Vera: Data Corruption by design

rafale77 · October 22, 2018, 4:58pm

Seeing a few users posting here about very frequent luup engine reloads and Lua code corruption, I just wanted to share my recent experience.

Our user-data.json containing all of our zwave, zigbee and other plugins, by design is constantly getting saved and rewritten. This file is stored, not in the RAM but on the SLC NAND storage to allow it to recover in case of sudden shutdown of the unit (power outage or unplugging of the vera). The problem is that there is so much to write and the writing and processing takes so much time when you have a large configuration that the probability of this file getting corrupted increases.

I have had my setup run for ~10-15 days without a Luup reload for about a month and then suddenly I started seeing devices go “can’t detect device” without any changes to my network. Then gradually I would start seeing it spread, then mysteriously would see a Luup reload every 10 hours (exactly). Upon Luup reload, a perfectly configured device would start showing a “wait for device to wake up to configure”. A manual configuration without waking the device would eliminate the error message (meaning it actually did not get reconfigured). Another device would show undetected etc… I would start seeing some scenes “last run” dates in 2019 on the UI (yes it is in the future). I no longer have much Lua code on the vera but I suspect many people who are seeing corrupted Lua code in their scenes actually have the same problem source. At one point in my recovery attempts, trying to force a device configuration, my vera started doing 2-3 luup reloads a minute. The omnipresence of Luup reload calls within the engine contributes to the data corruption as your data maybe changing while it is being written or the writing maybe interrupted by a reload.
The solution to all these problems? I recovered from a backup from a time before all of these problems started and boom, all the problems are gone.
I can’t emphasize enough the importance of backing up both the zwave and the device configuration. My particular system is getting backed up daily and locally on my NAS but the current design has a scalability issue. The writing on the NAND is so frequent that after a few month or maybe a year or two we will start seeing worn out memory cells on our vera (my previous Vera Edge had this problem after 10 months, the small size of the storage contributes greatly too) or you will have some data corruption due to some event of writing or luup reloading.
One other element which is very damaging to the reliability of the whole system is the forced nightly heal. If you have a very stable mesh/network, it is not only unnecessary, causing downtime to the zwave network, it also increases error and data corruption probability.

I have sent a long list to suggestions to CS a couple of months ago to address these problems right around the time of the mios ownership change but not sure they will be taken into account for the next UI design.

Conclusion, for now is to use backups and keep preciously your known good ones.

therealdb · October 23, 2018, 3:43pm

I do backups locally, but if you leave your very connected to internet, a daily backup is saved online as well.
It saved me a couple of times before implementing my local backup strategy (ie controller completely empty after a power loss).

rafale77 · October 23, 2018, 6:52pm

[quote=“therealdb, post:2, topic:199954”]I do backups locally, but if you leave your very connected to internet, a daily backup is saved online as well.
It saved me a couple of times before implementing my local backup strategy (ie controller completely empty after a power loss).[/quote]

What prompted me to post this is my observation from the forum that there is a very large number of post pointing out issues related to some data corruption: Scenes, Lua code, devices or sometimes entire configs and database having issues. I have removed the daily backup to the server by removing the vera from my account. I am finding the local backup strategy more reliable anyway.

rafale77 · October 31, 2018, 2:30am

ran dmesg on my Vera Plus and guess what… I got this:

[ 2.484000] NAND device: Manufacturer ID: 0xc2, Chip ID: 0xf1 (Macronix NAND 128MiB 3,3V 8-bit), 128MiB, page size: 2048, OOB size: 64
[ 2.508000] [NAND]select ecc bit:4, sparesize :64 spare_per_sector=16
[ 2.520000] Scanning device for bad blocks
[ 2.744000] Bad eraseblock 907 at 0x000007160000

Am I not glad I pivoted my vera to a USB SSD…

rigpapa · November 16, 2018, 3:12pm

A couple of (relatively) simple changes they could make right now:

Store it in separate partition. A mission-critical file like user_data.json should be stored on a dedicated partition, sized and managed in such a way that it cannot fill up. That file should NEVER be stored on a filesystem that has user-affected growth (e.g. installing plugins).
Split it up. As a single monolithic file, the corruption of any part is corruption of the whole. If you split it up, you create the opportunity for only changed elements to be written, which is not only faster, but also reduces overall write count on NAND blocks and extends storage life.

If they had some control over the hardware, I’d lobby for an area of RAM that’s battery-backed for storing userdata, and writing it to non-volatile storage at much larger intervals or particular events (actual config changes, orderly shutdown, etc.). Most of the data updates are noise anyway (motion trip, motion untrip, motion trip, motion untrip…).

rafale77 · November 17, 2018, 4:12am

[quote=“rigpapa, post:5, topic:199954”]A couple of (relatively) simple changes they could make right now:

Store it in separate partition. A mission-critical file like user_data.json should be stored on a dedicated partition, sized and managed in such a way that it cannot fill up. That file should NEVER be stored on a filesystem that has user-affected growth (e.g. installing plugins).
Split it up. As a single monolithic file, the corruption of any part is corruption of the whole. If you split it up, you create the opportunity for only changed elements to be written, which is not only faster, but also reduces overall write count on NAND blocks and extends storage life.

If they had some control over the hardware, I’d lobby for an area of RAM that’s battery-backed for storing userdata, and writing it to non-volatile storage at much larger intervals or particular events (actual config changes, orderly shutdown, etc.). Most of the data updates are noise anyway (motion trip, motion untrip, motion trip, motion untrip…).[/quote]

I can?t agree more with you. I would add on top of this to prevent the LuaUPNP from reloading on a whim and treating it as a fix for bugs in the code. It self reloads at the OS level whenever it crashes but the program itself automatically reloads often for no justifiable reasons.

rigpapa · November 17, 2018, 1:55pm

rafale77, I know you and others have done extensive work investigating the reload issue. I happen to be lucky, in the sense that I’ve found a formula that works for my environment. My “production” (used by my household) Vera Plus is fairly stable, but I’ve been very specific in my choice of plugins (mostly my own, as you may imagine). Ejecting a few very popular plugins has stabilized my system considerably; some I’ve replaced with my own work, and I will get around to the others (and make those public as well).

What stands in stark contrast to me, however, are the Vera3 and VeraLite that I use for development, and monitor very carefully. Where the Plus may often go 12-15 hours without a reload (6-8 is most common–ridiculous, but “acceptable” in this world, I guess), the 3 and Lite, which have no Z-Wave devices at all, will often go 14 days and more without a reload, even with my test plugin devices running in full debug. I also use openLuup and prefer to develop there, so it’s frequently the case that the Vera systems are just left to themselves until I’m ready to test on Vera hardware. But when I am hard at work on them, if I go to bed and night and get up in the next morning, usually no reloads overnight. If I don’t force a reload on them, they run for very long time relatively, until they don’t. I monitor CPU load and memory use as part of my QA, and they are very stable (CPU load bouncing around a reasonable median, memfree flat), right up until reload, and at the point recorded prior (5-minute intervals typically), there’s not really ever a spike or trend in either. I’ve never been able to determine what eventually takes them out.

So it seems, as you say, there are likely (too) many idiopathic causes for when Luup reaches for its poison pill, but it seems to me that their Z-Wave support is the most sinister culprit of the lot, because it’s actually a pretty stable device when no Z-Wave is running.

rafale77 · November 17, 2018, 6:14pm

rigpapa, with all my tweaks to various scripts, I am now able to get to over 14 days without a luup reload on my 138 node zwave network. I guess I relative to a lot of people I can?t complain but it is still not acceptable to me. Your observations actually points to one of my suspicions about the zwave network able to cause a luup reload as well but I have never been able to point to an event.

One example of unjustifiable reloads: go to your device screens, go to add devices, go out without doing anything else… boom luup reload. There are many other examples and observing how ALTUI handles things shows how unnecessary they are. I saw that the luup reload also checks on the validity of the user-data.json and tries to clean it and if it can?t goes and load the previous version which is the reason why it deletes devices on its own. It seems idiotic to me. I also have practically no plugin on my production vera plus. Only system monitor and ALTUI (and the 2 embedded mios ones). I disabled auto update on all of them obviously.

The reloads can be caused by system level overloads and I found out are often spikes you do not see while monitoring (depending on your data frequency). One example is the log rotate which can for a short time, use a lot of RAM. My last reload was caused by it. The network monitor, can spike the CPU load and of course the LuaUPNP can do that too. It is not like it is creeping up over time (though there seems to be a memory leak as well for some people). It is the short term peak usage. I am now having to dump cache memory every hour. I am even thinking of testing the LuaUPNP program in a QEMU VM.

To get back to the topic of the thread, this is a contributing factor of the data corruption… Every reload causes a lot of writes eventually wearing out the embedded NAND which has no wear leveling algorithm.

Sorin · November 19, 2018, 10:51am

Among other things, we have a dedicated team that has already started development for the next generation of Luup engine.

If you remember these topics :LUA scripting - What should we do to improve it , and Vera Plugins - Development environment and tools needed . That’s what that feedback was for.

So stay tuned. 8)