Securing and stabilizing the Vera by taking it off the grid

It is indeed by looking at the LuaUPNP logs. I am comparing them to the queue list from Z-way and can observe the behaviors and their differences.

Not sure if this helps but the logviewer plugin will display pretty detailed Z-Wave info (thanks to GenGenā€™s additions) when logging is switched to verbose mode. However that in its self can cause restarts:

http://forum.micasaverde.com/index.php?topic=13477.0

Picks up popcorn ready for next installment.

Just to give you an idea of the difference observed:

I just had a Luup crash and reload after manually sending 7 consecutive commands to a zwave device through the vera within a short period of time, not allowing the first command to complete with even an ack before sending the 7th.
In comparison I have done the same on Zway and have reached over 500 commands in the queue. (half of the commands in the queue are often waiting for ack response from the zwave radio so ~200.

I have a lot of Figaro FGS223 and since every command is sent from 2 to 5 times before being executed, I can confirm that massive number of commands (ie changing house mode to switch off all lights) may cause such weirds behaviors and have luupnp engine restarted.
But there?s hope, since it?s not that difficult to implement a robust queue. Let?s hope they will invest in this area under the new effort.

Are you using debug log level then to see these levels in your log?

LV_ZWAVE 40
LV_SEND_ZWAVE 41
LV_RECEIVE_ZWAVE 42

I am trying to diagnose delays to scenes for basic functions and comparing the references you are making to queuing and seeing if I have any of the same issue. It will help me determine if the devices are a problem or the core luup engine and how it is written that you have mentioned a lot in this thread.

I have enabled:
Show polling activity
Show individual jobs
Verbose logging

and yes debug enabled.

I am now very easily able to reproduce the Luup reload by overwhelming the command queue. Since most of my commands are issued from openLuup, I am even thinking about creating a buffered queue on the verabridge. The vera still polls a number of devices in spite of having disabled device polling from the zwave menuā€¦ I am thinking about testing a Lua code to disable device polling on all zwave devices, device by device but polling is not the main issue from what I can see.

for k, v in pairs(luup.devices) do local var= luup.variable_get("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollSettings",k) local bat = luup.variable_get("urn:micasaverde-com:serviceId:ZWaveDevice1", "WakeupInterval",k) if var ~= nil and bat ~= nil then if var ~= 0 then luup.variable.set("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollSettings", "0", k) luup.variable.set("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollNoReply", "0", k) luup.variable.set("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollRatings", "5.0", k) end end end

I do not have a very large network and I was experimenting with Reactor while using PLEG. Some of my z-wave commands were duplicated in both to see which one I liked. When PLEG and Reactor tried to run some things at 5:30 am, I would get a couple of the commands to work - turn off ceiling fan, start to ramp up the nightstand Hue lamps, and then nothing. A couple of minutes later, a LUUP reload message on Pushover. Most times that was it, Z-wave commands did not complete, probably because Reactor and PLEG were no longer triggering since the conditions were satisfied. Occasionally, PLEG and Reactor would finish ramping up the nightstand lights, disarming Blue Iris, setting house mode to Home, setting the thermostat temperatures, turning on a wall outlet, etc. I let it go a couple of weeks and it was very consistent behavior.

Either Vera did not like the duplicate commands, or both were sending too many to fast, or both were stacking the z-wave commands too high, or the Reactor and PLEG just do not play well together. After that experiment, I disabled all of the Reactor devices (they are still there and they trigger. They just do not trigger scenes). Vera went back to being stable most of the time.

Thanks Don,

This is another confirmation to me of the limitations of the luup engineā€™s lack of command queuing. Like you, what I am observing is that once it is overwhelmed, it just hangs, crashes and then reloads ignoring all the commands in between from or to devices.

Latest iteration of my mod script tested on firmware 1.7.26 and 1.7.27:

I am basically keeping a record of the mods I made to my Vera Plus and wrote a script with the included files so I can remember and recover in case of unit failure. I am sharing it for those who want to try it.
As a reminder, this requires extroot for the reason that the mods require more disk space to install. I did a lot of trial and errors testing, soft bricking my test unit but thanks to the extrooting, it was very easy to debrick it. With extrooting, these mods are completely risk free.
Summary of what this does:

  1. Eliminates 3 of the 5 scripts which sync time at startup and at the same time making it more reliable at boot. Now the vera only runs ntpclient as the resident ntp service.
  2. Updates the Zwave serial API file to the latest version 6.81.3
  3. Eliminates the remote access tunnel to the mios server
  4. Eliminates the Relay again to the mios server
  5. Eliminates the network monitor and internet connection check which can reboot the vera
  6. Disables the MiOS API server functions eliminating various mios cloud functionality
  7. Fix the backup function. I think mcv got a little too cute with it, having too many variations of the backup. I got it to be the same files every time.
  8. Massive updates to the OS libraries and programs. The most noticeable will be busybox from 1.19 to 1.23 and lighttpd from 1.4.32 to 1.4.46. A lot of these were compiled with Chaos Calmer but appear to work fine on the Veraā€™s barrier breaker. The UI now feels snappier.
  9. installed nano and set it up as the default editor. (I am not a fan of vi)

Adding a disclaimer which I was just reminded to do and I thought was very important.

[size=14pt]These mods, because they prevent the vera from connecting to the mios servers, also prevent support from logging into your account and help you help you troubleshoot issues.[/size] I have taken upon myself to troubleshoot my issues by 1. having a test unit which is extrooted so I can always go back to factory firmware/config and 2. understanding the various scripts and various logic layers. The issue is not so much that vera will not support you, it is that they cannot easily do so by preventing them from accessing your unit. There remains potention issues which these mods do not fix. A few I know of and worth mentioning: 1. data corruption which is more of a hardware/firmware design long term reliability problem 2. a zwave stack to vera program command queue weakness. 3. A couple of issues not related to vera itself but more on the specific vera chip side which I have not really gotten to the bottom of and sometimes require excluding and reincluding the device. (dongle data base issue in the zwave chip itself) which I am able to observe with a secondary zwave controller.

Offering now Version 3 of my mods.

The big change comes from the fact that I managed to create my own cross compiling environment for the vera plus linux kernel and platform.

Change log from V2.

Updated busybox to latest stable 1.29.3. If you wonder what this does, it is the base script interpreter and tools of the OS. For me it made the vera run distinguishably faster. You will actually feel it from web UI. It also includes security updates. One significant thing I have noticed is that a few of my secure class sensors which was an absolute pain to configure (with or without inclusion) because I was suspecting them to respond in the zwave network at a rate which the vera could not keep up with, are now much easier to configure. The overall retry required when trying to re-configure devices has also significantly decreased.
Updated nano to 3.2
Updated lighttpd which is the webserver to latest stable 1.4.52. It has performance and security upgrades
Updated luasocket, luasec, luafilesystem to latest. (this I do not know whether it had an effect. To me it did not break anything)
Other various library updates and patches to scripts to make them work with these updates.

File has become too big for me to attach here. I will find a way to host it.

This is too complex for me to understand and deploy but it interests me that the work you are doing to stabilise Vera and the progress youā€™ve made appears to be more than what Vera have done in the last six months.

I am so tempted to try this as Iā€™m getting somewhat ticked with the amazing ability of Vera to simply lose stuff and be generally a bit not great.

If it all goes wrong, is there a simply roll back? e.g. Reset, log in, restore backup?

Cheers!

C

I would tend to recommend to do this combined with extroot because then, all you have to do is unplug the external SSD and the vera boots from default and all your changes are undone. As long as you have backups, you can then plug the drive back in, extroot again, and only run whatever you needed to run. I am actually blown away by how much faster the vera interface feels nowā€¦ This makes me suspect that the program was written and tested on a much faster machine and then compiled to run on the little embedded machine.

I could go back and rewrite a revert script but value is a limited. A factory reboot and recover from backup should do that for you. I am only modifying files in the rootfs which gets overwritten by the vera firmware updates. I am not modifying the kernel or upgrading the operating system. I am just upgrading packages within the operating system, some at the system level.

Note. The ZIP file is 2MBā€¦ I will get to it after work today.

Edit: Attached of a screenshot of the login prompt. :smiley:

Awesome, thanks!

Iā€™m probably not going to be able to do this for a week or so, but I need to take some leave soā€¦

C

Always watching and reading the thread, thanks for the work, if things go sour with Veraā€™s new battle plan I know there are options for my future ha needs.

Sent from my VS995 using Tapatalk

Here you go. I made the effort to pretty it up a little and now it gives you also the option to NOT take the vera offline if you so desire but will still upgrade the various packages within OpenWRT.

There are three files but due to size limitations, I am having to post them into 2 different posts.
Download all three files and unzip them which will give you 3 folders.
You have multiple options to get the files on the vera:
-SCP them on
-If you have your logs setup on a USB stick, you can remove it from the vera, put the files on the stick next to your logs and then plug it back on the vera. The folders will then be in /tmp/log/cmh
-if you have extrooted your vera, you can do the same as above by putting the 3 folders on the second partition as you know where it is.

Upgrade process is to enter the VeraMods folder and execute the modvera.sh file

cd VeraMods ./modvera.sh

And this is the third zip fileā€¦

Please make sure than all three folders are in the same location on your drive. You do not need to merge the content.

As many of us have been anxiously waiting and expecting any sort of information regarding the new products and platform resulting from the takeover, I have continued to seek continuous improvement of the stability of my setup while looking at what problems people on the forum are reporting. I am at this point not really looking for anything more from vera in terms of functionality besides the request I have made a number of times for:

  1. an Offline mode which would not rely at all on any of the vera cloud services which this thread has been mostly about and I have achieved almost all of it besides the single even server requirement.
  2. A Zwave command queue which is really now mitigated by my use of AKbooerā€™s excellent openLuup.

I have now updated the vera plus with upgraded packages, the most useful ones being: busybox (system command interpreter) and lighttpd(webserver). I even noticed that mios, own repo now offers some updated packages as well.

I have binned the vera problems into two separated categories:

  1. The hardware which I am convinced was just sourced from an OEM (Sercomm) and on which Vera slapped their program without even modifying the firmware. What we call firmware updates for the vera are in reality only firmwares for the zwave and zigbee radio processors and the vera program called LuaUPnP. They are not firmware for the vera device which normally includes a bootloader, a kernel and an operating system. It is massively underpowered if you look at three areas:
    a. Plugins and integrations due to the lack of CPU power, storage space and to a lesser degree on the plus, RAM. There is really plenty of storage on the device but the vera is only using 8.6MB of the 128MB available for its real use, actually to be fair, 20MB is used if we include the ROM and I suspect is caused by a proprietary partition table imposed by the OEM. The resolution to this for me has been to offload all my plugins and integrations to openLuup.
    b. The CPU and network socket cannot keep up with the amount of communication required to both talk to all the vera services and your own integration. I managed to cause a complete system reboot by sending too many requests. No clear resolution on this unlessā€¦ I can change hardware.
    c. The vera uses SLC NAND for itā€™s storage which is very good as it is a high endurance memory. Unfortunately as I mentioned above, it is restricting itself to a tiny area of itā€™s NAND cells to write of its data which eventually will cause wear and memory cell failure and data corruption which many have experienced. I calculated the lifetime for my use case to be 4-8 years given the frequency and amount of data which is written. extrooting essentially resolves this problem. I have had no data corruption since I did it.

  2. Software
    a. Autoconfiguration of devices, works most of the time but I found, goes a little too far in its automation and tends to be buggy. It requires a lot of maintenance as well. It creates attributes, and variables on its own but sometimes does it wrong and then comes back to check and ā€¦ causes a luup reload because itā€™s not right and can enter an infinite loop of luup reloads. Thankfully I have ALTUI to stop this insanity which is short of manually modifying the user-data.json file.
    b. Looking at all the shell code, it is actually pretty resilient but some of the scripting is obsolete and conflicts with each other. I was able to correct most of it in my mods. (example of time keeping, backup mechanism etcā€¦)
    c. Luup reloads: the LuaUPnP program tends to abuse of Luup reload commands for no particular reason on UI7 at least. Many times a change of an attribute or exiting of a page will cause a luup reload. I would have preferred a reload button at the top of the page and eliminate all the auto reloads. It would resolve a lot of the problem in ā€œaā€ as well. There are actually two types of luup reloads. One which is triggered by the program itself and another which is caused by a crash of the luup engine and is more dangerous since data could be lost and corrupted: The vera uses what it calls NVRAM: It is a misnomer in veraā€™s lingo as it is really a ramdisk: It is a volatile file system, basically a virtual drive stored in RAM which is the exact opposite of the nvram which is a portion of RAM stored in a hard drive. You will see it as tmpfs drive. This drive recreated at every boot and is where the vera writes all its logs and configurations and only every few minutes writes them down to the NAND flash which is non volatile and will remain at reboot. However if the luup engine reloads while it was writing something, chances are that you will have a problems when it loads back up. Thatā€™s why there is a recovery mechanism with multiple versions of the user data. The luup engine also restarts every 1st of the month at midnight, and every time the system time changes.

So in conclusion of all this is while I was able to workaround the software issues, I am facing the possibility that all I have left is the hardware. And while I managed to address the storage problem, the CPU and the network socket could not be changedā€¦ or could it? ;D :stuck_out_tongue: :wink:

A good bit of experimentation on my part recently, along with that of reneboer and akbooer, have revealed a few interest points on this that I would perhaps point away from hardware.

The network socket issue breaks into three classes on its own:

  1. The luup request system, appears to have some kind of concurrency issue in which hitting it with too many simultaneous requests leads to a reload. In my own testing, the results are somewhat random, which is to say I can get it fail, but how long it takes variesā€“sometimes immediately, sometimes only after several minutes of a punishing request load. I have not noticed that CPU load is a factor when the crash occurs.

  2. The built-in mechanism by which many plugins communicate on sockets with their interfaced devices has some quirks and bugs, but more importantly, the current mechanism of receiving ā€œrawā€ data (raw protocol on the socket) returns one byte at a time to the handler, rather than a block of all available data, and this leads to massive inefficiency in processing the data, to the point that processing packets of any size is impractical using this method. Iā€™ve already put in a request with Vera (recently) for a raw block protocol, recognizing itā€™s probably a low priority for them at this moment. Unfortunately, this is the only way of handling received data asynchronously, becauseā€¦

  3. If the foregoing mechanism doesnā€™t work for you for whatever reason, you are left with direct luup.io calls or LuaSocket. This is fine for send-expect communication (send command to device, it soon returns a response of success/failure/status), but any protocol that returns data asynchronously (e.g. alarm panel notifies you that zone 7 is now breached and system is in alarm) is a problem because luup.io provides no mechanism for notifying on data available, and the usual mechanisms for waiting on a socket in LuaSocket donā€™t work in the Luup environmentā€“they lead to deadlocks and reloads (which makes sense, actually, given the architecture of the system). That means the plugin needs to poll the socket with a frequency that makes sense for the response time required, which is inherently inefficient (doing work when thereā€™s no data waiting/no work to be done) and can lead to high CPU utilization and the side-effects that come with it (impact on other services, etc.). Fixing Luup to make LuaSocket work is probably not in the cards, but I think extending luup.io with a few more useful functions (like being able to set handler functions for ā€œdata availableā€ and ā€œconnection closedā€) would go miles toward resolving this and other problems with async comms.

All of these point to software to me. Which is good. It means they are fixable in a portable way that works for even the oldest hardware and keep those hardware platforms viable longer (Iā€™ve already asked on 2 and 3).

Thanks Rigpapa,

Itā€™s very insightful. The reason why I am pointing to hardware is becauseā€¦ I managed to make the whole unit reboot (not just the luup engine) by SCPing a large amount of small files from the vera to server. I was cloning the entire file system and passing it through the socket.
The Extrooting process which does the same but through USB never caused this so I concluded that the socket was the problem.