Zwave Network On Vera Explained

rafale77 · October 24, 2019, 6:54am

The basics:

Zwave is a low power, low frequency (900MHz) and therefore low bandwidth protocol. Its low frequency enables it to have a longer range for the same power. 433MHz has an even greater range/power but sacrifices even more bandwidth. Zigbee (2.4GHz) uses a pulsing trick similar to the FLiRS concept of zwave to save power while in reality using more RF power to meet about the same range while offering higher bandwidth due to the higher frequency. In essence, in terms of EM power efficiency, Zwave is a very good compromise between power requirement and bandwidth.

The vera:

With the bandwidth compromise in mind, a stable and large network should have for ultimate goal to maximize available bandwidth and controller receiving time for user data. (useful data) as opposed to infrastructural data transmission (overhead). Especially if that overhead is unnecessary: Most devices no longer require polling for example as they update status whenever status changes. The only exceptions are the older switches which do not support instant status. No battery powered device that I know maybe except for a few FLiRS require polling either and when they do it is only to update their battery status. While asleep they will continue to send sensor status change. On my particular setup, I have 2 devices out of 144 which require polling maybe once a day but the vera wants to poll them all every few minutes. The vera also generates a lot more questionable overhead traffic…

rafale77 · October 24, 2019, 6:55am

The vera infrastructural data transmission (overhead):

Polling for AC powered device: the vera recommends to poll the devices at a very regular basis. This might have been useful for devices which would not send updates status on their own when needed (instant status or various AC powered sensors) but really at this point, I don’t know of any such device. This default setting shows up incomprehensibly at many places. It shows up on the zwave menu of UI7 and in a variable within each device (pollinginterval)

Wakeup For battery operated devices: because these are not listening constantly and go to sleep most of the time, they use a different mechanism. They wakeup at a regular interval sending a signal to the vera to say: “hey I am still alive”. The vera sets a default of 1800s which is every 30min. When the device wakes up, the vera takes advantage to send a polling signal to the device and a series of exchanges occur between the two updating the vera with all its status and if needed the vera will send some configuration data frames as well. As of 7.0.29 the vera also follows the poll with a nnu (neighbor node update) which takes about 1 minute for the device to complete as it sends “get info” signal all around itself to figure out who its neighbors are and then sends the output back to the vera. While this is happening, the network around this device is neutered as the bandwidth is saturated by these “get” and response signals. If the vera happens to be one of its neighbors… well the vera will be swamped too. To this date, I still have no idea why vera decided to mess around with the default wakeup interval of the device by setting it to 1800 instead of just acquiring what it is and write it in its device variable. This parameter is set during the configuration step fo the inclusion.

Network Heal: This action is triggered by the controller and gets the zwave chip to send an nnu (same as above, neighbor node update) to every node on the network in the order of the node id from 2 to 232. This is designed to rebuild the mesh so that each node gets a refresh of its neighbors list and therefore enables the zwave chip to know the shortest route to a given node and also enables the remote node to know the fastest way to respond (return route). This heal is designed to be a repair step. It is intense on the bandwidth and disrupts parts of the network mesh for the entire time it is running. The run time gets longer the larger your network is and can take days. This is another point of grief I have with the vera: the UI7 currently forces a nightly heal which means that on a large network, you often start a new heal before the previous one could complete, and your network goes completely bonkers. This I found was a source of luup reloads as the vera reloads when it has too much lag in its network.

From the 3 processes above you can see how much overhead bandwidth the vera imposes on your network if you follow the default inclusion and settings and could lead to an completely unreliable and unstable setup by violating the basic principal of maximizing the bandwidth for useful data as opposed to the humongous overhead the vera is defaulting to. I have been advocating to change all this mess and is what I speak of when I wrote in my many posts that making the current vera stable is not a matter of adding more code but to delete absurd ones. Luckily @edward has come onboard made changes.

rafale77 · October 24, 2019, 6:56am

On 7.0.29, try extending the wakeup interval of all battery powered devices from 1800 to something like 86400 or 43200 (i.e. 24 or 12h). This will be a huge relief to your network. As of now the entire wakeup process (wakeup, poll, nnu) takes over 1minute and during that time, your network is disabled. Looking at the logs you will have data collision on the network and if you have say 10 battery operated sensors on your network and they wakeup every 30min, you will end up disabling your network for 1/3 of its time. When network collision occur, the data is lost and when one device wakes up while another is doing its nnu… well then the vera doesn’t know it woke up. If this happens twice to this device, then the vera says it cannot detect the device and this is the song and dance of the “can’t detect device” which has been plaguing the vera. When this causes commands to lag too much, we end up with tardy commands (command lag) which, when they become too long cause the vera to crash itself and reboot. Imagine combining this absurd wakeup process with a nightly network heal and you get the perfect storm!

Note that the wakeup interval is a configuration of the node and not of the vera. The variable “WakeupInterval” is used by the vera to calculate how long before the vera will go without a wakeup signal from the device before it will flag it as “cannot detect”. The vera will let go of one interval but will flag the device after the second time it misses the wakeup. (2x the value of that variable). The
“ConfiguredWakeupInterval” variable keeps a record of the value the vera used the last time it sent it to the device. Changing the wakeup interval can be done through the device settings page and will require waking up the device manually for the vera to send that configuration change to the device.

The poll settings on the other hand are a parameter of the vera only and do not require a device reconfiguration (the firmware will do it if you try changing this variable from the device settings page but is really not needed). I prefer doing it with lua code and then run a luup reload. The code below will disable polling on all non battery operated devices:

for k, v in pairs(luup.devices) do
  local var= luup.variable_get("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollSettings",k)
  local bat =  luup.variable_get("urn:micasaverde-com:serviceId:HaDevice1", "BatteryLevel",k)
   if var ~= nil  and v.device_num_parent== 1 and bat == nil then
     if var ~= 0 then 
     luup.variable_set("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollSettings", "0", k)
     luup.variable_set("urn:micasaverde-com:serviceId:ZWaveDevice1", "PollNoReply", "0", k)
     end
   end
end

You can then determine afterwards if you need to enable polling on some specific devices. You could do it manually for your FLiRs too by changing the “PollSettings” variable to 0. Other battery operated devices will still get polled when they wakeup and I am still requesting to eliminate this function which takes 3s of zwave bandwidth in not very useful way which could be 0.

On the upcoming 7.0.30, a few things I have requested have been implemented:

Disable nightly heal. I will say and repeat this again. The heal is not a maintenance procedure. It is heart surgery. Doing this nightly will get you to certain death. It should be used only manually by the user when he knows there is a problem. Not used properly, it leads to self-destruction and the only person who can know when to use it is the user. It should never be automated. Thank you ezlo for finally listening and allowing us to disable it.

Lua code to disable nightly heal:
luup.attr_set("EnableNightlyHeal",0,0)

Disable the Wakeup nnu: This is another aberration forcing a mini self heal of a perfectly working device. Why? Because if the device wokeup and the vera can detect it and is able to send it this command then why for goodness sake would you want to ask it to update neighbor nodes? The only case when this would be needed is when the network is broken and the node is not reachable which is self defeating. This function should never exist and defies logic. Again thank you @edward for finally allowing us to disable this. It cuts down the wakeup time of the battery operated node from 1 min each time down to ~2-3s. Saves a ton of battery life, improves stability of the network of the vera by freeing precious useable bandwidth (decreasing the overhead). It also reduces/eliminates the recurrent “can’t detect device” messages caused by devices waking up while the network was busy doing useless things. The wakeup polling remains another extra useless overhead for most devices which has not been disabled but is much less of an issue than the wakeup nnu.

This is the luup call to disable the wakeup nnu for all devices which wakeup (i.e battery operated devices)

for k, v in pairs(luup.devices) do
  local var= luup.variable_get("urn:micasaverde-com:serviceId:ZWaveDevice1", "WakeupInterval",k)
   if var ~= nil  and var ~= 0 and v.device_num_parent== 1  then 
     luup.variable_set("urn:micasaverde-com:serviceId:ZWaveDevice1", "DisableWakeupARR_NNU", "1", k)
   end
end

TimeJump auto luup reload. The luup engine also has a function to kill itself and reload whenever it detects a time jump. This can be caused by a time zone change, a DST or end of the month or… somehow when the zwave chip has too much lag for the engine. The problem I have with this is that the reload does absolutely nothing but wipe out your data in the dram (notably ongoing scene timers), risking data corruption. It does nothing about the cause or consequences of the timejump. We will now have the ability to disable it!

lua code to disable the time jump reload:
luup.attr_set("ReloadOnTimeJumps",0,0)

Note that at the moment, as reported by @therealdb this last feature of 7.0.30 is buggy. It prevents time based scenes from triggering and also causes a small memory leak as you can see below:

At this moment, the released version of 7.30 has been pulled mostly due to update failures and a poorly tested/validated kernel which is causing Christmas light mode and disables extroot. For those who have an old SATA ssd laying around, I proposed getting a USB to SATA dongle and head to the extroot thread to first extroot your vera and then to run a hybrid upgrade whereby only the program side of the vera upgraded and not the OS/kernel which enables all the new features without suffering from the odd problems of the new kernel.

rafale77 · October 24, 2019, 6:58am

The vera inclusion process:
The inclusion process is divided into two steps:

The zwave network inclusion operated by the dongle and is the same for any host controller.
The configuration which is unique to vera and follows the network inclusion. If the inclusion is secured, this step is critically attached to the first step.

All the inclusion failures I have seen are due to the zwave network being too busy or chatty because of the overhead traffic I described above. The chattiness can interrupt the inclusion flow by inserting commands between the 2 steps and causing delays. As your network grows, so will your overhead and dis-proportionally so. The problem with the secure class device inclusion is that step 1 and 2 must be followed closely in time (security feature) and any lag can make it fail as the device often has a very tight time window in which it expects to get the security key. If it fails, you must exclude and include again.

Conclusion of this very long posts: If you see inclusion failures, it is likely that your zwave network is flooded with unnecessary data transmissions setup by the vera. The first step is for you to increase your wakeup intervals and disable polling. The second step is to (wait for) try 7.0.30 (this is why it is such a game changer for vera) to disable the 3 bloatware functions which are killing your network and the luup engine.

All this explains the increased probability for a successful secure class key exchange after a full reboot of the vera: We interrupt the zwave heal and kill a number of nnu calls in the queue while the vera is freshly started up allowing for more available bandwidth for the secure key exchange to happen.
It also explains the occasional “can’t detect device” messages which are due to repeated missed wakeups caused by network being too busy. It explains spontaneous luup reloads and occasional massive response delays to commands. It is also the cause for the HEM child device stopping to report etc, etc…

After taking all of these actions, I am now on my 20th day of luup uptime and still running, have had no longer issues including devices on a 144 node network, the size of my logs has been divided by 5, my HEM have never dropped child reports once and the frequency of delayed commands/missed sensor untrip signals has significantly decreased. We still have a got CAN problem to solve…

Edit: One thing I wanted to add regarding inclusion. One can make the analogy of the zwave RF transmission to our voices talking. When we had the recommendation to be within 10ft of the the vera to include a device we can all picture that it was to make sure that the node to be added was loud enough to hear the controller and vice versa, that the controller can be loud enough to be heard by the node. The problem now gets more complicated if you have a very chatty network with other devices very close to the vera. Because they will likely be repeaters, these devices could be louder than the node you are trying to add if they are either closer or their antenna or positioning is somehow more favorable than your device to be included. This is why, as strange as it sounds, it is sometimes better to be further away from the device so that the signal is relayed by a louder repeater or maybe two which each would be more quiet because they would not be relaying all the messages. This is especially true when adding a battery operated device which typically has lower power, when you have a repeater node very close to the vera.
Now the core issue is… why are all these devices so talkative when they should not be doing anything? This is where the overwhelming overhead comes in and should be reduced to the lowest possible amount.

Edit 2: The beta is now available! For posterity I am attaching a screenshot showing my current uptime.

Edit 3: Uptime with the beta. The official release for the plus/secure came and went.

Edit 4: After over 25 days of uninterrupted uptime, I upgraded to the release build 4833 but without the new kernel. Had to swap out a couple of sensors (1 ran out of battery or 1 was physically broken and could not be manually woken up). I can report a very noticeable decrease of battery consumption across all of my battery operated devices be it FLiRs or regular ones. My Yale locks for example now have had the same lithium batteries for over 4 months which has never occurred before, typically averaging 1-2 months.

Edit 5: Summary of benefits observed.

After over 2 months of running with the disabling of wakeup arr/nnu and nearly 2 months of extending wakeup intervals, I wanted to list out all the problems it solved so that people car refer to it if they encounter these:

The obvious is the extended battery life especially on FLiRs like the zwave locks which used to burn out my lithium battery within 1-2 months. Now after two months without changing the batteries, they are all still showing 100% and are still running strong. Likewise my econet vent for which I had to recharge the batteries about every 6 weeks are at the very least extending their battery life by 3x from the current battery level reading. It is also too soon to assess the battery life on all my sensors but I get the feeling they may multiply even more. It is such a different experience to wake up every morning not having to check my vera UI to see which device needs new batteries!!! This was my life before as every 3/4 days, I had some batteries to swap out. I changed device batteries on only 2 devices in 2.5 months!
See below the batteries consumed by my ~50 battery operated nodes over 2 years because of these absurd default or forced settings from the vera and knowing that I have been using rechargeable batteries on every device I could: It’s over 14lbs

My Aeon HEM which often used to stop updating data from one of the two child devices at least once every two weeks… completely stopped doing so.
I have a handful of Leviton 4 button scene and zone controllers with an embedded relay which would lose its association about once every two months, requiring a power cycle to recover. This is completely eliminated.
The frequency at which I get delayed scene or even simple command execution dramatically decreased… by at least 10X. I rarely encounter issues like these any more.
Luup reload but you already knew that.
Random can’t detect devices… Completely eliminated.
Frequency of missed sensor trips and untrip dramatically decreased. It still very occasionally misses untrips which I identified in the logs as associated with a “got CAN” and tardy event and sometimes wake-up polls.
Strangely high probability of secure key exchange failure during secure device inclusions. Eliminated.
Garage Door openers (Linear GD00Z) used to occasionally go out of sync on their open close status or stop responding to commands as if they were frozen, requiring a power cycle to recover. Eliminated.

@Pabla reported:

My z wave network is speedy fast
Battery life was never an issue for me with my Schlage locks but I’ve been observing that battery life isn’t going down as quickly
Inclusion/exclusion process is much more reliable
Any scenes I have run very quickly
Overall performance of my VP is way better

Sorin · October 24, 2019, 8:44am

As always, amazing contribution. Pinned!

therealdb · October 24, 2019, 4:01pm

Great post, as always. I still have complaints but we’re heading in the right direction. I managed to get 12 days max, but overall my unit is pretty stable (I occasionally get a reload because of my config but it should be fixed in a future version). I followed the rules in this post and the chattiness of my network dramatically improved. Can’t wait for all you to benefit from this new release.

richie.digital · October 25, 2019, 12:04am

still trying to understand where u got the beta firmware from

rafale77 · October 25, 2019, 1:21am

It’s a misnomer. It is not a beta but a more complete alpha some of us are testing

mikewop · October 25, 2019, 2:06am

Great explanation @rafale77, it makes a lot of sense once it’s laid out like this.
My battery-powered Z-Wave Sensors still have a setting for “poll this node at most every xx seconds”, what should that be set to? Or is that ignored for battery nodes?

Also, any ETA for 7.30 (or at least the beta) for the rest of us?

rafale77 · October 25, 2019, 3:08am

From what I can tell this setting doesn’t do anything. The polling is embedded in the response to a wake-up frame of the vera. I tried disabling polling using this parameter without success.

Some of us testers have expressed that it is ready for beta so I will leave the answer to @edward or @Sorin. Note that 7.30 has much much more than what I discussed here so it takes time to test and iterate to squash the bugs and this is a good thing. I can only second guess but one reason the vera has had all these problems was the lack of production testing. All the problems I described here probably would not be observed on a test setup with 2-5 devices. The full scale test which we, alpha testers are doing, should be able to improve how reliable and bug free the firmwares will be.

kigmatzomat · October 25, 2019, 3:40am

Oy vay. That’s informative but dang. No wonder so many Veras got worse over time. In many cases the default recommendation to a zwave problem was “add a smartplug as a repeater”, which was only making the heal cycle take longer and longer.

I suspect the polling setting may have worked at some point, as I remember some battery life improvement after a change. I guess that could have been as much due to a seasonal temp change as a polling change, but that lock went from a set of batteries every 3 weeks to seasonal.

There must be a faster implementation of the nnu step possible. I have used the homeseer ZSeer tool, which provides a map of a zwave network and allows you to edit/heal a network. I did a complete scan of my @40 device network a few months ago and it took under a half hour. I want to say it was only around 15m, but that could be wrong. I may fire up the laptop and give it another go , for science.

rafale77 · October 25, 2019, 4:00am

Polling settings work on FLiRS and AC powered devices. They don’t work on battery operated devices which go to sleep.

There is no need for an nnu or heal if you network is healthy. If you have a problem you can go straight to that device and do a manual nnu (update neighbor node) or a manual heal the same way as on homeseer. It is no different. The command to the zwave chip is the same since these are all handled by it and not the vera host per say. My point was that this should never be automated and run on its own, nightly, while the network is perfectly fine. It is a waste of power, waste of bandwidth and a contributor to instability, worst case, if you have a FLiRs out of battery during that time, it will completely collapse the network mesh.
The speed of the nnu depends on the device. What I observed is that it takes about 1min per device minimum. If it tries to run on battery operated ones then it waits until they wakeup. If they wakeup while another device is doing its nnu then it will skip and wait until the next time it wakes up and you go through a round robin until every node has done its nnu. Mine almost never completed until the next one starts.

slelieveld · October 25, 2019, 7:13am

@rafale77, great post! I would be helpfull to have some sort of “how to” with screenshots for the less experienced users among us (with regards to the zwave protocol). I tend to see contradicted advices from vera support (in the past) with regards to polling, wakeup, etc. times and I think my zwave network must be a mess with the can’t detect device, waiting for wake up to configure device, etc. messages and also the appearing ghost messages (I think think die to “the engine” trying to reconfigure my device/network?)

dJOS · October 25, 2019, 7:39am

Agreed, I would like this also.

rafale77 · October 25, 2019, 3:12pm

I will post a few things shortly since 7.0.30 beta is now available. Because I have a lot of devices… I tend to do things through lua code to change configuration for a bunch of devices at a time. It will be in the first 4 posts of this thread.

jonas2 · October 25, 2019, 3:43pm

Thank you for an informative post!!z
Do you think its safe for a regular user to upgrade to the beta?

rafale77 · October 25, 2019, 3:48pm

I would move very cautiously:
If you have a test unit. I would start by doing it an upgrade on that. If not,

Make sure you make a backup of your entire setup.
Have the url of 7.0.29 handy to downgrade in case something goes bad.

Note that this upgrade goes deeper than most because it affects the file system and the OS. It may take a bit longer and may require a couple of manual reboots.

Details of the set up have been posted above.

jonas2 · October 25, 2019, 3:51pm

Thanks for your reply! Sounds like i’m gonna wait with the upgrade, i only have one Vera and don’t wanna loose it

rafale77 · October 25, 2019, 5:26pm

I ran the upgrade earlier on my test unit. It looks good and seems pretty safe to try. I even want to say that this upgrade is less likely to brick your vera than 7.0.29 because of the storage allocation change!

https://community.getvera.com/t/vera-7-30-core-firmware-beta-release/210665?u=rafale77