Zwave Network On Vera Explained

rafale77 · October 24, 2019, 6:58am

The vera inclusion process:
The inclusion process is divided into two steps:

The zwave network inclusion operated by the dongle and is the same for any host controller.
The configuration which is unique to vera and follows the network inclusion. If the inclusion is secured, this step is critically attached to the first step.

All the inclusion failures I have seen are due to the zwave network being too busy or chatty because of the overhead traffic I described above. The chattiness can interrupt the inclusion flow by inserting commands between the 2 steps and causing delays. As your network grows, so will your overhead and dis-proportionally so. The problem with the secure class device inclusion is that step 1 and 2 must be followed closely in time (security feature) and any lag can make it fail as the device often has a very tight time window in which it expects to get the security key. If it fails, you must exclude and include again.

Conclusion of this very long posts: If you see inclusion failures, it is likely that your zwave network is flooded with unnecessary data transmissions setup by the vera. The first step is for you to increase your wakeup intervals and disable polling. The second step is to (wait for) try 7.0.30 (this is why it is such a game changer for vera) to disable the 3 bloatware functions which are killing your network and the luup engine.

All this explains the increased probability for a successful secure class key exchange after a full reboot of the vera: We interrupt the zwave heal and kill a number of nnu calls in the queue while the vera is freshly started up allowing for more available bandwidth for the secure key exchange to happen.
It also explains the occasional “can’t detect device” messages which are due to repeated missed wakeups caused by network being too busy. It explains spontaneous luup reloads and occasional massive response delays to commands. It is also the cause for the HEM child device stopping to report etc, etc…

After taking all of these actions, I am now on my 20th day of luup uptime and still running, have had no longer issues including devices on a 144 node network, the size of my logs has been divided by 5, my HEM have never dropped child reports once and the frequency of delayed commands/missed sensor untrip signals has significantly decreased. We still have a got CAN problem to solve…

Edit: One thing I wanted to add regarding inclusion. One can make the analogy of the zwave RF transmission to our voices talking. When we had the recommendation to be within 10ft of the the vera to include a device we can all picture that it was to make sure that the node to be added was loud enough to hear the controller and vice versa, that the controller can be loud enough to be heard by the node. The problem now gets more complicated if you have a very chatty network with other devices very close to the vera. Because they will likely be repeaters, these devices could be louder than the node you are trying to add if they are either closer or their antenna or positioning is somehow more favorable than your device to be included. This is why, as strange as it sounds, it is sometimes better to be further away from the device so that the signal is relayed by a louder repeater or maybe two which each would be more quiet because they would not be relaying all the messages. This is especially true when adding a battery operated device which typically has lower power, when you have a repeater node very close to the vera.
Now the core issue is… why are all these devices so talkative when they should not be doing anything? This is where the overwhelming overhead comes in and should be reduced to the lowest possible amount.

Edit 2: The beta is now available! For posterity I am attaching a screenshot showing my current uptime.

Edit 3: Uptime with the beta. The official release for the plus/secure came and went.

Edit 4: After over 25 days of uninterrupted uptime, I upgraded to the release build 4833 but without the new kernel. Had to swap out a couple of sensors (1 ran out of battery or 1 was physically broken and could not be manually woken up). I can report a very noticeable decrease of battery consumption across all of my battery operated devices be it FLiRs or regular ones. My Yale locks for example now have had the same lithium batteries for over 4 months which has never occurred before, typically averaging 1-2 months.

Edit 5: Summary of benefits observed.

After over 2 months of running with the disabling of wakeup arr/nnu and nearly 2 months of extending wakeup intervals, I wanted to list out all the problems it solved so that people car refer to it if they encounter these:

The obvious is the extended battery life especially on FLiRs like the zwave locks which used to burn out my lithium battery within 1-2 months. Now after two months without changing the batteries, they are all still showing 100% and are still running strong. Likewise my econet vent for which I had to recharge the batteries about every 6 weeks are at the very least extending their battery life by 3x from the current battery level reading. It is also too soon to assess the battery life on all my sensors but I get the feeling they may multiply even more. It is such a different experience to wake up every morning not having to check my vera UI to see which device needs new batteries!!! This was my life before as every 3/4 days, I had some batteries to swap out. I changed device batteries on only 2 devices in 2.5 months!
See below the batteries consumed by my ~50 battery operated nodes over 2 years because of these absurd default or forced settings from the vera and knowing that I have been using rechargeable batteries on every device I could: It’s over 14lbs

My Aeon HEM which often used to stop updating data from one of the two child devices at least once every two weeks… completely stopped doing so.
I have a handful of Leviton 4 button scene and zone controllers with an embedded relay which would lose its association about once every two months, requiring a power cycle to recover. This is completely eliminated.
The frequency at which I get delayed scene or even simple command execution dramatically decreased… by at least 10X. I rarely encounter issues like these any more.
Luup reload but you already knew that.
Random can’t detect devices… Completely eliminated.
Frequency of missed sensor trips and untrip dramatically decreased. It still very occasionally misses untrips which I identified in the logs as associated with a “got CAN” and tardy event and sometimes wake-up polls.
Strangely high probability of secure key exchange failure during secure device inclusions. Eliminated.
Garage Door openers (Linear GD00Z) used to occasionally go out of sync on their open close status or stop responding to commands as if they were frozen, requiring a power cycle to recover. Eliminated.

@Pabla reported:

My z wave network is speedy fast
Battery life was never an issue for me with my Schlage locks but I’ve been observing that battery life isn’t going down as quickly
Inclusion/exclusion process is much more reliable
Any scenes I have run very quickly
Overall performance of my VP is way better