Vera 7.31 Core Firmware BETA Release

rafale77 · January 3, 2020, 2:35pm

@therealdb, I am indeed experiencing the same thing you did: increased activity level in the house leads to crash reboots and massive command lag.

To me it is not just about being dismissed, it is more of a return on (time) investment. I have spent so much of my time on this platform often discovering and fixing things which are so trivial and obvious that I am losing confidence in the devs. I so often looked at what I found and screamed “what were they thinking?” in awe. The new ownership has also been very promising, and has helped but from observations, is mostly focused on new products and cloud/mobile app based solutions and is the opposite of what they say. This is a trust issue. They also appear to be building the new platform from the roof down instead of going from the foundation up, which is not giving me a whole lot of confidence either. There is a problem of dependencies and when the devs start from the cloud and mobile app before building the local api and local processing or even communication stack support, I don’t see how this can work. When you are changing, redoing things which are easy and work reasonably well instead of doing the hard stuff on which everything else is supposed to rely on and is currently broken… how could I trust the final product?

Domoworking · January 3, 2020, 3:15pm

rafale77:

We can get it to be 80% there by lowering the chattiness but the fundamental problem is in its command queuing and frame handling. This here is an example of what I still observe on 7.31 and outrages me:

01      01/02/20 7:30:16.052    ZWaveJobHandler::ReceivedFrame NONCE_GET flood node 17 <0x7e380520>
01      01/02/20 7:30:33.005    ZWaveSerial::Send m_iFrameID 14107 type 0x0 command 0x13 expected 3 got ack 1 response 44437568 request 0 failed to get at time 28885005 start time 28864991 wait 20000 ack 1 m_iSendsWithoutReceive 0 <0x7ef80520>
01      01/02/20 7:30:33.006    AlarmManager::Run callback for alarm 0x902ed8 entry 0x2a55308 type 52 id 25322 param=(nil) entry->when: 1577979013 time: 1577979033 tnum: 1 slow 0 tardy 20 <0x7ef80520>


01      01/02/20 9:30:55.456    ZWaveJobHandler::ReceivedFrame NONCE_GET flood node 203 <0x7e380520>
01      01/02/20 9:31:15.415    ZWaveSerial::Send m_iFrameID 20800 type 0x0 command 0x13 expected 3 got ack 1 response 47031776 request 0 failed to get at time 36127414 start time 36107402 wait 20000 ack 1 m_iSendsWithoutReceive 0 <0x7ef80520>
01      01/02/20 9:31:15.416    AlarmManager::Run callback for alarm 0x902ed8 entry 0x22eac38 type 52 id 34429 param=0x2722de0 entry->when: 1577986255 time: 1577986275 tnum: 1 slow 0 tardy 20 <0x7ef80520>
01      01/02/20 9:31:15.422    AlarmManager::Run callback for alarm 0x902ed8 entry 0x2a29cf0 type 52 id 34408 param=(nil) entry->when: 1577986256 time: 1577986275 tnum: 1 slow 0 tardy 19 <0x7ef80520>

Why you would hold down any command for 20s waiting for a frame is beyond my comprehension ability.

What scares me most is seeing a well detailed report of and existing problem without any comment from the dev team. I think this is even more frustrating.

therealdb · January 3, 2020, 4:23pm

Yeah, I feel the same. But I truly hope they will continue to be the hackable plaftorm it is now, even in the future.

melih · January 3, 2020, 5:27pm

Our intention is to empower customization. However building a proper infrastructure to allow customization will take time.

rigpapa · January 3, 2020, 6:18pm

What mystifies me about this is that I am having a very different experience. Yes, I have occasional reloads, but not anything like before (pre-7.29). Currently, I’m not seeing a lot of slowdowns in my network. Everything works well, even though December started with me adding a bunch of new devices to the network and not having a smooth start with all of them (many documented in these forums, but include two Schlage Z-Wave locks–one Plus, one ancient/not–, a Dome water leak sensor, a Zooz siren ZSE19, a Zooz switch ZSE15, another Zooz switch ZSE25, an ancient outdoor GE/Jasco appliance switch, and two Ecolink garage door tilt sensors). For the 1-2 days when I was mucking about with these devices, things got wild–the siren in particular, a secure device, trying to get it added, I ended up losing half my network and had to restore from backup (and I made a backup before doing anything, which was life-saving, but you learn the hard way). Things generally stabilized, but I could get the Vera to reload by manually operating one of the Schlage locks. That ended up being resolved by doing a heal on the lock and a cluster of three switches closest to it. Not much trouble since (although I’ll likely still replace it as some point as the new ZP models are much faster).

I have one community plugin that I run on my house Plus, which I’ve had disabled since I did the device mucking before Christmas. I know that plugin has problems (some quite serious), but I’m too busy to write my own so I’ve fixed many of them in my own version of it, but I know it still has issues, and uses facilities within Luup that I believe contribute to instability (Luup’s <incoming> read and built-in TCP socket handling, specifically). Other than that, I use (since they all mine, of course) Reactor, Switchboard, Rachio, SiteSensor, DelayLight, Emby, and Virtual Sensor. I use two unpublished plugins that I call LockValet (manages lock codes) and SceneSlayer (makes scene controllers work the way I want them to). Oh, and I recently started using Battery Monitor, which is a community plugin, but again, I use a version I’ve modified. My system has been quite stable. Without that one suspect plugin running over the holidays, I ended up with (a record for me) an unbroken 18 days of runtime on 7.30 (4833) just recently, before a reload occurred that I didn’t even notice.

I have expressed here many times my opinions about Vera’s pre-acquisition handling of the App Marketplace and the large catalog of booby traps and time bombs that it allowed to grow, particularly in the transition between UI5 and UI7, when they had ample opportunity to clean house. Unfortunately, in holding an eye to their future plans, eZLO has really largely ignored the App Marketplace, so the rot that festers there continues to be a minefield for new users to the platform, and is no less a hazard to even seasoned users here.

Being able to write plugins is a gift, but with great power comes great responsibility, as they say. In the Vera/Luup world, the plugin environment is not tightly sandboxed, at least, not as much, for example, as a script in a browser tab is. Unfortunately, it’s very easy for plugins to create deadlocks and cause reloads that bring the entire system down. It’s very easy for plugins to delay system operations (e.g. luup.sleep() should never have been implemented). It’s not hard to imagine the operation of a plugin (or even any scene Lua fragment) creating a significant enough delay in system execution that a Z-Wave message is missed altogether (or perhaps more correctly, not handled within time constraints/requirements).

There’s been a lot of attention to the specific actions of the ZWave stack here, and I agree completely that there are some things going on that don’t look like good choices, but what if those exceptions were being caused by irritants elsewhere in the system?

In 2017, I was at my wits end with the instability of my system and on my way to bringing up HomeAssistant. But like @rafale77 and @therealdb, the investment and sunk costs in Vera kept me there. I then began to exhaust every option I could think of to get my system working and healthy. Lo and behold, I stopped using a couple of very common plugins, and things got much better. Like @anon53786315, I replaced all plugins with my own Lua, and things stayed good. Some time after, I decided that this community needed better tools, and so a lot of the work I had done just for myself, I ended up publishing and now support.

Anyway, the point of this rather long missive is this: what I don’t yet see happening in trying to chase down these recurring errors is removing all of the other variables from the system–as many as possible, no matter how far-fetched their relationship might seem. Back everything up, and then, as painful as it might be, I suggest disabling plugins systematically… not uninstalling them and killing all the devices/losing configs/scenes/etc., just un-LZO the implementation file and put if luup then return false end at the start of the startup function, which will keep the plugin from starting up. Then observe.

I think this is a possibility worth exploring. I would not be surprised if we find there’s an irritant that’s exacerbating the other problems. Yes, we want the ZWave problems fixed, but consider, if we’re looking at a bug in Luup’s ZWave error handling, how much better can it get if the error handling improves but the stimulant is still causing an error that needs to be handled?

rafale77 · January 3, 2020, 6:48pm

I completely agree with you @rigpapa. My journey started at the same place. I started by removing all plugins from the vera. I am only left today with ALTUI which I found indispensable and system monitor which is very simple and I modified for my needs.
This enabled me to isolate and partition all source of instability. Storage, data corruption, mios cloud etc, I systematically eliminated all of these and am left with a vera being just a “dumb” zwave/zibgee radio with an API and even doing that, it has problems. Not on the zigbee side from what I am seeing. It is all zwave. Like you, I have hit over 21 days of uptime on build 4833 but this was with only two people in the house. Once we were 10, with young kids running around tripping sensors etc, the vera started reloading luup up to twice a day with error code 245. When it did not reload, it would accumulate delays to near infinity: over 2h. So as I said, the command queuing is fundamentally broken on the vera and the chattiness/business is what will reveal it. The extremely poor testing done by mios on their code and release management has lead the devs to assume their test on 3-10 devices can be scaled to 150 nodes and implemented some very weird and shocking code like high default wakeup frequency, the wakeup nnu or nightly heal but I am uncovering more and more as I go. Now that my extended family is gone, my vera and zwave network appear to be stable again but my network was behaving basically like the pre 7.30 unstressed network when stressed. The lower chattiness of the network is a huge improvement, no doubt. But there is a fundamental problem underneath.

PS: memory leak when reloadontimejump is disabled is still there on 7.31…

PS2: In all fairness fixing the command queue is not trivial, though the mistakes made were gross and absurd. This, I think why ezlo is working on a new platform and calls the current one beyond fixing. I am reporting to try to make sure things like these don’t repeat in the new firmware.

therealdb · January 3, 2020, 8:06pm

I only have nest, SysMon, Alexa and Harmony. I moved all my critical code to my own lua and a custom made c# app.

I found mayor problems when making frequent http calls into the vera, so I will probably change to a different path (ie the Vera calling endpoints to get updated values), but zwave seems to be the major source of problems for me. I’m not sure what’s next for us, but fixing is easier than rebuild. We’ll see.

rigpapa · January 3, 2020, 8:22pm

IIRC, you were using frequent variablesets, is that right? I thought that odd, too, because my automated test tool can beat the thing to death when it’s setting up a test. I don’t think I’ve seen it crash, though, at least, not because I was sending a flurry of requests. They are sequential coming from my tool, though, so if you have multiple coming in simultaneously from multiple sources, it’s not hard to imagine concurrency problems in the core.

therealdb · January 4, 2020, 8:23am

Yes, I think it’s an issue in how multi-threading is handled.

It helped when I wrote an handler and updated things accordingly and when I removed part of the logic and simply stopped the update of non-needed variables.

I will probably implement some sort of queue logic and see if this helps, since it’s all done from my custom app in C# updating things from non-supported devices into virtual ones.

EDIT: I just added a semaphore, so calls are now queued. I will try again and see if slow them down will help. We’re speaking of about 1-2 calls every 90 secs. When updating many more variables in less time (let’s say, 10 every minute) the stability dramatically decreased. I can have a reload every 2 to 3 hours this way. Bear in mind I’m on a Edge, maybe the CPU/memory combo has a worst impact compared to the Plus.

dJOS · January 4, 2020, 9:16am

So that went very quickly (literally >10 mins) and very smoothly … especially compared to the Charlie Foxtrot that was all versions of 7.30!

Btw, @edward might have been nice if you’d put up a link to this topic in the 7.30 topic! I might not have needed to go through 4 hours of wasted time and frustration today!!!

anon37769099 · January 4, 2020, 1:00pm

Updated my test VP now, no problems i can see. works well with an old RFXtrx433 with plugin version 1.87 as well.

I’ll give it some time before I upgrade the main unit given some’s experience here.

Matsohl · January 4, 2020, 2:07pm

Works well on my Edge, but I can see that it starts my LUA once every minute

Catman · January 4, 2020, 2:43pm

Does this have the kernel update as well?

C

Matsohl · January 4, 2020, 3:58pm

Good question, how can I check that?

Catman · January 4, 2020, 4:01pm

No idea
C

Matsohl · January 4, 2020, 4:23pm

Logged in via ssh and then: uname -r
It is 3.10.34. I dont know if it is the latest

Edit: Seems be an old one, but I am not a virtuoso at the command line so perhaps the experts here can guide you

HSD99 · January 4, 2020, 5:27pm

You must be on an Edge as 3.14.24 is correct for that platform. The kernel for V+ is 3.10.14 for 7.29 and 7.31.

Catman · January 4, 2020, 5:57pm

So it doesn’t have the updated kernel in 7.31 then? Interesting

C

Matsohl · January 4, 2020, 6:44pm

Yes, I am on my test Edge. I have not updated my main V+ yet since I want to be sure of the hickups

rafale77 · January 4, 2020, 7:32pm

Yes it does as I reported higher… it prevents extroot from my tests. It also non longer give you a splash screen/banner upon ssh login. These are very customized kernels from the ezlo team so the numbering does not tell the whole story.