Vera Plus Restarting Too Often?

LibraSun · April 8, 2020, 1:42pm

Having created a Reactor routine that sends me an email every time my Vera controller restarts, I’ve lately noticed I’m getting those messages about every 12 hours. No particular pattern, no set time of day, and definitely not in conjunction with any Vera activity (i.e. neither commands sent nor Scenes running) at those times. I’ve even been using System Monitor plug-in to keep track of Vera’s memory (RAM) usage, and that number remains stable in the 125-180 kb range at all times, with CPU rate consistently between 0.07 and 0.24, all of which I consider rather conservative.

And her free disk space, according to Putty, is quite generous (/rootfs is only 26% full, for example).

No Internet outages, no network devices cycling, no power spikes, no server logouts, etc. …

In other words … VERA RESTARTS RANDOMLY ONCE OR TWICE DAILY!

Any tips on how to troubleshoot this phenomenon and thereby put an end to it?

THANKS! - Libra

Related reading:

https://community.getvera.com/t/vera-plus-restarts-new-device-id/198039

https://community.getvera.com/t/vera-plus-restarts-anytime-my-internet-connection-drops-or-router-restarts/196394

LibraSun · April 8, 2020, 1:58pm

NOTES TO SELF

The Roku plug-in generates an enormous amount of entries in the Luup log (it sits idle 100% of the time); ◄— DELETED IT!
Seeing this in the Log feels scary ‘TempLogFileSystemFailure (not failure, only WriteUserData)’;
Admitted to self that I’ll never use Insteon PLM again, so removed interface;
I realize I don’t quite know how to identify a Luup engine restart by looking at the Log (which I naively assume will be preceded by whatever caused the Reload());
It’s not “got CAN” or “exit code: 245” and no “tardy” found in Log.

rafale77 · April 8, 2020, 2:17pm

More reading for you https://community.getvera.com/t/securing-and-stabilizing-the-vera-by-taking-it-off-the-grid/199140

https://community.getvera.com/t/zwave-tardiness-and-luup-reloads/209376/3

https://community.getvera.com/t/luup-exit-code-245/166576

https://community.getvera.com/t/vera-7-31-core-firmware-beta-release/211660/187

https://community.getvera.com/t/luup-reloading-continuously/209433

When it is this frequent and has been there for this long without anyone doing anything about it, it is no longer a problem or a bug. It is a vera feature, or “vera style” of coding. You either have to accept it and live with it or walk (run) away. I personally already wasted too much of my life trying to bring the horse to the river but it won’t drink so I can’t save it.

There is also a number of related thread I can remember which got deleted/censored over the past few months…

akbooer · April 8, 2020, 2:49pm

Do you have a spare Raspberry PI (or similar) ?

LibraSun · April 8, 2020, 2:51pm

I sure don’t. Haven’t dabbled in the Pi realm yet (almost did once, though!).

Catman · April 8, 2020, 2:52pm

They’re fun. I have 3 (I think. I keep losing them…)

C

rafale77 · April 8, 2020, 3:00pm

I actually have had a couple for quite some time as I bargain shopped the rPi3B+ When the 4 was released. I think I got 2 for $45. I just got 2 razberry shields off eBay for $38 to go with them. This makes for a vera edge alternative which is cheaper (~$42) and superior in every way (hardware— cpu 4 core 1.4GHz ARM, 1GB RAM, sd card or USB storage, software—zway) and you can run openLuup directly on it. Now I have test/spare units to play with.
The only thing to be wary of is the storage, just like on the vera where I recommend to boot from an external SSD rather than the native SD card.

LibraSun · April 8, 2020, 11:13pm

On closer inspection, I’ve started noticing a few of these in my Luup log:

02	04/08/20 17:44:40.104	ZWaveNode::AddPollingCommand light/switch node 27 doesn't support any COMMAND_CLASS_BASIC <0x7668a520>
01	04/08/20 17:44:43.704	got CAN <0x7628a520>
02	04/08/20 17:44:43.704	ZWaveSerial::Send m_iFrameID 19348 got a CAN -- Dongle is in a bad state.  Wait 1 second before continuing to let it try to recover. <0x7668a520>
01	04/08/20 17:44:55.607	ZWaveJobHandler::ReceivedFrame NONCE_GET flood node 27 <0x7628a520>
01	04/08/20 17:44:55.608	got CAN <0x7628a520>
02	04/08/20 17:44:55.608	ZWaveSerial::Send m_iFrameID 19411 got a CAN -- Dongle is in a bad state.  Wait 1 second before continuing to let it try to recover. <0x76c8a520>
06

Should I be concerned and/or take preventive measures??
(I don’t really know what the “NONCE_GET” and “got CAN” debacle was about.)

rafale77 · April 9, 2020, 12:42am

Not sure I want to say anything without appearing to get up on my soapbox again.
https://community.getvera.com/t/got-can-how-to-fix-this/208803
https://community.getvera.com/t/got-can-and-tardiness/210126/8

There is unfortunately not much you can do to prevent it. There is a lot of partial truth posted and though in itself should not be a problem and is mostly a misinterpretation/mishandling of a frame from the vera firmware, it is mostly what the vera does in response to it which is catastrophic. You have had it for a long long time and it can, through a snowball effect lead to a luup reload but it is only one of the mechanisms causing the vera to crash. At the core, is the handling of the command queue which has been botched many moons ago in ways which are so illogical that it challenges my 6 year old.

LibraSun · April 10, 2020, 6:21pm

Got another restart half an hour ago, and immediately inspected my LuaUPnP log from the Vera Plus, where I noticed this:

2020-04-10 12:35:22 - LuaUPnP Terminated with Exit Code: 137

03	04/10/20 12:35:22.370	LuaUPNP: starting bLogUPnP 0 <0x77654320>
02	04/10/20 12:35:22.380	JobHandler_LuaUPnP::Run: pid 4775 didn't exit <0x77654320>
02	04/10/20 12:35:22.543	UserData::TempLogFileSystemFailure start 1 <0x77654320>
02	04/10/20 12:35:22.916	UserData::TempLogFileSystemFailure 8889 res:1

Is there anything there I ought to be suspicious about? Keeping looking earlier?? I believe “Exit Code 137” == “Linux Exit Code 9 (SIGKILL)” but can’t for the life of me comprehend what event would precipitate that.

REFERENCE
http://forum.micasaverde.com/index.php/topic,9543.msg63629.html#msg63629

rafale77 · April 10, 2020, 6:23pm

Exit code 137 is an indication of running out of memory forcing the kernel to kill the biggest hitter.

LibraSun · April 10, 2020, 6:34pm

I’ve just power cycled my Vera Plus in hopes that whatever instability it had accumulated might be shaken out. (Don’t bother telling me how naive a proposition that it, lol!)

rafale77 · April 10, 2020, 7:34pm

You may be having a memory leak. There is a known one coming from running the vera with ReloadonTimeJump disabled. Do you happen to be setup that way by any chance?

other related threads

I had forgotten about it. In the old days this was plugin related and was very frequent. More recently, all the reports have been related to NONCE_Get Flood/tardiness-hangup problem but is nothing new. It is one of the staple features of the vera.