Reboot Battle

My Vera Edge suffers from random reboots for some time now. It’s not just a LUUP reload, it’s a complete reboot of the unit. Lights go off and come back on after some time and the unit is fine again.

Sometimes it reboots just 2 times a day but usually it’s a lot more. The result is that it’s missing scenes and during restarts i can’t control my devices.

I logged a case with Vera support and although they have been very helpful, they have not yet been able to solve this.

They did eventually found out that the reboots are problably caused by “CPU overload”. At least that is what the logging says:

01 04/05/19 6:04:28.733 e[31;1mReportError nm_cpu / CPU overloaded, count: 5, sec: 1202, idle: 0, system: 8, delay: 179 / e[0m <0x77632000>
10 04/05/19 6:04:29.763 FileUtils::ReadURL starting user: pass: <0x77632000>
10 04/05/19 6:04:30.782 FileUtils::ReadURL resp:201 size 43 <0x77632000>
01 04/05/19 6:04:30.810 e[31;1mCheckCpuUsage Calling DoReboote[0m <0x77632000>
02 04/05/19 6:04:30.832 e[33;1mDumpProcesses: /bin/ps (0) Output

From that point they focussed on the plugins i have running. I removed some to check if that would help in stabalizing the unit but it did not. I have not many plugins and removing them all is just not an option as it will disable all my automation i have built up in the years.

Does anybody has any ideas or tips on how to troubleshoot this further? Is there a way to find out what device of service is causing these high CPU loads? Would really appreciate if somebody has some insight on this. I added an screenshot with some CPU load data, is my CPU load really high or normal?

A list of your plugins (remaining) would be useful. Otherwise we have a chalk line on the sidewalk but no bodies, no witnesses.

At the moment i have the following plugins installed:

PLEG / Vera Alerts
Panel Manager (1 panel with 1 button only)
Virtual Sensor / Sitesensor (:wink:)
Smart Meter **
Sonos **
Honeywell Wifi Thermostats
System Monitor **
Eventwatcher **
HTTP Switch (WiFi Switch)
Broadlink RM Interface

(** = already tried removing ; no effect)

I don’t suppose you have no memory of when it started? Your CPU loads are very high in general. Did you try reloading the firmware? Do you have scenes which run in frequent cycles?

can you SSH into the box and run: ps aux and see which processes are consuming the CPU cycles?

Yeah, you’re down to the biggies.

I doubt VirtualSensor has anything to do with it, it’s just too dumb/simple. But it would be possible for SiteSensor to cause a sudden memory exhaustion event (for which a total OS crash could well be the result) if it’s hitting a site that returns a huge JSON response. I mean, it would have to be a very large response, but it’s not out of the question. The problem with the built-in JSON parser available in Luup is that you have to have the entire JSON string in RAM to parse it. If your SiteSensor is not using JSON response, just matching a string in the response, that is done incrementally (in a 2048 byte sliding window), no unlikely to cause a memory problem no matter how large the response. But, if you are using JSON, I would say the next low-hanging fruit to try is to disable your SiteSensors (turn off all, and if the problem goes away, turn them back on one at a time until the problem comes back and you’ve likely identified the site request that’s causing heartburn).

Beyond that, though, you’re into the hard stuff… PLEG, my friend. I walked away from it 18 months ago and it changed my life (or more correctly, saved Vera’s in my house).

Edit: Also, check your SiteSensor request interval–a high request rate will run the CPU hard as well. Once a minute is not a high rate. But the closer to 0 you are from there, the more suspect.

I don’t PLEG is any harder than anything else to troubleshoot. Obviously with a logic engine it is possible to create an infinite loop - so I would review each PLEG to make sure conditions are evaluating as expected. But usually when I have done something dumb in PLEG I just cause the lights to flash on-and-off, taxing the z-wave network but not the CPU.

I agree with @rafale77 to SSH and see what processes are dominating.

Your answer reminded me of the fact that I did have memory “issues” before that causes the unit to completely freeze up and generated “can’t write user data” errors. I solved this by clearing the cache when memory is dropping below 50000. Will keep an eye on the memory level when a reboot occurs.
I did reload the firmware a week ago and i am running the newest beta which is doing quite good. I have some scenes that run frequent but not a lot in my view.
Below a screenshot from ssh session:

Sitesensor just parses a simple json from openweather ; really don’t think that does a lot. But i will disable that one just to check if it helps. Yeah, i read on the forum PLEG can be quite CPU intensive. Problem is that i built my whole automation around it, so saying goodbye to it is hard. Will have look at your reactor plugin, might be a very good alternative. Also going to try to see what disabling PLEG does to the CPU load.

If i would replace my Edge, would a Vera Plus be a better choice?

Thank you both for looking into this!

I can see @rigpapa replying as I type… Pretty cool feature of this new forum site!

I don’t see anything crazy on your processes. It looks like it should be only ~35% busy but your charts showed much higher CPU UT rates. By any chance, do you see anything if you scroll up? The LuaUPnP program s at 23% which seems ok.

No worries, I was just trying to give you some scenarios to play with. I agree, unless openWeather goes rogue, the JSON response will be light and no problem.

Also, try the top shell command… I think it’s a bit more readable than ps aux

Well, after a couple of weeks of reboot bonanza, watching CPU loads skyrocketing, analysing logging, listening to wife and children complaining about all sorts of things that used to work by “itself”, i was about to throw in the towel… :grimacing:

But after reading the extroot thread (Vera extroot) i decided to try that as a last resort. Bought a small SSD and removed the old USB stick i was using before.

Tried to run the script but not much happened. Reached out to the always helpful @rafale77 and tried some more things. In the meantime i noticed that the unit did not reboot anymore :smile: CPU load averages around 0.40 (15min) and the unit is responsive and fast. Scenes are running again uninterrupted.

No idea what happened here but needless to say i am quite happy. Unit has been up now since saturday without skipping a beat.

I still want to finish the extroot proces to give myself some extra storage space, but for now it’s good.

Ouch. Yeah, I know @rafale77 recommends against these, and with well-founded reason. They just don’t stand up to the kind of punishment they get as a full-time use storage device. Seems likely is was degrading, and maybe not great to begin with. Performance can also be an issue–if the stick isn’t fast enough, internal assumptions on timeouts/timing start to go wrong and things go tumbling down.

Great to hear that it worked out.