Restarts on just about any scene or Reactor

ruster34 · April 24, 2020, 2:11pm

@LibraSun - yes, and agreed - but my head scratching is that everything was fine a couple weeks ago with the same configs. I had moved a ton of stuff to Reactor months back and it has been noticeably more stable throughout the day. I dare say that I would even go a few days in a row with no restarts. But something must be hosed to create such a dramatic and recent change in stability…

@Catman Yes, that was my next step as well - just delete the house mode change and see if it executes ok.

@rigpapa One of the many things that is so incredible about Reactor is the ability to continue activity executions even factoring in restarts - But I’ve noticed that with this recent issue that even Reactor is not completing some (not all) tasks following a delay and with the restart occurring. Does that shed any light on it?

I’m going to try removing the house mode changes next and report back.

Thanks to all of you guys (and this community in general) for the pro-bono help we all provide each other!

Ryan

LibraSun · April 24, 2020, 2:39pm

FYI, this is why I wrote my Replacing HouseModes Plug-In with Reactor treatise last month!

ruster34 · April 24, 2020, 3:05pm

LOL

treatise

ruster34 · April 24, 2020, 3:31pm

Ironically - I just did this as my first troubleshooting step. I had the housemodes plugin, and had ‘clicking the button’ as an action in the scene. I removed that and used Reactor’s built-in “change house mode” as a replacement.

rigpapa · April 24, 2020, 3:48pm

I would also study your LuaUPnP.log file carefully. If you find a lot of red and yellow in there, and the infamous “got CAN” messages, that’s an indicator that ZWave commands are being dropped or ignored by the engine. There’s no way for Reactor to know that. It speaks the health of your mesh and devices. Depending on the structure of the mesh, a single device changing neighbors or dropping out can big problems that result in delays and other issues, and very often lead to “got CANs”.

ruster34 · April 24, 2020, 4:05pm

The only ‘red’ lines I’m seeing (and not sure how far back to go, this is all within the last 5 minutes or so)
I don’t know what device 277 is, can’t find it. Device 134 is a motion detector.

LuaInterface::CallFunction_Timer device 277 refreshCache took 10 seconds (multiple instances of this entry)

ZWaveNode::HandlePollUpdate_Alarm node 59 device 134 v1type: 0 v1level: 0 source: 0 status: 255 type: 7 event: 0 parms_len: 0 parms: 0 code: (null)

Update that I just ‘caught’ another restart in the log:
01 04/24/20 16:16:30.306 FileUtils::ReadURL 28/resp:0 user: pass: size 1 http://apps.mios.com/get_plugin_version2.php?plugin=4086&accesspoint=50061111&platform=mt7621_Luup &firmware=*1.7.4970*&oem=1 response: <0x7706c520> 01 04/24/20 16:16:30.307 JobHandler_LuaUPnP::GetPluginVersionOnline iPlugin: 4086 buffer empty <0x7706c520>

This restart was triggered by running a scene which contained only one plugin (which is also in the others that have been triggering it) - Harmony.

Edit 2: I have tried Harmony interaction same activity, manually and it all worked fine. Ugh this is driving me nuts.

ruster34 · April 24, 2020, 8:45pm

I’m considering just taking a solid backup and factory resetting everything. I have never seen it this unstable - restarting on small automations and/or not completing them. It would be understandable if I had added a new plugin, reconfigured something in some way but I haven’t.

anon53786315 · April 24, 2020, 8:47pm

So you’re saying that you’re commanding Harmony at the mode change? I have seen plugins that access an external servers API cause problems when you command them to do something at a mode change. In my cases, I found that to be an inappropriate (or missing) timeout value for network operations in the plugin, or just some incomplete error handling. I’ve seen that cause a reload in one of two ways:

it causes a deadlock situation because Vera winds up with two things waiting for each other, which after 60 seconds leads to a reload
it causes the scene to run too long (several seconds), and Vera reloads to fix the problem

Did anything change in your home networking recently?

ruster34 · April 24, 2020, 8:58pm

So - the Harmony does change on mode changes, but the restart is happening even if there is no mode change. And then the restart doesn’t happen if i just manipulate the Harmony ‘device’ manually. It’s literally like my ‘scene engine’ for lack of better terms, is broken. I can alter all plugins manually - meaning I can change thermostat modes, run Harmony activities, etc. One of the restarts even triggered from turning on a virtual switch which does 2 things: unlocking deadbolt and disarming the alarm.

Side Note:
I think I have run some command at some point that recycles logs faster than normal, because every time I pull them up, they only go back to about 10 minutes ago, with the first entry being:
02 04/24/20 16:42:17.624 Finished rotate logs <0x7713d320>

Now, to your question about home networking -
Yes. We had left the house vacant for a few weeks and internet had crapped the bed, and I had to reboot even netgear switches to get them working again. I wonder if i got a surge or crazy power spike. I have vera, router, NAS, etc in a UPS so didn’t see any issues there, but I had to power cycle like every single thing. Ubiquiti APs, you name it.
I have lots of chromecast audios and in doing all this ‘restoring’ I decided to create a new WLAN and put them on that to get them a little seperate. I also renamed my 5gz from the 2.4. Vera is on LAN directly into the router - and the IPs of other devices, like the Harmony and Alarm panel for example, are accurate and never changed.

What ideas do you have on the networking change?

anon53786315 · April 24, 2020, 10:08pm

Interesting … I wonder if instead of a power glitch you had a hacking attempt. If you haven’t already rebooted your router, I’d suggest that. Some hacking attempts leave malicious code in the router’s RAM that gets cleared out with a reboot.

One hacking method is to change the name servers in your router to something else, so that all your network accesses get diverted to the hacker, which intercepts data an passes it on. This tends to slow down access to everything. It doesn’t slow down your connection speed, but it adds a delay to the start of every access. Do you know how to check the nameservers in your router?

With regards to your faster log recycling – were you saving logs to an external USB stick before? Maybe the “Store Logs on USB Device” got unchecked somehow, or the USB stick failed/got corrupted? Vera can’t store as much internally so the logs recycle faster without a stick.

When you look at your log file at the time of one of the reloads, can you find the line that has “LuaUPnP::Reload” on it? What else is on that line?

If you’re only seeing the problem when you execute the commands in a scene, then maybe that’s the “scene ran too long” problem. You can look in your log file and it will tell you when the scene started executing (search for the scene name in the log). How long between the scene start and the “LuaUPnP::Reload” line?

ruster34 · April 25, 2020, 9:46am

Interesting idea -
Yeah I rebooted the router, vera, switches, APs, just about everything connected. Things were in this down state for about 2 weeks so everything had festered for a while. I will say something interesting was that the time was way off for a windows PC that is hardwired (and set to automatically update). Trying to confirm in my router now and it’s off a few seconds, looks like the last response from the NTP server was April 4th and it just says waiting for response.

Regarding the USB logging, I have a RFX transceiver (window shades and temp sensors) so can’t use a jump drive for them.

LibraSun · April 25, 2020, 3:59pm

Might try swapping out some of your Ethernet cables in case one or more took an electrical strike. Same logic applies to one or more router/switch ports in network.

ruster34 · April 25, 2020, 5:01pm

Next ‘upgrade’ in home automation is to replace router and net switches… and this very well might be the last straw to do it. I power cycled the router this morning, only thing that would force it to refresh it’s NTP server connection, and it updated it’s time a few seconds. I also took one of the switches out of the equation for the APs connection. Pulled the plug on vera and let it sit for an hour in timeout. I had to partial reset it twice (3x reset pushes in 6 seconds) to get it to establish connection again on the network. Of course I had to synchronize a few things like the ecobee token. That among other things accounted for the 3-5 restarts off the bat, but it’s been stable since, best I could tell.

I wonder if there was an issue with the timers in scenes and the router being off a bit on time? I know time being off can make all kinds of goofy things… so might be it. I’ll know for sure if it makes it 24 hours.

Side note of ‘smart home hell’ or ‘WAF’, etc. - I have a condition that starts chomecasting to 6 speakers throughout the house, pretty fun - even changes the pandora station based on weather temps, etc. Basically the condition is an ‘on or off’ thing, which begins playing when we all wake up, come home, etc. Turns off at goodnight, leave house, etc. At exactly midnight, music began playing like a freaking scene out of poltergeist! Hence vera got put in timeout when I got up and ran my morning scene - which of course triggered several restarts…

LibraSun · April 25, 2020, 5:03pm

You have just lived my nightmare. #death_by_waf

ruster34 · April 26, 2020, 2:21pm

Ok, so if I run this group activity (denoted by the red arrow), it triggers a restart. My only ideas that would hose it up would be the variable setting or thermostat mode (which uses the ecobee plugin). I can run both of those manually (denoted by the green arrows) and they both succeed without issue.

Is it possible the delay for 10 seconds is causing the tripping? @LibraSun you mentioned the scene timers issue, but I did run that code…

rigpapa · April 26, 2020, 2:38pm

The question is not whether you can run the commands individually, but whether removing them makes the restart go away. My suspicion is that the Ecobee plugin is communicating with its target devices and something in that interaction is causing a deadlock. In this instance, it may simply be because you are running the activity via the test function, which makes an HTTP request to the Vera to run the activity; if the Ecobee plugin also uses HTTP in its communication, that would be the makings of a potential deadlock.

An alternate way to test this activity would be to toggle the “NOT” group setting on the “Everyone Awake” group. This would cause a reversal of the condition group’s result that will make the group activity run (or not–toggle it back and forth until it does). This is a more “natural” way of running the activity, because it driven by the normal flow of logic and not by an outside HTTP request.

LibraSun · April 26, 2020, 2:45pm

Always suspect the ecobee plug-in! Unmaintained (its keeper banished to Oblivion) and fraught with peril.
I use it and recall having to navigate several workarounds because the API kept returning bogus values, like (nil) for the ClimateHold. Why? Because any deviation from its prevailing “Comfort Setting” (home, away, etc.) results in a blank there.
Vera no likey the blank!
But I think @rigpapa is on to the correct path here. Keep us posted!

rigpapa · April 26, 2020, 2:52pm

I want to be clear, though, I’m not saying there’s a problem with the Ecobee plugin, but there are a lot of ways to deadlock Luup, and not all of them are obvious, or apparently consistent. Sometimes, you really can’t put a finger on why, and you just have to do things differently to make it work. I will say that I see one pitfall, though, and that is that the Ecobee plugin does a lot of its updates in <run> (inline) tasks rather than jobs, and when you have potentially long-running tasks like communication with a remote device (and that itself has long timeouts when things go badly), it’s much higher risk that the plugin’s work causes issues. On the other hand, I have recent experience with serially-run jobs deadlocking, not every time but maybe one of five runs (often enough to be troublesome), for no apparent reason. I, for one, will say “good riddance” when this firmware is put to pasture.

anon53786315 · April 26, 2020, 8:46pm

You use both Harmony Hub and Ecobee plugin – I have found them both to be incredibly fragile to the unexpected on the network. So much so, that I have the two of them isolated in a 2nd Vera so the other 99% of my home automation can proceed undisturbed on my primary Vera while those two plugins reload to their hearts’ content.

You said this problem began suddenly when you started having network problems, right? Did you resolve those? Perhaps that’s an easier path to get back to where you were than digging into plugin code and Vera firmware that you can’t fix?

ruster34 · April 27, 2020, 3:30am

hahaa that’s so funny - and sad.

I have a vera lite and that’s not a bad idea to isolate those 2 plugins over there - can’t believe I’ve not thought of that. Unfortunately I have considered running HA on my HTPC and putting all heavy lifted over there and leaving the ‘easy’ stuff to Vera since it’s more her style.

I’ve tried to rectify any and all network problems - and all devices (I have tons, ip cameras, smart devices like ecobee, etc, probably over 50 ip clients) are all happy and up. Vera continues to spiral…

For example I had a device trigger on today, plus a vera alert, that only comes on in night mode and after a certain time and it came on when neither conditions were true. It’s a bit out of sorts and I’m close to giving up hope.