Context and thread for executing actions

I have a plugin (Vera2 UI5) that expects to have a service action of a device instance triggered by an HTTP call from an external device as follows:

http://:3480/data_request?id=action&DeviceNum=&serviceId=&action=SetPresent&newPresentValue=

No problem triggering the action. This action read/writes to local variables and also does luup.variable_set and luup.call_delay (used for a timeout check).

So my problem is that I have more than one HTTP client triggering the action for a single device instance and the two clients are asynchronous. When the actions are triggered more than, say, 2 seconds apart, everything runs just fine. When two HTTP clients trigger the same action within about 200ms of each other, I get a LuuP crash/restart.

So this raises my concerns about what thread is running and what context is running at certain times. It was my understanding (perhaps wrong!) from many messages in this forum, that each device instance (I only have 1 of these) is in it’s own Luup context (assuming that means local variable instances) and also has two threads. Don’t know why 2 threads. What is not clear is whether the HTTP parsing calls the action function or whether it “messages”, “queues” or similiar, the device to perform the action. Or worse yet, it creates a job (which I haven’t seen defined very well) that executes the action and the job has its own thread.

So, I ask:

  1. what is a luup context
  2. what are the two threads per device instance used for
  3. in what thread are the service actions of a device instance executed
  4. in what thread and context are callbacks executed (specifically call_delay, but other callbacks as well)
  5. what is a job and in what thread and context are the jobs run

Though I’m no expert, I have 30 years of experience in mutlitasking, embedded programming. I always prefer to understand the architecture of the system so I can then solve my own problems rather that bug you guys to fix them. That is why I intentionally haven’t posted code or other specifics. Having said that, I’m very happy to answer any questions and provide any code.

Also, I’m very happy to read manuals, so if the answer is in a manual, please give me a link. I couldn’t figure out the answers from the plugin development docs, though.

Regards
Keith

Reading this with great interest as I am observing similar issues with HTTP calls sent from a client but to multiple devices. On a Vera Plus (UI7) it works but with limitations. It seems that if I sent too many commands within a short time (I have not been able to quantify the time) I often get a Luup reload. I initially thought about as a command queuing but upon looking my logs, it is likely the Luup engine itself not being able to handle the jobs and having a job hang and timeout is what is apparently crashing the luup engine as I am getting an error code 137.

e[7;36m2019-01-15 16:42:56 - LuaUPnP Terminated with Exit Code: 137e[1;00m 02 01/15/19 16:42:58.206 e[33;1mJobHandler_LuaUPnP::Run: pid 27983 didn't exite[0m <0x77fe8320> 02 01/15/19 16:42:58.648 e[33;1mUserData::TempLogFileSystemFailure start 1e[0m <0x77fe8320> 02 01/15/19 16:42:58.750 e[33;1mUserData::TempLogFileSystemFailure 7621 res:1

Though it may not be exactly the same problem since you are dealing with a single device and multiple commands, I am seeing a crash for multiple device and commands. I too have been looking for the command queuing limits of the engine.

As I mentioned, in the action code I make calls to luup.variable_set (two). I have now changed that to make them conditional and happen far less often. I’m going to watch for “back to back” executions of the action (which is what seems to trigger the Luup restart) and see whether this changes things (i.e. less likely to restart Luup now).

I have a few other reorganizations of code I can do (haven’t yet) to do far less work in the action (e.g. executing much code in a call_delay callback) to see if I can minimize the work that is done without a break (i.e. back to back executions of the entire action).

My suspicion is that I am crashing because I’m doing back to back executions of the action without a break in between. Although this is completely “legal”, perhaps the system is unable to do enough periodic housecleaning (garbage collection maybe?) and it incorrectly concludes a failure (memory leak maybe?) and crashes Luup. If I only do a tiny bit in the action (I can’t prevent back to back calls to the action) and schedule the rest with call delay, perhaps it will get the break it needs to clean house. This will be my next test if this one prove better but not perfect.

I’m totally guessing since I am unfamiliar with the Lua (or Luup) environment. My guesses are that the problem is buried in the system, not in my code. How arrogant of me!

Will advise on the current test (fewer luup.variable_set calls in callback) when it has a bit of runtime history…

I have experienced the same. Most of my commands though come from openLuup and I am testing some codes to add intervals between commands on openLuup. I am fairly certain now that the vera has no command queue.

If the action you are running is enclosed in a tag in the implementation file (I_xxx.xml), you might want to try changing that to a tag. This changes the execution environment from an immediate, in-line run of the action’s implementation to deferred/queued. The use of requires that you return a job completion status from the implementation, with [tt]return 4.0[/tt] being the most common success indication, and [tt]return 2,0[/tt] indicating failure.

I’ve found that many deadlocks can be avoided by careful selection of what runs in line, and what is deferred, particularly when communicating with certain classes of Z-Wave devices (locks, in particular… the deadlocks, oh, so many).

Thank you. Unfortunately the core implemetations code on the vera does not appear accessible so I don’t know whether they run as job or run, However I am now testing akbooer’s suggested code on openLuup to see if openLuup can delay its calls. It appears than even a very slight delay has been able to address my problem with this one particular scene but I have another which potentially could be worse triggering 50+ zwave actions I will test.

FWIW, IIRC, a job will return a job number as part of its return data, whereas a run does not.

Turns out that removing the guarantee of two luup.set_variable in the action did not improve it. It crashed after running the action without calling luup.set_variable once.

So I added the line “os.execute(“echo 3 > /proc/sys/vm/drop_caches”)” just before returning from the action and it has not failed yet. Very premature and I’m not convinced since I have been checking top and free and I see very low on free memory and still very much cache.

@rigpapa: I’m interested in the vs . Sounds promising! I don’t really see a downside to doing that.

@rafale77: keep me informed of what you try and what is better, worse, disastrous and, most importantly, successful. Your situation and mine might be the same.

So I searched re vs and looky here!

http://forum.micasaverde.com/index.php/topic,28583.msg204271.html#msg204271

Sure seems to support rigpapa’s suggestion.

So adding “os.execute(“echo 3 > /proc/sys/vm/drop_caches”)” helped a lot but didn’t cure the problem. Still got Luup reloads with same status (245).

Eliminated os.execute and changed to and changed the return true to return 4,nil. Amazing how much code there is out there that has no return at all in either or tags of actions. Even MCV code. Lots of it.

I will let you know of the stability in the next couple of days.

@rafale77: I have very little in my action but it is always triggered by HTTP request. I think you are in the same boat. I’m starting to think that although my code is perfectly acceptable to be (i.e. it executes quicly), in the big picture it is only the tail end of an HTTP request and, for all I know, HTTP is held up pending my return code. It may be that if the action comes from HTTP, is the only safe way. Again, let’s see about the stability.

AFAIK, this is not an issue. As for scenes, a blank return from either run or job code should do the expected thing: exit successfully.

That’s what I found in my testing whilst reverse engineering the Luup engine for openLuup.

Well, changing from to didn’t help. It still crashes when two jobs run back to back. Surprised me.

So now I have it as and I added back in “os.execute(“echo 3 > /proc/sys/vm/drop_caches”)” which definitely helped. I also added in “os.execute(“sync”)” before drop_caches to free a bit more. Oddly, neither of these two seem to help free memory much. Maybe it just slows things down a bit to let things settle on the back to back jobs.

Essentially, I’m still believing that the two executions back to back are an issue as is lack of free memory. [edit] Keep in mind I’m on Vera 2!

Just out of curiosity, what is exit code 245 from Luup? I suspect it simply reflects the signal 11 (segmentation fault - likely from too little free mem) noted immediately after.

Because most of my http calls to the vera come from openLuup and that akbooer suggested slowing down the rate of the calls by changing the openLuup code, it appears that my problems have been resolved for now.

See what I tried to do below:

±-------------+ eth ±--------------------+. uart ±---------------------+ zwave ±----------------- —+
| OpenLuup |. <–> | Vera Luup API | <–>. | ZM5304 Zwave. | <—> | Zwave network. |
±-------------+ ±--------------------+ ±---------------------+ ±---------------------+

The Vera Luup to Zwave chip and the Zwave network itself are the bottlenecks and the vera should really be queuing its data flow to the Zwave chip like pretty much every controller I know, but it does not.
In order to have a queue, you need to have memory and Sorin replied to my question on queueing saying that they believe the zwave SOC (ZM5304) does it but it does not have the memory to do it. So essentially what is happening is that the Luup engine locks up because either it is getting too many http calls it is not able to store and queue or because it is a passthrough and forwards every call it gets to the zwave chip through its serial API which then locks up and causes the vera luup to reload.
The change from run to job was done in openLuup so that it prevents this problem upstream. The ethernet interface and openLuup itself are very significantly faster than the other protocols. It is a little like a bandwidth and processing power funnel as it narrows going to the right on the graph. Someone on the chain either needs to queue the calls so they get spaced out in time or we need multiply veras and zwave network so the calls can be spaced out in space.

It seems to me like you are trying to get the vera to queue the incoming calls but it maybe that either it is too slow (now that I know you have a vera 2) to process the calls or that it is just passing them through to the zwave radio hoping it will queue and it is the serial interface (likely USB in your case?) API which crashes. Did you consider slowing down the rate of the incoming calls?

I’m not actually hitting the z-wave so that isn’t an issue. I’m actually only doing call_delay (to schedule a timout check) and (maybe) some variable_sets. Pretty benign!

It is also hard to slow down or schedule the HTTP calls. Essentially, external devices (RPi) are monitoring for presence on bluetooth and when they see it, they call the action of the associated Vera device and repeat that periodically (every 30s) while still present. I have multiple RPis monitoring throughout the home so I may be present in more than one, though the presence also includes strength so it is easy to see which room I’m in. Pretty hard to co-ordinate these independent devices. I wouldn’t have thought I had to!

My next try might be to co-ordinate the processing in Vera. If I use to receive the info then call_delay to schedule the processing of same, I can ensure that I space out the processing. This won’t eliminate concurrent event calls but might minimise their effect. Like jumping through hoops!!!

@rafale77: glad yours is working. @akbooer has helped many, many people along the way. Don’t forget to give him Karma.

I just ran a stress test and I can confirm that it passes with flying colors with my setup.

Thank you for explaining. It seems then that the design issue in the Luup engine is not new and is deep. Actually the fact that I am not really seeing much delay anymore in my scene actions tells me that zwave is not the bottleneck. It is the vera itself which is the problem. It may not be able to receive and process too many http commands within a short time. It is lacking a command queue buffer on both the incoming and the outgoing commands.

Thank you for reminding me. He actually got quite a couple from me lately. ;D

I do something like this in many of my plugins, so much so that it’s become one of my design patterns. I have a “task queue”, which is a table of objects containing a time to next run and a function to be called. The plugin has a single “call_delay” thread that only calls the “task scheduler”. When called, it rips through the task queue and builds a list of ready tasks, then runs that ready list by calling the task-specific callback function for each. It then finds the next earliest waiting task, and schedules the next call_delay accordingly. When the tasks run, they can call to reschedule themselves for another future run on whatever schedule they need. When tasks “register”, they can include not just the callback function to use, but an array of arguments of any valid Lua type, so I can pass in a lot more than just a single string to my time-delayed tasks (a limitation in call_delay). Tasks can also be cancelled or rescheduled externally (i.e. an action implementation can call for the periodic update task to run immediately). It’s been a very helpful wrapper on Luup’s call_delay.

This is an awesome improvement (the wrapper) on the luup engine which should really be integrated in the vera firmware. Maybe AK can use it in OpenLuup too?

openLuup does a good job of scheduling tasks. Without trying to speak for akbooer, I think he needs his scheduler to work the Vera way because that’s what’s compatible. Nothing he does prevents my approach from working on openLuup, and mine is just one way of getting things done.

I could publish it as a module or encapsulated-function to embed inline; it’s only about 160 lines of loosely-spaced code (including trace/debug statements that could be removed).

This is exactly how the openLuup scheduler works, and, strangely (or perhaps not) I had done the same thing previously with a startup code library that I used ubiquitously on my Veras.

As per the above, I think I already do? It may just be a cosmetic tweak to the packaging which you’re after.

Very much appreciated! Thanks @GeekGoneOld (and Karma.)

Sorry akbooer, I should have looked at the scheduling code in openLuup closer before making this comment. It is indeed more or less how it works. I made the assumption that it behaved more like the vera which openLuup was trying to mimic.