[Altsteon] Stops working after a few hours

I have done many factory resets on my VeraLite and my PLM (2413U) and I keep coming to this issue.

I have about 8 insteon devices set up with Altsteon (3 relay, 3 dimmer, 1 dimmer kpl and an IOLinc) I can get them set up just fine, they seem to work as advertised. It seems like after some period of time they stop responding to commands from the Vera interface. I try and kill and start the Altsteon service but I get an error that “Failed to Initilize PLM”. So I have to reboot to get it to respond again.

Now today, I left the altsteon_cli open all day and noticed once it stopped working I got a “Broken Pipe” error. Again I tried to kill the process and start it again. Failed to Initlize PLM. When I went to check that the PLM was still connected to USB0 (dmesg | grep “ttyUSB”), I got a lot of errors:

ftdi_sio ttyUSB0: urb failed to clear flow control
ftdi_sio ttyUSB0: ftdi_set_termios FAILED to set databits/stopbits/parity
ftdi_sio ttyUSB0:ftdi_set_termios urb failed to set baudrate

I have removed everything else from the USB port but the PLM and still get this error. Any suggestions on how to make this thing stable for me? Thanks.

The “Failed to initialize PLM” error is directly related to the dmesg errors you are seeing. For some reason, the serial chip in the PLM seems to end up in a state where it is locked up in some way. I have seen this happen when Altsteon is shut down uncleanly, or crashes.

Which brings me to the “Broken pipe”. The broken pipe error is what the underlying OS sends up to Altsteon when the communication socket between the daemon and a client is broken. Since the cli is reporting this, I would guess that the daemon has crashed for some reason. The daemon crashing could be causing the errors in dmesg. (As an aside, I have wondered why the PLM likes to crash in this way. Other serial devices I have worked with are a lot more graceful when something goes wrong.)

It shouldn’t be necessary to factory reset the PLM. Unplugging it for 10-30 seconds and plugging it back in should resolve the lock up. However, that won’t solve the root problem, which is that it seems to be crashing.

So, here are some steps that may help me figure out what is going wrong :

  1. First, I want to make sure that my theory that the daemon is crashing is correct. Run it the same way that you did, and when you get the “Broken pipe” error, hop over to a command prompt on the Vera and run ‘ps xaf | grep “altsteon”’. (Without the ’ and ', but including the ".) If the daemon is running, you should get back a line that says “altsteon”. You may also get a line that shows the command you entered, you should ignore that one.

  2. Assuming it appears to be crashing, I need to try to understand what is causing the crash. The best way to do that is to gather some logs. There is a command line flag that can be used to write a log file in the current directory, but it probably isn’t a great idea to use that directly on the Vera. Instead, it is better to SSH in to the Vera, enable capturing the terminal output, and then running the daemon with the -ddd command line parameter. After it crashes, save the capture data, and attach it to a message so I can look at it. If the file is too big, then just copy the last 200 lines out of the file, and attach that. But save the bigger file in case I need to look farther back.

I will also have to admit that I have not done any testing on the VeraLite. If the hardware is different than the hardware on the Vera 2 or 3, then we could easily be hitting a situation where a CPU instruction that is valid on the Vera 2 or 3 is not valid on the lite. So, it anyone else reads this and has Altsteon running on a Vera lite, perhaps you could chime in and let us know if it is working for you, and if you had to do anything special.

Alright, as requested I started altsteon with the -ddd switch with output logging on and around 12:57 (I assume), I lot the ability to control my devices.

Here is the fully log:
[url=https://www.dropbox.com/s/adghe0ggm2t25d1/putty.log]Dropbox - Error - Simplify your life

Here are the last few lines before I started seeing errors:

[5/25/2012 - 12:51:54] - **** Queue before removing item :
[5/25/2012 - 12:51:54] - Priority Queue for 19.5F.2D :
[5/25/2012 - 12:51:54] - Standard Queue for 19.5F.2D :
[5/25/2012 - 12:51:54] - Cmd : 0262195F2D0F1901 Multiresponse : false
[5/25/2012 - 12:51:54] - Cmd : 0262195F2D0F1900 Multiresponse : false
[5/25/2012 - 12:51:54] - **** Queue after removing item :
[5/25/2012 - 12:51:54] - Priority Queue for 19.5F.2D :
[5/25/2012 - 12:51:54] - Standard Queue for 19.5F.2D :
[5/25/2012 - 12:51:54] - Cmd : 0262195F2D0F1900 Multiresponse : false
[DEBUG] [5/25/2012 - 12:51:54] - Cleared active command.
[TRACE] [5/25/2012 - 12:51:54] - There are 0 byte(s) left in the buffer.
[TRACE] [5/25/2012 - 12:51:54] - No active command!?
[TRACE] [5/25/2012 - 12:51:55] - Returning new command.
[DEBUG] [5/25/2012 - 12:51:55] - Sending : 0262195F2D0F1900
[TRACE] [5/25/2012 - 12:51:55] - Found the device : 19.5F.2D (19.5F.2D)
[TRACE] [5/25/2012 - 12:51:55] - Processing SdEcho using base class!
[TRACE] [5/25/2012 - 12:51:55] - Got an ACK.
[TRACE] [5/25/2012 - 12:51:55] - There are 0 byte(s) left in the buffer.
[DEBUG] [5/25/2012 - 12:51:55] - Response : 02 50 19 5F 2D 1C F7 28 2B 02 00
[TRACE] [5/25/2012 - 12:51:55] - Found the device : 19.5F.2D (19.5F.2D)
[DEBUG] [5/25/2012 - 12:51:55] - Flags : Direct_ACK Hops remaining : 2
[TRACE] [5/25/2012 - 12:51:55] - Calling 4 parameter cooked ack.
[5/25/2012 - 12:51:55] - 19.5F.2D’s contact is open.
[TRACE] [5/25/2012 - 12:51:55] - To client : 19.5F.2D:000A,02 - Contact is open.

[5/25/2012 - 12:51:55] - **** Queue before removing item :
[5/25/2012 - 12:51:55] - Priority Queue for 19.5F.2D :
[5/25/2012 - 12:51:55] - Standard Queue for 19.5F.2D :
[5/25/2012 - 12:51:55] - Cmd : 0262195F2D0F1900 Multiresponse : false
[5/25/2012 - 12:51:55] - **** Queue after removing item :
[5/25/2012 - 12:51:55] - Priority Queue for 19.5F.2D :
[5/25/2012 - 12:51:55] - Standard Queue for 19.5F.2D :
[DEBUG] [5/25/2012 - 12:51:55] - Cleared active command.
[TRACE] [5/25/2012 - 12:51:55] - There are 0 byte(s) left in the buffer.
[TRACE] [5/25/2012 - 12:51:55] - No active command!?
[TRACE] [5/25/2012 - 12:56:18] - **** Polling device ‘15.97.D0’.
[TRACE] [5/25/2012 - 12:56:18] - Adding : 02621597D00F1900. Multi = false
[TRACE] [5/25/2012 - 12:56:18] - Queued poll request for 15.97.D0
[TRACE] [5/25/2012 - 12:56:18] - Next poll in 600 second(s).
[TRACE] [5/25/2012 - 12:56:18] - Returning new command.
[DEBUG] [5/25/2012 - 12:56:18] - Sending : 02621597D00F1900
[TRACE] [5/25/2012 - 12:56:19] - Waiting for command response. (Timeout = 1)
[TRACE] [5/25/2012 - 12:56:20] - Waiting for command response. (Timeout = 2)
[TRACE] [5/25/2012 - 12:56:21] - Waiting for command response. (Timeout = 3)
[TRACE] [5/25/2012 - 12:56:22] - Resending…
[DEBUG] [5/25/2012 - 12:56:22] - Sending : 02621597D00F1900
[TRACE] [5/25/2012 - 12:56:23] - Waiting for command response. (Timeout = 1)
[TRACE] [5/25/2012 - 12:56:24] - Waiting for command response. (Timeout = 2)
[TRACE] [5/25/2012 - 12:56:25] - Waiting for command response. (Timeout = 3)
[TRACE] [5/25/2012 - 12:56:26] - Resending…

It just keeps repeating over and over again. Any ideas?

Thank you.

Also, once I reset the Vera (using command line “reboot”), I have to re-add all of my insteon devices.

I am having a similar issue where the insteon devices lose communication with Vera and the plm. Altsteon, does not crash, but I can not issue any commands to the devices. I get similar try to communicate messages and fail after 3 tries. I am running some test to find them cause. I am thinking it is communication issues, but seeing odd test results. I just wanted to chime in and say your issue is similar to mine.

  • Garrett

Let me start with the easy one first. There is no way to actually reboot the PLM via the Altsteon command line. To reboot the PLM, you have to manually unplug and replug the PLM. The “plm reset” command actually sends the Insteon Reset command, which resets the PLM to factory defaults. When you reset to factory defaults, the ALDB in your PLM gets deleted.

Now, for the crappy part. When you reset the PLM, it doesn’t reset all of the devices on your network. So, all of your devices probably have multiple ALDB links back to your PLM. From the cli you can enter the command ‘aa.bb.cc dump_aldb’ to see what the ALDB for that device looks like. If there are a bunch of duplicates that are on the same group number you may start to get weird behavior. The behavior I had observed is a device not responding, or a device being sluggish. It shouldn’t cause the issue you are seeing, however.

I’m going to need some time to look over your log file for anything that looks unusual. If I get time, it will be tonight, if not it will be sometime before Monday night. It would be helpful to understand a little more about how your Insteon setup is laid out. Specifically :

  1. Do all of your devices stop working? Or just one or two? (I realize you said all, but I used to do a job in IT where a manager would come to me and tell me that “all” the computers were down, when the reality is that one computer displayed an error he didn’t understand. So, forgive the stupid question.)

  2. Are the devices you are using on the same electrical phase? Or on different phases?

  3. Are the devices relatively close together?

  4. Are any of the device dual-mode?

  5. Where is your PLM relative to the other devices? (Same floor? Different floor? Is there lots of metal in the ‘line of site’ between the PLM and devices that might be using RF?)

  6. Do you use any other tech that runs in the 900Mhz band? (Baby monitors, 900Mhz cordless phones, etc.)

  7. Does it ever stop working when you are not at home? Or only when you are at home? (And if it stops when you are not at home, do you leave any appliances on to keep an animal company, or anything like that.)

It seems there are a few things that could be going on here. Obviously, one could be a bug in Altsteon. However, the logs you posted either make the bug unlikely, or VERY subtle. So, the next simple place to look is the phase that your devices are on. If devices on a different phase from the PLM are failing, then you may just have a situation where you need to install another wireless enabled (“dual-mode”) device. If the devices are on the same phase, then you will want to look for anything that might be causing noise on the power line. Things like fridges, computers, TVs, A/V equipment, etc. can all cause issues.)

Again, I’ll let you know if the logs turn up anything interesting.

fba,
I have the same problem. It seems that I need to restart the daemon about once a day or so - though I have not tried to get a time between failures. My Zwave devices work 100% so it is likely the Altsteon daemon.

To answer some of your questions…

20+ devices all within about 2000sqft
all are powerline Insteon devices, no RF, no dual mode at all
when it fails Altsteon will not control any device and status does not update either
PLM is in the same spot I’ve had it for 8 years while I’ve used 6 other apps, all without issue & I have an active repeater bridging the phases/circuits
I have not needed to reboot/reset/unplug the PLM, just reboot the vera and all works again for about a day.
I have had it stop working while I was using it - via Authomation app and vera web portal (I was not making a config change - just controlling lighting)

maybe a watch dog process to restart the daemon if communications fail… at least until the bug is found?

Sorry, let me clarify. I am rebooting Vera with the Linux command line “reboot”. It is easier than running to the basement and unplugging and plugging in the Vera. Similar to Aaron, if I reboot Vera, everything works again. No need to reboot the PLM.

To answer your questions:

  1. Yes, I loose all control and updates from my insteon devices. Z-wave and other devices continue to function.

  2. They should be on different phases, they are are located throughout the house. (I have a hard wired phase coupler).

  3. Most of the devices are within about 15 feet of the Vera / PLM. 1 or two are across the garage (25 feet).

  4. No dual mode.

  5. The PLM is relatively close to the devices (15-25 feet).

  6. I don’t believe so. My phones are in the 5.8 Ghz range.

  7. It seems to happen sometime in the middle of the night between sunset and sunrise. But it can happen at anytime it seems.

As a temporary fix, I have set up a nightly reboot within Vera to make sure that the system is ready to turn off the outside lights a sunrise.

Thanks for the help and keep up the good work!

By the way, I have a very similar set up to Aaron. I have been using Insteon for 6 years from the same spot (different PLC/PLM, but same location).

Working for about 1.5 days without reboot… it seems so random. Usually I must reboot at least once a day.

~10% of the time I know when it needs a reboot because my kitchen lights (2 dimmers linked together) will turn on on their own… even when I manually turn them off, they come back on 2-3 seconds later. They are the only switches that do it too! Crazy huh.

But most of the time the lights just don’t respond and the status does not update in Vera.

Looking at the log, the problem appears to be that the PLM just stops sending bytes to Altsteon. Very strange…

@Aaron - Are you seeing the same errors in dmesg that RyanAHolland mentioned in his first post? If your errors match, then we should be able to reasonably assume that the problems you guys are seeing are the same.

It really sounds like something is happening that the serial port driver doesn’t like, and it is getting stuck in a weird state. I’ll have to dig around and see if I can figure anything out that will mitigate the issue…

closing in on 3 solid days now… Still running without a problem. No reboots.

Now closing in on 4 days and still running strong. I’m not sure why since it was needing a reboot there everyday for a few days.

From what I could tell through some Googling, there is a bug in the FTDI driver that can cause the receive side to lock up under certain circumstances. (What those circumstances may be is a bit fuzzy.) This matches the behavior seen in logs.

I don’t know if there is much that can be done about this, however I have seen strange behavior from PLMs when there is more than one program that has the device open. Looking at my Vera last night I noticed that ser2net is running and has the port open. So, in an effort to try to mitigate this problem, and also to get Altsteon more ready to be put on the app store, I started to work on code to use ser2net instead of talking to the serial hardware directly.

Hopefully this will make things more stable (though I really can’t be sure). However, making this change will make Altsteon integrate better with the Vera, and should open the door to using PLMs on the other end of things like the wizNet or Lantronix network to serial adapters.

I hope to get a first pass of code done before this weekend is out. When I do, I’ll post a new build to this thread along with instructions on what changes need to be made for people to try.

fba - that all sounds good. I don’t have use FTDI so that could be why I’m not seeing the issues as frequently. It could just be a fluke for mine.

Aaron -

Are you using the serial interface then?

I’ve been pretty low key lately on the forums and with development of my app. I wanted to provide an update on my testing with altsteon. For the past week I have been logging the altsteon interaction with my keypadlinc and fanlinc. The last 7 days I have not had one issue with communication with the devices. Everything was responsive and the remaining hops was always 2. Not sure why prior I was loosing communication with the devices after a few days. I restarted altsteon this morning with no logging and will see how things perform. Hopefully it will hold up like it has the last 7 days.

fba, looking forward to seeing the new implementation. Does the new implantation cause any sort of development road blocks?

  • Garrett

[quote=“fba, post:15, topic:171597”]Aaron -

Are you using the serial interface then?[/quote]

yes, a PL, not FTDI.

@garrettwp -

Good to hear it is working. So far, the new implementation only seems to introduce one ‘problem’. (And, it isn’t much of a problem.) Originally, Altsteon would let you run the daemon on a different Linux box with the PLM connected to it. Because of the way that things fit together on the Vera, it would be a bit of extra work to keep that functionality.

However, there is an interesting trade-off that comes along with the changes. Because ser2net is IP based, it is likely that the changes to support it will allow things like the wizNet to work.

From the end-user perspective, the main change will be that you no longer need to specify the serial port that the PLM is connected to when the daemon starts up. Instead, you will just configure the PLM to use one of the serial ports that is defined in the Vera UI, and everything will “magically” work. (From there, it is a small step to have the PLM LUA code load the daemon on startup, followed by having the PLM LUA code detect which binary to use, and then it should be ready to go on the app store.)

From the perspective of the LUA code used by the various devices to present data to the Vera, there are no changes needed. There is one or two new result codes that the PLM can kick out, but nothing that end devices should care about.

@Aaron -

Interesting. That basically kills my theory that it is a bug with the FTDI driver. That leaves me with only two other ideas on what it could be. Either my code is setting up the serial port in a way that makes the PLM mad sometimes. Or, the ser2net process that is also attempting to use the PLM’s serial port is causing some kind of conflict. If it turns out to be either of those issues, then the changes I am working on should resolve the issue. So, lets cross out fingers. :wink:

I had to unplug the Vera yesterday due to redoing wiring, but it had been up 5+ days… So, it might be a fluke that I had problems for a bit. Time will tell, but for now all is perfect for me… You code is working perfectly, and fast!

Thanks for the great work