We experience periodic hangs of our plugin, currently testing on several Ezlo Plus units–maybe one hang per 2-3 days per unit. Periodically, ssh also becomes unavailable. Reboot recovers.
Inspecting one of these hangs while ongoing, I observe our plugin is waiting for an http connection it initiated to close. And still waiting after 130,000 seconds (approaching two days). There appear to be 32 bytes sitting in a receive queue, unprocessed.
The server the Ezlo thinks it’s connected to has no open connections from the Ezlo (per netstat).
The server logs show no errors. All responses are 3447 bytes. 99pctl response time is about 1 sec with none over 2 sec.
On the Ezlo:
# netstat -tpeW
[...]
tcp 32 0 [Ezlo IP redacted --Lee]:52950 [host redacted --Lee]:https CLOSE_WAIT 4978/ha-luad
[...]
I watched this for over an hour–this entry isn’t transient. I haven’t dug into the /proc/ entries or elsewhere enough to actually find a create timestamp for the connection.
Our http.request call looks like this:
local request = {
url = address,
type = "POST",
verbose = true,
content_type = "application/json",
data = data_json,
content_length = data_length,
fail_on_error = true,
handler = constants.plugin_hub_script_path .. "utils/http_receive"
}
local success, connection_code = pcall(http.request, request)
99%+ of the time (maybe more nines!) everything works fine.
- Is there anything we’re doing or not doing that can be leaving data behind in a buffer?
- Does this represent an OS or plugin error? or is it in some way a feature?
- We’ll start closing connections from our plugin after a reasonable time. It’s sort of gross if we retry after a successful send because connection closure got gummed up. Is there anything else we can try?
Thanks for your help!