Deploying headless IoT edge gateways in remote sites or locked local networks presents physical maintenance hurdles. If the primary telemetry script encounters a memory leak, deadlocks on a socket read, or stalls due to network outages, the node will stop reporting logs. Without a mechanism to detect and resolve these software hangs, physical intervention (like power-cycling the device) is required. A robust edge design incorporates a local Watchdog to monitor process states and execute self-healing actions.
By configuring systemd watchdogs alongside lightweight bash monitors, edge nodes can recover from application failures automatically.
1. Implementing the Process State Watchdog
We write a bash script that runs periodically via cron, verifying that the python client process is responsive and restarting it if it hangs:
#!/bin/bash
# /usr/local/bin/client_watchdog.sh
set -euo pipefail
LOG_FILE="/var/log/edge_watchdog.log"
PROCESS_NAME="telemetry_client.py"
# Verify that the process is active
if ! pgrep -f "$PROCESS_NAME" > /dev/null; then
echo "$(date '+%Y-%m-%d %H:%M:%S') - CRITICAL: $PROCESS_NAME crashed! Restarting..." >> "$LOG_FILE"
# Attempt to restart the systemd service
sudo systemctl restart edge_telemetry.service
# Dispatch alert hook
curl -X POST -H "Content-Type: application/json" -d '{"event":"crash", "device":"pi-edge-01", "details":"Process auto-restarted"}' https://api.telemetry.com/alerts/device-recovery
fi
2. Leveraging Native Systemd Watchdog Integration
Systemd provides built-in watchdog features: it can monitor a service and restart it if the service fails to ping the systemd watchdog socket within a defined timeout. We configure the service file:
# /etc/systemd/system/edge_telemetry.service
[Unit]
Description=Edge Telemetry Daemon
After=network.target
[Service]
Type=notify
ExecStart=/usr/bin/python3 /usr/local/bin/telemetry_client.py
Restart=always
WatchdogSec=30
NotifyAccess=main
[Install]
WantedBy=multi-user.target
3. Client Heartbeat Implementation
Inside the python telemetry client script, we import the systemd system library to send periodic "WATCHDOG=1" signals inside the main execution loop, guaranteeing that if the main thread locks up, systemd will trigger a hardware-level restart:
# telemetry_client.py
import time
import os
import systemd.daemon
def main_loop():
print("Edge Telemetry client started...")
# Inform systemd we are ready
systemd.daemon.notify("READY=1")
while True:
try:
# Process telemetry read
send_metrics()
# Ping systemd watchdog
systemd.daemon.notify("WATCHDOG=1")
time.sleep(10)
except Exception as e:
print(f"Error in main loop: {e}")
time.sleep(5)
if __name__ == "__main__":
main_loop()