Real-Time System Watchdogs: Designing Self-Healing Processes for IoT Edge Nodes

2021-11-26 15:42:51+00:00

Deploying headless IoT edge gateways in remote sites or locked local networks presents physical maintenance hurdles. If the primary telemetry script encounters a memory leak, deadlocks on a socket read, or stalls due to network outages, the node will stop reporting logs. Without a mechanism to detect and resolve these software hangs, physical intervention (like power-cycling the device) is required. A robust edge design incorporates a local Watchdog to monitor process states and execute self-healing actions.

By configuring systemd watchdogs alongside lightweight bash monitors, edge nodes can recover from application failures automatically.

1. Implementing the Process State Watchdog

We write a bash script that runs periodically via cron, verifying that the python client process is responsive and restarting it if it hangs:

#!/bin/bash
# /usr/local/bin/client_watchdog.sh
set -euo pipefail

LOG_FILE="/var/log/edge_watchdog.log"
PROCESS_NAME="telemetry_client.py"

# Verify that the process is active
if ! pgrep -f "$PROCESS_NAME" > /dev/null; then
    echo "$(date '+%Y-%m-%d %H:%M:%S') - CRITICAL: $PROCESS_NAME crashed! Restarting..." >> "$LOG_FILE"
    
    # Attempt to restart the systemd service
    sudo systemctl restart edge_telemetry.service
    
    # Dispatch alert hook
    curl -X POST -H "Content-Type: application/json"          -d '{"event":"crash", "device":"pi-edge-01", "details":"Process auto-restarted"}'          https://api.telemetry.com/alerts/device-recovery
fi

2. Leveraging Native Systemd Watchdog Integration

Systemd provides built-in watchdog features: it can monitor a service and restart it if the service fails to ping the systemd watchdog socket within a defined timeout. We configure the service file:

# /etc/systemd/system/edge_telemetry.service
[Unit]
Description=Edge Telemetry Daemon
After=network.target

[Service]
Type=notify
ExecStart=/usr/bin/python3 /usr/local/bin/telemetry_client.py
Restart=always
WatchdogSec=30
NotifyAccess=main

[Install]
WantedBy=multi-user.target

3. Client Heartbeat Implementation

Inside the python telemetry client script, we import the systemd system library to send periodic "WATCHDOG=1" signals inside the main execution loop, guaranteeing that if the main thread locks up, systemd will trigger a hardware-level restart:

# telemetry_client.py
import time
import os
import systemd.daemon

def main_loop():
    print("Edge Telemetry client started...")
    # Inform systemd we are ready
    systemd.daemon.notify("READY=1")
    
    while True:
        try:
            # Process telemetry read
            send_metrics()
            # Ping systemd watchdog
            systemd.daemon.notify("WATCHDOG=1")
            time.sleep(10)
        except Exception as e:
            print(f"Error in main loop: {e}")
            time.sleep(5)

if __name__ == "__main__":
    main_loop()