Watchdog: Hardware Problem Detection
Introduction
In computer hardware, a watchdog is an electronic or software mechanism designed to ensure that an automated system hasn’t become stuck at a particular processing step. It’s a protection mechanism designed to restart the system if a defined action is not executed within a given time period.
When implemented in software, it typically consists of a counter that is regularly reset to zero. If the counter exceeds a given value (timeout), then a system reset is triggered. The watchdog often consists of a register that is updated via a regular interrupt. It can also consist of an interrupt routine that must perform certain maintenance tasks before returning control to the main program. If a routine enters an infinite loop, the watchdog counter will no longer be reset to zero, and a reset is ordered. The watchdog also allows a restart if no instruction is provided for this purpose. You simply need to write a value exceeding the counter’s capacity directly into the register. The watchdog will then initiate the reset.
In industrial computing, the watchdog is often implemented as an electronic device, generally a monostable flip-flop. It is based on the principle that each processing step must execute within a maximum time. It is therefore possible to arm a timer before its execution. When the flip-flop returns to its stable state before the task is complete, the watchdog is triggered. It implements a backup system that can either trigger an alarm, restart the device, or activate a redundant system.
Watchdogs are often integrated into microcontrollers and motherboards dedicated to real-time operations.
Installation
Watchdog is simple to install and set up:
apt-get install watchdog
There’s also a small kernel component to implement:
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=y
Configuration
Configuration is done in the /etc/watchdog.conf
file. Let’s look at some different mechanisms.
Networks
Take here for example the IP addresses 192.168.0.138 and 192.168.0.1. This means that we’re going to continuously ping these IP addresses, and if one of them doesn’t respond, it means we have a failure on our machine and therefore need to reboot. This method is quite dangerous in production, so be sure of what you’re doing.
ping = 192.168.0.138
ping = 192.168.0.1
interface = eth0
file = /var/log/messages
The pings are sent from the network card eth0 and are logged in /var/log/messages.
System Load
If you believe your machine contains no bugs and that if the memory load is too high, there’s a problem and you need to reboot, then here’s an option that will appeal to you:
max-load-1 = 24
max-load-5 = 18
max-load-15 = 12
Modify the values according to your needs.
Temperature
If you monitor your machine and want to restart in case of overheating, use this:
temperature-device = /dev/hda
max-temperature = 50
You should normally have already configured sensors beforehand (e.g., hdparm & lm-sensors).
Default Options
You must also set the default options and adapt them to your needs:
# Defaults compiled into the binary
admin = root
interval = 10
logtick = 1
# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime = yes
priority = 1
# Check if syslogd is still running by enabling the following line
pidfile = /var/run/syslogd.pid
Once finished, apply the changes by restarting the service:
/etc/init.d/watchdog restart
Last updated 17 Dec 2006, 22:29 +0200.