GUIDE · PREVIEW
GUIDE / HAR.47
source: docs/guide/hardware/Hardware Watchdogs.md
Hardware

Hardware Watchdogs

A hardware watchdog is a timer circuit on the motherboard (or in the CPU chipset) that resets the machine if nothing writes to it within a configured timeout. It exists to recover from hangs that have taken the OS so deep that software-level recovery is impossible: locked kernel, frozen scheduler, deadlocked interrupt handler, memory corruption that walked off the end of a buffer. When the OS is too broken to notice it's broken, the watchdog still ticks down, and eventually the machine power-cycles itself back to a known state.

FortrOS's design leans on this mechanism to get "silent hang" recovery for free. A node that locks up hard still gets back into the boot chain via PXE → bootstrapper → preboot → node, and either successfully rejoins the org or ends up in a recoverable failure state the admin can see. Without a hardware watchdog, a silent hang sits forever until a human notices.

Note on naming. The [boot watchdog] mentioned in 05 Loading the Real OS is a software construct — an s6-rc oneshot that confirms the maintainer reached a healthy state before marking a generation as verified. The hardware watchdog discussed on this page is a physical timer independent of software. They share the name by convention but live at entirely different layers. The software one only functions if the kernel is running; the hardware one functions even when the kernel is dead.

How hardware watchdogs work

A hardware watchdog has two parties: the watchdog hardware (a small timer) and a feeder (some software that periodically writes to it). On boot, firmware typically starts the watchdog so that a system which never finishes booting still recovers. Once the OS is up, either the OS takes over feeding the watchdog, or the firmware continues until OS takeover, or the watchdog is disabled altogether. Which of these three happens depends on firmware settings and OS drivers.

If nobody feeds it within the configured timeout, the watchdog asserts a hardware reset line. The machine power-cycles the same way it would from the reset button — no Linux shutdown messages, no journal flush, no clean unmount. From the outside it looks indistinguishable from a power glitch.

Intel iTCO on Dell Optiplex (our case)

Dell Optiplex and most Intel-chipset workstations include the iTCO (Intel TCO Watchdog Timer). The chipset's PMC (Power Management Controller) has a countdown timer that the BIOS or UEFI firmware arms at boot. The timer continues running once Linux boots. If Linux doesn't claim ownership via the iTCO_wdt driver, firmware continues ticking, and when the timer hits zero — typically 5 to 10 minutes depending on BIOS configuration — the chipset asserts a reset.

This is what caused the silent ~5-minute reboots during FortrOS bring-up before we enabled the driver. The node booted, all services came up, the WireGuard overlay came online, enrollment succeeded, then the kernel went idle as nothing was petting the watchdog. The iTCO timer hit zero and hard-reset the machine. No serial output accompanied the reset because firmware-level reset bypasses Linux's clean shutdown path entirely.

Other chipsets have equivalent watchdogs (hpwdt on HP, ipmi_watchdog for IPMI-accessible hardware, softdog as a pure-software fallback). The interaction pattern is the same: a driver claims ownership, userspace feeds /dev/watchdog, the chipset stays quiet.

How other distros handle this

Distro/system Approach
systemd-based (Fedora, Ubuntu, RHEL) systemd itself opens /dev/watchdog via RuntimeWatchdogSec= in /etc/systemd/system.conf, pets it periodically, and will hard-reset if it detects its own hang. A reboot request also re-arms the watchdog so if the clean shutdown hangs, HW rescues it.
OpenWrt procd watchdog daemon opens /dev/watchdog and feeds it. Integrates with service health (unhealthy services are restarted; if that fails, system reboots).
Kubernetes node (bare metal) Usually relies on the distro's systemd/watchdogd plus Node-level heartbeat loss to drain the node. The HW watchdog is the lowest floor.
NixOS services.watchdog.enable = true; spawns watchdogd configured via /etc/watchdog.conf.
Bare Buildroot (FortrOS base) Nothing by default. Either your init system (s6 for us) opens /dev/watchdog, or you ship a dedicated feeder service.

The common thread: whichever process "feeds" the watchdog becomes the liveness oracle for the whole machine. If that process is something deep in the system (init), a hang anywhere — including the kernel itself — often still stops the feed, triggering recovery. If the feeder is a dumb timer shell loop, only that loop's ability to run matters; the rest of the system can be frozen.

FortrOS's approach

Phase 1 (current): a minimal s6-rc longrun service called watchdog-feeder runs busybox's watchdog applet against /dev/watchdog. It pets every 10 seconds with a 30-second timeout (3× safety margin). This catches any hang deep enough to stop s6 from running, and nothing else.

Relevant pieces:

  • buildroot/board/kernel-configs/fortros-node-6.19.config — enables CONFIG_WATCHDOG, CONFIG_WATCHDOG_CORE, CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED, CONFIG_ITCO_WDT.
  • buildroot/board/busybox.fragment — enables the watchdog applet.
  • buildroot/overlay/etc/s6-rc/source/watchdog-feeder/ — the s6-rc service definition (type: longrun, depends on mount-filesystems).

Phase 2 (planned): the maintainer opens /dev/watchdog itself and pets it from inside its main loop. A maintainer hang becomes a hardware reset, and recovery is the normal PXE → preboot → node cycle. This makes the maintainer the definitive liveness oracle for the node. Any code path in the maintainer that could wedge (deadlock, infinite retry loop, catastrophic exception) now self-heals.

This is a deliberate design decision:

  • Making maintainer the feeder ties liveness to the component whose health actually matters. A dumb timer service can be alive while the maintainer is stuck.
  • The recovery mechanism is the same one we already exercise daily (PXE-recovery). No new code paths, no new failure modes to reason about.
  • The cost is that maintainer bugs now cause reboots instead of getting caught by supervision. That cost is worth it because supervision of a broken maintainer is usually ineffective anyway — the whole node is the unit of recovery.

Interaction with FortrOS's BootOrder state machine

Hardware watchdog resets look identical to cold-boot from the firmware's perspective — the machine comes up in its default BootOrder state. Because FortrOS manages BootOrder explicitly (bootstrapper forces PXE-first on entry; preboot self-promotes right before kexec), a watchdog-triggered reset lands deterministically:

  • If the reset happened during bootstrapper or preboot, PXE is already the boot priority, so the firmware re-PXEs and we retry.
  • If the reset happened on the main-OS side (preboot had self-promoted), the firmware boots preboot directly, which re-authenticates and kexecs fresh. No user intervention.

In both cases we end up back in a known boot stage within a handful of seconds to a minute, depending on firmware POST time.

Tradeoffs and rejected alternatives

  • Leave the watchdog untouched. Rejected. Means every silent hang requires a human with physical access to the machine. Unacceptable for a self-organizing org.
  • Disable the watchdog entirely (CONFIG_WATCHDOG=n or per-driver disable). Rejected. We want the safety net, especially for headless nodes where no one sees the machine's state.
  • Use softdog (pure software watchdog). Rejected. It's a kernel thread triggered by the kernel's own timer, so it can't detect a kernel lockup — which is the highest-value hang to recover from.
  • Use NMI watchdog / hardlockup_detector. Rejected as the sole mechanism. These detect specific classes of kernel bugs but not all hangs, and their behavior is to panic rather than reset. Under FortrOS panic → reboot cycle works, but the hardware watchdog is a strict superset.