05 Loading the Real OS

The Problem

The preboot has authenticated, unlocked /persist, and knows which generation to boot (04 Disk Encryption). But the preboot kernel is generic -- it's a one-size-fits-all kernel that boots on any hardware. The real OS needs an optimized kernel: drivers compiled in for this specific hardware, CPU flags matching this machine's architecture, and the full service stack.

The preboot needs to replace itself with the real kernel. Without rebooting. Without going through firmware again. Without losing the LUKS key.

This is what kexec does.

What Is kexec?

kexec (kernel execute) is a Linux system call that loads a new kernel into memory and jumps to it, bypassing the firmware entirely. If you've done green/blue deployment in web development (two environments, switch which one is live by changing where traffic points), kexec is the OS equivalent: the old kernel is "green," the new kernel is "blue," and kexec is the instant switch. No restart, no downtime, just redirect to the new version.

Normal reboot:   kernel -> shutdown -> firmware -> bootloader -> new kernel
kexec:           kernel -> new kernel (direct jump, like switching green->blue)

kexec is fast (seconds vs 30-60 seconds for a full reboot on server hardware) and avoids re-running firmware initialization. But it has implications:

The new kernel doesn't get a "fresh start" from firmware. Some hardware may need re-initialization that only firmware provides.
Secrets in memory persist across the transition (useful for passing keys).
The new kernel inherits whatever hardware state the old kernel left.

How Others Do It

ChromeOS: A/B Partitions

ChromeOS stores two complete OS images on disk (partition A and partition B). Updates write to the inactive partition. On reboot, firmware selects the new partition. If boot fails, firmware falls back to the other partition.

Strength: Clean boot from firmware every time. Atomic updates. Weakness: Two full copies of the OS on disk. Full reboot required.

Android: A/B Slots (Same Pattern)

Android uses the same A/B approach. The bootloader (not firmware) selects which slot to boot. A "boot control HAL" tracks boot success -- if the new slot fails 3 times, it falls back to the previous slot.

Strength: Well-tested, billions of devices. Weakness: Requires double the disk space for the OS.

NixOS: Generation Boot Entries

Every nixos-rebuild creates a new boot entry. The bootloader lists all generations and the user (or auto-selection) picks one. Old generations are kept on disk until garbage-collected. Rollback is a reboot into a previous boot entry.

Strength: Any number of generations, instant rollback. Weakness: Full reboot required for each generation switch. All generations on the same partition.

Talos Linux: Staged Boot

Talos uses a similar flow to FortrOS: a minimal init system loads the real OS from a squashfs image. No kexec -- the init loads the squashfs and pivot_roots into it. The "real OS" runs from a read-only filesystem overlay.

Strength: No kexec complexity. Weakness: Cannot switch to a kernel optimized for specific hardware (same kernel for all roles).

The Tradeoffs

Approach	Kernel per-hardware	Switch speed	Disk usage	Fallback mechanism
A/B partitions	No (one kernel)	Full reboot (30-60s)	2x OS size	Firmware selects slot
Boot entries	No	Full reboot	N x generation size	Bootloader menu
kexec	Yes (per-host optimized)	Fast (seconds)	N x kernel+initramfs	Generation health markers
pivot_root	No	Instant (no reboot)	1x squashfs + overlay	Overlay fallback

FortrOS uses kexec because it enables per-host optimized kernels: each machine can have drivers compiled specifically for its hardware, with CPU architecture flags matching its processor. The generic preboot kernel is a universal boot stub; the real kernel is tuned.

How FortrOS Does It

The Generation Model

A generation is a versioned kernel + initramfs pair stored on /persist. Each generation contains:

An optimized kernel (drivers compiled for specific hardware or role)
An initramfs with the full s6-rc service tree
A content hash (for integrity verification)
A generation ID (for LUKS key derivation)
A health marker using the three-state confirmation pattern: untested (just deployed, not yet verified), ok (boot watchdog confirmed healthy), or failed (boot watchdog timed out). This is the same pattern as enrollment (pending / confirmed / revoked) -- FortrOS is honest about uncertainty rather than treating "deployed" as "working."

Multiple generations coexist on /persist. The preboot selects which one to boot based on the gen-auth's response during unlock, with fallback to the newest healthy generation cached locally.

/persist/
  boot-state/
    current              # Points to the current generation ID
    gen-v1/
      status             # "ok" / "failed" / "untested"
      vmlinuz            # Optimized kernel
      initramfs          # Service tree
      config-diff        # Diff from generic kernel config
      content-hash       # SHA-256 of kernel + initramfs
    gen-v2/
      status
      vmlinuz
      initramfs
      ...

Generation Selection

The gen-auth already told the preboot which generation to boot during the unlock sequence ("send what you have"). The preboot's job is to verify and load it:

Read the generation ID from the unlock response (gen-auth selected it)
Verify the generation's content hash (Ed25519 signature)
If valid and health status is "ok" or "untested" -> boot it
If invalid or "failed" -> fall back to the previous healthy generation still cached on /persist
If no valid generation on /persist -> connect to org, download fresh image
If no network -> show "connect me" screen (preboot stays running)

This is level-triggered: the preboot doesn't track "which generation did I just install" or "was this an upgrade or a normal boot." It reads the current state and acts on it. If the state is corrupt, it falls back. If everything is gone, it re-provisions.

The kexec Transition

Once a generation is selected, the preboot:

Builds the appended initramfs: A minimal cpio archive containing /luks.key (the LUKS encryption key for /persist). This archive is gzip-compressed and concatenated with the generation's initramfs.
Loads via kexec: kexec -l vmlinuz --initrd=combined-initramfs
Executes: kexec -e -- the preboot kernel is replaced by the generation kernel. No reboot, no firmware, direct jump.

The combined initramfs trick works because the Linux kernel supports concatenated cpio archives -- it unpacks them in order, so the appended archive's /luks.key is available in the generation's initramfs filesystem.

The Main OS Boot (After kexec)

The generation kernel boots and runs its s6-rc service tree. One of the first services is persist-mount:

Read /luks.key from the initramfs
cryptsetup luksOpen on /persist using the key
Zero the key in memory
Delete /luks.key from the tmpfs
Mount /persist

After persist-mount, no trace of the LUKS key remains in memory or on any filesystem. The main OS accesses /persist normally -- it doesn't know or care how the key got there.

The Boot Watchdog

Not the hardware watchdog. The boot-watchdog described here is a software construct that confirms a generation is healthy before marking it verified. A separate hardware watchdog (Hardware Watchdogs) runs at the chipset level and recovers from silent hangs. They share the name by convention but operate at different layers and are not interchangeable.

After kexec, a two-layer watchdog protects against bad generations:

Readiness signal (primary): The boot-watchdog s6-rc service depends on the maintainer service. When the maintainer signals readiness (via s6 notification-fd -- meaning WireGuard is up, gossip mesh joined, initial TreeSync complete), the watchdog writes "ok" to the generation's status file on /persist. The generation is now confirmed healthy (three-state: untested -> ok).

Timeout backstop: A separate watchdog runs independently of s6-rc. If the maintainer does not signal readiness within the configured timeout, the backstop marks the generation as "failed" (three-state: untested -> failed) and reboots. The timeout is configurable per org -- hardware with slow initialization (spinning disks, many NICs) may need longer. The preboot sees the "failed" status and falls back to the previous healthy generation, down to the rollback floor.

The rollback floor is the oldest generation the org considers safe. The org sets this via config to prevent rollback to pre-security-fix generations.

Generation Signature Verification

Before kexec, the preboot verifies the generation image's integrity:

SHA-256 hash of kernel + initramfs -> content hash
Verify Ed25519 signature of the content hash against the org CA's public key (embedded in the preboot's initramfs)
If signature is invalid -> skip this generation, try the next

This prevents a compromised /persist from tricking the preboot into booting a malicious kernel. The org CA's public key is in the preboot UKI (on the ESP), not on /persist, so a /persist compromise cannot substitute a different verification key.

Why the Preboot Disappears

An alternative architecture: keep the preboot running as a persistent hypervisor. The maintainer, org VMs, and user VMs all run as sibling VMs on top. The preboot becomes the node's permanent "firmware" layer -- a software-based BMC that could restart a hung maintainer VM without external hardware (no AMT needed).

This is the Qubes OS model, and it has real security appeal. But follow the responsibilities: if the preboot is the persistent hypervisor, it needs WireGuard (to route overlay traffic between VMs), gossip awareness (to know which VMs to run), storage access (to fetch VM images), and orchestration logic (to manage VM lifecycles). At that point, the preboot IS the maintainer -- you've moved all the complex distributed systems code into the most privileged layer.

FortrOS goes the other direction: the trust anchor should be as small as possible. The preboot is ~300 lines of Rust: authenticate, unlock, kexec, disappear. The maintainer is ~3,000+ lines of complex code (gossip, CRDTs, TreeSync, WireGuard mesh management, workload IPC). Every line of code in the trust anchor is a line that could have a security bug in the most privileged position. Moving the maintainer into the trust anchor makes the attack surface larger, not smaller.

The split is: the preboot extends the chain of trust from hardware and org, then gets out of the way. The maintainer does the complex work at a lower privilege level where a compromise can be clobbered by a reboot. TPM hardware isolation protects the preboot's secrets regardless of what happens to the main OS. SEV-SNP adds hardware memory encryption for VMs, achieving most of the Qubes security model without the persistent hypervisor complexity.

The Qubes-style model remains a valid future option for high-security deployments on modern hardware where the added complexity is justified.

Stage Boundary

What This Stage Produces

After kexec completes:

A new kernel is running (generation-specific, optimized)
The initramfs contains the full s6-rc service tree + LUKS key
/persist is about to be unlocked by the persist-mount service
The boot watchdog is ticking (configurable timeout to prove health)

What Is Handed Off

The preboot is gone. The generation kernel and its s6-rc services take over:

persist-mount unlocks /persist (covered above)
s6-rc starts all services in dependency order (06 Init and Services)
WireGuard comes up (07 Overlay Networking)
The maintainer joins the gossip mesh (08 Cluster Formation)
The boot watchdog marks the generation as "ok" if everything succeeds

What This Stage Does NOT Do

It does not manage services (that's 06 Init and Services)
It does not configure networking (that's 07 Overlay Networking)
It does not join the org (that's 08 Cluster Formation)
It does not run workloads (that's 09 Running Workloads)
It does not handle upgrades (that's 10 Sustaining the Org)

This is the end of the preboot. Everything from here forward is the main OS. The preboot's job was: authenticate, unlock disk, select generation, kexec. It's done. The main OS takes over and never looks back.