GUIDE · PREVIEW
GUIDE / CHA.08
source: docs/guide/chapters/08 Cluster Formation.md
Chapters

08 Cluster Formation

The Problem

Nodes have overlay connectivity (07 Overlay Networking). They can send encrypted packets to each other. But connectivity isn't coordination. The nodes need to:

  1. Discover each other: Who else is in the org? Who just joined? Who left?
  2. Agree on state: What's the org's configuration? Which nodes are healthy? What workloads should run where?
  3. Handle disagreement: What if two partitions of the org made different changes while disconnected?

This is the distributed consensus problem, and it's one of the hardest problems in computer science. The classic solutions (Raft, Paxos) require a majority of nodes to be reachable -- if your cluster of 5 nodes loses 3, the remaining 2 can't make decisions.

FortrOS rejects this model. An org that splits in half should have both halves continue operating. When they reconnect, state merges automatically. This requires a different approach: Gossip Protocols for discovery and CRDTs for state.

The Building Blocks

Gossip: How Nodes Discover Each Other

Gossip Protocols are how nodes learn about each other without a central directory. The idea is borrowed from epidemiology: information spreads like a virus. Each node periodically picks a random peer and exchanges state. After a few rounds, every node knows about every other node.

FortrOS uses the SWIM protocol (Scalable Weakly-consistent Infection-style Membership) via the foca Rust crate. SWIM adds failure detection: nodes periodically probe each other (ping), and if a node doesn't respond, it's suspected and eventually declared dead.

Gossip provides:

  • Membership: Who's in the org right now
  • Failure detection: Who's unreachable
  • Dissemination: Piggybacking small messages on protocol traffic

Gossip does NOT provide agreement on complex state -- it just ensures everyone eventually learns the same membership information.

CRDTs: How Nodes Agree Without a Leader

CRDTs (Conflict-free Replicated Data Types) are data structures where conflicts are machine-resolvable. Two nodes can make different changes simultaneously, and when those changes meet, deterministic rules resolve the disagreement without human intervention. No leader, no voting, no quorum. Each node modifies its local copy, and when copies merge, the result is mathematically guaranteed to converge to the same state on all nodes.

Conflicts still happen (two nodes changed the same thing during a partition), but the CRDTs carry enough context (causality tracking via logical counters, not wall-clock time) that pre-defined rules always determine the outcome. FortrOS deliberately avoids depending on synchronized clocks -- physical time is fragile in distributed systems (clock drift, NTP failures, adversarial manipulation). CRDTs track which operations have seen which other operations (causality), not when they happened (time). Human intervention is a last resort for genuinely ambiguous org-wide policy conflicts.

FortrOS uses several CRDT types (from the crdts Rust crate):

  • Orswot (Observed-Remove Set Without Tombstones): for org membership
  • MVReg (Multi-Value Register): for per-node metadata that may be concurrently modified
  • GSet (Grow-only Set): for revocation lists (once revoked, always revoked)
  • Map: for structured key-value state

Merkle Trees: Detecting Divergence

Merkle Trees are hash trees where each leaf is a hash of a data block, and each non-leaf is a hash of its children. The root hash is a fingerprint of the entire dataset. If any leaf changes, the root hash changes.

FortrOS uses Merkle trees to detect state divergence efficiently: two nodes compare root hashes (32 bytes). If they match, the nodes are in sync. If they differ, they walk the tree to find exactly which leaves differ, exchanging only the changed data.

TreeSync: Efficient State Transfer

TreeSync is FortrOS's protocol for synchronizing state between nodes. When gossip broadcasts indicate a hash mismatch (one node's state has changed), TreeSync handles the actual data transfer:

  1. Client sends its top-level Merkle tree hashes (~4KB)
  2. Server walks divergent subtrees
  3. Server returns only the changed leaves (~1KB for 1 changed leaf, even in a 10,000-node org)
  4. Client merges the received leaves into its local CRDTs

This is far more efficient than sending the full state on every change.

How Others Do It

Kubernetes (etcd): Strong Consensus

Kubernetes stores all cluster state in etcd, which uses the Raft consensus algorithm. Raft requires a majority of etcd nodes to be reachable for any write to succeed. A 3-node etcd cluster tolerates 1 failure; a 5-node cluster tolerates 2.

Strength: Strong consistency -- all reads return the latest written value. Weakness: Cannot operate during a network partition if the majority is unreachable. The cluster is "correct but unavailable."

Consul (Serf + Raft): Two Layers

Consul uses two coordination mechanisms: Serf (gossip, SWIM-based) for membership and failure detection, and Raft for the key-value store and service catalog. Serf handles "who's in the cluster" (eventually consistent). Raft handles "what's the configuration" (strongly consistent).

Strength: Best-of-both: fast membership via gossip, strong config via Raft. Weakness: Raft still requires a majority of server nodes. Consul agents (non-servers) can't participate in Raft.

Riak: CRDTs for Real

Riak is a distributed database that uses CRDTs for conflict resolution. Multiple clients can write to the same key simultaneously, and Riak merges the values using CRDT semantics. No leader election, no quorum for writes (configurable consistency levels).

Strength: Available during partitions. Writes always succeed locally. Weakness: Eventually consistent -- reads may return stale data briefly. CRDT merge semantics can be surprising (concurrent counter increments work, but concurrent "set value to X" creates a multi-value register).

The Tradeoffs

Approach Partition tolerance Consistency Write availability Complexity
Raft (etcd, Consul servers) Majority required Strong Blocks without majority Low (single leader)
Gossip + CRDTs (FortrOS, Riak) Both halves operate Eventual Always (local writes) Higher (merge semantics)
Gossip only (Serf) Both halves operate Eventual (membership) Always Low (membership only)

FortrOS chose gossip + CRDTs because the org must survive arbitrary network partitions. A homelab where one machine is on a laptop (Wi-Fi, intermittent) and another is a VPS (always connected) shouldn't require both to be reachable for either to operate.

How FortrOS Does It

State Trees: A Reusable Primitive

FortrOS's replicated state system is built around state trees -- a reusable primitive that any service can use. A state tree wraps a CRDT dataset with a Merkle tree for efficient sync. Services register a state tree with the maintainer and immediately get: gossip-based change notification, Merkle-efficient sync, and conflict resolution via a chosen schema. No CRDT code in the service itself.

The current state trees in the org:

Name Contents Resolution Schema
OrgOperational Membership, per-node metadata, revocations Self-authoritative / grow-only
OrgConfig Org-wide declarative settings Precondition-based
WorkloadDesired What workloads should run (specs, placement) Self-authoritative per owner
WorkloadObserved What workloads are actually running (status) Self-authoritative per reporter

Tree registration should use name-based lookup (services register by name, the system assigns an ID) rather than hardcoded numeric IDs. Hardcoded IDs are a collision risk as third-party services and org-specific extensions register their own trees. The grouping of datasets into trees is driven by sync granularity -- data that changes together and is consumed together belongs in one tree.

The Sync Protocol

The protocol has two layers: gossip broadcasts are hints that something changed, and TreeSync over TCP is the reliable merge/resolution mechanism.

Gossip layer (hints):

  1. A node modifies its local CRDT. The state tree recalculates its Merkle root.
  2. The node broadcasts its recent Merkle root hashes (a configurable history ring) via foca gossip (~65 bytes per broadcast).
  3. Receiving nodes compare the broadcast hashes against their own history. If a peer is broadcasting a hash the receiver has already moved past, the receiver knows it has newer data. If the hash is unknown, a sync is needed.
  4. Only the originator of a change broadcasts. Nodes that receive state via TreeSync do not re-broadcast -- this prevents convergence storms where partial merges trigger cascading broadcasts. Every node still converges because gossip is probabilistic (each round picks random peers) and TreeSync pulls are also triggered periodically, not only on hash mismatch. A missed broadcast is caught by the next gossip round or periodic pull.

TreeSync layer (reliable resolution):

  1. Pull: TCP connection over WireGuard. The client sends its top-level Merkle tree hashes (~4KB). The server walks divergent subtrees and returns only changed leaves.
  2. Merge: The client merges received leaves into its local CRDTs using the tree's conflict resolution schema.
  3. Push-notify: After pulling, the client sends its new hash to the server. If hashes still differ (bidirectional changes), the server triggers a reverse pull.
  4. Convergence: After 1-2 rounds of bidirectional TreeSync, all participating nodes have identical state.

Enrollment Completion

The enrollment that started in 03 Trust and Identity completes here:

  1. Preboot authenticated, got TPM credentials (chapter 3)
  2. /persist unlocked, keys accessible (chapter 4)
  3. kexec'd into main OS (chapter 5)
  4. s6-rc started all services (chapter 6)
  5. WireGuard overlay connected (chapter 7)
  6. Now: Maintainer joins gossip mesh, presents enrollment nonce
  7. Provisioner validates nonce, promotes pending -> confirmed
  8. Confirmed enrollment gossips to all nodes via CRDT
  9. Node is a full org member

Conflict Resolution Schemas

FortrOS provides a set of resolution schemas that state trees choose from at registration. Services pick a schema and immediately benefit from automatic conflict resolution -- no conflict-handling code in the service itself.

Schema How It Works Current Users
Self-authoritative Each actor owns its own entries. Only the actor can modify its data. No conflict possible. Node metadata, workload observed status
Precondition-based Changes carry a precondition ("set X to 5, if current is 3"). After merge, unmet preconditions are rejected and the originator is informed. Org config
Grow-only Data can only be added, never removed. Merge is set union. No conflicts possible. Revocation lists
CRDT-native The CRDT's built-in merge semantics handle it (add-wins for Orswot, multi-value for MVReg). Membership

The general principle behind all schemas is locality-wins, which ties directly to the three-state confirmation pattern:

  • The local partition's change was confirmed -- it was enacted, the partition could observe the resource and verify the result.
  • The remote partition's change was pending -- it was intended, but the partition couldn't reach the resource to verify.
  • When the partitions reconnect and merge, the pending change's precondition no longer holds (the resource has already been changed by the confirmed side). The pending change is rejected, and the originator is informed so they can decide whether to re-apply against the current state.

This means conflict resolution isn't a special case -- it's the normal three-state lifecycle applied to concurrent changes.

Stage Boundary

What This Stage Produces

After cluster formation:

  • The node is a full org member (confirmed enrollment)
  • Org state is replicated and converging via gossip + TreeSync
  • The node can participate in org decisions (config changes, workload placement)
  • Failure detection is active (SWIM protocol monitors all peers)

What Is Handed Off

The replicated state enables:

What This Stage Does NOT Do

  • It does not run workloads (that's 09 Running Workloads)
  • It does not handle upgrades (that's 10 Sustaining the Org)
  • It does not provide strong consistency (CRDTs are eventually consistent)
  • It does not handle large data replication (that's the storage layer)

Further Reading

Concepts:

Services:

  • Maintainer -- The gossip participant and CRDT owner on each node

FortrOS implementation: