GUIDE · PREVIEW
GUIDE / CON.40
source: docs/guide/concepts/Topology Map.md
Concepts

Topology Map

What It Is

A topology map is a hierarchical model of the org's physical infrastructure: which nodes are on which network, which networks are in which building, which buildings are in which region. It answers the question "how is the org physically organized?" so that placement decisions (shards, workloads, signing quorums) can account for correlated failure domains.

The concept is adapted from Ceph's CRUSH map -- a hierarchical description of storage topology that drives data placement without a central directory.

Why It Matters

Physical proximity creates correlated failure. If two nodes share a power strip and the power strip fails, both nodes die simultaneously. If three nodes are in the same building and the building loses internet, all three become unreachable. Placement systems that ignore topology will, by chance, put all your shard replicas on the same rack -- and one power outage destroys all copies.

The topology map gives the placement system the information it needs to spread data and workloads across failure domains: "don't put all three replicas in the same rack" requires knowing which rack each node is in.

How It Works

Hierarchy

The topology is a tree of failure domains, from broadest to narrowest:

org
  region (geographic area, data center campus)
    site (building, data center)
      rack (network switch group, power domain)
        node (individual machine)
          storage controller (HBA, USB hub, NVMe controller)
            disk (individual storage device)

The hierarchy extends below the node for storage placement. Two shards on the same node but different disks survive a single disk failure. Two shards on the same HBA share a failure domain -- if the HBA fails, both shards are gone. The placement service uses the full hierarchy when deciding where to put shards: "don't put two shards on the same disk" is as important as "don't put all replicas in the same rack."

Not every org needs every level. A homelab might have: org -> site (the house) -> node -> disk. A multinational data center might use all levels including HBA-level granularity. The hierarchy is flexible -- add levels that represent real failure boundaries in your infrastructure.

Discovery

Topology discovery combines multiple signals. No single source is authoritative -- the system correlates evidence to build confidence:

  • LAN detection: Nodes auto-detect LAN neighbors (same-subnet, low-latency gossip responses). Nodes on the same LAN are inferred to share a rack/site. This requires no manual configuration.
  • Latency patterns: Nodes that consistently show similar latency profiles to the same set of peers are likely co-located. Latency clustering groups nodes without explicit assignment.
  • OOB metadata: If Intel ME (AMT) or IPMI is available, the BMC often knows chassis/rack identity (asset tags, IPMI SDR records, SMBIOS data). This provides machine-reported topology without any manual input.
  • Downtime correlation: Nodes that lose power and recover simultaneously are likely on the same UPS, PDU, or circuit. Tracking correlated failures passively discovers physical groupings: "nodes A, B, and C all went down at 14:32 and came back at 14:37 -- they probably share a power source."
  • Hardware probing (below-node): The maintainer's hardware probe detects storage controllers, USB hubs, and individual disks. Two disks behind the same HBA share a failure domain. This is reported in the node's CRDT metadata and used by the placement service for shard distribution.
  • Manual override: Admins can explicitly assign nodes to topology levels via the admin service. Useful for cases where auto-detection is wrong (two nodes on the same /24 subnet but in different buildings via VPN).

Hierarchical Self-Organization

The topology map is not centrally managed. Each level of the hierarchy is responsible for organizing its own internals:

  • Regions track which sites they contain and how to route between them. They don't care about individual disks in another region.
  • Sites track racks and inter-rack connectivity. They know their nodes but delegate below-node details to the nodes themselves.
  • Nodes track their own storage controllers and disks. They report a summary upward ("I have 4 disks, 3 healthy, 2TB available") but the full disk topology stays local.

This means the admin interface naturally drills down: at the org level you see regions. Click a region, you see sites. Click a site, you see racks and nodes. Click a node, the admin interface asks that node (or a peer in its zone) for its disk layout. The detail lives where it's relevant, not replicated everywhere.

This maps directly to the mesh topology model: nodes maintain strong connections to local peers (who share detailed local topology) and sparse connections to distant peers (who only need coarse regional topology). The topology data and the network connectivity follow the same shape.

Placement Rules

Services and data specify placement constraints in terms of the topology:

  • Spread: "Place replicas in at least 2 different sites" -- ensures a single-site failure doesn't destroy all copies.
  • Affinity: "Place this workload on a node in the same rack as its database" -- minimizes latency for tightly coupled services.
  • Anti-affinity: "Don't place two instances of this service on the same node" -- survives individual node failure.

The placement service evaluates these constraints against the topology map when scoring nodes for shard placement and workload scheduling.

Correlated Failure Detection

Gossip Protocols detect individual node failures. The topology map aggregates these into infrastructure events:

Pattern Inference Admin Action
One node in a rack goes down Individual failure Auto-replace workloads
All nodes in a rack go down simultaneously Rack failure (power/switch) Alert: check infrastructure
All nodes in a site go down Site failure (network/power) Alert: major outage
Nodes across multiple sites go down Org-wide issue or partition Alert: investigate connectivity

The admin interface distinguishes these. "Rack-3 unreachable" triggers a different response than "node-47 down" -- the former is likely infrastructure, the latter is likely a node issue.

Multisig Topology Spread

The org's K-of-N multisig for critical operations (key rotation, revocation) requires signatures from nodes in different topology branches. This prevents a compromised rack or site from manufacturing enough signatures to act unilaterally. The topology map provides the spread validation: "these K signatures come from K different sites" vs "these K signatures all come from the same rack."

How FortrOS Uses It

  • Shard placement: Erasure Coding shards are spread across failure domains per the org's replication policy. The topology map ensures "N effective shards" accounts for correlated failure (shards in the same rack share a failure domain).
  • Workload scheduling: The reconciler uses topology constraints from workload manifests (spread, affinity, anti-affinity) when scoring nodes.
  • Gateway selection: Mesh topology uses the topology map to identify zone boundaries for gateway placement.
  • Multisig validation: Signature collection checks topology spread before accepting a K-of-N quorum.
  • Admin dashboard: Infrastructure health is visualized by topology level, not just as a flat list of nodes.

Alternatives

Flat placement (no topology awareness): Treat all nodes as equal. Simpler but can't prevent correlated failures from destroying all replicas.

Manual rack/zone assignment only: Admin assigns every node to a zone. Accurate but doesn't scale and requires manual updates as infrastructure changes.

DNS-based discovery (Consul): Use DNS SRV records to group services by datacenter. Works for service discovery but doesn't model the physical hierarchy below the datacenter level.

Links