Skip to content
scarecr0w12 edited this page Jun 24, 2026 · 1 revision

Swarm

Distributed agent coordination across multiple CortexPrism instances. The swarm layer allows Cortex instances to form a fleet, discover each other, dispatch work directives, and aggregate resource usage — all over the A2A protocol.

Architecture

┌──────────────────────────────────────────────┐
│                  Swarm Coordinator           │
│  registerSelf · discoverPeers · dispatch     │
│  broadcast · getResourceReport · heartbeat   │
├──────────────────────────────────────────────┤
│              A2A Transport Layer             │
│  connect · disconnect · sendDirective        │
│  ping · fetchRemoteAgentCard                 │
├──────────────────────────────────────────────┤
│          A2A Protocol (JSON-RPC 2.0)         │
└──────────────────────────────────────────────┘

The swarm sits above the A2A transport layer (packages/infra/src/swarm/) and provides the primary API for fleet operations. It is defined in packages/infra/contracts/swarm.ts and implemented across 5 modules:

File Purpose
coordinator.ts swarm singleton: self-registration, peer discovery, dispatch, heartbeat, resource reports
node-registry.ts CRUD over the nodes table, heartbeat updates, stale-node eviction, peer discovery via A2A agent cards
transport.ts Maps swarm directives to A2A messages over JSON-RPC 2.0
directive-handler.ts Receiving side: processes incoming directives (spawn sub-agents, execute tasks, query resources)
remote-kernel.ts Proxies remote processes into the local OsKernel process tree; aggregates cross-node resources

Directives

Work is dispatched across the swarm via directives — typed task messages sent from one node to another. Each directive has a unique ID, priority level (low / normal / high / critical), a TTL, and is tracked in the swarm_directives table.

5 Directive Kinds

Kind Purpose
spawn_agent Spawn a sub-agent on a remote node to perform delegated work
execute_task Execute a shell command or tool invocation on a remote node
query_resources Query a remote node's resource accounting (tokens, CPU, memory, sessions)
forward_message Forward a user message or agent output to a remote node
sync_state Synchronize shared state (memory, skills, configuration) between nodes

Directive results include status (completed / failed / cancelled / timed_out), output, error details, and execution metrics (tokens in/out, cost, duration, tool calls).

Node Lifecycle

Registration

When an instance joins the swarm via cortex swarm init, it registers itself in the nodes table with a name, host, port, capability tier (root / sudo / unprivileged), group, and A2A endpoint. The registration is idempotent — re-registering with the same endpoint updates the existing node record.

Peer Discovery

Nodes discover each other through three phases:

  1. Explicit endpoints — passed to discoverPeers() directly
  2. Config seed nodesconfig.swarm.seedNodes array in ~/.cortex/config.json
  3. Database refresh — existing connected nodes in the nodes table whose agent cards are refreshed via A2A fetchAgentCard()

Heartbeat

Every 30 seconds (HEARTBEAT_INTERVAL_MS = 30_000), each node sends a heartbeat containing:

  • CPU percent, memory (used/total), disk (used/total)
  • Active sessions and processes
  • Tokens used today (in/out), cost USD today
  • Uptime seconds

Heartbeat metrics are written to both the nodes table (current values) and the swarm_resource_snapshots table (time-series, retained for 1440 snapshots per node).

Stale Node Detection

Nodes that miss heartbeats for 120 seconds (NODE_STALE_MS) are automatically marked as disconnected by markNodesOffline().

Draining & Sealing

  • Drain — sets the node to draining status (mapped to connected in DB). The node stops accepting new directives but completes in-flight work.
  • Seal — sets the node to sealed status (mapped to disconnected in DB). Heartbeat stops; the node gracefully shuts down.

Resource Aggregation

The swarm aggregates resource usage across all nodes:

  • Per-node metrics: tokens in/out, cost, tool calls, CPU ms, peak memory, active sessions/processes
  • Fleet totals: summed from the swarm_resource_snapshots table (last 24 hours)
  • Resource report (SwarmResourceReport): total nodes, online nodes, aggregate token usage, cost, tool calls, CPU time, peak memory

Remote Process Proxying

The remote kernel (remote-kernel.ts) extends the local OsKernel process tree to include remote processes. Remote sub-agents appear in the local process tree display with PIDs in the 900,000+ range. Resource accounting from remote nodes is synced into the local kernel so that token usage, cost, and CPU time reflect fleet-wide activity.

// Register a remote sub-agent process
registerRemoteProcess({
  parentPid: 0,
  agentId: 'explorer_1',
  sessionId: 'sess_abc',
  role: 'agent',
  agentType: 'explorer',
  nodeId: 'node_xyz',
});

// Sync remote resource accounting
syncRemoteResources('node_xyz', [
  { agentId: 'coder_1', toolCalls: 42, tokensIn: 5000, tokensOut: 3000, costUsd: 0.12, cpuMs: 8000, peakMemoryMb: 512 },
]);

Configuration

Add seed nodes and enable the swarm in ~/.cortex/config.json:

{
  "swarm": {
    "seedNodes": ["http://node2:4220/a2a", "http://node3:4220/a2a"],
    "group": "production",
    "enabled": true
  }
}

CLI Reference

cortex swarm              # Show swarm overview and available sub-commands
cortex swarm init         # Register this instance as a swarm node
cortex swarm nodes        # List all registered swarm nodes with status and metrics
cortex swarm topology     # Show process tree and token usage across all nodes
cortex swarm report       # Aggregated fleet resource report (tokens, cost, CPU, memory)
cortex swarm drain        # Stop accepting new directives (complete in-flight work)
cortex swarm seal         # Graceful shutdown — stop heartbeat, stop accepting work

cortex swarm init

cortex swarm init --name my-node --host 192.168.1.10 --port 4220 --group production --tier sudo

Prompts for a node name if --name is omitted. Registers the A2A server handler and starts the heartbeat loop.

cortex swarm nodes

Displays each node's name, ID, status (color-coded: green=connected, red=disconnected, yellow=other), host:port, tier, group, active sessions, processes, memory usage, and last heartbeat timestamp.

cortex swarm report

Shows fleet-level aggregates: online/total nodes, total tokens in/out, total cost, total tool calls, total CPU ms, peak memory, and per-node breakdowns.

REST API

Method Path Description
GET /api/swarm/topology Process tree and token usage across all nodes
GET /api/swarm/report Aggregated resource report (tokens, cost, CPU, memory)
GET /api/swarm/directives Directive history (filterable by ?status= and ?limit=)
GET /api/swarm/nodes/metrics Raw metrics for all connected nodes
GET /api/swarm/nodes/:id/snapshots Time-series resource snapshots for a specific node

Database Tables

Table Purpose
nodes Node registry (shared with hub/node system), extended with a2a_endpoint, labels, metrics_json, cpu_percent, memory_used_mb, memory_total_mb
swarm_directives Directive audit trail: ID, kind, source/target nodes, payload, priority, status, metrics
swarm_resource_snapshots Per-node time-series metrics retained for 1440 snapshots per node

See Also

Clone this wiki locally