Swarm

Distributed agent coordination across multiple CortexPrism instances. The swarm layer allows Cortex instances to form a fleet, discover each other, dispatch work directives, and aggregate resource usage — all over the A2A protocol.

Architecture

┌──────────────────────────────────────────────┐
│                  Swarm Coordinator           │
│  registerSelf · discoverPeers · dispatch     │
│  broadcast · getResourceReport · heartbeat   │
├──────────────────────────────────────────────┤
│              A2A Transport Layer             │
│  connect · disconnect · sendDirective        │
│  ping · fetchRemoteAgentCard                 │
├──────────────────────────────────────────────┤
│          A2A Protocol (JSON-RPC 2.0)         │
└──────────────────────────────────────────────┘

The swarm sits above the A2A transport layer (packages/infra/src/swarm/) and provides the primary API for fleet operations. It is defined in packages/infra/contracts/swarm.ts and implemented across 5 modules:

File	Purpose
`coordinator.ts`	`swarm` singleton: self-registration, peer discovery, dispatch, heartbeat, resource reports
`node-registry.ts`	CRUD over the `nodes` table, heartbeat updates, stale-node eviction, peer discovery via A2A agent cards
`transport.ts`	Maps swarm directives to A2A messages over JSON-RPC 2.0
`directive-handler.ts`	Receiving side: processes incoming directives (spawn sub-agents, execute tasks, query resources)
`remote-kernel.ts`	Proxies remote processes into the local `OsKernel` process tree; aggregates cross-node resources

Directives

Work is dispatched across the swarm via directives — typed task messages sent from one node to another. Each directive has a unique ID, priority level (low / normal / high / critical), a TTL, and is tracked in the swarm_directives table.

5 Directive Kinds

Kind	Purpose
`spawn_agent`	Spawn a sub-agent on a remote node to perform delegated work
`execute_task`	Execute a shell command or tool invocation on a remote node
`query_resources`	Query a remote node's resource accounting (tokens, CPU, memory, sessions)
`forward_message`	Forward a user message or agent output to a remote node
`sync_state`	Synchronize shared state (memory, skills, configuration) between nodes

Directive results include status (completed / failed / cancelled / timed_out), output, error details, and execution metrics (tokens in/out, cost, duration, tool calls).

Node Lifecycle

Registration

When an instance joins the swarm via cortex swarm init, it registers itself in the nodes table with a name, host, port, capability tier (root / sudo / unprivileged), group, and A2A endpoint. The registration is idempotent — re-registering with the same endpoint updates the existing node record.

Peer Discovery

Nodes discover each other through three phases:

Explicit endpoints — passed to discoverPeers() directly
Config seed nodes — config.swarm.seedNodes array in ~/.cortex/config.json
Database refresh — existing connected nodes in the nodes table whose agent cards are refreshed via A2A fetchAgentCard()

Heartbeat

Every 30 seconds (HEARTBEAT_INTERVAL_MS = 30_000), each node sends a heartbeat containing:

CPU percent, memory (used/total), disk (used/total)
Active sessions and processes
Tokens used today (in/out), cost USD today
Uptime seconds

Heartbeat metrics are written to both the nodes table (current values) and the swarm_resource_snapshots table (time-series, retained for 1440 snapshots per node).

Stale Node Detection

Nodes that miss heartbeats for 120 seconds (NODE_STALE_MS) are automatically marked as disconnected by markNodesOffline().

Draining & Sealing

Drain — sets the node to draining status (mapped to connected in DB). The node stops accepting new directives but completes in-flight work.
Seal — sets the node to sealed status (mapped to disconnected in DB). Heartbeat stops; the node gracefully shuts down.

Resource Aggregation

The swarm aggregates resource usage across all nodes:

Per-node metrics: tokens in/out, cost, tool calls, CPU ms, peak memory, active sessions/processes
Fleet totals: summed from the swarm_resource_snapshots table (last 24 hours)
Resource report (SwarmResourceReport): total nodes, online nodes, aggregate token usage, cost, tool calls, CPU time, peak memory

Remote Process Proxying

The remote kernel (remote-kernel.ts) extends the local OsKernel process tree to include remote processes. Remote sub-agents appear in the local process tree display with PIDs in the 900,000+ range. Resource accounting from remote nodes is synced into the local kernel so that token usage, cost, and CPU time reflect fleet-wide activity.

// Register a remote sub-agent process
registerRemoteProcess({
  parentPid: 0,
  agentId: 'explorer_1',
  sessionId: 'sess_abc',
  role: 'agent',
  agentType: 'explorer',
  nodeId: 'node_xyz',
});

// Sync remote resource accounting
syncRemoteResources('node_xyz', [
  { agentId: 'coder_1', toolCalls: 42, tokensIn: 5000, tokensOut: 3000, costUsd: 0.12, cpuMs: 8000, peakMemoryMb: 512 },
]);

Configuration

Add seed nodes and enable the swarm in ~/.cortex/config.json:

{
  "swarm": {
    "seedNodes": ["http://node2:4220/a2a", "http://node3:4220/a2a"],
    "group": "production",
    "enabled": true
  }
}

CLI Reference

cortex swarm              # Show swarm overview and available sub-commands
cortex swarm init         # Register this instance as a swarm node
cortex swarm nodes        # List all registered swarm nodes with status and metrics
cortex swarm topology     # Show process tree and token usage across all nodes
cortex swarm report       # Aggregated fleet resource report (tokens, cost, CPU, memory)
cortex swarm drain        # Stop accepting new directives (complete in-flight work)
cortex swarm seal         # Graceful shutdown — stop heartbeat, stop accepting work

`cortex swarm init`

cortex swarm init --name my-node --host 192.168.1.10 --port 4220 --group production --tier sudo

Prompts for a node name if --name is omitted. Registers the A2A server handler and starts the heartbeat loop.

`cortex swarm nodes`

Displays each node's name, ID, status (color-coded: green=connected, red=disconnected, yellow=other), host:port, tier, group, active sessions, processes, memory usage, and last heartbeat timestamp.

`cortex swarm report`

Shows fleet-level aggregates: online/total nodes, total tokens in/out, total cost, total tool calls, total CPU ms, peak memory, and per-node breakdowns.

REST API

Method	Path	Description
`GET`	`/api/swarm/topology`	Process tree and token usage across all nodes
`GET`	`/api/swarm/report`	Aggregated resource report (tokens, cost, CPU, memory)
`GET`	`/api/swarm/directives`	Directive history (filterable by `?status=` and `?limit=`)
`GET`	`/api/swarm/nodes/metrics`	Raw metrics for all connected nodes
`GET`	`/api/swarm/nodes/:id/snapshots`	Time-series resource snapshots for a specific node

Database Tables

Table	Purpose
`nodes`	Node registry (shared with hub/node system), extended with `a2a_endpoint`, `labels`, `metrics_json`, `cpu_percent`, `memory_used_mb`, `memory_total_mb`
`swarm_directives`	Directive audit trail: ID, kind, source/target nodes, payload, priority, status, metrics
`swarm_resource_snapshots`	Per-node time-series metrics retained for 1440 snapshots per node

Uh oh!

Uh oh!

Swarm

Swarm

Architecture

Directives

5 Directive Kinds

Node Lifecycle

Registration

Peer Discovery

Heartbeat

Stale Node Detection

Draining & Sealing

Resource Aggregation

Remote Process Proxying

Configuration

CLI Reference

cortex swarm init

cortex swarm nodes

cortex swarm report

REST API

Database Tables

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`cortex swarm init`

`cortex swarm nodes`

`cortex swarm report`