Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion plugins/smelt-agent/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ You are a smelt worker agent executing Assay runs. Your job is to receive a run
| --- | --- |
| `/assay:run-dispatch` | Dispatch a single or multi-session run from a manifest |
| `/assay:backend-status` | Query orchestrator status and interpret results |
| `/assay:peer-message` | Send and receive messages between sessions (mesh/gossip) |
| `/assay:peer-message` | Send and receive messages between sessions (mesh/gossip/signal) |
| `/assay:peer-registry` | Peer discovery, registration, and cross-instance signal forwarding |

## MCP Tools

Expand All @@ -23,6 +24,9 @@ You are a smelt worker agent executing Assay runs. Your job is to receive a run
| `cycle_status` | Get active milestone progress |
| `cycle_advance` | Advance the active chunk |
| `chunk_status` | Get gate results for a specific chunk |
| `poll_signals` | Read `PeerUpdate` messages from a session's signal inbox |
| `send_signal` | POST a `SignalRequest` to any signal endpoint URL |
| `merge_propose` | Push branch and create a GitHub PR with gate evidence |

## Workflow

Expand All @@ -41,5 +45,22 @@ Not all backends support every feature. Check the `CapabilitySet` before relying
- `supports_gossip_manifest: false` → gossip knowledge manifest may not persist between rounds
- `supports_annotations: false` → run annotations are not stored
- `supports_checkpoints: false` → team checkpoints are not persisted
- `supports_signals: false` → signal endpoint events are not pushed to the backend
- `supports_peer_registry: false` → peer registration/discovery is not available; cross-instance forwarding is disabled

Capability-limited runs degrade gracefully — they are not failures.

### Cross-Instance Signal Forwarding

When the signal endpoint receives a `POST /api/v1/signal` for an unknown local session, it queries the peer registry (`list_peers()`) and forwards the request to known peers. The first peer to return `202 Accepted` wins. An `X-Assay-Forwarded: true` header prevents forwarding loops — forwarded requests that miss locally return `404` immediately.

**Environment variables for the signal endpoint:**

| Variable | Default | Description |
| --- | --- | --- |
| `ASSAY_SIGNAL_PORT` | `7432` | Port for the HTTP signal listener |
| `ASSAY_SIGNAL_BIND` | `127.0.0.1` | Bind address (`0.0.0.0` for all interfaces) |
| `ASSAY_SIGNAL_URL` | _(derived)_ | Override the peer-registered URL — required when `ASSAY_SIGNAL_BIND=0.0.0.0` to provide a routable address |
| `ASSAY_SIGNAL_TOKEN` | _(none)_ | Optional bearer token for auth |

On startup, the MCP server registers itself as a peer in the state backend. On clean shutdown, it unregisters.
38 changes: 38 additions & 0 deletions plugins/smelt-agent/skills/peer-message.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,41 @@ In gossip mode, there is no direct messaging between sessions. Instead, a coordi
### Capability Guard

9. **Check `supports_gossip_manifest` before relying on the manifest.** If the backend has `supports_gossip_manifest: false`, the knowledge manifest may not persist between coordinator rounds. Check `gossip_status.sessions_synthesized` via `orchestrate_status` — if it stays at zero despite sessions completing, manifest persistence is disabled.

## Signal-Based Messaging (Cross-Instance)

For multi-machine deployments, sessions communicate via the HTTP signal endpoint instead of filesystem-based mesh routing.

### Receiving Signals

10. **Use the `poll_signals` MCP tool** to read `PeerUpdate` messages from your session's signal inbox:
```json
{ "session_name": "worker-1" }
```
Returns a `PollSignalsResult` with a `signals` array of `PeerUpdate` objects. Messages are consumed on read (exactly-once delivery).

Comment on lines +75 to +76
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says signals are consumed on read with “exactly-once delivery”, but the underlying inbox polling is best-effort and can deliver duplicates if file deletion fails (it logs a warning: “may be delivered twice”). Also, poll_signals can consume-but-skip malformed messages during JSON decode. Please soften this claim (e.g., at-least-once / best-effort) to match actual semantics.

Suggested change
Returns a `PollSignalsResult` with a `signals` array of `PeerUpdate` objects. Messages are consumed on read (exactly-once delivery).
Returns a `PollSignalsResult` with a `signals` array of `PeerUpdate` objects. Messages are removed from the inbox on a best-effort, at-least-once basis; callers should tolerate occasional duplicates and understand that malformed messages may be consumed but skipped.

Copilot uses AI. Check for mistakes.
### Sending Signals

11. **Use the `send_signal` MCP tool** to POST a signal to any Assay signal endpoint:
```json
{
"url": "http://peer-host:7432/api/v1/signal",
"target_session": "orchestrator",
"update": {
"source_job": "job-abc",
"source_session": "worker-1",
"changed_files": ["src/main.rs"],
"gate_summary": { "passed": 5, "failed": 0, "skipped": 1 },
"branch": "feature/auth"
}
}
```
Returns the HTTP status code and response body. Non-2xx responses are returned as the tool result (not a tool-level error) so the agent can decide how to proceed.

### Cross-Instance Forwarding

12. **Signals for unknown local sessions are forwarded automatically.** When the signal endpoint receives a request for a session not registered locally, it queries the peer registry and forwards to known peers. The first peer to return `202 Accepted` wins. An `X-Assay-Forwarded: true` header prevents forwarding loops.

### Capability Guard

13. **Check `supports_signals` and `supports_peer_registry`** to determine if signal-based messaging and cross-instance forwarding are available. `SmeltBackend` supports signals but not peer registry (`supports_peer_registry: false` — register_peer is fire-and-forget, forwarding uses Smelt's server-side routing); `LocalFsBackend` supports peer registry but not signal push; `NoopBackend` supports neither.
123 changes: 123 additions & 0 deletions plugins/smelt-agent/skills/peer-registry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
name: peer-registry
description: >
Peer discovery, registration, and cross-instance signal forwarding.
Use when configuring multi-machine deployments where multiple Assay
instances need to discover each other and forward signals across hosts.
---

# Peer Registry

Register, discover, and forward signals between Assay instances running on different machines.

## Overview

Each Assay MCP server can register itself as a **peer** in the state backend. Other instances query the peer registry to discover where to forward signals for sessions they don't own locally. This enables multi-machine orchestration without a central message broker.

```
┌──────────────┐ ┌──────────────┐
│ Machine A │ │ Machine B │
│ assay-mcp │◄───────►│ assay-mcp │
│ :7432 │ HTTP │ :7432 │
│ │ forward │ │
│ worker-1 │ │ worker-2 │
│ worker-3 │ │ orchestrator│
└──────────────┘ └──────────────┘
│ │
└────────┬───────────────┘
peers.json (or Smelt API)
```

## PeerInfo Type

Each registered peer is a `PeerInfo` record:

```json
{
"peer_id": "machine-a",
"signal_url": "http://192.168.1.10:7432",
"registered_at": "2026-03-29T12:00:00Z"
}
```

| Field | Type | Description |
| --- | --- | --- |
| `peer_id` | `String` | Unique identifier (typically hostname or UUID) |
| `signal_url` | `String` | HTTP endpoint for the signal server |
| `registered_at` | `DateTime<Utc>` | When this peer was registered |

## Backend Methods

The `StateBackend` trait provides three peer registry methods with default no-op implementations:

| Method | Description |
| --- | --- |
| `register_peer(peer: &PeerInfo)` | Upsert a peer entry (by `peer_id`) |
| `list_peers()` | Return all registered peers |
| `unregister_peer(peer_id: &str)` | Remove a peer entry (idempotent) |

### LocalFsBackend

Stores peers in `{assay_dir}/peers.json`. Writes are atomic (temp file + rename). Suitable for single-machine multi-process setups where all Assay instances share the same `.assay/` directory.

### SmeltBackend

Registers peers by POSTing `PeerInfo` JSON to `{smelt_url}/api/v1/peers`. Graceful degradation — registration failure logs a warning but does not abort startup. `list_peers` and `unregister_peer` use the default no-op implementations (Smelt manages peer lifecycle server-side).

### Other Backends

`NoopBackend`, `LinearBackend`, `GitHubBackend`, and `SshSyncBackend` all return `supports_peer_registry: false` and use the default no-op implementations.

## Automatic Lifecycle

The MCP server manages peer registration automatically:

1. **On startup** — after the signal endpoint binds, the server calls `register_peer` with its hostname and `signal_url` derived from `ASSAY_SIGNAL_BIND` and `ASSAY_SIGNAL_PORT`.
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step says the server derives signal_url from ASSAY_SIGNAL_BIND and ASSAY_SIGNAL_PORT, but the server code uses ASSAY_SIGNAL_URL as an override (and warns that 0.0.0.0 is unroutable for peer registration). The docs should mention ASSAY_SIGNAL_URL and recommend setting it in multi-machine deployments to a reachable host/IP.

Suggested change
1. **On startup** — after the signal endpoint binds, the server calls `register_peer` with its hostname and `signal_url` derived from `ASSAY_SIGNAL_BIND` and `ASSAY_SIGNAL_PORT`.
1. **On startup** — after the signal endpoint binds, the server calls `register_peer` with its hostname and `signal_url`. If `ASSAY_SIGNAL_URL` is set, that value is used directly. Otherwise, `signal_url` is derived from `ASSAY_SIGNAL_BIND` and `ASSAY_SIGNAL_PORT` (note that `0.0.0.0` is not routable from other machines, so in multi-machine deployments you should set `ASSAY_SIGNAL_URL` to a reachable host or IP).

Copilot uses AI. Check for mistakes.
2. **On clean shutdown** — the server calls `unregister_peer` to remove itself.

No manual registration is needed for normal operation.
Comment on lines +65 to +78
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs state that SmeltBackend supports peer registry / cross-instance forwarding, but the code currently reports supports_peer_registry: false for SmeltBackend (and does not implement list_peers/unregister_peer). As written, using SmeltBackend will disable forwarding because the signal server gates forwarding on capabilities().supports_peer_registry. Please update the docs to match current behavior, or (if the intended behavior is support) update the backend capability flag and implementations in code.

Suggested change
Registers peers by POSTing `PeerInfo` JSON to `{smelt_url}/api/v1/peers`. Graceful degradation — registration failure logs a warning but does not abort startup. `list_peers` and `unregister_peer` use the default no-op implementations (Smelt manages peer lifecycle server-side).
### Other Backends
`NoopBackend`, `LinearBackend`, `GitHubBackend`, and `SshSyncBackend` all return `supports_peer_registry: false` and use the default no-op implementations.
## Automatic Lifecycle
The MCP server manages peer registration automatically:
1. **On startup** — after the signal endpoint binds, the server calls `register_peer` with its hostname and `signal_url` derived from `ASSAY_SIGNAL_BIND` and `ASSAY_SIGNAL_PORT`.
2. **On clean shutdown** — the server calls `unregister_peer` to remove itself.
No manual registration is needed for normal operation.
As of this version, `SmeltBackend` reports `supports_peer_registry: false` and relies on the default no-op implementations of `register_peer`, `list_peers`, and `unregister_peer`. Deployments using `SmeltBackend` do not currently participate in the peer registry or cross-instance signal forwarding.
### Other Backends
`SmeltBackend`, `NoopBackend`, `LinearBackend`, `GitHubBackend`, and `SshSyncBackend` all return `supports_peer_registry: false` and use the default no-op implementations.
## Automatic Lifecycle
When using a backend that reports `supports_peer_registry: true` (for example, `LocalFsBackend`), the MCP server manages peer registration automatically:
1. **On startup** — after the signal endpoint binds, the server calls `register_peer` with its hostname and `signal_url` derived from `ASSAY_SIGNAL_BIND` and `ASSAY_SIGNAL_PORT`.
2. **On clean shutdown** — the server calls `unregister_peer` to remove itself (this is a no-op for backends like `SmeltBackend` that do not support the peer registry).
No manual registration is needed for normal operation when the selected backend supports the peer registry.

Copilot uses AI. Check for mistakes.

## Cross-Instance Signal Forwarding

When `POST /api/v1/signal` targets an unknown local session:

1. Check for `X-Assay-Forwarded: true` header — if present, return `404` immediately (loop prevention).
2. Check `capabilities().supports_peer_registry` — if false, return `404`.
3. Call `list_peers()` — iterate peers sequentially.
4. For each peer, POST the original `SignalRequest` to `{peer.signal_url}/api/v1/signal` with:
- `X-Assay-Forwarded: true` header (prevents the receiving peer from re-forwarding)
- `Authorization: Bearer <token>` header (if `ASSAY_SIGNAL_TOKEN` is set)
5. First peer to return `202 Accepted` wins — return `202` to the original caller.
6. If all peers fail or the list is empty, return `404`.

### Loop Prevention

The `X-Assay-Forwarded: true` header is the loop-prevention mechanism. A forwarded request that arrives at a peer is never re-forwarded — it either matches a local session (202) or fails (404). This guarantees at most one hop.

## Multi-Machine Setup

To deploy Assay across multiple machines:

1. **Set `ASSAY_SIGNAL_BIND=0.0.0.0`** and **`ASSAY_SIGNAL_URL=http://<machine-ip>:7432`** on each machine. Without `ASSAY_SIGNAL_URL`, the registered peer URL is `http://0.0.0.0:7432` — unroutable by other machines.
2. **Use a shared state backend** — either `LocalFsBackend` on a shared filesystem (NFS) or `SmeltBackend` with a central Smelt server.
3. **Start each Assay instance** — each registers itself as a peer automatically.
4. **Dispatch runs** — sessions on any machine can send signals to sessions on any other machine via `send_signal`. Unknown-session signals are forwarded through the peer registry.

### Environment Variables

| Variable | Default | Description |
| --- | --- | --- |
| `ASSAY_SIGNAL_PORT` | `7432` | Port for the HTTP signal listener |
| `ASSAY_SIGNAL_BIND` | `127.0.0.1` | Bind address (`0.0.0.0` for multi-machine) |
| `ASSAY_SIGNAL_URL` | _(derived)_ | **Required when `ASSAY_SIGNAL_BIND=0.0.0.0`** — override the peer-registered URL with the machine's reachable address (e.g. `http://192.168.1.10:7432`). Without this, peers register `http://0.0.0.0:7432` which is unroutable. |
| `ASSAY_SIGNAL_TOKEN` | _(none)_ | Optional bearer token for auth (shared across peers) |

Comment on lines +108 to +114
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable table omits ASSAY_SIGNAL_URL, but the server supports it and it is required for correct peer registration when binding ASSAY_SIGNAL_BIND=0.0.0.0 (otherwise peers get an unroutable URL). Please document ASSAY_SIGNAL_URL here (and its relationship to ASSAY_SIGNAL_BIND/ASSAY_SIGNAL_PORT).

Copilot uses AI. Check for mistakes.
## Capability Guard

Check `supports_peer_registry` before relying on peer discovery:

- `supports_peer_registry: true` — `LocalFsBackend`
- Note: `SmeltBackend` returns `false` — it implements `register_peer` as fire-and-forget but `list_peers` and `unregister_peer` are no-ops, so local signal forwarding is disabled; Smelt handles cross-instance routing server-side
- `supports_peer_registry: false` — `NoopBackend`, `LinearBackend`, `GitHubBackend`, `SshSyncBackend`

When peer registry is unavailable, signals for unknown local sessions return `404` without any forwarding attempt.
1 change: 1 addition & 0 deletions plugins/smelt-agent/skills/run-dispatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ If the controller wants state on a remote backend, set the `state_backend` field
- `{ type = "linear", team_id = "TEAM123" }` — Linear project tracking; requires `LINEAR_API_KEY` env var; `project_id` is optional (M011/S02)
- `{ type = "github", repo = "owner/repo" }` — GitHub Issues via `gh` CLI; requires `gh` installed and authenticated; `label` is optional (M011/S03)
- `{ type = "ssh", host = "worker.example.com", remote_assay_dir = "/home/user/.assay" }` — SCP sync to remote host; `user` and `port` are optional (M011/S04)
- `{ type = "smelt", url = "http://smelt.example.com:9000", job_id = "abc123", token = "secret" }` — Smelt HTTP backend; POSTs orchestrator events to Smelt's `/api/v1/events` endpoint; `token` is optional (bearer auth)
- `{ type = "custom", name = "my-backend", config = { ... } }` — custom third-party backend (falls back to no-op)

**Note:** `linear`, `github`, and `ssh` backends are stub implementations in the current release — configuring them logs a warning and falls back to a no-op backend that discards all state writes. Full implementations land in M011/S02–S04.
Expand Down
105 changes: 105 additions & 0 deletions plugins/smelt-agent/tests/verify-docs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env bash
# verify-docs.sh — Structural test for smelt-agent plugin documentation.
#
# Checks that MCP tool names referenced in plugin docs exist in the
# assay-mcp router (server.rs). Exits non-zero on any mismatch.
#
# Usage: bash plugins/smelt-agent/tests/verify-docs.sh

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PLUGIN_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
REPO_ROOT="$(cd "$PLUGIN_DIR/../.." && pwd)"

SERVER_RS="$REPO_ROOT/crates/assay-mcp/src/server.rs"

if [ ! -f "$SERVER_RS" ]; then
echo "ERROR: server.rs not found at $SERVER_RS"
exit 1
fi

# Extract MCP tool names from the router (pub async fn declarations in the
# #[tool_router] impl block). These are the canonical tool names.
ROUTER_TOOLS=$(grep 'pub async fn' "$SERVER_RS" \
| grep -oE 'fn [a-z_]+' \
| sed 's/fn //' \
| grep -v '^serve$' \
| sort -u)

# Extract tool names referenced in the MCP Tools table in AGENTS.md.
# Table rows look like: | `tool_name` | description |
DOC_TOOLS=$(grep -oE '`[a-z_]+`' "$PLUGIN_DIR/AGENTS.md" \
| tr -d '`' \
| sort -u)

# Known tools that exist on feature branches but not yet on main.
# These are documented in advance of the M015 merge and will be
# validated once M015 lands.
PENDING_TOOLS=""

Comment on lines +37 to +40
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PENDING_TOOLS includes poll_signals and send_signal, but both tools are already present in crates/assay-mcp/src/server.rs (they are #[tool] pub async fn in the router). Marking them as pending means this test will stop validating these docs against the router and could let future renames/drift slip through unnoticed. Consider removing them from PENDING_TOOLS, or only treating a tool as pending when it is actually absent from the router output.

Suggested change
# These are documented in advance of the M015 merge and will be
# validated once M015 lands.
PENDING_TOOLS="poll_signals send_signal"
# Currently none; add tool names here temporarily if docs lead server.rs.
PENDING_TOOLS=""

Copilot uses AI. Check for mistakes.
# Filter out non-tool identifiers (field names, config keys, etc.)
NON_TOOLS="run_id state_backend"

ERRORS=0

for tool in $DOC_TOOLS; do
# Skip known non-tool identifiers
skip=0
for nt in $NON_TOOLS; do
if [ "$tool" = "$nt" ]; then
skip=1
break
fi
done
[ "$skip" -eq 1 ] && continue

# Check if it's a pending tool (on a feature branch, not yet merged)
pending=0
for pt in $PENDING_TOOLS; do
if [ "$tool" = "$pt" ]; then
pending=1
break
fi
done

if [ "$pending" -eq 1 ]; then
echo " PENDING: $tool (M015 feature branch — not yet on main)"
continue
fi

# Check if tool exists in router
if ! echo "$ROUTER_TOOLS" | grep -qx "$tool"; then
echo " MISSING: $tool (referenced in docs but not in router)"
ERRORS=$((ERRORS + 1))
fi
done

# Also check skill files for tool references
for skill_file in "$PLUGIN_DIR"/skills/*.md; do
SKILL_TOOLS=$(grep -oE '`(poll_signals|send_signal|merge_propose|orchestrate_run|run_manifest|orchestrate_status|gate_run|spec_list|spec_get)`' "$skill_file" 2>/dev/null | tr -d '`' | sort -u || true)
for tool in $SKILL_TOOLS; do
pending=0
for pt in $PENDING_TOOLS; do
if [ "$tool" = "$pt" ]; then
pending=1
break
fi
done
[ "$pending" -eq 1 ] && continue

if ! echo "$ROUTER_TOOLS" | grep -qx "$tool"; then
echo " MISSING: $tool (referenced in $(basename "$skill_file") but not in router)"
ERRORS=$((ERRORS + 1))
fi
done
done

if [ "$ERRORS" -gt 0 ]; then
echo ""
echo "FAIL: $ERRORS tool name(s) referenced in docs but missing from router"
exit 1
fi

echo "OK: all documented tool names verified"
exit 0
Loading