Move signature aggregation off the tick loop (background worker with snapshot)

## Problem

When a node is an aggregator, `AggregateCommitteeSignatures` runs **synchronously on the tick goroutine** at interval 2 each slot. The `xmss.AggregateWithChildren` call (`node/store_aggregate.go:132`) goes through cgo into the leanMultisig recursive-aggregation prover, which currently takes **1-4 seconds** per group — well above the 800ms interval budget.

The relevant code:

```go
// node/tick.go:45-56
// Interval 2: synchronous aggregation. Blocks the tick loop for 1-4s
// but keeps source consistency — headState doesn't change during proving.
// Async aggregation breaks source alignment because head drifts during
// the background prover run. Acceptable until prover is <800ms.
if currentInterval == 2 && isAgg {
    aggs := AggregateCommitteeSignatures(e.Store)
    for _, agg := range aggs {
        if e.P2P != nil {
            e.P2P.PublishAggregatedAttestation(context.Background(), agg)
        }
    }
}
```

While the prover runs, the tick goroutine cannot:
- accept new gossip blocks
- handle attestation gossip
- advance head / safe-target updates
- service the next slot's tick promptly

This produces missed slots, late attestations from peers piling up, and (under non-finality pressure when there's more than one group to aggregate) compounding stalls.

A separate, related comment misleads readers: `tick.go:42` says *"Aggregation is handled async below to avoid blocking the tick loop"* — that's aspirational; the very next block is synchronous. Either fix or refactor; this issue proposes the fix.

## Why now

Both peer clients have already shipped this work, with public devnet evidence of measurable wins:

- **ethlambda PR #299** — *fix: make signature aggregation asynchronous* (merged 2026-04-21). https://github.com/lambdaclass/ethlambda/pull/299
- **zeam PR #874** — *feat: move aggregate FFI off libxev thread using Io.Threaded worker* (merged 2026-05-14). https://github.com/blockblaz/zeam/pull/874
- **zeam PR #890** — *offload heavy chain mutations to chain-worker, parallelize XMSS verify* (merged 2026-05-17, follow-on architecture). https://github.com/blockblaz/zeam/pull/890

The gean comment's exit criterion — *Acceptable until prover is <800ms* — defers indefinitely. The leanMultisig prover team has not committed to a sub-800ms target on any timeline gean controls.

## The blocker the comment cites — and how peers solved it

`tick.go:47-48` is correct that the naive `go func() { Aggregate(); Publish() }()` is unsafe: head moves during the 1-4s prover run, so the aggregator could publish an attestation whose source/target chain no longer matches what other nodes see at publish time.

Both ethlambda and zeam converged on the **snapshot pattern**: capture head + justified at submission time, hand them to the worker, and the worker aggregates against the frozen snapshot. The aggregate is consistent with *some* recent state even if head has moved by the time it lands. Specifically:

- **ethlambda** (`crates/blockchain/src/store.rs`) exposes pure helpers `snapshot_aggregation_inputs`, `aggregate_job`, `apply_aggregated_group`, `finalize_aggregation_session`. `store::on_tick` no longer calls into aggregation — an actor drives it.
- **zeam** has a dedicated `std.Io.Threaded` worker with `concurrent_limit = .limited(1)`; the worker *owns the snapshot* and runs `forkChoice.aggregate` then `publishProducedAggregations`.

## Two design points where ethlambda and zeam differ

| | ethlambda #299 | zeam #874 |
|---|---|---|
| Concurrency limit | spawn-blocking, but bounded by deadline | dedicated worker, `concurrent_limit=1` |
| Deadline | 750 ms cancellation token | none |
| Behavior at deadline | cancel — but late results still apply + publish | n/a |
| Behavior when previous still in-flight | n/a (cancelled at deadline) | skip with metric `zeam_aggregate_skip_total{reason=\"in_flight\"}` |
| Fence late messages | session id = slot number | not needed (no late path) |
| Lifecycle | `#[stopped]` cancels + joins bounded at 2 s | `aggregate_group.cancel → chain_worker.stop → aggregate_io.deinit` |

ethlambda's design is more complex but doesn't drop work under partition stack-up. zeam's design is simpler and shipped first.

ethlambda's devnet evidence (4 nodes, 200 slots, 3 induced partitions via `docker pause`):
- finality lag 5 → 3 slots
- 24/203 sessions cancelled at deadline (11.8%) — only when `groups_considered > 1`
- 13 missed slots, all inside partition windows
- no straggler warnings

## Proposed approach for gean

Ship in two slices, ordered to minimize blast radius:

**Slice 1 — skip-when-busy (matches zeam #874)**
- New `node/aggregator_worker.go`: a single goroutine owning a `chan aggregateJob` (capacity 1). Sender uses non-blocking send: if full, skip and increment a counter.
- New `snapshotAggregationInputs(s *ConsensusStore)` in `node/store_aggregate.go`: builds the immutable input set the worker needs (Select/Fill outputs, headState reference, slot, justified) without touching mutable store state.
- `tick.go` interval-2 path: replace the synchronous `AggregateCommitteeSignatures` + `PublishAggregatedAttestation` block with `e.AggregatorWorker.SubmitOrSkip(snapshot)`.
- Worker goroutine: receive snapshot → call the Aggregate phase → for each result, send `*types.SignedAggregatedAttestation` on a results channel back to the engine, which publishes via `e.P2P.PublishAggregatedAttestation`. (Worker must NOT call into the store directly; results land on the tick goroutine for ordered publish.)
- Shutdown: engine `Close()` cancels the worker context, drains the results channel up to a 2 s deadline.

**Slice 2 — deadline + late-keep (matches ethlambda #299)**
- Wrap each worker run in `context.WithTimeout` (750 ms; tune after burn-in).
- On timeout: cancel the proof, but if the prover completed *just after* cancellation, still apply + publish the result locally. Fence by slot id so a late-arriving result for slot N can't be applied during slot N+2.
- Defer until slice 1 has burn-in evidence; gean can stay on skip-when-busy if devnet runs don't show partition-induced stack-up.

## Files to touch (slice 1)

- `node/tick.go:42-56` — remove inline aggregation + Publish loop; replace with worker submission. Delete or correct the misleading line-42 comment.
- `node/store_aggregate.go` — extract `snapshotAggregationInputs` from the head of `AggregateCommitteeSignatures`; the existing function becomes pure (snapshot → aggregates).
- `node/aggregator_worker.go` (new) — worker goroutine, job channel, result channel, lifecycle.
- `node/node.go` — wire the worker into `Engine` (alongside `AggregationCh`), start in `Run`, close in shutdown.
- `node/metrics.go` — register `lean_aggregator_worker_duration_seconds` (histogram) and `lean_aggregator_skip_total{reason=\"in_flight|not_aggregator|not_synced|missing_state\"}` (counter), mirroring zeam's metric names so dashboards line up.

## Verification

- **Tick interval duration**: existing `lean_tick_interval_duration_seconds` histogram p99 should drop from multi-second to sub-100ms during aggregator role. This is the headline metric.
- **Devnet rerun**: 4-node devnet, 200+ slots, with at least one induced partition via `docker pause`. Confirm:
  - finality lag improves
  - missed slots concentrated inside the partition window, not outside
  - `lean_aggregator_skip_total{reason=\"in_flight\"}` only fires during partitions
- **Spec tests**: `go test -tags spectests ./spectests/...` must remain green; aggregation paths are exercised by api fixtures + forkchoice fixtures.
- **Race detector**: `go test -race ./node/` must pass; snapshot ownership transfer is the most likely place a data race could appear.

## Related (do NOT confuse)

- **gean PR #223** — *perf(node): parallelize block attestation signature verification* (merged 2026-04-23). This is the **verifier** path on block import (parallel `xmss.VerifyAggregatedSignature` via `errgroup.Group` bounded by `runtime.GOMAXPROCS(0)`). Different problem, already shipped. The two pieces of work were discussed in the same session and are easy to conflate.
- ethlambda PR #274 (parallel verification) is the analogue of gean PR #223.
- zeam PR #890 (chain-worker offload) is what gean would build *after* this slice if we want to follow zeam's full architecture, but that's out of scope here.

## References

- ethlambda PR #299: https://github.com/lambdaclass/ethlambda/pull/299
- zeam PR #874: https://github.com/blockblaz/zeam/pull/874
- zeam PR #890: https://github.com/blockblaz/zeam/pull/890
- gean PR #223 (verification, already done): https://github.com/geanlabs/gean/pull/223

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move signature aggregation off the tick loop (background worker with snapshot) #272

Problem

Why now

The blocker the comment cites — and how peers solved it

Two design points where ethlambda and zeam differ

Proposed approach for gean

Files to touch (slice 1)

Verification

Related (do NOT confuse)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	ethlambda #299	zeam #874
Concurrency limit	spawn-blocking, but bounded by deadline	dedicated worker, `concurrent_limit=1`
Deadline	750 ms cancellation token	none
Behavior at deadline	cancel — but late results still apply + publish	n/a
Behavior when previous still in-flight	n/a (cancelled at deadline)	skip with metric `zeam_aggregate_skip_total{reason=\"in_flight\"}`
Fence late messages	session id = slot number	not needed (no late path)
Lifecycle	`#[stopped]` cancels + joins bounded at 2 s	`aggregate_group.cancel → chain_worker.stop → aggregate_io.deinit`

Move signature aggregation off the tick loop (background worker with snapshot) #272

Description

Problem

Why now

The blocker the comment cites — and how peers solved it

Two design points where ethlambda and zeam differ

Proposed approach for gean

Files to touch (slice 1)

Verification

Related (do NOT confuse)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions