Problem
When a node is an aggregator, AggregateCommitteeSignatures runs synchronously on the tick goroutine at interval 2 each slot. The xmss.AggregateWithChildren call (node/store_aggregate.go:132) goes through cgo into the leanMultisig recursive-aggregation prover, which currently takes 1-4 seconds per group — well above the 800ms interval budget.
The relevant code:
// node/tick.go:45-56
// Interval 2: synchronous aggregation. Blocks the tick loop for 1-4s
// but keeps source consistency — headState doesn't change during proving.
// Async aggregation breaks source alignment because head drifts during
// the background prover run. Acceptable until prover is <800ms.
if currentInterval == 2 && isAgg {
aggs := AggregateCommitteeSignatures(e.Store)
for _, agg := range aggs {
if e.P2P != nil {
e.P2P.PublishAggregatedAttestation(context.Background(), agg)
}
}
}
While the prover runs, the tick goroutine cannot:
- accept new gossip blocks
- handle attestation gossip
- advance head / safe-target updates
- service the next slot's tick promptly
This produces missed slots, late attestations from peers piling up, and (under non-finality pressure when there's more than one group to aggregate) compounding stalls.
A separate, related comment misleads readers: tick.go:42 says "Aggregation is handled async below to avoid blocking the tick loop" — that's aspirational; the very next block is synchronous. Either fix or refactor; this issue proposes the fix.
Why now
Both peer clients have already shipped this work, with public devnet evidence of measurable wins:
The gean comment's exit criterion — Acceptable until prover is <800ms — defers indefinitely. The leanMultisig prover team has not committed to a sub-800ms target on any timeline gean controls.
The blocker the comment cites — and how peers solved it
tick.go:47-48 is correct that the naive go func() { Aggregate(); Publish() }() is unsafe: head moves during the 1-4s prover run, so the aggregator could publish an attestation whose source/target chain no longer matches what other nodes see at publish time.
Both ethlambda and zeam converged on the snapshot pattern: capture head + justified at submission time, hand them to the worker, and the worker aggregates against the frozen snapshot. The aggregate is consistent with some recent state even if head has moved by the time it lands. Specifically:
- ethlambda (
crates/blockchain/src/store.rs) exposes pure helpers snapshot_aggregation_inputs, aggregate_job, apply_aggregated_group, finalize_aggregation_session. store::on_tick no longer calls into aggregation — an actor drives it.
- zeam has a dedicated
std.Io.Threaded worker with concurrent_limit = .limited(1); the worker owns the snapshot and runs forkChoice.aggregate then publishProducedAggregations.
Two design points where ethlambda and zeam differ
|
ethlambda #299 |
zeam #874 |
| Concurrency limit |
spawn-blocking, but bounded by deadline |
dedicated worker, concurrent_limit=1 |
| Deadline |
750 ms cancellation token |
none |
| Behavior at deadline |
cancel — but late results still apply + publish |
n/a |
| Behavior when previous still in-flight |
n/a (cancelled at deadline) |
skip with metric zeam_aggregate_skip_total{reason=\"in_flight\"} |
| Fence late messages |
session id = slot number |
not needed (no late path) |
| Lifecycle |
#[stopped] cancels + joins bounded at 2 s |
aggregate_group.cancel → chain_worker.stop → aggregate_io.deinit |
ethlambda's design is more complex but doesn't drop work under partition stack-up. zeam's design is simpler and shipped first.
ethlambda's devnet evidence (4 nodes, 200 slots, 3 induced partitions via docker pause):
- finality lag 5 → 3 slots
- 24/203 sessions cancelled at deadline (11.8%) — only when
groups_considered > 1
- 13 missed slots, all inside partition windows
- no straggler warnings
Proposed approach for gean
Ship in two slices, ordered to minimize blast radius:
Slice 1 — skip-when-busy (matches zeam #874)
- New
node/aggregator_worker.go: a single goroutine owning a chan aggregateJob (capacity 1). Sender uses non-blocking send: if full, skip and increment a counter.
- New
snapshotAggregationInputs(s *ConsensusStore) in node/store_aggregate.go: builds the immutable input set the worker needs (Select/Fill outputs, headState reference, slot, justified) without touching mutable store state.
tick.go interval-2 path: replace the synchronous AggregateCommitteeSignatures + PublishAggregatedAttestation block with e.AggregatorWorker.SubmitOrSkip(snapshot).
- Worker goroutine: receive snapshot → call the Aggregate phase → for each result, send
*types.SignedAggregatedAttestation on a results channel back to the engine, which publishes via e.P2P.PublishAggregatedAttestation. (Worker must NOT call into the store directly; results land on the tick goroutine for ordered publish.)
- Shutdown: engine
Close() cancels the worker context, drains the results channel up to a 2 s deadline.
Slice 2 — deadline + late-keep (matches ethlambda #299)
- Wrap each worker run in
context.WithTimeout (750 ms; tune after burn-in).
- On timeout: cancel the proof, but if the prover completed just after cancellation, still apply + publish the result locally. Fence by slot id so a late-arriving result for slot N can't be applied during slot N+2.
- Defer until slice 1 has burn-in evidence; gean can stay on skip-when-busy if devnet runs don't show partition-induced stack-up.
Files to touch (slice 1)
node/tick.go:42-56 — remove inline aggregation + Publish loop; replace with worker submission. Delete or correct the misleading line-42 comment.
node/store_aggregate.go — extract snapshotAggregationInputs from the head of AggregateCommitteeSignatures; the existing function becomes pure (snapshot → aggregates).
node/aggregator_worker.go (new) — worker goroutine, job channel, result channel, lifecycle.
node/node.go — wire the worker into Engine (alongside AggregationCh), start in Run, close in shutdown.
node/metrics.go — register lean_aggregator_worker_duration_seconds (histogram) and lean_aggregator_skip_total{reason=\"in_flight|not_aggregator|not_synced|missing_state\"} (counter), mirroring zeam's metric names so dashboards line up.
Verification
- Tick interval duration: existing
lean_tick_interval_duration_seconds histogram p99 should drop from multi-second to sub-100ms during aggregator role. This is the headline metric.
- Devnet rerun: 4-node devnet, 200+ slots, with at least one induced partition via
docker pause. Confirm:
- finality lag improves
- missed slots concentrated inside the partition window, not outside
lean_aggregator_skip_total{reason=\"in_flight\"} only fires during partitions
- Spec tests:
go test -tags spectests ./spectests/... must remain green; aggregation paths are exercised by api fixtures + forkchoice fixtures.
- Race detector:
go test -race ./node/ must pass; snapshot ownership transfer is the most likely place a data race could appear.
Related (do NOT confuse)
References
Problem
When a node is an aggregator,
AggregateCommitteeSignaturesruns synchronously on the tick goroutine at interval 2 each slot. Thexmss.AggregateWithChildrencall (node/store_aggregate.go:132) goes through cgo into the leanMultisig recursive-aggregation prover, which currently takes 1-4 seconds per group — well above the 800ms interval budget.The relevant code:
While the prover runs, the tick goroutine cannot:
This produces missed slots, late attestations from peers piling up, and (under non-finality pressure when there's more than one group to aggregate) compounding stalls.
A separate, related comment misleads readers:
tick.go:42says "Aggregation is handled async below to avoid blocking the tick loop" — that's aspirational; the very next block is synchronous. Either fix or refactor; this issue proposes the fix.Why now
Both peer clients have already shipped this work, with public devnet evidence of measurable wins:
The gean comment's exit criterion — Acceptable until prover is <800ms — defers indefinitely. The leanMultisig prover team has not committed to a sub-800ms target on any timeline gean controls.
The blocker the comment cites — and how peers solved it
tick.go:47-48is correct that the naivego func() { Aggregate(); Publish() }()is unsafe: head moves during the 1-4s prover run, so the aggregator could publish an attestation whose source/target chain no longer matches what other nodes see at publish time.Both ethlambda and zeam converged on the snapshot pattern: capture head + justified at submission time, hand them to the worker, and the worker aggregates against the frozen snapshot. The aggregate is consistent with some recent state even if head has moved by the time it lands. Specifically:
crates/blockchain/src/store.rs) exposes pure helperssnapshot_aggregation_inputs,aggregate_job,apply_aggregated_group,finalize_aggregation_session.store::on_tickno longer calls into aggregation — an actor drives it.std.Io.Threadedworker withconcurrent_limit = .limited(1); the worker owns the snapshot and runsforkChoice.aggregatethenpublishProducedAggregations.Two design points where ethlambda and zeam differ
concurrent_limit=1zeam_aggregate_skip_total{reason=\"in_flight\"}#[stopped]cancels + joins bounded at 2 saggregate_group.cancel → chain_worker.stop → aggregate_io.deinitethlambda's design is more complex but doesn't drop work under partition stack-up. zeam's design is simpler and shipped first.
ethlambda's devnet evidence (4 nodes, 200 slots, 3 induced partitions via
docker pause):groups_considered > 1Proposed approach for gean
Ship in two slices, ordered to minimize blast radius:
Slice 1 — skip-when-busy (matches zeam #874)
node/aggregator_worker.go: a single goroutine owning achan aggregateJob(capacity 1). Sender uses non-blocking send: if full, skip and increment a counter.snapshotAggregationInputs(s *ConsensusStore)innode/store_aggregate.go: builds the immutable input set the worker needs (Select/Fill outputs, headState reference, slot, justified) without touching mutable store state.tick.gointerval-2 path: replace the synchronousAggregateCommitteeSignatures+PublishAggregatedAttestationblock withe.AggregatorWorker.SubmitOrSkip(snapshot).*types.SignedAggregatedAttestationon a results channel back to the engine, which publishes viae.P2P.PublishAggregatedAttestation. (Worker must NOT call into the store directly; results land on the tick goroutine for ordered publish.)Close()cancels the worker context, drains the results channel up to a 2 s deadline.Slice 2 — deadline + late-keep (matches ethlambda #299)
context.WithTimeout(750 ms; tune after burn-in).Files to touch (slice 1)
node/tick.go:42-56— remove inline aggregation + Publish loop; replace with worker submission. Delete or correct the misleading line-42 comment.node/store_aggregate.go— extractsnapshotAggregationInputsfrom the head ofAggregateCommitteeSignatures; the existing function becomes pure (snapshot → aggregates).node/aggregator_worker.go(new) — worker goroutine, job channel, result channel, lifecycle.node/node.go— wire the worker intoEngine(alongsideAggregationCh), start inRun, close in shutdown.node/metrics.go— registerlean_aggregator_worker_duration_seconds(histogram) andlean_aggregator_skip_total{reason=\"in_flight|not_aggregator|not_synced|missing_state\"}(counter), mirroring zeam's metric names so dashboards line up.Verification
lean_tick_interval_duration_secondshistogram p99 should drop from multi-second to sub-100ms during aggregator role. This is the headline metric.docker pause. Confirm:lean_aggregator_skip_total{reason=\"in_flight\"}only fires during partitionsgo test -tags spectests ./spectests/...must remain green; aggregation paths are exercised by api fixtures + forkchoice fixtures.go test -race ./node/must pass; snapshot ownership transfer is the most likely place a data race could appear.Related (do NOT confuse)
xmss.VerifyAggregatedSignatureviaerrgroup.Groupbounded byruntime.GOMAXPROCS(0)). Different problem, already shipped. The two pieces of work were discussed in the same session and are easy to conflate.References