Skip to content

Move signature aggregation off the tick loop (background worker with snapshot) #272

@dimka90

Description

@dimka90

Problem

When a node is an aggregator, AggregateCommitteeSignatures runs synchronously on the tick goroutine at interval 2 each slot. The xmss.AggregateWithChildren call (node/store_aggregate.go:132) goes through cgo into the leanMultisig recursive-aggregation prover, which currently takes 1-4 seconds per group — well above the 800ms interval budget.

The relevant code:

// node/tick.go:45-56
// Interval 2: synchronous aggregation. Blocks the tick loop for 1-4s
// but keeps source consistency — headState doesn't change during proving.
// Async aggregation breaks source alignment because head drifts during
// the background prover run. Acceptable until prover is <800ms.
if currentInterval == 2 && isAgg {
    aggs := AggregateCommitteeSignatures(e.Store)
    for _, agg := range aggs {
        if e.P2P != nil {
            e.P2P.PublishAggregatedAttestation(context.Background(), agg)
        }
    }
}

While the prover runs, the tick goroutine cannot:

  • accept new gossip blocks
  • handle attestation gossip
  • advance head / safe-target updates
  • service the next slot's tick promptly

This produces missed slots, late attestations from peers piling up, and (under non-finality pressure when there's more than one group to aggregate) compounding stalls.

A separate, related comment misleads readers: tick.go:42 says "Aggregation is handled async below to avoid blocking the tick loop" — that's aspirational; the very next block is synchronous. Either fix or refactor; this issue proposes the fix.

Why now

Both peer clients have already shipped this work, with public devnet evidence of measurable wins:

The gean comment's exit criterion — Acceptable until prover is <800ms — defers indefinitely. The leanMultisig prover team has not committed to a sub-800ms target on any timeline gean controls.

The blocker the comment cites — and how peers solved it

tick.go:47-48 is correct that the naive go func() { Aggregate(); Publish() }() is unsafe: head moves during the 1-4s prover run, so the aggregator could publish an attestation whose source/target chain no longer matches what other nodes see at publish time.

Both ethlambda and zeam converged on the snapshot pattern: capture head + justified at submission time, hand them to the worker, and the worker aggregates against the frozen snapshot. The aggregate is consistent with some recent state even if head has moved by the time it lands. Specifically:

  • ethlambda (crates/blockchain/src/store.rs) exposes pure helpers snapshot_aggregation_inputs, aggregate_job, apply_aggregated_group, finalize_aggregation_session. store::on_tick no longer calls into aggregation — an actor drives it.
  • zeam has a dedicated std.Io.Threaded worker with concurrent_limit = .limited(1); the worker owns the snapshot and runs forkChoice.aggregate then publishProducedAggregations.

Two design points where ethlambda and zeam differ

ethlambda #299 zeam #874
Concurrency limit spawn-blocking, but bounded by deadline dedicated worker, concurrent_limit=1
Deadline 750 ms cancellation token none
Behavior at deadline cancel — but late results still apply + publish n/a
Behavior when previous still in-flight n/a (cancelled at deadline) skip with metric zeam_aggregate_skip_total{reason=\"in_flight\"}
Fence late messages session id = slot number not needed (no late path)
Lifecycle #[stopped] cancels + joins bounded at 2 s aggregate_group.cancel → chain_worker.stop → aggregate_io.deinit

ethlambda's design is more complex but doesn't drop work under partition stack-up. zeam's design is simpler and shipped first.

ethlambda's devnet evidence (4 nodes, 200 slots, 3 induced partitions via docker pause):

  • finality lag 5 → 3 slots
  • 24/203 sessions cancelled at deadline (11.8%) — only when groups_considered > 1
  • 13 missed slots, all inside partition windows
  • no straggler warnings

Proposed approach for gean

Ship in two slices, ordered to minimize blast radius:

Slice 1 — skip-when-busy (matches zeam #874)

  • New node/aggregator_worker.go: a single goroutine owning a chan aggregateJob (capacity 1). Sender uses non-blocking send: if full, skip and increment a counter.
  • New snapshotAggregationInputs(s *ConsensusStore) in node/store_aggregate.go: builds the immutable input set the worker needs (Select/Fill outputs, headState reference, slot, justified) without touching mutable store state.
  • tick.go interval-2 path: replace the synchronous AggregateCommitteeSignatures + PublishAggregatedAttestation block with e.AggregatorWorker.SubmitOrSkip(snapshot).
  • Worker goroutine: receive snapshot → call the Aggregate phase → for each result, send *types.SignedAggregatedAttestation on a results channel back to the engine, which publishes via e.P2P.PublishAggregatedAttestation. (Worker must NOT call into the store directly; results land on the tick goroutine for ordered publish.)
  • Shutdown: engine Close() cancels the worker context, drains the results channel up to a 2 s deadline.

Slice 2 — deadline + late-keep (matches ethlambda #299)

  • Wrap each worker run in context.WithTimeout (750 ms; tune after burn-in).
  • On timeout: cancel the proof, but if the prover completed just after cancellation, still apply + publish the result locally. Fence by slot id so a late-arriving result for slot N can't be applied during slot N+2.
  • Defer until slice 1 has burn-in evidence; gean can stay on skip-when-busy if devnet runs don't show partition-induced stack-up.

Files to touch (slice 1)

  • node/tick.go:42-56 — remove inline aggregation + Publish loop; replace with worker submission. Delete or correct the misleading line-42 comment.
  • node/store_aggregate.go — extract snapshotAggregationInputs from the head of AggregateCommitteeSignatures; the existing function becomes pure (snapshot → aggregates).
  • node/aggregator_worker.go (new) — worker goroutine, job channel, result channel, lifecycle.
  • node/node.go — wire the worker into Engine (alongside AggregationCh), start in Run, close in shutdown.
  • node/metrics.go — register lean_aggregator_worker_duration_seconds (histogram) and lean_aggregator_skip_total{reason=\"in_flight|not_aggregator|not_synced|missing_state\"} (counter), mirroring zeam's metric names so dashboards line up.

Verification

  • Tick interval duration: existing lean_tick_interval_duration_seconds histogram p99 should drop from multi-second to sub-100ms during aggregator role. This is the headline metric.
  • Devnet rerun: 4-node devnet, 200+ slots, with at least one induced partition via docker pause. Confirm:
    • finality lag improves
    • missed slots concentrated inside the partition window, not outside
    • lean_aggregator_skip_total{reason=\"in_flight\"} only fires during partitions
  • Spec tests: go test -tags spectests ./spectests/... must remain green; aggregation paths are exercised by api fixtures + forkchoice fixtures.
  • Race detector: go test -race ./node/ must pass; snapshot ownership transfer is the most likely place a data race could appear.

Related (do NOT confuse)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions