feat(node): move interval-2 aggregation off the tick loop into a worker goroutine (#311)#312
Merged
mananuf merged 1 commit intoMay 25, 2026
Conversation
…er goroutine (#311) Before this PR, AggregateCommitteeSignatures ran SYNCHRONOUSLY on the consensus tick loop at interval 2 (node/tick.go:52). At gean's measured p99 of ~1.4s, the tick couldn't advance to interval 3 (safe_target), interval 4 (accept_new_attestations), or the next slot's interval 0 (proposal) until the FFI finished. The existing comment acknowledged this as "acceptable until prover is <800ms" — we're now under 800ms at p50 but p99 still slips past. ethlambda runs aggregation on a tokio::spawn_blocking worker with a 750ms deadline (aggregation.rs:395-462); zeam runs worker threads. Spec read confirmed off-tick dispatch is permissible (forks/lstar/spec.py aggregate(store) → (store, aggregates) is a pure transformation; the spec doesn't model WHEN execution happens, only what state must eventually exist before the soft deadline at interval 3 ~800ms after dispatch or the hard deadline at interval 4 ~1600ms after dispatch). This PR mirrors that off-tick pattern: Engine struct: + AggregationDispatchCh chan AggregationDispatch (buffered to 1) Engine.Run(): + go e.runAggregationWorker(ctx) tick.go interval-2: - aggs := AggregateCommitteeSignatures(e.Store) - for _, agg := range aggs { e.P2P.PublishAggregatedAttestation(...) } + snap := snapshotAggregationInputs(e.Store) + select { case e.AggregationDispatchCh <- AggregationDispatch{snap, slot}: default: drop counter } New: runAggregationWorker (store_aggregate.go) - Drains AggregationDispatchCh - Runs aggregateFromSnapshot + applyAggregationMutations off the tick - Publishes aggregates to P2P - Records lean_aggregation_worker_total_time_seconds New: AggregationDispatch type (store_aggregate.go) Carries one slot's snapshot + slot number from tick to worker. New metrics (metrics.go): + lean_aggregation_worker_total_time_seconds (STATE_TRANSITION_BUCKETS) + lean_aggregation_dispatch_dropped_total counter Channel capacity is 1 by design: a backlog of more than one slot means the worker is too slow and the new dispatch should drop rather than queue. Drops are spec-permissible (best-effort aggregation per slot) and surface via the dropped counter. With current numbers (p99 = 1.4s, slot = 4s) drops should be vanishingly rare. Store-mutation races: applyAggregationMutations writes to KnownPayloads (mutex'd in PushBatch) and AttestationSignatures (mutex'd in Delete). Both already protected against concurrent gossip-handler writes; the worker's writes serialise correctly without new locking. Behavior change: the tick loop no longer blocks on aggregation. Aggregation result latency unchanged; what changes is that the rest of the tick can proceed while it runs. Spec compliance: zero impact. Same eventual store state, same broadcast set, same FFI calls in the same per-aggregate order. context import removed from tick.go (no longer used after dispatch rewrite; previously imported only for context.Background() at the publish call site, which moved to the worker). go build ./... + go test -count=1 ./... all packages green. Closes #311.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves committee aggregation off the consensus tick loop into a single worker goroutine. The tick thread snapshots inputs synchronously (cheap, ms) and hands them to the worker via a buffered channel; the worker runs the FFI + publishes results off-tick. Mirrors ethlambda's
tokio::spawn_blockingpattern and zeam's worker-thread model. Closes #311.This is the architecture change Partha pointed at when he said zeam's measurement was "worker thread execution time" — gean now has the same shape.
Why
Before this PR, the tick comment was honest about the trade-off (
node/tick.go:48-51):We're now under 800ms at p50 (407ms FFI) but p99 still hits 1.4s. While aggregation runs, the tick can't advance to interval 3 (
update_safe_target) or interval 4 (accept_new_attestations), and crucially the next slot's interval 0/1 (block production + attestation duties) gets delayed if the previous slot's interval-2 overruns.ethlambda runs aggregation on
tokio::spawn_blockingwith a 750ms deadline (ethlambda/crates/blockchain/src/aggregation.rs:395-462). Spec read confirmed off-tick dispatch is permissible —forks/lstar/spec.py'saggregate(store) → (store, aggregates)is a pure transformation; the spec doesn't model when execution happens, only what state must eventually exist before:update_safe_targetreadslatest_new_aggregated_payloadsaccept_new_attestationspromoteslatest_new → latest_knownBoth fit comfortably inside gean's current p99 of 1.4s.
What changed
node/node.go(+ field, + worker spawn)type Engine struct { ... + AggregationDispatchCh chan AggregationDispatch // buffered 1 } func (e *Engine) Run(ctx context.Context) { ... go e.runFetchBatcher(ctx) + go e.runAggregationWorker(ctx) }node/tick.go(interval-2 rewrite)contextimport removed fromtick.go(no longer used after the publish moved to the worker).node/store_aggregate.go(+ type, + worker function)node/metrics.go(2 new metrics)lean_aggregation_worker_total_time_seconds(STATE_TRANSITION_BUCKETS shape) — end-to-end worker pass timelean_aggregation_dispatch_dropped_totalcounter — interval-2 dispatches the worker couldn't pick upChannel-capacity design
AggregationDispatchChis buffered to 1 deliberately. A backlog of more than one slot means the worker is too slow for the protocol cadence; queueing more dispatches would only delay the wrong work. Dropping is the right answer:accept_new_attestationsdoesn't get slot N's aggregate; payloads from prior slots are still inlatest_known_aggregated_payloads)Race analysis
applyAggregationMutationswrites to:KnownPayloads.PushBatch— already mutex-protected against concurrent gossip writesAttestationSignatures.Delete— sameBoth are already designed for concurrent access from gossip handlers. The worker's writes serialise correctly without new locking.
The snapshot read on the tick thread is consistent because
s.AttestationSignatures.Snapshot()takes its own mutex ands.NewPayloads.data/s.KnownPayloads.datareads happen on the tick goroutine which is the sole writer for those particular maps (per engine convention).Behavior change
latest_known_aggregated_payloads.Expected impact
Spec compliance
Zero impact. Same eventual store state, same broadcast set, same FFI calls in the same order. Spec permits off-tick execution provided soft/hard deadlines are met, which gean's p99 already does.
Test plan
go build ./...cleango test -count=1 ./node/...greengo test -count=1 ./...full suite greenlean_aggregation_dispatch_dropped_totalstays at 0 under typical devnet load; confirm tick-loop responsiveness improvement (proxy:lean_state_transition_time_secondsp99 should not be inflated by interval-2 sitting on it)Lean-review
dead-code: ✅ removed unusedcontextimport fromtick.gopremature-abstraction: single worker, single channel, single dispatch type — sized to current need, no speculative dispatch surfacedefensive-bloat: N/Aduplicates-existing-util: N/A — follows existingrunFetchBatchergoroutine pattern innode.gocomment-bloat: 4-line comment onAggregationDispatchChdocuments the capacity-1 design choice (non-obvious); worker has 5-line block on the ethlambda/zeam pattern + drop semantics; both earn their keepover-validated-boundary: N/AStack
Stacks on PR #306 (instrumentation) and PR #308 (snapshot refactor — already merged into
perf/instrument-aggregation-phases). Depends on the snapshot type from #308; consumes the per-phase histograms from #306. Pairs naturally with PR #310 (prep-vector pooling) — D + C together give the biggest combined win.Targets
perf/instrument-aggregation-phases. Will land alongside #306 + #310 before the integration branch merges to devnet-4.