perf(node): instrument interval-2 aggregation phases for attribution (#305)#306
Merged
Conversation
…305) After PR #299 brought per-aggregate FFI to p50 ~407 ms, the pass-level histogram (lean_committee_signatures_aggregation_time_seconds) shows p50 ~682 ms — meaning ~275 ms p50 sits outside the FFI in Go-side prep, post-processing, or pass-end commit. We had no attribution on which phase, so the next round of perf work would be guesswork. Split AggregateCommitteeSignatures' per-aggregate body into three phases plus a per-pass commit phase, each timed: prep : select children + collect raw sigs + parse + pubkey-cache lookups (per data root) ffi : xmss.AggregateWithChildren (already covered by lean_pq_sig_aggregated_signatures_building_time_seconds) post : build participants bitlist + proof struct + accumulator append (per data root) commit : KnownPayloads.PushBatch + AttestationSignatures.Delete (once per pass, batched) Three new histograms with the same {0.005..1} bucket shape used by other short-duration histograms on devnet-4: lean_aggregation_prep_time_seconds lean_aggregation_post_time_seconds lean_aggregation_commit_time_seconds Pure instrumentation. No behavior change. Four time.Now() calls per aggregate (~50 ns each) — negligible against the FFI's 400 ms. After a devnet aggregator window, prep + ffi + post + commit should sum within ~5% of the existing pass-level histogram. Attribution will then drive the next perf decision (pool prep vectors, snapshot pattern, async dispatch) instead of guessing. go build ./... + go test ./node/... green. Closes #305.
5 tasks
…e/apply (#307) Mirrors ethlambda's snapshot_aggregation_inputs → aggregate_job worker → publish structural split (ethlambda/crates/blockchain/src/aggregation.rs). Three new helpers + two new types replace the single monolithic function: snapshotAggregationInputs(s) → *AggregationSnapshot Reads head state, attestation signatures, new/known payload entries, and pre-resolves target states. The only function with a store reference; runs synchronously on the caller's thread. aggregateFromSnapshot(snap, cache) → ([]*SignedAggregatedAttestation, *AggregationMutations) Per-data-root prep/FFI/post phases. Pure function of snapshot and pubkey cache; holds no store reference. Returns aggregates plus a batched mutation set the caller is responsible for applying. applyAggregationMutations(s, mut) Applies the prove phase's KnownPayloads.PushBatch + AttestationSignatures.Delete in one call. AggregateCommitteeSignatures is now a thin wrapper that calls the three in sequence, keeping its public signature and behaviour unchanged. selectChildProofs' last parameter changed from *ConsensusStore to *xmss.PubKeyCache — that was the only store dependency in the body, and threading the cache directly removes the snapshot's only escape hatch into the live store. Pure refactor. Same store reads, same FFI calls, same store mutations, in the same order. All four phase-attribution histograms from #305 stay where they were (still wrap the same code spans). Sets up the prerequisite for Phase D async dispatch: with the snapshot type in place, an off-tick worker goroutine has a well-defined, self-contained input to consume. go build ./... + go test ./node/... green. Closes #307.
This was referenced May 25, 2026
…rove refactor(node): split AggregateCommitteeSignatures into snapshot/prove/apply (#307)
…sure (#309) Previously, aggregateFromSnapshot allocated four nil-initialised slices per data root and discarded them at iteration end: var childProofs []xmss.ChildProof // grew via append doubling var rawPubkeys []xmss.CPubKey var rawSigs []xmss.CSig var rawIDs []uint64 At typical 5-20 data roots per pass × 4 slices, that's 20-80 allocations per pass that the young-gen GC has to clean up. A 400 ms FFI call is a generous window for the runtime to schedule a 10-30 ms STW pause inside — the dominant suspected source of gean's p99 tail above 1 s. ethlambda pays nothing equivalent (Rust ownership, stack-allocated buffers dropped at scope exit per crates/blockchain/src/aggregation.rs:238). This PR adds sync.Pool reuse for all four slice types, plus capacity hints in the pool New functions sized to typical working set (8 children, 32 raw sigs). Pattern mirrors the existing xmss/proof_pool.go — pool stores *[]T, typed get/put wrappers, put resets length to 0 preserving capacity. Refactor of aggregateFromSnapshot: - Wraps the per-data-root loop body in func(){}() so defer pool-Put fires per iteration rather than accumulating to function return - Replaces continue with return inside the anonymous func - Slice access via *bufPtr; append targets *bufPtr = append(*bufPtr, x) Secondary win: defer xmss.FreeSignature(parsed) inside the sig loop also now fires per iteration — previously those C-side handles accumulated across ALL data roots in a pass, only freeing at function return. Tighter memory lifetime. Also adds a capacity hint on allIDs (was 0; now rawIDs + covered count). go build ./... + go test ./node/... green. Closes #309.
This was referenced May 25, 2026
perf(node): pool aggregation prep vectors to cut GC pressure (#309)
…er goroutine (#311) Before this PR, AggregateCommitteeSignatures ran SYNCHRONOUSLY on the consensus tick loop at interval 2 (node/tick.go:52). At gean's measured p99 of ~1.4s, the tick couldn't advance to interval 3 (safe_target), interval 4 (accept_new_attestations), or the next slot's interval 0 (proposal) until the FFI finished. The existing comment acknowledged this as "acceptable until prover is <800ms" — we're now under 800ms at p50 but p99 still slips past. ethlambda runs aggregation on a tokio::spawn_blocking worker with a 750ms deadline (aggregation.rs:395-462); zeam runs worker threads. Spec read confirmed off-tick dispatch is permissible (forks/lstar/spec.py aggregate(store) → (store, aggregates) is a pure transformation; the spec doesn't model WHEN execution happens, only what state must eventually exist before the soft deadline at interval 3 ~800ms after dispatch or the hard deadline at interval 4 ~1600ms after dispatch). This PR mirrors that off-tick pattern: Engine struct: + AggregationDispatchCh chan AggregationDispatch (buffered to 1) Engine.Run(): + go e.runAggregationWorker(ctx) tick.go interval-2: - aggs := AggregateCommitteeSignatures(e.Store) - for _, agg := range aggs { e.P2P.PublishAggregatedAttestation(...) } + snap := snapshotAggregationInputs(e.Store) + select { case e.AggregationDispatchCh <- AggregationDispatch{snap, slot}: default: drop counter } New: runAggregationWorker (store_aggregate.go) - Drains AggregationDispatchCh - Runs aggregateFromSnapshot + applyAggregationMutations off the tick - Publishes aggregates to P2P - Records lean_aggregation_worker_total_time_seconds New: AggregationDispatch type (store_aggregate.go) Carries one slot's snapshot + slot number from tick to worker. New metrics (metrics.go): + lean_aggregation_worker_total_time_seconds (STATE_TRANSITION_BUCKETS) + lean_aggregation_dispatch_dropped_total counter Channel capacity is 1 by design: a backlog of more than one slot means the worker is too slow and the new dispatch should drop rather than queue. Drops are spec-permissible (best-effort aggregation per slot) and surface via the dropped counter. With current numbers (p99 = 1.4s, slot = 4s) drops should be vanishingly rare. Store-mutation races: applyAggregationMutations writes to KnownPayloads (mutex'd in PushBatch) and AttestationSignatures (mutex'd in Delete). Both already protected against concurrent gossip-handler writes; the worker's writes serialise correctly without new locking. Behavior change: the tick loop no longer blocks on aggregation. Aggregation result latency unchanged; what changes is that the rest of the tick can proceed while it runs. Spec compliance: zero impact. Same eventual store state, same broadcast set, same FFI calls in the same per-aggregate order. context import removed from tick.go (no longer used after dispatch rewrite; previously imported only for context.Background() at the publish call site, which moved to the worker). go build ./... + go test -count=1 ./... all packages green. Closes #311.
Merged
5 tasks
feat(node): move interval-2 aggregation off the tick loop into a worker goroutine (#311)
After Phase D moved aggregation off-tick (#311), the function-level wrapper has no callers: tick.go calls snapshotAggregationInputs directly; the worker goroutine calls aggregateFromSnapshot directly. grep -rn AggregateCommitteeSignatures across node/, api/, cmd/ returns only the dead definition itself. By extension also dead: - metricCommitteeAggregationTime histogram (the pass-level lean_committee_signatures_aggregation_time_seconds metric) - ObserveCommitteeAggregationTime wrapper (only callsite was the dead wrapper's defer) Both are superseded by lean_aggregation_worker_total_time_seconds (added in #311), which measures the same end-to-end span on the new off-tick code path. The pass-level metric is gean-specific (0 hits in leanSpec registry) so removing it is internal cleanup, not a spec break. Also updates a stale comment reference in xmss/aggregate_test.go:49. Net: -25 LOC of dead code.
Of the 4 phase-attribution histograms added in #305, only prep does non-trivial work (XMSS sig parsing, pubkey cache lookups, child selection). post (build participants bitlist + struct + append) and commit (KnownPayloads.PushBatch + AttestationSignatures.Delete) are mechanical Go-side operations expected to be sub-millisecond and produce dead-bucket distributions. The pass-level metric they were meant to attribute alongside is itself dead after lean-review #1 (removed lean_committee_signatures_aggregation_time_seconds). If real measurements ever show post or commit matter, they're trivial to add back. Keep prep (real attribution value) + worker_total (end-to-end span, added in #311). Net: -16 LOC (2 histograms + 2 Observe wrappers + 2 callsites + 2 time.Now() captures).
The struct was a 2-field tuple (PayloadEntries, KeysToDelete) used in exactly one place. Idiomatic Go uses multi-value returns for this case. aggregateFromSnapshot now returns ([]*SignedAggregatedAttestation, []PayloadKV, []AttestationDeleteKey); applyAggregationMutations takes the two slices directly. The mut.PayloadEntries / mut.KeysToDelete field-access noise becomes direct slice names. Net: -5 LOC (struct + 5-line doc comment removed; one new local var declaration vs the previous struct alloc; field access simplified).
Dropped 5 lines describing generic sync.Pool behavior (capacity growth, length reset on Put) — every reader of sync.Pool already knows this. Kept the WHY (GC pressure during FFI) and the xmss/proof_pool.go cross-reference. Net: -5 LOC.
7-line comment was duplicating what runAggregationWorker's docstring covers. Trimmed to 2 lines pointing at the worker for the rationale. Net: -5 LOC.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Splits
AggregateCommitteeSignaturesper-aggregate body into three phases (prep/ffi/post) plus a per-passcommitphase, each emitting its own histogram. Pure instrumentation — zero behavior change. Closes #305.Why
After PR #299 dropped per-aggregate FFI to p50 = 407 ms, the pass-level
lean_committee_signatures_aggregation_time_secondsreports p50 = 682 ms — meaning ~275 ms sits outside the FFI in Go-side prep / post / commit. We have no attribution on which.Without this PR, the next perf step (pool prep vectors, snapshot pattern, async dispatch) would be guesswork. After this PR, one 10-minute devnet scrape tells us where to invest.
What changed
node/metrics.go(+22 LOC)Three new histograms with the same
{0.005, 0.01, 0.025, 0.05, 0.1, 1}bucket shape used by other short-duration histograms on devnet-4 (e.g.lean_attestation_validation_time_seconds). Same shape means same +Inf-overflow semantics as the rest of the family:```go
metricAggregationPrepTime = ... Name: "lean_aggregation_prep_time_seconds"
metricAggregationPostTime = ... Name: "lean_aggregation_post_time_seconds"
metricAggregationCommitTime = ... Name: "lean_aggregation_commit_time_seconds"
```
Plus three matching one-line
Observe*functions in the existing observer block.node/store_aggregate.go(+7 LOC)Four
time.Sincecaptures around the phase boundaries:prepStartat top of per-data-root loop body → emit ataggStartpostStartimmediately after FFI completes → emit at end of loop bodycommitStartimmediately beforeKnownPayloads.PushBatch→ emit afterAttestationSignatures.DeleteFFI is already covered by the existing
lean_pq_sig_aggregated_signatures_building_time_seconds— left alone.Behavior change
None. Four extra
time.Now()calls per aggregate (~50 ns each) — negligible vs the 400 ms FFI cost. No new locks, no new allocations beyond the histogram buckets themselves.Test plan
Spec compliance
Zero impact. `lean_aggregation_*` is a gean-specific metric family; `grep` of `leanSpec/src/lean_spec/subspecs/metrics/registry.py` returns no matching name. Same posture as the existing `lean_committee_signatures_aggregation_time_seconds` (verified gean-specific in PR #301).
Lean-review
Why land this first
This issue is the must-precede for the rest of the perf roadmap (Issues TBD):
childProofs,rawPubkeys,rawSigs,allIDs)Will queue follow-up issues once we have the attribution data from the first scrape.
Related