perf(node): instrument interval-2 aggregation phases for attribution (#305) by mananuf · Pull Request #306 · geanlabs/gean

mananuf · 2026-05-25T11:53:13Z

Summary

Splits AggregateCommitteeSignatures per-aggregate body into three phases (prep / ffi / post) plus a per-pass commit phase, each emitting its own histogram. Pure instrumentation — zero behavior change. Closes #305.

Why

After PR #299 dropped per-aggregate FFI to p50 = 407 ms, the pass-level lean_committee_signatures_aggregation_time_seconds reports p50 = 682 ms — meaning ~275 ms sits outside the FFI in Go-side prep / post / commit. We have no attribution on which.

Without this PR, the next perf step (pool prep vectors, snapshot pattern, async dispatch) would be guesswork. After this PR, one 10-minute devnet scrape tells us where to invest.

What changed

`node/metrics.go` (+22 LOC)

Three new histograms with the same {0.005, 0.01, 0.025, 0.05, 0.1, 1} bucket shape used by other short-duration histograms on devnet-4 (e.g. lean_attestation_validation_time_seconds). Same shape means same +Inf-overflow semantics as the rest of the family:

```go
metricAggregationPrepTime = ... Name: "lean_aggregation_prep_time_seconds"
metricAggregationPostTime = ... Name: "lean_aggregation_post_time_seconds"
metricAggregationCommitTime = ... Name: "lean_aggregation_commit_time_seconds"
```

Plus three matching one-line Observe* functions in the existing observer block.

`node/store_aggregate.go` (+7 LOC)

Four time.Since captures around the phase boundaries:

prepStart at top of per-data-root loop body → emit at aggStart
postStart immediately after FFI completes → emit at end of loop body
commitStart immediately before KnownPayloads.PushBatch → emit after AttestationSignatures.Delete

FFI is already covered by the existing lean_pq_sig_aggregated_signatures_building_time_seconds — left alone.

Behavior change

None. Four extra time.Now() calls per aggregate (~50 ns each) — negligible vs the 400 ms FFI cost. No new locks, no new allocations beyond the histogram buckets themselves.

Test plan

`go build ./...` clean
`go test -count=1 ./node/...` green
Post-merge: 10-minute devnet aggregator window, scrape:
- All three new histograms present in `/metrics`
- `prep_sum + ffi_sum + post_sum + commit_sum ≈ pass-level sum` within ~5% (a larger drift means the instrumentation has a gap to fix)

Spec compliance

Zero impact. `lean_aggregation_*` is a gean-specific metric family; `grep` of `leanSpec/src/lean_spec/subspecs/metrics/registry.py` returns no matching name. Same posture as the existing `lean_committee_signatures_aggregation_time_seconds` (verified gean-specific in PR #301).

Lean-review

`dead-code`: N/A
`premature-abstraction`: N/A — three histograms because three distinct phases; no helper, no wrapper
`defensive-bloat`: N/A
`duplicates-existing-util`: N/A — FFI already covered, not duplicated
`comment-bloat`: 4-line comment block on the metric var group cross-references existing FFI histogram so future readers don't add a fourth duplicate; earns its keep
`over-validated-boundary`: N/A

Why land this first

This issue is the must-precede for the rest of the perf roadmap (Issues TBD):

If `prep` dominates → next PR is pool/pre-allocate the four prep vectors (childProofs, rawPubkeys, rawSigs, allIDs)
If `commit` dominates → next PR is batched store mutation
If everything is spread → next PR is the snapshot+async pattern (ethlambda-style)

Will queue follow-up issues once we have the attribution data from the first scrape.

Closes perf: instrument interval-2 aggregation path for prep / FFI / post / commit attribution #305
Builds on PR perf(xmss): adopt [profile.multisig-release] + gean-glue shim (#298) #299 (perf profile + cgo-glue shim) — already merged
Builds on PR metric: replace 10 duplicated bucket literals with leanSpec-named constants (#303) #304 (named bucket constants) — open, not blocking

…305) After PR #299 brought per-aggregate FFI to p50 ~407 ms, the pass-level histogram (lean_committee_signatures_aggregation_time_seconds) shows p50 ~682 ms — meaning ~275 ms p50 sits outside the FFI in Go-side prep, post-processing, or pass-end commit. We had no attribution on which phase, so the next round of perf work would be guesswork. Split AggregateCommitteeSignatures' per-aggregate body into three phases plus a per-pass commit phase, each timed: prep : select children + collect raw sigs + parse + pubkey-cache lookups (per data root) ffi : xmss.AggregateWithChildren (already covered by lean_pq_sig_aggregated_signatures_building_time_seconds) post : build participants bitlist + proof struct + accumulator append (per data root) commit : KnownPayloads.PushBatch + AttestationSignatures.Delete (once per pass, batched) Three new histograms with the same {0.005..1} bucket shape used by other short-duration histograms on devnet-4: lean_aggregation_prep_time_seconds lean_aggregation_post_time_seconds lean_aggregation_commit_time_seconds Pure instrumentation. No behavior change. Four time.Now() calls per aggregate (~50 ns each) — negligible against the FFI's 400 ms. After a devnet aggregator window, prep + ffi + post + commit should sum within ~5% of the existing pass-level histogram. Attribution will then drive the next perf decision (pool prep vectors, snapshot pattern, async dispatch) instead of guessing. go build ./... + go test ./node/... green. Closes #305.

…e/apply (#307) Mirrors ethlambda's snapshot_aggregation_inputs → aggregate_job worker → publish structural split (ethlambda/crates/blockchain/src/aggregation.rs). Three new helpers + two new types replace the single monolithic function: snapshotAggregationInputs(s) → *AggregationSnapshot Reads head state, attestation signatures, new/known payload entries, and pre-resolves target states. The only function with a store reference; runs synchronously on the caller's thread. aggregateFromSnapshot(snap, cache) → ([]*SignedAggregatedAttestation, *AggregationMutations) Per-data-root prep/FFI/post phases. Pure function of snapshot and pubkey cache; holds no store reference. Returns aggregates plus a batched mutation set the caller is responsible for applying. applyAggregationMutations(s, mut) Applies the prove phase's KnownPayloads.PushBatch + AttestationSignatures.Delete in one call. AggregateCommitteeSignatures is now a thin wrapper that calls the three in sequence, keeping its public signature and behaviour unchanged. selectChildProofs' last parameter changed from *ConsensusStore to *xmss.PubKeyCache — that was the only store dependency in the body, and threading the cache directly removes the snapshot's only escape hatch into the live store. Pure refactor. Same store reads, same FFI calls, same store mutations, in the same order. All four phase-attribution histograms from #305 stay where they were (still wrap the same code spans). Sets up the prerequisite for Phase D async dispatch: with the snapshot type in place, an off-tick worker goroutine has a well-defined, self-contained input to consume. go build ./... + go test ./node/... green. Closes #307.

…rove refactor(node): split AggregateCommitteeSignatures into snapshot/prove/apply (#307)

…sure (#309) Previously, aggregateFromSnapshot allocated four nil-initialised slices per data root and discarded them at iteration end: var childProofs []xmss.ChildProof // grew via append doubling var rawPubkeys []xmss.CPubKey var rawSigs []xmss.CSig var rawIDs []uint64 At typical 5-20 data roots per pass × 4 slices, that's 20-80 allocations per pass that the young-gen GC has to clean up. A 400 ms FFI call is a generous window for the runtime to schedule a 10-30 ms STW pause inside — the dominant suspected source of gean's p99 tail above 1 s. ethlambda pays nothing equivalent (Rust ownership, stack-allocated buffers dropped at scope exit per crates/blockchain/src/aggregation.rs:238). This PR adds sync.Pool reuse for all four slice types, plus capacity hints in the pool New functions sized to typical working set (8 children, 32 raw sigs). Pattern mirrors the existing xmss/proof_pool.go — pool stores *[]T, typed get/put wrappers, put resets length to 0 preserving capacity. Refactor of aggregateFromSnapshot: - Wraps the per-data-root loop body in func(){}() so defer pool-Put fires per iteration rather than accumulating to function return - Replaces continue with return inside the anonymous func - Slice access via *bufPtr; append targets *bufPtr = append(*bufPtr, x) Secondary win: defer xmss.FreeSignature(parsed) inside the sig loop also now fires per iteration — previously those C-side handles accumulated across ALL data roots in a pass, only freeing at function return. Tighter memory lifetime. Also adds a capacity hint on allIDs (was 0; now rawIDs + covered count). go build ./... + go test ./node/... green. Closes #309.

perf(node): pool aggregation prep vectors to cut GC pressure (#309)

…er goroutine (#311) Before this PR, AggregateCommitteeSignatures ran SYNCHRONOUSLY on the consensus tick loop at interval 2 (node/tick.go:52). At gean's measured p99 of ~1.4s, the tick couldn't advance to interval 3 (safe_target), interval 4 (accept_new_attestations), or the next slot's interval 0 (proposal) until the FFI finished. The existing comment acknowledged this as "acceptable until prover is <800ms" — we're now under 800ms at p50 but p99 still slips past. ethlambda runs aggregation on a tokio::spawn_blocking worker with a 750ms deadline (aggregation.rs:395-462); zeam runs worker threads. Spec read confirmed off-tick dispatch is permissible (forks/lstar/spec.py aggregate(store) → (store, aggregates) is a pure transformation; the spec doesn't model WHEN execution happens, only what state must eventually exist before the soft deadline at interval 3 ~800ms after dispatch or the hard deadline at interval 4 ~1600ms after dispatch). This PR mirrors that off-tick pattern: Engine struct: + AggregationDispatchCh chan AggregationDispatch (buffered to 1) Engine.Run(): + go e.runAggregationWorker(ctx) tick.go interval-2: - aggs := AggregateCommitteeSignatures(e.Store) - for _, agg := range aggs { e.P2P.PublishAggregatedAttestation(...) } + snap := snapshotAggregationInputs(e.Store) + select { case e.AggregationDispatchCh <- AggregationDispatch{snap, slot}: default: drop counter } New: runAggregationWorker (store_aggregate.go) - Drains AggregationDispatchCh - Runs aggregateFromSnapshot + applyAggregationMutations off the tick - Publishes aggregates to P2P - Records lean_aggregation_worker_total_time_seconds New: AggregationDispatch type (store_aggregate.go) Carries one slot's snapshot + slot number from tick to worker. New metrics (metrics.go): + lean_aggregation_worker_total_time_seconds (STATE_TRANSITION_BUCKETS) + lean_aggregation_dispatch_dropped_total counter Channel capacity is 1 by design: a backlog of more than one slot means the worker is too slow and the new dispatch should drop rather than queue. Drops are spec-permissible (best-effort aggregation per slot) and surface via the dropped counter. With current numbers (p99 = 1.4s, slot = 4s) drops should be vanishingly rare. Store-mutation races: applyAggregationMutations writes to KnownPayloads (mutex'd in PushBatch) and AttestationSignatures (mutex'd in Delete). Both already protected against concurrent gossip-handler writes; the worker's writes serialise correctly without new locking. Behavior change: the tick loop no longer blocks on aggregation. Aggregation result latency unchanged; what changes is that the rest of the tick can proceed while it runs. Spec compliance: zero impact. Same eventual store state, same broadcast set, same FFI calls in the same per-aggregate order. context import removed from tick.go (no longer used after dispatch rewrite; previously imported only for context.Background() at the publish call site, which moved to the worker). go build ./... + go test -count=1 ./... all packages green. Closes #311.

feat(node): move interval-2 aggregation off the tick loop into a worker goroutine (#311)

After Phase D moved aggregation off-tick (#311), the function-level wrapper has no callers: tick.go calls snapshotAggregationInputs directly; the worker goroutine calls aggregateFromSnapshot directly. grep -rn AggregateCommitteeSignatures across node/, api/, cmd/ returns only the dead definition itself. By extension also dead: - metricCommitteeAggregationTime histogram (the pass-level lean_committee_signatures_aggregation_time_seconds metric) - ObserveCommitteeAggregationTime wrapper (only callsite was the dead wrapper's defer) Both are superseded by lean_aggregation_worker_total_time_seconds (added in #311), which measures the same end-to-end span on the new off-tick code path. The pass-level metric is gean-specific (0 hits in leanSpec registry) so removing it is internal cleanup, not a spec break. Also updates a stale comment reference in xmss/aggregate_test.go:49. Net: -25 LOC of dead code.

Of the 4 phase-attribution histograms added in #305, only prep does non-trivial work (XMSS sig parsing, pubkey cache lookups, child selection). post (build participants bitlist + struct + append) and commit (KnownPayloads.PushBatch + AttestationSignatures.Delete) are mechanical Go-side operations expected to be sub-millisecond and produce dead-bucket distributions. The pass-level metric they were meant to attribute alongside is itself dead after lean-review #1 (removed lean_committee_signatures_aggregation_time_seconds). If real measurements ever show post or commit matter, they're trivial to add back. Keep prep (real attribution value) + worker_total (end-to-end span, added in #311). Net: -16 LOC (2 histograms + 2 Observe wrappers + 2 callsites + 2 time.Now() captures).

The struct was a 2-field tuple (PayloadEntries, KeysToDelete) used in exactly one place. Idiomatic Go uses multi-value returns for this case. aggregateFromSnapshot now returns ([]*SignedAggregatedAttestation, []PayloadKV, []AttestationDeleteKey); applyAggregationMutations takes the two slices directly. The mut.PayloadEntries / mut.KeysToDelete field-access noise becomes direct slice names. Net: -5 LOC (struct + 5-line doc comment removed; one new local var declaration vs the previous struct alloc; field access simplified).

Dropped 5 lines describing generic sync.Pool behavior (capacity growth, length reset on Put) — every reader of sync.Pool already knows this. Kept the WHY (GC pressure during FFI) and the xmss/proof_pool.go cross-reference. Net: -5 LOC.

7-line comment was duplicating what runAggregationWorker's docstring covers. Trimmed to 2 lines pointing at the worker for the rationale. Net: -5 LOC.

mananuf mentioned this pull request May 25, 2026

refactor(node): split AggregateCommitteeSignatures into snapshot + prove phases #307

Closed

5 tasks

This was referenced May 25, 2026

refactor(node): split AggregateCommitteeSignatures into snapshot/prove/apply (#307) #308

Merged

perf(node): pool + capacity-hint aggregation prep vectors to cut GC pressure #309

Closed

mananuf and others added 2 commits May 25, 2026 13:08

Merge pull request #308 from geanlabs/refactor/aggregation-snapshot-p…

66c712e

…rove refactor(node): split AggregateCommitteeSignatures into snapshot/prove/apply (#307)

This was referenced May 25, 2026

perf(node): pool aggregation prep vectors to cut GC pressure (#309) #310

Merged

feat(node): move interval-2 aggregation off the tick loop into a worker goroutine #311

Closed

mananuf and others added 2 commits May 25, 2026 13:13

Merge pull request #310 from geanlabs/perf/pool-aggregation-prep-vectors

22cea31

perf(node): pool aggregation prep vectors to cut GC pressure (#309)

mananuf mentioned this pull request May 25, 2026

feat(node): move interval-2 aggregation off the tick loop into a worker goroutine (#311) #312

Merged

5 tasks

mananuf and others added 6 commits May 25, 2026 13:21

Merge pull request #312 from geanlabs/feat/aggregation-worker-goroutine

f08a6c0

feat(node): move interval-2 aggregation off the tick loop into a worker goroutine (#311)

lean-review #5: collapse tick.go interval-2 comment block

aa5e204

7-line comment was duplicating what runAggregationWorker's docstring covers. Trimmed to 2 lines pointing at the worker for the rationale. Net: -5 LOC.

mananuf merged commit 5b5f1ff into devnet-4 May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(node): instrument interval-2 aggregation phases for attribution (#305)#306

perf(node): instrument interval-2 aggregation phases for attribution (#305)#306
mananuf merged 12 commits into
devnet-4from
perf/instrument-aggregation-phases

mananuf commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mananuf commented May 25, 2026

Summary

Why

What changed

node/metrics.go (+22 LOC)

node/store_aggregate.go (+7 LOC)

Behavior change

Test plan

Spec compliance

Lean-review

Why land this first

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`node/metrics.go` (+22 LOC)

`node/store_aggregate.go` (+7 LOC)