Skip to content

perf(node): instrument interval-2 aggregation phases for attribution (#305)#306

Merged
mananuf merged 12 commits into
devnet-4from
perf/instrument-aggregation-phases
May 25, 2026
Merged

perf(node): instrument interval-2 aggregation phases for attribution (#305)#306
mananuf merged 12 commits into
devnet-4from
perf/instrument-aggregation-phases

Conversation

@mananuf
Copy link
Copy Markdown
Collaborator

@mananuf mananuf commented May 25, 2026

Summary

Splits AggregateCommitteeSignatures per-aggregate body into three phases (prep / ffi / post) plus a per-pass commit phase, each emitting its own histogram. Pure instrumentation — zero behavior change. Closes #305.

Why

After PR #299 dropped per-aggregate FFI to p50 = 407 ms, the pass-level lean_committee_signatures_aggregation_time_seconds reports p50 = 682 ms — meaning ~275 ms sits outside the FFI in Go-side prep / post / commit. We have no attribution on which.

Without this PR, the next perf step (pool prep vectors, snapshot pattern, async dispatch) would be guesswork. After this PR, one 10-minute devnet scrape tells us where to invest.

What changed

node/metrics.go (+22 LOC)

Three new histograms with the same {0.005, 0.01, 0.025, 0.05, 0.1, 1} bucket shape used by other short-duration histograms on devnet-4 (e.g. lean_attestation_validation_time_seconds). Same shape means same +Inf-overflow semantics as the rest of the family:

```go
metricAggregationPrepTime = ... Name: "lean_aggregation_prep_time_seconds"
metricAggregationPostTime = ... Name: "lean_aggregation_post_time_seconds"
metricAggregationCommitTime = ... Name: "lean_aggregation_commit_time_seconds"
```

Plus three matching one-line Observe* functions in the existing observer block.

node/store_aggregate.go (+7 LOC)

Four time.Since captures around the phase boundaries:

  • prepStart at top of per-data-root loop body → emit at aggStart
  • postStart immediately after FFI completes → emit at end of loop body
  • commitStart immediately before KnownPayloads.PushBatch → emit after AttestationSignatures.Delete

FFI is already covered by the existing lean_pq_sig_aggregated_signatures_building_time_seconds — left alone.

Behavior change

None. Four extra time.Now() calls per aggregate (~50 ns each) — negligible vs the 400 ms FFI cost. No new locks, no new allocations beyond the histogram buckets themselves.

Test plan

  • `go build ./...` clean
  • `go test -count=1 ./node/...` green
  • Post-merge: 10-minute devnet aggregator window, scrape:
    • All three new histograms present in `/metrics`
    • `prep_sum + ffi_sum + post_sum + commit_sum ≈ pass-level sum` within ~5% (a larger drift means the instrumentation has a gap to fix)

Spec compliance

Zero impact. `lean_aggregation_*` is a gean-specific metric family; `grep` of `leanSpec/src/lean_spec/subspecs/metrics/registry.py` returns no matching name. Same posture as the existing `lean_committee_signatures_aggregation_time_seconds` (verified gean-specific in PR #301).

Lean-review

  • `dead-code`: N/A
  • `premature-abstraction`: N/A — three histograms because three distinct phases; no helper, no wrapper
  • `defensive-bloat`: N/A
  • `duplicates-existing-util`: N/A — FFI already covered, not duplicated
  • `comment-bloat`: 4-line comment block on the metric var group cross-references existing FFI histogram so future readers don't add a fourth duplicate; earns its keep
  • `over-validated-boundary`: N/A

Why land this first

This issue is the must-precede for the rest of the perf roadmap (Issues TBD):

  • If `prep` dominates → next PR is pool/pre-allocate the four prep vectors (childProofs, rawPubkeys, rawSigs, allIDs)
  • If `commit` dominates → next PR is batched store mutation
  • If everything is spread → next PR is the snapshot+async pattern (ethlambda-style)

Will queue follow-up issues once we have the attribution data from the first scrape.

Related

…305)

After PR #299 brought per-aggregate FFI to p50 ~407 ms, the pass-level
histogram (lean_committee_signatures_aggregation_time_seconds) shows
p50 ~682 ms — meaning ~275 ms p50 sits outside the FFI in Go-side
prep, post-processing, or pass-end commit. We had no attribution on
which phase, so the next round of perf work would be guesswork.

Split AggregateCommitteeSignatures' per-aggregate body into three
phases plus a per-pass commit phase, each timed:

  prep   : select children + collect raw sigs + parse + pubkey-cache
           lookups (per data root)
  ffi    : xmss.AggregateWithChildren (already covered by
           lean_pq_sig_aggregated_signatures_building_time_seconds)
  post   : build participants bitlist + proof struct + accumulator
           append (per data root)
  commit : KnownPayloads.PushBatch + AttestationSignatures.Delete
           (once per pass, batched)

Three new histograms with the same {0.005..1} bucket shape used by
other short-duration histograms on devnet-4:

  lean_aggregation_prep_time_seconds
  lean_aggregation_post_time_seconds
  lean_aggregation_commit_time_seconds

Pure instrumentation. No behavior change. Four time.Now() calls per
aggregate (~50 ns each) — negligible against the FFI's 400 ms.

After a devnet aggregator window, prep + ffi + post + commit should
sum within ~5% of the existing pass-level histogram. Attribution
will then drive the next perf decision (pool prep vectors, snapshot
pattern, async dispatch) instead of guessing.

go build ./... + go test ./node/... green.

Closes #305.
…e/apply (#307)

Mirrors ethlambda's snapshot_aggregation_inputs → aggregate_job worker
→ publish structural split (ethlambda/crates/blockchain/src/aggregation.rs).

Three new helpers + two new types replace the single monolithic function:

  snapshotAggregationInputs(s) → *AggregationSnapshot
      Reads head state, attestation signatures, new/known payload entries,
      and pre-resolves target states. The only function with a store
      reference; runs synchronously on the caller's thread.

  aggregateFromSnapshot(snap, cache) → ([]*SignedAggregatedAttestation, *AggregationMutations)
      Per-data-root prep/FFI/post phases. Pure function of snapshot and
      pubkey cache; holds no store reference. Returns aggregates plus a
      batched mutation set the caller is responsible for applying.

  applyAggregationMutations(s, mut)
      Applies the prove phase's KnownPayloads.PushBatch +
      AttestationSignatures.Delete in one call.

AggregateCommitteeSignatures is now a thin wrapper that calls the three
in sequence, keeping its public signature and behaviour unchanged.

selectChildProofs' last parameter changed from *ConsensusStore to
*xmss.PubKeyCache — that was the only store dependency in the body, and
threading the cache directly removes the snapshot's only escape hatch
into the live store.

Pure refactor. Same store reads, same FFI calls, same store mutations,
in the same order. All four phase-attribution histograms from #305 stay
where they were (still wrap the same code spans).

Sets up the prerequisite for Phase D async dispatch: with the snapshot
type in place, an off-tick worker goroutine has a well-defined,
self-contained input to consume.

go build ./... + go test ./node/... green.

Closes #307.
mananuf and others added 2 commits May 25, 2026 13:08
…rove

refactor(node): split AggregateCommitteeSignatures into snapshot/prove/apply (#307)
…sure (#309)

Previously, aggregateFromSnapshot allocated four nil-initialised slices
per data root and discarded them at iteration end:

  var childProofs []xmss.ChildProof  // grew via append doubling
  var rawPubkeys  []xmss.CPubKey
  var rawSigs     []xmss.CSig
  var rawIDs      []uint64

At typical 5-20 data roots per pass × 4 slices, that's 20-80 allocations
per pass that the young-gen GC has to clean up. A 400 ms FFI call is a
generous window for the runtime to schedule a 10-30 ms STW pause inside
— the dominant suspected source of gean's p99 tail above 1 s.

ethlambda pays nothing equivalent (Rust ownership, stack-allocated
buffers dropped at scope exit per crates/blockchain/src/aggregation.rs:238).

This PR adds sync.Pool reuse for all four slice types, plus capacity
hints in the pool New functions sized to typical working set (8 children,
32 raw sigs). Pattern mirrors the existing xmss/proof_pool.go — pool
stores *[]T, typed get/put wrappers, put resets length to 0 preserving
capacity.

Refactor of aggregateFromSnapshot:

  - Wraps the per-data-root loop body in func(){}() so defer pool-Put
    fires per iteration rather than accumulating to function return
  - Replaces continue with return inside the anonymous func
  - Slice access via *bufPtr; append targets *bufPtr = append(*bufPtr, x)

Secondary win: defer xmss.FreeSignature(parsed) inside the sig loop also
now fires per iteration — previously those C-side handles accumulated
across ALL data roots in a pass, only freeing at function return.
Tighter memory lifetime.

Also adds a capacity hint on allIDs (was 0; now rawIDs + covered count).

go build ./... + go test ./node/... green.

Closes #309.
mananuf and others added 2 commits May 25, 2026 13:13
perf(node): pool aggregation prep vectors to cut GC pressure (#309)
…er goroutine (#311)

Before this PR, AggregateCommitteeSignatures ran SYNCHRONOUSLY on the
consensus tick loop at interval 2 (node/tick.go:52). At gean's measured
p99 of ~1.4s, the tick couldn't advance to interval 3 (safe_target),
interval 4 (accept_new_attestations), or the next slot's interval 0
(proposal) until the FFI finished. The existing comment
acknowledged this as "acceptable until prover is <800ms" — we're now
under 800ms at p50 but p99 still slips past.

ethlambda runs aggregation on a tokio::spawn_blocking worker with a
750ms deadline (aggregation.rs:395-462); zeam runs worker threads.
Spec read confirmed off-tick dispatch is permissible (forks/lstar/spec.py
aggregate(store) → (store, aggregates) is a pure transformation; the
spec doesn't model WHEN execution happens, only what state must
eventually exist before the soft deadline at interval 3 ~800ms after
dispatch or the hard deadline at interval 4 ~1600ms after dispatch).

This PR mirrors that off-tick pattern:

  Engine struct:
    + AggregationDispatchCh chan AggregationDispatch (buffered to 1)

  Engine.Run():
    + go e.runAggregationWorker(ctx)

  tick.go interval-2:
    - aggs := AggregateCommitteeSignatures(e.Store)
    - for _, agg := range aggs { e.P2P.PublishAggregatedAttestation(...) }
    + snap := snapshotAggregationInputs(e.Store)
    + select { case e.AggregationDispatchCh <- AggregationDispatch{snap, slot}: default: drop counter }

  New: runAggregationWorker (store_aggregate.go)
    - Drains AggregationDispatchCh
    - Runs aggregateFromSnapshot + applyAggregationMutations off the tick
    - Publishes aggregates to P2P
    - Records lean_aggregation_worker_total_time_seconds

  New: AggregationDispatch type (store_aggregate.go)
    Carries one slot's snapshot + slot number from tick to worker.

  New metrics (metrics.go):
    + lean_aggregation_worker_total_time_seconds (STATE_TRANSITION_BUCKETS)
    + lean_aggregation_dispatch_dropped_total counter

Channel capacity is 1 by design: a backlog of more than one slot
means the worker is too slow and the new dispatch should drop rather
than queue. Drops are spec-permissible (best-effort aggregation per
slot) and surface via the dropped counter. With current numbers
(p99 = 1.4s, slot = 4s) drops should be vanishingly rare.

Store-mutation races: applyAggregationMutations writes to KnownPayloads
(mutex'd in PushBatch) and AttestationSignatures (mutex'd in Delete).
Both already protected against concurrent gossip-handler writes; the
worker's writes serialise correctly without new locking.

Behavior change: the tick loop no longer blocks on aggregation.
Aggregation result latency unchanged; what changes is that the rest
of the tick can proceed while it runs.

Spec compliance: zero impact. Same eventual store state, same broadcast
set, same FFI calls in the same per-aggregate order.

context import removed from tick.go (no longer used after dispatch
rewrite; previously imported only for context.Background() at the
publish call site, which moved to the worker).

go build ./... + go test -count=1 ./... all packages green.

Closes #311.
mananuf and others added 6 commits May 25, 2026 13:21
feat(node): move interval-2 aggregation off the tick loop into a worker goroutine (#311)
After Phase D moved aggregation off-tick (#311), the function-level
wrapper has no callers: tick.go calls snapshotAggregationInputs
directly; the worker goroutine calls aggregateFromSnapshot directly.
grep -rn AggregateCommitteeSignatures across node/, api/, cmd/ returns
only the dead definition itself.

By extension also dead:
  - metricCommitteeAggregationTime histogram (the pass-level
    lean_committee_signatures_aggregation_time_seconds metric)
  - ObserveCommitteeAggregationTime wrapper (only callsite was the
    dead wrapper's defer)

Both are superseded by lean_aggregation_worker_total_time_seconds
(added in #311), which measures the same end-to-end span on the new
off-tick code path. The pass-level metric is gean-specific (0 hits in
leanSpec registry) so removing it is internal cleanup, not a spec break.

Also updates a stale comment reference in xmss/aggregate_test.go:49.

Net: -25 LOC of dead code.
Of the 4 phase-attribution histograms added in #305, only prep does
non-trivial work (XMSS sig parsing, pubkey cache lookups, child
selection). post (build participants bitlist + struct + append) and
commit (KnownPayloads.PushBatch + AttestationSignatures.Delete) are
mechanical Go-side operations expected to be sub-millisecond and
produce dead-bucket distributions.

The pass-level metric they were meant to attribute alongside is itself
dead after lean-review #1 (removed lean_committee_signatures_aggregation_time_seconds).

If real measurements ever show post or commit matter, they're trivial
to add back. Keep prep (real attribution value) + worker_total
(end-to-end span, added in #311).

Net: -16 LOC (2 histograms + 2 Observe wrappers + 2 callsites + 2
time.Now() captures).
The struct was a 2-field tuple (PayloadEntries, KeysToDelete) used in
exactly one place. Idiomatic Go uses multi-value returns for this case.

aggregateFromSnapshot now returns ([]*SignedAggregatedAttestation,
[]PayloadKV, []AttestationDeleteKey); applyAggregationMutations takes
the two slices directly. The mut.PayloadEntries / mut.KeysToDelete
field-access noise becomes direct slice names.

Net: -5 LOC (struct + 5-line doc comment removed; one new local var
declaration vs the previous struct alloc; field access simplified).
Dropped 5 lines describing generic sync.Pool behavior (capacity growth,
length reset on Put) — every reader of sync.Pool already knows this.
Kept the WHY (GC pressure during FFI) and the xmss/proof_pool.go
cross-reference.

Net: -5 LOC.
7-line comment was duplicating what runAggregationWorker's docstring
covers. Trimmed to 2 lines pointing at the worker for the rationale.

Net: -5 LOC.
@mananuf mananuf merged commit 5b5f1ff into devnet-4 May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant