T1: eBPF profiling capture of a live qfc-node validator#123
Merged
Conversation
Kernel-level traces from qfc-testnet-node (VPS-B, Ubuntu 24.04 / kernel 6.17 aarch64, EBS-backed nvme) against the running qfc-node-1 container, per ROADMAP-SRE T1. Deliverables: raw traces (docs/profiling/ebpf/), reproducible bpftrace programs + runner (scripts/profiling/), and a findings note (docs/profiling/T1-eBPF.md). Findings (steady-state testnet, image staging-sha-8cf3cb0 which predates the SRE branch — read as a production baseline): 1. ~47% of on-CPU is BLAKE3 on the *portable (scalar)* backend on an aarch64 host — NEON SIMD isn't compiled in. ~half of node CPU is recoverable. Highest-value optimization target; feeds a T3 follow-up. 2. ~17% of on-CPU + ~20k write()/s is sync-protocol response *logging* (SyncResponse formatted via tracing_subscriber::fmt, mirrored by containerd-shim/dockerd). Chain-side analogue of the Loki disk incident. 3. The deployed build does NOT fsync canonical blocks (only alloy fsyncs; qfc-node off-CPU is futex+tcp, no io_schedule). This is the baseline T3.2 (PR #103 set_sync) will change — and why RPO=0 matters. 4. Disk is EBS: write service time 1-8ms (tail 64ms) is the cost, not IOPS — supports the T3.1 cache/bloom work and confirms per-block fsync is affordable within the block interval. Captures: fsync latency, block-IO latency by RWBS, off-CPU stacks, write-path syscalls by thread, on-CPU perf flame (folded stacks committed; render with inferno/flamegraph). No write-stall flame exists in this build (finding 3); under-stress + post-T3.2-deploy captures noted as follow-ups. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
lai3d
added a commit
that referenced
this pull request
Jun 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The T1 evidence item from ROADMAP-SRE — kernel-level traces of a live validator, with the optimization targets the rest of the roadmap references. Captured on
qfc-testnet-node(VPS-B, Ubuntu 24.04 / kernel 6.17 aarch64, EBS-backed nvme) against the runningqfc-node-1container.Docs-only — traces, reproducible bpftrace programs + runner, and a findings note. No code changes.
Findings (production build
staging-sha-8cf3cb0, steady-state testnet)write()/s is sync-protocol logging —SyncResponseformatted viatracing_subscriber::fmt, mirrored 1:1 bycontainerd-shim/dockerd(stdout → Docker json-log). Chain-side analogue of the Loki disk-exhaustion incident.alloyfsyncs;qfc-nodeoff-CPU is futex + tcp, noio_schedule. This is the baseline that T3.2 (PR T3.2: crash-atomic block commit + durability ADR #103set_sync, merged-not-deployed) will change, and concretely why RPO=0 matters.Deliverables
docs/profiling/T1-eBPF.md— findings note with the breakdown + cross-refs to T3.1/T3.2/T8.docs/profiling/ebpf/— raw traces (01–05),00-context.txt, and05-oncpu.folded(flame-graph source; render withinferno-flamegraph/flamegraph.pl).scripts/profiling/—capture.sh+ four standalone.btprograms (no bcc dependency) to reproduce.Notes
offcpu-qfcnode.btpost-T3.2-deploy to capture one./tmpworkspace cleaned up after.🤖 Generated with Claude Code