fix(consensus): wall-clock-anchored epoch/slot scheduling (5th fork defect)#129
Merged
Conversation
…efect) The testnet reset on the #126 image still forked: nodes that started at different times ran on different epoch/slot numbers, so they elected different producers at the same instant and produced competing blocks. Root cause: scheduling was anchored to each node's LOCAL start time. - producer.rs used a per-tick local counter (`slot += 1` from boot). - maybe_advance_epoch accumulated epochs from each node's `start_time`. #126 made selection deterministic GIVEN the same (epoch, slot), but nothing made nodes agree on the CURRENT epoch/slot. (Observed live: node-1 epoch 25 vs node-2/3 epoch 15 purely from a ~minute start-time gap.) Fix — anchor scheduling to wall-clock (standard time-slot approach): - producer slot = now_ms / block_interval_ms (global; NTP-synced nodes agree). Processed at most once per slot. Replaces the local counter. - maybe_advance_epoch: epoch = now_ms / epoch_duration_ms (global), replacing the local-start-time accumulation. - Epoch seed derived DIRECTLY: blake3(genesis_seed || epoch) — O(1), required because wall-clock epoch numbers are far too large to walk #126's hash chain to. genesis_seed is captured on the first start_epoch (genesis init) and is identical across nodes. Result: every node computes the same slot/epoch/seed/producer at any instant → one elected producer network-wide per slot → convergence. Tests: nodes started at DIFFERENT times with opposite validator order now agree on epoch, seed, and producer for 200 slots; epoch seed = blake3(genesis||n). 28 qfc-consensus + 21 qfc-node tests pass. This is the final consensus fix for the testnet recovery (after #126 selection determinism, #128 forward sync). Rebuild → re-run reset. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
lai3d
added a commit
that referenced
this pull request
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The testnet reset on the #126 image still forked — exposing a 5th defect only visible at multi-node runtime: scheduling was anchored to each node's local start time, so nodes that booted at different times ran on different epoch/slot numbers and elected different producers at the same instant → competing blocks.
Observed live: node-1 on epoch 25 vs node-2/3 on epoch 15, purely from a ~minute start-time gap; block #5 hash differed across nodes.
Root cause
producer.rsused a per-tick local counter (slot += 1from boot).maybe_advance_epochaccumulated epochs from each node'sstart_time.#126 made selection deterministic given the same
(epoch, slot), but nothing made nodes agree on the current epoch/slot.Fix — anchor scheduling to wall-clock (standard time-slot PoS)
now_ms / block_interval_ms(global; NTP-synced nodes agree), processed at most once per slot.now_ms / epoch_duration_ms(global), replacing local-start accumulation.blake3(genesis_seed ‖ epoch)— O(1), required because wall-clock epoch numbers are far too large to walk fix(consensus): make leader election converge across nodes (testnet fork fix) #126's hash chain.genesis_seedis captured on the firststart_epoch(genesis init), identical across nodes.Every node now computes the same slot/epoch/seed/producer at any instant → one elected producer network-wide per slot → convergence.
Tests (28 qfc-consensus + 21 qfc-node pass)
test_nodes_agree_despite_different_start_times— two engines started ~20ms apart with opposite validator order agree on epoch, seed, and producer for 200 slots (the decisive property).test_epoch_seed_is_deterministic— seed =blake3(genesis ‖ n).Final consensus fix for the testnet recovery (after #126 selection determinism, #128 forward sync). Next: rebuild → re-run the reset.
🤖 Generated with Claude Code