fix(sync): forward catch-up so behind/new nodes can sync (deep-fork ④)#128
Merged
Conversation
A node that fell behind could never catch up: handle_block walked the peer's chain backward by parent hash into a bounded pending buffer (MAX_PENDING_BLOCKS=1000) that dropped blocks as fast as it queued them, so it imported nothing and re-requested forever (also the ~20k writes/s log flood T1 found). The forward range-sync path already existed but was dead code (sync_with_peer / sync_blocks_from_peer, "Will be used when initial sync is implemented") — and the wire protocol already serves GetBlockByNumber / GetBlockRange. This wires it up: - run_catch_up_loop: every 5s, if we're > CATCH_UP_LAG_THRESHOLD (2) blocks behind the highest known peer, pick a peer and range-sync the gap forward, importing blocks IN ORDER (no missing-parent step → no pending-buffer overflow → works at any depth, incl. a fresh node from genesis). Runs sequentially so catch-ups never overlap; spawned in main.rs alongside the message/event handlers. - sync_with_peer un-dead-coded (verifies peer genesis before syncing). - The gossip backward-walk stays for small reorg gaps only. Tests: should_catch_up threshold logic (fires only when >2 behind, never when level/1-2 behind; fires for genesis catch-up). qfc-node 21 unit tests pass. This is fix ④ for the testnet recovery: with consensus convergence (#126) nodes won't fork, and this lets a lagging/new node rejoin and catch up without re-forking — making a testnet reset durable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
lai3d
added a commit
that referenced
this pull request
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix ④ from the testnet recovery plan — the last code fix before the reset.
Problem
A node that fell behind could never catch up.
handle_blockwalked the peer's chain backward by parent hash into a bounded pending buffer (MAX_PENDING_BLOCKS = 1000) that dropped blocks as fast as it queued them — so it imported nothing and re-requested forever. (This is the deep-fork sync defect, and the source of the ~20k-writes/s log flood T1 found.)Fix
The correct forward range-sync path already existed but was dead code (
sync_with_peer/sync_blocks_from_peer, "Will be used when initial sync is implemented"), and the wire protocol already servesGetBlockByNumber/GetBlockRange. This wires it up:run_catch_up_loop— every 5s, if we're more thanCATCH_UP_LAG_THRESHOLD(2) blocks behind the highest known peer, range-sync the gap forward and import blocks in order. In-order import has no missing-parent step, so there's no pending-buffer overflow — it catches up from any depth, including a fresh node from genesis. Sequential (each tick awaits the full sync) so catch-ups never overlap; spawned inmain.rsnext to the message/event handlers.sync_with_peerun-dead-coded (verifies the peer's genesis hash before syncing).Tests
should_catch_up_threshold— fires only when > 2 behind (never level / 1–2 behind, avoiding thrash), and fires for a genesis catch-up. qfc-node 21 unit tests pass; fmt + clippy clean.Context
With consensus convergence (#126) nodes no longer fork; this lets a lagging or newly-joined node rejoin and catch up without re-forking — making the planned testnet reset durable. After this lands → promote to staging → reset (Strategy A).
🤖 Generated with Claude Code