Skip to content

fix(sync): forward catch-up so behind/new nodes can sync (deep-fork ④)#128

Merged
lai3d merged 1 commit into
mainfrom
claude/sync-deep-catchup
Jun 14, 2026
Merged

fix(sync): forward catch-up so behind/new nodes can sync (deep-fork ④)#128
lai3d merged 1 commit into
mainfrom
claude/sync-deep-catchup

Conversation

@lai3d

@lai3d lai3d commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Fix ④ from the testnet recovery plan — the last code fix before the reset.

Problem

A node that fell behind could never catch up. handle_block walked the peer's chain backward by parent hash into a bounded pending buffer (MAX_PENDING_BLOCKS = 1000) that dropped blocks as fast as it queued them — so it imported nothing and re-requested forever. (This is the deep-fork sync defect, and the source of the ~20k-writes/s log flood T1 found.)

Fix

The correct forward range-sync path already existed but was dead code (sync_with_peer / sync_blocks_from_peer, "Will be used when initial sync is implemented"), and the wire protocol already serves GetBlockByNumber / GetBlockRange. This wires it up:

  • run_catch_up_loop — every 5s, if we're more than CATCH_UP_LAG_THRESHOLD (2) blocks behind the highest known peer, range-sync the gap forward and import blocks in order. In-order import has no missing-parent step, so there's no pending-buffer overflow — it catches up from any depth, including a fresh node from genesis. Sequential (each tick awaits the full sync) so catch-ups never overlap; spawned in main.rs next to the message/event handlers.
  • sync_with_peer un-dead-coded (verifies the peer's genesis hash before syncing).
  • The gossip backward-walk stays for small reorg gaps only.

Tests

  • should_catch_up_threshold — fires only when > 2 behind (never level / 1–2 behind, avoiding thrash), and fires for a genesis catch-up. qfc-node 21 unit tests pass; fmt + clippy clean.
  • End-to-end forward sync is exercised by the testnet reset (a node catching up on the fresh chain).

Context

With consensus convergence (#126) nodes no longer fork; this lets a lagging or newly-joined node rejoin and catch up without re-forking — making the planned testnet reset durable. After this lands → promote to staging → reset (Strategy A).

🤖 Generated with Claude Code

A node that fell behind could never catch up: handle_block walked the peer's
chain backward by parent hash into a bounded pending buffer
(MAX_PENDING_BLOCKS=1000) that dropped blocks as fast as it queued them, so it
imported nothing and re-requested forever (also the ~20k writes/s log flood T1
found).

The forward range-sync path already existed but was dead code
(sync_with_peer / sync_blocks_from_peer, "Will be used when initial sync is
implemented") — and the wire protocol already serves GetBlockByNumber /
GetBlockRange. This wires it up:

- run_catch_up_loop: every 5s, if we're > CATCH_UP_LAG_THRESHOLD (2) blocks
  behind the highest known peer, pick a peer and range-sync the gap forward,
  importing blocks IN ORDER (no missing-parent step → no pending-buffer
  overflow → works at any depth, incl. a fresh node from genesis). Runs
  sequentially so catch-ups never overlap; spawned in main.rs alongside the
  message/event handlers.
- sync_with_peer un-dead-coded (verifies peer genesis before syncing).
- The gossip backward-walk stays for small reorg gaps only.

Tests: should_catch_up threshold logic (fires only when >2 behind, never when
level/1-2 behind; fires for genesis catch-up). qfc-node 21 unit tests pass.

This is fix ④ for the testnet recovery: with consensus convergence (#126)
nodes won't fork, and this lets a lagging/new node rejoin and catch up without
re-forking — making a testnet reset durable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@lai3d lai3d merged commit e310cb3 into main Jun 14, 2026
4 checks passed
@lai3d lai3d deleted the claude/sync-deep-catchup branch June 14, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant