Develop validator resilience to transient chain disruption and recover from network-wide slice failures#509
Conversation
Per-slice disable was a one-way write with no recovery path. A chain RPC disconnect that forces a metagraph reconnect followed by mass QUIC handshake timeouts caused every dslice in the run to fail with zero verified tiles, which then permanently appended every slice for every active circuit to the disable list. Subsequent runs short-circuit with skipped=N and the validator stops dispatching work until the process is restarted. Two changes restore self-healing: - Refuse to add to the disable list when every slice in the run failed with zero verified tiles. That signal is more consistent with a validator-side network or chain event than per-slice unprovability and is logged as a run-wide failure for visibility. - Track the block height at which each slice was disabled and rehabilitate entries older than DISABLED_SLICE_REHAB_BLOCKS (one tempo) at dispatch time. Entries still get refreshed on repeated failure but no longer persist past a transient disruption.
… bytes Random-sample verification previously rolled at response receipt, so the validator had to retain task_inputs (the local JSON copy of the input tensor) across the entire in-flight window for every dispatched request, even though only ~4% of those requests would be deep-verified. At steady state with thousands of concurrent dispatches, the retained inputs were a dominant contributor to per-request memory footprint and amplified host memory pressure events. Roll the RSV decision in dispatch_requests instead, attach the disposition to both DispatchedRequest and TaskResult as pre_sampled, and drop task_inputs immediately for the non-sampled path before the miner task is spawned. Force-verify paths (PoW, customer RWR, external-hash, API dslice) continue to retain their inputs because pre_decide_sample marks them as pre_sampled regardless of the RSV roll. decide_sample at response time now consults the pre-decision rather than re-rolling, with a defensive no-proof-no-sample short-circuit preserved for empty or failed responses. Existing post-response byte-shedding for proof_content / witness / raw / public_json is extended to also release MinerResponse.inputs on the non-sample path, covering the residual case where a force-verified request completes with an empty proof.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThis PR moves RSV sampling to dispatch-time with a propagated ChangesDispatch-time sampling and disabled-slice rehabilitation
Sequence Diagram(s)sequenceDiagram
participant Dispatch as dispatch_requests
participant PreDecide as pre_decide_sample
participant Spawn as spawn_miner_task
participant Verification as start_verification
participant Store as IncrementalRunManager
Dispatch->>PreDecide: pre_decide_sample(dispatched, hotkey, block, tempo, rsv)
PreDecide-->>Dispatch: pre_sampled (bool)
Dispatch->>Spawn: pass DispatchedRequest (pre_sampled, maybe cleared task_inputs)
Spawn-->>Verification: TaskResult (includes pre_sampled, proof bytes or cleared)
Verification->>Store: note_slice_skipped / mark_slice_failed (as needed)
Store-->>Dispatch: skipped_slice_count used in finalize flow (run completion)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/sn2-validator/src/validator_loop/dslice.rs`:
- Around line 809-833: The current all_slices_failed logic wrongly treats runs
where every slice is failed+zero-verified the same even if none were actually
dispatched (e.g., preflight rejections or already-disabled skips); change the
condition so the "treat as run-wide failure" branch only triggers when at least
one slice was actually dispatched/executed (or when failure cause indicates
execution-time outage). Concretely, add a check alongside total_slices > 0 that
queries the run state (e.g., a dispatched count or a failure-cause flag from
RunManager) — for example require run_manager.dispatched_slice_count(run_uid) >
0 (or check a per-slice failure cause via
run_manager.failure_cause/run_manager.was_dispatched) before setting
all_slices_failed — so deterministic preflight/disabled skips don’t suppress the
disable-list write.
In `@crates/sn2-validator/src/validator_loop/verification.rs`:
- Around line 250-264: decide_sample currently re-enables sampling whenever
hotkey.is_empty(), which can resurrect sampling after dispatch cleared
task_inputs; to fix, persist the "empty-hotkey at dispatch" state on TaskResult
(e.g. add a dispatched_empty_hotkey / had_empty_hotkey flag) and set that flag
inside dispatch_requests when you clear task_inputs for non-pre_sampled tasks,
then update decide_sample to consult result.dispatched_empty_hotkey instead of
computing hotkey.is_empty() at verify time; ensure dispatch_requests sets the
flag and decide_sample uses result.pre_sampled || result.dispatched_empty_hotkey
(and only treats force-sample when the Response still contains inputs if you
prefer the alternative safeguard).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b8649a51-b823-4192-96ea-bf7a67809a35
📒 Files selected for processing (4)
crates/sn2-validator/src/validator_loop/dispatch.rscrates/sn2-validator/src/validator_loop/dslice.rscrates/sn2-validator/src/validator_loop/mod.rscrates/sn2-validator/src/validator_loop/verification.rs
…lice The previous guard fired whenever every slice in a benchmark run ended with is_slice_failed=true and zero verified tiles. That signal conflates two distinct outcomes: every slice was deterministically skipped before dispatch (already-disabled entries, preflight rejections) versus every slice that was actually attempted failed during dispatch. Only the latter is the network-wide event the guard is designed to absorb; the former is either a no-op (re-marking already-disabled slices) or a legitimate signal that preflight-rejected slices should be added to the disable list. Track deterministically-skipped slices per run via a new note_slice_skipped / skipped_slice_count pair on IncrementalRunManager, called at the two existing skip call sites (already-disabled and preflight_failed). The guard now requires attempted > 0 and every attempted slice to have failed with zero verified tiles before suppressing the disable-list write. Mixed runs where some slices were skipped and others legitimately failed during dispatch continue to disable the genuinely-failed slices. Skipped-slice bookkeeping is removed alongside other per-run state in remove_run and evict_by_circuit so the map cannot leak across run lifecycles.
decide_sample retained an empty-hotkey force-sample branch carried over from the historical response-time RSV roll. With sampling pre-decided at dispatch, that branch reintroduced a race: if the metagraph reshuffled between dispatch and response (uid deregistered, uid_hotkeys entry cleared) the verifier would revive sampling on a request whose task_inputs had already been released, leading to a verification attempt against a missing input tensor. The equivalent empty-hotkey safety net lives in pre_decide_sample at dispatch time, where the corresponding pre_sampled=true keeps task_inputs retained for the in-flight window. The verify-side check is now strictly result.pre_sampled (gated on has_proof), removing the post-dispatch hotkey lookup from the path entirely.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/sn2-validator/src/incremental_runner.rs (1)
485-506:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMissing
skipped_slicescleanup ingc_stale.Unlike
remove_runandevict_by_circuit, this function doesn't clean upskipped_slicesentries for evicted runs. This creates a memory leak where stale run entries accumulate over time.🐛 Proposed fix
pub fn gc_stale(&mut self, idle_timeout: Duration) -> Vec<String> { let now = Instant::now(); let stale: Vec<String> = self .runs .iter() .filter(|(_, run)| now.duration_since(run.last_activity) >= idle_timeout) .map(|(uid, _)| uid.clone()) .collect(); if !stale.is_empty() { info!(count = stale.len(), run_uids = ?stale, "evicting idle runs"); } let stale_set: HashSet<&str> = stale.iter().map(|s| s.as_str()).collect(); self.tile_counters .retain(|(run_uid, _), _| !stale_set.contains(run_uid.as_str())); self.verified_tile_counts .retain(|(run_uid, _), _| !stale_set.contains(run_uid.as_str())); + self.skipped_slices + .retain(|run_uid, _| !stale_set.contains(run_uid.as_str())); for uid in stale.iter() { self.runs.remove(uid); self.evicted.insert(uid.clone()); } stale }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sn2-validator/src/incremental_runner.rs` around lines 485 - 506, The gc_stale function fails to remove entries from skipped_slices for evicted runs, causing a memory leak; update gc_stale to compute the same stale run UIDs (stale) and then remove corresponding entries from skipped_slices just like it already does for tile_counters and verified_tile_counts, and ensure it mirrors the cleanup behavior in remove_run and evict_by_circuit (use the same run UID set/stale_set and call skipped_slices.retain or remove per-uid before inserting into evicted and deleting from runs).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/sn2-validator/src/incremental_runner.rs`:
- Around line 485-506: The gc_stale function fails to remove entries from
skipped_slices for evicted runs, causing a memory leak; update gc_stale to
compute the same stale run UIDs (stale) and then remove corresponding entries
from skipped_slices just like it already does for tile_counters and
verified_tile_counts, and ensure it mirrors the cleanup behavior in remove_run
and evict_by_circuit (use the same run UID set/stale_set and call
skipped_slices.retain or remove per-uid before inserting into evicted and
deleting from runs).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 72e1d23f-55b5-47b2-8bf9-f6fbefa65ee4
📒 Files selected for processing (3)
crates/sn2-validator/src/incremental_runner.rscrates/sn2-validator/src/validator_loop/dslice.rscrates/sn2-validator/src/validator_loop/verification.rs
remove_run and evict_by_circuit already drop their share of skipped_slices when a run leaves the manager, but gc_stale was added before the field and never updated. Idle benchmark runs aged out by gc_stale would leave their per-slice skip records behind, slowly growing the map across long-lived validator processes. Mirror the existing tile_counters and verified_tile_counts retain pattern so the same stale_set drops skipped_slices entries before the runs are removed and inserted into the evicted FIFO.
Brings #510 (trust-on-first-use validator allowlist with stake-coverage gate) into the testnet branch alongside #509 (pre-sampled RSV dispatch and disabled-slice rehab) and #512 (rehab clock reset fix). The next testnet release will ship all three on the testnet miners and validators so operators can smoke-test the allowlist path on a host that runs without --disable-blacklist.
Promotes the validator resilience and rehab-cooldown work that has been running on testnet onto main for the next mainnet patch tag. Brings in #509 (pre-sampled RSV dispatch, disabled-slice rehab with stake-weighted run-wide-failure detection) and #512 (rehab clock-reset fix for slices that re-enter the skip path while still within the 360-block cooldown). #510 (trust-on-first-use validator allowlist with stake-coverage gate and nftables driver) is already on main; this merge collapses the divergence introduced when #510 landed on main while #509 landed on testnet in parallel.
Summary
Two independent improvements to validator state management under transient
disruption. Both target testnet so they can soak before promotion to main.
Rehabilitate
disabled_slicesafter run-wide failuresThe validator currently writes every slice that ends a benchmark run with
zero verified tiles into a per-circuit disable list, and the only path that
ever clears that list is full circuit deactivation. When the chain RPC
forces a reconnect and the metagraph re-sync triggers a concurrent QUIC
handshake to every miner, those handshakes time out together. The next
benchmark run sees every dslice fail with zero verified tiles, the disable
list absorbs the entire slice set, and the validator transitions into a
state where each subsequent run logs
skipped=N / no circuit slices to dispatch / combined run complete failed_count=Nand dispatches no workuntil the process is restarted. vTrust degrades over the next epoch with
no observable cause in the local logs.
Two changes restore self-healing:
finalize_combined_runno longer writes todisabled_sliceswhen everyslice in the run failed with zero verified tiles. That signal is more
consistent with a validator-side network or chain disruption than with
each slice being independently unprovable, and is now logged as a
run-wide failure for operator visibility.
disabled_slicescarries the block height at which each slice wasdisabled, and the dispatch-side filter skips entries older than
DISABLED_SLICE_REHAB_BLOCKS(one tempo). Repeated failures stillrefresh the entry, but transient ones age out cleanly.
Pre-decide RSV sampling at dispatch time
Random-sample verification rolled at response receipt, which meant the
validator retained
task_inputs(the local JSON copy of the input tensor)across the entire in-flight window for every dispatched request even
though only ~4% of those requests would be deep-verified. At steady state
with thousands of concurrent dispatches, the retained inputs were a
dominant contributor to per-request memory footprint and amplified host
memory pressure events.
This change rolls the RSV decision in
dispatch_requests, attaches theresult to both
DispatchedRequestandTaskResultaspre_sampled, andclears
task_inputsimmediately for the non-sampled path before theminer task is spawned. Force-verify paths (PoW, customer RWR,
external-hash, API dslice) continue to retain their inputs because
pre_decide_samplemarks them as pre-sampled regardless of the RSV roll.decide_sampleat response time now consults the pre-decision rather thanre-rolling. The defensive
no-proof / empty-response → no-sampleshort-circuit is preserved so an empty response on a force-verified
request still skips verification cleanly. Post-response byte-shedding for
proof_content/witness/raw/public_jsonis also extended torelease
MinerResponse.inputson the non-sample path for the residualcase where a force-verified request completes with an empty proof.
Sampling rate and skiplist behavior are unchanged: the RSV roll uses the
same
should_sampleover the same constants, just at an earlier point inthe request lifecycle.
Test plan
cargo check -p sn2-validatorcargo test -p sn2-validator --bin sn2-validator(87 tests pass)cargo fmt --checkcargo clippy -p sn2-validator --tests -- -D warningsactive_tasks/dispatch_budgethealth-log metrics over a full epochvalidator continues to dispatch after the reconnect storm rather
than locking into the
skipped=Nno-dispatch loophistorical baseline (the sampling rate and decision logic are
unchanged, only the moment of the roll has moved)
Follow-ups (separate PRs)
btlightning::update_miner_registryso a massreconnect does not synchronize on a single timeout boundary in the
first place. Tracking separately because it lands in
btlightning.Summary by CodeRabbit
New Features
Bug Fixes
Refactor