Resolve disabled-slice rehab clock reset on re-skipped slices#512
Conversation
Investigate cooldown behavior of disabled_slices across consecutive runs and isolate a logic gap that defeats the DISABLED_SLICE_REHAB_BLOCKS contract. finalize_combined_run iterated `candidates` (slices marked failed with zero verified tiles) and called `entry.insert(slice_id, current_block)` on each. HashMap::insert unconditionally replaces the value; the `.is_none()` return check only gated the `inserted` counter, not the mutation. Slices that had been skipped via the disabled_slices filter at dispatch time were also present in candidates (mark_slice_failed at the skip path moves them from pending_slices into failed_slices on each new CombinedRun), so their disabled_at block was overwritten with current_block every run, restarting the 360-block rehab cooldown indefinitely. A slice that ever landed in disabled_slices remained there until validator restart. Resolve by: * Adding IncrementalRunManager::is_slice_skipped(run_uid, slice_id) alongside the existing note_slice_skipped / skipped_slice_count surface introduced in #509. * Excluding skipped slices from the candidates filter in finalize_combined_run so the disable-list write only touches slices that were actually attempted in this run and failed without producing a verified tile. * Simplifying the run-wide failure check now that candidates contains exactly the attempted-failed slices: `candidates.len() == attempted` replaces the saturating-sub bookkeeping. Newly disabled slices still get a fresh disabled_at stamp. Existing disabled entries age out via the unchanged `current_block - disabled_at < DISABLED_SLICE_REHAB_BLOCKS` filter at dispatch time once `current_block - disabled_at >= 360` blocks. The prune_disabled_slices / run-eviction paths are unaffected. Tests in incremental_runner cover the new accessor across the before-noting, after-noting, per-run-uid scoping, and run-removal cleanup paths; cargo test -p sn2-validator passes 17/17 in the incremental_runner module. cargo fmt --check clean.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughIncrementalRunManager receives a new public query method ChangesSkipped Slice Tracking and Finalization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Brings #510 (trust-on-first-use validator allowlist with stake-coverage gate) into the testnet branch alongside #509 (pre-sampled RSV dispatch and disabled-slice rehab) and #512 (rehab clock reset fix). The next testnet release will ship all three on the testnet miners and validators so operators can smoke-test the allowlist path on a host that runs without --disable-blacklist.
Promotes the validator resilience and rehab-cooldown work that has been running on testnet onto main for the next mainnet patch tag. Brings in #509 (pre-sampled RSV dispatch, disabled-slice rehab with stake-weighted run-wide-failure detection) and #512 (rehab clock-reset fix for slices that re-enter the skip path while still within the 360-block cooldown). #510 (trust-on-first-use validator allowlist with stake-coverage gate and nftables driver) is already on main; this merge collapses the divergence introduced when #510 landed on main while #509 landed on testnet in parallel.
Summary
The
DISABLED_SLICE_REHAB_BLOCKScooldown introduced alongside the disabled-slice rehab logic does not actually age slices out. Once a slice lands indisabled_slices, itsdisabled_atblock is re-stamped tocurrent_blockon every subsequent run, holding it disabled until validator restart.Root cause
In
finalize_combined_run(crates/sn2-validator/src/validator_loop/dslice.rs):Two interacting facts make this re-stamp every disabled slice:
dispatch_work_items_for_circuitputs currently-disabled (within rehab) slices intoskipped, then callsself.run_manager.mark_slice_failed(run_uid, &work.slice_id)on each.mark_slice_failedmoves the slice frompending_slicesintofailed_sliceson the freshCombinedRunfor this run, sois_slice_failed(run_uid, sid)returns true for them.HashMap::insert(k, v)unconditionally replaces the value when the key exists, returning the old value viaSome(old). The.is_none()check only gates theinsertedcounter — the side effect (entry["S"] = current_block) always happens.So every run that skips slice
Sfor being disabled overwritesS'sdisabled_atblock. The dispatch-time filtercurrent_block - disabled_at < 360therefore stays true forever.Fix
IncrementalRunManager::is_slice_skipped(run_uid, slice_id)alongside the existingnote_slice_skipped/skipped_slice_countsurface.candidatesinfinalize_combined_runso the disable-list write only touches slices that were actually attempted in this run.candidates.len()is exactly the attempted-and-failed count:candidates.len() == attemptedreplaces the saturating-sub bookkeeping.Newly attempted-and-failed slices still get a fresh
disabled_atstamp via the sameentry.insertcall. Once their block is older thanDISABLED_SLICE_REHAB_BLOCKS, the dispatch-time filter excludes them from the active disabled set on the next run and they get a retry opportunity.Test plan
is_slice_skipped: false-before-noting, true-after-noting, per-run_uidscoping, cleared onremove_run.cargo test -p sn2-validator incremental_runner::— 17 pass (13 existing + 4 new).cargo build -p sn2-validatorclean.cargo fmt -p sn2-validator --checkclean.-D warningsand full workspace tests.disabled_slicesages out after 360 blocks (~72 min) and re-enters dispatch.Targets
Branched off
origin/testnetbecause the affected code (#509) is on the testnet branch and has not landed onmainyet.Summary by CodeRabbit
New Features
Bug Fixes