Skip to content

Resolve disabled-slice rehab clock reset on re-skipped slices#512

Merged
HudsonGraeme merged 1 commit into
testnetfrom
resolve/disabled-slice-rehab-clock-reset
May 25, 2026
Merged

Resolve disabled-slice rehab clock reset on re-skipped slices#512
HudsonGraeme merged 1 commit into
testnetfrom
resolve/disabled-slice-rehab-clock-reset

Conversation

@HudsonGraeme
Copy link
Copy Markdown
Member

@HudsonGraeme HudsonGraeme commented May 25, 2026

Summary

The DISABLED_SLICE_REHAB_BLOCKS cooldown introduced alongside the disabled-slice rehab logic does not actually age slices out. Once a slice lands in disabled_slices, its disabled_at block is re-stamped to current_block on every subsequent run, holding it disabled until validator restart.

Root cause

In finalize_combined_run (crates/sn2-validator/src/validator_loop/dslice.rs):

let candidates: Vec<String> = slice_tiles
    .into_keys()
    .filter(|slice_id| {
        self.run_manager.is_slice_failed(run_uid, slice_id)
            && self.run_manager.verified_tile_count(run_uid, slice_id) == 0
    })
    .collect();} else if !candidates.is_empty() {
    let block = self.current_block;
    let entry = self.disabled_slices.entry(circuit_id.clone()).or_default();
    let mut inserted = 0usize;
    for slice_id in &candidates {
        if entry.insert(slice_id.clone(), block).is_none() {
            inserted += 1;
        }
    }

Two interacting facts make this re-stamp every disabled slice:

  1. The dispatch-time filter in dispatch_work_items_for_circuit puts currently-disabled (within rehab) slices into skipped, then calls self.run_manager.mark_slice_failed(run_uid, &work.slice_id) on each. mark_slice_failed moves the slice from pending_slices into failed_slices on the fresh CombinedRun for this run, so is_slice_failed(run_uid, sid) returns true for them.
  2. HashMap::insert(k, v) unconditionally replaces the value when the key exists, returning the old value via Some(old). The .is_none() check only gates the inserted counter — the side effect (entry["S"] = current_block) always happens.

So every run that skips slice S for being disabled overwrites S's disabled_at block. The dispatch-time filter current_block - disabled_at < 360 therefore stays true forever.

Fix

  • Add IncrementalRunManager::is_slice_skipped(run_uid, slice_id) alongside the existing note_slice_skipped / skipped_slice_count surface.
  • Exclude skipped slices from candidates in finalize_combined_run so the disable-list write only touches slices that were actually attempted in this run.
  • Simplify the run-wide failure detection now that candidates.len() is exactly the attempted-and-failed count: candidates.len() == attempted replaces the saturating-sub bookkeeping.

Newly attempted-and-failed slices still get a fresh disabled_at stamp via the same entry.insert call. Once their block is older than DISABLED_SLICE_REHAB_BLOCKS, the dispatch-time filter excludes them from the active disabled set on the next run and they get a retry opportunity.

Test plan

  • New unit tests for is_slice_skipped: false-before-noting, true-after-noting, per-run_uid scoping, cleared on remove_run.
  • cargo test -p sn2-validator incremental_runner:: — 17 pass (13 existing + 4 new).
  • cargo build -p sn2-validator clean.
  • cargo fmt -p sn2-validator --check clean.
  • CI to confirm clippy -D warnings and full workspace tests.
  • Operator smoke on testnet: confirm a slice that lands in disabled_slices ages out after 360 blocks (~72 min) and re-enters dispatch.

Targets

Branched off origin/testnet because the affected code (#509) is on the testnet branch and has not landed on main yet.

Summary by CodeRabbit

  • New Features

    • Validation runs now track which slices were skipped during execution.
  • Bug Fixes

    • Improved detection of true run-wide failures by correctly distinguishing skipped slices from failed execution attempts.

Review Change Stack

Investigate cooldown behavior of disabled_slices across consecutive runs
and isolate a logic gap that defeats the DISABLED_SLICE_REHAB_BLOCKS
contract. finalize_combined_run iterated `candidates` (slices marked
failed with zero verified tiles) and called
`entry.insert(slice_id, current_block)` on each. HashMap::insert
unconditionally replaces the value; the `.is_none()` return check only
gated the `inserted` counter, not the mutation. Slices that had been
skipped via the disabled_slices filter at dispatch time were also
present in candidates (mark_slice_failed at the skip path moves them
from pending_slices into failed_slices on each new CombinedRun), so
their disabled_at block was overwritten with current_block every run,
restarting the 360-block rehab cooldown indefinitely. A slice that
ever landed in disabled_slices remained there until validator restart.

Resolve by:

* Adding IncrementalRunManager::is_slice_skipped(run_uid, slice_id)
  alongside the existing note_slice_skipped / skipped_slice_count
  surface introduced in #509.
* Excluding skipped slices from the candidates filter in
  finalize_combined_run so the disable-list write only touches slices
  that were actually attempted in this run and failed without
  producing a verified tile.
* Simplifying the run-wide failure check now that candidates contains
  exactly the attempted-failed slices: `candidates.len() == attempted`
  replaces the saturating-sub bookkeeping.

Newly disabled slices still get a fresh disabled_at stamp. Existing
disabled entries age out via the unchanged
`current_block - disabled_at < DISABLED_SLICE_REHAB_BLOCKS` filter at
dispatch time once `current_block - disabled_at >= 360` blocks. The
prune_disabled_slices / run-eviction paths are unaffected.

Tests in incremental_runner cover the new accessor across the
before-noting, after-noting, per-run-uid scoping, and run-removal
cleanup paths; cargo test -p sn2-validator passes 17/17 in the
incremental_runner module. cargo fmt --check clean.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d2ba1248-48dc-4815-b11a-08395617f483

📥 Commits

Reviewing files that changed from the base of the PR and between 7a73633 and 2f11f54.

📒 Files selected for processing (2)
  • crates/sn2-validator/src/incremental_runner.rs
  • crates/sn2-validator/src/validator_loop/dslice.rs

Walkthrough

IncrementalRunManager receives a new public query method is_slice_skipped() that reports whether a slice was marked as skipped for a given run, with test coverage. The run finalization logic in finalize_combined_run now uses this query to explicitly exclude skipped slices when deciding which slices to write to the disabled-slices list, and simplifies run-wide failure detection accordingly.

Changes

Skipped Slice Tracking and Finalization

Layer / File(s) Summary
Add is_slice_skipped() query API
crates/sn2-validator/src/incremental_runner.rs
IncrementalRunManager exposes is_slice_skipped(run_uid, slice_id) to query whether a slice was marked skipped. Tests verify false-before-noting, true-after-noting, per-run isolation, and cleanup after remove_run.
Refine run finalization to exclude skipped slices
crates/sn2-validator/src/validator_loop/dslice.rs
finalize_combined_run calls is_slice_skipped() when building candidates for disabled-slice decisions, explicitly excluding skipped slices from consideration. Run-wide failure detection is simplified to check if candidates.len() == attempted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • inference-labs-inc/subnet-2#509: Both PRs modify IncrementalRunManager/note_slice_skipped tracking and update validator_loop/dslice.rs::finalize_combined_run to exclude skipped slices from computing whether to persist disabled_slices.

  • inference-labs-inc/subnet-2#441: Both PRs modify the same slice-handling flow in validator_loop/dslice.rs to decide which slices are treated as "disabled/failed," with this PR's query logic tightening decisions by excluding skipped rather than attempted slices.

  • inference-labs-inc/subnet-2#432: Both PRs update IncrementalRunManager and finalize_combined_run to make final run behavior depend on per-slice tracking for slice status, overlapping at the same slice-status-driven finalization logic.

Poem

🐰 A rabbit hops through slice-marked logs,
Skipped slices now have queries in the bogs,
No more confusion—are they failed or mere?
is_slice_skipped makes it crystal clear! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the bug being fixed: preventing the disabled-slice rehab clock from being reset when slices are re-skipped, which is the core issue resolved in this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch resolve/disabled-slice-rehab-clock-reset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@HudsonGraeme HudsonGraeme merged commit e86c28a into testnet May 25, 2026
8 checks passed
@HudsonGraeme HudsonGraeme deleted the resolve/disabled-slice-rehab-clock-reset branch May 25, 2026 14:12
HudsonGraeme added a commit that referenced this pull request May 25, 2026
Brings #510 (trust-on-first-use validator allowlist with stake-coverage
gate) into the testnet branch alongside #509 (pre-sampled RSV dispatch
and disabled-slice rehab) and #512 (rehab clock reset fix). The next
testnet release will ship all three on the testnet miners and
validators so operators can smoke-test the allowlist path on a host
that runs without --disable-blacklist.
HudsonGraeme added a commit that referenced this pull request May 25, 2026
Promotes the validator resilience and rehab-cooldown work that has been
running on testnet onto main for the next mainnet patch tag. Brings in
#509 (pre-sampled RSV dispatch, disabled-slice rehab with stake-weighted
run-wide-failure detection) and #512 (rehab clock-reset fix for slices
that re-enter the skip path while still within the 360-block cooldown).
#510 (trust-on-first-use validator allowlist with stake-coverage gate
and nftables driver) is already on main; this merge collapses the
divergence introduced when #510 landed on main while #509 landed on
testnet in parallel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant