Resolve disabled-slice rehab clock reset on re-skipped slices by HudsonGraeme · Pull Request #512 · inference-labs-inc/subnet-2

HudsonGraeme · 2026-05-25T13:42:14Z

Summary

The DISABLED_SLICE_REHAB_BLOCKS cooldown introduced alongside the disabled-slice rehab logic does not actually age slices out. Once a slice lands in disabled_slices, its disabled_at block is re-stamped to current_block on every subsequent run, holding it disabled until validator restart.

Root cause

In finalize_combined_run (crates/sn2-validator/src/validator_loop/dslice.rs):

let candidates: Vec<String> = slice_tiles
    .into_keys()
    .filter(|slice_id| {
        self.run_manager.is_slice_failed(run_uid, slice_id)
            && self.run_manager.verified_tile_count(run_uid, slice_id) == 0
    })
    .collect();
…
} else if !candidates.is_empty() {
    let block = self.current_block;
    let entry = self.disabled_slices.entry(circuit_id.clone()).or_default();
    let mut inserted = 0usize;
    for slice_id in &candidates {
        if entry.insert(slice_id.clone(), block).is_none() {
            inserted += 1;
        }
    }

Two interacting facts make this re-stamp every disabled slice:

The dispatch-time filter in dispatch_work_items_for_circuit puts currently-disabled (within rehab) slices into skipped, then calls self.run_manager.mark_slice_failed(run_uid, &work.slice_id) on each. mark_slice_failed moves the slice from pending_slices into failed_slices on the fresh CombinedRun for this run, so is_slice_failed(run_uid, sid) returns true for them.
HashMap::insert(k, v) unconditionally replaces the value when the key exists, returning the old value via Some(old). The .is_none() check only gates the inserted counter — the side effect (entry["S"] = current_block) always happens.

So every run that skips slice S for being disabled overwrites S's disabled_at block. The dispatch-time filter current_block - disabled_at < 360 therefore stays true forever.

Fix

Add IncrementalRunManager::is_slice_skipped(run_uid, slice_id) alongside the existing note_slice_skipped / skipped_slice_count surface.
Exclude skipped slices from candidates in finalize_combined_run so the disable-list write only touches slices that were actually attempted in this run.
Simplify the run-wide failure detection now that candidates.len() is exactly the attempted-and-failed count: candidates.len() == attempted replaces the saturating-sub bookkeeping.

Newly attempted-and-failed slices still get a fresh disabled_at stamp via the same entry.insert call. Once their block is older than DISABLED_SLICE_REHAB_BLOCKS, the dispatch-time filter excludes them from the active disabled set on the next run and they get a retry opportunity.

Test plan

New unit tests for is_slice_skipped: false-before-noting, true-after-noting, per-run_uid scoping, cleared on remove_run.
cargo test -p sn2-validator incremental_runner:: — 17 pass (13 existing + 4 new).
cargo build -p sn2-validator clean.
cargo fmt -p sn2-validator --check clean.
CI to confirm clippy -D warnings and full workspace tests.
Operator smoke on testnet: confirm a slice that lands in disabled_slices ages out after 360 blocks (~72 min) and re-enters dispatch.

Targets

Branched off origin/testnet because the affected code (#509) is on the testnet branch and has not landed on main yet.

Summary by CodeRabbit

New Features
- Validation runs now track which slices were skipped during execution.
Bug Fixes
- Improved detection of true run-wide failures by correctly distinguishing skipped slices from failed execution attempts.

Investigate cooldown behavior of disabled_slices across consecutive runs and isolate a logic gap that defeats the DISABLED_SLICE_REHAB_BLOCKS contract. finalize_combined_run iterated `candidates` (slices marked failed with zero verified tiles) and called `entry.insert(slice_id, current_block)` on each. HashMap::insert unconditionally replaces the value; the `.is_none()` return check only gated the `inserted` counter, not the mutation. Slices that had been skipped via the disabled_slices filter at dispatch time were also present in candidates (mark_slice_failed at the skip path moves them from pending_slices into failed_slices on each new CombinedRun), so their disabled_at block was overwritten with current_block every run, restarting the 360-block rehab cooldown indefinitely. A slice that ever landed in disabled_slices remained there until validator restart. Resolve by: * Adding IncrementalRunManager::is_slice_skipped(run_uid, slice_id) alongside the existing note_slice_skipped / skipped_slice_count surface introduced in #509. * Excluding skipped slices from the candidates filter in finalize_combined_run so the disable-list write only touches slices that were actually attempted in this run and failed without producing a verified tile. * Simplifying the run-wide failure check now that candidates contains exactly the attempted-failed slices: `candidates.len() == attempted` replaces the saturating-sub bookkeeping. Newly disabled slices still get a fresh disabled_at stamp. Existing disabled entries age out via the unchanged `current_block - disabled_at < DISABLED_SLICE_REHAB_BLOCKS` filter at dispatch time once `current_block - disabled_at >= 360` blocks. The prune_disabled_slices / run-eviction paths are unaffected. Tests in incremental_runner cover the new accessor across the before-noting, after-noting, per-run-uid scoping, and run-removal cleanup paths; cargo test -p sn2-validator passes 17/17 in the incremental_runner module. cargo fmt --check clean.

coderabbitai · 2026-05-25T13:42:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d2ba1248-48dc-4815-b11a-08395617f483

📥 Commits

Reviewing files that changed from the base of the PR and between 7a73633 and 2f11f54.

📒 Files selected for processing (2)

crates/sn2-validator/src/incremental_runner.rs
crates/sn2-validator/src/validator_loop/dslice.rs

Walkthrough

IncrementalRunManager receives a new public query method is_slice_skipped() that reports whether a slice was marked as skipped for a given run, with test coverage. The run finalization logic in finalize_combined_run now uses this query to explicitly exclude skipped slices when deciding which slices to write to the disabled-slices list, and simplifies run-wide failure detection accordingly.

Changes

Skipped Slice Tracking and Finalization

Layer / File(s)	Summary
Add is_slice_skipped() query API `crates/sn2-validator/src/incremental_runner.rs`	IncrementalRunManager exposes `is_slice_skipped(run_uid, slice_id)` to query whether a slice was marked skipped. Tests verify false-before-noting, true-after-noting, per-run isolation, and cleanup after `remove_run`.
Refine run finalization to exclude skipped slices `crates/sn2-validator/src/validator_loop/dslice.rs`	`finalize_combined_run` calls `is_slice_skipped()` when building candidates for disabled-slice decisions, explicitly excluding skipped slices from consideration. Run-wide failure detection is simplified to check if `candidates.len() == attempted`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

inference-labs-inc/subnet-2#509: Both PRs modify IncrementalRunManager/note_slice_skipped tracking and update validator_loop/dslice.rs::finalize_combined_run to exclude skipped slices from computing whether to persist disabled_slices.
inference-labs-inc/subnet-2#441: Both PRs modify the same slice-handling flow in validator_loop/dslice.rs to decide which slices are treated as "disabled/failed," with this PR's query logic tightening decisions by excluding skipped rather than attempted slices.
inference-labs-inc/subnet-2#432: Both PRs update IncrementalRunManager and finalize_combined_run to make final run behavior depend on per-slice tracking for slice status, overlapping at the same slice-status-driven finalization logic.

Poem

🐰 A rabbit hops through slice-marked logs,
Skipped slices now have queries in the bogs,
No more confusion—are they failed or mere?
is_slice_skipped makes it crystal clear! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the bug being fixed: preventing the disabled-slice rehab clock from being reset when slices are re-skipped, which is the core issue resolved in this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch resolve/disabled-slice-rehab-clock-reset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Brings #510 (trust-on-first-use validator allowlist with stake-coverage gate) into the testnet branch alongside #509 (pre-sampled RSV dispatch and disabled-slice rehab) and #512 (rehab clock reset fix). The next testnet release will ship all three on the testnet miners and validators so operators can smoke-test the allowlist path on a host that runs without --disable-blacklist.

Promotes the validator resilience and rehab-cooldown work that has been running on testnet onto main for the next mainnet patch tag. Brings in #509 (pre-sampled RSV dispatch, disabled-slice rehab with stake-weighted run-wide-failure detection) and #512 (rehab clock-reset fix for slices that re-enter the skip path while still within the 360-block cooldown). #510 (trust-on-first-use validator allowlist with stake-coverage gate and nftables driver) is already on main; this merge collapses the divergence introduced when #510 landed on main while #509 landed on testnet in parallel.

HudsonGraeme merged commit e86c28a into testnet May 25, 2026
8 checks passed

HudsonGraeme deleted the resolve/disabled-slice-rehab-clock-reset branch May 25, 2026 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve disabled-slice rehab clock reset on re-skipped slices#512

Resolve disabled-slice rehab clock reset on re-skipped slices#512
HudsonGraeme merged 1 commit into
testnetfrom
resolve/disabled-slice-rehab-clock-reset

HudsonGraeme commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HudsonGraeme commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Targets

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HudsonGraeme commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading