Develop validator resilience to transient chain disruption and recover from network-wide slice failures by HudsonGraeme · Pull Request #509 · inference-labs-inc/subnet-2

HudsonGraeme · 2026-05-23T16:34:44Z

Summary

Two independent improvements to validator state management under transient
disruption. Both target testnet so they can soak before promotion to main.

Rehabilitate `disabled_slices` after run-wide failures

The validator currently writes every slice that ends a benchmark run with
zero verified tiles into a per-circuit disable list, and the only path that
ever clears that list is full circuit deactivation. When the chain RPC
forces a reconnect and the metagraph re-sync triggers a concurrent QUIC
handshake to every miner, those handshakes time out together. The next
benchmark run sees every dslice fail with zero verified tiles, the disable
list absorbs the entire slice set, and the validator transitions into a
state where each subsequent run logs skipped=N / no circuit slices to dispatch / combined run complete failed_count=N and dispatches no work
until the process is restarted. vTrust degrades over the next epoch with
no observable cause in the local logs.

Two changes restore self-healing:

finalize_combined_run no longer writes to disabled_slices when every
slice in the run failed with zero verified tiles. That signal is more
consistent with a validator-side network or chain disruption than with
each slice being independently unprovable, and is now logged as a
run-wide failure for operator visibility.
disabled_slices carries the block height at which each slice was
disabled, and the dispatch-side filter skips entries older than
DISABLED_SLICE_REHAB_BLOCKS (one tempo). Repeated failures still
refresh the entry, but transient ones age out cleanly.

Pre-decide RSV sampling at dispatch time

Random-sample verification rolled at response receipt, which meant the
validator retained task_inputs (the local JSON copy of the input tensor)
across the entire in-flight window for every dispatched request even
though only ~4% of those requests would be deep-verified. At steady state
with thousands of concurrent dispatches, the retained inputs were a
dominant contributor to per-request memory footprint and amplified host
memory pressure events.

This change rolls the RSV decision in dispatch_requests, attaches the
result to both DispatchedRequest and TaskResult as pre_sampled, and
clears task_inputs immediately for the non-sampled path before the
miner task is spawned. Force-verify paths (PoW, customer RWR,
external-hash, API dslice) continue to retain their inputs because
pre_decide_sample marks them as pre-sampled regardless of the RSV roll.

decide_sample at response time now consults the pre-decision rather than
re-rolling. The defensive no-proof / empty-response → no-sample
short-circuit is preserved so an empty response on a force-verified
request still skips verification cleanly. Post-response byte-shedding for
proof_content / witness / raw / public_json is also extended to
release MinerResponse.inputs on the non-sample path for the residual
case where a force-verified request completes with an empty proof.

Sampling rate and skiplist behavior are unchanged: the RSV roll uses the
same should_sample over the same constants, just at an earlier point in
the request lifecycle.

Test plan

cargo check -p sn2-validator
cargo test -p sn2-validator --bin sn2-validator (87 tests pass)
cargo fmt --check
cargo clippy -p sn2-validator --tests -- -D warnings
Deploy to a testnet validator and observe RSS / active_tasks /
dispatch_budget health-log metrics over a full epoch
Force a chain RPC disconnect on a testnet validator and confirm the
validator continues to dispatch after the reconnect storm rather
than locking into the skipped=N no-dispatch loop
Confirm RSV strike rate and skiplist behavior on testnet match
historical baseline (the sampling rate and decision logic are
unchanged, only the moment of the roll has moved)

Follow-ups (separate PRs)

Jitter / stagger in btlightning::update_miner_registry so a mass
reconnect does not synchronize on a single timeout boundary in the
first place. Tracking separately because it lands in btlightning.

Summary by CodeRabbit

New Features
- Disabled slices now automatically rehabilitate after a configurable block window.
- Validator now tracks skipped slices per run to improve run-level failure handling.
Bug Fixes
- Avoids persisting a disable-list when all attempted slices produce no verified work, preventing permanent lockout.
- Clears unnecessary task inputs before miner execution to reduce memory retention.
Refactor
- Sampling decision moved earlier in dispatch; pre-sampling is propagated so non‑sampled work can drop inputs/proofs before execution.

Per-slice disable was a one-way write with no recovery path. A chain RPC disconnect that forces a metagraph reconnect followed by mass QUIC handshake timeouts caused every dslice in the run to fail with zero verified tiles, which then permanently appended every slice for every active circuit to the disable list. Subsequent runs short-circuit with skipped=N and the validator stops dispatching work until the process is restarted. Two changes restore self-healing: - Refuse to add to the disable list when every slice in the run failed with zero verified tiles. That signal is more consistent with a validator-side network or chain event than per-slice unprovability and is logged as a run-wide failure for visibility. - Track the block height at which each slice was disabled and rehabilitate entries older than DISABLED_SLICE_REHAB_BLOCKS (one tempo) at dispatch time. Entries still get refreshed on repeated failure but no longer persist past a transient disruption.

… bytes Random-sample verification previously rolled at response receipt, so the validator had to retain task_inputs (the local JSON copy of the input tensor) across the entire in-flight window for every dispatched request, even though only ~4% of those requests would be deep-verified. At steady state with thousands of concurrent dispatches, the retained inputs were a dominant contributor to per-request memory footprint and amplified host memory pressure events. Roll the RSV decision in dispatch_requests instead, attach the disposition to both DispatchedRequest and TaskResult as pre_sampled, and drop task_inputs immediately for the non-sampled path before the miner task is spawned. Force-verify paths (PoW, customer RWR, external-hash, API dslice) continue to retain their inputs because pre_decide_sample marks them as pre_sampled regardless of the RSV roll. decide_sample at response time now consults the pre-decision rather than re-rolling, with a defensive no-proof-no-sample short-circuit preserved for empty or failed responses. Existing post-response byte-shedding for proof_content / witness / raw / public_json is extended to also release MinerResponse.inputs on the non-sample path, covering the residual case where a force-verified request completes with an empty proof.

HudsonGraeme · 2026-05-23T17:20:59Z

@coderabbitai review

coderabbitai · 2026-05-23T17:21:05Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-23T17:21:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 13f5267a-4ec6-4de2-bbeb-8e5e74454cda

📥 Commits

Reviewing files that changed from the base of the PR and between 936345b and 8b3227a.

📒 Files selected for processing (1)

crates/sn2-validator/src/incremental_runner.rs

Walkthrough

This PR moves RSV sampling to dispatch-time with a propagated pre_sampled flag (clearing inputs when not pre-sampled), adds run-level skipped-slice tracking, and implements time-windowed rehabilitation for disabled slices to avoid persistent no-dispatch conditions.

Changes

Dispatch-time sampling and disabled-slice rehabilitation

Layer / File(s)	Summary
Data contracts and imports `crates/sn2-validator/src/validator_loop/mod.rs`	Adds `pre_sampled: bool` to `TaskResult` and `DispatchedRequest`, updates std::collections imports, and changes `ValidatorLoop.disabled_slices` to `HashMap<String, HashMap<String, u64>>` to store disabled_at timestamps.
Sampling decision functions `crates/sn2-validator/src/validator_loop/verification.rs`	Adds `pub(super) pre_decide_sample(...)` for dispatch-time decisions (force-verify cases and RSV delegation) and simplifies `decide_sample` to a response-side function that uses `TaskResult.pre_sampled` and proof presence.
Dispatch-time sampling execution `crates/sn2-validator/src/validator_loop/dispatch.rs`	`dispatch_requests` calls `pre_decide_sample` to set `pre_sampled`; when `false`, `task_inputs` are cleared. RWR and DSlice `DispatchedRequest` initialize `pre_sampled:false`. `spawn_miner_task` forwards `pre_sampled` into `TaskResult`.
Incremental run skipped-slice tracking `crates/sn2-validator/src/incremental_runner.rs`	Adds `skipped_slices` per-run state, `note_slice_skipped`, `skipped_slice_count`, and ensures cleanup/eviction keep skipped-slice tracking consistent.
Disabled slice rehab and finalization `crates/sn2-validator/src/validator_loop/dslice.rs`, `crates/sn2-validator/src/validator_loop/mod.rs`	Introduces `DISABLED_SLICE_REHAB_BLOCKS`; `enqueue_all_dslices` filters disabled slices by age and records skipped slices; `finalize_combined_run` distinguishes skipped vs attempted slices and suppresses writing disable lists when every attempted slice failed with zero verified tiles; disabled entries are inserted with current block timestamps.

Sequence Diagram(s)

sequenceDiagram
  participant Dispatch as dispatch_requests
  participant PreDecide as pre_decide_sample
  participant Spawn as spawn_miner_task
  participant Verification as start_verification
  participant Store as IncrementalRunManager

  Dispatch->>PreDecide: pre_decide_sample(dispatched, hotkey, block, tempo, rsv)
  PreDecide-->>Dispatch: pre_sampled (bool)
  Dispatch->>Spawn: pass DispatchedRequest (pre_sampled, maybe cleared task_inputs)
  Spawn-->>Verification: TaskResult (includes pre_sampled, proof bytes or cleared)
  Verification->>Store: note_slice_skipped / mark_slice_failed (as needed)
  Store-->>Dispatch: skipped_slice_count used in finalize flow (run completion)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

inference-labs-inc/subnet-2#441: Related changes to validator disabled-slice tracking and rehab semantics.
inference-labs-inc/subnet-2#469: Overlapping RSV sampling and verification gating modifications.
inference-labs-inc/subnet-2#463: Related enqueue_all_dslices early-skip and failure recording changes.

Poem

🐰 I pre-sample hops at break of dawn,

inputs dropped where proofs are gone,
some slices nap a rehab while,
runs remember skips and file,
then validators hum, the net moves on.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main objectives of the PR: developing validator resilience to transient chain disruption and recovery from network-wide slice failures, which directly aligns with the two key improvements documented in the PR objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch implement/pre-sampled-dispatch-state-reduction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sn2-validator/src/validator_loop/dslice.rs`:
- Around line 809-833: The current all_slices_failed logic wrongly treats runs
where every slice is failed+zero-verified the same even if none were actually
dispatched (e.g., preflight rejections or already-disabled skips); change the
condition so the "treat as run-wide failure" branch only triggers when at least
one slice was actually dispatched/executed (or when failure cause indicates
execution-time outage). Concretely, add a check alongside total_slices > 0 that
queries the run state (e.g., a dispatched count or a failure-cause flag from
RunManager) — for example require run_manager.dispatched_slice_count(run_uid) >
0 (or check a per-slice failure cause via
run_manager.failure_cause/run_manager.was_dispatched) before setting
all_slices_failed — so deterministic preflight/disabled skips don’t suppress the
disable-list write.

In `@crates/sn2-validator/src/validator_loop/verification.rs`:
- Around line 250-264: decide_sample currently re-enables sampling whenever
hotkey.is_empty(), which can resurrect sampling after dispatch cleared
task_inputs; to fix, persist the "empty-hotkey at dispatch" state on TaskResult
(e.g. add a dispatched_empty_hotkey / had_empty_hotkey flag) and set that flag
inside dispatch_requests when you clear task_inputs for non-pre_sampled tasks,
then update decide_sample to consult result.dispatched_empty_hotkey instead of
computing hotkey.is_empty() at verify time; ensure dispatch_requests sets the
flag and decide_sample uses result.pre_sampled || result.dispatched_empty_hotkey
(and only treats force-sample when the Response still contains inputs if you
prefer the alternative safeguard).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b8649a51-b823-4192-96ea-bf7a67809a35

📥 Commits

Reviewing files that changed from the base of the PR and between f1368d5 and 41b5a0b.

📒 Files selected for processing (4)

crates/sn2-validator/src/validator_loop/dispatch.rs
crates/sn2-validator/src/validator_loop/dslice.rs
crates/sn2-validator/src/validator_loop/mod.rs
crates/sn2-validator/src/validator_loop/verification.rs

…lice The previous guard fired whenever every slice in a benchmark run ended with is_slice_failed=true and zero verified tiles. That signal conflates two distinct outcomes: every slice was deterministically skipped before dispatch (already-disabled entries, preflight rejections) versus every slice that was actually attempted failed during dispatch. Only the latter is the network-wide event the guard is designed to absorb; the former is either a no-op (re-marking already-disabled slices) or a legitimate signal that preflight-rejected slices should be added to the disable list. Track deterministically-skipped slices per run via a new note_slice_skipped / skipped_slice_count pair on IncrementalRunManager, called at the two existing skip call sites (already-disabled and preflight_failed). The guard now requires attempted > 0 and every attempted slice to have failed with zero verified tiles before suppressing the disable-list write. Mixed runs where some slices were skipped and others legitimately failed during dispatch continue to disable the genuinely-failed slices. Skipped-slice bookkeeping is removed alongside other per-run state in remove_run and evict_by_circuit so the map cannot leak across run lifecycles.

decide_sample retained an empty-hotkey force-sample branch carried over from the historical response-time RSV roll. With sampling pre-decided at dispatch, that branch reintroduced a race: if the metagraph reshuffled between dispatch and response (uid deregistered, uid_hotkeys entry cleared) the verifier would revive sampling on a request whose task_inputs had already been released, leading to a verification attempt against a missing input tensor. The equivalent empty-hotkey safety net lives in pre_decide_sample at dispatch time, where the corresponding pre_sampled=true keeps task_inputs retained for the in-flight window. The verify-side check is now strictly result.pre_sampled (gated on has_proof), removing the post-dispatch hotkey lookup from the path entirely.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/sn2-validator/src/incremental_runner.rs (1)

485-506: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing skipped_slices cleanup in gc_stale.

Unlike remove_run and evict_by_circuit, this function doesn't clean up skipped_slices entries for evicted runs. This creates a memory leak where stale run entries accumulate over time.

🐛 Proposed fix

     pub fn gc_stale(&mut self, idle_timeout: Duration) -> Vec<String> {
         let now = Instant::now();
         let stale: Vec<String> = self
             .runs
             .iter()
             .filter(|(_, run)| now.duration_since(run.last_activity) >= idle_timeout)
             .map(|(uid, _)| uid.clone())
             .collect();
         if !stale.is_empty() {
             info!(count = stale.len(), run_uids = ?stale, "evicting idle runs");
         }
         let stale_set: HashSet<&str> = stale.iter().map(|s| s.as_str()).collect();
         self.tile_counters
             .retain(|(run_uid, _), _| !stale_set.contains(run_uid.as_str()));
         self.verified_tile_counts
             .retain(|(run_uid, _), _| !stale_set.contains(run_uid.as_str()));
+        self.skipped_slices
+            .retain(|run_uid, _| !stale_set.contains(run_uid.as_str()));
         for uid in stale.iter() {
             self.runs.remove(uid);
             self.evicted.insert(uid.clone());
         }
         stale
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sn2-validator/src/incremental_runner.rs` around lines 485 - 506, The
gc_stale function fails to remove entries from skipped_slices for evicted runs,
causing a memory leak; update gc_stale to compute the same stale run UIDs
(stale) and then remove corresponding entries from skipped_slices just like it
already does for tile_counters and verified_tile_counts, and ensure it mirrors
the cleanup behavior in remove_run and evict_by_circuit (use the same run UID
set/stale_set and call skipped_slices.retain or remove per-uid before inserting
into evicted and deleting from runs).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/sn2-validator/src/incremental_runner.rs`:
- Around line 485-506: The gc_stale function fails to remove entries from
skipped_slices for evicted runs, causing a memory leak; update gc_stale to
compute the same stale run UIDs (stale) and then remove corresponding entries
from skipped_slices just like it already does for tile_counters and
verified_tile_counts, and ensure it mirrors the cleanup behavior in remove_run
and evict_by_circuit (use the same run UID set/stale_set and call
skipped_slices.retain or remove per-uid before inserting into evicted and
deleting from runs).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 72e1d23f-55b5-47b2-8bf9-f6fbefa65ee4

📥 Commits

Reviewing files that changed from the base of the PR and between 41b5a0b and 936345b.

📒 Files selected for processing (3)

crates/sn2-validator/src/incremental_runner.rs
crates/sn2-validator/src/validator_loop/dslice.rs
crates/sn2-validator/src/validator_loop/verification.rs

remove_run and evict_by_circuit already drop their share of skipped_slices when a run leaves the manager, but gc_stale was added before the field and never updated. Idle benchmark runs aged out by gc_stale would leave their per-slice skip records behind, slowly growing the map across long-lived validator processes. Mirror the existing tile_counters and verified_tile_counts retain pattern so the same stale_set drops skipped_slices entries before the runs are removed and inserted into the evicted FIFO.

Brings #510 (trust-on-first-use validator allowlist with stake-coverage gate) into the testnet branch alongside #509 (pre-sampled RSV dispatch and disabled-slice rehab) and #512 (rehab clock reset fix). The next testnet release will ship all three on the testnet miners and validators so operators can smoke-test the allowlist path on a host that runs without --disable-blacklist.

Promotes the validator resilience and rehab-cooldown work that has been running on testnet onto main for the next mainnet patch tag. Brings in #509 (pre-sampled RSV dispatch, disabled-slice rehab with stake-weighted run-wide-failure detection) and #512 (rehab clock-reset fix for slices that re-enter the skip path while still within the 360-block cooldown). #510 (trust-on-first-use validator allowlist with stake-coverage gate and nftables driver) is already on main; this merge collapses the divergence introduced when #510 landed on main while #509 landed on testnet in parallel.

HudsonGraeme added 2 commits May 23, 2026 16:20

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread crates/sn2-validator/src/validator_loop/dslice.rs

Comment thread crates/sn2-validator/src/validator_loop/verification.rs Outdated

HudsonGraeme added 2 commits May 23, 2026 20:39

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

HudsonGraeme merged commit 7a73633 into testnet May 24, 2026
18 checks passed

HudsonGraeme deleted the implement/pre-sampled-dispatch-state-reduction branch May 24, 2026 19:27

HudsonGraeme mentioned this pull request May 25, 2026

Resolve disabled-slice rehab clock reset on re-skipped slices #512

Merged

6 tasks

HudsonGraeme mentioned this pull request May 25, 2026

Introduce dispatch cooldown for sustained-failure miners #513

Closed

5 tasks

coderabbitai Bot mentioned this pull request May 25, 2026

Develop dispatch main-loop throughput optimizations #514

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop validator resilience to transient chain disruption and recover from network-wide slice failures#509

Develop validator resilience to transient chain disruption and recover from network-wide slice failures#509
HudsonGraeme merged 5 commits into
testnetfrom
implement/pre-sampled-dispatch-state-reduction

HudsonGraeme commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

HudsonGraeme commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HudsonGraeme commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rehabilitate disabled_slices after run-wide failures

Pre-decide RSV sampling at dispatch time

Test plan

Follow-ups (separate PRs)

Summary by CodeRabbit

Uh oh!

HudsonGraeme commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HudsonGraeme commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Rehabilitate `disabled_slices` after run-wide failures

coderabbitai Bot commented May 23, 2026 •

edited

Loading