Skip to content

Develop validator resilience to transient chain disruption and recover from network-wide slice failures#509

Merged
HudsonGraeme merged 5 commits into
testnetfrom
implement/pre-sampled-dispatch-state-reduction
May 24, 2026
Merged

Develop validator resilience to transient chain disruption and recover from network-wide slice failures#509
HudsonGraeme merged 5 commits into
testnetfrom
implement/pre-sampled-dispatch-state-reduction

Conversation

@HudsonGraeme
Copy link
Copy Markdown
Member

@HudsonGraeme HudsonGraeme commented May 23, 2026

Summary

Two independent improvements to validator state management under transient
disruption. Both target testnet so they can soak before promotion to main.

Rehabilitate disabled_slices after run-wide failures

The validator currently writes every slice that ends a benchmark run with
zero verified tiles into a per-circuit disable list, and the only path that
ever clears that list is full circuit deactivation. When the chain RPC
forces a reconnect and the metagraph re-sync triggers a concurrent QUIC
handshake to every miner, those handshakes time out together. The next
benchmark run sees every dslice fail with zero verified tiles, the disable
list absorbs the entire slice set, and the validator transitions into a
state where each subsequent run logs skipped=N / no circuit slices to dispatch / combined run complete failed_count=N and dispatches no work
until the process is restarted. vTrust degrades over the next epoch with
no observable cause in the local logs.

Two changes restore self-healing:

  • finalize_combined_run no longer writes to disabled_slices when every
    slice in the run failed with zero verified tiles. That signal is more
    consistent with a validator-side network or chain disruption than with
    each slice being independently unprovable, and is now logged as a
    run-wide failure for operator visibility.
  • disabled_slices carries the block height at which each slice was
    disabled, and the dispatch-side filter skips entries older than
    DISABLED_SLICE_REHAB_BLOCKS (one tempo). Repeated failures still
    refresh the entry, but transient ones age out cleanly.

Pre-decide RSV sampling at dispatch time

Random-sample verification rolled at response receipt, which meant the
validator retained task_inputs (the local JSON copy of the input tensor)
across the entire in-flight window for every dispatched request even
though only ~4% of those requests would be deep-verified. At steady state
with thousands of concurrent dispatches, the retained inputs were a
dominant contributor to per-request memory footprint and amplified host
memory pressure events.

This change rolls the RSV decision in dispatch_requests, attaches the
result to both DispatchedRequest and TaskResult as pre_sampled, and
clears task_inputs immediately for the non-sampled path before the
miner task is spawned. Force-verify paths (PoW, customer RWR,
external-hash, API dslice) continue to retain their inputs because
pre_decide_sample marks them as pre-sampled regardless of the RSV roll.

decide_sample at response time now consults the pre-decision rather than
re-rolling. The defensive no-proof / empty-response → no-sample
short-circuit is preserved so an empty response on a force-verified
request still skips verification cleanly. Post-response byte-shedding for
proof_content / witness / raw / public_json is also extended to
release MinerResponse.inputs on the non-sample path for the residual
case where a force-verified request completes with an empty proof.

Sampling rate and skiplist behavior are unchanged: the RSV roll uses the
same should_sample over the same constants, just at an earlier point in
the request lifecycle.

Test plan

  • cargo check -p sn2-validator
  • cargo test -p sn2-validator --bin sn2-validator (87 tests pass)
  • cargo fmt --check
  • cargo clippy -p sn2-validator --tests -- -D warnings
  • Deploy to a testnet validator and observe RSS / active_tasks /
    dispatch_budget health-log metrics over a full epoch
  • Force a chain RPC disconnect on a testnet validator and confirm the
    validator continues to dispatch after the reconnect storm rather
    than locking into the skipped=N no-dispatch loop
  • Confirm RSV strike rate and skiplist behavior on testnet match
    historical baseline (the sampling rate and decision logic are
    unchanged, only the moment of the roll has moved)

Follow-ups (separate PRs)

  • Jitter / stagger in btlightning::update_miner_registry so a mass
    reconnect does not synchronize on a single timeout boundary in the
    first place. Tracking separately because it lands in btlightning.

Summary by CodeRabbit

  • New Features

    • Disabled slices now automatically rehabilitate after a configurable block window.
    • Validator now tracks skipped slices per run to improve run-level failure handling.
  • Bug Fixes

    • Avoids persisting a disable-list when all attempted slices produce no verified work, preventing permanent lockout.
    • Clears unnecessary task inputs before miner execution to reduce memory retention.
  • Refactor

    • Sampling decision moved earlier in dispatch; pre-sampling is propagated so non‑sampled work can drop inputs/proofs before execution.

Review Change Stack

Per-slice disable was a one-way write with no recovery path. A chain RPC
disconnect that forces a metagraph reconnect followed by mass QUIC handshake
timeouts caused every dslice in the run to fail with zero verified tiles,
which then permanently appended every slice for every active circuit to the
disable list. Subsequent runs short-circuit with skipped=N and the validator
stops dispatching work until the process is restarted.

Two changes restore self-healing:

- Refuse to add to the disable list when every slice in the run failed with
  zero verified tiles. That signal is more consistent with a validator-side
  network or chain event than per-slice unprovability and is logged as a
  run-wide failure for visibility.
- Track the block height at which each slice was disabled and rehabilitate
  entries older than DISABLED_SLICE_REHAB_BLOCKS (one tempo) at dispatch
  time. Entries still get refreshed on repeated failure but no longer
  persist past a transient disruption.
… bytes

Random-sample verification previously rolled at response receipt, so the
validator had to retain task_inputs (the local JSON copy of the input
tensor) across the entire in-flight window for every dispatched request,
even though only ~4% of those requests would be deep-verified. At steady
state with thousands of concurrent dispatches, the retained inputs were a
dominant contributor to per-request memory footprint and amplified host
memory pressure events.

Roll the RSV decision in dispatch_requests instead, attach the disposition
to both DispatchedRequest and TaskResult as pre_sampled, and drop
task_inputs immediately for the non-sampled path before the miner task is
spawned. Force-verify paths (PoW, customer RWR, external-hash, API dslice)
continue to retain their inputs because pre_decide_sample marks them as
pre_sampled regardless of the RSV roll. decide_sample at response time
now consults the pre-decision rather than re-rolling, with a defensive
no-proof-no-sample short-circuit preserved for empty or failed responses.

Existing post-response byte-shedding for proof_content / witness / raw /
public_json is extended to also release MinerResponse.inputs on the
non-sample path, covering the residual case where a force-verified
request completes with an empty proof.
@HudsonGraeme
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 13f5267a-4ec6-4de2-bbeb-8e5e74454cda

📥 Commits

Reviewing files that changed from the base of the PR and between 936345b and 8b3227a.

📒 Files selected for processing (1)
  • crates/sn2-validator/src/incremental_runner.rs

Walkthrough

This PR moves RSV sampling to dispatch-time with a propagated pre_sampled flag (clearing inputs when not pre-sampled), adds run-level skipped-slice tracking, and implements time-windowed rehabilitation for disabled slices to avoid persistent no-dispatch conditions.

Changes

Dispatch-time sampling and disabled-slice rehabilitation

Layer / File(s) Summary
Data contracts and imports
crates/sn2-validator/src/validator_loop/mod.rs
Adds pre_sampled: bool to TaskResult and DispatchedRequest, updates std::collections imports, and changes ValidatorLoop.disabled_slices to HashMap<String, HashMap<String, u64>> to store disabled_at timestamps.
Sampling decision functions
crates/sn2-validator/src/validator_loop/verification.rs
Adds pub(super) pre_decide_sample(...) for dispatch-time decisions (force-verify cases and RSV delegation) and simplifies decide_sample to a response-side function that uses TaskResult.pre_sampled and proof presence.
Dispatch-time sampling execution
crates/sn2-validator/src/validator_loop/dispatch.rs
dispatch_requests calls pre_decide_sample to set pre_sampled; when false, task_inputs are cleared. RWR and DSlice DispatchedRequest initialize pre_sampled:false. spawn_miner_task forwards pre_sampled into TaskResult.
Incremental run skipped-slice tracking
crates/sn2-validator/src/incremental_runner.rs
Adds skipped_slices per-run state, note_slice_skipped, skipped_slice_count, and ensures cleanup/eviction keep skipped-slice tracking consistent.
Disabled slice rehab and finalization
crates/sn2-validator/src/validator_loop/dslice.rs, crates/sn2-validator/src/validator_loop/mod.rs
Introduces DISABLED_SLICE_REHAB_BLOCKS; enqueue_all_dslices filters disabled slices by age and records skipped slices; finalize_combined_run distinguishes skipped vs attempted slices and suppresses writing disable lists when every attempted slice failed with zero verified tiles; disabled entries are inserted with current block timestamps.

Sequence Diagram(s)

sequenceDiagram
  participant Dispatch as dispatch_requests
  participant PreDecide as pre_decide_sample
  participant Spawn as spawn_miner_task
  participant Verification as start_verification
  participant Store as IncrementalRunManager

  Dispatch->>PreDecide: pre_decide_sample(dispatched, hotkey, block, tempo, rsv)
  PreDecide-->>Dispatch: pre_sampled (bool)
  Dispatch->>Spawn: pass DispatchedRequest (pre_sampled, maybe cleared task_inputs)
  Spawn-->>Verification: TaskResult (includes pre_sampled, proof bytes or cleared)
  Verification->>Store: note_slice_skipped / mark_slice_failed (as needed)
  Store-->>Dispatch: skipped_slice_count used in finalize flow (run completion)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I pre-sample hops at break of dawn,

inputs dropped where proofs are gone,
some slices nap a rehab while,
runs remember skips and file,
then validators hum, the net moves on.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objectives of the PR: developing validator resilience to transient chain disruption and recovery from network-wide slice failures, which directly aligns with the two key improvements documented in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch implement/pre-sampled-dispatch-state-reduction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sn2-validator/src/validator_loop/dslice.rs`:
- Around line 809-833: The current all_slices_failed logic wrongly treats runs
where every slice is failed+zero-verified the same even if none were actually
dispatched (e.g., preflight rejections or already-disabled skips); change the
condition so the "treat as run-wide failure" branch only triggers when at least
one slice was actually dispatched/executed (or when failure cause indicates
execution-time outage). Concretely, add a check alongside total_slices > 0 that
queries the run state (e.g., a dispatched count or a failure-cause flag from
RunManager) — for example require run_manager.dispatched_slice_count(run_uid) >
0 (or check a per-slice failure cause via
run_manager.failure_cause/run_manager.was_dispatched) before setting
all_slices_failed — so deterministic preflight/disabled skips don’t suppress the
disable-list write.

In `@crates/sn2-validator/src/validator_loop/verification.rs`:
- Around line 250-264: decide_sample currently re-enables sampling whenever
hotkey.is_empty(), which can resurrect sampling after dispatch cleared
task_inputs; to fix, persist the "empty-hotkey at dispatch" state on TaskResult
(e.g. add a dispatched_empty_hotkey / had_empty_hotkey flag) and set that flag
inside dispatch_requests when you clear task_inputs for non-pre_sampled tasks,
then update decide_sample to consult result.dispatched_empty_hotkey instead of
computing hotkey.is_empty() at verify time; ensure dispatch_requests sets the
flag and decide_sample uses result.pre_sampled || result.dispatched_empty_hotkey
(and only treats force-sample when the Response still contains inputs if you
prefer the alternative safeguard).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b8649a51-b823-4192-96ea-bf7a67809a35

📥 Commits

Reviewing files that changed from the base of the PR and between f1368d5 and 41b5a0b.

📒 Files selected for processing (4)
  • crates/sn2-validator/src/validator_loop/dispatch.rs
  • crates/sn2-validator/src/validator_loop/dslice.rs
  • crates/sn2-validator/src/validator_loop/mod.rs
  • crates/sn2-validator/src/validator_loop/verification.rs

Comment thread crates/sn2-validator/src/validator_loop/dslice.rs
Comment thread crates/sn2-validator/src/validator_loop/verification.rs Outdated
…lice

The previous guard fired whenever every slice in a benchmark run ended with
is_slice_failed=true and zero verified tiles. That signal conflates two
distinct outcomes: every slice was deterministically skipped before
dispatch (already-disabled entries, preflight rejections) versus every
slice that was actually attempted failed during dispatch. Only the latter
is the network-wide event the guard is designed to absorb; the former is
either a no-op (re-marking already-disabled slices) or a legitimate signal
that preflight-rejected slices should be added to the disable list.

Track deterministically-skipped slices per run via a new
note_slice_skipped / skipped_slice_count pair on IncrementalRunManager,
called at the two existing skip call sites (already-disabled and
preflight_failed). The guard now requires attempted > 0 and every
attempted slice to have failed with zero verified tiles before suppressing
the disable-list write. Mixed runs where some slices were skipped and
others legitimately failed during dispatch continue to disable the
genuinely-failed slices.

Skipped-slice bookkeeping is removed alongside other per-run state in
remove_run and evict_by_circuit so the map cannot leak across run
lifecycles.
decide_sample retained an empty-hotkey force-sample branch carried over
from the historical response-time RSV roll. With sampling pre-decided at
dispatch, that branch reintroduced a race: if the metagraph reshuffled
between dispatch and response (uid deregistered, uid_hotkeys entry
cleared) the verifier would revive sampling on a request whose
task_inputs had already been released, leading to a verification attempt
against a missing input tensor.

The equivalent empty-hotkey safety net lives in pre_decide_sample at
dispatch time, where the corresponding pre_sampled=true keeps task_inputs
retained for the in-flight window. The verify-side check is now strictly
result.pre_sampled (gated on has_proof), removing the post-dispatch
hotkey lookup from the path entirely.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/sn2-validator/src/incremental_runner.rs (1)

485-506: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing skipped_slices cleanup in gc_stale.

Unlike remove_run and evict_by_circuit, this function doesn't clean up skipped_slices entries for evicted runs. This creates a memory leak where stale run entries accumulate over time.

🐛 Proposed fix
     pub fn gc_stale(&mut self, idle_timeout: Duration) -> Vec<String> {
         let now = Instant::now();
         let stale: Vec<String> = self
             .runs
             .iter()
             .filter(|(_, run)| now.duration_since(run.last_activity) >= idle_timeout)
             .map(|(uid, _)| uid.clone())
             .collect();
         if !stale.is_empty() {
             info!(count = stale.len(), run_uids = ?stale, "evicting idle runs");
         }
         let stale_set: HashSet<&str> = stale.iter().map(|s| s.as_str()).collect();
         self.tile_counters
             .retain(|(run_uid, _), _| !stale_set.contains(run_uid.as_str()));
         self.verified_tile_counts
             .retain(|(run_uid, _), _| !stale_set.contains(run_uid.as_str()));
+        self.skipped_slices
+            .retain(|run_uid, _| !stale_set.contains(run_uid.as_str()));
         for uid in stale.iter() {
             self.runs.remove(uid);
             self.evicted.insert(uid.clone());
         }
         stale
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sn2-validator/src/incremental_runner.rs` around lines 485 - 506, The
gc_stale function fails to remove entries from skipped_slices for evicted runs,
causing a memory leak; update gc_stale to compute the same stale run UIDs
(stale) and then remove corresponding entries from skipped_slices just like it
already does for tile_counters and verified_tile_counts, and ensure it mirrors
the cleanup behavior in remove_run and evict_by_circuit (use the same run UID
set/stale_set and call skipped_slices.retain or remove per-uid before inserting
into evicted and deleting from runs).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/sn2-validator/src/incremental_runner.rs`:
- Around line 485-506: The gc_stale function fails to remove entries from
skipped_slices for evicted runs, causing a memory leak; update gc_stale to
compute the same stale run UIDs (stale) and then remove corresponding entries
from skipped_slices just like it already does for tile_counters and
verified_tile_counts, and ensure it mirrors the cleanup behavior in remove_run
and evict_by_circuit (use the same run UID set/stale_set and call
skipped_slices.retain or remove per-uid before inserting into evicted and
deleting from runs).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 72e1d23f-55b5-47b2-8bf9-f6fbefa65ee4

📥 Commits

Reviewing files that changed from the base of the PR and between 41b5a0b and 936345b.

📒 Files selected for processing (3)
  • crates/sn2-validator/src/incremental_runner.rs
  • crates/sn2-validator/src/validator_loop/dslice.rs
  • crates/sn2-validator/src/validator_loop/verification.rs

remove_run and evict_by_circuit already drop their share of skipped_slices
when a run leaves the manager, but gc_stale was added before the field and
never updated. Idle benchmark runs aged out by gc_stale would leave their
per-slice skip records behind, slowly growing the map across long-lived
validator processes.

Mirror the existing tile_counters and verified_tile_counts retain pattern
so the same stale_set drops skipped_slices entries before the runs are
removed and inserted into the evicted FIFO.
@HudsonGraeme HudsonGraeme merged commit 7a73633 into testnet May 24, 2026
18 checks passed
@HudsonGraeme HudsonGraeme deleted the implement/pre-sampled-dispatch-state-reduction branch May 24, 2026 19:27
HudsonGraeme added a commit that referenced this pull request May 25, 2026
Brings #510 (trust-on-first-use validator allowlist with stake-coverage
gate) into the testnet branch alongside #509 (pre-sampled RSV dispatch
and disabled-slice rehab) and #512 (rehab clock reset fix). The next
testnet release will ship all three on the testnet miners and
validators so operators can smoke-test the allowlist path on a host
that runs without --disable-blacklist.
HudsonGraeme added a commit that referenced this pull request May 25, 2026
Promotes the validator resilience and rehab-cooldown work that has been
running on testnet onto main for the next mainnet patch tag. Brings in
#509 (pre-sampled RSV dispatch, disabled-slice rehab with stake-weighted
run-wide-failure detection) and #512 (rehab clock-reset fix for slices
that re-enter the skip path while still within the 360-block cooldown).
#510 (trust-on-first-use validator allowlist with stake-coverage gate
and nftables driver) is already on main; this merge collapses the
divergence introduced when #510 landed on main while #509 landed on
testnet in parallel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant