feat(vardiff): add in-process simulation framework + baseline regression tests by gimballock · Pull Request #2154 · stratum-mining/stratum

gimballock · 2026-05-13T21:06:17Z

Adds a deterministic in-process simulation framework that characterizes
any Vardiff implementation across the operational rate range, and
commits the current algorithm's measurements as a baseline for automated
regression testing.

The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.

The finding that motivates this

Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:

Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):

share/min	sensitivity
6	0.70
12	0.55
30	0.33
60	0.16
120	0.09

Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence delta_time
grows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in vardiff_baseline.md.

This isn't a fix. It's the measurement that lets the fix be evaluated.

What's in the PR

3 commits, ~2000 LOC plus baseline data:

feat(vardiff): inject Clock trait + add_shares trait method
Minimum API additions to channels_sv2 for testability and
simulation performance. Production behavior unchanged — existing
constructors default to SystemClock, the new trait method has a
default implementation that keeps existing impls compiling.
feat(vardiff_sim): in-process simulation framework
New crate at sv2/channels-sv2/sim/. Per-tick Poisson share
sampling, five behavioral metrics with percentile distributions,
50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
binary for baseline generation, regression test asserting against
a committed baseline.
data(vardiff_sim): design doc + baseline characterization
The design proposal documenting metric definitions and tolerance
policy, plus the measured baseline as both TOML (consumed by the
regression test) and Markdown (for human review).

What the framework measures

Five behavioral attributes, each as a distribution across 1000
independent trials per cell:

Metric	Better is	What it tells you
Convergence time	Smaller	How fast the algorithm settles after cold start
Settled accuracy	Smaller	How close to truth the algorithm lands
Steady-state jitter	Smaller	How often it fires on noise post-settle
Reaction time	Smaller	How fast it responds to genuine load changes
Reaction sensitivity	≈ 1 for real Δ, ≈ 0 for noise	Whether it distinguishes signal from noise

Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.

How to run

From sv2/channels-sv2/sim/:

# Fast unit tests (~1 second)
cargo test

# Generate a fresh baseline (~5-15 seconds)
cargo run --release --bin generate-baseline

# Run the slow regression test (~5-15 seconds; #[ignore]-d by default)
cargo test --release --lib -- --ignored

What this enables

For any future vardiff proposal:

Implement the new algorithm as a Vardiff impl
cargo run --release --bin generate-baseline to produce comparable
measurements
Diff against the committed baseline
Make the case with numbers

No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."

Where to look in this PR

Design proposal (architecture, metric definitions, tolerance
rationale): sv2/channels-sv2/sim/VARDIFF_SIMULATION_FRAMEWORK.md
Crate README (usage, output interpretation, baseline-update
workflow): sv2/channels-sv2/sim/README.md
The current algorithm's measured baseline:
sv2/channels-sv2/sim/vardiff_baseline.md

What this PR is NOT

Not an algorithm change. VardiffState behavior is unchanged.
The only public-API additions are Vardiff::add_shares (with a
default impl) and the Clock trait. Production code defaults to
SystemClock and behaves identically to before.
Not a recommendation about share rate defaults. The baseline
data suggests 12-30 spm is the operational sweet spot, but this
PR doesn't touch any defaults.
Not a CI workflow. The regression test works locally but needs
a GitHub Action to be a true CI gate. Follow-up.

Open follow-ups

Wire cargo test --release --lib -- --ignored into CI on PRs
touching vardiff/* or the sim crate.
Bump channels_sv2 5.0.0 → 5.1.0 once the workspace lockfile
situation allows (the trait-method addition is technically a
minor-version semver change). TODO comment in Cargo.toml tracks
this.
Investigate the reactivity-degrades-with-rate finding. The framework
surfaces the problem; fixing it is a separate proposal that this
framework will be the right tool to evaluate.

Test plan

cargo test -p channels_sv2 --lib vardiff — 17 tests, all pass
cargo test from sv2/channels-sv2/sim/ — 53 fast unit tests
cargo test --release --lib -- --ignored from sim/ — slow
regression test passes against committed baseline
cargo run --release --bin generate-baseline — reproduces the
committed vardiff_baseline.toml byte-for-byte at the same seed

Two API additions to enable mockable time and bulk share-count operations in the Vardiff trait, prerequisites for the in-process simulation framework added in subsequent commits. Clock injection: - New vardiff/clock.rs with Clock trait, SystemClock, and MockClock. - VardiffState gains an Arc<dyn Clock> field and a new_with_clock constructor. reset_counter and try_vardiff read time via the clock rather than calling SystemTime::now() directly. - Existing constructors (new, new_with_min) default to SystemClock; production behavior is unchanged. Bulk share addition: - Vardiff trait gains add_shares(n: u32) with a default implementation calling increment_shares_since_last_update n times. - VardiffState overrides with a single saturating add. Required for simulation performance — the harness can bulk-add millions of shares per tick during cold-start scenarios where the default's loop would dominate trial runtime. VardiffError::TimeError is now unreachable but retained with a doc comment marking it for removal at the next major version bump; removing it now would break downstream exhaustive matches. Semver note: channels_sv2 should bump from 5.0.0 to 5.1.0 to surface the new add_shares method to downstream consumers, but the project's pinned Rust 1.75 toolchain cannot write the v4 Cargo.lock format that a version change requires. TODO comment in Cargo.toml flags the deferred bump. Tests: 17 vardiff tests pass (12 existing unchanged, 3 new clock-module unit tests, 2 new tests verifying clock injection propagates through VardiffState). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…terization New vardiff_sim crate at sv2/channels-sv2/sim/ providing deterministic behavioral characterization of any Vardiff implementation, plus a regression test that asserts the current algorithm against a checked-in baseline. Purpose: surface the operationally-important attributes of the vardiff algorithm — convergence time, settled accuracy, steady-state jitter, reaction time, reaction sensitivity — in concrete measurable terms so that any future algorithmic improvement (parametric thresholds, EWMA, SPRT, etc.) can be evaluated against a fixed harness and produce a clean delta report. Components: - rng.rs: XorShift64 RNG plus exponential and Poisson samplers (Knuth for λ<30, normal approximation for ≥30). Hand-rolled for cross-version reproducibility without depending on the rand crate's RNG-stability guarantees. - schedule.rs: HashrateSchedule for parameterizing the miner's true hashrate over time. Convenience constructors for stable, step-change, and throttle scenarios. - trial.rs: run_trial drives any Vardiff implementation through duration_secs of simulated time. Per-tick Poisson sampling: at each 60s tick, samples (true_h / estimated_h) * shares_per_minute, bulk- adds via Vardiff::add_shares, calls try_vardiff. Rate-independent — handles λ from near-zero to millions. - metrics.rs: Distribution helper (sorted f64s, p10-p99 percentiles, mean, count) plus the five metric functions. Where a metric can fail (non-converging trials, missing reactions) the rate is reported alongside the distribution. - baseline.rs: Scenario / Cell / CellResult types and run_baseline generic over Vardiff. Default grid is 5 share rates × 10 scenarios = 50 cells. Hand-written TOML and Markdown serialization (avoiding serde + toml dependencies to keep the lockfile minimal). - bin/generate-baseline.rs: CLI entry point. Configurable via VARDIFF_BASELINE_TRIALS, VARDIFF_BASELINE_SEED, VARDIFF_BASELINE_OUT_DIR. - regression.rs: baseline-parsing + per-metric tolerance assertions. The classic_algorithm_no_regression test loads the committed baseline via include_str! and asserts current measurements. Marked #[ignore] because it runs the full ~5s baseline; CI should invoke via cargo test --release --lib -- --ignored. - README.md covering usage, output interpretation, baseline-update workflow, and project-specific notes including the Cargo.lock copy-from-parent rationale. The crate is declared as its own Cargo workspace (its Cargo.toml has a top-level [workspace] section) so its lockfile is independent of the parent stratum workspace. Required because the parent's pinned 1.75 toolchain cannot write v4 lockfiles, and adding the sim crate as a workspace member would force such a write. The committed Cargo.lock is a copy of the parent's. Tests: 53 fast unit tests + 1 #[ignore]-d slow regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tate - VARDIFF_SIMULATION_FRAMEWORK.md: design proposal documenting the framework's five metrics, assertion policy, simulation mechanism, and architectural rationale. Co-located with the crate it describes. - vardiff_baseline.toml: machine-readable baseline measurements of the classic VardiffState algorithm across the default 50-cell grid (5 share rates × 10 scenarios, 1000 trials each, base seed 0xDEADBEEFCAFEF00D). Consumed by the regression test in the sim crate. - vardiff_baseline.md: human-readable summary of the same data, organized by metric type for PR review. Notable findings surfaced by the baseline: - Convergence: solid across rates (100% at 30+ spm, 95% at 12 spm, 83% at 6 spm). p50 is ~10 minutes everywhere, dominated by the Phase 1 ×3/min ramp clamp. - Settled accuracy: follows 1/sqrt(N) cleanly. p99 error is 70% at 6 spm, 27% at 12, 15% at 30, 3% at 60, 0% at 120. Low-rate operation is statistically threadbare. - Steady-state jitter: small everywhere and ~0 above 30 spm. The algorithm's growing delta_time post-convergence narrows the effective noise band as 1/sqrt(N), producing accidental self- stabilization at high rates. - Reaction sensitivity DEGRADES with share rate — counterintuitive but mechanistic. The same property that produces low jitter at high rates (growing delta_time after a Phase 1 fire) produces sluggish response to step changes (post-step shares diluted by long pre-step history). At 60+ spm only 9-16% of trials react to a 50% drop within 5 minutes. This baseline is the reference point for evaluating any future algorithmic proposal. The regression test in the sim crate asserts each metric is within tolerance of these recorded values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gimballock · 2026-05-13T22:23:53Z

The ask to get this simulator off the ground is a new trait to swap out the system clock with a MockClock, and the addition of a new public method to the vardiff trait. These changes are isolated to the first commit.

gimballock · 2026-05-13T23:26:06Z

+| share/min | rate | p10 | p50 | p90 | p99 |
+| --- | --- | --- | --- | --- | --- |
+| 6 | 83.3% | 10m | 12m | 21m | 25m |
+| 12 | 95.4% | 10m | 10m | 20m | 25m |
+| 30 | 99.5% | 10m | 10m | 15m | 25m |
+| 60 | 100.0% | 10m | 10m | 10m | 20m |
+| 120 | 100.0% | 10m | 10m | 10m | 15m |


The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in 5m of simulated time) occurring 17% of the time!

the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.

Eric Price and others added 3 commits May 13, 2026 16:53

gimballock force-pushed the vardiff/simulation-framework branch from 11b2560 to 88d8d1d Compare May 13, 2026 22:06

This was referenced May 13, 2026

replace vardiff hardcoded threshold ladder with parametric noise floor #2148

Closed

[experiment] Apply new error estimate math to vardiff algo stratum-mining/sv2-apps#488

Closed

gimballock commented May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vardiff): add in-process simulation framework + baseline regression tests#2154

feat(vardiff): add in-process simulation framework + baseline regression tests#2154
gimballock wants to merge 3 commits into
stratum-mining:mainfrom
fossatmara:vardiff/simulation-framework

gimballock commented May 13, 2026

Uh oh!

gimballock commented May 13, 2026 •

edited

Loading

Uh oh!

gimballock May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gimballock commented May 13, 2026

The finding that motivates this

What's in the PR

What the framework measures

How to run

What this enables

Where to look in this PR

What this PR is NOT

Open follow-ups

Test plan

Uh oh!

gimballock commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gimballock May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gimballock commented May 13, 2026 •

edited

Loading