Skip to content

feat(vardiff): add in-process simulation framework + baseline regression tests#2154

Open
gimballock wants to merge 3 commits into
stratum-mining:mainfrom
fossatmara:vardiff/simulation-framework
Open

feat(vardiff): add in-process simulation framework + baseline regression tests#2154
gimballock wants to merge 3 commits into
stratum-mining:mainfrom
fossatmara:vardiff/simulation-framework

Conversation

@gimballock
Copy link
Copy Markdown

Adds a deterministic in-process simulation framework that characterizes
any Vardiff implementation across the operational rate range, and
commits the current algorithm's measurements as a baseline for automated
regression testing.

The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.

The finding that motivates this

Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:

Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):

share/min sensitivity
6 0.70
12 0.55
30 0.33
60 0.16
120 0.09

Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence delta_time
grows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in vardiff_baseline.md.

This isn't a fix. It's the measurement that lets the fix be evaluated.

What's in the PR

3 commits, ~2000 LOC plus baseline data:

  1. feat(vardiff): inject Clock trait + add_shares trait method
    Minimum API additions to channels_sv2 for testability and
    simulation performance. Production behavior unchanged — existing
    constructors default to SystemClock, the new trait method has a
    default implementation that keeps existing impls compiling.

  2. feat(vardiff_sim): in-process simulation framework
    New crate at sv2/channels-sv2/sim/. Per-tick Poisson share
    sampling, five behavioral metrics with percentile distributions,
    50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
    binary for baseline generation, regression test asserting against
    a committed baseline.

  3. data(vardiff_sim): design doc + baseline characterization
    The design proposal documenting metric definitions and tolerance
    policy, plus the measured baseline as both TOML (consumed by the
    regression test) and Markdown (for human review).

What the framework measures

Five behavioral attributes, each as a distribution across 1000
independent trials per cell:

Metric Better is What it tells you
Convergence time Smaller How fast the algorithm settles after cold start
Settled accuracy Smaller How close to truth the algorithm lands
Steady-state jitter Smaller How often it fires on noise post-settle
Reaction time Smaller How fast it responds to genuine load changes
Reaction sensitivity ≈ 1 for real Δ, ≈ 0 for noise Whether it distinguishes signal from noise

Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.

How to run

From sv2/channels-sv2/sim/:

# Fast unit tests (~1 second)
cargo test

# Generate a fresh baseline (~5-15 seconds)
cargo run --release --bin generate-baseline

# Run the slow regression test (~5-15 seconds; #[ignore]-d by default)
cargo test --release --lib -- --ignored

What this enables

For any future vardiff proposal:

  1. Implement the new algorithm as a Vardiff impl
  2. cargo run --release --bin generate-baseline to produce comparable
    measurements
  3. Diff against the committed baseline
  4. Make the case with numbers

No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."

Where to look in this PR

What this PR is NOT

  • Not an algorithm change. VardiffState behavior is unchanged.
    The only public-API additions are Vardiff::add_shares (with a
    default impl) and the Clock trait. Production code defaults to
    SystemClock and behaves identically to before.
  • Not a recommendation about share rate defaults. The baseline
    data suggests 12-30 spm is the operational sweet spot, but this
    PR doesn't touch any defaults.
  • Not a CI workflow. The regression test works locally but needs
    a GitHub Action to be a true CI gate. Follow-up.

Open follow-ups

  • Wire cargo test --release --lib -- --ignored into CI on PRs
    touching vardiff/* or the sim crate.
  • Bump channels_sv2 5.0.0 → 5.1.0 once the workspace lockfile
    situation allows (the trait-method addition is technically a
    minor-version semver change). TODO comment in Cargo.toml tracks
    this.
  • Investigate the reactivity-degrades-with-rate finding. The framework
    surfaces the problem; fixing it is a separate proposal that this
    framework will be the right tool to evaluate.

Test plan

  • cargo test -p channels_sv2 --lib vardiff — 17 tests, all pass
  • cargo test from sv2/channels-sv2/sim/ — 53 fast unit tests
  • cargo test --release --lib -- --ignored from sim/ — slow
    regression test passes against committed baseline
  • cargo run --release --bin generate-baseline — reproduces the
    committed vardiff_baseline.toml byte-for-byte at the same seed

Eric Price and others added 3 commits May 13, 2026 16:53
Two API additions to enable mockable time and bulk share-count operations
in the Vardiff trait, prerequisites for the in-process simulation
framework added in subsequent commits.

Clock injection:
- New vardiff/clock.rs with Clock trait, SystemClock, and MockClock.
- VardiffState gains an Arc<dyn Clock> field and a new_with_clock
  constructor. reset_counter and try_vardiff read time via the clock
  rather than calling SystemTime::now() directly.
- Existing constructors (new, new_with_min) default to SystemClock;
  production behavior is unchanged.

Bulk share addition:
- Vardiff trait gains add_shares(n: u32) with a default implementation
  calling increment_shares_since_last_update n times.
- VardiffState overrides with a single saturating add. Required for
  simulation performance — the harness can bulk-add millions of shares
  per tick during cold-start scenarios where the default's loop would
  dominate trial runtime.

VardiffError::TimeError is now unreachable but retained with a doc
comment marking it for removal at the next major version bump; removing
it now would break downstream exhaustive matches.

Semver note: channels_sv2 should bump from 5.0.0 to 5.1.0 to surface the
new add_shares method to downstream consumers, but the project's pinned
Rust 1.75 toolchain cannot write the v4 Cargo.lock format that a version
change requires. TODO comment in Cargo.toml flags the deferred bump.

Tests: 17 vardiff tests pass (12 existing unchanged, 3 new clock-module
unit tests, 2 new tests verifying clock injection propagates through
VardiffState).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…terization

New vardiff_sim crate at sv2/channels-sv2/sim/ providing deterministic
behavioral characterization of any Vardiff implementation, plus a
regression test that asserts the current algorithm against a
checked-in baseline.

Purpose: surface the operationally-important attributes of the vardiff
algorithm — convergence time, settled accuracy, steady-state jitter,
reaction time, reaction sensitivity — in concrete measurable terms so
that any future algorithmic improvement (parametric thresholds, EWMA,
SPRT, etc.) can be evaluated against a fixed harness and produce a
clean delta report.

Components:

- rng.rs: XorShift64 RNG plus exponential and Poisson samplers
  (Knuth for λ<30, normal approximation for ≥30). Hand-rolled for
  cross-version reproducibility without depending on the rand crate's
  RNG-stability guarantees.

- schedule.rs: HashrateSchedule for parameterizing the miner's true
  hashrate over time. Convenience constructors for stable, step-change,
  and throttle scenarios.

- trial.rs: run_trial drives any Vardiff implementation through
  duration_secs of simulated time. Per-tick Poisson sampling: at each
  60s tick, samples (true_h / estimated_h) * shares_per_minute, bulk-
  adds via Vardiff::add_shares, calls try_vardiff. Rate-independent —
  handles λ from near-zero to millions.

- metrics.rs: Distribution helper (sorted f64s, p10-p99 percentiles,
  mean, count) plus the five metric functions. Where a metric can fail
  (non-converging trials, missing reactions) the rate is reported
  alongside the distribution.

- baseline.rs: Scenario / Cell / CellResult types and run_baseline
  generic over Vardiff. Default grid is 5 share rates × 10 scenarios =
  50 cells. Hand-written TOML and Markdown serialization (avoiding
  serde + toml dependencies to keep the lockfile minimal).

- bin/generate-baseline.rs: CLI entry point. Configurable via
  VARDIFF_BASELINE_TRIALS, VARDIFF_BASELINE_SEED, VARDIFF_BASELINE_OUT_DIR.

- regression.rs: baseline-parsing + per-metric tolerance assertions.
  The classic_algorithm_no_regression test loads the committed baseline
  via include_str! and asserts current measurements. Marked #[ignore]
  because it runs the full ~5s baseline; CI should invoke via
  cargo test --release --lib -- --ignored.

- README.md covering usage, output interpretation, baseline-update
  workflow, and project-specific notes including the Cargo.lock
  copy-from-parent rationale.

The crate is declared as its own Cargo workspace (its Cargo.toml has a
top-level [workspace] section) so its lockfile is independent of the
parent stratum workspace. Required because the parent's pinned 1.75
toolchain cannot write v4 lockfiles, and adding the sim crate as a
workspace member would force such a write. The committed Cargo.lock is
a copy of the parent's.

Tests: 53 fast unit tests + 1 #[ignore]-d slow regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tate

- VARDIFF_SIMULATION_FRAMEWORK.md: design proposal documenting the
  framework's five metrics, assertion policy, simulation mechanism,
  and architectural rationale. Co-located with the crate it describes.

- vardiff_baseline.toml: machine-readable baseline measurements of
  the classic VardiffState algorithm across the default 50-cell grid
  (5 share rates × 10 scenarios, 1000 trials each, base seed
  0xDEADBEEFCAFEF00D). Consumed by the regression test in the sim
  crate.

- vardiff_baseline.md: human-readable summary of the same data,
  organized by metric type for PR review.

Notable findings surfaced by the baseline:

- Convergence: solid across rates (100% at 30+ spm, 95% at 12 spm,
  83% at 6 spm). p50 is ~10 minutes everywhere, dominated by the
  Phase 1 ×3/min ramp clamp.

- Settled accuracy: follows 1/sqrt(N) cleanly. p99 error is 70% at
  6 spm, 27% at 12, 15% at 30, 3% at 60, 0% at 120. Low-rate
  operation is statistically threadbare.

- Steady-state jitter: small everywhere and ~0 above 30 spm. The
  algorithm's growing delta_time post-convergence narrows the
  effective noise band as 1/sqrt(N), producing accidental self-
  stabilization at high rates.

- Reaction sensitivity DEGRADES with share rate — counterintuitive
  but mechanistic. The same property that produces low jitter at
  high rates (growing delta_time after a Phase 1 fire) produces
  sluggish response to step changes (post-step shares diluted by
  long pre-step history). At 60+ spm only 9-16% of trials react to
  a 50% drop within 5 minutes.

This baseline is the reference point for evaluating any future
algorithmic proposal. The regression test in the sim crate asserts
each metric is within tolerance of these recorded values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 11b2560 to 88d8d1d Compare May 13, 2026 22:06
@gimballock
Copy link
Copy Markdown
Author

gimballock commented May 13, 2026

The ask to get this simulator off the ground is a new trait to swap out the system clock with a MockClock, and the addition of a new public method to the vardiff trait. These changes are isolated to the first commit.

Comment on lines +7 to +13
| share/min | rate | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 83.3% | 10m | 12m | 21m | 25m |
| 12 | 95.4% | 10m | 10m | 20m | 25m |
| 30 | 99.5% | 10m | 10m | 15m | 25m |
| 60 | 100.0% | 10m | 10m | 10m | 20m |
| 120 | 100.0% | 10m | 10m | 10m | 15m |
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in 5m of simulated time) occurring 17% of the time!

the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant