feat(vardiff): add in-process simulation framework + baseline regression tests#2154
Open
gimballock wants to merge 3 commits into
Open
feat(vardiff): add in-process simulation framework + baseline regression tests#2154gimballock wants to merge 3 commits into
gimballock wants to merge 3 commits into
Conversation
Two API additions to enable mockable time and bulk share-count operations in the Vardiff trait, prerequisites for the in-process simulation framework added in subsequent commits. Clock injection: - New vardiff/clock.rs with Clock trait, SystemClock, and MockClock. - VardiffState gains an Arc<dyn Clock> field and a new_with_clock constructor. reset_counter and try_vardiff read time via the clock rather than calling SystemTime::now() directly. - Existing constructors (new, new_with_min) default to SystemClock; production behavior is unchanged. Bulk share addition: - Vardiff trait gains add_shares(n: u32) with a default implementation calling increment_shares_since_last_update n times. - VardiffState overrides with a single saturating add. Required for simulation performance — the harness can bulk-add millions of shares per tick during cold-start scenarios where the default's loop would dominate trial runtime. VardiffError::TimeError is now unreachable but retained with a doc comment marking it for removal at the next major version bump; removing it now would break downstream exhaustive matches. Semver note: channels_sv2 should bump from 5.0.0 to 5.1.0 to surface the new add_shares method to downstream consumers, but the project's pinned Rust 1.75 toolchain cannot write the v4 Cargo.lock format that a version change requires. TODO comment in Cargo.toml flags the deferred bump. Tests: 17 vardiff tests pass (12 existing unchanged, 3 new clock-module unit tests, 2 new tests verifying clock injection propagates through VardiffState). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…terization New vardiff_sim crate at sv2/channels-sv2/sim/ providing deterministic behavioral characterization of any Vardiff implementation, plus a regression test that asserts the current algorithm against a checked-in baseline. Purpose: surface the operationally-important attributes of the vardiff algorithm — convergence time, settled accuracy, steady-state jitter, reaction time, reaction sensitivity — in concrete measurable terms so that any future algorithmic improvement (parametric thresholds, EWMA, SPRT, etc.) can be evaluated against a fixed harness and produce a clean delta report. Components: - rng.rs: XorShift64 RNG plus exponential and Poisson samplers (Knuth for λ<30, normal approximation for ≥30). Hand-rolled for cross-version reproducibility without depending on the rand crate's RNG-stability guarantees. - schedule.rs: HashrateSchedule for parameterizing the miner's true hashrate over time. Convenience constructors for stable, step-change, and throttle scenarios. - trial.rs: run_trial drives any Vardiff implementation through duration_secs of simulated time. Per-tick Poisson sampling: at each 60s tick, samples (true_h / estimated_h) * shares_per_minute, bulk- adds via Vardiff::add_shares, calls try_vardiff. Rate-independent — handles λ from near-zero to millions. - metrics.rs: Distribution helper (sorted f64s, p10-p99 percentiles, mean, count) plus the five metric functions. Where a metric can fail (non-converging trials, missing reactions) the rate is reported alongside the distribution. - baseline.rs: Scenario / Cell / CellResult types and run_baseline generic over Vardiff. Default grid is 5 share rates × 10 scenarios = 50 cells. Hand-written TOML and Markdown serialization (avoiding serde + toml dependencies to keep the lockfile minimal). - bin/generate-baseline.rs: CLI entry point. Configurable via VARDIFF_BASELINE_TRIALS, VARDIFF_BASELINE_SEED, VARDIFF_BASELINE_OUT_DIR. - regression.rs: baseline-parsing + per-metric tolerance assertions. The classic_algorithm_no_regression test loads the committed baseline via include_str! and asserts current measurements. Marked #[ignore] because it runs the full ~5s baseline; CI should invoke via cargo test --release --lib -- --ignored. - README.md covering usage, output interpretation, baseline-update workflow, and project-specific notes including the Cargo.lock copy-from-parent rationale. The crate is declared as its own Cargo workspace (its Cargo.toml has a top-level [workspace] section) so its lockfile is independent of the parent stratum workspace. Required because the parent's pinned 1.75 toolchain cannot write v4 lockfiles, and adding the sim crate as a workspace member would force such a write. The committed Cargo.lock is a copy of the parent's. Tests: 53 fast unit tests + 1 #[ignore]-d slow regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tate - VARDIFF_SIMULATION_FRAMEWORK.md: design proposal documenting the framework's five metrics, assertion policy, simulation mechanism, and architectural rationale. Co-located with the crate it describes. - vardiff_baseline.toml: machine-readable baseline measurements of the classic VardiffState algorithm across the default 50-cell grid (5 share rates × 10 scenarios, 1000 trials each, base seed 0xDEADBEEFCAFEF00D). Consumed by the regression test in the sim crate. - vardiff_baseline.md: human-readable summary of the same data, organized by metric type for PR review. Notable findings surfaced by the baseline: - Convergence: solid across rates (100% at 30+ spm, 95% at 12 spm, 83% at 6 spm). p50 is ~10 minutes everywhere, dominated by the Phase 1 ×3/min ramp clamp. - Settled accuracy: follows 1/sqrt(N) cleanly. p99 error is 70% at 6 spm, 27% at 12, 15% at 30, 3% at 60, 0% at 120. Low-rate operation is statistically threadbare. - Steady-state jitter: small everywhere and ~0 above 30 spm. The algorithm's growing delta_time post-convergence narrows the effective noise band as 1/sqrt(N), producing accidental self- stabilization at high rates. - Reaction sensitivity DEGRADES with share rate — counterintuitive but mechanistic. The same property that produces low jitter at high rates (growing delta_time after a Phase 1 fire) produces sluggish response to step changes (post-step shares diluted by long pre-step history). At 60+ spm only 9-16% of trials react to a 50% drop within 5 minutes. This baseline is the reference point for evaluating any future algorithmic proposal. The regression test in the sim crate asserts each metric is within tolerance of these recorded values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11b2560 to
88d8d1d
Compare
Author
|
The ask to get this simulator off the ground is a new trait to swap out the system clock with a MockClock, and the addition of a new public method to the vardiff trait. These changes are isolated to the first commit. |
This was referenced May 13, 2026
gimballock
commented
May 13, 2026
Comment on lines
+7
to
+13
| | share/min | rate | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 83.3% | 10m | 12m | 21m | 25m | | ||
| | 12 | 95.4% | 10m | 10m | 20m | 25m | | ||
| | 30 | 99.5% | 10m | 10m | 15m | 25m | | ||
| | 60 | 100.0% | 10m | 10m | 10m | 20m | | ||
| | 120 | 100.0% | 10m | 10m | 10m | 15m | |
Author
There was a problem hiding this comment.
The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in 5m of simulated time) occurring 17% of the time!
the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a deterministic in-process simulation framework that characterizes
any
Vardiffimplementation across the operational rate range, andcommits the current algorithm's measurements as a baseline for automated
regression testing.
The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.
The finding that motivates this
Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:
Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):
Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence
delta_timegrows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in
vardiff_baseline.md.This isn't a fix. It's the measurement that lets the fix be evaluated.
What's in the PR
3 commits, ~2000 LOC plus baseline data:
feat(vardiff): inject Clock trait + add_shares trait methodMinimum API additions to
channels_sv2for testability andsimulation performance. Production behavior unchanged — existing
constructors default to
SystemClock, the new trait method has adefault implementation that keeps existing impls compiling.
feat(vardiff_sim): in-process simulation frameworkNew crate at
sv2/channels-sv2/sim/. Per-tick Poisson sharesampling, five behavioral metrics with percentile distributions,
50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
binary for baseline generation, regression test asserting against
a committed baseline.
data(vardiff_sim): design doc + baseline characterizationThe design proposal documenting metric definitions and tolerance
policy, plus the measured baseline as both TOML (consumed by the
regression test) and Markdown (for human review).
What the framework measures
Five behavioral attributes, each as a distribution across 1000
independent trials per cell:
Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.
How to run
From
sv2/channels-sv2/sim/:What this enables
For any future vardiff proposal:
Vardiffimplcargo run --release --bin generate-baselineto produce comparablemeasurements
No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."
Where to look in this PR
rationale):
sv2/channels-sv2/sim/VARDIFF_SIMULATION_FRAMEWORK.mdworkflow):
sv2/channels-sv2/sim/README.mdsv2/channels-sv2/sim/vardiff_baseline.mdWhat this PR is NOT
VardiffStatebehavior is unchanged.The only public-API additions are
Vardiff::add_shares(with adefault impl) and the
Clocktrait. Production code defaults toSystemClockand behaves identically to before.data suggests 12-30 spm is the operational sweet spot, but this
PR doesn't touch any defaults.
a GitHub Action to be a true CI gate. Follow-up.
Open follow-ups
cargo test --release --lib -- --ignoredinto CI on PRstouching
vardiff/*or the sim crate.channels_sv25.0.0 → 5.1.0 once the workspace lockfilesituation allows (the trait-method addition is technically a
minor-version semver change). TODO comment in
Cargo.tomltracksthis.
surfaces the problem; fixing it is a separate proposal that this
framework will be the right tool to evaluate.
Test plan
cargo test -p channels_sv2 --lib vardiff— 17 tests, all passcargo testfromsv2/channels-sv2/sim/— 53 fast unit testscargo test --release --lib -- --ignoredfrom sim/ — slowregression test passes against committed baseline
cargo run --release --bin generate-baseline— reproduces thecommitted
vardiff_baseline.tomlbyte-for-byte at the same seed