Skip to content

[experiment] Apply new error estimate math to vardiff algo #488

@gimballock

Description

@gimballock

tl;dr: I had my agent look at the Garwood interval data posted by @adammwest in the linked issue to see how it might best be integrated into the current algorithm.

Findings so far:

  • It says the existing algo tries to force the converged hashrate to a smaller range of acceptable values than is justifiable mathematically, and as a result subsequent difficulty adjustments can be triggered due to noise/error.
  • So this change actually increases the variance in where a miner's hashrate will converge but in exchange it should get there with fewer triggered adjustments.
  • This change replaced the fixed stairstep function with a dynamically computed value (though we could rewrite with a fixed rungs if desired)
  • Also the new function is parameterized by share-rate, so the vardiff triggers get more efficient as share-rate increases.

I'm doing some manual A/B testing now to confirm the reduction in vardiff triggers but in all honesty this accuracy tweak doesn't have much impact on the total convergence time, which is the bigger issue and will require a larger rewrite.

Curious what others think of these changes.

Reconsider #396: vardiff convergence quality, not poll cadence

Refs: #396

#396 proposes shrinking the hardcoded 60s vardiff polling cycle in pool, jd-client, and translator so sv2-ui updates more frequently. After working through the algorithm and its statistical behaviour at default settings, the polling cadence isn't the bottleneck — it isn't even part of the algorithm window. The user-visible problems (slow convergence, "staircase" wobble, dashboard feeling unresponsive) come from a separate place: the ladder's thresholds were calibrated to a different statistical regime than the default 6 shares/min produces. This is a request to reframe the work and pursue different fixes.

Polling cadence is not the algorithm window

The hardcoded 60s in each app is just the tokio::time::interval cadence at which try_vardiff is called:

// pool-apps/pool/src/lib/channel_manager/mod.rs:611-621 (mirrored in jd-client and translator)
let mut ticker = tokio::time::interval(std::time::Duration::from_secs(60));
loop {
    ticker.tick().await;
    self.run_vardiff().await
}

try_vardiff (stratum/sv2/channels-sv2/src/vardiff/classic.rs) computes deviation against shares_since_last_update, which only resets when a fire applies a new target. Calling it more frequently re-checks the same accumulating window against the same thresholds. There's also an internal early-return at delta_time <= 15, so polling below ~15s is a no-op.

Shrinking the apps-layer ticker on its own buys nothing on the algorithm side. If UI freshness is the real goal, it's better solved at the display layer (smoother interpolation, separate hashrate buckets independent of vardiff fires).

What operators actually experience

Three properties matter for vardiff in practice:

Convergence latency — time from a fresh connection to a target near the miner's true hashrate. Today: ~7–10 minutes on typical SV1-bridged connections, presenting as a visible staircase climb.

Convergence stability — how smoothly the staircase rises. Today: a few large discrete jumps, because each fire pulls a single-window hashrate estimate that carries a wide confidence interval.

Steady-state behaviour — once near the true rate, no rungs fire and the system settles. This works correctly.

The complaint that motivated #396 is downstream of latency and stability, not poll frequency. A faster ticker on the same algorithm makes the staircase no shorter and no smoother; it just samples it more often.

Why this happens — sampling noise at default share rate

Every example config uses shares_per_minute = 6.0. One 60s window observes ~6 shares. Share counts are Poisson; the exact 95% Garwood CI on a count of 6 is approximately [2.2, 13.1] — that is, −63%, +118% from the expected value. The single-window hashrate estimate that drives every ladder fire inherits that CI directly.

Cross-referencing the ladder in vardiff/classic.rs:154-162:

let should_update = match hashrate_delta_percentage {
    pct if pct >= 100.0                       => true,
    pct if pct >=  60.0 && delta_time >=  60  => true,
    pct if pct >=  50.0 && delta_time >= 120  => true,
    pct if pct >=  45.0 && delta_time >= 180  => true,
    pct if pct >=  30.0 && delta_time >= 240  => true,
    pct if pct >=  15.0 && delta_time >= 300  => true,
    _ => false,
};

against the noise expected at each elapsed time:

Rung Threshold Expected shares 95% Garwood CI Verdict
≥100%, any time 100% always outside OK
≥60%, ≥60s 60% 6 −63%, +118% inside noise (upper)
≥50%, ≥120s 50% 12 −48%, +75% inside noise
≥45%, ≥180s 45% 18 −41%, +58% inside noise
≥30%, ≥240s 30% 24 −36%, +49% inside noise
≥15%, ≥300s 15% 30 −33%, +43% inside noise

(99% CI rather than 95% because the algorithm re-evaluates each rung repeatedly during convergence; a 95% bound implies ~5% false-fire probability per check, which is the regime we're trying to leave.)

Bumping the default share rate is a defensible secondary change

shares_per_minute = 6.0 produces the noisy single-window estimates above. Doubling to 12 narrows the post-fire CI by ~30% (Poisson noise scales as 1/√N), which makes the first close-to-truth fire land meaningfully closer.

Tradeoffs to weigh:

Per-share work. Pool, JDC, and translator all do per-share validation; doubling the rate roughly doubles that load. For the largest pools this may shift where validation runs (more partial validation in JDC/translator). Probably fine, but worth confirming with operators of large-deployment pools.

Small miners. A USB-stick or low-end ASIC averaging well below 6 shares/min today won't reach 12 either — at very low miner hashrates, the difficulty floor in the protocol takes over and the configured rate is aspirational. This bump shouldn't change anything for them, but it's worth checking that no path assumes the configured rate is achievable.

Diminishing returns. Past ~20 shares/min the CI tightens slowly; sweet spot is 10–15.

Steady-state precision: a real tradeoff

Worth being honest about this. The current ladder fires periodically once "near" truth — those small lower rungs (15%/300s, 30%/240s) keep nudging the displayed value, which both anchors it within roughly ±15% of truth and produces the visible twitch that operators complain about. Thresholds set above the noise floor will sit, by construction, around the noise-floor magnitude — at 12 shares/min and the proposed parametric thresholds, the steady-state band is approximately ±45%.

That's a real cost. A 30% hardware throttling event (failed PSU rail, thermal limiter kicking in) would land inside the new band and not trigger a re-fire. A vardiff dashboard exists partly to surface that kind of event. The right argument here isn't "stability is more important than precision" — it's "vardiff isn't the right surface for change-detection at this granularity." Operators wanting to spot a 30% throttle should be reading a smoothed observed-share-rate line, not the vardiff target. If we agree on that reframing, the precision regression is acceptable. If we don't, this change shouldn't land on its own and should wait for the EWMA restructuring below.

Expected impact

A fresh connection's hashrate trajectory has three phases. Each fix touches a different one.

Phase 1 — orders-of-magnitude climb. When the configured hashrate is far from truth, the system spends most of convergence walking up by ×3 per minute under the clamp at classic.rs:177-181. Share counts in this phase are huge (thousands per cycle), so noise is irrelevant. Neither fix touches Phase 1. Walking from a default 10 GH/s to within order-of-magnitude of truth still takes ~9 minutes. This is the dominant cost for fresh-connection latency, and we're not addressing it here.

Phase 2 — final approach. Once within 100–1000% of truth, the top rung fires using a measured estimate. Bumping shares_per_minute from 6 to 12 narrows the post-fire CI by ~30%; the first close-to-truth fire lands closer.

Phase 3 — settling. Today's lower rungs sit inside the CI of subsequent measurements, so the system keeps firing on noise as small corrections that walk around the true value. This is the visible wobble. Retuned thresholds sit above the noise floor, so once within band, nothing fires.

The chart below is a simulation, not measured data — it's there to illustrate the qualitative shape, not as evidence:

Simulated trajectory: today's ladder vs. retuned

In summary:

  • Time-to-flat improves from ~15 minutes to ~10 minutes (the ~9-minute Phase 1 still dominates).
  • Visible jumps at the top of the staircase drop from 2–3 to 1.
  • Steady-state band widens from ~±15% to ~±45%, with the reframing above.

Future work

EWMA-smoothed hashrate. Replace the ladder with a continuously-updated exponentially-weighted moving average. Convergence becomes a smooth exponential ramp, post-fire accuracy is tight by construction, and the implementation is simpler than the current ladder. Time constant becomes the only tunable. This also eliminates the steady-state-band tradeoff above and is the right long-term fix.

SPRT / CI-driven retargeting. Fire only when the observed rate is decisively outside the configured target's confidence band. Best statistical guarantees, fastest convergence on large initial gaps, most code complexity.

Implementation surface

Primary algorithm (parametric thresholds):

  • stratum-mining/stratum: sv2/channels-sv2/src/vardiff/classic.rs:96-196

Polling call sites — no change, keep 60s:

  • pool-apps/pool/src/lib/channel_manager/mod.rs:611-621
  • miner-apps/jd-client/src/lib/channel_manager/mod.rs:1257-1260
  • miner-apps/translator/src/lib/sv1/sv1_server/difficulty_manager.rs:33-43

Default share rate (proposal: 12.0, conditional on agreeing the steady-state reframing above):

  • pool-apps/pool/config-examples/**/*.toml
  • miner-apps/jd-client/config-examples/**/*.toml
  • miner-apps/translator/config-examples/**/*.toml

Methodology note: statistical analysis was developed with LLM assistance against the Garwood interval data linked from #396; code references and config defaults were verified against current sv2-apps HEAD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions