[experiment] Apply new error estimate math to vardiff algo

tl;dr: I had my agent look at the Garwood interval data posted by @adammwest in the linked issue to see how it might best be integrated into the current algorithm.

Findings so far:
- It says the existing algo tries to force the converged hashrate to a smaller range of acceptable values than is justifiable mathematically, and as a result subsequent difficulty adjustments can be triggered due to noise/error.
- So this change actually **increases** the variance in where a miner's hashrate will converge but in exchange it should get there with fewer triggered adjustments.
- This change replaced the fixed stairstep function with a dynamically computed value (though we could rewrite with a fixed rungs if desired)
- Also the new function is parameterized by share-rate, so the vardiff triggers get more efficient as share-rate increases.

I'm doing some manual A/B testing now to confirm the reduction in vardiff triggers but in all honesty this accuracy tweak doesn't have much impact on the total convergence time, which is the bigger issue and will require a larger rewrite.

Curious what others think of these changes.

<html><head></head><body><h1>Reconsider #396: vardiff convergence quality, not poll cadence</h1>
Refs: stratum-mining/sv2-apps#396
#396 proposes shrinking the hardcoded 60s vardiff polling cycle in <code>pool</code>, <code>jd-client</code>, and <code>translator</code> so <code>sv2-ui</code> updates more frequently. After working through the algorithm and its statistical behaviour at default settings, the polling cadence isn't the bottleneck — it isn't even part of the algorithm window. The user-visible problems (slow convergence, "staircase" wobble, dashboard feeling unresponsive) come from a separate place: the ladder's thresholds were calibrated to a different statistical regime than the default 6 shares/min produces. This is a request to reframe the work and pursue different fixes.
<h2>Polling cadence is not the algorithm window</h2>
The hardcoded 60s in each app is just the <code>tokio::time::interval</code> cadence at which <code>try_vardiff</code> is called:
<pre><code class="language-rust">// pool-apps/pool/src/lib/channel_manager/mod.rs:611-621 (mirrored in jd-client and translator)
let mut ticker = tokio::time::interval(std::time::Duration::from_secs(60));
loop {
 ticker.tick().await;
 self.run_vardiff().await
}
</code></pre>
<code>try_vardiff</code> (<code>stratum/sv2/channels-sv2/src/vardiff/classic.rs</code>) computes deviation against <code>shares_since_last_update</code>, which only resets when a fire applies a new target. Calling it more frequently re-checks the same accumulating window against the same thresholds. There's also an internal early-return at <code>delta_time &lt;= 15</code>, so polling below ~15s is a no-op.
Shrinking the apps-layer ticker on its own buys nothing on the algorithm side. If UI freshness is the real goal, it's better solved at the display layer (smoother interpolation, separate hashrate buckets independent of vardiff fires).
<h2>What operators actually experience</h2>
Three properties matter for vardiff in practice:
Convergence latency — time from a fresh connection to a target near the miner's true hashrate. Today: ~7–10 minutes on typical SV1-bridged connections, presenting as a visible staircase climb.
Convergence stability — how smoothly the staircase rises. Today: a few large discrete jumps, because each fire pulls a single-window hashrate estimate that carries a wide confidence interval.
Steady-state behaviour — once near the true rate, no rungs fire and the system settles. This works correctly.
The complaint that motivated #396 is downstream of latency and stability, not poll frequency. A faster ticker on the same algorithm makes the staircase no shorter and no smoother; it just samples it more often.
<h2>Why this happens — sampling noise at default share rate</h2>
Every example config uses <code>shares_per_minute = 6.0</code>. One 60s window observes ~6 shares. Share counts are Poisson; the exact 95% Garwood CI on a count of 6 is approximately <code>[2.2, 13.1]</code> — that is, −63%, +118% from the expected value. The single-window hashrate estimate that drives every ladder fire inherits that CI directly.
Cross-referencing the ladder in <code>vardiff/classic.rs:154-162</code>:
<pre><code class="language-rust">let should_update = match hashrate_delta_percentage {
 pct if pct &gt;= 100.0 =&gt; true,
 pct if pct &gt;= 60.0 &amp;&amp; delta_time &gt;= 60 =&gt; true,
 pct if pct &gt;= 50.0 &amp;&amp; delta_time &gt;= 120 =&gt; true,
 pct if pct &gt;= 45.0 &amp;&amp; delta_time &gt;= 180 =&gt; true,
 pct if pct &gt;= 30.0 &amp;&amp; delta_time &gt;= 240 =&gt; true,
 pct if pct &gt;= 15.0 &amp;&amp; delta_time &gt;= 300 =&gt; true,
 _ =&gt; false,
};
</code></pre>
against the noise expected at each elapsed time:

Rung | Threshold | Expected shares | 95% Garwood CI | Verdict
-- | -- | -- | -- | --
≥100%, any time | 100% | — | always outside | OK
≥60%, ≥60s | 60% | 6 | −63%, +118% | inside noise (upper)
≥50%, ≥120s | 50% | 12 | −48%, +75% | inside noise
≥45%, ≥180s | 45% | 18 | −41%, +58% | inside noise
≥30%, ≥240s | 30% | 24 | −36%, +49% | inside noise
≥15%, ≥300s | 15% | 30 | −33%, +43% | inside noise


(99% CI rather than 95% because the algorithm re-evaluates each rung repeatedly during convergence; a 95% bound implies ~5% false-fire probability per check, which is the regime we're trying to leave.)
<h2>Bumping the default share rate is a defensible secondary change</h2>
<code>shares_per_minute = 6.0</code> produces the noisy single-window estimates above. Doubling to 12 narrows the post-fire CI by ~30% (Poisson noise scales as <code>1/√N</code>), which makes the first close-to-truth fire land meaningfully closer.
Tradeoffs to weigh:
Per-share work. Pool, JDC, and translator all do per-share validation; doubling the rate roughly doubles that load. For the largest pools this may shift where validation runs (more partial validation in JDC/translator). Probably fine, but worth confirming with operators of large-deployment pools.
Small miners. A USB-stick or low-end ASIC averaging well below 6 shares/min today won't reach 12 either — at very low miner hashrates, the difficulty floor in the protocol takes over and the configured rate is aspirational. This bump shouldn't change anything for them, but it's worth checking that no path assumes the configured rate is achievable.
Diminishing returns. Past ~20 shares/min the CI tightens slowly; sweet spot is 10–15.
<h2>Steady-state precision: a real tradeoff</h2>
Worth being honest about this. The current ladder fires periodically once "near" truth — those small lower rungs (15%/300s, 30%/240s) keep nudging the displayed value, which both anchors it within roughly ±15% of truth and produces the visible twitch that operators complain about. Thresholds set above the noise floor will sit, by construction, around the noise-floor magnitude — at 12 shares/min and the proposed parametric thresholds, the steady-state band is approximately ±45%.
That's a real cost. A 30% hardware throttling event (failed PSU rail, thermal limiter kicking in) would land inside the new band and not trigger a re-fire. A vardiff dashboard exists partly to surface that kind of event. The right argument here isn't "stability is more important than precision" — it's "vardiff isn't the right surface for change-detection at this granularity." Operators wanting to spot a 30% throttle should be reading a smoothed observed-share-rate line, not the vardiff target. If we agree on that reframing, the precision regression is acceptable. If we don't, this change shouldn't land on its own and should wait for the EWMA restructuring below.
<h2>Expected impact</h2>
A fresh connection's hashrate trajectory has three phases. Each fix touches a different one.
Phase 1 — orders-of-magnitude climb. When the configured hashrate is far from truth, the system spends most of convergence walking up by ×3 per minute under the clamp at <code>classic.rs:177-181</code>. Share counts in this phase are huge (thousands per cycle), so noise is irrelevant. Neither fix touches Phase 1. Walking from a default 10 GH/s to within order-of-magnitude of truth still takes ~9 minutes. This is the dominant cost for fresh-connection latency, and we're not addressing it here.
Phase 2 — final approach. Once within 100–1000% of truth, the top rung fires using a measured estimate. Bumping <code>shares_per_minute</code> from 6 to 12 narrows the post-fire CI by ~30%; the first close-to-truth fire lands closer.
Phase 3 — settling. Today's lower rungs sit inside the CI of subsequent measurements, so the system keeps firing on noise as small corrections that walk around the true value. This is the visible wobble. Retuned thresholds sit above the noise floor, so once within band, nothing fires.
The chart below is a simulation, not measured data — it's there to illustrate the qualitative shape, not as evidence:
<img width="300" height="171" alt="Simulated trajectory: today's ladder vs. retuned" src="https://github.com/user-attachments/assets/2f40f6f4-22d4-4bc5-8367-85b979771a62" />
In summary:
<ul>
<li>Time-to-flat improves from ~15 minutes to ~10 minutes (the ~9-minute Phase 1 still dominates).</li>
<li>Visible jumps at the top of the staircase drop from 2–3 to 1.</li>
<li>Steady-state band widens from ~±15% to ~±45%, with the reframing above.</li>
</ul>
<h2>Future work</h2>
EWMA-smoothed hashrate. Replace the ladder with a continuously-updated exponentially-weighted moving average. Convergence becomes a smooth exponential ramp, post-fire accuracy is tight by construction, and the implementation is simpler than the current ladder. Time constant becomes the only tunable. This also eliminates the steady-state-band tradeoff above and is the right long-term fix.
SPRT / CI-driven retargeting. Fire only when the observed rate is decisively outside the configured target's confidence band. Best statistical guarantees, fastest convergence on large initial gaps, most code complexity.
<h2>Implementation surface</h2>
Primary algorithm (parametric thresholds):
<ul>
<li><code>stratum-mining/stratum</code>: <code>sv2/channels-sv2/src/vardiff/classic.rs:96-196</code></li>
</ul>
Polling call sites — no change, keep 60s:
<ul>
<li><code>pool-apps/pool/src/lib/channel_manager/mod.rs:611-621</code></li>
<li><code>miner-apps/jd-client/src/lib/channel_manager/mod.rs:1257-1260</code></li>
<li><code>miner-apps/translator/src/lib/sv1/sv1_server/difficulty_manager.rs:33-43</code></li>
</ul>
Default share rate (proposal: 12.0, conditional on agreeing the steady-state reframing above):
<ul>
<li><code>pool-apps/pool/config-examples/**/*.toml</code></li>
<li><code>miner-apps/jd-client/config-examples/**/*.toml</code></li>
<li><code>miner-apps/translator/config-examples/**/*.toml</code></li>
</ul>
<hr>
Methodology note: statistical analysis was developed with LLM assistance against the Garwood interval data linked from #396; code references and config defaults were verified against current <code>sv2-apps</code> HEAD.</body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiment] Apply new error estimate math to vardiff algo #488

Reconsider #396: vardiff convergence quality, not poll cadence

Polling cadence is not the algorithm window

What operators actually experience

Why this happens — sampling noise at default share rate

Bumping the default share rate is a defensible secondary change

Steady-state precision: a real tradeoff

Expected impact

Future work

Implementation surface

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rung	Threshold	Expected shares	95% Garwood CI	Verdict
≥100%, any time	100%	—	always outside	OK
≥60%, ≥60s	60%	6	−63%, +118%	inside noise (upper)
≥50%, ≥120s	50%	12	−48%, +75%	inside noise
≥45%, ≥180s	45%	18	−41%, +58%	inside noise
≥30%, ≥240s	30%	24	−36%, +49%	inside noise
≥15%, ≥300s	15%	30	−33%, +43%	inside noise

[experiment] Apply new error estimate math to vardiff algo #488

Description

Reconsider #396: vardiff convergence quality, not poll cadence

Polling cadence is not the algorithm window

What operators actually experience

Why this happens — sampling noise at default share rate

Bumping the default share rate is a defensible secondary change

Steady-state precision: a real tradeoff

Expected impact

Future work

Implementation surface

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions