Rust benchmark harness for speculative decoding draft_length strategies against the Lemonade inference engine.
Target backend: Lemonade, a local inference server exposing a chat/completions-style HTTP API.
Measure latency/throughput + stability tradeoffs across:
- fixed_1 .. fixed_8
- adaptive controller (warmup-calibrated thresholds, SLO-aware scoring)
- optional CPU load injection for contention experiments
Speculative decoding performance is highly sensitive to the chosen draft_length.
A static draft_length may perform well under one backend condition but degrade under contention or workload shifts.
HaloSpec explores whether an adaptive draft controller, calibrated from warmup statistics and evaluated using an SLO-aware scoring function, can maintain stable performance under non-stationary runtime conditions.
Watch the adaptive controller respond to runtime load injection:
Each mode:
- Warmup (WARMUP_STEPS, not logged)
- Measured steps (CSV logging)
- Summary stats + SLO-aware score
- Adaptive: tracks draft_length changes + convergence
During warmup, latency percentiles (p50, p95) are measured and used to derive dynamic thresholds:
low_thr = 0.85 * p50high_thr = 1.05 * p95
At runtime:
- If latency < low_thr → increment draft_length
- If latency > high_thr → decrement draft_length
- Otherwise → maintain or gently increase
Latency is smoothed via EMA to prevent oscillation under transient spikes.
- avg / median / p95 / min / max / stddev latency
- throughput (tokens/sec from completion_tokens)
- success rate
- adaptive: draft change count, convergence_step(k)
Enable:
HALOSPEC_LOAD=1
Behavior:
- Not during warmup
- Starts at measured step 6
- Duration: ~30s
- Goal: test controller stability under contention
# Fixed sweep + adaptive
cargo run
# With load injection
HALOSPEC_LOAD=1 cargo run
# Optional verbose JSON
HALOSPEC_DEBUG_JSON=1 cargo run# Fixed sweep + adaptive
cargo run
# With load injection
$env:HALOSPEC_LOAD="1"; cargo run
# Optional verbose JSON
$env:HALOSPEC_DEBUG_JSON="1"; cargo runThese plots are generated from results_phase0.csv with HALOSPEC_LOAD=1, where CPU contention is injected starting at measured step 6 for ~30s. Phases are logged as steady, load, and recovery.
This project was built to explore how speculative decoding behaves under real runtime variability rather than idealized benchmark conditions.
This benchmark frames speculative draft_length selection as a non-stationary control problem.
Rather than assuming a static optimal parameter, HaloSpec evaluates whether runtime-adaptive tuning can maintain stable latency under dynamic backend conditions and injected contention.
The objective is not simply to outperform fixed configurations, but to analyze controller behavior, convergence properties, and stability under perturbation.
- Load injection produces a measurable latency elevation during the
loadphase followed by stabilization inrecovery. - The adaptive controller remains stable (bounded draft_length changes) and converges after the perturbation window.
- Fixed draft_length modes can outperform adaptive in some runs; the project frames this as a non-stationary tuning problem under runtime variability.
HaloSpec is an open-source experiment. I welcome contributions from the AMD developer community, specifically in:
- Telemetry accuracy: Implementing real Linux
perfevent monitoring for actual UMA bandwidth tracking. - Model Support: Testing performance gains across different GGUF model architectures (Llama, Mistral, Qwen).
- Control Theory: Refining the
alphatuning parameter for smoother draft length transitions.









