This file tracks project structure and changes. Update it whenever new files, features, or behaviors are added.
- Provide a quick map of the project layout.
- Record recent changes for future requests.
Makefile: Unified entrypoints for checks, preset listing, local benchmarks, AutoJudge runs, and Docker CPU/GPU flows.benchmarks/bench_speculative.py: Benchmark runner for baseline, speculative, AutoJudge, Top-K, and SpecExec methods. Supports HF models, KV cache, quantization, GSM8K/MT-Bench eval modes, resume mode, enriched JSONL logging, and per-run error persistence.jointadaspec/: JointAdaSpec implementation with MDP state/action spaces, trace collection, sparse value iteration, inference-time policy lookup, baseline decoders, metrics, and utility loaders.sp_samp/hf_topk.py: HF Top-K lossy verification baseline and runtime stats.configs/datasets/: Hydra dataset configs for thejointadaspec/pipeline (gsm8k,livecodebench,mtbench).configs/model_pairs/: Hydra model-pair configs for JointAdaSpec experiments, including the ungated Qwen2.57B -> 1.5Bpair and the gated Llama placeholder pair.configs/experiments/: Hydra experiment configs for JointAdaSpec defaults, sweeps, and dataset-specific profiles.configs/mdp_default.yaml: Central JointAdaSpec MDP hyperparameters (grid, thresholds, reward weights, VI controls).configs/models.json: Model presets (target/draft HF models, device, dtype, quantization, tokenizer, chat template).configs/methods.json: Method presets (baseline/speculative/autojudge/topk/specexec/all/all_paper).configs/experiments.json: Target/draft pairing presets for common comparisons.configs/method_templates.json: Templates for AutoJudge and SpecExec configs with parameters/metrics.scripts/01_collect_traces.py: JointAdaSpec stage 1, exploratory one-step trace collection into Parquet.scripts/02_solve_mdp.py: JointAdaSpec stage 2, smoothed MDP estimation and sparse value iteration over akappasweep.scripts/03_benchmark.py: JointAdaSpec stage 3, benchmark runner forjointadaspec,vanilla_ar,fixed_sd,fuzzy_sd, andspecdecpp.scripts/run_jointadaspec_qwen_longrun.sh: Staged orchestration script for the ungated Qwen2.57B -> 1.5BJointAdaSpec profile on GSM8K + LiveCodeBench.scripts/validate_configs.py: Static validator for configs (cross-file references, method constraints, tokenizer compatibility for target/draft pair methods).scripts/validate_results_jsonl.py: JSONL schema/type validator for benchmark outputs (strict mode supported).scripts/run_autojudge_paper_eval.sh: Orchestrates paper-style GSM8K sweeps (AutoJudge thresholds + Top-K grid) for Qwen2.5.scripts/write_run_manifest.py: Writes environment manifest JSON for run reproducibility (used by paper-eval runner).scripts/report_autojudge_paper.py: Aggregates raw JSONL into.md/.csv/.jsonreport artifacts with derived speed/accuracy deltas.scripts/report_yandex_style.py: Generates Yandex-style threshold/accuracy/speedup report tables (.md/.csv/.json).scripts/run_local_7b_1p5b_eval.sh: Orchestrates local Qwen2.5 7B/1.5B GSM8K + LiveCodeBench evaluation.scripts/run_autojudge_topk_gsm8k_bg.sh: Long-run orchestration for local Qwen2.5 7B/1.5B GSM8K AutoJudge-threshold + Top-K sweeps with GPU preflight wait gate and resume-friendly append output.scripts/install_dependencies.sh: Safe host bootstrap script (idempotent apt/pip install, optional GPU extras, EOL Ubuntu guard, never touches NVIDIA driver packages)..github/workflows/ci.yml: GitHub Actions CI pipeline (checks, tests, toy SpecExec smoke run, result-schema validation)..python-version: Recommended local Python runtime version.datasets/.gitkeep: Repository placeholder for local dataset directory (datasets/).README.MD: Project overview, usage, and examples.CONTRIBUTING.md: Contributor workflow, checks, and benchmark artifact policy.CODE_OF_CONDUCT.md: Collaboration and community behavior expectations..github/pull_request_template.md: Pull request template with validation checklist..github/ISSUE_TEMPLATE/: Structured issue templates for bugs and feature requests.docs/RESULTS.md: Curated benchmark snapshot and reproducibility commands.docs/ROADMAP.md: Near-term and mid-term engineering roadmap.docs/GITHUB_SETUP.md: Practical checklist for GitHub About/topics/social preview setup.reports/jointadaspec_qwen_7b_1p5b_2026-04-14.md: Curated markdown report for the first real JointAdaSpec Qwen7B -> 1.5Brun.sp_samp/autojudge.py: Paper-aligned AutoJudge implementation (Algorithm 1 token mining, LogisticRegression training, and verification-stage decoding override).sp_samp/gsm8k.py: GSM8K utilities (dataset loading, final-answer extraction, and equivalence checks).sp_samp/specexec.py: CPU SpecExec implementation (exact target sampling with draft-tree cache prefill and pruning).sp_samp/hf_specexec.py: HF SpecExec implementation with KV-cache reuse along prefix-tree edges.sp_samp/cli.py: Unified CLI runner (bench,autojudge,specexec, preset application).sp_samp/__init__.py: Public exports with optional lazy behavior for torch-dependent modules.sp_samp/hf_adapter.py: HF model adapter with KV cache and optional bitsandbytes quantization.sp_samp/hf_sampling.py: HF sampling (baseline + speculative) using KV cache.sp_samp/sampling.py: Reference CPU implementations + metrics.sp_samp/mtbench.py: MT-Bench loader.sp_samp/livecodebench.py: LiveCodeBench loader and HF hub downloader.sp_samp/methods/: Method-facing exports (including SpecExec export).tests/test_sampling.py: Core correctness tests for baseline/speculative sampling.tests/test_autojudge.py: Unit tests for AutoJudge features, classifier fitting, and stats.tests/test_specexec.py: Unit tests for SpecExec behavior and stats.tests/test_topk.py: Unit tests for Top-K mismatch acceptance/rejection behavior.tests/test_livecodebench.py: Unit tests for LiveCodeBench loader.tests/test_features.py,tests/test_verification.py,tests/test_mdp_solver.py,tests/test_inference.py,tests/test_end_to_end.py: JointAdaSpec unit and toy end-to-end coverage.Dockerfile,Dockerfile.gpu: CPU/GPU containers.requirements.txt,requirements-gpu.txt: Dependencies.
- Added the first complete JointAdaSpec implementation:
- new
jointadaspec/package with core verification logic, feature extraction, tabular MDP spaces, trace collection, sparse MDP estimation, value iteration, inference-time policy lookup, baseline decoders, metrics, and utility loaders; - new stage-based scripts
scripts/01_collect_traces.py,scripts/02_solve_mdp.py,scripts/03_benchmark.py; - new Hydra configs under
configs/datasets/,configs/model_pairs/,configs/experiments/, plusconfigs/mdp_default.yaml.
- new
- Added JointAdaSpec documentation and experiment workflow:
- updated
README.MDanddocs/RESULTS.mdwith JointAdaSpec positioning and result snapshots; - added
reports/jointadaspec_qwen_7b_1p5b_2026-04-14.md; - added
scripts/run_jointadaspec_qwen_longrun.shfor reproducible Qwen7B -> 1.5Bruns.
- updated
- Added JointAdaSpec test coverage:
- feature, verification, solver, inference, and toy end-to-end tests under
tests/test_*.
- feature, verification, solver, inference, and toy end-to-end tests under
- Ran the first real JointAdaSpec HF profile on the ungated Qwen2.5
7B -> 1.5Bpair:3000traces,479880one-step transitions;- saved policy sweep for
kappa = 0.0, 0.5, 1.0, 2.0, 5.0; - benchmarked on GSM8K and LiveCodeBench with tracked outputs under
outputs/jointadaspec_qwen_2026-04-14/.
- GitHub-facing presentation pass for external reviewers:
- rewrote
README.MDas a repository landing page (positioning, method matrix, benchmark snapshot, quickstart); - added contributor-facing docs:
CONTRIBUTING.md,CODE_OF_CONDUCT.md,docs/RESULTS.md,docs/ROADMAP.md; - added GitHub collaboration templates:
.github/pull_request_template.mdand.github/ISSUE_TEMPLATE/*; - added maintainer checklist for repository presentation:
docs/GITHUB_SETUP.md.
- rewrote
- Documentation refresh for day-to-day work:
README.MDreduced to a compact, task-first guide (quick start, key commands, and minimal constraints).CLAUDE.mdupdated with explicit local Llama eval command and paper-aligned AutoJudge C-grid policy (1e-7..1e0, 8 values).
- Long-run operations guidance now uses
PYTORCH_ALLOC_CONF=expandable_segments:True(replaces deprecatedPYTORCH_CUDA_ALLOC_CONF). - Added practical monitoring command set for long Llama/GSM8K/LiveCodeBench sweeps (log tail, GPU telemetry, output growth checks).
- Documentation-only operational update (no API/algorithm changes):
CLAUDE.mdnow includes a canonical 24-48h runbook usingtmux + staged AutoJudge profile.- Stage A (checkpoint bootstrap) and Stage B (main AutoJudge + Top-K sweep) are documented as separate steps to avoid mixing bootstrap metrics into final JSONL.
- Added standardized long-run operations guidance in docs:
- GPU preflight checks (
nvidia-smi, conflicting PID detection). - session lifecycle (
tmuxstart/detach/reattach), monitoring commands, and post-run validation/report commands. - stop/recovery command set for process termination and GPU cleanup.
- GPU preflight checks (
- Documented operational constraints for stability:
- single GPU-heavy job rule;
- resume behavior via stable
OUT_GSM8Koutput file; - checkpoint-first flow for AutoJudge (
datasets/autojudge_qwen25_1p5b_to_7b.pt); - OOM mitigation hints (
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, optional quantized run flags).
- Added local Qwen2.5 7B/1.5B model support for offline experiments:
- new model presets
qwen25_7b_instruct_local,qwen25_1p5b_instruct_localinconfigs/models.json; - new experiment presets
qwen25_7b_local_target_1p5b_local_{speculative,autojudge,topk,all_paper}_k4andqwen25_7b_local_baselineinconfigs/experiments.json; .gitignorenow ignoresmodels/*while keepingmodels/.gitkeep.
- new model presets
- Added LiveCodeBench dataset loader (
sp_samp/livecodebench.py):load_livecodebench(path, max_samples)loads prompts from local JSONL;download_livecodebench(output_path, version_tag)downloads from HF hub;- exported in
sp_samp/__init__.py.
- Added
livecodebenchas eval task in benchmark runner and CLI:--eval-task livecodebenchloads prompts as throughput-onlyEvalSample(noreference_answer).
- Added Yandex-style report generator (
scripts/report_yandex_style.py):- produces threshold/accuracy/speed/speedup tables as
.md/.csv/.json.
- produces threshold/accuracy/speed/speedup tables as
- Added local evaluation orchestration script (
scripts/run_local_7b_1p5b_eval.sh):- runs GSM8K + LiveCodeBench sweeps (baseline, speculative, AutoJudge thresholds, Top-K grid);
- auto-downloads LiveCodeBench dataset if missing;
- generates Yandex-style reports for both datasets.
- Added
make local-evaltarget in Makefile. - Added
datasets>=2.18.0torequirements.txt(needed for LiveCodeBench HF hub download). - Added
tests/test_livecodebench.pywith JSONL parsing, max_samples, and key fallback tests. - Updated documentation:
README.MD,CLAUDE.md,CODEX.MD.
-
Switched paper-style default compute pair to vocabulary-compatible Qwen2.5 (
0.5B -> 3B): -
added model presets
qwen25_0p5b_instruct_compat,qwen25_3b_instruct; -
added experiment presets
qwen25_3b_target_qwen25_0p5b_{speculative,autojudge,topk,specexec,all_methods,all_paper}_k4; -
Make defaults now point to
qwen25_3b_target_qwen25_0p5b_all_methodsand related AutoJudge/SpecExec experiments; -
paper runner defaults (
scripts/run_autojudge_paper_eval.sh) now target the 0.5B->3B pair and checkpointautojudge_qwen25_0p5b_to_3b.pt. -
Added paper-eval manifest reliability changes:
-
moved manifest generation to dedicated script
scripts/write_run_manifest.py; -
defaulted
HF_HUB_DISABLE_XET=1inscripts/run_autojudge_paper_eval.shto reduce hub backend instability in long runs. -
Kept legacy
0.5B -> 7Bpresets for reference, but this pair fails strict speculative methods due model vocab-size mismatch. -
Added paper-style benchmark extensions for Qwen2.5 (
0.5B -> 7Blegacy set): -
new method
topkand bundleall_paperinbenchmarks/bench_speculative.py; -
allnow runs baseline + speculative + autojudge + topk + specexec. -
Added GSM8K evaluation path in benchmark:
-
new args
--eval-taskand--gsm8k-eval-mode; -
quality fields
gsm8k_exact_match,gsm8k_correct,gsm8k_totalin JSONL run/summary records. -
Added Top-K controls and metrics:
-
new args
--topk-rank,--topk-grid; -
metrics
topk_accept_rate,topk_rank_effective,topk_mismatches,topk_accepted_mismatches. -
Added configs for paper-style runs:
-
configs/methods.json:topk_k4,compare_all_paper_k4; -
configs/experiments.json:qwen25_7b_target_qwen25_0p5b_topk_k4,qwen25_7b_target_qwen25_0p5b_all_paper. -
Updated CLI/validators for top-k and paper fields:
-
sp_samp/cli.pyacceptstopk,all_paper,eval-task,gsm8k-eval-mode,topk-rank,topk-grid; -
scripts/validate_configs.pyandscripts/validate_results_jsonl.pysupport new methods/fields with backward compatibility. -
Added paper evaluation tooling:
-
scripts/run_autojudge_paper_eval.sh(preflight manifest + GSM8K download + sweep orchestration); -
scripts/report_autojudge_paper.py(aggregates raw runs and computesspeedup_vs_speculative,accuracy_delta_vs_speculative); -
make paper-evaltarget. -
Added Top-K unit tests in
tests/test_topk.py. -
AutoJudge presets/defaults migrated to paper-aligned open-model pairing:
-
added model presets
qwen25_0p5b_instruct,qwen25_7b_instruct,qwen25_32b_instruct; -
added experiment presets
qwen25_7b_target_qwen25_0p5b_{speculative,autojudge,specexec,all_methods}_k4; -
(legacy note) these 7B/0.5B presets remain available for reference; active defaults were later moved to 3B/0.5B due vocab compatibility.
-
Hardened AutoJudge training dataset gate in
benchmarks/bench_speculative.py: -
fail-fast with actionable command when no checkpoint and no GSM8K train dataset is provided;
-
explicit
path not founderror for missing train dataset path; -
explicit incompatibility error when dataset lacks GSM8K
question/answerschema (for example MT-Bench files). -
JSONL threshold semantics split with backward-compatible aliases:
-
new fields
autojudge_threshold_calibratedandautojudge_threshold_used; -
legacy
autojudge_threshold_selectedkept as alias of calibrated value; -
legacy
autojudge_thresholdkept as alias of used value; -
updated
scripts/validate_results_jsonl.pyto accept new fields while remaining compatible with older JSONL files. -
Replaced legacy synthetic-label AutoJudge with paper-aligned pipeline:
-
mining labels via semi-greedy mismatch search (Algorithm 1 style) on GSM8K prompts;
-
classifier training switched to
StandardScaler + LogisticRegressionwith threshold calibration by target recall; -
inference integrated at speculative verification stage (classifier called only on would-be-rejected mismatches).
-
Extended AutoJudge runtime/config interfaces:
-
new args in benchmark/CLI:
autojudge_task,autojudge_train_dataset,autojudge_recall_target,autojudge_train_split,autojudge_c_grid; -
legacy args (
autojudge_train_steps,autojudge_train_batch_size,autojudge_train_lr,autojudge_audit_ratio) are parsed but ignored with deprecation warning; -
checkpoint format bumped (
autojudge_version=2) with explicit legacy checkpoint fallback to retraining. -
Extended JSONL/schema metrics for AutoJudge calibration:
-
autojudge_val_auc,autojudge_val_recall,autojudge_threshold_selectedin records; -
updated strict validator
scripts/validate_results_jsonl.py. -
Added/updated tests:
-
tests/test_autojudge.pynow covers GSM8K parsing/equivalence, classifier recall-target calibration, Algorithm-1-style mining behavior, and mismatch-time inference decisions. -
Validated host + Docker GPU workflow on Ubuntu 24.04.4 with RTX 5090 (driver 590.48.01):
-
installed host toolchain/dependencies via
scripts/install_dependencies.sh --gpu; -
verified local checks/tests (
make check,make test) and host CUDA visibility (torch 2.9.1+cu128,sm_120). -
Installed Docker Engine + NVIDIA Container Toolkit and verified GPU runtime:
-
docker run hello-worldpasses; -
make docker-gpu-checkandmake docker-gpu-check-imagepass. -
Added Makefile smoke flexibility for host GPU validation:
-
new variable
SMOKE_HF_DEVICE(defaultcpu); -
new target
smoke-hf-gpuas alias to run tiny HF smoke on CUDA. -
Extended README operational guidance:
-
explicit Ubuntu 24.04 Docker Engine install commands;
-
explicit NVIDIA Container Toolkit setup/config commands;
-
documented host GPU tiny smoke workflow and expected JSONL output;
-
documented one-off
DOCKER_CMD="sudo docker"usage; -
documented short Docker load run command for
gptoss20b_target_gptoss20b_draft_k4plus strict JSONL validation and fallback command. -
Bench run artifacts captured under
datasets/: -
datasets/results_smoke_hf_gpu.jsonl(validated strict); -
datasets/results_gptoss20b_load.jsonl(validated strict, generated via deterministic fallback presetmistral_target_mistral_draft_k4after GPT-OSS runtime mismatch).
- Standardized local dataset layout to
datasets/: - added tracked directory placeholder
datasets/.gitkeep; - Makefile defaults now use
DATASET=datasets/mt_bench.jsonlandOUT=datasets/results.jsonl; - benchmark runner default output path now resolves to
datasets/results.jsonl; - benchmark targets now fail fast if dataset file does not exist (clear recovery message);
- Docker dataset mount now resolves from absolute host path derived from
DATASET; - README examples updated to use
datasets/paths. - Updated ignore rules for local data:
.gitignorenow ignoresdatasets/*while keepingdatasets/.gitkeep;.dockerignorenow excludesdatasets/from image build context.- Added CI pipeline (
.github/workflows/ci.yml) on Python 3.11: - runs
make check,make test, toy SpecExec smoke benchmark, and JSONL schema validation. - Pinned dependency versions in
requirements.txtandrequirements-gpu.txtfor reproducible installs. - Added
.python-versionwith recommended runtime3.11. - Improved bootstrap interpreter selection:
scripts/install_dependencies.shnow preferspython3.11when available, enforces minimum Python 3.10, and warns when below 3.11.- Added benchmark result schema validation:
- new script
scripts/validate_results_jsonl.py; - new
make validate-resultstarget with strict mode. - Removed
autojudge_placeholderfromconfigs/methods.jsonand simplifiedscripts/validate_configs.pylogic accordingly. - Fixed Docker/Torch version consistency for GPU builds:
MakefileexposesCUDA_BASE_IMAGE,TORCH_INDEX_URL, andTORCH_VERSIONvariables;- defaults updated for Blackwell-safe stack (
nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04,cu128,torch==2.9.1); requirements.txttorch pin updated to2.9.1;- README GPU build example updated accordingly.
- Hardened
Dockerfile.gpufor CUDA runtime base images: - if Python/pip are missing, image now installs
python3,python3-pip, andpython3-venv; - switched pip invocations to
python3 -m pipto avoid missingpipbinary issues. - Improved HF quantization compatibility in
sp_samp/hf_adapter.py: - when a checkpoint already has its own quantization class (for example MXFP4), the loader now falls back by dropping conflicting BitsAndBytes override and retries with checkpoint defaults.
- Added early quantization-override guard in
benchmarks/bench_speculative.py: - for native-quantized checkpoints (for example GPT-OSS),
--quant/--draft-quantoverrides are ignored with explicit warnings; - prevents
Mxfp4ConfigvsBitsAndBytesConfigconflicts before model load. - Fixed HF model-loading edge case for native quantized checkpoints:
sp_samp/hf_adapter.pyno longer forwardsquantization_config=Nonetofrom_pretrained, avoidingAttributeError: 'NoneType' object has no attribute 'to_dict'in recenttransformers.- Added Docker build recovery helpers for host-side BuildKit snapshot issues:
Makefilenow supportsDOCKER_CMDoverride anddocker-build-gpu-safefallback target (automatic retry withDOCKER_BUILDKIT=0);- added
docker-prune-buildertarget to clean Docker builder cache; - updated README with snapshot/export troubleshooting flow.
- Added Docker GPU diagnostics helpers:
make docker-gpu-checkvalidates NVIDIA runtime passthrough using a clean CUDA image;make docker-gpu-check-imagevalidatestorch.cuda.is_available()inside project GPU image.- Updated
make docker-gpu-checkbehavior: - first attempts
nvidia-smiin CUDA base image; - if NVML path fails, falls back to torch CUDA runtime check in the built project image.
- Added explicit CUDA preflight checks for HF benchmark path in
benchmarks/bench_speculative.py: - when
--device cudais requested but CUDA is unavailable, benchmark now exits early with actionable setup guidance. - Added explicit CUDA architecture compatibility checks for HF benchmark path:
- fail-fast on unsupported torch GPU arch (for example
sm_120on old torch builds) with rebuild guidance. - Improved HF loader runtime error clarity in
sp_samp/hf_adapter.py: - maps
no kernel image is available for execution on the deviceto a concise torch/GPU arch compatibility hint. - Updated
Dockerfile.gpuinstall order: - requirements are installed first;
- optional torch CUDA wheel override is installed after requirements so build args remain authoritative.
- Improved local tooling reliability:
Makefilenow prefers.venv/bin/pythonwhen present and falls back topython3.make testnow validatespytestavailability first and emits a clear recovery message (make setup).- Fixed SpecExec summary metric correctness:
- in benchmark stat aggregation,
max_active_branchesis now aggregated asmax(...)instead of sum. - Replaced stale AutoJudge placeholder module export:
sp_samp/methods/autojudge/__init__.pynow exports actual AutoJudge APIs when dependencies are installed;- when torch/transformers are missing, module now raises a clear dependency error on attribute access.
- Expanded
.dockerignoreto reduce Docker context size and avoid shipping local artifacts (.venv, caches, build outputs, logs,papers/).
- Added safe host bootstrap flow:
- new script
scripts/install_dependencies.shinstalls missing Ubuntu packages and updates Python deps in.venv; - supports optional GPU extras via
--gpu; - blocks on EOL Ubuntu by default (can be overridden with
--allow-eol-ubuntu); - explicitly avoids NVIDIA driver/CUDA driver package changes.
- Added Make targets for bootstrap:
make setup,make setup-gpu;ALLOW_EOL_UBUNTU=1passthrough support.- Updated dependency baselines in
requirements.txtandrequirements-gpu.txtto newer minimum versions. - Updated
README.MDwith safe installation workflow and Ubuntu EOL caveats. - Added benchmark reliability/runtime instrumentation in
benchmarks/bench_speculative.py: - system metadata is written to each JSONL record (
git_sha, host, GPU/driver, torch/transformers versions); - resume mode skips already completed runs via
resume_key; - per-run exceptions are persisted to JSONL (
status=error, traceback) without aborting the whole method. - Added headless execution guard:
- new CLI/benchmark flag
--require-headless; Makefilesupport viaHEADLESS=1.- Optimized HF SpecExec cache building in
sp_samp/hf_specexec.py: - draft tree expansion now reuses KV cache via
prefill + step; - target cache fill now uses depth-wise tree traversal with KV reuse.
- Reworked SpecExec to paper-aligned exact behavior:
- SpecExec now samples exactly from target distribution while using draft-tree cache prefill (no branch-selection sampling bias).
- Added stronger SpecExec diagnostics (
SpecExecError) with stage/prefix context for faster failure triage. - Added SpecExec distribution and exactness tests in
tests/test_specexec.py: - empirical distribution match to target;
- sequence equivalence to baseline for equal seeds.
- Implemented SpecExec as a first-class method:
- added
sp_samp/specexec.py(toy/CPU) andsp_samp/hf_specexec.py(HF); - added
SpecExecStatswith branch metrics (branch_prune_rate,effective_parallelism,max_active_branches, call rates). - Integrated SpecExec into benchmark flow (
benchmarks/bench_speculative.py): - new method option
specexec; allnow runs baseline + speculative + autojudge + specexec;- new args
--parallel-branchesand--branch-prune-threshold; - JSONL records now include SpecExec configuration/metrics fields.
- Integrated SpecExec into CLI (
sp_samp/cli.py): - new subcommand
specexec; - method presets now accept
specexec; - passthrough for
parallel_branchesandbranch_prune_threshold. - Updated configs:
configs/methods.json: addedspecexec_k4, updatedcompare_all_k4, removed SpecExec placeholder preset;configs/experiments.json: added SpecExec experiment presets for Llama3/Mistral/GPT-OSS;configs/method_templates.json: aligned SpecExec template metrics with actual output.- Updated automation entrypoints:
Makefile: addedSPECEXEC_EXPERIMENT,specexec, anddocker-specexectargets; refreshedbench-alldescription.- Added unit tests for SpecExec in
tests/test_specexec.py. - Updated
README.MDwith SpecExec usage examples and metrics documentation. - Added
scripts/validate_configs.pyand integrated it intomake checkandmake validate-configs. - Added
make smoke-hffor quick end-to-end HF pipeline verification with a tiny model. - Added
TARGET_PRESET/DRAFT_PRESETvariables inMakefileforbench-methodto avoid hardcoded mismatched pairs. - Fixed benchmark import bug by explicitly importing
JudgeMLPinbenchmarks/bench_speculative.py. - Fixed AutoJudge run determinism path by forwarding per-run
seedintoautojudge_sample_hf. - Improved HF architecture in
benchmarks/bench_speculative.py: - separated target/draft runtime settings (
draft_tokenizer,draft_device,draft_dtype,draft_quant,draft_bnb_compute_dtype); - stopped forcing draft to use target tokenizer path;
- added tokenizer compatibility guard for speculative/AutoJudge methods;
- avoided unnecessary draft-model construction for baseline-only runs.
- Expanded benchmark JSONL records with resolved target/draft runtime fields for reproducibility.
- Expanded CLI run args in
sp_samp/cli.pyto pass draft-specific overrides. - Added draft-preset mapping in
sp_samp/cli.pyto avoid target-arg overwrites. - Updated
configs/experiments.jsonto correctness-safe target/draft pairings (identical tokenizer presets). - Updated
Makefiledefaults to valid experiment IDs (llama3_all_methods,llama3_target_llama3_autojudge_k4). - Updated
README.MDwith config validation and HF smoke targets, and refreshed experiment examples. - Updated
README.MDHF command examples to tokenizer-compatible target/draft pairs. - Added preset configs for models and methods in
configs/. - Added experiment pair presets in
configs/experiments.json. - Added AutoJudge/SpecExec config templates in
configs/method_templates.json. - Added CLI runner in
sp_samp/cli.py. - Added CLI support for
--experimentpresets. - Exposed benchmark parser/run for programmatic use.
- Added
benchmarks/__init__.pyfor importable benchmark module. - Added CODEX.MD and README.MD.
- Expanded README with a full from-zero setup guide.
- Fixed GPU requirements to avoid reinstalling CPU-only torch.
- Implemented
sp_samp/autojudge.pywith: - synthetic judge-label collection from target/draft models;
- MLP judge classifier training;
- AutoJudge decoding loop with threshold decisions and fallback-to-target.
- Integrated AutoJudge into
benchmarks/bench_speculative.py: - new methods:
autojudge,all; - AutoJudge train/inference args;
- method-specific JSONL metrics.
- Updated CLI for method selection:
bench --method autojudge|all;autojudgecommand now routes to benchmark runner.- Updated configs:
- new method presets
autojudge_k4,compare_all_k4; - new experiment presets for AutoJudge and all-method comparisons.
- Added tests in
tests/test_autojudge.py. - Updated
README.MDwith end-to-end usage for AutoJudge and all-method runs. - Added
Makefilewith: make help,make check,make list-presets;- quick local run (
make bench-toy); - preset runs (
make bench,make autojudge,make bench-all); - Docker runs (
make docker-build,make docker-build-gpu,make docker-bench,make docker-autojudge,make docker-bench-all). - Simplified
sp_samp/cli.py: - removed duplicated parser arguments via shared helper;
- switched to lazy benchmark import so
list-presetsworks without ML dependencies. - expanded passthrough arguments for toy/HF parity (
vocab_size,draft_noise,seed). - Hardened package import behavior in
sp_samp/__init__.py: - moved HF/AutoJudge exports to optional imports, preserving lightweight commands without
torch. - Improved
benchmarks/bench_speculative.pyportability: - supports toy-mode execution without
torch; - emits clear dependency errors only for HF/AutoJudge paths.
- changed toy default
vocab_sizeto2048(from32000) to avoid pathological memory/time inRandomModel. - added fail-fast validation for
autojudgewithout HF model arguments. - Standardized benchmark invocations to module mode (
python -m benchmarks.bench_speculative) in docs and Makefile. - Tuned
make bench-toydefaults for fast smoke runs.