Skip to content

Latest commit

 

History

History
384 lines (371 loc) · 29.6 KB

File metadata and controls

384 lines (371 loc) · 29.6 KB

CODEX

This file tracks project structure and changes. Update it whenever new files, features, or behaviors are added.

Purpose

  • Provide a quick map of the project layout.
  • Record recent changes for future requests.

Structure

  • Makefile: Unified entrypoints for checks, preset listing, local benchmarks, AutoJudge runs, and Docker CPU/GPU flows.
  • benchmarks/bench_speculative.py: Benchmark runner for baseline, speculative, AutoJudge, Top-K, and SpecExec methods. Supports HF models, KV cache, quantization, GSM8K/MT-Bench eval modes, resume mode, enriched JSONL logging, and per-run error persistence.
  • jointadaspec/: JointAdaSpec implementation with MDP state/action spaces, trace collection, sparse value iteration, inference-time policy lookup, baseline decoders, metrics, and utility loaders.
  • sp_samp/hf_topk.py: HF Top-K lossy verification baseline and runtime stats.
  • configs/datasets/: Hydra dataset configs for the jointadaspec/ pipeline (gsm8k, livecodebench, mtbench).
  • configs/model_pairs/: Hydra model-pair configs for JointAdaSpec experiments, including the ungated Qwen2.5 7B -> 1.5B pair and the gated Llama placeholder pair.
  • configs/experiments/: Hydra experiment configs for JointAdaSpec defaults, sweeps, and dataset-specific profiles.
  • configs/mdp_default.yaml: Central JointAdaSpec MDP hyperparameters (grid, thresholds, reward weights, VI controls).
  • configs/models.json: Model presets (target/draft HF models, device, dtype, quantization, tokenizer, chat template).
  • configs/methods.json: Method presets (baseline/speculative/autojudge/topk/specexec/all/all_paper).
  • configs/experiments.json: Target/draft pairing presets for common comparisons.
  • configs/method_templates.json: Templates for AutoJudge and SpecExec configs with parameters/metrics.
  • scripts/01_collect_traces.py: JointAdaSpec stage 1, exploratory one-step trace collection into Parquet.
  • scripts/02_solve_mdp.py: JointAdaSpec stage 2, smoothed MDP estimation and sparse value iteration over a kappa sweep.
  • scripts/03_benchmark.py: JointAdaSpec stage 3, benchmark runner for jointadaspec, vanilla_ar, fixed_sd, fuzzy_sd, and specdecpp.
  • scripts/run_jointadaspec_qwen_longrun.sh: Staged orchestration script for the ungated Qwen2.5 7B -> 1.5B JointAdaSpec profile on GSM8K + LiveCodeBench.
  • scripts/validate_configs.py: Static validator for configs (cross-file references, method constraints, tokenizer compatibility for target/draft pair methods).
  • scripts/validate_results_jsonl.py: JSONL schema/type validator for benchmark outputs (strict mode supported).
  • scripts/run_autojudge_paper_eval.sh: Orchestrates paper-style GSM8K sweeps (AutoJudge thresholds + Top-K grid) for Qwen2.5.
  • scripts/write_run_manifest.py: Writes environment manifest JSON for run reproducibility (used by paper-eval runner).
  • scripts/report_autojudge_paper.py: Aggregates raw JSONL into .md/.csv/.json report artifacts with derived speed/accuracy deltas.
  • scripts/report_yandex_style.py: Generates Yandex-style threshold/accuracy/speedup report tables (.md/.csv/.json).
  • scripts/run_local_7b_1p5b_eval.sh: Orchestrates local Qwen2.5 7B/1.5B GSM8K + LiveCodeBench evaluation.
  • scripts/run_autojudge_topk_gsm8k_bg.sh: Long-run orchestration for local Qwen2.5 7B/1.5B GSM8K AutoJudge-threshold + Top-K sweeps with GPU preflight wait gate and resume-friendly append output.
  • scripts/install_dependencies.sh: Safe host bootstrap script (idempotent apt/pip install, optional GPU extras, EOL Ubuntu guard, never touches NVIDIA driver packages).
  • .github/workflows/ci.yml: GitHub Actions CI pipeline (checks, tests, toy SpecExec smoke run, result-schema validation).
  • .python-version: Recommended local Python runtime version.
  • datasets/.gitkeep: Repository placeholder for local dataset directory (datasets/).
  • README.MD: Project overview, usage, and examples.
  • CONTRIBUTING.md: Contributor workflow, checks, and benchmark artifact policy.
  • CODE_OF_CONDUCT.md: Collaboration and community behavior expectations.
  • .github/pull_request_template.md: Pull request template with validation checklist.
  • .github/ISSUE_TEMPLATE/: Structured issue templates for bugs and feature requests.
  • docs/RESULTS.md: Curated benchmark snapshot and reproducibility commands.
  • docs/ROADMAP.md: Near-term and mid-term engineering roadmap.
  • docs/GITHUB_SETUP.md: Practical checklist for GitHub About/topics/social preview setup.
  • reports/jointadaspec_qwen_7b_1p5b_2026-04-14.md: Curated markdown report for the first real JointAdaSpec Qwen 7B -> 1.5B run.
  • sp_samp/autojudge.py: Paper-aligned AutoJudge implementation (Algorithm 1 token mining, LogisticRegression training, and verification-stage decoding override).
  • sp_samp/gsm8k.py: GSM8K utilities (dataset loading, final-answer extraction, and equivalence checks).
  • sp_samp/specexec.py: CPU SpecExec implementation (exact target sampling with draft-tree cache prefill and pruning).
  • sp_samp/hf_specexec.py: HF SpecExec implementation with KV-cache reuse along prefix-tree edges.
  • sp_samp/cli.py: Unified CLI runner (bench, autojudge, specexec, preset application).
  • sp_samp/__init__.py: Public exports with optional lazy behavior for torch-dependent modules.
  • sp_samp/hf_adapter.py: HF model adapter with KV cache and optional bitsandbytes quantization.
  • sp_samp/hf_sampling.py: HF sampling (baseline + speculative) using KV cache.
  • sp_samp/sampling.py: Reference CPU implementations + metrics.
  • sp_samp/mtbench.py: MT-Bench loader.
  • sp_samp/livecodebench.py: LiveCodeBench loader and HF hub downloader.
  • sp_samp/methods/: Method-facing exports (including SpecExec export).
  • tests/test_sampling.py: Core correctness tests for baseline/speculative sampling.
  • tests/test_autojudge.py: Unit tests for AutoJudge features, classifier fitting, and stats.
  • tests/test_specexec.py: Unit tests for SpecExec behavior and stats.
  • tests/test_topk.py: Unit tests for Top-K mismatch acceptance/rejection behavior.
  • tests/test_livecodebench.py: Unit tests for LiveCodeBench loader.
  • tests/test_features.py, tests/test_verification.py, tests/test_mdp_solver.py, tests/test_inference.py, tests/test_end_to_end.py: JointAdaSpec unit and toy end-to-end coverage.
  • Dockerfile, Dockerfile.gpu: CPU/GPU containers.
  • requirements.txt, requirements-gpu.txt: Dependencies.

Recent Changes (2026-04-14)

  • Added the first complete JointAdaSpec implementation:
    • new jointadaspec/ package with core verification logic, feature extraction, tabular MDP spaces, trace collection, sparse MDP estimation, value iteration, inference-time policy lookup, baseline decoders, metrics, and utility loaders;
    • new stage-based scripts scripts/01_collect_traces.py, scripts/02_solve_mdp.py, scripts/03_benchmark.py;
    • new Hydra configs under configs/datasets/, configs/model_pairs/, configs/experiments/, plus configs/mdp_default.yaml.
  • Added JointAdaSpec documentation and experiment workflow:
    • updated README.MD and docs/RESULTS.md with JointAdaSpec positioning and result snapshots;
    • added reports/jointadaspec_qwen_7b_1p5b_2026-04-14.md;
    • added scripts/run_jointadaspec_qwen_longrun.sh for reproducible Qwen 7B -> 1.5B runs.
  • Added JointAdaSpec test coverage:
    • feature, verification, solver, inference, and toy end-to-end tests under tests/test_*.
  • Ran the first real JointAdaSpec HF profile on the ungated Qwen2.5 7B -> 1.5B pair:
    • 3000 traces, 479880 one-step transitions;
    • saved policy sweep for kappa = 0.0, 0.5, 1.0, 2.0, 5.0;
    • benchmarked on GSM8K and LiveCodeBench with tracked outputs under outputs/jointadaspec_qwen_2026-04-14/.

Recent Changes (2026-04-01)

  • GitHub-facing presentation pass for external reviewers:
    • rewrote README.MD as a repository landing page (positioning, method matrix, benchmark snapshot, quickstart);
    • added contributor-facing docs: CONTRIBUTING.md, CODE_OF_CONDUCT.md, docs/RESULTS.md, docs/ROADMAP.md;
    • added GitHub collaboration templates: .github/pull_request_template.md and .github/ISSUE_TEMPLATE/*;
    • added maintainer checklist for repository presentation: docs/GITHUB_SETUP.md.

Recent Changes (2026-03-28)

  • Documentation refresh for day-to-day work:
    • README.MD reduced to a compact, task-first guide (quick start, key commands, and minimal constraints).
    • CLAUDE.md updated with explicit local Llama eval command and paper-aligned AutoJudge C-grid policy (1e-7..1e0, 8 values).
  • Long-run operations guidance now uses PYTORCH_ALLOC_CONF=expandable_segments:True (replaces deprecated PYTORCH_CUDA_ALLOC_CONF).
  • Added practical monitoring command set for long Llama/GSM8K/LiveCodeBench sweeps (log tail, GPU telemetry, output growth checks).

Recent Changes (2026-03-10)

  • Documentation-only operational update (no API/algorithm changes):
    • CLAUDE.md now includes a canonical 24-48h runbook using tmux + staged AutoJudge profile.
    • Stage A (checkpoint bootstrap) and Stage B (main AutoJudge + Top-K sweep) are documented as separate steps to avoid mixing bootstrap metrics into final JSONL.
  • Added standardized long-run operations guidance in docs:
    • GPU preflight checks (nvidia-smi, conflicting PID detection).
    • session lifecycle (tmux start/detach/reattach), monitoring commands, and post-run validation/report commands.
    • stop/recovery command set for process termination and GPU cleanup.
  • Documented operational constraints for stability:
    • single GPU-heavy job rule;
    • resume behavior via stable OUT_GSM8K output file;
    • checkpoint-first flow for AutoJudge (datasets/autojudge_qwen25_1p5b_to_7b.pt);
    • OOM mitigation hints (PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, optional quantized run flags).

Recent Changes (2026-03-06)

  • Added local Qwen2.5 7B/1.5B model support for offline experiments:
    • new model presets qwen25_7b_instruct_local, qwen25_1p5b_instruct_local in configs/models.json;
    • new experiment presets qwen25_7b_local_target_1p5b_local_{speculative,autojudge,topk,all_paper}_k4 and qwen25_7b_local_baseline in configs/experiments.json;
    • .gitignore now ignores models/* while keeping models/.gitkeep.
  • Added LiveCodeBench dataset loader (sp_samp/livecodebench.py):
    • load_livecodebench(path, max_samples) loads prompts from local JSONL;
    • download_livecodebench(output_path, version_tag) downloads from HF hub;
    • exported in sp_samp/__init__.py.
  • Added livecodebench as eval task in benchmark runner and CLI:
    • --eval-task livecodebench loads prompts as throughput-only EvalSample (no reference_answer).
  • Added Yandex-style report generator (scripts/report_yandex_style.py):
    • produces threshold/accuracy/speed/speedup tables as .md/.csv/.json.
  • Added local evaluation orchestration script (scripts/run_local_7b_1p5b_eval.sh):
    • runs GSM8K + LiveCodeBench sweeps (baseline, speculative, AutoJudge thresholds, Top-K grid);
    • auto-downloads LiveCodeBench dataset if missing;
    • generates Yandex-style reports for both datasets.
  • Added make local-eval target in Makefile.
  • Added datasets>=2.18.0 to requirements.txt (needed for LiveCodeBench HF hub download).
  • Added tests/test_livecodebench.py with JSONL parsing, max_samples, and key fallback tests.
  • Updated documentation: README.MD, CLAUDE.md, CODEX.MD.

Recent Changes (2026-02-21)

  • Switched paper-style default compute pair to vocabulary-compatible Qwen2.5 (0.5B -> 3B):

  • added model presets qwen25_0p5b_instruct_compat, qwen25_3b_instruct;

  • added experiment presets qwen25_3b_target_qwen25_0p5b_{speculative,autojudge,topk,specexec,all_methods,all_paper}_k4;

  • Make defaults now point to qwen25_3b_target_qwen25_0p5b_all_methods and related AutoJudge/SpecExec experiments;

  • paper runner defaults (scripts/run_autojudge_paper_eval.sh) now target the 0.5B->3B pair and checkpoint autojudge_qwen25_0p5b_to_3b.pt.

  • Added paper-eval manifest reliability changes:

  • moved manifest generation to dedicated script scripts/write_run_manifest.py;

  • defaulted HF_HUB_DISABLE_XET=1 in scripts/run_autojudge_paper_eval.sh to reduce hub backend instability in long runs.

  • Kept legacy 0.5B -> 7B presets for reference, but this pair fails strict speculative methods due model vocab-size mismatch.

  • Added paper-style benchmark extensions for Qwen2.5 (0.5B -> 7B legacy set):

  • new method topk and bundle all_paper in benchmarks/bench_speculative.py;

  • all now runs baseline + speculative + autojudge + topk + specexec.

  • Added GSM8K evaluation path in benchmark:

  • new args --eval-task and --gsm8k-eval-mode;

  • quality fields gsm8k_exact_match, gsm8k_correct, gsm8k_total in JSONL run/summary records.

  • Added Top-K controls and metrics:

  • new args --topk-rank, --topk-grid;

  • metrics topk_accept_rate, topk_rank_effective, topk_mismatches, topk_accepted_mismatches.

  • Added configs for paper-style runs:

  • configs/methods.json: topk_k4, compare_all_paper_k4;

  • configs/experiments.json: qwen25_7b_target_qwen25_0p5b_topk_k4, qwen25_7b_target_qwen25_0p5b_all_paper.

  • Updated CLI/validators for top-k and paper fields:

  • sp_samp/cli.py accepts topk, all_paper, eval-task, gsm8k-eval-mode, topk-rank, topk-grid;

  • scripts/validate_configs.py and scripts/validate_results_jsonl.py support new methods/fields with backward compatibility.

  • Added paper evaluation tooling:

  • scripts/run_autojudge_paper_eval.sh (preflight manifest + GSM8K download + sweep orchestration);

  • scripts/report_autojudge_paper.py (aggregates raw runs and computes speedup_vs_speculative, accuracy_delta_vs_speculative);

  • make paper-eval target.

  • Added Top-K unit tests in tests/test_topk.py.

  • AutoJudge presets/defaults migrated to paper-aligned open-model pairing:

  • added model presets qwen25_0p5b_instruct, qwen25_7b_instruct, qwen25_32b_instruct;

  • added experiment presets qwen25_7b_target_qwen25_0p5b_{speculative,autojudge,specexec,all_methods}_k4;

  • (legacy note) these 7B/0.5B presets remain available for reference; active defaults were later moved to 3B/0.5B due vocab compatibility.

  • Hardened AutoJudge training dataset gate in benchmarks/bench_speculative.py:

  • fail-fast with actionable command when no checkpoint and no GSM8K train dataset is provided;

  • explicit path not found error for missing train dataset path;

  • explicit incompatibility error when dataset lacks GSM8K question/answer schema (for example MT-Bench files).

  • JSONL threshold semantics split with backward-compatible aliases:

  • new fields autojudge_threshold_calibrated and autojudge_threshold_used;

  • legacy autojudge_threshold_selected kept as alias of calibrated value;

  • legacy autojudge_threshold kept as alias of used value;

  • updated scripts/validate_results_jsonl.py to accept new fields while remaining compatible with older JSONL files.

  • Replaced legacy synthetic-label AutoJudge with paper-aligned pipeline:

  • mining labels via semi-greedy mismatch search (Algorithm 1 style) on GSM8K prompts;

  • classifier training switched to StandardScaler + LogisticRegression with threshold calibration by target recall;

  • inference integrated at speculative verification stage (classifier called only on would-be-rejected mismatches).

  • Extended AutoJudge runtime/config interfaces:

  • new args in benchmark/CLI: autojudge_task, autojudge_train_dataset, autojudge_recall_target, autojudge_train_split, autojudge_c_grid;

  • legacy args (autojudge_train_steps, autojudge_train_batch_size, autojudge_train_lr, autojudge_audit_ratio) are parsed but ignored with deprecation warning;

  • checkpoint format bumped (autojudge_version=2) with explicit legacy checkpoint fallback to retraining.

  • Extended JSONL/schema metrics for AutoJudge calibration:

  • autojudge_val_auc, autojudge_val_recall, autojudge_threshold_selected in records;

  • updated strict validator scripts/validate_results_jsonl.py.

  • Added/updated tests:

  • tests/test_autojudge.py now covers GSM8K parsing/equivalence, classifier recall-target calibration, Algorithm-1-style mining behavior, and mismatch-time inference decisions.

  • Validated host + Docker GPU workflow on Ubuntu 24.04.4 with RTX 5090 (driver 590.48.01):

  • installed host toolchain/dependencies via scripts/install_dependencies.sh --gpu;

  • verified local checks/tests (make check, make test) and host CUDA visibility (torch 2.9.1+cu128, sm_120).

  • Installed Docker Engine + NVIDIA Container Toolkit and verified GPU runtime:

  • docker run hello-world passes;

  • make docker-gpu-check and make docker-gpu-check-image pass.

  • Added Makefile smoke flexibility for host GPU validation:

  • new variable SMOKE_HF_DEVICE (default cpu);

  • new target smoke-hf-gpu as alias to run tiny HF smoke on CUDA.

  • Extended README operational guidance:

  • explicit Ubuntu 24.04 Docker Engine install commands;

  • explicit NVIDIA Container Toolkit setup/config commands;

  • documented host GPU tiny smoke workflow and expected JSONL output;

  • documented one-off DOCKER_CMD="sudo docker" usage;

  • documented short Docker load run command for gptoss20b_target_gptoss20b_draft_k4 plus strict JSONL validation and fallback command.

  • Bench run artifacts captured under datasets/:

  • datasets/results_smoke_hf_gpu.jsonl (validated strict);

  • datasets/results_gptoss20b_load.jsonl (validated strict, generated via deterministic fallback preset mistral_target_mistral_draft_k4 after GPT-OSS runtime mismatch).

Recent Changes (2026-02-13)

  • Standardized local dataset layout to datasets/:
  • added tracked directory placeholder datasets/.gitkeep;
  • Makefile defaults now use DATASET=datasets/mt_bench.jsonl and OUT=datasets/results.jsonl;
  • benchmark runner default output path now resolves to datasets/results.jsonl;
  • benchmark targets now fail fast if dataset file does not exist (clear recovery message);
  • Docker dataset mount now resolves from absolute host path derived from DATASET;
  • README examples updated to use datasets/ paths.
  • Updated ignore rules for local data:
  • .gitignore now ignores datasets/* while keeping datasets/.gitkeep;
  • .dockerignore now excludes datasets/ from image build context.
  • Added CI pipeline (.github/workflows/ci.yml) on Python 3.11:
  • runs make check, make test, toy SpecExec smoke benchmark, and JSONL schema validation.
  • Pinned dependency versions in requirements.txt and requirements-gpu.txt for reproducible installs.
  • Added .python-version with recommended runtime 3.11.
  • Improved bootstrap interpreter selection:
  • scripts/install_dependencies.sh now prefers python3.11 when available, enforces minimum Python 3.10, and warns when below 3.11.
  • Added benchmark result schema validation:
  • new script scripts/validate_results_jsonl.py;
  • new make validate-results target with strict mode.
  • Removed autojudge_placeholder from configs/methods.json and simplified scripts/validate_configs.py logic accordingly.
  • Fixed Docker/Torch version consistency for GPU builds:
  • Makefile exposes CUDA_BASE_IMAGE, TORCH_INDEX_URL, and TORCH_VERSION variables;
  • defaults updated for Blackwell-safe stack (nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04, cu128, torch==2.9.1);
  • requirements.txt torch pin updated to 2.9.1;
  • README GPU build example updated accordingly.
  • Hardened Dockerfile.gpu for CUDA runtime base images:
  • if Python/pip are missing, image now installs python3, python3-pip, and python3-venv;
  • switched pip invocations to python3 -m pip to avoid missing pip binary issues.
  • Improved HF quantization compatibility in sp_samp/hf_adapter.py:
  • when a checkpoint already has its own quantization class (for example MXFP4), the loader now falls back by dropping conflicting BitsAndBytes override and retries with checkpoint defaults.
  • Added early quantization-override guard in benchmarks/bench_speculative.py:
  • for native-quantized checkpoints (for example GPT-OSS), --quant/--draft-quant overrides are ignored with explicit warnings;
  • prevents Mxfp4Config vs BitsAndBytesConfig conflicts before model load.
  • Fixed HF model-loading edge case for native quantized checkpoints:
  • sp_samp/hf_adapter.py no longer forwards quantization_config=None to from_pretrained, avoiding AttributeError: 'NoneType' object has no attribute 'to_dict' in recent transformers.
  • Added Docker build recovery helpers for host-side BuildKit snapshot issues:
  • Makefile now supports DOCKER_CMD override and docker-build-gpu-safe fallback target (automatic retry with DOCKER_BUILDKIT=0);
  • added docker-prune-builder target to clean Docker builder cache;
  • updated README with snapshot/export troubleshooting flow.
  • Added Docker GPU diagnostics helpers:
  • make docker-gpu-check validates NVIDIA runtime passthrough using a clean CUDA image;
  • make docker-gpu-check-image validates torch.cuda.is_available() inside project GPU image.
  • Updated make docker-gpu-check behavior:
  • first attempts nvidia-smi in CUDA base image;
  • if NVML path fails, falls back to torch CUDA runtime check in the built project image.
  • Added explicit CUDA preflight checks for HF benchmark path in benchmarks/bench_speculative.py:
  • when --device cuda is requested but CUDA is unavailable, benchmark now exits early with actionable setup guidance.
  • Added explicit CUDA architecture compatibility checks for HF benchmark path:
  • fail-fast on unsupported torch GPU arch (for example sm_120 on old torch builds) with rebuild guidance.
  • Improved HF loader runtime error clarity in sp_samp/hf_adapter.py:
  • maps no kernel image is available for execution on the device to a concise torch/GPU arch compatibility hint.
  • Updated Dockerfile.gpu install order:
  • requirements are installed first;
  • optional torch CUDA wheel override is installed after requirements so build args remain authoritative.
  • Improved local tooling reliability:
  • Makefile now prefers .venv/bin/python when present and falls back to python3.
  • make test now validates pytest availability first and emits a clear recovery message (make setup).
  • Fixed SpecExec summary metric correctness:
  • in benchmark stat aggregation, max_active_branches is now aggregated as max(...) instead of sum.
  • Replaced stale AutoJudge placeholder module export:
  • sp_samp/methods/autojudge/__init__.py now exports actual AutoJudge APIs when dependencies are installed;
  • when torch/transformers are missing, module now raises a clear dependency error on attribute access.
  • Expanded .dockerignore to reduce Docker context size and avoid shipping local artifacts (.venv, caches, build outputs, logs, papers/).

Recent Changes (2026-02-10)

  • Added safe host bootstrap flow:
  • new script scripts/install_dependencies.sh installs missing Ubuntu packages and updates Python deps in .venv;
  • supports optional GPU extras via --gpu;
  • blocks on EOL Ubuntu by default (can be overridden with --allow-eol-ubuntu);
  • explicitly avoids NVIDIA driver/CUDA driver package changes.
  • Added Make targets for bootstrap:
  • make setup, make setup-gpu;
  • ALLOW_EOL_UBUNTU=1 passthrough support.
  • Updated dependency baselines in requirements.txt and requirements-gpu.txt to newer minimum versions.
  • Updated README.MD with safe installation workflow and Ubuntu EOL caveats.
  • Added benchmark reliability/runtime instrumentation in benchmarks/bench_speculative.py:
  • system metadata is written to each JSONL record (git_sha, host, GPU/driver, torch/transformers versions);
  • resume mode skips already completed runs via resume_key;
  • per-run exceptions are persisted to JSONL (status=error, traceback) without aborting the whole method.
  • Added headless execution guard:
  • new CLI/benchmark flag --require-headless;
  • Makefile support via HEADLESS=1.
  • Optimized HF SpecExec cache building in sp_samp/hf_specexec.py:
  • draft tree expansion now reuses KV cache via prefill + step;
  • target cache fill now uses depth-wise tree traversal with KV reuse.
  • Reworked SpecExec to paper-aligned exact behavior:
  • SpecExec now samples exactly from target distribution while using draft-tree cache prefill (no branch-selection sampling bias).
  • Added stronger SpecExec diagnostics (SpecExecError) with stage/prefix context for faster failure triage.
  • Added SpecExec distribution and exactness tests in tests/test_specexec.py:
  • empirical distribution match to target;
  • sequence equivalence to baseline for equal seeds.
  • Implemented SpecExec as a first-class method:
  • added sp_samp/specexec.py (toy/CPU) and sp_samp/hf_specexec.py (HF);
  • added SpecExecStats with branch metrics (branch_prune_rate, effective_parallelism, max_active_branches, call rates).
  • Integrated SpecExec into benchmark flow (benchmarks/bench_speculative.py):
  • new method option specexec;
  • all now runs baseline + speculative + autojudge + specexec;
  • new args --parallel-branches and --branch-prune-threshold;
  • JSONL records now include SpecExec configuration/metrics fields.
  • Integrated SpecExec into CLI (sp_samp/cli.py):
  • new subcommand specexec;
  • method presets now accept specexec;
  • passthrough for parallel_branches and branch_prune_threshold.
  • Updated configs:
  • configs/methods.json: added specexec_k4, updated compare_all_k4, removed SpecExec placeholder preset;
  • configs/experiments.json: added SpecExec experiment presets for Llama3/Mistral/GPT-OSS;
  • configs/method_templates.json: aligned SpecExec template metrics with actual output.
  • Updated automation entrypoints:
  • Makefile: added SPECEXEC_EXPERIMENT, specexec, and docker-specexec targets; refreshed bench-all description.
  • Added unit tests for SpecExec in tests/test_specexec.py.
  • Updated README.MD with SpecExec usage examples and metrics documentation.
  • Added scripts/validate_configs.py and integrated it into make check and make validate-configs.
  • Added make smoke-hf for quick end-to-end HF pipeline verification with a tiny model.
  • Added TARGET_PRESET/DRAFT_PRESET variables in Makefile for bench-method to avoid hardcoded mismatched pairs.
  • Fixed benchmark import bug by explicitly importing JudgeMLP in benchmarks/bench_speculative.py.
  • Fixed AutoJudge run determinism path by forwarding per-run seed into autojudge_sample_hf.
  • Improved HF architecture in benchmarks/bench_speculative.py:
  • separated target/draft runtime settings (draft_tokenizer, draft_device, draft_dtype, draft_quant, draft_bnb_compute_dtype);
  • stopped forcing draft to use target tokenizer path;
  • added tokenizer compatibility guard for speculative/AutoJudge methods;
  • avoided unnecessary draft-model construction for baseline-only runs.
  • Expanded benchmark JSONL records with resolved target/draft runtime fields for reproducibility.
  • Expanded CLI run args in sp_samp/cli.py to pass draft-specific overrides.
  • Added draft-preset mapping in sp_samp/cli.py to avoid target-arg overwrites.
  • Updated configs/experiments.json to correctness-safe target/draft pairings (identical tokenizer presets).
  • Updated Makefile defaults to valid experiment IDs (llama3_all_methods, llama3_target_llama3_autojudge_k4).
  • Updated README.MD with config validation and HF smoke targets, and refreshed experiment examples.
  • Updated README.MD HF command examples to tokenizer-compatible target/draft pairs.
  • Added preset configs for models and methods in configs/.
  • Added experiment pair presets in configs/experiments.json.
  • Added AutoJudge/SpecExec config templates in configs/method_templates.json.
  • Added CLI runner in sp_samp/cli.py.
  • Added CLI support for --experiment presets.
  • Exposed benchmark parser/run for programmatic use.
  • Added benchmarks/__init__.py for importable benchmark module.
  • Added CODEX.MD and README.MD.
  • Expanded README with a full from-zero setup guide.
  • Fixed GPU requirements to avoid reinstalling CPU-only torch.
  • Implemented sp_samp/autojudge.py with:
  • synthetic judge-label collection from target/draft models;
  • MLP judge classifier training;
  • AutoJudge decoding loop with threshold decisions and fallback-to-target.
  • Integrated AutoJudge into benchmarks/bench_speculative.py:
  • new methods: autojudge, all;
  • AutoJudge train/inference args;
  • method-specific JSONL metrics.
  • Updated CLI for method selection:
  • bench --method autojudge|all;
  • autojudge command now routes to benchmark runner.
  • Updated configs:
  • new method presets autojudge_k4, compare_all_k4;
  • new experiment presets for AutoJudge and all-method comparisons.
  • Added tests in tests/test_autojudge.py.
  • Updated README.MD with end-to-end usage for AutoJudge and all-method runs.
  • Added Makefile with:
  • make help, make check, make list-presets;
  • quick local run (make bench-toy);
  • preset runs (make bench, make autojudge, make bench-all);
  • Docker runs (make docker-build, make docker-build-gpu, make docker-bench, make docker-autojudge, make docker-bench-all).
  • Simplified sp_samp/cli.py:
  • removed duplicated parser arguments via shared helper;
  • switched to lazy benchmark import so list-presets works without ML dependencies.
  • expanded passthrough arguments for toy/HF parity (vocab_size, draft_noise, seed).
  • Hardened package import behavior in sp_samp/__init__.py:
  • moved HF/AutoJudge exports to optional imports, preserving lightweight commands without torch.
  • Improved benchmarks/bench_speculative.py portability:
  • supports toy-mode execution without torch;
  • emits clear dependency errors only for HF/AutoJudge paths.
  • changed toy default vocab_size to 2048 (from 32000) to avoid pathological memory/time in RandomModel.
  • added fail-fast validation for autojudge without HF model arguments.
  • Standardized benchmark invocations to module mode (python -m benchmarks.bench_speculative) in docs and Makefile.
  • Tuned make bench-toy defaults for fast smoke runs.