CODEX

This file tracks project structure and changes. Update it whenever new files, features, or behaviors are added.

Purpose

Provide a quick map of the project layout.
Record recent changes for future requests.

Structure

Makefile: Unified entrypoints for checks, preset listing, local benchmarks, AutoJudge runs, and Docker CPU/GPU flows.
benchmarks/bench_speculative.py: Benchmark runner for baseline, speculative, AutoJudge, Top-K, and SpecExec methods. Supports HF models, KV cache, quantization, GSM8K/MT-Bench eval modes, resume mode, enriched JSONL logging, and per-run error persistence.
jointadaspec/: JointAdaSpec implementation with MDP state/action spaces, trace collection, sparse value iteration, inference-time policy lookup, baseline decoders, metrics, and utility loaders.
sp_samp/hf_topk.py: HF Top-K lossy verification baseline and runtime stats.
configs/datasets/: Hydra dataset configs for the jointadaspec/ pipeline (gsm8k, livecodebench, mtbench).
configs/model_pairs/: Hydra model-pair configs for JointAdaSpec experiments, including the ungated Qwen2.5 7B -> 1.5B pair and the gated Llama placeholder pair.
configs/experiments/: Hydra experiment configs for JointAdaSpec defaults, sweeps, and dataset-specific profiles.
configs/mdp_default.yaml: Central JointAdaSpec MDP hyperparameters (grid, thresholds, reward weights, VI controls).
configs/models.json: Model presets (target/draft HF models, device, dtype, quantization, tokenizer, chat template).
configs/methods.json: Method presets (baseline/speculative/autojudge/topk/specexec/all/all_paper).
configs/experiments.json: Target/draft pairing presets for common comparisons.
configs/method_templates.json: Templates for AutoJudge and SpecExec configs with parameters/metrics.
scripts/01_collect_traces.py: JointAdaSpec stage 1, exploratory one-step trace collection into Parquet.
scripts/02_solve_mdp.py: JointAdaSpec stage 2, smoothed MDP estimation and sparse value iteration over a kappa sweep.
scripts/03_benchmark.py: JointAdaSpec stage 3, benchmark runner for jointadaspec, vanilla_ar, fixed_sd, fuzzy_sd, and specdecpp.
scripts/run_jointadaspec_qwen_longrun.sh: Staged orchestration script for the ungated Qwen2.5 7B -> 1.5B JointAdaSpec profile on GSM8K + LiveCodeBench.
scripts/validate_configs.py: Static validator for configs (cross-file references, method constraints, tokenizer compatibility for target/draft pair methods).
scripts/validate_results_jsonl.py: JSONL schema/type validator for benchmark outputs (strict mode supported).
scripts/run_autojudge_paper_eval.sh: Orchestrates paper-style GSM8K sweeps (AutoJudge thresholds + Top-K grid) for Qwen2.5.
scripts/write_run_manifest.py: Writes environment manifest JSON for run reproducibility (used by paper-eval runner).
scripts/report_autojudge_paper.py: Aggregates raw JSONL into .md/.csv/.json report artifacts with derived speed/accuracy deltas.
scripts/report_yandex_style.py: Generates Yandex-style threshold/accuracy/speedup report tables (.md/.csv/.json).
scripts/run_local_7b_1p5b_eval.sh: Orchestrates local Qwen2.5 7B/1.5B GSM8K + LiveCodeBench evaluation.
scripts/run_autojudge_topk_gsm8k_bg.sh: Long-run orchestration for local Qwen2.5 7B/1.5B GSM8K AutoJudge-threshold + Top-K sweeps with GPU preflight wait gate and resume-friendly append output.
scripts/install_dependencies.sh: Safe host bootstrap script (idempotent apt/pip install, optional GPU extras, EOL Ubuntu guard, never touches NVIDIA driver packages).
.github/workflows/ci.yml: GitHub Actions CI pipeline (checks, tests, toy SpecExec smoke run, result-schema validation).
.python-version: Recommended local Python runtime version.
datasets/.gitkeep: Repository placeholder for local dataset directory (datasets/).
README.MD: Project overview, usage, and examples.
CONTRIBUTING.md: Contributor workflow, checks, and benchmark artifact policy.
CODE_OF_CONDUCT.md: Collaboration and community behavior expectations.
.github/pull_request_template.md: Pull request template with validation checklist.
.github/ISSUE_TEMPLATE/: Structured issue templates for bugs and feature requests.
docs/RESULTS.md: Curated benchmark snapshot and reproducibility commands.
docs/ROADMAP.md: Near-term and mid-term engineering roadmap.
docs/GITHUB_SETUP.md: Practical checklist for GitHub About/topics/social preview setup.
reports/jointadaspec_qwen_7b_1p5b_2026-04-14.md: Curated markdown report for the first real JointAdaSpec Qwen 7B -> 1.5B run.
sp_samp/autojudge.py: Paper-aligned AutoJudge implementation (Algorithm 1 token mining, LogisticRegression training, and verification-stage decoding override).
sp_samp/gsm8k.py: GSM8K utilities (dataset loading, final-answer extraction, and equivalence checks).
sp_samp/specexec.py: CPU SpecExec implementation (exact target sampling with draft-tree cache prefill and pruning).
sp_samp/hf_specexec.py: HF SpecExec implementation with KV-cache reuse along prefix-tree edges.
sp_samp/cli.py: Unified CLI runner (bench, autojudge, specexec, preset application).
sp_samp/__init__.py: Public exports with optional lazy behavior for torch-dependent modules.
sp_samp/hf_adapter.py: HF model adapter with KV cache and optional bitsandbytes quantization.
sp_samp/hf_sampling.py: HF sampling (baseline + speculative) using KV cache.
sp_samp/sampling.py: Reference CPU implementations + metrics.
sp_samp/mtbench.py: MT-Bench loader.
sp_samp/livecodebench.py: LiveCodeBench loader and HF hub downloader.
sp_samp/methods/: Method-facing exports (including SpecExec export).
tests/test_sampling.py: Core correctness tests for baseline/speculative sampling.
tests/test_autojudge.py: Unit tests for AutoJudge features, classifier fitting, and stats.
tests/test_specexec.py: Unit tests for SpecExec behavior and stats.
tests/test_topk.py: Unit tests for Top-K mismatch acceptance/rejection behavior.
tests/test_livecodebench.py: Unit tests for LiveCodeBench loader.
tests/test_features.py, tests/test_verification.py, tests/test_mdp_solver.py, tests/test_inference.py, tests/test_end_to_end.py: JointAdaSpec unit and toy end-to-end coverage.
Dockerfile, Dockerfile.gpu: CPU/GPU containers.
requirements.txt, requirements-gpu.txt: Dependencies.

Recent Changes (2026-04-14)

Added the first complete JointAdaSpec implementation:
- new jointadaspec/ package with core verification logic, feature extraction, tabular MDP spaces, trace collection, sparse MDP estimation, value iteration, inference-time policy lookup, baseline decoders, metrics, and utility loaders;
- new stage-based scripts scripts/01_collect_traces.py, scripts/02_solve_mdp.py, scripts/03_benchmark.py;
- new Hydra configs under configs/datasets/, configs/model_pairs/, configs/experiments/, plus configs/mdp_default.yaml.
Added JointAdaSpec documentation and experiment workflow:
- updated README.MD and docs/RESULTS.md with JointAdaSpec positioning and result snapshots;
- added reports/jointadaspec_qwen_7b_1p5b_2026-04-14.md;
- added scripts/run_jointadaspec_qwen_longrun.sh for reproducible Qwen 7B -> 1.5B runs.
Added JointAdaSpec test coverage:
- feature, verification, solver, inference, and toy end-to-end tests under tests/test_*.
Ran the first real JointAdaSpec HF profile on the ungated Qwen2.5 7B -> 1.5B pair:
- 3000 traces, 479880 one-step transitions;
- saved policy sweep for kappa = 0.0, 0.5, 1.0, 2.0, 5.0;
- benchmarked on GSM8K and LiveCodeBench with tracked outputs under outputs/jointadaspec_qwen_2026-04-14/.

Recent Changes (2026-04-01)

GitHub-facing presentation pass for external reviewers:
- rewrote README.MD as a repository landing page (positioning, method matrix, benchmark snapshot, quickstart);
- added contributor-facing docs: CONTRIBUTING.md, CODE_OF_CONDUCT.md, docs/RESULTS.md, docs/ROADMAP.md;
- added GitHub collaboration templates: .github/pull_request_template.md and .github/ISSUE_TEMPLATE/*;
- added maintainer checklist for repository presentation: docs/GITHUB_SETUP.md.

Recent Changes (2026-03-28)

Documentation refresh for day-to-day work:
- README.MD reduced to a compact, task-first guide (quick start, key commands, and minimal constraints).
- CLAUDE.md updated with explicit local Llama eval command and paper-aligned AutoJudge C-grid policy (1e-7..1e0, 8 values).
Long-run operations guidance now uses PYTORCH_ALLOC_CONF=expandable_segments:True (replaces deprecated PYTORCH_CUDA_ALLOC_CONF).
Added practical monitoring command set for long Llama/GSM8K/LiveCodeBench sweeps (log tail, GPU telemetry, output growth checks).

Recent Changes (2026-03-10)

Documentation-only operational update (no API/algorithm changes):
- CLAUDE.md now includes a canonical 24-48h runbook using tmux + staged AutoJudge profile.
- Stage A (checkpoint bootstrap) and Stage B (main AutoJudge + Top-K sweep) are documented as separate steps to avoid mixing bootstrap metrics into final JSONL.
Added standardized long-run operations guidance in docs:
- GPU preflight checks (nvidia-smi, conflicting PID detection).
- session lifecycle (tmux start/detach/reattach), monitoring commands, and post-run validation/report commands.
- stop/recovery command set for process termination and GPU cleanup.
Documented operational constraints for stability:
- single GPU-heavy job rule;
- resume behavior via stable OUT_GSM8K output file;
- checkpoint-first flow for AutoJudge (datasets/autojudge_qwen25_1p5b_to_7b.pt);
- OOM mitigation hints (PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, optional quantized run flags).

Recent Changes (2026-03-06)

Added local Qwen2.5 7B/1.5B model support for offline experiments:
- new model presets qwen25_7b_instruct_local, qwen25_1p5b_instruct_local in configs/models.json;
- new experiment presets qwen25_7b_local_target_1p5b_local_{speculative,autojudge,topk,all_paper}_k4 and qwen25_7b_local_baseline in configs/experiments.json;
- .gitignore now ignores models/* while keeping models/.gitkeep.
Added LiveCodeBench dataset loader (sp_samp/livecodebench.py):
- load_livecodebench(path, max_samples) loads prompts from local JSONL;
- download_livecodebench(output_path, version_tag) downloads from HF hub;
- exported in sp_samp/__init__.py.
Added livecodebench as eval task in benchmark runner and CLI:
- --eval-task livecodebench loads prompts as throughput-only EvalSample (no reference_answer).
Added Yandex-style report generator (scripts/report_yandex_style.py):
- produces threshold/accuracy/speed/speedup tables as .md/.csv/.json.
Added local evaluation orchestration script (scripts/run_local_7b_1p5b_eval.sh):
- runs GSM8K + LiveCodeBench sweeps (baseline, speculative, AutoJudge thresholds, Top-K grid);
- auto-downloads LiveCodeBench dataset if missing;
- generates Yandex-style reports for both datasets.
Added make local-eval target in Makefile.
Added datasets>=2.18.0 to requirements.txt (needed for LiveCodeBench HF hub download).
Added tests/test_livecodebench.py with JSONL parsing, max_samples, and key fallback tests.
Updated documentation: README.MD, CLAUDE.md, CODEX.MD.

Recent Changes (2026-02-21)

Switched paper-style default compute pair to vocabulary-compatible Qwen2.5 (0.5B -> 3B):
added model presets qwen25_0p5b_instruct_compat, qwen25_3b_instruct;
added experiment presets qwen25_3b_target_qwen25_0p5b_{speculative,autojudge,topk,specexec,all_methods,all_paper}_k4;
Make defaults now point to qwen25_3b_target_qwen25_0p5b_all_methods and related AutoJudge/SpecExec experiments;
paper runner defaults (scripts/run_autojudge_paper_eval.sh) now target the 0.5B->3B pair and checkpoint autojudge_qwen25_0p5b_to_3b.pt.
Added paper-eval manifest reliability changes:
moved manifest generation to dedicated script scripts/write_run_manifest.py;
defaulted HF_HUB_DISABLE_XET=1 in scripts/run_autojudge_paper_eval.sh to reduce hub backend instability in long runs.
Kept legacy 0.5B -> 7B presets for reference, but this pair fails strict speculative methods due model vocab-size mismatch.
Added paper-style benchmark extensions for Qwen2.5 (0.5B -> 7B legacy set):
new method topk and bundle all_paper in benchmarks/bench_speculative.py;
all now runs baseline + speculative + autojudge + topk + specexec.
Added GSM8K evaluation path in benchmark:
new args --eval-task and --gsm8k-eval-mode;
quality fields gsm8k_exact_match, gsm8k_correct, gsm8k_total in JSONL run/summary records.
Added Top-K controls and metrics:
new args --topk-rank, --topk-grid;
metrics topk_accept_rate, topk_rank_effective, topk_mismatches, topk_accepted_mismatches.
Added configs for paper-style runs:
configs/methods.json: topk_k4, compare_all_paper_k4;
configs/experiments.json: qwen25_7b_target_qwen25_0p5b_topk_k4, qwen25_7b_target_qwen25_0p5b_all_paper.
Updated CLI/validators for top-k and paper fields:
sp_samp/cli.py accepts topk, all_paper, eval-task, gsm8k-eval-mode, topk-rank, topk-grid;
scripts/validate_configs.py and scripts/validate_results_jsonl.py support new methods/fields with backward compatibility.
Added paper evaluation tooling:
scripts/run_autojudge_paper_eval.sh (preflight manifest + GSM8K download + sweep orchestration);
scripts/report_autojudge_paper.py (aggregates raw runs and computes speedup_vs_speculative, accuracy_delta_vs_speculative);
make paper-eval target.
Added Top-K unit tests in tests/test_topk.py.
AutoJudge presets/defaults migrated to paper-aligned open-model pairing:
added model presets qwen25_0p5b_instruct, qwen25_7b_instruct, qwen25_32b_instruct;
added experiment presets qwen25_7b_target_qwen25_0p5b_{speculative,autojudge,specexec,all_methods}_k4;
(legacy note) these 7B/0.5B presets remain available for reference; active defaults were later moved to 3B/0.5B due vocab compatibility.
Hardened AutoJudge training dataset gate in benchmarks/bench_speculative.py:
fail-fast with actionable command when no checkpoint and no GSM8K train dataset is provided;
explicit path not found error for missing train dataset path;
explicit incompatibility error when dataset lacks GSM8K question/answer schema (for example MT-Bench files).
JSONL threshold semantics split with backward-compatible aliases:
new fields autojudge_threshold_calibrated and autojudge_threshold_used;
legacy autojudge_threshold_selected kept as alias of calibrated value;
legacy autojudge_threshold kept as alias of used value;
updated scripts/validate_results_jsonl.py to accept new fields while remaining compatible with older JSONL files.
Replaced legacy synthetic-label AutoJudge with paper-aligned pipeline:
mining labels via semi-greedy mismatch search (Algorithm 1 style) on GSM8K prompts;
classifier training switched to StandardScaler + LogisticRegression with threshold calibration by target recall;
inference integrated at speculative verification stage (classifier called only on would-be-rejected mismatches).
Extended AutoJudge runtime/config interfaces:
new args in benchmark/CLI: autojudge_task, autojudge_train_dataset, autojudge_recall_target, autojudge_train_split, autojudge_c_grid;
legacy args (autojudge_train_steps, autojudge_train_batch_size, autojudge_train_lr, autojudge_audit_ratio) are parsed but ignored with deprecation warning;
checkpoint format bumped (autojudge_version=2) with explicit legacy checkpoint fallback to retraining.
Extended JSONL/schema metrics for AutoJudge calibration:
autojudge_val_auc, autojudge_val_recall, autojudge_threshold_selected in records;
updated strict validator scripts/validate_results_jsonl.py.
Added/updated tests:
tests/test_autojudge.py now covers GSM8K parsing/equivalence, classifier recall-target calibration, Algorithm-1-style mining behavior, and mismatch-time inference decisions.
Validated host + Docker GPU workflow on Ubuntu 24.04.4 with RTX 5090 (driver 590.48.01):
installed host toolchain/dependencies via scripts/install_dependencies.sh --gpu;
verified local checks/tests (make check, make test) and host CUDA visibility (torch 2.9.1+cu128, sm_120).
Installed Docker Engine + NVIDIA Container Toolkit and verified GPU runtime:
docker run hello-world passes;
make docker-gpu-check and make docker-gpu-check-image pass.
Added Makefile smoke flexibility for host GPU validation:
new variable SMOKE_HF_DEVICE (default cpu);
new target smoke-hf-gpu as alias to run tiny HF smoke on CUDA.
Extended README operational guidance:
explicit Ubuntu 24.04 Docker Engine install commands;
explicit NVIDIA Container Toolkit setup/config commands;
documented host GPU tiny smoke workflow and expected JSONL output;
documented one-off DOCKER_CMD="sudo docker" usage;
documented short Docker load run command for gptoss20b_target_gptoss20b_draft_k4 plus strict JSONL validation and fallback command.
Bench run artifacts captured under datasets/:
datasets/results_smoke_hf_gpu.jsonl (validated strict);
datasets/results_gptoss20b_load.jsonl (validated strict, generated via deterministic fallback preset mistral_target_mistral_draft_k4 after GPT-OSS runtime mismatch).

Recent Changes (2026-02-13)

Standardized local dataset layout to datasets/:
added tracked directory placeholder datasets/.gitkeep;
Makefile defaults now use DATASET=datasets/mt_bench.jsonl and OUT=datasets/results.jsonl;
benchmark runner default output path now resolves to datasets/results.jsonl;
benchmark targets now fail fast if dataset file does not exist (clear recovery message);
Docker dataset mount now resolves from absolute host path derived from DATASET;
README examples updated to use datasets/ paths.
Updated ignore rules for local data:
.gitignore now ignores datasets/* while keeping datasets/.gitkeep;
.dockerignore now excludes datasets/ from image build context.
Added CI pipeline (.github/workflows/ci.yml) on Python 3.11:
runs make check, make test, toy SpecExec smoke benchmark, and JSONL schema validation.
Pinned dependency versions in requirements.txt and requirements-gpu.txt for reproducible installs.
Added .python-version with recommended runtime 3.11.
Improved bootstrap interpreter selection:
scripts/install_dependencies.sh now prefers python3.11 when available, enforces minimum Python 3.10, and warns when below 3.11.
Added benchmark result schema validation:
new script scripts/validate_results_jsonl.py;
new make validate-results target with strict mode.
Removed autojudge_placeholder from configs/methods.json and simplified scripts/validate_configs.py logic accordingly.
Fixed Docker/Torch version consistency for GPU builds:
Makefile exposes CUDA_BASE_IMAGE, TORCH_INDEX_URL, and TORCH_VERSION variables;
defaults updated for Blackwell-safe stack (nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04, cu128, torch==2.9.1);
requirements.txt torch pin updated to 2.9.1;
README GPU build example updated accordingly.
Hardened Dockerfile.gpu for CUDA runtime base images:
if Python/pip are missing, image now installs python3, python3-pip, and python3-venv;
switched pip invocations to python3 -m pip to avoid missing pip binary issues.
Improved HF quantization compatibility in sp_samp/hf_adapter.py:
when a checkpoint already has its own quantization class (for example MXFP4), the loader now falls back by dropping conflicting BitsAndBytes override and retries with checkpoint defaults.
Added early quantization-override guard in benchmarks/bench_speculative.py:
for native-quantized checkpoints (for example GPT-OSS), --quant/--draft-quant overrides are ignored with explicit warnings;
prevents Mxfp4Config vs BitsAndBytesConfig conflicts before model load.
Fixed HF model-loading edge case for native quantized checkpoints:
sp_samp/hf_adapter.py no longer forwards quantization_config=None to from_pretrained, avoiding AttributeError: 'NoneType' object has no attribute 'to_dict' in recent transformers.
Added Docker build recovery helpers for host-side BuildKit snapshot issues:
Makefile now supports DOCKER_CMD override and docker-build-gpu-safe fallback target (automatic retry with DOCKER_BUILDKIT=0);
added docker-prune-builder target to clean Docker builder cache;
updated README with snapshot/export troubleshooting flow.
Added Docker GPU diagnostics helpers:
make docker-gpu-check validates NVIDIA runtime passthrough using a clean CUDA image;
make docker-gpu-check-image validates torch.cuda.is_available() inside project GPU image.
Updated make docker-gpu-check behavior:
first attempts nvidia-smi in CUDA base image;
if NVML path fails, falls back to torch CUDA runtime check in the built project image.
Added explicit CUDA preflight checks for HF benchmark path in benchmarks/bench_speculative.py:
when --device cuda is requested but CUDA is unavailable, benchmark now exits early with actionable setup guidance.
Added explicit CUDA architecture compatibility checks for HF benchmark path:
fail-fast on unsupported torch GPU arch (for example sm_120 on old torch builds) with rebuild guidance.
Improved HF loader runtime error clarity in sp_samp/hf_adapter.py:
maps no kernel image is available for execution on the device to a concise torch/GPU arch compatibility hint.
Updated Dockerfile.gpu install order:
requirements are installed first;
optional torch CUDA wheel override is installed after requirements so build args remain authoritative.
Improved local tooling reliability:
Makefile now prefers .venv/bin/python when present and falls back to python3.
make test now validates pytest availability first and emits a clear recovery message (make setup).
Fixed SpecExec summary metric correctness:
in benchmark stat aggregation, max_active_branches is now aggregated as max(...) instead of sum.
Replaced stale AutoJudge placeholder module export:
sp_samp/methods/autojudge/__init__.py now exports actual AutoJudge APIs when dependencies are installed;
when torch/transformers are missing, module now raises a clear dependency error on attribute access.
Expanded .dockerignore to reduce Docker context size and avoid shipping local artifacts (.venv, caches, build outputs, logs, papers/).

Recent Changes (2026-02-10)

Added safe host bootstrap flow:
new script scripts/install_dependencies.sh installs missing Ubuntu packages and updates Python deps in .venv;
supports optional GPU extras via --gpu;
blocks on EOL Ubuntu by default (can be overridden with --allow-eol-ubuntu);
explicitly avoids NVIDIA driver/CUDA driver package changes.
Added Make targets for bootstrap:
make setup, make setup-gpu;
ALLOW_EOL_UBUNTU=1 passthrough support.
Updated dependency baselines in requirements.txt and requirements-gpu.txt to newer minimum versions.
Updated README.MD with safe installation workflow and Ubuntu EOL caveats.
Added benchmark reliability/runtime instrumentation in benchmarks/bench_speculative.py:
system metadata is written to each JSONL record (git_sha, host, GPU/driver, torch/transformers versions);
resume mode skips already completed runs via resume_key;
per-run exceptions are persisted to JSONL (status=error, traceback) without aborting the whole method.
Added headless execution guard:
new CLI/benchmark flag --require-headless;
Makefile support via HEADLESS=1.
Optimized HF SpecExec cache building in sp_samp/hf_specexec.py:
draft tree expansion now reuses KV cache via prefill + step;
target cache fill now uses depth-wise tree traversal with KV reuse.
Reworked SpecExec to paper-aligned exact behavior:
SpecExec now samples exactly from target distribution while using draft-tree cache prefill (no branch-selection sampling bias).
Added stronger SpecExec diagnostics (SpecExecError) with stage/prefix context for faster failure triage.
Added SpecExec distribution and exactness tests in tests/test_specexec.py:
empirical distribution match to target;
sequence equivalence to baseline for equal seeds.
Implemented SpecExec as a first-class method:
added sp_samp/specexec.py (toy/CPU) and sp_samp/hf_specexec.py (HF);
added SpecExecStats with branch metrics (branch_prune_rate, effective_parallelism, max_active_branches, call rates).
Integrated SpecExec into benchmark flow (benchmarks/bench_speculative.py):
new method option specexec;
all now runs baseline + speculative + autojudge + specexec;
new args --parallel-branches and --branch-prune-threshold;
JSONL records now include SpecExec configuration/metrics fields.
Integrated SpecExec into CLI (sp_samp/cli.py):
new subcommand specexec;
method presets now accept specexec;
passthrough for parallel_branches and branch_prune_threshold.
Updated configs:
configs/methods.json: added specexec_k4, updated compare_all_k4, removed SpecExec placeholder preset;
configs/experiments.json: added SpecExec experiment presets for Llama3/Mistral/GPT-OSS;
configs/method_templates.json: aligned SpecExec template metrics with actual output.
Updated automation entrypoints:
Makefile: added SPECEXEC_EXPERIMENT, specexec, and docker-specexec targets; refreshed bench-all description.
Added unit tests for SpecExec in tests/test_specexec.py.
Updated README.MD with SpecExec usage examples and metrics documentation.
Added scripts/validate_configs.py and integrated it into make check and make validate-configs.
Added make smoke-hf for quick end-to-end HF pipeline verification with a tiny model.
Added TARGET_PRESET/DRAFT_PRESET variables in Makefile for bench-method to avoid hardcoded mismatched pairs.
Fixed benchmark import bug by explicitly importing JudgeMLP in benchmarks/bench_speculative.py.
Fixed AutoJudge run determinism path by forwarding per-run seed into autojudge_sample_hf.
Improved HF architecture in benchmarks/bench_speculative.py:
separated target/draft runtime settings (draft_tokenizer, draft_device, draft_dtype, draft_quant, draft_bnb_compute_dtype);
stopped forcing draft to use target tokenizer path;
added tokenizer compatibility guard for speculative/AutoJudge methods;
avoided unnecessary draft-model construction for baseline-only runs.
Expanded benchmark JSONL records with resolved target/draft runtime fields for reproducibility.
Expanded CLI run args in sp_samp/cli.py to pass draft-specific overrides.
Added draft-preset mapping in sp_samp/cli.py to avoid target-arg overwrites.
Updated configs/experiments.json to correctness-safe target/draft pairings (identical tokenizer presets).
Updated Makefile defaults to valid experiment IDs (llama3_all_methods, llama3_target_llama3_autojudge_k4).
Updated README.MD with config validation and HF smoke targets, and refreshed experiment examples.
Updated README.MD HF command examples to tokenizer-compatible target/draft pairs.
Added preset configs for models and methods in configs/.
Added experiment pair presets in configs/experiments.json.
Added AutoJudge/SpecExec config templates in configs/method_templates.json.
Added CLI runner in sp_samp/cli.py.
Added CLI support for --experiment presets.
Exposed benchmark parser/run for programmatic use.
Added benchmarks/__init__.py for importable benchmark module.
Added CODEX.MD and README.MD.
Expanded README with a full from-zero setup guide.
Fixed GPU requirements to avoid reinstalling CPU-only torch.
Implemented sp_samp/autojudge.py with:
synthetic judge-label collection from target/draft models;
MLP judge classifier training;
AutoJudge decoding loop with threshold decisions and fallback-to-target.
Integrated AutoJudge into benchmarks/bench_speculative.py:
new methods: autojudge, all;
AutoJudge train/inference args;
method-specific JSONL metrics.
Updated CLI for method selection:
bench --method autojudge|all;
autojudge command now routes to benchmark runner.
Updated configs:
new method presets autojudge_k4, compare_all_k4;
new experiment presets for AutoJudge and all-method comparisons.
Added tests in tests/test_autojudge.py.
Updated README.MD with end-to-end usage for AutoJudge and all-method runs.
Added Makefile with:
make help, make check, make list-presets;
quick local run (make bench-toy);
preset runs (make bench, make autojudge, make bench-all);
Docker runs (make docker-build, make docker-build-gpu, make docker-bench, make docker-autojudge, make docker-bench-all).
Simplified sp_samp/cli.py:
removed duplicated parser arguments via shared helper;
switched to lazy benchmark import so list-presets works without ML dependencies.
expanded passthrough arguments for toy/HF parity (vocab_size, draft_noise, seed).
Hardened package import behavior in sp_samp/__init__.py:
moved HF/AutoJudge exports to optional imports, preserving lightweight commands without torch.
Improved benchmarks/bench_speculative.py portability:
supports toy-mode execution without torch;
emits clear dependency errors only for HF/AutoJudge paths.
changed toy default vocab_size to 2048 (from 32000) to avoid pathological memory/time in RandomModel.
added fail-fast validation for autojudge without HF model arguments.
Standardized benchmark invocations to module mode (python -m benchmarks.bench_speculative) in docs and Makefile.
Tuned make bench-toy defaults for fast smoke runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODEX

Purpose

Structure

Recent Changes (2026-04-14)

Recent Changes (2026-04-01)

Recent Changes (2026-03-28)

Recent Changes (2026-03-10)

Recent Changes (2026-03-06)

Recent Changes (2026-02-21)

Recent Changes (2026-02-13)

Recent Changes (2026-02-10)

FilesExpand file tree

CODEX.MD

Latest commit

History

CODEX.MD

File metadata and controls

CODEX

Purpose

Structure

Recent Changes (2026-04-14)

Recent Changes (2026-04-01)

Recent Changes (2026-03-28)

Recent Changes (2026-03-10)

Recent Changes (2026-03-06)

Recent Changes (2026-02-21)

Recent Changes (2026-02-13)

Recent Changes (2026-02-10)