Official implementation of the paper LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation (arXiv: 2507.01449). This repository ships the draft-then-verify decoding algorithm, inference harnesses, and speed benchmarking utilities used in the paper, tailored for the Llama 2 family (Llama 2, Vicuna, CodeLlama, …).
- Next-next-token speculation — reuses n-gram caches from the prompt to propose multi-branch drafts without training auxiliary draft models.
- Retrieval-aware draft tree — dynamically grows candidate paths with configurable capacity, balancing acceptance length and compute.
- Turn-key benchmarking — FastChat-based evaluation loop plus Spec-Bench style speed profiler capture new-token counts, wall-clock latency, and accept-length histograms.
LogitSpec/
├── data/ # Spec-Bench, CNN/DM, GSM8K, HumanEval question sets
├── evaluation/ # Inference pipelines and speed analysis
├── model/logitspec/ # Pool builder, KV cache, and patched Llama kernels
├── eval.sh # End-to-end reproduction script
├── requirements.txt # Python dependencies
└── README.md
- Create a dedicated environment (Python 3.9 recommended):
conda create -n logitspec python=3.9 conda activate logitspec
- Install dependencies:
pip install -r requirements.txt
torchis intentionally left unpinned. Install the CUDA build that matches your driver and is compatible withtransformers==4.37.1. - Hardware guidance:
- Single 24 GB GPU suffices for 7B models; larger models or multi-run sweeps benefit from 48 GB+.
- CUDA 11.8 or newer is recommended. Optional extras such as
flash-attn/xformerscan further reduce latency if your stack supports them.
- Model weights: Download your target checkpoint (e.g.,
lmsys/vicuna-7b-v1.3) viahuggingface-cli downloadorgit lfs, and point--model-path(orVicuna_PATHineval.sh) to the local directory. - Benchmarks in-tree:
data/spec_bench/question.jsonl(Spec-Bench full set)data/cnndm/question.jsonl,data/gsm8k/question.jsonl,data/humaneval/question.jsonl, …
- Custom benchmarks: drop a
question.jsonlintodata/<bench_name>/and reference it via--bench-name <bench_name>.
eval.sh orchestrates the full workflow. After editing Vicuna_PATH:
sh eval.shThis sequentially runs the baseline decoder, LogitSpec decoder, and the speed reporter, writing JSONL traces under data/<bench_name>/model_answer/.
1. Baseline decoding
python -m evaluation.inference_baseline \
--model-path /path/to/vicuna-7b-v1.3 \
--model-id vicuna-7b-v1.3-vanilla-float16-temp-0.0 \
--bench-name spec_bench \
--max-new-tokens 1024 \
--temperature 0.0 \
--dtype float162. LogitSpec decoding
python -m evaluation.inference_logitspec \
--model-path /path/to/vicuna-7b-v1.3 \
--model-id vicuna-7b-v1.3-logitspec-float16-temp-0.0 \
--bench-name spec_bench \
--max-new-tokens 1024 \
--temperature 0.0 \
--max_ngram_size 3 \
--num_pred_tokens 20 \
--draft_tree_capacity 64 \
--dtype float163. Throughput & accept-length statistics
python evaluation/speed.py \
--file-path data/spec_bench/model_answer/vicuna-7b-v1.3-logitspec-float16-temp-0.0.jsonl \
--base-path data/spec_bench/model_answer/vicuna-7b-v1.3-vanilla-float16-temp-0.0.jsonl \
--tokenizer-path /path/to/vicuna-7b-v1.3Add --mean-report when aggregating multiple runs.
| Argument | Description | Typical range |
|---|---|---|
--max-new-tokens |
Maximum continuation length per turn | 512–2048 |
--temperature |
Sampling temperature (LogitSpec currently optimized for greedy) | 0–0.1 |
--max_ngram_size |
Longest n-gram cached in the pool | 2–4 |
--num_pred_tokens |
Max tokens retrieved per candidate branch | 8–32 |
--draft_tree_capacity |
Total tokens allowed in the draft tree per step | 32–128 |
--num-gpus-per-model |
GPUs allocated per process | 1 (default) |
--question-begin/end |
Evaluate a slice of the benchmark | custom |
Tip:
draft_tree_capacityandnum_pred_tokensdrive both acceptance length and memory. Smaller models benefit from conservative values to avoid OOM.
- CUDA OOM — Lower
--max-new-tokens,draft_tree_capacity, or switch to--dtype bfloat16; if usingdevice_map="auto", pin the model to a specific GPU. - Model not found — Ensure weights are downloaded locally and the path is absolute. HF token authentication may be required for gated checkpoints.
- Evaluation stops early — Remove
--question-begin/--question-endoverrides or double-check benchmark length. - Speed script feels slow — The current implementation re-tokenizes baseline outputs. For large sweeps, subsample the JSONL or adapt the script to cache encodings.
If LogitSpec helps your research, please cite:
@misc{liu2025logitspecacceleratingretrievalbasedspeculative,
title={LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation},
author={Tianyu Liu and Qitan Lv and Hao Li and Xing Gao and Xiao Sun},
year={2025},
eprint={2507.01449},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.01449},
}This project builds upon Spec-Bench and FastChat. We are grateful to their authors and the broader open-source community for the foundations enabling LogitSpec.