Asynchronous Salience-Driven Speculative Lookup Framework
ASDSL Framework V2 is a research-oriented CPU inference stack for running large decoder-only models (notably Microsoft Phi-4 multimodal instruct) with optional 4-bit ASDSL quantization, native AVX2/OpenMP GEMV kernels (PyBind11), and quantization-cascade speculative decoding (QCSD). The primary reference inference path lives in Python and PyTorch (experiments/phi4_cpu_run.py); optional C++ extensions accelerate fused GEMV and related kernels when built with setup.py.
- What this repository contains
- The CPU inference bottleneck (context)
- Architecture overview
- Native C++ extensions and build flags
- Phi-4 CPU reference path (
experiments/phi4_cpu_run.py) - Full A/B/C benchmark (
scripts/run_full_benchmark.py) - QCSD: speculative decoding and verification
- Leviathan guardrail and
.qcsd_history.json - Threading and OpenMP tuning
- Project structure (accurate layout)
- Getting started
- Tests
- Roadmap and limitations
This is not a single monolithic “drop a .pyd and forget Python” engine. It combines:
| Layer | Role |
|---|---|
experiments/phi4_cpu_run.py |
End-to-end Phi-4 text inference: load local safetensors, ASDSL quantize projections, KVHistory, RoPE, GQA, MLP, greedy generate, optional QCSD (generate_qcsd). |
asdsl/kernels/ |
Python APIs (gemv_q4_packed, etc.) plus optional native modules (_native_gemv, _native_forward, …) compiled from asdsl/kernels/native/*.cpp and forward_loop.cpp. |
scripts/run_full_benchmark.py |
Simulator mode (no weights) or --phi4 mode: seven-row throughput table (Profiles A, C, D, E, F, G, B), Leviathan QCSD gate, optional history-backed α, RSS and footprint telemetry. |
asdsl/speculative/ |
Dual-model / simulated speculative decoding helpers used by benchmarks and tests. |
If the native extensions are not built, ASDSL falls back to NumPy or PyTorch paths where implemented; behavior remains correct but slower.
Scope: The rest of this README is a deep technical reference for the Phi-4 CPU runner, A/B/C benchmark, QCSD + Leviathan + history file, native packed-Q4 GEMV (including batched verify), compiler flags, and thread/OpenMP tuning, with an accurate repo layout. It does not exhaust every module under asdsl/ (e.g. all quantization research utilities, eval scripts, or LUT experiments)—use this file as the spine, then follow imports, setup.py, and tests/ for subsystems not expanded here.
Decoder steps are dominated by memory bandwidth (reading large weight matrices per token) and by framework overhead (allocations, Python dispatch). This codebase mitigates that by:
- Fused packed 4-bit GEMV in C++ (weights stay nibble-packed; dequant + dot in registers) when
_native_gemvis available. - Optional OpenMP over output rows / layers in those kernels.
- QCSD to amortize target work when draft acceptance is high enough (guarded analytically).
Exact tok/s depend on hardware, thread counts, and whether native GEMV is linked.
- Weights: Read from
models/phi4-multimodal-instruct/viasafetensors+ index JSON (model.safetensors.index.json). WeightStore: Holds per-layer projections as packed Q4 (or uint8 paths), scales/biases, optional draft bank for QCSD, and anlm_headtensor.- Forward:
- AR (
generate): one token at a time; each linear usesmatvec→gemv_q4_packed(native) or dequant + PyTorch depending on flags. - Batched verify (
forward_layer_batch+matmul_batch): for QCSD verification, a batch of hidden rows is multiplied by weights; whenbits == 4and_use_native_gemv,_matmul_q4_packed_batchcallsgemv_q4_packedwith a 2Dxso the entire batch is dispatched in one native call (see below).
- AR (
asdsl/kernels/forward_loop.cpp→ moduleasdsl.kernels._native_forward: mmap-oriented / GGUF-style Q4K helpers and related utilities (not the same code path asgemv_q4_packedused by Phi-4’s packed ASDSL layout).asdsl/kernels/native/gemv_q4_avx2.cpp+gemv_q4_kernel.cpp→asdsl.kernels._native_gemv: packed Q4 fused GEMV (gemv_q4_packed,gemv_q4_unpacked,gemv_q4_avx2_gs64), CPU feature probes, optionalset_num_threads/get_num_threadswhen built with OpenMP.
Phi-4 Profile C throughput is dominated by gemv_q4_packed → gemv_q4_packed_impl_v2 (AVX2 unpack + FMA), not by forward_loop.cpp’s Q4K GEMV.
File: setup.py (there is no CMakeLists.txt in this repo; all native code is built through setuptools + PyBind11).
Windows (MSVC)
/O2— maximize speed/Ob2— aggressive inlining/Oi— intrinsics/arch:AVX2/fp:fast/openmp— compile-time OpenMP (linker pulls OpenMP runtime; do not pass/openmptolink.exeas a separate link flag)/EHsc— C++ exception handling
Linux and macOS (GCC/Clang)
-O3,-mavx2,-mfma,-mf16c,-ffast-math,-std=c++17-fopenmpon compile and link
macOS note: Apple’s toolchain often needs Homebrew libomp and appropriate CPPFLAGS / LDFLAGS if -fopenmp fails at link time. See comments at the top of setup.py.
| Python module | Main sources |
|---|---|
asdsl.kernels._native_forward |
asdsl/kernels/forward_loop.cpp |
asdsl.kernels._native_gemv |
native/gemv_q4_avx2.cpp, native/gemv_q4_kernel.cpp |
asdsl.kernels._native_gemv_q8 |
native/gemv_q8_avx2.cpp |
asdsl.kernels._native_gemv_q3 |
native/gemv_q3_avx2.cpp |
asdsl.kernels._native_gemv_q2 |
native/gemv_q2_avx2.cpp |
asdsl.kernels._native_sparse_gemv |
native/gemv_sparse_avx2.cpp |
asdsl.kernels._native_lut |
native/lut_avx2.cpp |
asdsl.kernels._native_inference |
native/inference_engine.cpp |
Build:
pip install pybind11
python setup.py build_ext --inplace- Loads Phi-4 multimodal instruct weights from disk (see
MODEL_DIR,INDEX_FILE). - Builds RMSNorm, RoPE (partial rotary dim), GQA, SiLU MLP, LM head.
- Implements
KVHistory(FP32 K/V per layer) and optional ASDSL KV tracker for diagnostics. generate: standard greedy autoregressive decoding; records optionalbench_metrics_outdicts (tokens_per_second, etc.).generate_qcsd: draft model (e.g. 2-bit bank) + batched target verify; see QCSD.
WeightStore._use_native_gemv: whenTrueand bits/layout match, 4-bit primary usesasdsl.kernels.gemv_q4.gemv_q4_packed(C++ if available)._matmul_q4_packed_batch: if_use_native_gemv, builds NumPy views of packed weights +(batch, cols)activations and callsgemv_q4_packedonce; applies the same outlier correction loop as single-vectormatvecwhen outlier tables exist. Otherwise falls back to unpack +torch.mm.
Location: top of phi4_cpu_run.py.
- Sets
OMP_NUM_THREADS,MKL_NUM_THREADS,OPENBLAS_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADStostr(n). - Calls
torch.set_num_threads(n)and CPU float flush config. - If
asdsl.kernels._native_gemvis importable and exposeshas_openmpandset_num_threads, calls_native_gemv.set_num_threads(n)so the OpenMP runtime matches the environment before native GEMV runs.
Auto mode (n <= 0): n = max(1, (os.cpu_count() or 4) // 2) — half of logical CPUs (minimum 1), aimed at reducing hyperthread contention on bandwidth-bound kernels while still allowing --threads N to override.
CLI: python experiments/phi4_cpu_run.py --threads 0 uses auto; positive N fixes thread count. After auto, main overwrites the displayed args.threads with int(os.environ["OMP_NUM_THREADS"]) for consistent logging.
- Default (no
--phi4) — simulator usingasdsl.speculative.dual_modeland synthetic token lists; analytical footprints and optional--verify-leviathan-apples. --phi4— loadsexperiments/phi4_cpu_runonly after_require_phi4_index_or_exit: verifiesmodels/phi4-multimodal-instruct/model.safetensors.index.jsonexists to avoid expensive imports when weights are missing (exit code 2, message on stderr).
phi4.set_thread_count(threads if threads > 0 else 0)— so--threads 0triggers auto half-logical-CPU policy.- Load
WeightStore,load(),warm_cache(), record RSS after load. - Leviathan gate α:
- If
.qcsd_history.jsoncontains validacceptance_rates, mean of all entries isalpha_for_leviathan. - Else
alpha_for_leviathan = phi4_acceptance_estimate(CLI default 0.40).
_qcsd_break_even_ok(alpha_for_leviathan, …)may setstore._enable_qcsd = Falseand fall back to AR for Profile B.
- If
- Profile A:
store._use_native_gemv = False,phi4.generate, capturebench_metrics_out, peak RSS. - Profile C:
store._use_native_gemv = True,phi4.generateagain (same AR path as A, native Q4 GEMV on). Footprint model matches A (primary + FP32 KV estimate). - Profile B:
store._use_native_gemv = True;phi4.generate_qcsdif QCSD still enabled, elsephi4.generate. - If QCSD ran and metrics include
acceptance_rate,_append_qcsd_acceptance_rate(root, rate)appends to.qcsd_history.json(cap 256 entries).
- A: AR + PyTorch matvec path + FP32 KV estimate.
- C: AR + native Q4 GEMV + FP32 KV estimate (isolates kernel vs Python/QCSD).
- B: Native GEMV + QCSD + Q4 KV estimate, or native AR if QCSD disabled.
Also prints effective thread count from OMP_NUM_THREADS, QCSD greedy acceptance, verify telemetry (when QCSD), Leviathan S with alpha_gate, and append notice when history was updated.
| Flag | Meaning |
|---|---|
--phi4 |
Real Phi-4 benchmark (requires local index/weights). |
--threads |
Default 0 (auto). Pass N to override. |
--phi4-acceptance-estimate |
Prior α when history empty (default 0.40). |
--gamma / draft-k |
Draft width for QCSD and Leviathan g. |
--verify-leviathan-apples |
Simulator: compare timed QCSD vs AR to analytical S. |
Entry: generate_qcsd in experiments/phi4_cpu_run.py.
- Draft:
run_forward(..., use_draft=True)for up todraft_ksteps on the draft weights (sequential small model steps). KV is snapshotted and restored so draft does not corrupt target cache. - Verify (target, batched): builds
hidden_batchfromverify_tokens(current greedy token + draft prefix aligned for greedy check), then one stack overNUM_LAYERS:forward_layer_batch→matmul_batchon the primary store. This is notkfull separate target forwards for the stacked verify; it is one batched pass per layer. - LM head:
lm_head_matmul_batchon the batch. - Accept/reject: greedy comparison of draft vs target argmax; KV trim + optional
run_forwardfor correction or continuation when all draft tokens match.
Counters (per full decode):
_verify_calls: incremented once per speculative cycle when the batched verify stack starts (expect ≈ number of speculative cycles, notdraft_kper cycle)._verify_extra_run_forward: targetrun_forwardafter verify (correction / all-accepted tail).
Printed at end of QCSD and echoed into bench_metrics_out as:
_verify_calls,qcsd_verify_batched_passes,qcsd_speculative_cycles,qcsd_verify_extra_run_forward,acceptance_rate,tokens_per_second, etc.
Theory (Leviathan et al., 2023): speculative speedup S as a function of acceptance α, draft length g, and cost ratio c (here modeled as draft_mb / target_mb).
Gate: QCSD is enabled only if S ≥ 1.01 (configurable inside _qcsd_break_even_ok). Failure prints an analytical message including a binary-search hint for min α at 1.05×.
Adaptive α (implemented):
- File: repo-root
.qcsd_history.json, shape{"acceptance_rates": [ ... ]}. - Read: before the gate, if any valid rates exist,
alpha_for_leviathan = mean(rates). - Write: after a Phi-4 run where QCSD actually executed, the measured greedy
acceptance_rateis appended (bounded list). - Cold start: if history is empty, CLI prior default is 0.40 (
--phi4-acceptance-estimate). If the gate disables QCSD, no acceptance is appended until a run actually completes QCSD.
This prevents a fixed optimistic prior (e.g. 0.70) from keeping QCSD on when measured acceptance is ~0.14.
| Mechanism | Where |
|---|---|
| Environment variables | set_thread_count in phi4_cpu_run.py |
| PyTorch intra-op threads | torch.set_num_threads |
| Native OpenMP team size | _native_gemv.set_num_threads when built with OpenMP |
| Benchmark default | scripts/run_full_benchmark.py --threads default 0 → auto half logical CPUs |
Tuning guidance: for bandwidth-bound GEMV, too many OpenMP threads can hurt; compare Profile C vs A while sweeping --threads.
asdsl-framework/
├── asdsl/
│ ├── kernels/
│ │ ├── forward_loop.cpp # _native_forward (mmap/Q4K-style paths)
│ │ ├── native/
│ │ │ ├── gemv_q4_avx2.cpp # pybind + gemv_q4_packed_impl_v2, batched API
│ │ │ ├── gemv_q4_kernel.cpp # gemv_q4_avx2 (gs64) + OpenMP on rows
│ │ │ ├── gemv_q4_kernel.h
│ │ │ ├── gemv_q8_avx2.cpp, gemv_q3_avx2.cpp, gemv_q2_avx2.cpp, …
│ │ │ └── …
│ │ ├── gemv_q4.py # Python entry: 1D or 2D x, NumPy fallback
│ │ └── __init__.py
│ ├── speculative/ # dual_model / simulators for benchmarks
│ ├── quantization/
│ └── …
├── experiments/
│ ├── phi4_cpu_run.py # Main Phi-4 CPU reference + QCSD
│ └── phi4_integration.py # Setup / download guidance (see script)
├── scripts/
│ └── run_full_benchmark.py # A/B/C benchmark, Leviathan, history
├── tests/
│ ├── test_run_full_benchmark_preflight.py
│ ├── test_leviathan_qcsd.py
│ ├── test_gemv_q4_batched.py # Batched packed + gs64 vs loop
│ └── …
├── models/ # Local Phi-4 weights (user-provided)
│ └── phi4-multimodal-instruct/
│ └── model.safetensors.index.json # required for --phi4 fast-fail
├── setup.py # Native extension build flags
├── pyproject.toml
└── .qcsd_history.json # created at repo root after QCSD benchmark runs (optional)
Note: Older README references to run_phi4_benchmark.py / quantize_phi4_mock.py at repo root are obsolete; use experiments/phi4_cpu_run.py and scripts/run_full_benchmark.py instead.
- Python 3.10+
- PyTorch, transformers, safetensors (see
requirements.txt/pyproject.toml) - For native speedups: MSVC Build Tools (Windows) or GCC/Clang with OpenMP (Linux; macOS may need
libomp)
cd asdsl-framework
pip install -r requirements.txt # or pip install -e ".[dev]"
pip install pybind11
python setup.py build_ext --inplacePlace checkpoints under models/phi4-multimodal-instruct/ so that model.safetensors.index.json exists. Use python experiments/phi4_integration.py (or your own download flow) per that script’s instructions.
python experiments/phi4_cpu_run.py --bits 4 --prompt "Hello" --max-new-tokens 32 --threads 0
python experiments/phi4_cpu_run.py --qcsd --bits 4 --draft-bits 2 --draft-k 7python scripts/run_full_benchmark.py --phi4 --max-new-tokens 64 --threads 0| Test file | Focus |
|---|---|
tests/test_run_full_benchmark_preflight.py |
Phi-4 index fast-fail; module load |
tests/test_leviathan_qcsd.py |
Leviathan S and break-even helpers |
tests/test_gemv_q4_batched.py |
Batched gemv_q4_packed / gemv_q4_avx2_gs64 vs sequential |
tests/test_fused_gemv.py |
Q8 fused path (when built) |
tests/test_speculative_decoding.py |
Speculative decoding Python contracts |
| Others | Q3, cache tiling, STREAM OMP hygiene, Q4 KV, etc. |
pytest tests/ -qAll benchmark results below use the pinned settings in benchmark_config.json so cross-phase comparisons stay apples-to-apples (prompt length drives KV size and throughput on this memory-bound stack).
- Prompt:
The fundamental theorem of calculus states that - Max new tokens: 64
- Threads: 8
- draft_k: 1 for Profile G EAGLE-3 (chosen by
scripts/choose_draft_k.pyfrom MTP test_top1; QCSD Profile B uses the samedraft_kfrom the config for Leviathan) - Inter-profile sleep: 3 s (overridable with
ASDSL_PROFILE_SLEEP)
CLI flags that override these values print [CONFIG] WARNING: unless you pass --override-config.
Hardware: Intel Core (Raptor Lake, Family 6 Model 186), 12 physical cores, 16.9 GB RAM, AVX2, Windows 11, no GPU. Phase 13 (2026-03-30): same canonical command and benchmark_config.json as Phase 11-12. EAGLE-3 change: both reject and all-accept branches now extract _last_final_hidden and logits directly from the verify batch's hidden_norm and all_logits tensors — zero extra target forward passes per cycle (extra_run_forward=0 confirmed). Load+quantize parent process ~1036 s on the recorded run.
| Profile | Configuration | tok/s | vs baseline | vs llama.cpp |
|---|---|---|---|---|
| A | PyTorch baseline | 1.99 | 1.00× | 0.28× |
| C | Native Q4 GEMV (AVX2 FMA) | 2.40 | 1.21× | 0.34× |
| D | LUT vpshufb (slower on Raptor Lake†) | 1.64 | 0.82× | 0.23× |
| E | SliM 2.2-bit + LUT (4/32 layers) | 1.56 | 0.78× | 0.22× |
| F | FATReLU 85% FFN sparsity | 2.86 | 1.44× | 0.41× |
| G | FATReLU + EAGLE-3 MTP (draft_k=1) | 1.43 | 0.72× | 0.20× |
| B | Legacy QCSD (2-bit draft bank) | 0.86 | 0.43× | 0.12× |
†Profile D is slower than Profile C on this hardware because _mm_i32gather_ps latency (~20 cycles) on Raptor Lake often outweighs the vpshufb shuffle path. The LUT approach tends to pay off more on AMD Zen 4 or ARM Neoverse class cores.
llama.cpp Q4_K_M reference (same hardware class): ~7.0 tok/s
EAGLE-3 acceptance rate: ~7.1% (Profile G subprocess, Phase 13 run). Mean tokens/cycle: 1.07 (was 0.44 in Phase 12 — improvement from eliminating the reject-path run_forward). Decode summary prints extra_run_forward=0 confirming zero extra passes per cycle. Leviathan gate for G at draft_k=1: FAIL (break-even α ~22.1%). Profile G remains below Profile F on this run (1.43 vs 2.86 tok/s).
- EAGLE-3 cycle cost (Phase 13 fix): Both reject and all-accept branches now extract
_last_final_hiddenand logits directly from the verify batch'shidden_normandall_logitstensors.extra_run_forward= 0 confirmed in subprocess telemetry. - EAGLE-3 vs llama.cpp ceiling: With zero cycle overhead and draft_k=1: at 100% acceptance G = 2×F = ~5.72 tok/s (ceiling below llama.cpp's 7.0). To beat llama.cpp requires Profile F ≥ ~4.7 tok/s AND acceptance ≥ ~50% (full SliM calibration + regression fix).
- Profile F regression: Phase 7 peak was 5.19 tok/s; Phase 13 measured 2.86 tok/s (44.9% regression). All 32 transposed down_proj layers load correctly; cause is likely session-to-session variance on quantized AVX2 GEMV workloads.
- QCSD Profile B still reports
extra target run_forward after verifyfromgenerate_qcsd; unchanged.
- EAGLE-3 throughput: At ~7% acceptance, G < F; need acceptance ≥ ~22% for G = F, and ≥ 50% to approach llama.cpp. Requires more/better MTP training data or larger draft_k (if acceptance supports it).
- SliM calibration: Quick-mode metadata calibrates only 4/32 layers; full calibration should shrink footprint and may change Profile E quality and speed.
- Full SliM + FATReLU combined: Profiles E and F are separate; stacking both is future work.
- QCSD is guarded by Leviathan + optional history file; low acceptance keeps QCSD off to avoid slowdowns.
- Profile C is the right knob to compare native GEMV vs PyTorch matvec without speculative decoding.
- L2 cache tiling inside packed GEMV is a possible future optimization if profiling still shows DRAM-bound behavior after flags + thread tuning.
forward_loop.cppQ4K paths are separate from Phi-4 packed Q4 ingemv_q4_*; do not assume one optimization applies to the other.
License: Apache-2.0. See LICENSE.