Evgenii Vyaltsev (ORCID 0009-0004-3712-6798), Daniil Vyaltsev — June 2026.
A training-free, exact-ranking query-axis selector for top-K sparse attention, with an explicit measured entropy-concentration scaling exponent.
- Theorem 1 (exact ranking). Ranking keys by the projection onto the unit query
direction
u_Q = Q/‖Q‖is order-identical to ranking by the raw attention score, so top-k_effselection by⟨u_Q, K_i⟩returns the true top-k_eff— no learned parameters, no ranking approximation. (Exact where DSA/NSA are learned and Quest/InfLLM/Loki are approximate.) - Theorem 2 (entropy-concentration scaling, descriptive). Attention entropy
grows logarithmically,
H_N = α·log N + βwith a measured α ≈ 0.31 (Llama-3.2-1B, WikiText-2), so the effective supportK_eff = e^{H_N} ∼ N^αis sublinear and the coverage-floor perplexity gap decays asΔPPL(N) = O(N^{-(1-α)}): the relative advantage of sparsity strengthens with context length. Reported as a descriptive law; the causal test (matched-magnitude intervention) is future work. - Verified system (training-free). A Triton selection kernel validated bitwise-identical to a pure-PyTorch reference; bit-identical flag-off behaviour; an INT8 KV cache at 0.508× footprint, +0.714% PPL.
paper2_qaxis_scaling.tex LaTeX source (arXiv-ready)
paper2_qaxis_scaling.pdf compiled preprint (7 pp)
make_figures.py regenerates the theory figures (Thm 1 schematic, Thm 3 curves)
figures/ fig1 (selector), fig2 (scaling) — .pdf + .png
- Verified and stated without qualification: Theorem 1 (algebra), the exact-kernel
bitwise validation (
max|ΔO| = 0, index agreement 1.0), bit-identical flag-off (max|Δlogit| = 0), and the INT8 KV cache (cache ratio 0.5078, +0.714% PPL), reproduced by the sealed self-check (567 tests pass). - Descriptive: Theorem 2.
α ≈ 0.31is a measured exponent for one model/corpus, not a universal constant; its causal status is untested by design. - Deferred to a companion empirical report (future release): the full multi-seed
ΔPPL(N)sweep and the latency decomposition. Point estimates are seed- and engine-version sensitive; the robust, version-independent claim is the shape (advantage monotone-increasing in N for α<1, Fig. 2). The engine, exact kernel, and sealed self-check live in the project's dcr-attention repository.
python3 -m pip install matplotlib numpy
python3 make_figures.pyFigures are theory/illustrative (the Theorem 1 selector schematic and the Theorem 2 predicted curves for α=0.31); no measured PPL points are fabricated.
pdflatex paper2_qaxis_scaling.tex && pdflatex paper2_qaxis_scaling.texCompanion to No Critical Key Budget in Query-Axis Top-K Sparse Attention (retention; DOI 10.5281/zenodo.20514229), which uses the same exact selector and shows the long-context limit is the base model, not the key budget. This paper supplies the selector's exactness (Thm 1) and the cost-scaling framing (Thm 2).
Text/figures: CC-BY-4.0. Code: MIT.
See CITATION.cff; full citation/BibTeX added once the Zenodo DOI is minted.