Skip to content

Seqev/qaxis-exact-scaling

Repository files navigation

Exact Query-Axis Ranking and Entropy-Concentration Scaling for Training-Free Top-K Sparse Attention

DOI

Evgenii Vyaltsev (ORCID 0009-0004-3712-6798), Daniil Vyaltsev — June 2026.

A training-free, exact-ranking query-axis selector for top-K sparse attention, with an explicit measured entropy-concentration scaling exponent.

TL;DR

  • Theorem 1 (exact ranking). Ranking keys by the projection onto the unit query direction u_Q = Q/‖Q‖ is order-identical to ranking by the raw attention score, so top-k_eff selection by ⟨u_Q, K_i⟩ returns the true top-k_eff — no learned parameters, no ranking approximation. (Exact where DSA/NSA are learned and Quest/InfLLM/Loki are approximate.)
  • Theorem 2 (entropy-concentration scaling, descriptive). Attention entropy grows logarithmically, H_N = α·log N + β with a measured α ≈ 0.31 (Llama-3.2-1B, WikiText-2), so the effective support K_eff = e^{H_N} ∼ N^α is sublinear and the coverage-floor perplexity gap decays as ΔPPL(N) = O(N^{-(1-α)}): the relative advantage of sparsity strengthens with context length. Reported as a descriptive law; the causal test (matched-magnitude intervention) is future work.
  • Verified system (training-free). A Triton selection kernel validated bitwise-identical to a pure-PyTorch reference; bit-identical flag-off behaviour; an INT8 KV cache at 0.508× footprint, +0.714% PPL.

Repository layout

paper2_qaxis_scaling.tex   LaTeX source (arXiv-ready)
paper2_qaxis_scaling.pdf   compiled preprint (7 pp)
make_figures.py            regenerates the theory figures (Thm 1 schematic, Thm 3 curves)
figures/                   fig1 (selector), fig2 (scaling) — .pdf + .png

Status of the numbers (read before citing)

  • Verified and stated without qualification: Theorem 1 (algebra), the exact-kernel bitwise validation (max|ΔO| = 0, index agreement 1.0), bit-identical flag-off (max|Δlogit| = 0), and the INT8 KV cache (cache ratio 0.5078, +0.714% PPL), reproduced by the sealed self-check (567 tests pass).
  • Descriptive: Theorem 2. α ≈ 0.31 is a measured exponent for one model/corpus, not a universal constant; its causal status is untested by design.
  • Deferred to a companion empirical report (future release): the full multi-seed ΔPPL(N) sweep and the latency decomposition. Point estimates are seed- and engine-version sensitive; the robust, version-independent claim is the shape (advantage monotone-increasing in N for α<1, Fig. 2). The engine, exact kernel, and sealed self-check live in the project's dcr-attention repository.

Figures

python3 -m pip install matplotlib numpy
python3 make_figures.py

Figures are theory/illustrative (the Theorem 1 selector schematic and the Theorem 2 predicted curves for α=0.31); no measured PPL points are fabricated.

Rebuild the PDF

pdflatex paper2_qaxis_scaling.tex && pdflatex paper2_qaxis_scaling.tex

Relation to the project

Companion to No Critical Key Budget in Query-Axis Top-K Sparse Attention (retention; DOI 10.5281/zenodo.20514229), which uses the same exact selector and shows the long-context limit is the base model, not the key budget. This paper supplies the selector's exactness (Thm 1) and the cost-scaling framing (Thm 2).

License

Text/figures: CC-BY-4.0. Code: MIT.

Citation

See CITATION.cff; full citation/BibTeX added once the Zenodo DOI is minted.

About

Training-free EXACT-ranking top-K sparse attention. Ranking keys by ⟨u_Q,K⟩ (u_Q=Q/‖Q‖) is order-identical to true attention score. Entropy H_N=α·log N+β (α≈0.31) gives sublinear support N^α and PPL gap decaying as N^-(1-α). Verified system: exact Triton kernel, INT8 KV ×0.508, +0.714% PPL delta..

Topics

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE-CODE.txt
Unknown
LICENSE-TEXT-DATA.txt

Stars

Watchers

Forks

Packages

 
 
 

Contributors