Add optional sparse covariance backend for matrix memory updates by simeon-kepp · Pull Request #117 · NX-AI/xlstm

simeon-kepp · 2026-04-13T06:40:43Z

What this adds

An optional sparse memory update path for the mLSTM cell, gated behind a config flag. Default behavior is unchanged.

The mLSTM cell's matrix memory update is compute-heavy at inference time: every step writes C_new = fg * C_prev + ig * (k ⊗ v), regardless of how informative the current token is. In practice, a large fraction of key and value projections are near-zero — especially in later layers — meaning most of those outer-product writes are wasted.

This PR adds a Sparse Ternary Covariance (STC) backend that skips low-magnitude writes by adaptively quantizing k and v to {−1, 0, +1} before the outer product. Zero entries in the quantized vectors produce zero-contribution columns/rows in the update, which are skipped entirely. A ternary forget gate replaces the sigmoid gate when the STC backend is active, turning exact preserve (gate=0) and active inhibition (gate=−1) into first-class operations.

Changes

New modules

xlstm/modules/ternary_quantizer.py — adaptive quantizer with Straight-Through Estimator for gradient flow through the zero region
xlstm/modules/ternary_gate.py — ternary forget gate logic ({−1, 0, +1})

New kernels

xlstm/kernels/stc_sparse_update.py — PyTorch reference implementation
xlstm/kernels/stc_sparse_update.cpp / .cu — C++/CUDA stubs ready for a hardware-accelerated write-skip kernel

mLSTM integration

mLSTMCellConfig / mLSTMLayerConfig — two new opt-in fields: memory_backend ("dense" | "stc_sparse") and gate_mode ("sigmoid" | "ternary")
mLSTMCell — quantizers and ternary gate wired in; dense path is untouched
backends.py — recurrent_step_stabilized_simple updated to dispatch to the STC path

Benchmarks

bench/dense_vs_stc.py — wall-clock latency comparison (dense vs STC, warmup-corrected)
bench/flops_saved.py — FLOPs analysis across sparsity levels

Opt-in usage

from xlstm import mLSTMLayerConfig

cfg = mLSTMLayerConfig(
    memory_backend="stc_sparse",
    gate_mode="ternary"
)

Default (memory_backend="dense", gate_mode="sigmoid") is identical to current behavior. No existing tests should be affected.

FLOPs at different sparsity levels

Update sparsity	Gate sparsity	Savings vs dense
90%	70%	~73%
90%	90%	~91%
95%	90%	~93%

Sparsity levels depend on input distribution and the EMA threshold in the quantizer. The STE ensures gradients flow through the quantizer during training.

What's not included

Hardware-accelerated write-skip for the CUDA path — the .cu stub is there but the sparse dispatch logic needs a proper CSC/COO kernel. Happy to implement that in a follow-up if the approach looks good.

… updates Introduce Sparse Ternary Covariance (STC) backend with optional write-skip acceleration. Includes ternary quantizer with STE, ternary gate module, and benchmarking suite.

github-actions · 2026-04-13T06:40:55Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

simeon-kepp · 2026-04-13T10:12:55Z

I have read the CLA Document and I hereby sign the CLA

martinloretzzz · 2026-05-18T12:01:56Z

Did you train any models with it?

simeon-kepp · 2026-05-20T15:23:38Z

Not with xLSTM specifically, but we've validated ternary-thresholded sparse updates in our own training runs, a 17L MoE we're actively training uses the same core idea (adaptive {-1, 0, +1} quantization to skip low-magnitude outer-product writes). The skip pattern and fill rates we observe there are what motivated this PR. Happy to share training curves or benchmark numbers if useful for review.

simeon-kepp · 2026-05-21T05:22:45Z

Following up on the offer to share numbers.

The closest analogue in our training data is albert. — an 18L ternary MoE (12 experts, Top-3 routing, 256H, 32k vocab) currently at epoch 2572. The STE-trained ternary weights do produce real sparsity that scales with training depth:

Layer-wise zero-weight fraction (TELE telemetry, ep2572):

Layer range	Sparsity
L0–L3 (embedding-adjacent)	3.1–3.7%
L4–L8 (mid)	3.9–4.0%
L9–L17 (deep)	4.1–5.5%

This is weight-level sparsity after ~2500 epochs. The gradient flow through the STE zero region stays active (we track per-layer gradient norms and the zero-region pass-through is essential — without it, early layers freeze entirely).

Zero-skip speedup (hardware-verified, x86 AVX2, i7-4800MQ):

Zero fraction	Speedup vs dense
10%	1.01×
25%	1.18×
50%	1.45×
75%	2.01×
90%	3.29×

Crossover point: empirically ~10% — below that, branch misprediction overhead on the skip check costs more than it saves (12–18% branch miss rates measured). The STC backend would want the EMA threshold tuned to stay above that floor in practice.

For MoE specifically: at 75% routing sparsity (9/12 experts skipped per step), we measure 3.97× end-to-end MLP throughput with output divergence < 1e-4. The mLSTM outer-product skip is a different primitive but the same hardware phenomenon.

Full benchmark suite (reproducible, §8–§11): https://github.com/eriirfos-eng/ternary-intelligence-stack/blob/main/ternlang-root/BENCHMARKS.md

Happy to answer questions about the STE behaviour in the zero region specifically — that's where most of the practical tuning lives.

feat(xlstm): add optional sparse covariance backend for matrix memory…

3bef42e

… updates Introduce Sparse Ternary Covariance (STC) backend with optional write-skip acceleration. Includes ternary quantizer with STE, ternary gate module, and benchmarking suite.

remove: PR_SUMMARY.md and internal bench files

39a3ca6

martinloretzzz closed this May 28, 2026

github-actions Bot locked and limited conversation to collaborators May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional sparse covariance backend for matrix memory updates#117

Add optional sparse covariance backend for matrix memory updates#117
simeon-kepp wants to merge 2 commits into
NX-AI:mainfrom
simeon-kepp:feature/stc-sparse-backend

simeon-kepp commented Apr 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

simeon-kepp commented Apr 13, 2026

Uh oh!

martinloretzzz commented May 18, 2026

Uh oh!

simeon-kepp commented May 20, 2026 •

edited

Loading

Uh oh!

simeon-kepp commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simeon-kepp commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Changes

Opt-in usage

FLOPs at different sparsity levels

What's not included

Uh oh!

github-actions Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simeon-kepp commented Apr 13, 2026

Uh oh!

martinloretzzz commented May 18, 2026

Uh oh!

simeon-kepp commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simeon-kepp commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simeon-kepp commented Apr 13, 2026 •

edited

Loading

github-actions Bot commented Apr 13, 2026 •

edited

Loading

simeon-kepp commented May 20, 2026 •

edited

Loading