Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) by abaybektursun · Pull Request #399 · openai/parameter-golf

abaybektursun · 2026-03-22T04:52:11Z

Systems Optimization: 81.87 ms/step, all artifacts under 16 MB

Pure training speed optimization. Model architecture and hyperparameters are unchanged — only the optimizer and weight storage layout are modified.

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

Seed	step_avg	steps	int6 sliding val_bpb	artifact
1337	81.86 ms	7,331	1.1241	15,830,960 bytes
42	81.88 ms	7,328	1.1253	15,819,728 bytes
2025	81.86 ms	7,330	1.1247	15,796,052 bytes
Mean	81.87 ms	7,330	1.1247 (std 0.0006)	~15.8 MB

What changed

Three optimizer techniques replacing 66 sequential individual Newton-Schulz calls with 4 batched operations:

1. Parameter Banking

3D nn.Parameter banks replace 66 separate nn.Linear weights:

qo_bank: (22, 512, 512) — Q + Out projections
kv_bank: (22, 256, 512) — K + V projections
mlp_up_bank: (11, 1536, 512) — MLP up
mlp_down_bank: (11, 512, 1536) — MLP down

Forward: F.linear(x, bank[layer_idx]). Compiled forward+backward verified identical: 72.33ms vs 72.59ms.

2. Batched Newton-Schulz

Standard NS coefficients (a=3.4445, b=-4.7750, c=2.0315) applied in a single batched torch.bmm over all bank layers, instead of 66 sequential torch.mm calls.

Note: We initially used Polar Express (arXiv:2505.16932) — per-iteration minimax-optimal polynomial coefficients that achieve 35% tighter orthogonalization (0.21 vs 0.32 relative error). However, PE produces weight value distributions that are ~190KB harder to compress, pushing artifacts over 16MB. Reverting to standard NS fixed the artifact size with no measurable quality difference.

3. Parallel Muon (arXiv:2511.07464)

DDP removed for bank params. Post-backward communication scheduled explicitly:

Launch async reduce_scatter for all banks (biggest first)
all_reduce + Adam step on small params (while bank RS is in-flight)
Wait for RS, local batched NS on each GPU's shard, async all_gather

Why DDP doesn't work with banking

Bank gradients aggregate across all 11 layers → available only at end of backward → DDP can't overlap all-reduce with compute. Result: +4ms regression with banks in DDP (88.8ms vs 84.8ms baseline). Fix: remove DDP for banks, schedule communication explicitly in optimizer step. This follows the approach used in modded-nanogpt — no DDP at all, fully manual communication scheduling.

What we tried and learned

Approach	Result	Lesson
Non-surgery batching (keep 66 params, batch in optimizer)	85.73ms (+1ms vs baseline)	Gather/scatter kernel launch overhead offsets PE speedup
DDP with banks	88.8ms (+4ms regression)	Bank grads only available at end of backward, zero overlap
DDP with `_ddp_params_and_buffers_to_ignore`	88ms	Still no overlap for bank all-reduce
Polar Express	82ms but 16.2MB artifacts	PE weights compress ~190KB worse than NS
Parallel Muon + NS	81.87ms, 15.8MB	Winner

PRs we tested our optimizer against

Base PR	Speed win?	Score win?	Why
#315 (EMA only)	Yes (-3.4%)	Yes (-0.0006)	Extra steps directly improve EMA score
#374 (Tight SWA)	Yes (-3.5%)	No (+0.001)	SWA averages warmdown weights; extra steps don't help
#401 (EMA+SWA stack)	Yes (-2.8%)	No (+0.0005)	Same SWA dilution
#398 (aggressive TTT)	Yes (-2.3%)	No (+0.004)	TTT paradox: more trained model = less TTT gain
#332 (12L grad-quant)	Yes (-12%)	Untested	Gradient-guided quant didn't activate with our optimizer

Key finding: Only PRs using simple EMA (no SWA/TTT) benefit from faster training, because EMA quality improves monotonically with more steps. SWA averages warmdown weights, and TTT paradoxically benefits from less-trained models.

Credits

PR #315 by @jfprincz — architecture config (11L Partial RoPE + LN Scale + EMA + XSA4) used for benchmarking
PR #287 — base 11L stack (SmearGate, BigramHash, OrthoInit, FA3)
Parallel Muon — async reduce-scatter/all-gather scheduling pattern
modded-nanogpt — inspiration for DDP-free manual communication

🤖 Generated with Claude Code

@jfprincz

Systems optimization built on PR openai#315 by @jfprincz (11L XSA4+EMA, 1.1248 bpb). Same architecture, same hyperparameters, only optimizer changed. 82.14ms/step vs 84.76ms baseline = 7,306 steps vs 7,079 in 600s. Pre-quant val_bpb 1.1421 (identical to baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1.1248) Unbank state dict before quantization so int6 per-row scales match baseline. Rebank after dequantization for roundtrip eval. Results: 82.13ms/step, 7,306 steps, int6 sliding window val_bpb 1.1238. Artifact: 16.06MB (int6+zstd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaced Polar Express with standard Newton-Schulz + switched to lzma compression. 3-seed results: 81.87ms/step mean, 1.1247 sliding bpb mean, all artifacts ~15.8MB. Seed 1337: 7331 steps, 1.1241 bpb, 15,830,960 bytes Seed 42: 7328 steps, 1.1253 bpb, 15,819,728 bytes Seed 2025: 7330 steps, 1.1247 bpb, 15,796,052 bytes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun and others added 2 commits March 22, 2026 00:13

Add 3-seed results + train logs

4db0057

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315) Mar 22, 2026

abaybektursun force-pushed the submission/parallel-muon-82ms branch from 5f4d141 to 4db0057 Compare March 22, 2026 15:24

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315) Mar 22, 2026

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399
abaybektursun wants to merge 4 commits intoopenai:mainfrom
abaybektursun:submission/parallel-muon-82ms

abaybektursun commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Systems Optimization: 81.87 ms/step, all artifacts under 16 MB

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

What changed

1. Parameter Banking

2. Batched Newton-Schulz

3. Parallel Muon (arXiv:2511.07464)

Why DDP doesn't work with banking

What we tried and learned

PRs we tested our optimizer against

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 22, 2026 •

edited

Loading