Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399
Open
abaybektursun wants to merge 4 commits intoopenai:mainfrom
Open
Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399abaybektursun wants to merge 4 commits intoopenai:mainfrom
abaybektursun wants to merge 4 commits intoopenai:mainfrom
Conversation
Systems optimization built on PR openai#315 by @jfprincz (11L XSA4+EMA, 1.1248 bpb). Same architecture, same hyperparameters, only optimizer changed. 82.14ms/step vs 84.76ms baseline = 7,306 steps vs 7,079 in 600s. Pre-quant val_bpb 1.1421 (identical to baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1.1248) Unbank state dict before quantization so int6 per-row scales match baseline. Rebank after dequantization for roundtrip eval. Results: 82.13ms/step, 7,306 steps, int6 sliding window val_bpb 1.1238. Artifact: 16.06MB (int6+zstd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5f4d141 to
4db0057
Compare
Replaced Polar Express with standard Newton-Schulz + switched to lzma compression. 3-seed results: 81.87ms/step mean, 1.1247 sliding bpb mean, all artifacts ~15.8MB. Seed 1337: 7331 steps, 1.1241 bpb, 15,830,960 bytes Seed 42: 7328 steps, 1.1253 bpb, 15,819,728 bytes Seed 2025: 7330 steps, 1.1247 bpb, 15,796,052 bytes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Systems Optimization: 81.87 ms/step, all artifacts under 16 MB
Pure training speed optimization. Model architecture and hyperparameters are unchanged — only the optimizer and weight storage layout are modified.
3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)
What changed
Three optimizer techniques replacing 66 sequential individual Newton-Schulz calls with 4 batched operations:
1. Parameter Banking
3D
nn.Parameterbanks replace 66 separatenn.Linearweights:qo_bank: (22, 512, 512) — Q + Out projectionskv_bank: (22, 256, 512) — K + V projectionsmlp_up_bank: (11, 1536, 512) — MLP upmlp_down_bank: (11, 512, 1536) — MLP downForward:
F.linear(x, bank[layer_idx]). Compiled forward+backward verified identical: 72.33ms vs 72.59ms.2. Batched Newton-Schulz
Standard NS coefficients (a=3.4445, b=-4.7750, c=2.0315) applied in a single batched
torch.bmmover all bank layers, instead of 66 sequentialtorch.mmcalls.3. Parallel Muon (arXiv:2511.07464)
DDP removed for bank params. Post-backward communication scheduled explicitly:
reduce_scatterfor all banks (biggest first)all_reduce+ Adam step on small params (while bank RS is in-flight)all_gatherWhy DDP doesn't work with banking
Bank gradients aggregate across all 11 layers → available only at end of backward → DDP can't overlap all-reduce with compute. Result: +4ms regression with banks in DDP (88.8ms vs 84.8ms baseline). Fix: remove DDP for banks, schedule communication explicitly in optimizer step. This follows the approach used in modded-nanogpt — no DDP at all, fully manual communication scheduling.
What we tried and learned
_ddp_params_and_buffers_to_ignorePRs we tested our optimizer against
Key finding: Only PRs using simple EMA (no SWA/TTT) benefit from faster training, because EMA quality improves monotonically with more steps. SWA averages warmdown weights, and TTT paradoxically benefits from less-trained models.
Credits
🤖 Generated with Claude Code