Skip to content

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333

Open
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538
Open

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas:submission/v2-11L-xsa-swa-1.1538

Conversation

@mahsumaktas
Copy link

Summary

Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB

23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.

Techniques

  • 11 transformer layers + XSA on last 4 layers
  • SmearGate + BigramHash(2048) + OrthoInit
  • INT6 per-row quantization + zstd-22 + FP16 tied embedding + Late-K FP16
  • SWA every 50 steps (fp32 accumulation) — bf16 causes catastrophic loss
  • Muon WD=0.04 + grad clip 0.3 + RoPE base 50K
  • Overtone SVD init + Phase-transition residual mixing
  • MLP 2.75x — sweet spot (3x exceeds 16MB with SmearGate at 11L)

Results (3 seeds)

Seed Sliding BPB Post-quant BPB Artifact
1337 1.1538 1.1766 15.99 MB
42 1.1565 1.1790 15.87 MB
7 1.1593 1.1820 15.93 MB
Mean 1.1565

Key Findings from 23 Runs

  • EMA(0.997) causes 0.14 BPB quant gap — SWA far better for our stack
  • 11L MLP 3x exceeds 16MB with SmearGate+BigramHash
  • SmearGate removal loses more than MLP 3x gains — bigram context matters
  • XSA needs GQA-compatible v expansion (repeat_interleave, bug found and fixed)
  • Seq curriculum doesn't work — SWA checkpoint incompatibility across seq lengths
  • Depth recurrence works but dim=640 too narrow; dim=768+ exceeds 16MB
  • Higher LR (0.03) improves BPB but worsens compression (larger weights)
  • Late QAT (75%) reduces quant gap (0.023 -> 0.006) but fewer steps

Run command

NUM_LAYERS=11 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=2048 \
TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 MLP_MULT=2.75 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_WEIGHT_DECAY=0.04 WARMDOWN_ITERS=3000 \
SWA_ENABLED=1 SWA_EVERY=50 ROPE_BASE=50000 EVAL_STRIDE=64 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • Runs reproducibly on 8xH100 SXM in under 10 minutes
  • Artifact under 16 MB (15.87-15.99 MB)
  • 3-seed validation (mean 1.1565, std 0.0028)
  • Sliding window eval completes within 10 minutes

Built with Claude Code

Mahsum and others added 2 commits March 20, 2026 10:19
…l_bpb=1.1754)

Combines 10 orthogonal improvements over the naive baseline:
- Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact)
- FP16 tied embedding export (near-zero quantization gap)
- MLP 2.5x expansion
- SmearGate + BigramHash bigram-aware modules
- OrthoInit + muP scaling + phase-transition residual mixing
- Muon weight decay (0.02)
- Stochastic Weight Averaging (4 checkpoints)
- Sliding window evaluation (stride=64)
- Tuned hyperparameters (grad_clip=0.3, warmdown=3000)

8xH100 SXM, 9919 steps in 10 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrade from V1 to V2 with 23 GPU runs on 8xH100:
- 11 layers (was 9) + XSA on last 4 layers
- MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L
- RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048)
- SWA/50 with fp32 accumulation (bf16 catastrophic fix)
- OrthoInit + Overtone SVD + Phase-transition residual mixing
- INT6 + zstd-22 + FP16 tied embed + Late-K FP16

3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028)
Artifact: 15.87-15.99 MB (all under 16MB)

23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep,
WD sweep, QAT, MLP 3x — documented in README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant