11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
Open
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
Open
11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333mahsumaktas wants to merge 2 commits intoopenai:mainfrom
mahsumaktas wants to merge 2 commits intoopenai:mainfrom
Conversation
…l_bpb=1.1754) Combines 10 orthogonal improvements over the naive baseline: - Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact) - FP16 tied embedding export (near-zero quantization gap) - MLP 2.5x expansion - SmearGate + BigramHash bigram-aware modules - OrthoInit + muP scaling + phase-transition residual mixing - Muon weight decay (0.02) - Stochastic Weight Averaging (4 checkpoints) - Sliding window evaluation (stride=64) - Tuned hyperparameters (grad_clip=0.3, warmdown=3000) 8xH100 SXM, 9919 steps in 10 minutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrade from V1 to V2 with 23 GPU runs on 8xH100: - 11 layers (was 9) + XSA on last 4 layers - MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L - RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048) - SWA/50 with fp32 accumulation (bf16 catastrophic fix) - OrthoInit + Overtone SVD + Phase-transition residual mixing - INT6 + zstd-22 + FP16 tied embed + Late-K FP16 3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028) Artifact: 15.87-15.99 MB (all under 16MB) 23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep, WD sweep, QAT, MLP 3x — documented in README. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB
23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.
Techniques
Results (3 seeds)
Key Findings from 23 Runs
Run command
Test plan
Built with Claude Code