Merge pull request #561 from openai/codex/update-readme-leaderboard-merged-records by peterpodj · Pull Request #1 · peterpodj/parameter-golf

peterpodj · 2026-03-24T09:00:33Z

Comparing updates

Update README.md

## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |

… 1.2129) 10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129 across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats (t=34.12, p<<0.001). Key changes: - 10 layers (vs 9 baseline) - Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - FP16 tied embedding export (reduces quant gap) - Int6 quantization for middle layers 2-7 (fits under 16MB) Mean artifact size: 15.36MB (under 16MB cap). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aluating the graph after each sub-batch step

Use eager mx.eval() to fix running train script on 16GB Mac devices

keep tok_emb.weight in fp16 during int8 export (kills the quant gap), shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600 and matrix LR to 0.06. tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>

Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB). Key changes: - 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params) - QAT: STE fake-quantize simulates int6 during training - Int6 quantization on all block weights (layers 0-8) - Sliding window eval (stride=64) for ~0.033 BPB free gain - FP16 tied embedding + lower LRs (carried over) 5-seed results on 8xH100 SXM: Mean slide_bpb: 1.1652 (std=0.0017) Mean rt_bpb: 1.1985 t-statistic: 78.93 (p << 0.001) All artifacts under 16MB (mean: 15.64MB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The window_starts filter dropped windows shorter than stride, silently skipping up to (stride-1) tokens at the end of the validation set. Now includes all windows with >= 1 scoreable token, and clamps the score start for short final windows.

Co-authored-by: spokane-way <spokane@way>

…val_bpb=1.1748) (openai#60) * Add NTK Eval + Overtone Init submission (1.2160 BPB) Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002) * Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006) * Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB) * Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone * Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone --------- Co-authored-by: notapplica <notapplica@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Warmdown-quantization co-optimization, val_bpb=1.2154 Novel finding: aggressive LR decay (WARMDOWN_ITERS=20000) reduces int8 quantization penalty from 0.014 to 0.005 BPB. Combined with FP16 tied embeddings and moderate NTK-RoPE extrapolation (eval@1408). Full warmdown sweep across 10 values and detailed analysis in README. * breakthrough: 1.1574 BPB via int6 + MLP 3x + sliding window stride=256 --------- Co-authored-by: Sam Larson <saml212@users.noreply.github.com>

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

…mbed-int6 Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)

…nt6_MLP3x_SmearGate_BigramHash_MuonWD_SWA Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)

Novel: Efficient Partial Exclusive Self Attention on last 3 layers. GQA-aware reshape avoids tensor duplication (<2ms overhead). Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.

Update README.md

Co-authored-by: Codex <noreply@openai.com>

Restore train_gpt.py before bd2463a

Update the text to reflect the passive voice grammar.

Fix grammar in README

…, 3-seed mean)

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

…-1.1233 Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

…oard-merged-records Update README leaderboard with merged record submissions

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebda3af334

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-24T09:04:54Z

+
+## Included Files
+
+- `train_gpt.py` — code snapshot of the best configuration so far (008)


Add the missing train_gpt.py snapshot

This README says the folder includes a train_gpt.py code snapshot for the best run, but the committed train_gpt.py in this directory is empty (0 bytes). That makes the reported result non-reproducible and conflicts with the repository submission requirement that each run include a runnable training script.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-24T09:04:54Z

@@ -0,0 +1,9 @@
+{
+  "name": "10L Int5-MLP + BigramHash(10240) + SWA(frac=0.4) + WD=0.04",
+  "val_loss": 1.14276,


Store the leaderboard metric under val_bpb

This submission metadata puts 1.14276 under val_loss and does not provide a val_bpb field. In this repo, submission.json is expected to carry val_bpb, and this numeric value matches BPB scale rather than the loss scale used in logs, so parsers/readers relying on the documented schema can mis-ingest or skip this run.

Useful? React with 👍 / 👎.

0hq and others added 30 commits March 18, 2026 16:33

Merge pull request openai#35 from openai/0hq-patch-1

0f9518a

Update README.md

Update train_gpt.py

886cc5b

Update train_gpt_mlx.py

de13248

Update README.md

954a158

Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808

9d318e7

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

b8a1426

Add MLX_EAGER_EVAL flag to further reduce memory pressure by force-ev…

6a08c9d

…aluating the graph after each sub-batch step

Merge pull request openai#100 from sandsevenone/mlx_eager_eval

2081ba1

Use eager mx.eval() to fix running train script on 16GB Mac devices

Update README.md (openai#105)

b5ea566

clarify torch version

6e3e90d

SOTA attempt (val_bpb=1.2064) (openai#49)

e89fcf8

* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>

Update README.md

194bb87

Add record: Sliding Window Eval (stride=64), val_bpb=1.1925 (openai#50)

d84a3e8

Update README.md

6b40978

New SOTA attempt (openai#52)

78c24e2

Co-authored-by: spokane-way <spokane@way>

Update README.md

b87b883

Update README.md

2d6e9e0

Update README.md

ce6cf9a

Update README.md

ad7b62c

Update README.md

cfa5726

Update README.md

f3897c1

Update README.md

5353524

Update README.md

d2bd760

cocohearts and others added 29 commits March 20, 2026 11:42

Merge pull request openai#73 from NishantDahal/swiglu-warmdown-1x5090

e821922

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

Merge pull request openai#86 from aruniyer/submission/10L-lowlr-fp16e…

b774930

…mbed-int6 Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)

Merge pull request openai#162 from raahilshah/submission/2026-03-20_I…

8b2b17e

…nt6_MLP3x_SmearGate_BigramHash_MuonWD_SWA Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)

Merge pull request openai#180 from thwu1/10L-int5mlp-wd04-swa50

ee82226

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)

Update README.md

4958197

Update README.md

f427d59

Create leaderboard-best-score-over-time.svg

f6e75fe

Update README.md

a9d5c77

Delete assets directory

e721424

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) — NEW SOTA

a81f85b

Novel: Efficient Partial Exclusive Self Attention on last 3 layers. GQA-aware reshape avoids tensor duplication (<2ms overhead). Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.

Merge pull request openai#255 from openai/valerio-oai-patch-1

2d7199d

Update README.md

Restore train_gpt.py before bd2463a

48404cd

Co-authored-by: Codex <noreply@openai.com>

Merge pull request openai#269 from openai/revert-train-gpt-pre-bd2463a

3c1f8b3

Restore train_gpt.py before bd2463a

Update README.md

2cff08e

Update README.md

9a60d16

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

d758328

Update README.md

fc6332a

Fix grammar in README

8926429

Update the text to reflect the passive voice grammar.

Merge pull request openai#350 from sha-huang/hs-patch-grammar

0f51451

Fix grammar in README

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

2951651

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233…

15776db

…, 3-seed mean)

Merge pull request openai#265 from unnir/submission/v22-XSA3-beats-top1

56a9283

Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)

Merge pull request openai#287 from jfprincz/submission/11l-xsa4-ema-i…

0d44464

…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)

Merge pull request openai#315 from jfprincz/submission/11l-partialrop…

cdabe13

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

Merge pull request openai#414 from signalrush/submission/ema-gptqlite…

b5ac0de

…-1.1233 Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

Update README leaderboard with merged records

b82c50d

Use GitHub usernames in new leaderboard rows

d74c0b5

Describe leaderboard entries by base-run diff

8a77849

Merge pull request openai#561 from openai/codex/update-readme-leaderb…

ebda3af

…oard-merged-records Update README leaderboard with merged record submissions

chatgpt-codex-connector Bot reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge pull request #561 from openai/codex/update-readme-leaderboard-merged-records#1

Merge pull request #561 from openai/codex/update-readme-leaderboard-merged-records#1
peterpodj wants to merge 79 commits into
0hq-patch-1from
main

peterpodj commented Mar 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants


		## Included Files

		- `train_gpt.py` — code snapshot of the best configuration so far (008)

Conversation

peterpodj commented Mar 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants