Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
0f9518a
Merge pull request #35 from openai/0hq-patch-1
0hq Mar 18, 2026
886cc5b
Update train_gpt.py
0hq Mar 18, 2026
de13248
Update train_gpt_mlx.py
0hq Mar 18, 2026
954a158
Update README.md
0hq Mar 18, 2026
9d318e7
Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808
aquariouseworkman Mar 19, 2026
b8a1426
Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)
NishantDahal Mar 19, 2026
f85c837
Record: Mixed Quant (int6+int8) + Sliding Window, val_bpb=1.1630
aquariouseworkman Mar 19, 2026
73a4a9d
Submission: 10L + lower LR + fp16 embed + int6 middle layers (val_bpb…
Mar 19, 2026
6a08c9d
Add MLX_EAGER_EVAL flag to further reduce memory pressure by force-ev…
sandsevenone Mar 19, 2026
2081ba1
Merge pull request #100 from sandsevenone/mlx_eager_eval
0hq Mar 19, 2026
a5eb9ed
fp16 tied embedding + lr/warmdown tuning — val_bpb 1.2197 (#42)
chonchiog Mar 19, 2026
b5ea566
Update README.md (#105)
0hq Mar 19, 2026
6e3e90d
clarify torch version
cocohearts Mar 19, 2026
e89fcf8
SOTA attempt (val_bpb=1.2064) (#49)
spokane-way Mar 19, 2026
194bb87
Update README.md
0hq Mar 19, 2026
d84a3e8
Add record: Sliding Window Eval (stride=64), val_bpb=1.1925 (#50)
mattqlf Mar 19, 2026
8e4f5d1
Update submission: MLP 3x + QAT + Int6 + Sliding Window (val_bpb 1.1652)
Mar 19, 2026
3a6fec7
Fix: score final partial window in sliding window eval (#124)
mattqlf Mar 19, 2026
6b40978
Update README.md
0hq Mar 19, 2026
78c24e2
New SOTA attempt (#52)
spokane-way Mar 19, 2026
b87b883
Update README.md
0hq Mar 19, 2026
2d6e9e0
Update README.md
0hq Mar 19, 2026
ce6cf9a
Update README.md
0hq Mar 19, 2026
9fbdf8c
Record: Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (…
notapplica Mar 19, 2026
ad7b62c
Update README.md
0hq Mar 19, 2026
555669e
Int6 + MLP 3x + sliding window: val_bpb=1.1574 (#61)
saml212 Mar 19, 2026
cfa5726
Update README.md
0hq Mar 19, 2026
f3897c1
Update README.md
0hq Mar 19, 2026
5353524
Update README.md
0hq Mar 19, 2026
d2bd760
Update README.md
0hq Mar 19, 2026
34fccfb
Add Seq2048 + FP16 Tied Embedding submission (mean val_bpb 1.2067)
Mar 19, 2026
c2b3621
Update submission: 10L + int6 mid + sliding window (mean val_bpb 1.1793)
Mar 19, 2026
9e7a4b8
Update: full int6+zstd, MLP 1344, Muon 0.99 (mean val_bpb 1.1632)
Mar 19, 2026
510e3f6
Update: STE int6 QAT, zero quant gap (mean val_bpb 1.1598)
Mar 19, 2026
ae88208
Update README.md
0hq Mar 19, 2026
9ac12c2
Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle …
nanlliu Mar 19, 2026
bd2463a
commit ttt record (#77)
samacqua Mar 19, 2026
5e29bfd
Update README.md
0hq Mar 19, 2026
45bbccf
Update README.md
cocohearts Mar 19, 2026
e109841
SmearGate + OrthoInit + Muon WD + Int6 STE QAT + MLP 3x + Sliding Window
aquariouseworkman Mar 20, 2026
3aface5
Merge branch 'openai:main' into main
aquariouseworkman Mar 20, 2026
14cdf6f
Add submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA…
Mar 20, 2026
3c458dc
Update submission: 2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_…
Mar 20, 2026
ab9ecb2
Record: 10L Int5-MLP + MuonWD=0.04 + SWA/50 (val_bpb=1.1453)
thwu1 Mar 20, 2026
a320f7c
Update records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50…
thwu1 Mar 20, 2026
1641a23
Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)
Mar 20, 2026
9f9d533
Update: val_bpb=1.1428 (mean 3 seeds) — bigram=10240 + SWA(0.4) + WD=…
Mar 20, 2026
1a8be36
Add 3-seed training logs (seed=42, 1337, 2024)
Mar 20, 2026
80f7a21
Merge pull request #63 from yahya010/submission/seq2048-fp16emb
cocohearts Mar 20, 2026
5b26c56
Merge pull request #65 from aquariouseworkman/main
cocohearts Mar 20, 2026
e821922
Merge pull request #73 from NishantDahal/swiglu-warmdown-1x5090
cocohearts Mar 20, 2026
b774930
Merge pull request #86 from aruniyer/submission/10L-lowlr-fp16embed-int6
cocohearts Mar 20, 2026
8b2b17e
Merge pull request #162 from raahilshah/submission/2026-03-20_Int6_ML…
cocohearts Mar 20, 2026
ee82226
Merge pull request #180 from thwu1/10L-int5mlp-wd04-swa50
cocohearts Mar 20, 2026
4958197
Update README.md
valerio-oai Mar 20, 2026
f427d59
Update README.md
valerio-oai Mar 20, 2026
f6e75fe
Create leaderboard-best-score-over-time.svg
valerio-oai Mar 20, 2026
a9d5c77
Update README.md
valerio-oai Mar 20, 2026
e721424
Delete assets directory
valerio-oai Mar 20, 2026
a81f85b
Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) — NEW SOTA
unnir Mar 20, 2026
2d7199d
Merge pull request #255 from openai/valerio-oai-patch-1
cocohearts Mar 20, 2026
48404cd
Restore train_gpt.py before bd2463a
yuzhougu-oai Mar 20, 2026
3c1f8b3
Merge pull request #269 from openai/revert-train-gpt-pre-bd2463a
cocohearts Mar 20, 2026
2cff08e
Update README.md
cocohearts Mar 20, 2026
9a60d16
Update README.md
0hq Mar 20, 2026
d758328
Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
jfprincz Mar 20, 2026
fc6332a
Update README.md
0hq Mar 20, 2026
8926429
Fix grammar in README
sha-huang Mar 21, 2026
0f51451
Merge pull request #350 from sha-huang/hs-patch-grammar
cocohearts Mar 21, 2026
2951651
Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
jfprincz Mar 21, 2026
15776db
Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233…
signalrush Mar 22, 2026
56a9283
Merge pull request #265 from unnir/submission/v22-XSA3-beats-top1
cocohearts Mar 23, 2026
0d44464
Merge pull request #287 from jfprincz/submission/11l-xsa4-ema-int6-ml…
cocohearts Mar 23, 2026
cdabe13
Merge pull request #315 from jfprincz/submission/11l-partialrope-late…
cocohearts Mar 23, 2026
b5ac0de
Merge pull request #414 from signalrush/submission/ema-gptqlite-1.1233
cocohearts Mar 23, 2026
b82c50d
Update README leaderboard with merged records
cocohearts Mar 23, 2026
d74c0b5
Use GitHub usernames in new leaderboard rows
cocohearts Mar 23, 2026
8a77849
Describe leaderboard entries by base-run diff
cocohearts Mar 23, 2026
ebda3af
Merge pull request #561 from openai/codex/update-readme-leaderboard-m…
cocohearts Mar 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 37 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ If you're familiar with [neural scaling laws](https://arxiv.org/abs/2001.08361),
Ideally, we'd allow for submissions to use arbitrary computational resources. But in order to make the challenge not inaccessibly expensive, we're limiting *leaderboard submissions* to 10 minutes on 8xH100s. However, we'd still love to see submissions that don't meet the compute limitation requirements in our 'Non-record Submissions' section: We're excited to see people push the infinite frontier of parameter limited performance as well.

We also know compute is expensive, so **OpenAI is sponsoring $1,000,000 in compute credits** to help people get started training their models. To request a compute grant, use this form: [Request a Compute Grant](https://openai.com/index/parameter-golf/#credit-form).
When requesting compute, please make sure you choose the appropriate level, write sufficient justification, and **submit with an email tied to a OpenAI / ChatGPT account**.

## Participant Form

Expand All @@ -27,10 +28,26 @@ Happy training!

## Leaderboard


| Rank | Run | Score | Author | Summary | Date | Info |
|-----:|-----|------:|--------|---------|------|------|
| 1 | Naive Baseline | 1.2244 | Baseline | 9layer 512dim 1024vocab TiedEmbeddings 4 KV heads | 2026-03-18 | [info](records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md) |
| Run | Score | Author | Summary | Date | Info |
|-----|------:|--------|---------|------|------|
| 11L EMA + GPTQ-lite + warmdown3500 | 1.1228 | signalrush | On PR #374: GPTQ-lite clip search + EMA, plus warmdown3500 and QAT@0.15 | 2026-03-22 | [info](records/track_10min_16mb/2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233/README.md) |
| 11L Partial RoPE + LN Scale + EMA + XSA4 | 1.1248 | jfprincz | On PR #287: Partial RoPE (16/64) + layerwise LN scale | 2026-03-21 | [info](records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/README.md) |
| 11L XSA4 + EMA + Int6 MLP3x | 1.1271 | jfprincz | On PR #198: XSA on the last 4 layers + EMA replacing SWA | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_11L_XSA4_EMA_Int6_MLP3x_WD04_1.1271/README.md) |
| 11L Efficient Partial XSA | 1.1307 | unnir | On PR #198: Efficient Partial XSA on the deepest 3 layers | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_11L_EfficientPartialXSA_FA3_SWA120/README.md) |
| 10L Int5-MLP + BigramHash(10240) | 1.1428 | thwu1 | 10 layers, mixed int5/int6 quantization, BigramHash(10240), SWA(0.4), WD=0.04 | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md) |
| Int6 MLP3x + SmearGate + BigramHash | 1.1458 | Raahil Shah | 3x MLP + SmearGate + BigramHash + OrthoInit + Muon WD + SWA | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA/README.md) |
| 11L MLP3x + Int6 QAT | 1.1502 | aruniyer | 11 layers, 3x MLP, int6 QAT, zstd-22, WD=0.04, sliding eval | 2026-03-20 | [info](records/track_10min_16mb/2026-03-19_MLP3x_QAT_Int6_SlidingWindow/README.md) |
| SmearGate + OrthoInit + Muon WD | 1.1556 | aquariouseworkman | SmearGate + BigramHash + 3x MLP + int6 STE QAT + sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_smeargate_orthoinit_muonwd/README.md) |
| 10L Int6 QAT + Zstd MLP2.6x | 1.1586 | yahya010 | 10 layers, int6 QAT + zstd-22, MLP 1344, Muon 0.99, sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_Seq2048_FP16Emb_TunedLR/README.md) |
| Mixed Quant + Sliding Window Eval | 1.1630 | aquariouseworkman | Int6 block weights + int8 embeddings + 3x MLP + sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_MixedQuant_Int6Int8_SlidingWindow/README.md) |
| Muon WD + 10 layer | 1.1748 | notapplica | Includes prev. wins + Spectral embed init + resid mix | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/README.md) |
| Sliding Window Eval | 1.1925 | Matthew Li | Sliding window evaluation at stride=64, increasing context for eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md) |
| Lora TTT | 1.1928 | samacqua | Test-time training with LORAs | 2026-03-19 | [info](records/track_10min_16mb/2026-03-17_LoRA_TTT/README.md) |
| 4k seq length| 1.2014 | Spokane Way | 4k seq length + better hypers | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_TrainingOptSeq4096/README.md) |
| 2048 seq length | 1.206 | Spokane Way | 2048 seq length (train + val) | 2026-03-18 | [info](records/track_10min_16mb/2026-03-18_LongContextSeq2048/README.md) |
| int6 mixed precision | 1.2147 | Nan Liu | 10 layers, mixed int8/int6 | 2026-03-18 | [info](records/track_10min_16mb/2026-03-19_10L_MixedPrecision/README.md) |
| fp16 Embed | 1.2197 | Renier Velazco | FP16 Tied Embedding + LR/Warmdown Tuning | 2026-03-18 | [info](records/track_10min_16mb/2026-03-18_FP16Embed_WD3600/README.md) |
| Naive Baseline | 1.2244 | Baseline | 9layer 512dim 1024vocab TiedEmbeddings 4 KV heads | 2026-03-18 | [info](records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md) |

#### Notable Non-Record Runs

Expand Down Expand Up @@ -89,7 +106,7 @@ You can rent GPUs from anywhere, but OpenAI is partnering with Runpod to make se

1. First, [create a Runpod account](https://console.runpod.io/deploy). You should also set up an SSH key in the Settings tab on the left so you can connect to your remote machine. If you're new to this, ask Codex to help you set it up.

2. Once you've set up your account, create a new GPU Cloud Pod. You can choose whichever GPU SKU you'd like. Final leaderboard submissions must run in under 10 minutes on 8xH100s, but we strongly recommend testing and running experiments on cheaper SKUs first, since an 8xH100 box can cost around $20/hour.
2. Once you've set up your account, create a new GPU Cloud Pod. You can choose whichever GPU SKU you'd like. Final leaderboard submissions must run in under 10 minutes on 8xH100s (specifically the SXM variant), but we strongly recommend testing and running experiments on cheaper SKUs first, since an 8xH100 box can cost around $20/hour.

3. Let's start with a 1xH100 pod. Deploy using the official Parameter Golf template: [Launch Template](https://console.runpod.io/deploy?template=y5cejece4j&ref=nl2r56th). Enable SSH terminal access, leaving the other settings at their defaults. Deploy your pod and SSH into it once it's up. You should land in `/workspace/`.

Expand Down Expand Up @@ -125,6 +142,7 @@ By default, this command prints `train_loss` step logs during training and print

For dataset export, tokenizer export, and docs-cache rebuild instructions, see [data/README.md](data/README.md).

Evaluation will be in the RunPod environment with all packages installed. `requirements.txt` is provided as a reference if you want to self-setup.

## FAQ

Expand All @@ -136,15 +154,26 @@ No external downloads, training dataset access, or network calls are allowed dur

**Are scores independently verified by OpenAI?**

We're not automatically verifying every submission, but we will verify the top leaderboard entries over time. Any non-reproducible results can be disqualified, and issues reproducing submissions should be raised on the PR.
We're not automatically verifying every submission, but we will verify the top leaderboard entries over time. Any non-reproducible results can be disqualified, and issues reproducing submissions should be raised on the PR. If you find an issue with a record on the leaderboard or find a record isn't reproducible, please let us know and add an Github Issue describing your findings.

**What counts as 'external compute'? For example, is it fair to tune my hyperparameters offline?**

There's no perfectly clear answer here and it's hard to draw a clean line around what does or does not count as external compute. For now, we're reserving the right to disqualify runs that are not in the spirit of the challenge. Tuning your Adam hyperparameters across a bunch of runs is fine, but if there's evidence that you're sneaking in additional compute unfairly, such as brute-forcing ridiculous seeds, we won't allow it. Use your best judgment and there's no penalty for asking questions.

**What are the restrictions on evaluation?**

We won't accept submissions that take more than 10 minutes on 8xH100 to evaluate (Note: This limit is in addition to the 10 minutes of training time allowed!), but otherwise you're free to evaluate however. As with modded-nanogpt, we allow evaluation at any sequence length. And, obviously, you aren't allowed to access any training data during evaluation, unless you pay for those bits in the <16MB limit. We encourage competitors to push the bounds of evaluation methods as aggressively as with training methods.
We won't accept submissions that take more than 10 minutes on 8xH100 to evaluate (Note: This limit is in addition to the 10 minutes of training time allowed!), but otherwise you're free to evaluate however. As with modded-nanogpt, we allow evaluation at any sequence length. And, obviously, you aren't allowed to access any training data during evaluation, unless you pay for those bits in the <16MB limit. We encourage competitors to push the bounds of evaluation methods as aggressively as with training methods. You CANNOT access validation data during training, e.g. by compressing it into your 16mb with "paid prefix".

If it isn't abundantly obvious: You can't cheat on your test loss. You can't cheat by training on the validation set before you evaluate on the validation set. The validation language around test-time training has been confusing people: you are only allowed to test-time train on validation set tokens _you've already evaluated your model on_, since those tokens have already been graded!

**What is the process for accepting new submissions?**

Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time. The leaderboard may take time to update due to verification and review of submissions, so pay consideration to what the current SOTA PR is when submitting. As explained below, submissions should exceed the SOTA record with sufficient statistical significance in order to be accepted for the leaderboard. Otherwise, submissions may be accepted as 'non-record submissions' given they are sufficiently unique or interesting.

**Can I import XYZ package or library?**

Yes, you're free to import any package or library you want, so long as it does not unjustly violate the rules on evaluation, compute, training time, code size or otherwise. Just include a requirements.txt in your records folder and mention setup instructions in your README.md. Since you don't pay for bits imported in Python libraries, limitations clearly apply: You can't sneak in extra compute, capabilities, or massively increase effective code size with custom libraries, but importing FlashAttention, etc. is completely fine.


## Submission Process

Expand All @@ -162,7 +191,7 @@ All submissions should be made as a pull request that only adds a new folder to

2. A `submission.json` file (see the example runs) that includes your name, GitHub ID, `val_bpb`, and related metadata.

3. A train log, automatically produced by your script.
3. A train log, automatically produced by your script. Please demonstrate a statistically significant win. Most often, submitting an average over 3 training runs is sufficient.

4. A `train_gpt.py` script and any other dependencies. Note: this must successfully compile and run within the records folder. Broken scripts will not be accepted.

Expand Down
59 changes: 59 additions & 0 deletions records/track_10min_16mb/2026-03-17_LoRA_TTT/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
This record captures `LoRA TTT`: the naive baseline model with document-aware LoRA test-time training at evaluation.

## Method

**Training** is identical to the naive baseline.

**Evaluation** adds per-document LoRA test-time training (TTT). For each document in the validation set:
1. Find document boundaries using BOS tokens
2. Split the document into overlapping chunks (chunk_size=256 within eval_seq_len=1024 context windows)
3. For each chunk, score it (accumulate loss/bytes for BPB), *then* train rank-8 LoRA adapters on that chunk's loss (so you only train on the context -- no leakage)
4. Reset LoRA parameters between documents (no leakake across documents)

Documents are batched (batch_size=64) and sorted by length for efficiency. The LoRA adapters target `lm_head`, `c_q`, and `c_v` projections in all transformer blocks. A single Adam optimizer with `lr=0.01, betas=(0.9, 0.95)` trains all LoRA parameters with one gradient step per chunk.

## Notes

This is very similar to [a record I submmited to the modded nano-gpt speedrun repo](https://samacquaviva.com/projects/nanogpt/).
The major addition is to make the test-time training ~5x faster by using LoRAs: this let's you have per-sequence adaptation (no leaking between validation sequences) while still batching.

This is not a heavily optimized run: I just wanted to plant the TTT seed.
It uses ~1/10th of the evaluation budget.

## Ablations

The majority of this improvement doesn't come from the TTT itself, but from
1). Only conditioning on the current document
2). Doing strided evaluations

| Condition | val_loss | val_bpb | Delta bpb |
| --------- | -------- | ------- | --------- |
| Baseline (cross-doc, flat stream) | 2.0731 | 1.2278 | — |
| + Doc-isolated | 2.0561 | 1.2168 | -0.0110 |
| + Stride (chunk=256) | 2.0177 | 1.1941 | -0.0337 |
| + LoRA TTT | 2.0126 | 1.1910 | -0.0368 |

![ablations](ablations.png)

## Results

Validated on the full 50k-document fineweb_val split. Submitting at `bpb=1.195`.

```bash
bpb: [1.1927, 1.1935, 1.1921, 1.1929]
mean: 1.1928
std: 0.0005
p-value < 1.195: 0.00234486
```

## Command

```bash
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Included files

- `train_gpt.py`
- `train_v*.txt` (note that `train_v0.txt` is on 2xH100)
- `submission.json`
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions records/track_10min_16mb/2026-03-17_LoRA_TTT/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "sam",
"github_id": "samacqua",
"name": "LoRA TTT",
"blurb": "Naive baseline + per-document LoRA test-time training at eval. Rank-8 LoRA on lm_head/Q/V with Adam lr=0.01, overlapping 256-token chunks in 1024-token context windows. Same training, smarter eval.",
"date": "2026-03-19T10:00:00Z",
"val_loss": 2.0142,
"val_bpb": 1.1929,
"bytes_total": 15882446,
"bytes_code": 58509
}
Loading