LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers by FlashyFlash3011 · Pull Request #347 · openai/parameter-golf

FlashyFlash3011 · 2026-03-21T15:22:07Z

Submissions

Two experiments targeting sub-1.1698 BPB. Scripts are complete and smoke-tested. Full 3-seed H100 runs are in progress (compute grant pending) — results and train logs will be added before final review.

1. LongContext 4096 + Full SOTA Stack

Folder: records/track_10min_16mb/2026-03-21_LongContext4096_FullStack/

The 4096-seq training record (1.2014 BPB) was submitted before sliding window eval, FP16 embeddings, Muon WD, or Overtone init existed. This combines all of those with long-context training:

Train at seq_len=4096, eval with sliding window stride=256
Each eval token sees 3,840 tokens of context vs 960 in SOTA
Zero extra eval cost: 64 seqs × 4096 = 256 seqs × 1024 tokens per batch
NTK-aware RoPE base=40000 (= 10000 × 4096/1024)
Re-tuned LRs (matrix=0.025) and Muon momentum (0.98) for 4096 context
All SOTA techniques: 10L, Muon WD=0.02, FP16 tied embed, Overtone spectral init, U-Net skips

Expected: ~1.14–1.16 BPB

2. QAT Int4 → 16 Layers

Folder: records/track_10min_16mb/2026-03-21_QAT_Int4_16L/

Int4 nibble-packing (2 weights/byte) fits 16 transformer layers in the same 16MB budget as SOTA's 10 — a 60% parameter increase.

Quantization-Aware Training with straight-through estimator (STE) activates at 15% of training iterations
Weights gradually cluster near int4 grid points before export, minimising accuracy loss
Int4 roundtrip assertion before export verifies zero packing error
All SOTA techniques carried forward with stability adjustments: matrix_lr=0.030, Muon momentum 0.97, grad_clip=1.0

Expected: ~1.14–1.16 BPB

Status

Scripts complete and syntax-verified
Smoke-tested locally (ROCm, 1 GPU)
Full H100 runs (3 seeds each) — pending compute
Results tables and train logs — to be added

Two new submissions targeting sub-1.1698 BPB: 1. 2026-03-21_LongContext4096_FullStack - 4096-token training context + full modern SOTA stack - Sliding window eval stride=256 (3840 context tokens per position) - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context 2. 2026-03-21_QAT_Int4_16L - Int4 nibble-packing enables 16 transformer layers in 16MB budget - QAT with straight-through estimator activates at 15% of training - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers#347

LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers#347
FlashyFlash3011 wants to merge 1 commit intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

FlashyFlash3011 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FlashyFlash3011 commented Mar 21, 2026

Submissions

1. LongContext 4096 + Full SOTA Stack

2. QAT Int4 → 16 Layers

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant