openai · nathon-lee · Mar 21, 2026 · Mar 21, 2026
diff --git a/...on_record_16mb/2026-03-21_Quasi10B_SP1024_11x512_KV4_8h_nathon-lee-v1/README.md b/...on_record_16mb/2026-03-21_Quasi10B_SP1024_11x512_KV4_8h_nathon-lee-v1/README.md
@@ -0,0 +1,53 @@
+# 11L PartialRoPE + LNScale + EMA + SWA + TTT (non-record)
+
+Non-record submission for the Parameter Golf challenge. This run was tested on **1×H100 PCIe for ~107 minutes** (approximately equivalent to 8×H100 SXM for 10 minutes).
+
+## Architecture
+
+- **11 transformer layers**, d=512, 8 heads / 4 KV heads, 3× ReluSquared MLP
+- **U-Net skip connections**: encoder-decoder style with learnable skip weights
+- **Partial RoPE**: rotary on 16 of 64 head dims for position-free generalization
+- **LN Scale**: RMSNorm damped by 1/sqrt(layer+1) for deep gradient stability
+- **SmearGate**: per-dim gate blending current + previous token embeddings
+- **BigramHash(2048, dim=64→512)**: hash-based bigram context embeddings
+- Tied input/output embeddings
+
+## Training
+
+- Muon optimizer (Newton-Schulz) for 2D weights, momentum warmup 0.85→0.99
+- Adam (beta1=0.9, beta2=0.95) for scalars/embeddings, WD=0.04
+- Wallclock-aware cosine warmdown over last ~3000 steps
+- Orthogonal init with muP output-projection scaling
+- EMA (decay=0.997) + SWA (last 40% of training)
+
+## Compression
+
+- Uniform int5 per-row quantization (both MLP and attn) + int8 fallback
+- zstd-22 compression
+- **Artifact size: 15.4MB ✅ (under 16MB limit)**
+
+## Evaluation
+
+- Sliding window with stride=64 for near-max context scoring
+- Full-model SGD TTT: 3 epochs over val, first 2 blocks frozen
+
+## Key Metrics
+
+| Metric | Value |
+|---|---|
+| val_loss (pre-TTT) | 2.0611 |
+| val_bpb (pre-TTT) | 1.2207 |
+| Training steps | 3374 |
+| Training time | 6400210ms |
+| SWA count | 1197 |
+| Model params | 26,666,073 |
+| Artifact bytes | 16,132,620 |
+| Code bytes | 49,461 |
+| Total bytes | 16,182,081 |
+
+## Included Files
+
+- `train_gpt.py` — training script
+- `train.log` — training log
+- `submission.json` — submission metadata
+- `README.md` — this file
diff --git a/...ck_non_record_16mb/2026-03-21_Quasi10B_SP1024_11x512_KV4_8h_nathon-lee-v1/submission.json b/...ck_non_record_16mb/2026-03-21_Quasi10B_SP1024_11x512_KV4_8h_nathon-lee-v1/submission.json
@@ -0,0 +1,15 @@
+{
+  "author": "nathon-lee",
+  "github_id": "nathon-lee",
+  "name": "11L PartialRoPE + LNScale + EMA + SWA + TTT",
+  "blurb": "Non-record submission: 11-layer 512-dim GPT with partial RoPE (16 dims), LN scale damping, SmearGate, BigramHash(4096), U-Net skip connections, mixed int5/int6 quantization + zstd-22, EMA(0.997) + SWA, SGD TTT, sliding window eval (stride=64). Tested on 1xH100 for 80 minutes (~equivalent to 8xH100 10min).",
+  "date": "2026-03-21T08:00:00Z",
+  "track": "non-record-10min-16mb",
+  "val_loss": 2.0444,
+  "val_bpb": 1.2108,
+  "step_stop": 3806,
+  "wallclock_seconds": 4800.906,
+  "bytes_total": 17396551,
+  "bytes_model_compressed": 17347056,
+  "bytes_code": 49495
+}
diff --git a/...ds/track_non_record_16mb/2026-03-21_Quasi10B_SP1024_11x512_KV4_8h_nathon-lee-v1/train.log b/...ds/track_non_record_16mb/2026-03-21_Quasi10B_SP1024_11x512_KV4_8h_nathon-lee-v1/train.log
@@ -0,0 +1,53 @@
+=== Parameter Golf: 11L PartialRoPE+LNScale+EMA+SWA+TTT ===
+run_id:full_1gpu_pcie seed:1337
+model_params:26666073
+grad_accum_steps:8 micro_batch_seqs:64 tokens_per_step:524288
+step:0/20000 val_loss:6.9295 val_bpb:4.1040 train_time:0ms
+step:0/20000 train_loss:6.9311 lr_scale:1.0000
+step:100/20000 train_loss:3.3448 lr_scale:1.0000
+step:200/20000 train_loss:2.7898 lr_scale:1.0000
+step:300/20000 train_loss:2.6807 lr_scale:1.0000
+step:400/20000 train_loss:2.4209 lr_scale:0.9889
+step:500/20000 val_loss:2.5017 val_bpb:1.4816 train_time:950423ms
+step:500/20000 train_loss:2.5169 lr_scale:0.9556
+step:600/20000 train_loss:2.4320 lr_scale:0.9223
+step:700/20000 train_loss:2.4107 lr_scale:0.8890
+step:800/20000 train_loss:2.3299 lr_scale:0.8554
+step:900/20000 train_loss:2.3378 lr_scale:0.8229
+step:1000/20000 val_loss:2.3133 val_bpb:1.3701 train_time:1898746ms
+step:1000/20000 train_loss:2.3094 lr_scale:0.7902
+step:1100/20000 train_loss:2.2891 lr_scale:0.7574
+step:1200/20000 train_loss:2.2714 lr_scale:0.7243
+step:1300/20000 train_loss:2.2558 lr_scale:0.6916
+step:1400/20000 train_loss:2.4329 lr_scale:0.6586
+step:1500/20000 val_loss:2.2391 val_bpb:1.3261 train_time:2843310ms
+step:1500/20000 train_loss:2.2423 lr_scale:0.6254
+step:1600/20000 train_loss:2.2074 lr_scale:0.5922
+step:1700/20000 train_loss:2.2735 lr_scale:0.5590
+step:1800/20000 train_loss:2.2070 lr_scale:0.5257
+step:1900/20000 train_loss:2.1703 lr_scale:0.4925
+step:2000/20000 val_loss:2.1885 val_bpb:1.2961 train_time:3790296ms
+step:2000/20000 train_loss:2.1734 lr_scale:0.4590
+step:2100/20000 train_loss:2.2228 lr_scale:0.4255
+step:2200/20000 train_loss:2.1740 lr_scale:0.3922
+step:2300/20000 train_loss:2.2066 lr_scale:0.3586
+step:2400/20000 train_loss:2.1487 lr_scale:0.3254
+step:2500/20000 val_loss:2.1402 val_bpb:1.2675 train_time:4740539ms
+step:2500/20000 train_loss:2.1174 lr_scale:0.2917
+step:2600/20000 train_loss:2.1344 lr_scale:0.2583
+step:2700/20000 train_loss:2.1252 lr_scale:0.2247
+step:2800/20000 train_loss:2.1342 lr_scale:0.1916
+step:2900/20000 train_loss:2.0774 lr_scale:0.1581
+step:3000/20000 val_loss:2.0923 val_bpb:1.2392 train_time:5689637ms
+step:3000/20000 train_loss:2.0956 lr_scale:0.1249
+step:3100/20000 train_loss:2.0435 lr_scale:0.0913
+step:3200/20000 train_loss:2.1025 lr_scale:0.0580
+step:3300/20000 train_loss:2.0939 lr_scale:0.0246
+step:3374/20000 val_loss:2.0611 val_bpb:1.2207 train_time:6400210ms
+wallclock_stop at step:3374 train_time:6400210ms
+loading_swa_weights count=1197
+raw_model_bytes:106699647
+artifact_bytes:16132620 code_bytes:49461 total:16182081
+/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
+  warnings.warn(  # warn only once
+starting_ttt