Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# 11L PartialRoPE + LNScale + EMA + SWA + TTT (non-record)

Non-record submission for the Parameter Golf challenge. This run was tested on **1×H100 PCIe for ~107 minutes** (approximately equivalent to 8×H100 SXM for 10 minutes).

## Architecture

- **11 transformer layers**, d=512, 8 heads / 4 KV heads, 3× ReluSquared MLP
- **U-Net skip connections**: encoder-decoder style with learnable skip weights
- **Partial RoPE**: rotary on 16 of 64 head dims for position-free generalization
- **LN Scale**: RMSNorm damped by 1/sqrt(layer+1) for deep gradient stability
- **SmearGate**: per-dim gate blending current + previous token embeddings
- **BigramHash(2048, dim=64→512)**: hash-based bigram context embeddings
- Tied input/output embeddings

## Training

- Muon optimizer (Newton-Schulz) for 2D weights, momentum warmup 0.85→0.99
- Adam (beta1=0.9, beta2=0.95) for scalars/embeddings, WD=0.04
- Wallclock-aware cosine warmdown over last ~3000 steps
- Orthogonal init with muP output-projection scaling
- EMA (decay=0.997) + SWA (last 40% of training)

## Compression

- Uniform int5 per-row quantization (both MLP and attn) + int8 fallback
- zstd-22 compression
- **Artifact size: 15.4MB ✅ (under 16MB limit)**

## Evaluation

- Sliding window with stride=64 for near-max context scoring
- Full-model SGD TTT: 3 epochs over val, first 2 blocks frozen

## Key Metrics

| Metric | Value |
|---|---|
| val_loss (pre-TTT) | 2.0611 |
| val_bpb (pre-TTT) | 1.2207 |
| Training steps | 3374 |
| Training time | 6400210ms |
| SWA count | 1197 |
| Model params | 26,666,073 |
| Artifact bytes | 16,132,620 |
| Code bytes | 49,461 |
| Total bytes | 16,182,081 |

## Included Files

- `train_gpt.py` — training script
- `train.log` — training log
- `submission.json` — submission metadata
- `README.md` — this file
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"author": "nathon-lee",
"github_id": "nathon-lee",
"name": "11L PartialRoPE + LNScale + EMA + SWA + TTT",
"blurb": "Non-record submission: 11-layer 512-dim GPT with partial RoPE (16 dims), LN scale damping, SmearGate, BigramHash(4096), U-Net skip connections, mixed int5/int6 quantization + zstd-22, EMA(0.997) + SWA, SGD TTT, sliding window eval (stride=64). Tested on 1xH100 for 80 minutes (~equivalent to 8xH100 10min).",
"date": "2026-03-21T08:00:00Z",
"track": "non-record-10min-16mb",
"val_loss": 2.0444,
"val_bpb": 1.2108,
"step_stop": 3806,
"wallclock_seconds": 4800.906,
"bytes_total": 17396551,
"bytes_model_compressed": 17347056,
"bytes_code": 49495
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
=== Parameter Golf: 11L PartialRoPE+LNScale+EMA+SWA+TTT ===
run_id:full_1gpu_pcie seed:1337
model_params:26666073
grad_accum_steps:8 micro_batch_seqs:64 tokens_per_step:524288
step:0/20000 val_loss:6.9295 val_bpb:4.1040 train_time:0ms
step:0/20000 train_loss:6.9311 lr_scale:1.0000
step:100/20000 train_loss:3.3448 lr_scale:1.0000
step:200/20000 train_loss:2.7898 lr_scale:1.0000
step:300/20000 train_loss:2.6807 lr_scale:1.0000
step:400/20000 train_loss:2.4209 lr_scale:0.9889
step:500/20000 val_loss:2.5017 val_bpb:1.4816 train_time:950423ms
step:500/20000 train_loss:2.5169 lr_scale:0.9556
step:600/20000 train_loss:2.4320 lr_scale:0.9223
step:700/20000 train_loss:2.4107 lr_scale:0.8890
step:800/20000 train_loss:2.3299 lr_scale:0.8554
step:900/20000 train_loss:2.3378 lr_scale:0.8229
step:1000/20000 val_loss:2.3133 val_bpb:1.3701 train_time:1898746ms
step:1000/20000 train_loss:2.3094 lr_scale:0.7902
step:1100/20000 train_loss:2.2891 lr_scale:0.7574
step:1200/20000 train_loss:2.2714 lr_scale:0.7243
step:1300/20000 train_loss:2.2558 lr_scale:0.6916
step:1400/20000 train_loss:2.4329 lr_scale:0.6586
step:1500/20000 val_loss:2.2391 val_bpb:1.3261 train_time:2843310ms
step:1500/20000 train_loss:2.2423 lr_scale:0.6254
step:1600/20000 train_loss:2.2074 lr_scale:0.5922
step:1700/20000 train_loss:2.2735 lr_scale:0.5590
step:1800/20000 train_loss:2.2070 lr_scale:0.5257
step:1900/20000 train_loss:2.1703 lr_scale:0.4925
step:2000/20000 val_loss:2.1885 val_bpb:1.2961 train_time:3790296ms
step:2000/20000 train_loss:2.1734 lr_scale:0.4590
step:2100/20000 train_loss:2.2228 lr_scale:0.4255
step:2200/20000 train_loss:2.1740 lr_scale:0.3922
step:2300/20000 train_loss:2.2066 lr_scale:0.3586
step:2400/20000 train_loss:2.1487 lr_scale:0.3254
step:2500/20000 val_loss:2.1402 val_bpb:1.2675 train_time:4740539ms
step:2500/20000 train_loss:2.1174 lr_scale:0.2917
step:2600/20000 train_loss:2.1344 lr_scale:0.2583
step:2700/20000 train_loss:2.1252 lr_scale:0.2247
step:2800/20000 train_loss:2.1342 lr_scale:0.1916
step:2900/20000 train_loss:2.0774 lr_scale:0.1581
step:3000/20000 val_loss:2.0923 val_bpb:1.2392 train_time:5689637ms
step:3000/20000 train_loss:2.0956 lr_scale:0.1249
step:3100/20000 train_loss:2.0435 lr_scale:0.0913
step:3200/20000 train_loss:2.1025 lr_scale:0.0580
step:3300/20000 train_loss:2.0939 lr_scale:0.0246
step:3374/20000 val_loss:2.0611 val_bpb:1.2207 train_time:6400210ms
wallclock_stop at step:3374 train_time:6400210ms
loading_swa_weights count=1197
raw_model_bytes:106699647
artifact_bytes:16132620 code_bytes:49461 total:16182081
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
warnings.warn( # warn only once
starting_ttt
Loading