Skip to content

LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers#347

Draft
FlashyFlash3011 wants to merge 1 commit intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l
Draft

LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers#347
FlashyFlash3011 wants to merge 1 commit intoopenai:mainfrom
FlashyFlash3011:flashyflash3011/long-context-4096-qat-int4-16l

Conversation

@FlashyFlash3011
Copy link

Submissions

Two experiments targeting sub-1.1698 BPB. Scripts are complete and smoke-tested. Full 3-seed H100 runs are in progress (compute grant pending) — results and train logs will be added before final review.


1. LongContext 4096 + Full SOTA Stack

Folder: records/track_10min_16mb/2026-03-21_LongContext4096_FullStack/

The 4096-seq training record (1.2014 BPB) was submitted before sliding window eval, FP16 embeddings, Muon WD, or Overtone init existed. This combines all of those with long-context training:

  • Train at seq_len=4096, eval with sliding window stride=256
  • Each eval token sees 3,840 tokens of context vs 960 in SOTA
  • Zero extra eval cost: 64 seqs × 4096 = 256 seqs × 1024 tokens per batch
  • NTK-aware RoPE base=40000 (= 10000 × 4096/1024)
  • Re-tuned LRs (matrix=0.025) and Muon momentum (0.98) for 4096 context
  • All SOTA techniques: 10L, Muon WD=0.02, FP16 tied embed, Overtone spectral init, U-Net skips

Expected: ~1.14–1.16 BPB


2. QAT Int4 → 16 Layers

Folder: records/track_10min_16mb/2026-03-21_QAT_Int4_16L/

Int4 nibble-packing (2 weights/byte) fits 16 transformer layers in the same 16MB budget as SOTA's 10 — a 60% parameter increase.

  • Quantization-Aware Training with straight-through estimator (STE) activates at 15% of training iterations
  • Weights gradually cluster near int4 grid points before export, minimising accuracy loss
  • Int4 roundtrip assertion before export verifies zero packing error
  • All SOTA techniques carried forward with stability adjustments: matrix_lr=0.030, Muon momentum 0.97, grad_clip=1.0

Expected: ~1.14–1.16 BPB


Status

  • Scripts complete and syntax-verified
  • Smoke-tested locally (ROCm, 1 GPU)
  • Full H100 runs (3 seeds each) — pending compute
  • Results tables and train logs — to be added

Two new submissions targeting sub-1.1698 BPB:

1. 2026-03-21_LongContext4096_FullStack
   - 4096-token training context + full modern SOTA stack
   - Sliding window eval stride=256 (3840 context tokens per position)
   - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch
   - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context

2. 2026-03-21_QAT_Int4_16L
   - Int4 nibble-packing enables 16 transformer layers in 16MB budget
   - QAT with straight-through estimator activates at 15% of training
   - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant