Skip to content

Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)#397

Open
translatingthename wants to merge 1 commit intoopenai:mainfrom
translatingthename:submission-dyneval-ttt
Open

Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)#397
translatingthename wants to merge 1 commit intoopenai:mainfrom
translatingthename:submission-dyneval-ttt

Conversation

@translatingthename
Copy link

Summary

Dynamic evaluation (Krause et al., ICML 2018) applied to the SOTA pipeline without modifying training. The model takes periodic SGD gradient steps during sliding window scoring, adapting to local text distribution. 2.0% consistent bpb improvement at zero artifact cost.

3-seed mean: 1.1371 (seeds 42, 7, 2024). Best seed: 1.1364. Merged SOTA: 1.1428.

Results (3-seed, 8xH100 SXM, SDPA backend)

Seed Steps Int6 Roundtrip + TTT + Dynamic Eval Delta Artifact
42 5,604 1.1607 1.1364 -0.0243 15.65 MB
7 5,590 1.1618 1.1369 -0.0249 15.80 MB
2024 5,620 1.1613 1.1380 -0.0233 15.35 MB
Mean 1.1613 1.1371 -0.0242

Novel Contribution: Dynamic Evaluation

After TTT adaptation, we score the validation stream using sliding windows (stride=64). Between batches of scored windows, we take an SGD gradient step (lr=0.001) on the model weights. The model adapts to the local distribution as it scores. TTT adapts weights before scoring; dynamic eval adapts during scoring. The two are complementary.

  • Windows scored in batches of 32, gradient step every 4 batches
  • SGD without momentum, rank-local adaptation
  • ~344s eval time, total eval ~427s (well under 600s budget)
  • Zero additional artifact bytes

Attribution

Built on PR #315 (jfprincz): XSA, EMA, Partial RoPE, LN Scale, Late QAT.
PR #338 (alertcat): TTT integration.
SmearGate/BigramHash/OrthoInit originally by unnir.

Reference: Krause et al., "Dynamic Evaluation of Neural Sequence Models," ICML 2018.

See records/track_10min_16mb/2026-03-22_DynamicEval_TTT_11L/README.md for full details, ablation, what didn't work, and reproduction instructions.

3-seed mean: 1.1371 (seeds 42, 7, 2024)
Dynamic evaluation (Krause et al., ICML 2018) applied during sliding window scoring.
2.0% consistent bpb improvement at zero artifact cost.
Built on PR openai#315 (jfprincz) and PR openai#338 (alertcat).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant