You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update STATUS.md for 390M model and GCP training plan
Reflects current target: 390M params (1.68B dense equiv), Mistral tokenizer,
GCP L4 on-demand ~200hr/$142. Notes training data is lean relative to
Chinchilla scaling but sufficient for delta system validation. Updates
checkpoint frequency documentation and remaining work section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/STATUS.md
+29-8Lines changed: 29 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,16 @@
1
1
# LeanFormer Project Status
2
2
3
-
**Last updated:** 2026-03-29
3
+
**Last updated:** 2026-03-30
4
4
5
5
---
6
6
7
7
## Current State
8
8
9
-
All architecture and infrastructure code is complete. The full pipeline (data preparation, training, forging, routing, composition, consolidation, serving) is implemented and tested. **119 tests pass.** The remaining work is one long-running GPU training job (~24-36 hours), followed by full-scale forging and integration validation.
9
+
All architecture and infrastructure code is complete. The full pipeline (data preparation, training, forging, routing, composition, consolidation, serving) is implemented and tested. **119 tests pass.**
10
+
11
+
The current training target is the **390M parameter model** (1.68B dense equivalent, 4.3x compression) using the Mistral tokenizer (32K vocab). Training runs on a GCP g2-standard-4 (L4 24GB, on-demand) for an estimated ~200 hours (~8 days), costing ~$142 within GCP's $300 free credit.
12
+
13
+
**Note on training data:** The corpus is 249M tokens. For a 390M model over 3 epochs, that's ~747M token passes — lean relative to the ~8B tokens that Chinchilla scaling would suggest. This is intentional: the base model needs competent internal representations for the delta system, not encyclopedic factual knowledge. Perplexity may plateau above the <50 target, but the Knowledge Plane validation requires meaningful embeddings and layer structure, not a fully converged language model.
0 commit comments