Skip to content

Commit fa7ce10

Browse files
brian-c-mooreclaude
andcommitted
Update STATUS.md for 390M model and GCP training plan
Reflects current target: 390M params (1.68B dense equiv), Mistral tokenizer, GCP L4 on-demand ~200hr/$142. Notes training data is lean relative to Chinchilla scaling but sufficient for delta system validation. Updates checkpoint frequency documentation and remaining work section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9333e9a commit fa7ce10

1 file changed

Lines changed: 29 additions & 8 deletions

File tree

docs/STATUS.md

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
11
# LeanFormer Project Status
22

3-
**Last updated:** 2026-03-29
3+
**Last updated:** 2026-03-30
44

55
---
66

77
## Current State
88

9-
All architecture and infrastructure code is complete. The full pipeline (data preparation, training, forging, routing, composition, consolidation, serving) is implemented and tested. **119 tests pass.** The remaining work is one long-running GPU training job (~24-36 hours), followed by full-scale forging and integration validation.
9+
All architecture and infrastructure code is complete. The full pipeline (data preparation, training, forging, routing, composition, consolidation, serving) is implemented and tested. **119 tests pass.**
10+
11+
The current training target is the **390M parameter model** (1.68B dense equivalent, 4.3x compression) using the Mistral tokenizer (32K vocab). Training runs on a GCP g2-standard-4 (L4 24GB, on-demand) for an estimated ~200 hours (~8 days), costing ~$142 within GCP's $300 free credit.
12+
13+
**Note on training data:** The corpus is 249M tokens. For a 390M model over 3 epochs, that's ~747M token passes — lean relative to the ~8B tokens that Chinchilla scaling would suggest. This is intentional: the base model needs competent internal representations for the delta system, not encyclopedic factual knowledge. Perplexity may plateau above the <50 target, but the Knowledge Plane validation requires meaningful embeddings and layer structure, not a fully converged language model.
1014

1115
---
1216

@@ -57,7 +61,7 @@ Scale validation trained on 500K OpenWebText samples.
5761
| Bit-for-bit restoration | Verified for all 100 beliefs (ordered and random removal) |
5862
| Base weight integrity | 406 tensors verified unchanged |
5963

60-
### 66M Parameters (d_model=768, 12 layers) - Current
64+
### 66M Parameters (d_model=768, 12 layers)
6165

6266
Knowledge Plane validated with quick-trained checkpoint (500 steps on WikiText-2).
6367

@@ -74,6 +78,22 @@ Knowledge Plane validated with quick-trained checkpoint (500 steps on WikiText-2
7478
| Base weight integrity | SHA-256 verified after full routing + composition lifecycle |
7579
| Inference latency overhead | < 2x with knowledge routing active |
7680

81+
### 390M Parameters (d_model=2048, 24 layers) - Current Target
82+
83+
Training on GCP L4 24GB. Mistral tokenizer (32K vocab).
84+
85+
| Metric | Value |
86+
|--------|-------|
87+
| Compressed params | 390,214,657 |
88+
| Dense equivalent | 1,677,197,312 (4.3x compression) |
89+
| Tokenizer | Mistral (mistralai/Mistral-7B-v0.1, 32K vocab) |
90+
| Training data | 487,000 samples (~249M tokens), 3 epochs (~747M token passes) |
91+
| Training hardware | GCP g2-standard-4, NVIDIA L4 24GB, on-demand |
92+
| Estimated training time | ~200 hours (~8 days) |
93+
| Estimated cost | ~$142 |
94+
| Checkpoint frequency | Every 5,000 steps (~1.6 hours) |
95+
| Inference target | RTX 3060 12GB at FP16 (~0.8GB weights) |
96+
7797
---
7898

7999
## Training Data
@@ -118,24 +138,25 @@ Second batches are rejected because the 500-step model produces nearly identical
118138

119139
## Remaining Work
120140

121-
### 1. Full Training (~24-36 hours GPU)
141+
### 1. Full Training (~200 hours on GCP L4)
122142

123143
```bash
144+
cp configs/reasoning_core_432m.yaml configs/reasoning_core.yaml
124145
python -m leanformer.scripts.train_reasoning
125146
```
126147

127148
The training script handles:
128-
- 3 epochs with cosine LR schedule and warmup
149+
- 3 epochs with cosine LR schedule (lr=2e-4) and 3000-step warmup
129150
- Mixed precision (fp16) with GradScaler
130-
- Gradient accumulation (effective batch 64)
151+
- Gradient accumulation (batch_size=2 x 32 = effective batch 64)
131152
- Validation every 1000 steps
132-
- Checkpointing every 5000 steps
153+
- Checkpointing every 5000 steps (~1.6 hours, ~146 checkpoints total)
133154
- Exit head tuning (1000 steps, frozen base, lr=1e-3)
134155
- Post-training validation (perplexity, generation samples, orthogonal capacity measurement)
135156
- Model hash computation and storage
136157

137158
**Expected outcomes:**
138-
- Perplexity < 50 on held-out reasoning corpus
159+
- Validation perplexity decreasing steadily (may plateau above 50 due to limited training data relative to model capacity — see note in Current State)
139160
- FF sparsity climbing toward 0.8
140161
- Diverse per-layer representations enabling many orthogonal deltas
141162

0 commit comments

Comments
 (0)