Skip to content

Latest commit

 

History

History
211 lines (139 loc) · 14.9 KB

File metadata and controls

211 lines (139 loc) · 14.9 KB

Roadmap

Research Questions

Primary: Can an LLM learn to "speak" the way humans do?

Human children acquire language through immersion (hearing language around them), mimicry (attempting to reproduce what they hear), and positive reinforcement (adults responding to successful attempts). Bootstrap Basil tests whether the same loop can bootstrap a language model from random weights:

  • Immersion: The trunk trains on full session transcripts (Tutor, Sophie, Story content) -- Basil "listens" to fluent English without producing it
  • Mimicry: The LoRA adapter trains on Basil's own graded attempts -- reinforcing outputs that match what was modeled
  • Positive reinforcement: The score-to-weight table ensures only above-threshold outputs are reinforced, while poor attempts are discarded

If this works, it suggests that the statistical structure of language exposure, combined with selective reinforcement of successful imitation, is sufficient to bootstrap linguistic competence -- no pre-existing corpus, no human annotation, no distillation from a capable model.

Secondary: Will identity emerge from language acquisition?

Every training round, Basil is asked: "Who are you?" The answer is logged to identities/identity_log.jsonl. The trajectory so far:

Model Response to "Who are you?"
v001 (random init) "shaft slip slip slip shaft"
v002 "That's correct! A soft and"
v005 "Hello, Sophie and little Basil! I'm"
v006 "Fireflies have long feathers that glow to communicate"
v010 "Hello, Sophie and Basil! Let's"
v012 "Sure, Tutor'm really"
v014 "Keep getting this kind of , do"

Basil has moved from random tokens to mimicking the conversational patterns it's immersed in (echoing Tutor/Sophie phrases). The question is whether continued training will produce a progression like:

"shaft slip slip shaft" → "Good try Sophie" → "I like stories" → "I am Basil" → "I am Basil and I love to hear stories and learn"

If a sense of self emerges purely from language acquisition -- without any explicit identity training data -- that would be a striking parallel to how human children develop self-concept through language.

Tertiary: What can this teach us about human language and identity acquisition?

Regardless of whether Basil reaches fluency, the project generates data about:

  • Critical mass: How much exposure is needed before productive language appears? Is there a phase transition, or is progress continuous?
  • Developmental staging: Does enforcing a curriculum progression (babble → single words → phrases → sentences) actually help, or would unstructured exposure work just as well?
  • Identity and language: If identity emerges, does it require explicit identity-focused training, or does it arise naturally from learning to use first-person language in context?
  • Reinforcement sensitivity: How important is the quality threshold for reinforcement? Would reinforcing all attempts (even garbage) work, or does selective reinforcement matter?
  • Separation of knowledge and behavior: The trunk/LoRA split mirrors a theoretical distinction between linguistic competence (knowing what English sounds like) and linguistic performance (producing appropriate responses). Does this separation help or hinder learning?

These are empirical questions. The codebase is designed to let you run controlled experiments to answer them.


Current Status (February 2026)

Bootstrap Basil is an experiment in training a language model from scratch using only AI-generated curriculum -- no human-written training data, no distillation from a larger model, no pre-existing text corpus. The base model is GPT-2 (124M parameters) initialized with random weights.

This is not proven to work. What we've demonstrated so far:

  • Basil has bootstrapped from random token babble to occasional task compliance and thematic word salads in ~50K graded turns across 14 training rounds
  • At age_band 2 ("first words"), ~18% of graded turns score >= 3 (usable for LoRA training), up from 0% at initialization
  • The LoRA adapter reliably steers output toward domain-relevant vocabulary (e.g., generating "fire" when asked about fire trucks, "leaves" and "plants" when discussing ecosystems)
  • Score 4+ rate (on-topic with some target words) is at ~11% and climbing
  • Score 6+ rate (actual task compliance -- saying the target word) is at ~6%
  • Training dynamics are healthy: both trunk (world knowledge) and LoRA (task behavior) converge independently with measurable improvement each round

What hasn't happened yet:

  • Basil has not progressed beyond age_band 2 (out of 7)
  • Outputs are still word salads, not coherent sentences
  • Compliance is inconsistent -- the same model that hits "fire" perfectly will miss "roots" entirely
  • We don't know if this approach can reach conversational fluency, or if it will plateau

The architecture is designed to scale, but nobody has pushed it far enough to find out.


Levers to Pull

These are the dimensions we believe could accelerate or unlock further progress. They're ordered roughly by expected impact and ease of implementation.

1. Scale: More Data, More Training

The most straightforward lever. Each generation-training cycle currently takes ~16 hours of data generation + 8-12 hours of training on a single GPU. More compute directly translates to more cycles.

  • More parallel workers for content generation (currently 30, limited by API rate limits and GPU memory for Basil inference)
  • Faster/more GPUs for training (currently single-GPU, ~15K optimizer steps per round)
  • More training rounds -- we've done 14 rounds total; the model is still improving each round with no sign of plateau
  • Longer training runs -- the last run hit the 12-hour time limit before WORLD convergence; longer runs may extract more from existing data

2. Session Diversity: New Content Types

The model currently trains on four session types (Classroom, Storytime, HowItWorks, WhyChain). Each exposes Basil to different language patterns. More variety means richer trunk training data and more diverse LoRA examples.

Ideas for new session types:

  • Rhyme Time -- nursery rhymes and word play (phonological patterns)
  • Show and Tell -- describe objects, properties, categories (descriptive language)
  • Conversation Practice -- multi-turn back-and-forth with Sophie (dialogue structure)
  • Reading Aloud -- Tutor reads, pauses, Basil fills in predictable words (cloze tasks)
  • Counting/Listing -- structured enumeration (sequential patterns)
  • Simon Says -- action words and following instructions (imperative comprehension)
  • Identity / Self-Expression -- "Who are you?", "What do you like?", "Tell me about yourself" (first-person language, self-referential output). This directly targets the secondary research question: if identity emerges from language, explicitly practicing self-referential language may accelerate it. Alternatively, not training on identity prompts and watching whether identity emerges anyway would be the stronger finding.

Each new session type requires a session runner (Python class) and prompt templates. The existing infrastructure (grading, training, orchestration) handles new types automatically.

3. Base Model Scale

The current base is GPT-2 (124M parameters). Larger models have more capacity to absorb language structure from the AI-generated curriculum. Options:

  • GPT-2 Medium (355M) -- 2.8x larger, same architecture, drop-in replacement
  • GPT-2 Large (774M) -- 6.2x larger, needs more GPU memory
  • GPT-2 XL (1.5B) -- 12x larger, likely needs multi-GPU or quantization
  • Other architectures -- Llama, Mistral, etc. would require adapter changes but may learn faster

The LoRA adapter is currently ~1.4M parameters (0.56% of base). With a larger base, the LoRA rank and alpha may need adjustment to maintain the right balance of adaptation vs stability.

Trade-off: larger models need more data per training round to avoid overfitting, and inference is slower (reducing generation throughput). But they may reach milestones faster per-token.

4. LoRA Configuration Tuning

The LoRA adapter controls how strongly Basil's task-specific behavior diverges from the base trunk. Current settings:

Parameter Current Exploration Range
Rank 8 4-32
Alpha 16 8-64
Target modules c_attn, c_proj Add c_fc, c_proj (MLP layers)
Dropout 0.05 0.0-0.15
Strength ramp Linear 0.0-1.0 over bands 0-7 S-curve, stepped, faster ramp
Epoch cap ramp Linear 0-100 over bands 0-7 Front-loaded (more training early)

Key questions:

  • Should the LoRA target MLP layers in addition to attention? This would give it more capacity to learn task-specific behavior at the cost of more parameters
  • Is the linear strength ramp optimal, or should LoRA influence increase faster at early bands (where the signal is most needed)?
  • Should the epoch cap be front-loaded (more epochs at low bands where data is noisy and the model needs more passes)?

5. Score-to-Weight Mapping

Training weights determine how much each graded turn contributes to LoRA learning. The current BASIL_POLICY_SCORE_WEIGHTS_TABLE uses a 2D mapping (score x age_band) with specific design rules:

  • Score <= age_band gets weight 0 (only scores above the current band reinforce)
  • Minimum reinforcing weight is 0.15 with linear ramp to 1.0
  • Score 7 always gets weight 1.0

Potential improvements:

  • Exponential weighting -- heavily reward the rare high scores instead of linear ramp
  • Negative reinforcement -- currently score 0 is simply ignored; could we use low-scoring examples as negative examples (DPO-style)?
  • Dynamic weighting -- adjust weights based on score distribution, not just fixed table
  • Separate weights for trunk vs LoRA -- the trunk currently uses lora_weight / TRUNK_WEIGHT_DIVISOR for Basil zones; this ratio could be tuned

6. Grading System Calibration

The two-grader architecture (English Grader + Task Grader with gating) is the primary quality signal. Improvements here directly translate to better training signal:

  • Stronger grader model -- upgrade from gpt-4o-mini to gpt-4o for more nuanced scoring (cost vs accuracy trade-off)
  • Grader calibration harness -- systematic testing of grader prompts against known-good/bad examples to measure accuracy
  • Additional programmatic checks -- beyond target-word presence, detect sentence structure, word order, grammaticality
  • Grader ensembling -- run multiple graders and take majority vote for noisy cases
  • Self-consistency -- grade the same response multiple times and use the median

7. Training Architecture Refinements

The dual-objective training (World trunk + Basil LoRA) has room for experimentation:

  • DPO / preference learning -- instead of weighted SFT, pair high-scoring and low-scoring responses to the same prompt and train with Direct Preference Optimization
  • Curriculum-aware data mixing -- weight recent training data more heavily (currently using recency weighting with half-life of 6 runs, tunable)
  • Progressive context expansion -- currently LoRA training uses episode-local context to prevent garbage contamination; as Basil improves, gradually expand to full session context
  • Separate trunk/LoRA learning rate scheduling -- currently using cosine schedules; could experiment with warmup-stable-decay or cyclical schedules
  • Multi-objective balancing -- dynamically adjust the WORLD/BASIL epoch ratio based on which objective has more room to improve

8. Evaluation and Benchmarks

Currently, progress is measured by score distribution, usable rate, and qualitative inspection of outputs. More rigorous evaluation would help:

  • Held-out test prompts -- a fixed set of prompts evaluated after each training round to track progress consistently
  • Automated benchmarks -- perplexity on a standard English corpus, word-level accuracy on target tasks
  • Human evaluation -- periodic human scoring of Basil outputs to calibrate LLM grader accuracy
  • Developmental milestones -- define specific, measurable criteria for each age band transition

9. Infrastructure and Tooling

  • Dashboard -- real-time visualization of training progress, score distributions, age band history
  • Experiment tracking -- integration with Weights & Biases or similar for hyperparameter sweeps
  • Distributed training -- multi-GPU training for larger models and longer runs
  • Cost optimization -- batch API calls, cache tutor/sophie outputs, reduce redundant LLM calls

Architecture Principles

For anyone picking this up, the core design decisions to understand:

  1. No gold-standard data -- Basil never sees human-written text or distilled outputs. All training data is generated by the system itself (Tutor, Sophie, Basil's own attempts). The trunk learns language from AI-generated teaching sessions; the LoRA learns task behavior from Basil's own graded outputs.

  2. Developmental staging -- Everything scales with age_band (0-7): LoRA strength, training epochs, max output tokens, training thresholds, score weights. This prevents early-stage overfitting while unlocking full capacity at maturity.

  3. Separation of concerns -- The trunk (base model) learns "what English sounds like" from full session transcripts. The LoRA adapter learns "how to respond to tasks" from Basil's own attempts. They train on different data with different objectives and converge independently.

  4. Quality-gated reinforcement -- Not all data is equal. The two-grader architecture, score-to-weight table, and usable-turn thresholds ensure the model only reinforces behavior above its current level. Score <= age_band is discarded, not reinforced.

  5. Self-improving loop -- Better Basil outputs lead to higher-scoring training data, which leads to better training, which leads to better outputs. The question is whether this loop has enough signal to bootstrap all the way to fluency, or if it plateaus.


How to Contribute

If you're interested in pushing this further:

  1. Run it -- Clone the repo, set up an API key, run python orchestrator.py run. Watch the scores. See if they improve over multiple training rounds.

  2. Try a lever -- Pick one of the levers above and experiment. The codebase is modular: new session types, model sizes, LoRA configs, and weight tables can all be changed independently.

  3. Report results -- Whether it works or not, the data is valuable. Open an issue with your config, training curves, and score distributions.

  4. Challenge the architecture -- If you think the approach is fundamentally flawed, explain why. The best contribution might be identifying why this can't work and what would need to change.

The most interesting open questions are the research questions at the top of this document. The codebase is designed to let you run the experiments that answer them.