Can a system learn over time by using conversations with humans as a new dataset?
This validates whether GaLore-based continuous pretraining enables:
- Knowledge acquisition from conversational data
- Retention of base capabilities (avoiding catastrophic forgetting)
- Memory-efficient training suitable for online/edge deployment
| Model | Parameters | Context | Why |
|---|---|---|---|
| Qwen2.5-0.5B-Instruct | 0.5B | 128K | Best tiny instruction model, multilingual |
| Qwen2.5-1.5B-Instruct | 1.5B | 128K | Sweet spot for quality vs memory |
| Llama-3.2-1B-Instruct | 1B | 128K | Meta's latest small model |
| Gemma-2-2B-it | 2B | 8K | Google's distilled model, strong reasoning |
| Phi-3.5-mini-instruct | 3.8B | 128K | Microsoft's textbook-trained, best reasoning/size |
Primary recommendation: Start with Qwen2.5-1.5B-Instruct - best balance of:
- Small enough for single GPU GaLore training
- Strong instruction-following baseline
- Large context for conversation history
model_name: Qwen/Qwen2.5-1.5B-Instruct
# Alternative: meta-llama/Llama-3.2-1B-Instruct
# Alternative: google/gemma-2-2b-it| Dataset | Size | Description | Use |
|---|---|---|---|
| LMSYS-Chat-1M | 1M convos | Real human-LLM conversations from Chatbot Arena | Primary training |
| ShareGPT | ~90K convos | Shared ChatGPT conversations | Supplementary |
| ShareChat | 142K convos | Cross-platform (GPT, Claude, Gemini, etc.) | Diversity |
| OpenAssistant | 161K messages | Human-annotated conversation trees | Quality signal |
Primary: lmsys/lmsys-chat-1m (HuggingFace)
- Real user queries, diverse topics
- Multi-turn conversations
- Natural distribution of difficulty
| Benchmark | Purpose |
|---|---|
| MMLU (5-shot) | General knowledge retention |
| HellaSwag | Commonsense reasoning |
| ARC-Challenge | Science reasoning |
| TruthfulQA | Factual accuracy |
| GSM8K | Math reasoning |
-
Learning Signal
- Perplexity on held-out conversation data (↓ = learning)
- Response quality on conversation test set
-
Forgetting Detection
- MMLU accuracy before/after (Δ < 2% acceptable)
- Perplexity on original pretraining distribution (C4/RedPajama)
-
Efficiency
- Memory usage vs full AdamW
- Training throughput (tokens/sec)
Step 0: Baseline eval (MMLU, HellaSwag, conversation perplexity)
Step 500: Quick check (conversation perplexity only)
Step 1000: Full eval
Step 2500: Full eval
Step 5000: Final eval + forgetting analysis
Compare continuous training approaches:
- A: Full AdamW (memory baseline)
- B: LoRA r=64 (parameter-efficient baseline)
- C: GaLore r=256 (our approach)
- D: GaLore r=128 (memory-optimized)
Hypothesis: GaLore achieves similar learning to full AdamW with LoRA-like memory.
Train on LMSYS-Chat-1M with increasing data:
- 10K, 50K, 100K, 500K conversations
- Measure: conversation perplexity, MMLU retention
Hypothesis: Perplexity improves log-linearly; forgetting stays bounded.
Simulate real-time conversation ingestion:
- Batch size 1-2, continuous stream
- Measure adaptation speed to new topics
Hypothesis: Model adapts to conversation patterns within ~1000 steps.
Test replay strategies:
- A: No replay (pure online)
- B: 10% replay from original data
- C: 20% replay
Hypothesis: Small replay (10%) significantly reduces forgetting.
- Update config for Qwen/Llama/Gemma models
- Handle chat templates properly
- Add bfloat16 support
- Parse multi-turn conversation format
- Convert to prompt/response pairs
- Handle conversation context window
- Integrate lm-evaluation-harness
- Add perplexity tracking
- Create evaluation checkpoints
- Track per-domain perplexity
- Implement MMLU quick-eval
- Add replay buffer option
- Sweep configs for experiments
- Logging to W&B/TensorBoard
- Result aggregation
| Metric | Target |
|---|---|
| Conversation perplexity | ↓ 15%+ from baseline |
| MMLU retention | Δ < 3% from baseline |
| Memory vs full AdamW | < 50% |
| Training speed | > 80% of full AdamW |
| Week | Focus |
|---|---|
| 1 | Task 1-2: Model + data support |
| 2 | Task 3-4: Evaluation infrastructure |
| 3 | Experiment 1-2: Baselines + learning curves |
| 4 | Experiment 3-4: Online + forgetting |
| 5 | Analysis + writeup |
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (Zhao et al., 2024)
- LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset (Zheng et al., 2023)
- Continual Learning of Large Language Models: A Comprehensive Survey (Wang et al., 2024)
- TiC-LM: A Multi-Year Benchmark for Continual Pretraining (OpenReview, 2025)