Skip to content

Latest commit

 

History

History
186 lines (138 loc) · 5.56 KB

File metadata and controls

186 lines (138 loc) · 5.56 KB

Validation Plan: Continuous Learning from Human Conversations

Research Question

Can a system learn over time by using conversations with humans as a new dataset?

This validates whether GaLore-based continuous pretraining enables:

  1. Knowledge acquisition from conversational data
  2. Retention of base capabilities (avoiding catastrophic forgetting)
  3. Memory-efficient training suitable for online/edge deployment

Phase 1: Model Selection

Recommended Base Models (Modern, Small, Efficient)

Model Parameters Context Why
Qwen2.5-0.5B-Instruct 0.5B 128K Best tiny instruction model, multilingual
Qwen2.5-1.5B-Instruct 1.5B 128K Sweet spot for quality vs memory
Llama-3.2-1B-Instruct 1B 128K Meta's latest small model
Gemma-2-2B-it 2B 8K Google's distilled model, strong reasoning
Phi-3.5-mini-instruct 3.8B 128K Microsoft's textbook-trained, best reasoning/size

Primary recommendation: Start with Qwen2.5-1.5B-Instruct - best balance of:

  • Small enough for single GPU GaLore training
  • Strong instruction-following baseline
  • Large context for conversation history

Config Update

model_name: Qwen/Qwen2.5-1.5B-Instruct
# Alternative: meta-llama/Llama-3.2-1B-Instruct
# Alternative: google/gemma-2-2b-it

Phase 2: Datasets

Conversational Data Sources

Dataset Size Description Use
LMSYS-Chat-1M 1M convos Real human-LLM conversations from Chatbot Arena Primary training
ShareGPT ~90K convos Shared ChatGPT conversations Supplementary
ShareChat 142K convos Cross-platform (GPT, Claude, Gemini, etc.) Diversity
OpenAssistant 161K messages Human-annotated conversation trees Quality signal

Primary: lmsys/lmsys-chat-1m (HuggingFace)

  • Real user queries, diverse topics
  • Multi-turn conversations
  • Natural distribution of difficulty

Evaluation Datasets (Forgetting Detection)

Benchmark Purpose
MMLU (5-shot) General knowledge retention
HellaSwag Commonsense reasoning
ARC-Challenge Science reasoning
TruthfulQA Factual accuracy
GSM8K Math reasoning

Phase 3: Evaluation Protocol

Metrics

  1. Learning Signal

    • Perplexity on held-out conversation data (↓ = learning)
    • Response quality on conversation test set
  2. Forgetting Detection

    • MMLU accuracy before/after (Δ < 2% acceptable)
    • Perplexity on original pretraining distribution (C4/RedPajama)
  3. Efficiency

    • Memory usage vs full AdamW
    • Training throughput (tokens/sec)

Evaluation Schedule

Step 0:      Baseline eval (MMLU, HellaSwag, conversation perplexity)
Step 500:    Quick check (conversation perplexity only)
Step 1000:   Full eval
Step 2500:   Full eval
Step 5000:   Final eval + forgetting analysis

Phase 4: Experiments

Experiment 1: Baseline Comparison

Compare continuous training approaches:

  • A: Full AdamW (memory baseline)
  • B: LoRA r=64 (parameter-efficient baseline)
  • C: GaLore r=256 (our approach)
  • D: GaLore r=128 (memory-optimized)

Hypothesis: GaLore achieves similar learning to full AdamW with LoRA-like memory.

Experiment 2: Conversation Learning Curve

Train on LMSYS-Chat-1M with increasing data:

  • 10K, 50K, 100K, 500K conversations
  • Measure: conversation perplexity, MMLU retention

Hypothesis: Perplexity improves log-linearly; forgetting stays bounded.

Experiment 3: Online Simulation

Simulate real-time conversation ingestion:

  • Batch size 1-2, continuous stream
  • Measure adaptation speed to new topics

Hypothesis: Model adapts to conversation patterns within ~1000 steps.

Experiment 4: Forgetting Mitigation

Test replay strategies:

  • A: No replay (pure online)
  • B: 10% replay from original data
  • C: 20% replay

Hypothesis: Small replay (10%) significantly reduces forgetting.


Phase 5: Implementation Tasks

Task 1: Add Modern Model Support

  • Update config for Qwen/Llama/Gemma models
  • Handle chat templates properly
  • Add bfloat16 support

Task 2: Add LMSYS-Chat Data Loader

  • Parse multi-turn conversation format
  • Convert to prompt/response pairs
  • Handle conversation context window

Task 3: Add Evaluation Harness

  • Integrate lm-evaluation-harness
  • Add perplexity tracking
  • Create evaluation checkpoints

Task 4: Add Forgetting Metrics

  • Track per-domain perplexity
  • Implement MMLU quick-eval
  • Add replay buffer option

Task 5: Experiment Runner

  • Sweep configs for experiments
  • Logging to W&B/TensorBoard
  • Result aggregation

Success Criteria

Metric Target
Conversation perplexity ↓ 15%+ from baseline
MMLU retention Δ < 3% from baseline
Memory vs full AdamW < 50%
Training speed > 80% of full AdamW

Timeline

Week Focus
1 Task 1-2: Model + data support
2 Task 3-4: Evaluation infrastructure
3 Experiment 1-2: Baselines + learning curves
4 Experiment 3-4: Online + forgetting
5 Analysis + writeup

References

  1. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (Zhao et al., 2024)
  2. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset (Zheng et al., 2023)
  3. Continual Learning of Large Language Models: A Comprehensive Survey (Wang et al., 2024)
  4. TiC-LM: A Multi-Year Benchmark for Continual Pretraining (OpenReview, 2025)