A hands-on research & engineering playground for learning modern LLM architectures by re-implementing, ablating, and benchmarking key architectural and optimization variants under fixed compute (FLOPs).
Base model: this project uses the Qwen3 0.6B as baseline model, with training/eval tooling built on π€ Transformers + Accelerate (and optional FSDP).
-
Base Architecture: RMSNorm, SwiGLU, RoPE, Multi-head Attention (MHA), Grouped Query Attention (GQA).
-
Attention
- MHA (baseline)
- MLA (multi-head latent attention)
- Gated attention (attention output gating / QKV gating variants)
- Linear attention
- Sparse attention
- Sliding-window attention
- Hybrid architecture
-
MoE
- Top-k routing
- Load balancing losses
-
mHC / Hyper-Connections
- multi-stream / hyper residual pathways
-
Engram-style memory
- Engram: retrieval / hash memory modules
-
Optimization & Training
- AdamW baseline vs. Muon
- Config system (YAML/OmegaConf or dataclasses)
- Dataset pipeline (streaming + shuffling + packing)
- Tokenizer integration via π€ Transformers (reuse Qwen tokenizer)
- Baseline decoder-only model (Qwen3-style)
- Training loop with Accelerate (fp16/bf16, grad accumulation, ckpt)
- Evaluation harness (perplexity + small task suite)
- Logging (W&B or TensorBoard) + run manifest export
- Dense baseline reproduction: stable loss curve, expected PPL
- MoE FFN block (Top-2/Top-1 routing)
- Load-balancing losses (aux loss variants) + metrics (expert usage entropy, overflow)
- Capacity factor + token dropping policy
- Fixed-FLOPs comparison scripts (dense vs MoE at matched compute)
- Sliding-window attention
- Sparse attention (block/global-local)
- MLA-style KV compression family
- Gated attention variants
- Linear attention baseline (1β2 representative forms)
- Hybrid configs (e.g., every N layers global attention)
- Engram memory module (retrieval + gating)
- mHC / hyper connections (n-stream residual)
- Combined ablations (MoE + attention + memory)
nanoLLM/
βββ configs/ # YAML configs (model/train/eval)
β βββ model/
β β βββ baseline_0_6B.yaml # Dense baseline (Qwen3-style)
β β βββ moe_a0_6B.yaml # MoE baseline (priority)
β β βββ experimental_mhc.yaml
β βββ train/ # Training hyperparameters
β βββ pretrain_adamw.yaml
β βββ pretrain_muon.yaml
β
βββ src/ # Python package root (HF-compatible)
β βββ nanollm/
β βββ __init__.py
β βββ configuration_nanollm.py # HF Config (PretrainedConfig)
β βββ modeling_nanollm.py # HF Model (PreTrainedModel)
β βββ modeling_blocks.py # TransformerBlock / MoEBlock assembly
β β
β βββ components/ # Pluggable building blocks
β β βββ norms.py # RMSNorm / LayerNorm variants
β β βββ rotary.py # RoPE utilities
β β βββ mlp.py # FFN / SwiGLU / gated variants
β β βββ attention/
β β β βββ mha.py
β β β βββ mla.py
β β β βββ linear.py
β β β βββ sparse.py
β β β βββ sliding_window.py
β β β βββ hybrid.py
β β βββ moe/
β β β βββ router.py # top-k routing
β β β βββ experts.py # expert MLP
β β β βββ losses.py # load-balancing losses (DeepSeek-style)
β β βββ memory/
β β β βββ engram.py # retrieval/hash memory + gating
β β βββ residual/
β β βββ mhc.py # hyper-connections / multi-stream residuals
β β
β βββ data/
β β βββ datasets.py # HF datasets / streaming loaders
β β βββ packing.py # sequence packing
β β βββ collate.py # batch collation
β β
β βββ optim/ # One true place for optim & schedulers
β β βββ adamw.py
β β βββ muon.py # Custom Muon optimizer
β β βββ schedulers.py # cosine/warmup/etc.
β β
β βββ train/
β β βββ trainer.py # accelerate-based trainer
β β βββ losses.py # LM loss + aux losses (moe, engram, mtp)
β β βββ hooks.py # optional: callbacks (log, ckpt, eval)
β β
β βββ eval/
β β βββ perplexity.py
β β βββ harness.py # downstream eval glue
β β βββ metrics.py
β β
β βββ utils/
β βββ flops.py # CRITICAL: FLOPs estimator/accounting
β βββ metrics.py # training metrics + EMA, etc.
β βββ logging.py # W&B/TB logger adapters
β βββ checkpoint.py
β βββ seed.py
β
βββ scripts/ # CLI entrypoints
β βββ train.py
β βββ eval.py
β βββ estimate_flops.py
β βββ sweep.py # optional
β
βββ tests/
β βββ test_shapes.py
β βββ test_attention_equivalence.py
β βββ test_moe_routing.py
β
βββ requirements.txt
βββ pyproject.toml # (recommended) or setup.cfg
βββ README.md
python scripts/train.py \
--config configs/model/baseline_0_6B.yaml \
--train-config configs/train/pretrain_adamw.yamlpython scripts/train.py \
--config configs/model/moe_a0_6B.yaml \
--train-config configs/train/pretrain_muon.yamlpython scripts/eval.py \
--config configs/model/baseline_0_6B.yaml \
--ckpt path/to/checkpointpython scripts/estimate_flops.py \
--config configs/model/baseline_0_6B.yamlReuse from π€ Transformers:
- tokenizer / vocab / special tokens
- dataset loading utilities
- (optionally) weight init conventions / config patterns
Write in nanoLLM:
- a clean, minimal Qwen3-style model (so you understand it)
- attention / MoE / memory variants as components
- fixed-FLOPs accounting + fair benchmarking harness
- RMSNorm + RoPE + SwiGLU FFN
- KV cache support + causal mask correctness tests
- HF-compatible
Config+from_pretrained-style loading (optional)
- Router: top-1/top-2, jitter noise, z-loss (optional)
- Losses: balance/importance/load losses (DeepSeek-like variants)
- Capacity factor + dispatch implementation
- Metrics dashboard for expert load
- Sliding window attention with KV cache
- Block sparse attention
- MLA module + ablation knobs (rank, shared projection, etc.)
- Gated attention (output gate, Q/K gate variants)
- Linear attention baseline
- Retrieval table + hashing + gating API
- Plug memory into attention context or FFN residual
- mHC multi-stream residual with fused-friendly layout (later)