Skip to content

HeCheng0625/nanoLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

nanoLLM πŸ”¬

A hands-on research & engineering playground for learning modern LLM architectures by re-implementing, ablating, and benchmarking key architectural and optimization variants under fixed compute (FLOPs).

Base model: this project uses the Qwen3 0.6B as baseline model, with training/eval tooling built on πŸ€— Transformers + Accelerate (and optional FSDP).


🧩 Features & Roadmap

Features

  • Base Architecture: RMSNorm, SwiGLU, RoPE, Multi-head Attention (MHA), Grouped Query Attention (GQA).

  • Attention

    • MHA (baseline)
    • MLA (multi-head latent attention)
    • Gated attention (attention output gating / QKV gating variants)
    • Linear attention
    • Sparse attention
    • Sliding-window attention
    • Hybrid architecture
  • MoE

    • Top-k routing
    • Load balancing losses
  • mHC / Hyper-Connections

    • multi-stream / hyper residual pathways
  • Engram-style memory

    • Engram: retrieval / hash memory modules
  • Optimization & Training

    • AdamW baseline vs. Muon

Roadmap

Phase 0 β€” Foundations

  • Config system (YAML/OmegaConf or dataclasses)
  • Dataset pipeline (streaming + shuffling + packing)
  • Tokenizer integration via πŸ€— Transformers (reuse Qwen tokenizer)
  • Baseline decoder-only model (Qwen3-style)
  • Training loop with Accelerate (fp16/bf16, grad accumulation, ckpt)
  • Evaluation harness (perplexity + small task suite)
  • Logging (W&B or TensorBoard) + run manifest export

Phase 1 β€” Baseline + MoE (your stated priority)

  • Dense baseline reproduction: stable loss curve, expected PPL
  • MoE FFN block (Top-2/Top-1 routing)
  • Load-balancing losses (aux loss variants) + metrics (expert usage entropy, overflow)
  • Capacity factor + token dropping policy
  • Fixed-FLOPs comparison scripts (dense vs MoE at matched compute)

Phase 2 β€” Attention variants

  • Sliding-window attention
  • Sparse attention (block/global-local)
  • MLA-style KV compression family
  • Gated attention variants
  • Linear attention baseline (1–2 representative forms)
  • Hybrid configs (e.g., every N layers global attention)

Phase 3 β€” Memory + residual/path tricks

  • Engram memory module (retrieval + gating)
  • mHC / hyper connections (n-stream residual)
  • Combined ablations (MoE + attention + memory)

Repository layout

nanoLLM/
β”œβ”€β”€ configs/                        # YAML configs (model/train/eval)
β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”œβ”€β”€ baseline_0_6B.yaml      # Dense baseline (Qwen3-style)
β”‚   β”‚   └── moe_a0_6B.yaml          # MoE baseline (priority)
β”‚   β”‚   └── experimental_mhc.yaml
β”‚   └── train/                      # Training hyperparameters
β”‚       β”œβ”€β”€ pretrain_adamw.yaml
β”‚       └── pretrain_muon.yaml
β”‚
β”œβ”€β”€ src/                            # Python package root (HF-compatible)
β”‚   └── nanollm/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ configuration_nanollm.py    # HF Config (PretrainedConfig)
β”‚       β”œβ”€β”€ modeling_nanollm.py         # HF Model (PreTrainedModel)
β”‚       β”œβ”€β”€ modeling_blocks.py          # TransformerBlock / MoEBlock assembly
β”‚       β”‚
β”‚       β”œβ”€β”€ components/                 # Pluggable building blocks
β”‚       β”‚   β”œβ”€β”€ norms.py                # RMSNorm / LayerNorm variants
β”‚       β”‚   β”œβ”€β”€ rotary.py               # RoPE utilities
β”‚       β”‚   β”œβ”€β”€ mlp.py                  # FFN / SwiGLU / gated variants
β”‚       β”‚   β”œβ”€β”€ attention/
β”‚       β”‚   β”‚   β”œβ”€β”€ mha.py
β”‚       β”‚   β”‚   β”œβ”€β”€ mla.py
β”‚       β”‚   β”‚   β”œβ”€β”€ linear.py
β”‚       β”‚   β”‚   β”œβ”€β”€ sparse.py
β”‚       β”‚   β”‚   β”œβ”€β”€ sliding_window.py
β”‚       β”‚   β”‚   └── hybrid.py
β”‚       β”‚   β”œβ”€β”€ moe/
β”‚       β”‚   β”‚   β”œβ”€β”€ router.py           # top-k routing
β”‚       β”‚   β”‚   β”œβ”€β”€ experts.py          # expert MLP
β”‚       β”‚   β”‚   └── losses.py           # load-balancing losses (DeepSeek-style)
β”‚       β”‚   β”œβ”€β”€ memory/
β”‚       β”‚   β”‚   └── engram.py           # retrieval/hash memory + gating
β”‚       β”‚   └── residual/
β”‚       β”‚       └── mhc.py              # hyper-connections / multi-stream residuals
β”‚       β”‚
β”‚       β”œβ”€β”€ data/
β”‚       β”‚   β”œβ”€β”€ datasets.py             # HF datasets / streaming loaders
β”‚       β”‚   β”œβ”€β”€ packing.py              # sequence packing
β”‚       β”‚   └── collate.py              # batch collation
β”‚       β”‚
β”‚       β”œβ”€β”€ optim/                      # One true place for optim & schedulers
β”‚       β”‚   β”œβ”€β”€ adamw.py
β”‚       β”‚   β”œβ”€β”€ muon.py                 # Custom Muon optimizer
β”‚       β”‚   └── schedulers.py           # cosine/warmup/etc.
β”‚       β”‚
β”‚       β”œβ”€β”€ train/
β”‚       β”‚   β”œβ”€β”€ trainer.py              # accelerate-based trainer
β”‚       β”‚   β”œβ”€β”€ losses.py               # LM loss + aux losses (moe, engram, mtp)
β”‚       β”‚   └── hooks.py                # optional: callbacks (log, ckpt, eval)
β”‚       β”‚
β”‚       β”œβ”€β”€ eval/
β”‚       β”‚   β”œβ”€β”€ perplexity.py
β”‚       β”‚   β”œβ”€β”€ harness.py              # downstream eval glue
β”‚       β”‚   └── metrics.py
β”‚       β”‚
β”‚       └── utils/
β”‚           β”œβ”€β”€ flops.py                # CRITICAL: FLOPs estimator/accounting
β”‚           β”œβ”€β”€ metrics.py              # training metrics + EMA, etc.
β”‚           β”œβ”€β”€ logging.py              # W&B/TB logger adapters
β”‚           β”œβ”€β”€ checkpoint.py
β”‚           └── seed.py
β”‚
β”œβ”€β”€ scripts/                           # CLI entrypoints
β”‚   β”œβ”€β”€ train.py
β”‚   β”œβ”€β”€ eval.py
β”‚   β”œβ”€β”€ estimate_flops.py
β”‚   └── sweep.py                        # optional
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_shapes.py
β”‚   β”œβ”€β”€ test_attention_equivalence.py
β”‚   └── test_moe_routing.py
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml                      # (recommended) or setup.cfg
└── README.md

Quickstart

1. Train dense baseline

python scripts/train.py \
  --config configs/model/baseline_0_6B.yaml \
  --train-config configs/train/pretrain_adamw.yaml

2. Train MoE variant

python scripts/train.py \
  --config configs/model/moe_a0_6B.yaml \
  --train-config configs/train/pretrain_muon.yaml

3. Evaluate perplexity

python scripts/eval.py \
  --config configs/model/baseline_0_6B.yaml \
  --ckpt path/to/checkpoint

4. Estimate FLOPs

python scripts/estimate_flops.py \
  --config configs/model/baseline_0_6B.yaml

Implementation notes (what to reuse vs write)

Reuse from πŸ€— Transformers:

  • tokenizer / vocab / special tokens
  • dataset loading utilities
  • (optionally) weight init conventions / config patterns

Write in nanoLLM:

  • a clean, minimal Qwen3-style model (so you understand it)
  • attention / MoE / memory variants as components
  • fixed-FLOPs accounting + fair benchmarking harness

TODO (starter list)

Baseline model

  • RMSNorm + RoPE + SwiGLU FFN
  • KV cache support + causal mask correctness tests
  • HF-compatible Config + from_pretrained-style loading (optional)

MoE (priority)

  • Router: top-1/top-2, jitter noise, z-loss (optional)
  • Losses: balance/importance/load losses (DeepSeek-like variants)
  • Capacity factor + dispatch implementation
  • Metrics dashboard for expert load

Attention variants

  • Sliding window attention with KV cache
  • Block sparse attention
  • MLA module + ablation knobs (rank, shared projection, etc.)
  • Gated attention (output gate, Q/K gate variants)
  • Linear attention baseline

Engram + mHC

  • Retrieval table + hashing + gating API
  • Plug memory into attention context or FFN residual
  • mHC multi-stream residual with fused-friendly layout (later)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages