Deep dives into AI infrastructure papers and open-source systems from frontier labs. 前沿 AI 实验室基础设施论文与开源系统的深度解析。
This repo collects engineering-focused analyses of papers and open-source releases covering:
- Training systems — parallelism strategies, mixed precision, communication primitives
- Inference systems — KV cache, speculative decoding, serving architectures
- Model architectures with infra implications — MoE routing, attention variants, long context
- Open-source infra components — kernels, schedulers, file systems
Every paper gets both an English (en.md) and Chinese (zh.md) write-up using the same template. See _template/.
- Scaling Laws — Kaplan 2020 + Chinchilla 2022 — power-law compute/data/param tradeoffs; the 20-tokens-per-param rule
- FlashAttention 1 / 2 / 3 — IO-aware exact attention, Hopper async + FP8
- Triton — block-level GPU kernel DSL + MLIR compiler; the productivity multiplier
- PagedAttention / vLLM — OS-style paging for KV cache, continuous batching
- Orca — Continuous Batching — iteration-level scheduling; goodput as the right metric; foundation for vLLM/SGLang
- Megatron-LM (TP / PP / SP) — tensor, pipeline, sequence parallelism + selective recompute
- ZeRO / FSDP — sharded data parallelism; orthogonal to Megatron
- Speculative Decoding — draft + verify; lossless 2–4× decode speedup
- Ring Attention / Context Parallelism — exact attention at 1M+ context via sequence sharding
- Grouped Query Attention (GQA) — H/G KV cache reduction; the default attention variant in Llama 3, Mistral, Gemma
- Rotary Position Embeddings (RoPE) — position-by-rotation; relative, parameter-free, flash-friendly; long-context extensions
- DistServe — prefill/decode disaggregation; goodput as the right metric
- SGLang — RadixAttention prefix caching; frontend DSL for multi-call LLM programs
- Prefix Caching — hash-based vs radix-trie matching; eviction policies; hit rate economics; multi-tier GPU→CPU→disk; disaggregated routing
- Weight Quantization — GPTQ & AWQ — INT4 post-training; Hessian vs activation-aware
- SmoothQuant — W8A8; activation-to-weight outlier migration
- RLHF / InstructGPT — three-stage SFT+RM+PPO; the post-training foundation
- DPO — collapse RLHF into one supervised step; the simpler default
- GRPO — Group Relative Policy Optimization — critic-free RL via group advantage; the algorithm behind DeepSeek-R1
- SimPO — Simple Preference Optimization — reference-free DPO; length-normalized reward + target margin
- LoRA / QLoRA — low-rank weight adaptation; 4-bit fine-tuning on a single GPU
- Mixed Precision Training — FP16 → BF16 → FP8; AMP, loss scaling, per-tensor/block FP8 scaling
- Chunked Prefill — Sarathi-Serve — interleave prefill chunks with decode; eliminate TTFT stalls on a single GPU
- KV Cache Quantization — KIVI & KVQuant — INT2 keys + INT4 values; completes the quantization arc
- verl — HybridFlow — per-model parallelism + CPU offload for PPO/GRPO at 70B+ scale; the infra behind R1-style training
- GPU Interconnect primer — NVLink, NVSwitch, RDMA, IBGDA; the fabric assumed by everything else
- Hopper / H100 Architecture Primer — wgmma, TMA, FP8 Tensor Cores, Thread Block Clusters; the compute primitives behind FlashAttention-3, DeepGEMM, DualPipe
- Mamba and State Space Models — linear-time, constant-memory-decode; hybrids with attention
- Inference-Time Scaling — Test-Time Compute — parallel search (best-of-N, PRM beam search) vs sequential refinement (thinking tokens); compute-optimal strategy by difficulty
- Process Reward Models (PRMs) — step-level verification; PRM800K human labels vs MC rollout auto-labeling; beam search and MCTS integration
- Speculative Decoding Variants — Medusa / EAGLE — tree-structured self-draft; Medusa parallel heads; EAGLE feature-level autoregressive draft; EAGLE-2 adaptive trees
- Blackwell / B200 Architecture Primer — FP4 Tensor Cores, HBM3e 192 GB, NVLink 5 1.8 TB/s, GB200 NVL72 rack-scale fabric
- Tokenization — BPE, SentencePiece, Tiktoken — subword algorithms; multilingual token budget economics; fertility by language
- Data Pipelines — FineWeb / MinHash / Quality Filtering — Common Crawl → pretraining corpus; MinHash dedup; quality classifiers at trillion-token scale
- Position Interpolation — YaRN / LongRoPE / NTK-aware — extending pretrained RoPE models to 128k–2M context; NTK-aware base change; YaRN non-uniform scaling
- Streaming LLM & Attention Sinks — bounded KV cache for infinite-length generation; attention sink phenomenon; K_sink + sliding window
- Switch Transformer & GShard — top-1 routing; auxiliary load-balancing loss; expert capacity factor; the foundational MoE papers
- TGI — Text Generation Inference — HuggingFace's Rust + Python serving stack; continuous batching; quantization formats; vLLM comparison
- Multi-Tenant LoRA Serving — SLoRA / Punica — paged adapter pool (SLoRA); SGMV batched kernel (Punica); serving thousands of adapters from one base model
- Diffusion Fundamentals — DDPM / DDIM / CFG — forward/reverse diffusion process; DDIM deterministic sampling; classifier-free guidance; NFE as the primary latency driver
- Sequence Parallelism Variants — Ulysses & Megatron-CP — all-to-all on head dim (Ulysses); causal even-odd interleaving (Megatron-CP); completes the sequence-axis parallelism story with Ring Attention
- torch.compile / Inductor — TorchDynamo bytecode tracing; AOTAutograd joint graph; Inductor Triton codegen; operator fusion; CUDA Graphs
- Knowledge Distillation at Scale — soft label loss with temperature; white-box / black-box / sequence-level regimes; R1 distillation pipeline; Gemma 2 logit soft-capping
- Tool Use / Function Calling Infrastructure — JSON schema tool definitions; parallel tool call dispatch; streaming delta parsing; safety layers; context window growth arithmetic
- Agent Framework Landscape — LangGraph stateful graphs; AutoGen multi-agent conversations; OpenAI Swarm handoffs; DSPy prompt compilation; state persistence and p99 latency tradeoffs
- Agent Evaluation Infrastructure — SWE-bench, TAU-bench, WebArena; trajectory metrics beyond binary pass/fail; sandboxed eval environments; LLM-as-judge calibration
- NCCL Internals — ring vs tree AllReduce; LL/LL128 protocols; NVLS in-fabric reduction; SHARP IB offload; IBGDA; topology detection and key debug env vars
- Async Checkpointing & PyTorch DCP — sharded save/reshardable load; async in-memory copy; ZeRO sharded format; checkpoint frequency optimization; recovery bandwidth arithmetic
- MoE Routing Improvements — Expert Choice & Loss-Free Balancing — Expert Choice inverted assignment; Loss-Free Balancing bias update rule; fine-grained+shared pattern; closes the MoE routing arc
- Jamba / Hybrid SSM-Transformer — interleaved Mamba+Attention blocks; KV cache reduction at 256k context; decode economics; why pure SSMs plateaued on recall tasks
- llama.cpp & GGUF — GGUF binary format; Q4_K_M super-block quantization; CPU+GPU hybrid offload; Metal/CUDA backends; the substrate under Ollama
- LMDeploy / TurboMind — W4A16 AWQ custom CUDA kernels; MLA-aware KV cache; FP8 KV; H100 throughput vs vLLM/TGI; first-class Qwen/DeepSeek support
- LLM Evaluation Harness — lm-eval-harness; log-likelihood vs generation scoring; MMLU/GSM8K/HumanEval plumbing; pass@k formula; Open LLM Leaderboard pipeline
- Chatbot Arena & Pairwise Evaluation — Bradley-Terry model; Elo rating; 1M+ human preference votes; MT-Bench LLM-as-judge; Arena Hard; why static benchmarks saturated
- Confidential LLM Inference — H100 CC mode attestation; AWS Nitro Enclaves; Azure SEV-SNP; TEE-gated serving pattern; 5–10% latency overhead; 20–30% cost premium
- CLIP — Contrastive Language-Image Pretraining — dual-encoder contrastive objective; InfoNCE loss over N² pairs; zero-shot classification; web-scale training
- Vision Transformer (ViT) — patch embedding; [CLS] token; ViT-L/14 → 576 tokens; FlashAttention-compatible; DeiT distillation
- LLaVA / Vision-Language Models — MLP projector; two-stage training; visual token counts (256→576→2880); image prefix caching
- Whisper — Speech Recognition — log-mel spectrogram; 30s chunking; multitask via special tokens; weakly-supervised at 680k hours
- DiT — Diffusion Transformers — latent diffusion + transformer backbone; adaLN conditioning; Sora / SD3 / FLUX lineage; compute-bound inference
- VLM Serving — visual token prefill economics; variable-resolution tiling; image prefix caching; heterogeneous batching
- V2 — Economical MoE at 236B — MLA + DeepSeekMoE as a system; 21B activated / 236B total; 5.76× throughput over dense predecessor
- V3 Technical Report — FP8 training, DualPipe, MoE at 671B
- MLA — Multi-head Latent Attention — KV cache compression
- DeepSeekMoE — fine-grained + shared experts
- R1 — reasoning via rule-based RL; GRPO; R1-Zero emergence
- Open Source Week — FlashMLA walkthrough — seesaw schedule, FP8 sparse decode
- Open Source Week — DeepEP walkthrough — expert-parallel all-to-all, IBGDA low-latency
- Open Source Week — DeepGEMM walkthrough — JIT FP8/BF16 GEMM, MoE layouts, V3.2 indexer
- Open Source Week — DualPipe walkthrough — bidirectional pipeline schedule; halves bubbles
- Open Source Week — 3FS walkthrough — RDMA-native distributed FS; CRAQ + FDB + USRBIO
- DeepSeek-Prover — Lean 4 theorem proving via RL + RMaxTS; compiler as perfect PRM; auto-formalization pipeline
- V3.2 / Native Sparse Attention (NSA) — compressed + selected + window three-path sparse attention; V3.2 indexer kernel; native 128k context without RoPE interpolation
- Llama 3 Herd of Models — 405B dense, 16k H100s, 4D parallelism
- Llama 4 — first MoE family; Scout 17B×16E / Maverick 17B×128E; iRoPE interleaved attention; native multimodal
- Qwen3 — thinking/non-thinking toggle; MoE 235B/22A + dense 0.6B–32B; RL post-training; top open-weight family
- Mixtral of Experts — 8×7B, top-2 routing, open-weight MoE baseline
- Mooncake — KVCache-centric disaggregated inference; PD-disaggregation; cache pool
- Pathways — async distributed dataflow runtime; single-controller at TPU pod scale
- GSPMD — XLA compiler pass for auto-parallelization; sharding as a type
- Gemma 2 — distillation from 27B teacher; logit soft-capping; alternating local/global attention
- DeepSpeed — MoE, Chat, and Inference Engine — expert parallelism, hybrid RLHF engine, fused INT8 inference
- CUTLASS — C++ GEMM template hierarchy; the kernel library under FlashAttention, DeepGEMM, and cuBLAS
- TensorRT-LLM — AOT-compiled LLM inference; paged KV cache, FP8, continuous batching at H100 peak
- Dynamo — disaggregated inference orchestration; KV-aware routing; prefill/decode pool management; TensorRT-LLM integration
- TransformerEngine — FP8 drop-in modules (te.Linear, te.TransformerLayer); E4M3/E5M2 format split; DelayedScaling amax history; 1.3–1.6× end-to-end training speedup
- Megatron-Core — modular library superseding the 2021 paper; ParallelState, TransformerConfig, mcore DDP; CP integration; TE routing; used by Nemotron, NeMo, Grok
- Grok + Colossus — 314B MoE Grok-1; 100k H100 single-site Memphis cluster; 4D parallelism at frontier scale; single-site AllReduce latency advantage
- Seed / Doubao — MegaScale fault tolerance at 12k GPUs; verl/HybridFlow origin lab; H800 export-control constraints; PD-disaggregated inference at 100M+ QPS
- Foundation Models (AFM) — on-device 3B (4-bit palettized) + Private Cloud Compute; Apple Silicon attestation-based privacy; MLX unified memory; two-tier routing
- Building Effective Agents — workflow vs agent, five workflow patterns
- Model Context Protocol (MCP) — LSP for LLMs; tools / resources / prompts via JSON-RPC
- Constitutional AI — critique-revision loop + RLAIF; harmlessness from AI feedback
- Computer Use & Browser Automation — screenshot-based pixel-level action space; visual grounding; per-step latency arithmetic; container sandboxing
Career and onboarding guides — engineering-first, minimal ML algorithm prerequisites.
- From DevOps to AI Infrastructure — skills that transfer, gaps to fill, three vertical paths (cluster ops / inference platform / training infra), 6-month milestones
- On-Premise LLM Deployment — from Mac Studio + Ollama to multi-node GPU clusters; hardware sizing, software stack, cost reference, decision flowchart
- Secure On-Prem Agent Deployment — tool call sandboxing (gVisor, NetworkPolicy, seccomp); MCP permission boundaries; secret management in agent loops; minimal secure stack reference
New papers: copy _template/ into the appropriate vendor directory, fill in both zh.md and en.md, update this index.
Documentation: CC BY 4.0.