ai-infra

Deep dives into AI infrastructure papers and open-source systems from frontier labs. 前沿 AI 实验室基础设施论文与开源系统的深度解析。

中文版 README

Scope

This repo collects engineering-focused analyses of papers and open-source releases covering:

Training systems — parallelism strategies, mixed precision, communication primitives
Inference systems — KV cache, speculative decoding, serving architectures
Model architectures with infra implications — MoE routing, attention variants, long context
Open-source infra components — kernels, schedulers, file systems

Every paper gets both an English (en.md) and Chinese (zh.md) write-up using the same template. See _template/.

Index

Foundational

Scaling Laws — Kaplan 2020 + Chinchilla 2022 — power-law compute/data/param tradeoffs; the 20-tokens-per-param rule
FlashAttention 1 / 2 / 3 — IO-aware exact attention, Hopper async + FP8
Triton — block-level GPU kernel DSL + MLIR compiler; the productivity multiplier
PagedAttention / vLLM — OS-style paging for KV cache, continuous batching
Orca — Continuous Batching — iteration-level scheduling; goodput as the right metric; foundation for vLLM/SGLang
Megatron-LM (TP / PP / SP) — tensor, pipeline, sequence parallelism + selective recompute
ZeRO / FSDP — sharded data parallelism; orthogonal to Megatron
Speculative Decoding — draft + verify; lossless 2–4× decode speedup
Ring Attention / Context Parallelism — exact attention at 1M+ context via sequence sharding
Grouped Query Attention (GQA) — H/G KV cache reduction; the default attention variant in Llama 3, Mistral, Gemma
Rotary Position Embeddings (RoPE) — position-by-rotation; relative, parameter-free, flash-friendly; long-context extensions
DistServe — prefill/decode disaggregation; goodput as the right metric
SGLang — RadixAttention prefix caching; frontend DSL for multi-call LLM programs
Prefix Caching — hash-based vs radix-trie matching; eviction policies; hit rate economics; multi-tier GPU→CPU→disk; disaggregated routing
Weight Quantization — GPTQ & AWQ — INT4 post-training; Hessian vs activation-aware
SmoothQuant — W8A8; activation-to-weight outlier migration
RLHF / InstructGPT — three-stage SFT+RM+PPO; the post-training foundation
DPO — collapse RLHF into one supervised step; the simpler default
GRPO — Group Relative Policy Optimization — critic-free RL via group advantage; the algorithm behind DeepSeek-R1
SimPO — Simple Preference Optimization — reference-free DPO; length-normalized reward + target margin
LoRA / QLoRA — low-rank weight adaptation; 4-bit fine-tuning on a single GPU
Mixed Precision Training — FP16 → BF16 → FP8; AMP, loss scaling, per-tensor/block FP8 scaling
Chunked Prefill — Sarathi-Serve — interleave prefill chunks with decode; eliminate TTFT stalls on a single GPU
KV Cache Quantization — KIVI & KVQuant — INT2 keys + INT4 values; completes the quantization arc
verl — HybridFlow — per-model parallelism + CPU offload for PPO/GRPO at 70B+ scale; the infra behind R1-style training
GPU Interconnect primer — NVLink, NVSwitch, RDMA, IBGDA; the fabric assumed by everything else
Hopper / H100 Architecture Primer — wgmma, TMA, FP8 Tensor Cores, Thread Block Clusters; the compute primitives behind FlashAttention-3, DeepGEMM, DualPipe
Mamba and State Space Models — linear-time, constant-memory-decode; hybrids with attention
Inference-Time Scaling — Test-Time Compute — parallel search (best-of-N, PRM beam search) vs sequential refinement (thinking tokens); compute-optimal strategy by difficulty
Process Reward Models (PRMs) — step-level verification; PRM800K human labels vs MC rollout auto-labeling; beam search and MCTS integration
Speculative Decoding Variants — Medusa / EAGLE — tree-structured self-draft; Medusa parallel heads; EAGLE feature-level autoregressive draft; EAGLE-2 adaptive trees
Blackwell / B200 Architecture Primer — FP4 Tensor Cores, HBM3e 192 GB, NVLink 5 1.8 TB/s, GB200 NVL72 rack-scale fabric
Tokenization — BPE, SentencePiece, Tiktoken — subword algorithms; multilingual token budget economics; fertility by language
Data Pipelines — FineWeb / MinHash / Quality Filtering — Common Crawl → pretraining corpus; MinHash dedup; quality classifiers at trillion-token scale
Position Interpolation — YaRN / LongRoPE / NTK-aware — extending pretrained RoPE models to 128k–2M context; NTK-aware base change; YaRN non-uniform scaling
Streaming LLM & Attention Sinks — bounded KV cache for infinite-length generation; attention sink phenomenon; K_sink + sliding window
Switch Transformer & GShard — top-1 routing; auxiliary load-balancing loss; expert capacity factor; the foundational MoE papers
TGI — Text Generation Inference — HuggingFace's Rust + Python serving stack; continuous batching; quantization formats; vLLM comparison
Multi-Tenant LoRA Serving — SLoRA / Punica — paged adapter pool (SLoRA); SGMV batched kernel (Punica); serving thousands of adapters from one base model
Diffusion Fundamentals — DDPM / DDIM / CFG — forward/reverse diffusion process; DDIM deterministic sampling; classifier-free guidance; NFE as the primary latency driver
Sequence Parallelism Variants — Ulysses & Megatron-CP — all-to-all on head dim (Ulysses); causal even-odd interleaving (Megatron-CP); completes the sequence-axis parallelism story with Ring Attention
torch.compile / Inductor — TorchDynamo bytecode tracing; AOTAutograd joint graph; Inductor Triton codegen; operator fusion; CUDA Graphs
Knowledge Distillation at Scale — soft label loss with temperature; white-box / black-box / sequence-level regimes; R1 distillation pipeline; Gemma 2 logit soft-capping
Tool Use / Function Calling Infrastructure — JSON schema tool definitions; parallel tool call dispatch; streaming delta parsing; safety layers; context window growth arithmetic
Agent Framework Landscape — LangGraph stateful graphs; AutoGen multi-agent conversations; OpenAI Swarm handoffs; DSPy prompt compilation; state persistence and p99 latency tradeoffs
Agent Evaluation Infrastructure — SWE-bench, TAU-bench, WebArena; trajectory metrics beyond binary pass/fail; sandboxed eval environments; LLM-as-judge calibration
NCCL Internals — ring vs tree AllReduce; LL/LL128 protocols; NVLS in-fabric reduction; SHARP IB offload; IBGDA; topology detection and key debug env vars
Async Checkpointing & PyTorch DCP — sharded save/reshardable load; async in-memory copy; ZeRO sharded format; checkpoint frequency optimization; recovery bandwidth arithmetic
MoE Routing Improvements — Expert Choice & Loss-Free Balancing — Expert Choice inverted assignment; Loss-Free Balancing bias update rule; fine-grained+shared pattern; closes the MoE routing arc
Jamba / Hybrid SSM-Transformer — interleaved Mamba+Attention blocks; KV cache reduction at 256k context; decode economics; why pure SSMs plateaued on recall tasks
llama.cpp & GGUF — GGUF binary format; Q4_K_M super-block quantization; CPU+GPU hybrid offload; Metal/CUDA backends; the substrate under Ollama
LMDeploy / TurboMind — W4A16 AWQ custom CUDA kernels; MLA-aware KV cache; FP8 KV; H100 throughput vs vLLM/TGI; first-class Qwen/DeepSeek support
LLM Evaluation Harness — lm-eval-harness; log-likelihood vs generation scoring; MMLU/GSM8K/HumanEval plumbing; pass@k formula; Open LLM Leaderboard pipeline
Chatbot Arena & Pairwise Evaluation — Bradley-Terry model; Elo rating; 1M+ human preference votes; MT-Bench LLM-as-judge; Arena Hard; why static benchmarks saturated
Confidential LLM Inference — H100 CC mode attestation; AWS Nitro Enclaves; Azure SEV-SNP; TEE-gated serving pattern; 5–10% latency overhead; 20–30% cost premium

Multimodal

CLIP — Contrastive Language-Image Pretraining — dual-encoder contrastive objective; InfoNCE loss over N² pairs; zero-shot classification; web-scale training
Vision Transformer (ViT) — patch embedding; [CLS] token; ViT-L/14 → 576 tokens; FlashAttention-compatible; DeiT distillation
LLaVA / Vision-Language Models — MLP projector; two-stage training; visual token counts (256→576→2880); image prefix caching
Whisper — Speech Recognition — log-mel spectrogram; 30s chunking; multitask via special tokens; weakly-supervised at 680k hours
DiT — Diffusion Transformers — latent diffusion + transformer backbone; adaLN conditioning; Sora / SD3 / FLUX lineage; compute-bound inference
VLM Serving — visual token prefill economics; variable-resolution tiling; image prefix caching; heterogeneous batching

DeepSeek

V2 — Economical MoE at 236B — MLA + DeepSeekMoE as a system; 21B activated / 236B total; 5.76× throughput over dense predecessor
V3 Technical Report — FP8 training, DualPipe, MoE at 671B
MLA — Multi-head Latent Attention — KV cache compression
DeepSeekMoE — fine-grained + shared experts
R1 — reasoning via rule-based RL; GRPO; R1-Zero emergence
Open Source Week — FlashMLA walkthrough — seesaw schedule, FP8 sparse decode
Open Source Week — DeepEP walkthrough — expert-parallel all-to-all, IBGDA low-latency
Open Source Week — DeepGEMM walkthrough — JIT FP8/BF16 GEMM, MoE layouts, V3.2 indexer
Open Source Week — DualPipe walkthrough — bidirectional pipeline schedule; halves bubbles
Open Source Week — 3FS walkthrough — RDMA-native distributed FS; CRAQ + FDB + USRBIO
DeepSeek-Prover — Lean 4 theorem proving via RL + RMaxTS; compiler as perfect PRM; auto-formalization pipeline
V3.2 / Native Sparse Attention (NSA) — compressed + selected + window three-path sparse attention; V3.2 indexer kernel; native 128k context without RoPE interpolation

Qwen

Qwen3 — thinking/non-thinking toggle; MoE 235B/22A + dense 0.6B–32B; RL post-training; top open-weight family

Mistral

Mixtral of Experts — 8×7B, top-2 routing, open-weight MoE baseline

Moonshot

Mooncake — KVCache-centric disaggregated inference; PD-disaggregation; cache pool

Google

Pathways — async distributed dataflow runtime; single-controller at TPU pod scale
GSPMD — XLA compiler pass for auto-parallelization; sharding as a type
Gemma 2 — distillation from 27B teacher; logit soft-capping; alternating local/global attention

Microsoft

DeepSpeed — MoE, Chat, and Inference Engine — expert parallelism, hybrid RLHF engine, fused INT8 inference

NVIDIA

CUTLASS — C++ GEMM template hierarchy; the kernel library under FlashAttention, DeepGEMM, and cuBLAS
TensorRT-LLM — AOT-compiled LLM inference; paged KV cache, FP8, continuous batching at H100 peak
Dynamo — disaggregated inference orchestration; KV-aware routing; prefill/decode pool management; TensorRT-LLM integration
TransformerEngine — FP8 drop-in modules (te.Linear, te.TransformerLayer); E4M3/E5M2 format split; DelayedScaling amax history; 1.3–1.6× end-to-end training speedup
Megatron-Core — modular library superseding the 2021 paper; ParallelState, TransformerConfig, mcore DDP; CP integration; TE routing; used by Nemotron, NeMo, Grok

xAI

Grok + Colossus — 314B MoE Grok-1; 100k H100 single-site Memphis cluster; 4D parallelism at frontier scale; single-site AllReduce latency advantage

ByteDance

Seed / Doubao — MegaScale fault tolerance at 12k GPUs; verl/HybridFlow origin lab; H800 export-control constraints; PD-disaggregated inference at 100M+ QPS

Apple

Foundation Models (AFM) — on-device 3B (4-bit palettized) + Private Cloud Compute; Apple Silicon attestation-based privacy; MLX unified memory; two-tier routing

Anthropic

Building Effective Agents — workflow vs agent, five workflow patterns
Model Context Protocol (MCP) — LSP for LLMs; tools / resources / prompts via JSON-RPC
Constitutional AI — critique-revision loop + RLAIF; harmlessness from AI feedback
Computer Use & Browser Automation — screenshot-based pixel-level action space; visual grounding; per-step latency arithmetic; container sandboxing

Guides

Career and onboarding guides — engineering-first, minimal ML algorithm prerequisites.

From DevOps to AI Infrastructure — skills that transfer, gaps to fill, three vertical paths (cluster ops / inference platform / training infra), 6-month milestones
On-Premise LLM Deployment — from Mac Studio + Ollama to multi-node GPU clusters; hardware sizing, software stack, cost reference, decision flowchart
Secure On-Prem Agent Deployment — tool call sandboxing (gVisor, NetworkPolicy, seccomp); MCP permission boundaries; secret management in agent loops; minimal secure stack reference

Contributing

New papers: copy _template/ into the appropriate vendor directory, fill in both zh.md and en.md, update this index.

License

Documentation: CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
_template		_template
anthropic		anthropic
apple/afm		apple/afm
bytedance/seed		bytedance/seed
deepseek		deepseek
foundational		foundational
google		google
guides		guides
meta		meta
microsoft/deepspeed		microsoft/deepspeed
mistral/mixtral		mistral/mixtral
moonshot		moonshot
multimodal		multimodal
nvidia		nvidia
qwen		qwen
xai/grok-colossus		xai/grok-colossus
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
ROADMAP.md		ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ai-infra

Scope

Index

Foundational

Multimodal

DeepSeek

Meta

Qwen

Mistral

Moonshot

Google

Microsoft

NVIDIA

xAI

ByteDance

Apple

Anthropic

Guides

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ai-infra

Scope

Index

Foundational

Multimodal

DeepSeek

Meta

Qwen

Mistral

Moonshot

Google

Microsoft

NVIDIA

xAI

ByteDance

Apple

Anthropic

Guides

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages