Skip to content

qfc-network/ai-infra

Repository files navigation

ai-infra

Deep dives into AI infrastructure papers and open-source systems from frontier labs. 前沿 AI 实验室基础设施论文与开源系统的深度解析。

中文版 README

Scope

This repo collects engineering-focused analyses of papers and open-source releases covering:

  • Training systems — parallelism strategies, mixed precision, communication primitives
  • Inference systems — KV cache, speculative decoding, serving architectures
  • Model architectures with infra implications — MoE routing, attention variants, long context
  • Open-source infra components — kernels, schedulers, file systems

Every paper gets both an English (en.md) and Chinese (zh.md) write-up using the same template. See _template/.

Index

Foundational

Multimodal

DeepSeek

Meta

  • Llama 3 Herd of Models — 405B dense, 16k H100s, 4D parallelism
  • Llama 4 — first MoE family; Scout 17B×16E / Maverick 17B×128E; iRoPE interleaved attention; native multimodal

Qwen

  • Qwen3 — thinking/non-thinking toggle; MoE 235B/22A + dense 0.6B–32B; RL post-training; top open-weight family

Mistral

Moonshot

  • Mooncake — KVCache-centric disaggregated inference; PD-disaggregation; cache pool

Google

  • Pathways — async distributed dataflow runtime; single-controller at TPU pod scale
  • GSPMD — XLA compiler pass for auto-parallelization; sharding as a type
  • Gemma 2 — distillation from 27B teacher; logit soft-capping; alternating local/global attention

Microsoft

NVIDIA

  • CUTLASS — C++ GEMM template hierarchy; the kernel library under FlashAttention, DeepGEMM, and cuBLAS
  • TensorRT-LLM — AOT-compiled LLM inference; paged KV cache, FP8, continuous batching at H100 peak
  • Dynamo — disaggregated inference orchestration; KV-aware routing; prefill/decode pool management; TensorRT-LLM integration
  • TransformerEngine — FP8 drop-in modules (te.Linear, te.TransformerLayer); E4M3/E5M2 format split; DelayedScaling amax history; 1.3–1.6× end-to-end training speedup
  • Megatron-Core — modular library superseding the 2021 paper; ParallelState, TransformerConfig, mcore DDP; CP integration; TE routing; used by Nemotron, NeMo, Grok

xAI

  • Grok + Colossus — 314B MoE Grok-1; 100k H100 single-site Memphis cluster; 4D parallelism at frontier scale; single-site AllReduce latency advantage

ByteDance

  • Seed / Doubao — MegaScale fault tolerance at 12k GPUs; verl/HybridFlow origin lab; H800 export-control constraints; PD-disaggregated inference at 100M+ QPS

Apple

  • Foundation Models (AFM) — on-device 3B (4-bit palettized) + Private Cloud Compute; Apple Silicon attestation-based privacy; MLX unified memory; two-tier routing

Anthropic

Guides

Career and onboarding guides — engineering-first, minimal ML algorithm prerequisites.

  • From DevOps to AI Infrastructure — skills that transfer, gaps to fill, three vertical paths (cluster ops / inference platform / training infra), 6-month milestones
  • On-Premise LLM Deployment — from Mac Studio + Ollama to multi-node GPU clusters; hardware sizing, software stack, cost reference, decision flowchart
  • Secure On-Prem Agent Deployment — tool call sandboxing (gVisor, NetworkPolicy, seccomp); MCP permission boundaries; secret management in agent loops; minimal secure stack reference

Contributing

New papers: copy _template/ into the appropriate vendor directory, fill in both zh.md and en.md, update this index.

License

Documentation: CC BY 4.0.

About

Deep dives into AI infrastructure papers and open-source systems from frontier labs (bilingual zh/en)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors