A self-directed study log: ML compilers, GPU kernels, MLIR, and the systems work that makes them fast.
Ride along with me if you like :)
Four months of focused work on getting genuinely good at ML systems: CUDA, Triton, MLIR, distributed GPU work, and whatever rabbit holes the work opens up. The endpoint is to ship something real: a kernel competition placement, a contribution to tinygrad or another open-source project, or a measurable improvement on something that matters. Whichever feels most alive by week six.
This is a public work log, not a tutorial.
- Foundation refresh: LLVM Kaleidoscope, C++ refresher, GPU MODE basics
- GPU fundamentals: CUDA, PMPP, naive and tiled matmul, NSight profiling
- Triton entry: Python-embedded GPU kernels, PTX inspection
- MLIR proper: IR dialects, lowering passes
- First real contribution: the centerpiece
- Synthesize: Flash Attention, NCCL collectives, CUDA Graphs, second push
/cuda CUDA kernels, benchmarks, NSight reports
/triton Triton kernels, PTX dumps, fused-op experiments
/mlir MLIR Toy tutorial, dialect experiments, LLVM Kaleidoscope
/notes Reading notes, paper summaries, lecture notes
/docs The plan, references, anything reusable
- Writeups on Substack: dstrbad.substack.com
- Day-to-day on X: @dstrbad
If you're working through similar material, open an issue or reach out. I'm not teaching, but I'm happy to compare notes.
The work draws on:
- Programming Massively Parallel Processors (Hwu, Kirk, El Hajj) — the PMPP book
- GPU MODE lectures and Discord
- Aalto Programming Parallel Computers — structured exercises
- Triton and MLIR official docs and tutorials
- Systems Performance (Brendan Gregg) — methodology reference
- AI Performance Engineering (Fregly) — companion repo, 200-item performance checklist, monthly meetup videos
- Convex Optimization (Boyd) — math reference
May 23, 2026.