Primus/Primus-LM is a flexible and high-performance training framework designed for large-scale foundation model training and inference on AMD GPUs. It supports pretraining, posttraining, and reinforcement learning workflows with multiple backends including Megatron-LM, TorchTitan, and JAX MaxText, alongside ROCm-optimized components.
Part of the Primus Ecosystem: Primus-LM is the training framework layer of the Primus ecosystem, working together with Primus-Turbo (high-performance operators) and Primus-SaFE (stability & platform).
- 🔄 Multi-Backend Support: Seamlessly switch between Megatron-LM, TorchTitan, and other training frameworks
- 🚀 Unified CLI: One command interface for local development, containers, and Slurm clusters
- ⚡ ROCm Optimized: Deep integration with AMD ROCm stack and optimized kernels from Primus-Turbo
- 📦 Production Ready: Battle-tested on large-scale training with hundreds of GPUs
- 🔌 Extensible Architecture: Plugin-based design for easy integration of custom models and workflows
- 🛡️ Enterprise Features: Built-in fault tolerance, checkpoint management, and monitoring
- Megatron-LM: LLaMA2 / LLaMA3 / LLaMA4 families, DeepSeek-V2/V3, Mixtral-style MoE, and other GPT-style models
- TorchTitan: LLaMA3 / LLaMA4, DeepSeek-V3, and related decoder-only architectures
- MaxText (JAX): LLaMA3.x and other MaxText-supported transformer models (subset; see MaxText docs for details)
For the full and up-to-date model matrix, see Supported Models.
- [2025/12/17] MoE Training Best Practices on AMD GPUs - MoE Package Blog
- [2025/11/14] 🎉 Primus CLI 1.0 Released - Unified command-line interface with comprehensive documentation
- [2025/08/22] Primus introduction blog
- [2025/06/18] Added TorchTitan backend support
- [2025/05/16] Added benchmark suite for performance evaluation
- [2025/04/18] Added Preflight cluster sanity checker
- [2025/04/14] Integrated HipBLASLt autotuning for optimized GPU kernel performance
- [2025/04/09] Extended support for LLaMA2, LLaMA3, DeepSeek-V2/V3 models
- [2025/03/04] Released Megatron trainer module
Primus leverages AMD’s ROCm Docker images to provide a consistent, ready-to-run environment optimized for AMD GPUs. This eliminates manual dependency and environment configuration.
- AMD ROCm drivers (version ≥ 7.0 recommended)
- Docker (version ≥ 24.0) with ROCm support
- ROCm-compatible AMD GPUs (e.g., Instinct MI300 series)
- Proper permissions for Docker and GPU device access
-
Pull the latest Docker image
docker pull docker.io/rocm/primus:v25.10
-
Clone the repository
git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git cd Primus -
Run your first training
# Run training in container # NOTE: If your config downloads weights/tokenizer from Hugging Face Hub, # you typically need to pass HF_TOKEN into the container. ./runner/primus-cli container --image rocm/primus:v25.10 \ --env HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \ -- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
For more detailed usage instructions, see the CLI User Guide.
Comprehensive documentation is available in the docs/ directory:
- Quick Start Guide - Get started in 5 minutes
- Primus CLI User Guide - Complete CLI reference and usage
- CLI Architecture - Technical design and architecture
- Backend Patch Notes - Primus-specific backend arguments
- Full Documentation Index - Browse all available documentation
Primus-LM is part of a comprehensive ecosystem designed to provide end-to-end solutions for large model training on AMD GPUs:
┌─────────────────────────────────────────────────────┐
│ Primus-SaFE │
│ (Stability & Platform Layer) │
│ Cluster Management | Fault Tolerance | Scheduling │
└────────────────────────┬────────────────────────────┘
│
┌────────────────────────▼────────────────────────────┐
│ Primus-LM │
│ (Training Framework) │
│ Megatron | TorchTitan | Unified CLI | Workflows │
└────────────────────────┬────────────────────────────┘
│
┌────────────────────────▼────────────────────────────┐
│ Primus-Turbo │
│ (High-Performance Operators) │
│ FlashAttention | GEMM | Collectives | GroupedGemm │
│ AITER | CK | hipBLASLt | Triton │
└─────────────────────────────────────────────────────┘
| Component | Role | Key Features | Repository |
|---|---|---|---|
| Primus (Primus-LM) | Training Framework | Multi-backend support, unified CLI, production-ready workflows | This repo |
| Primus-Turbo | Performance Layer | Optimized kernels for attention, GEMM, communication, and more | Primus-Turbo |
| Primus-SaFE | Platform Layer | Cluster orchestration, fault tolerance, topology-aware scheduling | Primus-SaFE |
- Primus-LM provides the training framework and workflow orchestration
- Primus-Turbo supplies highly optimized compute kernels for maximum performance
- Primus-SaFE ensures stability and efficient resource utilization at scale
This separation of concerns allows each component to evolve independently while maintaining seamless integration.
- Add support for more model architectures and backends
- Expand documentation with more examples and tutorials
Primus builds on top of several ROCm-native operator libraries and compiler projects—we couldn’t reach current performance levels without them:
- ROCm AITer – AI Tensor Engine kernels (elementwise, attention, KV-cache, fused MoE, etc.)
- Composable Kernel – performance-portable tensor operator generator for GEMM and convolutions
- hipBLASLt – low-level BLAS Lt API with autotuning support for ROCm GPUs
- ROCm Triton – Python-first kernel compiler used for custom attention and MoE paths
If you rely on Primus, please consider starring or contributing to these projects as well—they are foundational to our stack.
We welcome contributions! Please see our Contributing Guide for details.
Primus is released under the Apache 2.0 License.
Built with ❤️ by AMD AI Brain - Training at Scale (TAS) Team