Primus

Primus/Primus-LM is a flexible and high-performance training framework designed for large-scale foundation model training and inference on AMD GPUs. It supports pretraining, posttraining, and reinforcement learning workflows with multiple backends including Megatron-LM, TorchTitan, and JAX MaxText, alongside ROCm-optimized components.

Part of the Primus Ecosystem: Primus-LM is the training framework layer of the Primus ecosystem, working together with Primus-Turbo (high-performance operators) and Primus-SaFE (stability & platform).

✨ Key Features

🔄 Multi-Backend Support: Seamlessly switch between Megatron-LM, TorchTitan, and other training frameworks
🚀 Unified CLI: One command interface for local development, containers, and Slurm clusters
⚡ ROCm Optimized: Deep integration with AMD ROCm stack and optimized kernels from Primus-Turbo
📦 Production Ready: Battle-tested on large-scale training with hundreds of GPUs
🔌 Extensible Architecture: Plugin-based design for easy integration of custom models and workflows
🛡️ Enterprise Features: Built-in fault tolerance, checkpoint management, and monitoring

✅ Supported Models (high level)

Megatron-LM: LLaMA2 / LLaMA3 / LLaMA4 families, DeepSeek-V2/V3, Mixtral-style MoE, and other GPT-style models
TorchTitan: LLaMA3 / LLaMA4, DeepSeek-V3, and related decoder-only architectures
MaxText (JAX): LLaMA3.x and other MaxText-supported transformer models (subset; see MaxText docs for details)

For the full and up-to-date model matrix, see Supported Models.

🆕 What's New

[2025/12/17] MoE Training Best Practices on AMD GPUs - MoE Package Blog
[2025/11/14] 🎉 Primus CLI 1.0 Released - Unified command-line interface with comprehensive documentation
[2025/08/22] Primus introduction blog
[2025/06/18] Added TorchTitan backend support
[2025/05/16] Added benchmark suite for performance evaluation
[2025/04/18] Added Preflight cluster sanity checker
[2025/04/14] Integrated HipBLASLt autotuning for optimized GPU kernel performance
[2025/04/09] Extended support for LLaMA2, LLaMA3, DeepSeek-V2/V3 models
[2025/03/04] Released Megatron trainer module

🚀 Setup & Deployment

Primus leverages AMD’s ROCm Docker images to provide a consistent, ready-to-run environment optimized for AMD GPUs. This eliminates manual dependency and environment configuration.

Prerequisites

AMD ROCm drivers (version ≥ 7.0 recommended)
Docker (version ≥ 24.0) with ROCm support
ROCm-compatible AMD GPUs (e.g., Instinct MI300 series)
Proper permissions for Docker and GPU device access

Quick Start with Primus CLI

Pull the latest Docker image

docker pull docker.io/rocm/primus:v25.10

Clone the repository

git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git
cd Primus

Run your first training

# Run training in container
# NOTE: If your config downloads weights/tokenizer from Hugging Face Hub,
#       you typically need to pass HF_TOKEN into the container.
./runner/primus-cli container --image rocm/primus:v25.10 \
  --env HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
  -- train pretrain --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml

For more detailed usage instructions, see the CLI User Guide.

📚 Documentation

Comprehensive documentation is available in the docs/ directory:

Quick Start Guide - Get started in 5 minutes
Primus CLI User Guide - Complete CLI reference and usage
CLI Architecture - Technical design and architecture
Backend Patch Notes - Primus-specific backend arguments
Full Documentation Index - Browse all available documentation

🌐 Primus Ecosystem

Primus-LM is part of a comprehensive ecosystem designed to provide end-to-end solutions for large model training on AMD GPUs:

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────┐
│                   Primus-SaFE                       │
│         (Stability & Platform Layer)                │
│   Cluster Management | Fault Tolerance | Scheduling │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│                   Primus-LM                         │
│              (Training Framework)                   │
│    Megatron | TorchTitan | Unified CLI | Workflows  │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│                  Primus-Turbo                       │
│           (High-Performance Operators)              │
│  FlashAttention | GEMM | Collectives | GroupedGemm  │
│        AITER | CK | hipBLASLt | Triton              │
└─────────────────────────────────────────────────────┘

📦 Component Details

Component	Role	Key Features	Repository
Primus (Primus-LM)	Training Framework	Multi-backend support, unified CLI, production-ready workflows	This repo
Primus-Turbo	Performance Layer	Optimized kernels for attention, GEMM, communication, and more	Primus-Turbo
Primus-SaFE	Platform Layer	Cluster orchestration, fault tolerance, topology-aware scheduling	Primus-SaFE

🔗 How They Work Together

Primus-LM provides the training framework and workflow orchestration
Primus-Turbo supplies highly optimized compute kernels for maximum performance
Primus-SaFE ensures stability and efficient resource utilization at scale

This separation of concerns allows each component to evolve independently while maintaining seamless integration.

📝 TODOs

Add support for more model architectures and backends
Expand documentation with more examples and tutorials

🙏 Upstream Optimizations

Primus builds on top of several ROCm-native operator libraries and compiler projects—we couldn’t reach current performance levels without them:

ROCm AITer – AI Tensor Engine kernels (elementwise, attention, KV-cache, fused MoE, etc.)
Composable Kernel – performance-portable tensor operator generator for GEMM and convolutions
hipBLASLt – low-level BLAS Lt API with autotuning support for ROCm GPUs
ROCm Triton – Python-first kernel compiler used for custom attention and MoE paths

If you rely on Primus, please consider starring or contributing to these projects as well—they are foundational to our stack.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

Primus is released under the Apache 2.0 License.

Built with ❤️ by AMD AI Brain - Training at Scale (TAS) Team

Name		Name	Last commit message	Last commit date
Latest commit History 327 Commits
.github		.github
benchmark		benchmark
docs		docs
examples		examples
primus		primus
runner		runner
tests		tests
third_party		third_party
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements-jax.txt		requirements-jax.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Primus

✨ Key Features

✅ Supported Models (high level)

🆕 What's New

🚀 Setup & Deployment

Prerequisites

Quick Start with Primus CLI

📚 Documentation

🌐 Primus Ecosystem

🏗️ Architecture Overview

📦 Component Details

🔗 How They Work Together

📝 TODOs

🙏 Upstream Optimizations

🤝 Contributing

📄 License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 31

Uh oh!

Languages

License

AMD-AGI/Primus

Folders and files

Latest commit

History

Repository files navigation

Primus

✨ Key Features

✅ Supported Models (high level)

🆕 What's New

🚀 Setup & Deployment

Prerequisites

Quick Start with Primus CLI

📚 Documentation

🌐 Primus Ecosystem

🏗️ Architecture Overview

📦 Component Details

🔗 How They Work Together

📝 TODOs

🙏 Upstream Optimizations

🤝 Contributing

📄 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 31

Uh oh!

Languages

Packages