Orchestrate GPU-pinned AI agent swarms across NVIDIA DGX/HGX infrastructure
30s Demo • What It Does • Install • Full Guide • NVIDIA Integrations
You describe a research goal. NemoSpawn spawns an intelligent leader agent that autonomously orchestrates specialized sub-agents across your GPUs — designing experiments, monitoring performance, reallocating resources, and synthesizing results. No human intervention after launch.
No database. No cloud service. Just a CLI and your GPUs.
pip install nemospawn
# One command: autonomous research across 8 GPUs
nemospawn launch run autoresearch --gpus 0,1,2,3,4,5,6,7That single command:
- Discovers your GPU topology and NVLink islands
- Spawns an AI leader agent that orchestrates the entire team
- Leader spawns trainers on available GPUs and an evaluator
- Wires up task dependencies (evaluate waits for training)
- Leader monitors GPU performance, kills underperformers, respawns with new hyperparameters
- All agents get the full coordination protocol injected automatically
nemospawn launch run autoresearch --gpus 0,1,2,3,4,5,6,7
|
┌─────────▼─────────┐
│ orchestrator │
│ (AI leader) │
│ spawns, monitors,│
│ reallocates │
└──┬──┬──┬──┬──┬──┬──┘
│ │ │ │ │ │
┌─────────┐ ┌┘ │ │ │ │ └┐ ┌──────────┐
│trainer-0 │ │ │ │ │ │ │ │evaluator │
│ GPU 0 │ │ │ │ │ │ │ │ GPU 7 │
└──────────┘ │ │ │ │ │ │ └──────────┘
┌──────┘ │ │ │ └───┘
│trainer-1 │ │ │ trainer-5│
│ GPU 1 │ │ │ GPU 5 │
└──────────┘ │ └─────────┘
... GPUs 2-4 ...
┌─────────────────────────────────────────────┐
│ Leader autonomously: │
│ - Designs experiments with varied HP │
│ - Monitors GPU util via DCGM │
│ - Kills underperformers, respawns fresh │
│ - Reviews worker plans before execution │
│ - Merges results, synthesizes findings │
└─────────────────────────────────────────────┘
You have 8 H100s. You want to run a hyperparameter sweep, evaluate the best checkpoint, deploy it as a NIM endpoint, and benchmark latency — all in parallel where possible. Today you do this with bash scripts, tmux, and prayer.
NemoSpawn treats your GPU cluster as a programmable agent fabric. An AI leader agent orchestrates everything:
# Step 1: See what you have
nemospawn gpu discover # List GPUs
nemospawn gpu topology # NVLink interconnect map
# Step 2: Launch autonomous research (leader + workers)
nemospawn launch run autoresearch --gpus 0,1,2,3,4,5,6,7
# Step 3: Watch from the dashboard
nemospawn board serve my-team-abc # Web dashboard at :8080Or build manually with full control:
# Create team and spawn a leader agent
nemospawn team create llama-sweep --gpus 0,1,2,3,4,5,6,7
nemospawn spawn agent --team llama-sweep-abc --agent-name orchestrator \
--role leader --task "Design HP sweep, spawn trainers, monitor, reallocate"
# Leader agent then autonomously spawns workers, creates tasks, monitors...The leader agent receives a 10-step autonomous orchestration protocol:
- Discover GPUs —
nemospawn gpu discover - Spawn worker agents on available GPUs with specialized tasks
- Create task dependencies — training before evaluation, evaluation before deployment
- Monitor performance —
nemospawn schedule analyzechecks GPU utilization per agent - Review worker plans — approve or reject before major experiments
- Detect underperformers —
nemospawn schedule suggestfinds low-utilization agents - Kill idle agents —
nemospawn spawn killfrees GPUs - Respawn with new parameters — fresh agents with updated hyperparameters
- Merge results —
nemospawn workspace mergecombines agent branches - Synthesize findings — report final results
Workers coordinate autonomously through the CLI:
# Workers run these from their tmux sessions:
nemospawn task update $NEMOSPAWN_TEAM task-abc --status running
nemospawn inbox send $NEMOSPAWN_TEAM leader "val_loss=0.031 at epoch 42"
nemospawn plan submit --team $NEMOSPAWN_TEAM --agent $NEMOSPAWN_AGENT \
--title "Switch to cosine LR" --steps "Update config,Retrain,Eval"
nemospawn artifact register $NEMOSPAWN_TEAM ./model.nemo --type nemo-checkpoint --val-loss 0.031
nemospawn lifecycle idle --team $NEMOSPAWN_TEAM --agent $NEMOSPAWN_AGENT --reason "All tasks done"pip install nemospawn # Core
pip install nemospawn[hpo] # + Optuna HPO
pip install nemospawn[transport] # + ZeroMQ cross-node messaging
pip install nemospawn[all] # EverythingRequires: Python >= 3.10, tmux
For GPU features: NVIDIA drivers + nvidia-smi (CPU-only dev mode works without)
Core Orchestration (click to expand)
| Command | What it does |
|---|---|
nemospawn team |
Create GPU-aware teams with NVLink topology discovery |
nemospawn spawn |
Spawn agents in tmux or OpenShell sandboxes with GPU pinning |
nemospawn task |
Task DAG — blocked_by dependencies, auto-unblocking, val_loss tracking |
nemospawn inbox |
Agent-to-agent messaging — direct, broadcast, atomic JSON delivery |
nemospawn plan |
Plan approval — agents submit proposals for leader review before execution |
nemospawn lifecycle |
Graceful idle/shutdown request/approve/reject protocol |
NVIDIA GPU & AI Stack
| Command | What it does |
|---|---|
nemospawn gpu |
GPU discovery, NVLink topology, DCGM health monitoring |
nemospawn artifact |
NeMo artifact store — register, promote, val_loss ranking |
nemospawn nim |
NIM deployment — build containers, benchmark with perf_analyzer, rank endpoints |
nemospawn ngc |
NGC registry — pull/push models and containers |
nemospawn hpo |
Optuna TPE + ASHA pruner for hyperparameter optimization |
Infrastructure & Operations
| Command | What it does |
|---|---|
nemospawn launch |
One-command team launch from TOML templates (autoresearch, nim-deploy, rlhf-swarm, data-curation) |
nemospawn cluster |
Cross-cluster federation — SSH remote spawn on DGX/HGX nodes |
nemospawn slurm |
SLURM job script generation, submission, status, cancel |
nemospawn auth |
API key auth (SHA-256), multi-user namespaces, JSONL audit logging |
Monitoring & Observability
| Command | What it does |
|---|---|
nemospawn board |
Web UI kanban (SSE real-time), terminal kanban (Rich), tiled tmux view |
nemospawn watch |
Agent health monitoring — dead tmux detection, stuck agent alerts |
nemospawn cost |
GPU-hour cost tracking per agent with configurable $/GPU-hour rates |
nemospawn schedule |
Adaptive scheduling — analyze GPU util, auto-reassign tasks from underperformers |
nemospawn snapshot |
Save and restore full team state |
nemospawn workspace |
Git worktree checkpoint, merge, cleanup per agent |
Configuration
| Command | What it does |
|---|---|
nemospawn config |
Dynamic config — env var > config file > default (10 settings) |
nemospawn profile |
Agent CLI profiles — wizard, doctor, smoke test for 8 supported agents |
nemospawn skill |
Install coordination protocol as a discoverable skill for Claude Code / Codex |
| Agent | Auth | Prompt Injection |
|---|---|---|
| Claude Code | ANTHROPIC_API_KEY |
CLI flag |
| Codex | OPENAI_API_KEY |
CLI flag |
| Kimi CLI | MOONSHOT_API_KEY |
CLI flag |
| aider | OPENAI_API_KEY |
CLI flag |
| nanobot | — | CLI flag |
| Cursor | — | File |
| OpenCode | — | File |
| GitHub Copilot | GITHUB_TOKEN |
File |
Plus --agent-cmd custom for any unlisted CLI tool. Create profiles with nemospawn profile wizard.
Every template spawns an orchestrator agent (role=leader) that autonomously manages the team:
nemospawn launch run autoresearch --gpus 0-7 # Leader + trainers + evaluator
nemospawn launch run nim-deploy --gpus 0-3 # Leader + deployers + benchmarker
nemospawn launch run rlhf-swarm --gpus 0-3 # Leader + reward + PPO + eval
nemospawn launch run data-curation --gpus 0,1 # Leader + curator + trainerWrite your own in TOML:
name = "my-pipeline"
min_gpus = 3
[[workers]]
name = "orchestrator"
role = "leader"
gpu_count = 0
task = "Orchestrate team: spawn workers, monitor GPU perf, reallocate, synthesize results"
[[workers]]
name = "trainer"
role = "trainer"
gpu_count = 2
task = "Fine-tune LLaMA-70B with LoRA"
require_nvlink = true
[[workers]]
name = "evaluator"
role = "evaluator"
gpu_count = 1
task = "Run AlpacaEval"
blocked_by = ["trainer"]NemoSpawn directly calls 8 NVIDIA components and runs on 7 more:
| Component | What NemoSpawn does with it |
|---|---|
| nvidia-smi / NVML | gpu discover lists GPUs. gpu topology parses the NVLink interconnect matrix. gpu health reads temperature, utilization, power, and ECC errors via pynvml bindings. |
| NeMo Framework | Manages .nemo checkpoint bundles. Generates YAML config overrides with schema-aware type coercion. NVLink-aware scheduler places multi-GPU training on the same NVLink island for maximum interconnect bandwidth. |
| NIM | nim deploy builds inference containers from checkpoints with tensor parallel profiles (TP1-TP8). nim benchmark runs perf_analyzer and reports p50/p95/p99 latency. nim list tracks endpoint health. |
| Triton Inference Server | Auto-generates config.pbtxt model repository configs. Benchmarks endpoints via perf_analyzer with configurable concurrency levels. |
| DCGM | gpu status polls dcgmi dmon for SM utilization, memory, temperature, power, ECC errors. Exports to Prometheus. Falls back to nvidia-smi gracefully. |
| NGC | ngc pull/push wraps the NGC CLI for model download/upload. ngc push-container pushes to nvcr.io. |
| NIXL | Sub-microsecond inter-agent messaging over NVLink/InfiniBand. Auto-negotiated — falls back to ZeroMQ then file transport. |
| OpenShell | --runtime sandbox spawns agents with kernel-level isolation (Landlock filesystem + seccomp syscalls + network namespaces). GPU passthrough via CUDA. Per-role security policies auto-generated. |
| Component | How it's used |
|---|---|
| NVLink | Topology parsed to detect GPU islands. Multi-GPU tasks placed on same island. Link types tracked: NV12, NV8, NV4, NV2. |
| CUDA Toolkit | Agents use CUDA_VISIBLE_DEVICES for GPU pinning. NeMo/NIM workloads run on CUDA. |
| NCCL | Multi-GPU training uses NCCL for collective comms. NemoSpawn configures TP/PP degrees that NCCL implements. |
| cuDNN | NeMo training uses cuDNN. NemoSpawn configures mixed-precision modes (bf16-mixed, 16-mixed, 32). |
| TensorRT | NIM containers use TensorRT for inference. --profile max-throughput leverages TensorRT compilation. |
| Container Toolkit | Required for GPU access inside NIM Docker containers. |
| DGX / HGX | Cross-cluster federation targets DGX/HGX nodes. Topology features optimized for DGX H100/A100. |
~/.nemospawn/ All state — atomic JSON, no database
├── teams/{id}/
│ ├── team.json GPU list, NVLink topology, islands
│ ├── agents/{id}.json Status, GPUs, tmux session, lifecycle
│ ├── tasks/{id}.json DAG: blocked_by deps, val_loss, metadata
│ ├── plans/{id}.json Submit → pending → approved/rejected
│ ├── inbox/{agent}/ Per-agent message files
│ ├── artifacts/{id}.json .nemo checkpoints, NIM containers
│ ├── prompts/{id}.md Auto-injected coordination prompts
│ ├── snapshots/{id}.json Point-in-time team state
│ ├── costs/cost_record.json GPU-hour tracking
│ ├── workspaces/ Git worktrees (one per agent)
│ └── metrics/ DCGM snapshots
├── config.json Dynamic config (env > file > default)
└── audit.jsonl Structured audit log
Transport negotiation — messaging picks the fastest available backend automatically:
| Condition | Transport | Latency |
|---|---|---|
| Same node + NVLink | NIXL | Sub-microsecond |
| Cross-node | ZeroMQ | TCP |
| Fallback | File (atomic JSON) | Filesystem I/O |
Autonomous coordination flow:
nemospawn launch run autoresearchspawns leader + workers- Each agent gets
NEMOSPAWN_TEAM,NEMOSPAWN_AGENT,CUDA_VISIBLE_DEVICES,PATH - Coordination prompt auto-injected — leader gets 10-step orchestration protocol
- Leader spawns additional agents, creates tasks, monitors GPU performance
- Workers train, report val_loss, submit plans, send messages
- Leader detects underperformers, kills idle agents, respawns with new params
- Workers report
lifecycle idlewhen done - Leader merges results and synthesizes findings
git clone https://github.com/alokemajumder/nemospawn.git
cd nemospawn
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v # 201 tests
ruff check src/ tests/ # LintProject layout
src/nemospawn/
├── cli/ 24 Typer command groups
├── core/ State, models, auth, profiles, config, plan,
│ lifecycle, costs, snapshot, watcher, adaptive, skill
├── gpu/ Discovery, NVLink topology, DCGM health
├── nemo/ Artifacts, config injection, NVLink-aware scheduling
├── nim/ NIM deployer, Triton benchmarks
├── openshell/ Sandbox integration, security policies, prompts
├── messaging/ File / ZeroMQ / NIXL transport
├── observability/ Prometheus, Grafana, kanban, web UI (SSE)
├── templates/ TOML templates, launch engine
├── federation/ Cross-cluster SSH spawn, git-annex
├── hpo/ Optuna TPE/ASHA, fallback sampler
├── ngc/ NGC model registry
└── runtime/ tmux, git worktree, SLURM
Contributions welcome:
- Fork the repo
- Create a feature branch (
git checkout -b feature/my-feature) - Run tests (
pytest tests/ -v) and lint (ruff check src/ tests/) - Submit a PR
See GUIDE.md for architecture details and development patterns.