NemoSpawn

Orchestrate GPU-pinned AI agent swarms across NVIDIA DGX/HGX infrastructure

30s Demo • What It Does • Install • Full Guide • NVIDIA Integrations

You describe a research goal. NemoSpawn spawns an intelligent leader agent that autonomously orchestrates specialized sub-agents across your GPUs — designing experiments, monitoring performance, reallocating resources, and synthesizing results. No human intervention after launch.

No database. No cloud service. Just a CLI and your GPUs.

30-Second Demo

pip install nemospawn

# One command: autonomous research across 8 GPUs
nemospawn launch run autoresearch --gpus 0,1,2,3,4,5,6,7

That single command:

Discovers your GPU topology and NVLink islands
Spawns an AI leader agent that orchestrates the entire team
Leader spawns trainers on available GPUs and an evaluator
Wires up task dependencies (evaluate waits for training)
Leader monitors GPU performance, kills underperformers, respawns with new hyperparameters
All agents get the full coordination protocol injected automatically

            nemospawn launch run autoresearch --gpus 0,1,2,3,4,5,6,7
                                    |
                          ┌─────────▼─────────┐
                          │   orchestrator     │
                          │   (AI leader)      │
                          │   spawns, monitors,│
                          │   reallocates      │
                          └──┬──┬──┬──┬──┬──┬──┘
                             │  │  │  │  │  │
               ┌─────────┐ ┌┘  │  │  │  │  └┐ ┌──────────┐
               │trainer-0 │ │   │  │  │  │   │ │evaluator │
               │  GPU 0   │ │   │  │  │  │   │ │  GPU 7   │
               └──────────┘ │   │  │  │  │   │ └──────────┘
                     ┌──────┘   │  │  │  └───┘
                     │trainer-1 │  │  │ trainer-5│
                     │  GPU 1   │  │  │  GPU 5  │
                     └──────────┘  │  └─────────┘
                          ...  GPUs 2-4  ...

         ┌─────────────────────────────────────────────┐
         │  Leader autonomously:                        │
         │  - Designs experiments with varied HP         │
         │  - Monitors GPU util via DCGM                │
         │  - Kills underperformers, respawns fresh     │
         │  - Reviews worker plans before execution     │
         │  - Merges results, synthesizes findings      │
         └─────────────────────────────────────────────┘

What It Does

The Problem

You have 8 H100s. You want to run a hyperparameter sweep, evaluate the best checkpoint, deploy it as a NIM endpoint, and benchmark latency — all in parallel where possible. Today you do this with bash scripts, tmux, and prayer.

The Solution

NemoSpawn treats your GPU cluster as a programmable agent fabric. An AI leader agent orchestrates everything:

# Step 1: See what you have
nemospawn gpu discover                    # List GPUs
nemospawn gpu topology                    # NVLink interconnect map

# Step 2: Launch autonomous research (leader + workers)
nemospawn launch run autoresearch --gpus 0,1,2,3,4,5,6,7

# Step 3: Watch from the dashboard
nemospawn board serve my-team-abc         # Web dashboard at :8080

Or build manually with full control:

# Create team and spawn a leader agent
nemospawn team create llama-sweep --gpus 0,1,2,3,4,5,6,7
nemospawn spawn agent --team llama-sweep-abc --agent-name orchestrator \
  --role leader --task "Design HP sweep, spawn trainers, monitor, reallocate"

# Leader agent then autonomously spawns workers, creates tasks, monitors...

How Agents Self-Organize

The leader agent receives a 10-step autonomous orchestration protocol:

Discover GPUs — nemospawn gpu discover
Spawn worker agents on available GPUs with specialized tasks
Create task dependencies — training before evaluation, evaluation before deployment
Monitor performance — nemospawn schedule analyze checks GPU utilization per agent
Review worker plans — approve or reject before major experiments
Detect underperformers — nemospawn schedule suggest finds low-utilization agents
Kill idle agents — nemospawn spawn kill frees GPUs
Respawn with new parameters — fresh agents with updated hyperparameters
Merge results — nemospawn workspace merge combines agent branches
Synthesize findings — report final results

Workers coordinate autonomously through the CLI:

# Workers run these from their tmux sessions:
nemospawn task update $NEMOSPAWN_TEAM task-abc --status running
nemospawn inbox send $NEMOSPAWN_TEAM leader "val_loss=0.031 at epoch 42"
nemospawn plan submit --team $NEMOSPAWN_TEAM --agent $NEMOSPAWN_AGENT \
  --title "Switch to cosine LR" --steps "Update config,Retrain,Eval"
nemospawn artifact register $NEMOSPAWN_TEAM ./model.nemo --type nemo-checkpoint --val-loss 0.031
nemospawn lifecycle idle --team $NEMOSPAWN_TEAM --agent $NEMOSPAWN_AGENT --reason "All tasks done"

Install

pip install nemospawn                     # Core
pip install nemospawn[hpo]                # + Optuna HPO
pip install nemospawn[transport]          # + ZeroMQ cross-node messaging
pip install nemospawn[all]                # Everything

Requires: Python >= 3.10, tmux

For GPU features: NVIDIA drivers + nvidia-smi (CPU-only dev mode works without)

Feature Overview

24 CLI Command Groups

Core Orchestration (click to expand)

Command	What it does
`nemospawn team`	Create GPU-aware teams with NVLink topology discovery
`nemospawn spawn`	Spawn agents in tmux or OpenShell sandboxes with GPU pinning
`nemospawn task`	Task DAG — `blocked_by` dependencies, auto-unblocking, val_loss tracking
`nemospawn inbox`	Agent-to-agent messaging — direct, broadcast, atomic JSON delivery
`nemospawn plan`	Plan approval — agents submit proposals for leader review before execution
`nemospawn lifecycle`	Graceful idle/shutdown request/approve/reject protocol

NVIDIA GPU & AI Stack

Command	What it does
`nemospawn gpu`	GPU discovery, NVLink topology, DCGM health monitoring
`nemospawn artifact`	NeMo artifact store — register, promote, val_loss ranking
`nemospawn nim`	NIM deployment — build containers, benchmark with perf_analyzer, rank endpoints
`nemospawn ngc`	NGC registry — pull/push models and containers
`nemospawn hpo`	Optuna TPE + ASHA pruner for hyperparameter optimization

Infrastructure & Operations

Command	What it does
`nemospawn launch`	One-command team launch from TOML templates (autoresearch, nim-deploy, rlhf-swarm, data-curation)
`nemospawn cluster`	Cross-cluster federation — SSH remote spawn on DGX/HGX nodes
`nemospawn slurm`	SLURM job script generation, submission, status, cancel
`nemospawn auth`	API key auth (SHA-256), multi-user namespaces, JSONL audit logging

Monitoring & Observability

Command	What it does
`nemospawn board`	Web UI kanban (SSE real-time), terminal kanban (Rich), tiled tmux view
`nemospawn watch`	Agent health monitoring — dead tmux detection, stuck agent alerts
`nemospawn cost`	GPU-hour cost tracking per agent with configurable $/GPU-hour rates
`nemospawn schedule`	Adaptive scheduling — analyze GPU util, auto-reassign tasks from underperformers
`nemospawn snapshot`	Save and restore full team state
`nemospawn workspace`	Git worktree checkpoint, merge, cleanup per agent

Configuration

Command	What it does
`nemospawn config`	Dynamic config — env var > config file > default (10 settings)
`nemospawn profile`	Agent CLI profiles — wizard, doctor, smoke test for 8 supported agents
`nemospawn skill`	Install coordination protocol as a discoverable skill for Claude Code / Codex

8 Supported Agent CLIs

Agent	Auth	Prompt Injection
Claude Code	`ANTHROPIC_API_KEY`	CLI flag
Codex	`OPENAI_API_KEY`	CLI flag
Kimi CLI	`MOONSHOT_API_KEY`	CLI flag
aider	`OPENAI_API_KEY`	CLI flag
nanobot	—	CLI flag
Cursor	—	File
OpenCode	—	File
GitHub Copilot	`GITHUB_TOKEN`	File

Plus --agent-cmd custom for any unlisted CLI tool. Create profiles with nemospawn profile wizard.

4 Built-in Templates (all include AI leader)

Every template spawns an orchestrator agent (role=leader) that autonomously manages the team:

nemospawn launch run autoresearch --gpus 0-7    # Leader + trainers + evaluator
nemospawn launch run nim-deploy --gpus 0-3      # Leader + deployers + benchmarker
nemospawn launch run rlhf-swarm --gpus 0-3      # Leader + reward + PPO + eval
nemospawn launch run data-curation --gpus 0,1   # Leader + curator + trainer

Write your own in TOML:

name = "my-pipeline"
min_gpus = 3

[[workers]]
name = "orchestrator"
role = "leader"
gpu_count = 0
task = "Orchestrate team: spawn workers, monitor GPU perf, reallocate, synthesize results"

[[workers]]
name = "trainer"
role = "trainer"
gpu_count = 2
task = "Fine-tune LLaMA-70B with LoRA"
require_nvlink = true

[[workers]]
name = "evaluator"
role = "evaluator"
gpu_count = 1
task = "Run AlpacaEval"
blocked_by = ["trainer"]

NVIDIA Integrations

NemoSpawn directly calls 8 NVIDIA components and runs on 7 more:

Direct (NemoSpawn calls these APIs/CLIs)

Component	What NemoSpawn does with it
nvidia-smi / NVML	`gpu discover` lists GPUs. `gpu topology` parses the NVLink interconnect matrix. `gpu health` reads temperature, utilization, power, and ECC errors via pynvml bindings.
NeMo Framework	Manages `.nemo` checkpoint bundles. Generates YAML config overrides with schema-aware type coercion. NVLink-aware scheduler places multi-GPU training on the same NVLink island for maximum interconnect bandwidth.
NIM	`nim deploy` builds inference containers from checkpoints with tensor parallel profiles (TP1-TP8). `nim benchmark` runs perf_analyzer and reports p50/p95/p99 latency. `nim list` tracks endpoint health.
Triton Inference Server	Auto-generates `config.pbtxt` model repository configs. Benchmarks endpoints via perf_analyzer with configurable concurrency levels.
DCGM	`gpu status` polls `dcgmi dmon` for SM utilization, memory, temperature, power, ECC errors. Exports to Prometheus. Falls back to nvidia-smi gracefully.
NGC	`ngc pull/push` wraps the NGC CLI for model download/upload. `ngc push-container` pushes to `nvcr.io`.
NIXL	Sub-microsecond inter-agent messaging over NVLink/InfiniBand. Auto-negotiated — falls back to ZeroMQ then file transport.
OpenShell	`--runtime sandbox` spawns agents with kernel-level isolation (Landlock filesystem + seccomp syscalls + network namespaces). GPU passthrough via CUDA. Per-role security policies auto-generated.

Indirect (infrastructure NemoSpawn runs on)

Component	How it's used
NVLink	Topology parsed to detect GPU islands. Multi-GPU tasks placed on same island. Link types tracked: NV12, NV8, NV4, NV2.
CUDA Toolkit	Agents use `CUDA_VISIBLE_DEVICES` for GPU pinning. NeMo/NIM workloads run on CUDA.
NCCL	Multi-GPU training uses NCCL for collective comms. NemoSpawn configures TP/PP degrees that NCCL implements.
cuDNN	NeMo training uses cuDNN. NemoSpawn configures mixed-precision modes (bf16-mixed, 16-mixed, 32).
TensorRT	NIM containers use TensorRT for inference. `--profile max-throughput` leverages TensorRT compilation.
Container Toolkit	Required for GPU access inside NIM Docker containers.
DGX / HGX	Cross-cluster federation targets DGX/HGX nodes. Topology features optimized for DGX H100/A100.

Architecture

~/.nemospawn/                             All state — atomic JSON, no database
├── teams/{id}/
│   ├── team.json                         GPU list, NVLink topology, islands
│   ├── agents/{id}.json                  Status, GPUs, tmux session, lifecycle
│   ├── tasks/{id}.json                   DAG: blocked_by deps, val_loss, metadata
│   ├── plans/{id}.json                   Submit → pending → approved/rejected
│   ├── inbox/{agent}/                    Per-agent message files
│   ├── artifacts/{id}.json               .nemo checkpoints, NIM containers
│   ├── prompts/{id}.md                   Auto-injected coordination prompts
│   ├── snapshots/{id}.json               Point-in-time team state
│   ├── costs/cost_record.json            GPU-hour tracking
│   ├── workspaces/                       Git worktrees (one per agent)
│   └── metrics/                          DCGM snapshots
├── config.json                           Dynamic config (env > file > default)
└── audit.jsonl                           Structured audit log

Transport negotiation — messaging picks the fastest available backend automatically:

Condition	Transport	Latency
Same node + NVLink	NIXL	Sub-microsecond
Cross-node	ZeroMQ	TCP
Fallback	File (atomic JSON)	Filesystem I/O

Autonomous coordination flow:

nemospawn launch run autoresearch spawns leader + workers
Each agent gets NEMOSPAWN_TEAM, NEMOSPAWN_AGENT, CUDA_VISIBLE_DEVICES, PATH
Coordination prompt auto-injected — leader gets 10-step orchestration protocol
Leader spawns additional agents, creates tasks, monitors GPU performance
Workers train, report val_loss, submit plans, send messages
Leader detects underperformers, kills idle agents, respawns with new params
Workers report lifecycle idle when done
Leader merges results and synthesizes findings

Development

git clone https://github.com/alokemajumder/nemospawn.git
cd nemospawn
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v           # 201 tests
ruff check src/ tests/     # Lint

Project layout

src/nemospawn/
├── cli/             24 Typer command groups
├── core/            State, models, auth, profiles, config, plan,
│                    lifecycle, costs, snapshot, watcher, adaptive, skill
├── gpu/             Discovery, NVLink topology, DCGM health
├── nemo/            Artifacts, config injection, NVLink-aware scheduling
├── nim/             NIM deployer, Triton benchmarks
├── openshell/       Sandbox integration, security policies, prompts
├── messaging/       File / ZeroMQ / NIXL transport
├── observability/   Prometheus, Grafana, kanban, web UI (SSE)
├── templates/       TOML templates, launch engine
├── federation/      Cross-cluster SSH spawn, git-annex
├── hpo/             Optuna TPE/ASHA, fallback sampler
├── ngc/             NGC model registry
└── runtime/         tmux, git worktree, SLURM

Contributing

Contributions welcome:

Fork the repo
Create a feature branch (git checkout -b feature/my-feature)
Run tests (pytest tests/ -v) and lint (ruff check src/ tests/)
Submit a PR

See GUIDE.md for architecture details and development patterns.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src/nemospawn		src/nemospawn
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GUIDE.md		GUIDE.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NemoSpawn

30-Second Demo

What It Does

The Problem

The Solution

How Agents Self-Organize

Install

Feature Overview

24 CLI Command Groups

8 Supported Agent CLIs

4 Built-in Templates (all include AI leader)

NVIDIA Integrations

Direct (NemoSpawn calls these APIs/CLIs)

Indirect (infrastructure NemoSpawn runs on)

Architecture

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NemoSpawn

30-Second Demo

What It Does

The Problem

The Solution

How Agents Self-Organize

Install

Feature Overview

24 CLI Command Groups

8 Supported Agent CLIs

4 Built-in Templates (all include AI leader)

NVIDIA Integrations

Direct (NemoSpawn calls these APIs/CLIs)

Indirect (infrastructure NemoSpawn runs on)

Architecture

Development

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages