LLM agents can search the web, run code, read files, and chain multi-step reasoning together, but getting them to do this reliably is an unsolved problem. When an agent picks the wrong tool, hallucinates a search query, or formats its answer incorrectly, the whole chain falls apart. Nobody has figured out how to make this work perfectly, and the techniques that do work are changing fast.
In this lab you'll get hands-on with the problem. You'll run agents against real questions, watch them fail in interesting ways through traces, change their configs, and measure whether your changes actually helped. By the end of the session you'll have a working intuition for why agents break and how to fix them.
You'll work with three tools:
- NAT (NeMo Agent Toolkit): NVIDIA's open-source library that adds intelligence to AI agents across any framework, enhancing speed, accuracy, and decision-making through enterprise-grade instrumentation, observability, and continuous learning. You define an agent entirely in YAML (model, tools, system prompt, architecture) and NAT handles orchestration, tool execution, and LLM calls.
- GAIA: a benchmark of real-world questions that require multi-step reasoning and tool use. These aren't toy problems. They involve reading spreadsheets, analyzing images, searching the web, running calculations, and combining it all into a precise answer. The repo includes a test set (for benchmarking) and a dev set (with expected answers, for tuning).
- Phoenix: a tracing UI built on OpenTelemetry. Every LLM call, tool invocation, and routing decision shows up as a span tree you can click through. When something goes wrong, you can see exactly what the agent did and where it broke.
What you'll learn:
- How agents actually work under the hood. Not the theory, but the real execution: which tools get called, what the LLM sees at each step, how routing decisions play out.
- Why architecture matters. You'll run the same question through different agent designs and see in traces how a flat agent, a multi-agent orchestrator, and prompt-driven routing each handle it differently.
- How to debug and improve agents. Read traces, spot wasted tool calls or bad formatting, fix them with targeted prompt edits.
- How to measure what you've built. Your agents submit to a public leaderboard so you can see exactly where you stand.
The repo ships with four agents that score 85-90% on the leaderboard. They're good but not perfect. Study their configs, read their traces, find where they fail, and build something better.
Complete the Setup below, then follow the Lab Guide for a full guided walkthrough with Phoenix tracing, failure diagnosis, and benchmarking.
git clone https://github.com/mnajafian-nv/nat-agent-lab.git nat-agent-lab
cd nat-agent-lab
bash setup.sh # ~20 min; prompts for API keys, downloads model
./ask # start chattingSetup takes 20-30 minutes (model download, dependencies, API keys). See Setup for details. There is a GPU path and an Ollama path if you don't have GPUs.
You should see a status line like Agent: ultrafast | vLLM: OK | NAT: OK | Phoenix: OK. If anything looks wrong, type status for diagnostics. Try asking "What is 2+2?" to confirm the agent responds.
Once you see the ask> prompt, work through these steps in order.
ask> What is the tallest building in San Francisco?
The agent searches the web, reasons over the result, and answers. Type a follow-up and the agent remembers context. The prompt shows your turn count (ask [1]> after the first exchange, ask [2]> after the second, etc.).
ask [1]> level dev 1, 1
This runs the first question from GAIA Level 1 (dev set). The agent works through it with tool calls, and then the expected answer is shown so you can compare. Notice the timing: ./ask prints how long the agent took.
After it finishes, you can ask follow-ups like "why did you use that tool?" or "explain your reasoning" since the Q&A is already in memory.
Some useful commands:
level dev 1- list all Level 1 dev questions (with answers for checking)level 1- list Level 1 test questions (no answers, for benchmarking)level- summary of all levels
Start Phoenix before running agent comparisons so traces are captured.
ask [1]> tracing
If you're on a remote machine (Brev, GCP), forward port 6006 first from a separate terminal on your laptop:
ssh -L 6006:localhost:6006 <your-ssh-host>Then open http://localhost:6006 and keep it open in a tab. Your runs from steps 1 and 2 are already there. Phoenix is optional. Agents work fine without it, you just won't see the traces.
Pick a dev question so you have an expected answer to judge against. level dev 1, 1 is a good choice because it needs multiple tool calls and computation, which is where architecture differences show up.
You already have a trace for the default agent (ultrafast) from step 2. Now run the same question on the other two:
ask [1]> switch single
ask> level dev 1, 1
ask [1]> switch multi
ask> level dev 1, 1
After each run, refresh Phoenix. Click into a trace to see the full agent loop: system prompt, tool calls, intermediate results, final answer. Compare latencies per span to see where time is spent.
| Agent | What to look for |
|---|---|
| Ultrafast (default) | Your baseline. How does the system prompt classify the question before any tool call? |
| Single | No routing. Does it call tools that Ultrafast would have skipped? More or fewer LLM round-trips? |
| Multi | Extra LLM call for the orchestrator. Is the routing accurate? Is the specialist's focused context worth the latency? |
Bonus: try a simple factual question like "What is the capital of France?" (zero tools needed). All agents should answer instantly. If Multi is noticeably slower, that's the cost of orchestrator overhead on a question that didn't need routing.
The takeaway: the best architecture depends on the task. Flat agents are faster on simple questions; routing pays off on complex multi-tool questions. Traces make this measurable.
ask [1]> benchmark
Pick an agent and run 20 scored questions (~15 min). Your answers are submitted and your team appears on the public leaderboard within seconds. You'll be prompted for your org and team name. The leaderboard entry follows the format NAT-<org>-<team>-<agent> (e.g., NAT-UCB-TeamAlpha-Ultrafast).
While the benchmark runs, go back to Phoenix and dig into the traces from step 4. This score is your baseline to beat in step 6.
The built-in agents are baselines. Open their YAML configs, read the system prompts, look at traces to see where they fail, then improve on them.
mkdir my-agent
cp ultrafast-agent/gaia_agent_ultrafast.yml my-agent/config.yml
# edit config.yml: refine the system prompt, add/remove tools, change the agent typeLoad it in ./ask:
ask> switch my-agent/config.yml
What to optimize (change one variable at a time, measure before and after):
- Cut unnecessary tool calls. Check traces. Is the agent calling
internet_searchwhen the answer is in the question? Add explicit guidance to the system prompt about when to use each tool. - Remove unused tools. Fewer tools = shorter tool schema in the prompt = fewer input tokens per LLM call = faster inference. Check traces to see which tools actually get called.
- Add prompt-driven routing. The ultrafast agent classifies questions into TYPE A/B/C/D categories in the system prompt before making any tool call. Read its config to see how.
- Try a different agent type.
tool_calling_agentuses the LLM's native function-calling format.react_agentuses ReAct-style Thought/Action/Observation prompting. These need different system prompts, so don't just swap the_typefield. See NAT examples for working configs. - Tune
temperatureandseed. Lower temperature (0.0-0.1) makes tool selection more deterministic. Settingseedimproves reproducibility (local vLLM only). Run the same question 3 times to measure variance. - Tighten answer formatting. GAIA exact-match scoring is strict. If the agent gets the right answer but wraps it in extra text, add formatting rules to the system prompt.
- Reduce latency. Total time = (LLM calls x per-call latency) + tool time. Target fewer LLM calls and shorter prompts.
On swapping models: each LLM has its own max_tokens, temperature range, chat template, and tool-calling format. If you change model in your YAML, adjust these parameters to match. The built-in configs are tuned for MiniMax M2.5 (vLLM) and Qwen3.5 35B-A3B (Ollama). See NAT examples for other models.
The iteration loop: edit YAML, run a level dev question, check traces, repeat. When you're confident, run benchmark custom to submit your score.
Compete on three fronts:
- Accuracy (leaderboard score). The built-ins score 85-90%. Can you beat them?
- Speed (benchmark time). Check
gaia_summary.jsonfor per-question timing. A faster agent at the same accuracy is a better agent. - Design (trace quality). Open your traces and a built-in's side by side. Fewer tool calls, cleaner routing, shorter prompts. Be ready to explain what you changed and why.
The GPU agents (single, multi, ultrafast) share the full tool set: internet_search, wiki_search, read_file, fetch_url, python_executor, describe_image, describe_image_alt, transcribe_audio, get_youtube_transcript, solve_chess, current_datetime. The Ollama-served agent has most of these but skips describe_image_alt to keep context smaller.
| Agent | Config | Architecture | LLM | Key difference |
|---|---|---|---|---|
| Single | single-agent/gaia_agent.yml |
Flat tool_calling_agent with direct access to all tools |
MiniMax M2.5 456B MoE (vLLM serving) | Simplest design. One LLM call per tool step, no routing overhead. |
| Multi | multi-agent/gaia_agent_multi.yml |
Orchestrator dispatches to 3 specialist sub-agents (web, file, multimedia) | MiniMax M2.5 456B MoE (vLLM serving) | Extra LLM call for routing, but specialists get focused prompts and tools. |
| Ultrafast | ultrafast-agent/gaia_agent_ultrafast.yml |
Flat agent with TYPE A/B/C/D routing baked into the system prompt | MiniMax M2.5 456B MoE (vLLM serving) | Same flat architecture as Single, but prompt-driven routing skips the orchestrator call. |
| Ollama Ultrafast | ultrafast-ollama-agent/gaia_agent_ultrafast_ollama.yml |
Same design as Ultrafast, running locally via Ollama | Qwen3.5 35B-A3B (Ollama serving) | No GPU needed, no API keys for inference. Default is a MoE model (35B total / 3B active), needs 32+ GB RAM. |
Each agent is defined entirely by its YAML config. NAT supports additional architectures (react_agent, router_agent, sequential_executor, etc.). See step 6 for how to experiment.
Ollama notes: Qwen3.5 35B-A3B needs ~24 GB disk and ~32 GB RAM. It is a MoE model (35B total / 3B active) from the latest Qwen 3.5 generation, with fast inference and strong tool-calling. Vision and audio tools (describe_image, transcribe_audio) still work because they call external APIs, not the local model. They do require an NGC_API_KEY. For best GAIA accuracy, use the GPU agents.
The repo includes two sets of GAIA questions:
- Test set (
gaia_questions.json, 301 questions) - no expected answers. Uselevelcommands to browse and run these. Thebenchmarkcommand scores against the HuggingFace leaderboard. - Dev set (
gaia_dev_questions.json, 165 questions) - with expected answers. Uselevel devcommands. These are for tuning your agent: run a question, see if the answer matches, check the trace, iterate.
Use the dev set to improve your agent. Use the test set (via benchmark) to measure your final score.
./ask is multi-turn: the agent remembers your conversation and can answer follow-ups. The prompt shows the turn count (e.g., ask [3]>). Memory is kept for up to 20 turns, then oldest turns are trimmed.
Three things clear memory:
clear- manually reset without restartingswitch- changing agents always clears memorylevel/level dev- GAIA questions are sent standalone (no prior context) so the agent can't be confused by earlier turns. After the answer, the Q&A is seeded into memory for follow-ups.
If a question fails (timeout, API error), the failed message is removed from memory so it doesn't affect future turns.
Type help in ./ask for the full list. Key commands:
| Command | What it does |
|---|---|
level |
Show test questions (no answers) |
level <L>, <N> |
Run test question N from Level L |
level dev |
Show dev questions (with expected answers) |
level dev <L>, <N> |
Run dev question with answer checking |
benchmark [agent] |
Run 20-question scored leaderboard |
switch [agent] |
Change agent or load a custom config. Clears memory. |
clear |
Reset conversation memory |
info |
Current agent, model, and tools |
status |
Service and API key health |
tracing |
Start Phoenix and open traces in browser |
verbose on/off |
Show/hide full model reasoning |
help |
Full command reference |
quit |
Exit (Ctrl+C also works) |
This path gives you all three GPU-powered agents, each using MiniMax M2.5 456B MoE served by vLLM:
- Single: flat
tool_calling_agentwith direct access to all tools. Simplest design, no routing overhead. - Multi: orchestrator dispatches to three specialist sub-agents (web, file, multimedia). Extra LLM call for routing, but specialists get focused prompts.
- Ultrafast (default): flat agent with TYPE A/B/C/D routing baked into the system prompt. Same architecture as Single, but prompt-driven classification before each tool call.
The Ollama agent (Path B) is for users without GPUs. If you have a GPU instance, use these three agents.
What you need:
- Linux with 8 GPUs, ~640 GB VRAM total (e.g., 8x H100 80GB, 8x A100 80GB). Tested on GCP
a3-highgpu-8gand Brev8xH100. - 300 GB free disk for MiniMax M2.5 model weights (~220 GB).
- NVIDIA drivers installed (
nvidia-smishould show all 8 GPUs). - Python 3.10+ and
tmux(pre-installed on most GPU cloud images). - 3 API keys (free tiers work). Sign up before running setup so you have them ready: Tavily, NVIDIA Build, HuggingFace. See API Keys below.
Steps:
git clone https://github.com/mnajafian-nv/nat-agent-lab.git nat-agent-lab
cd nat-agent-lab
bash setup.sh # ~20 min; prompts for API keys, downloads model
bash gaia_tools/start_services.sh # ~5-10 min (vLLM loads model into GPU memory)
./ask # verify status line shows all OKYou should see Agent: ultrafast | vLLM: OK | NAT: OK | Phoenix: OK. All agents are available via switch.
This path gives you a single agent: Ollama Ultrafast, which uses the same prompt-driven design as Ultrafast but runs Qwen3.5 35B-A3B locally via Ollama instead of MiniMax M2.5 on GPU. Only the ollama agent is available; the three GPU agents (single, multi, ultrafast) require Path A.
What you need:
- macOS (Apple Silicon M1/M2/M3/M4) or Linux. No GPU required.
- 32 GB RAM required (the default 35B-A3B model uses ~24 GB).
- 2 API keys required: Tavily (search) and HuggingFace (dataset). NVIDIA Build is optional (only needed for vision and audio tools).
- No Ollama pre-install needed.
setup.shinstalls it automatically.
Steps:
git clone https://github.com/mnajafian-nv/nat-agent-lab.git nat-agent-lab
cd nat-agent-lab
bash setup.sh # auto-detects no GPU; installs Ollama, pulls model, prompts for keys
./ask # start chattingThen at the prompt:
switch ollama
setup.sh auto-detects that you have no GPU and handles everything: installs Ollama if needed, pulls qwen3.5:35b-a3b (requires 32+ GB RAM), sets up the Python environment, and prompts for API keys.
You should see Agent: ollama | vLLM: off (not needed) | NAT: OK. The agent runs locally. Only Tavily (web search) and HuggingFace (dataset) need API keys.
Trade-offs: Qwen3.5 35B-A3B is smaller than MiniMax M2.5 456B (Path A), so accuracy on harder questions will be lower. However, the MoE architecture gives fast inference with strong tool-calling, partially closing the gap. Path B is great for experimenting without a GPU; use Path A for best accuracy.
setup.sh prompts for all three keys interactively and saves them to .env. Sign up before running setup so you have the keys ready. Path B (Ollama) only needs Tavily and HuggingFace.
| Key | Sign up | Used for |
|---|---|---|
| Tavily | tavily.com | Internet search tool |
| NVIDIA Build | build.nvidia.com | Vision and audio tools (Path A; optional for Path B) |
| HuggingFace | huggingface.co/settings/tokens | GAIA dataset download and leaderboard submission |
Tavily (TAVILY_API_KEY):
- Go to tavily.com and sign in with Google (or use your university/org account).
- Your API key is on the dashboard after signing in. Copy it.
NVIDIA Build (NGC_API_KEY):
- Go to build.nvidia.com and click Login (top right).
- Create an NVIDIA account if you don't have one (verify by email and phone).
- Once logged in, go to build.nvidia.com/settings/api-keys and click Generate API Key. Copy it.
If you get stuck on phone verification, email help@build.nvidia.com with your registered email and a screenshot of the error.
HuggingFace (HF_TOKEN):
- Go to huggingface.co and sign in (or create a free account).
- Go to Settings > Access Tokens.
- Create a token with Read access. Copy it.
Coming back later? Just run ./ask. vLLM and Phoenix run in background tmux sessions that survive SSH disconnects.
Each benchmark run saves results locally and submits to the leaderboard. Files go to <agent>/runs/runN_<config>/ (auto-numbered):
| File | Contents |
|---|---|
gaia_summary.json |
Score, correct count, time per question |
gaia_results.json |
Every question with your agent's answers |
benchmark.log |
Full terminal output |
nat.log |
NAT server log (for debugging tool call failures) |
config.yml |
Snapshot of the YAML config used |
cat ultrafast-agent/runs/latest/gaia_summary.json # latest score
bash gaia_tools/gaia_run.sh --history # all past runsAfter a benchmark run your answers are automatically submitted and your team appears on the student leaderboard within seconds. Search for your team name (NAT-<org>-<team>-<agent>) to see your score.
The leaderboard shows aggregate scores only. To see which individual questions you got right or wrong, open the local results file:
cat <agent>/runs/latest/gaia_results.jsonEach entry has the question, your submitted answer, the expected answer, and whether it was marked correct.
You can also run benchmarks from the command line:
bash gaia_tools/gaia_run.sh --single # single agent
bash gaia_tools/gaia_run.sh --ultrafast # ultrafast agent
bash gaia_tools/gaia_run.sh -c my-agent/config.yml # your custom configvLLM won't start or crashes
- Check GPU memory:
nvidia-smi - Kill stale processes:
bash gaia_tools/start_services.sh --stop - Check the model is downloaded:
ls .cache/huggingface/hub/models--MiniMaxAI--MiniMax-M2.5/
NAT fails to start
- Check port 8000:
curl localhost:8000/health - Check the NAT log in the run directory or
/tmp/nat_serve/
Phoenix not accessible
- Check it's running:
curl localhost:6006 - If remote, forward the port:
ssh -L 6006:localhost:6006 <your-ssh-host> - Phoenix is optional; agents work without it
Disk space
- The model needs ~300 GB. Clone to a partition with enough room (
/ephemeral/,/data/). - Check space:
df -h .
vLLM shows DOWN after reconnecting
- It runs in tmux. If the machine rebooted:
bash gaia_tools/start_services.sh - Check if it's still loading:
tmux attach -t vllm(Ctrl+B, D to detach)
No GPU / macOS
- Use the Ollama-served agent instead. See Path B setup.
.
├── setup.sh # One-time setup (GPU path)
├── ask # Launch script (activates venv, starts chat)
├── lab/
│ └── lab-guide.md # Full guided lab walkthrough
├── gaia_questions.json # GAIA test questions (no answers)
├── gaia_dev_questions.json # GAIA dev questions (with answers, for tuning)
├── gaia_files/ # Attached files for GAIA questions
├── single-agent/
│ └── gaia_agent.yml # Single agent config
├── multi-agent/
│ └── gaia_agent_multi.yml # Multi-agent orchestrator config
├── ultrafast-agent/
│ └── gaia_agent_ultrafast.yml # Ultrafast agent with prompt-driven routing
├── ultrafast-ollama-agent/
│ └── gaia_agent_ultrafast_ollama.yml # Ollama-served agent (various models, no GPU needed)
└── gaia_tools/
├── ask.py # Interactive chat engine
├── gaia_run.sh # Benchmark runner
├── gaia_run_all.sh # Run all 3 GPU agents sequentially
├── start_services.sh # Start vLLM + Phoenix in tmux
├── gaia_submit.py # Benchmark scoring and submission
├── prep_gaia_data.py # Download GAIA data from HuggingFace
├── tests/ # Test suite (run: python3 -m pytest gaia_tools/tests/)
└── src/gaia_tools/
└── register.py # Custom tool definitions for NAT
- GAIA paper (Mialon et al., 2023)
- Student leaderboard - 20 Level-1 questions, scored
- Official GAIA leaderboard - 300-question test set, answers hidden
- GAIA dataset - terms of use, submission instructions
- NAT docs and source
- Anthropic prompt engineering guide - useful for system prompt design
Tested with: NAT 1.5.0, vLLM 0.18.0, Python 3.12, MiniMax M2.5 (vLLM), Qwen3.5 35B-A3B (Ollama). Hardware: 8x H100 (Brev/GCP), macOS Apple Silicon.