A controlled multi-agent pipeline that turns research briefs into reviewed, tested deliverables.
Six specialized agents. Bounded QA loop. Human-approved code execution. No runaway costs.
📝 Brief
│
▼
┌──────────────┐
│ Orchestrator │ decomposes into sub-goals
└──────┬───────┘
▼
┌──────────────┐
│ Researcher │ gathers & synthesizes info
└──────┬───────┘
▼
┌──────────────┐
│ Architect │ designs the solution
└──────┬───────┘
▼
┌──────────────┐
│ Worker │ produces the deliverable
└──────┬───────┘
▼
┌──────────────┐ ┌──────────┐
│ Critic │──────▶│ Worker │ (up to 3 rounds)
└──────┬───────┘ └──────────┘
▼
┌──────────────┐
│ Sandbox │ isolated execution · human-gated
└──────┬───────┘
▼
✅ Final output
Each stage produces structured artifacts. The QA loop is bounded — if the Critic never approves after 3 rounds, the run ends with needs_human_review. No infinite loops.
- Python 3.12+
- A DeepSeek API key
git clone https://github.com/akaradje/agent_lab.git
cd agent_lab
python -m venv .venv
source .venv/bin/activate # macOS / Linux
.venv\Scripts\activate # Windows
pip install -r requirements.txtSet your API key:
# macOS / Linux
export DEEPSEEK_API_KEY=sk-...
# Windows
set DEEPSEEK_API_KEY=sk-...python -m agent_lab.main --brief "Write a Python function that checks if a string is a palindrome. Include tests."The pipeline will decompose, research, design, build, review (up to 3 QA rounds), then pause for your approval before executing in a sandbox.
Output: run_output.json — all artifacts + full transcript.
| Flag | Description |
|---|---|
--yes |
Auto-approve all human gates (for testing / automation) |
--resume <file> |
Resume a saved run from its state file |
# Auto-approve (testing only)
python -m agent_lab.main --brief "..." --yes
# Resume a stopped run
python -m agent_lab.main --resume run_output.jsonThese are enforced in code, not conventions:
| # | Feature | Enforced by |
|---|---|---|
| 🔒 | Budget ceiling — run stops at token or cost limit | budget.py |
| 🔁 | Bounded QA loop — Worker↔Critic capped at 3 rounds | pipeline.py |
| 🛡️ | Human approval gates — code never auto-executes | sandbox.py |
| 🔑 | No secrets in code — API key from env var only | config.py |
Changes that weaken any of these are blocked by design.
Edit agent_lab/config.py:
| Setting | Default | What it does |
|---|---|---|
MAX_TOTAL_TOKENS |
250,000 | Token ceiling per run (soft warning at 80%) |
MAX_USD |
1.00 | Cost ceiling per run (USD) |
MAX_QA_ROUNDS |
3 | Max Worker↔Critic revision rounds |
PRICING |
per-model | USD per million tokens (input, output) |
AGENT_MODELS |
per-agent | Model + reasoning effort per stage |
| Agent | Model | Reasoning | Why |
|---|---|---|---|
| Orchestrator | deepseek-v4-flash |
low | Light planning |
| Researcher | deepseek-v4-flash |
high | Info synthesis |
| Architect | deepseek-v4-pro |
high | Hard reasoning — design |
| Worker | deepseek-v4-flash |
high | Production |
| Critic | deepseek-v4-pro |
high | Hard reasoning — adversarial review |
| Sandbox | deepseek-v4-flash |
low | Summarization |
DeepSeek recommends defaulting to Flash and escalating to Pro only where it measurably helps. See specs/DEEPSEEK_REFERENCE.md for pricing details.
agent_lab/ # Source
├── main.py # CLI entry point
├── config.py # Settings, pricing, model routing
├── budget.py # Token/cost tracking + ceiling enforcement
├── llm_client.py # DeepSeek API wrapper (OpenAI SDK)
├── agents.py # 6 agent classes
├── pipeline.py # Orchestration, QA loop, gates, resume
├── sandbox.py # Approval-gated subprocess execution
└── state.py # Run state persistence (JSON)
specs/ # Design docs
├── ARCHITECTURE.md # Module breakdown + design decisions
├── BUILD_PLAN.md # Phased build order with acceptance checks
├── DEEPSEEK_REFERENCE.md # Provider, models, pricing
└── prompts/ # Agent system prompts (versioned)
tests/ # pytest suite
# Unit tests (no API calls — fast, free)
pytest
# End-to-end tests (calls DeepSeek API — costs money)
AGENT_LAB_LIVE_TEST=1 pytest
# Lint
ruff check .Agent Lab is a pipeline orchestration tool, not an autonomous research lab. It runs LLM calls in a controlled sequence with human oversight. It does not train models, self-modify, or operate unattended.
The value is in reliability: bounded cost, reviewable output, and explicit checkpoints.
MIT — use it however you want.