🧪 Agent Lab

A controlled multi-agent pipeline that turns research briefs into reviewed, tested deliverables.

Six specialized agents. Bounded QA loop. Human-approved code execution. No runaway costs.

How it works

  📝 Brief
      │
      ▼
  ┌──────────────┐
  │ Orchestrator │  decomposes into sub-goals
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │  Researcher  │  gathers & synthesizes info
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │  Architect   │  designs the solution
  └──────┬───────┘
         ▼
  ┌──────────────┐
  │   Worker     │  produces the deliverable
  └──────┬───────┘
         ▼
  ┌──────────────┐       ┌──────────┐
  │   Critic     │──────▶│  Worker  │  (up to 3 rounds)
  └──────┬───────┘       └──────────┘
         ▼
  ┌──────────────┐
  │   Sandbox    │  isolated execution · human-gated
  └──────┬───────┘
         ▼
  ✅ Final output

Each stage produces structured artifacts. The QA loop is bounded — if the Critic never approves after 3 rounds, the run ends with needs_human_review. No infinite loops.

Quick start

Prerequisites

Python 3.12+
A DeepSeek API key

Setup

git clone https://github.com/akaradje/agent_lab.git
cd agent_lab

python -m venv .venv
source .venv/bin/activate      # macOS / Linux
.venv\Scripts\activate         # Windows

pip install -r requirements.txt

Set your API key:

# macOS / Linux
export DEEPSEEK_API_KEY=sk-...

# Windows
set DEEPSEEK_API_KEY=sk-...

Run it

python -m agent_lab.main --brief "Write a Python function that checks if a string is a palindrome. Include tests."

The pipeline will decompose, research, design, build, review (up to 3 QA rounds), then pause for your approval before executing in a sandbox.

Output: run_output.json — all artifacts + full transcript.

Options

Flag	Description
`--yes`	Auto-approve all human gates (for testing / automation)
`--resume <file>`	Resume a saved run from its state file

# Auto-approve (testing only)
python -m agent_lab.main --brief "..." --yes

# Resume a stopped run
python -m agent_lab.main --resume run_output.json

Safety features

These are enforced in code, not conventions:

#	Feature	Enforced by
🔒	Budget ceiling — run stops at token or cost limit	`budget.py`
🔁	Bounded QA loop — Worker↔Critic capped at 3 rounds	`pipeline.py`
🛡️	Human approval gates — code never auto-executes	`sandbox.py`
🔑	No secrets in code — API key from env var only	`config.py`

Changes that weaken any of these are blocked by design.

Configuration

Edit agent_lab/config.py:

Setting	Default	What it does
`MAX_TOTAL_TOKENS`	250,000	Token ceiling per run (soft warning at 80%)
`MAX_USD`	1.00	Cost ceiling per run (USD)
`MAX_QA_ROUNDS`	3	Max Worker↔Critic revision rounds
`PRICING`	per-model	USD per million tokens (input, output)
`AGENT_MODELS`	per-agent	Model + reasoning effort per stage

Model routing

Agent	Model	Reasoning	Why
Orchestrator	`deepseek-v4-flash`	low	Light planning
Researcher	`deepseek-v4-flash`	high	Info synthesis
Architect	`deepseek-v4-pro`	high	Hard reasoning — design
Worker	`deepseek-v4-flash`	high	Production
Critic	`deepseek-v4-pro`	high	Hard reasoning — adversarial review
Sandbox	`deepseek-v4-flash`	low	Summarization

DeepSeek recommends defaulting to Flash and escalating to Pro only where it measurably helps. See specs/DEEPSEEK_REFERENCE.md for pricing details.

Project structure

agent_lab/                  # Source
├── main.py                 #   CLI entry point
├── config.py               #   Settings, pricing, model routing
├── budget.py               #   Token/cost tracking + ceiling enforcement
├── llm_client.py           #   DeepSeek API wrapper (OpenAI SDK)
├── agents.py               #   6 agent classes
├── pipeline.py             #   Orchestration, QA loop, gates, resume
├── sandbox.py              #   Approval-gated subprocess execution
└── state.py                #   Run state persistence (JSON)

specs/                      # Design docs
├── ARCHITECTURE.md         #   Module breakdown + design decisions
├── BUILD_PLAN.md           #   Phased build order with acceptance checks
├── DEEPSEEK_REFERENCE.md   #   Provider, models, pricing
└── prompts/                #   Agent system prompts (versioned)

tests/                      # pytest suite

Running tests

# Unit tests (no API calls — fast, free)
pytest

# End-to-end tests (calls DeepSeek API — costs money)
AGENT_LAB_LIVE_TEST=1 pytest

# Lint
ruff check .

Honest scope

Agent Lab is a pipeline orchestration tool, not an autonomous research lab. It runs LLM calls in a controlled sequence with human oversight. It does not train models, self-modify, or operate unattended.

The value is in reliability: bounded cost, reviewable output, and explicit checkpoints.

License

MIT — use it however you want.

Architecture · Build Plan · DeepSeek Reference · Agent Prompts

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude		.claude
agent_lab		agent_lab
docs		docs
specs		specs
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
harness.py		harness.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 Agent Lab

How it works

Quick start

Prerequisites

Setup

Run it

Options

Safety features

Configuration

Model routing

Project structure

Running tests

Honest scope

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧪 Agent Lab

How it works

Quick start

Prerequisites

Setup

Run it

Options

Safety features

Configuration

Model routing

Project structure

Running tests

Honest scope

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages