Architecture

Data Flow

SWE-bench Verified (500 tasks)
    │
    ▼ sample 50, split 30/20
config/splits.json
    │
    ▼ orchestrator builds queue + sets up sandbox
harness/orchestrator.py
    │
    ▼ for each (task, arm):
harness/runner.py
    ├── git clone --bare → git worktree add (host)
    ├── docker run --rm (sandbox)
    │   ├── bind mount worktree → /workspace
    │   ├── floop volume → /floop-store (rw train, ro eval)
    │   ├── agents/mini_swe_cli.py (entrypoint, JSON stdin/stdout)
    │   ├── agent loop (litellm API calls + bash, sandboxed)
    │   └── agent calls floop learn/active organically
    └── git diff on bind mount (host) → capture patch
    │
    ▼ save
harness/db.py → SQLite (WAL mode)
results/predictions/*.jsonl
results/transcripts/
    │
    ▼ evaluate
harness/swebench_eval.py → SWE-bench Docker
    │
    ▼ import results
harness/db.py → resolved / unresolved
    │
    ▼ analyze
analysis/analyze.py → stats
analysis/charts.py → PNG/SVG

Components

harness/

File	Role
`config.py`	Loads `arms.toml` and `splits.json`. Agent registry (lazy-loaded).
`db.py`	SQLite with WAL mode. Context-managed connections. PK is `(instance_id, arm)`.
`runner.py`	Repo checkout (bare clone + worktree), agent dispatch (sandboxed or direct), diff capture, cleanup. `SandboxConfig` dataclass.
`orchestrator.py`	Click CLI. Phase dispatch, queue building, resume, budget guard. Docker sandbox lifecycle (image build, volume create/init, leakage audit).
`parallel.py`	ProcessPoolExecutor wrapper. Cost guard per task submission. Threads `SandboxConfig`.
`swebench_eval.py`	Subprocess call to `swebench.harness.run_evaluation`. Imports results.

agents/

File	Role
`base.py`	`RunResult` dataclass and `Agent` protocol (`name` + `run()`).
`mini_swe.py`	Litellm agent loop. Extracts bash blocks, executes, feeds output back. Stops on SUBMIT or step limit.
`mini_swe_cli.py`	Docker entrypoint. Reads JSON from stdin, runs `MiniSweAgent`, prints `RunResult` JSON to stdout.
`claude_code.py`	Claude Code CLI wrapper (`claude -p`). Uses `--allowedTools` to control floop access.

floop_integration/

Two integration paths depending on agent:

mini_swe: inject.py builds a prompt preamble with CLI cadence instructions (floop active, floop learn) and any existing behaviors. The agent uses floop organically via bash commands inside the sandbox.
claude_code: Floop is accessed via MCP tools. --allowedTools includes/excludes floop tool names per arm.

analysis/

File	Role
`analyze.py`	Resolve rates, bootstrap 95% CIs, McNemar's test, Cohen's h, gap closure.
`charts.py`	Grouped bar (resolve rates + CIs), cost scatter, cost per resolved. PNG + SVG.

Database Schema

CREATE TABLE runs (
    instance_id TEXT NOT NULL,
    arm TEXT NOT NULL,
    model TEXT NOT NULL,
    floop_enabled BOOLEAN NOT NULL,
    model_patch TEXT,
    resolved BOOLEAN,
    status TEXT NOT NULL,       -- "completed" | "timeout" | "error"
    duration_seconds REAL,
    input_tokens INTEGER,
    output_tokens INTEGER,
    cost_usd REAL,
    transcript_path TEXT,
    error_message TEXT,
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    PRIMARY KEY (instance_id, arm)
);

Resume

The orchestrator is safe to interrupt and re-run. load_completed() returns all (instance_id, arm) pairs with status completed, timeout, or error, and the queue skips them. save_run() uses INSERT ... ON CONFLICT ... DO UPDATE with COALESCE to preserve existing resolved values.

Docker Sandbox

Agents run inside disposable Docker containers (one per task). This prevents agent-executed bash commands from affecting the host.

Host                              Container (fresh per task, --rm)
────                              ─────────────────────────────────
git worktree setup
  bind mount ──────────────→      /workspace (repo files)
  Docker volume ───────────→      /floop-store (persistent within phase)
  env vars (API keys) ────→      GEMINI_API_KEY, etc.
  stdin (JSON) ────────────→      agents/mini_swe_cli.py
                                  agent loop + bash (sandboxed)
                                  floop learn/active via bash
  ←── stdout (JSON) ───────      RunResult
git diff (host, on bind mount)

Security (beebox pattern): --cap-drop ALL, minimal adds (CHOWN, DAC_OVERRIDE, FOWNER), resource limits (--memory 2g, --cpus 2, --pids-limit 256).

Floop volume lifecycle:

Train: floop-train volume, read-write. Behaviors accumulate across tasks.
Eval: same volume mounted read-only. Agent can query but not learn.
make clean removes all volumes for a fresh start.

Graceful fallback: If Docker is unavailable, the orchestrator warns and runs agents directly on the host. Use --no-sandbox to opt out explicitly.

Repo Isolation

Each task needs the repo at a specific commit:

Bare clone — one per repo in work/repos/, fetched once.
Worktree — one per task at the exact base_commit. Fast, fully isolated.
Cleanup — worktree removed after each run, stale refs pruned.

Parallel workers can run different tasks on the same repo without conflicts.

Parallel Execution

Orchestrator
    ├── Worker 1 → own worktrees → SQLite
    ├── Worker 2 → own worktrees → SQLite
    ├── Worker 3 → own worktrees → SQLite
    └── Worker 4 → own worktrees → SQLite

Workers are processes (ProcessPoolExecutor). Each creates its own worktrees and DB connections. SQLite WAL mode handles concurrent writes. Cost guard is checked before submitting each task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Data Flow

Components

harness/

agents/

floop_integration/

analysis/

Database Schema

Resume

Docker Sandbox

Repo Isolation

Parallel Execution

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

Data Flow

Components

harness/

agents/

floop_integration/

analysis/

Database Schema

Resume

Docker Sandbox

Repo Isolation

Parallel Execution