harness-engineering

A pattern (not a framework) for AI-agent pipelines. Validated across 6+ open-source projects in production.

Most AI-agent projects collapse not because the LLM is bad, but because the pipeline around the LLM is undisciplined. Retries, logs, costs, validation, and state recovery each get re-invented (badly) in every new repo, until the project hits a wall and gets rewritten.

This repo writes down the pipeline pattern that's survived six rewrites across six different problem domains (cross-platform screen capture / voice dictation / domain investing / educational PDFs / multi-voter LLM decisions / file cleanup) in a single-author setting. It is:

A pattern, not a library — there's nothing to pip install. Every project implements the pattern independently in its own language with its own dependencies.
Opinionated — there is exactly one correct answer to "where does the manifest live?", "how do I shell out to a subprocess?", "what's the CLI surface look like?". The point of pattern documentation is to stop relitigating these.
Earned, not theorized — every rule below was added the day it would have prevented a bug shipping. The lessons section quotes the specific failure that taught each one.

If you are building anything that looks like "some agent generates output, then I want to verify that output, then I want to ship the output somewhere, and the whole loop must be re-runnable" — this is for you.

TL;DR — The pattern in one diagram

┌─────────────────────────────────────────────────────┐
│  cli.py     status / doctor / run / audit           │  ← single entry point
├─────────────────────────────────────────────────────┤
│  agents/    "Generate" — LLM calls, HTTP, scrapes   │  ← does the work
│  validators/  "Verify" — pure-function gates        │  ← catches mistakes
│  pipelines/   "Orchestrate" — multi-stage flow      │  ← composes the above
├─────────────────────────────────────────────────────┤
│  manifest.json    stage status + counts + cost      │  ← state of the world
│  configs/<artifact>.yaml   per-artifact config      │  ← what to build
│  logs/pipeline_<ts>.log    everything ever happened │  ← debug surface
└─────────────────────────────────────────────────────┘

Three rules of the pattern:

Validators are mandatory — every artifact runs through ≥ 1 deterministic gate before it ships. A wrong-sign derivative in a math lesson, a -48 LUFS music track, a 401 from a paid API — these are caught by the validator layer, not by a human who happens to be looking.
Subprocesses are always captured — capture_output=True everywhere. The agent's stdout is a structured tool result for the orchestrator, not a stream into the terminal.
One CLI, four verbs — status (read the manifest), doctor (check API keys / deps), run (execute one or all stages), audit (replay validators against existing artifacts).

Everything else is a consequence of these three.

Why a pattern and not a framework

LangChain, CrewAI, LangGraph, AutoGen, Inspect — there are already excellent frameworks for AI-agent pipelines. They give you a Pipeline object, a @stage decorator, a vendored retry policy, a vendored logger.

The problem with frameworks-as-the-answer:

Lock-in to a programming model. Once you @stage your code, you've coupled to a specific framework's ideas about state, retries, and async. Migrating to a different framework two years later is a rewrite.
Dependency hell on a per-project basis. A video harness, a math harness, and a trading harness have nothing in common at the dependency level (FFmpeg vs SymPy vs CCXT). Forcing them through one framework drags in transitive deps that none of them needs.
Pattern recognition is the actual transferable skill. Six harnesses, six independent codebases, six independent dependency trees, the same architecture. That architecture — agents + validators + manifest, CLI-driven, observable — is what stays the same when the framework underneath doesn't.

The pattern is documented here so that:

Someone starting their seventh harness in a fresh language doesn't waste a week relitigating layout.
A code review can say "this stage doesn't emit a manifest update, that's a violation of the pattern" without arguing about it.
Lessons learned compound across projects instead of being trapped in any single repo's commit history.

When to use this pattern (and when not to)

Use it when your project has:

Multiple stages that need ordering — discover → score → buy → list, or extract → render → validate → publish.
Cross-stage state recovery — re-running stage 4 after a stage-3 failure should not re-run stages 1–2.
A mix of LLM and deterministic logic — some stages call an LLM (creative), others apply rules (verification).
Batch artifacts — multiple lessons / domains / videos coming out of the same pipeline, each with their own success/fail status.
Cost or correctness gates — you want a "did this artifact pass before we ship it" step that fails the build if not.

Don't use it when the project is one of these:

You're building	Why this pattern is wrong
A library (`Council(voters)`, `Client(api_key)`)	No stages, no batch artifacts, no manifest needed. Just a clean API surface.
An interactive tool (push-to-talk hotkey, browser extension)	The event loop runs forever; there's no `run --stage all` semantics. Use this pattern's logging and config conventions only.
A single-shot script (`scripts/migrate_db.py`)	If it runs once and writes once, there's no pipeline. Just write the script.
An evaluation harness (Inspect-style)	Existing eval frameworks (Inspect, lm-eval) are already optimized for this — don't reinvent.

For lib + tool projects you can still steal the conventions in this repo (quiet logging, subprocess capture, CLI shape) without adopting the full agents/validators/manifest skeleton.

The 7 mandatory features

Every project that calls itself a "harness" must implement all seven. Skipping any one of them is what causes the rewrite-in-month-3 trajectory.

1. Validators that catch what humans miss

Add at least 2–3 deterministic validators per harness. A validator is a pure function: validator(artifact) -> Verdict(pass | fail | warn, evidence). Run all of them before the artifact is allowed to leave the pipeline.

Examples that have paid for themselves:

Video / image: vision-model verifier (Qwen-VL scores the frame), prompt linter (rule-based scan for known prompt-injection patterns), audio QA (ebur128 LUFS check — a -48 LUFS music track sounds fine to the human ear and is unshippable).
Educational / math: SymPy ground-truth check (every claimed derivative is verified against diff(f, x)), LaTeX render check (catches $$...$$ syntax errors before PDF build).
Trading: backtest runner (sanity P&L), risk check (per-trade and per-day cap), data freshness check.
Cross-cutting: secret scanner (gitleaks regex pass on any artifact about to be published), cost tracker (per-stage USD running total, fails the build if monthly cap is exceeded).

The rule: if a human eye / ear could miss the bug, the validator catches it.

2. Subprocesses always captured

subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", errors="replace", check=False)

Never inherit stdout. Two reasons:

Context pollution — when the agent is itself driven by an LLM (Claude Code, Cursor, Devin), inherited stdout floods the LLM's context window with hundreds of lines per subprocess call, choking the conversation.
Failure mode legibility — result.stderr[-800:] on a non-zero exit is structured and inspectable. print() to terminal is gone the moment the agent moves to the next step.

The exception: user-facing orchestrators (run.py that a human runs interactively to see live progress) may inherit stdout deliberately. The rule applies to agent-driven calls.

3. Quiet logging that doesn't drown the conversation

Logging setup must support a quiet: bool = False parameter:

Default (quiet=False): console = INFO + file = INFO. For human dev.
Quiet (quiet=True): console = WARNING+, file = INFO. For LLM-driven runs — the console is the LLM's context window; the file is the audit trail.

Whatever you do, both branches log everything to logs/pipeline_<ts>.log. The console is for the human; the file is for the postmortem.

4. Manifest as the source of truth

manifest.py persists a data/<artifact>/manifest.json for every artifact, tracking:

stage_status: {discover: done, value: running, register: pending, ...}
count, provider, updated_at, per-stage cost_usd
A summary() method that pretty-prints the table

Re-running becomes idempotent: stages marked done are skipped unless --force. Partial failures become visible: status shows exactly where it stopped.

5. API retries + cost tracking, always

Every external API call goes through retry.py:

Exponential backoff, max 3
Distinguish 4xx (don't retry, log and surface) from 5xx (retry)
Optional provider fallback chain (primary → fallback_a → fallback_b)

Every retry-wrapped call writes its USD cost to the manifest. The manifest's cost_usd total is the source of truth — if you can't answer "how much have I spent today?" from one CLI command, you've already lost.

6. A four-verb CLI

<harness> status   # read manifest, pretty-print stage status
<harness> doctor   # check API keys, runtime deps, external tools on PATH
<harness> run      # run a stage (--stage N) or everything (--stage all)
<harness> audit    # replay validators against existing artifacts, no rerun

typer makes this trivial. Resist the urge to add more verbs — every additional verb is one more surface a maintainer has to remember.

7. Progress that doesn't flood

Any stage that runs longer than 30s (video generation, batch image, multi-step LLM chain) must emit progress via:

JSON-lines on a separate file or stderr (not stdout — that's the artifact channel)
A --progress json|tqdm|silent flag
The LLM-driven default is silent; humans use tqdm; CI uses json

The agent reads the JSON to know "stage 4 is 60% done", not "stage 4 logged 'Processing batch 38 of 64' to stdout".

Upgrading an existing project

If you have a working pipeline already, don't rewrite it. Migrate in five stages, one per commit, in this order:

Stage	Add	Why first	Effort
A	quiet logging	Unblocks LLM-driven runs immediately; touches zero business logic	30 min
B	manifest persistence	Enables idempotent reruns; touches stage transitions only	1–2 h
C	validators (≥ 2)	Catches the bugs you're shipping today	2–4 h
D	subprocess capture	Mechanical grep + replace; no business-logic change	30 min
E	retry + cost tracking	Hardest — API code lives in many files; do last when the rest is stable	4–8 h

After each stage, run end-to-end and confirm nothing broke. Never combine two stages in one commit — it makes the cause of a regression invisible.

Reference implementations

Six projects following this pattern, all open source, all MIT:

Repo	Domain	Stage count	Validators
claude-screen-mcp	Cross-platform screen capture + OCR for AI agents	10 tools, single stage each	output byte cap, dHash channel assert, OCR allowlist
voice2ai	Hands-free push-to-talk dictation (Windows)	Interactive — uses pattern's logging/config only	—
domain-harness	Automated domain investing	6 stages: `discover → value → acquire → list → negotiate → settle`	budget walls, trademark blacklist, dup check, AI Council
ai-council	Multi-voter LLM consensus framework	Library — uses pattern's manifest only	—
methods-harness	SymPy-verified bilingual math lesson pipeline	5 stages per chapter	derivative / integral / factor / transform / trig-solve
cleanup-harness (private)	Reversible disk-cleanup pipeline	4 stages: `scan → classify → quarantine → confirm`	whitelist enforce, dry-run gate, undo log

The three full harnesses (domain / methods / cleanup) use all 7 mandatory features. claude-screen-mcp uses 5 (no pipelines, no batch artifacts — it's a per-call tool server). voice2ai and ai-council adopt only conventions because their shape isn't pipeline-like (see When NOT to use).

Lessons learned (the painful ones)

These are the bugs that earned each rule. Names anonymized; specifics preserved.

L1 · Validators must catch the audible/visible gap. A music track at -48 LUFS sounded fine to the ear during preview. The user reported "audio is broken." The fix was an ebur128 LUFS validator that fails the build on anything below -23. Every harness needs at least one validator that catches what humans miss.

L2 · "It generated something" ≠ "it generated the right thing". A documentary stage used Ken Burns pan-and-zoom on still frames because the video model failed silently. Output looked like video. Was rejected on review. Fix: validator that checks for actual frame-to-frame pixel motion above a threshold.

L3 · LLM color descriptions are not RGB. "Zitan red" (a specific reddish-brown wood color) was translated by an image model as purple. Fix: every color spec in a prompt must be accompanied by an explicit hex code, and the validator checks the rendered image's dominant color against the expected hex within ΔE tolerance.

L4 · Placeholder substitution must be reverse-sorted. _D_1 is a prefix of _D_10. Replacing in forward order corrupts _D_10 into <value-of-D_1>0. Fix: for i in reversed(range(N)). Trivial in hindsight; surprisingly bug-prone in practice.

L5 · CLAUDE.md / agent-instruction files have a token budget. A 15k-token project instruction file was being silently truncated by the LLM client at 8k. Project-specific details belong in per-directory CLAUDE.md / AGENTS.md, not the global one. Keep the global file to meta-rules, preferences, and an index pointing into specifics.

L6 · Half-automated > manually-finished is a trap. "Click here at the end" tutorial steps always desync from reality. If a step can be scripted (Playwright, an API call, a shell command), script it — even when "it's easier this once to just do it by hand."

L7 · Store credentials and IDs the moment you receive them. Discovering at 2 a.m. that you don't remember which TTS voice ID was the good one is a special kind of pain. Append to accounts.md (or equivalent) the moment you sign up or pick a voice.

L8 · Don't trust a SKILL just because its name overlaps. Three skills called "trading" turned out to be a framework, a knowledge base, and a tactical playbook — none redundant. Before deleting a "duplicate," read the SKILL.md.

L9 · Memory files have a half-life. A memory written 6 weeks ago about which library was current may be stale. When a memory cites a specific function or path, verify the cite (grep the repo, read the file) before acting on it. Update the memory if it's wrong.

L10 · _lib/ shared code is an over-engineering trap for single-author multi-harness portfolios. Six harnesses across six different domains (video / RAG / education / trading / cleanup / OCR) tried to share a _lib/ and the result was a versioning nightmare. Independent code + shared methodology (i.e. this repo) won. The same 50 lines of logging_setup.py repeated six times is not a problem; dependency hell across six projects is.

L11 · Auto-audit tools can lie — every flagged item needs a human pass. Regex audits over multi-line subprocess calls flag false positives. AST audits flag genuinely-OK calls in user-facing orchestrators (where inheriting stdout is the right call). Treat any auto-audit output as a candidate list, not a bug list.

L12 · git init before the SKILL refactor, not after. Three projects refactored without a baseline commit and had to start from scratch when the refactor went sideways. First commit on any new repo is chore: baseline initial commit with zero business-logic changes. Then everything else.

L13 · Not every project should adopt all 7 features. A library (Council(voters)) doesn't need a manifest. An interactive tool (push-to-talk loop) doesn't need pipelines. Only adopt the pattern when the project has stages + cross-stage state + batch artifacts. Otherwise borrow the conventions you like and leave the rest.

L14 · Verify the real API before writing docs about it. Three projects shipped CONTRIBUTING.md sections describing functions that didn't exist (plausible-but-imaginary API names auto-generated by an LLM). The validator: grep -nE "^def " src/ | sort before mentioning any function name in user-facing prose.

Anti-patterns

A short list of "if you find yourself doing this, stop":

❌ No validators, ship directly. You will ship a bug a human eye missed. Asked-and-answered six times.
❌ Subprocess without capture_output=True. Floods the LLM context. Floods CI logs. Hides errors.
❌ Reading progress from stdout. Stdout is the artifact channel. Progress goes to a different channel.
❌ One file with if stage == 1: ... elif stage == 2: .... Stages are independent units; one file each, in a pipelines/ or stages/ folder.
❌ A _lib/ shared across multiple unrelated harnesses. See L10.
❌ Editing business logic during a logging / config / structure refactor. Each commit does one thing. Always.

Status

v0.1 — pattern documented, 6 reference implementations. Future versions of this repo are docs-only additions:

v0.2 — ARCHITECTURE.md long-form deep-dive on each layer
v0.3 — WHEN.md decision tree for "pattern vs library vs script"
v0.4 — UPGRADE-PATH.md per-stage runnable example (a 200-line "before" repo + diff to "after" repo)
v1.0 — a polished public version of audit.py (the self-audit script) as a reference implementation

There is no plan to ship a Python package. The point is that you don't need one.

Contributing

PRs welcome on:

More lessons learned — if you've shipped a harness following this pattern and hit a bug not already in §L, send a PR.
More reference implementations — open-source harnesses that follow the pattern get linked from §Reference implementations. Open an issue with your repo URL.
Translations — README.zh-CN.md (or any other language) is wide open.
Typo / clarity fixes — always welcome.

What I won't take:

PRs that add a Python package wrapper. The point is that there isn't one.
PRs that add "support" for specific frameworks (LangChain, CrewAI). The pattern is framework-agnostic on purpose.

See CONTRIBUTING.md.

License

MIT. Use this pattern, fork it, claim it, internalize it — the goal is propagation, not credit.

Sibling projects

Built by @lfzds4399-cpu — an 18-year-old solo builder, Year 12 student in Australia, validating this pattern across a handful of open-source projects:

Repo	One line
claude-screen-mcp	Cross-platform MCP server letting Claude see your screen — OCR + smart vision-diff
voice2ai	Hands-free push-to-talk dictation for Windows
domain-harness	Automated domain-investing pipeline with hard budget walls
ai-council	Multi-voter consensus framework for LLM decisions
methods-harness	SymPy-verified bilingual math lesson pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

harness-engineering

TL;DR — The pattern in one diagram

Why a pattern and not a framework

When to use this pattern (and when not to)

The 7 mandatory features

1. Validators that catch what humans miss

2. Subprocesses always captured

3. Quiet logging that doesn't drown the conversation

4. Manifest as the source of truth

5. API retries + cost tracking, always

6. A four-verb CLI

7. Progress that doesn't flood

Upgrading an existing project

Reference implementations

Lessons learned (the painful ones)

Anti-patterns

Status

Contributing

License

Sibling projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

harness-engineering

TL;DR — The pattern in one diagram

Why a pattern and not a framework

When to use this pattern (and when not to)

The 7 mandatory features

1. Validators that catch what humans miss

2. Subprocesses always captured

3. Quiet logging that doesn't drown the conversation

4. Manifest as the source of truth

5. API retries + cost tracking, always

6. A four-verb CLI

7. Progress that doesn't flood

Upgrading an existing project

Reference implementations

Lessons learned (the painful ones)

Anti-patterns

Status

Contributing

License

Sibling projects

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages