Skip to content

lfzds4399-cpu/harness-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

harness-engineering

A pattern (not a framework) for AI-agent pipelines. Validated across 6+ open-source projects in production.

License: MIT Pattern Status


Most AI-agent projects collapse not because the LLM is bad, but because the pipeline around the LLM is undisciplined. Retries, logs, costs, validation, and state recovery each get re-invented (badly) in every new repo, until the project hits a wall and gets rewritten.

This repo writes down the pipeline pattern that's survived six rewrites across six different problem domains (cross-platform screen capture / voice dictation / domain investing / educational PDFs / multi-voter LLM decisions / file cleanup) in a single-author setting. It is:

  • A pattern, not a library — there's nothing to pip install. Every project implements the pattern independently in its own language with its own dependencies.
  • Opinionated — there is exactly one correct answer to "where does the manifest live?", "how do I shell out to a subprocess?", "what's the CLI surface look like?". The point of pattern documentation is to stop relitigating these.
  • Earned, not theorized — every rule below was added the day it would have prevented a bug shipping. The lessons section quotes the specific failure that taught each one.

If you are building anything that looks like "some agent generates output, then I want to verify that output, then I want to ship the output somewhere, and the whole loop must be re-runnable" — this is for you.


TL;DR — The pattern in one diagram

┌─────────────────────────────────────────────────────┐
│  cli.py     status / doctor / run / audit           │  ← single entry point
├─────────────────────────────────────────────────────┤
│  agents/    "Generate" — LLM calls, HTTP, scrapes   │  ← does the work
│  validators/  "Verify" — pure-function gates        │  ← catches mistakes
│  pipelines/   "Orchestrate" — multi-stage flow      │  ← composes the above
├─────────────────────────────────────────────────────┤
│  manifest.json    stage status + counts + cost      │  ← state of the world
│  configs/<artifact>.yaml   per-artifact config      │  ← what to build
│  logs/pipeline_<ts>.log    everything ever happened │  ← debug surface
└─────────────────────────────────────────────────────┘

Three rules of the pattern:

  1. Validators are mandatory — every artifact runs through ≥ 1 deterministic gate before it ships. A wrong-sign derivative in a math lesson, a -48 LUFS music track, a 401 from a paid API — these are caught by the validator layer, not by a human who happens to be looking.
  2. Subprocesses are always capturedcapture_output=True everywhere. The agent's stdout is a structured tool result for the orchestrator, not a stream into the terminal.
  3. One CLI, four verbsstatus (read the manifest), doctor (check API keys / deps), run (execute one or all stages), audit (replay validators against existing artifacts).

Everything else is a consequence of these three.


Why a pattern and not a framework

LangChain, CrewAI, LangGraph, AutoGen, Inspect — there are already excellent frameworks for AI-agent pipelines. They give you a Pipeline object, a @stage decorator, a vendored retry policy, a vendored logger.

The problem with frameworks-as-the-answer:

  • Lock-in to a programming model. Once you @stage your code, you've coupled to a specific framework's ideas about state, retries, and async. Migrating to a different framework two years later is a rewrite.
  • Dependency hell on a per-project basis. A video harness, a math harness, and a trading harness have nothing in common at the dependency level (FFmpeg vs SymPy vs CCXT). Forcing them through one framework drags in transitive deps that none of them needs.
  • Pattern recognition is the actual transferable skill. Six harnesses, six independent codebases, six independent dependency trees, the same architecture. That architecture — agents + validators + manifest, CLI-driven, observable — is what stays the same when the framework underneath doesn't.

The pattern is documented here so that:

  • Someone starting their seventh harness in a fresh language doesn't waste a week relitigating layout.
  • A code review can say "this stage doesn't emit a manifest update, that's a violation of the pattern" without arguing about it.
  • Lessons learned compound across projects instead of being trapped in any single repo's commit history.

When to use this pattern (and when not to)

Use it when your project has:

  • Multiple stages that need ordering — discover → score → buy → list, or extract → render → validate → publish.
  • Cross-stage state recovery — re-running stage 4 after a stage-3 failure should not re-run stages 1–2.
  • A mix of LLM and deterministic logic — some stages call an LLM (creative), others apply rules (verification).
  • Batch artifacts — multiple lessons / domains / videos coming out of the same pipeline, each with their own success/fail status.
  • Cost or correctness gates — you want a "did this artifact pass before we ship it" step that fails the build if not.

Don't use it when the project is one of these:

You're building Why this pattern is wrong
A library (Council(voters), Client(api_key)) No stages, no batch artifacts, no manifest needed. Just a clean API surface.
An interactive tool (push-to-talk hotkey, browser extension) The event loop runs forever; there's no run --stage all semantics. Use this pattern's logging and config conventions only.
A single-shot script (scripts/migrate_db.py) If it runs once and writes once, there's no pipeline. Just write the script.
An evaluation harness (Inspect-style) Existing eval frameworks (Inspect, lm-eval) are already optimized for this — don't reinvent.

For lib + tool projects you can still steal the conventions in this repo (quiet logging, subprocess capture, CLI shape) without adopting the full agents/validators/manifest skeleton.


The 7 mandatory features

Every project that calls itself a "harness" must implement all seven. Skipping any one of them is what causes the rewrite-in-month-3 trajectory.

1. Validators that catch what humans miss

Add at least 2–3 deterministic validators per harness. A validator is a pure function: validator(artifact) -> Verdict(pass | fail | warn, evidence). Run all of them before the artifact is allowed to leave the pipeline.

Examples that have paid for themselves:

  • Video / image: vision-model verifier (Qwen-VL scores the frame), prompt linter (rule-based scan for known prompt-injection patterns), audio QA (ebur128 LUFS check — a -48 LUFS music track sounds fine to the human ear and is unshippable).
  • Educational / math: SymPy ground-truth check (every claimed derivative is verified against diff(f, x)), LaTeX render check (catches $$...$$ syntax errors before PDF build).
  • Trading: backtest runner (sanity P&L), risk check (per-trade and per-day cap), data freshness check.
  • Cross-cutting: secret scanner (gitleaks regex pass on any artifact about to be published), cost tracker (per-stage USD running total, fails the build if monthly cap is exceeded).

The rule: if a human eye / ear could miss the bug, the validator catches it.

2. Subprocesses always captured

subprocess.run(cmd, capture_output=True, text=True, encoding="utf-8", errors="replace", check=False)

Never inherit stdout. Two reasons:

  1. Context pollution — when the agent is itself driven by an LLM (Claude Code, Cursor, Devin), inherited stdout floods the LLM's context window with hundreds of lines per subprocess call, choking the conversation.
  2. Failure mode legibilityresult.stderr[-800:] on a non-zero exit is structured and inspectable. print() to terminal is gone the moment the agent moves to the next step.

The exception: user-facing orchestrators (run.py that a human runs interactively to see live progress) may inherit stdout deliberately. The rule applies to agent-driven calls.

3. Quiet logging that doesn't drown the conversation

Logging setup must support a quiet: bool = False parameter:

  • Default (quiet=False): console = INFO + file = INFO. For human dev.
  • Quiet (quiet=True): console = WARNING+, file = INFO. For LLM-driven runs — the console is the LLM's context window; the file is the audit trail.

Whatever you do, both branches log everything to logs/pipeline_<ts>.log. The console is for the human; the file is for the postmortem.

4. Manifest as the source of truth

manifest.py persists a data/<artifact>/manifest.json for every artifact, tracking:

  • stage_status: {discover: done, value: running, register: pending, ...}
  • count, provider, updated_at, per-stage cost_usd
  • A summary() method that pretty-prints the table

Re-running becomes idempotent: stages marked done are skipped unless --force. Partial failures become visible: status shows exactly where it stopped.

5. API retries + cost tracking, always

Every external API call goes through retry.py:

  • Exponential backoff, max 3
  • Distinguish 4xx (don't retry, log and surface) from 5xx (retry)
  • Optional provider fallback chain (primary → fallback_a → fallback_b)

Every retry-wrapped call writes its USD cost to the manifest. The manifest's cost_usd total is the source of truth — if you can't answer "how much have I spent today?" from one CLI command, you've already lost.

6. A four-verb CLI

<harness> status   # read manifest, pretty-print stage status
<harness> doctor   # check API keys, runtime deps, external tools on PATH
<harness> run      # run a stage (--stage N) or everything (--stage all)
<harness> audit    # replay validators against existing artifacts, no rerun

typer makes this trivial. Resist the urge to add more verbs — every additional verb is one more surface a maintainer has to remember.

7. Progress that doesn't flood

Any stage that runs longer than 30s (video generation, batch image, multi-step LLM chain) must emit progress via:

  • JSON-lines on a separate file or stderr (not stdout — that's the artifact channel)
  • A --progress json|tqdm|silent flag
  • The LLM-driven default is silent; humans use tqdm; CI uses json

The agent reads the JSON to know "stage 4 is 60% done", not "stage 4 logged 'Processing batch 38 of 64' to stdout".


Upgrading an existing project

If you have a working pipeline already, don't rewrite it. Migrate in five stages, one per commit, in this order:

Stage Add Why first Effort
A quiet logging Unblocks LLM-driven runs immediately; touches zero business logic 30 min
B manifest persistence Enables idempotent reruns; touches stage transitions only 1–2 h
C validators (≥ 2) Catches the bugs you're shipping today 2–4 h
D subprocess capture Mechanical grep + replace; no business-logic change 30 min
E retry + cost tracking Hardest — API code lives in many files; do last when the rest is stable 4–8 h

After each stage, run end-to-end and confirm nothing broke. Never combine two stages in one commit — it makes the cause of a regression invisible.


Reference implementations

Six projects following this pattern, all open source, all MIT:

Repo Domain Stage count Validators
claude-screen-mcp Cross-platform screen capture + OCR for AI agents 10 tools, single stage each output byte cap, dHash channel assert, OCR allowlist
voice2ai Hands-free push-to-talk dictation (Windows) Interactive — uses pattern's logging/config only
domain-harness Automated domain investing 6 stages: discover → value → acquire → list → negotiate → settle budget walls, trademark blacklist, dup check, AI Council
ai-council Multi-voter LLM consensus framework Library — uses pattern's manifest only
methods-harness SymPy-verified bilingual math lesson pipeline 5 stages per chapter derivative / integral / factor / transform / trig-solve
cleanup-harness (private) Reversible disk-cleanup pipeline 4 stages: scan → classify → quarantine → confirm whitelist enforce, dry-run gate, undo log

The three full harnesses (domain / methods / cleanup) use all 7 mandatory features. claude-screen-mcp uses 5 (no pipelines, no batch artifacts — it's a per-call tool server). voice2ai and ai-council adopt only conventions because their shape isn't pipeline-like (see When NOT to use).


Lessons learned (the painful ones)

These are the bugs that earned each rule. Names anonymized; specifics preserved.

L1 · Validators must catch the audible/visible gap. A music track at -48 LUFS sounded fine to the ear during preview. The user reported "audio is broken." The fix was an ebur128 LUFS validator that fails the build on anything below -23. Every harness needs at least one validator that catches what humans miss.

L2 · "It generated something" ≠ "it generated the right thing". A documentary stage used Ken Burns pan-and-zoom on still frames because the video model failed silently. Output looked like video. Was rejected on review. Fix: validator that checks for actual frame-to-frame pixel motion above a threshold.

L3 · LLM color descriptions are not RGB. "Zitan red" (a specific reddish-brown wood color) was translated by an image model as purple. Fix: every color spec in a prompt must be accompanied by an explicit hex code, and the validator checks the rendered image's dominant color against the expected hex within ΔE tolerance.

L4 · Placeholder substitution must be reverse-sorted. _D_1 is a prefix of _D_10. Replacing in forward order corrupts _D_10 into <value-of-D_1>0. Fix: for i in reversed(range(N)). Trivial in hindsight; surprisingly bug-prone in practice.

L5 · CLAUDE.md / agent-instruction files have a token budget. A 15k-token project instruction file was being silently truncated by the LLM client at 8k. Project-specific details belong in per-directory CLAUDE.md / AGENTS.md, not the global one. Keep the global file to meta-rules, preferences, and an index pointing into specifics.

L6 · Half-automated > manually-finished is a trap. "Click here at the end" tutorial steps always desync from reality. If a step can be scripted (Playwright, an API call, a shell command), script it — even when "it's easier this once to just do it by hand."

L7 · Store credentials and IDs the moment you receive them. Discovering at 2 a.m. that you don't remember which TTS voice ID was the good one is a special kind of pain. Append to accounts.md (or equivalent) the moment you sign up or pick a voice.

L8 · Don't trust a SKILL just because its name overlaps. Three skills called "trading" turned out to be a framework, a knowledge base, and a tactical playbook — none redundant. Before deleting a "duplicate," read the SKILL.md.

L9 · Memory files have a half-life. A memory written 6 weeks ago about which library was current may be stale. When a memory cites a specific function or path, verify the cite (grep the repo, read the file) before acting on it. Update the memory if it's wrong.

L10 · _lib/ shared code is an over-engineering trap for single-author multi-harness portfolios. Six harnesses across six different domains (video / RAG / education / trading / cleanup / OCR) tried to share a _lib/ and the result was a versioning nightmare. Independent code + shared methodology (i.e. this repo) won. The same 50 lines of logging_setup.py repeated six times is not a problem; dependency hell across six projects is.

L11 · Auto-audit tools can lie — every flagged item needs a human pass. Regex audits over multi-line subprocess calls flag false positives. AST audits flag genuinely-OK calls in user-facing orchestrators (where inheriting stdout is the right call). Treat any auto-audit output as a candidate list, not a bug list.

L12 · git init before the SKILL refactor, not after. Three projects refactored without a baseline commit and had to start from scratch when the refactor went sideways. First commit on any new repo is chore: baseline initial commit with zero business-logic changes. Then everything else.

L13 · Not every project should adopt all 7 features. A library (Council(voters)) doesn't need a manifest. An interactive tool (push-to-talk loop) doesn't need pipelines. Only adopt the pattern when the project has stages + cross-stage state + batch artifacts. Otherwise borrow the conventions you like and leave the rest.

L14 · Verify the real API before writing docs about it. Three projects shipped CONTRIBUTING.md sections describing functions that didn't exist (plausible-but-imaginary API names auto-generated by an LLM). The validator: grep -nE "^def " src/ | sort before mentioning any function name in user-facing prose.


Anti-patterns

A short list of "if you find yourself doing this, stop":

  • No validators, ship directly. You will ship a bug a human eye missed. Asked-and-answered six times.
  • Subprocess without capture_output=True. Floods the LLM context. Floods CI logs. Hides errors.
  • Reading progress from stdout. Stdout is the artifact channel. Progress goes to a different channel.
  • One file with if stage == 1: ... elif stage == 2: .... Stages are independent units; one file each, in a pipelines/ or stages/ folder.
  • A _lib/ shared across multiple unrelated harnesses. See L10.
  • Editing business logic during a logging / config / structure refactor. Each commit does one thing. Always.

Status

v0.1 — pattern documented, 6 reference implementations. Future versions of this repo are docs-only additions:

  • v0.2 — ARCHITECTURE.md long-form deep-dive on each layer
  • v0.3 — WHEN.md decision tree for "pattern vs library vs script"
  • v0.4 — UPGRADE-PATH.md per-stage runnable example (a 200-line "before" repo + diff to "after" repo)
  • v1.0 — a polished public version of audit.py (the self-audit script) as a reference implementation

There is no plan to ship a Python package. The point is that you don't need one.


Contributing

PRs welcome on:

  • More lessons learned — if you've shipped a harness following this pattern and hit a bug not already in §L, send a PR.
  • More reference implementations — open-source harnesses that follow the pattern get linked from §Reference implementations. Open an issue with your repo URL.
  • TranslationsREADME.zh-CN.md (or any other language) is wide open.
  • Typo / clarity fixes — always welcome.

What I won't take:

  • PRs that add a Python package wrapper. The point is that there isn't one.
  • PRs that add "support" for specific frameworks (LangChain, CrewAI). The pattern is framework-agnostic on purpose.

See CONTRIBUTING.md.


License

MIT. Use this pattern, fork it, claim it, internalize it — the goal is propagation, not credit.


Sibling projects

Built by @lfzds4399-cpu — an 18-year-old solo builder, Year 12 student in Australia, validating this pattern across a handful of open-source projects:

Repo One line
claude-screen-mcp Cross-platform MCP server letting Claude see your screen — OCR + smart vision-diff
voice2ai Hands-free push-to-talk dictation for Windows
domain-harness Automated domain-investing pipeline with hard budget walls
ai-council Multi-voter consensus framework for LLM decisions
methods-harness SymPy-verified bilingual math lesson pipeline

About

A pattern (not a framework) for AI-agent pipelines — agents + validators + manifest, validated across 6+ production projects

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors