Skip to content

Zane456/harness-check

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | 简体中文

harness-check

harness-check — agent harness structure auditor for Claude Code

"Docs describe one system. Code runs another. Find the gap before your users do."

License: MIT Platform: Claude Code Type: Agent Skill Failure patterns: 24


A structure auditor for multi-agent harnesses — does the system your docs describe equal the system your code can actually run?


harness-check audits agent frameworks wired together from prompts, Markdown, and scripts. It hunts the glue — a trigger word, role discovery, a registry, a symlink, a script seam. Where one quiet break kills a whole chain while the docs still claim full functionality. It verifies every finding against your real files (no guessing), and reports findings as plain-language before/after comparisons — not a wall of line numbers.


See Demo · Install · What It Catches · How It Works · Why It's Different


Demo

A real audit run, condensed. You ask in plain language; it answers in a layered report:

❯ "review my agent framework"

VERDICT — the everyday main path (Claude running the test loop) is healthy and
fully wired: preflight green, zero dead links, every script arg lines up, locks
correct. No XHigh/High — a few doc-drift items + 1 "relies on the model paying
attention" robustness note.

📊 Health scores      architecture 5 · trigger 4 · execution 4 · declared-vs-wired 4

CODE LAYER  (the audit trail, skippable)
  🟡 M1  Script failure goes unchecked — dim3 execution reliability
         where: testing/test-runner.md:177-199
         why:   the pseudocode never checks state-script exit codes → a failed
                status flip gets treated as success and the loop walks on
         fix:   add Supervisor rule "any state script non-zero → stop + report"

✅ SAFE TO FIX NOW              🤔 NEEDS YOUR DECISION
  N1 main-flow hardcodes Claude    M1 add a "non-zero → stop" hard check
  N2 compaction threshold mismatch    (edits the shared main flow — your call,
  N3 scenario count off by one         won't change automatically)

It doesn't bury you in line numbers. It tells you what silently breaks, what's safe to fix right now, and what needs your call — because when an LLM wrote the harness, you need a second pair of eyes.


Install

It's a Claude Code skill — drop it in your skills directory:

git clone https://github.com/Zane456/harness-check.git ~/.claude/skills/harness-check

Then just ask, in any wording:

review my agent framework   ·   harness review   ·   审查我的框架   ·   harness 体检

The skill auto-triggers on those phrasings and runs the full six-step audit.


What It Catches

Four dimensions, 24 catalogued failure patterns, each paired with a concrete how-to-verify-on-disk check:

Dimension What silently breaks here
1 · System architecture Two disconnected execution paths wearing one name; docs that promise what the running path doesn't deliver; "drop a file and it works" that secretly needs a script edit.
2 · Trigger reliability A discovery glob that misses subdirectories → empty registry; dead symlinks; a roster name that ≠ the dispatcher's name → lookup 404s; a forced Skill(X) pointing at an undiscoverable X.
3 · Execution reliability State/recursion bugs; script-config seams that don't match; 2>/dev/null || true swallowing a critical failure; a loop cap that truncates silently; shared resources with no lock.
4 · Declared vs wired Documented features the code never implements; a health-check that validates the wrong field (green ✓ ≠ runnable); empty dirs, hardcoded paths, a dependency that exists only as .example.

Full pattern catalogue with per-pattern verification: references/failure-patterns.md.


How It Works

Six steps. Each prints a visible [harness-check] … line so no step is silently skipped.

0 Discovery ─▶ 1 Inventory ─▶ 2 Scan (4 dims) ─▶ 3 Verify ─▶ 4 Report ─▶ 5 Self-check

0. Discovery first — the lifeline: can roles/skills/tools actually be found and invoked? Globs, symlinks, registry completeness, name consistency. 1. Inventory — read everything (config, rules, roster, role defs, scripts) before judging anything. 2. Scan — four dimensions against the 24-pattern catalogue. 3. Verify — the iron rule: no verification, no finding. Anything unverifiable on disk is downgraded to "suspected" or dropped. 4. Report — verdict, health scores, and code-layer evidence; then the ✅ Safe to fix now / 🤔 Needs your decision split, with the top trigger- and execution-reliability silent-break scenarios folded into those items. Written compact (lesstoken), behavior-first. 5. Self-check — every file assessed, every finding verified, every where carries a file:line.


Why It's Different

Most "reviews" harness-check
Target business-logic correctness and security whether the framework runs as wired
Evidence plausible inference verified on your real files, or dropped
Fix advice "here's the bug" severity and fix-risk (🟢 safe now / 🟡 use caution / 🔴 may break a running path)
Report voice file:line everywhere before/after behavior — what changes, how risky

Out of scope (on purpose): secret leaks, business-logic correctness, performance profiling — they distract from the one question: can the framework run?


Repository Structure

harness-check/
├── SKILL.md                          # methodology — the six-step flow
├── references/
│   ├── failure-patterns.md           # 24 patterns × how to verify each
│   ├── fix-risk-tiers.md             # 🟢/🟡/🔴 — risk of each fix
│   └── output-format.md              # the layered report shape
├── assets/hero.png
└── README.md

Docs describe one system. Code runs another. harness-check finds where they split.


Zane456

Platform Link
🐙 GitHub @Zane456

⭐ If harness-check saves you from a silent break, consider starring it.

MIT License © Zane456

About

Agent-harness structure auditor for Claude Code multi-agent systems. Checks whether the system your docs describe equals the system your code can actually run — trigger words, role discovery, registries, symlinks, script seams. 4 dimensions, 24 verified failure patterns, severity x fix-risk tiers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors