Stometa's public curated Claude Code skillset — a small, opinionated set of skills we use ourselves, published periodically.
What it is: A Claude Code plugin with two engineering skills — review-loop (cross-model code review) and harness (multi-agent task orchestration). Published from Stometa's private stometa-skillset in batches, after internal validation.
What it does: Coordinates a Plan → Generate → Evaluate → Retro pipeline with hard, engine-enforced constraints: two isolated sessions, a fresh sub-agent per checkpoint, an engine script as the sole gatekeeper, and a cross-vendor peer review before any PR lands.
What problems it solves:
| Failure mode | How it shows up | What Harness does |
|---|---|---|
| Context drift | Planning, coding, and review share one growing context — the model drifts further with each turn | Two-session split + fresh sub-agent per checkpoint (eigenbehavior reset) |
| Self-certification | The LLM that wrote the code also judges whether it passes | Engine script rejects any pass-checkpoint where the evaluator session id matches a prior checkpoint in the same task |
| Echo-chamber review | The same model reviews its own work and misses its own blind spots | review-loop enforces a different-vendor peer (Codex or Gemini) and runs a fresh-session final approval before closing |
Two phases, hard context boundary between them. Each phase contains its own fresh-sub-agent iteration loop. The engine script is the only entity that can advance phase state — the LLM cannot self-certify.
flowchart TB
YOU(["You"])
subgraph P["① Plan · Session 1"]
direction LR
PL["Planner"] <-->|"draft ↔ revise"| SE["Spec Evaluator\nfresh sub-agent"]
end
subgraph E["② Execute · Session 2 (one fresh sub-agent per checkpoint)"]
direction TB
GN["Generator\nfresh sub-agent"] <-->|"implement ↔ verify"| EV["Evaluator\nfresh sub-agent"]
EV -->|"all CPs pass"| RL["Cross-model peer review\nCodex ∣ Gemini"]
end
PR[/"Open PR"/]
RT[("Persistent retro\nlearnings → next task")]
YOU -->|"harness plan"| P
P -. "context isolation" .-> E
RL --> PR --> RT
RT -.->|"accumulated learnings"| YOU
classDef fresh stroke:#d97706,stroke-width:2px
class SE,GN,EV fresh
Legend — orange-bordered nodes are fresh independent sub-agents (drift firewall); dashed arrows are cross-session / cross-task information flows that carry no shared context.
The model running each role is decoupled from the model hosting the session — that's why the same pipeline works whether you start in Claude Code or Codex.
| Role | Who plays it | Notes |
|---|---|---|
| Orchestrator host (Session 1 + 2) | Claude Code CLI or Codex CLI | Symmetric. Recommended split: Claude Code for Session 1, Codex for Session 2. |
| Spec Evaluator | Claude (sub-agent or via claude-agent-invoke.sh) |
Stable across hosts. |
| Generator | Active host LLM (Claude or Codex) | Inherits the host. |
| Evaluator / E2E / Retro | Claude (sub-agent or via claude-agent-invoke.sh) |
Engine rejects same-context self-evaluation. |
review-loop peer (cross-model gate) |
codex CLI or gemini CLI — allowlisted |
Claude is not a peer here by design — same-vendor review would defeat the cross-model purpose. |
Heads-up on the peer allowlist: the bundled
review-loopskill enforcespeer ∈ {codex, gemini}in preflight. If Claude is hosting, the peer is naturally a different vendor; if Codex is hosting, pickingcodexstill gives you a fresh isolated context (differentCODEX_HOME, no MCP, stripped credentials), andgeminigives you a true cross-vendor read.
| Concern | Typical multi-agent loop | This Harness skill |
|---|---|---|
| Context drift | One growing context across plan → code → review | Two-session split + fresh sub-agent per checkpoint (eigenbehavior reset) |
| Self-certification | LLM judges its own output | harness-engine.sh blocks pass-checkpoint until the latest evaluation.md has verdict: PASS and the evaluator session id was not reused by any prior checkpoint. The same self-certification gate applies to pass-cohort: cohort status (passed / partial-pass) is computed from per-CP evaluation.md verdicts and engine-side state, never from LLM claims about cohort completion. |
| Echo-chamber review | Same model reviews itself | review-loop enforces a different-vendor peer (Codex or Gemini) and runs a fresh-session final approval pass so the closing verdict isn't biased by the iterative repair conversation |
| Black-box state | State implicit in chat history | All state on disk (.harness/<task-id>/, git-state.json), one engine script owns the phase machine, every transition is auditable |
| No memory across tasks | Each task starts cold | Persistent .harness/retro/ (git-tracked) accumulates error patterns, rule proposals, and skill defects — closes the cybernetic feedback loop |
| Tool-use bias | Lock-in to one CLI / one vendor | Orchestrator host and review peer are independently swappable; the same engine and gates run on Claude Code or Codex |
Cross-LLM iterative code review. Spawns a peer reviewer (Codex CLI or Gemini CLI) to independently review your changes. Claude evaluates the peer's findings, implements accepted fixes, and re-submits until both sides agree on the final code state. The human doesn't need to participate — watch progress via .review-loop/<session>/summary.md.
Cybernetics-based multi-agent orchestration for complex tasks. Coordinates a Planner → Generator → Evaluator → Retro pipeline with fresh sub-agents per checkpoint (drift prevention) and persistent retro learning across tasks. Recommended flow: Claude Code plans the spec (Session 1), Codex executes autonomously (Session 2), and review-loop (Codex or Gemini CLI as peer) provides the cross-model quality gate before PR.
claude plugin marketplace add https://github.com/stone16/harness-engineering-skills
claude plugin install harness-engineering-skills@stometaVerify:
claude plugin list | grep harness-engineering-skills- Required:
git,python3, Claude Code with thesuperpowersplugin installed. - Peer reviewer (one of):
codexCLI orgeminiCLI — only needed if you usereview-looporharness's cross-model review. - Optional:
ghCLI for PR-scoped review detection.
Inside a Claude Code session, once the plugin is installed:
/review-loop
Variants: review loop with gemini, review loop, max 3 rounds, review loop for PR 42, review loop for commit abc123.
The peer reviewer is one of codex or gemini — set globally via .review-loop/config.json (peer_reviewer), or per-invocation. The loop iterates until peer and host reach CONSENSUS, then runs a fresh-session final approval pass before writing summary.md.
Two recommended entry patterns — both produce the identical pipeline shown in the diagram above:
Pattern A — Claude Code drives planning, Codex drives execution (recommended):
# Session 1, in Claude Code
harness plan <task-id> # interactive spec creation + spec review
# Session 2, in Codex (fresh process, planning context discarded by design)
harness execute <task-id> # checkpoints → E2E → review-loop → full-verify → PR → retro
Pattern B — single host (Claude Code or Codex) for everything:
harness plan <task-id>
harness continue # same host runs both phases
Pick the cross-model peer once in .harness/config.json:
{ "cross_model_review": true, "cross_model_peer": "gemini" }harness will not let pass-checkpoint, pass-e2e, pass-review-loop, or pass-full-verify succeed unless the corresponding artifacts exist with the right verdict — the engine is the gatekeeper, not the LLM.
Apache-2.0 — see LICENSE.
This repo is the public publication surface for a subset of Stometa's private stometa-skillset. Future batches will add more skills as they stabilize. Issues and pull requests are welcome on the GitHub tracker.