full-review

Adversarial code review that escapes AI confirmation bias.

You are letting the AI mark its own homework. When you ask the same LLM that wrote your code to review it, you get a rubber stamp — the model is stuck in the same cognitive track that generated the bug in the first place.

What you get

/full-review <paths> — one command, runs on your changed files, emits structured YAML findings (severity S0/S1/S2, file:line, evidence, fix intent)
Round 1: 9-pass unified harness — one persistent LLM session does taxonomy scan → six focused perspectives (async, silent failure, state, validation, cross-file, observability) → sweep → merge. File context loaded once. Each pass sees what earlier passes found.
Round 2: cross-family review — if Claude wrote the code, Codex and Gemini review it. Same-family self-review has documented blind spots; cross-family catches the residual ~20%.
Self-improving taxonomy — ~110 abstract bug patterns at launch. Every bug found becomes a pattern future reviews auto-detect.
Measured 0.80–0.87 recall on a 15-bug production fixture. vs 0.40 for the "5 parallel specialists" pattern most plugins ship. Two replicas per config, semantic-similarity scoring against bug aliases.
Reproducible benchmark harness ships in review_bench/ — bring your own fixtures, run with any LLM via a shell wrapper.
Cross-runtime — Claude Code plugin AND Codex CLI skill. Same skill dir works for both.
Per-family blind-spot log (BLIND_SPOTS.md) — documents what Claude, Codex, and Gemini empirically miss, so you can pick the right cross-family pairing for your work.
MIT licensed, zero runtime dependencies beyond your LLM CLI.

The problem

If you use Claude Code or Codex CLI for anything non-trivial, you probably skip rigorous review. The existing prompts and parallel-agent plugins give you shallow noise. They miss logic errors because the reviewer is stuck in the same cognitive track that generated the bug in the first place. You pay the price in production.

The insight

Spawning five specialist agents in parallel — one for security, one for logic, one for style — feels rigorous. It isn't. Each agent reads the diff in isolation. None of them reason across the whole change. They overlap on the obvious bugs and miss every subtle one.

The fix is not more reviewers. It is stateful adversarial interrogation. One LLM session, nine sequential passes, then a cross-family check by a model that did not write the code.

The numbers

Measured against a 15-bug production fixture (S0/S1 severity, real bugs from a fix commit, not synthetic), two independent replicas per configuration:

Approach	Recall
5 parallel specialists (the status quo)	0.40
Single monolithic prompt	0.35
Single-shot checklist (taxonomy inline)	0.53
2 agents, union of findings	0.67
full-review — unified 9-pass harness	0.80–0.87

The jump from 0.53 to 0.80+ is not a bigger model. It is a persistent session that lets the reviewer build a mental model of the change before judging it.

Install

/plugin marketplace add Oldrich333/full-review
/plugin install full-review

Then, after you've written code:

/full-review <paths>

For Codex CLI:

git clone https://github.com/Oldrich333/full-review ~/.codex/skills/full-review

Output format

Findings are emitted as structured YAML, one entry per issue:

findings:
  - id: F-001
    file: path/to/module.py
    line_range: [128, 142]
    severity: S0           # S0 = correctness, S1 = quality, S2 = hygiene
    confidence: 0.9        # 0.0–1.0, filter by threshold
    symptom: One-line summary of what goes wrong
    invariant_violated: |
      The rule the code breaks, stated positively
    evidence: |
      Concrete reasoning chain — how the failure triggers
    patch_intent: One-line description of the fix
    taxonomy_hit: DOM-001  # which taxonomy pattern matched (when applicable)

Drop findings below your confidence threshold (default 0.7). Full audit log including below-threshold findings stored in the result directory.

The journey

The path from 0.40 to 0.80+ was not obvious. We tried the obvious things first. They failed.

v1 — Five parallel specialists. Recall 0.40. Looks rigorous on paper. In practice each reviewer does a shallow single-shot read, they overlap on obvious bugs, and they miss every subtle one. Five times the cost of one agent, worse recall than one good prompt.
v2 — One monolith prompt. Recall 0.35. Squeezed the five perspectives into one giant instruction. Attention diffused. Model produced a shallow grab bag. Worse.
v3 — Checklist. Recall 0.53. Externalised the bug taxonomy into a separate file, inlined it into the prompt as a mandatory checklist. +50 % from this one change. The taxonomy, not the model, is the leverage.
v4 — Unified harness. Recall 0.80–0.87. Made the checklist pass and the focused perspectives separate turns of the same session, not one monolithic prompt. One LLM. Nine turns. File context loaded once. Each later turn told "you already did X; find new issues in Y." Deduplication at the merge step.

The story the numbers tell: code review quality is a process choice, not a model upgrade.

What we did NOT build

No auto-fixer. Reviewer and fixer are separate roles on purpose. A reviewer without context makes good-intention wrong fixes. Only the author can judge bug vs by-design.
No IDE integration. The CLI is the agent's home. IDE hooks are lipstick.
No fine-tuned model. The taxonomy is the moat, not model weights. Works on any frontier LLM.

Repository layout

plugins/full-review/
  .claude-plugin/plugin.json
  skills/full-review/
    SKILL.md                      entry point (Claude / Codex plugin)
    TAXONOMY.md                   ~110 abstract bug patterns
    BLIND_SPOTS.md                per-LLM-family empirical misses
    HOW_TO_EXTEND.md              rules for adding to the taxonomy
    VERIFY_GREPS.md               grep patterns as pre/post gates
    REVIEWER_FIXER_CONTRACT.md    finding lifecycle
    CHANGELOG.md                  what changed, when, why
    runners/
      codex_cli.sh                R1 self-review runner
      claude_native.md            Claude Code Agent-tool invocation
      cross_family.sh             R2 cross-family orchestrator

review_bench/                     reproducible benchmark harness
  bench.py, unified_harness_run.py, run_reviewer.py, score.py
  lib/, configs/, fixtures/README.md

.claude-plugin/marketplace.json   plugin marketplace entry

Companion: ax-headers

Pairs with ax-headers — a one-line machine-readable header on every Python file that cuts AI context bloat by ~30 %. full-review automatically surfaces header drift during review. Neither depends on the other.

License

MIT. If full-review catches bugs on your codebase, file an issue with the numbers — helps calibrate the taxonomy for everyone.

Built by

@Oldrich333. Part of the Hive AI ecosystem alongside raisin (dense Python for LLMs) and ax-headers.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.claude-plugin		.claude-plugin
assets		assets
plugins/full-review		plugins/full-review
review_bench		review_bench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

full-review

What you get

The problem

The insight

The numbers

Install

Output format

The journey

What we did NOT build

Repository layout

Companion: ax-headers

License

Built by

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

full-review

What you get

The problem

The insight

The numbers

Install

Output format

The journey

What we did NOT build

Repository layout

Companion: ax-headers

License

Built by

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages