Skip to content

Oldrich333/full-review

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

full-review β€” adversarial code review for Claude Code and Codex CLI

full-review

Adversarial code review that escapes AI confirmation bias.

Claude Code Skill Codex CLI recall License


You are letting the AI mark its own homework. When you ask the same LLM that wrote your code to review it, you get a rubber stamp β€” the model is stuck in the same cognitive track that generated the bug in the first place.

What you get

  • /full-review <paths> β€” one command, runs on your changed files, emits structured YAML findings (severity S0/S1/S2, file:line, evidence, fix intent)
  • Round 1: 9-pass unified harness β€” one persistent LLM session does taxonomy scan β†’ six focused perspectives (async, silent failure, state, validation, cross-file, observability) β†’ sweep β†’ merge. File context loaded once. Each pass sees what earlier passes found.
  • Round 2: cross-family review β€” if Claude wrote the code, Codex and Gemini review it. Same-family self-review has documented blind spots; cross-family catches the residual ~20%.
  • Self-improving taxonomy β€” ~110 abstract bug patterns at launch. Every bug found becomes a pattern future reviews auto-detect.
  • Measured 0.80–0.87 recall on a 15-bug production fixture. vs 0.40 for the "5 parallel specialists" pattern most plugins ship. Two replicas per config, semantic-similarity scoring against bug aliases.
  • Reproducible benchmark harness ships in review_bench/ β€” bring your own fixtures, run with any LLM via a shell wrapper.
  • Cross-runtime β€” Claude Code plugin AND Codex CLI skill. Same skill dir works for both.
  • Per-family blind-spot log (BLIND_SPOTS.md) β€” documents what Claude, Codex, and Gemini empirically miss, so you can pick the right cross-family pairing for your work.
  • MIT licensed, zero runtime dependencies beyond your LLM CLI.

The problem

If you use Claude Code or Codex CLI for anything non-trivial, you probably skip rigorous review. The existing prompts and parallel-agent plugins give you shallow noise. They miss logic errors because the reviewer is stuck in the same cognitive track that generated the bug in the first place. You pay the price in production.

The insight

Spawning five specialist agents in parallel β€” one for security, one for logic, one for style β€” feels rigorous. It isn't. Each agent reads the diff in isolation. None of them reason across the whole change. They overlap on the obvious bugs and miss every subtle one.

The fix is not more reviewers. It is stateful adversarial interrogation. One LLM session, nine sequential passes, then a cross-family check by a model that did not write the code.

The numbers

Measured against a 15-bug production fixture (S0/S1 severity, real bugs from a fix commit, not synthetic), two independent replicas per configuration:

Approach Recall
5 parallel specialists (the status quo) 0.40
Single monolithic prompt 0.35
Single-shot checklist (taxonomy inline) 0.53
2 agents, union of findings 0.67
full-review β€” unified 9-pass harness 0.80–0.87

The jump from 0.53 to 0.80+ is not a bigger model. It is a persistent session that lets the reviewer build a mental model of the change before judging it.

Install

/plugin marketplace add Oldrich333/full-review
/plugin install full-review

Then, after you've written code:

/full-review <paths>

For Codex CLI:

git clone https://github.com/Oldrich333/full-review ~/.codex/skills/full-review

Output format

Findings are emitted as structured YAML, one entry per issue:

findings:
  - id: F-001
    file: path/to/module.py
    line_range: [128, 142]
    severity: S0           # S0 = correctness, S1 = quality, S2 = hygiene
    confidence: 0.9        # 0.0–1.0, filter by threshold
    symptom: One-line summary of what goes wrong
    invariant_violated: |
      The rule the code breaks, stated positively
    evidence: |
      Concrete reasoning chain β€” how the failure triggers
    patch_intent: One-line description of the fix
    taxonomy_hit: DOM-001  # which taxonomy pattern matched (when applicable)

Drop findings below your confidence threshold (default 0.7). Full audit log including below-threshold findings stored in the result directory.

The journey

The path from 0.40 to 0.80+ was not obvious. We tried the obvious things first. They failed.

  • v1 β€” Five parallel specialists. Recall 0.40. Looks rigorous on paper. In practice each reviewer does a shallow single-shot read, they overlap on obvious bugs, and they miss every subtle one. Five times the cost of one agent, worse recall than one good prompt.
  • v2 β€” One monolith prompt. Recall 0.35. Squeezed the five perspectives into one giant instruction. Attention diffused. Model produced a shallow grab bag. Worse.
  • v3 β€” Checklist. Recall 0.53. Externalised the bug taxonomy into a separate file, inlined it into the prompt as a mandatory checklist. +50 % from this one change. The taxonomy, not the model, is the leverage.
  • v4 β€” Unified harness. Recall 0.80–0.87. Made the checklist pass and the focused perspectives separate turns of the same session, not one monolithic prompt. One LLM. Nine turns. File context loaded once. Each later turn told "you already did X; find new issues in Y." Deduplication at the merge step.

The story the numbers tell: code review quality is a process choice, not a model upgrade.

What we did NOT build

  • No auto-fixer. Reviewer and fixer are separate roles on purpose. A reviewer without context makes good-intention wrong fixes. Only the author can judge bug vs by-design.
  • No IDE integration. The CLI is the agent's home. IDE hooks are lipstick.
  • No fine-tuned model. The taxonomy is the moat, not model weights. Works on any frontier LLM.

Repository layout

plugins/full-review/
  .claude-plugin/plugin.json
  skills/full-review/
    SKILL.md                      entry point (Claude / Codex plugin)
    TAXONOMY.md                   ~110 abstract bug patterns
    BLIND_SPOTS.md                per-LLM-family empirical misses
    HOW_TO_EXTEND.md              rules for adding to the taxonomy
    VERIFY_GREPS.md               grep patterns as pre/post gates
    REVIEWER_FIXER_CONTRACT.md    finding lifecycle
    CHANGELOG.md                  what changed, when, why
    runners/
      codex_cli.sh                R1 self-review runner
      claude_native.md            Claude Code Agent-tool invocation
      cross_family.sh             R2 cross-family orchestrator

review_bench/                     reproducible benchmark harness
  bench.py, unified_harness_run.py, run_reviewer.py, score.py
  lib/, configs/, fixtures/README.md

.claude-plugin/marketplace.json   plugin marketplace entry

Companion: ax-headers

Pairs with ax-headers β€” a one-line machine-readable header on every Python file that cuts AI context bloat by ~30 %. full-review automatically surfaces header drift during review. Neither depends on the other.

License

MIT. If full-review catches bugs on your codebase, file an issue with the numbers β€” helps calibrate the taxonomy for everyone.

Built by

@Oldrich333. Part of the Hive AI ecosystem alongside raisin (dense Python for LLMs) and ax-headers.

About

πŸ” full-review β€” adversarial code review skill for Claude Code / Codex CLI. 0.80+ recall vs 0.40 for parallel specialists. Breaks AI confirmation bias.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors