A methodology for structured cross-agent review — using multiple AI models to catch what no single model catches alone.
A single AI agent analyzing a problem produces reasonable output. Two agents from different providers analyzing the same problem, then reviewing each other's work, produce output that is more thorough, more balanced, and catches blind spots that neither finds alone.
Cross-Talk is the protocol for making that happen reliably.
In a real project, Claude proposed a minimum-velocity filter that would have silently deleted real musical notes (quiet pianissimo passages). Codex caught this as a violation of the project's constitutional guarantees.
- Cost of the review: ~$2 in tokens
- Cost of the bug it prevented: user trust, potentially unrecoverable
That's the value proposition. Senior-engineer-quality review at API prices.
Phase 1: DIVERGE Same task -> Agent A + Agent B independently (can't see each other)
Phase 2: CROSS-REVIEW Each reviews the other's output
Phase 3: CONVERGE Resolve disagreements into one final artifact
Phase 4: IMPLEMENT Split work, cross-review the code too
Different models have different training data, different reasoning patterns, different blind spots. Agreement builds confidence. Disagreement marks exactly where the interesting engineering decisions live. Both outcomes are valuable. Neither is possible with a single agent.
No framework required. No code to install. You need:
- Two AI agents from different model families (e.g., Claude + Codex, Claude + GPT, Claude + Gemini)
- A shared context folder with project docs both agents can read
- A specific question narrow enough for a concrete artifact (not "make it better" but "classify these 6 gaps as safe-fix vs. editorial-choice")
- An artifact contract defining output format upfront so results are comparable
Then follow the step-by-step protocol.
Ask: "Would I want a second opinion from a senior engineer before shipping this?"
| Use Cross-Talk | Don't use Cross-Talk |
|---|---|
| Architecture decisions | Typo fixes |
| Safety/trust-sensitive changes | Log line additions |
| Ambiguous requirements | Style/formatting changes |
| Risky PRs before merge | Mechanical refactors |
| Compliance/regulatory review | Well-understood bug fixes |
| Approach | Cost | Quality |
|---|---|---|
| Single agent, one pass | $0.30-0.50 | Good |
| Single agent, self-review | $0.60-1.00 | Better |
| Cross-Talk standard | $1.50-2.50 | Best among low-cost AI-only options |
| Human senior engineer review | $200-400 | Potentially best, context-dependent |
Full cost breakdown and optimization strategies in Cost & Token Economics.
- Diverge Before You Converge -- Get independent analysis before combining. Never let Agent B see Agent A's output during generation.
- Frame Tasks, Not Agents -- Task clarity matters more than model choice. A well-framed task given to any capable model beats a vague task given to "the best" model.
- Review the Disagreements -- Agreements validate. Disagreements illuminate. The places where agents diverge are where the most interesting decisions live.
- Artifacts Over Chat -- Every analysis produces a durable document, not conversation. Documents can be reviewed, diffed, archived, and handed to future agents.
- Know When Not to Use It -- Cross-Talk costs 3-5x a single-agent pass. Use it for judgment calls, not mechanical work.
- Shared Contract -- Both agents get the same project principles. Disagreement stays productive because it's grounded in shared reference material.
- Narrow Question, Structured Output -- Specific questions produce comparable artifacts. Vague questions produce incomparable ones.
- Classification Before Implementation -- Separate "clearly bugs" from "judgment calls" from "feature requests" before writing any code.
- File-Based Handoff -- Agents communicate through markdown files in the repo, not chat. Chat is ephemeral. Files survive session boundaries.
- The "Changed My Mind" Flag -- Agents explicitly flag where they changed position after review. These are the highest-value outputs.
Full patterns and anti-patterns in Patterns.
- Same model twice -- Same model = same blind spots. You need actual model diversity.
- Letting agents peek -- If Agent B sees Agent A's work during generation, you get anchoring bias instead of independence.
- Converging by averaging -- "Agent A says 40, Agent B says 60, let's use 50" is avoidance, not convergence. Understand why they disagree.
- Using it for everything -- Not every change needs a $2 review. The second-opinion threshold exists for a reason.
- Review in chat -- Chat is session-scoped. When the implementer starts a new session, the findings are gone.
Full 4-phase protocol. Claude and Codex independently analyzed 6 transcription quality gaps in a music notation app. Cross-review caught a constitutionally-violating velocity filter. Two Codex-originated ideas (transformation diagnostics, anti-overcleaning guard) were adopted into the final plan.
Lightweight 2-pass variation. Claude implemented, Codex reviewed. Caught an octave-leap false positive that the implementer's own tests missed (constitutional severity). Also discovered that chat-based review doesn't work -- led to the file-based handoff protocol.
| Cross-Talk | AutoGen / CrewAI | Academic LLM Debate | adversarial-review | |
|---|---|---|---|---|
| Different providers required | Yes | Optional | Usually no | Yes |
| Artifact-based (files, not chat) | Yes | No | No | Yes |
| Prescriptive protocol | Yes | Build your own | Theoretical | Partial |
| Failure modes documented | Yes | No | No | No |
| Cost economics included | Yes | No | No | No |
| Real case studies with costs | Yes | No | Benchmarks only | No |
| No code/framework required | Yes | Code required | Code required | Code required |
| When NOT to use it | Yes | No | No | No |
Cross-Talk is the methodology layer that sits above orchestration frameworks. AutoGen gives you plumbing. CrewAI gives you roles. Cross-Talk tells you what to do, why, and when not to.
| Phase | Often a good fit | Why |
|---|---|---|
| Architecture & tradeoffs | Claude | Principle-based reasoning |
| Implementation | Codex | Token-efficient, precise execution |
| Adversarial review | Claude | Catches conceptual issues |
| Test generation | Codex | Pattern-following, mechanical |
| Safety/constitutional review | Claude | "Should we?" reasoning |
| Parallel coding | Codex | Worktree isolation model |
These are starting heuristics, not rules. Task framing matters more than model choice. Full profiles in Agent Profiles.
Yes. The protocol is manual by default (two terminals, shared folder), but can be automated with:
- LangGraph -- Define agents as nodes, review loops as edges
- Anthropic Agent SDK / OpenAI Agents SDK -- Custom multi-agent pipelines
- MCP -- Shared tool definitions across agent providers
- Claude Code sub-agents -- Fan-out implementation after convergence
See Tooling Landscape for the full ecosystem map.
docs/
MANIFESTO.md Core principles -- why cross-agent review works
METHODOLOGY.md Step-by-step protocol with artifact templates
PATTERNS.md Patterns that work + anti-patterns to avoid
AGENT_PROFILES.md Agent strengths and task assignment guide
COST_AND_TOKENS.md Token economics and optimization strategies
TOOLING_LANDSCAPE.md Current tools and ecosystem
examples/
mozartino-*.md Real-world case studies
- This README (you're here)
- Manifesto -- understand why
- Methodology -- learn the protocol
- Patterns -- learn what works and what doesn't
- Examples -- see it applied to a real project
Cross-Talk was developed independently from hands-on project experience. We later found strong alignment with academic research:
- Irving et al. 2018 — "AI Safety via Debate" (the foundational paper, co-authored by Dario Amodei)
- Du et al. 2023 — Multiagent debate reduces hallucinations and improves reasoning (ICML 2024)
- A-HMAD 2025 — Heterogeneous model debate produces 30%+ fewer errors (directly validates the cross-model thesis)
- ChatEval 2023 — Diverse agent roles outperform single-judge evaluation (ICLR 2024)
- D3 2024 — Cost-aware debate; 3-7 agents is the sweet spot
The Ralph Wiggum Technique influenced the artifact-based persistence approach.
Full discussion in Manifesto — Related Work.
Early-stage methodology, born from real multi-agent work on one production application (2026). N=1, but the patterns are replicable, the academic foundation is strong, and the case study includes real costs and real bugs caught. More case studies welcome — that's the fastest way to strengthen this.
Contributions, case studies, and tooling experiments welcome.
- Keep methodology prescriptive and concise
- All claims should reference real experience or cited sources
- Case studies from real projects are welcome (anonymize if needed)
- Date-stamp research findings -- tooling changes fast
- See CLAUDE.md for agent contribution guidelines
MIT