Skip to content

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451

Open
PureWeen wants to merge 11 commits intomainfrom
fix/mixed-model-review-squad
Open

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
PureWeen wants to merge 11 commits intomainfrom
fix/mixed-model-review-squad

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 29, 2026

Summary

Redesigns the built-in PR Review Squad multi-agent preset and adds fleet command diagnostics.

Architecture: 5x Opus Workers with Internal Multi-Model Dispatch

Each of the 5 workers runs on Opus and internally dispatches 3 parallel sub-agent reviews via the task tool:

  • claude-opus-4.6 -- deep reasoning, architecture, subtle logic bugs
  • claude-sonnet-4.6 -- fast pattern matching, common bug classes, security
  • gpt-5.3-codex -- alternative perspective, edge cases

The worker synthesizes a 2-of-3 consensus report. The orchestrator assigns one worker per PR (round-robin for multiple PRs) -- no fan-out.

Changes

  1. WorkerReviewPrompt -- instructs workers to dispatch 3 sub-agents with different models
  2. RoutingContext -- 1 worker per PR, no fan-out, round-robin distribution
  3. SharedContext -- consensus filter (2+ models agree), fix standards use git merge (not rebase)
  4. Fix process -- git merge instead of git rebase + force-with-lease, includes push verification
  5. Re-review tracking -- structured FIXED / STILL PRESENT / N/A tracking restored
  6. .squad/ deleted -- replaced by built-in preset; Squad integration tracked in Deep Squad Integration: SquadSessionProvider via ISessionProvider Plugin System #436
  7. /fleet diagnostics -- StartFleetAsync returns specific error reason
  8. Tests -- verify all 5 workers are Opus, prompts contain 3 sub-agent models, merge not rebase

Why 5x Opus?

Workers orchestrate sub-agent dispatch -- Opus excels at this. Model diversity happens at the sub-agent level inside each worker.

@PureWeen PureWeen force-pushed the fix/mixed-model-review-squad branch from 32b4d6b to 91041af Compare March 29, 2026 07:08
PureWeen and others added 2 commits March 29, 2026 09:25
….squad/

Replace 5 identical Sonnet workers with 3 diverse-model workers:
- Worker 1: Claude Opus (deep reasoning, subtle bugs)
- Worker 2: Claude Sonnet (fast pattern matching, common bugs)
- Worker 3: GPT Codex (alternative perspective, edge cases)

Model diversity is now achieved at the worker level instead of
relying on workers to internally dispatch sub-agents. The
orchestrator synthesizes consensus (2-of-3 filter) from the
different model perspectives.

Also removes .squad/ directory — the useful content (routing rules,
review standards) is now baked into the built-in preset. Real Squad
integration will come via SquadSessionProvider (issue #436).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User feedback: 3 workers was too few. Restored to 5 workers with
mixed models for better consensus coverage. Updated consensus
filter references to 2-of-5.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/mixed-model-review-squad branch from baea1ae to ae642e5 Compare March 29, 2026 14:25
PureWeen and others added 2 commits March 29, 2026 09:31
Each worker dispatches 3 sub-agents (Opus, Sonnet, Codex) via the
task tool for consensus. The orchestrator assigns 1 worker per PR
and distributes round-robin — no fan-out of multiple workers to
the same PR. All workers run on Opus (best at orchestrating
sub-agent dispatch).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return specific error (session not found, not connected, processing,
RPC error) instead of generic failure message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/mixed-model-review-squad branch from ae642e5 to b3fe6fe Compare March 29, 2026 14:31
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied


🔴 CRITICAL — Worker array contradicts PR description: 5×Opus, not 3 diverse models

File: PolyPilot/Models/ModelCapabilities.cs:327
Flagged by: Opus, Sonnet, Codex (unanimous)

The diff changes workers from 5×claude-sonnet-4.6 to 5×claude-opus-4.6. The PR description claims "Replace 5 identical Sonnet workers with 3 diverse-model workers (Opus + Sonnet + Codex)." The code keeps 5 workers, all Opus. The description string in the code even says "5 reviewers — each does multi-model consensus (Opus + Sonnet + Codex)", confirming the code and PR description describe two different architectures.

This is also a cost regression — 5 Opus workers each dispatching 3 sub-agents = 20 model calls per review, with 10 at premium Opus tier.


🟡 MODERATE — git rebase + --force-with-lease in worker prompt contradicts deleted safety rules

File: PolyPilot/Models/ModelCapabilities.cs:303-304
Flagged by: Opus, Sonnet

The fix instructions tell workers: git rebase origin/main then push with --force-with-lease. The deleted .squad/decisions.md explicitly banned both: "NEVER force push — no --force, --force-with-lease, or any force variant" and "ALWAYS integrate with git merge — never git rebase." These rules existed because rebase+force-push caused commits to land on the wrong remote. The PR deletes the safety documents while restoring the dangerous pattern.


🟡 MODERATE — .squad/ deletion removes push-to-pr.sh safety script with no replacement

File: .squad/push-to-pr.sh (deleted)
Flagged by: Opus, Sonnet, Codex (unanimous)

The 58-line script verified push targets, confirmed remote/branch matching, and validated pushes landed correctly. The embedded prompt's fix instructions are a brief 2-line "checkout, rebase, push" with no verification step. Workers pushing to contributor forks without verification is the exact failure mode decisions.md documented.


🟡 MODERATE — Test coverage insufficient for the change

File: PolyPilot.Tests/SessionOrganizationTests.cs:2531
Flagged by: Opus, Sonnet, Codex (unanimous)

The test only changes WorkerModels[0] from sonnet to opus. It still asserts WorkerModels.Length == 5 (unchanged). It doesn't verify models at indices 1-4, doesn't test model diversity, and doesn't validate the description matches reality. Missing tests:

  • Worker count matches intent (3 vs 5)
  • Model diversity (Opus + Sonnet + Codex all present)
  • Worker prompt content (rebase vs merge)
  • StartFleetAsync new tuple return type

🟡 MODERATE — Cross-worker consensus filter weakened/undefined

File: PolyPilot/Models/ModelCapabilities.cs:280-292
Flagged by: Opus, Sonnet

Old: 5 workers, 2+ agree = 40% threshold. New: 3 sub-agent models, 2+ agree = 67% threshold. The higher threshold means fewer findings survive consensus. Additionally, the orchestrator's routing context says "Assign ONE worker per PR" but has no guidance for merging findings if multiple workers review the same PR.


🟢 MINOR — Raw exception messages surfaced in UI

File: PolyPilot/Services/CopilotService.cs:~1989-2008, Dashboard.razor:~2222-2226
Flagged by: Sonnet, Codex

ex.Message from RPC exceptions is written directly into chat history via ChatMessage.ErrorMessage($"Failed to start fleet mode: {error}"). This can expose internal endpoints, paths, or credential fragments. The message is persisted to events.jsonl. Fix: log full exception but return a generic "RPC error. Check logs." to the caller.


🟢 MINOR — Re-review tracking removed

File: PolyPilot/Models/ModelCapabilities.cs:307-315 (deleted §6)
Flagged by: Opus, Sonnet

The old prompt had a "Re-Review Process" section with structured FIXED / STILL PRESENT / N/A tracking. The replacement is a one-liner: "re-run the 3-model review on the updated diff." Without explicit finding-status tracking, fixed issues may be re-reported and still-present issues may be missed.


📋 Additional Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comments: None
  • Positive: The StartFleetAsync refactor from Task<bool> to Task<(bool, string?)> is clean — specific error messages with proper Dashboard integration.

Recommended action: 🔴 Do not merge

Two blockers:

  1. The feature wasn't implemented. Code delivers 5×Opus, not 3×mixed. Description, code string, and implementation are three different architectures. Decide on one and align all three.
  2. Safety regression. The deleted .squad/decisions.md and push-to-pr.sh documented hard-won lessons from wrong-remote pushes. Either keep the safety script or embed equivalent git merge + verification in the prompt. Do not restore rebase + --force-with-lease.

PureWeen and others added 3 commits March 29, 2026 09:49
The fleet review incorrectly flagged these as non-existent because
they aren't in PolyPilot's internal registry, but these are CLI
model IDs passed to the task tool — the CLI supports them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace git rebase + force-with-lease with git merge in worker
  fix instructions and SharedContext (safety regression)
- Add push verification step (equivalent to deleted push-to-pr.sh)
- Restore structured re-review tracking (FIXED/STILL PRESENT/N/A)
- Sanitize RPC exception in fleet error (don't leak internal paths)
- Improve test coverage: verify all 5 models are Opus, verify worker
  prompts contain all 3 sub-agent model names, verify merge not rebase
  in SharedContext, verify 1-worker-per-PR routing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review (Re-Review) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied


Previous Findings Status

# Previous Finding Status Models
1 🔴 Worker array 5×Opus vs described 3×mixed FIXED Opus ✅, Sonnet ✅, Codex ✅
2 🟡 git rebase + --force-with-lease contradicting deleted safety rules FIXED Opus ✅, Sonnet ✅, Codex ✅
3 🟡 push-to-pr.sh deleted with no replacement ⚠️ PARTIALLY FIXED Opus ⚠️, Sonnet ⚠️, Codex ❌
4 🟡 Insufficient test coverage ⚠️ PARTIALLY FIXED Opus ⚠️, Sonnet ✅, Codex ⚠️
5 🟡 Consensus filter weakened FIXED Opus ✅, Sonnet ✅, Codex ❌
6 🟢 Raw exception messages in UI FIXED Opus ✅, Sonnet ✅, Codex ✅
7 🟢 Re-review tracking removed FIXED Opus ✅, Sonnet ✅, Codex ✅

Summary: 5 of 7 previous findings are fully resolved. 2 are partially addressed (see details below).


Remaining Issues

⚠️ Finding #3: push-to-pr.sh — Inline verification weaker than script

Flagged by: Opus, Sonnet, Codex (unanimous)

Push verification is now inline in the worker prompt (steps 5–6: git fetch origin <branch> && git log). This replaces the deleted 58-line bash script that did SHA comparison and fork-remote detection. Text instructions are less deterministic than set -e scripted checks, but gh pr checkout handles remote tracking, so the gap is narrower than before. Acceptable for merge — the core safety (merge-not-rebase, gh pr checkout) is preserved.

⚠️ Finding #4: Test coverage — Preset tests good, StartFleetAsync untested

Flagged by: Opus, Codex

Good preset test additions: 5×Opus verification, 3 sub-agent model strings, merge-not-rebase assertions, 1-worker-per-PR routing. However, the StartFleetAsync signature change from Task<bool> to Task<(bool, string?)> has no dedicated tests for the new error tuple propagation or Dashboard integration.


New Issues (2+ model consensus)

🟢 MINOR — Console.WriteLine with full stack trace in production

File: PolyPilot/Services/CopilotService.cs:~2007
Flagged by: Opus, Sonnet, Codex (unanimous)

Changed from Debug.WriteLine (stripped in Release) to Console.WriteLine (always present). Logs {ex} (full stack trace) instead of {ex.Message}. On iOS/Android, Console.WriteLine goes nowhere useful. Should use ILogger or keep Debug.WriteLine for the stack trace.


📋 Additional Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comment addressed: Yes — the original review had 2 blockers (architecture mismatch + safety regression). Both are resolved: PR description now explicitly documents 5×Opus architecture with "Why 5x Opus?" section, and all fix instructions use git merge with push verification.
  • Notable single-model finding (Sonnet only, did not meet consensus): StartFleetAsync lacks a IsRemoteMode guard — in remote mode state.Session is always null, so the new error message "Session is not connected" is misleading. Every other CopilotService method that touches state.Session checks IsRemoteMode first. Worth addressing but not a consensus finding.

Recommended action:Approve

Both previous blockers are resolved. The remaining items are minor:

The PR delivers what it promises: 5 Opus workers with internal 3-model consensus, merge-based fix process, and structured re-review tracking.

@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review (Fresh Review #3) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied


Consensus Findings (flagged by 2+ models)

🟡 MODERATE — Console.WriteLine replaces Debug.WriteLine in production path

File: PolyPilot/Services/CopilotService.cs:~2008
Flagged by: Opus 🟢, Sonnet 🟢, Codex 🟡 (unanimous)

Changed from System.Diagnostics.Debug.WriteLine (stripped in Release builds) to Console.WriteLine (always present). Logs full exception via {ex} (message + stack trace). On iOS/Android, Console.WriteLine goes nowhere useful and bypasses the app's ILogger pipeline. Should use ILogger or at minimum keep Debug.WriteLine for the stack trace while returning the sanitized error string to the caller.


🟡 MODERATE — Degraded consensus fallback undefined (1-model scenario)

File: PolyPilot/Models/ModelCapabilities.cs (WorkerReviewPrompt §2-3)
Flagged by: Opus 🟡, Sonnet 🟢

The prompt says "If a model is unavailable, proceed with the remaining models" and "require 2+ models (if only 2 ran, require both)." If 2 of 3 models fail, only 1 remains — the 2+ consensus threshold can never be met, silently filtering every finding and producing an empty report with no warning. Add explicit guidance: "If only 1 model ran, include all its findings with a low-confidence disclaimer."


🟢 MINOR — .squad/ deletion removes deterministic push safety script

File: .squad/push-to-pr.sh (deleted), .squad/decisions.md (deleted)
Flagged by: Sonnet 🟡, Codex 🟢

The 58-line bash script enforced push safety deterministically (branch verification, correct remote detection, SHA comparison). Replaced by prompt text in the worker instructions (steps 5-6: git fetch && git log). Prompt-based instructions are less reliable than set -e scripted checks. However, gh pr checkout handles remote tracking and the new prompt includes merge-not-rebase + push verification steps, so the gap is narrowed. Acceptable for merge but worth noting the regression in enforcement rigor.


Notable Single-Model Findings (did not meet consensus — informational only)

Finding Model Severity
StartFleetAsync has no IsRemoteMode guard — in remote mode state.Session is always null, so /fleet gives misleading "Session not connected" error instead of "not supported in remote mode" Codex 🟡
Force push prohibition (--force) not explicitly banned — old SharedContext had it, new one only removes recommendation Sonnet 🟡
"Verify Claims Against Code" section removed from worker prompt — weakens cross-checking of PR descriptions Codex 🟡

📋 Summary

  • CI: ⚠️ No checks configured for this branch
  • Test coverage: Good preset assertions added (5×Opus, 3 sub-agent models, merge-not-rebase, 1-worker-per-PR). StartFleetAsync tuple return type has no dedicated tests.
  • Prior reviews (2 comments): Review 1 found 7 issues (2 blockers: architecture mismatch + safety regression). Review 2 confirmed both blockers resolved, 5/7 fixed, 2/7 partially fixed. This fresh review confirms the PR is in good shape.
  • Architecture: Code, description, and tests are aligned — 5 Opus workers each dispatching 3 sub-agents (Opus/Sonnet/Codex) with 2-of-3 consensus.
  • Safety: git merge replaces git rebase everywhere. --force-with-lease removed. Push verification added.

Recommended action:Approve

The PR delivers what it promises. Two moderate findings remain (Console.WriteLine logging, degraded consensus fallback) but neither blocks merge. The Console.WriteLine is a minor production hygiene issue, and the 1-model consensus gap is an edge case (2 of 3 model APIs failing simultaneously) that can be addressed in a follow-up.

PureWeen and others added 3 commits March 29, 2026 10:41
Workers now:
- Post exactly ONE comment per PR (edit existing, never add new)
- Use adversarial consensus: when only 1 model flags an issue,
  the other models get a follow-up round to agree/disagree
- Handle degraded mode: if only 1 model ran, include findings
  with low-confidence disclaimer
- Structured re-review updates the existing comment in-place

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a user sends a message to an orchestrator that's already
dispatching workers, the message is now queued with a '📨 New task
queued' system message visible in the orchestrator's chat. After
the current dispatch completes (workers finish + synthesis), all
queued messages are drained and dispatched sequentially.

Previously, messages blocked silently on a semaphore — the user
got no feedback and the message appeared to vanish.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes for stuck orchestrator when a worker goes silent:

1. Add 15-minute OrchestratorCollectionTimeout on Task.WhenAll.
   If any worker is stuck, force-complete it and proceed to
   synthesis with partial results. Previously the orchestrator
   would block for up to 60 minutes (WorkerExecutionTimeout).

2. Don't reset WatchdogCaseBLastFileSize to 0 on each SDK event.
   The stale-file-size detection needs prevSize > 0 to work on
   its first iteration. Resetting to 0 wasted one full 180s
   timeout cycle, compounding to ~540s (9 min) total recovery
   instead of ~360s (6 min).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review (Review #4 — New Code) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied
Diff: +193/-254 across 11 files (significant new changes since last review)


⚠️ New code since last review

This PR has grown from 84 → 193 additions. Two new files changed:

  • CopilotService.Events.cs — watchdog stale-detection optimization
  • CopilotService.Organization.cs (+92 lines) — orchestrator prompt queue, collection timeout, force-completion of stuck workers, adversarial consensus prompt

🔴 CRITICAL — Force-completion via TrySetResult leaves IsProcessing stuck on timed-out workers

File: PolyPilot/Services/CopilotService.Organization.cs:1927-1931
Flagged by: Opus, Sonnet, Codex (unanimous)

The new orchestrator collection timeout force-completes stuck workers by calling ws.ResponseCompletion?.TrySetResult(...). This unblocks the orchestrator's Task.WhenAll, but does NOT clear IsProcessing or its 9 companion fields (ActiveToolCallCount, HasUsedToolsThisTurn, ProcessingStartedAt, ProcessingPhase, ToolCallCount, watchdog counters, etc.).

Impact: Force-completed workers show a perpetual "Thinking..." spinner. Users cannot send new prompts to them. The processing watchdog eventually cleans up (120-600s later), but this is a long, visible degradation.

Fix: Replace the bare TrySetResult with ForceCompleteProcessingAsync(a.WorkerName, ws, "collection timeout") — which already exists at line 2149 and is used by the watchdog, pre-dispatch cleanup, and other timeout paths. It handles the full INV-1 compliant cleanup: clears all 9+ fields, calls FlushCurrentResponse, fires OnSessionComplete, and resolves the TCS.


🟡 MODERATE — Double-await of allDone re-throws caught exception

File: PolyPilot/Services/CopilotService.Organization.cs:1935-1937
Flagged by: Opus, Sonnet

try { await allDone; } catch (OperationCanceledException) { }  // swallows OCE
// ...
var results = await allDone;  // ← re-throws the same OCE!

A faulted Task always re-throws on await. The first catch swallows the exception, but the second await re-throws it — killing the entire dispatch and skipping the synthesis phase. Additionally, only OperationCanceledException is caught; non-OCE worker failures (e.g., RpcException) propagate unhandled from both awaits.

Fix: Store the result from the first await path, or use allDone.IsCompletedSuccessfully ? allDone.Result : partial.


🟡 MODERATE — Unbounded orchestrator queue drain holds dispatch lock indefinitely

File: PolyPilot/Services/CopilotService.Organization.cs:1687 + 1708-1738
Flagged by: Opus, Codex

DrainOrchestratorQueueAsync runs inside the try/finally that holds dispatchLock. Each drained prompt calls SendViaOrchestratorAsync (up to 15 min each via OrchestratorCollectionTimeout). If prompts arrive faster than drain, the queue grows unbounded and the lock is held for N × 15 minutes. There's no cap on queue depth or drain duration.

Fix: Add a max drain count per cycle (e.g., 3), or release/re-acquire the lock between iterations.


🟡 MODERATE — Console.WriteLine replaces Debug.WriteLine in production path

File: PolyPilot/Services/CopilotService.cs:~2008
Flagged by: Opus, Codex

Changed from Debug.WriteLine (stripped in Release) to Console.WriteLine (always present, goes nowhere useful on iOS/Android). Logs full exception with stack trace via {ex}. Per project conventions, should use ILogger or the project's Debug() helper.


🟢 MINOR — Watchdog file-size preservation has narrow false-positive window

File: PolyPilot/Services/CopilotService.Events.cs:233-237
Flagged by: Opus, Sonnet

Not resetting WatchdogCaseBLastFileSize to 0 is a sound optimization (avoids wasting a 180s stale-detection cycle). However, if an SDK event arrives via JSON-RPC before events.jsonl is flushed to disk, the first watchdog check sees prevSize == currentSize as a false "stale" signal. Since WatchdogCaseBMaxStaleChecks=2 requires 2 consecutive stale checks, a single false positive won't force-complete. Safe in practice — no fix needed.


Notable Single-Model Findings (informational — did not meet consensus)

Finding Model Severity
GetOrchestratorSession reads Organization.Sessions (plain List) off UI thread in queuing + drain paths — race condition Sonnet 🟡
Queued prompts silently lost if cancellation fires between TryDequeue and ThrowIfCancellationRequested Opus 🟡
StartFleetAsync has no IsRemoteMode guard — misleading error in remote mode Opus 🟢
Test assertions coupled to exact prompt wording ("Adversarial Consensus") — brittle Opus 🟢

📋 Summary

  • CI: ⚠️ No checks configured for this branch
  • Test coverage: Preset tests are good. No tests for the new Organization.cs code (orchestrator queue, collection timeout, force-completion). This is the riskiest new code and would benefit from coverage.
  • Previous findings status: All 7 original findings from Review Polish UI, Rename Sessions, Markdown Output Support, Queued Messages #1 remain resolved (architecture aligned, merge-not-rebase, push verification, re-review tracking, error messages sanitized).
  • New in this version: Adversarial consensus prompt, comment deduplication rules, orchestrator prompt queue, collection timeout with force-completion, watchdog optimization.

Recommended action: ⚠️ Request changes

One blocker:

  1. 🔴 Force-completion must use ForceCompleteProcessingAsync — the bare TrySetResult at line 1929 violates the IsProcessing cleanup invariant (18 invariants, 13 PRs of fix cycles). This will cause stuck-spinner regressions for any worker that times out. Replace with the existing ForceCompleteProcessingAsync method.

Should also fix before merge:
2. 🟡 Double-await re-throws — the synthesis phase is skipped on any worker cancellation.
3. 🟡 Unbounded queue drain — add a cap to prevent lock starvation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant