feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
Conversation
32b4d6b to
91041af
Compare
….squad/ Replace 5 identical Sonnet workers with 3 diverse-model workers: - Worker 1: Claude Opus (deep reasoning, subtle bugs) - Worker 2: Claude Sonnet (fast pattern matching, common bugs) - Worker 3: GPT Codex (alternative perspective, edge cases) Model diversity is now achieved at the worker level instead of relying on workers to internally dispatch sub-agents. The orchestrator synthesizes consensus (2-of-3 filter) from the different model perspectives. Also removes .squad/ directory — the useful content (routing rules, review standards) is now baked into the built-in preset. Real Squad integration will come via SquadSessionProvider (issue #436). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User feedback: 3 workers was too few. Restored to 5 workers with mixed models for better consensus coverage. Updated consensus filter references to 2-of-5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
baea1ae to
ae642e5
Compare
Each worker dispatches 3 sub-agents (Opus, Sonnet, Codex) via the task tool for consensus. The orchestrator assigns 1 worker per PR and distributes round-robin — no fan-out of multiple workers to the same PR. All workers run on Opus (best at orchestrating sub-agent dispatch). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return specific error (session not found, not connected, processing, RPC error) instead of generic failure message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ae642e5 to
b3fe6fe
Compare
🤖 Multi-Model Code Review — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) 🔴 CRITICAL — Worker array contradicts PR description: 5×Opus, not 3 diverse modelsFile: The diff changes workers from This is also a cost regression — 5 Opus workers each dispatching 3 sub-agents = 20 model calls per review, with 10 at premium Opus tier. 🟡 MODERATE —
|
The fleet review incorrectly flagged these as non-existent because they aren't in PolyPilot's internal registry, but these are CLI model IDs passed to the task tool — the CLI supports them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace git rebase + force-with-lease with git merge in worker fix instructions and SharedContext (safety regression) - Add push verification step (equivalent to deleted push-to-pr.sh) - Restore structured re-review tracking (FIXED/STILL PRESENT/N/A) - Sanitize RPC exception in fleet error (don't leak internal paths) - Improve test coverage: verify all 5 models are Opus, verify worker prompts contain all 3 sub-agent model names, verify merge not rebase in SharedContext, verify 1-worker-per-PR routing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review (Re-Review) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) Previous Findings Status
Summary: 5 of 7 previous findings are fully resolved. 2 are partially addressed (see details below). Remaining Issues
|
🤖 Multi-Model Code Review (Fresh Review #3) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) Consensus Findings (flagged by 2+ models)🟡 MODERATE —
|
| Finding | Model | Severity |
|---|---|---|
StartFleetAsync has no IsRemoteMode guard — in remote mode state.Session is always null, so /fleet gives misleading "Session not connected" error instead of "not supported in remote mode" |
Codex | 🟡 |
Force push prohibition (--force) not explicitly banned — old SharedContext had it, new one only removes recommendation |
Sonnet | 🟡 |
| "Verify Claims Against Code" section removed from worker prompt — weakens cross-checking of PR descriptions | Codex | 🟡 |
📋 Summary
- CI:
⚠️ No checks configured for this branch - Test coverage: Good preset assertions added (5×Opus, 3 sub-agent models, merge-not-rebase, 1-worker-per-PR).
StartFleetAsynctuple return type has no dedicated tests. - Prior reviews (2 comments): Review 1 found 7 issues (2 blockers: architecture mismatch + safety regression). Review 2 confirmed both blockers resolved, 5/7 fixed, 2/7 partially fixed. This fresh review confirms the PR is in good shape.
- Architecture: Code, description, and tests are aligned — 5 Opus workers each dispatching 3 sub-agents (Opus/Sonnet/Codex) with 2-of-3 consensus.
- Safety:
git mergereplacesgit rebaseeverywhere.--force-with-leaseremoved. Push verification added.
Recommended action: ✅ Approve
The PR delivers what it promises. Two moderate findings remain (Console.WriteLine logging, degraded consensus fallback) but neither blocks merge. The Console.WriteLine is a minor production hygiene issue, and the 1-model consensus gap is an edge case (2 of 3 model APIs failing simultaneously) that can be addressed in a follow-up.
Workers now: - Post exactly ONE comment per PR (edit existing, never add new) - Use adversarial consensus: when only 1 model flags an issue, the other models get a follow-up round to agree/disagree - Handle degraded mode: if only 1 model ran, include findings with low-confidence disclaimer - Structured re-review updates the existing comment in-place Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a user sends a message to an orchestrator that's already dispatching workers, the message is now queued with a '📨 New task queued' system message visible in the orchestrator's chat. After the current dispatch completes (workers finish + synthesis), all queued messages are drained and dispatched sequentially. Previously, messages blocked silently on a semaphore — the user got no feedback and the message appeared to vanish. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes for stuck orchestrator when a worker goes silent: 1. Add 15-minute OrchestratorCollectionTimeout on Task.WhenAll. If any worker is stuck, force-complete it and proceed to synthesis with partial results. Previously the orchestrator would block for up to 60 minutes (WorkerExecutionTimeout). 2. Don't reset WatchdogCaseBLastFileSize to 0 on each SDK event. The stale-file-size detection needs prevSize > 0 to work on its first iteration. Resetting to 0 wasted one full 180s timeout cycle, compounding to ~540s (9 min) total recovery instead of ~360s (6 min). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review (Review #4 — New Code) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
|
| Finding | Model | Severity |
|---|---|---|
GetOrchestratorSession reads Organization.Sessions (plain List) off UI thread in queuing + drain paths — race condition |
Sonnet | 🟡 |
Queued prompts silently lost if cancellation fires between TryDequeue and ThrowIfCancellationRequested |
Opus | 🟡 |
StartFleetAsync has no IsRemoteMode guard — misleading error in remote mode |
Opus | 🟢 |
| Test assertions coupled to exact prompt wording ("Adversarial Consensus") — brittle | Opus | 🟢 |
📋 Summary
- CI:
⚠️ No checks configured for this branch - Test coverage: Preset tests are good. No tests for the new Organization.cs code (orchestrator queue, collection timeout, force-completion). This is the riskiest new code and would benefit from coverage.
- Previous findings status: All 7 original findings from Review Polish UI, Rename Sessions, Markdown Output Support, Queued Messages #1 remain resolved (architecture aligned, merge-not-rebase, push verification, re-review tracking, error messages sanitized).
- New in this version: Adversarial consensus prompt, comment deduplication rules, orchestrator prompt queue, collection timeout with force-completion, watchdog optimization.
Recommended action:
One blocker:
- 🔴 Force-completion must use
ForceCompleteProcessingAsync— the bareTrySetResultat line 1929 violates the IsProcessing cleanup invariant (18 invariants, 13 PRs of fix cycles). This will cause stuck-spinner regressions for any worker that times out. Replace with the existingForceCompleteProcessingAsyncmethod.
Should also fix before merge:
2. 🟡 Double-await re-throws — the synthesis phase is skipped on any worker cancellation.
3. 🟡 Unbounded queue drain — add a cap to prevent lock starvation.
Summary
Redesigns the built-in PR Review Squad multi-agent preset and adds fleet command diagnostics.
Architecture: 5x Opus Workers with Internal Multi-Model Dispatch
Each of the 5 workers runs on Opus and internally dispatches 3 parallel sub-agent reviews via the task tool:
The worker synthesizes a 2-of-3 consensus report. The orchestrator assigns one worker per PR (round-robin for multiple PRs) -- no fan-out.
Changes
Why 5x Opus?
Workers orchestrate sub-agent dispatch -- Opus excels at this. Model diversity happens at the sub-agent level inside each worker.