jimstratus · jimstratus · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -37,11 +37,11 @@ argus/
 │   └── adapters/
 │       ├── __init__.py        # ROUTE_ADAPTERS registry
 │       ├── aichat.py          # universal aichat wrapper (all 8 provider routes)
-│       ├── gemini_cli.py      # gemini --yolo -p "<prompt>"
+│       ├── gemini_cli.py      # gemini --yolo -p ""  (prompt via stdin)
 │       ├── codex_cli.py       # codex exec --skip-git-repo-check -  (stdin)
 │       ├── claude_cli.py      # claude -p --output-format text --bare  (stdin, nested-from-claude blocked)
 │       ├── opencode_cli.py    # opencode run -  (stdin)
-│       └── copilot_cli.py     # copilot -p "<prompt>" --model X --allow-all-tools
+│       └── copilot_cli.py     # copilot -p "<stdin ptr>" --model X  (prompt via stdin)
 ├── fixtures/                  # benchmark inputs (diff.patch + ground-truth.json pairs)
 │   ├── sql-injection/
 │   ├── race-refund/
@@ -119,15 +119,21 @@ py -3.12 "$ARGUS_HOME/scripts/benchmark.py" --runs 3 --profile panel
 
 ### Host-CLI awareness
 `detect_host.py` inspects env + parent process tree. Returns one of
-`claude | codex | gemini | opencode | unknown`. `host_rules` in config.yaml
-removes the matching reviewer from the roster and (for non-claude hosts) adds
-`claude` in. **When running inside Claude Code, the `claude` reviewer is
-auto-skipped** — Claude's nested-invocation policy would block it anyway.
+`claude | codex | gemini | opencode | unknown`. `resolve_roster` (called by
+dispatch.py and benchmark.py) applies `host_rules` from config.yaml: the
+matching host's reviewer is always skipped (**inside Claude Code the `claude`
+reviewer is auto-skipped** — the nested-invocation policy would block it).
+`--roster` names are explicit: disabled/custom_only gates are waived and
+host_rules `add` does not inject extra reviewers; profile-based rosters get
+the full policy (disabled, tier, privacy, custom_only, host add).
 
 ### Confidence filter + corroboration
 - Drop findings with effective confidence < 80.
-- Bonus: +15 points when ≥2 reviewers flag the same `file:line` (capped at 100).
-- Merge dedupes by exact `file:line`. Nearby-line corroboration is NOT done today — reviewers often disagree on line numbers by ±3 within the same diff hunk; this under-counts corroboration. Candidate v2 fix: bucket by file + ±3-line range.
+- Bonus: +15 points when ≥2 reviewers flag the same issue (capped at 100).
+- Merge clusters findings by file + anchor-based ±3-line proximity
+  (`defaults.merge_line_tolerance`, `--line-tolerance`); the reported line is
+  the cluster median. Do NOT re-implement "±3 bucketing" — it already exists
+  in `merge.py:_merge`.
 
 ### Cost gates
 - Review: warn $0.50, hard block $2.00 — overridable with `--yes-cost`
@@ -137,7 +143,10 @@ auto-skipped** — Claude's nested-invocation policy would block it anyway.
 ### JSON extraction
 `_common.extract_json` handles: raw JSON, `<think>` reasoning blocks
 (Qwen3 / DeepSeek-R1 style), fenced code blocks, and embedded objects by
-"findings" or "ok" key. Every aichat reviewer we tested returns usable JSON.
+"findings" or "ok" key. The balanced-brace scanners track string/escape
+state (braces inside descriptions are safe) and the last-resort fallback is
+a single O(n) pass capped at 50 parse attempts. Every aichat reviewer we
+tested returns usable JSON.
 
 ## Benchmark run protocol (learned from first full bench)
 
@@ -163,9 +172,6 @@ auto-skipped** — Claude's nested-invocation policy would block it anyway.
   slow expensive frontier when F1 is comparable.
 
 ## v2 ideas (captured for later)
-- **Nearby-line corroboration in merge.** Currently bucket by exact
-  `file:line`; reviewers often differ by ±3 within the same hunk. Bucket
-  by file + ±3-line range to increase corroboration accuracy.
 - **User-action capture.** After merged.md is shown, collect accepted/rejected
   per finding into history.db. Use this for usage-driven preference weights.
 - **`--refresh` command.** Query OpenRouter `/models` and diff against pinned
@@ -180,9 +186,8 @@ auto-skipped** — Claude's nested-invocation policy would block it anyway.
 | MiniMax-M2.7 direct → works (slow, 20s for fixture) | — | — |
 | Kimi `KIMI_API_KEY` is consumer-only | forced OR primary (`moonshotai/kimi-k2.5`) | need Moonshot Platform dev key for k2.6-preview |
 | Nous Hermes direct → `NOUSRESEARCH_API_KEY` not in env | OR fallback works | set env var if wanted |
-| Gemini CLI slow (60-70s/call) | works, use sparingly for quick reviews | no fix — CLI startup time |
-| OpenCode CLI slow (42+s/call) | works, barely fits 45s verify timeout | set longer timeout in dispatch (already 180s) |
-| Merge doesn't apply line-tolerance to corroboration | under-counts corroborated findings | v2: ±3-line bucketing |
+| Gemini CLI disabled (timeout hang) | run_subprocess now tree-kills on timeout | re-test on Windows, then re-enable in config |
+| OpenCode CLI slow (42+s/call) | works, barely fits 45s verify timeout | dispatch timeout already 360s |
 | Fixtures only cover 4 scenarios | leaderboard is noisy at N=4 | author more fixtures (auth bypass, XSS, null-deref, async races, TOCTOU, etc.) |
 
 ## Reviewer recommendations (initial, pre-benchmark)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -24,11 +24,20 @@ New fixtures are valuable — they sharpen the benchmark signal.
 - Include `scripts/verify.py --json --roster <name>` output for reachability issues
 - Report `aichat --version`, Python version, OS
 
+## Tests
+
+- `python -m pytest tests/ -q` must pass — CI runs it on every push/PR.
+  Tests need no network or API keys.
+- Bug fixes in `_common.py` (extract_json, run_subprocess, merge scoring)
+  should come with a regression test alongside the existing ones in `tests/`.
+
 ## Code style
 
 - Python 3.12+. Type hints where they clarify; not exhaustive.
 - No new dependencies without justification. pyyaml + psutil is the current bar.
 - Shell scripts must work on both Git Bash (Windows) and POSIX.
+- Roster policy lives only in `_common.resolve_roster` — don't add inline
+  roster filters to individual scripts.
 
 ## Don't break
 

diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -75,6 +75,8 @@ argus/
 │       ├── opencode_cli.py
 │       └── copilot_cli.py
 ├── fixtures/                   # Benchmark inputs (diff.patch + ground-truth.json)
+├── tests/                      # Unit tests (pytest; run in CI)
+├── .github/workflows/ci.yml    # GitHub Actions — pytest on push/PR
 ├── runs/                       # Per-invocation artifacts (gitignored)
 ├── benchmarks/                 # Leaderboard outputs (gitignored)
 └── history.db                  # SQLite — runs, findings, benchmarks (gitignored)
@@ -141,31 +143,157 @@ python scripts/stats.py
 
 ## Architecture Notes
 
+### Roster Policy — single source of truth
+
+All roster policy lives in `_common.resolve_roster` — dispatch.py and
+benchmark.py both call it; **never add inline roster filters elsewhere**.
+Semantics:
+
+| | explicit `--roster` names | profile-based roster |
+|---|---|---|
+| `disabled: true` | kept (explicit naming = intent) | dropped |
+| `custom_only: true` | kept | dropped |
+| `tier: free` | needs `--allow-free` | needs `--allow-free` |
+| `privacy: LOGS` | needs `--allow-logging` | needs `--allow-logging` |
+| host_rules `skip` | applied | applied |
+| host_rules `add` | **not** applied | applied |
+| not in registry | dropped | dropped |
+
+Drops are written to `dispatch_summary.json` under `"dropped"` and noted on
+stderr. `estimate_cost.py` rejects unknown reviewer names outright (exit 2,
+labeled `INVALID ROSTER` so it can't be mistaken for a cost block).
+
 ### Host-CLI Awareness
 
-`detect_host.py` inspects env + parent process tree. Returns `claude | codex | gemini | opencode | unknown`. The matching reviewer is removed from the roster; non-claude hosts auto-add `claude`.
+`detect_host.py` inspects env markers + the parent process tree. Returns
+`claude | codex | gemini | opencode | unknown`. Env markers are specific
+variables (e.g. `CLAUDECODE=1`, `CODEX_SANDBOX`) — credential-shaped vars
+like `CODEX_API_KEY` don't trigger detection. The process walk (psutil, up
+to 8 levels) matches argv[0]/argv[1] basenames only, never full command
+lines, so a wrapping shell containing `--roster codex` can't misdetect.
 
 ### Confidence Filter + Corroboration
 
-- Findings with effective confidence < 80 are dropped
-- +15 confidence bonus (cap 100) when >=2 reviewers flag the same file:line
-- Merge dedupes by exact file:line
+- Findings with effective confidence < 80 are dropped (`defaults.confidence_threshold`)
+- +15 confidence bonus, cap 100, when >=2 reviewers agree (`defaults.corroboration_boost`)
+- Merge clusters findings per file by anchor-based line proximity within
+  `defaults.merge_line_tolerance` (default ±3, `--line-tolerance` to override);
+  the cluster reports its median line, worst severity, and concatenated
+  descriptions. See README "Merge logic" for the dual-tolerance walk.
 
 ### Cost Gates
 
 | Mode | Warn | Hard block | Override |
 |------|------|-----------|----------|
 | review | $0.50 | $2.00 | `--yes-cost` |
-| benchmark | $10 | $30 | `--yes-cost` |
-| OR balance | (auto) | `available < safety × estimate` | `--skip-balance-check` |
+| benchmark | $10 | $30 | `--yes-cost` / `ARGUS_YES_COST=1` |
+| OR balance | warn in review mode | blocks in benchmark mode | `--skip-balance-check` |
+
+`estimate_cost.py` exit codes: `0` OK, `1` warn, `2` block or invalid roster
+(check stderr). Paid-CLI reviewers (`cost_per_m: null`) count as $0.
+
+### Subprocess Layer
+
+`_common.run_subprocess` resolves argv[0] via a cached `shutil.which`, pipes
+the prompt on **stdin** (never argv — Windows ARG_MAX), and on timeout kills
+the entire process tree: children run in their own session, `os.killpg` on
+POSIX, `taskkill /T /F` on Windows. Adapters return a uniform dict:
+`{route, cmd, exit_code, stdout, stderr, latency_sec}`.
+
+`load_config()` is `lru_cache`d and returns a **shared dict — treat it as
+read-only**. The prompt template read is cached too.
 
 ### JSON Extraction
 
-`_common.extract_json` handles: raw JSON, `<think>` reasoning blocks (Qwen3/DeepSeek-R1), fenced code blocks, and embedded objects by "findings" or "ok" key.
+`_common.extract_json` tries, in order: raw parse → strip `<think>` blocks
+(Qwen3/DeepSeek-R1) → fenced code blocks → balanced object containing
+`"findings"`/`"ok"` → a single O(n) string-aware pass over all balanced
+`{...}` spans (max 50 parse attempts). The scanner tracks string/escape
+state, so braces inside description values don't truncate the object.
+`normalize_findings` clamps off-enum severities to `medium`.
+
+### Benchmark Scoring
+
+```mermaid
+flowchart TB
+    A["dispatch reviewer x fixture x run"] --> B{"exit_code == 0?"}
+    B -->|no| Z["zero-score<br>P=R=F1=0, fn=len(issues)"]
+    B -->|yes| C{"JSON parsed?"}
+    C -->|"no (parse_error)"| Z
+    C -->|yes| D{"fixture has issues?"}
+    D -->|"no (clean baseline)"| E{"findings empty?"}
+    E -->|yes| P["perfect<br>P=R=F1=1.0"]
+    E -->|no| Z2["all false positives<br>P=R=F1=0"]
+    D -->|yes| F["match by file +<br>line within tolerance"]
+    F --> G["tp/fp/fn -> P, R, F1"]
+
+    style Z fill:#ffb6c1,stroke:#c71585
+    style Z2 fill:#ffb6c1,stroke:#c71585
+    style P fill:#98fb98,stroke:#2e8b57
+```
+
+The clean-baseline `1.0` applies **only to successful, parseable runs** —
+a reviewer whose call failed or returned prose is zero-scored everywhere,
+so broken reviewers can't rank above working ones. `aggregate_bench.py`
+applies the same rule when re-scoring per-reviewer JSONs from
+parallel-shell runs. Benchmark timestamps are UTC.
+
+### history.db Schema
+
+```mermaid
+erDiagram
+    runs ||--o{ reviewer_runs : "run_id"
+    runs ||--o{ findings : "run_id"
+    runs {
+        TEXT run_id PK
+        TEXT ts "UTC ISO-8601"
+        TEXT roster
+        TEXT diff_sha256
+        REAL latency_sec
+    }
+    reviewer_runs {
+        TEXT run_id PK,FK
+        TEXT reviewer PK
+        TEXT route
+        INTEGER exit_code
+        INTEGER fallback_used
+        REAL latency_sec
+        INTEGER n_findings
+        TEXT error
+    }
+    findings {
+        TEXT run_id PK,FK
+        INTEGER idx PK
+        TEXT file
+        INTEGER line
+        TEXT severity
+        INTEGER confidence
+        INTEGER n_reviewers
+        TEXT reviewers
+    }
+    benchmarks {
+        TEXT ts PK "UTC compact YYYYMMDDTHHMMSS"
+        TEXT reviewer PK
+        TEXT fixture PK
+        INTEGER run_idx PK
+        REAL precision
+        REAL recall
+        REAL f1
+        REAL latency_sec
+        TEXT error
+    }
+```
+
+`benchmarks` is keyed independently (no FK to `runs`) — review runs and
+benchmark runs are separate timelines. `stats.py --since` normalizes both
+timestamp formats to a common `YYYYMMDDTHHMMSS` prefix before comparing.
 
 ## Testing
 
 ```bash
+# Unit tests (no network, no API keys; also run by CI on every push/PR)
+python -m pytest tests/ -q
+
 # Verify all routes are reachable
 python scripts/verify.py --all
 
@@ -177,6 +305,12 @@ python scripts/estimate_cost.py --roster "glm-5.1,minimax-m2.7,gemini-or,codex"
   --diff <(git diff HEAD)
 ```
 
+Unit coverage: `extract_json` edge cases (think-blocks, fences,
+braces-in-strings, O(n) garbage handling), merge clustering/corroboration,
+benchmark scoring, and `run_subprocess` timeout + process-tree kill.
+Always dry-run a benchmark (`--runs 1 --fixtures <one>`) with a new roster
+before a full run — it catches provider-config bugs in ~30s instead of 40min.
+
 ## Code Style
 
 - Python 3.12+. Type hints where they clarify; not exhaustive.
@@ -185,7 +319,7 @@ python scripts/estimate_cost.py --roster "glm-5.1,minimax-m2.7,gemini-or,codex"
 
 ## Known Gotchas
 
-- **Windows .cmd shim tree-kill**: Node CLIs (gemini, copilot) don't tree-kill children on subprocess timeout. Use `gemini-or` instead.
+- **Windows .cmd shim tree-kill**: fixed — `run_subprocess` kills the whole process tree on timeout (killpg on POSIX, `taskkill /T /F` on Windows). gemini-direct stays disabled until re-tested on Windows; `gemini-or` remains the default route.
 - **OpenRouter reasoning providers**: `z-ai/glm-5.1` and `minimax/minimax-m2.7` may route to providers returning `{content: null}`. Mitigation: aichat patch applies reasoning-exclude + provider-ignore.
-- **Argv length on Windows (~32KB)**: gemini-cli and copilot-cli embed prompts as CLI args. Diffs >30KB fail. Use `--files` to scope.
+- **Argv length on Windows (~32KB)**: fixed — all CLI adapters pipe the prompt via stdin; no adapter embeds it in argv.
 - **Full-codebase audit prompt mismatch**: Default prompt optimized for PR review. Use `--overlay audit` for empty-tree→HEAD diffs.