Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 20 additions & 15 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ argus/
│ └── adapters/
│ ├── __init__.py # ROUTE_ADAPTERS registry
│ ├── aichat.py # universal aichat wrapper (all 8 provider routes)
│ ├── gemini_cli.py # gemini --yolo -p "<prompt>"
│ ├── gemini_cli.py # gemini --yolo -p "" (prompt via stdin)
│ ├── codex_cli.py # codex exec --skip-git-repo-check - (stdin)
│ ├── claude_cli.py # claude -p --output-format text --bare (stdin, nested-from-claude blocked)
│ ├── opencode_cli.py # opencode run - (stdin)
│ └── copilot_cli.py # copilot -p "<prompt>" --model X --allow-all-tools
│ └── copilot_cli.py # copilot -p "<stdin ptr>" --model X (prompt via stdin)
├── fixtures/ # benchmark inputs (diff.patch + ground-truth.json pairs)
│ ├── sql-injection/
│ ├── race-refund/
Expand Down Expand Up @@ -119,15 +119,21 @@ py -3.12 "$ARGUS_HOME/scripts/benchmark.py" --runs 3 --profile panel

### Host-CLI awareness
`detect_host.py` inspects env + parent process tree. Returns one of
`claude | codex | gemini | opencode | unknown`. `host_rules` in config.yaml
removes the matching reviewer from the roster and (for non-claude hosts) adds
`claude` in. **When running inside Claude Code, the `claude` reviewer is
auto-skipped** — Claude's nested-invocation policy would block it anyway.
`claude | codex | gemini | opencode | unknown`. `resolve_roster` (called by
dispatch.py and benchmark.py) applies `host_rules` from config.yaml: the
matching host's reviewer is always skipped (**inside Claude Code the `claude`
reviewer is auto-skipped** — the nested-invocation policy would block it).
`--roster` names are explicit: disabled/custom_only gates are waived and
host_rules `add` does not inject extra reviewers; profile-based rosters get
the full policy (disabled, tier, privacy, custom_only, host add).

### Confidence filter + corroboration
- Drop findings with effective confidence < 80.
- Bonus: +15 points when ≥2 reviewers flag the same `file:line` (capped at 100).
- Merge dedupes by exact `file:line`. Nearby-line corroboration is NOT done today — reviewers often disagree on line numbers by ±3 within the same diff hunk; this under-counts corroboration. Candidate v2 fix: bucket by file + ±3-line range.
- Bonus: +15 points when ≥2 reviewers flag the same issue (capped at 100).
- Merge clusters findings by file + anchor-based ±3-line proximity
(`defaults.merge_line_tolerance`, `--line-tolerance`); the reported line is
the cluster median. Do NOT re-implement "±3 bucketing" — it already exists
in `merge.py:_merge`.

### Cost gates
- Review: warn $0.50, hard block $2.00 — overridable with `--yes-cost`
Expand All @@ -137,7 +143,10 @@ auto-skipped** — Claude's nested-invocation policy would block it anyway.
### JSON extraction
`_common.extract_json` handles: raw JSON, `<think>` reasoning blocks
(Qwen3 / DeepSeek-R1 style), fenced code blocks, and embedded objects by
"findings" or "ok" key. Every aichat reviewer we tested returns usable JSON.
"findings" or "ok" key. The balanced-brace scanners track string/escape
state (braces inside descriptions are safe) and the last-resort fallback is
a single O(n) pass capped at 50 parse attempts. Every aichat reviewer we
tested returns usable JSON.

## Benchmark run protocol (learned from first full bench)

Expand All @@ -163,9 +172,6 @@ auto-skipped** — Claude's nested-invocation policy would block it anyway.
slow expensive frontier when F1 is comparable.

## v2 ideas (captured for later)
- **Nearby-line corroboration in merge.** Currently bucket by exact
`file:line`; reviewers often differ by ±3 within the same hunk. Bucket
by file + ±3-line range to increase corroboration accuracy.
- **User-action capture.** After merged.md is shown, collect accepted/rejected
per finding into history.db. Use this for usage-driven preference weights.
- **`--refresh` command.** Query OpenRouter `/models` and diff against pinned
Expand All @@ -180,9 +186,8 @@ auto-skipped** — Claude's nested-invocation policy would block it anyway.
| MiniMax-M2.7 direct → works (slow, 20s for fixture) | — | — |
| Kimi `KIMI_API_KEY` is consumer-only | forced OR primary (`moonshotai/kimi-k2.5`) | need Moonshot Platform dev key for k2.6-preview |
| Nous Hermes direct → `NOUSRESEARCH_API_KEY` not in env | OR fallback works | set env var if wanted |
| Gemini CLI slow (60-70s/call) | works, use sparingly for quick reviews | no fix — CLI startup time |
| OpenCode CLI slow (42+s/call) | works, barely fits 45s verify timeout | set longer timeout in dispatch (already 180s) |
| Merge doesn't apply line-tolerance to corroboration | under-counts corroborated findings | v2: ±3-line bucketing |
| Gemini CLI disabled (timeout hang) | run_subprocess now tree-kills on timeout | re-test on Windows, then re-enable in config |
| OpenCode CLI slow (42+s/call) | works, barely fits 45s verify timeout | dispatch timeout already 360s |
| Fixtures only cover 4 scenarios | leaderboard is noisy at N=4 | author more fixtures (auth bypass, XSS, null-deref, async races, TOCTOU, etc.) |

## Reviewer recommendations (initial, pre-benchmark)
Expand Down
9 changes: 9 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,20 @@ New fixtures are valuable — they sharpen the benchmark signal.
- Include `scripts/verify.py --json --roster <name>` output for reachability issues
- Report `aichat --version`, Python version, OS

## Tests

- `python -m pytest tests/ -q` must pass — CI runs it on every push/PR.
Tests need no network or API keys.
- Bug fixes in `_common.py` (extract_json, run_subprocess, merge scoring)
should come with a regression test alongside the existing ones in `tests/`.

## Code style

- Python 3.12+. Type hints where they clarify; not exhaustive.
- No new dependencies without justification. pyyaml + psutil is the current bar.
- Shell scripts must work on both Git Bash (Windows) and POSIX.
- Roster policy lives only in `_common.resolve_roster` — don't add inline
roster filters to individual scripts.

## Don't break

Expand Down
152 changes: 143 additions & 9 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ argus/
│ ├── opencode_cli.py
│ └── copilot_cli.py
├── fixtures/ # Benchmark inputs (diff.patch + ground-truth.json)
├── tests/ # Unit tests (pytest; run in CI)
├── .github/workflows/ci.yml # GitHub Actions — pytest on push/PR
├── runs/ # Per-invocation artifacts (gitignored)
├── benchmarks/ # Leaderboard outputs (gitignored)
└── history.db # SQLite — runs, findings, benchmarks (gitignored)
Expand Down Expand Up @@ -141,31 +143,157 @@ python scripts/stats.py

## Architecture Notes

### Roster Policy — single source of truth

All roster policy lives in `_common.resolve_roster` — dispatch.py and
benchmark.py both call it; **never add inline roster filters elsewhere**.
Semantics:

| | explicit `--roster` names | profile-based roster |
|---|---|---|
| `disabled: true` | kept (explicit naming = intent) | dropped |
| `custom_only: true` | kept | dropped |
| `tier: free` | needs `--allow-free` | needs `--allow-free` |
| `privacy: LOGS` | needs `--allow-logging` | needs `--allow-logging` |
| host_rules `skip` | applied | applied |
| host_rules `add` | **not** applied | applied |
| not in registry | dropped | dropped |

Drops are written to `dispatch_summary.json` under `"dropped"` and noted on
stderr. `estimate_cost.py` rejects unknown reviewer names outright (exit 2,
labeled `INVALID ROSTER` so it can't be mistaken for a cost block).

### Host-CLI Awareness

`detect_host.py` inspects env + parent process tree. Returns `claude | codex | gemini | opencode | unknown`. The matching reviewer is removed from the roster; non-claude hosts auto-add `claude`.
`detect_host.py` inspects env markers + the parent process tree. Returns
`claude | codex | gemini | opencode | unknown`. Env markers are specific
variables (e.g. `CLAUDECODE=1`, `CODEX_SANDBOX`) — credential-shaped vars
like `CODEX_API_KEY` don't trigger detection. The process walk (psutil, up
to 8 levels) matches argv[0]/argv[1] basenames only, never full command
lines, so a wrapping shell containing `--roster codex` can't misdetect.

### Confidence Filter + Corroboration

- Findings with effective confidence < 80 are dropped
- +15 confidence bonus (cap 100) when >=2 reviewers flag the same file:line
- Merge dedupes by exact file:line
- Findings with effective confidence < 80 are dropped (`defaults.confidence_threshold`)
- +15 confidence bonus, cap 100, when >=2 reviewers agree (`defaults.corroboration_boost`)
- Merge clusters findings per file by anchor-based line proximity within
`defaults.merge_line_tolerance` (default ±3, `--line-tolerance` to override);
the cluster reports its median line, worst severity, and concatenated
descriptions. See README "Merge logic" for the dual-tolerance walk.

### Cost Gates

| Mode | Warn | Hard block | Override |
|------|------|-----------|----------|
| review | $0.50 | $2.00 | `--yes-cost` |
| benchmark | $10 | $30 | `--yes-cost` |
| OR balance | (auto) | `available < safety × estimate` | `--skip-balance-check` |
| benchmark | $10 | $30 | `--yes-cost` / `ARGUS_YES_COST=1` |
| OR balance | warn in review mode | blocks in benchmark mode | `--skip-balance-check` |

`estimate_cost.py` exit codes: `0` OK, `1` warn, `2` block or invalid roster
(check stderr). Paid-CLI reviewers (`cost_per_m: null`) count as $0.

### Subprocess Layer

`_common.run_subprocess` resolves argv[0] via a cached `shutil.which`, pipes
the prompt on **stdin** (never argv — Windows ARG_MAX), and on timeout kills
the entire process tree: children run in their own session, `os.killpg` on
POSIX, `taskkill /T /F` on Windows. Adapters return a uniform dict:
`{route, cmd, exit_code, stdout, stderr, latency_sec}`.

`load_config()` is `lru_cache`d and returns a **shared dict — treat it as
read-only**. The prompt template read is cached too.

### JSON Extraction

`_common.extract_json` handles: raw JSON, `<think>` reasoning blocks (Qwen3/DeepSeek-R1), fenced code blocks, and embedded objects by "findings" or "ok" key.
`_common.extract_json` tries, in order: raw parse → strip `<think>` blocks
(Qwen3/DeepSeek-R1) → fenced code blocks → balanced object containing
`"findings"`/`"ok"` → a single O(n) string-aware pass over all balanced
`{...}` spans (max 50 parse attempts). The scanner tracks string/escape
state, so braces inside description values don't truncate the object.
`normalize_findings` clamps off-enum severities to `medium`.

### Benchmark Scoring

```mermaid
flowchart TB
A["dispatch reviewer x fixture x run"] --> B{"exit_code == 0?"}
B -->|no| Z["zero-score<br>P=R=F1=0, fn=len(issues)"]
B -->|yes| C{"JSON parsed?"}
C -->|"no (parse_error)"| Z
C -->|yes| D{"fixture has issues?"}
D -->|"no (clean baseline)"| E{"findings empty?"}
E -->|yes| P["perfect<br>P=R=F1=1.0"]
E -->|no| Z2["all false positives<br>P=R=F1=0"]
D -->|yes| F["match by file +<br>line within tolerance"]
F --> G["tp/fp/fn -> P, R, F1"]

style Z fill:#ffb6c1,stroke:#c71585
style Z2 fill:#ffb6c1,stroke:#c71585
style P fill:#98fb98,stroke:#2e8b57
```

The clean-baseline `1.0` applies **only to successful, parseable runs** —
a reviewer whose call failed or returned prose is zero-scored everywhere,
so broken reviewers can't rank above working ones. `aggregate_bench.py`
applies the same rule when re-scoring per-reviewer JSONs from
parallel-shell runs. Benchmark timestamps are UTC.

### history.db Schema

```mermaid
erDiagram
runs ||--o{ reviewer_runs : "run_id"
runs ||--o{ findings : "run_id"
runs {
TEXT run_id PK
TEXT ts "UTC ISO-8601"
TEXT roster
TEXT diff_sha256
REAL latency_sec
}
reviewer_runs {
TEXT run_id PK,FK
TEXT reviewer PK
TEXT route
INTEGER exit_code
INTEGER fallback_used
REAL latency_sec
INTEGER n_findings
TEXT error
}
findings {
TEXT run_id PK,FK
INTEGER idx PK
TEXT file
INTEGER line
TEXT severity
INTEGER confidence
INTEGER n_reviewers
TEXT reviewers
}
benchmarks {
TEXT ts PK "UTC compact YYYYMMDDTHHMMSS"
TEXT reviewer PK
TEXT fixture PK
INTEGER run_idx PK
REAL precision
REAL recall
REAL f1
REAL latency_sec
TEXT error
}
```

`benchmarks` is keyed independently (no FK to `runs`) — review runs and
benchmark runs are separate timelines. `stats.py --since` normalizes both
timestamp formats to a common `YYYYMMDDTHHMMSS` prefix before comparing.

## Testing

```bash
# Unit tests (no network, no API keys; also run by CI on every push/PR)
python -m pytest tests/ -q

# Verify all routes are reachable
python scripts/verify.py --all

Expand All @@ -177,6 +305,12 @@ python scripts/estimate_cost.py --roster "glm-5.1,minimax-m2.7,gemini-or,codex"
--diff <(git diff HEAD)
```

Unit coverage: `extract_json` edge cases (think-blocks, fences,
braces-in-strings, O(n) garbage handling), merge clustering/corroboration,
benchmark scoring, and `run_subprocess` timeout + process-tree kill.
Always dry-run a benchmark (`--runs 1 --fixtures <one>`) with a new roster
before a full run — it catches provider-config bugs in ~30s instead of 40min.

## Code Style

- Python 3.12+. Type hints where they clarify; not exhaustive.
Expand All @@ -185,7 +319,7 @@ python scripts/estimate_cost.py --roster "glm-5.1,minimax-m2.7,gemini-or,codex"

## Known Gotchas

- **Windows .cmd shim tree-kill**: Node CLIs (gemini, copilot) don't tree-kill children on subprocess timeout. Use `gemini-or` instead.
- **Windows .cmd shim tree-kill**: fixed — `run_subprocess` kills the whole process tree on timeout (killpg on POSIX, `taskkill /T /F` on Windows). gemini-direct stays disabled until re-tested on Windows; `gemini-or` remains the default route.
- **OpenRouter reasoning providers**: `z-ai/glm-5.1` and `minimax/minimax-m2.7` may route to providers returning `{content: null}`. Mitigation: aichat patch applies reasoning-exclude + provider-ignore.
- **Argv length on Windows (~32KB)**: gemini-cli and copilot-cli embed prompts as CLI args. Diffs >30KB fail. Use `--files` to scope.
- **Argv length on Windows (~32KB)**: fixed — all CLI adapters pipe the prompt via stdin; no adapter embeds it in argv.
- **Full-codebase audit prompt mismatch**: Default prompt optimized for PR review. Use `--overlay audit` for empty-tree→HEAD diffs.
Loading
Loading