reviewer-upgrade: add gpt-5.5 PRICING (address PR #522 review P2/P3)

igerber · claude · igerber · commit e05a104b67a2 · 2026-06-01T12:58:05.000-04:00
Add gpt-5.5 ($5/$30) and gpt-5.5-pro ($30/$180) to PRICING at OpenAI's
confirmed standard rates (developers.openai.com/api/docs/pricing). The
production reviewer uses the flat-rate codex backend, but `--backend auto`
falls back to the metered API path when the codex CLI is unavailable —
there estimate_cost("gpt-5.5") previously returned None and silently
dropped the cost lines. gpt-5.5-pro gets its own entry so it doesn't
prefix-fall-back to the standard rate. Also bumps the api-backend cost
range note (gpt-5.5 ~2x gpt-5.4) and the CHANGELOG wording.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/commands/ai-review-local.md b/.claude/commands/ai-review-local.md
@@ -16,7 +16,7 @@ Two backends are supported:
 
 | Backend | Latency | Cost | Quality |
 |---|---|---|---|
-| `api` (`gpt-5.5`) | 30-60s | $0.05-0.50/run, metered via `OPENAI_API_KEY` | Single-shot — won't grep, can't load files on its own initiative |
+| `api` (`gpt-5.5`) | 30-60s | ~$0.10-1.00/run, metered via `OPENAI_API_KEY` | Single-shot — won't grep, can't load files on its own initiative |
 | `codex` (any auth) | 3-15 min | depends on your `codex login` mode (subscription vs API key) — see codex docs | Agentic — matches CI Codex reviewer, can grep / load files / multi-turn |
 
 Choose with `--backend {auto,codex,api}` (default `auto`):
diff --git a/.claude/scripts/openai_review.py b/.claude/scripts/openai_review.py
@@ -871,6 +871,8 @@ def apply_token_budget(
 PRICING = {
     "gpt-5.4": (2.50, 15.00),
     "gpt-5.4-pro": (30.00, 180.00),
+    "gpt-5.5": (5.00, 30.00),
+    "gpt-5.5-pro": (30.00, 180.00),
     "gpt-4.1": (2.00, 8.00),
     "gpt-4.1-mini": (0.40, 1.60),
     "o3": (2.00, 8.00),
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Covariate names that collide with reserved structural terms now raise `ValueError` instead of silently corrupting the coefficient dict (`DifferenceInDifferences`, `MultiPeriodDiD`, `TwoWayFixedEffects`).** These estimators build their `coefficients` dict by zipping a variable-name list -- structural term names PLUS the user covariate column names appended verbatim -- with the fitted coefficient vector. A covariate whose name equaled a reserved structural name (`const`; the treatment/time column names; the `{treatment}:{time}` interaction; MultiPeriodDiD `period_{p}` dummies and `{treatment}:period_{p}` interactions; `TwoWayFixedEffects` `ATT`; fixed-effect / unit / time dummy names; or an internal `_`-prefixed working column such as `_treat_time` / `_did_treatment` / `_treatment_post`) silently **overwrote** that structural coefficient via Python dict last-write-wins -- e.g. a covariate named `const` dropped the intercept -- with no error or warning. A new shared `validate_covariate_names` helper (`diff_diff/utils.py`) is now called in each of the three `fit()` methods before the design matrix is built; it raises `ValueError` on a collision (the comparison is case-sensitive, so e.g. `Const` is still allowed) **and** on duplicate names within `covariates` (which collapse to a single dict entry the same way). Fixed-effect/unit/time dummy reserved names are taken from the same `pd.get_dummies(..., drop_first=True)` call used to build them, so they match exactly (including for pandas `Categorical` columns with a non-default category order). For `TwoWayFixedEffects` the guard fires on **all** variance paths: the default within-transform path returns only `{"ATT": att}` (no covariate is a dict key there), but a covariate named `_treatment_post` would still clobber the internal interaction column, so guarding both paths is uniform and forward-compatible. **Potentially breaking:** a fit that previously *succeeded* with a colliding (or duplicated) covariate name -- silently returning a corrupted coefficient dict -- now raises; rename the covariate column(s). The staggered / influence-function estimators (CallawaySantAnna, SunAbraham, StaggeredTripleDifference, EfficientDiD, TwoStageDiD, ImputationDiD, WooldridgeDiD, dCDH, StackedDiD) key results by `(g, t)` tuples / relative-time indices, never covariate names, and `TripleDifference` / `SyntheticControl` / `SyntheticDiD` do not expose covariates by name, so none are affected. New tests in `tests/test_utils.py`, `tests/test_estimators.py`, and `tests/test_estimators_vcov_type.py`.
 
 ### Changed
-- **CI + local AI PR-reviewer model upgraded `gpt-5.4` → `gpt-5.5`.** The CI Codex reviewer (`.github/workflows/ai_pr_review.yml`) and the local `/ai-review-local` default (`.claude/scripts/openai_review.py` `DEFAULT_MODEL`) now run `gpt-5.5` @ `xhigh` effort / `read-only` sandbox (all other invocation settings unchanged). Validated empirically before the swap via the `tools/reviewer-eval/` A/B harness: on a real-bug corpus plus a k=6 big-diff de-risk, `gpt-5.5` matched-or-beat `gpt-5.4` on every test-backed recall case (including a bug buried in a ~3k-line methodology diff), added zero false positives, and ran faster; an end-to-end CI canary confirmed the action environment (`openai/codex-action@v1`, codex CLI 0.135.0) runs `gpt-5.5` and catches a planted P0. `gpt-5.5` is also added to the reasoning-model set (`_is_reasoning_model`), so the rarely-used api backend would apply the reasoning-model timeout/token limits if invoked with it; no `PRICING` entry is added because the CI + local reviewer run `gpt-5.5` via the **codex backend** (subscription/flat-rate), which does not use the api-backend cost table. `gpt-5.4` remains accepted.
+- **CI + local AI PR-reviewer model upgraded `gpt-5.4` → `gpt-5.5`.** The CI Codex reviewer (`.github/workflows/ai_pr_review.yml`) and the local `/ai-review-local` default (`.claude/scripts/openai_review.py` `DEFAULT_MODEL`) now run `gpt-5.5` @ `xhigh` effort / `read-only` sandbox (all other invocation settings unchanged). Validated empirically before the swap via the `tools/reviewer-eval/` A/B harness: on a real-bug corpus plus a k=6 big-diff de-risk, `gpt-5.5` matched-or-beat `gpt-5.4` on every test-backed recall case (including a bug buried in a ~3k-line methodology diff), added zero false positives, and ran faster; an end-to-end CI canary confirmed the action environment (`openai/codex-action@v1`, codex CLI 0.135.0) runs `gpt-5.5` and catches a planted P0. `gpt-5.5` is also added to the reasoning-model set (`_is_reasoning_model`, for the api-backend timeout/token limits) and to the `PRICING` table at OpenAI's confirmed standard rates (`gpt-5.5` $5/$30, `gpt-5.5-pro` $30/$180 per 1M input/output tokens) — the production CI + local reviewer run `gpt-5.5` via the flat-rate **codex backend**, but `--backend auto` falls back to the metered API path when the codex CLI is unavailable, so the cost estimate must stay accurate there. `gpt-5.4` remains accepted.
 
 ## [3.5.0] - 2026-06-01