Skip to content

SyntheticControl: confidence sets by test inversion (Firpo-Possebom 2018, PR-B)#527

Open
igerber wants to merge 2 commits into
mainfrom
feature/sc-ci-test-inversion
Open

SyntheticControl: confidence sets by test inversion (Firpo-Possebom 2018, PR-B)#527
igerber wants to merge 2 commits into
mainfrom
feature/sc-ci-test-inversion

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jun 2, 2026

Summary

  • PR-B of the SCM Firpo & Possebom (2018, §4) methodology track (consumes the PR-A review docs/methodology/papers/firpo-possebom-2018-review.md, merged in docs: Firpo & Possebom (2018) paper review — SCM CI by test inversion (PR-A) #524). Gives classic SyntheticControl the uncertainty quantification it lacked — a confidence set for the treatment-effect path by test inversion — without changing its always-NaN analytical inference contract.
  • Two opt-in SyntheticControlResults methods, both a pure re-ranking of the in-space placebo gaps (no synthetic-control refits):
    • test_sharp_null(effect, gamma=0.1) — test H_0: α_1t = f(t) (Eqs 12–13, φ=0); test_sharp_null(0) is bit-for-bit placebo_p_value.
    • confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200) — invert that test (Eqs 14/16/18, strict p^f > γ). With bounds=None the set is recovered exactly: p^c is piecewise-constant, so the placebo-comparison quadratic breakpoints partition the line; p is evaluated per interval and at each breakpoint (a tie under can spike p, so isolated singletons are captured) — no centering/monotonicity assumption, so accepted tails / disjoint components / unbounded / empty are all handled.
  • Result lives in a dedicated effect_confidence_set summary + get_confidence_set_df(); analytical conf_int/se/t_stat/p_value stay NaN (a permutation set at level 1−γ, γ-granular in 1/(J+1), possibly a set/unbounded/non-contiguous — kept separate exactly as placebo_p_value is kept off p_value). Surfaced opt-in under estimator_native_diagnostics.confidence_set.
  • Fail-closed: γ < 1/(J+1) or treated-not-best-pre-fit → unbounded; nothing accepted → empty; disjoint/singleton → contiguous=False + warning; <2 donors / non-converged treated / unpickled → ValueError.

Methodology references

  • Method: SCM confidence sets / CI by test inversion (sharp-null RMSPE^f permutation), Firpo & Possebom (2018), J. Causal Inference 6(2), Eqs 11–18.
  • Source: docs/methodology/papers/firpo-possebom-2018-review.md (PR-A, docs: Firpo & Possebom (2018) paper review — SCM CI by test inversion (PR-A) #524); new ## SyntheticControl methodology block + 4 **Note:** labels in docs/methodology/REGISTRY.md (boundary convention, exact-breakpoint set construction, non-analytical conf_int, no-R-anchor validation).
  • Deviations: the set-construction method (exact breakpoint inversion / optional fixed grid) is OUR choice — the paper leaves the grid unspecified (documented). Strict p^f > γ boundary per Eq 14 (documented; the discrete p-value makes p=γ reachable). Deferred (flagged in the review checklist): sensitivity weights (φ≠0), the general-θ menu (Eq 19), one-sided (§7), multiple-outcome/treated (§6).

Validation

  • Tests added: tests/test_methodology_synthetic_control.py (numpy oracle for Eqs 12–14 incl. the strict p=γ boundary + the per-unit floor; the test_sharp_null(0)==placebo_p_value self-consistency anchor incl. a near-perfect-pre-fit floor-biting case; center-rejected/tails-accepted and isolated-breakpoint-singleton regressions; invariants; fail-closed; input validation; a @slow coverage simulation) + tests/test_diagnostic_report.py (DR surfacing). No R anchor — R Synth has no test inversion; validated by self-consistency (transitively R-anchored via the Basque placebo parity), the numpy oracle, and the coverage MC.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

…018, PR-B)

Add two opt-in SyntheticControlResults methods that re-rank the in-space placebo
gaps into a confidence set for the treatment-effect path (no synthetic-control refits):
- test_sharp_null(effect, gamma): test H_0: alpha_1t = f(t) (Eqs 12-13, phi=0, v=1);
  test_sharp_null(0) is identically placebo_p_value.
- confidence_set(family="constant"|"linear", gamma): invert that test (Eqs 14/16/18,
  strict p^f > gamma). p^param is piecewise-constant, so the set is recovered EXACTLY via
  the placebo breakpoints (real roots of the pairwise RMSPE-equality quadratics) -- no
  shape/centering assumption, so accepted tails, disjoint components, empty and unbounded
  sets are all handled. effect_confidence_set summary + get_confidence_set_df() grid;
  analytical conf_int/se/t_stat/p_value stay NaN.

Persist per-unit floored pre-denominators in in_space_placebo so the f=0 anchor holds
bit-for-bit. Wire an opt-in confidence_set block into _scm_native (numeric endpoints only
for a bounded set; JSON-safe). Fail-closed for unbounded (gamma<1/(J+1) or
treated-not-best-pre-fit) / empty / non-contiguous / <2 donors / non-converged / unpickled.

Docs: REGISTRY methodology block + checklist + Notes (boundary/grid/non-analytical/
no-R-anchor); review checklist flip; LLM guides; api rst; CHANGELOG; doc-deps.
Tests: numpy oracle (Eqs 12-14 incl strict p=gamma boundary + per-unit floor), the f=0
self-consistency anchor, a center-rejected/tails-accepted regression, invariants,
fail-closed paths, and a coverage simulation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

⚠️ Needs changes

Executive summary

  • Methodology formulas for Firpo-Possebom §4 mostly match the Registry and paper-review equations: Eq. 12/13 re-ranking, strict p > gamma, and NaN analytical inference are documented and implemented.
  • One unmitigated P1 state-consistency bug can silently surface a stale confidence set after the placebo reference set is rebuilt.
  • No new inline Wald inference / se > 0 else 0.0 anti-pattern found in the changed SCM paths.
  • Deferred scope items are documented in REGISTRY.md / paper-review checklist, so they are not defects.
  • git diff --check passed. I did not run the test suite.

Methodology

Finding 1

Severity: P1
Location: diff_diff/synthetic_control_results.py:L653-L900, diff_diff/synthetic_control_results.py:L1787-L1800, diff_diff/diagnostic_report.py:L2466-L2494
Impact: confidence_set() is defined as a re-ranking of the current in-space placebo reference set. However, after confidence_set() populates effect_confidence_set and _confidence_set_df, a later explicit in_space_placebo(...) rebuild overwrites placebo_p_value, _placebo_gaps, and _placebo_pre_denoms without invalidating the cached confidence-set outputs. The diagnostic report then emits the old effect_confidence_set as if it belongs to the current placebo reference set. This can silently report the wrong confidence set, especially because _require_placebo_reference() tells users to rebuild via in_space_placebo(n_starts=...).

Concrete fix: In in_space_placebo(), clear dependent confidence-set state whenever the placebo reference is recomputed or fail-closed:

self.effect_confidence_set = None
self._confidence_set_df = None

Do this on all exits that set _placebo_gaps / _placebo_status. Add a regression test: run confidence_set(), then rerun in_space_placebo(n_starts=...), assert effect_confidence_set is None, get_confidence_set_df() raises, and DiagnosticReport(...).to_dict()["estimator_native_diagnostics"]["confidence_set"]["status"] == "not_run" until confidence_set() is called again.

Code Quality

No P1+ code-quality issues beyond the cache invalidation bug above.

Performance

No blocking performance findings. The exact breakpoint inversion is a documented implementation choice in REGISTRY.md, not a methodology defect.

Maintainability

Finding 2

Severity: P3
Location: docs/methodology/papers/firpo-possebom-2018-review.md:L8-L14, docs/methodology/papers/firpo-possebom-2018-review.md:L135-L138
Impact: The checklist now says PR-B shipped, but the intro still describes PR-B as forthcoming. This is documentation drift only.

Concrete fix: Update the intro/scope note to say the review was consumed by the shipped PR-B implementation.

Tech Debt

No untracked blocking tech debt found. Deferred sensitivity weights, general theta menu, one-sided tests, and multiple-outcome/treated extensions are explicitly documented as deferred in REGISTRY.md and the paper review, so they are not defects.

Security

No security findings in the changed code/docs.

Documentation/Tests

Finding 3

Severity: P2
Location: tests/test_methodology_synthetic_control.py:L3407-L3420, diff_diff/synthetic_control_results.py:L653-L900
Impact: Tests cover lazy creation and n_starts ignored when a reference exists, but not the lifecycle where the reference set is explicitly rebuilt after a confidence set was computed. That gap allowed the P1 stale-cache issue.

Concrete fix: Add the regression test described in Finding 1.

Path to Approval

  1. Invalidate effect_confidence_set and _confidence_set_df on every in_space_placebo() path that rebuilds or invalidates the placebo reference.
  2. Add a regression test for confidence_set() followed by in_space_placebo(n_starts=...), including diagnostic-report status.
  3. Update the stale paper-review intro language from “forthcoming PR-B” to shipped/consumed wording.

CI codex (gpt-5.5) findings on PR #527:
- P1: confidence_set() caches effect_confidence_set / _confidence_set_df against the
  CURRENT in-space placebo reference set, but a later explicit in_space_placebo() rebuild
  (which _require_placebo_reference suggests via n_starts) overwrote the reference without
  invalidating the cache -> a stale set could be reported by summary()/to_dict()/_scm_native.
  Now clear both at the start of in_space_placebo() (after the snapshot check) so every
  rebuild drops the stale cache.
- P2: add a regression test (confidence_set -> in_space_placebo(n_starts=) -> assert
  effect_confidence_set is None, get_confidence_set_df() raises, DR status "not_run").
- P3: update the Firpo-Possebom review intro from "forthcoming PR-B" to shipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 3e98f262a1be2e2bd3db55fd78048580a783ec15


✅ Looks good

Executive summary

  • Previous P1 stale confidence-set cache finding is resolved: in_space_placebo() now clears effect_confidence_set / _confidence_set_df, with regression coverage.
  • Methodology matches the Registry: sharp-null re-ranking, strict p > gamma, per-unit denominator reuse, and separate non-analytical confidence-set output.
  • Deferred items and deviations are documented in REGISTRY.md, so they are informational only.
  • No new inline inference / partial NaN-guard anti-pattern found in changed SCM paths.
  • git diff --check passed. Targeted tests could not be run because pytest is not installed.

Methodology

Finding 1

Severity: P3 informational
Location: docs/methodology/REGISTRY.md:L1999-L2023, diff_diff/synthetic_control.py:L1801-L2119, diff_diff/synthetic_control_results.py:L1580-L1808
Impact: The implemented confidence-set path follows the documented Firpo-Possebom PR-B contract: test_sharp_null() re-ranks stored placebo gaps, confidence_set() uses strict p^f > gamma, analytical conf_int remains NaN, and exact breakpoint inversion is documented as an implementation choice. The paper describes sharp-null RMSPE testing and test inversion for confidence sets, and the published article metadata/abstract confirms this is the intended contribution. (degruyterbrill.com)
Concrete fix: No action required.

Finding 2

Severity: P3 informational
Location: docs/methodology/REGISTRY.md:L2020-L2023, docs/methodology/papers/firpo-possebom-2018-review.md:L135-L141
Impact: Deferred sensitivity weights, general test-statistic menu, one-sided tests, and multiple-outcome/treated extensions are explicitly documented as out of scope. Under the review rules, these are not defects.
Concrete fix: No action required.

Code Quality

No findings. The previous stale-cache issue is addressed at diff_diff/synthetic_control_results.py:L725-L730, and the regression test is at tests/test_methodology_synthetic_control.py:L3429-L3447.

Performance

No findings. The exact breakpoint inversion is documented in REGISTRY.md as an implementation choice rather than a paper-mandated algorithm.

Maintainability

No findings. Public surfaces are documented in docs/api/synthetic_control.rst and the LLM guides, and result serialization/reporting paths were updated consistently.

Tech Debt

No untracked blocking tech debt found. Deferred Firpo-Possebom scope is documented in the Registry/paper review, not silently left ambiguous.

Security

No security findings in the changed code/docs.

Documentation/Tests

Finding 3

Severity: P3 informational
Location: tests/test_methodology_synthetic_control.py:L3032-L3538, tests/test_diagnostic_report.py:L2114-L2129
Impact: Added tests cover the prior stale-cache lifecycle, strict boundary behavior, singleton/disjoint/unbounded cases, lazy placebo creation, invalid inputs, and diagnostic-report surfacing. I could not execute them locally because this runner does not have pytest installed (No module named pytest).
Concrete fix: No PR action required; run the targeted SCM tests in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant