SyntheticControl: confidence sets by test inversion (Firpo-Possebom 2018, PR-B) by igerber · Pull Request #527 · igerber/diff-diff

igerber · 2026-06-02T12:10:12Z

Summary

PR-B of the SCM Firpo & Possebom (2018, §4) methodology track (consumes the PR-A review docs/methodology/papers/firpo-possebom-2018-review.md, merged in docs: Firpo & Possebom (2018) paper review — SCM CI by test inversion (PR-A) #524). Gives classic SyntheticControl the uncertainty quantification it lacked — a confidence set for the treatment-effect path by test inversion — without changing its always-NaN analytical inference contract.
Two opt-in SyntheticControlResults methods, both a pure re-ranking of the in-space placebo gaps (no synthetic-control refits):
- test_sharp_null(effect, gamma=0.1) — test H_0: α_1t = f(t) (Eqs 12–13, φ=0); test_sharp_null(0) is bit-for-bit placebo_p_value.
- confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200) — invert that test (Eqs 14/16/18, strict p^f > γ). With bounds=None the set is recovered exactly: p^c is piecewise-constant, so the placebo-comparison quadratic breakpoints partition the line; p is evaluated per interval and at each breakpoint (a tie under ≥ can spike p, so isolated singletons are captured) — no centering/monotonicity assumption, so accepted tails / disjoint components / unbounded / empty are all handled.
Result lives in a dedicated effect_confidence_set summary + get_confidence_set_df(); analytical conf_int/se/t_stat/p_value stay NaN (a permutation set at level 1−γ, γ-granular in 1/(J+1), possibly a set/unbounded/non-contiguous — kept separate exactly as placebo_p_value is kept off p_value). Surfaced opt-in under estimator_native_diagnostics.confidence_set.
Fail-closed: γ < 1/(J+1) or treated-not-best-pre-fit → unbounded; nothing accepted → empty; disjoint/singleton → contiguous=False + warning; <2 donors / non-converged treated / unpickled → ValueError.

Methodology references

Method: SCM confidence sets / CI by test inversion (sharp-null RMSPE^f permutation), Firpo & Possebom (2018), J. Causal Inference 6(2), Eqs 11–18.
Source: docs/methodology/papers/firpo-possebom-2018-review.md (PR-A, docs: Firpo & Possebom (2018) paper review — SCM CI by test inversion (PR-A) #524); new ## SyntheticControl methodology block + 4 **Note:** labels in docs/methodology/REGISTRY.md (boundary convention, exact-breakpoint set construction, non-analytical conf_int, no-R-anchor validation).
Deviations: the set-construction method (exact breakpoint inversion / optional fixed grid) is OUR choice — the paper leaves the grid unspecified (documented). Strict p^f > γ boundary per Eq 14 (documented; the discrete p-value makes p=γ reachable). Deferred (flagged in the review checklist): sensitivity weights (φ≠0), the general-θ menu (Eq 19), one-sided (§7), multiple-outcome/treated (§6).

Validation

Tests added: tests/test_methodology_synthetic_control.py (numpy oracle for Eqs 12–14 incl. the strict p=γ boundary + the per-unit floor; the test_sharp_null(0)==placebo_p_value self-consistency anchor incl. a near-perfect-pre-fit floor-biting case; center-rejected/tails-accepted and isolated-breakpoint-singleton regressions; invariants; fail-closed; input validation; a @slow coverage simulation) + tests/test_diagnostic_report.py (DR surfacing). No R anchor — R Synth has no test inversion; validated by self-consistency (transitively R-anchored via the Basque placebo parity), the numpy oracle, and the coverage MC.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

…018, PR-B) Add two opt-in SyntheticControlResults methods that re-rank the in-space placebo gaps into a confidence set for the treatment-effect path (no synthetic-control refits): - test_sharp_null(effect, gamma): test H_0: alpha_1t = f(t) (Eqs 12-13, phi=0, v=1); test_sharp_null(0) is identically placebo_p_value. - confidence_set(family="constant"|"linear", gamma): invert that test (Eqs 14/16/18, strict p^f > gamma). p^param is piecewise-constant, so the set is recovered EXACTLY via the placebo breakpoints (real roots of the pairwise RMSPE-equality quadratics) -- no shape/centering assumption, so accepted tails, disjoint components, empty and unbounded sets are all handled. effect_confidence_set summary + get_confidence_set_df() grid; analytical conf_int/se/t_stat/p_value stay NaN. Persist per-unit floored pre-denominators in in_space_placebo so the f=0 anchor holds bit-for-bit. Wire an opt-in confidence_set block into _scm_native (numeric endpoints only for a bounded set; JSON-safe). Fail-closed for unbounded (gamma<1/(J+1) or treated-not-best-pre-fit) / empty / non-contiguous / <2 donors / non-converged / unpickled. Docs: REGISTRY methodology block + checklist + Notes (boundary/grid/non-analytical/ no-R-anchor); review checklist flip; LLM guides; api rst; CHANGELOG; doc-deps. Tests: numpy oracle (Eqs 12-14 incl strict p=gamma boundary + per-unit floor), the f=0 self-consistency anchor, a center-rejected/tails-accepted regression, invariants, fail-closed paths, and a coverage simulation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-02T12:13:20Z

⚠️ Needs changes

Executive summary

Methodology formulas for Firpo-Possebom §4 mostly match the Registry and paper-review equations: Eq. 12/13 re-ranking, strict p > gamma, and NaN analytical inference are documented and implemented.
One unmitigated P1 state-consistency bug can silently surface a stale confidence set after the placebo reference set is rebuilt.
No new inline Wald inference / se > 0 else 0.0 anti-pattern found in the changed SCM paths.
Deferred scope items are documented in REGISTRY.md / paper-review checklist, so they are not defects.
git diff --check passed. I did not run the test suite.

Methodology

Finding 1

Severity: P1
Location: diff_diff/synthetic_control_results.py:L653-L900, diff_diff/synthetic_control_results.py:L1787-L1800, diff_diff/diagnostic_report.py:L2466-L2494
Impact: confidence_set() is defined as a re-ranking of the current in-space placebo reference set. However, after confidence_set() populates effect_confidence_set and _confidence_set_df, a later explicit in_space_placebo(...) rebuild overwrites placebo_p_value, _placebo_gaps, and _placebo_pre_denoms without invalidating the cached confidence-set outputs. The diagnostic report then emits the old effect_confidence_set as if it belongs to the current placebo reference set. This can silently report the wrong confidence set, especially because _require_placebo_reference() tells users to rebuild via in_space_placebo(n_starts=...).

Concrete fix: In in_space_placebo(), clear dependent confidence-set state whenever the placebo reference is recomputed or fail-closed:

self.effect_confidence_set = None
self._confidence_set_df = None

Do this on all exits that set _placebo_gaps / _placebo_status. Add a regression test: run confidence_set(), then rerun in_space_placebo(n_starts=...), assert effect_confidence_set is None, get_confidence_set_df() raises, and DiagnosticReport(...).to_dict()["estimator_native_diagnostics"]["confidence_set"]["status"] == "not_run" until confidence_set() is called again.

Code Quality

No P1+ code-quality issues beyond the cache invalidation bug above.

Performance

No blocking performance findings. The exact breakpoint inversion is a documented implementation choice in REGISTRY.md, not a methodology defect.

Maintainability

Finding 2

Severity: P3
Location: docs/methodology/papers/firpo-possebom-2018-review.md:L8-L14, docs/methodology/papers/firpo-possebom-2018-review.md:L135-L138
Impact: The checklist now says PR-B shipped, but the intro still describes PR-B as forthcoming. This is documentation drift only.

Concrete fix: Update the intro/scope note to say the review was consumed by the shipped PR-B implementation.

Tech Debt

No untracked blocking tech debt found. Deferred sensitivity weights, general theta menu, one-sided tests, and multiple-outcome/treated extensions are explicitly documented as deferred in REGISTRY.md and the paper review, so they are not defects.

Security

No security findings in the changed code/docs.

Documentation/Tests

Finding 3

Severity: P2
Location: tests/test_methodology_synthetic_control.py:L3407-L3420, diff_diff/synthetic_control_results.py:L653-L900
Impact: Tests cover lazy creation and n_starts ignored when a reference exists, but not the lifecycle where the reference set is explicitly rebuilt after a confidence set was computed. That gap allowed the P1 stale-cache issue.

Concrete fix: Add the regression test described in Finding 1.

Path to Approval

Invalidate effect_confidence_set and _confidence_set_df on every in_space_placebo() path that rebuilds or invalidates the placebo reference.
Add a regression test for confidence_set() followed by in_space_placebo(n_starts=...), including diagnostic-report status.
Update the stale paper-review intro language from “forthcoming PR-B” to shipped/consumed wording.

CI codex (gpt-5.5) findings on PR #527: - P1: confidence_set() caches effect_confidence_set / _confidence_set_df against the CURRENT in-space placebo reference set, but a later explicit in_space_placebo() rebuild (which _require_placebo_reference suggests via n_starts) overwrote the reference without invalidating the cache -> a stale set could be reported by summary()/to_dict()/_scm_native. Now clear both at the start of in_space_placebo() (after the snapshot check) so every rebuild drops the stale cache. - P2: add a regression test (confidence_set -> in_space_placebo(n_starts=) -> assert effect_confidence_set is None, get_confidence_set_df() raises, DR status "not_run"). - P3: update the Firpo-Possebom review intro from "forthcoming PR-B" to shipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-02T12:20:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3e98f262a1be2e2bd3db55fd78048580a783ec15

✅ Looks good

Executive summary

Previous P1 stale confidence-set cache finding is resolved: in_space_placebo() now clears effect_confidence_set / _confidence_set_df, with regression coverage.
Methodology matches the Registry: sharp-null re-ranking, strict p > gamma, per-unit denominator reuse, and separate non-analytical confidence-set output.
Deferred items and deviations are documented in REGISTRY.md, so they are informational only.
No new inline inference / partial NaN-guard anti-pattern found in changed SCM paths.
git diff --check passed. Targeted tests could not be run because pytest is not installed.

Methodology

Finding 1

Severity: P3 informational
Location: docs/methodology/REGISTRY.md:L1999-L2023, diff_diff/synthetic_control.py:L1801-L2119, diff_diff/synthetic_control_results.py:L1580-L1808
Impact: The implemented confidence-set path follows the documented Firpo-Possebom PR-B contract: test_sharp_null() re-ranks stored placebo gaps, confidence_set() uses strict p^f > gamma, analytical conf_int remains NaN, and exact breakpoint inversion is documented as an implementation choice. The paper describes sharp-null RMSPE testing and test inversion for confidence sets, and the published article metadata/abstract confirms this is the intended contribution. (degruyterbrill.com)
Concrete fix: No action required.

Finding 2

Severity: P3 informational
Location: docs/methodology/REGISTRY.md:L2020-L2023, docs/methodology/papers/firpo-possebom-2018-review.md:L135-L141
Impact: Deferred sensitivity weights, general test-statistic menu, one-sided tests, and multiple-outcome/treated extensions are explicitly documented as out of scope. Under the review rules, these are not defects.
Concrete fix: No action required.

Code Quality

No findings. The previous stale-cache issue is addressed at diff_diff/synthetic_control_results.py:L725-L730, and the regression test is at tests/test_methodology_synthetic_control.py:L3429-L3447.

Performance

No findings. The exact breakpoint inversion is documented in REGISTRY.md as an implementation choice rather than a paper-mandated algorithm.

Maintainability

No findings. Public surfaces are documented in docs/api/synthetic_control.rst and the LLM guides, and result serialization/reporting paths were updated consistently.

Tech Debt

No untracked blocking tech debt found. Deferred Firpo-Possebom scope is documented in the Registry/paper review, not silently left ambiguous.

Security

No security findings in the changed code/docs.

Documentation/Tests

Finding 3

Severity: P3 informational
Location: tests/test_methodology_synthetic_control.py:L3032-L3538, tests/test_diagnostic_report.py:L2114-L2129
Impact: Added tests cover the prior stale-cache lifecycle, strict boundary behavior, singleton/disjoint/unbounded cases, lazy placebo creation, invalid inputs, and diagnostic-report surfacing. I could not execute them locally because this runner does not have pytest installed (No module named pytest).
Concrete fix: No PR action required; run the targeted SCM tests in CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SyntheticControl: confidence sets by test inversion (Firpo-Possebom 2018, PR-B)#527

SyntheticControl: confidence sets by test inversion (Firpo-Possebom 2018, PR-B)#527
igerber wants to merge 2 commits into
mainfrom
feature/sc-ci-test-inversion

igerber commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Jun 2, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Jun 2, 2026

⚠️ Needs changes

Executive summary

Methodology

Finding 1

Code Quality

Performance

Maintainability

Finding 2

Tech Debt

Security

Documentation/Tests

Finding 3

Path to Approval

Uh oh!

github-actions Bot commented Jun 2, 2026

✅ Looks good

Executive summary

Methodology

Finding 1

Finding 2

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Finding 3

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant