diff --git a/CHANGELOG.md b/CHANGELOG.md
index 27c7aeb9..bb7ffed8 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- **`SyntheticControl` cross-validation + inverse-variance `V`-selection (ADH 2015 §; Abadie 2021 §3.2(a), Eq. 9).** Two new `v_method` values complete the ADH-2015/Abadie-2021 `V`-selection menu (joining `"nested"` / `"custom"`), each threaded through the in-space / leave-one-out / in-time placebo refits so a diagnostic uses the **same** estimator as the headline fit. **`v_method="cv"`** selects the diagonal predictor-importance `V` by out-of-sample cross-validation: the pre-period is split positionally at `v_cv_t0` (new constructor param; default `len(pre)//2`, Abadie 2021's `t0 = T0/2`) into a training and a validation window, `V` is chosen to minimize the validation-window outcome MSPE of the training-fit weights (`mspe_v` now reports this validation MSPE under cv), and the final reported weights are re-estimated on the validation-window predictors (ADH 2015 step 4). Each predictor spec is **re-aggregated** over each window (its mean/sum/identity recomputed over only the periods that fall in that window — a separate `dataprep` per window, exactly as ADH 2015's CV does, since R `Synth` has no built-in CV function), so the V-search is genuinely out-of-sample for every predictor type and the same `V*` drives both fits with no zeroed coordinate (`v_weights` reproduce `donor_weights` on the validation-window predictors, and `predictor_balance` is reported on that validation-window basis). **Fully-spanning precondition (fail-closed):** re-aggregating a predictor on each window requires it to be observed in **both** windows, so `cv` **requires every predictor to span both the training and validation windows** and raises `ValueError` otherwise — satisfied by ADH 2015's shared covariate / multi-period `special_predictors` (which span the windows) but NOT by the default per-period outcome lags (each is single-period and lives in one window only), so `cv` with the bare default predictors is rejected with guidance to pass spanning predictors. In-time-placebo truncation that breaks the fully-spanning precondition (a kept spec stops spanning both windows at the truncated split) marks that date `infeasible`. A second fail-closed gate covers windows that span but carry **no cross-donor variation** (every re-aggregated predictor constant across the donors, so `X0·W` is constant in `W` → a flat, unidentified weight solve that would otherwise return arbitrary "converged" weights — even when the treated unit differs, since donor distinguishability, not treated-vs-donor variation, identifies `W`): the headline fit raises `ValueError`, in-space placebo refits whose donor pool is indistinguishable in a window are dropped from the reference set, and such in-time-truncated dates are marked `infeasible`. Abadie 2021 footnote 7's CV non-uniqueness is handled by a **deterministic tie-break** (prefer the `V` closest to uniform among ties), making the selected `V*` among equally-good optima independent of the multistart evaluation order. The cv fit is reproducible for a fixed `seed` (like `nested`) but is not seed-independent — the multistart fills any slots beyond the distinct heuristic starts with seed-dependent random Dirichlet draws, so the tie-break removes start-order dependence among ties, not seed dependence. The tie-break is convergence-aware (a non-converged optimizer candidate cannot displace a converged incumbent on an objective tie). If the training-window solve that defines `mspe_v` truncates (e.g. `inner_max_iter` too small), the fit fails closed — `mspe_v=NaN` and the fit is marked non-converged — rather than reporting an invalid Eq. 9 criterion. **`v_method="inverse_variance"`** uses the closed form `v_h = 1/Var(X_h)` (variance over donors+treated on the unstandardized predictors), applied to the **raw** predictors so the effective objective is the unit-variance-rescaled `Σ_h diff_h²/Var_h` (Abadie 2021 §3.2(a)); the `standardize` pre-scaling is intentionally bypassed on this branch (inverse-variance weighting *is* the unit-variance rescaling — applying it on already-standardized rows would double-rescale to `Σ_h diff_h²/Var_h²`), so it is equivalent to uniform `V` on standardized predictors. No search (`mspe_v=None`); a zero-variance row gets 0 weight and an all-zero-variance panel falls back to uniform `V` with a warning. `custom_v` is rejected (fail-closed) for both methods and `v_cv_t0` is rejected unless `v_method="cv"`. On the degenerate **single-donor** path (`J=1` forces `w=[1]`) `V` is unidentified — every `V` yields the same synthetic — so `v_weights` is **uniform** and `mspe_v=None` for ALL `v_method`s (cv / inverse_variance included; their selected / closed-form `V` would be inert), with a `UserWarning`; the donor weights / gap / ATT are unaffected. An explicitly pinned `v_cv_t0` that no longer fits the truncated pre-fake window is nulled to the `//2` default for the placebo refit (a pinned value that still fits the truncated window is kept). **Validation:** R `Synth` has no built-in CV function (ADH 2015's CV is a manual `dataprep`+`synth` re-run), so cv is anchored by deterministic equivalence to the R-anchored `custom_v` path (the step-3 validation MSPE of the training-window fit and the step-4 validation-window weights each match a `custom_v=V*` fit on the correspondingly re-aggregated predictors) plus cv self-consistency (`in_time_placebo` under cv == a fresh cv fit on the backdated panel to 1e-7); inverse-variance is anchored bit-for-bit to a `custom_v=1/Var(X)` fit. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the per-window re-aggregation convention, the flat-MSPE tie-break, and inverse-variance), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. The remaining ADH-2015 items (`W^reg` extrapolation diagnostic, sparse-SC subset search) stay tracked in `TODO.md`.
- **Firpo & Possebom (2018) SCM inference paper review on file (PR-A).** Added `docs/methodology/papers/firpo-possebom-2018-review.md`, a faithful, paper-sourced fidelity review of Firpo & Possebom (2018, *Journal of Causal Inference* 6(2), DOI 10.1515/jci-2016-0026) — the Step-1 artifact for the forthcoming SCM **confidence-set / CI-by-test-inversion** track (PR-B) layered on the existing `SyntheticControl` estimator (classic SCM has no analytical SE; `se`/`p_value`/`conf_int` are NaN). Transcribes (paper-sourced only, no code-deviation verdicts) the benchmark RMSPE-ratio permutation test (Eqs. 4–6), the sensitivity-analysis parametric p-value weights with worst/best-case `φ̲`/`φ̄` (Eqs. 7–9), the sharp-null `RMSPE^f` test (Eqs. 10–13), the **confidence sets by test inversion** (Eq. 14) with the operational constant-effect CI (Eqs. 15–16) and linear-effect CS (Eqs. 17–18), the general test-statistic framework + Monte Carlo size/power of five statistics (Eq. 19, Section 5), and the multiple-outcome FWER (Eqs. 23–24) and multiple-treated-unit pooled (Eqs. 25–26) extensions; the requirements checklist flags the PR-B target (sharp-null test + constant/linear CI + benchmark + one-sided) versus the deferred sensitivity-analysis and multi-outcome/treated extensions. Docs-only; no code change. Registered in `docs/references.rst` (Synthetic Control Method section) and `docs/doc-deps.yaml`; REGISTRY `## SyntheticControl` gains a `firpo-possebom-2018-review.md` reviews-on-file pointer.
+- **`SyntheticControl` confidence sets by test inversion (Firpo & Possebom 2018 §4, PR-B).** Classic SCM gains the uncertainty quantification it has lacked — a confidence set for the treatment-effect *path* — without changing its always-NaN analytical inference contract. Two opt-in `SyntheticControlResults` methods built ON TOP of the in-space placebo: `test_sharp_null(effect, gamma=0.1)` tests a sharp null `H_0: α_1t = f(t)` (Eq 11; `effect` a scalar constant effect or a length-`n_post` post-period path) by subtracting `f(t)` from every unit's post-period gaps and re-ranking the modified RMSPE ratio `RMSPE^f` (Eqs 12–13 at `φ=0`, `v=(1,…,1)`), and `confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200)` inverts that test into a confidence set — a constant-in-time interval (Eqs 15–16) or a linear-in-time slope set (Eqs 17–18) — keeping every value whose sharp null is not rejected at the paper's **strict** `p^f > γ` boundary (Eq 14). The whole computation is a **pure re-ranking of the gap paths `in_space_placebo()` already computes** (no synthetic-control refits): under a common-effect null the donor synthetics and the pre-period MSPE denominators are unchanged — only the post gaps shift by `f(t)` — so each grid value costs an `O(J)` rank, not a refit. With `bounds=None` the set is recovered **EXACTLY** by piecewise-constant breakpoint inversion: `p^c` is constant between the real roots of the placebo-vs-treated comparison quadratics, so `p` is evaluated once per induced interval AND at each breakpoint (a tie under `≥` can lift `p` above γ there, yielding an isolated accepted point) — NO centering/monotonicity assumption, so accepted tails, disjoint components, and unbounded/empty sets are all handled (a poor-pre-fit treated unit can have its accepted region in the tails). `bounds=(lo,hi)` instead scans a fixed grid (grid-limited); `n_grid` controls only the returned inspection table when `bounds=None`. Results: a pickle-surviving `effect_confidence_set` summary (`{family, parameter, gamma, lower, upper, contiguous, status, …}`, `status ∈ {"ran","empty","unbounded"}`) + a `get_confidence_set_df()` grid table, surfaced under `estimator_native_diagnostics.confidence_set`. **The analytical `conf_int`/`se`/`t_stat`/`p_value` stay NaN** — this is a permutation set at level `1−γ` (γ granular in `1/(J+1)`), possibly a set / unbounded / non-contiguous, so it cannot be coerced into the Wald-interval `conf_int` tuple; it is kept separate exactly as `placebo_p_value` is kept off `p_value`. **Fail-closed:** `γ < 1/(J+1)` (no value rejectable — fn 8) or a treated unit lacking the best pre-fit → `"unbounded"` (`±inf` + warning); no interval or breakpoint accepted → `"empty"` (NaN endpoints); a non-contiguous accepted region (disjoint components / an isolated singleton) → the `[lower, upper]` hull with `contiguous=False` + warning; `< 2` donors / a non-converged treated fit / an unpickled result (no placebo reference set) → `ValueError`. `test_sharp_null(0)` is held bit-for-bit equal to `placebo_p_value` (Eq 5 = Eq 13) by reusing each unit's **per-unit** floored pre-period denominator persisted from the placebo run. **Scope:** the sensitivity-analysis weights (`φ≠0`, Eqs 7–9), the general test-statistic menu (Eq 19), one-sided (§7's signed-`t` statistic), and the multiple-outcome/treated extensions (§6) are deferred (flagged in the paper review checklist). **Validation:** no R anchor (R `Synth` has no test inversion; the authors' Code Ocean capsule was not consulted) — self-consistency to the (Basque-R-anchored) `placebo_p_value`, a numpy oracle on Eqs 12–14 (incl. the strict `p=γ` boundary and the per-unit floor), invariants (the point estimate lies in the constant set for a well-posed fit; a center-rejected/tails-accepted regression; an isolated-breakpoint singleton; monotone-in-γ), and a coverage simulation. Consumes the PR-A `firpo-possebom-2018-review.md`; documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new methodology block + `**Note:**` labels for the boundary convention, the grid choice, the non-analytical `conf_int` contract, and the no-R-anchor validation), `docs/api/synthetic_control.rst`, and the LLM guides.
- **`HeterogeneousAdoptionDiD.fit()` fit-time extensive-margin warning + `covariates=` not-implemented pointer.** Two UX additions to the HAD `fit()` surface, with **no change to any estimate or standard error**. (1) The **overall** path now emits a `UserWarning` when a non-trivial fraction (`>= 10%`, a library-convention cutoff in `_HAD_EXTENSIVE_MARGIN_ZERO_DOSE_FRAC`) of units have an exactly-zero post-period dose — a genuine untreated mass for which a standard DiD using those units as controls may be more appropriate (de Chaisemartin et al. 2026, Section 2 / Assumption 3). The paper retains *small* untreated shares (e.g. 12/2954 in Garrett et al., with close-to-nominal coverage), so the 10% cutoff sits ~25× above that; the warning is **overall-path-only** because the event-study path *requires* never-treated units per Appendix B.2. Previously the recommendation surfaced only via `qug_test()`'s zero-dose warning when the user ran the pre-tests. (2) `HeterogeneousAdoptionDiD.fit(covariates=...)` now raises `NotImplementedError` with a pointer to the deferred Appendix B.1 / Theorem 6 covariate-adjusted extension (via an explicit keyword-only `covariates=` param) instead of a bare `TypeError` from an unknown kwarg; pre-residualize the outcome on the covariates as a workaround. Documented in `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD; new tests in `tests/test_had.py` and `tests/test_methodology_had.py`.
### Fixed
diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py
index fd65ea61..c48cdfe6 100644
--- a/diff_diff/diagnostic_report.py
+++ b/diff_diff/diagnostic_report.py
@@ -2458,6 +2458,49 @@ def _scm_native(self, r: Any) -> Dict[str, Any]:
"placebo (opt-in; refits per backdated date)."
),
}
+
+ # Test-inversion confidence set (Firpo & Possebom 2018 §4): opt-in, surfaced once
+ # the user has run results.confidence_set() (it reuses the in-space placebo
+ # reference set — no refits). The analytical conf_int stays NaN; this is a SEPARATE
+ # permutation set at level 1 - gamma, possibly unbounded or non-contiguous.
+ ecs = getattr(r, "effect_confidence_set", None)
+ if ecs is not None:
+ ecs_status = ecs.get("status")
+ _lo, _hi = ecs.get("lower"), ecs.get("upper")
+ block = {
+ "status": ecs_status,
+ "family": ecs.get("family"),
+ "parameter": ecs.get("parameter"),
+ "gamma": _to_python_float(ecs.get("gamma")),
+ # Emit each endpoint independently: a finite float, else None for a non-finite
+ # side (NaN for an empty set, +/-inf for an unbounded tail) -- keeps the dict
+ # JSON-safe while preserving the FINITE side of a one-sided unbounded set.
+ "lower": float(_lo) if isinstance(_lo, (int, float)) and np.isfinite(_lo) else None,
+ "upper": float(_hi) if isinstance(_hi, (int, float)) and np.isfinite(_hi) else None,
+ "contiguous": bool(ecs.get("contiguous")),
+ "n_placebos": _to_python_scalar(ecs.get("n_placebos")),
+ }
+ if ecs_status == "unbounded":
+ block["reason"] = (
+ "confidence_set() ran but the set is unbounded (gamma below the "
+ "1/(J+1) permutation granularity, or the treated unit lacks the best "
+ "pre-treatment fit); endpoint(s) are +/-inf."
+ )
+ elif ecs_status == "empty":
+ block["reason"] = (
+ "confidence_set() ran but the set is empty (every effect in the "
+ "family is rejected at gamma); endpoints are NaN."
+ )
+ out["confidence_set"] = block
+ else:
+ out["confidence_set"] = {
+ "status": "not_run",
+ "reason": (
+ "Call results.confidence_set() for a test-inversion confidence set of "
+ "the effect path (Firpo-Possebom 2018; opt-in, reuses the in-space "
+ "placebo reference set)."
+ ),
+ }
return out
# -- Heterogeneity helpers --------------------------------------------
diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
index b4a89596..b64b532a 100644
--- a/diff_diff/guides/llms-full.txt
+++ b/diff_diff/guides/llms-full.txt
@@ -617,7 +617,7 @@ scm.fit(
) -> SyntheticControlResults
```
-**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. **ADH-2015 §4 robustness (opt-in, analytical inference unchanged):** `results.leave_one_out()` drops each reportably-weighted donor (weight > 1e-6) and re-fits (per-drop ATT/`delta_att` — large `delta_att` ⇒ single-donor dependence); `results.in_time_placebo()` backdates the intervention and checks for a spurious pre-period gap (TRUNCATE windowing — predictor windows in the held-out region are dropped). Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation).
+**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. **ADH-2015 §4 robustness (opt-in, analytical inference unchanged):** `results.leave_one_out()` drops each reportably-weighted donor (weight > 1e-6) and re-fits (per-drop ATT/`delta_att` — large `delta_att` ⇒ single-donor dependence); `results.in_time_placebo()` backdates the intervention and checks for a spurious pre-period gap (TRUNCATE windowing — predictor windows in the held-out region are dropped). **Confidence sets by test inversion (Firpo-Possebom 2018 §4, opt-in, also non-analytical):** `results.test_sharp_null(effect, gamma=0.1)` tests `H_0: α_1t = f(t)` by re-ranking the in-space placebo gaps (no refits; `test_sharp_null(0)` == `placebo_p_value`), and `results.confidence_set(family="constant"|"linear", gamma=0.1)` inverts it into a confidence set for the effect path (constant-effect interval / linear-slope set, strict `p>gamma`), surfaced on `effect_confidence_set` / `get_confidence_set_df()` — the analytical `conf_int` still stays NaN. Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation).
**Usage:**
@@ -1300,8 +1300,9 @@ Returned by `SyntheticControl.fit()`.
| `pre_periods`, `post_periods` | `list` | Calendar-sorted periods |
| `v_method`, `standardize` | `str` | Echoed configuration |
| `v_cv_t0` | `int \| None` | Resolved cv train/validation split index (None unless v_method="cv") |
+| `effect_confidence_set` | `dict \| None` | Test-inversion confidence-set summary (Firpo-Possebom 2018 §4): `{family, parameter, gamma, lower, upper, contiguous, status ("ran"/"empty"/"unbounded"), ...}`; None until `confidence_set()` runs. SEPARATE from the always-NaN analytical `conf_int` (a permutation set at level 1−gamma, possibly a set/unbounded). |
-**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `leave_one_out()` (ADH-2015 §4 donor robustness; drops each reportably-weighted donor (weight > 1e-6) → per-drop ATT/`delta_att` table) + `get_leave_one_out_df()`/`get_leave_one_out_gaps()`, `in_time_placebo()` (ADH-2015 §4 backdating placebo; reassigns the intervention earlier, TRUNCATE windowing, placebo ATT ~0 if no real pre-effect) + `get_in_time_placebo_df()`/`get_in_time_placebo_gaps()`, `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()`
+**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `leave_one_out()` (ADH-2015 §4 donor robustness; drops each reportably-weighted donor (weight > 1e-6) → per-drop ATT/`delta_att` table) + `get_leave_one_out_df()`/`get_leave_one_out_gaps()`, `in_time_placebo()` (ADH-2015 §4 backdating placebo; reassigns the intervention earlier, TRUNCATE windowing, placebo ATT ~0 if no real pre-effect) + `get_in_time_placebo_df()`/`get_in_time_placebo_gaps()`, `test_sharp_null(effect, gamma=0.1)` (Firpo-Possebom 2018 §4: test a sharp null α_1t=f(t) by re-ranking the in-space placebo gaps — `effect` is a scalar or a post-period array; `test_sharp_null(0)` is identically `placebo_p_value`), `confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200)` (invert that test for a confidence set of the effect path — a constant-effect interval / linear-slope set; strict p>gamma membership; exact piecewise-constant breakpoint inversion when bounds=None, else a fixed grid; `conf_int` stays NaN) + `get_confidence_set_df()`, `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()`
### TripleDifferenceResults
diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt
index 250068de..50faa40b 100644
--- a/diff_diff/guides/llms.txt
+++ b/diff_diff/guides/llms.txt
@@ -60,7 +60,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")`
- [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html): Gardner (2022) two-stage estimator with GMM sandwich variance
- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=
` for CR1 via Gerber (2026) Binder TSL (Wave E.1) and under `vcov_type="conley"` via a panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2 cross-sectional `conley_lag_cutoff=0`) extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC (Newey-West 1987 separable form; `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=`), both composed with the Wave D Gardner GMM correction; `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores at the meat-helper boundary (Wave E.3, R `svyrecvar(subset())` form) (replicate weights queued as follow-up)
- [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html): Synthetic DiD combining standard DiD and synthetic control methods for few treated units
-- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, predictor-importance V via nested / cv (out-of-sample, ADH 2015; needs predictors spanning both train/val windows, so default single-period lags are rejected) / inverse-variance (1/Var on raw predictors, bypasses standardize) / custom, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)); ADH-2015 §4 robustness: `leave_one_out()` donor-robustness + `in_time_placebo()` backdating placebo
+- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, predictor-importance V via nested / cv (out-of-sample, ADH 2015; needs predictors spanning both train/val windows, so default single-period lags are rejected) / inverse-variance (1/Var on raw predictors, bypasses standardize) / custom, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)); ADH-2015 §4 robustness: `leave_one_out()` donor-robustness + `in_time_placebo()` backdating placebo; confidence sets by test inversion (Firpo-Possebom 2018 §4): `test_sharp_null()` + `confidence_set(family="constant"|"linear")` re-rank the placebo gaps into a confidence set for the effect path (the analytical `conf_int` stays NaN)
- [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html): Triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
- [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html): Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves
- [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲=0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲>0`, continuous-near-d̲ or mass-point), with multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release (repeated cross-sections rejected by the validator). Alias `HAD`.
diff --git a/diff_diff/synthetic_control.py b/diff_diff/synthetic_control.py
index 5a04fa75..7b396047 100644
--- a/diff_diff/synthetic_control.py
+++ b/diff_diff/synthetic_control.py
@@ -1761,6 +1761,22 @@ def _mspe(gap_path: Dict[Any, float], periods: List[Any]) -> float:
return float(np.mean(g**2))
+def _floored_pre_mspe(pre_gaps: np.ndarray, scale: float) -> float:
+ """Pre-period MSPE with the scale-aware floor used as the RMSPE-ratio denominator.
+
+ ``= max(mean(pre_gaps**2), 1e-8 * max(scale, 1)**2)``. Factored out of
+ ``_rmspe_ratio`` so the sharp-null test inversion (Firpo & Possebom 2018) can reuse
+ the SAME per-unit floored denominator the in-space placebo used — the floor scale is
+ PER-UNIT (``scale`` = ``max|Z1|`` of that unit's pre-period outcomes), so this is
+ what guarantees ``test_sharp_null(0) == placebo_p_value`` bit-for-bit even when the
+ floor bites (a near-perfect pre-fit). The denominator is GRID-INVARIANT under a sharp
+ null whose ``f`` is zero in the pre-period (the operational constant/linear families).
+ """
+ pre_mspe = float(np.mean(pre_gaps**2)) if pre_gaps.size else float("nan")
+ floor = 1e-8 * max(float(scale), 1.0) ** 2
+ return max(pre_mspe, floor)
+
+
def _rmspe_ratio(pre_gaps: np.ndarray, post_gaps: np.ndarray, scale: float) -> float:
"""Post/pre RMSPE ratio — the in-space placebo test statistic (ADH 2010 §2.4).
@@ -1777,10 +1793,339 @@ def _rmspe_ratio(pre_gaps: np.ndarray, post_gaps: np.ndarray, scale: float) -> f
``scale`` is the magnitude of the unit's pre-period outcomes. Mirrors the
``_fit_tol`` poor-fit guard in ``fit()``.
"""
- pre_mspe = float(np.mean(pre_gaps**2)) if pre_gaps.size else float("nan")
post_mspe = float(np.mean(post_gaps**2)) if post_gaps.size else float("nan")
- floor = 1e-8 * max(float(scale), 1.0) ** 2
- return float(np.sqrt(post_mspe / max(pre_mspe, floor)))
+ return float(np.sqrt(post_mspe / _floored_pre_mspe(pre_gaps, scale)))
+
+
+# =============================================================================
+# Sharp-null test inversion (Firpo & Possebom 2018, "Synthetic Control Method:
+# Inference, Sensitivity Analysis and Confidence Sets," J. Causal Inference 6(2)).
+# These helpers RE-RANK the gap paths the in-space placebo already computed (Eqs
+# 12-13, φ=0, v=(1,...,1)); no synthetic-control refits. The benchmark f≡0 case is
+# identically the existing in-space placebo permutation (Eq 5 = Eq 13 at f≡0).
+# =============================================================================
+
+
+def _constant_f_post(c: float, n_post: int) -> np.ndarray:
+ """Constant-in-time post-period effect path f(t)=c (Firpo & Possebom Eq 15)."""
+ return np.full(n_post, float(c), dtype=float)
+
+
+def _linear_f_post(c_tilde: float, n_post: int) -> np.ndarray:
+ """Linear-in-time post-period effect path f(t)=c̃·(t−T0) (Firpo & Possebom Eq 17).
+
+ ``(t−T0)`` is the 1-based post-period index (1 for the first post period), so the
+ path is ``c̃·[1, 2, …, n_post]`` aligned to calendar-sorted ``post_periods``.
+ """
+ return float(c_tilde) * np.arange(1, n_post + 1, dtype=float)
+
+
+def _rmspe_f_ratio(
+ gap_path: Dict[Any, float],
+ post_periods: List[Any],
+ f_post: np.ndarray,
+ pre_denom: float,
+) -> float:
+ """RMSPE^f ratio under a common-effect sharp null (Firpo & Possebom Eq 12).
+
+ Subtracts the post-period effect path ``f_post`` (length ``len(post_periods)``,
+ aligned to calendar-sorted ``post_periods``) from the unit's post gaps, then divides
+ by the GRID-INVARIANT floored pre-period denominator ``pre_denom`` (the pre window is
+ ``f``-free for the operational constant/linear families, Eqs 15/17, so the
+ denominator is unchanged across the inversion grid). Returns
+ ``sqrt(mean_post((g_t − f_t)**2) / pre_denom)``.
+ """
+ post = np.array([gap_path[p] for p in post_periods], dtype=float)
+ resid = post - np.asarray(f_post, dtype=float)
+ post_mspe = float(np.mean(resid**2)) if resid.size else float("nan")
+ return float(np.sqrt(post_mspe / pre_denom))
+
+
+def _sharp_null_pvalue(
+ treated_gap: Dict[Any, float],
+ placebo_gaps: Dict[Any, Dict[Any, float]],
+ post_periods: List[Any],
+ f_post: np.ndarray,
+ pre_denoms: Dict[Any, float],
+ treated_id: Any,
+) -> Tuple[float, float, int]:
+ """Permutation p-value for a common-effect sharp null (Firpo & Possebom Eq 13, φ=0, v=1).
+
+ ``p^f = (1 + #{converged placebos j : RMSPE^f_j ≥ RMSPE^f_1}) / (n_ref + 1)`` — the
+ treated unit is the "+1" and ``n_ref`` is the number of converged placebos (the SAME
+ reference set the in-space placebo built). ``placebo_gaps`` maps each converged donor
+ to its gap path; ``pre_denoms`` maps each unit (treated + those donors) to its floored
+ pre-denominator. Ties counted via ``>=`` so the p-value is conservative, matching
+ ``in_space_placebo``. Returns ``(p, rmspe_f_treated, n_ref)``.
+ """
+ f_post = np.asarray(f_post, dtype=float)
+ r1 = _rmspe_f_ratio(treated_gap, post_periods, f_post, pre_denoms[treated_id])
+ n_ref = 0
+ n_ge = 0
+ for j, g in placebo_gaps.items():
+ rj = _rmspe_f_ratio(g, post_periods, f_post, pre_denoms[j])
+ n_ref += 1
+ if rj >= r1:
+ n_ge += 1
+ p = (1 + n_ge) / (n_ref + 1)
+ return p, r1, n_ref
+
+
+def _invert_sharp_null(
+ treated_gap: Dict[Any, float],
+ placebo_gaps: Dict[Any, Dict[Any, float]],
+ post_periods: List[Any],
+ pre_denoms: Dict[Any, float],
+ treated_id: Any,
+ family: str,
+ gamma: float,
+ *,
+ bounds: Optional[Tuple[float, float]] = None,
+ n_grid: int = 200,
+) -> Dict[str, Any]:
+ """Invert the sharp-null RMSPE^f test over a one-parameter effect family.
+
+ Returns the confidence set ``{ param : p^param > gamma }`` (Firpo & Possebom Eqs
+ 14/16/18; STRICT inequality -- ``p == gamma`` is excluded). ``family`` is ``"constant"``
+ (``f(t)=param``, Eq 15) or ``"linear"`` (``f(t)=param*(t-T0)``, Eq 17).
+
+ ``p^param`` is a piecewise-constant step function of the scalar ``param`` (each placebo's
+ indicator flips only at the real roots of a quadratic in ``param``), so with
+ ``bounds=None`` the EXACT set is recovered with NO shape assumption (no "centered
+ interval" / monotonicity): the placebo breakpoints partition the line, ``p`` is evaluated
+ once per induced interval (and the two unbounded tails), and the union of intervals with
+ ``p > gamma`` is the set -- correctly handling accepted tails, disjoint components, and the
+ empty / unbounded cases. With explicit ``bounds`` a fixed ``linspace(*bounds, n_grid)``
+ grid is scanned instead (grid-limited membership). Each ``p`` is a pure O(J) re-ranking.
+
+ Result dict: ``{lower, upper, contiguous, status, n_ref, point_estimate, grid}`` where
+ ``status`` is ``"ran"`` / ``"empty"`` / ``"unbounded"``, ``point_estimate`` is the treated
+ unit's own RMSPE-minimizing ``param`` (the ATT for constant, the no-intercept OLS slope
+ for linear -- reported for reference, NOT assumed to maximize ``p``), and ``grid`` is a
+ list of ``(param, p_value, in_set)`` rows for the returned table.
+ """
+ if family not in ("constant", "linear"):
+ raise ValueError(f"family must be 'constant' or 'linear', got {family!r}")
+ n_post = len(post_periods)
+ # Family shape s_t and per-unit moments of A_u(param) = mean_t((g_ut - param*s_t)**2)
+ # = S2*param**2 - 2*P_u*param + Q_u (s_t = 1 constant / (t - T0) linear; S2 = mean(s**2)).
+ s = (
+ np.ones(n_post, dtype=float)
+ if family == "constant"
+ else np.arange(1, n_post + 1, dtype=float)
+ )
+ S2 = float(np.mean(s**2)) if n_post else 0.0
+
+ def moments(gap_path: Dict[Any, float], unit: Any) -> Tuple[float, float, float]:
+ g = np.array([gap_path[p] for p in post_periods], dtype=float)
+ return float(np.mean(g * s)), float(np.mean(g**2)), pre_denoms[unit]
+
+ P1, Q1, D1 = moments(treated_gap, treated_id)
+ placebo_mom = [moments(gp, j) for j, gp in placebo_gaps.items()]
+ n_ref = len(placebo_mom)
+ # The treated unit's own RMSPE-minimizing param: reported as the point estimate ONLY
+ # (it is NOT assumed to maximize p -- see the exact-inversion comment below).
+ center = float(P1 / S2) if S2 > 0 else 0.0
+
+ def pval(param: float) -> float:
+ # p^param = (1 + #{placebos j : A_j/D_j >= A_1/D_1}) / (n_ref + 1). Comparing A/D is
+ # monotone-equivalent to comparing the RMSPE ratios sqrt(A/D), so this is identical to
+ # _sharp_null_pvalue / test_sharp_null (the squared form just avoids the sqrt).
+ r1 = ((S2 * param - 2.0 * P1) * param + Q1) / D1
+ n_ge = sum(
+ 1 for (Pj, Qj, Dj) in placebo_mom if ((S2 * param - 2.0 * Pj) * param + Qj) / Dj >= r1
+ )
+ return (1 + n_ge) / (n_ref + 1)
+
+ def grid_rows(lo: float, hi: float) -> List[Tuple[float, float, bool]]:
+ if not (np.isfinite(lo) and np.isfinite(hi)) or hi <= lo:
+ return []
+ out: List[Tuple[float, float, bool]] = []
+ for x in np.linspace(lo, hi, max(int(n_grid), 2)):
+ p = pval(float(x))
+ out.append((float(x), float(p), bool(p > gamma)))
+ return out
+
+ # ----- fixed-grid path (user-supplied bounds) -----
+ if bounds is not None:
+ lo_b, hi_b = float(bounds[0]), float(bounds[1])
+ if hi_b <= lo_b:
+ raise ValueError(f"bounds must be (lo, hi) with hi > lo, got {bounds!r}")
+ rows = grid_rows(lo_b, hi_b)
+ accepted = [x for x, _, ok in rows if ok]
+ if not accepted:
+ return {
+ "lower": np.nan,
+ "upper": np.nan,
+ "contiguous": True,
+ "status": "empty",
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": rows,
+ }
+ lower, upper = min(accepted), max(accepted)
+ contiguous = all(ok for x, _, ok in rows if lower <= x <= upper)
+ return {
+ "lower": lower,
+ "upper": upper,
+ "contiguous": contiguous,
+ "status": "ran",
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": rows,
+ }
+
+ # ----- exact piecewise-constant inversion (bounds=None; no shape assumption) -----
+ # p^param is a step function of param: each placebo's indicator flips only at the real
+ # roots of A_j(param)*D1 - A_1(param)*Dj = 0, a quadratic in param. So the EXACT set is
+ # recovered with NO "centered interval" / monotonicity assumption -- collect every placebo
+ # breakpoint, evaluate p once on each open interval they induce (plus the two unbounded
+ # tails), and union the intervals with p > gamma. This correctly handles accepted tails,
+ # disjoint components, and the empty / unbounded cases: a poor-pre-fit treated unit can
+ # have its accepted region in the TAILS, not around the point estimate.
+ if gamma < 1.0 / (n_ref + 1):
+ # p >= 1/(J+1) everywhere (the treated ranks itself), so nothing is ever rejected
+ # (Firpo & Possebom fn 8 -- the discrete granularity); the set is all of R.
+ return {
+ "lower": -np.inf,
+ "upper": np.inf,
+ "contiguous": True,
+ "status": "unbounded",
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": [],
+ }
+
+ def quad_roots(a: float, b: float, c: float) -> List[float]:
+ # Real roots of a*x**2 + b*x + c, degenerate-safe (linear / constant fall through).
+ if abs(a) <= 1e-15:
+ return [] if abs(b) <= 1e-15 else [-c / b]
+ disc = b * b - 4.0 * a * c
+ if disc < 0.0:
+ return []
+ sq = disc**0.5
+ return [(-b - sq) / (2.0 * a), (-b + sq) / (2.0 * a)]
+
+ # g_j(param) = A_j(param)*D1 - A_1(param)*Dj = a*param^2 + b*param + c; the indicator
+ # 1[g_j(param) >= 0] == 1[RMSPE^param_j >= RMSPE^param_1], and breakpoints (where an
+ # indicator flips) are the real roots of each g_j.
+ coeffs = [
+ (S2 * (D1 - Dj), -2.0 * (Pj * D1 - P1 * Dj), Qj * D1 - Q1 * Dj)
+ for (Pj, Qj, Dj) in placebo_mom
+ ]
+ raw_breaks: List[float] = []
+ for a, b, c in coeffs:
+ raw_breaks.extend(quad_roots(a, b, c))
+ breaks: List[float] = []
+ for r in sorted(raw_breaks):
+ # dedup near-equal roots so each interval midpoint stays strictly interior
+ if not breaks or abs(r - breaks[-1]) > 1e-12 + 1e-9 * abs(r):
+ breaks.append(r)
+
+ def pval_break(r: float) -> float:
+ # p AT a breakpoint: strict-`>` membership (Eq 14) combined with the conservative
+ # tie-counting `>=` (Eqs 5/13) means a placebo that EXACTLY ties the treated at a root
+ # is counted, so p can spike above gamma there even when both neighboring open
+ # intervals are rejected (a tangent / co-located root => an isolated accepted point).
+ # A relative tolerance registers the tie robustly in floating point.
+ n_ge = 0
+ for a, b, c in coeffs:
+ gj = (a * r + b) * r + c
+ if gj >= -(1e-9 * (abs(a) * r * r + abs(b) * abs(r) + abs(c)) + 1e-12):
+ n_ge += 1
+ return (1 + n_ge) / (n_ref + 1)
+
+ if not breaks:
+ # p is constant in param (e.g. every placebo shares the treated's moments + denom).
+ if pval(center) > gamma:
+ return {
+ "lower": -np.inf,
+ "upper": np.inf,
+ "contiguous": True,
+ "status": "unbounded",
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": [],
+ }
+ return {
+ "lower": np.nan,
+ "upper": np.nan,
+ "contiguous": True,
+ "status": "empty",
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": [],
+ }
+
+ # Atoms along the line: cell_0, b_0, cell_1, b_1, ..., b_{m-1}, cell_m (2m+1 atoms). p is
+ # constant on each open cell (sampled at an interior point) and is evaluated WITH the tie
+ # tolerance at each breakpoint. The set = union of accepted atoms; consecutive accepted
+ # atoms share a boundary => one connected component, a rejected atom splits components
+ # (so an accepted breakpoint whose neighbor cells are both rejected is a singleton).
+ m = len(breaks)
+ cell_pts = (
+ [breaks[0] - 1.0]
+ + [0.5 * (breaks[k] + breaks[k + 1]) for k in range(m - 1)]
+ + [breaks[-1] + 1.0]
+ )
+ cell_in = [pval(x) > gamma for x in cell_pts]
+ break_in = [pval_break(r) > gamma for r in breaks]
+ acc = [cell_in[i // 2] if i % 2 == 0 else break_in[(i - 1) // 2] for i in range(2 * m + 1)]
+ if not any(acc):
+ return {
+ "lower": np.nan,
+ "upper": np.nan,
+ "contiguous": True,
+ "status": "empty",
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": [],
+ }
+
+ def atom_extent(i: int) -> Tuple[float, float]:
+ if i % 2 == 0:
+ k = i // 2 # open cell k = (left boundary, right boundary)
+ return (-np.inf if k == 0 else breaks[k - 1], np.inf if k == m else breaks[k])
+ b = breaks[(i - 1) // 2] # a breakpoint singleton
+ return (b, b)
+
+ components: List[Tuple[float, float]] = []
+ i = 0
+ while i < len(acc):
+ if not acc[i]:
+ i += 1
+ continue
+ j = i
+ while j + 1 < len(acc) and acc[j + 1]:
+ j += 1
+ components.append((atom_extent(i)[0], atom_extent(j)[1]))
+ i = j + 1
+
+ lower = min(comp[0] for comp in components)
+ upper = max(comp[1] for comp in components)
+ contiguous = len(components) == 1
+ status = "ran" if (np.isfinite(lower) and np.isfinite(upper)) else "unbounded"
+ # Display grid: the finite hull, else a padded breakpoint range (for inspection only).
+ if np.isfinite(lower) and np.isfinite(upper):
+ if upper <= lower:
+ # A pure singleton set (lower == upper, a lone accepted breakpoint): a zero-width
+ # range yields an empty grid_rows(), so emit the single accepted point explicitly
+ # (with its tie-spike p) — the inspection table must reflect the non-empty set.
+ grid = [(float(lower), float(pval_break(lower)), True)]
+ else:
+ grid = grid_rows(lower, upper)
+ else:
+ pad = max(1.0, breaks[-1] - breaks[0])
+ grid = grid_rows(breaks[0] - pad, breaks[-1] + pad)
+ return {
+ "lower": float(lower),
+ "upper": float(upper),
+ "contiguous": bool(contiguous),
+ "status": status,
+ "n_ref": n_ref,
+ "point_estimate": center,
+ "grid": grid,
+ }
def _placebo_fit_unit(
diff --git a/diff_diff/synthetic_control_results.py b/diff_diff/synthetic_control_results.py
index 60ef44ad..55866f3d 100644
--- a/diff_diff/synthetic_control_results.py
+++ b/diff_diff/synthetic_control_results.py
@@ -200,6 +200,16 @@ class SyntheticControlResults:
rmspe_ratio: float = np.nan
n_placebos: int = 0
n_failed: int = 0
+ # Confidence set for the treatment-effect path by test inversion (Firpo & Possebom
+ # 2018, "Synthetic Control Method: Inference, Sensitivity Analysis and Confidence
+ # Sets," J. Causal Inference 6(2), §4), populated by ``confidence_set()``. A small
+ # summary dict ``{family, parameter, gamma, lower, upper, contiguous, boundary,
+ # point_estimate, n_grid, n_placebos, status}``; None until ``confidence_set()`` runs.
+ # DELIBERATELY SEPARATE from the always-NaN analytical ``conf_int`` (the Wald interval
+ # classic SCM does not have): this is a PERMUTATION set at level ``1-gamma`` (with
+ # ``gamma`` granular in ``1/(J+1)``), and may be a set / unbounded / non-contiguous —
+ # mirrors how ``placebo_p_value`` is kept distinct from the (NaN) ``p_value``.
+ effect_confidence_set: Optional[Dict[str, Any]] = None
def __post_init__(self) -> None:
# Internal state set per instance by ``fit()`` / ``in_space_placebo()``.
@@ -223,6 +233,13 @@ def __post_init__(self) -> None:
# None (not run), "ran", "treated_fit_nonconverged", "too_few_donors",
# "all_placebos_failed". A small string, so it survives pickling.
self._placebo_status: Optional[str] = None
+ # Per-unit floored pre-period denominators (treated + each converged placebo),
+ # captured by in_space_placebo() so the sharp-null test inversion
+ # (test_sharp_null / confidence_set, Firpo & Possebom 2018) re-ranks against the
+ # SAME denominators the placebo run used (the test_sharp_null(0) == placebo_p_value
+ # anchor). Each value uses that unit's OWN pre-outcome scale; the pre window is
+ # f-free so the denominator is grid-invariant. Small dict → survives pickling.
+ self._placebo_pre_denoms: Optional[Dict[Any, float]] = None
# --- ADH 2015 §4 robustness diagnostics (opt-in, populated by ---
# --- leave_one_out() / in_time_placebo()). Same panel-vs-scalar split as ---
@@ -256,6 +273,11 @@ def __post_init__(self) -> None:
# periods, all predictors dropped, or a zero-mass surviving custom_v). Surfaced
# alongside _in_time_n_failed so a mixed no-success run reports an accurate mix.
self._in_time_n_infeasible: int = 0
+ # Firpo & Possebom (2018) §4 test-inversion confidence set (opt-in, populated by
+ # confidence_set()). The grid table {param, p_value, in_set} is small / NOT
+ # panel-derived, so it survives pickling by default (NOT nulled by __getstate__);
+ # the public ``effect_confidence_set`` summary dataclass field likewise survives.
+ self._confidence_set_df: Optional[pd.DataFrame] = None
def __getstate__(self) -> Dict[str, Any]:
"""Exclude panel-derived internal state from pickling.
@@ -379,6 +401,45 @@ def summary(self, alpha: Optional[float] = None) -> str:
"",
]
)
+ # Test-inversion confidence set (Firpo & Possebom 2018, §4), if computed. Like the
+ # placebo p-value this is permutation-based; the analytical conf_int stays n/a.
+ ecs = self.effect_confidence_set
+ if ecs is not None:
+ fam = ecs["family"]
+ param = ecs["parameter"]
+ conf_pct = 100.0 * (1.0 - ecs["gamma"])
+ lines.append(
+ f"Confidence set by test inversion (Firpo-Possebom 2018; {fam} effect "
+ f"f(t), parameter {param}):"
+ )
+ if ecs["status"] == "ran":
+ note = "" if ecs["contiguous"] else " (non-contiguous; [lower, upper] hull)"
+ lines.append(
+ f" {conf_pct:.1f}% set:".ljust(34)
+ + f"[{ecs['lower']:.4f}, {ecs['upper']:.4f}]{note}"
+ )
+ elif ecs["status"] == "unbounded":
+ tail = (
+ " and NON-CONTIGUOUS (hull shown; see get_confidence_set_df())"
+ if not ecs["contiguous"]
+ else ""
+ )
+ lines.append(
+ " Unbounded (gamma below the 1/(J+1) granularity, or the treated "
+ f"unit is not the best pre-fit){tail}."
+ )
+ else: # "empty"
+ lines.append(
+ " Empty: every effect in this family is rejected at "
+ f"gamma={ecs['gamma']:.3g}."
+ )
+ lines.extend(
+ [
+ "(Permutation-based; the analytical conf_int above stays n/a.)",
+ "-" * 75,
+ "",
+ ]
+ )
# Three states: (1) placebo never run -> point to in_space_placebo();
# (2) run with a valid reference set -> show the permutation p-value;
# (3) run but infeasible (no placebo entered the rank, e.g. J<2 or all
@@ -483,6 +544,17 @@ def to_dict(self) -> Dict[str, Any]:
"n_placebos": self.n_placebos,
"n_failed": self.n_failed,
}
+ # Test-inversion confidence set (Firpo & Possebom 2018), flattened to scalars so
+ # to_dataframe() stays a single row of scalars; all None until confidence_set()
+ # runs. The analytical conf_int_lower/upper above stay NaN (no Wald interval).
+ ecs = self.effect_confidence_set
+ result["effect_ci_family"] = ecs["family"] if ecs else None
+ result["effect_ci_parameter"] = ecs["parameter"] if ecs else None
+ result["effect_ci_gamma"] = ecs["gamma"] if ecs else None
+ result["effect_ci_lower"] = ecs["lower"] if ecs else None
+ result["effect_ci_upper"] = ecs["upper"] if ecs else None
+ result["effect_ci_contiguous"] = ecs["contiguous"] if ecs else None
+ result["effect_ci_status"] = ecs["status"] if ecs else None
if self.survey_metadata is not None:
sm = self.survey_metadata
result["weight_type"] = sm.weight_type
@@ -646,9 +718,16 @@ def in_space_placebo(
"older estimator version. Re-fit to enable in-space placebo "
"inference."
)
- from diff_diff.synthetic_control import _mspe, _placebo_fit_unit
+ from diff_diff.synthetic_control import _floored_pre_mspe, _mspe, _placebo_fit_unit
snap = self._fit_snapshot
+ # A rebuilt placebo reference set invalidates any previously computed confidence set
+ # (test_sharp_null / confidence_set re-rank against THIS reference set), so drop the
+ # cached confidence-set outputs up front — a stale set must never be reported after an
+ # explicit in_space_placebo() re-run (e.g. with a different n_starts). The snapshot
+ # check above has already passed, so the reference IS about to be rebuilt on every exit.
+ self.effect_confidence_set = None
+ self._confidence_set_df = None
donors = list(snap.donor_ids)
n_donors = len(donors)
if n_starts is None:
@@ -697,6 +776,7 @@ def in_space_placebo(
self.n_placebos = 0
self.n_failed = 0
self._placebo_gaps = {}
+ self._placebo_pre_denoms = {}
self._placebo_status = "treated_fit_nonconverged"
self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS)
return self._placebo_df.copy()
@@ -713,6 +793,7 @@ def in_space_placebo(
self.n_placebos = 0
self.n_failed = 0
self._placebo_gaps = {}
+ self._placebo_pre_denoms = {}
self._placebo_status = "too_few_donors"
self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS)
return self._placebo_df.copy()
@@ -803,6 +884,20 @@ def in_space_placebo(
stacklevel=2,
)
+ # Persist each unit's floored pre-period denominator (treated + every converged
+ # placebo) so the sharp-null test inversion (test_sharp_null / confidence_set,
+ # Firpo & Possebom 2018) re-ranks against the SAME denominators this run used —
+ # the test_sharp_null(0) == placebo_p_value anchor. The pre window is f-free so the
+ # denominator is grid-invariant; each unit's floor uses its OWN pre-outcome scale.
+ outcome_pivot = snap.pivots[snap.outcome]
+ pre_denoms: Dict[Any, float] = {}
+ for unit, gp in [(snap.treated_id, self.gap_path), *placebo_gaps.items()]:
+ pre_gaps_u = np.array([gp[p] for p in snap.pre_periods], dtype=float)
+ z1_u = outcome_pivot.loc[snap.pre_periods, unit].to_numpy(dtype=float)
+ scale_u = float(np.max(np.abs(z1_u))) if z1_u.size else 0.0
+ pre_denoms[unit] = _floored_pre_mspe(pre_gaps_u, scale_u)
+ self._placebo_pre_denoms = pre_denoms
+
self.placebo_p_value = float(p_value)
self.n_placebos = int(n_placebos)
self.n_failed = int(n_failed)
@@ -1401,3 +1496,328 @@ def get_in_time_placebo_gaps(self) -> pd.DataFrame:
}
)
return pd.DataFrame(rows, columns=["placebo_period", "period", "gap", "phase"])
+
+ # =====================================================================
+ # Confidence sets by test inversion (Firpo & Possebom 2018, §4)
+ # =====================================================================
+
+ def _require_placebo_reference(self, n_starts: Optional[int]) -> None:
+ """Ensure an in-space placebo reference set is available for test inversion.
+
+ Lazily runs :meth:`in_space_placebo` when no reference set has been built yet
+ (raising the same ValueError as that method if the fit snapshot is missing, e.g.
+ on an unpickled result). If a reference set already exists, a non-None ``n_starts``
+ is **ignored with a UserWarning** — the test inversion reuses the single stored set
+ (every sharp null re-ranks the SAME gaps), so honouring ``n_starts`` would mean an
+ expensive O(J) re-fit that the caller did not ask for. Raises ValueError when no
+ valid reference set could be produced (fewer than 2 donors, a non-converged treated
+ fit, or all donor refits failed) — there is then no permutation distribution to
+ invert.
+ """
+ if self._placebo_gaps is None:
+ # Builds the reference set; raises ValueError if the snapshot is unavailable.
+ self.in_space_placebo(n_starts=n_starts)
+ elif n_starts is not None:
+ warnings.warn(
+ "n_starts is ignored: the in-space placebo reference set was already "
+ "computed and is reused (every sharp null / grid value re-ranks the same "
+ "placebo gaps). Re-run in_space_placebo(n_starts=...) explicitly to rebuild "
+ "it with a different multistart count.",
+ UserWarning,
+ stacklevel=3,
+ )
+ if not self._placebo_gaps or self._placebo_status != "ran":
+ reasons = {
+ "treated_fit_nonconverged": (
+ "the treated unit's own SCM fit did not converge at fit time, so its "
+ "RMSPE ratio is not a valid optimum to rank against placebos"
+ ),
+ "too_few_donors": (
+ "fewer than 2 donors are available (each placebo is fit against the "
+ "other donors)"
+ ),
+ "all_placebos_failed": (
+ "every donor refit failed to converge, so no placebo entered the "
+ "reference set"
+ ),
+ }
+ default_reason = "no valid in-space placebo reference set was produced"
+ status = self._placebo_status
+ reason = reasons.get(status, default_reason) if status is not None else default_reason
+ raise ValueError(
+ "Confidence set / sharp-null test requires a valid in-space placebo "
+ f"reference set, but {reason}. (See the in_space_placebo() warning above.)"
+ )
+
+ @staticmethod
+ def _coerce_effect_path(effect: Any, n_post: int) -> np.ndarray:
+ """Coerce ``effect`` to a length-``n_post`` post-period effect path ``f(t)``.
+
+ A scalar broadcasts to a constant path (Eq 11 with ``f(t) = c``); a 1-D array must
+ have one finite value per post period, aligned to the calendar-sorted
+ ``post_periods``. Fails closed on a wrong length or any non-finite value.
+ """
+ arr = np.asarray(effect, dtype=float)
+ if arr.ndim == 0:
+ f = np.full(n_post, float(arr), dtype=float)
+ elif arr.ndim == 1:
+ if arr.shape[0] != n_post:
+ raise ValueError(
+ f"effect path has length {arr.shape[0]} but there are {n_post} "
+ "post-treatment periods; pass a scalar (a constant-in-time effect) or "
+ f"a length-{n_post} array aligned to post_periods (calendar order)."
+ )
+ f = arr
+ else:
+ raise ValueError(
+ "effect must be a scalar (constant effect) or a 1-D array (one value per "
+ f"post period), got a {arr.ndim}-D array."
+ )
+ if not np.all(np.isfinite(f)):
+ raise ValueError("effect contains non-finite (NaN/inf) values.")
+ return f
+
+ def test_sharp_null(
+ self,
+ effect: Any,
+ *,
+ gamma: float = 0.1,
+ n_starts: Optional[int] = None,
+ ) -> pd.Series:
+ """Test a sharp null hypothesis on the treatment-effect path (Firpo & Possebom 2018, §4.1).
+
+ Tests ``H_0^f: α_{1,t} = f(t)`` for every post period (Eq 11) by subtracting the
+ hypothesized effect path ``f(t)`` from the post-period gaps of EVERY unit and
+ re-ranking the treated unit's modified RMSPE ratio against the placebo distribution
+ (Eqs 12–13 at ``φ = 0``, ``v = (1,…,1)`` — the equal-weights benchmark). The
+ synthetic controls are NOT refit: this reuses the gap paths and per-unit
+ denominators :meth:`in_space_placebo` already computed (run lazily here if needed).
+ At ``effect = 0`` the p-value is identically the benchmark ``placebo_p_value``
+ (Eq 5 = Eq 13 with ``f ≡ 0``).
+
+ Parameters
+ ----------
+ effect : float or array-like
+ The hypothesized post-period effect ``f(t)``: a scalar (a constant-in-time
+ effect, Eq 11), or a length-``n_post_periods`` array aligned to ``post_periods``
+ in calendar order (an arbitrary path — e.g. an intervention cost path or a
+ theory-predicted shape).
+ gamma : float, default 0.1
+ Test level; the null is rejected when ``p^f < gamma``. The permutation p-value
+ is granular in ``1/(J+1)`` (Firpo & Possebom fn 8), so not every nominal level
+ is attainable.
+ n_starts : int, optional
+ Multistart count for the lazy :meth:`in_space_placebo` run; ignored (with a
+ warning) if the reference set already exists.
+
+ Returns
+ -------
+ pandas.Series
+ ``p_value`` (``p^f``), ``reject`` (``p^f < gamma``), ``gamma``,
+ ``rmspe_f_treated`` (the treated unit's modified RMSPE ratio), ``n_placebos``
+ (reference-set size), ``n_failed``.
+
+ Raises
+ ------
+ ValueError
+ If ``gamma`` is not in ``(0, 1)``, ``effect`` has the wrong shape / non-finite
+ values, or no valid placebo reference set is available (see
+ :meth:`in_space_placebo`).
+ """
+ from diff_diff.synthetic_control import _sharp_null_pvalue
+
+ if not (0.0 < float(gamma) < 1.0):
+ raise ValueError(f"gamma must be in (0, 1), got {gamma!r}")
+ self._require_placebo_reference(n_starts)
+ post_periods = list(self.post_periods)
+ f_post = self._coerce_effect_path(effect, len(post_periods))
+ assert self._placebo_gaps is not None and self._placebo_pre_denoms is not None
+ p, r1, n_ref = _sharp_null_pvalue(
+ self.gap_path,
+ self._placebo_gaps,
+ post_periods,
+ f_post,
+ self._placebo_pre_denoms,
+ self.treated_unit,
+ )
+ return pd.Series(
+ {
+ "p_value": float(p),
+ "reject": bool(p < float(gamma)),
+ "gamma": float(gamma),
+ "rmspe_f_treated": float(r1),
+ "n_placebos": int(n_ref),
+ "n_failed": int(self.n_failed),
+ }
+ )
+
+ def confidence_set(
+ self,
+ *,
+ family: str = "constant",
+ gamma: float = 0.1,
+ bounds: Optional[Tuple[float, float]] = None,
+ n_grid: int = 200,
+ n_starts: Optional[int] = None,
+ ) -> pd.DataFrame:
+ """Confidence set for the treatment-effect path by test inversion (Firpo & Possebom 2018, §4.2).
+
+ Inverts the sharp-null test (:meth:`test_sharp_null`) over a one-parameter effect
+ family: the confidence set is every parameter value whose sharp null is **not
+ rejected**, ``{ param : p^param > gamma }`` (Eq 14, **strict** inequality). Two
+ families are supported:
+
+ - ``family="constant"`` — ``f(t) = c`` (Eq 15); the set is a confidence **interval**
+ for a constant-in-time effect (Eq 16). The parameter column is ``c``.
+ - ``family="linear"`` — ``f(t) = c̃·(t − T0)`` with the 1-based post-period index
+ ``(t − T0)`` (Eq 17); the set is a confidence **set** over the slope ``c̃``
+ (Eq 18). The parameter column is ``c_tilde``.
+
+ The inversion is a pure re-ranking of the stored placebo gaps (no synthetic-control
+ refits): :meth:`in_space_placebo` is run lazily if needed, then each value only
+ recomputes ``p^param``. With ``bounds=None`` the set is recovered **exactly**:
+ ``p^param`` is piecewise-constant (each placebo's indicator flips only at the real
+ roots of a quadratic in ``param``), so the placebo breakpoints partition the line,
+ ``p`` is evaluated once per induced interval AND at each breakpoint (where a tie
+ under ``≥`` can lift ``p`` above ``gamma``), and the union of accepted
+ intervals/points is the set — with NO centering or monotonicity assumption (accepted
+ tails and disjoint components are handled). With explicit ``bounds`` a fixed
+ ``linspace(*bounds, n_grid)`` grid is scanned instead (grid-limited membership).
+
+ **Boundary convention (paper-sourced, Eq 14):** membership is the *strict* inequality
+ ``p^param > gamma``. The permutation p-value is discrete (a multiple of ``1/(J+1)``),
+ so ``p = gamma`` is reachable and is **excluded** from the set.
+
+ The result is stored on the object: the summary on
+ :attr:`effect_confidence_set` (``{family, parameter, gamma, lower, upper,
+ contiguous, boundary, point_estimate, n_grid, n_placebos, status}``, surviving
+ pickling) and the full grid on :meth:`get_confidence_set_df`. The analytical
+ ``conf_int`` / ``se`` stay NaN — this is a separate permutation object.
+
+ Parameters
+ ----------
+ family : {"constant", "linear"}, default "constant"
+ The one-parameter effect family to invert over.
+ gamma : float, default 0.1
+ Confidence level is ``1 − gamma``; ``p^param > gamma`` defines membership.
+ bounds : (float, float), optional
+ Fixed ``(lo, hi)`` grid for the parameter. Default None uses exact breakpoint
+ inversion (a fixed grid is used only when ``bounds`` is supplied).
+ n_grid : int, default 200
+ Number of grid points evaluated for the returned table (>= 2).
+ n_starts : int, optional
+ Multistart count for the lazy :meth:`in_space_placebo` run; ignored (with a
+ warning) if the reference set already exists.
+
+ Returns
+ -------
+ pandas.DataFrame
+ Columns ``param`` (``c`` for constant, ``c̃`` for linear), ``p_value``
+ (``p^param``), ``in_set`` (``p^param > gamma``). Empty for an ``"empty"`` set;
+ an ``"unbounded"`` exact set with finite breakpoints still returns an inspection
+ grid over a padded breakpoint range (see :attr:`effect_confidence_set`
+ ``status``).
+
+ Raises
+ ------
+ ValueError
+ If ``family`` is unknown, ``gamma`` not in ``(0, 1)``, ``n_grid < 2``, ``bounds``
+ malformed, or no valid placebo reference set is available.
+ """
+ from diff_diff.synthetic_control import _invert_sharp_null
+
+ if family not in ("constant", "linear"):
+ raise ValueError(f"family must be 'constant' or 'linear', got {family!r}")
+ if not (0.0 < float(gamma) < 1.0):
+ raise ValueError(f"gamma must be in (0, 1), got {gamma!r}")
+ if not isinstance(n_grid, (int, np.integer)) or n_grid < 2:
+ raise ValueError(f"n_grid must be an integer >= 2, got {n_grid!r}")
+ if bounds is not None:
+ # Guard the type/length BEFORE indexing so a malformed scalar raises the
+ # documented ValueError (not a bare TypeError from len()/subscription).
+ if (
+ not isinstance(bounds, (tuple, list, np.ndarray))
+ or len(bounds) != 2
+ or not all(isinstance(b, (int, float, np.integer, np.floating)) for b in bounds)
+ or not all(np.isfinite(float(b)) for b in bounds)
+ ):
+ raise ValueError(f"bounds must be a finite (lo, hi) pair, got {bounds!r}")
+ if float(bounds[1]) <= float(bounds[0]):
+ raise ValueError(f"bounds must satisfy hi > lo, got {bounds!r}")
+ self._require_placebo_reference(n_starts)
+ assert self._placebo_gaps is not None and self._placebo_pre_denoms is not None
+ res = _invert_sharp_null(
+ self.gap_path,
+ self._placebo_gaps,
+ list(self.post_periods),
+ self._placebo_pre_denoms,
+ self.treated_unit,
+ family,
+ float(gamma),
+ bounds=(None if bounds is None else (float(bounds[0]), float(bounds[1]))),
+ n_grid=int(n_grid),
+ )
+ status = res["status"]
+ if status == "unbounded":
+ extra = (
+ " The accepted set is ALSO non-contiguous (e.g. two accepted tails with a "
+ "rejected middle, NOT the whole line), so [lower, upper] is only the hull — "
+ "inspect get_confidence_set_df() for the structure."
+ if not res["contiguous"]
+ else ""
+ )
+ warnings.warn(
+ "Confidence set is unbounded: either gamma is below the permutation "
+ "granularity 1/(J+1) (so no effect is ever rejected — Firpo & Possebom "
+ "fn 8), or the treated unit does not have the best pre-treatment fit (so "
+ "the RMSPE ratio does not grow without bound on one side). Reported "
+ "endpoint(s) are +/-inf." + extra,
+ UserWarning,
+ stacklevel=2,
+ )
+ elif status == "empty":
+ warnings.warn(
+ f"Confidence set is empty: every {family} effect in this family is "
+ f"rejected at gamma={gamma:.3g} (the largest attainable p-value does not "
+ "exceed gamma). Endpoints are NaN.",
+ UserWarning,
+ stacklevel=2,
+ )
+ elif not res["contiguous"]:
+ warnings.warn(
+ "Confidence set is non-contiguous (the discrete permutation p-value dips "
+ "below gamma at an interior grid point); [lower, upper] is reported as the "
+ "hull. Inspect get_confidence_set_df() for the full grid.",
+ UserWarning,
+ stacklevel=2,
+ )
+ self.effect_confidence_set = {
+ "family": family,
+ "parameter": "c" if family == "constant" else "c_tilde",
+ "gamma": float(gamma),
+ "lower": float(res["lower"]),
+ "upper": float(res["upper"]),
+ "contiguous": bool(res["contiguous"]),
+ "boundary": "strict",
+ "point_estimate": float(res["point_estimate"]),
+ "n_grid": int(n_grid),
+ "n_placebos": int(res["n_ref"]),
+ "status": status,
+ }
+ self._confidence_set_df = pd.DataFrame(res["grid"], columns=["param", "p_value", "in_set"])
+ return self._confidence_set_df.copy()
+
+ def get_confidence_set_df(self) -> pd.DataFrame:
+ """Get the test-inversion confidence-set grid table (see :meth:`confidence_set`).
+
+ Columns: ``param`` (``c`` constant / ``c̃`` linear), ``p_value`` (``p^param``),
+ ``in_set`` (``p^param > gamma``). Survives pickling. Raises if
+ :meth:`confidence_set` has not been run.
+
+ Returns
+ -------
+ pandas.DataFrame
+ """
+ if self._confidence_set_df is None:
+ raise ValueError("No confidence set yet; call confidence_set() first.")
+ return self._confidence_set_df.copy()
diff --git a/docs/api/synthetic_control.rst b/docs/api/synthetic_control.rst
index b50e840a..3ffd197d 100644
--- a/docs/api/synthetic_control.rst
+++ b/docs/api/synthetic_control.rst
@@ -36,6 +36,21 @@ earlier pre-date and checks for a spurious gap before the true treatment date (t
backdating placebo; ``placebo_att`` should be ~0). Both re-run the validated solver and
leave the analytical inference fields NaN.
+**Confidence sets by test inversion (Firpo & Possebom 2018 §4, opt-in):**
+:meth:`~diff_diff.SyntheticControlResults.test_sharp_null` tests a sharp null
+``H_0: alpha_1t = f(t)`` (a scalar constant effect, or a post-period effect path) by
+re-ranking the stored in-space placebo gaps — no refits, and ``test_sharp_null(0)`` is
+identically ``placebo_p_value`` — and
+:meth:`~diff_diff.SyntheticControlResults.confidence_set` (``family="constant"`` or
+``"linear"``) inverts that test into a confidence set for the effect path: a
+constant-effect interval (Eqs. 15–16) or a linear-slope set (Eqs. 17–18), with the
+paper's strict ``p > gamma`` membership (Eq. 14), computed by exact piecewise-constant
+breakpoint inversion (or a fixed grid when ``bounds=`` is supplied). The set is
+summarized on ``effect_confidence_set`` and returned by
+:meth:`~diff_diff.SyntheticControlResults.get_confidence_set_df`; the analytical
+``conf_int`` stays NaN (this is a separate permutation set at level ``1 - gamma``,
+possibly a set / unbounded / non-contiguous).
+
**Distinct from** :class:`~diff_diff.SyntheticDiD` (Arkhangelsky et al. 2021), which adds
time weights and ridge regularization; classic SCM uses **donor weights only** plus the
outer ``V`` search.
@@ -88,6 +103,9 @@ Results container for synthetic control estimation.
~SyntheticControlResults.in_time_placebo
~SyntheticControlResults.get_in_time_placebo_df
~SyntheticControlResults.get_in_time_placebo_gaps
+ ~SyntheticControlResults.test_sharp_null
+ ~SyntheticControlResults.confidence_set
+ ~SyntheticControlResults.get_confidence_set_df
~SyntheticControlResults.summary
~SyntheticControlResults.print_summary
~SyntheticControlResults.to_dict
diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
index 5f0857bc..a24cb239 100644
--- a/docs/doc-deps.yaml
+++ b/docs/doc-deps.yaml
@@ -635,6 +635,8 @@ sources:
- path: docs/methodology/REGISTRY.md
section: "SyntheticControl"
type: methodology
+ - path: docs/methodology/papers/firpo-possebom-2018-review.md
+ type: methodology
- path: docs/api/synthetic_control.rst
type: api_reference
- path: diff_diff/guides/llms-full.txt
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 8da65b5a..4a537bbf 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1996,6 +1996,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
- **Leave-one-out donor robustness** (`leave_one_out()`): drops each **reportably-weighted** donor and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`). A large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal, not a failure). The reporting stack's headline donor-sensitivity number is `max_abs_delta_att` = `max |delta_att|` over the drops (baseline-relative, so a uniform shift of every drop away from the full-fit ATT is not masked the way a raw ATT range would be). `get_leave_one_out_gaps()` returns the per-drop trajectories for the overlay plot. Fails closed on a non-converged treated fit or `< 2` donors.
- **In-time (backdating) placebo** (`in_time_placebo()`): reassigns the intervention to an earlier pre-date `t_f`, re-fits using ONLY pre-`t_f` information (TRUNCATE convention — see Note), and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 if there is no real pre-period effect (ADH 2015 Fig. 4, German reunification backdated to 1975). Sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a valid-but-dimensionally-infeasible date yields a `status="infeasible"` row (no raise).
+**Confidence sets by test inversion (Firpo & Possebom 2018, §4, opt-in):** A confidence set for the treatment-effect path, built ON TOP of the in-space placebo and surfaced under `estimator_native_diagnostics` (the analytical inference contract is unchanged — see the non-analytical Note below). Under a common-effect sharp null `H_0^f: α_1t = f(t)` (Eq 11) the donor synthetic controls and the pre-period MSPE denominators do not change — only the post-period gaps shift by `f(t)` — so the test is a pure **re-ranking** of the gap paths `in_space_placebo()` already computed (no synthetic-control refits):
+- **`test_sharp_null(effect, gamma=0.1)`** forms the modified RMSPE ratio `RMSPE^f_j = sqrt(mean_post((α̂_jt − f(t))²) / pre_denom_j)` for every unit (Eq 12) and the permutation p-value `p^f = (1 + #{converged placebos j : RMSPE^f_j ≥ RMSPE^f_1}) / (n_ref + 1)` (Eq 13 at `φ=0`, `v=(1,…,1)` — the equal-weights benchmark). `effect` is a scalar (a constant-in-time effect) or a length-`n_post` array (an arbitrary post-period path `f(t)` — e.g. an intervention cost path or a theory-predicted shape). At `f≡0` this is identically the in-space `placebo_p_value` (Eq 5 = Eq 13), held bit-for-bit by reusing each unit's floored pre-period denominator persisted from the placebo run (the pre window is `f`-free, so the denominator is grid-invariant; the floor scale is **per-unit** `max|Z1_j|`).
+- **`confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200)`** inverts that test over a one-parameter family: `"constant"` → `f(t)=c` (Eq 15), a confidence **interval** for a constant effect (Eq 16); `"linear"` → `f(t)=c̃·(t−T0)` with the 1-based post-period index (Eq 17), a confidence **set** over the slope `c̃` (Eq 18). Membership is the paper's **strict** `p^c > γ` (Eq 14 — see the boundary Note). With `bounds=None` the set is recovered **EXACTLY**: `p^c` is a piecewise-constant step function (each placebo's indicator flips only at the real roots of `A_j(c)·D_1 = A_1(c)·D_j`, a quadratic in `c`), so the placebo breakpoints partition the line and `p` is evaluated once per induced open interval AND at each breakpoint — where a tie under `≥` can lift `p` above γ, so an isolated accepted singleton (a tangent / co-located root) is captured. The accepted set is the union of accepted intervals/points, with **no centering or monotonicity assumption** (a poor-pre-fit treated unit can have its accepted region in the tails, not around the point estimate). `bounds=(lo,hi)` instead scans a fixed grid (grid-limited membership). The summary is on `effect_confidence_set` (`status ∈ {"ran","empty","unbounded"}`) and the full grid on `get_confidence_set_df()`. **Fail-closed:** `γ < 1/(J+1)` ⇒ `p^c > γ` for every `c` ⇒ `"unbounded"` (`±inf` endpoints + warning — the discrete-granularity point, fn 8); a treated unit lacking the best pre-fit can give a one-sided unbounded edge; if no interval or breakpoint is accepted the set is `"empty"` (NaN endpoints); a non-contiguous accepted region (disjoint components / an isolated singleton) reports the `[lower, upper]` hull with `contiguous=False` + a warning; and `< 2` donors / a non-converged treated fit / an unpickled result (no placebo reference set) raise `ValueError`. **Scope:** sensitivity weights (`φ≠0`, Eqs 7–9), the general test-statistic menu `θ¹–θ⁵` (Eq 19), one-sided (§7's signed-`t` statistic), and the multiple-outcome / multiple-treated extensions (§6) are **deferred** (flagged in the paper review checklist).
+
**Notes / deviations:**
- **Note:** The standardization divisor `divisor = sqrt(apply(cbind(X0,X1), 1, var))` (per-predictor SD over donors+treated, ddof=1) and the inner/outer optimizer are **not specified in ADH 2010** (which defers these numerics to Abadie & Gardeazabal 2003 App. B / the `Synth` software). The divisor is pinned from the R `Synth::synth` source; `solution.v` lives in this scaled predictor space, so the deterministic R-parity test feeds `custom_v` in the same scaled space.
- **Note:** The outer objective minimizes the pre-period outcome MSPE over **all** pre periods, whereas R `Synth` uses a `time.optimize.ssr` window (1960–1969 in the Basque example). The nested `V` therefore differs from R by an efficiency-only choice (the paper notes inferential validity holds for *any* `V`), so end-to-end nested parity is a tolerance band, not equality.
@@ -2013,6 +2017,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
- **Note (placebo failure handling):** a placebo is **excluded from both the numerator and the denominator** of the rank (never penalized into it) and tallied in `n_failed` when its fit is not a valid optimum — EITHER its **inner Frank-Wolfe weight solve** did not converge (a truncated `W` is unusable) OR its **outer `V` search** did not converge (an under-optimized `V` fits the pre-period worse, shrinking the RMSPE ratio and biasing the p-value anti-conservatively, so it must not silently enter the rank). The reported p-value uses the **effective** count `rank / (n_placebos + 1)`, where `n_placebos` is the number of placebos that entered the reference set. Failed donors still appear in `get_placebo_df()` (`status="failed"`, NaN metrics), so once a reference set is produced the table is the full treated + every-donor unit set (`n_donors + 1` rows). In the fail-closed cases the placebo loop does not run and only the treated row is returned: `J < 2` → `placebo_p_value` is NaN with a warning (no placebo distribution; `J == 2` warns the distribution is coarse), and a treated fit whose own **inner OR outer** search did not converge also fails closed (ranking a truncated / under-optimized treated statistic would not be a valid permutation). **Caveat:** each placebo refit inherits the original fit's `optimizer_options` / `n_starts`, so valid inference requires settings adequate for the outer `V` search to converge to a comparable-quality synthetic (production defaults do; cheap settings under-optimize placebo `V` and those placebos are dropped as failed — raise `n_starts` on `in_space_placebo()` or re-fit with a larger `optimizer_options['maxiter']`).
- **Note (RMSPE-ratio floor):** the reported `rmspe_ratio = sqrt(MSPE_post / MSPE_pre)` floors the pre-period MSPE denominator at a scale-aware `1e-8 · max(|pre-outcomes|, 1)²` (before the square root) so a (near-)perfect pre-fit (`pre-MSPE → 0`) yields a large-but-FINITE ratio rather than `inf`/`nan` (which would corrupt the rank). Ties (`ratio_j ≥ treated_ratio`) are counted, making the p-value conservative. Mirrors the `_fit_tol` poor-fit guard.
- **Note (placebo p-value is non-analytical):** `placebo_p_value` is deliberately a SEPARATE field from `p_value` (which stays NaN) — it is a permutation p-value with no SE / t-stat, so it does not flow through `safe_inference`. `is_significant` likewise stays bound to the (NaN) `p_value`, NOT `placebo_p_value`; a tool gating on `is_significant` will see `False` even when `placebo_p_value` is small. The reporting stack surfaces the placebo p-value through `estimator_native_diagnostics`, never the analytical headline.
+- **Note (confidence set is non-analytical — `conf_int` stays NaN):** the Firpo & Possebom test-inversion `effect_confidence_set` is a permutation set at level `1−γ`, kept DELIBERATELY SEPARATE from the analytical `conf_int` (which stays `(NaN, NaN)` — classic SCM has no Wald interval). It is parametrized by `γ` (not the estimator's `alpha`, and granular in `1/(J+1)`), can be a *set* rather than an interval (the linear family) and can be unbounded or non-contiguous, so it cannot be coerced into the `(lo, hi)` `conf_int` tuple without breaking the `safe_inference` NaN-consistency contract. `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` all stay at their NaN state; the set is surfaced only via `confidence_set()` / `get_confidence_set_df()` / `effect_confidence_set` (and `estimator_native_diagnostics`) — mirroring how `placebo_p_value` is kept off `p_value`.
+- **Note (test-inversion boundary convention — strict `p^f > γ`):** Firpo & Possebom's inequalities are non-uniform at `p = γ` — the RMSPE-based tests reject at `p < γ` (Eqs 5/9/13), the general-statistic test rejects at `p ≤ γ` (Eq 19), and the confidence set is the **strict** `p^f > γ` (Eq 14), so Eq 14's set is NOT the exact complement of the Eq 13 rejection region (they differ at `p^f = γ`). Because the permutation p-value is discrete (a multiple of `1/(J+1)`), `p = γ` is reachable, so this implementation pins Eq 14's strict `p^f > γ` for set membership (a `p = γ` value is **excluded**) and documents it (matches the `firpo-possebom-2018-review.md` §4.2 boundary note).
+- **Note (test-inversion set construction is an implementation choice):** Firpo & Possebom §4.2 gives the set definitions (Eqs 14/16/18) but does NOT prescribe how to enumerate the set. The default is **exact piecewise-constant breakpoint inversion** — `p^c` is constant between the real roots of the placebo-vs-treated comparison quadratics, so evaluating `p` per induced interval and at each breakpoint recovers the set exactly (no grid resolution / shape assumption); the optional fixed `bounds=` grid is the grid-limited alternative. Either is OUR choice (the paper leaves it unspecified) — a documented deviation. (`n_grid` controls only the returned inspection table, not membership, when `bounds=None`.)
+- **Note (test-inversion validation — no R anchor):** R `Synth` has no test-inversion confidence-set function, and the authors' Code Ocean replication capsule was not consulted (paper-sourced only). The confidence sets are validated by (a) **self-consistency** — `test_sharp_null(0)` equals the in-space `placebo_p_value` exactly (transitively R-anchored via the Basque placebo parity), (b) a **numpy oracle** re-implementing Eqs 12–14 on hand-built gap paths (incl. the strict `p = γ` boundary and the per-unit floor), and (c) a **coverage simulation** (a constant-effect DGP; the `1−γ` set covers the truth at ≈ `1−γ`).
- **Note (in-time placebo windowing — TRUNCATE):** ADH 2015 §4 says to re-estimate the in-time placebo "with the same predictors lagged accordingly." Because `diff_diff`'s predictor specs reference **absolute** periods, the in-time placebo re-cuts them by TRUNCATION: pre-period-outcome predictors become the pre-`t_f` outcomes, and covariate / special-predictor windows are intersected with the pre-`t_f` window; a window lying ENTIRELY in the held-out region `[t_f, T0)` is **dropped** (surfaced in the `n_dropped_specs` column + an aggregated warning), and `custom_v` is subset in lockstep with the surviving specs. For an outcome-predictor fit (the R-anchorable case) TRUNCATE is identical to ADH's "lag" — both equal a manual `Synth::synth` re-run with `time.optimize.ssr` cut at `t_f`. The held-out window never enters the fit (the placebo's `all_periods` is the pre-fake + post-fake span; the true post-treatment periods are excluded entirely), so there is no "peeking." This concrete convention is NOT spelled out in ADH 2015 (which gives only the qualitative "lag accordingly").
- **Note (in-time placebo requires ≥2 pre-fake periods):** the in-time placebo treats a date with fewer than 2 pre-fake periods as `status="infeasible"` (the default sweep starts at the 3rd pre-period). This is DELIBERATELY stricter than the base estimator's `T0 ≥ 1` allowance (which permits a single-pre-period fit but warns that nested-`V` selection is unreliable): an auto-swept placebo date with a single pre-fake period is a trivially-matchable, non-credible pre-fit, so it is dropped rather than surfaced as a `ran` placebo (mirrors `SyntheticDiD.in_time_placebo`'s `i ≥ 2` rule). A date whose surviving `custom_v` has zero mass after truncation is likewise infeasible (not a convergence failure).
- **Note (leave-one-out weight floor):** ADH 2015 §4 leave-one-out omits "each donor that received positive weight." This implementation drops each donor with **reportable** weight — above the `1e-6` interpretability floor (`synthetic_control._MIN_REPORT_WEIGHT`), i.e. exactly the donors in `donor_weights` — rather than every strictly-positive weight. A donor with `0 < w ≤ 1e-6` is numerical dust whose removal moves the ATT by ~its weight (its `delta_att` would be ~0, an uninformative row), and the floor keeps the LOO table aligned with the reported donor support. The drop-set is **frozen at fit time** on the fit snapshot (`weighted_donor_ids`), so `leave_one_out()` is immune to post-fit mutation of the presentation-level `donor_weights` dict.
@@ -2031,6 +2039,7 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
- [x] No analytical SE (NaN inference); in-space placebo permutation inference (`in_space_placebo()`, `rank/(n_placebos+1)`) with the real treated unit excluded from every placebo pool, effective-count denominator, and a scale-aware RMSPE-ratio floor.
- [x] Leave-one-out donor robustness (`leave_one_out()`, ADH 2015 §4): per-drop ATT / `delta_att` table + overlay gaps; fail-closed.
- [x] In-time (backdating) placebo (`in_time_placebo()`, ADH 2015 §4): TRUNCATE windowing (drop held-out-window predictors + lockstep `custom_v` subset), feasible-date sweep, fail-closed.
+- [x] Confidence sets by test inversion (`test_sharp_null()` + `confidence_set()`, Firpo & Possebom 2018 §4): sharp-null `RMSPE^f` re-ranking of the in-space placebo gaps (Eqs 12–13) + constant/linear one-parameter sets (Eqs 14/16/18) with the strict `p^f > γ` boundary, EXACT piecewise-constant breakpoint inversion (no shape assumption; isolated/disjoint/unbounded sets handled), and fail-closed unbounded/empty/non-contiguous handling. *Deferred:* sensitivity weights (φ≠0), the general-θ menu (Eq 19), one-sided (§7), multiple-outcome/treated (§6).
- [ ] *Deferred (ADH 2015):* regression-weight `W^reg` extrapolation diagnostic, sparse-SC subset search (see `TODO.md`).
- [x] Predictor-leakage, absorbing-suffix/no-anticipation, empty-window, duplicate-label, and inner-non-convergence validation gates.
diff --git a/docs/methodology/papers/firpo-possebom-2018-review.md b/docs/methodology/papers/firpo-possebom-2018-review.md
index f2fae1f0..cd34181c 100644
--- a/docs/methodology/papers/firpo-possebom-2018-review.md
+++ b/docs/methodology/papers/firpo-possebom-2018-review.md
@@ -5,13 +5,13 @@
**PDF reviewed:** https://doi.org/10.1515/jci-2016-0026 (published *Journal of Causal Inference* version, open access; received 15 Nov 2016, revised 6 Aug 2018, accepted 11 Aug 2018, 26 pp). Per the project's PDFs-never-committed convention the local PDF is kept outside the repository; the published J. Causal Inference version (DOI 10.1515/jci-2016-0026) is the authoritative source. All equation, section, and footnote numbers below are pinned to that version.
**Review date:** 2026-06-01
-> Scope note: this paper extends the **permutation / placebo inference** procedure of Abadie, Diamond & Hainmueller (the SCM benchmark) in two ways — (1) a **sensitivity analysis** that parametrically re-weights the placebo p-value away from the equal-weights benchmark, and (2) testing **any sharp null hypothesis** (not only "no effect whatsoever") via a modified RMSPE statistic, which it **inverts to construct confidence sets** for the treatment-effect path. It also generalizes to arbitrary test statistics, multiple outcomes (familywise error control), and multiple treated units (a pooled effect). This review is the **Step-1 fidelity artifact** for a forthcoming SCM **confidence-set / CI-by-test-inversion** implementation (PR-B) layered on the existing `SyntheticControl` estimator; the sensitivity-analysis and multiple-outcome / multiple-treated extensions are documented here but flagged **deferred**. The estimator itself (donor weights `W`, predictor importance `V`) is taken as given from ADH 2010/2015 — already implemented as `SyntheticControl` — and is recapped only as the paper frames it. Nothing here is sourced from outside this paper.
+> Scope note: this paper extends the **permutation / placebo inference** procedure of Abadie, Diamond & Hainmueller (the SCM benchmark) in two ways — (1) a **sensitivity analysis** that parametrically re-weights the placebo p-value away from the equal-weights benchmark, and (2) testing **any sharp null hypothesis** (not only "no effect whatsoever") via a modified RMSPE statistic, which it **inverts to construct confidence sets** for the treatment-effect path. It also generalizes to arbitrary test statistics, multiple outcomes (familywise error control), and multiple treated units (a pooled effect). This review is the **Step-1 fidelity artifact** for the SCM **confidence-set / CI-by-test-inversion** implementation (PR-B, **shipped** — `SyntheticControlResults.test_sharp_null()` / `confidence_set()`) layered on the existing `SyntheticControl` estimator; the sensitivity-analysis and multiple-outcome / multiple-treated extensions are documented here but flagged **deferred**. The estimator itself (donor weights `W`, predictor importance `V`) is taken as given from ADH 2010/2015 — already implemented as `SyntheticControl` — and is recapped only as the paper frames it. Nothing here is sourced from outside this paper.
---
## Methodology Registry Entry
-*Formatted to match docs/methodology/REGISTRY.md. This documents an **inference procedure on the existing `SyntheticControl` estimator**, not a new estimator — the `## SyntheticControl` heading mirrors `abadie-2021-review.md`. The REGISTRY implementation contract (`docs/methodology/REGISTRY.md` §SyntheticControl) is unchanged by this docs-only PR-A; PR-B will add the confidence-set methodology subsection and flip the relevant checklist items.*
+*Formatted to match docs/methodology/REGISTRY.md. This documents an **inference procedure on the existing `SyntheticControl` estimator**, not a new estimator — the `## SyntheticControl` heading mirrors `abadie-2021-review.md`. The REGISTRY implementation contract (`docs/methodology/REGISTRY.md` §SyntheticControl) was unchanged by the docs-only PR-A; PR-B (shipped) added the confidence-set methodology subsection there and flipped the relevant checklist items below.*
## SyntheticControl
@@ -132,10 +132,10 @@ A re-analysis of ETA terrorism on Basque Country GDP per capita (Abadie & Gardea
- Built on the authors' `Synth` package (R / MATLAB / Stata) for the underlying SCM fit.
**Requirements checklist** (features this paper adds beyond ADH 2010/2015; **PR-B** = the planned next implementation target, **deferred** = later):
-- [ ] (PR-B) Sharp-null `RMSPE^f` test (Eqs. 12–13) reusing the in-space placebo permutation — subtract the hypothesized `f(t)` from the post-period gaps and re-rank.
-- [ ] (PR-B) Confidence **interval** for a constant-in-time effect (Eqs. 15–16) by test inversion over a `c`-grid.
-- [ ] (PR-B) Confidence **set** for a linear-in-time effect (Eqs. 17–18) by test inversion over a `c̃`-grid.
-- [ ] (PR-B) Benchmark `(φ = 0, v = (1,…,1))` p-value (reuse `in_space_placebo`'s RMSPE-ratio) + a one-sided variant (Section 7).
+- [x] (PR-B) Sharp-null `RMSPE^f` test (Eqs. 12–13) reusing the in-space placebo permutation — subtract the hypothesized `f(t)` from the post-period gaps and re-rank. **Shipped:** `SyntheticControlResults.test_sharp_null(effect, gamma=...)`.
+- [x] (PR-B) Confidence **interval** for a constant-in-time effect (Eqs. 15–16) by test inversion over a `c`-grid. **Shipped:** `confidence_set(family="constant")`.
+- [x] (PR-B) Confidence **set** for a linear-in-time effect (Eqs. 17–18) by test inversion over a `c̃`-grid. **Shipped:** `confidence_set(family="linear")`.
+- [x] (PR-B) Benchmark `(φ = 0, v = (1,…,1))` p-value (reuse `in_space_placebo`'s RMSPE-ratio): shipped — `test_sharp_null(0)` is identically `placebo_p_value`. **One-sided variant (Section 7): still `[ ]` deferred** — §7 uses the signed-`t` statistic `θ³` from the deferred general-`θ` menu (Eq. 19), so it ships with that menu, not here.
- [ ] (deferred) Sensitivity-analysis parametric weights `π_(j)(φ, v)` (Eqs. 7–9) + worst/best-case `φ̲`/`φ̄` robustness curve (Section 3).
- [ ] (deferred) General test-statistic menu `θ¹`–`θ⁵` (Eq. 19, Section 5).
- [ ] (deferred) Multiple-outcome FWER control (Eqs. 23–24) and multiple-treated-unit pooled confidence sets (Eqs. 25–26, Section 6).
diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py
index a0a78fcd..8501c426 100644
--- a/tests/test_diagnostic_report.py
+++ b/tests/test_diagnostic_report.py
@@ -2111,6 +2111,23 @@ def test_scm_native_surfaces_in_time_placebo_after_optin_run(self, scm_fit):
block = native["in_time_placebo"]
assert block["status"] == "ran" and block["n_dates"] >= 1
+ def test_scm_native_confidence_set_not_run_stub(self, scm_fit):
+ # Firpo-Possebom test-inversion confidence set is opt-in, like the placebos.
+ res, _ = scm_fit
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ assert native["confidence_set"]["status"] == "not_run"
+
+ def test_scm_native_surfaces_confidence_set_after_optin_run(self, scm_fit):
+ res, _ = scm_fit
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.confidence_set(family="constant", gamma=0.34) # J=4 -> gamma > 1/(J+1)
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ block = native["confidence_set"]
+ assert block["status"] in ("ran", "empty", "unbounded")
+ assert block["family"] == "constant" and block["parameter"] == "c"
+ assert block["gamma"] == pytest.approx(0.34)
+
def test_scm_does_not_call_honest_did(self, scm_fit):
"""HonestDiD sensitivity should NOT run on SCM (fit-based / native path)."""
res, _ = scm_fit
diff --git a/tests/test_methodology_synthetic_control.py b/tests/test_methodology_synthetic_control.py
index 02eeb494..00c2a9ff 100644
--- a/tests/test_methodology_synthetic_control.py
+++ b/tests/test_methodology_synthetic_control.py
@@ -38,6 +38,15 @@
SyntheticControlResults,
synthetic_control,
)
+from diff_diff.synthetic_control import (
+ _constant_f_post,
+ _floored_pre_mspe,
+ _invert_sharp_null,
+ _linear_f_post,
+ _rmspe_f_ratio,
+ _rmspe_ratio,
+ _sharp_null_pvalue,
+)
from tests.conftest import assert_nan_inference
DATA_DIR = Path(__file__).parent / "data"
@@ -3013,3 +3022,519 @@ def spy(*a, **k):
inner_min_decrease=1e-3,
)
assert loo_att == pytest.approx(fresh.att, abs=1e-7)
+
+
+# ===========================================================================
+# Confidence sets by test inversion (Firpo & Possebom 2018, Section 4)
+# ===========================================================================
+#
+# Two opt-in SyntheticControlResults methods built ON TOP of the in-space placebo:
+# test_sharp_null(effect) tests H_0^f: alpha_1t = f(t) by re-ranking the stored
+# placebo gaps (Eqs 12-13, phi=0, v=(1,...,1)); confidence_set(family=...) inverts
+# that test over a one-parameter family (Eqs 14/16/18, strict p^f > gamma). No SCM
+# refits. The benchmark f==0 case is identically the existing placebo_p_value
+# (Eq 5 = Eq 13 at f==0). No R anchor (Synth has no test inversion): validated by
+# self-consistency to the placebo p-value, a numpy oracle, and a coverage MC.
+
+
+def _gp(pre_vals, post_vals, pre_periods, post_periods):
+ """Build a {period: gap} path from pre/post value lists."""
+ d = {p: float(v) for p, v in zip(pre_periods, pre_vals)}
+ d.update({p: float(v) for p, v in zip(post_periods, post_vals)})
+ return d
+
+
+# Hand-built scenario for the helper-level oracle tests: the treated unit has the
+# best pre-fit and a constant +2 post effect; 4 placebos have worse pre-fit + scattered
+# post gaps (so for large |c| the treated becomes the most-deviant unit -> bounded set).
+_ORACLE_PRE = [0, 1, 2]
+_ORACLE_POST = [3, 4, 5]
+_ORACLE_TREATED = _gp([0.08, -0.06, 0.05], [2.0, 2.0, 2.0], _ORACLE_PRE, _ORACLE_POST)
+_ORACLE_PLACEBOS = {
+ "a": _gp([0.3, -0.3, 0.3], [0.6, -0.4, 0.2], _ORACLE_PRE, _ORACLE_POST),
+ "b": _gp([-0.3, 0.3, 0.2], [-0.5, 0.4, 0.3], _ORACLE_PRE, _ORACLE_POST),
+ "c": _gp([0.2, 0.3, -0.3], [0.3, 0.5, -0.4], _ORACLE_PRE, _ORACLE_POST),
+ "d": _gp([-0.2, 0.3, 0.3], [0.4, -0.3, 0.5], _ORACLE_PRE, _ORACLE_POST),
+}
+
+
+def _oracle_pre_denoms(scale=1.0):
+ units = {0: _ORACLE_TREATED, **_ORACLE_PLACEBOS}
+ return {
+ u: _floored_pre_mspe(np.array([gp[p] for p in _ORACLE_PRE]), scale)
+ for u, gp in units.items()
+ }
+
+
+def test_sharp_null_pvalue_matches_independent_oracle():
+ pre_denoms = _oracle_pre_denoms()
+ for c in (0.0, 1.0, 2.0, 3.5, -1.0):
+ f = _constant_f_post(c, len(_ORACLE_POST))
+ p, r1, n_ref = _sharp_null_pvalue(
+ _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, f, pre_denoms, 0
+ )
+ # Independent re-implementation of Eqs 12-13.
+ resid_t = np.array([_ORACLE_TREATED[q] for q in _ORACLE_POST]) - f
+ r1_o = float(np.sqrt(np.mean(resid_t**2) / pre_denoms[0]))
+ rj = []
+ for u, gp in _ORACLE_PLACEBOS.items():
+ resid = np.array([gp[q] for q in _ORACLE_POST]) - f
+ rj.append(float(np.sqrt(np.mean(resid**2) / pre_denoms[u])))
+ p_o = (1 + sum(1 for r in rj if r >= r1_o)) / (len(rj) + 1)
+ assert n_ref == 4
+ assert r1 == pytest.approx(r1_o)
+ assert p == pytest.approx(p_o), (c, p, p_o)
+
+
+def test_sharp_null_zero_equals_rmspe_ratio_helper():
+ # Eq 13 at f==0 reduces to the ADH RMSPE-ratio statistic (Eq 4).
+ scale = 1.0
+ pre_denoms = _oracle_pre_denoms(scale)
+ f0 = _constant_f_post(0.0, len(_ORACLE_POST))
+ _, r1, _ = _sharp_null_pvalue(
+ _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, f0, pre_denoms, 0
+ )
+ pre = np.array([_ORACLE_TREATED[p] for p in _ORACLE_PRE])
+ post = np.array([_ORACLE_TREATED[p] for p in _ORACLE_POST])
+ assert r1 == pytest.approx(_rmspe_ratio(pre, post, scale))
+
+
+def test_floored_pre_denominator_is_per_unit_not_global():
+ # M1: the RMSPE floor scale is PER-UNIT max|Z1|. For a near-perfect pre-fit (the
+ # floor bites), a wrong GLOBAL scale would change the denominator and break the
+ # f==0 == placebo_p_value anchor. Assert the per-unit denom == the _rmspe_ratio
+ # denom and differs from a global-scale denom.
+ pre = np.array([1e-9, -1e-9, 5e-10]) # near-perfect pre-fit -> the floor dominates
+ post = np.array([2.0, 2.0, 2.0])
+ scale_unit = 10.0
+ denom_unit = _floored_pre_mspe(pre, scale_unit) # 1e-8 * 10**2 = 1e-6
+ denom_global = _floored_pre_mspe(pre, 1.0) # 1e-8
+ assert denom_unit == pytest.approx(1e-6)
+ assert not np.isclose(denom_unit, denom_global)
+ gp = _gp(list(pre), list(post), _ORACLE_PRE, _ORACLE_POST)
+ f0 = _constant_f_post(0.0, 3)
+ assert _rmspe_f_ratio(gp, _ORACLE_POST, f0, denom_unit) == pytest.approx(
+ _rmspe_ratio(pre, post, scale_unit)
+ )
+ # The wrong global denom yields a materially different statistic.
+ assert not np.isclose(
+ _rmspe_f_ratio(gp, _ORACLE_POST, f0, denom_unit),
+ _rmspe_f_ratio(gp, _ORACLE_POST, f0, denom_global),
+ )
+
+
+def test_invert_constant_set_brackets_true_effect():
+ pre_denoms = _oracle_pre_denoms()
+ res = _invert_sharp_null(
+ _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, pre_denoms, 0, "constant", 0.25, n_grid=120
+ )
+ assert res["status"] == "ran"
+ assert res["point_estimate"] == pytest.approx(2.0) # center = att = mean post gap
+ assert res["lower"] <= 2.0 <= res["upper"]
+ assert res["contiguous"]
+
+
+def test_invert_strict_boundary_excludes_p_equals_gamma():
+ # Eq 14 membership is STRICT (p^f > gamma). Use fixed wide bounds so the grid spans
+ # rejected points too; with 4 placebos gamma=0.4 (=2/5) is attainable, so some grid
+ # points have p == gamma and MUST be excluded.
+ pre_denoms = _oracle_pre_denoms()
+ res = _invert_sharp_null(
+ _ORACLE_TREATED,
+ _ORACLE_PLACEBOS,
+ _ORACLE_POST,
+ pre_denoms,
+ 0,
+ "constant",
+ 0.4,
+ bounds=(-6.0, 10.0),
+ n_grid=401,
+ )
+ grid = res["grid"]
+ assert all(in_set == (p > 0.4) for _, p, in_set in grid) # strict operator
+ at_gamma = [in_set for _, p, in_set in grid if np.isclose(p, 0.4)]
+ assert at_gamma, "expected the grid to include a p == gamma point"
+ assert not any(at_gamma) # every p == gamma point excluded (strict)
+
+
+def test_invert_unbounded_when_gamma_below_granularity():
+ # p^f >= 1/(J+1) always (the treated ranks itself); gamma below that -> nothing is
+ # rejected -> the set is all of R (Firpo & Possebom fn 8).
+ pre_denoms = _oracle_pre_denoms()
+ res = _invert_sharp_null(
+ _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, pre_denoms, 0, "constant", 0.1, n_grid=10
+ )
+ assert res["status"] == "unbounded"
+ assert res["lower"] == -np.inf and res["upper"] == np.inf
+
+
+def test_invert_empty_set_when_family_cannot_fit():
+ # A constant treated effect cannot be matched by the LINEAR family; with a near-
+ # perfect pre-fit (tiny denom) the treated stays the most-deviant unit at every
+ # slope, so p == 1/(J+1) everywhere -> rejected at gamma > 1/(J+1) -> empty.
+ treated = _gp([1e-9, -1e-9, 1e-9], [2.0, 2.0, 2.0], _ORACLE_PRE, _ORACLE_POST)
+ units = {0: treated, **_ORACLE_PLACEBOS}
+ pre_denoms = {
+ u: _floored_pre_mspe(np.array([gp[p] for p in _ORACLE_PRE]), 1.0) for u, gp in units.items()
+ }
+ res = _invert_sharp_null(
+ treated, _ORACLE_PLACEBOS, _ORACLE_POST, pre_denoms, 0, "linear", 0.25, n_grid=60
+ )
+ assert res["status"] == "empty"
+ assert np.isnan(res["lower"]) and np.isnan(res["upper"])
+
+
+def test_invert_accepts_tails_when_center_is_rejected():
+ # Regression for the centered-bracket bug (reviewer M1): when the treated unit has a
+ # WORSE pre-fit than the placebos, its accepted region is in the TAILS, not around the
+ # point estimate, and the central band is REJECTED. The exact breakpoint inversion must
+ # return the unbounded, non-contiguous (two-tail) set — NOT "empty" (the old bug).
+ # Treated post gaps [0, 2] with pre-MSPE 100 (poor fit); 4 placebos post [1, 1] pre-MSPE 1.
+ pre, post = [0, 1], [2, 3]
+ treated = {0: 10.0, 1: -10.0, 2: 0.0, 3: 2.0} # pre-MSPE=100 -> D=100; post gaps [0, 2]
+ placebos = {
+ f"p{k}": {0: 1.0, 1: -1.0, 2: 1.0, 3: 1.0} for k in range(4)
+ } # pre-MSPE=1; post [1,1]
+ pden = {
+ u: _floored_pre_mspe(np.array([gp[p] for p in pre]), 1.0)
+ for u, gp in {0: treated, **placebos}.items()
+ }
+ assert pden[0] == pytest.approx(100.0) and pden["p0"] == pytest.approx(1.0)
+ res = _invert_sharp_null(treated, placebos, post, pden, 0, "constant", 0.25, n_grid=50)
+ assert res["status"] == "unbounded" # was wrongly "empty" under the centered bracket
+ assert res["lower"] == -np.inf and res["upper"] == np.inf
+ assert res["contiguous"] is False # central rejected band -> two disjoint accepted tails
+ # The point estimate (att = 1) is itself rejected; far-out values are accepted.
+ p_center, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(1.0, 2), pden, 0)
+ p_tail, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(50.0, 2), pden, 0)
+ assert p_center <= 0.25 < p_tail
+ # Under a user-bounded grid the same scenario is grid-limited "ran" but still flagged
+ # non-contiguous (accepted at both edges, rejected through the middle).
+ res_b = _invert_sharp_null(
+ treated, placebos, post, pden, 0, "constant", 0.25, bounds=(-50.0, 50.0), n_grid=201
+ )
+ assert res_b["status"] == "ran" and res_b["contiguous"] is False
+
+
+def test_invert_includes_accepted_breakpoint_singleton():
+ # Reviewer round-2 (M1/DT1): strict p>gamma membership + tie-counting >= means a placebo
+ # that EXACTLY ties the treated at a (tangent) breakpoint can push p above gamma THERE
+ # while both neighbouring open intervals are rejected -> an isolated accepted singleton
+ # the exact inversion must include (NOT report "empty"). Construct a placebo whose RMSPE
+ # ratio touches the treated's only at c=0; 3 others stay strictly below.
+ pre, post = [0, 1], [2, 3]
+ treated = {0: 1.0, 1: -1.0, 2: 1.0, 3: -1.0} # post [1,-1], pre-MSPE 1 -> D=1
+ placebos = {"tie": {0: 2.0, 1: -2.0, 2: 2.0, 3: -2.0}} # post [2,-2], pre-MSPE 4: ties at c=0
+ for k in range(3):
+ placebos[f"lo{k}"] = {0: 2.0, 1: -2.0, 2: 0.5, 3: -0.5} # always strictly below treated
+ pden = {
+ u: _floored_pre_mspe(np.array([gp[p] for p in pre]), 1.0)
+ for u, gp in {0: treated, **placebos}.items()
+ }
+ # At c=0 the tie places p at 2/5; just off c=0 only the treated ranks (p=1/5).
+ p_at0, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(0.0, 2), pden, 0)
+ p_near, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(0.5, 2), pden, 0)
+ assert p_at0 == pytest.approx(0.4) and p_near == pytest.approx(0.2)
+ res = _invert_sharp_null(treated, placebos, post, pden, 0, "constant", 0.25, n_grid=20)
+ assert res["status"] == "ran" # NOT "empty" -- the singleton is included
+ assert res["lower"] == pytest.approx(0.0) and res["upper"] == pytest.approx(0.0)
+ # the returned inspection grid reflects the non-empty singleton (one accepted row, not [])
+ assert len(res["grid"]) == 1
+ g_param, g_p, g_in = res["grid"][0]
+ assert g_param == pytest.approx(0.0) and g_p > 0.25 and g_in is True
+
+
+def test_linear_f_post_is_one_based_and_czero_equals_constant_zero():
+ assert np.allclose(_linear_f_post(1.0, 4), [1.0, 2.0, 3.0, 4.0]) # (t - T0), 1-based
+ pre_denoms = _oracle_pre_denoms()
+ p_lin0, _, _ = _sharp_null_pvalue(
+ _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, _linear_f_post(0.0, 3), pre_denoms, 0
+ )
+ p_con0, _, _ = _sharp_null_pvalue(
+ _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, _constant_f_post(0.0, 3), pre_denoms, 0
+ )
+ assert p_lin0 == pytest.approx(p_con0)
+
+
+def test_invert_monotone_in_gamma():
+ # A larger gamma rejects more -> a narrower (or equal) confidence set.
+ pre_denoms = _oracle_pre_denoms()
+
+ def width(g):
+ r = _invert_sharp_null(
+ _ORACLE_TREATED,
+ _ORACLE_PLACEBOS,
+ _ORACLE_POST,
+ pre_denoms,
+ 0,
+ "constant",
+ g,
+ n_grid=120,
+ )
+ return (r["upper"] - r["lower"]) if r["status"] == "ran" else None
+
+ w_lo, w_hi = width(0.25), width(0.45)
+ assert w_lo is not None and w_hi is not None
+ assert w_lo >= w_hi - 1e-9
+
+
+# --- end-to-end (real fit): custom V skips the outer search for speed/determinism ---
+
+
+def _fit_with_placebos(n_donors=6, T=10, T0=6, effect=3.0, seed=0, run_placebo=True):
+ df, years, t0 = _make_panel(n_donors=n_donors, T=T, T0=T0, effect=effect, seed=seed)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = SyntheticControl(v_method="custom", custom_v=np.ones(T0), seed=seed).fit(
+ df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+ )
+ if run_placebo:
+ res.in_space_placebo()
+ return res, years, t0
+
+
+def _exact_combo_fit(effect=3.0, T=10, T0=6, n_donors=5):
+ """Deterministic panel where the treated is an EXACT convex combo of two donors.
+
+ The donors carry distinct sinusoidal idiosyncrasies, so no single donor is a convex
+ combination of the others (placebos fit poorly -> larger pre-denominators), while the
+ treated reproduces 0.5*d0 + 0.5*d1 (near-perfect pre-fit -> the smallest denominator).
+ The treated is therefore uniquely the most-deviant unit in the tails, so the constant
+ confidence set is BOUNDED ("ran") -- the end-to-end analogue of the helper oracle.
+ """
+ years = list(range(2000, 2000 + T))
+ t = np.arange(T, dtype=float)
+ donors = {
+ j: 10.0 + 2.0 * j + (0.3 + 0.1 * j) * t + (0.5 + 0.3 * j) * np.sin(t)
+ for j in range(n_donors)
+ }
+ treated = (0.5 * donors[0] + 0.5 * donors[1]).copy()
+ treated[T0:] += effect
+ rows = []
+ for j in range(n_donors):
+ for i in range(T):
+ rows.append({"unit": f"d{j}", "year": years[i], "y": float(donors[j][i]), "treated": 0})
+ for i in range(T):
+ rows.append(
+ {"unit": "treated", "year": years[i], "y": float(treated[i]), "treated": int(i >= T0)}
+ )
+ df = pd.DataFrame(rows)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = SyntheticControl(v_method="custom", custom_v=np.ones(T0), seed=0).fit(
+ df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+ )
+ res.in_space_placebo()
+ return res
+
+
+def test_test_sharp_null_zero_equals_placebo_p_value_end_to_end():
+ res, _, _ = _fit_with_placebos()
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ s0 = res.test_sharp_null(0.0)
+ assert s0["p_value"] == pytest.approx(res.placebo_p_value)
+ assert s0["rmspe_f_treated"] == pytest.approx(res.rmspe_ratio)
+ assert s0["n_placebos"] == res.n_placebos
+
+
+def test_confidence_set_constant_contains_att_and_excludes_zero():
+ res = _exact_combo_fit(effect=3.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ grid = res.confidence_set(family="constant", gamma=0.25)
+ ecs = res.effect_confidence_set
+ assert ecs["status"] == "ran"
+ assert ecs["lower"] <= res.att <= ecs["upper"] # point estimate inside the set
+ assert not (ecs["lower"] <= 0.0 <= ecs["upper"]) # a real +3 effect -> 0 excluded
+ assert list(grid.columns) == ["param", "p_value", "in_set"]
+ assert ecs["boundary"] == "strict" and ecs["parameter"] == "c"
+
+
+def test_test_sharp_null_array_path_matches_scalar_and_validates():
+ res, _, _ = _fit_with_placebos()
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ s_scalar = res.test_sharp_null(3.0)
+ s_array = res.test_sharp_null(np.array([3.0, 3.0, 3.0, 3.0]))
+ assert s_array["p_value"] == pytest.approx(s_scalar["p_value"])
+ for bad in (
+ np.array([1.0, 2.0]), # wrong length
+ np.array([[1.0, 2.0, 3.0, 4.0]]), # 2-D
+ np.array([np.nan, 1.0, 1.0, 1.0]), # non-finite
+ ):
+ with pytest.raises(ValueError):
+ res.test_sharp_null(bad)
+
+
+def test_confidence_set_conf_int_stays_nan():
+ res, _, _ = _fit_with_placebos()
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.confidence_set(family="constant", gamma=0.25)
+ # The analytical fields stay NaN: this is a SEPARATE permutation object.
+ assert np.isnan(res.se) and np.isnan(res.p_value) and np.isnan(res.t_stat)
+ assert np.isnan(res.conf_int[0]) and np.isnan(res.conf_int[1])
+ assert res.effect_confidence_set is not None
+
+
+def test_confidence_set_to_dict_flattened_and_summary_renders():
+ res = _exact_combo_fit(effect=3.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.confidence_set(family="constant", gamma=0.25)
+ d = res.to_dict()
+ assert d["effect_ci_status"] == "ran"
+ assert d["effect_ci_family"] == "constant" and d["effect_ci_parameter"] == "c"
+ assert np.isnan(d["conf_int_lower"]) # analytical interval still NaN
+ assert np.isfinite(d["effect_ci_lower"]) and np.isfinite(d["effect_ci_upper"])
+ row = res.to_dataframe() # stays a single row of scalars
+ assert len(row) == 1 and "effect_ci_lower" in row.columns
+ assert "Confidence set by test inversion" in res.summary()
+
+
+def test_confidence_set_linear_runs_and_sets_field():
+ res, _, _ = _fit_with_placebos()
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ grid = res.confidence_set(family="linear", gamma=0.25)
+ ecs = res.effect_confidence_set
+ assert ecs["family"] == "linear" and ecs["parameter"] == "c_tilde"
+ assert ecs["status"] in ("ran", "empty", "unbounded")
+ assert list(grid.columns) == ["param", "p_value", "in_set"]
+
+
+def test_confidence_set_lazy_runs_in_space_placebo():
+ res, _, _ = _fit_with_placebos(run_placebo=False)
+ assert res._placebo_gaps is None # placebo NOT run yet
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.confidence_set(family="constant", gamma=0.25)
+ assert res._placebo_gaps is not None # lazily built
+ assert res.effect_confidence_set is not None
+
+
+def test_confidence_set_n_starts_ignored_with_warning_when_reference_exists():
+ res, _, _ = _fit_with_placebos(run_placebo=True) # reference already built
+ with pytest.warns(UserWarning, match="n_starts is ignored"):
+ res.confidence_set(family="constant", gamma=0.25, n_starts=3)
+
+
+def test_get_confidence_set_df_requires_run():
+ res, _, _ = _fit_with_placebos()
+ with pytest.raises(ValueError, match="No confidence set"):
+ res.get_confidence_set_df()
+
+
+def test_in_space_placebo_rerun_invalidates_confidence_set():
+ # CI-review P1: a confidence set is computed against the CURRENT placebo reference set,
+ # so an explicit in_space_placebo() rebuild (which _require_placebo_reference even
+ # suggests, via n_starts) must INVALIDATE the cached set rather than report a stale one.
+ res = _exact_combo_fit(effect=3.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.confidence_set(family="constant", gamma=0.25)
+ assert res.effect_confidence_set is not None
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ assert native["confidence_set"]["status"] == "ran"
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.in_space_placebo(n_starts=2) # rebuild the reference set
+ assert res.effect_confidence_set is None
+ with pytest.raises(ValueError, match="No confidence set"):
+ res.get_confidence_set_df()
+ native2 = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ assert native2["confidence_set"]["status"] == "not_run"
+
+
+def test_confidence_set_too_few_donors_raises():
+ # One donor -> in_space_placebo cannot form a reference set -> CI / test raise.
+ df, years, T0 = _make_panel(n_donors=1, T=10, T0=6)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = SyntheticControl(v_method="custom", custom_v=np.ones(T0), seed=0).fit(
+ df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+ )
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ with pytest.raises(ValueError, match="reference set"):
+ res.confidence_set(family="constant", gamma=0.25)
+ with pytest.raises(ValueError, match="reference set"):
+ res.test_sharp_null(0.0)
+
+
+def test_confidence_set_unpickled_raises():
+ res, _, _ = _fit_with_placebos()
+ restored = pickle.loads(pickle.dumps(res)) # snapshot + placebo gaps dropped
+ with pytest.raises(ValueError):
+ restored.confidence_set(family="constant", gamma=0.25)
+
+
+@pytest.mark.parametrize(
+ "kwargs",
+ [
+ {"family": "quadratic"},
+ {"gamma": 0.0},
+ {"gamma": 1.0},
+ {"n_grid": 1},
+ {"bounds": (1.0, 1.0)},
+ {"bounds": (2.0, 1.0)},
+ {"bounds": 5.0}, # scalar -> ValueError, not a bare TypeError from len()
+ {"bounds": (1.0,)}, # wrong length
+ {"bounds": (np.inf, 1.0)}, # non-finite
+ ],
+)
+def test_confidence_set_input_validation(kwargs):
+ res, _, _ = _fit_with_placebos()
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ with pytest.raises(ValueError):
+ res.confidence_set(**kwargs)
+
+
+@pytest.mark.parametrize("bad", [0.0, 1.0, -0.1, 1.5])
+def test_test_sharp_null_gamma_validation(bad):
+ res, _, _ = _fit_with_placebos()
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ with pytest.raises(ValueError):
+ res.test_sharp_null(0.0, gamma=bad)
+
+
+@pytest.mark.slow
+def test_confidence_set_coverage_simulation():
+ # Behavioral coverage check: under a constant-effect DGP the (1 - gamma) confidence
+ # set should cover the true effect at roughly 1 - gamma. A looser inner tolerance keeps
+ # the refits converging cleanly; reps with ANY dropped placebo (n_failed > 0) are
+ # EXCLUDED from the coverage count so dropped placebos cannot bias it (M5), and we
+ # assert that the large majority of reps are clean (the settings are adequate).
+ # J = 9 -> attainable p in multiples of 1/10, gamma = 0.1.
+ gamma = 0.1
+ c_true = 2.0
+ reps = 100
+ clean = 0
+ covered = 0
+ for s in range(reps):
+ df, years, T0 = _make_panel(n_donors=9, T=10, T0=6, effect=c_true, seed=1000 + s)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = SyntheticControl(
+ v_method="custom", custom_v=np.ones(T0), seed=s, inner_min_decrease=1e-3
+ ).fit(
+ df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+ )
+ res.in_space_placebo()
+ if res.n_failed != 0:
+ continue # exclude biased reps from the coverage count (M5)
+ clean += 1
+ res.confidence_set(family="constant", gamma=gamma)
+ ecs = res.effect_confidence_set
+ if ecs["status"] == "unbounded":
+ covered += 1 # an unbounded set trivially covers the truth
+ elif ecs["status"] == "ran" and ecs["lower"] <= c_true <= ecs["upper"]:
+ covered += 1
+ assert clean >= 0.8 * reps, f"only {clean}/{reps} reps converged cleanly"
+ coverage = covered / clean
+ # Permutation inference is finite-sample valid under exchangeability; allow a wide
+ # band (the convex-combo treated is not perfectly exchangeable with single donors).
+ assert coverage >= 0.70, f"coverage {coverage:.3f} too low (target ~{1 - gamma})"