diff --git a/CHANGELOG.md b/CHANGELOG.md
index 27c7aeb9..bb7ffed8 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 - **`SyntheticControl` cross-validation + inverse-variance `V`-selection (ADH 2015 §; Abadie 2021 §3.2(a), Eq. 9).** Two new `v_method` values complete the ADH-2015/Abadie-2021 `V`-selection menu (joining `"nested"` / `"custom"`), each threaded through the in-space / leave-one-out / in-time placebo refits so a diagnostic uses the **same** estimator as the headline fit. **`v_method="cv"`** selects the diagonal predictor-importance `V` by out-of-sample cross-validation: the pre-period is split positionally at `v_cv_t0` (new constructor param; default `len(pre)//2`, Abadie 2021's `t0 = T0/2`) into a training and a validation window, `V` is chosen to minimize the validation-window outcome MSPE of the training-fit weights (`mspe_v` now reports this validation MSPE under cv), and the final reported weights are re-estimated on the validation-window predictors (ADH 2015 step 4). Each predictor spec is **re-aggregated** over each window (its mean/sum/identity recomputed over only the periods that fall in that window — a separate `dataprep` per window, exactly as ADH 2015's CV does, since R `Synth` has no built-in CV function), so the V-search is genuinely out-of-sample for every predictor type and the same `V*` drives both fits with no zeroed coordinate (`v_weights` reproduce `donor_weights` on the validation-window predictors, and `predictor_balance` is reported on that validation-window basis). **Fully-spanning precondition (fail-closed):** re-aggregating a predictor on each window requires it to be observed in **both** windows, so `cv` **requires every predictor to span both the training and validation windows** and raises `ValueError` otherwise — satisfied by ADH 2015's shared covariate / multi-period `special_predictors` (which span the windows) but NOT by the default per-period outcome lags (each is single-period and lives in one window only), so `cv` with the bare default predictors is rejected with guidance to pass spanning predictors. In-time-placebo truncation that breaks the fully-spanning precondition (a kept spec stops spanning both windows at the truncated split) marks that date `infeasible`. A second fail-closed gate covers windows that span but carry **no cross-donor variation** (every re-aggregated predictor constant across the donors, so `X0·W` is constant in `W` → a flat, unidentified weight solve that would otherwise return arbitrary "converged" weights — even when the treated unit differs, since donor distinguishability, not treated-vs-donor variation, identifies `W`): the headline fit raises `ValueError`, in-space placebo refits whose donor pool is indistinguishable in a window are dropped from the reference set, and such in-time-truncated dates are marked `infeasible`. Abadie 2021 footnote 7's CV non-uniqueness is handled by a **deterministic tie-break** (prefer the `V` closest to uniform among ties), making the selected `V*` among equally-good optima independent of the multistart evaluation order. The cv fit is reproducible for a fixed `seed` (like `nested`) but is not seed-independent — the multistart fills any slots beyond the distinct heuristic starts with seed-dependent random Dirichlet draws, so the tie-break removes start-order dependence among ties, not seed dependence. The tie-break is convergence-aware (a non-converged optimizer candidate cannot displace a converged incumbent on an objective tie). If the training-window solve that defines `mspe_v` truncates (e.g. `inner_max_iter` too small), the fit fails closed — `mspe_v=NaN` and the fit is marked non-converged — rather than reporting an invalid Eq. 9 criterion. **`v_method="inverse_variance"`** uses the closed form `v_h = 1/Var(X_h)` (variance over donors+treated on the unstandardized predictors), applied to the **raw** predictors so the effective objective is the unit-variance-rescaled `Σ_h diff_h²/Var_h` (Abadie 2021 §3.2(a)); the `standardize` pre-scaling is intentionally bypassed on this branch (inverse-variance weighting *is* the unit-variance rescaling — applying it on already-standardized rows would double-rescale to `Σ_h diff_h²/Var_h²`), so it is equivalent to uniform `V` on standardized predictors. No search (`mspe_v=None`); a zero-variance row gets 0 weight and an all-zero-variance panel falls back to uniform `V` with a warning. `custom_v` is rejected (fail-closed) for both methods and `v_cv_t0` is rejected unless `v_method="cv"`. On the degenerate **single-donor** path (`J=1` forces `w=[1]`) `V` is unidentified — every `V` yields the same synthetic — so `v_weights` is **uniform** and `mspe_v=None` for ALL `v_method`s (cv / inverse_variance included; their selected / closed-form `V` would be inert), with a `UserWarning`; the donor weights / gap / ATT are unaffected. An explicitly pinned `v_cv_t0` that no longer fits the truncated pre-fake window is nulled to the `//2` default for the placebo refit (a pinned value that still fits the truncated window is kept). **Validation:** R `Synth` has no built-in CV function (ADH 2015's CV is a manual `dataprep`+`synth` re-run), so cv is anchored by deterministic equivalence to the R-anchored `custom_v` path (the step-3 validation MSPE of the training-window fit and the step-4 validation-window weights each match a `custom_v=V*` fit on the correspondingly re-aggregated predictors) plus cv self-consistency (`in_time_placebo` under cv == a fresh cv fit on the backdated panel to 1e-7); inverse-variance is anchored bit-for-bit to a `custom_v=1/Var(X)` fit. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the per-window re-aggregation convention, the flat-MSPE tie-break, and inverse-variance), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. The remaining ADH-2015 items (`W^reg` extrapolation diagnostic, sparse-SC subset search) stay tracked in `TODO.md`.
 - **Firpo & Possebom (2018) SCM inference paper review on file (PR-A).** Added `docs/methodology/papers/firpo-possebom-2018-review.md`, a faithful, paper-sourced fidelity review of Firpo & Possebom (2018, *Journal of Causal Inference* 6(2), DOI 10.1515/jci-2016-0026) — the Step-1 artifact for the forthcoming SCM **confidence-set / CI-by-test-inversion** track (PR-B) layered on the existing `SyntheticControl` estimator (classic SCM has no analytical SE; `se`/`p_value`/`conf_int` are NaN). Transcribes (paper-sourced only, no code-deviation verdicts) the benchmark RMSPE-ratio permutation test (Eqs. 4–6), the sensitivity-analysis parametric p-value weights with worst/best-case `φ̲`/`φ̄` (Eqs. 7–9), the sharp-null `RMSPE^f` test (Eqs. 10–13), the **confidence sets by test inversion** (Eq. 14) with the operational constant-effect CI (Eqs. 15–16) and linear-effect CS (Eqs. 17–18), the general test-statistic framework + Monte Carlo size/power of five statistics (Eq. 19, Section 5), and the multiple-outcome FWER (Eqs. 23–24) and multiple-treated-unit pooled (Eqs. 25–26) extensions; the requirements checklist flags the PR-B target (sharp-null test + constant/linear CI + benchmark + one-sided) versus the deferred sensitivity-analysis and multi-outcome/treated extensions. Docs-only; no code change. Registered in `docs/references.rst` (Synthetic Control Method section) and `docs/doc-deps.yaml`; REGISTRY `## SyntheticControl` gains a `firpo-possebom-2018-review.md` reviews-on-file pointer.
+- **`SyntheticControl` confidence sets by test inversion (Firpo & Possebom 2018 §4, PR-B).** Classic SCM gains the uncertainty quantification it has lacked — a confidence set for the treatment-effect *path* — without changing its always-NaN analytical inference contract. Two opt-in `SyntheticControlResults` methods built ON TOP of the in-space placebo: `test_sharp_null(effect, gamma=0.1)` tests a sharp null `H_0: α_1t = f(t)` (Eq 11; `effect` a scalar constant effect or a length-`n_post` post-period path) by subtracting `f(t)` from every unit's post-period gaps and re-ranking the modified RMSPE ratio `RMSPE^f` (Eqs 12–13 at `φ=0`, `v=(1,…,1)`), and `confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200)` inverts that test into a confidence set — a constant-in-time interval (Eqs 15–16) or a linear-in-time slope set (Eqs 17–18) — keeping every value whose sharp null is not rejected at the paper's **strict** `p^f > γ` boundary (Eq 14). The whole computation is a **pure re-ranking of the gap paths `in_space_placebo()` already computes** (no synthetic-control refits): under a common-effect null the donor synthetics and the pre-period MSPE denominators are unchanged — only the post gaps shift by `f(t)` — so each grid value costs an `O(J)` rank, not a refit. With `bounds=None` the set is recovered **EXACTLY** by piecewise-constant breakpoint inversion: `p^c` is constant between the real roots of the placebo-vs-treated comparison quadratics, so `p` is evaluated once per induced interval AND at each breakpoint (a tie under `≥` can lift `p` above γ there, yielding an isolated accepted point) — NO centering/monotonicity assumption, so accepted tails, disjoint components, and unbounded/empty sets are all handled (a poor-pre-fit treated unit can have its accepted region in the tails). `bounds=(lo,hi)` instead scans a fixed grid (grid-limited); `n_grid` controls only the returned inspection table when `bounds=None`. Results: a pickle-surviving `effect_confidence_set` summary (`{family, parameter, gamma, lower, upper, contiguous, status, …}`, `status ∈ {"ran","empty","unbounded"}`) + a `get_confidence_set_df()` grid table, surfaced under `estimator_native_diagnostics.confidence_set`. **The analytical `conf_int`/`se`/`t_stat`/`p_value` stay NaN** — this is a permutation set at level `1−γ` (γ granular in `1/(J+1)`), possibly a set / unbounded / non-contiguous, so it cannot be coerced into the Wald-interval `conf_int` tuple; it is kept separate exactly as `placebo_p_value` is kept off `p_value`. **Fail-closed:** `γ < 1/(J+1)` (no value rejectable — fn 8) or a treated unit lacking the best pre-fit → `"unbounded"` (`±inf` + warning); no interval or breakpoint accepted → `"empty"` (NaN endpoints); a non-contiguous accepted region (disjoint components / an isolated singleton) → the `[lower, upper]` hull with `contiguous=False` + warning; `< 2` donors / a non-converged treated fit / an unpickled result (no placebo reference set) → `ValueError`. `test_sharp_null(0)` is held bit-for-bit equal to `placebo_p_value` (Eq 5 = Eq 13) by reusing each unit's **per-unit** floored pre-period denominator persisted from the placebo run. **Scope:** the sensitivity-analysis weights (`φ≠0`, Eqs 7–9), the general test-statistic menu (Eq 19), one-sided (§7's signed-`t` statistic), and the multiple-outcome/treated extensions (§6) are deferred (flagged in the paper review checklist). **Validation:** no R anchor (R `Synth` has no test inversion; the authors' Code Ocean capsule was not consulted) — self-consistency to the (Basque-R-anchored) `placebo_p_value`, a numpy oracle on Eqs 12–14 (incl. the strict `p=γ` boundary and the per-unit floor), invariants (the point estimate lies in the constant set for a well-posed fit; a center-rejected/tails-accepted regression; an isolated-breakpoint singleton; monotone-in-γ), and a coverage simulation. Consumes the PR-A `firpo-possebom-2018-review.md`; documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new methodology block + `**Note:**` labels for the boundary convention, the grid choice, the non-analytical `conf_int` contract, and the no-R-anchor validation), `docs/api/synthetic_control.rst`, and the LLM guides.
 - **`HeterogeneousAdoptionDiD.fit()` fit-time extensive-margin warning + `covariates=` not-implemented pointer.** Two UX additions to the HAD `fit()` surface, with **no change to any estimate or standard error**. (1) The **overall** path now emits a `UserWarning` when a non-trivial fraction (`>= 10%`, a library-convention cutoff in `_HAD_EXTENSIVE_MARGIN_ZERO_DOSE_FRAC`) of units have an exactly-zero post-period dose — a genuine untreated mass for which a standard DiD using those units as controls may be more appropriate (de Chaisemartin et al. 2026, Section 2 / Assumption 3). The paper retains *small* untreated shares (e.g. 12/2954 in Garrett et al., with close-to-nominal coverage), so the 10% cutoff sits ~25× above that; the warning is **overall-path-only** because the event-study path *requires* never-treated units per Appendix B.2. Previously the recommendation surfaced only via `qug_test()`'s zero-dose warning when the user ran the pre-tests. (2) `HeterogeneousAdoptionDiD.fit(covariates=...)` now raises `NotImplementedError` with a pointer to the deferred Appendix B.1 / Theorem 6 covariate-adjusted extension (via an explicit keyword-only `covariates=` param) instead of a bare `TypeError` from an unknown kwarg; pre-residualize the outcome on the covariates as a workaround. Documented in `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD; new tests in `tests/test_had.py` and `tests/test_methodology_had.py`.
 
 ### Fixed
diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py
index fd65ea61..c48cdfe6 100644
--- a/diff_diff/diagnostic_report.py
+++ b/diff_diff/diagnostic_report.py
@@ -2458,6 +2458,49 @@ def _scm_native(self, r: Any) -> Dict[str, Any]:
                     "placebo (opt-in; refits per backdated date)."
                 ),
             }
+
+        # Test-inversion confidence set (Firpo & Possebom 2018 §4): opt-in, surfaced once
+        # the user has run results.confidence_set() (it reuses the in-space placebo
+        # reference set — no refits). The analytical conf_int stays NaN; this is a SEPARATE
+        # permutation set at level 1 - gamma, possibly unbounded or non-contiguous.
+        ecs = getattr(r, "effect_confidence_set", None)
+        if ecs is not None:
+            ecs_status = ecs.get("status")
+            _lo, _hi = ecs.get("lower"), ecs.get("upper")
+            block = {
+                "status": ecs_status,
+                "family": ecs.get("family"),
+                "parameter": ecs.get("parameter"),
+                "gamma": _to_python_float(ecs.get("gamma")),
+                # Emit each endpoint independently: a finite float, else None for a non-finite
+                # side (NaN for an empty set, +/-inf for an unbounded tail) -- keeps the dict
+                # JSON-safe while preserving the FINITE side of a one-sided unbounded set.
+                "lower": float(_lo) if isinstance(_lo, (int, float)) and np.isfinite(_lo) else None,
+                "upper": float(_hi) if isinstance(_hi, (int, float)) and np.isfinite(_hi) else None,
+                "contiguous": bool(ecs.get("contiguous")),
+                "n_placebos": _to_python_scalar(ecs.get("n_placebos")),
+            }
+            if ecs_status == "unbounded":
+                block["reason"] = (
+                    "confidence_set() ran but the set is unbounded (gamma below the "
+                    "1/(J+1) permutation granularity, or the treated unit lacks the best "
+                    "pre-treatment fit); endpoint(s) are +/-inf."
+                )
+            elif ecs_status == "empty":
+                block["reason"] = (
+                    "confidence_set() ran but the set is empty (every effect in the "
+                    "family is rejected at gamma); endpoints are NaN."
+                )
+            out["confidence_set"] = block
+        else:
+            out["confidence_set"] = {
+                "status": "not_run",
+                "reason": (
+                    "Call results.confidence_set() for a test-inversion confidence set of "
+                    "the effect path (Firpo-Possebom 2018; opt-in, reuses the in-space "
+                    "placebo reference set)."
+                ),
+            }
         return out
 
     # -- Heterogeneity helpers --------------------------------------------
diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
index b4a89596..b64b532a 100644
--- a/diff_diff/guides/llms-full.txt
+++ b/diff_diff/guides/llms-full.txt
@@ -617,7 +617,7 @@ scm.fit(
 ) -> SyntheticControlResults
 ```
 
-**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. **ADH-2015 §4 robustness (opt-in, analytical inference unchanged):** `results.leave_one_out()` drops each reportably-weighted donor (weight > 1e-6) and re-fits (per-drop ATT/`delta_att` — large `delta_att` ⇒ single-donor dependence); `results.in_time_placebo()` backdates the intervention and checks for a spurious pre-period gap (TRUNCATE windowing — predictor windows in the held-out region are dropped). Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation).
+**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. **ADH-2015 §4 robustness (opt-in, analytical inference unchanged):** `results.leave_one_out()` drops each reportably-weighted donor (weight > 1e-6) and re-fits (per-drop ATT/`delta_att` — large `delta_att` ⇒ single-donor dependence); `results.in_time_placebo()` backdates the intervention and checks for a spurious pre-period gap (TRUNCATE windowing — predictor windows in the held-out region are dropped). **Confidence sets by test inversion (Firpo-Possebom 2018 §4, opt-in, also non-analytical):** `results.test_sharp_null(effect, gamma=0.1)` tests `H_0: α_1t = f(t)` by re-ranking the in-space placebo gaps (no refits; `test_sharp_null(0)` == `placebo_p_value`), and `results.confidence_set(family="constant"|"linear", gamma=0.1)` inverts it into a confidence set for the effect path (constant-effect interval / linear-slope set, strict `p>gamma`), surfaced on `effect_confidence_set` / `get_confidence_set_df()` — the analytical `conf_int` still stays NaN. Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation).
 
 **Usage:**
 
@@ -1300,8 +1300,9 @@ Returned by `SyntheticControl.fit()`.
 | `pre_periods`, `post_periods` | `list` | Calendar-sorted periods |
 | `v_method`, `standardize` | `str` | Echoed configuration |
 | `v_cv_t0` | `int \| None` | Resolved cv train/validation split index (None unless v_method="cv") |
+| `effect_confidence_set` | `dict \| None` | Test-inversion confidence-set summary (Firpo-Possebom 2018 §4): `{family, parameter, gamma, lower, upper, contiguous, status ("ran"/"empty"/"unbounded"), ...}`; None until `confidence_set()` runs. SEPARATE from the always-NaN analytical `conf_int` (a permutation set at level 1−gamma, possibly a set/unbounded). |
 
-**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `leave_one_out()` (ADH-2015 §4 donor robustness; drops each reportably-weighted donor (weight > 1e-6) → per-drop ATT/`delta_att` table) + `get_leave_one_out_df()`/`get_leave_one_out_gaps()`, `in_time_placebo()` (ADH-2015 §4 backdating placebo; reassigns the intervention earlier, TRUNCATE windowing, placebo ATT ~0 if no real pre-effect) + `get_in_time_placebo_df()`/`get_in_time_placebo_gaps()`, `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()`
+**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `leave_one_out()` (ADH-2015 §4 donor robustness; drops each reportably-weighted donor (weight > 1e-6) → per-drop ATT/`delta_att` table) + `get_leave_one_out_df()`/`get_leave_one_out_gaps()`, `in_time_placebo()` (ADH-2015 §4 backdating placebo; reassigns the intervention earlier, TRUNCATE windowing, placebo ATT ~0 if no real pre-effect) + `get_in_time_placebo_df()`/`get_in_time_placebo_gaps()`, `test_sharp_null(effect, gamma=0.1)` (Firpo-Possebom 2018 §4: test a sharp null α_1t=f(t) by re-ranking the in-space placebo gaps — `effect` is a scalar or a post-period array; `test_sharp_null(0)` is identically `placebo_p_value`), `confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200)` (invert that test for a confidence set of the effect path — a constant-effect interval / linear-slope set; strict p>gamma membership; exact piecewise-constant breakpoint inversion when bounds=None, else a fixed grid; `conf_int` stays NaN) + `get_confidence_set_df()`, `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()`
 
 ### TripleDifferenceResults
 
diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt
index 250068de..50faa40b 100644
--- a/diff_diff/guides/llms.txt
+++ b/diff_diff/guides/llms.txt
@@ -60,7 +60,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")`
 - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html): Gardner (2022) two-stage estimator with GMM sandwich variance
 - [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=<col>` for CR1 via Gerber (2026) Binder TSL (Wave E.1) and under `vcov_type="conley"` via a panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2 cross-sectional `conley_lag_cutoff=0`) extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC (Newey-West 1987 separable form; `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=<col>`), both composed with the Wave D Gardner GMM correction; `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores at the meat-helper boundary (Wave E.3, R `svyrecvar(subset())` form) (replicate weights queued as follow-up)
 - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html): Synthetic DiD combining standard DiD and synthetic control methods for few treated units
-- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, predictor-importance V via nested / cv (out-of-sample, ADH 2015; needs predictors spanning both train/val windows, so default single-period lags are rejected) / inverse-variance (1/Var on raw predictors, bypasses standardize) / custom, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)); ADH-2015 §4 robustness: `leave_one_out()` donor-robustness + `in_time_placebo()` backdating placebo
+- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, predictor-importance V via nested / cv (out-of-sample, ADH 2015; needs predictors spanning both train/val windows, so default single-period lags are rejected) / inverse-variance (1/Var on raw predictors, bypasses standardize) / custom, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)); ADH-2015 §4 robustness: `leave_one_out()` donor-robustness + `in_time_placebo()` backdating placebo; confidence sets by test inversion (Firpo-Possebom 2018 §4): `test_sharp_null()` + `confidence_set(family="constant"|"linear")` re-rank the placebo gaps into a confidence set for the effect path (the analytical `conf_int` stays NaN)
 - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html): Triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
 - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html): Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves
 - [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲=0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲>0`, continuous-near-d̲ or mass-point), with multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release (repeated cross-sections rejected by the validator). Alias `HAD`.
diff --git a/diff_diff/synthetic_control.py b/diff_diff/synthetic_control.py
index 5a04fa75..7b396047 100644
--- a/diff_diff/synthetic_control.py
+++ b/diff_diff/synthetic_control.py
@@ -1761,6 +1761,22 @@ def _mspe(gap_path: Dict[Any, float], periods: List[Any]) -> float:
     return float(np.mean(g**2))
 
 
+def _floored_pre_mspe(pre_gaps: np.ndarray, scale: float) -> float:
+    """Pre-period MSPE with the scale-aware floor used as the RMSPE-ratio denominator.
+
+    ``= max(mean(pre_gaps**2), 1e-8 * max(scale, 1)**2)``. Factored out of
+    ``_rmspe_ratio`` so the sharp-null test inversion (Firpo & Possebom 2018) can reuse
+    the SAME per-unit floored denominator the in-space placebo used — the floor scale is
+    PER-UNIT (``scale`` = ``max|Z1|`` of that unit's pre-period outcomes), so this is
+    what guarantees ``test_sharp_null(0) == placebo_p_value`` bit-for-bit even when the
+    floor bites (a near-perfect pre-fit). The denominator is GRID-INVARIANT under a sharp
+    null whose ``f`` is zero in the pre-period (the operational constant/linear families).
+    """
+    pre_mspe = float(np.mean(pre_gaps**2)) if pre_gaps.size else float("nan")
+    floor = 1e-8 * max(float(scale), 1.0) ** 2
+    return max(pre_mspe, floor)
+
+
 def _rmspe_ratio(pre_gaps: np.ndarray, post_gaps: np.ndarray, scale: float) -> float:
     """Post/pre RMSPE ratio — the in-space placebo test statistic (ADH 2010 §2.4).
 
@@ -1777,10 +1793,339 @@ def _rmspe_ratio(pre_gaps: np.ndarray, post_gaps: np.ndarray, scale: float) -> f
     ``scale`` is the magnitude of the unit's pre-period outcomes. Mirrors the
     ``_fit_tol`` poor-fit guard in ``fit()``.
     """
-    pre_mspe = float(np.mean(pre_gaps**2)) if pre_gaps.size else float("nan")
     post_mspe = float(np.mean(post_gaps**2)) if post_gaps.size else float("nan")
-    floor = 1e-8 * max(float(scale), 1.0) ** 2
-    return float(np.sqrt(post_mspe / max(pre_mspe, floor)))
+    return float(np.sqrt(post_mspe / _floored_pre_mspe(pre_gaps, scale)))
+
+
+# =============================================================================
+# Sharp-null test inversion (Firpo & Possebom 2018, "Synthetic Control Method:
+# Inference, Sensitivity Analysis and Confidence Sets," J. Causal Inference 6(2)).
+# These helpers RE-RANK the gap paths the in-space placebo already computed (Eqs
+# 12-13, φ=0, v=(1,...,1)); no synthetic-control refits. The benchmark f≡0 case is
+# identically the existing in-space placebo permutation (Eq 5 = Eq 13 at f≡0).
+# =============================================================================
+
+
+def _constant_f_post(c: float, n_post: int) -> np.ndarray:
+    """Constant-in-time post-period effect path f(t)=c (Firpo & Possebom Eq 15)."""
+    return np.full(n_post, float(c), dtype=float)
+
+
+def _linear_f_post(c_tilde: float, n_post: int) -> np.ndarray:
+    """Linear-in-time post-period effect path f(t)=c̃·(t−T0) (Firpo & Possebom Eq 17).
+
+    ``(t−T0)`` is the 1-based post-period index (1 for the first post period), so the
+    path is ``c̃·[1, 2, …, n_post]`` aligned to calendar-sorted ``post_periods``.
+    """
+    return float(c_tilde) * np.arange(1, n_post + 1, dtype=float)
+
+
+def _rmspe_f_ratio(
+    gap_path: Dict[Any, float],
+    post_periods: List[Any],
+    f_post: np.ndarray,
+    pre_denom: float,
+) -> float:
+    """RMSPE^f ratio under a common-effect sharp null (Firpo & Possebom Eq 12).
+
+    Subtracts the post-period effect path ``f_post`` (length ``len(post_periods)``,
+    aligned to calendar-sorted ``post_periods``) from the unit's post gaps, then divides
+    by the GRID-INVARIANT floored pre-period denominator ``pre_denom`` (the pre window is
+    ``f``-free for the operational constant/linear families, Eqs 15/17, so the
+    denominator is unchanged across the inversion grid). Returns
+    ``sqrt(mean_post((g_t − f_t)**2) / pre_denom)``.
+    """
+    post = np.array([gap_path[p] for p in post_periods], dtype=float)
+    resid = post - np.asarray(f_post, dtype=float)
+    post_mspe = float(np.mean(resid**2)) if resid.size else float("nan")
+    return float(np.sqrt(post_mspe / pre_denom))
+
+
+def _sharp_null_pvalue(
+    treated_gap: Dict[Any, float],
+    placebo_gaps: Dict[Any, Dict[Any, float]],
+    post_periods: List[Any],
+    f_post: np.ndarray,
+    pre_denoms: Dict[Any, float],
+    treated_id: Any,
+) -> Tuple[float, float, int]:
+    """Permutation p-value for a common-effect sharp null (Firpo & Possebom Eq 13, φ=0, v=1).
+
+    ``p^f = (1 + #{converged placebos j : RMSPE^f_j ≥ RMSPE^f_1}) / (n_ref + 1)`` — the
+    treated unit is the "+1" and ``n_ref`` is the number of converged placebos (the SAME
+    reference set the in-space placebo built). ``placebo_gaps`` maps each converged donor
+    to its gap path; ``pre_denoms`` maps each unit (treated + those donors) to its floored
+    pre-denominator. Ties counted via ``>=`` so the p-value is conservative, matching
+    ``in_space_placebo``. Returns ``(p, rmspe_f_treated, n_ref)``.
+    """
+    f_post = np.asarray(f_post, dtype=float)
+    r1 = _rmspe_f_ratio(treated_gap, post_periods, f_post, pre_denoms[treated_id])
+    n_ref = 0
+    n_ge = 0
+    for j, g in placebo_gaps.items():
+        rj = _rmspe_f_ratio(g, post_periods, f_post, pre_denoms[j])
+        n_ref += 1
+        if rj >= r1:
+            n_ge += 1
+    p = (1 + n_ge) / (n_ref + 1)
+    return p, r1, n_ref
+
+
+def _invert_sharp_null(
+    treated_gap: Dict[Any, float],
+    placebo_gaps: Dict[Any, Dict[Any, float]],
+    post_periods: List[Any],
+    pre_denoms: Dict[Any, float],
+    treated_id: Any,
+    family: str,
+    gamma: float,
+    *,
+    bounds: Optional[Tuple[float, float]] = None,
+    n_grid: int = 200,
+) -> Dict[str, Any]:
+    """Invert the sharp-null RMSPE^f test over a one-parameter effect family.
+
+    Returns the confidence set ``{ param : p^param > gamma }`` (Firpo & Possebom Eqs
+    14/16/18; STRICT inequality -- ``p == gamma`` is excluded). ``family`` is ``"constant"``
+    (``f(t)=param``, Eq 15) or ``"linear"`` (``f(t)=param*(t-T0)``, Eq 17).
+
+    ``p^param`` is a piecewise-constant step function of the scalar ``param`` (each placebo's
+    indicator flips only at the real roots of a quadratic in ``param``), so with
+    ``bounds=None`` the EXACT set is recovered with NO shape assumption (no "centered
+    interval" / monotonicity): the placebo breakpoints partition the line, ``p`` is evaluated
+    once per induced interval (and the two unbounded tails), and the union of intervals with
+    ``p > gamma`` is the set -- correctly handling accepted tails, disjoint components, and the
+    empty / unbounded cases. With explicit ``bounds`` a fixed ``linspace(*bounds, n_grid)``
+    grid is scanned instead (grid-limited membership). Each ``p`` is a pure O(J) re-ranking.
+
+    Result dict: ``{lower, upper, contiguous, status, n_ref, point_estimate, grid}`` where
+    ``status`` is ``"ran"`` / ``"empty"`` / ``"unbounded"``, ``point_estimate`` is the treated
+    unit's own RMSPE-minimizing ``param`` (the ATT for constant, the no-intercept OLS slope
+    for linear -- reported for reference, NOT assumed to maximize ``p``), and ``grid`` is a
+    list of ``(param, p_value, in_set)`` rows for the returned table.
+    """
+    if family not in ("constant", "linear"):
+        raise ValueError(f"family must be 'constant' or 'linear', got {family!r}")
+    n_post = len(post_periods)
+    # Family shape s_t and per-unit moments of A_u(param) = mean_t((g_ut - param*s_t)**2)
+    # = S2*param**2 - 2*P_u*param + Q_u  (s_t = 1 constant / (t - T0) linear; S2 = mean(s**2)).
+    s = (
+        np.ones(n_post, dtype=float)
+        if family == "constant"
+        else np.arange(1, n_post + 1, dtype=float)
+    )
+    S2 = float(np.mean(s**2)) if n_post else 0.0
+
+    def moments(gap_path: Dict[Any, float], unit: Any) -> Tuple[float, float, float]:
+        g = np.array([gap_path[p] for p in post_periods], dtype=float)
+        return float(np.mean(g * s)), float(np.mean(g**2)), pre_denoms[unit]
+
+    P1, Q1, D1 = moments(treated_gap, treated_id)
+    placebo_mom = [moments(gp, j) for j, gp in placebo_gaps.items()]
+    n_ref = len(placebo_mom)
+    # The treated unit's own RMSPE-minimizing param: reported as the point estimate ONLY
+    # (it is NOT assumed to maximize p -- see the exact-inversion comment below).
+    center = float(P1 / S2) if S2 > 0 else 0.0
+
+    def pval(param: float) -> float:
+        # p^param = (1 + #{placebos j : A_j/D_j >= A_1/D_1}) / (n_ref + 1). Comparing A/D is
+        # monotone-equivalent to comparing the RMSPE ratios sqrt(A/D), so this is identical to
+        # _sharp_null_pvalue / test_sharp_null (the squared form just avoids the sqrt).
+        r1 = ((S2 * param - 2.0 * P1) * param + Q1) / D1
+        n_ge = sum(
+            1 for (Pj, Qj, Dj) in placebo_mom if ((S2 * param - 2.0 * Pj) * param + Qj) / Dj >= r1
+        )
+        return (1 + n_ge) / (n_ref + 1)
+
+    def grid_rows(lo: float, hi: float) -> List[Tuple[float, float, bool]]:
+        if not (np.isfinite(lo) and np.isfinite(hi)) or hi <= lo:
+            return []
+        out: List[Tuple[float, float, bool]] = []
+        for x in np.linspace(lo, hi, max(int(n_grid), 2)):
+            p = pval(float(x))
+            out.append((float(x), float(p), bool(p > gamma)))
+        return out
+
+    # ----- fixed-grid path (user-supplied bounds) -----
+    if bounds is not None:
+        lo_b, hi_b = float(bounds[0]), float(bounds[1])
+        if hi_b <= lo_b:
+            raise ValueError(f"bounds must be (lo, hi) with hi > lo, got {bounds!r}")
+        rows = grid_rows(lo_b, hi_b)
+        accepted = [x for x, _, ok in rows if ok]
+        if not accepted:
+            return {
+                "lower": np.nan,
+                "upper": np.nan,
+                "contiguous": True,
+                "status": "empty",
+                "n_ref": n_ref,
+                "point_estimate": center,
+                "grid": rows,
+            }
+        lower, upper = min(accepted), max(accepted)
+        contiguous = all(ok for x, _, ok in rows if lower <= x <= upper)
+        return {
+            "lower": lower,
+            "upper": upper,
+            "contiguous": contiguous,
+            "status": "ran",
+            "n_ref": n_ref,
+            "point_estimate": center,
+            "grid": rows,
+        }
+
+    # ----- exact piecewise-constant inversion (bounds=None; no shape assumption) -----
+    # p^param is a step function of param: each placebo's indicator flips only at the real
+    # roots of A_j(param)*D1 - A_1(param)*Dj = 0, a quadratic in param. So the EXACT set is
+    # recovered with NO "centered interval" / monotonicity assumption -- collect every placebo
+    # breakpoint, evaluate p once on each open interval they induce (plus the two unbounded
+    # tails), and union the intervals with p > gamma. This correctly handles accepted tails,
+    # disjoint components, and the empty / unbounded cases: a poor-pre-fit treated unit can
+    # have its accepted region in the TAILS, not around the point estimate.
+    if gamma < 1.0 / (n_ref + 1):
+        # p >= 1/(J+1) everywhere (the treated ranks itself), so nothing is ever rejected
+        # (Firpo & Possebom fn 8 -- the discrete granularity); the set is all of R.
+        return {
+            "lower": -np.inf,
+            "upper": np.inf,
+            "contiguous": True,
+            "status": "unbounded",
+            "n_ref": n_ref,
+            "point_estimate": center,
+            "grid": [],
+        }
+
+    def quad_roots(a: float, b: float, c: float) -> List[float]:
+        # Real roots of a*x**2 + b*x + c, degenerate-safe (linear / constant fall through).
+        if abs(a) <= 1e-15:
+            return [] if abs(b) <= 1e-15 else [-c / b]
+        disc = b * b - 4.0 * a * c
+        if disc < 0.0:
+            return []
+        sq = disc**0.5
+        return [(-b - sq) / (2.0 * a), (-b + sq) / (2.0 * a)]
+
+    # g_j(param) = A_j(param)*D1 - A_1(param)*Dj = a*param^2 + b*param + c; the indicator
+    # 1[g_j(param) >= 0] == 1[RMSPE^param_j >= RMSPE^param_1], and breakpoints (where an
+    # indicator flips) are the real roots of each g_j.
+    coeffs = [
+        (S2 * (D1 - Dj), -2.0 * (Pj * D1 - P1 * Dj), Qj * D1 - Q1 * Dj)
+        for (Pj, Qj, Dj) in placebo_mom
+    ]
+    raw_breaks: List[float] = []
+    for a, b, c in coeffs:
+        raw_breaks.extend(quad_roots(a, b, c))
+    breaks: List[float] = []
+    for r in sorted(raw_breaks):
+        # dedup near-equal roots so each interval midpoint stays strictly interior
+        if not breaks or abs(r - breaks[-1]) > 1e-12 + 1e-9 * abs(r):
+            breaks.append(r)
+
+    def pval_break(r: float) -> float:
+        # p AT a breakpoint: strict-`>` membership (Eq 14) combined with the conservative
+        # tie-counting `>=` (Eqs 5/13) means a placebo that EXACTLY ties the treated at a root
+        # is counted, so p can spike above gamma there even when both neighboring open
+        # intervals are rejected (a tangent / co-located root => an isolated accepted point).
+        # A relative tolerance registers the tie robustly in floating point.
+        n_ge = 0
+        for a, b, c in coeffs:
+            gj = (a * r + b) * r + c
+            if gj >= -(1e-9 * (abs(a) * r * r + abs(b) * abs(r) + abs(c)) + 1e-12):
+                n_ge += 1
+        return (1 + n_ge) / (n_ref + 1)
+
+    if not breaks:
+        # p is constant in param (e.g. every placebo shares the treated's moments + denom).
+        if pval(center) > gamma:
+            return {
+                "lower": -np.inf,
+                "upper": np.inf,
+                "contiguous": True,
+                "status": "unbounded",
+                "n_ref": n_ref,
+                "point_estimate": center,
+                "grid": [],
+            }
+        return {
+            "lower": np.nan,
+            "upper": np.nan,
+            "contiguous": True,
+            "status": "empty",
+            "n_ref": n_ref,
+            "point_estimate": center,
+            "grid": [],
+        }
+
+    # Atoms along the line: cell_0, b_0, cell_1, b_1, ..., b_{m-1}, cell_m (2m+1 atoms). p is
+    # constant on each open cell (sampled at an interior point) and is evaluated WITH the tie
+    # tolerance at each breakpoint. The set = union of accepted atoms; consecutive accepted
+    # atoms share a boundary => one connected component, a rejected atom splits components
+    # (so an accepted breakpoint whose neighbor cells are both rejected is a singleton).
+    m = len(breaks)
+    cell_pts = (
+        [breaks[0] - 1.0]
+        + [0.5 * (breaks[k] + breaks[k + 1]) for k in range(m - 1)]
+        + [breaks[-1] + 1.0]
+    )
+    cell_in = [pval(x) > gamma for x in cell_pts]
+    break_in = [pval_break(r) > gamma for r in breaks]
+    acc = [cell_in[i // 2] if i % 2 == 0 else break_in[(i - 1) // 2] for i in range(2 * m + 1)]
+    if not any(acc):
+        return {
+            "lower": np.nan,
+            "upper": np.nan,
+            "contiguous": True,
+            "status": "empty",
+            "n_ref": n_ref,
+            "point_estimate": center,
+            "grid": [],
+        }
+
+    def atom_extent(i: int) -> Tuple[float, float]:
+        if i % 2 == 0:
+            k = i // 2  # open cell k = (left boundary, right boundary)
+            return (-np.inf if k == 0 else breaks[k - 1], np.inf if k == m else breaks[k])
+        b = breaks[(i - 1) // 2]  # a breakpoint singleton
+        return (b, b)
+
+    components: List[Tuple[float, float]] = []
+    i = 0
+    while i < len(acc):
+        if not acc[i]:
+            i += 1
+            continue
+        j = i
+        while j + 1 < len(acc) and acc[j + 1]:
+            j += 1
+        components.append((atom_extent(i)[0], atom_extent(j)[1]))
+        i = j + 1
+
+    lower = min(comp[0] for comp in components)
+    upper = max(comp[1] for comp in components)
+    contiguous = len(components) == 1
+    status = "ran" if (np.isfinite(lower) and np.isfinite(upper)) else "unbounded"
+    # Display grid: the finite hull, else a padded breakpoint range (for inspection only).
+    if np.isfinite(lower) and np.isfinite(upper):
+        if upper <= lower:
+            # A pure singleton set (lower == upper, a lone accepted breakpoint): a zero-width
+            # range yields an empty grid_rows(), so emit the single accepted point explicitly
+            # (with its tie-spike p) — the inspection table must reflect the non-empty set.
+            grid = [(float(lower), float(pval_break(lower)), True)]
+        else:
+            grid = grid_rows(lower, upper)
+    else:
+        pad = max(1.0, breaks[-1] - breaks[0])
+        grid = grid_rows(breaks[0] - pad, breaks[-1] + pad)
+    return {
+        "lower": float(lower),
+        "upper": float(upper),
+        "contiguous": bool(contiguous),
+        "status": status,
+        "n_ref": n_ref,
+        "point_estimate": center,
+        "grid": grid,
+    }
 
 
 def _placebo_fit_unit(
diff --git a/diff_diff/synthetic_control_results.py b/diff_diff/synthetic_control_results.py
index 60ef44ad..55866f3d 100644
--- a/diff_diff/synthetic_control_results.py
+++ b/diff_diff/synthetic_control_results.py
@@ -200,6 +200,16 @@ class SyntheticControlResults:
     rmspe_ratio: float = np.nan
     n_placebos: int = 0
     n_failed: int = 0
+    # Confidence set for the treatment-effect path by test inversion (Firpo & Possebom
+    # 2018, "Synthetic Control Method: Inference, Sensitivity Analysis and Confidence
+    # Sets," J. Causal Inference 6(2), §4), populated by ``confidence_set()``. A small
+    # summary dict ``{family, parameter, gamma, lower, upper, contiguous, boundary,
+    # point_estimate, n_grid, n_placebos, status}``; None until ``confidence_set()`` runs.
+    # DELIBERATELY SEPARATE from the always-NaN analytical ``conf_int`` (the Wald interval
+    # classic SCM does not have): this is a PERMUTATION set at level ``1-gamma`` (with
+    # ``gamma`` granular in ``1/(J+1)``), and may be a set / unbounded / non-contiguous —
+    # mirrors how ``placebo_p_value`` is kept distinct from the (NaN) ``p_value``.
+    effect_confidence_set: Optional[Dict[str, Any]] = None
 
     def __post_init__(self) -> None:
         # Internal state set per instance by ``fit()`` / ``in_space_placebo()``.
@@ -223,6 +233,13 @@ def __post_init__(self) -> None:
         # None (not run), "ran", "treated_fit_nonconverged", "too_few_donors",
         # "all_placebos_failed". A small string, so it survives pickling.
         self._placebo_status: Optional[str] = None
+        # Per-unit floored pre-period denominators (treated + each converged placebo),
+        # captured by in_space_placebo() so the sharp-null test inversion
+        # (test_sharp_null / confidence_set, Firpo & Possebom 2018) re-ranks against the
+        # SAME denominators the placebo run used (the test_sharp_null(0) == placebo_p_value
+        # anchor). Each value uses that unit's OWN pre-outcome scale; the pre window is
+        # f-free so the denominator is grid-invariant. Small dict → survives pickling.
+        self._placebo_pre_denoms: Optional[Dict[Any, float]] = None
 
         # --- ADH 2015 §4 robustness diagnostics (opt-in, populated by ---
         # --- leave_one_out() / in_time_placebo()). Same panel-vs-scalar split as ---
@@ -256,6 +273,11 @@ def __post_init__(self) -> None:
         # periods, all predictors dropped, or a zero-mass surviving custom_v). Surfaced
         # alongside _in_time_n_failed so a mixed no-success run reports an accurate mix.
         self._in_time_n_infeasible: int = 0
+        # Firpo & Possebom (2018) §4 test-inversion confidence set (opt-in, populated by
+        # confidence_set()). The grid table {param, p_value, in_set} is small / NOT
+        # panel-derived, so it survives pickling by default (NOT nulled by __getstate__);
+        # the public ``effect_confidence_set`` summary dataclass field likewise survives.
+        self._confidence_set_df: Optional[pd.DataFrame] = None
 
     def __getstate__(self) -> Dict[str, Any]:
         """Exclude panel-derived internal state from pickling.
@@ -379,6 +401,45 @@ def summary(self, alpha: Optional[float] = None) -> str:
                 "",
             ]
         )
+        # Test-inversion confidence set (Firpo & Possebom 2018, §4), if computed. Like the
+        # placebo p-value this is permutation-based; the analytical conf_int stays n/a.
+        ecs = self.effect_confidence_set
+        if ecs is not None:
+            fam = ecs["family"]
+            param = ecs["parameter"]
+            conf_pct = 100.0 * (1.0 - ecs["gamma"])
+            lines.append(
+                f"Confidence set by test inversion (Firpo-Possebom 2018; {fam} effect "
+                f"f(t), parameter {param}):"
+            )
+            if ecs["status"] == "ran":
+                note = "" if ecs["contiguous"] else "  (non-contiguous; [lower, upper] hull)"
+                lines.append(
+                    f"  {conf_pct:.1f}% set:".ljust(34)
+                    + f"[{ecs['lower']:.4f}, {ecs['upper']:.4f}]{note}"
+                )
+            elif ecs["status"] == "unbounded":
+                tail = (
+                    " and NON-CONTIGUOUS (hull shown; see get_confidence_set_df())"
+                    if not ecs["contiguous"]
+                    else ""
+                )
+                lines.append(
+                    "  Unbounded (gamma below the 1/(J+1) granularity, or the treated "
+                    f"unit is not the best pre-fit){tail}."
+                )
+            else:  # "empty"
+                lines.append(
+                    "  Empty: every effect in this family is rejected at "
+                    f"gamma={ecs['gamma']:.3g}."
+                )
+            lines.extend(
+                [
+                    "(Permutation-based; the analytical conf_int above stays n/a.)",
+                    "-" * 75,
+                    "",
+                ]
+            )
         # Three states: (1) placebo never run -> point to in_space_placebo();
         # (2) run with a valid reference set -> show the permutation p-value;
         # (3) run but infeasible (no placebo entered the rank, e.g. J<2 or all
@@ -483,6 +544,17 @@ def to_dict(self) -> Dict[str, Any]:
             "n_placebos": self.n_placebos,
             "n_failed": self.n_failed,
         }
+        # Test-inversion confidence set (Firpo & Possebom 2018), flattened to scalars so
+        # to_dataframe() stays a single row of scalars; all None until confidence_set()
+        # runs. The analytical conf_int_lower/upper above stay NaN (no Wald interval).
+        ecs = self.effect_confidence_set
+        result["effect_ci_family"] = ecs["family"] if ecs else None
+        result["effect_ci_parameter"] = ecs["parameter"] if ecs else None
+        result["effect_ci_gamma"] = ecs["gamma"] if ecs else None
+        result["effect_ci_lower"] = ecs["lower"] if ecs else None
+        result["effect_ci_upper"] = ecs["upper"] if ecs else None
+        result["effect_ci_contiguous"] = ecs["contiguous"] if ecs else None
+        result["effect_ci_status"] = ecs["status"] if ecs else None
         if self.survey_metadata is not None:
             sm = self.survey_metadata
             result["weight_type"] = sm.weight_type
@@ -646,9 +718,16 @@ def in_space_placebo(
                 "older estimator version. Re-fit to enable in-space placebo "
                 "inference."
             )
-        from diff_diff.synthetic_control import _mspe, _placebo_fit_unit
+        from diff_diff.synthetic_control import _floored_pre_mspe, _mspe, _placebo_fit_unit
 
         snap = self._fit_snapshot
+        # A rebuilt placebo reference set invalidates any previously computed confidence set
+        # (test_sharp_null / confidence_set re-rank against THIS reference set), so drop the
+        # cached confidence-set outputs up front — a stale set must never be reported after an
+        # explicit in_space_placebo() re-run (e.g. with a different n_starts). The snapshot
+        # check above has already passed, so the reference IS about to be rebuilt on every exit.
+        self.effect_confidence_set = None
+        self._confidence_set_df = None
         donors = list(snap.donor_ids)
         n_donors = len(donors)
         if n_starts is None:
@@ -697,6 +776,7 @@ def in_space_placebo(
             self.n_placebos = 0
             self.n_failed = 0
             self._placebo_gaps = {}
+            self._placebo_pre_denoms = {}
             self._placebo_status = "treated_fit_nonconverged"
             self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS)
             return self._placebo_df.copy()
@@ -713,6 +793,7 @@ def in_space_placebo(
             self.n_placebos = 0
             self.n_failed = 0
             self._placebo_gaps = {}
+            self._placebo_pre_denoms = {}
             self._placebo_status = "too_few_donors"
             self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS)
             return self._placebo_df.copy()
@@ -803,6 +884,20 @@ def in_space_placebo(
                 stacklevel=2,
             )
 
+        # Persist each unit's floored pre-period denominator (treated + every converged
+        # placebo) so the sharp-null test inversion (test_sharp_null / confidence_set,
+        # Firpo & Possebom 2018) re-ranks against the SAME denominators this run used —
+        # the test_sharp_null(0) == placebo_p_value anchor. The pre window is f-free so the
+        # denominator is grid-invariant; each unit's floor uses its OWN pre-outcome scale.
+        outcome_pivot = snap.pivots[snap.outcome]
+        pre_denoms: Dict[Any, float] = {}
+        for unit, gp in [(snap.treated_id, self.gap_path), *placebo_gaps.items()]:
+            pre_gaps_u = np.array([gp[p] for p in snap.pre_periods], dtype=float)
+            z1_u = outcome_pivot.loc[snap.pre_periods, unit].to_numpy(dtype=float)
+            scale_u = float(np.max(np.abs(z1_u))) if z1_u.size else 0.0
+            pre_denoms[unit] = _floored_pre_mspe(pre_gaps_u, scale_u)
+        self._placebo_pre_denoms = pre_denoms
+
         self.placebo_p_value = float(p_value)
         self.n_placebos = int(n_placebos)
         self.n_failed = int(n_failed)
@@ -1401,3 +1496,328 @@ def get_in_time_placebo_gaps(self) -> pd.DataFrame:
                         }
                     )
         return pd.DataFrame(rows, columns=["placebo_period", "period", "gap", "phase"])
+
+    # =====================================================================
+    # Confidence sets by test inversion (Firpo & Possebom 2018, §4)
+    # =====================================================================
+
+    def _require_placebo_reference(self, n_starts: Optional[int]) -> None:
+        """Ensure an in-space placebo reference set is available for test inversion.
+
+        Lazily runs :meth:`in_space_placebo` when no reference set has been built yet
+        (raising the same ValueError as that method if the fit snapshot is missing, e.g.
+        on an unpickled result). If a reference set already exists, a non-None ``n_starts``
+        is **ignored with a UserWarning** — the test inversion reuses the single stored set
+        (every sharp null re-ranks the SAME gaps), so honouring ``n_starts`` would mean an
+        expensive O(J) re-fit that the caller did not ask for. Raises ValueError when no
+        valid reference set could be produced (fewer than 2 donors, a non-converged treated
+        fit, or all donor refits failed) — there is then no permutation distribution to
+        invert.
+        """
+        if self._placebo_gaps is None:
+            # Builds the reference set; raises ValueError if the snapshot is unavailable.
+            self.in_space_placebo(n_starts=n_starts)
+        elif n_starts is not None:
+            warnings.warn(
+                "n_starts is ignored: the in-space placebo reference set was already "
+                "computed and is reused (every sharp null / grid value re-ranks the same "
+                "placebo gaps). Re-run in_space_placebo(n_starts=...) explicitly to rebuild "
+                "it with a different multistart count.",
+                UserWarning,
+                stacklevel=3,
+            )
+        if not self._placebo_gaps or self._placebo_status != "ran":
+            reasons = {
+                "treated_fit_nonconverged": (
+                    "the treated unit's own SCM fit did not converge at fit time, so its "
+                    "RMSPE ratio is not a valid optimum to rank against placebos"
+                ),
+                "too_few_donors": (
+                    "fewer than 2 donors are available (each placebo is fit against the "
+                    "other donors)"
+                ),
+                "all_placebos_failed": (
+                    "every donor refit failed to converge, so no placebo entered the "
+                    "reference set"
+                ),
+            }
+            default_reason = "no valid in-space placebo reference set was produced"
+            status = self._placebo_status
+            reason = reasons.get(status, default_reason) if status is not None else default_reason
+            raise ValueError(
+                "Confidence set / sharp-null test requires a valid in-space placebo "
+                f"reference set, but {reason}. (See the in_space_placebo() warning above.)"
+            )
+
+    @staticmethod
+    def _coerce_effect_path(effect: Any, n_post: int) -> np.ndarray:
+        """Coerce ``effect`` to a length-``n_post`` post-period effect path ``f(t)``.
+
+        A scalar broadcasts to a constant path (Eq 11 with ``f(t) = c``); a 1-D array must
+        have one finite value per post period, aligned to the calendar-sorted
+        ``post_periods``. Fails closed on a wrong length or any non-finite value.
+        """
+        arr = np.asarray(effect, dtype=float)
+        if arr.ndim == 0:
+            f = np.full(n_post, float(arr), dtype=float)
+        elif arr.ndim == 1:
+            if arr.shape[0] != n_post:
+                raise ValueError(
+                    f"effect path has length {arr.shape[0]} but there are {n_post} "
+                    "post-treatment periods; pass a scalar (a constant-in-time effect) or "
+                    f"a length-{n_post} array aligned to post_periods (calendar order)."
+                )
+            f = arr
+        else:
+            raise ValueError(
+                "effect must be a scalar (constant effect) or a 1-D array (one value per "
+                f"post period), got a {arr.ndim}-D array."
+            )
+        if not np.all(np.isfinite(f)):
+            raise ValueError("effect contains non-finite (NaN/inf) values.")
+        return f
+
+    def test_sharp_null(
+        self,
+        effect: Any,
+        *,
+        gamma: float = 0.1,
+        n_starts: Optional[int] = None,
+    ) -> pd.Series:
+        """Test a sharp null hypothesis on the treatment-effect path (Firpo & Possebom 2018, §4.1).
+
+        Tests ``H_0^f: α_{1,t} = f(t)`` for every post period (Eq 11) by subtracting the
+        hypothesized effect path ``f(t)`` from the post-period gaps of EVERY unit and
+        re-ranking the treated unit's modified RMSPE ratio against the placebo distribution
+        (Eqs 12–13 at ``φ = 0``, ``v = (1,…,1)`` — the equal-weights benchmark). The
+        synthetic controls are NOT refit: this reuses the gap paths and per-unit
+        denominators :meth:`in_space_placebo` already computed (run lazily here if needed).
+        At ``effect = 0`` the p-value is identically the benchmark ``placebo_p_value``
+        (Eq 5 = Eq 13 with ``f ≡ 0``).
+
+        Parameters
+        ----------
+        effect : float or array-like
+            The hypothesized post-period effect ``f(t)``: a scalar (a constant-in-time
+            effect, Eq 11), or a length-``n_post_periods`` array aligned to ``post_periods``
+            in calendar order (an arbitrary path — e.g. an intervention cost path or a
+            theory-predicted shape).
+        gamma : float, default 0.1
+            Test level; the null is rejected when ``p^f < gamma``. The permutation p-value
+            is granular in ``1/(J+1)`` (Firpo & Possebom fn 8), so not every nominal level
+            is attainable.
+        n_starts : int, optional
+            Multistart count for the lazy :meth:`in_space_placebo` run; ignored (with a
+            warning) if the reference set already exists.
+
+        Returns
+        -------
+        pandas.Series
+            ``p_value`` (``p^f``), ``reject`` (``p^f < gamma``), ``gamma``,
+            ``rmspe_f_treated`` (the treated unit's modified RMSPE ratio), ``n_placebos``
+            (reference-set size), ``n_failed``.
+
+        Raises
+        ------
+        ValueError
+            If ``gamma`` is not in ``(0, 1)``, ``effect`` has the wrong shape / non-finite
+            values, or no valid placebo reference set is available (see
+            :meth:`in_space_placebo`).
+        """
+        from diff_diff.synthetic_control import _sharp_null_pvalue
+
+        if not (0.0 < float(gamma) < 1.0):
+            raise ValueError(f"gamma must be in (0, 1), got {gamma!r}")
+        self._require_placebo_reference(n_starts)
+        post_periods = list(self.post_periods)
+        f_post = self._coerce_effect_path(effect, len(post_periods))
+        assert self._placebo_gaps is not None and self._placebo_pre_denoms is not None
+        p, r1, n_ref = _sharp_null_pvalue(
+            self.gap_path,
+            self._placebo_gaps,
+            post_periods,
+            f_post,
+            self._placebo_pre_denoms,
+            self.treated_unit,
+        )
+        return pd.Series(
+            {
+                "p_value": float(p),
+                "reject": bool(p < float(gamma)),
+                "gamma": float(gamma),
+                "rmspe_f_treated": float(r1),
+                "n_placebos": int(n_ref),
+                "n_failed": int(self.n_failed),
+            }
+        )
+
+    def confidence_set(
+        self,
+        *,
+        family: str = "constant",
+        gamma: float = 0.1,
+        bounds: Optional[Tuple[float, float]] = None,
+        n_grid: int = 200,
+        n_starts: Optional[int] = None,
+    ) -> pd.DataFrame:
+        """Confidence set for the treatment-effect path by test inversion (Firpo & Possebom 2018, §4.2).
+
+        Inverts the sharp-null test (:meth:`test_sharp_null`) over a one-parameter effect
+        family: the confidence set is every parameter value whose sharp null is **not
+        rejected**, ``{ param : p^param > gamma }`` (Eq 14, **strict** inequality). Two
+        families are supported:
+
+        - ``family="constant"`` — ``f(t) = c`` (Eq 15); the set is a confidence **interval**
+          for a constant-in-time effect (Eq 16). The parameter column is ``c``.
+        - ``family="linear"`` — ``f(t) = c̃·(t − T0)`` with the 1-based post-period index
+          ``(t − T0)`` (Eq 17); the set is a confidence **set** over the slope ``c̃``
+          (Eq 18). The parameter column is ``c_tilde``.
+
+        The inversion is a pure re-ranking of the stored placebo gaps (no synthetic-control
+        refits): :meth:`in_space_placebo` is run lazily if needed, then each value only
+        recomputes ``p^param``. With ``bounds=None`` the set is recovered **exactly**:
+        ``p^param`` is piecewise-constant (each placebo's indicator flips only at the real
+        roots of a quadratic in ``param``), so the placebo breakpoints partition the line,
+        ``p`` is evaluated once per induced interval AND at each breakpoint (where a tie
+        under ``≥`` can lift ``p`` above ``gamma``), and the union of accepted
+        intervals/points is the set — with NO centering or monotonicity assumption (accepted
+        tails and disjoint components are handled). With explicit ``bounds`` a fixed
+        ``linspace(*bounds, n_grid)`` grid is scanned instead (grid-limited membership).
+
+        **Boundary convention (paper-sourced, Eq 14):** membership is the *strict* inequality
+        ``p^param > gamma``. The permutation p-value is discrete (a multiple of ``1/(J+1)``),
+        so ``p = gamma`` is reachable and is **excluded** from the set.
+
+        The result is stored on the object: the summary on
+        :attr:`effect_confidence_set` (``{family, parameter, gamma, lower, upper,
+        contiguous, boundary, point_estimate, n_grid, n_placebos, status}``, surviving
+        pickling) and the full grid on :meth:`get_confidence_set_df`. The analytical
+        ``conf_int`` / ``se`` stay NaN — this is a separate permutation object.
+
+        Parameters
+        ----------
+        family : {"constant", "linear"}, default "constant"
+            The one-parameter effect family to invert over.
+        gamma : float, default 0.1
+            Confidence level is ``1 − gamma``; ``p^param > gamma`` defines membership.
+        bounds : (float, float), optional
+            Fixed ``(lo, hi)`` grid for the parameter. Default None uses exact breakpoint
+            inversion (a fixed grid is used only when ``bounds`` is supplied).
+        n_grid : int, default 200
+            Number of grid points evaluated for the returned table (>= 2).
+        n_starts : int, optional
+            Multistart count for the lazy :meth:`in_space_placebo` run; ignored (with a
+            warning) if the reference set already exists.
+
+        Returns
+        -------
+        pandas.DataFrame
+            Columns ``param`` (``c`` for constant, ``c̃`` for linear), ``p_value``
+            (``p^param``), ``in_set`` (``p^param > gamma``). Empty for an ``"empty"`` set;
+            an ``"unbounded"`` exact set with finite breakpoints still returns an inspection
+            grid over a padded breakpoint range (see :attr:`effect_confidence_set`
+            ``status``).
+
+        Raises
+        ------
+        ValueError
+            If ``family`` is unknown, ``gamma`` not in ``(0, 1)``, ``n_grid < 2``, ``bounds``
+            malformed, or no valid placebo reference set is available.
+        """
+        from diff_diff.synthetic_control import _invert_sharp_null
+
+        if family not in ("constant", "linear"):
+            raise ValueError(f"family must be 'constant' or 'linear', got {family!r}")
+        if not (0.0 < float(gamma) < 1.0):
+            raise ValueError(f"gamma must be in (0, 1), got {gamma!r}")
+        if not isinstance(n_grid, (int, np.integer)) or n_grid < 2:
+            raise ValueError(f"n_grid must be an integer >= 2, got {n_grid!r}")
+        if bounds is not None:
+            # Guard the type/length BEFORE indexing so a malformed scalar raises the
+            # documented ValueError (not a bare TypeError from len()/subscription).
+            if (
+                not isinstance(bounds, (tuple, list, np.ndarray))
+                or len(bounds) != 2
+                or not all(isinstance(b, (int, float, np.integer, np.floating)) for b in bounds)
+                or not all(np.isfinite(float(b)) for b in bounds)
+            ):
+                raise ValueError(f"bounds must be a finite (lo, hi) pair, got {bounds!r}")
+            if float(bounds[1]) <= float(bounds[0]):
+                raise ValueError(f"bounds must satisfy hi > lo, got {bounds!r}")
+        self._require_placebo_reference(n_starts)
+        assert self._placebo_gaps is not None and self._placebo_pre_denoms is not None
+        res = _invert_sharp_null(
+            self.gap_path,
+            self._placebo_gaps,
+            list(self.post_periods),
+            self._placebo_pre_denoms,
+            self.treated_unit,
+            family,
+            float(gamma),
+            bounds=(None if bounds is None else (float(bounds[0]), float(bounds[1]))),
+            n_grid=int(n_grid),
+        )
+        status = res["status"]
+        if status == "unbounded":
+            extra = (
+                " The accepted set is ALSO non-contiguous (e.g. two accepted tails with a "
+                "rejected middle, NOT the whole line), so [lower, upper] is only the hull — "
+                "inspect get_confidence_set_df() for the structure."
+                if not res["contiguous"]
+                else ""
+            )
+            warnings.warn(
+                "Confidence set is unbounded: either gamma is below the permutation "
+                "granularity 1/(J+1) (so no effect is ever rejected — Firpo & Possebom "
+                "fn 8), or the treated unit does not have the best pre-treatment fit (so "
+                "the RMSPE ratio does not grow without bound on one side). Reported "
+                "endpoint(s) are +/-inf." + extra,
+                UserWarning,
+                stacklevel=2,
+            )
+        elif status == "empty":
+            warnings.warn(
+                f"Confidence set is empty: every {family} effect in this family is "
+                f"rejected at gamma={gamma:.3g} (the largest attainable p-value does not "
+                "exceed gamma). Endpoints are NaN.",
+                UserWarning,
+                stacklevel=2,
+            )
+        elif not res["contiguous"]:
+            warnings.warn(
+                "Confidence set is non-contiguous (the discrete permutation p-value dips "
+                "below gamma at an interior grid point); [lower, upper] is reported as the "
+                "hull. Inspect get_confidence_set_df() for the full grid.",
+                UserWarning,
+                stacklevel=2,
+            )
+        self.effect_confidence_set = {
+            "family": family,
+            "parameter": "c" if family == "constant" else "c_tilde",
+            "gamma": float(gamma),
+            "lower": float(res["lower"]),
+            "upper": float(res["upper"]),
+            "contiguous": bool(res["contiguous"]),
+            "boundary": "strict",
+            "point_estimate": float(res["point_estimate"]),
+            "n_grid": int(n_grid),
+            "n_placebos": int(res["n_ref"]),
+            "status": status,
+        }
+        self._confidence_set_df = pd.DataFrame(res["grid"], columns=["param", "p_value", "in_set"])
+        return self._confidence_set_df.copy()
+
+    def get_confidence_set_df(self) -> pd.DataFrame:
+        """Get the test-inversion confidence-set grid table (see :meth:`confidence_set`).
+
+        Columns: ``param`` (``c`` constant / ``c̃`` linear), ``p_value`` (``p^param``),
+        ``in_set`` (``p^param > gamma``). Survives pickling. Raises if
+        :meth:`confidence_set` has not been run.
+
+        Returns
+        -------
+        pandas.DataFrame
+        """
+        if self._confidence_set_df is None:
+            raise ValueError("No confidence set yet; call confidence_set() first.")
+        return self._confidence_set_df.copy()
diff --git a/docs/api/synthetic_control.rst b/docs/api/synthetic_control.rst
index b50e840a..3ffd197d 100644
--- a/docs/api/synthetic_control.rst
+++ b/docs/api/synthetic_control.rst
@@ -36,6 +36,21 @@ earlier pre-date and checks for a spurious gap before the true treatment date (t
 backdating placebo; ``placebo_att`` should be ~0). Both re-run the validated solver and
 leave the analytical inference fields NaN.
 
+**Confidence sets by test inversion (Firpo & Possebom 2018 §4, opt-in):**
+:meth:`~diff_diff.SyntheticControlResults.test_sharp_null` tests a sharp null
+``H_0: alpha_1t = f(t)`` (a scalar constant effect, or a post-period effect path) by
+re-ranking the stored in-space placebo gaps — no refits, and ``test_sharp_null(0)`` is
+identically ``placebo_p_value`` — and
+:meth:`~diff_diff.SyntheticControlResults.confidence_set` (``family="constant"`` or
+``"linear"``) inverts that test into a confidence set for the effect path: a
+constant-effect interval (Eqs. 15–16) or a linear-slope set (Eqs. 17–18), with the
+paper's strict ``p > gamma`` membership (Eq. 14), computed by exact piecewise-constant
+breakpoint inversion (or a fixed grid when ``bounds=`` is supplied). The set is
+summarized on ``effect_confidence_set`` and returned by
+:meth:`~diff_diff.SyntheticControlResults.get_confidence_set_df`; the analytical
+``conf_int`` stays NaN (this is a separate permutation set at level ``1 - gamma``,
+possibly a set / unbounded / non-contiguous).
+
 **Distinct from** :class:`~diff_diff.SyntheticDiD` (Arkhangelsky et al. 2021), which adds
 time weights and ridge regularization; classic SCM uses **donor weights only** plus the
 outer ``V`` search.
@@ -88,6 +103,9 @@ Results container for synthetic control estimation.
       ~SyntheticControlResults.in_time_placebo
       ~SyntheticControlResults.get_in_time_placebo_df
       ~SyntheticControlResults.get_in_time_placebo_gaps
+      ~SyntheticControlResults.test_sharp_null
+      ~SyntheticControlResults.confidence_set
+      ~SyntheticControlResults.get_confidence_set_df
       ~SyntheticControlResults.summary
       ~SyntheticControlResults.print_summary
       ~SyntheticControlResults.to_dict
diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
index 5f0857bc..a24cb239 100644
--- a/docs/doc-deps.yaml
+++ b/docs/doc-deps.yaml
@@ -635,6 +635,8 @@ sources:
       - path: docs/methodology/REGISTRY.md
         section: "SyntheticControl"
         type: methodology
+      - path: docs/methodology/papers/firpo-possebom-2018-review.md
+        type: methodology
       - path: docs/api/synthetic_control.rst
         type: api_reference
       - path: diff_diff/guides/llms-full.txt
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 8da65b5a..4a537bbf 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1996,6 +1996,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
 - **Leave-one-out donor robustness** (`leave_one_out()`): drops each **reportably-weighted** donor and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`). A large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal, not a failure). The reporting stack's headline donor-sensitivity number is `max_abs_delta_att` = `max |delta_att|` over the drops (baseline-relative, so a uniform shift of every drop away from the full-fit ATT is not masked the way a raw ATT range would be). `get_leave_one_out_gaps()` returns the per-drop trajectories for the overlay plot. Fails closed on a non-converged treated fit or `< 2` donors.
 - **In-time (backdating) placebo** (`in_time_placebo()`): reassigns the intervention to an earlier pre-date `t_f`, re-fits using ONLY pre-`t_f` information (TRUNCATE convention — see Note), and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 if there is no real pre-period effect (ADH 2015 Fig. 4, German reunification backdated to 1975). Sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a valid-but-dimensionally-infeasible date yields a `status="infeasible"` row (no raise).
 
+**Confidence sets by test inversion (Firpo & Possebom 2018, §4, opt-in):** A confidence set for the treatment-effect path, built ON TOP of the in-space placebo and surfaced under `estimator_native_diagnostics` (the analytical inference contract is unchanged — see the non-analytical Note below). Under a common-effect sharp null `H_0^f: α_1t = f(t)` (Eq 11) the donor synthetic controls and the pre-period MSPE denominators do not change — only the post-period gaps shift by `f(t)` — so the test is a pure **re-ranking** of the gap paths `in_space_placebo()` already computed (no synthetic-control refits):
+- **`test_sharp_null(effect, gamma=0.1)`** forms the modified RMSPE ratio `RMSPE^f_j = sqrt(mean_post((α̂_jt − f(t))²) / pre_denom_j)` for every unit (Eq 12) and the permutation p-value `p^f = (1 + #{converged placebos j : RMSPE^f_j ≥ RMSPE^f_1}) / (n_ref + 1)` (Eq 13 at `φ=0`, `v=(1,…,1)` — the equal-weights benchmark). `effect` is a scalar (a constant-in-time effect) or a length-`n_post` array (an arbitrary post-period path `f(t)` — e.g. an intervention cost path or a theory-predicted shape). At `f≡0` this is identically the in-space `placebo_p_value` (Eq 5 = Eq 13), held bit-for-bit by reusing each unit's floored pre-period denominator persisted from the placebo run (the pre window is `f`-free, so the denominator is grid-invariant; the floor scale is **per-unit** `max|Z1_j|`).
+- **`confidence_set(family="constant"|"linear", gamma=0.1, bounds=None, n_grid=200)`** inverts that test over a one-parameter family: `"constant"` → `f(t)=c` (Eq 15), a confidence **interval** for a constant effect (Eq 16); `"linear"` → `f(t)=c̃·(t−T0)` with the 1-based post-period index (Eq 17), a confidence **set** over the slope `c̃` (Eq 18). Membership is the paper's **strict** `p^c > γ` (Eq 14 — see the boundary Note). With `bounds=None` the set is recovered **EXACTLY**: `p^c` is a piecewise-constant step function (each placebo's indicator flips only at the real roots of `A_j(c)·D_1 = A_1(c)·D_j`, a quadratic in `c`), so the placebo breakpoints partition the line and `p` is evaluated once per induced open interval AND at each breakpoint — where a tie under `≥` can lift `p` above γ, so an isolated accepted singleton (a tangent / co-located root) is captured. The accepted set is the union of accepted intervals/points, with **no centering or monotonicity assumption** (a poor-pre-fit treated unit can have its accepted region in the tails, not around the point estimate). `bounds=(lo,hi)` instead scans a fixed grid (grid-limited membership). The summary is on `effect_confidence_set` (`status ∈ {"ran","empty","unbounded"}`) and the full grid on `get_confidence_set_df()`. **Fail-closed:** `γ < 1/(J+1)` ⇒ `p^c > γ` for every `c` ⇒ `"unbounded"` (`±inf` endpoints + warning — the discrete-granularity point, fn 8); a treated unit lacking the best pre-fit can give a one-sided unbounded edge; if no interval or breakpoint is accepted the set is `"empty"` (NaN endpoints); a non-contiguous accepted region (disjoint components / an isolated singleton) reports the `[lower, upper]` hull with `contiguous=False` + a warning; and `< 2` donors / a non-converged treated fit / an unpickled result (no placebo reference set) raise `ValueError`. **Scope:** sensitivity weights (`φ≠0`, Eqs 7–9), the general test-statistic menu `θ¹–θ⁵` (Eq 19), one-sided (§7's signed-`t` statistic), and the multiple-outcome / multiple-treated extensions (§6) are **deferred** (flagged in the paper review checklist).
+
 **Notes / deviations:**
 - **Note:** The standardization divisor `divisor = sqrt(apply(cbind(X0,X1), 1, var))` (per-predictor SD over donors+treated, ddof=1) and the inner/outer optimizer are **not specified in ADH 2010** (which defers these numerics to Abadie & Gardeazabal 2003 App. B / the `Synth` software). The divisor is pinned from the R `Synth::synth` source; `solution.v` lives in this scaled predictor space, so the deterministic R-parity test feeds `custom_v` in the same scaled space.
 - **Note:** The outer objective minimizes the pre-period outcome MSPE over **all** pre periods, whereas R `Synth` uses a `time.optimize.ssr` window (1960–1969 in the Basque example). The nested `V` therefore differs from R by an efficiency-only choice (the paper notes inferential validity holds for *any* `V`), so end-to-end nested parity is a tolerance band, not equality.
@@ -2013,6 +2017,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
 - **Note (placebo failure handling):** a placebo is **excluded from both the numerator and the denominator** of the rank (never penalized into it) and tallied in `n_failed` when its fit is not a valid optimum — EITHER its **inner Frank-Wolfe weight solve** did not converge (a truncated `W` is unusable) OR its **outer `V` search** did not converge (an under-optimized `V` fits the pre-period worse, shrinking the RMSPE ratio and biasing the p-value anti-conservatively, so it must not silently enter the rank). The reported p-value uses the **effective** count `rank / (n_placebos + 1)`, where `n_placebos` is the number of placebos that entered the reference set. Failed donors still appear in `get_placebo_df()` (`status="failed"`, NaN metrics), so once a reference set is produced the table is the full treated + every-donor unit set (`n_donors + 1` rows). In the fail-closed cases the placebo loop does not run and only the treated row is returned: `J < 2` → `placebo_p_value` is NaN with a warning (no placebo distribution; `J == 2` warns the distribution is coarse), and a treated fit whose own **inner OR outer** search did not converge also fails closed (ranking a truncated / under-optimized treated statistic would not be a valid permutation). **Caveat:** each placebo refit inherits the original fit's `optimizer_options` / `n_starts`, so valid inference requires settings adequate for the outer `V` search to converge to a comparable-quality synthetic (production defaults do; cheap settings under-optimize placebo `V` and those placebos are dropped as failed — raise `n_starts` on `in_space_placebo()` or re-fit with a larger `optimizer_options['maxiter']`).
 - **Note (RMSPE-ratio floor):** the reported `rmspe_ratio = sqrt(MSPE_post / MSPE_pre)` floors the pre-period MSPE denominator at a scale-aware `1e-8 · max(|pre-outcomes|, 1)²` (before the square root) so a (near-)perfect pre-fit (`pre-MSPE → 0`) yields a large-but-FINITE ratio rather than `inf`/`nan` (which would corrupt the rank). Ties (`ratio_j ≥ treated_ratio`) are counted, making the p-value conservative. Mirrors the `_fit_tol` poor-fit guard.
 - **Note (placebo p-value is non-analytical):** `placebo_p_value` is deliberately a SEPARATE field from `p_value` (which stays NaN) — it is a permutation p-value with no SE / t-stat, so it does not flow through `safe_inference`. `is_significant` likewise stays bound to the (NaN) `p_value`, NOT `placebo_p_value`; a tool gating on `is_significant` will see `False` even when `placebo_p_value` is small. The reporting stack surfaces the placebo p-value through `estimator_native_diagnostics`, never the analytical headline.
+- **Note (confidence set is non-analytical — `conf_int` stays NaN):** the Firpo & Possebom test-inversion `effect_confidence_set` is a permutation set at level `1−γ`, kept DELIBERATELY SEPARATE from the analytical `conf_int` (which stays `(NaN, NaN)` — classic SCM has no Wald interval). It is parametrized by `γ` (not the estimator's `alpha`, and granular in `1/(J+1)`), can be a *set* rather than an interval (the linear family) and can be unbounded or non-contiguous, so it cannot be coerced into the `(lo, hi)` `conf_int` tuple without breaking the `safe_inference` NaN-consistency contract. `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` all stay at their NaN state; the set is surfaced only via `confidence_set()` / `get_confidence_set_df()` / `effect_confidence_set` (and `estimator_native_diagnostics`) — mirroring how `placebo_p_value` is kept off `p_value`.
+- **Note (test-inversion boundary convention — strict `p^f > γ`):** Firpo & Possebom's inequalities are non-uniform at `p = γ` — the RMSPE-based tests reject at `p < γ` (Eqs 5/9/13), the general-statistic test rejects at `p ≤ γ` (Eq 19), and the confidence set is the **strict** `p^f > γ` (Eq 14), so Eq 14's set is NOT the exact complement of the Eq 13 rejection region (they differ at `p^f = γ`). Because the permutation p-value is discrete (a multiple of `1/(J+1)`), `p = γ` is reachable, so this implementation pins Eq 14's strict `p^f > γ` for set membership (a `p = γ` value is **excluded**) and documents it (matches the `firpo-possebom-2018-review.md` §4.2 boundary note).
+- **Note (test-inversion set construction is an implementation choice):** Firpo & Possebom §4.2 gives the set definitions (Eqs 14/16/18) but does NOT prescribe how to enumerate the set. The default is **exact piecewise-constant breakpoint inversion** — `p^c` is constant between the real roots of the placebo-vs-treated comparison quadratics, so evaluating `p` per induced interval and at each breakpoint recovers the set exactly (no grid resolution / shape assumption); the optional fixed `bounds=` grid is the grid-limited alternative. Either is OUR choice (the paper leaves it unspecified) — a documented deviation. (`n_grid` controls only the returned inspection table, not membership, when `bounds=None`.)
+- **Note (test-inversion validation — no R anchor):** R `Synth` has no test-inversion confidence-set function, and the authors' Code Ocean replication capsule was not consulted (paper-sourced only). The confidence sets are validated by (a) **self-consistency** — `test_sharp_null(0)` equals the in-space `placebo_p_value` exactly (transitively R-anchored via the Basque placebo parity), (b) a **numpy oracle** re-implementing Eqs 12–14 on hand-built gap paths (incl. the strict `p = γ` boundary and the per-unit floor), and (c) a **coverage simulation** (a constant-effect DGP; the `1−γ` set covers the truth at ≈ `1−γ`).
 - **Note (in-time placebo windowing — TRUNCATE):** ADH 2015 §4 says to re-estimate the in-time placebo "with the same predictors lagged accordingly." Because `diff_diff`'s predictor specs reference **absolute** periods, the in-time placebo re-cuts them by TRUNCATION: pre-period-outcome predictors become the pre-`t_f` outcomes, and covariate / special-predictor windows are intersected with the pre-`t_f` window; a window lying ENTIRELY in the held-out region `[t_f, T0)` is **dropped** (surfaced in the `n_dropped_specs` column + an aggregated warning), and `custom_v` is subset in lockstep with the surviving specs. For an outcome-predictor fit (the R-anchorable case) TRUNCATE is identical to ADH's "lag" — both equal a manual `Synth::synth` re-run with `time.optimize.ssr` cut at `t_f`. The held-out window never enters the fit (the placebo's `all_periods` is the pre-fake + post-fake span; the true post-treatment periods are excluded entirely), so there is no "peeking." This concrete convention is NOT spelled out in ADH 2015 (which gives only the qualitative "lag accordingly").
 - **Note (in-time placebo requires ≥2 pre-fake periods):** the in-time placebo treats a date with fewer than 2 pre-fake periods as `status="infeasible"` (the default sweep starts at the 3rd pre-period). This is DELIBERATELY stricter than the base estimator's `T0 ≥ 1` allowance (which permits a single-pre-period fit but warns that nested-`V` selection is unreliable): an auto-swept placebo date with a single pre-fake period is a trivially-matchable, non-credible pre-fit, so it is dropped rather than surfaced as a `ran` placebo (mirrors `SyntheticDiD.in_time_placebo`'s `i ≥ 2` rule). A date whose surviving `custom_v` has zero mass after truncation is likewise infeasible (not a convergence failure).
 - **Note (leave-one-out weight floor):** ADH 2015 §4 leave-one-out omits "each donor that received positive weight." This implementation drops each donor with **reportable** weight — above the `1e-6` interpretability floor (`synthetic_control._MIN_REPORT_WEIGHT`), i.e. exactly the donors in `donor_weights` — rather than every strictly-positive weight. A donor with `0 < w ≤ 1e-6` is numerical dust whose removal moves the ATT by ~its weight (its `delta_att` would be ~0, an uninformative row), and the floor keeps the LOO table aligned with the reported donor support. The drop-set is **frozen at fit time** on the fit snapshot (`weighted_donor_ids`), so `leave_one_out()` is immune to post-fit mutation of the presentation-level `donor_weights` dict.
@@ -2031,6 +2039,7 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
 - [x] No analytical SE (NaN inference); in-space placebo permutation inference (`in_space_placebo()`, `rank/(n_placebos+1)`) with the real treated unit excluded from every placebo pool, effective-count denominator, and a scale-aware RMSPE-ratio floor.
 - [x] Leave-one-out donor robustness (`leave_one_out()`, ADH 2015 §4): per-drop ATT / `delta_att` table + overlay gaps; fail-closed.
 - [x] In-time (backdating) placebo (`in_time_placebo()`, ADH 2015 §4): TRUNCATE windowing (drop held-out-window predictors + lockstep `custom_v` subset), feasible-date sweep, fail-closed.
+- [x] Confidence sets by test inversion (`test_sharp_null()` + `confidence_set()`, Firpo & Possebom 2018 §4): sharp-null `RMSPE^f` re-ranking of the in-space placebo gaps (Eqs 12–13) + constant/linear one-parameter sets (Eqs 14/16/18) with the strict `p^f > γ` boundary, EXACT piecewise-constant breakpoint inversion (no shape assumption; isolated/disjoint/unbounded sets handled), and fail-closed unbounded/empty/non-contiguous handling. *Deferred:* sensitivity weights (φ≠0), the general-θ menu (Eq 19), one-sided (§7), multiple-outcome/treated (§6).
 - [ ] *Deferred (ADH 2015):* regression-weight `W^reg` extrapolation diagnostic, sparse-SC subset search (see `TODO.md`).
 - [x] Predictor-leakage, absorbing-suffix/no-anticipation, empty-window, duplicate-label, and inner-non-convergence validation gates.
 
diff --git a/docs/methodology/papers/firpo-possebom-2018-review.md b/docs/methodology/papers/firpo-possebom-2018-review.md
index f2fae1f0..cd34181c 100644
--- a/docs/methodology/papers/firpo-possebom-2018-review.md
+++ b/docs/methodology/papers/firpo-possebom-2018-review.md
@@ -5,13 +5,13 @@
 **PDF reviewed:** https://doi.org/10.1515/jci-2016-0026 (published *Journal of Causal Inference* version, open access; received 15 Nov 2016, revised 6 Aug 2018, accepted 11 Aug 2018, 26 pp). Per the project's PDFs-never-committed convention the local PDF is kept outside the repository; the published J. Causal Inference version (DOI 10.1515/jci-2016-0026) is the authoritative source. All equation, section, and footnote numbers below are pinned to that version.
 **Review date:** 2026-06-01
 
-> Scope note: this paper extends the **permutation / placebo inference** procedure of Abadie, Diamond & Hainmueller (the SCM benchmark) in two ways — (1) a **sensitivity analysis** that parametrically re-weights the placebo p-value away from the equal-weights benchmark, and (2) testing **any sharp null hypothesis** (not only "no effect whatsoever") via a modified RMSPE statistic, which it **inverts to construct confidence sets** for the treatment-effect path. It also generalizes to arbitrary test statistics, multiple outcomes (familywise error control), and multiple treated units (a pooled effect). This review is the **Step-1 fidelity artifact** for a forthcoming SCM **confidence-set / CI-by-test-inversion** implementation (PR-B) layered on the existing `SyntheticControl` estimator; the sensitivity-analysis and multiple-outcome / multiple-treated extensions are documented here but flagged **deferred**. The estimator itself (donor weights `W`, predictor importance `V`) is taken as given from ADH 2010/2015 — already implemented as `SyntheticControl` — and is recapped only as the paper frames it. Nothing here is sourced from outside this paper.
+> Scope note: this paper extends the **permutation / placebo inference** procedure of Abadie, Diamond & Hainmueller (the SCM benchmark) in two ways — (1) a **sensitivity analysis** that parametrically re-weights the placebo p-value away from the equal-weights benchmark, and (2) testing **any sharp null hypothesis** (not only "no effect whatsoever") via a modified RMSPE statistic, which it **inverts to construct confidence sets** for the treatment-effect path. It also generalizes to arbitrary test statistics, multiple outcomes (familywise error control), and multiple treated units (a pooled effect). This review is the **Step-1 fidelity artifact** for the SCM **confidence-set / CI-by-test-inversion** implementation (PR-B, **shipped** — `SyntheticControlResults.test_sharp_null()` / `confidence_set()`) layered on the existing `SyntheticControl` estimator; the sensitivity-analysis and multiple-outcome / multiple-treated extensions are documented here but flagged **deferred**. The estimator itself (donor weights `W`, predictor importance `V`) is taken as given from ADH 2010/2015 — already implemented as `SyntheticControl` — and is recapped only as the paper frames it. Nothing here is sourced from outside this paper.
 
 ---
 
 ## Methodology Registry Entry
 
-*Formatted to match docs/methodology/REGISTRY.md. This documents an **inference procedure on the existing `SyntheticControl` estimator**, not a new estimator — the `## SyntheticControl` heading mirrors `abadie-2021-review.md`. The REGISTRY implementation contract (`docs/methodology/REGISTRY.md` §SyntheticControl) is unchanged by this docs-only PR-A; PR-B will add the confidence-set methodology subsection and flip the relevant checklist items.*
+*Formatted to match docs/methodology/REGISTRY.md. This documents an **inference procedure on the existing `SyntheticControl` estimator**, not a new estimator — the `## SyntheticControl` heading mirrors `abadie-2021-review.md`. The REGISTRY implementation contract (`docs/methodology/REGISTRY.md` §SyntheticControl) was unchanged by the docs-only PR-A; PR-B (shipped) added the confidence-set methodology subsection there and flipped the relevant checklist items below.*
 
 ## SyntheticControl
 
@@ -132,10 +132,10 @@ A re-analysis of ETA terrorism on Basque Country GDP per capita (Abadie & Gardea
 - Built on the authors' `Synth` package (R / MATLAB / Stata) for the underlying SCM fit.
 
 **Requirements checklist** (features this paper adds beyond ADH 2010/2015; **PR-B** = the planned next implementation target, **deferred** = later):
-- [ ] (PR-B) Sharp-null `RMSPE^f` test (Eqs. 12–13) reusing the in-space placebo permutation — subtract the hypothesized `f(t)` from the post-period gaps and re-rank.
-- [ ] (PR-B) Confidence **interval** for a constant-in-time effect (Eqs. 15–16) by test inversion over a `c`-grid.
-- [ ] (PR-B) Confidence **set** for a linear-in-time effect (Eqs. 17–18) by test inversion over a `c̃`-grid.
-- [ ] (PR-B) Benchmark `(φ = 0, v = (1,…,1))` p-value (reuse `in_space_placebo`'s RMSPE-ratio) + a one-sided variant (Section 7).
+- [x] (PR-B) Sharp-null `RMSPE^f` test (Eqs. 12–13) reusing the in-space placebo permutation — subtract the hypothesized `f(t)` from the post-period gaps and re-rank. **Shipped:** `SyntheticControlResults.test_sharp_null(effect, gamma=...)`.
+- [x] (PR-B) Confidence **interval** for a constant-in-time effect (Eqs. 15–16) by test inversion over a `c`-grid. **Shipped:** `confidence_set(family="constant")`.
+- [x] (PR-B) Confidence **set** for a linear-in-time effect (Eqs. 17–18) by test inversion over a `c̃`-grid. **Shipped:** `confidence_set(family="linear")`.
+- [x] (PR-B) Benchmark `(φ = 0, v = (1,…,1))` p-value (reuse `in_space_placebo`'s RMSPE-ratio): shipped — `test_sharp_null(0)` is identically `placebo_p_value`. **One-sided variant (Section 7): still `[ ]` deferred** — §7 uses the signed-`t` statistic `θ³` from the deferred general-`θ` menu (Eq. 19), so it ships with that menu, not here.
 - [ ] (deferred) Sensitivity-analysis parametric weights `π_(j)(φ, v)` (Eqs. 7–9) + worst/best-case `φ̲`/`φ̄` robustness curve (Section 3).
 - [ ] (deferred) General test-statistic menu `θ¹`–`θ⁵` (Eq. 19, Section 5).
 - [ ] (deferred) Multiple-outcome FWER control (Eqs. 23–24) and multiple-treated-unit pooled confidence sets (Eqs. 25–26, Section 6).
diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py
index a0a78fcd..8501c426 100644
--- a/tests/test_diagnostic_report.py
+++ b/tests/test_diagnostic_report.py
@@ -2111,6 +2111,23 @@ def test_scm_native_surfaces_in_time_placebo_after_optin_run(self, scm_fit):
         block = native["in_time_placebo"]
         assert block["status"] == "ran" and block["n_dates"] >= 1
 
+    def test_scm_native_confidence_set_not_run_stub(self, scm_fit):
+        # Firpo-Possebom test-inversion confidence set is opt-in, like the placebos.
+        res, _ = scm_fit
+        native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+        assert native["confidence_set"]["status"] == "not_run"
+
+    def test_scm_native_surfaces_confidence_set_after_optin_run(self, scm_fit):
+        res, _ = scm_fit
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            res.confidence_set(family="constant", gamma=0.34)  # J=4 -> gamma > 1/(J+1)
+            native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+        block = native["confidence_set"]
+        assert block["status"] in ("ran", "empty", "unbounded")
+        assert block["family"] == "constant" and block["parameter"] == "c"
+        assert block["gamma"] == pytest.approx(0.34)
+
     def test_scm_does_not_call_honest_did(self, scm_fit):
         """HonestDiD sensitivity should NOT run on SCM (fit-based / native path)."""
         res, _ = scm_fit
diff --git a/tests/test_methodology_synthetic_control.py b/tests/test_methodology_synthetic_control.py
index 02eeb494..00c2a9ff 100644
--- a/tests/test_methodology_synthetic_control.py
+++ b/tests/test_methodology_synthetic_control.py
@@ -38,6 +38,15 @@
     SyntheticControlResults,
     synthetic_control,
 )
+from diff_diff.synthetic_control import (
+    _constant_f_post,
+    _floored_pre_mspe,
+    _invert_sharp_null,
+    _linear_f_post,
+    _rmspe_f_ratio,
+    _rmspe_ratio,
+    _sharp_null_pvalue,
+)
 from tests.conftest import assert_nan_inference
 
 DATA_DIR = Path(__file__).parent / "data"
@@ -3013,3 +3022,519 @@ def spy(*a, **k):
             inner_min_decrease=1e-3,
         )
     assert loo_att == pytest.approx(fresh.att, abs=1e-7)
+
+
+# ===========================================================================
+# Confidence sets by test inversion (Firpo & Possebom 2018, Section 4)
+# ===========================================================================
+#
+# Two opt-in SyntheticControlResults methods built ON TOP of the in-space placebo:
+# test_sharp_null(effect) tests H_0^f: alpha_1t = f(t) by re-ranking the stored
+# placebo gaps (Eqs 12-13, phi=0, v=(1,...,1)); confidence_set(family=...) inverts
+# that test over a one-parameter family (Eqs 14/16/18, strict p^f > gamma). No SCM
+# refits. The benchmark f==0 case is identically the existing placebo_p_value
+# (Eq 5 = Eq 13 at f==0). No R anchor (Synth has no test inversion): validated by
+# self-consistency to the placebo p-value, a numpy oracle, and a coverage MC.
+
+
+def _gp(pre_vals, post_vals, pre_periods, post_periods):
+    """Build a {period: gap} path from pre/post value lists."""
+    d = {p: float(v) for p, v in zip(pre_periods, pre_vals)}
+    d.update({p: float(v) for p, v in zip(post_periods, post_vals)})
+    return d
+
+
+# Hand-built scenario for the helper-level oracle tests: the treated unit has the
+# best pre-fit and a constant +2 post effect; 4 placebos have worse pre-fit + scattered
+# post gaps (so for large |c| the treated becomes the most-deviant unit -> bounded set).
+_ORACLE_PRE = [0, 1, 2]
+_ORACLE_POST = [3, 4, 5]
+_ORACLE_TREATED = _gp([0.08, -0.06, 0.05], [2.0, 2.0, 2.0], _ORACLE_PRE, _ORACLE_POST)
+_ORACLE_PLACEBOS = {
+    "a": _gp([0.3, -0.3, 0.3], [0.6, -0.4, 0.2], _ORACLE_PRE, _ORACLE_POST),
+    "b": _gp([-0.3, 0.3, 0.2], [-0.5, 0.4, 0.3], _ORACLE_PRE, _ORACLE_POST),
+    "c": _gp([0.2, 0.3, -0.3], [0.3, 0.5, -0.4], _ORACLE_PRE, _ORACLE_POST),
+    "d": _gp([-0.2, 0.3, 0.3], [0.4, -0.3, 0.5], _ORACLE_PRE, _ORACLE_POST),
+}
+
+
+def _oracle_pre_denoms(scale=1.0):
+    units = {0: _ORACLE_TREATED, **_ORACLE_PLACEBOS}
+    return {
+        u: _floored_pre_mspe(np.array([gp[p] for p in _ORACLE_PRE]), scale)
+        for u, gp in units.items()
+    }
+
+
+def test_sharp_null_pvalue_matches_independent_oracle():
+    pre_denoms = _oracle_pre_denoms()
+    for c in (0.0, 1.0, 2.0, 3.5, -1.0):
+        f = _constant_f_post(c, len(_ORACLE_POST))
+        p, r1, n_ref = _sharp_null_pvalue(
+            _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, f, pre_denoms, 0
+        )
+        # Independent re-implementation of Eqs 12-13.
+        resid_t = np.array([_ORACLE_TREATED[q] for q in _ORACLE_POST]) - f
+        r1_o = float(np.sqrt(np.mean(resid_t**2) / pre_denoms[0]))
+        rj = []
+        for u, gp in _ORACLE_PLACEBOS.items():
+            resid = np.array([gp[q] for q in _ORACLE_POST]) - f
+            rj.append(float(np.sqrt(np.mean(resid**2) / pre_denoms[u])))
+        p_o = (1 + sum(1 for r in rj if r >= r1_o)) / (len(rj) + 1)
+        assert n_ref == 4
+        assert r1 == pytest.approx(r1_o)
+        assert p == pytest.approx(p_o), (c, p, p_o)
+
+
+def test_sharp_null_zero_equals_rmspe_ratio_helper():
+    # Eq 13 at f==0 reduces to the ADH RMSPE-ratio statistic (Eq 4).
+    scale = 1.0
+    pre_denoms = _oracle_pre_denoms(scale)
+    f0 = _constant_f_post(0.0, len(_ORACLE_POST))
+    _, r1, _ = _sharp_null_pvalue(
+        _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, f0, pre_denoms, 0
+    )
+    pre = np.array([_ORACLE_TREATED[p] for p in _ORACLE_PRE])
+    post = np.array([_ORACLE_TREATED[p] for p in _ORACLE_POST])
+    assert r1 == pytest.approx(_rmspe_ratio(pre, post, scale))
+
+
+def test_floored_pre_denominator_is_per_unit_not_global():
+    # M1: the RMSPE floor scale is PER-UNIT max|Z1|. For a near-perfect pre-fit (the
+    # floor bites), a wrong GLOBAL scale would change the denominator and break the
+    # f==0 == placebo_p_value anchor. Assert the per-unit denom == the _rmspe_ratio
+    # denom and differs from a global-scale denom.
+    pre = np.array([1e-9, -1e-9, 5e-10])  # near-perfect pre-fit -> the floor dominates
+    post = np.array([2.0, 2.0, 2.0])
+    scale_unit = 10.0
+    denom_unit = _floored_pre_mspe(pre, scale_unit)  # 1e-8 * 10**2 = 1e-6
+    denom_global = _floored_pre_mspe(pre, 1.0)  # 1e-8
+    assert denom_unit == pytest.approx(1e-6)
+    assert not np.isclose(denom_unit, denom_global)
+    gp = _gp(list(pre), list(post), _ORACLE_PRE, _ORACLE_POST)
+    f0 = _constant_f_post(0.0, 3)
+    assert _rmspe_f_ratio(gp, _ORACLE_POST, f0, denom_unit) == pytest.approx(
+        _rmspe_ratio(pre, post, scale_unit)
+    )
+    # The wrong global denom yields a materially different statistic.
+    assert not np.isclose(
+        _rmspe_f_ratio(gp, _ORACLE_POST, f0, denom_unit),
+        _rmspe_f_ratio(gp, _ORACLE_POST, f0, denom_global),
+    )
+
+
+def test_invert_constant_set_brackets_true_effect():
+    pre_denoms = _oracle_pre_denoms()
+    res = _invert_sharp_null(
+        _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, pre_denoms, 0, "constant", 0.25, n_grid=120
+    )
+    assert res["status"] == "ran"
+    assert res["point_estimate"] == pytest.approx(2.0)  # center = att = mean post gap
+    assert res["lower"] <= 2.0 <= res["upper"]
+    assert res["contiguous"]
+
+
+def test_invert_strict_boundary_excludes_p_equals_gamma():
+    # Eq 14 membership is STRICT (p^f > gamma). Use fixed wide bounds so the grid spans
+    # rejected points too; with 4 placebos gamma=0.4 (=2/5) is attainable, so some grid
+    # points have p == gamma and MUST be excluded.
+    pre_denoms = _oracle_pre_denoms()
+    res = _invert_sharp_null(
+        _ORACLE_TREATED,
+        _ORACLE_PLACEBOS,
+        _ORACLE_POST,
+        pre_denoms,
+        0,
+        "constant",
+        0.4,
+        bounds=(-6.0, 10.0),
+        n_grid=401,
+    )
+    grid = res["grid"]
+    assert all(in_set == (p > 0.4) for _, p, in_set in grid)  # strict operator
+    at_gamma = [in_set for _, p, in_set in grid if np.isclose(p, 0.4)]
+    assert at_gamma, "expected the grid to include a p == gamma point"
+    assert not any(at_gamma)  # every p == gamma point excluded (strict)
+
+
+def test_invert_unbounded_when_gamma_below_granularity():
+    # p^f >= 1/(J+1) always (the treated ranks itself); gamma below that -> nothing is
+    # rejected -> the set is all of R (Firpo & Possebom fn 8).
+    pre_denoms = _oracle_pre_denoms()
+    res = _invert_sharp_null(
+        _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, pre_denoms, 0, "constant", 0.1, n_grid=10
+    )
+    assert res["status"] == "unbounded"
+    assert res["lower"] == -np.inf and res["upper"] == np.inf
+
+
+def test_invert_empty_set_when_family_cannot_fit():
+    # A constant treated effect cannot be matched by the LINEAR family; with a near-
+    # perfect pre-fit (tiny denom) the treated stays the most-deviant unit at every
+    # slope, so p == 1/(J+1) everywhere -> rejected at gamma > 1/(J+1) -> empty.
+    treated = _gp([1e-9, -1e-9, 1e-9], [2.0, 2.0, 2.0], _ORACLE_PRE, _ORACLE_POST)
+    units = {0: treated, **_ORACLE_PLACEBOS}
+    pre_denoms = {
+        u: _floored_pre_mspe(np.array([gp[p] for p in _ORACLE_PRE]), 1.0) for u, gp in units.items()
+    }
+    res = _invert_sharp_null(
+        treated, _ORACLE_PLACEBOS, _ORACLE_POST, pre_denoms, 0, "linear", 0.25, n_grid=60
+    )
+    assert res["status"] == "empty"
+    assert np.isnan(res["lower"]) and np.isnan(res["upper"])
+
+
+def test_invert_accepts_tails_when_center_is_rejected():
+    # Regression for the centered-bracket bug (reviewer M1): when the treated unit has a
+    # WORSE pre-fit than the placebos, its accepted region is in the TAILS, not around the
+    # point estimate, and the central band is REJECTED. The exact breakpoint inversion must
+    # return the unbounded, non-contiguous (two-tail) set — NOT "empty" (the old bug).
+    # Treated post gaps [0, 2] with pre-MSPE 100 (poor fit); 4 placebos post [1, 1] pre-MSPE 1.
+    pre, post = [0, 1], [2, 3]
+    treated = {0: 10.0, 1: -10.0, 2: 0.0, 3: 2.0}  # pre-MSPE=100 -> D=100; post gaps [0, 2]
+    placebos = {
+        f"p{k}": {0: 1.0, 1: -1.0, 2: 1.0, 3: 1.0} for k in range(4)
+    }  # pre-MSPE=1; post [1,1]
+    pden = {
+        u: _floored_pre_mspe(np.array([gp[p] for p in pre]), 1.0)
+        for u, gp in {0: treated, **placebos}.items()
+    }
+    assert pden[0] == pytest.approx(100.0) and pden["p0"] == pytest.approx(1.0)
+    res = _invert_sharp_null(treated, placebos, post, pden, 0, "constant", 0.25, n_grid=50)
+    assert res["status"] == "unbounded"  # was wrongly "empty" under the centered bracket
+    assert res["lower"] == -np.inf and res["upper"] == np.inf
+    assert res["contiguous"] is False  # central rejected band -> two disjoint accepted tails
+    # The point estimate (att = 1) is itself rejected; far-out values are accepted.
+    p_center, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(1.0, 2), pden, 0)
+    p_tail, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(50.0, 2), pden, 0)
+    assert p_center <= 0.25 < p_tail
+    # Under a user-bounded grid the same scenario is grid-limited "ran" but still flagged
+    # non-contiguous (accepted at both edges, rejected through the middle).
+    res_b = _invert_sharp_null(
+        treated, placebos, post, pden, 0, "constant", 0.25, bounds=(-50.0, 50.0), n_grid=201
+    )
+    assert res_b["status"] == "ran" and res_b["contiguous"] is False
+
+
+def test_invert_includes_accepted_breakpoint_singleton():
+    # Reviewer round-2 (M1/DT1): strict p>gamma membership + tie-counting >= means a placebo
+    # that EXACTLY ties the treated at a (tangent) breakpoint can push p above gamma THERE
+    # while both neighbouring open intervals are rejected -> an isolated accepted singleton
+    # the exact inversion must include (NOT report "empty"). Construct a placebo whose RMSPE
+    # ratio touches the treated's only at c=0; 3 others stay strictly below.
+    pre, post = [0, 1], [2, 3]
+    treated = {0: 1.0, 1: -1.0, 2: 1.0, 3: -1.0}  # post [1,-1], pre-MSPE 1 -> D=1
+    placebos = {"tie": {0: 2.0, 1: -2.0, 2: 2.0, 3: -2.0}}  # post [2,-2], pre-MSPE 4: ties at c=0
+    for k in range(3):
+        placebos[f"lo{k}"] = {0: 2.0, 1: -2.0, 2: 0.5, 3: -0.5}  # always strictly below treated
+    pden = {
+        u: _floored_pre_mspe(np.array([gp[p] for p in pre]), 1.0)
+        for u, gp in {0: treated, **placebos}.items()
+    }
+    # At c=0 the tie places p at 2/5; just off c=0 only the treated ranks (p=1/5).
+    p_at0, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(0.0, 2), pden, 0)
+    p_near, _, _ = _sharp_null_pvalue(treated, placebos, post, _constant_f_post(0.5, 2), pden, 0)
+    assert p_at0 == pytest.approx(0.4) and p_near == pytest.approx(0.2)
+    res = _invert_sharp_null(treated, placebos, post, pden, 0, "constant", 0.25, n_grid=20)
+    assert res["status"] == "ran"  # NOT "empty" -- the singleton is included
+    assert res["lower"] == pytest.approx(0.0) and res["upper"] == pytest.approx(0.0)
+    # the returned inspection grid reflects the non-empty singleton (one accepted row, not [])
+    assert len(res["grid"]) == 1
+    g_param, g_p, g_in = res["grid"][0]
+    assert g_param == pytest.approx(0.0) and g_p > 0.25 and g_in is True
+
+
+def test_linear_f_post_is_one_based_and_czero_equals_constant_zero():
+    assert np.allclose(_linear_f_post(1.0, 4), [1.0, 2.0, 3.0, 4.0])  # (t - T0), 1-based
+    pre_denoms = _oracle_pre_denoms()
+    p_lin0, _, _ = _sharp_null_pvalue(
+        _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, _linear_f_post(0.0, 3), pre_denoms, 0
+    )
+    p_con0, _, _ = _sharp_null_pvalue(
+        _ORACLE_TREATED, _ORACLE_PLACEBOS, _ORACLE_POST, _constant_f_post(0.0, 3), pre_denoms, 0
+    )
+    assert p_lin0 == pytest.approx(p_con0)
+
+
+def test_invert_monotone_in_gamma():
+    # A larger gamma rejects more -> a narrower (or equal) confidence set.
+    pre_denoms = _oracle_pre_denoms()
+
+    def width(g):
+        r = _invert_sharp_null(
+            _ORACLE_TREATED,
+            _ORACLE_PLACEBOS,
+            _ORACLE_POST,
+            pre_denoms,
+            0,
+            "constant",
+            g,
+            n_grid=120,
+        )
+        return (r["upper"] - r["lower"]) if r["status"] == "ran" else None
+
+    w_lo, w_hi = width(0.25), width(0.45)
+    assert w_lo is not None and w_hi is not None
+    assert w_lo >= w_hi - 1e-9
+
+
+# --- end-to-end (real fit): custom V skips the outer search for speed/determinism ---
+
+
+def _fit_with_placebos(n_donors=6, T=10, T0=6, effect=3.0, seed=0, run_placebo=True):
+    df, years, t0 = _make_panel(n_donors=n_donors, T=T, T0=T0, effect=effect, seed=seed)
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res = SyntheticControl(v_method="custom", custom_v=np.ones(T0), seed=seed).fit(
+            df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+        )
+        if run_placebo:
+            res.in_space_placebo()
+    return res, years, t0
+
+
+def _exact_combo_fit(effect=3.0, T=10, T0=6, n_donors=5):
+    """Deterministic panel where the treated is an EXACT convex combo of two donors.
+
+    The donors carry distinct sinusoidal idiosyncrasies, so no single donor is a convex
+    combination of the others (placebos fit poorly -> larger pre-denominators), while the
+    treated reproduces 0.5*d0 + 0.5*d1 (near-perfect pre-fit -> the smallest denominator).
+    The treated is therefore uniquely the most-deviant unit in the tails, so the constant
+    confidence set is BOUNDED ("ran") -- the end-to-end analogue of the helper oracle.
+    """
+    years = list(range(2000, 2000 + T))
+    t = np.arange(T, dtype=float)
+    donors = {
+        j: 10.0 + 2.0 * j + (0.3 + 0.1 * j) * t + (0.5 + 0.3 * j) * np.sin(t)
+        for j in range(n_donors)
+    }
+    treated = (0.5 * donors[0] + 0.5 * donors[1]).copy()
+    treated[T0:] += effect
+    rows = []
+    for j in range(n_donors):
+        for i in range(T):
+            rows.append({"unit": f"d{j}", "year": years[i], "y": float(donors[j][i]), "treated": 0})
+    for i in range(T):
+        rows.append(
+            {"unit": "treated", "year": years[i], "y": float(treated[i]), "treated": int(i >= T0)}
+        )
+    df = pd.DataFrame(rows)
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res = SyntheticControl(v_method="custom", custom_v=np.ones(T0), seed=0).fit(
+            df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+        )
+        res.in_space_placebo()
+    return res
+
+
+def test_test_sharp_null_zero_equals_placebo_p_value_end_to_end():
+    res, _, _ = _fit_with_placebos()
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        s0 = res.test_sharp_null(0.0)
+    assert s0["p_value"] == pytest.approx(res.placebo_p_value)
+    assert s0["rmspe_f_treated"] == pytest.approx(res.rmspe_ratio)
+    assert s0["n_placebos"] == res.n_placebos
+
+
+def test_confidence_set_constant_contains_att_and_excludes_zero():
+    res = _exact_combo_fit(effect=3.0)
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        grid = res.confidence_set(family="constant", gamma=0.25)
+    ecs = res.effect_confidence_set
+    assert ecs["status"] == "ran"
+    assert ecs["lower"] <= res.att <= ecs["upper"]  # point estimate inside the set
+    assert not (ecs["lower"] <= 0.0 <= ecs["upper"])  # a real +3 effect -> 0 excluded
+    assert list(grid.columns) == ["param", "p_value", "in_set"]
+    assert ecs["boundary"] == "strict" and ecs["parameter"] == "c"
+
+
+def test_test_sharp_null_array_path_matches_scalar_and_validates():
+    res, _, _ = _fit_with_placebos()
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        s_scalar = res.test_sharp_null(3.0)
+        s_array = res.test_sharp_null(np.array([3.0, 3.0, 3.0, 3.0]))
+    assert s_array["p_value"] == pytest.approx(s_scalar["p_value"])
+    for bad in (
+        np.array([1.0, 2.0]),  # wrong length
+        np.array([[1.0, 2.0, 3.0, 4.0]]),  # 2-D
+        np.array([np.nan, 1.0, 1.0, 1.0]),  # non-finite
+    ):
+        with pytest.raises(ValueError):
+            res.test_sharp_null(bad)
+
+
+def test_confidence_set_conf_int_stays_nan():
+    res, _, _ = _fit_with_placebos()
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res.confidence_set(family="constant", gamma=0.25)
+    # The analytical fields stay NaN: this is a SEPARATE permutation object.
+    assert np.isnan(res.se) and np.isnan(res.p_value) and np.isnan(res.t_stat)
+    assert np.isnan(res.conf_int[0]) and np.isnan(res.conf_int[1])
+    assert res.effect_confidence_set is not None
+
+
+def test_confidence_set_to_dict_flattened_and_summary_renders():
+    res = _exact_combo_fit(effect=3.0)
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res.confidence_set(family="constant", gamma=0.25)
+    d = res.to_dict()
+    assert d["effect_ci_status"] == "ran"
+    assert d["effect_ci_family"] == "constant" and d["effect_ci_parameter"] == "c"
+    assert np.isnan(d["conf_int_lower"])  # analytical interval still NaN
+    assert np.isfinite(d["effect_ci_lower"]) and np.isfinite(d["effect_ci_upper"])
+    row = res.to_dataframe()  # stays a single row of scalars
+    assert len(row) == 1 and "effect_ci_lower" in row.columns
+    assert "Confidence set by test inversion" in res.summary()
+
+
+def test_confidence_set_linear_runs_and_sets_field():
+    res, _, _ = _fit_with_placebos()
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        grid = res.confidence_set(family="linear", gamma=0.25)
+    ecs = res.effect_confidence_set
+    assert ecs["family"] == "linear" and ecs["parameter"] == "c_tilde"
+    assert ecs["status"] in ("ran", "empty", "unbounded")
+    assert list(grid.columns) == ["param", "p_value", "in_set"]
+
+
+def test_confidence_set_lazy_runs_in_space_placebo():
+    res, _, _ = _fit_with_placebos(run_placebo=False)
+    assert res._placebo_gaps is None  # placebo NOT run yet
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res.confidence_set(family="constant", gamma=0.25)
+    assert res._placebo_gaps is not None  # lazily built
+    assert res.effect_confidence_set is not None
+
+
+def test_confidence_set_n_starts_ignored_with_warning_when_reference_exists():
+    res, _, _ = _fit_with_placebos(run_placebo=True)  # reference already built
+    with pytest.warns(UserWarning, match="n_starts is ignored"):
+        res.confidence_set(family="constant", gamma=0.25, n_starts=3)
+
+
+def test_get_confidence_set_df_requires_run():
+    res, _, _ = _fit_with_placebos()
+    with pytest.raises(ValueError, match="No confidence set"):
+        res.get_confidence_set_df()
+
+
+def test_in_space_placebo_rerun_invalidates_confidence_set():
+    # CI-review P1: a confidence set is computed against the CURRENT placebo reference set,
+    # so an explicit in_space_placebo() rebuild (which _require_placebo_reference even
+    # suggests, via n_starts) must INVALIDATE the cached set rather than report a stale one.
+    res = _exact_combo_fit(effect=3.0)
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res.confidence_set(family="constant", gamma=0.25)
+    assert res.effect_confidence_set is not None
+    native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+    assert native["confidence_set"]["status"] == "ran"
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res.in_space_placebo(n_starts=2)  # rebuild the reference set
+    assert res.effect_confidence_set is None
+    with pytest.raises(ValueError, match="No confidence set"):
+        res.get_confidence_set_df()
+    native2 = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+    assert native2["confidence_set"]["status"] == "not_run"
+
+
+def test_confidence_set_too_few_donors_raises():
+    # One donor -> in_space_placebo cannot form a reference set -> CI / test raise.
+    df, years, T0 = _make_panel(n_donors=1, T=10, T0=6)
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        res = SyntheticControl(v_method="custom", custom_v=np.ones(T0), seed=0).fit(
+            df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+        )
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        with pytest.raises(ValueError, match="reference set"):
+            res.confidence_set(family="constant", gamma=0.25)
+        with pytest.raises(ValueError, match="reference set"):
+            res.test_sharp_null(0.0)
+
+
+def test_confidence_set_unpickled_raises():
+    res, _, _ = _fit_with_placebos()
+    restored = pickle.loads(pickle.dumps(res))  # snapshot + placebo gaps dropped
+    with pytest.raises(ValueError):
+        restored.confidence_set(family="constant", gamma=0.25)
+
+
+@pytest.mark.parametrize(
+    "kwargs",
+    [
+        {"family": "quadratic"},
+        {"gamma": 0.0},
+        {"gamma": 1.0},
+        {"n_grid": 1},
+        {"bounds": (1.0, 1.0)},
+        {"bounds": (2.0, 1.0)},
+        {"bounds": 5.0},  # scalar -> ValueError, not a bare TypeError from len()
+        {"bounds": (1.0,)},  # wrong length
+        {"bounds": (np.inf, 1.0)},  # non-finite
+    ],
+)
+def test_confidence_set_input_validation(kwargs):
+    res, _, _ = _fit_with_placebos()
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        with pytest.raises(ValueError):
+            res.confidence_set(**kwargs)
+
+
+@pytest.mark.parametrize("bad", [0.0, 1.0, -0.1, 1.5])
+def test_test_sharp_null_gamma_validation(bad):
+    res, _, _ = _fit_with_placebos()
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        with pytest.raises(ValueError):
+            res.test_sharp_null(0.0, gamma=bad)
+
+
+@pytest.mark.slow
+def test_confidence_set_coverage_simulation():
+    # Behavioral coverage check: under a constant-effect DGP the (1 - gamma) confidence
+    # set should cover the true effect at roughly 1 - gamma. A looser inner tolerance keeps
+    # the refits converging cleanly; reps with ANY dropped placebo (n_failed > 0) are
+    # EXCLUDED from the coverage count so dropped placebos cannot bias it (M5), and we
+    # assert that the large majority of reps are clean (the settings are adequate).
+    # J = 9 -> attainable p in multiples of 1/10, gamma = 0.1.
+    gamma = 0.1
+    c_true = 2.0
+    reps = 100
+    clean = 0
+    covered = 0
+    for s in range(reps):
+        df, years, T0 = _make_panel(n_donors=9, T=10, T0=6, effect=c_true, seed=1000 + s)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            res = SyntheticControl(
+                v_method="custom", custom_v=np.ones(T0), seed=s, inner_min_decrease=1e-3
+            ).fit(
+                df, "y", "treated", "unit", "year", post_periods=years[T0:], treated_unit="treated"
+            )
+            res.in_space_placebo()
+            if res.n_failed != 0:
+                continue  # exclude biased reps from the coverage count (M5)
+            clean += 1
+            res.confidence_set(family="constant", gamma=gamma)
+        ecs = res.effect_confidence_set
+        if ecs["status"] == "unbounded":
+            covered += 1  # an unbounded set trivially covers the truth
+        elif ecs["status"] == "ran" and ecs["lower"] <= c_true <= ecs["upper"]:
+            covered += 1
+    assert clean >= 0.8 * reps, f"only {clean}/{reps} reps converged cleanly"
+    coverage = covered / clean
+    # Permutation inference is finite-sample valid under exchangeability; allow a wide
+    # band (the convex-combo treated is not perfectly exchangeable with single donors).
+    assert coverage >= 0.70, f"coverage {coverage:.3f} too low (target ~{1 - gamma})"