From e9550092b84877f6200b778246744e5bcf2c7264 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Mon, 1 Jun 2026 18:02:46 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20Firpo=20&=20Possebom=20(2018)=20paper?=
 =?UTF-8?q?=20review=20=E2=80=94=20SCM=20CI=20by=20test=20inversion=20(PR-?=
 =?UTF-8?q?A)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add docs/methodology/papers/firpo-possebom-2018-review.md, a faithful,
paper-sourced fidelity review (Step-1 artifact) of Firpo & Possebom (2018,
Journal of Causal Inference 6(2), DOI 10.1515/jci-2016-0026) for the
forthcoming SCM confidence-set / CI-by-test-inversion track (PR-B) layered on
the existing SyntheticControl estimator (classic SCM has no analytical SE).

Full-paper coverage (paper-sourced only, no code-deviation verdicts): the
benchmark RMSPE-ratio permutation test (Eqs 4-6), sensitivity-analysis
parametric p-value weights with worst/best-case phi (Eqs 7-9), the sharp-null
RMSPE^f test (Eqs 10-13), confidence sets by test inversion (Eq 14) with the
operational constant-effect CI (Eqs 15-16) and linear-effect CS (Eqs 17-18),
the general test-statistic framework + Monte Carlo size/power (Eq 19, Sec 5),
and the multiple-outcome FWER / multiple-treated-unit pooled extensions
(Eqs 23-26). Requirements checklist flags the PR-B target vs deferred items.
A one-time boundary/equality-convention note documents the paper's own mixed
reject-at-<gamma (Eqs 5/9/13) vs <=gamma (Eq 19) and the strict CS >gamma
(Eq 14), recommending a single convention for PR-B's discrete permutation p-value.

Docs-only; no code change. Registered in docs/references.rst (Synthetic
Control Method) and docs/doc-deps.yaml; REGISTRY ## SyntheticControl gains a
firpo-possebom-2018-review.md reviews-on-file pointer; CHANGELOG [Unreleased]
documents the PR-A.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  |   1 +
 docs/doc-deps.yaml                            |   2 +
 docs/methodology/REGISTRY.md                  |   2 +-
 .../papers/firpo-possebom-2018-review.md      | 180 ++++++++++++++++++
 docs/references.rst                           |   2 +
 5 files changed, 186 insertions(+), 1 deletion(-)
 create mode 100644 docs/methodology/papers/firpo-possebom-2018-review.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7c279d34..09f669db 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **`SyntheticControl` cross-validation + inverse-variance `V`-selection (ADH 2015 §; Abadie 2021 §3.2(a), Eq. 9).** Two new `v_method` values complete the ADH-2015/Abadie-2021 `V`-selection menu (joining `"nested"` / `"custom"`), each threaded through the in-space / leave-one-out / in-time placebo refits so a diagnostic uses the **same** estimator as the headline fit. **`v_method="cv"`** selects the diagonal predictor-importance `V` by out-of-sample cross-validation: the pre-period is split positionally at `v_cv_t0` (new constructor param; default `len(pre)//2`, Abadie 2021's `t0 = T0/2`) into a training and a validation window, `V` is chosen to minimize the validation-window outcome MSPE of the training-fit weights (`mspe_v` now reports this validation MSPE under cv), and the final reported weights are re-estimated on the validation-window predictors (ADH 2015 step 4). Each predictor spec is **re-aggregated** over each window (its mean/sum/identity recomputed over only the periods that fall in that window — a separate `dataprep` per window, exactly as ADH 2015's CV does, since R `Synth` has no built-in CV function), so the V-search is genuinely out-of-sample for every predictor type and the same `V*` drives both fits with no zeroed coordinate (`v_weights` reproduce `donor_weights` on the validation-window predictors, and `predictor_balance` is reported on that validation-window basis). **Fully-spanning precondition (fail-closed):** re-aggregating a predictor on each window requires it to be observed in **both** windows, so `cv` **requires every predictor to span both the training and validation windows** and raises `ValueError` otherwise — satisfied by ADH 2015's shared covariate / multi-period `special_predictors` (which span the windows) but NOT by the default per-period outcome lags (each is single-period and lives in one window only), so `cv` with the bare default predictors is rejected with guidance to pass spanning predictors. In-time-placebo truncation that breaks the fully-spanning precondition (a kept spec stops spanning both windows at the truncated split) marks that date `infeasible`. A second fail-closed gate covers windows that span but carry **no cross-donor variation** (every re-aggregated predictor constant across the donors, so `X0·W` is constant in `W` → a flat, unidentified weight solve that would otherwise return arbitrary "converged" weights — even when the treated unit differs, since donor distinguishability, not treated-vs-donor variation, identifies `W`): the headline fit raises `ValueError`, in-space placebo refits whose donor pool is indistinguishable in a window are dropped from the reference set, and such in-time-truncated dates are marked `infeasible`. Abadie 2021 footnote 7's CV non-uniqueness is handled by a **deterministic tie-break** (prefer the `V` closest to uniform among ties), making the selected `V*` among equally-good optima independent of the multistart evaluation order. The cv fit is reproducible for a fixed `seed` (like `nested`) but is not seed-independent — the multistart fills any slots beyond the distinct heuristic starts with seed-dependent random Dirichlet draws, so the tie-break removes start-order dependence among ties, not seed dependence. The tie-break is convergence-aware (a non-converged optimizer candidate cannot displace a converged incumbent on an objective tie). If the training-window solve that defines `mspe_v` truncates (e.g. `inner_max_iter` too small), the fit fails closed — `mspe_v=NaN` and the fit is marked non-converged — rather than reporting an invalid Eq. 9 criterion. **`v_method="inverse_variance"`** uses the closed form `v_h = 1/Var(X_h)` (variance over donors+treated on the unstandardized predictors), applied to the **raw** predictors so the effective objective is the unit-variance-rescaled `Σ_h diff_h²/Var_h` (Abadie 2021 §3.2(a)); the `standardize` pre-scaling is intentionally bypassed on this branch (inverse-variance weighting *is* the unit-variance rescaling — applying it on already-standardized rows would double-rescale to `Σ_h diff_h²/Var_h²`), so it is equivalent to uniform `V` on standardized predictors. No search (`mspe_v=None`); a zero-variance row gets 0 weight and an all-zero-variance panel falls back to uniform `V` with a warning. `custom_v` is rejected (fail-closed) for both methods and `v_cv_t0` is rejected unless `v_method="cv"`. On the degenerate **single-donor** path (`J=1` forces `w=[1]`) `V` is unidentified — every `V` yields the same synthetic — so `v_weights` is **uniform** and `mspe_v=None` for ALL `v_method`s (cv / inverse_variance included; their selected / closed-form `V` would be inert), with a `UserWarning`; the donor weights / gap / ATT are unaffected. An explicitly pinned `v_cv_t0` that no longer fits the truncated pre-fake window is nulled to the `//2` default for the placebo refit (a pinned value that still fits the truncated window is kept). **Validation:** R `Synth` has no built-in CV function (ADH 2015's CV is a manual `dataprep`+`synth` re-run), so cv is anchored by deterministic equivalence to the R-anchored `custom_v` path (the step-3 validation MSPE of the training-window fit and the step-4 validation-window weights each match a `custom_v=V*` fit on the correspondingly re-aggregated predictors) plus cv self-consistency (`in_time_placebo` under cv == a fresh cv fit on the backdated panel to 1e-7); inverse-variance is anchored bit-for-bit to a `custom_v=1/Var(X)` fit. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the per-window re-aggregation convention, the flat-MSPE tie-break, and inverse-variance), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. The remaining ADH-2015 items (`W^reg` extrapolation diagnostic, sparse-SC subset search) stay tracked in `TODO.md`.
+- **Firpo & Possebom (2018) SCM inference paper review on file (PR-A).** Added `docs/methodology/papers/firpo-possebom-2018-review.md`, a faithful, paper-sourced fidelity review of Firpo & Possebom (2018, *Journal of Causal Inference* 6(2), DOI 10.1515/jci-2016-0026) — the Step-1 artifact for the forthcoming SCM **confidence-set / CI-by-test-inversion** track (PR-B) layered on the existing `SyntheticControl` estimator (classic SCM has no analytical SE; `se`/`p_value`/`conf_int` are NaN). Transcribes (paper-sourced only, no code-deviation verdicts) the benchmark RMSPE-ratio permutation test (Eqs. 4–6), the sensitivity-analysis parametric p-value weights with worst/best-case `φ̲`/`φ̄` (Eqs. 7–9), the sharp-null `RMSPE^f` test (Eqs. 10–13), the **confidence sets by test inversion** (Eq. 14) with the operational constant-effect CI (Eqs. 15–16) and linear-effect CS (Eqs. 17–18), the general test-statistic framework + Monte Carlo size/power of five statistics (Eq. 19, Section 5), and the multiple-outcome FWER (Eqs. 23–24) and multiple-treated-unit pooled (Eqs. 25–26) extensions; the requirements checklist flags the PR-B target (sharp-null test + constant/linear CI + benchmark + one-sided) versus the deferred sensitivity-analysis and multi-outcome/treated extensions. Docs-only; no code change. Registered in `docs/references.rst` (Synthetic Control Method section) and `docs/doc-deps.yaml`; REGISTRY `## SyntheticControl` gains a `firpo-possebom-2018-review.md` reviews-on-file pointer.
 
 ### Fixed
 - **Covariate names that collide with reserved structural terms now raise `ValueError` instead of silently corrupting the coefficient dict (`DifferenceInDifferences`, `MultiPeriodDiD`, `TwoWayFixedEffects`).** These estimators build their `coefficients` dict by zipping a variable-name list -- structural term names PLUS the user covariate column names appended verbatim -- with the fitted coefficient vector. A covariate whose name equaled a reserved structural name (`const`; the treatment/time column names; the `{treatment}:{time}` interaction; MultiPeriodDiD `period_{p}` dummies and `{treatment}:period_{p}` interactions; `TwoWayFixedEffects` `ATT`; fixed-effect / unit / time dummy names; or an internal `_`-prefixed working column such as `_treat_time` / `_did_treatment` / `_treatment_post`) silently **overwrote** that structural coefficient via Python dict last-write-wins -- e.g. a covariate named `const` dropped the intercept -- with no error or warning. A new shared `validate_covariate_names` helper (`diff_diff/utils.py`) is now called in each of the three `fit()` methods before the design matrix is built; it raises `ValueError` on a collision (the comparison is case-sensitive, so e.g. `Const` is still allowed) **and** on duplicate names within `covariates` (which collapse to a single dict entry the same way). Fixed-effect/unit/time dummy reserved names are taken from the same `pd.get_dummies(..., drop_first=True)` call used to build them, so they match exactly (including for pandas `Categorical` columns with a non-default category order). For `TwoWayFixedEffects` the guard fires on **all** variance paths: the default within-transform path returns only `{"ATT": att}` (no covariate is a dict key there), but a covariate named `_treatment_post` would still clobber the internal interaction column, so guarding both paths is uniform and forward-compatible. **Potentially breaking:** a fit that previously *succeeded* with a colliding (or duplicated) covariate name -- silently returning a corrupted coefficient dict -- now raises; rename the covariate column(s). The staggered / influence-function estimators (CallawaySantAnna, SunAbraham, StaggeredTripleDifference, EfficientDiD, TwoStageDiD, ImputationDiD, WooldridgeDiD, dCDH, StackedDiD) key results by `(g, t)` tuples / relative-time indices, never covariate names, and `TripleDifference` / `SyntheticControl` / `SyntheticDiD` do not expose covariates by name, so none are affected. New tests in `tests/test_utils.py`, `tests/test_estimators.py`, and `tests/test_estimators_vcov_type.py`.
diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
index 2b471847..5f0857bc 100644
--- a/docs/doc-deps.yaml
+++ b/docs/doc-deps.yaml
@@ -613,6 +613,8 @@ sources:
       - path: docs/methodology/REGISTRY.md
         section: "SyntheticControl"
         type: methodology
+      - path: docs/methodology/papers/firpo-possebom-2018-review.md
+        type: methodology
       - path: docs/api/synthetic_control.rst
         type: api_reference
       - path: README.md
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index aac897e9..859541fc 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1974,7 +1974,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
 
 ## SyntheticControl
 
-**Primary source:** [Abadie, A., Diamond, A., & Hainmueller, J. (2010). "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program." *JASA*, 105(490), 493–505.](https://doi.org/10.1198/jasa.2009.ap08746) Method originates in Abadie & Gardeazabal (2003). Paper reviews on file: `docs/methodology/papers/abadie-diamond-hainmueller-2010-review.md` (primary), `...-2015-review.md`, `abadie-2021-review.md`, `chernozhukov-wuthrich-zhu-2021-review.md`.
+**Primary source:** [Abadie, A., Diamond, A., & Hainmueller, J. (2010). "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program." *JASA*, 105(490), 493–505.](https://doi.org/10.1198/jasa.2009.ap08746) Method originates in Abadie & Gardeazabal (2003). Paper reviews on file: `docs/methodology/papers/abadie-diamond-hainmueller-2010-review.md` (primary), `...-2015-review.md`, `abadie-2021-review.md`, `chernozhukov-wuthrich-zhu-2021-review.md`, `firpo-possebom-2018-review.md` (inference: sensitivity analysis + confidence sets by test inversion).
 
 Classic synthetic control (donor/unit weights only) for a single treated unit, distinct from `SyntheticDiD` (Arkhangelsky et al. 2021), which adds time weights and ridge regularization. Equation (1) of ADH 2010 shows classic SCM **generalizes the TWFE/DiD model** (recovered when the factor loadings `λ_t` are constant in time).
 
diff --git a/docs/methodology/papers/firpo-possebom-2018-review.md b/docs/methodology/papers/firpo-possebom-2018-review.md
new file mode 100644
index 00000000..f2fae1f0
--- /dev/null
+++ b/docs/methodology/papers/firpo-possebom-2018-review.md
@@ -0,0 +1,180 @@
+# Paper Review: Synthetic Control Method: Inference, Sensitivity Analysis and Confidence Sets
+
+**Authors:** Sergio Firpo (Insper), Vitor Possebom (Yale University)
+**Citation:** Firpo, S., & Possebom, V. (2018). "Synthetic Control Method: Inference, Sensitivity Analysis and Confidence Sets." *Journal of Causal Inference*, 6(2), 20160026.
+**PDF reviewed:** https://doi.org/10.1515/jci-2016-0026 (published *Journal of Causal Inference* version, open access; received 15 Nov 2016, revised 6 Aug 2018, accepted 11 Aug 2018, 26 pp). Per the project's PDFs-never-committed convention the local PDF is kept outside the repository; the published J. Causal Inference version (DOI 10.1515/jci-2016-0026) is the authoritative source. All equation, section, and footnote numbers below are pinned to that version.
+**Review date:** 2026-06-01
+
+> Scope note: this paper extends the **permutation / placebo inference** procedure of Abadie, Diamond & Hainmueller (the SCM benchmark) in two ways — (1) a **sensitivity analysis** that parametrically re-weights the placebo p-value away from the equal-weights benchmark, and (2) testing **any sharp null hypothesis** (not only "no effect whatsoever") via a modified RMSPE statistic, which it **inverts to construct confidence sets** for the treatment-effect path. It also generalizes to arbitrary test statistics, multiple outcomes (familywise error control), and multiple treated units (a pooled effect). This review is the **Step-1 fidelity artifact** for a forthcoming SCM **confidence-set / CI-by-test-inversion** implementation (PR-B) layered on the existing `SyntheticControl` estimator; the sensitivity-analysis and multiple-outcome / multiple-treated extensions are documented here but flagged **deferred**. The estimator itself (donor weights `W`, predictor importance `V`) is taken as given from ADH 2010/2015 — already implemented as `SyntheticControl` — and is recapped only as the paper frames it. Nothing here is sourced from outside this paper.
+
+---
+
+## Methodology Registry Entry
+
+*Formatted to match docs/methodology/REGISTRY.md. This documents an **inference procedure on the existing `SyntheticControl` estimator**, not a new estimator — the `## SyntheticControl` heading mirrors `abadie-2021-review.md`. The REGISTRY implementation contract (`docs/methodology/REGISTRY.md` §SyntheticControl) is unchanged by this docs-only PR-A; PR-B will add the confidence-set methodology subsection and flip the relevant checklist items.*
+
+## SyntheticControl
+
+**Primary source (this document):** Firpo, S., & Possebom, V. (2018). "Synthetic Control Method: Inference, Sensitivity Analysis and Confidence Sets." *Journal of Causal Inference*, 6(2), 20160026. https://doi.org/10.1515/jci-2016-0026
+
+**Key implementation requirements:**
+
+*Notation (Section 2.1):*
+- `J+1` regions over `T` periods; region 1 is treated from `T0+1` to `T` (`T0 ∈ (1,T) ∩ ℕ`); regions `2..J+1` are the never-treated donor pool. `Y^N_{j,t}` / `Y^I_{j,t}` = potential outcomes without / with the intervention; `D_{j,t}` = treatment dummy (`= 1` iff `j = 1` and `t > T0`). Observed outcome `Y_{j,t} = Y^N_{j,t} + α_{j,t}·D_{j,t}`.
+- `Y_j` = `(T0×1)` pre-period outcome vector; `X_j` = `(K×1)` predictors; `X_0` = `(K×J)` donor predictor matrix, `X_1` = `(K×1)` treated predictors. Some rows of `X_j` may be linear combinations of `Y_j` (footnote 6).
+
+*Target — the intervention-effect path (Equation 1):*
+
+    (1)  α_{j,t} := Y^I_{j,t} − Y^N_{j,t}
+
+The object of interest is the treated path `(α_{1,T0+1}, …, α_{1,T})`. Since `Y^I_{1,t}` is observed for `t > T0`, only `Y^N_{1,t}` must be estimated.
+
+*Estimator — the synthetic control (taken from ADH; Equations 2–3):*
+
+    Ŷ^N_{1,t} = Σ_{j=2}^{J+1} ŵ_j · Y_{j,t}
+    (2)  Ŵ(V) := argmin_{W∈𝒲} (X_1 − X_0 W)' V (X_1 − X_0 W),   𝒲 = {W : w_j ≥ 0, Σ_{j=2}^{J+1} w_j = 1}
+    (3)  V̂   := argmin_{V∈𝒱} (Y_1 − Y_0 Ŵ(V))' (Y_1 − Y_0 Ŵ(V)),   V diagonal PSD, trace = 1
+    estimated gap:  α̂_{1,t} := Y_{1,t} − Ŷ^N_{1,t}
+
+**Footnote 7 (V-selection — cross-reference to PR-1 #523):** besides the nested pre-period-MSPE choice of `V` (Eq. 3), the authors note two alternatives — (a) subjective / prior weights (discouraged, as it undermines SCM's objectivity), and (b) **cross-validation**: split the pre-period into a *training* and a *validation* window, solve Eq. 2 on the training window, and pick `V` minimizing the validation-window outcome MSPE (Eq. 3 evaluated on the validation window) — *exactly the `v_method="cv"` procedure shipped in PR-1 (#523)*. The Stata `Synth` command also offers a **regression-based** `V`, `v_k = |β_k| / (Σ_k |β_k|)` from regressing `Y_1` on `X_1` (not implemented in this library). The choice presented in the main text (nested MSPE) is the most common in the empirical literature.
+
+### Inference (Section 2.2) — the benchmark permutation test
+
+Following Fisher's Exact Hypothesis Testing Procedure (Fisher 1935; Imbens & Rubin; Rosenbaum), ADH permute which region is treated: for each `j ∈ {2,…,J+1}` re-estimate `α̂_{j,t}` and compare the treated unit's effect vector to the placebo distribution.
+
+*RMSPE-ratio test statistic (Equation 4):*
+
+    (4)  RMSPE_j := [ Σ_{t=T0+1}^{T} (Y_{j,t} − Ŷ^N_{j,t})² / (T − T0) ]  /  [ Σ_{t=1}^{T0} (Y_{j,t} − Ŷ^N_{j,t})² / T0 ]
+
+(post-intervention MSPE ÷ pre-intervention MSPE — controls for imperfect pre-fit; ADH 2010 introduce this to handle abnormally large `|α̂_{1,t}|` driven by poor pre-fit rather than a real effect.)
+
+*Benchmark p-value (Equation 5) and the exact null (Equation 6):*
+
+    (5)  p := ( Σ_{j=1}^{J+1} 𝟙[ RMSPE_j ≥ RMSPE_1 ] ) / (J + 1)
+    (6)  H_0:  Y^I_{j,t} = Y^N_{j,t}   for each region j and period t   (the "exact null" / no effect whatsoever)
+
+Reject `H_0` if `p < γ` (e.g. `γ = 0.1`). Rejecting the exact null implies *some* region has a non-zero effect in *some* period. **Footnote 8:** `γ` must be chosen carefully given the **discrete, finite** number of regions — the p-value granularity is `1/(J+1)`, which may preclude the usual 5% / 10% levels. The exact null is also known as the *sharp* null of no effect (it is stronger than "the average/typical effect is zero").
+
+### Contribution 1 — Sensitivity analysis (Section 3)
+
+The benchmark Eq. 5 weights all units equally; that choice is restrictive and the decision may depend on it. Generalize to `p := Σ_{j=1}^{J+1} π_j · 𝟙[RMSPE_j ≥ RMSPE_1]` (Equation 7) and impose a **parametric weight family** that distorts the uniform weights as little as possible (à la Rosenbaum / Cattaneo et al.). Step-by-step in the SCM framework (Section 3):
+
+1. Estimate `RMSPE_1,…,RMSPE_{J+1}` for all placebo assignments; `RMSPE_1 = RMSPE^obs`.
+2. Rename in decreasing order: `RMSPE_(1) > RMSPE_(2) > … > RMSPE_(J+1)`.
+3. Define `j̄ ∈ Ω := {(1),…,(J+1)}` with `RMSPE_j̄ = RMSPE^obs` (the largest such index on ties).
+4. Parametric weights (Equation 8):
+
+       (8)  π_(j)(φ, v) = exp(φ·v_(j)) / Σ_{j'∈Ω} exp(φ·v_(j'))
+
+   with sensitivity parameter `φ ∈ ℝ₊`, indicators `v_(j') ∈ {0,1}`, `v = (v_1,…,v_{J+1})`. At `φ = 0` all weights are equal (recovers the benchmark Eq. 5). Interpretation: a region with `v = 1` carries weight `Φ := exp(φ) − 1` times larger than a region with `v = 0`.
+5. Generalized p-value (Equation 9):
+
+       (9)  p(φ, v) = Σ_{(j)∈Ω} [ exp(φ·v_(j)) / Σ_{j'∈Ω} exp(φ·v_(j')) ] · 𝟙[ RMSPE_(j) ≥ RMSPE_j̄ ]
+
+   reject the exact null if `p(φ, v) < γ`; `φ = 0` reduces to Eq. 5.
+6. If the exact null is **rejected**: the worst-case scenario sets `v_(j) = 1 if (j) ≤ j̄, else 0`; define `φ̲ ∈ ℝ₊` solving `p(φ̲, v) = γ`. A **small** `φ̲` ⇒ the rejection is **not robust** (small deviations from equal weights flip the decision).
+7. If the exact null is **not rejected**: the best-case scenario sets `v_(j) = 0 if (j) ≤ j̄, else 1`; define `φ̄ ∈ ℝ₊` solving `p(φ̄, v) = γ`.
+8. Plot `φ` (x-axis) vs `p(φ, v)` (y-axis); a curve that moves too fast ⇒ the test is too sensitive to the weight choice.
+
+Large `φ̲` / `φ̄` boost confidence that the conclusion is robust to deviations from the equal-weights benchmark, in the same spirit as ADH's benchmark.
+
+### Contribution 2 — Sharp nulls and confidence sets (Section 4)
+
+*Testing sharp nulls (Section 4.1).* Generalize the exact null to any sharp null:
+
+    (10)  H_0^f:  Y^I_{j,t} = Y^N_{j,t} + f_j(t)        (region-specific effect function f_j, j ∈ {1,…,J+1})
+    (11)  H_0^f:  Y^I_{j,t} = Y^N_{j,t} + f(t)          (common effect function f : {1,…,T} → ℝ; the practical case)
+
+Under a sharp null all potential outcomes are known, so `Y^N` is recoverable from the observed data. The RMSPE statistic becomes (Equation 12):
+
+    (12)  RMSPE^f_j := [ Σ_{t=T0+1}^{T} (Y_{j,t} − Ŷ^N_{j,t} − f(t))² / (T − T0) ]  /  [ Σ_{t=1}^{T0} (Y_{j,t} − Ŷ^N_{j,t} − f(t))² / T0 ]
+
+`f(t)` appears in **both** windows because Eq. 11 defines `f` over all `t ∈ {1,…,T}`; for the operational constant (Eq. 15) and linear (Eq. 17) families `f` carries a `𝟙[t ≥ T0+1]` factor, so `f(t) = 0` in the pre-period and the denominator reduces to the plain pre-period MSPE of Eq. 4. The p-value (Equation 13):
+
+    (13)  p^f(φ, v) := Σ_{j=1}^{J+1} [ exp(φ·v_j) / Σ_{j'=1}^{J+1} exp(φ·v_j') ] · 𝟙[ RMSPE^f_j ≥ RMSPE^f_1 ]
+
+reject the sharp null Eq. 11 if `p^f(φ, v) < γ`. The exact null (Eq. 6) is the special case `f ≡ 0`. Three highlighted `(φ, v)` choices: `(φ = 0, v = (1,…,1))` = the ADH benchmark; the worst-case `φ̲` if rejected; the best-case `φ̄` if not. Choice of `f`: a linear / quadratic / exponential fit to the estimated `(α̂_{1,t})` (to predict future effects); the **cost path** of the intervention (test "effect = cost" for cost-benefit analysis); or a theory-predicted shape (e.g. the inverted-U / U / decreasing shapes for natural-disaster GDP effects).
+
+*Confidence sets by test inversion (Section 4.2).* Inverting the test over effect functions gives the confidence set (Equation 14):
+
+    (14)  CS_{(1−γ)}(φ, v) := { f ∈ ℝ^{{1,…,T}} : p^f(φ, v) > γ }
+
+= every effect function whose associated sharp null is **not rejected**. The general `ℝ^T` set is computationally infeasible and "too general to be informative," so the paper restricts to two **one-parameter** families:
+
+    (15)  H_0^c:  Y^I_{j,t} = Y^N_{j,t} + c · 𝟙[t ≥ T0+1]                     (constant-in-time effect, c ∈ ℝ)
+    (16)  CI_{(1−γ)}(φ, v) := { f : f = c and p^c(φ) > γ } ⊆ CS_{(1−γ)}(φ, v)    (confidence INTERVAL)
+
+    (17)  H_0^c̃:  Y^I_{j,t} = Y^N_{j,t} + c̃ · (t − T0) · 𝟙[t ≥ T0+1]           (linear-in-time, zero intercept)
+    (18)  C̃S_{(1−γ)}(φ, v) := { f : f = c̃·(t−T0)·𝟙[t ≥ T0+1] and p^c̃(φ) > γ }   (confidence SET)
+
+Operationally: grid over the scalar `c` (or `c̃`), test each value via Eqs. 12–13, and keep those satisfying the set's defining **strict** inequality `p^c(φ) > γ` (Eqs. 14/16/18). **Boundary/equality convention (paper-sourced, stated once).** The paper's inequalities are not uniform at the boundary `p = γ`: the RMSPE-based tests *reject* at `p < γ` (Eqs. 5/9/13), the general-statistic test rejects at `p ≤ γ` (Eq. 19), and the confidence set is the *strict* `p^f > γ` (Eq. 14). Eq. 14's set is therefore **not** the exact complement of the Eq. 13 rejection region — they differ at `p^f = γ` (Eq. 14 *excludes* it, while Eq. 13 does *not* reject it; Eq. 19, by contrast, *would* reject at `p = γ`). This matters because the permutation p-value is **discrete** (a multiple of `1/(J+1)`), so `p = γ` is reachable. A PR-B implementation should pin a single boundary convention — we recommend Eq. 14's strict `p^f > γ` for confidence-set membership (i.e. exclude `p^f = γ`) — and document it. Extending to two-parameter functions (quadratic / exponential / logarithmic) is "theoretically straightforward" from Eq. 14 but computationally heavier; the paper restricts its main examples to one parameter. Confidence sets summarize **significance** (is `f ≡ 0` excluded?), **precision** (narrower ⇒ stronger conclusions), and **robustness** (compare set areas across `φ`). They are **uniform** over time (they combine information across all post periods to describe which effect *functions* are not rejected); a **point-wise** per-period CI instead uses `α̂_{1,t'}` as the test statistic separately for each `t' > T0` (Section 6.1 cautions that a point-wise interval may be inadequate).
+
+### Other test statistics + Monte Carlo (Section 5)
+
+The sensitivity analysis and confidence sets work with *any* test statistic `θ^f(ι, τ, Y, X, f)` (Equation 19), where `ι` = treatment-assignment vector, `τ` = post-period indicator, `Y` = observed-outcome matrix, `X` = predictors, `f` = the sharp-null function; permutation replaces region 1's assignment with each canonical basis vector `e_j`:
+
+    (19)  p_{θ^f}(φ, v) := Σ_{j=1}^{J+1} [ exp(φ·v_j) / Σ_{j'} exp(φ·v_j') ] · 𝟙[ θ(e_j, τ, Y, X, f) ≥ θ^{f,obs} ] ≤ γ
+
+Five test statistics are compared: `θ¹ = mean(|α̂_{j,t}| : t > T0)` (ADH); **`θ² = RMSPE` (Eq. 4, ADH-recommended)**; `θ³ = |t-stat of the mean post effect vs 0|` (Mideksa); `θ⁴` = simple post-period difference-in-means, treated − controls (Imbens & Rubin); `θ⁵` = the interaction coefficient in a DiD regression (Equation 20). **Monte Carlo (T = 25, T0 = 15, K = 10, J+1 = 20; factor-model DGP Eq. 21; linear intervention effect Eq. 22 with `λ ∈ {0,.05,.1,.25,.5,1,2}`; 21,000 reps):** all five permutation tests have the correct size (0.10); **RMSPE (`θ²`) is uniformly more powerful than the simple `θ⁴`/`θ⁵`** and out-powers the Conley–Taber asymptotic test (which is mis-sized at this small `N`); the t-test `θ³` is the most powerful **but** fails when positive and negative effects cancel in the post-period mean — for sign-varying effects, use the multiple-outcome framework (§6.1). Excluding poor-pre-fit donors (pre-period MSPE > 5× the treated unit's) raises `θ¹`'s power but slightly over-rejects, and makes `θ²`/`θ³` slightly conservative. No single statistic dominates — match it to the research question (Eudey et al.).
+
+### Extensions (Section 6)
+
+*Multiple outcomes (Section 6.1) — familywise error control.* For `M` outcomes `Y^1,…,Y^M` with sharp null `H_0^f: Y^{m,I}_{j,t} = Y^{m,N}_{j,t} + f_m(t)` (Equation 23), compute a per-outcome observed p-value, then a **FWER-controlled** p-value (adapting Anderson 2008): order outcomes by observed p-value, take running minima, apply the parametric weights (Equation 24), enforce monotonicity, and reject outcome `m` if `p^{fwer}_m ≤ γ`. A single-outcome study where each post-period is treated as a separate "outcome" reduces to this; Anderson's **summary index test** is more powerful for "is there *any* effect?", whereas FWER control is for the *timing* of the effect.
+
+*Multiple treated units (Section 6.2) — pooled effect.* For `G` similar interventions (region `1^g` treated in interventon `g`), the pooled estimator is `ᾱ_{1,t} := Σ_{g=1}^{G} α̂_{1^g,t} / G` with sharp null Eq. 25. A single pooled test statistic `θ_{pld,f}` summarizes all time periods (to avoid over-rejection); placebo assignments permute which region is treated in each intervention via canonical bases, giving `Q := Π_{g=1}^{G} (J^g + 1)` pooled placebo assignments and the p-value Equation 26. Confidence sets (§4.2) extend by using Eq. 26.
+
+### Empirical application (Section 7)
+
+A re-analysis of ETA terrorism on Basque Country GDP per capita (Abadie & Gardeazabal 2003; `J+1 = 17`, `T0 = 1969`, post 1970–1997). A **one-sided** statistic (only negative effects are of interest) `θ = −ᾱ_post/(T−T0) ÷ (σ̂/√(T−T0))` gives `p = 3/17` (a marginal rejection of the exact null); excluding poor-pre-fit regions (Madrid, Extremadura, Balearic) → `p = 2/14`. Sensitivity: `φ̲ = 0.495` suffices to stop rejecting at the `3/14` level ⇒ the Basque region's weight need only be ~64% larger than `v = 0` units to overturn the result ⇒ **not very robust** (small sample). One-sided `12/14`-confidence sets (constant Eq. 16 and linear Eq. 18) lie below zero ⇒ economically relevant negative effects. A quadratic effect is not rejected (`p_quadratic = 6/14`, robust at `φ̄ = 1.905`) ⇒ the impact is initially negative but **attenuates toward zero in the long run**. Treating each year as a separate outcome (§6.1) localizes the negative impact to the 1980s with a recovery in the late 1990s.
+
+**Reference implementation(s):**
+- Authors' R and Stata code for the confidence sets in Eqs. 16 & 18 (footnote 15; the `goo.gl/RBYomh` short-link is stale) and a **Code Ocean** replication capsule (DOI `10.24433/CO.23bd238f-38c5-4b3e-82f4-3a1624fd8a33`).
+- Built on the authors' `Synth` package (R / MATLAB / Stata) for the underlying SCM fit.
+
+**Requirements checklist** (features this paper adds beyond ADH 2010/2015; **PR-B** = the planned next implementation target, **deferred** = later):
+- [ ] (PR-B) Sharp-null `RMSPE^f` test (Eqs. 12–13) reusing the in-space placebo permutation — subtract the hypothesized `f(t)` from the post-period gaps and re-rank.
+- [ ] (PR-B) Confidence **interval** for a constant-in-time effect (Eqs. 15–16) by test inversion over a `c`-grid.
+- [ ] (PR-B) Confidence **set** for a linear-in-time effect (Eqs. 17–18) by test inversion over a `c̃`-grid.
+- [ ] (PR-B) Benchmark `(φ = 0, v = (1,…,1))` p-value (reuse `in_space_placebo`'s RMSPE-ratio) + a one-sided variant (Section 7).
+- [ ] (deferred) Sensitivity-analysis parametric weights `π_(j)(φ, v)` (Eqs. 7–9) + worst/best-case `φ̲`/`φ̄` robustness curve (Section 3).
+- [ ] (deferred) General test-statistic menu `θ¹`–`θ⁵` (Eq. 19, Section 5).
+- [ ] (deferred) Multiple-outcome FWER control (Eqs. 23–24) and multiple-treated-unit pooled confidence sets (Eqs. 25–26, Section 6).
+
+---
+
+## Implementation Notes
+
+### Data Structure Requirements
+- Same as `SyntheticControl`: a balanced aggregate panel (one treated unit + a curated donor pool), a long pre-period, and an absorbing block-treatment suffix. The inference layer adds **no new data requirements** — it consumes the fitted gap path `(α̂_{j,t})` and the per-unit pre/post MSPEs the estimator already computes.
+- The sharp-null test and the confidence sets need the **full placebo reference set** (one synthetic-control refit per donor) — exactly the object the existing `in_space_placebo()` builds.
+
+### Computational Considerations
+- The benchmark test (Eq. 5) is `O(J)` synthetic-control refits (the permutation reference set). The sensitivity analysis (Eqs. 8–9) is a **closed-form re-weighting** of the *already-computed* `RMSPE_j` plus a one-dimensional root-find for `φ̲`/`φ̄` — no refits.
+- **Test-inversion CI = a grid search × the permutation test.** For each grid value `c` (or `c̃`): subtract `f(t)` from the relevant post-period outcomes, recompute `RMSPE^f` for all `J+1` units (Eq. 12), and evaluate Eq. 13. Because the donor synthetic controls and the pre-period denominators are unchanged across the grid (only the post-period gap shifts by `f(t)`), the per-grid-value cost is dominated by re-ranking, not refitting. Cost scales with grid resolution × `J`.
+- The general `ℝ^T` confidence set (Eq. 14) is computationally infeasible — an implementation must restrict to the constant / linear (or a small parametric) family and choose a finite grid.
+
+### Tuning Parameters
+
+| Parameter | Type | Default (this paper) | Selection method |
+|-----------|------|----------------------|------------------|
+| `φ` (sensitivity) | `≥ 0` | `0` (equal-weights benchmark) | swept to report `φ̲`/`φ̄`; `φ = 0` reproduces ADH |
+| `v` (weight indicators) | `{0,1}^{J+1}` | `(1,…,1)` | worst / best-case patterns (steps 6–7) for the robustness bound |
+| `γ` (significance level) | `∈ (0,1)` | `0.1` | chosen given the discrete `1/(J+1)` granularity (fn 8) |
+| effect family `f` | constant / linear (/ parametric) | — | constant (Eq. 16) or linear (Eq. 18); two-parameter possible but costly |
+| grid bounds + resolution | scalar grid | **unspecified by the paper** | implementation choice (documented deviation) |
+
+### Relation to Existing diff-diff Estimators
+- This is the **inference layer for the existing `SyntheticControl`** estimator (`diff_diff/synthetic_control.py`); it introduces **no new estimator**. PR-B would reuse `SyntheticControlResults.in_space_placebo`, `_placebo_fit_unit`, and the `_rmspe_ratio` / `_mspe` helpers: the benchmark test (`φ = 0`) is literally the existing in-space placebo (ADH 2010 §2.4, already shipped), and the CI adds the sharp-null `f(t)` subtraction + the grid inversion on top.
+- The paper's **footnote 7** cross-validation `V` selection is the `v_method="cv"` shipped in PR-1 (#523); the sensitivity analysis is orthogonal to (and composes with) the existing `placebo_p_value`.
+- Complements **conformal inference (Chernozhukov–Wüthrich–Zhu 2021)** — the other SCM inference track on the roadmap (review already on file). Firpo–Possebom is permutation / Fisher-randomization-based (finite-sample, valid under exchangeability of placebo assignments); CWZ is residual-exchangeability conformal. They are alternative routes to SCM uncertainty quantification.
+
+---
+
+## Gaps and Uncertainties
+
+- **Grid bounds and resolution for the test-inversion CI are not specified.** Section 4.2 gives the set definitions (Eqs. 14/16/18) but not how to grid `c`/`c̃` or locate the interval endpoints — an implementation choice for PR-B (a documented deviation), e.g. a bracketing search on `p^c(φ) − γ`.
+- **The general confidence set `CS ⊆ ℝ^T` (Eq. 14) is computationally infeasible** and "too general to be informative" (the paper's own framing); only the one-parameter constant (Eq. 16) and linear (Eq. 18) subsets are operationalized. Two-parameter families are called "theoretically straightforward" but are not demonstrated.
+- **`γ` and finite-sample granularity:** with `J+1` regions the permutation p-value is a multiple of `1/(J+1)`, so not every conventional level is attainable (fn 8). The empirical application reports `3/17`, `2/14`, `12/14`, etc., rather than 0.05 / 0.10.
+- **Point-wise vs uniform confidence sets:** the constructed sets are uniform over the post-period; a per-period point-wise interval (using `α̂_{1,t'}`) is mentioned but the paper cautions (Section 6.1) it may be inadequate without a multiplicity correction.
+- **Sensitivity `v` worst/best-case patterns** (steps 6–7) define `φ̲`/`φ̄`, but selecting among multiple `v` that achieve a given decision rests on the "distort the uniform weights as little as possible" heuristic — a deterministic tie-break is left to the implementer.
+- **Replication code:** the `goo.gl/RBYomh` short-link (fn 15) is stale; the live artifact is the Code Ocean capsule (DOI `10.24433/CO.23bd238f-38c5-4b3e-82f4-3a1624fd8a33`). Not consulted for this review (paper-sourced only).
diff --git a/docs/references.rst b/docs/references.rst
index d4610700..7d70134e 100644
--- a/docs/references.rst
+++ b/docs/references.rst
@@ -108,6 +108,8 @@ Synthetic Control Method
 
 - **Abadie, A., Diamond, A., & Hainmueller, J. (2015).** "Comparative Politics and the Synthetic Control Method." *American Journal of Political Science*, 59(2), 495-510. https://doi.org/10.1111/ajps.12116
 
+- **Firpo, S., & Possebom, V. (2018).** "Synthetic Control Method: Inference, Sensitivity Analysis and Confidence Sets." *Journal of Causal Inference*, 6(2), 20160026. https://doi.org/10.1515/jci-2016-0026
+
 Synthetic Difference-in-Differences
 -----------------------------------