Skip to content

Commit 6c49acb

Browse files
igerberclaude
andcommitted
SyntheticControl: full in-time coverage breakdown + LOO wording (CI codex review)
P2: _scm_native's in_time_placebo "ran" block now surfaces n_ran / n_infeasible (not just n_dates / n_failed), so a partially-usable sweep (some dates ran, some infeasible) is not summarized as full coverage. Regression test added. P3: aligned the remaining "positively-weighted" copy (docs/api/synthetic_control.rst, CHANGELOG, llms-full) to the documented "reportably-weighted (>1e-6)" contract, and refreshed the _check_estimator_native SCM summary to mention the leave_one_out / in_time_placebo blocks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent eb3ee68 commit 6c49acb

5 files changed

Lines changed: 28 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010
### Added
1111
- **PowerAnalysis methodology-review-tracker promotion: In Progress → Complete, with a panel-variance correction (behavior change).** Closes the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits on the tracker (PR-A #506 added both paper reviews + under-review Notes; this PR validates the source against the code and reconciles the discrepancies). **Behavior change:** the analytical *panel* DiD variance was the Moulton design-effect factor `(1+(T−1)·rho)/T`, wrong two ways versus the source — wrong period-scaling (~4× too small at `rho=0`, `m=r=5` versus the iid DiD benchmark) and the **opposite `rho`-sign** (it *raised* the MDE as within-unit correlation grew). It is replaced by the within-unit equicorrelated special case of Burlig et al. Eq. 2, `Var(ATT) = sigma² · (1/n_T + 1/n_C) · (1/n_pre + 1/n_post) · (1 − rho)`, in which within-unit (serial) correlation *lowers* the MDE because the difference-in-differences cancels the shared within-unit component. So `PowerAnalysis.mde` / `power` / `sample_size` (and the `compute_*` wrappers) now return a **smaller** MDE / required N as `rho` rises for **all** designs; the 2×2 path matches Bloom's `2σ²` at the default `rho = 0` and is continuous with the panel form at `n_pre = n_post = 1`. New input validation, enforced for **all** designs *before* the 2×2-vs-panel router: `n_pre >= 1`, `n_post >= 1`, `rho ∈ [−1/(T−1), 1)` (`T = n_pre + n_post`), finite `sigma >= 0`, positive group counts, and `treat_frac ∈ (0, 1)` now raise `ValueError` (previously invalid two-period shapes and out-of-range `rho` fell through to `basic_did` silently). The `(1 − rho)` factor applies at `T = 2` too — the 2×2 path is Burlig's `m = r = 1` special case (footnote 11), so a nonzero `rho` is no longer silently ignored there, while `rho = 0` still recovers Bloom's `2σ²`. The MDE multiplier stays the **normal (z)** Bloom multiplier (a deliberate large-sample approximation to Burlig's t, documented as `**Deviation from R:**`) — unchanged. New `tests/test_methodology_power.py` (Bloom Table 1 multipliers; 2×2 + panel closed forms; a literal-equicorrelated Monte-Carlo validation of the panel variance; `sample_size`↔`mde` round-trip; input-guard + `rho`-at-`T=2` + `compute_*` wrapper validation; base-R `qnorm` parity at `benchmarks/data/r_power_golden.json`, generator `benchmarks/R/generate_power_golden.R`); the two `tests/test_power.py` ICC-direction tests were inverted to Burlig's sign. REGISTRY `## PowerAnalysis` equation block rewritten (z not t; corrected 2×2 / panel SE + sample-size; removed the cluster-`m` and inverted-`R²` terms that matched neither code nor source); `docs/references.rst` adds Frison & Pocock (1992) + McKenzie (2012) as the equicorrelated lineage; tutorial `06_power_analysis.ipynb` corrected. `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-05-31`); priority queue pruned; the PR-A under-review Notes removed across REGISTRY / `power.py` / `references.rst`.
1212
- **`WooldridgeDiD` outcome-fit hint:** `WooldridgeDiD(method="ols")` now emits a `UserWarning` when the outcome is binary (`{0, 1}`) or a non-negative integer count, noting that a matching nonlinear model (`method="logit"` / `method="poisson"`) is often the **more appropriate specification** for such outcomes. Following Wooldridge (2023): the nonlinear paths impose parallel trends on the link/index scale rather than in levels (level-PT is only valid for continuous/unbounded outcomes), and the paper's Section 5 simulations show the linear model both biased and less precise where the nonlinear mean holds. It is a **different identifying assumption** than linear OLS — which one fits depends on which parallel-trends restriction holds — so the warning frames it as a recommended comparison, not an automatic switch or free efficiency upgrade. OLS remains a valid QMLE for *any* response (Table 1). Always-on (suppress via `warnings.filterwarnings`); detection is high-signal (binary requires exactly `{0, 1}`; the count branch suggests Poisson — the natural unbounded-count model — for *any* non-negative integers with >2 distinct values, so bounded binomial / known-upper-bound integer outcomes are not separately distinguished from unbounded counts; fractional / continuous outcomes are not flagged).
13-
- **`SyntheticControl` leave-one-out + in-time placebo robustness diagnostics (ADH 2015 §4).** Two opt-in `SyntheticControlResults` methods, each a thin re-run of the validated solver (analytical `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay bound to the NaN analytical `p_value`). **`leave_one_out()`** drops each positively-weighted donor in turn and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`; non-converged refits → `status="failed"` with NaN metrics); a large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal). **`in_time_placebo()`** reassigns the intervention to an earlier pre-date `t_f`, re-fits using only pre-`t_f` information, and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 when there is no real pre-period effect (ADH 2015 Fig. 4). It sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a dimensionally-infeasible valid date yields a `status="infeasible"` row. **Windowing = TRUNCATE** (documented `**Note:**` in REGISTRY): predictor specs are re-cut to the pre-`t_f` window (pre-period-outcome predictors become the pre-`t_f` outcomes; covariate/special windows are intersected), a window lying entirely in the held-out region is **dropped** (surfaced in `n_dropped_specs` + an aggregated warning) and `custom_v` is subset in lockstep with the surviving specs; the true post-treatment periods are excluded from the placebo fit entirely (no peeking). Both fail closed on a non-converged treated fit (and `leave_one_out` on `<2` donors). New accessors `get_leave_one_out_df()` / `get_in_time_placebo_df()` (survive pickling) and long-form `get_leave_one_out_gaps()` / `get_in_time_placebo_gaps()` for the overlay/backdating plots (panel-derived, dropped on pickle). **Validation:** R `Synth` has no in-time/LOO function (verified against its full CRAN function index), so — beyond the solver's existing Basque R parity — the diagnostics are anchored by deterministic self-consistency tests proving each equals a from-scratch `synthetic_control()` fit on the equivalent sub-problem (reduced donor pool / backdated panel) to 1e-7. **Reporting-stack integration:** `_scm_native` surfaces opt-in `leave_one_out` + `in_time_placebo` blocks (`status="not_run"` stub until run), `BusinessReport` lifts them into the SCM native robustness block, and `practitioner_next_steps` emits both as steps (non-`STEPS` tags so a caller's `completed_steps` cannot suppress them). The remaining ADH-2015 items (CV `V`-selection, `W^reg` extrapolation diagnostic, sparse-SC) are tracked in `TODO.md`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl, `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
13+
- **`SyntheticControl` leave-one-out + in-time placebo robustness diagnostics (ADH 2015 §4).** Two opt-in `SyntheticControlResults` methods, each a thin re-run of the validated solver (analytical `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay bound to the NaN analytical `p_value`). **`leave_one_out()`** drops each reportably-weighted donor (weight above the 1e-6 floor — the donors in `donor_weights`) in turn and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`; non-converged refits → `status="failed"` with NaN metrics); a large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal). **`in_time_placebo()`** reassigns the intervention to an earlier pre-date `t_f`, re-fits using only pre-`t_f` information, and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 when there is no real pre-period effect (ADH 2015 Fig. 4). It sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a dimensionally-infeasible valid date yields a `status="infeasible"` row. **Windowing = TRUNCATE** (documented `**Note:**` in REGISTRY): predictor specs are re-cut to the pre-`t_f` window (pre-period-outcome predictors become the pre-`t_f` outcomes; covariate/special windows are intersected), a window lying entirely in the held-out region is **dropped** (surfaced in `n_dropped_specs` + an aggregated warning) and `custom_v` is subset in lockstep with the surviving specs; the true post-treatment periods are excluded from the placebo fit entirely (no peeking). Both fail closed on a non-converged treated fit (and `leave_one_out` on `<2` donors). New accessors `get_leave_one_out_df()` / `get_in_time_placebo_df()` (survive pickling) and long-form `get_leave_one_out_gaps()` / `get_in_time_placebo_gaps()` for the overlay/backdating plots (panel-derived, dropped on pickle). **Validation:** R `Synth` has no in-time/LOO function (verified against its full CRAN function index), so — beyond the solver's existing Basque R parity — the diagnostics are anchored by deterministic self-consistency tests proving each equals a from-scratch `synthetic_control()` fit on the equivalent sub-problem (reduced donor pool / backdated panel) to 1e-7. **Reporting-stack integration:** `_scm_native` surfaces opt-in `leave_one_out` + `in_time_placebo` blocks (`status="not_run"` stub until run), `BusinessReport` lifts them into the SCM native robustness block, and `practitioner_next_steps` emits both as steps (non-`STEPS` tags so a caller's `completed_steps` cannot suppress them). The remaining ADH-2015 items (CV `V`-selection, `W^reg` extrapolation diagnostic, sparse-SC) are tracked in `TODO.md`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl, `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
1414
- **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide.
1515
- **`SyntheticControl` in-space placebo permutation inference + reporting-stack integration (ADH 2010 §2.4).** New `SyntheticControlResults.in_space_placebo()` provides the significance test classic SCM lacks an analytical SE for: it reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the **other `J−1` donors** (the real treated unit is excluded from every placebo pool — its post-period is treatment-contaminated; matches `SCtools::generate.placebos`), and ranks the treated unit's post/pre **RMSPE ratio** among the `J+1` units. New fields `placebo_p_value` (`= rank/(n_placebos+1)`, an upper-tail rank test on the unsigned RMSPE ratio — direction-agnostic, so it detects an effect of *either* sign rather than a signed/one-directional hypothesis; ties counted via `≥`), `rmspe_ratio` (the treated statistic, set at fit), and `n_placebos`/`n_failed` (effective reference-set sizes; non-converged placebos are excluded from BOTH numerator and denominator, never penalized into the rank). `placebo_p_value` is a **separate field** from the (always-NaN) `p_value` — it is a permutation p-value with no SE/t-stat and does not flow through `safe_inference`; `is_significant` stays bound to `p_value`. Edge cases fail closed: scale-aware RMSPE-ratio floor (a perfect pre-fit gives a finite ratio, not `inf`), `J<2` → NaN+warn, `J==2` → degenerate+coarse warn, deterministic given `seed`. New `get_placebo_df()` returns the per-unit RMSPE-ratio summary table (incl. the treated row and any failed donors) used for the rank. The design keeps the placebo *compute* opt-in — the per-donor refit loop runs only on the explicit `in_space_placebo()` call. To support that opt-in call, every fit retains a `_SyntheticControlFitSnapshot` of the pivoted panel (memory O(units × periods × predictor-vars), like `SyntheticDiD`'s snapshot for `in_time_placebo`; excluded from pickling). A compact/lazy snapshot representation is tracked as a follow-up in `TODO.md`. **Reporting-stack integration:** `SyntheticControlResults` is now routed through `DiagnosticReport` (fit-based `scm_fit` parallel-trends analogue → verdict `design_enforced_pt` reading `pre_rmspe`; `_scm_native` surfaces `pre_rmspe` + donor-weight concentration + the placebo p-value when already computed — never triggering the refit loop implicitly), `practitioner_next_steps` (`_handle_synthetic_control` with the placebo as the headline significance step), and `BusinessReport` (fit-based assumption block, ADH 2010 attribution, robustness via `estimator_native_diagnostics`; HonestDiD passthrough rejected like SDiD/TROP). Also fixes a latent BR bug where the headline `is_significant` was a non-JSON-serializable numpy `bool_` when `p_value` is a numpy `NaN`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the donor-pool construction, failure handling, RMSPE-ratio floor, and the non-analytical-p-value split), `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
1616
- **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN; significance comes from in-space placebo permutation inference via `in_space_placebo()` (see the dedicated entry below). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.

diff_diff/diagnostic_report.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2152,8 +2152,10 @@ def _check_estimator_native(self) -> Dict[str, Any]:
21522152
selected ``lambda_*``).
21532153
21542154
SyntheticControl: pre-treatment fit (``pre_rmspe``), donor-weight
2155-
concentration, and — when already computed — the in-space placebo
2156-
permutation p-value (``in_space_placebo``).
2155+
concentration, and — each surfaced only when already computed — the
2156+
in-space placebo permutation p-value (``in_space_placebo``), the ADH-2015
2157+
leave-one-out donor robustness (``leave_one_out``), and the in-time
2158+
backdating placebo (``in_time_placebo``).
21572159
"""
21582160
r = self._results
21592161
name = type(r).__name__
@@ -2398,9 +2400,14 @@ def _scm_native(self, r: Any) -> Dict[str, Any]:
23982400
max_abs_att = float(ran["placebo_att"].abs().max()) if len(ran) else None
23992401
out["in_time_placebo"] = {
24002402
"status": "ran",
2403+
# Full coverage breakdown so a partially-usable sweep is not
2404+
# overstated: n_dates is the requested grid; n_ran are the usable
2405+
# placebos; n_failed / n_infeasible are the dropped remainder.
24012406
"n_dates": _to_python_scalar(int(len(itp))),
2402-
"max_abs_placebo_att": _to_python_float(max_abs_att),
2407+
"n_ran": _to_python_scalar(int(len(ran))),
24032408
"n_failed": _to_python_scalar(getattr(r, "_in_time_n_failed", None)),
2409+
"n_infeasible": _to_python_scalar(getattr(r, "_in_time_n_infeasible", None)),
2410+
"max_abs_placebo_att": _to_python_float(max_abs_att),
24042411
}
24052412
else:
24062413
_it_reasons = {

0 commit comments

Comments
 (0)