Skip to content

Commit ef8b9e1

Browse files
igerberclaude
andcommitted
tutorial: add staggered-rollout vs collapsed-2x2 power decision guide
New self-contained tutorial (docs/tutorials/24_staggered_vs_collapsed_power.ipynb) framing a 50-state staggered geo rollout, with power analysis showing: - the collapsed 2x2 silently targets a diluted estimand (reports ~60-94% of the true effect-on-treated as the rollout staggers; its 95% CI covers the truth ~0% under a slow rollout), while CS's overall ATT stays on target; - CS's minimum-detectable-lift penalty is a fast-rollout phenomenon: the 2x2's MDE climbs as the rollout staggers while CS's stays flat, closing to near parity; - a clean-tail 2x2 is unbiased only under flat effects; plus a CS-vs-2x2 decision guide. Runs live (no committed data files), nbmake-clean in pure-Python (~65s). Registered in the docs toctree, tutorials README, and CHANGELOG; drift-test follow-up tracked in TODO.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent dd18dc5 commit ef8b9e1

5 files changed

Lines changed: 889 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
## [Unreleased]
99

1010
### Added
11+
- **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide.
1112
- **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN (in-space placebo permutation inference with the post/pre RMSPE-ratio statistic is planned for a follow-up release; `_placebo_gaps`/`_rmspe_ratio`/`_fit_snapshot` are reserved on the results object). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
1213
- **StaggeredTripleDifference methodology-review-tracker promotion: In Progress → Complete**, plus a new opt-in Eq-4.14 overall ATT. Closes the Ortiz-Villavicencio & Sant'Anna (2025, arXiv:2505.09942v3) primary-source review on the tracker (PR-A #499 added the paper review on file; this PR validates the source against it). New paper-equation-anchored Verified Components in `tests/test_methodology_staggered_triple_diff.py` (Theorem 4.1 / Eq. 4.5 RA=IPW=DR identification; Eq. 4.1 three-term DDD decomposition; Eqs. 4.11-4.12 optimal-GMM weight normalization + single-group reduction; Eq. 4.13 event-study cohort-share weighting; Eq. 4.14 / Cor. 4.2 overall) alongside the existing R cross-validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()`. **New feature — opt-in `overall_att_es` (paper Eq. 4.14 overall):** the unweighted mean of the post-treatment event-study effects ES(e), exposed on `StaggeredTripleDiffResults` (with `overall_se_es` / `overall_t_stat_es` / `overall_p_value_es` / `overall_conf_int_es`) and populated only when `aggregate="event_study"` / `"all"`. The default `overall_att` is unchanged (the Callaway-Sant'Anna simple post-treatment (g,t) average — the library-wide convention). Its analytical SE is the influence function of that mean (the average of the per-event-time combined IFs, routed through the same survey-aware variance estimator as the per-e effects via a new `_se_from_psi` helper); a multiplier-bootstrap SE replaces it under `n_bootstrap>0`. Computed via a side-channel stash on the shared `CallawaySantAnnaAggregationMixin._aggregate_event_study` (no return-signature change; CallawaySantAnna unaffected), over post-treatment `e >= -anticipation` (the library convention, matching `overall_att`). Cross-validated against R `agg_ddd(type="eventstudy")$overall.att` / `overall.se` (SE matches to ~0.1%). REGISTRY `## StaggeredTripleDifference`: the previously-unlabeled overall-aggregation prose is formalized under a `**Note:**` documenting both overalls, and the duplicate aggregation-weight deviation is consolidated (fixing a `P(G=g)` vs R `P(S=g)` mislabel). `METHODOLOGY_REVIEW.md` row L69 promoted to **Complete** (`Last Review = 2026-05-30`) with a Verified Components / R Comparison Results detail block; priority queue pruned. `docs/references.rst` Ortiz-Villavicencio entry pinned to arXiv:2505.09942v3.
1314
- **SunAbraham + WooldridgeDiD-OLS `vcov_type="conley"` (Conley 1999 spatial-HAC) threading.** Both estimators now accept `vcov_type="conley"` with the five `conley_*` constructor params (`conley_coords`, `conley_cutoff_km`, `conley_metric`, `conley_kernel`, `conley_lag_cutoff`), reusing the already-`conleyreg`-validated `solve_ols` / `conley.py` machinery — within-period spatial HAC at `conley_lag_cutoff=0`, plus the within-unit Bartlett serial term at `conley_lag_cutoff>0` (the panel-aware path, since `conley_time`/`conley_unit` are always supplied — not pooled cross-sectional), no new variance code. Conley routes through each estimator's within-transform path; the unit auto-cluster is dropped on the conley path (an explicit `cluster=` enables the spatial+cluster product kernel); `survey_design=` / `weights` / `n_bootstrap>0` are rejected, and WooldridgeDiD conley is OLS-path-only (`method ∈ {logit, poisson}` + conley still rejected via the `method != "ols"` guard). `SunAbrahamResults` / `WooldridgeDiDResults` gain a `conley_lag_cutoff` field plus a Conley variance-label line in `summary()` (`SunAbrahamResults` also gains `cluster_name`). FWL-composability — the within-transform conley SE equals the full-dummy conley SE — is pinned in `tests/test_conley_vcov.py` (`TestConleySunAbraham` / `TestConleyWooldridge`). **`StackedDiD` conley remains deferred for a methodology reason** (the stacked design replicates units across sub-experiments, so Conley would see same-unit copies at distance 0; no `conleyreg` anchor; paper-gated) — its prior "same shape as the SunAbraham follow-up" framing is corrected in REGISTRY / TODO / the rejection message.

TODO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ Deferred items from PR reviews that were not addressed before merge.
168168

169169
| Issue | Location | PR | Priority |
170170
|-------|----------|----|----------|
171+
| Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` | staggered-analysis-2x2 | Low |
171172
| R comparison tests spawn separate `Rscript` per test (slow CI) | `tests/test_methodology_twfe.py:294` | #139 | Low |
172173
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
173174
| Doc-snippet smoke tests only cover `.rst` files; `.txt` AI guides outside CI validation | `tests/test_doc_snippets.py` | #239 | Low |

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ Quick Links
119119
tutorials/05_honest_did
120120
tutorials/06_power_analysis
121121
tutorials/07_pretrends_power
122+
tutorials/24_staggered_vs_collapsed_power
122123

123124
.. toctree::
124125
:maxdepth: 1

docs/tutorials/24_staggered_vs_collapsed_power.ipynb

Lines changed: 878 additions & 0 deletions
Large diffs are not rendered by default.

docs/tutorials/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,14 @@ Practitioner workflow for `SpilloverDiD` (Butts 2021 ring-indicator estimator +
127127
- Conley spatial-HAC variance under `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff in {0, 1}` — the cutoff = `d_bar` choice follows Butts §3.1, while the `conley_lag_cutoff` serial extension is the library's documented Wave E.2 follow-up synthesis with Newey-West-style serial Bartlett HAC (per REGISTRY "Variance (Wave E.2 follow-up)")
128128
- Companion drift-test file (`tests/test_t23_spillover_tva_drift.py`)
129129

130+
### 24. Staggered Rollout vs a Collapsed 2×2 (`24_staggered_vs_collapsed_power.ipynb`)
131+
Power-analysis decision guide for geo experiments (framed on a 50-state staggered rollout) on when to use Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2:
132+
- Why the collapsed 2×2 silently targets a *diluted* estimand (and how often its CI misses the true effect-on-treated)
133+
- The CS event study vs the 2×2's single diluted number
134+
- How the minimum detectable lift (MDE) changes for each estimator as the rollout gets more staggered — the power gap is a *fast-rollout* phenomenon that closes to near parity as staggering increases
135+
- When a clean-tail 2×2 is unbiased, the small-holdout and few-clusters caveats, and a CS-vs-2×2 decision guide
136+
- Fully self-contained: runs live (no committed data files)
137+
130138
## Running the Notebooks
131139

132140
1. Install diff-diff with dependencies:

0 commit comments

Comments
 (0)