spillover-tva: address CI codex R3 P3s — README + CHANGELOG count

igerber · claude · igerber · commit e1e0ba59624c · 2026-05-25T13:34:27.000-04:00
Two cosmetic P3s from CI codex R3: - docs/tutorials/README.md: catalog stopped at T22; added a T23 entry in the same format (overview + companion drift-test pointer). - CHANGELOG.md: said "17 function-level tests" — file now has 18 after the d_bar≥100 delta_1 identity companion test landed in commit 1248c44. Bumped to 18 and also surfaced the new exact- identity claim ("tau_total AND delta_1 ... per Butts §4 'once d_bar covers the true horizon, widening is benign'"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **`CallawaySantAnna.cluster=` silent no-op (Phase 1b interstitial).** `CallawaySantAnna(cluster="state").fit(...)` previously accepted the argument, stored it, returned it from `get_params()`, but never consumed it anywhere in the fit / aggregator / bootstrap pipeline (`staggered.py:154-156` docstring claimed "Defaults to unit-level clustering" — but for bare `cluster=X`, the aggregator at `staggered_aggregation.py:193-213` computed per-unit IF variance regardless, and the bootstrap at `staggered_bootstrap.py:323-347` drew per-unit multiplier weights regardless). Users who explicitly set `cluster="state"` got per-unit inference with no warning — typically SE too small under intra-cluster correlation. **Survey-PSU clustering via `survey_design=SurveyDesign(psu="state")` was NOT affected** and continued to cluster correctly via `_compute_stratified_psu_meat`. The fix synthesizes a minimal `SurveyDesign(psu=self.cluster, weight_type="pweight")` when bare `cluster=` is set without an explicit survey design, threading the synthesized PSU through the existing survey-PSU machinery (aggregator + bootstrap). A new dedicated `df_inference` field on `CallawaySantAnnaResults` carries the cluster-level df for the bare-cluster-synthesize path ONLY (where `survey_metadata` is intentionally `None` to preserve the `DiagnosticReport.survey_metadata is not None` skip at `diagnostic_report.py:848-856` + `:1150-1158` for "Original fit used a survey design" reasoning, and the `summary()` survey block render at `staggered_results.py:235-238`). `HonestDiD` at `honest_did.py` prefers `survey_metadata.df_survey` first (the actual CS-internal df, which may be tightened post-resolve for replicate designs) and falls back to `df_inference` for bare-cluster fits — so downstream consumers always see the cluster df without overriding the post-recompute survey df. When `survey_design=SurveyDesign(weights=Y)` without PSU is provided AND `cluster=X` is also set, `_inject_cluster_as_psu` injects the bare cluster as the effective PSU AND an `effective_survey_design = replace(survey_design, psu=self.cluster)` is constructed so the downstream `_validate_unit_constant_survey` catches movers (units crossing clusters across periods) on panel data via the now-PSU-bearing design; `survey_metadata` is recomputed to reflect the injected PSU. When both `cluster=X` AND `survey_design.psu=Y` are set, the explicit PSU wins via `_resolve_effective_cluster` (emits `UserWarning` if partitions differ). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`**: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely (`survey.py:104-109` enforces replicate_weights are mutually exclusive with strata/psu/fpc); honoring bare `cluster=` would silently have no effect while populating `cluster_name`/`n_clusters` on Results dishonestly. Assertive regression tests pin the fix on both panel and repeated-cross-section paths plus the survey/non-survey contract boundaries: `test_cluster_robust_ses_differ_from_unit_level`, `test_bare_cluster_works_with_panel_false_rcs`, `test_bare_cluster_synthesizes_survey_design`, `test_inject_branch_panel_mover_raises`, `test_replicate_weight_plus_cluster_rejected`, `test_bare_cluster_populates_df_inference` (asserts the dedicated cluster-df carrier is set), `test_bare_cluster_does_not_set_survey_metadata` (asserts the survey/non-survey contract is preserved — DiagnosticReport / summary() must not treat a bare-cluster fit as survey-backed), `test_explicit_survey_design_does_populate_survey_metadata` (asserts the inject-branch path still populates survey_metadata for legitimate user-provided SurveyDesign), and `test_bare_cluster_honest_did_uses_df_inference` (end-to-end: HonestDiD threads df_inference into HonestDiDResults.df_survey, preventing silent normal-theory regression on a future refactor). When `cluster=None` (default), behavior is bit-equal to pre-PR (wiring guarded by `if self.cluster is not None:`). Audit verified the no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare `cluster=` correctly.
 
 ### Added
-- **New tutorial: SpilloverDiD on synthetic TVA-style spillover panel (Butts 2021 §4 analogue) with spillover-bandwidth sensitivity and Conley spatial-HAC inference.** `docs/tutorials/23_spillover_tva.ipynb` walks a practitioner through the SpilloverDiD workflow on a 4-period 200-unit panel laid out as treated cluster + near-control band + far-control band. The DGP is tuned at locked seed 23 (`n_treated=25, n_near=120, n_far=55, tau_total=-7.4, delta_1=-4.5, d_bar=100 km`) so naive multi-period TWFE on the full sample understates the direct effect by ~42% — matching the Butts (2021) §4 Table 1 Panel A bias-correction direction documented at `docs/methodology/papers/butts-2021-review.md:257`. SpilloverDiD with `rings=[0.0, 100.0]` cleanly recovers both `tau_total = -7.34` and `delta_1 = -4.53`. The tutorial covers (1) problem framing with TVA / Kline-Moretti (2014) citation, (2) panel construction with the DGP equation inline, (3) naive headline and the bias mechanism, (4) `rings` sensitivity grid at outer edges 50/100/150/200 km (estimates stabilize once `d_bar` covers the true spillover horizon), (5) the headline `SpilloverDiD` fit, (6) Conley spatial-HAC variance via `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff∈{0,1}` per Butts §3.1 — showcasing the Spillover-Conley work shipped through PRs #468 / #474 / #477 / #482 / #485 / #489. Drift detection in `tests/test_t23_spillover_tva_drift.py` (17 function-level tests, T19 pattern) pins panel composition, geographic bands, true parameters, naive coefficient endpoint (round-to-1), sensitivity grid endpoints (round-to-1 per reviewer guidance for BLAS safety on borderline-rank-deficient `rings=[0,50]`), Conley SE divergence from HC1 (direction-pinned: Conley < HC1 on this DGP), and the platform-agnostic post-filter warning surface (T19 pattern: mirrors the notebook's narrow `.*encountered in matmul` `RuntimeWarning` filter inside the capture block and asserts no warnings remain; on Apple Silicon M4 + numpy<2.3 the three known Accelerate BLAS matmul warnings — documented at `TODO.md` "RuntimeWarnings in Linear Algebra Operations" — fire and are filtered, on M3 / Intel / Linux or numpy>=2.3 the filter is a no-op, EITHER WAY any UserWarning / FutureWarning / non-matmul RuntimeWarning surfaces immediately). Per `feedback_t19_drift_guards_test_file_only`, ZERO in-notebook asserts — all numerical guards live in the test file. Per `feedback_notebook_workflow`, the DGP was developed and locked in a temporary `_scratch/` script (gitignored) before being pasted into the notebook §2 cell and duplicated into the drift-test `panel` fixture. `docs/index.rst` toctree updated; `docs/references.rst` gains the Kline-Moretti (2014) entry; `docs/methodology/papers/butts-2021-review.md:257` cross-reference updated from "T22 tutorial" to "T23 tutorial" (slot 22 was occupied by `22_had_survey_design.ipynb`). The `SpilloverDiD T22 TVA tutorial` row in `TODO.md` (renumbered to T23 at delivery) is dropped.
+- **New tutorial: SpilloverDiD on synthetic TVA-style spillover panel (Butts 2021 §4 analogue) with spillover-bandwidth sensitivity and Conley spatial-HAC inference.** `docs/tutorials/23_spillover_tva.ipynb` walks a practitioner through the SpilloverDiD workflow on a 4-period 200-unit panel laid out as treated cluster + near-control band + far-control band. The DGP is tuned at locked seed 23 (`n_treated=25, n_near=120, n_far=55, tau_total=-7.4, delta_1=-4.5, d_bar=100 km`) so naive multi-period TWFE on the full sample understates the direct effect by ~42% — matching the Butts (2021) §4 Table 1 Panel A bias-correction direction documented at `docs/methodology/papers/butts-2021-review.md:257`. SpilloverDiD with `rings=[0.0, 100.0]` cleanly recovers both `tau_total = -7.34` and `delta_1 = -4.53`. The tutorial covers (1) problem framing with TVA / Kline-Moretti (2014) citation, (2) panel construction with the DGP equation inline, (3) naive headline and the bias mechanism, (4) `rings` sensitivity grid at outer edges 50/100/150/200 km (estimates stabilize once `d_bar` covers the true spillover horizon), (5) the headline `SpilloverDiD` fit, (6) Conley spatial-HAC variance via `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff∈{0,1}` per Butts §3.1 — showcasing the Spillover-Conley work shipped through PRs #468 / #474 / #477 / #482 / #485 / #489. Drift detection in `tests/test_t23_spillover_tva_drift.py` (18 function-level tests, T19 pattern) pins panel composition, geographic bands, true parameters, naive coefficient endpoint (round-to-1), sensitivity grid endpoints (round-to-1 per reviewer guidance for BLAS safety on borderline-rank-deficient `rings=[0,50]`), exact `tau_total` AND `delta_1` identity across the d_bar ∈ {100, 150, 200} grid (per Butts §4 "once `d_bar` covers the true horizon, widening is benign"), Conley SE divergence from HC1 (direction-pinned: Conley < HC1 on this DGP), and the platform-agnostic post-filter warning surface (T19 pattern: mirrors the notebook's narrow `.*encountered in matmul` `RuntimeWarning` filter inside the capture block and asserts no warnings remain; on Apple Silicon M4 + numpy<2.3 the three known Accelerate BLAS matmul warnings — documented at `TODO.md` "RuntimeWarnings in Linear Algebra Operations" — fire and are filtered, on M3 / Intel / Linux or numpy>=2.3 the filter is a no-op, EITHER WAY any UserWarning / FutureWarning / non-matmul RuntimeWarning surfaces immediately). Per `feedback_t19_drift_guards_test_file_only`, ZERO in-notebook asserts — all numerical guards live in the test file. Per `feedback_notebook_workflow`, the DGP was developed and locked in a temporary `_scratch/` script (gitignored) before being pasted into the notebook §2 cell and duplicated into the drift-test `panel` fixture. `docs/index.rst` toctree updated; `docs/references.rst` gains the Kline-Moretti (2014) entry; `docs/methodology/papers/butts-2021-review.md:257` cross-reference updated from "T22 tutorial" to "T23 tutorial" (slot 22 was occupied by `22_had_survey_design.ipynb`). The `SpilloverDiD T22 TVA tutorial` row in `TODO.md` (renumbered to T23 at delivery) is dropped.
 - **TripleDifference `vcov_type` input contract (Phase 1b interstitial #2, permanently narrow).** `TripleDifference(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages mirroring the CS interstitial. The rejection is **library-architectural, not paper-prescribed**: TripleDifference uses influence-function-based variance per Ortiz-Villavicencio & Sant'Anna (2025) arXiv:2505.09942 — the 3-pairwise-DiD decomposition `inf = w3·IF_3 + w2·IF_2 - w1·IF_1` has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to the remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (`std(inf)/sqrt(n)`); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the combined IF (`(G/(G-1)) · Σ_c (Σ_{i∈c} ψ_i)² / n²`, plain CR1 — no Stata-style `(n-1)/(n-p)` finite-sample factor because the IF has no design-matrix `p` in the OLS sense); `hc1` with `survey_design=` ≡ TSL on the combined IF (analytical or replicate). All three paths are unchanged at machine precision (default behavior bit-equal across all 3 estimation methods `{dr, reg, ipw}`). `vcov_type` and `cluster_name` fields added to `TripleDifferenceResults`, threaded through `to_dict()`. `summary()` routes the variance-family label through the shared `_format_vcov_label` (`results.py:49-89`): bare fits render `"HC1 heteroskedasticity-robust"`, clustered fits render `"CR1 cluster-robust at <cluster_name>, G=<n>"` (since the actual algebra is Liang-Zeger CR1 on the combined IF), and survey-backed fits suppress the variance-estimator line entirely (the Survey Design block already names design + n_psu + df, and the analytical SE is TSL on the combined IF — a raw "hc1" label would misstate the inference path). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`** at `fit()`: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely; honoring bare `cluster=` would silently have no effect on the variance estimate while populating `cluster_name`/`n_clusters` on Results dishonestly. Mirrors the `CallawaySantAnna` guard from PR #487. Under `survey_design.psu` (non-replicate path) `cluster_name`/`n_clusters` on Results are suppressed (set to None) so they can't misreport the raw cluster argument when the resolver picks the survey PSU instead. `set_params(vcov_type=...)` mirrors CS pattern (mutate-then-validate-at-use, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR #2** (after CS PR #487) rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by TripleDifference's IF-based variance, not a deferral. New `TestTripleDifferenceVcovType` class in `tests/test_triple_diff.py` covers the 5-surface contract (default/cluster/survey bit-equal, `__init__` rejection per family, `fit()`-time revalidation) plus 8 introspection / convenience-function tests. REGISTRY.md "IF-based variance estimators vs analytical-sandwich estimators" cross-reference section updated to list `TripleDifference` alongside `CallawaySantAnna` in the "Enforced today" tier. Phase 1b PR 4/8 (full `{classical, hc1, hc2, hc2_bm}` threading) resumes on a different estimator (TwoStageDiD) post-merge; the two remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) follow the same narrow-contract template.
 - **CallawaySantAnna `vcov_type` input contract (Phase 1b interstitial, permanently narrow).** `CallawaySantAnna(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages. The rejection is **library-architectural, not paper-prescribed**: CS uses influence-function-based variance per Callaway & Sant'Anna (2021) — per-(g,t) doubly-robust / IPW / outcome-regression structure — and has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to other IF-based estimators (ImputationDiD, EfficientDiD) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (Williams 2000 form); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the IF activated via the cluster= wiring fix above. Documentation in `docs/methodology/REGISTRY.md` "IF-based variance estimators vs analytical-sandwich estimators" subsection. `vcov_type`, `cluster_name`, `n_clusters`, `df_inference` added to `CallawaySantAnnaResults` (the canonical PSU column wins for `cluster_name` reporting — `survey_design.psu` when explicit PSU is provided, `self.cluster` when bare cluster synthesizes/injects). `set_params(vcov_type=...)` mirrors SA pattern (mutate-then-refresh `_vcov_type_explicit`, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR** rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by CS's IF-based variance, not a deferral. Phase 1b PR 4/8 (full {classical, hc1, hc2, hc2_bm} threading) resumes on a different estimator post-merge.
 - **TripleDifference cluster-changes-SE defensive regression test.** Added `tests/test_triple_diff.py::TestTripleDifferenceClusterDefensive::test_cluster_changes_ses` asserting that `TripleDifference(cluster="state")` produces SE differing from `cluster=None` SE by `>1e-6` on a fixed-seed panel with state-level random effects. Defensive coverage closes a test gap identified during the Phase 1b cluster-wiring audit; TripleDifference's bare-cluster code path (`triple_diff.py:1245-1259`) was already correct but lacked a positive regression test. Mirrors `tests/test_two_stage.py::test_cluster_changes_ses`.
diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md
@@ -119,6 +119,14 @@ End-to-end HAD walkthrough on a BRFSS-shape stratified survey design (5 strata x
 - `did_had_pretest_workflow` on both overall and event-study paths under `survey_design=`, walking the Phase 4.5 C0 QUG-deferred verdict suffix and the stratified-clustered Stute multiplier bootstrap
 - Companion drift-test file (`tests/test_t22_had_survey_design_drift.py`)
 
+### 23. SpilloverDiD on a TVA-style Spillover Panel (`23_spillover_tva.ipynb`)
+Practitioner workflow for `SpilloverDiD` (Butts 2021 ring-indicator estimator + Gardner 2022 two-stage residualize-then-fit) on a synthetic TVA-style panel (4 periods, 200 units = 25 treated + 120 near-control + 55 far-control) reproducing the Butts §4 Table 1 Panel A ~40% understatement direction:
+- When place-based interventions cause geographic spillovers, naive multi-period TWFE on the full sample understates the direct effect because near-controls absorb the spillover (here: ATT recovers as -4.29 vs true `tau_total = -7.4`, a 42% understatement)
+- `SpilloverDiD(rings=[0.0, 100.0], conley_coords=("lat", "lon"))` cleanly recovers both `tau_total` (-7.34) and the near-band spillover coefficient `delta_1` (-4.53)
+- Choosing the spillover bandwidth via a `rings` sensitivity grid at outer edges 50 / 100 / 150 / 200 km, with the documented "undershooting `d_bar`" failure mode at 50 km
+- Conley spatial-HAC variance under `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff in {0, 1}` per Butts §3.1
+- Companion drift-test file (`tests/test_t23_spillover_tva_drift.py`)
+
 ## Running the Notebooks
 
 1. Install diff-diff with dependencies: