spillover-tva: address CI codex R6 P2 — notebook constant-sync guard

igerber · claude · igerber · commit 948627bdefbc · 2026-05-25T14:08:17.000-04:00
R6 noted that my AST sync test only compared the `build_t23_panel`
function body, but the notebook §2 cell also has 10 cell-level
constant assignments above the function (MAIN_SEED, N_TREATED,
N_NEAR, N_FAR, T_PERIODS, FIRST_TREAT, TAU_TOTAL, DELTA_1, D_BAR_KM,
NOISE_SD). A notebook-only edit to any of those would change tutorial
behavior without failing either `test_notebook_dgp_ast_matches_test_fixture`
(only compares function body) or `test_dgp_true_parameters_match_quoted`
(only reasserts the test module's own constants).

Adds `test_notebook_dgp_constants_match_test_module_constants`: parses
the notebook §2 cell, walks the top-level `ast.Assign` nodes,
literal-evaluates each RHS, and asserts the 10 expected constants
have values matching the test module. Any notebook-only constant
drift now fails the guard with a (notebook_value, test_value) diff
in the error message.

CHANGELOG count bumped 20 → 21.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **`CallawaySantAnna.cluster=` silent no-op (Phase 1b interstitial).** `CallawaySantAnna(cluster="state").fit(...)` previously accepted the argument, stored it, returned it from `get_params()`, but never consumed it anywhere in the fit / aggregator / bootstrap pipeline (`staggered.py:154-156` docstring claimed "Defaults to unit-level clustering" — but for bare `cluster=X`, the aggregator at `staggered_aggregation.py:193-213` computed per-unit IF variance regardless, and the bootstrap at `staggered_bootstrap.py:323-347` drew per-unit multiplier weights regardless). Users who explicitly set `cluster="state"` got per-unit inference with no warning — typically SE too small under intra-cluster correlation. **Survey-PSU clustering via `survey_design=SurveyDesign(psu="state")` was NOT affected** and continued to cluster correctly via `_compute_stratified_psu_meat`. The fix synthesizes a minimal `SurveyDesign(psu=self.cluster, weight_type="pweight")` when bare `cluster=` is set without an explicit survey design, threading the synthesized PSU through the existing survey-PSU machinery (aggregator + bootstrap). A new dedicated `df_inference` field on `CallawaySantAnnaResults` carries the cluster-level df for the bare-cluster-synthesize path ONLY (where `survey_metadata` is intentionally `None` to preserve the `DiagnosticReport.survey_metadata is not None` skip at `diagnostic_report.py:848-856` + `:1150-1158` for "Original fit used a survey design" reasoning, and the `summary()` survey block render at `staggered_results.py:235-238`). `HonestDiD` at `honest_did.py` prefers `survey_metadata.df_survey` first (the actual CS-internal df, which may be tightened post-resolve for replicate designs) and falls back to `df_inference` for bare-cluster fits — so downstream consumers always see the cluster df without overriding the post-recompute survey df. When `survey_design=SurveyDesign(weights=Y)` without PSU is provided AND `cluster=X` is also set, `_inject_cluster_as_psu` injects the bare cluster as the effective PSU AND an `effective_survey_design = replace(survey_design, psu=self.cluster)` is constructed so the downstream `_validate_unit_constant_survey` catches movers (units crossing clusters across periods) on panel data via the now-PSU-bearing design; `survey_metadata` is recomputed to reflect the injected PSU. When both `cluster=X` AND `survey_design.psu=Y` are set, the explicit PSU wins via `_resolve_effective_cluster` (emits `UserWarning` if partitions differ). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`**: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely (`survey.py:104-109` enforces replicate_weights are mutually exclusive with strata/psu/fpc); honoring bare `cluster=` would silently have no effect while populating `cluster_name`/`n_clusters` on Results dishonestly. Assertive regression tests pin the fix on both panel and repeated-cross-section paths plus the survey/non-survey contract boundaries: `test_cluster_robust_ses_differ_from_unit_level`, `test_bare_cluster_works_with_panel_false_rcs`, `test_bare_cluster_synthesizes_survey_design`, `test_inject_branch_panel_mover_raises`, `test_replicate_weight_plus_cluster_rejected`, `test_bare_cluster_populates_df_inference` (asserts the dedicated cluster-df carrier is set), `test_bare_cluster_does_not_set_survey_metadata` (asserts the survey/non-survey contract is preserved — DiagnosticReport / summary() must not treat a bare-cluster fit as survey-backed), `test_explicit_survey_design_does_populate_survey_metadata` (asserts the inject-branch path still populates survey_metadata for legitimate user-provided SurveyDesign), and `test_bare_cluster_honest_did_uses_df_inference` (end-to-end: HonestDiD threads df_inference into HonestDiDResults.df_survey, preventing silent normal-theory regression on a future refactor). When `cluster=None` (default), behavior is bit-equal to pre-PR (wiring guarded by `if self.cluster is not None:`). Audit verified the no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare `cluster=` correctly.
 
 ### Added
-- **New tutorial: SpilloverDiD on synthetic TVA-style spillover panel (Butts 2021 §4 analogue) with spillover-bandwidth sensitivity and Conley spatial-HAC inference.** `docs/tutorials/23_spillover_tva.ipynb` walks a practitioner through the SpilloverDiD workflow on a 4-period 200-unit panel laid out as treated cluster + near-control band + far-control band. The DGP is tuned at locked seed 23 (`n_treated=25, n_near=120, n_far=55, tau_total=-7.4, delta_1=-4.5, d_bar=100 km`) so naive multi-period TWFE on the full sample understates the direct effect by ~42% — matching the Butts (2021) §4 Table 1 Panel A bias-correction direction documented at `docs/methodology/papers/butts-2021-review.md:257`. SpilloverDiD with `rings=[0.0, 100.0]` cleanly recovers both `tau_total = -7.34` and `delta_1 = -4.53`. The tutorial covers (1) problem framing with TVA / Kline-Moretti (2014) citation, (2) panel construction with the DGP equation inline, (3) naive headline and the bias mechanism, (4) `rings` sensitivity grid at outer edges 50/100/150/200 km (estimates stabilize once `d_bar` covers the true spillover horizon), (5) the headline `SpilloverDiD` fit, (6) Conley spatial-HAC variance via `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff∈{0,1}` per Butts §3.1 — showcasing the Spillover-Conley work shipped through PRs #468 / #474 / #477 / #482 / #485 / #489. Drift detection in `tests/test_t23_spillover_tva_drift.py` (20 function-level tests, T19 pattern) pins panel composition, geographic bands, true parameters, naive coefficient endpoint (round-to-1), sensitivity grid endpoints (round-to-1 per reviewer guidance for BLAS safety on borderline-rank-deficient `rings=[0,50]`), exact `tau_total` AND `delta_1` identity across the d_bar ∈ {100, 150, 200} grid (per Butts §4 "once `d_bar` covers the true horizon, widening is benign"), Conley SE divergence from HC1 (direction-pinned: Conley < HC1 on this DGP), and the platform-agnostic post-filter warning surface (T19 pattern: mirrors the notebook's narrow `.*encountered in matmul` `RuntimeWarning` filter inside the capture block and asserts no warnings remain; on Apple Silicon M4 + numpy<2.3 the three known Accelerate BLAS matmul warnings — documented at `TODO.md` "RuntimeWarnings in Linear Algebra Operations" — fire and are filtered, on M3 / Intel / Linux or numpy>=2.3 the filter is a no-op, EITHER WAY any UserWarning / FutureWarning / non-matmul RuntimeWarning surfaces immediately). Per `feedback_t19_drift_guards_test_file_only`, ZERO in-notebook asserts — all numerical guards live in the test file. Per `feedback_notebook_workflow`, the DGP was developed and locked in a temporary `_scratch/` script (gitignored) before being pasted into the notebook §2 cell and duplicated into the drift-test `panel` fixture. `docs/index.rst` toctree updated; `docs/references.rst` gains the Kline-Moretti (2014) entry; `docs/methodology/papers/butts-2021-review.md:257` cross-reference updated from "T22 tutorial" to "T23 tutorial" (slot 22 was occupied by `22_had_survey_design.ipynb`). The `SpilloverDiD T22 TVA tutorial` row in `TODO.md` (renumbered to T23 at delivery) is dropped.
+- **New tutorial: SpilloverDiD on synthetic TVA-style spillover panel (Butts 2021 §4 analogue) with spillover-bandwidth sensitivity and Conley spatial-HAC inference.** `docs/tutorials/23_spillover_tva.ipynb` walks a practitioner through the SpilloverDiD workflow on a 4-period 200-unit panel laid out as treated cluster + near-control band + far-control band. The DGP is tuned at locked seed 23 (`n_treated=25, n_near=120, n_far=55, tau_total=-7.4, delta_1=-4.5, d_bar=100 km`) so naive multi-period TWFE on the full sample understates the direct effect by ~42% — matching the Butts (2021) §4 Table 1 Panel A bias-correction direction documented at `docs/methodology/papers/butts-2021-review.md:257`. SpilloverDiD with `rings=[0.0, 100.0]` cleanly recovers both `tau_total = -7.34` and `delta_1 = -4.53`. The tutorial covers (1) problem framing with TVA / Kline-Moretti (2014) citation, (2) panel construction with the DGP equation inline, (3) naive headline and the bias mechanism, (4) `rings` sensitivity grid at outer edges 50/100/150/200 km (estimates stabilize once `d_bar` covers the true spillover horizon), (5) the headline `SpilloverDiD` fit, (6) Conley spatial-HAC variance via `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff∈{0,1}` per Butts §3.1 — showcasing the Spillover-Conley work shipped through PRs #468 / #474 / #477 / #482 / #485 / #489. Drift detection in `tests/test_t23_spillover_tva_drift.py` (21 function-level tests, T19 pattern) pins panel composition, geographic bands, true parameters, naive coefficient endpoint (round-to-1), sensitivity grid endpoints (round-to-1 per reviewer guidance for BLAS safety on borderline-rank-deficient `rings=[0,50]`), exact `tau_total` AND `delta_1` identity across the d_bar ∈ {100, 150, 200} grid (per Butts §4 "once `d_bar` covers the true horizon, widening is benign"), Conley SE divergence from HC1 (direction-pinned: Conley < HC1 on this DGP), and the platform-agnostic post-filter warning surface (T19 pattern: mirrors the notebook's narrow `.*encountered in matmul` `RuntimeWarning` filter inside the capture block and asserts no warnings remain; on Apple Silicon M4 + numpy<2.3 the three known Accelerate BLAS matmul warnings — documented at `TODO.md` "RuntimeWarnings in Linear Algebra Operations" — fire and are filtered, on M3 / Intel / Linux or numpy>=2.3 the filter is a no-op, EITHER WAY any UserWarning / FutureWarning / non-matmul RuntimeWarning surfaces immediately). Per `feedback_t19_drift_guards_test_file_only`, ZERO in-notebook asserts — all numerical guards live in the test file. Per `feedback_notebook_workflow`, the DGP was developed and locked in a temporary `_scratch/` script (gitignored) before being pasted into the notebook §2 cell and duplicated into the drift-test `panel` fixture. `docs/index.rst` toctree updated; `docs/references.rst` gains the Kline-Moretti (2014) entry; `docs/methodology/papers/butts-2021-review.md:257` cross-reference updated from "T22 tutorial" to "T23 tutorial" (slot 22 was occupied by `22_had_survey_design.ipynb`). The `SpilloverDiD T22 TVA tutorial` row in `TODO.md` (renumbered to T23 at delivery) is dropped.
 - **TripleDifference `vcov_type` input contract (Phase 1b interstitial #2, permanently narrow).** `TripleDifference(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages mirroring the CS interstitial. The rejection is **library-architectural, not paper-prescribed**: TripleDifference uses influence-function-based variance per Ortiz-Villavicencio & Sant'Anna (2025) arXiv:2505.09942 — the 3-pairwise-DiD decomposition `inf = w3·IF_3 + w2·IF_2 - w1·IF_1` has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to the remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (`std(inf)/sqrt(n)`); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the combined IF (`(G/(G-1)) · Σ_c (Σ_{i∈c} ψ_i)² / n²`, plain CR1 — no Stata-style `(n-1)/(n-p)` finite-sample factor because the IF has no design-matrix `p` in the OLS sense); `hc1` with `survey_design=` ≡ TSL on the combined IF (analytical or replicate). All three paths are unchanged at machine precision (default behavior bit-equal across all 3 estimation methods `{dr, reg, ipw}`). `vcov_type` and `cluster_name` fields added to `TripleDifferenceResults`, threaded through `to_dict()`. `summary()` routes the variance-family label through the shared `_format_vcov_label` (`results.py:49-89`): bare fits render `"HC1 heteroskedasticity-robust"`, clustered fits render `"CR1 cluster-robust at <cluster_name>, G=<n>"` (since the actual algebra is Liang-Zeger CR1 on the combined IF), and survey-backed fits suppress the variance-estimator line entirely (the Survey Design block already names design + n_psu + df, and the analytical SE is TSL on the combined IF — a raw "hc1" label would misstate the inference path). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`** at `fit()`: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely; honoring bare `cluster=` would silently have no effect on the variance estimate while populating `cluster_name`/`n_clusters` on Results dishonestly. Mirrors the `CallawaySantAnna` guard from PR #487. Under `survey_design.psu` (non-replicate path) `cluster_name`/`n_clusters` on Results are suppressed (set to None) so they can't misreport the raw cluster argument when the resolver picks the survey PSU instead. `set_params(vcov_type=...)` mirrors CS pattern (mutate-then-validate-at-use, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR #2** (after CS PR #487) rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by TripleDifference's IF-based variance, not a deferral. New `TestTripleDifferenceVcovType` class in `tests/test_triple_diff.py` covers the 5-surface contract (default/cluster/survey bit-equal, `__init__` rejection per family, `fit()`-time revalidation) plus 8 introspection / convenience-function tests. REGISTRY.md "IF-based variance estimators vs analytical-sandwich estimators" cross-reference section updated to list `TripleDifference` alongside `CallawaySantAnna` in the "Enforced today" tier. Phase 1b PR 4/8 (full `{classical, hc1, hc2, hc2_bm}` threading) resumes on a different estimator (TwoStageDiD) post-merge; the two remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) follow the same narrow-contract template.
 - **CallawaySantAnna `vcov_type` input contract (Phase 1b interstitial, permanently narrow).** `CallawaySantAnna(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages. The rejection is **library-architectural, not paper-prescribed**: CS uses influence-function-based variance per Callaway & Sant'Anna (2021) — per-(g,t) doubly-robust / IPW / outcome-regression structure — and has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to other IF-based estimators (ImputationDiD, EfficientDiD) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (Williams 2000 form); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the IF activated via the cluster= wiring fix above. Documentation in `docs/methodology/REGISTRY.md` "IF-based variance estimators vs analytical-sandwich estimators" subsection. `vcov_type`, `cluster_name`, `n_clusters`, `df_inference` added to `CallawaySantAnnaResults` (the canonical PSU column wins for `cluster_name` reporting — `survey_design.psu` when explicit PSU is provided, `self.cluster` when bare cluster synthesizes/injects). `set_params(vcov_type=...)` mirrors SA pattern (mutate-then-refresh `_vcov_type_explicit`, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR** rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by CS's IF-based variance, not a deferral. Phase 1b PR 4/8 (full {classical, hc1, hc2, hc2_bm} threading) resumes on a different estimator post-merge.
 - **TripleDifference cluster-changes-SE defensive regression test.** Added `tests/test_triple_diff.py::TestTripleDifferenceClusterDefensive::test_cluster_changes_ses` asserting that `TripleDifference(cluster="state")` produces SE differing from `cluster=None` SE by `>1e-6` on a fixed-seed panel with state-level random effects. Defensive coverage closes a test gap identified during the Phase 1b cluster-wiring audit; TripleDifference's bare-cluster code path (`triple_diff.py:1245-1259`) was already correct but lacked a positive regression test. Mirrors `tests/test_two_stage.py::test_cluster_changes_ses`.
diff --git a/tests/test_t23_spillover_tva_drift.py b/tests/test_t23_spillover_tva_drift.py
@@ -363,6 +363,74 @@ def test_summary_renders_without_warning(spillover_fit):
     assert len(out) > 0
 
 
+def test_notebook_dgp_constants_match_test_module_constants():
+    """P2 sync guard: codex R6 caught that `test_notebook_dgp_ast_matches_test_fixture`
+    only compares the function body, and `test_dgp_true_parameters_match_quoted`
+    only reasserts the test module's own constants. A notebook-only edit
+    to `MAIN_SEED`, `N_TREATED`, `N_NEAR`, `N_FAR`, `T_PERIODS`,
+    `FIRST_TREAT`, `TAU_TOTAL`, `DELTA_1`, `D_BAR_KM`, or `NOISE_SD`
+    would change tutorial behavior without failing either of those
+    tests.
+
+    Parses the notebook §2 cell, walks the top-level
+    ``Assign`` nodes, and asserts the value of each expected constant
+    matches the test module's value. Any notebook-only constant edit
+    now fails this guard."""
+    import ast
+    import json
+    from pathlib import Path
+
+    nb_path = Path(__file__).resolve().parents[1] / "docs" / "tutorials" / "23_spillover_tva.ipynb"
+    with nb_path.open() as f:
+        nb = json.load(f)
+
+    matches = [
+        c
+        for c in nb["cells"]
+        if c["cell_type"] == "code" and any("def build_t23_panel" in s for s in c["source"])
+    ]
+    assert len(matches) == 1, (
+        f"Expected exactly one notebook code cell defining `build_t23_panel`; "
+        f"found {len(matches)}."
+    )
+    nb_cell_src = "".join(matches[0]["source"])
+
+    # Walk top-level Assigns in the cell and collect a constant -> value dict.
+    nb_consts: dict = {}
+    tree = ast.parse(nb_cell_src)
+    for node in tree.body:
+        if isinstance(node, ast.Assign) and len(node.targets) == 1:
+            target = node.targets[0]
+            if isinstance(target, ast.Name):
+                try:
+                    nb_consts[target.id] = ast.literal_eval(node.value)
+                except (ValueError, SyntaxError):
+                    pass  # non-literal RHS; skip
+
+    expected = {
+        "MAIN_SEED": MAIN_SEED,
+        "N_TREATED": N_TREATED,
+        "N_NEAR": N_NEAR,
+        "N_FAR": N_FAR,
+        "T_PERIODS": T_PERIODS,
+        "FIRST_TREAT": FIRST_TREAT,
+        "TAU_TOTAL": TAU_TOTAL,
+        "DELTA_1": DELTA_1,
+        "D_BAR_KM": D_BAR_KM,
+        "NOISE_SD": NOISE_SD,
+    }
+    missing = [k for k in expected if k not in nb_consts]
+    assert not missing, (
+        f"Notebook §2 cell missing expected constant assignments: {missing}. "
+        f"If a constant was renamed or moved, update both notebook and test."
+    )
+    mismatched = {k: (nb_consts[k], expected[k]) for k in expected if nb_consts[k] != expected[k]}
+    assert not mismatched, (
+        f"Notebook §2 constants drifted from test module constants: {mismatched}. "
+        f"Each entry is (notebook_value, test_value). Update both to match."
+    )
+
+
 def test_notebook_dgp_ast_matches_test_fixture():
     """P2 sync guard: enforces the "verbatim" duplication claim by
     parsing the notebook's §2 ``build_t23_panel`` definition and