Merge pull request #439 from igerber/fix-audit-409-r2

igerber · web-flow · commit 931feb24839d · 2026-05-15T06:03:56.000-04:00
Fix #409 holistic audit residuals: T20+T21 notebook cross-check + TODO status
diff --git a/TODO.md b/TODO.md
@@ -112,7 +112,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low |
 | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low |
 | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low |
-| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed (PR-pending); `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low |
+| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed in PR #409; `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low |
 | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
 | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium |
 | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low |
diff --git a/tests/_tutorial_drift.py b/tests/_tutorial_drift.py
@@ -0,0 +1,129 @@
+"""Shared helpers for tutorial-drift tests (T20, T21, ...).
+
+The HAD tutorial drift tests pin numbers / verdict strings against the
+locked DGP + seed. Without these helpers each drift test re-derived
+numbers but never verified that the rendered notebook surface (markdown
+prose + executed output cells) actually quotes those values. Because
+``nbsphinx_execute = "never"`` in ``docs/conf.py``, CI cannot detect
+drift between the pinned constants and the committed tutorial via
+notebook re-execution; the constants and the notebook can diverge
+silently. These helpers parse the .ipynb JSON directly so each
+tutorial-drift test file can cross-check its pins against the
+rendered surface it claims to protect.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Iterable
+
+
+def _read_notebook(nb_relpath: str) -> dict:
+    """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``).
+
+    Skips the calling test via ``pytest.skip(...)`` when the notebook file
+    is not present. The Rust-test CI job (and the isolated-install job)
+    copies only ``tests/`` to ``/tmp/tests`` and runs from there, without
+    ``docs/`` available. The repo convention is to skip cleanly when
+    artifacts are absent rather than fail (see e.g.
+    ``tests/test_notebook_md_extract.py`` and ``tests/test_nprobust_port.py``).
+    """
+    import pytest
+
+    nb_path = Path(__file__).resolve().parents[1] / nb_relpath
+    if not nb_path.exists():
+        pytest.skip(
+            f"Notebook {nb_relpath!r} not available in this CI environment "
+            "(isolated-install job copies only tests/, not docs/); "
+            "rendered-surface cross-check requires a full repo checkout."
+        )
+    return json.loads(nb_path.read_text())
+
+
+def notebook_markdown(nb_relpath: str) -> str:
+    """Return all markdown cells concatenated into one string."""
+    nb = _read_notebook(nb_relpath)
+    parts = []
+    for cell in nb["cells"]:
+        if cell["cell_type"] != "markdown":
+            continue
+        src = cell["source"]
+        if isinstance(src, list):
+            src = "".join(src)
+        parts.append(src)
+    return "\n".join(parts)
+
+
+def notebook_output_text(nb_relpath: str) -> str:
+    """Return all executed-output text (``stream`` and ``execute_result``
+    text/plain) from every code cell, concatenated.
+
+    Covers the rendered numeric surface that markdown alone misses —
+    e.g. printed verdict strings, formatted summary tables, p-values.
+    """
+    nb = _read_notebook(nb_relpath)
+    parts = []
+    for cell in nb["cells"]:
+        if cell["cell_type"] != "code":
+            continue
+        for out in cell.get("outputs", []):
+            # stream-style outputs (print / stdout / stderr)
+            text = out.get("text")
+            if text is not None:
+                parts.append("".join(text) if isinstance(text, list) else text)
+            # execute_result / display_data with text/plain
+            data = out.get("data") or {}
+            plain = data.get("text/plain")
+            if plain is not None:
+                parts.append("".join(plain) if isinstance(plain, list) else plain)
+    return "\n".join(parts)
+
+
+def notebook_rendered_text(nb_relpath: str) -> str:
+    """Return markdown + executed-output text together — the full
+    rendered surface a reader sees on RTD."""
+    return notebook_markdown(nb_relpath) + "\n" + notebook_output_text(nb_relpath)
+
+
+def assert_quotes_in_rendered(
+    nb_relpath: str,
+    expected_quotes: Iterable[str],
+    *,
+    surface: str = "rendered",
+) -> None:
+    """Assert each expected substring appears in the chosen rendered surface.
+
+    Parameters
+    ----------
+    nb_relpath
+        Notebook path relative to repo root (e.g.
+        ``"docs/tutorials/21_had_pretest_workflow.ipynb"``).
+    expected_quotes
+        Iterable of substrings that MUST appear in the chosen rendered
+        surface. Each is checked independently; the assertion message
+        lists every missing quote so a single failure surfaces all of
+        them.
+    surface
+        Which slice of the notebook to check: ``"markdown"`` (prose
+        only), ``"output"`` (executed output cells only), or
+        ``"rendered"`` (both — default; matches what a reader sees
+        on RTD).
+    """
+    if surface == "markdown":
+        text = notebook_markdown(nb_relpath)
+    elif surface == "output":
+        text = notebook_output_text(nb_relpath)
+    elif surface == "rendered":
+        text = notebook_rendered_text(nb_relpath)
+    else:
+        raise ValueError(f"surface must be 'markdown' / 'output' / 'rendered'; got {surface!r}")
+    missing = [q for q in expected_quotes if q not in text]
+    assert not missing, (
+        f"Tutorial {nb_relpath!r} ({surface=}) is missing load-bearing "
+        f"quoted values that the pinned drift constants assume are "
+        f"rendered verbatim. Either the notebook drifted from the "
+        f"locked DGP output (rerun the tutorial against the pinned "
+        f"seed) or the drift-test constants were updated without "
+        f"updating the tutorial. Missing: {missing}"
+    )
diff --git a/tests/test_t20_had_brand_campaign_drift.py b/tests/test_t20_had_brand_campaign_drift.py
@@ -270,3 +270,53 @@ def test_event_study_pre_placebos_cover_zero(event_study_result):
         i = event_times.index(e)
         assert abs(atts[i]) < 0.1, (e, atts[i])
         assert ci_lows[i] <= 0.0 <= ci_highs[i], (e, ci_lows[i], ci_highs[i])
+
+
+# =============================================================================
+# Notebook-narrative cross-check
+# =============================================================================
+#
+# Mirror of the T21 notebook cross-check (PR #409 holistic re-audit).
+# The asserts above re-derive numbers from the locked DGP+seed but
+# never verify that the rendered T20 tutorial actually quotes those
+# same numbers. Because ``nbsphinx_execute = "never"`` in
+# ``docs/conf.py``, CI cannot detect drift between the pinned drift
+# constants and the committed tutorial via notebook re-execution.
+# Use the shared helper from ``tests/_tutorial_drift.py`` to assert
+# each load-bearing quote is present in the rendered notebook
+# surface (markdown prose + executed output cells).
+
+
+T20_NOTEBOOK = "docs/tutorials/20_had_brand_campaign.ipynb"
+
+
+def test_notebook_quotes_match_pinned_constants():
+    """Every load-bearing T20 verdict / number must appear verbatim
+    in the rendered notebook (markdown prose + executed output cells).
+
+    Closes the same gap fixed for T21 in the PR #409 holistic
+    re-audit: the file-level docstring claimed "check against the
+    values quoted in the tutorial markdown" but every prior assert
+    re-derived numbers from the DGP and compared them to a hardcoded
+    constant, leaving the notebook completely uncross-checked.
+    """
+    from tests._tutorial_drift import assert_quotes_in_rendered
+
+    expected_quotes = [
+        # Headline WAS estimate quoted in cell 10 narrative.
+        "100 weekly visits",
+        # CI quoted alongside the headline.
+        "98.6 to 101.4",
+        # Design auto-detect outcome.
+        "continuous_near_d_lower",
+        # Target parameter label used across cells 8 / 12 / 13.
+        "WAS_d_lower",
+        # Placebo-magnitude prose claim (locked analytically above by
+        # test_event_study_pre_atts_near_zero with the ±0.1 envelope).
+        "±0.06",
+        # Sample-summary phrase in the design-fit narrative. Use the
+        # exact tilde-prefixed form so a future drift in the sentence
+        # (e.g. "median around $25K") would surface here.
+        "median ~$25K",
+    ]
+    assert_quotes_in_rendered(T20_NOTEBOOK, expected_quotes, surface="rendered")
diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py
@@ -366,3 +366,85 @@ def test_yatchew_side_panel_mean_independence_passes(yatchew_side_panel_inputs):
     assert res_mi.sigma2_lin > res_lin.sigma2_lin
     # And the differencing variance (sigma2_diff) is shared across modes.
     assert round(res_mi.sigma2_diff, 4) == round(res_lin.sigma2_diff, 4)
+
+
+# =============================================================================
+# Notebook-narrative cross-check
+# =============================================================================
+#
+# The asserts above re-derive numbers from the locked DGP+seed but do NOT
+# verify that the rendered tutorial actually quotes those same numbers.
+# Without this layer, the notebook prose can drift independently of the
+# library numerics (or vice versa) and CI stays green because
+# `nbsphinx_execute = "never"` in `docs/conf.py` (CI doesn't re-execute
+# notebooks during build). Use the shared tutorial-drift helper that
+# parses the notebook JSON and checks both markdown prose AND executed
+# output cells (since the load-bearing verdict strings appear in
+# print()-rendered output blocks, not just markdown prose).
+
+
+T21_NOTEBOOK = "docs/tutorials/21_had_pretest_workflow.ipynb"
+
+
+def test_notebook_quotes_match_pinned_constants():
+    """Every load-bearing verdict/value this file pins must appear
+    verbatim in the rendered T21 notebook surface (markdown prose +
+    executed output cells).
+
+    Closes the gap the file-level docstring claims to cover ("check
+    against the values quoted in the tutorial markdown") but the rest
+    of the file did not actually exercise — every prior assert
+    re-derives numbers from the DGP and compares them to a hardcoded
+    constant, leaving the notebook completely uncross-checked.
+    Without this test, the notebook can drift independently of the
+    library numerics (or vice versa) and CI stays green because
+    ``nbsphinx_execute = "never"`` in ``docs/conf.py``.
+    """
+    from tests._tutorial_drift import assert_quotes_in_rendered
+
+    expected_quotes = [
+        # ---- Verdict-string anchors ----
+        # Overall verdict substring (also pinned in test_overall_workflow_*).
+        # Appears in markdown prose AND in the verdict-print output cell.
+        "paper step 2 deferred to Phase 3 follow-up",
+        # Event-study verdict substring (rendered output of the
+        # aggregate='event_study' workflow + markdown reading-cell).
+        "TWFE admissible under Section 4 assumptions",
+        # Event-study output cell anchor — full verdict header.
+        "QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject",
+        # ---- Structural-field anchors ----
+        "aggregate = 'event_study'",
+        "pretrends_joint populated? True",
+        "homogeneity_joint populated? True",
+        "aggregate = 'overall'",
+        "pretrends_joint populated? False",
+        # ---- Verdict-reading markdown anchors (cell 6) ----
+        "T = D_(1) / (D_(2) - D_(1)) ~ 3.86",
+        "1/alpha - 1 = 19",
+        # ---- Numeric anchors pinned analytically above ----
+        # Every value pinned via round(..., 4) == 0.NNNN in this file
+        # must also appear in the rendered notebook (otherwise the
+        # tutorial prose / output is showing a different number than
+        # the test claims to lock).
+        "0.2059",  # QUG p-value (test_overall_workflow_*)
+        "0.6860",  # Stute p-value tolerance band anchor
+        "0.0720",  # joint-pretrends Stute p-value (event-study)
+        "0.7630",  # joint-homogeneity Stute p-value (event-study)
+        "0.4917",  # Yatchew side-panel null=linearity p-value
+        "0.2899",  # Yatchew side-panel null=mean_independence p-value
+        # Design auto-detect outcome (also pinned by overall-path tests).
+        "continuous_at_zero",
+        # Use the exact paper-step-1 phrasing with target=`WAS` so we
+        # don't false-pass on the many incidental occurrences of "WAS"
+        # elsewhere in the prose.
+        "target = `WAS`",
+        # Overall Yatchew p-value (analytical short-circuit on this DGP).
+        "1.0000",
+        # Overall Yatchew sigma2_lin in the rendered output.
+        "6250.2569",
+        # Side-panel Yatchew sigma2_lin under null='linearity'.
+        "6.5340",
+        # Side-panel Yatchew sigma2_lin under null='mean_independence'.
+        "7.0076",
+    ]
+    assert_quotes_in_rendered(T21_NOTEBOOK, expected_quotes, surface="rendered")