Skip to content

Commit 931feb2

Browse files
authored
Merge pull request #439 from igerber/fix-audit-409-r2
Fix #409 holistic audit residuals: T20+T21 notebook cross-check + TODO status
2 parents 6311508 + a01600d commit 931feb2

4 files changed

Lines changed: 262 additions & 1 deletion

File tree

TODO.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Deferred items from PR reviews that were not addressed before merge.
112112
| `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low |
113113
| `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low |
114114
| `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low |
115-
| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed (PR-pending); `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low |
115+
| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed in PR #409; `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low |
116116
| `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
117117
| `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium |
118118
| SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low |

tests/_tutorial_drift.py

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
"""Shared helpers for tutorial-drift tests (T20, T21, ...).
2+
3+
The HAD tutorial drift tests pin numbers / verdict strings against the
4+
locked DGP + seed. Without these helpers each drift test re-derived
5+
numbers but never verified that the rendered notebook surface (markdown
6+
prose + executed output cells) actually quotes those values. Because
7+
``nbsphinx_execute = "never"`` in ``docs/conf.py``, CI cannot detect
8+
drift between the pinned constants and the committed tutorial via
9+
notebook re-execution; the constants and the notebook can diverge
10+
silently. These helpers parse the .ipynb JSON directly so each
11+
tutorial-drift test file can cross-check its pins against the
12+
rendered surface it claims to protect.
13+
"""
14+
15+
from __future__ import annotations
16+
17+
import json
18+
from pathlib import Path
19+
from typing import Iterable
20+
21+
22+
def _read_notebook(nb_relpath: str) -> dict:
23+
"""Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``).
24+
25+
Skips the calling test via ``pytest.skip(...)`` when the notebook file
26+
is not present. The Rust-test CI job (and the isolated-install job)
27+
copies only ``tests/`` to ``/tmp/tests`` and runs from there, without
28+
``docs/`` available. The repo convention is to skip cleanly when
29+
artifacts are absent rather than fail (see e.g.
30+
``tests/test_notebook_md_extract.py`` and ``tests/test_nprobust_port.py``).
31+
"""
32+
import pytest
33+
34+
nb_path = Path(__file__).resolve().parents[1] / nb_relpath
35+
if not nb_path.exists():
36+
pytest.skip(
37+
f"Notebook {nb_relpath!r} not available in this CI environment "
38+
"(isolated-install job copies only tests/, not docs/); "
39+
"rendered-surface cross-check requires a full repo checkout."
40+
)
41+
return json.loads(nb_path.read_text())
42+
43+
44+
def notebook_markdown(nb_relpath: str) -> str:
45+
"""Return all markdown cells concatenated into one string."""
46+
nb = _read_notebook(nb_relpath)
47+
parts = []
48+
for cell in nb["cells"]:
49+
if cell["cell_type"] != "markdown":
50+
continue
51+
src = cell["source"]
52+
if isinstance(src, list):
53+
src = "".join(src)
54+
parts.append(src)
55+
return "\n".join(parts)
56+
57+
58+
def notebook_output_text(nb_relpath: str) -> str:
59+
"""Return all executed-output text (``stream`` and ``execute_result``
60+
text/plain) from every code cell, concatenated.
61+
62+
Covers the rendered numeric surface that markdown alone misses —
63+
e.g. printed verdict strings, formatted summary tables, p-values.
64+
"""
65+
nb = _read_notebook(nb_relpath)
66+
parts = []
67+
for cell in nb["cells"]:
68+
if cell["cell_type"] != "code":
69+
continue
70+
for out in cell.get("outputs", []):
71+
# stream-style outputs (print / stdout / stderr)
72+
text = out.get("text")
73+
if text is not None:
74+
parts.append("".join(text) if isinstance(text, list) else text)
75+
# execute_result / display_data with text/plain
76+
data = out.get("data") or {}
77+
plain = data.get("text/plain")
78+
if plain is not None:
79+
parts.append("".join(plain) if isinstance(plain, list) else plain)
80+
return "\n".join(parts)
81+
82+
83+
def notebook_rendered_text(nb_relpath: str) -> str:
84+
"""Return markdown + executed-output text together — the full
85+
rendered surface a reader sees on RTD."""
86+
return notebook_markdown(nb_relpath) + "\n" + notebook_output_text(nb_relpath)
87+
88+
89+
def assert_quotes_in_rendered(
90+
nb_relpath: str,
91+
expected_quotes: Iterable[str],
92+
*,
93+
surface: str = "rendered",
94+
) -> None:
95+
"""Assert each expected substring appears in the chosen rendered surface.
96+
97+
Parameters
98+
----------
99+
nb_relpath
100+
Notebook path relative to repo root (e.g.
101+
``"docs/tutorials/21_had_pretest_workflow.ipynb"``).
102+
expected_quotes
103+
Iterable of substrings that MUST appear in the chosen rendered
104+
surface. Each is checked independently; the assertion message
105+
lists every missing quote so a single failure surfaces all of
106+
them.
107+
surface
108+
Which slice of the notebook to check: ``"markdown"`` (prose
109+
only), ``"output"`` (executed output cells only), or
110+
``"rendered"`` (both — default; matches what a reader sees
111+
on RTD).
112+
"""
113+
if surface == "markdown":
114+
text = notebook_markdown(nb_relpath)
115+
elif surface == "output":
116+
text = notebook_output_text(nb_relpath)
117+
elif surface == "rendered":
118+
text = notebook_rendered_text(nb_relpath)
119+
else:
120+
raise ValueError(f"surface must be 'markdown' / 'output' / 'rendered'; got {surface!r}")
121+
missing = [q for q in expected_quotes if q not in text]
122+
assert not missing, (
123+
f"Tutorial {nb_relpath!r} ({surface=}) is missing load-bearing "
124+
f"quoted values that the pinned drift constants assume are "
125+
f"rendered verbatim. Either the notebook drifted from the "
126+
f"locked DGP output (rerun the tutorial against the pinned "
127+
f"seed) or the drift-test constants were updated without "
128+
f"updating the tutorial. Missing: {missing}"
129+
)

tests/test_t20_had_brand_campaign_drift.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,3 +270,53 @@ def test_event_study_pre_placebos_cover_zero(event_study_result):
270270
i = event_times.index(e)
271271
assert abs(atts[i]) < 0.1, (e, atts[i])
272272
assert ci_lows[i] <= 0.0 <= ci_highs[i], (e, ci_lows[i], ci_highs[i])
273+
274+
275+
# =============================================================================
276+
# Notebook-narrative cross-check
277+
# =============================================================================
278+
#
279+
# Mirror of the T21 notebook cross-check (PR #409 holistic re-audit).
280+
# The asserts above re-derive numbers from the locked DGP+seed but
281+
# never verify that the rendered T20 tutorial actually quotes those
282+
# same numbers. Because ``nbsphinx_execute = "never"`` in
283+
# ``docs/conf.py``, CI cannot detect drift between the pinned drift
284+
# constants and the committed tutorial via notebook re-execution.
285+
# Use the shared helper from ``tests/_tutorial_drift.py`` to assert
286+
# each load-bearing quote is present in the rendered notebook
287+
# surface (markdown prose + executed output cells).
288+
289+
290+
T20_NOTEBOOK = "docs/tutorials/20_had_brand_campaign.ipynb"
291+
292+
293+
def test_notebook_quotes_match_pinned_constants():
294+
"""Every load-bearing T20 verdict / number must appear verbatim
295+
in the rendered notebook (markdown prose + executed output cells).
296+
297+
Closes the same gap fixed for T21 in the PR #409 holistic
298+
re-audit: the file-level docstring claimed "check against the
299+
values quoted in the tutorial markdown" but every prior assert
300+
re-derived numbers from the DGP and compared them to a hardcoded
301+
constant, leaving the notebook completely uncross-checked.
302+
"""
303+
from tests._tutorial_drift import assert_quotes_in_rendered
304+
305+
expected_quotes = [
306+
# Headline WAS estimate quoted in cell 10 narrative.
307+
"100 weekly visits",
308+
# CI quoted alongside the headline.
309+
"98.6 to 101.4",
310+
# Design auto-detect outcome.
311+
"continuous_near_d_lower",
312+
# Target parameter label used across cells 8 / 12 / 13.
313+
"WAS_d_lower",
314+
# Placebo-magnitude prose claim (locked analytically above by
315+
# test_event_study_pre_atts_near_zero with the ±0.1 envelope).
316+
"±0.06",
317+
# Sample-summary phrase in the design-fit narrative. Use the
318+
# exact tilde-prefixed form so a future drift in the sentence
319+
# (e.g. "median around $25K") would surface here.
320+
"median ~$25K",
321+
]
322+
assert_quotes_in_rendered(T20_NOTEBOOK, expected_quotes, surface="rendered")

tests/test_t21_had_pretest_workflow_drift.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -366,3 +366,85 @@ def test_yatchew_side_panel_mean_independence_passes(yatchew_side_panel_inputs):
366366
assert res_mi.sigma2_lin > res_lin.sigma2_lin
367367
# And the differencing variance (sigma2_diff) is shared across modes.
368368
assert round(res_mi.sigma2_diff, 4) == round(res_lin.sigma2_diff, 4)
369+
370+
371+
# =============================================================================
372+
# Notebook-narrative cross-check
373+
# =============================================================================
374+
#
375+
# The asserts above re-derive numbers from the locked DGP+seed but do NOT
376+
# verify that the rendered tutorial actually quotes those same numbers.
377+
# Without this layer, the notebook prose can drift independently of the
378+
# library numerics (or vice versa) and CI stays green because
379+
# `nbsphinx_execute = "never"` in `docs/conf.py` (CI doesn't re-execute
380+
# notebooks during build). Use the shared tutorial-drift helper that
381+
# parses the notebook JSON and checks both markdown prose AND executed
382+
# output cells (since the load-bearing verdict strings appear in
383+
# print()-rendered output blocks, not just markdown prose).
384+
385+
386+
T21_NOTEBOOK = "docs/tutorials/21_had_pretest_workflow.ipynb"
387+
388+
389+
def test_notebook_quotes_match_pinned_constants():
390+
"""Every load-bearing verdict/value this file pins must appear
391+
verbatim in the rendered T21 notebook surface (markdown prose +
392+
executed output cells).
393+
394+
Closes the gap the file-level docstring claims to cover ("check
395+
against the values quoted in the tutorial markdown") but the rest
396+
of the file did not actually exercise — every prior assert
397+
re-derives numbers from the DGP and compares them to a hardcoded
398+
constant, leaving the notebook completely uncross-checked.
399+
Without this test, the notebook can drift independently of the
400+
library numerics (or vice versa) and CI stays green because
401+
``nbsphinx_execute = "never"`` in ``docs/conf.py``.
402+
"""
403+
from tests._tutorial_drift import assert_quotes_in_rendered
404+
405+
expected_quotes = [
406+
# ---- Verdict-string anchors ----
407+
# Overall verdict substring (also pinned in test_overall_workflow_*).
408+
# Appears in markdown prose AND in the verdict-print output cell.
409+
"paper step 2 deferred to Phase 3 follow-up",
410+
# Event-study verdict substring (rendered output of the
411+
# aggregate='event_study' workflow + markdown reading-cell).
412+
"TWFE admissible under Section 4 assumptions",
413+
# Event-study output cell anchor — full verdict header.
414+
"QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject",
415+
# ---- Structural-field anchors ----
416+
"aggregate = 'event_study'",
417+
"pretrends_joint populated? True",
418+
"homogeneity_joint populated? True",
419+
"aggregate = 'overall'",
420+
"pretrends_joint populated? False",
421+
# ---- Verdict-reading markdown anchors (cell 6) ----
422+
"T = D_(1) / (D_(2) - D_(1)) ~ 3.86",
423+
"1/alpha - 1 = 19",
424+
# ---- Numeric anchors pinned analytically above ----
425+
# Every value pinned via round(..., 4) == 0.NNNN in this file
426+
# must also appear in the rendered notebook (otherwise the
427+
# tutorial prose / output is showing a different number than
428+
# the test claims to lock).
429+
"0.2059", # QUG p-value (test_overall_workflow_*)
430+
"0.6860", # Stute p-value tolerance band anchor
431+
"0.0720", # joint-pretrends Stute p-value (event-study)
432+
"0.7630", # joint-homogeneity Stute p-value (event-study)
433+
"0.4917", # Yatchew side-panel null=linearity p-value
434+
"0.2899", # Yatchew side-panel null=mean_independence p-value
435+
# Design auto-detect outcome (also pinned by overall-path tests).
436+
"continuous_at_zero",
437+
# Use the exact paper-step-1 phrasing with target=`WAS` so we
438+
# don't false-pass on the many incidental occurrences of "WAS"
439+
# elsewhere in the prose.
440+
"target = `WAS`",
441+
# Overall Yatchew p-value (analytical short-circuit on this DGP).
442+
"1.0000",
443+
# Overall Yatchew sigma2_lin in the rendered output.
444+
"6250.2569",
445+
# Side-panel Yatchew sigma2_lin under null='linearity'.
446+
"6.5340",
447+
# Side-panel Yatchew sigma2_lin under null='mean_independence'.
448+
"7.0076",
449+
]
450+
assert_quotes_in_rendered(T21_NOTEBOOK, expected_quotes, surface="rendered")

0 commit comments

Comments
 (0)