Address PR #409 R4 review (1 P1, 1 P2)

igerber · claude · igerber · commit c5a0ee3aa608 · 2026-05-10T10:45:27.000-04:00
P1 — HAD design label convention was reversed across T21. Per
REGISTRY:2267 + had.py:7-33, the convention is:
  - Design 1' = continuous_at_zero (d_lower = 0, QUG case) — that's T21
  - Design 1  = continuous_near_d_lower (d_lower &gt; 0)     — that's T20
T21 had Design 1 / Design 1' swapped throughout. Fixed in the build
script (Section 1 paper-step taxonomy, Section 2 panel framing,
Section 3 reading-the-verdict, Section 7 Extensions). Notebook
re-executed and review extract regenerated.

Two residual "QUG selects/picks the identification path" leakages from
the original prose also surfaced (Section 7 + Summary checklist). Both
contradicted the explicit QUG-vs-_detect_design separation locked by
test_had_design_auto_lands_on_continuous_at_zero. Reworded to keep the
two rules independent ("QUG fail-to-reject and `design="auto"`
heuristic both pointed independently"; "QUG is a statistical test on
H0; `design="auto"` calls _detect_design() which uses a min/median
heuristic — both pointed to continuous_at_zero on this panel").

P2 (MT1) — T21 was mapped under had_pretests.py in doc-deps.yaml but
the drift test now also locks HAD(design="auto") / _detect_design()
behavior from had.py via test_had_design_auto_lands_on_continuous_at_zero.
Add T21 entry to the had.py docs block with a note on the
_detect_design() drift coverage so a future had.py design-selection
change does not miss T21 in the manual docs-impact map.

All 16 drift tests still pass on Rust; nbmake clean.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/_review/t21_notebook_extract.md b/docs/_review/t21_notebook_extract.md
@@ -28,7 +28,7 @@ This tutorial picks up where T20 left off. We re-run the brand campaign on a pan
 
 de Chaisemartin et al. (2026) Section 4.2 lays out a four-step workflow for HAD identification:
 
-1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1, `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1', `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.
+1. **Step 1 - QUG support-infimum test (paper Theorem 4):** is the support of the dose distribution consistent with `d_lower = 0` (Design 1', `continuous_at_zero`, target = `WAS`)? Or is the support strictly above zero (Design 1, `continuous_near_d_lower`, target = `WAS_d_lower`)? The two designs identify different estimands; getting this right matters.
 2. **Step 2 - Parallel pre-trends (paper Assumption 7):** does the differenced outcome behave the same way across dose groups in the *pre-treatment* periods? Same identifying logic as classic DiD.
 3. **Step 3 - Linearity / homogeneity (paper Assumption 8):** is `E[dY | D]` linear in `D`, so that the WAS reading reflects the average per-dose marginal effect rather than masking heterogeneity bias?
 4. **Step 4 - Boundary continuity (paper Assumptions 5, 6):** local-linearity of the dose-response near the boundary `d_lower`. **Non-testable**; argued from domain knowledge.
@@ -37,7 +37,7 @@ The library bundles the testable steps into one entry point: `did_had_pretest_wo
 
 ## 2. The Panel
 
-We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1) identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1 from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.
+We use a panel close in shape to T20's brand campaign (60 DMAs over 8 weeks, regional add-on spend on top of a national TV blast at week 5, true per-$1K lift = 100 weekly visits). The one difference: regional spend in this tutorial is drawn from `Uniform[$0.01K, $50K]` instead of T20's `Uniform[$5K, $50K]`. The true support of the dose distribution is therefore strictly positive (down to about $10), but very near zero - some markets barely participated in the regional add-on. Two independent things follow from that small `D_(1)`. (a) The QUG test in Step 1 will fail to reject `H0: d_lower = 0`, which means the data are **statistically consistent with** the `continuous_at_zero` (Design 1') identification path even though the true simulation lower bound is positive. (b) Independently, HAD's `design="auto"` detection - which uses a separate min/median heuristic, NOT the QUG p-value (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`) - also lands on `continuous_at_zero` here, because `D_(1) / median(D)` is below 0.01 on this panel. Both checks point to the same identification path on this panel, but they are independent rules; the workflow's `_detect_design` does not consume the pre-test outcomes. The point of this tutorial is not to assert that the data is Design 1' from the DGP up; the point is to read what the workflow concludes from the data and what it leaves open.
 
 ```python
 import numpy as np
@@ -152,7 +152,7 @@ homogeneity_joint populated? False
 
 **Reading the overall verdict.** Three things to note.
 
-- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1, but the QUG test is independent of HAD's `design="auto"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)
+- **Step 1 (QUG) fails to reject:** the test statistic `T = D_(1) / (D_(2) - D_(1)) ~ 3.86` lands well below its critical value (`1/alpha - 1 = 19` at alpha = 0.05); the data are statistically consistent with `d_lower = 0`. (Failing to reject is non-rejection, not proof - the true support could still be slightly above zero in finite samples; here it is, by construction of the DGP. QUG's outcome supports interpreting the data as Design 1', but the QUG test is independent of HAD's `design="auto"` selector - which uses the min/median heuristic described in Section 2 to reach the same `continuous_at_zero` decision on this panel.)
 - **Step 3 (linearity) fails to reject** on both Stute (CvM) and Yatchew-HR. The diagnostics do not flag heterogeneity bias on the dose dimension, so reading the WAS as an average per-dose marginal effect is supported by these tests (subject to finite-sample power).
 - **Step 2 (Assumption 7 pre-trends) is not run on this path.** The verdict says so verbatim: `"Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. With a single pre-period (the avg over weeks 1-4), there is nothing to compare against - we need at least two pre-periods to run a parallel-trends test on the dose dimension. The structural fields back this up: `pretrends_joint` and `homogeneity_joint` on the report are both `None` (the joint-Stute output containers don't get populated on the two-period path).
 
@@ -423,7 +423,7 @@ Pre-test results travel awkwardly to non-technical audiences. The template below
 
 ## 7. Extensions
 
-This tutorial covered the composite pre-test workflow on a single panel where QUG led the workflow to select the `continuous_at_zero` (Design 1) identification path. A few directions we did not exercise here:
+This tutorial covered the composite pre-test workflow on a single panel where QUG fail-to-reject and HAD's `design="auto"` heuristic both pointed independently to the `continuous_at_zero` (Design 1') identification path. A few directions we did not exercise here:
 
 - **Survey-weighted / population-weighted inference** - HAD's pre-test workflow accepts `survey_design=` (or the deprecated `survey=` / `weights=` aliases) for design-based inference. The QUG step is permanently deferred under survey weighting (extreme-value theory under complex sampling is not a settled toolkit); the linearity family runs with PSU-level Mammen multiplier bootstrap (Stute and joint variants) and weighted OLS + weighted variance components (Yatchew). A follow-up tutorial covers this path end-to-end.
 - **`trends_lin=True` (Pierce-Schott Eq 17 / 18 detrending)** - mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Forwards into both joint pre-trends and joint homogeneity wrappers; consumes the placebo at `base_period - 1` and skips Step 2 if no earlier placebo survives the drop. Useful when you suspect linear time trends correlated with dose but want to keep the joint-Stute machinery.
@@ -444,5 +444,5 @@ See the [`HeterogeneousAdoptionDiD` API reference](../api/had.html) and the [`HA
 - Upgrade to the multi-period (`aggregate="event_study"`) path to add the joint Stute pre-trends and joint homogeneity diagnostics. The verdict then reads "TWFE admissible under Section 4 assumptions" when none of the three testable diagnostics rejects - that is non-rejection evidence under finite-sample power and test specification, not proof.
 - Step 4 (paper Assumptions 5 / 6, boundary continuity) is **non-testable** from data - argue from domain knowledge.
 - The Yatchew-HR test exposes two null modes: `null="linearity"` (paper Theorem 7, default; what the workflow calls under the hood) and `null="mean_independence"` (Phase 4 R-parity with R `YatchewTest::yatchew_test(order=0)`, useful on placebo pre-period data).
-- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The workflow uses the QUG outcome to pick the identification path (`continuous_at_zero` vs `continuous_near_d_lower`); finite-sample uncertainty in that decision is a remaining caveat.
+- QUG fail-to-reject means the data are statistically consistent with `d_lower = 0`; it does not prove the true support starts at zero. The QUG test and HAD's `design="auto"` selector are independent rules: QUG is a statistical test on `H0: d_lower = 0`; `design="auto"` calls `_detect_design()` which uses a min/median heuristic on the dose vector. Both pointed to `continuous_at_zero` on this panel; finite-sample uncertainty in either decision is a remaining caveat.
 - Bootstrap p-values are RNG-dependent. The drift test for this notebook lives in `tests/test_t21_had_pretest_workflow_drift.py` and uses tolerance bands per backend (Rust vs pure-Python).
diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
@@ -388,6 +388,9 @@ sources:
       - path: diff_diff/guides/llms-full.txt
         section: "HeterogeneousAdoptionDiD"
         type: user_guide
+      - path: docs/tutorials/21_had_pretest_workflow.ipynb
+        type: tutorial
+        note: "Drift-locks `HAD(design=\"auto\")` resolution to `continuous_at_zero` on T21's panel via `tests/test_t21_had_pretest_workflow_drift.py::test_had_design_auto_lands_on_continuous_at_zero`; changes to `_detect_design()` heuristic should re-validate T21"
 
   diff_diff/had_pretests.py:
     drift_risk: medium
diff --git a/docs/tutorials/21_had_pretest_workflow.ipynb b/docs/tutorials/21_had_pretest_workflow.ipynb