Address PR #402 R4 review (1 P1, 1 P3)

igerber · claude · igerber · commit d4b909157a6a · 2026-05-09T11:46:55.000-04:00
P1 HAD Step-3 overstated pretest coverage on weighted/survey fits:
practitioner_next_steps() said did_had_pretest_workflow runs QUG on
both the overall and event-study paths without noting that the workflow
explicitly skips QUG whenever survey_design= / survey= / weights= is
supplied (Phase 4.5 C0 deferral, had_pretests.py:4488-4495 + REGISTRY
§ "QUG Null Test" Note (Phase 4.5 C0)). On weighted fits the workflow
emits a UserWarning and returns a linearity-conditional verdict only.

Both _handle_had and _handle_had_event_study Step-3 why-text + code
snippets now explicitly state that survey-weighted fits skip QUG and
yield a linearity-conditional verdict (the weighted verdict is
conditional on QUG holding by assumption). The event-study text also
notes that joint Stute pre-trends and joint homogeneity-linearity
themselves remain available under survey weighting via the PSU-level
Mammen multiplier bootstrap.

P3 REGISTRY § HeterogeneousAdoptionDiD requirements checklist was
stale: marked "Phase 5: practitioner_next_steps() integration" and
"Phase 5 (remaining): llms-full.txt section" as pending. Updated to
reflect this PR landing wave 1 of Phase 5; only T21 (HAD pretest
workflow tutorial) and T22 (weighted/survey HAD tutorial) remain
queued, both tracked in TODO.md.

Tests added (1 new, 89 total):
- test_had_step_3_flags_qug_under_survey_deferral: asserts both HAD
  handler variants surface the QUG-under-survey skip and the
  linearity-conditional-verdict caveat. Without this caveat agents may
  assume step 1 / Design 1' vs Design 1 was checked on weighted fits
  when the library deliberately does not check it there.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py
@@ -857,18 +857,26 @@ def _handle_had(results: Any):
             baker_step=3,
             label="Run the HAD pretest battery",
             why=(
-                "On a two-period panel did_had_pretest_workflow runs "
-                "paper Section 4.2 step 1 (QUG support-infimum test - "
+                "On a two-period unweighted panel did_had_pretest_workflow "
+                "runs paper Section 4.2 step 1 (QUG support-infimum test - "
                 "decides Design 1' vs Design 1) and step 3 (Stute / "
                 "Yatchew-HR Assumption 8 linearity tests). Step 2 "
                 "(Assumption 7 pre-trends) is NOT covered on the overall "
                 "path - a single pre-period cannot support the joint "
                 "Stute variant - and the returned verdict explicitly "
                 "flags that gap. To close step 2, refit on a multi-period "
-                "panel with aggregate='event_study'. Assumptions 3 / 5 / 6 "
-                "(uniform continuity at the boundary, Design 1 sign / "
-                "WAS_d_lower identification) are NOT testable via "
-                "pre-trends - the workflow vets only what can be vetted."
+                "panel with aggregate='event_study'. On survey-weighted "
+                "fits (survey_design= / survey= / weights=) the workflow "
+                "skips QUG with a UserWarning (permanent Phase 4.5 C0 "
+                "deferral - extreme order statistics are not smooth "
+                "functionals of the empirical CDF) and returns a "
+                "linearity-conditional verdict only - so step 1 coverage "
+                "is unweighted-only and the reported verdict on weighted "
+                "fits is conditional on QUG holding by assumption. "
+                "Assumptions 3 / 5 / 6 (uniform continuity at the "
+                "boundary, Design 1 sign / WAS_d_lower identification) "
+                "are NOT testable via pre-trends - the workflow vets only "
+                "what can be vetted."
             ),
             code=(
                 "from diff_diff import did_had_pretest_workflow\n"
@@ -879,7 +887,9 @@ def _handle_had(results: Any):
                 "print(report.summary())\n"
                 "# verdict explicitly flags the Assumption 7 gap on the\n"
                 "# overall path; aggregate='event_study' on a multi-period\n"
-                "# panel adds joint Stute pre-trends + joint homogeneity-linearity."
+                "# panel adds joint Stute pre-trends + joint homogeneity-linearity.\n"
+                "# Passing survey_design= / weights= skips QUG (Phase 4.5 C0)\n"
+                "# and returns a linearity-conditional verdict only."
             ),
             step_name="parallel_trends",
         ),
@@ -997,11 +1007,21 @@ def _handle_had_event_study(results: Any):
             baker_step=3,
             label="Run the HAD pretest battery (event-study mode)",
             why=(
-                "On multi-period panels, did_had_pretest_workflow with "
-                "aggregate='event_study' runs QUG plus joint Stute "
+                "On multi-period unweighted panels, did_had_pretest_workflow "
+                "with aggregate='event_study' runs QUG plus joint Stute "
                 "pre-trends plus joint homogeneity-linearity Stute. The "
                 "joint Stute variants close the paper Section 4.2 step-2 "
-                "gap that the overall path explicitly flags as deferred."
+                "gap that the overall path explicitly flags as deferred. "
+                "On survey-weighted fits (survey_design= / survey= / "
+                "weights=) the workflow skips QUG with a UserWarning "
+                "(permanent Phase 4.5 C0 deferral) and returns a "
+                "linearity-conditional verdict only - so step 1 coverage "
+                "is unweighted-only on the event-study path too, and the "
+                "weighted verdict is conditional on QUG holding by "
+                "assumption. The joint Stute pre-trends and joint "
+                "homogeneity-linearity tests themselves remain available "
+                "under survey weighting via PSU-level Mammen multiplier "
+                "bootstrap."
             ),
             code=(
                 "from diff_diff import did_had_pretest_workflow\n"
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2548,9 +2548,10 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 - [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Verdict priority follows the paper's one-way rule (TWFE admissible only if NO test rejects): **conclusive rejections are the primary verdict and are NEVER hidden by inconclusive status** — any unresolved-step note is appended via `"; additional steps unresolved: ..."` rather than replacing the rejection. The pure `"inconclusive - QUG NaN"` / `"inconclusive - both Stute and Yatchew linearity tests NaN"` forms only fire when NO conclusive test rejects AND a required step is unresolved. The partial-workflow fail-to-reject verdict may carry a `"(Yatchew NaN - skipped)"` (or Stute) suffix when one linearity test is NaN but the other is conclusive (step 3 resolved via the paper's "Stute OR Yatchew" wording). Bundled rejection-reason strings name each failed assumption in the conclusive-rejection case. `all_pass` is `True` iff QUG is conclusive AND at least one of Stute/Yatchew is conclusive AND no conclusive test rejects. **Non-negative-dose contract**: all three raw linearity helpers (`qug_test`, `stute_test`, `yatchew_hr_test`) raise a front-door `ValueError` on any `d < 0`, mirroring the `_validate_had_panel` guard (paper Section 2 HAD support restriction). Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
 - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values.
 - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1.
-- [ ] Phase 5: `practitioner_next_steps()` integration for HAD results.
+- [x] Phase 5 (wave 1, PR #402): `practitioner_next_steps()` integration for HAD results - `_handle_had` and `_handle_had_event_study` route both result classes through HAD-specific Baker et al. (2025) step guidance with bidirectional HAD ↔ ContinuousDiD Step-4 routing closure. The `_check_nan_att` helper extends to ndarray `att` (HAD event-study) via `np.all(np.isnan(arr))` semantics; scalar path bit-exact preserved.
+- [x] Phase 5 (wave 1, PR #402): `llms-full.txt` HeterogeneousAdoptionDiD section + result-class blocks + `## HAD Pretests` index + Choosing-an-Estimator row landed; constructor / fit() signatures match the real API (regression-tested via `inspect.signature`); result-class field tables enumerate every public dataclass field (regression-tested via `dataclasses.fields()`); `llms-practitioner.txt` Step 4 decision tree distinguishes ContinuousDiD (per-dose ATT(d), needs never-treated) from HeterogeneousAdoptionDiD (WAS, universal-rollout-compatible).
 - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh.
-- [ ] Phase 5 (remaining): Tutorial notebook + `llms-full.txt` HeterogeneousAdoptionDiD section (preserving the UTF-8 fingerprint).
+- [ ] Phase 5 (remaining): T21 HAD pretest workflow tutorial + T22 weighted/survey HAD tutorial - tracked in `TODO.md`.
 - [ ] Documentation of non-testability of Assumptions 5 and 6.
 - [ ] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`).
 - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work).
diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py
@@ -690,6 +690,43 @@ def test_handle_continuous_step_4_snippet_is_valid_python(self, mock_continuous_
             if code.strip():
                 ast.parse(code)  # raises SyntaxError on failure
 
+    def test_had_step_3_flags_qug_under_survey_deferral(
+        self, mock_had_results, mock_had_event_study_results
+    ):
+        # Per diff_diff/had_pretests.py:4488-4495 + REGISTRY § "QUG Null
+        # Test" Note (Phase 4.5 C0): when survey_design= / survey= /
+        # weights= is supplied, did_had_pretest_workflow skips the QUG
+        # step with a UserWarning and returns a linearity-conditional
+        # verdict only. Both HAD handler variants must surface this
+        # caveat so agents do not assume step 1 / Design 1' vs Design 1
+        # was checked on weighted fits when the library deliberately
+        # cannot check it there.
+        for fixture in (mock_had_results, mock_had_event_study_results):
+            output = practitioner_next_steps(fixture, verbose=False)
+            step_3_steps = [s for s in output["next_steps"] if s["baker_step"] == 3]
+            assert len(step_3_steps) == 1
+            text = (step_3_steps[0].get("why", "") + " " + step_3_steps[0].get("code", "")).lower()
+            # Must mention that survey-weighted fits skip QUG.
+            assert "skip" in text and "qug" in text, (
+                "Step-3 text must explicitly say survey-weighted fits "
+                "skip QUG (Phase 4.5 C0 deferral). Without this caveat "
+                "agents may assume step 1 / Design 1' vs Design 1 was "
+                "checked on weighted fits when the library deliberately "
+                "does not check it there."
+            )
+            # Must mention "linearity-conditional" verdict OR equivalent
+            # framing so agents know the weighted verdict is conditional
+            # on QUG holding by assumption.
+            assert (
+                "linearity-conditional" in text
+                or "linearity conditional" in text
+                or "qug holding by assumption" in text
+            ), (
+                "Step-3 text must describe the weighted verdict as "
+                "linearity-conditional / conditional on QUG holding by "
+                "assumption."
+            )
+
     def test_had_step_3_pretest_assumption_labels_correct(self, mock_had_results):
         # Per docs/methodology/REGISTRY.md and diff_diff/had_pretests.py
         # docstrings: