From f632acb9948566167d10b2a9fd47bfc85aec77d7 Mon Sep 17 00:00:00 2001 From: igerber Date: Mon, 1 Jun 2026 15:18:44 -0400 Subject: [PATCH] EfficientDiD methodology validation (PR-B): sieve outcome-regression upgrade + zero-weight-invariant nuisance selectors Source-validation pass for Chen, Sant'Anna & Xie (2025, arXiv:2506.17729v1); promotes the METHODOLOGY_REVIEW.md row to Complete. - Covariate doubly-robust outcome regression m_hat(X): linear OLS -> growing polynomial sieve (dimension-penalized AIC/BIC, no fixed K<=5 ceiling), so the covariate path attains the semiparametric efficiency bound under Assumption C.1; the K<=5 cap is removed from the two pre-existing propensity sieves too. - Theorem A.1 Hausman statistic extracted into a behavior-preserving _hausman_quadratic_form helper; criterion guard on estimate_outcome_regression. - Zero-weight (survey-subpopulation / padded) rows are now fully inert for the DR point estimate: all three sieve order-selectors key off the positive-weight support (n_pos = sum(w>0)) for auto-k_max / the n_basis cap / the IC sample-size terms, and the auto Silverman bandwidth for the kernel Omega*(X) is evaluated on the positive-weight support (this last only bites overidentified H>1 cells where Omega* feeds the efficient weights). Both were CI-codex P0s, both fail-under-bug verified. - New tests/test_methodology_efficient_did.py: paper-equation Verified Components (weights, generated-outcome telescoping, no-cov closed form, PT-Post=CS, SE, Hausman) + growing-sieve / propensity K>5 / zero-weight invariance (just- identified H=1 sieve-order AND overidentified H>1 kernel-bandwidth). - HRS Table 6 anchor tightened 0.1*SE -> 0.05*SE + data license / 656-vs-652 docs; REGISTRY / CHANGELOG / paper-review / docstrings reconciled to the growing-sieve + positive-weight-support contract; two P3 follow-ups (basis-rebuild cache, survey-vcov zero-weight-PSU DOF correction) tracked in TODO.md. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 1 + METHODOLOGY_REVIEW.md | 45 +- TODO.md | 2 + diff_diff/efficient_did.py | 128 ++- diff_diff/efficient_did_covariates.py | 300 ++++-- diff_diff/efficient_did_results.py | 7 +- diff_diff/guides/llms-full.txt | 2 +- docs/api/efficient_did.rst | 32 +- docs/choosing_estimator.rst | 18 +- docs/methodology/REGISTRY.md | 7 +- .../papers/chen-santanna-xie-2025-review.md | 2 +- tests/data/README.md | 16 + tests/test_efficient_did_validation.py | 11 +- tests/test_methodology_efficient_did.py | 940 ++++++++++++++++++ 14 files changed, 1369 insertions(+), 142 deletions(-) create mode 100644 tests/test_methodology_efficient_did.py diff --git a/CHANGELOG.md b/CHANGELOG.md index d4fc92d0..ec6e31f6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed - **CI + local AI PR-reviewer model upgraded `gpt-5.4` → `gpt-5.5`.** The CI Codex reviewer (`.github/workflows/ai_pr_review.yml`) and the local `/ai-review-local` default (`.claude/scripts/openai_review.py` `DEFAULT_MODEL`) now run `gpt-5.5` @ `xhigh` effort / `read-only` sandbox (all other invocation settings unchanged). Validated empirically before the swap via the `tools/reviewer-eval/` A/B harness: on a real-bug corpus plus a k=6 big-diff de-risk, `gpt-5.5` matched-or-beat `gpt-5.4` on every test-backed recall case (including a bug buried in a ~3k-line methodology diff), added zero false positives, and ran faster; an end-to-end CI canary confirmed the action environment (`openai/codex-action@v1`, codex CLI 0.135.0) runs `gpt-5.5` and catches a planted P0. `gpt-5.5` is also added to the reasoning-model set (`_is_reasoning_model`, for the api-backend timeout/token limits) and to the `PRICING` table at OpenAI's confirmed standard rates (`gpt-5.5` $5/$30, `gpt-5.5-pro` $30/$180 per 1M input/output tokens) — the production CI + local reviewer run `gpt-5.5` via the flat-rate **codex backend**, but `--backend auto` falls back to the metered API path when the codex CLI is unavailable, so the cost estimate must stay accurate there. `gpt-5.4` remains accepted. +- **EfficientDiD methodology-review-tracker promotion: In Progress → Complete, with a covariate outcome-regression upgrade (behavior change).** Completes the source-validation pass (PR-B) of the Chen, Sant'Anna & Xie (2025, arXiv:2506.17729v1) audit — PR-A (#515) added the paper review on file; this PR validates the source against the code, eliminates the one real deviation, adds paper-equation Verified Components, and flips the tracker. **Behavior change:** the covariate doubly-robust path's outcome regression `m̂(X)` was a **linear OLS** working model — consistent (doubly robust) but attaining the semiparametric efficiency bound only when the conditional mean is linear in the covariates. It is replaced by a **polynomial sieve** (total degree up to K, AIC/BIC order selection, the same basis family as the propensity-ratio sieve), so with the sieve propensity ratio and the kernel-smoothed conditional `Ω*(X)` all nuisances are estimated nonparametrically and the covariate path attains the bound under the paper's regularity conditions (Section 4 / Theorem 4.1). The order is chosen by an OLS information criterion `IC = n·ln(RSS/n) + c_n·p_K`, where `p_K = comb(K+d, d)` is the sieve basis dimension (number of fitted coefficients; `c_n = 2` AIC, `ln(n)` BIC), on the within-group (survey-weighted) residual sum of squares, using the within-group **positive-weight support** count for both `n` and the penalty (the raw row count when unweighted) so the selected order — and hence `m̂` — is invariant both to the survey-weight scale and to zero-weight (survey-subpopulation / padded) rows. The weighted RSS / Gram / loss totals are already inert to zero-weight rows; keying order selection (auto-k_max, the `n_basis` admissibility cap, and the IC sample-size terms) off the positive-weight support — in the outcome regression **and** both propensity sieves — keeps selection inert too. The auto Silverman bandwidth for the kernel-smoothed conditional `Ω*(X)` is likewise evaluated on the positive-weight support, so the overidentified (`H > 1`) DR path — where `Ω*(X)` feeds the per-unit efficient weights — is invariant as well (otherwise a zero-weight row with an extreme covariate would inflate the unweighted std and move the bandwidth). So a zero-weight-padded survey fit returns a bit-identical point estimate to the positive-weight-only fit (verified end-to-end in `tests/test_methodology_efficient_did.py::TestSurveyZeroWeightInvariance`, both a just-identified `H=1` case pinning sieve-order selection and an overidentified `H>1` case pinning the kernel bandwidth, plus a helper-level auto-order regression test; each fails under the respective raw-count / all-row selector; the existing `test_survey_phase3.py` scale-invariance asserts still hold to `atol=1e-8`). **Degree 1 reproduces the prior linear OLS up to floating point**, so AIC/BIC degrades to linear when the conditional mean is linear and covariate-fit numbers change only when a higher order is selected (i.e. when linear was inadequate); `sieve_k_max=1` forces every covariate-path sieve to degree 1 (it recovers the linear outcome-regression component but also degree-1-constrains the propensity sieves, so it does **not** reproduce the exact pre-PR estimator). The sieve is a *growing* sieve — the candidate degree is `floor(n_pos^{1/5})` (the positive-weight support `n_pos`, the raw group size when unweighted) with **no fixed ceiling**, giving a basis dimension `p_K = comb(K+d,d)` bounded by `n_basis < n_pos` (so `p_K/n → 0` for the low-dimensional covariate settings typical of DiD; Assumption C.1's rate is on the dimension, not the degree). This satisfies C.1's growing-sieve uniform-consistency / `o_p(n^{-1/2})` product-rate conditions (Theorem 4.1) under which the bound is attained asymptotically; a frozen finite-order sieve would not. (High-dimensional `X` faces the usual curse of dimensionality, where the paper's ML-nuisance option applies.) This also removes the prior hard `K≤5` cap from the two pre-existing propensity-ratio / inverse-propensity sieves (a no-op for groups under ~3,125 units, where `floor(n^{1/5}) < 5` anyway; it only activates higher orders at large n). The small-group overfit cap (`n_basis < n_group`), the rank-guard + partial-skip warnings, and the WLS survey path mirror the propensity sieve; if every degree is rank-skipped the estimator falls back to the intercept-only within-group mean (distinct from the propensity sieve's constant-ratio-1 fallback). The no-covariate path, weights, generated outcomes, `Ω*`, SE, aggregation, and Hausman are **unchanged** — the audit verified them correct against the paper (no other corrections). The Theorem A.1 Hausman statistic computation was extracted into a behavior-preserving `_hausman_quadratic_form` helper for unit-testability. New `tests/test_methodology_efficient_did.py` with paper-equation-numbered Verified Components (Eq 3.5/3.13 inverse-covariance weights + the min-variance property; Eq 3.9 generated-outcome telescoping; Eq 3.13/§4.1 no-covariate closed form; Corollary 3.1/3.2 PT-Post = Callaway-Sant'Anna; Theorem 4.1 SE = `sqrt(mean(EIF²)/n)`; Theorem A.1 / Eq A.2 Hausman with the restricted−efficient covariance direction, the effective-rank DOF safeguard on a rank-deficient `V`, and the covariance-direction guard; plus the sieve nonlinear-recovery / linear-degradation / efficiency-gain checks). The HRS Table 6 anchor (`tests/test_efficient_did_validation.py::TestHRSReplication`, a derived openICPSR 116186 subset) is tightened from 0.1·SE to **0.05·SE** (the fit is deterministic; all cells are < 0.03·SE), with the data license/redistribution and the 656-vs-652 sample difference documented in `tests/data/README.md`. REGISTRY `## EfficientDiD` Notes updated (outcome regression now sieve + bound-attainment under Assumption C.1; new K=1-fallback edge-case Note); module/class docstrings and the paper review's "open working-model choice" pointer reconciled; `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-06-01`) with a Verified Components / Corrections Made / Deviations detail block; priority queue pruned. ## [3.5.0] - 2026-06-01 diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index d07e5dac..9555517b 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -50,7 +50,7 @@ The catalog grew incrementally over several quarters, so formats vary across the | ImputationDiD | `imputation.py` | `didimputation` | **In Progress** | — | | TwoStageDiD | `two_stage.py` | `did2s` | **In Progress** | — | | WooldridgeDiD (ETWFE) | `wooldridge.py` | `etwfe` (R) / `jwdid` (Stata) | **Complete** | 2026-05-22 | -| EfficientDiD | `efficient_did.py` | (no canonical R package) | **In Progress** | — | +| EfficientDiD | `efficient_did.py` | (no canonical R package) | **Complete** | 2026-06-01 | ### Continuous & Universal-Treatment Estimators @@ -631,20 +631,36 @@ and covariate-adjusted specifications.) | Module | `efficient_did.py`, `efficient_did_bootstrap.py`, `efficient_did_covariates.py`, `efficient_did_weights.py` | | Primary Reference | Chen, Sant'Anna & Xie (2025), *Efficient Difference-in-Differences and Event Study Estimators* | | R Reference | (no canonical R package; paper compares against `did` / `DIDmultiplegt` / BJS / Gardner / Wooldridge as benchmarks rather than providing a reference implementation) | -| Status | **In Progress** | -| Last Review | — | +| Status | **Complete** | +| Last Review | 2026-06-01 | **Documentation in place:** -- REGISTRY.md section: `## EfficientDiD` (full Theorem 4.1 EIF, sieve-based propensity-ratio estimation with AIC/BIC, kernel-smoothed conditional covariance, Hausman pretest for PT-All vs PT-Post, survey support) -- Implementation: 130 unit tests in `tests/test_efficient_did.py` + 12 validation tests in `tests/test_efficient_did_validation.py` -- Hausman pretest: implemented per Theorem A.1 with Moore-Penrose pseudoinverse for finite-sample non-PSD variance-difference matrix -- Survey support: pweight + strata/PSU/FPC via TSL on EIF scores; covariates DR path with WLS outcome regression and weighted sieve normal equations +- REGISTRY.md section: `## EfficientDiD` (full Theorem 4.1 EIF, sieve-based propensity-ratio and outcome-regression estimation with AIC/BIC, kernel-smoothed conditional covariance, Hausman pretest for PT-All vs PT-Post, survey support) - Paper review on file: `docs/methodology/papers/chen-santanna-xie-2025-review.md` (PR-A, 2026-05-31) — faithful paper-sourced transcription of arXiv:2506.17729v1 (assumptions S/O/NA/PT-Post/PT-All; Theorem 3.1/3.2 EIFs + Corollaries 3.1/3.2; §4 sieve/kernel DR estimation; Theorem 4.1 SEs; Theorem A.1 Hausman; HRS Table 6 anchor) - -**Outstanding for promotion (PR-B source validation; paper review now on file):** -- Dedicated `tests/test_methodology_efficient_did.py` with Theorem 3.2 / Equation 3.5 / Equation 4.3 numbered Verified Components walk-through -- Cross-language anchor: the paper's empirical replication uses HRS data following Sun-Abraham (2021); a same-data benchmark against the paper's reported numbers (or a same-DGP MC against R alternatives) would substantiate the EIF construction -- Documented deviations: linear OLS working models for outcome regressions vs. paper's general nonparametric specification (DR safety net acknowledged but not separately validated); fixed-weight bootstrap aggregation vs. WIF-corrected analytical aggregation +- Tests: `tests/test_efficient_did.py` (unit/API), `tests/test_efficient_did_validation.py` (HRS Table 6 + Compustat MC), and `tests/test_methodology_efficient_did.py` (PR-B paper-equation Verified Components) + +**Corrections Made (PR-B source validation):** (None — implementation verified correct.) The PR-B walk-through traced each paper result against the source and found the no-covariate path (multi-baseline efficiency recovered via the `g'=g` same-cohort pairs), the generated outcome (Eq 3.9), the optimal weights (Eq 3.5/3.13), the conditional covariance Ω*(X) (Eq 3.12), the analytical SE (Theorem 4.1), the cohort-size event-study aggregation (with the `(G_g − π_g)` WIF correction), and the Hausman covariance direction (Eq A.2, restricted − efficient) all correct. + +**Implementation change (deliberate, decided with the maintainer — eliminates a deviation rather than fixing a bug):** +- Covariate outcome regression upgraded from linear OLS to a **polynomial sieve** (AIC/BIC order selection, same basis family as the propensity-ratio sieve; a *growing* sieve with no fixed order ceiling — `floor(n_pos^{1/5})` over the positive-weight support `n_pos` (raw group size when unweighted; zero-weight survey rows are inert for order selection), bounded by `n_basis < n_pos` — which, since C.1's rate is on the sieve *dimension* `p_K = comb(K+d,d)` (not the polynomial degree, which differ once `d > 1`), satisfies Assumption C.1's uniform-consistency / `o_p(n^{-1/2})` product-rate conditions for the low-dimensional covariate settings typical of DiD) so the doubly-robust covariate path attains the semiparametric efficiency bound asymptotically under the paper's nonparametric-nuisance specification (Section 4 / Theorem 4.1), not only when the conditional mean is linear. Degree 1 reproduces the prior linear OLS *outcome regression*; `sieve_k_max=1` forces all covariate-path sieves to degree 1 (it recovers the linear outcome component but also degree-1-constrains the propensity sieves, so it does **not** reproduce the exact pre-PR estimator). Removing the hard `K≤5` cap also updates the two pre-existing propensity-ratio / inverse-propensity sieves (a no-op for groups under ~3,125 units). `diff_diff/efficient_did_covariates.py::estimate_outcome_regression`. +- Extracted `_hausman_quadratic_form` (behaviour-preserving) so the Theorem A.1 statistic and effective-rank DOF logic are unit-testable in isolation. + +**Verified Components** (`tests/test_methodology_efficient_did.py`, paper-equation-numbered): +- [x] Inverse-covariance optimal weights `1'Ω*⁻¹/(1'Ω*⁻¹1)` (Eq 3.5 / 3.13) + the min-variance property + the singular-Ω* pseudoinverse path. +- [x] No-covariate generated outcome (Eq 3.9): `g'=g` telescopes to the per-baseline DiD (Eq 3.3); `g'=∞` to the period-1 long-difference. +- [x] No-covariate efficient ATT = `weights @ generated_outcomes` (Eq 3.13 / §4.1), rebuilt independently from within-group sample means/covariances. +- [x] PT-Post just-identified reduction = Callaway-Sant'Anna single-baseline (Corollary 3.1 single-date exact at 1e-9; Corollary 3.2 staggered). +- [x] Analytical SE = `sqrt(mean(EIF²)/n)` (Theorem 4.1). +- [x] Hausman statistic (Theorem A.1 / Eq A.2): `H = Δ'V⁺Δ`, `V = aCov(ẼS) − aCov(ÊS)` (restricted − efficient); `df = |E|` (well-conditioned) and `df = effective_rank < |E|` (rank-deficient safeguard); covariance-direction guard. +- [x] Covariate sieve outcome regression: recovers a nonlinear-in-X conditional mean (K≥2), reproduces linear OLS on linear data (K=1), and beats a forced-linear working model under a nonlinear-nuisance conditional-PT DGP. +- [x] Empirical anchor: HRS Table 6 (`tests/test_efficient_did_validation.py::TestHRSReplication`) on the paper's data (a derived openICPSR 116186 subset) matches all six ATT(g,t) + ES(0)/ES(1)/ES(2) + ES_avg to < 0.03 published SE; the Compustat MC confirms unbiasedness, the CS efficiency gain, coverage, and SE calibration. + +**Deviations (ratified; each carries a recognized REGISTRY label):** +- `overall_att` is the cohort-size-weighted post-treatment average (Callaway-Sant'Anna convention), not the paper's uniform `ES_avg` (Eq 2.3); `ES_avg` is recoverable from the event-study output. +- The multiplier bootstrap re-aggregates with fixed cohort-size weights (matching the CS bootstrap); the analytical path carries the `(G_g − π_g)` WIF weight-estimation correction. +- The Hausman χ² uses the effective rank of `V` as degrees of freedom (a finite-sample safeguard equal to `|E|` when `V` is well-conditioned), rather than a fixed `|E|`. +- `vcov_type` is permanently narrow to `{"hc1"}` (the IF-based variance has no single design matrix for analytical-sandwich families); the polynomial sieve basis and the Silverman kernel bandwidth are working-model choices the paper leaves open. +- SEs are i.i.d.-asymptotics (`sqrt(mean(EIF²)/n)` or multiplier bootstrap); cluster/survey EIF variants are documented library extensions beyond the paper's stated scope. --- @@ -1410,11 +1426,10 @@ more graceful handling of edge cases while still signaling invalid inference to Promotion priority for the **In Progress** entries, ordered by what's blocked on substantive review work (top of list = needs review next) vs. consolidation pass (bottom of list = mostly tracker walk-through): -**Substantive-review-blocked (still missing a methodology test file / R parity — and, except for EfficientDiD, a paper review):** +**Substantive-review-blocked (still missing a methodology test file / R parity and a paper review):** 1. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way. -2. **EfficientDiD** — **paper review on file** (PR-A, `chen-santanna-xie-2025-review.md`); remaining PR-B work is the source-validation pass — `tests/test_methodology_efficient_did.py` (Theorem 3.1/3.2 / Eq 3.5 / Eq 4.3 Verified Components), the HRS Table 6 cross-language anchor, and the documented deviations against Chen, Sant'Anna & Xie (2025). -3. **ImputationDiD / TwoStageDiD** — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against `didimputation` / `did2s`. +2. **ImputationDiD / TwoStageDiD** — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against `didimputation` / `did2s`. **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):** diff --git a/TODO.md b/TODO.md index fe7971cd..f4c49b59 100644 --- a/TODO.md +++ b/TODO.md @@ -87,6 +87,7 @@ Deferred items from PR reviews that were not addressed before merge. | SyntheticControl: the remaining ADH-2015 §4 items are out of scope of the leave-one-out + in-time-placebo PR — out-of-sample cross-validation `V`-selection (training/validation split), the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`), and sparse-SC subset search (`l < J`, holding `V` fixed). Leave-one-out (`leave_one_out()`) and the in-time placebo (`in_time_placebo()`) landed and are surfaced under `estimator_native_diagnostics`; these three are the deferred tail. | `synthetic_control.py`, `synthetic_control_results.py` | ADH-2015 follow-up | Low | | ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items. | `diff_diff/continuous_did.py` | — | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low | +| Survey sandwich SE is not exactly invariant to zero-weight (subpopulation / padded) rows: the shared `_compute_stratified_psu_meat` finite-sample correction counts zero-weight units as PSUs (an `n_psu/(n_psu-1)`-style factor), so adding zero-weight rows shifts the SE by a second-order amount (~2e-4 relative in the EfficientDiD e2e). The point estimate is exactly invariant and the weighted scores of zero-weight rows are already zero — only the DOF correction's PSU count includes them. Cross-cutting across all survey-enabled estimators; fix by counting only positive-weight PSUs in the correction. | `survey.py` (`_compute_stratified_psu_meat`) | PR-B follow-up | Low | | TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low | | TripleDifference power auto-routing: `power.simulate_power` ignores `n_periods` for DDD because `_ddd_dgp_kwargs` is hard-coded to the cross-sectional `generate_ddd_data`. Now that `generate_ddd_panel_data` exists (Wave 4), add a new `_EstimatorProfile` registry entry (or extend the existing one) to route to the panel DGP when `n_periods > 2`. | `power.py`, `prep_dgp.py` | follow-up | Low | | StaggeredTripleDifference R cross-validation: CSV fixtures not committed (gitignored); tests skip without local R + triplediff. Commit fixtures or generate deterministically. | `tests/test_methodology_staggered_triple_diff.py` | #245 | Medium | @@ -162,6 +163,7 @@ Deferred items from PR reviews that were not addressed before merge. | Rust-backend HC2 implementation. Current Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to the NumPy backend. For large-n fits this is noticeable. | `rust/src/linalg.rs` | Phase 1a | Low | | CR2 Bell-McCaffrey DOF uses a naive `O(n² k)` per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full `n × n` `M` matrix. Switch when a user hits a large-`n` cluster-robust design. | `linalg.py::_compute_cr2_bm` | Phase 1a | Low | | `SyntheticControl` retains a full `_SyntheticControlFitSnapshot` (pivoted outcome/predictor panels) on EVERY fit to support the opt-in `in_space_placebo()`, so callers who never run the placebo still pay O(units × periods × predictor-vars) memory (same as `SyntheticDiD`'s always-on snapshot for `in_time_placebo`). Store a compact array/index representation instead of per-variable DataFrames, or build the snapshot lazily on first placebo call (would need to retain the source data, ~same cost). | `synthetic_control.py` snapshot build, `synthetic_control_results.py::_SyntheticControlFitSnapshot` | follow-up | Low | +| EfficientDiD DR (covariate) path rebuilds the full polynomial sieve basis `_polynomial_sieve_basis(X, K)` for every candidate `K` inside each of the three nuisance fits (outcome regression, propensity ratio, inverse propensity), per `fit()`. After the growing-sieve cap removal (PR-B), large covariate-adjusted fits at large `n` pay more avoidable basis-construction cost. Cache the basis per `(X, K)` within a `fit()` and share it across the nuisance helpers. | `diff_diff/efficient_did_covariates.py` (the three sieve helpers) | PR-B follow-up | Low | #### Testing/Docs diff --git a/diff_diff/efficient_did.py b/diff_diff/efficient_did.py index 8760c2e4..878b2f94 100644 --- a/diff_diff/efficient_did.py +++ b/diff_diff/efficient_did.py @@ -4,8 +4,8 @@ Implements the ATT estimator from Chen, Sant'Anna & Xie (2025). Without covariates, achieves the semiparametric efficiency bound via closed-form within-group covariances. With covariates, uses a doubly -robust path with OLS outcome regression, sieve propensity ratios, and -kernel-smoothed conditional Omega*(X) (see class docstring for caveats). +robust path with sieve outcome regressions, sieve propensity ratios, and +kernel-smoothed conditional Omega*(X) (see class docstring for details). Under PT-All the model is overidentified and EDiD exploits this for tighter inference; under PT-Post it reduces to the standard @@ -139,6 +139,69 @@ def _compute_se_from_eif( return float(np.sqrt(np.mean(eif**2) / n_units)) +def _hausman_quadratic_form( + delta: np.ndarray, + cov_post: np.ndarray, + cov_all: np.ndarray, +) -> Tuple[float, int, float, int, bool]: + """Hausman statistic from the event-study delta and the two ES covariances. + + Implements the Theorem A.1 test statistic of Chen, Sant'Anna & Xie (2025, + arXiv:2506.17729v1). The variance-difference matrix is + + V = aCov(ES_post) - aCov(ES_all) = cov_post - cov_all + + (restricted minus efficient, PSD under H0 because the efficient estimator has + the smaller variance), and the statistic is ``H = delta' V^+ delta`` with + ``delta = ES_post - ES_all``. ``V`` is inverted by Moore-Penrose pseudoinverse + and the number of strictly positive eigenvalues is used as the chi-square + degrees of freedom -- a finite-sample safeguard for a non-PSD ``V`` that equals + ``|E|`` (the number of post-treatment horizons) when ``V`` is well-conditioned. + + Parameters + ---------- + delta : ndarray, shape (|E|,) + Event-study difference ``ES_post - ES_all`` (restricted minus efficient). + cov_post, cov_all : ndarray, shape (|E|, |E|) + Estimator-scale covariances of the restricted (PT-Post) and efficient + (PT-All) event-study vectors. + + Returns + ------- + H : float + The Hausman statistic (``max(delta' V^+ delta, 0)``); NaN if ``V`` is + non-finite or has no positive eigenvalues. + effective_rank : int + Number of positive eigenvalues of ``V`` (the chi-square degrees of freedom). + p_value : float + Upper-tail ``chi2(effective_rank)`` p-value; NaN when ``H`` is NaN. + n_negative : int + Number of substantially negative eigenvalues of ``V`` (efficiency-reversal + diagnostic). + finite_ok : bool + False when ``V`` contains non-finite entries. + """ + from scipy.stats import chi2 + + V = cov_post - cov_all + if not np.all(np.isfinite(V)): + return np.nan, 0, np.nan, 0, False + + eigvals = np.linalg.eigvalsh(V) + max_eigval = float(np.max(np.abs(eigvals))) if len(eigvals) > 0 else 0.0 + tol = max(1e-10 * max_eigval, 1e-15) + + n_negative = int(np.sum(eigvals < -tol)) + effective_rank = int(np.sum(eigvals > tol)) + if effective_rank == 0: + return np.nan, 0, np.nan, n_negative, True + + V_pinv = np.linalg.pinv(V, rcond=tol / max_eigval if max_eigval > 0 else 1e-10) + H = max(float(delta @ V_pinv @ delta), 0.0) + p_value = float(chi2.sf(H, df=effective_rank)) + return H, effective_rank, p_value, n_negative, True + + class EfficientDiD(EfficientDiDBootstrapMixin): """Efficient DiD estimator (Chen, Sant'Anna & Xie 2025). @@ -147,13 +210,15 @@ class EfficientDiD(EfficientDiDBootstrapMixin): means and covariances. With covariates, uses a doubly robust path: sieve-based propensity - score ratios (Eq 4.1-4.2), OLS outcome regression, sieve-estimated - inverse propensities (algorithm step 4), and kernel-smoothed - conditional Omega*(X) with per-unit efficient weights (Eq 3.12). - The DR property ensures consistency if either the OLS outcome model - or the sieve propensity ratio is correctly specified. The OLS - working model for outcome regressions does not generically guarantee - the semiparametric efficiency bound (see REGISTRY.md). + score ratios (Eq 4.1-4.2), sieve outcome regressions (polynomial + basis, AIC/BIC order selection), sieve-estimated inverse propensities + (algorithm step 4), and kernel-smoothed conditional Omega*(X) with + per-unit efficient weights (Eq 3.12). The DR property ensures + consistency if either the outcome regression or the sieve propensity + ratio is correctly specified; because all nuisances are sieves / + kernel smoothers (the paper's flexible-nuisance specification), the + covariate path attains the semiparametric efficiency bound under the + paper's regularity conditions (see REGISTRY.md). Parameters ---------- @@ -199,10 +264,21 @@ class EfficientDiD(EfficientDiDBootstrapMixin): ``control_group="last_cohort"``, also trims the pseudo-control period set at ``t >= last_g - anticipation`` (see REGISTRY.md). sieve_k_max : int or None - Maximum polynomial degree for sieve ratio estimation. None = auto - (``min(floor(n_gp^{1/5}), 5)``). Only used with covariates. + Maximum polynomial degree for the covariate-path sieves — the + propensity-ratio, inverse-propensity, AND outcome-regression fits all + use it. None = auto (``floor(n_pos^{1/5})`` over each group's + positive-weight support ``n_pos`` — the raw group size when unweighted — + a growing sieve with no fixed ceiling, bounded by ``n_basis < n_pos``; + zero-weight survey rows do not affect order selection). Only + used with covariates. ``sieve_k_max=1`` forces every covariate-path + sieve (outcome regression and both propensity sieves) to degree 1: it + recovers the pre-sieve linear-OLS *outcome regression* but also + degree-1-constrains the propensity sieves, so it does not reproduce the + exact pre-sieve estimator. sieve_criterion : str, default ``"bic"`` - Information criterion for sieve degree selection: ``"aic"`` or ``"bic"``. + Information criterion (``"aic"`` or ``"bic"``) for the order selection + of all covariate-path sieves (propensity ratio, inverse propensity, and + outcome regression). ratio_clip : float, default 20.0 Clip sieve propensity ratios to ``[1/ratio_clip, ratio_clip]``. kernel_bandwidth : float or None @@ -855,6 +931,8 @@ def fit( never_treated_mask, t_col_val, tpre_col_val, + k_max=self.sieve_k_max, + criterion=self.sieve_criterion, unit_weights=unit_level_weights, ) # m_{g', tpre, 1}(X) @@ -869,6 +947,8 @@ def fit( gp_mask_for_reg, tpre_col_val, effective_p1_col, + k_max=self.sieve_k_max, + criterion=self.sieve_criterion, unit_weights=unit_level_weights, ) # r_{g, inf}(X) and r_{g, g'}(X) via sieve (Eq 4.1-4.2) @@ -1672,8 +1752,6 @@ def hausman_pretest( ------- HausmanPretestResult """ - from scipy.stats import chi2 - # Fit under both assumptions (analytical SEs only, no bootstrap) common_kwargs = dict( cluster=cluster, @@ -1836,22 +1914,16 @@ def _eif_cov(eif_mat: np.ndarray) -> np.ndarray: cov_all = (eif_all_mat.T @ eif_all_mat) / (n_units**2) cov_post = (eif_post_mat.T @ eif_post_mat) / (n_units**2) - V = cov_post - cov_all - - if not np.all(np.isfinite(V)): + H, effective_rank, p_value, n_negative, finite_ok = _hausman_quadratic_form( + delta, cov_post, cov_all + ) + if not finite_ok: warnings.warn( "Hausman covariance matrix contains non-finite values. " "The test is unreliable.", UserWarning, stacklevel=2, ) return _nan_result() - - # Eigendecompose V — check for non-PSD - eigvals = np.linalg.eigvalsh(V) - max_eigval = np.max(np.abs(eigvals)) if len(eigvals) > 0 else 0.0 - tol = max(1e-10 * max_eigval, 1e-15) - - n_negative = int(np.sum(eigvals < -tol)) if n_negative > 0: warnings.warn( f"Hausman variance-difference matrix V has {n_negative} " @@ -1860,16 +1932,8 @@ def _eif_cov(eif_mat: np.ndarray) -> np.ndarray: UserWarning, stacklevel=2, ) - - effective_rank = int(np.sum(eigvals > tol)) if effective_rank == 0: return _nan_result() - - V_pinv = np.linalg.pinv(V, rcond=tol / max_eigval if max_eigval > 0 else 1e-10) - H = float(delta @ V_pinv @ delta) - H = max(H, 0.0) - - p_value = float(chi2.sf(H, df=effective_rank)) reject = p_value < alpha es_details = pd.DataFrame( diff --git a/diff_diff/efficient_did_covariates.py b/diff_diff/efficient_did_covariates.py index 86052aee..2c3976c7 100644 --- a/diff_diff/efficient_did_covariates.py +++ b/diff_diff/efficient_did_covariates.py @@ -2,16 +2,18 @@ Doubly robust math for the Efficient DiD estimator (with covariates). Implements the with-covariates path from Chen, Sant'Anna & Xie (2025): -OLS outcome regression (linear working model), sieve-based propensity -score ratios (Eq 4.1-4.2), sieve-based inverse propensities (step 4), -kernel-smoothed conditional Omega*(X) for per-unit efficient weights, -doubly robust generated outcomes (Eq 4.4), and the efficient influence -function for analytical standard errors. - -The DR property ensures consistency if either the OLS outcome model or -the sieve propensity ratio is correctly specified. The OLS working model -does not generically guarantee the semiparametric efficiency bound unless -the conditional mean is linear in covariates (see REGISTRY.md). +sieve outcome regressions (polynomial basis, AIC/BIC order selection), +sieve-based propensity score ratios (Eq 4.1-4.2), sieve-based inverse +propensities (step 4), kernel-smoothed conditional Omega*(X) for per-unit +efficient weights, doubly robust generated outcomes (Eq 4.4), and the +efficient influence function for analytical standard errors. + +The DR property ensures consistency if either the outcome regression or +the sieve propensity ratio is correctly specified. All three nuisances are +polynomial sieves / a kernel smoother (the paper's flexible-nuisance +specification, Section 4), so the doubly robust path attains the +semiparametric efficiency bound under the paper's regularity conditions +(see REGISTRY.md). All functions are pure (no state), operating on pre-pivoted numpy arrays. """ @@ -37,15 +39,51 @@ def estimate_outcome_regression( group_mask: np.ndarray, t_col: int, tpre_col: int, + k_max: Optional[int] = None, + criterion: str = "bic", unit_weights: Optional[np.ndarray] = None, ) -> np.ndarray: - """Estimate conditional mean outcome change m_hat(X) for a comparison group. + r"""Estimate conditional mean outcome change m_hat(X) via a polynomial sieve. + + Regresses ``(Y_t - Y_{tpre})`` on a polynomial sieve basis ``psi^K(X)`` within + the units identified by ``group_mask`` (WLS when ``unit_weights`` is given), + selects the sieve degree ``K`` by an OLS information criterion, and returns + predicted values ``m_hat(X_i)`` for **all** units (extrapolated from the + within-group fit). This implements ``m_hat_{g',t,tpre}(X) = E[Y_t - Y_{tpre} + | G=g', X]`` as a nonparametric (sieve) regression, matching the paper's + flexible-nuisance specification (Section 4). Together with the sieve + propensity ratio and kernel-smoothed Omega*(X) the doubly robust path attains + the semiparametric efficiency bound under the paper's regularity conditions. + + Order selection mirrors :func:`estimate_propensity_ratio_sieve`. For each + degree ``K = 1, ..., k_max`` (capped so the basis dimension stays below the + group size), the within-group (W)LS fit is scored by - Regresses ``(Y_t - Y_{tpre})`` on ``X`` within the units identified by - ``group_mask`` using OLS. Returns predicted values ``m_hat(X_i)`` for - **all** units (extrapolated from the within-group fit). - - This implements ``m_hat_{g',t,tpre}(X) = E[Y_t - Y_{tpre} | G=g', X]``. + .. math:: + \mathrm{IC}(K) = n \, \log(\mathrm{RSS}_w / n) + c_n \, p_K + + where ``n`` is the within-group **positive-weight support** count (the raw + row count when unweighted), ``RSS_w`` the (survey-)weighted residual sum of + squares, ``p_K`` the basis dimension, and ``c_n = 2`` (AIC) or ``log(n)`` + (BIC). Keying ``n`` and the penalty off the positive-weight support — only + ``RSS_w`` is weighted — makes ``IC(K)`` shift by a ``K``-independent constant + ``n*log(c)`` under survey-weight rescaling ``w -> c*w`` (the WLS fit itself is + weight-scale invariant) **and** makes zero-weight rows fully inert for order + selection, so the selected order and ``m_hat`` are invariant both to the + survey-weight scale and to zero-weight (e.g. survey-padded) rows. + + A degree whose (weighted) design Gram matrix has condition number above + ``1/sqrt(eps)`` (or that yields a non-finite fit) is skipped; if at least one + degree succeeds while others are skipped a ``UserWarning`` lists them. If + **every** degree is skipped (e.g. the group is too small for even the linear + basis), the estimator falls back to the intercept-only within-group mean of + ``Y_t - Y_{tpre}`` (the unconditional outcome regression) with a + ``UserWarning`` — distinct from the propensity sieve's constant-ratio fallback. + An empty comparison group (``n_group == 0``) returns zeros for all units (no + covariate adjustment). + Degree 1 (``[1, X]``) reproduces the previous linear-OLS working model up to + floating point. Per-cache-miss cost rises from one OLS to up to ``k_max`` OLS + solves, negligible against the kernel-Omega* term. Parameters ---------- @@ -57,36 +95,136 @@ def estimate_outcome_regression( Mask selecting units in the comparison group. t_col, tpre_col : int Column indices in ``outcome_wide`` for the two time periods. + k_max : int or None + Maximum polynomial degree. None = ``floor(n_pos^{1/5})`` where ``n_pos`` + is the within-group positive-weight support count (the raw group size + when unweighted) — a growing sieve with no fixed ceiling (the candidate + order grows with the support size and is bounded only by + ``n_basis < n_pos``), matching the propensity-ratio sieve. + criterion : str + ``"aic"`` or ``"bic"`` order selection. unit_weights : ndarray, shape (n_units,), optional - Survey weights at the unit level. When provided, uses WLS - instead of OLS for the within-group regression. + Survey weights at the unit level. When provided, uses WLS for the + within-group regression and a weighted RSS in the criterion. Returns ------- m_hat : ndarray, shape (n_units,) Predicted ``E[Y_t - Y_{tpre} | X]`` for every unit. """ + n_units = len(covariate_matrix) Y_group = outcome_wide[group_mask] delta_y = Y_group[:, t_col] - Y_group[:, tpre_col] + n_group = int(np.sum(group_mask)) - X_group = covariate_matrix[group_mask] - X_design = np.column_stack([np.ones(len(X_group)), X_group]) + if criterion not in ("aic", "bic"): + raise ValueError(f"criterion must be 'aic' or 'bic', got {criterion!r}") + + if n_group == 0: + return np.zeros(n_units) w_group = unit_weights[group_mask] if unit_weights is not None else None - coef, _, _ = solve_ols( - X_design, - delta_y, - weights=w_group, - weight_type="pweight" if w_group is not None else None, - return_vcov=False, - rank_deficient_action="warn", - ) + # Positive-weight support. Zero-weight rows contribute nothing to the WLS + # fit, the weighted Gram, or the weighted RSS, so order selection (auto-k_max, + # the ``n_basis`` admissibility cap, and the IC sample-size terms) must key + # off the positive-weight support count — otherwise padding the panel with + # zero-weight (e.g. survey-subpopulation) rows could silently change the + # selected ``K`` and hence the DR estimate even though ``m_hat`` is unchanged. + if w_group is not None: + support = w_group > 0 + n_pos = int(np.sum(support)) + delta_y_pos = delta_y[support] + else: + n_pos = n_group + delta_y_pos = delta_y + + # Intercept-only fallback: the unconditional within-group mean of Δy. + if w_group is not None and float(np.sum(w_group)) > 0: + fallback_mean = float(np.average(delta_y, weights=w_group)) + else: + fallback_mean = float(np.mean(delta_y)) + + d = covariate_matrix.shape[1] + if k_max is None: + k_max = int(n_pos**0.2) + k_max = max(k_max, 1) + + c_n = 2.0 if criterion == "aic" else np.log(max(n_pos, 2)) + cond_threshold = 1.0 / np.sqrt(np.finfo(float).eps) + + # Floor RSS so a (near-)perfect in-sample fit cannot drive log -> -inf and + # spuriously select a high degree; ties then break on the K-penalty toward + # the simpler order. + support_var = float(np.var(delta_y_pos)) if n_pos > 0 else 0.0 + rss_floor = max(1e-300, 1e-12 * n_pos * support_var) + + best_ic = np.inf + best_m_hat = np.full(n_units, fallback_mean) + singular_K: List[int] = [] + + for K in range(1, k_max + 1): + n_basis = comb(K + d, d) + # Cap so basis dimension stays below the support size (overfit guard). + if n_basis >= n_pos: + break + + basis_all = _polynomial_sieve_basis(covariate_matrix, K) + basis_group = basis_all[group_mask] + + # Rank guard on the (weighted) design Gram, mirroring the propensity sieve. + if w_group is not None: + gram = basis_group.T @ (w_group[:, None] * basis_group) + else: + gram = basis_group.T @ basis_group + with np.errstate(invalid="ignore", over="ignore"): + gram_cond = float(np.linalg.cond(gram)) + if not np.isfinite(gram_cond) or gram_cond > cond_threshold: + singular_K.append(K) + continue + + coef, _, _ = solve_ols( + basis_group, + delta_y, + weights=w_group, + weight_type="pweight", + return_vcov=False, + rank_deficient_action="warn", + ) + if not np.all(np.isfinite(coef)): + singular_K.append(K) + continue + + resid = delta_y - basis_group @ coef + if w_group is not None: + rss = float(np.sum(w_group * resid**2)) + else: + rss = float(np.sum(resid**2)) + rss = max(rss, rss_floor) + ic_val = n_pos * np.log(rss / n_pos) + c_n * n_basis - X_all = np.column_stack([np.ones(len(covariate_matrix)), covariate_matrix]) - coef_safe = np.where(np.isfinite(coef), coef, 0.0) - m_hat = X_all @ coef_safe + if ic_val < best_ic: + best_ic = ic_val + best_m_hat = basis_all @ coef + + if best_ic == np.inf: + warnings.warn( + "Outcome regression sieve estimation failed for all K values " + "(group too small or design rank-deficient at every degree). " + "Falling back to the intercept-only within-group mean.", + UserWarning, + stacklevel=2, + ) + elif singular_K: + warnings.warn( + f"Outcome regression sieve: skipped K={singular_K} due to " + f"rank-deficient or non-finite design. Selected basis used the " + f"remaining K values; this may indicate limited covariate variation.", + UserWarning, + stacklevel=2, + ) + m_hat = best_m_hat non_finite = ~np.isfinite(m_hat) if non_finite.any(): n_bad = int(non_finite.sum()) @@ -96,6 +234,7 @@ def estimate_outcome_regression( UserWarning, stacklevel=2, ) + m_hat = m_hat.copy() m_hat[non_finite] = 0.0 return m_hat @@ -190,7 +329,10 @@ def estimate_propensity_ratio_sieve( mask_gp : ndarray of bool, shape (n_units,) Comparison group mask. k_max : int or None - Maximum polynomial degree. None = ``min(floor(n_gp^{1/5}), 5)``. + Maximum polynomial degree. None = ``floor(n_gp^{1/5})`` where ``n_gp`` is + the comparison-group positive-weight support count (raw size when + unweighted) — a growing sieve with no fixed ceiling (bounded only by + ``n_basis < n_gp``). Zero-weight rows do not affect order selection. criterion : str ``"aic"`` or ``"bic"``. ratio_clip : float @@ -213,25 +355,32 @@ def estimate_propensity_ratio_sieve( d = covariate_matrix.shape[1] - # Default k_max: use comparison group size, not total n - if k_max is None: - k_max = min(int(n_gp**0.2), 5) - k_max = max(k_max, 1) - - # Penalty multiplier for IC - # BIC penalty uses observation count (not weighted) — complexity vs distinct obs - n_total = int(np.sum(mask_g)) + n_gp - c_n = 2.0 if criterion == "aic" else np.log(max(n_total, 2)) - - # Weighted totals for loss normalization (raw probability weights) + # Survey weights and positive-weight support counts. Zero-weight rows are + # inert in the weighted normal equations and the weighted loss total + # ``n_total_w``, so sieve selection (auto-k_max, the n_basis admissibility + # cap, and the IC sample-size terms) must key off the positive-weight support + # — otherwise padding the panel with zero-weight rows could silently change + # the selected K and the DR estimate. Unweighted: the raw row counts. if unit_weights is not None: w_g = unit_weights[mask_g] w_gp = unit_weights[mask_gp] + n_gp_pos = int(np.sum(w_gp > 0)) + n_total_pos = int(np.sum(w_g > 0)) + n_gp_pos n_total_w = float(np.sum(w_g)) + float(np.sum(w_gp)) else: w_g = None w_gp = None - n_total_w = float(n_total) + n_gp_pos = n_gp + n_total_pos = int(np.sum(mask_g)) + n_gp + n_total_w = float(n_total_pos) + + # Default k_max: grow with the comparison-group support size. + if k_max is None: + k_max = int(n_gp_pos**0.2) + k_max = max(k_max, 1) + + # BIC penalty uses the positive-weight support count (complexity vs distinct obs) + c_n = 2.0 if criterion == "aic" else np.log(max(n_total_pos, 2)) best_ic = np.inf best_ratio = np.ones(n_units) # fallback: constant ratio 1 @@ -243,8 +392,8 @@ def estimate_propensity_ratio_sieve( for K in range(1, k_max + 1): n_basis = comb(K + d, d) - # Cap K so basis dimension < n_gp (avoid singular system) - if n_basis >= n_gp: + # Cap K so basis dimension < n_gp_pos (avoid singular system) + if n_basis >= n_gp_pos: break basis_all = _polynomial_sieve_basis(covariate_matrix, K) @@ -287,9 +436,9 @@ def estimate_propensity_ratio_sieve( # Derivation: L(beta) = (1/n_w)(beta'A*beta - 2*b'beta). # At optimum A*beta = b, so beta'A*beta = b'beta. # Therefore L = (1/n_w)(b'beta - 2*b'beta) = -(1/n_w)*b'beta. - # Loss uses weighted totals; BIC penalty uses observation count. + # Loss uses weighted totals; BIC penalty uses the positive-weight support. loss = -float(b @ beta) / n_total_w - ic_val = 2.0 * loss + c_n * n_basis / n_total + ic_val = 2.0 * loss + c_n * n_basis / n_total_pos if ic_val < best_ic: best_ic = ic_val @@ -396,14 +545,13 @@ def estimate_inverse_propensity_sieve( if n_group == 0: return np.ones(n_units) - if k_max is None: - k_max = min(int(n_group**0.2), 5) - k_max = max(k_max, 1) - - # BIC penalty uses observation count (not weighted) - c_n = 2.0 if criterion == "aic" else np.log(max(n_units, 2)) - - # Weighted loss normalization and fallback + # Survey weights, fallback, and positive-weight support counts. Zero-weight + # rows are inert in the weighted normal equations and the weighted loss total + # ``n_units_w``, so sieve selection (auto-k_max, the n_basis admissibility + # cap, and the IC sample-size terms) must key off the positive-weight support + # rather than the raw row counts (see the outcome-regression docstring) — + # padding with zero-weight rows then cannot change the selected K or the DR + # estimate. Unweighted: the raw row counts. if unit_weights is not None: w_group = unit_weights[group_mask] sum_w_group = float(np.sum(w_group)) @@ -412,10 +560,21 @@ def estimate_inverse_propensity_sieve( return np.ones(n_units) n_units_w = float(np.sum(unit_weights)) fallback_ratio = n_units_w / sum_w_group + n_group_pos = int(np.sum(w_group > 0)) + n_units_pos = int(np.sum(unit_weights > 0)) else: w_group = None n_units_w = float(n_units) fallback_ratio = n_units / n_group + n_group_pos = n_group + n_units_pos = n_units + + if k_max is None: + k_max = int(n_group_pos**0.2) + k_max = max(k_max, 1) + + # BIC penalty uses the positive-weight support count + c_n = 2.0 if criterion == "aic" else np.log(max(n_units_pos, 2)) best_ic = np.inf best_s = np.full(n_units, fallback_ratio) # fallback: unconditional @@ -424,7 +583,7 @@ def estimate_inverse_propensity_sieve( for K in range(1, k_max + 1): n_basis = comb(K + d, d) - if n_basis >= n_group: + if n_basis >= n_group_pos: break basis_all = _polynomial_sieve_basis(covariate_matrix, K) @@ -460,9 +619,9 @@ def estimate_inverse_propensity_sieve( s_hat = basis_all @ beta # IC: loss = -(1/n_w) * b'beta (same derivation as ratio estimator) - # Loss uses weighted totals; BIC penalty uses observation count. + # Loss uses weighted totals; BIC penalty uses the positive-weight support. loss = -float(b @ beta) / n_units_w - ic_val = 2.0 * loss + c_n * n_basis / n_units + ic_val = 2.0 * loss + c_n * n_basis / n_units_pos if ic_val < best_ic: best_ic = ic_val @@ -605,11 +764,26 @@ def compute_generated_outcomes_cov( # --------------------------------------------------------------------------- -def _silverman_bandwidth(X: np.ndarray) -> float: +def _silverman_bandwidth(X: np.ndarray, unit_weights: Optional[np.ndarray] = None) -> float: """Silverman's rule-of-thumb bandwidth for d-dimensional X. ``h = (4 / (d + 2))^{1/(d+4)} * median_std * n^{-1/(d+4)}`` + + When ``unit_weights`` is provided, the rule is evaluated on the + **positive-weight support** (rows with ``w > 0``) only. The dispersion + statistic stays unweighted within that support (a documented second-order + simplification — survey-weighted bandwidth is deferred), but dropping + zero-weight rows keeps the bandwidth — and hence the kernel-smoothed + ``Omega*(X)`` and the per-unit efficient weights it feeds — invariant to + zero-weight (survey-subpopulation / padded) rows: such rows carry no + information yet would otherwise move both ``n`` and the standard deviation + (e.g. a zero-weight row with an extreme covariate inflates ``median_std``). + Falls back to the full matrix if the support is empty. """ + if unit_weights is not None: + support = unit_weights > 0 + if np.any(support): + X = X[support] n, d = X.shape stds = np.std(X, axis=0) stds[stds < 1e-10] = 1.0 @@ -737,7 +911,7 @@ def compute_omega_star_conditional( return np.empty((n_units, 0, 0)) if bandwidth is None: - bandwidth = _silverman_bandwidth(covariate_matrix) + bandwidth = _silverman_bandwidth(covariate_matrix, unit_weights) t_col = period_to_col[target_t] y1_col = period_1_col @@ -828,7 +1002,9 @@ def compute_omega_star_conditional( ) if gp_j not in W_gp_cache: X_gp = covariate_matrix[cohort_masks[gp_j]] - w_gp_j = unit_weights[cohort_masks[gp_j]] if unit_weights is not None else None + w_gp_j = ( + unit_weights[cohort_masks[gp_j]] if unit_weights is not None else None + ) W_gp_cache[gp_j] = _kernel_weights_matrix( covariate_matrix, X_gp, bandwidth, group_weights=w_gp_j ) diff --git a/diff_diff/efficient_did_results.py b/diff_diff/efficient_did_results.py index 8e91b9fa..83fdd2fc 100644 --- a/diff_diff/efficient_did_results.py +++ b/diff_diff/efficient_did_results.py @@ -124,9 +124,12 @@ class EfficientDiDResults: estimation_path : str ``"nocov"`` or ``"dr"`` — which estimation path was used. sieve_k_max : int or None - Maximum polynomial degree for sieve ratio estimation. + Maximum polynomial degree for the covariate-path sieves (propensity + ratio, inverse propensity, and outcome regression); ``1`` forces a + linear outcome-regression working model. sieve_criterion : str - Information criterion used (``"aic"`` or ``"bic"``). + Information criterion used (``"aic"`` or ``"bic"``) for all + covariate-path sieve order selection. ratio_clip : float Clipping bound for sieve propensity ratios. kernel_bandwidth : float or None diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index 5298bd0f..97c37ce7 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -848,7 +848,7 @@ plot_event_study(results) ### EfficientDiD -Efficient DiD estimator (Chen, Sant'Anna & Xie 2025). Achieves the semiparametric efficiency bound for ATT(g,t) on the **no-covariate path**. Also supports an optional doubly-robust covariate path (sieve-based propensity score ratios + linear OLS outcome regression): the DR property gives consistency if either the OR or the PS is correctly specified, but the linear OLS outcome regression does not generically attain the efficiency bound unless the conditional mean is linear in the covariates. Pass column names to the `covariates` parameter on `fit()`. +Efficient DiD estimator (Chen, Sant'Anna & Xie 2025). Achieves the semiparametric efficiency bound for ATT(g,t) on the **no-covariate path**. Also supports an optional doubly-robust covariate path with all nuisances estimated nonparametrically (sieve-based propensity score ratios + a sieve outcome regression with AIC/BIC order selection + kernel-smoothed conditional covariance): the DR property gives consistency if either the outcome regression or the PS is correctly specified, and the covariate path attains the efficiency bound asymptotically under the paper's regularity conditions (a *growing* polynomial sieve — no fixed order ceiling — whose basis dimension satisfies Assumption C.1's rate for low-dimensional covariates; degree 1 reproduces a linear working model, and `sieve_k_max=1` forces all covariate-path sieves to degree 1). Pass column names to the `covariates` parameter on `fit()`. ```python EfficientDiD( diff --git a/docs/api/efficient_did.rst b/docs/api/efficient_did.rst index ba58e5c3..3c2681b3 100644 --- a/docs/api/efficient_did.rst +++ b/docs/api/efficient_did.rst @@ -16,15 +16,18 @@ This module implements the efficiency-bound-attaining estimator that: .. note:: - EfficientDiD supports a doubly-robust covariate path: sieve-based - propensity score ratios combined with a linear OLS outcome regression. - The DR property ensures consistency if either the OR or the PS ratio is - correctly specified, but the linear OLS working model for the outcome - regression does not generically attain the semiparametric efficiency - bound unless the conditional mean is linear in the covariates. The - unqualified efficiency-bound claim applies to the no-covariate path - only. Pass column names to the ``covariates`` parameter on ``fit()``. - See ``docs/methodology/REGISTRY.md`` for the full contract. + EfficientDiD supports a doubly-robust covariate path with all nuisances + estimated nonparametrically: sieve-based propensity score ratios and a + sieve outcome regression (polynomial basis, AIC/BIC order selection), + plus the kernel-smoothed conditional covariance. The DR property ensures + consistency if either the outcome regression or the PS ratio is correctly + specified, and because the nuisances are growing sieves/kernels the + covariate path attains the semiparametric efficiency bound asymptotically + under the paper's regularity conditions (degree 1 reproduces a linear + working model, and ``sieve_k_max=1`` forces all covariate-path sieves to + degree 1). Pass column names to the ``covariates`` + parameter on ``fit()``. See ``docs/methodology/REGISTRY.md`` for the full + contract. **When to use EfficientDiD:** @@ -34,9 +37,10 @@ This module implements the efficiency-bound-attaining estimator that: - You need a formal efficiency benchmark for comparing estimators For covariate-adjusted designs, the doubly-robust path is consistent under -either outcome-regression or propensity-ratio correctness but does not -generically attain the efficiency bound under the shipped linear OLS -outcome regression. +either outcome-regression or propensity-ratio correctness and attains the +efficiency bound under the paper's regularity conditions, with all nuisances +(sieve propensity ratio, sieve outcome regression, kernel covariance) +estimated nonparametrically. **Reference:** Chen, X., Sant'Anna, P. H. C., & Xie, H. (2025). Efficient Difference-in-Differences and Event Study Estimators. @@ -148,11 +152,11 @@ Comparison with Other Staggered Estimators - Conditional PT - Strict exogeneity * - Efficiency - - Achieves semiparametric bound on the no-covariate path; DR covariate path is consistent but does not generically attain the bound under a linear OLS outcome regression + - Achieves the semiparametric bound on the no-covariate path; the doubly-robust covariate path attains it too (under regularity conditions) via sieve nuisances - Not efficient - Efficient under homogeneity * - Covariates - - Supported (doubly robust, sieve-based PS + linear OLS OR) + - Supported (doubly robust, sieve-based PS ratio + sieve outcome regression) - Supported (OR, IPW, DR) - Supported * - Bootstrap diff --git a/docs/choosing_estimator.rst b/docs/choosing_estimator.rst index 574c83f4..15df86cf 100644 --- a/docs/choosing_estimator.rst +++ b/docs/choosing_estimator.rst @@ -435,14 +435,16 @@ Use :class:`~diff_diff.EfficientDiD` when: .. note:: - EfficientDiD supports covariate adjustment via a doubly-robust path: - sieve-based propensity score ratios combined with a linear OLS outcome - regression. The DR property gives consistency if either the OR or the - PS is correctly specified, but the linear OLS outcome regression does - not generically attain the semiparametric efficiency bound unless the - conditional mean is linear in the covariates. The unqualified efficiency - claim applies to the no-covariate path only. Pass column names to the - ``covariates`` parameter on ``fit()``. See + EfficientDiD supports covariate adjustment via a doubly-robust path with + all nuisances estimated nonparametrically: sieve-based propensity score + ratios, a sieve outcome regression (polynomial basis, AIC/BIC order + selection), and a kernel-smoothed conditional covariance. The DR property + gives consistency if either the outcome regression or the PS is correctly + specified, and the covariate path attains the semiparametric efficiency + bound asymptotically under the paper's regularity conditions (a growing + sieve; degree 1 reproduces a linear working model, and ``sieve_k_max=1`` + forces all covariate-path sieves to degree 1). Pass column + names to the ``covariates`` parameter on ``fit()``. See ``docs/methodology/REGISTRY.md`` for the full contract. .. code-block:: python diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 5bf5cc1e..c70387aa 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1057,12 +1057,13 @@ where `q_{g,e} = pi_g / sum_{g' in G_{trt,e}} pi_{g'}`. - [x] Absorbing treatment validation - [x] Overlap diagnostics for propensity score ratios - [x] Survey design support (Phase 3): survey-weighted means/covariances in Omega*, TSL on EIF scores; bootstrap+survey supported (Phase 6) -- **Note:** Sieve ratio estimation uses polynomial basis functions (total degree up to K) with AIC/BIC model selection. The paper describes sieve estimators generally without specifying a particular basis family; polynomial sieves are a standard choice (Section 4, Eq 4.2). Negative sieve ratio predictions are clipped to a small positive value since the population ratio p_g(X)/p_{g'}(X) is non-negative. +- **Note:** Sieve ratio estimation uses polynomial basis functions (total degree up to K) with AIC/BIC model selection. The paper describes sieve estimators generally without specifying a particular basis family; polynomial sieves are a standard choice (Section 4, Eq 4.2). Negative sieve ratio predictions are clipped to a small positive value since the population ratio p_g(X)/p_{g'}(X) is non-negative. The outcome regression m_hat(X) uses the same polynomial sieve basis family (see the outcome-regression Note below). - **Note:** Kernel-smoothed conditional covariance Omega*(X) uses Gaussian kernel with Silverman's rule-of-thumb bandwidth by default. The paper specifies kernel smoothing (step 5, Section 4) without mandating a particular kernel or bandwidth selection method. - **Note:** Conditional covariance Omega*(X) scales each term by per-unit sieve-estimated inverse propensities s_hat_{g'}(X) = 1/p_{g'}(X) (algorithm step 4), matching Eq 3.12. The inverse propensity estimation uses the same polynomial sieve convex minimization as the ratio estimator. Estimated s_hat values are clipped to [1, n] with a UserWarning when clipping binds, mirroring the ratio path's overlap diagnostics. -- **Note:** Outcome regressions m_hat_{g',t,tpre}(X) use linear OLS working models. The paper's Section 4 describes flexible nonparametric nuisance estimation (sieve regression, kernel smoothing, or ML methods). The DR property ensures consistency if either the OLS outcome model or the sieve propensity ratio is correctly specified, but the linear OLS specification does not generically guarantee attainment of the semiparametric efficiency bound unless the conditional mean is linear in the covariates. +- **Note:** Outcome regressions m_hat_{g',t,tpre}(X) use a polynomial sieve (total degree up to K) with AIC/BIC order selection — the same basis family as the propensity-ratio sieve — matching the paper's flexible nonparametric nuisance specification (Section 4). The sieve order is selected by an OLS information criterion `IC = n*ln(RSS/n) + c_n*p_K`, where `p_K = comb(K+d, d)` is the **sieve basis dimension** (the number of fitted coefficients; `d` = covariate count) and `c_n = 2` (AIC) / `ln(n)` (BIC), on the within-group (survey-weighted) residual sum of squares; the within-group **positive-weight support** count is used for both `n` and the penalty (the raw row count when unweighted) so the selected order — and hence m_hat — is invariant both to the survey-weight scale and to zero-weight (survey-subpopulation / padded) rows. Zero-weight observations are inert in the weighted RSS, weighted Gram, and weighted loss totals; keying order selection (auto-k_max, the `n_basis` admissibility cap, and the IC sample-size terms) off the positive-weight support keeps them inert for selection too — otherwise padding the panel with zero-weight rows could push `floor(n^{1/5})` to a higher candidate degree and silently change the selected K (hence the DR estimate). This applies identically to the two sieve propensity nuisances. Degree 1 reproduces a linear working model up to floating point, so AIC/BIC degrades to linear when the conditional mean is linear and grows the order only when the data support it. There is **no fixed order ceiling**: the candidate maximum degree grows as `floor(n_pos^{1/5})` with the (positive-weight) group support `n_pos`, giving a sieve dimension `p_K = comb(K+d, d)` bounded by `n_basis = p_K < n_pos` — a *growing* sieve. Assumption C.1's regularity conditions are stated in terms of the sieve **dimension** (not the polynomial degree, which differ once `d > 1`): uniform consistency (C.1(5), `||m_hat - m||_inf = o_p(1)`) and the `o_p(n^{-1/2})` product rate (C.1(6)) require `p_K -> ∞` with `p_K = o(n)`. The growing sieve satisfies this for the low-dimensional covariate settings typical of DiD — with the `floor(n^{1/5})` degree rule, `p_K = comb(floor(n^{1/5})+d, d) = o(n)` for small `d`; for high-dimensional `X` a polynomial sieve faces the curse of dimensionality (`p_K` can outpace `o(n)`), where the paper's ML-nuisance option (Remark 4.2) is preferable. Under C.1 the doubly robust covariate path attains the semiparametric efficiency bound asymptotically (Theorem 4.1); a frozen finite-order sieve (fixed `p_K`) would violate C.1(5)/(6). The DR property still ensures consistency, regardless of `d`, if either the outcome regression or the propensity ratio is correctly specified. +- **Note:** If every sieve degree is rank-skipped for an outcome regression (a comparison group too small for even the linear basis, or a degenerate/constant covariate), the estimator falls back to the intercept-only within-group mean of `Y_t - Y_{tpre}` (the unconditional outcome regression) with a UserWarning — distinct from the propensity-ratio sieve's constant-ratio-1 fallback. - **Note:** EfficientDiD bootstrap with survey weights supported (Phase 6) via PSU-level multiplier weights -- **Note:** EfficientDiD covariates (DR path) with survey weights supported — WLS outcome regression, weighted sieve normal equations for propensity ratios/inverse propensities, survey-weighted Nadaraya-Watson kernel for conditional Omega*(X), and survey-weighted ATT averaging. Silverman bandwidth uses unweighted statistics (survey-weighted bandwidth deferred as second-order refinement). +- **Note:** EfficientDiD covariates (DR path) with survey weights supported — WLS outcome regression, weighted sieve normal equations for propensity ratios/inverse propensities, survey-weighted Nadaraya-Watson kernel for conditional Omega*(X), and survey-weighted ATT averaging. The auto Silverman bandwidth for the conditional Omega*(X) kernel is evaluated on the **positive-weight support** (rows with `w > 0`): the dispersion statistic stays unweighted within that support (a fully survey-weighted bandwidth is deferred as a second-order refinement), but excluding zero-weight rows keeps the bandwidth — and hence Omega*(X) and the per-unit efficient weights it feeds in overidentified (H>1) cells — invariant to zero-weight (survey-subpopulation / padded) rows. Together with the positive-weight-support sieve order selection, this makes the DR point estimate exactly invariant to zero-weight padding (a zero-weight row with an extreme covariate would otherwise inflate the unweighted std and move the bandwidth). - **Note:** Cluster-robust SEs use the standard Liang-Zeger clustered sandwich estimator applied to EIF values: aggregate EIF within clusters, center, and compute variance with G/(G-1) small-sample correction. Cluster bootstrap generates multiplier weights at the cluster level (all units in a cluster share the same weight). Analytical clustered SEs are the default when `cluster` is set; cluster bootstrap is opt-in via `n_bootstrap > 0`. - **Note:** Hausman pretest operates on the post-treatment event-study vector ES(e) per Theorem A.1. Both PT-All and PT-Post fits are aggregated to ES(e) using cohort-size weights before computing the test statistic H = delta' V^{-1} delta where delta = ES_post - ES_all and V = Cov(ES_post) - Cov(ES_all). Covariance is computed from aggregated ES(e)-level EIF values. The variance-difference matrix V is inverted via Moore-Penrose pseudoinverse to handle finite-sample non-positive-definiteness. Effective rank of V (number of positive eigenvalues) is used as degrees of freedom. - **Note:** Last-cohort-as-control (`control_group="last_cohort"`) reclassifies the latest treatment cohort as pseudo-never-treated and drops time periods at `t >= last_g - anticipation`, excluding anticipation-contaminated periods from the pseudo-control's pre-treatment window. This is distinct from CallawaySantAnna's `not_yet_treated` option which dynamically selects not-yet-treated units per (g,t) pair. diff --git a/docs/methodology/papers/chen-santanna-xie-2025-review.md b/docs/methodology/papers/chen-santanna-xie-2025-review.md index 9e934a3a..215b1288 100644 --- a/docs/methodology/papers/chen-santanna-xie-2025-review.md +++ b/docs/methodology/papers/chen-santanna-xie-2025-review.md @@ -187,7 +187,7 @@ Table 7 reports asymptotic relative efficiency (ARE, EDiD = 1.00): the alternati Two DGPs: single-treatment-date built on Arkhangelsky et al. (2021) calibrated to CPS; staggered built on Baker et al. (2022) calibrated to Compustat (covariates play no role in these setups). For `ES_avg`, the paper compares the efficient plug-in (EDiD, Eq 3.6) against OLS-TWFE (`β`, Eq 2.6), the average of post-treatment event-study TWFE coefficients (DTWFE, Eq 2.4), and Synthetic DiD (Arkhangelsky et al. 2021). Headline: the efficient estimators deliver markedly lower RMSE and narrower CIs, often exceeding 40% precision gains, with no loss in bias. ### Where the paper leaves a working-model choice open (neutral pointers for PR-B; no verdict here) -- **Outcome regression `m̂(X)`** — the paper admits sieve/kernel/ML and does **not** mandate a parametric form (the DR property covers misspecification of one nuisance). The library's choice of working model is PR-B's to evaluate against the bound. +- **Outcome regression `m̂(X)`** — the paper admits sieve/kernel/ML and does **not** mandate a parametric form (the DR property covers misspecification of one nuisance). **Resolved in PR-B:** the library implements `m̂(X)` as a polynomial sieve (AIC/BIC order selection), the same basis family as the propensity-ratio sieve, so all nuisances are nonparametric and the covariate doubly-robust path attains the bound under the paper's regularity conditions (degree 1 reproduces a linear working model). - **Sieve basis `ψ^K`** — the paper gives *(tensor products of) cubic B-splines* as an **example**; it does not mandate a specific basis family. - **Kernel / bandwidth for `Ω̂*`** — the paper specifies a generic kernel `Ker` and bandwidth `h` (Nadaraya–Watson) without mandating Gaussian/Silverman. - **Overall scalar summary** — the paper's scalar is `ES_avg` (Eq 2.3), a uniform average over post-treatment horizons; any cohort-size-weighted overall ATT is a different aggregation (PR-B to reconcile). diff --git a/tests/data/README.md b/tests/data/README.md index 1b361598..0f8be59b 100644 --- a/tests/data/README.md +++ b/tests/data/README.md @@ -6,6 +6,15 @@ "The Economic Consequences of Hospital Admissions." *American Economic Review*, 108(2), 308-352. Replication kit: https://www.openicpsr.org/openicpsr/project/116186/version/V1/view +**License / redistribution:** This fixture is a derived four-column subset (de-identified HRS +public-use person id, wave, out-of-pocket spending, and first-hospitalization wave) of the +**publicly available** Dobkin et al. (2018) replication package deposited in the AEA's openICPSR +repository (project 116186, distributed by ICPSR for replication of the published article). Only +the derived subset is committed — the full source `HRS_long.dta` is **not** redistributed here +(`.gitignore`d as `replication_data/`; regenerate from the openICPSR deposit via the snippet +below). It is included solely as a regression-test fixture to replicate the paper's Table 6. +Consult the openICPSR deposit page for the deposit's exact Terms of Use. + **Sample selection:** Follows Sun & Abraham (2021), as used by Chen, Sant'Anna & Xie (2025) Section 6: @@ -30,6 +39,13 @@ Section 6: **Columns:** `unit` (hhidpn), `time` (wave), `outcome` (oop_spend, 2005 dollars), `first_treat` (first_hosp) +**Note on sample size (656 vs 652):** Chen, Sant'Anna & Xie (2025) Table 6 reports **652** +individuals; this fixture yields **656**. The four-individual difference reflects a minor +sample-selection nuance (e.g. exact age-window or first-hospitalization tie handling) not fully +pinned down by the paper text. It is immaterial to the validation: every EDiD point estimate in +`tests/test_efficient_did_validation.py::TestHRSReplication` matches the published Table 6 value +to within **0.03 of the published standard error**. + **Regeneration:** Requires the Dobkin et al. replication kit (`.gitignore`d as `replication_data/`). ```python diff --git a/tests/test_efficient_did_validation.py b/tests/test_efficient_did_validation.py index 62b56b75..9a29ec22 100644 --- a/tests/test_efficient_did_validation.py +++ b/tests/test_efficient_did_validation.py @@ -67,12 +67,15 @@ def _get_effect(effects_dict, g, t): raise KeyError(f"ATT({g},{t}) not found in results") -def _assert_close(actual, expected, label, se=None, se_frac=0.1): +def _assert_close(actual, expected, label, se=None, se_frac=0.05): """Assert actual is close to expected, tolerance based on published SE. - Default tolerance is 0.1 * SE (10% of one standard error). Our actual - diffs are all < 0.03 SE, so this catches real drift while absorbing the - 4-individual sample difference (656 vs paper's 652). + Default tolerance is 0.05 * SE (5% of one published standard error). The HRS + fit is deterministic — no covariates, so the closed-form no-covariate path + runs with no RNG — and every cell's deviation from the published Table 6 + value is < 0.03 SE, so this is a tight-but-safe anchor that still absorbs the + documented 4-individual sample difference (656 vs the paper's 652; see + ``tests/data/README.md``). """ if se is not None: tol = se_frac * se diff --git a/tests/test_methodology_efficient_did.py b/tests/test_methodology_efficient_did.py new file mode 100644 index 00000000..d61ce757 --- /dev/null +++ b/tests/test_methodology_efficient_did.py @@ -0,0 +1,940 @@ +"""EfficientDiD methodology test file — Chen, Sant'Anna & Xie (2025) walkthrough. + +Companion to ``tests/test_efficient_did.py`` (API/unit surface) and +``tests/test_efficient_did_validation.py`` (HRS Table 6 + Compustat MC anchors): +this file validates the EfficientDiD *math* against the specific paper +equations/theorems, with paper-equation-numbered assertions. Mirrors the +structure of ``tests/test_methodology_power.py``. + +Paper: Chen, X., Sant'Anna, P. H. C., & Xie, H. (2025). *Efficient +Difference-in-Differences and Event Study Estimators.* arXiv:2506.17729v1. +Paper review on file: ``docs/methodology/papers/chen-santanna-xie-2025-review.md``. + +Decision pinned in PR-B (the source-validation pass): + +- **D1 (sieve outcome regression):** the covariate doubly-robust path estimates + the outcome regression ``m_hat(X)`` with a polynomial sieve (AIC/BIC order + selection), matching the paper's flexible-nuisance specification (Section 4). + Together with the sieve propensity ratio (Eq 4.1-4.2) and kernel-smoothed + ``Omega*(X)`` this lets the covariate path attain the semiparametric efficiency + bound under the paper's regularity conditions; it ELIMINATES the prior + linear-OLS working-model deviation. Degree 1 reproduces the previous linear OLS + up to floating point. ``TestCovariateSieveOutcomeRegression`` pins it. + +Class structure (Verified Components): + +- ``TestEfficientWeights`` — inverse-covariance optimal weights + ``w = 1' Omega*^-1 / (1' Omega*^-1 1)`` (Eq 3.5 single-date / Eq 3.13 staggered), + the min-variance property, and the singular-``Omega*`` pseudoinverse path. +- ``TestGeneratedOutcomeNoCov`` — the no-covariate generated outcome (Eq 3.9): the + ``g'=g`` same-cohort pair telescopes to the per-baseline DiD + ``E[Y_t-Y_tpre|g] - E[Y_t-Y_tpre|inf]`` (Eq 3.3); the ``g'=inf`` pairs telescope + to the period-1 long-difference and are equal across ``t_pre``. +- ``TestNoCovariateClosedForm`` — Section 4.1 / Eq 3.13: the fitted efficient + ``ATT(g,t)`` equals ``weights @ generated_outcomes`` rebuilt independently from + the within-group sample means/covariances. +- ``TestPTPostReducesToCS`` — Corollary 3.1 / 3.2: under PT-Post EfficientDiD is + just-identified (single never-treated, ``g-1`` baseline) and reduces to the + standard single-baseline Callaway-Sant'Anna ``ATT(g,t)``. +- ``TestAnalyticalSE`` — Theorem 4.1: the analytical SE equals + ``sqrt(mean(EIF^2)/n)``. +- ``TestHausmanStatistic`` — Theorem A.1 (Eq A.2): ``H = delta' V^+ delta`` with + ``V = cov_post - cov_all`` (restricted MINUS efficient, footnote 21); + well-conditioned ``V`` gives ``df = |E|``, a rank-deficient ``V`` gives + ``df = effective_rank < |E|`` (the documented finite-sample DOF safeguard), + reversing the covariance order flags an efficiency reversal, and the end-to-end + pretest is internally consistent. +- ``TestCovariateSieveOutcomeRegression`` — D1: the sieve recovers a nonlinear-in-X + conditional mean (selects ``K >= 2``), reproduces linear OLS on linear data + (``K = 1``), and the growing sieve is more accurate than degree-1-capped + sieves when the nuisances are nonlinear. +""" + +import warnings + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import CallawaySantAnna, EfficientDiD +from diff_diff.efficient_did import _hausman_quadratic_form +from diff_diff.efficient_did_covariates import ( + estimate_inverse_propensity_sieve, + estimate_outcome_regression, + estimate_propensity_ratio_sieve, +) +from diff_diff.efficient_did_weights import ( + compute_efficient_weights, + compute_generated_outcomes_nocov, + compute_omega_star_nocov, + enumerate_valid_triples, +) + +# ============================================================================= +# Helpers +# ============================================================================= + + +def _staggered_panel( + groups=(2, 3), n_per_group=120, n_control=120, n_periods=4, effect=2.0, sigma=0.4, seed=42 +): + """Balanced staggered panel with parallel trends (PT-All holds).""" + rng = np.random.default_rng(seed) + n_units = n_per_group * len(groups) + n_control + ft = np.full(n_units, np.inf) + for j, g in enumerate(groups): + ft[j * n_per_group : (j + 1) * n_per_group] = g + units = np.repeat(np.arange(n_units), n_periods) + times = np.tile(np.arange(1, n_periods + 1), n_units) + ft_col = np.repeat(ft, n_periods) + unit_fe = np.repeat(rng.normal(0, 1, n_units), n_periods) + time_fe = np.tile(np.arange(1, n_periods + 1) * 0.3, n_units) + tau = np.where((ft_col < np.inf) & (times >= ft_col), effect, 0.0) + y = unit_fe + time_fe + tau + rng.normal(0, sigma, len(units)) + return pd.DataFrame({"unit": units, "time": times, "first_treat": ft_col, "y": y}) + + +def _single_date_panel( + n_treated=150, n_control=150, n_periods=3, treat_period=2, effect=2.0, sigma=0.4, seed=7 +): + """Two-group single-treatment-date panel (G in {treat_period, inf}).""" + rng = np.random.default_rng(seed) + n_units = n_treated + n_control + ft = np.full(n_units, np.inf) + ft[:n_treated] = treat_period + units = np.repeat(np.arange(n_units), n_periods) + times = np.tile(np.arange(1, n_periods + 1), n_units) + ft_col = np.repeat(ft, n_periods) + unit_fe = np.repeat(rng.normal(0, 1, n_units), n_periods) + time_fe = np.tile(np.arange(1, n_periods + 1) * 0.5, n_units) + tau = np.where((ft_col < np.inf) & (times >= ft_col), effect, 0.0) + y = unit_fe + time_fe + tau + rng.normal(0, sigma, len(units)) + return pd.DataFrame({"unit": units, "time": times, "first_treat": ft_col, "y": y}) + + +def _pivot_nocov(df): + """Rebuild the no-covariate pivoted structures the fit derives internally. + + Means/covariances are permutation-invariant, so this reproduces the exact + ``Omega*`` / generated outcomes regardless of internal unit ordering. + """ + units = np.sort(df["unit"].unique()) + times = np.sort(df["time"].unique()) + period_to_col = {float(t): i for i, t in enumerate(times)} + wide = df.pivot(index="unit", columns="time", values="y").loc[units, times].to_numpy() + ft = df.groupby("unit")["first_treat"].first().loc[units].to_numpy(dtype=float) + n = len(units) + treatment_groups = sorted({float(g) for g in ft if np.isfinite(g)}) + cohort_masks = {float(g): (ft == g) for g in treatment_groups} + never_treated_mask = ~np.isfinite(ft) + cohort_fractions = {float(g): float((ft == g).sum()) / n for g in treatment_groups} + cohort_fractions[np.inf] = float(never_treated_mask.sum()) / n + return { + "wide": wide, + "period_to_col": period_to_col, + "period_1_col": period_to_col[float(times[0])], + "period_1": float(times[0]), + "cohort_masks": cohort_masks, + "never_treated_mask": never_treated_mask, + "cohort_fractions": cohort_fractions, + "time_periods": [float(t) for t in times], + "treatment_groups": treatment_groups, + } + + +def _cs_effect(cs_result, g, t): + for (gg, tt), info in cs_result.group_time_effects.items(): + if int(gg) == int(g) and int(tt) == int(t): + return info["effect"] + return None + + +def _delta_y_arrays(n, x1, dy): + """Pack a single-group outcome-change problem for estimate_outcome_regression. + + Returns (outcome_wide[n,2] with col0=Y_tpre=0, col1=Y_t=dy), covariate matrix + [n,1]=x1, and an all-True group mask. + """ + ow = np.column_stack([np.zeros(n), dy]) + return ow, x1.reshape(-1, 1), np.ones(n, dtype=bool) + + +# ============================================================================= +# Eq 3.5 / 3.13 — inverse-covariance optimal weights +# ============================================================================= + + +class TestEfficientWeights: + """Optimal weights ``w = 1' Omega*^-1 / (1' Omega*^-1 1)`` (Eq 3.5 / 3.13).""" + + def test_inverse_covariance_weights_hand_computed(self): + omega = np.array([[2.0, 0.5, 0.0], [0.5, 1.0, 0.2], [0.0, 0.2, 3.0]]) + w, used_pinv, _ = compute_efficient_weights(omega) + inv = np.linalg.inv(omega) + ones = np.ones(3) + expected = (ones @ inv) / (ones @ inv @ ones) + assert not used_pinv + np.testing.assert_allclose(w, expected, atol=1e-10) + assert abs(w.sum() - 1.0) < 1e-12 + + def test_weights_are_minimum_variance(self): + """The inverse-covariance weights minimize ``v' Omega v`` over ``1'v = 1`` + (the semiparametric-efficiency property of Theorem 3.1/3.2).""" + omega = np.array([[1.0, 0.3], [0.3, 2.0]]) + w, _, _ = compute_efficient_weights(omega) + var_w = float(w @ omega @ w) + rng = np.random.default_rng(0) + for _ in range(500): + a = rng.uniform(-1.0, 2.0) + v = np.array([a, 1.0 - a]) + assert v @ omega @ v >= var_w - 1e-12 + + def test_singular_omega_uses_pseudoinverse(self): + """Two identical rows -> singular Omega* -> pinv path; weights still sum to 1.""" + omega = np.array([[1.0, 1.0, 0.0], [1.0, 1.0, 0.0], [0.0, 0.0, 2.0]]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + w, used_pinv, _ = compute_efficient_weights(omega) + assert used_pinv + assert abs(w.sum() - 1.0) < 1e-8 + + +# ============================================================================= +# Eq 3.9 — no-covariate generated outcome (telescoping) +# ============================================================================= + + +class TestGeneratedOutcomeNoCov: + """The no-covariate generated outcome ``Y_hat_j`` (Eq 3.9), built from + within-group sample means, telescopes as the paper describes.""" + + def _setup(self): + # 5 units in cohort g=3, 5 never-treated; periods {1,2,3}. + # g=3 outcomes [1,2,5]; never-treated [0,3,4] (deliberately non-parallel + # so the tpre=2 per-baseline DiD differs from the period-1 long-difference). + wide = np.array([[1.0, 2.0, 5.0]] * 5 + [[0.0, 3.0, 4.0]] * 5) + cohort_masks = {3.0: np.array([True] * 5 + [False] * 5)} + never = np.array([False] * 5 + [True] * 5) + period_to_col = {1.0: 0, 2.0: 1, 3.0: 2} + return wide, cohort_masks, never, period_to_col + + def test_telescoping_to_per_baseline_and_long_difference(self): + wide, cohort_masks, never, p2c = self._setup() + pairs = enumerate_valid_triples(3.0, [3.0], [1.0, 2.0, 3.0], 1.0, "all") + y_hat = compute_generated_outcomes_nocov( + 3.0, 3.0, pairs, wide, cohort_masks, never, p2c, period_1_col=0 + ) + gen = dict(zip(pairs, y_hat)) + + # g'=g=3, t_pre=2: per-baseline DiD (Eq 3.3) = (Y3-Y2|g) - (Y3-Y2|inf) + per_baseline = (5.0 - 2.0) - (4.0 - 3.0) # = 2.0 + assert gen[(3.0, 2.0)] == pytest.approx(per_baseline, abs=1e-12) + + # g'=inf telescopes to the period-1 long-difference (Y3-Y1|g) - (Y3-Y1|inf), + # independent of t_pre. + long_diff = (5.0 - 1.0) - (4.0 - 0.0) # = 0.0 + assert gen[(np.inf, 2.0)] == pytest.approx(long_diff, abs=1e-12) + assert gen[(np.inf, 3.0)] == pytest.approx(long_diff, abs=1e-12) + + # The two mechanisms genuinely differ here (non-trivial telescoping). + assert abs(gen[(3.0, 2.0)] - gen[(np.inf, 2.0)]) > 1.0 + + +# ============================================================================= +# Section 4.1 / Eq 3.13 — no-covariate closed form +# ============================================================================= + + +class TestNoCovariateClosedForm: + """The fitted efficient ATT equals ``weights @ generated_outcomes`` (Eq 3.13) + rebuilt independently from within-group sample means/covariances (Section 4.1).""" + + def test_efficient_att_equals_weighted_generated_outcomes(self): + df = _staggered_panel( + groups=(2, 3), n_per_group=120, n_control=120, n_periods=4, sigma=0.4, seed=11 + ) + res = EfficientDiD(pt_assumption="all").fit(df, "y", "unit", "time", "first_treat") + P = _pivot_nocov(df) + + # Pick an overidentified (g,t) (|H| >= 2). + target, valid_pairs = None, None + for g, t in res.group_time_effects: + if t < g: + continue + pairs = enumerate_valid_triples( + float(g), P["treatment_groups"], P["time_periods"], P["period_1"], "all" + ) + if len(pairs) >= 2: + target, valid_pairs = (float(g), float(t)), pairs + break + assert target is not None, "no overidentified (g,t) found" + assert valid_pairs is not None + g, t = target + + omega = compute_omega_star_nocov( + g, + t, + valid_pairs, + P["wide"], + P["cohort_masks"], + P["never_treated_mask"], + P["period_to_col"], + P["period_1_col"], + P["cohort_fractions"], + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + w, _, _ = compute_efficient_weights(omega) + y_hat = compute_generated_outcomes_nocov( + g, + t, + valid_pairs, + P["wide"], + P["cohort_masks"], + P["never_treated_mask"], + P["period_to_col"], + P["period_1_col"], + ) + att_manual = float(w @ y_hat) + + fit_effect = None + for (gg, tt), info in res.group_time_effects.items(): + if float(gg) == g and float(tt) == t: + fit_effect = info["effect"] + break + assert fit_effect == pytest.approx(att_manual, abs=1e-9) + + +# ============================================================================= +# Corollary 3.1 / 3.2 — PT-Post reduces to Callaway-Sant'Anna +# ============================================================================= + + +class TestPTPostReducesToCS: + """Under PT-Post EfficientDiD is just-identified and matches CS single-baseline.""" + + def test_pt_post_is_just_identified_single_baseline(self): + # Corollary 3.2: PT-Post -> exactly one moment (never-treated, g-1 baseline). + pairs = enumerate_valid_triples(3.0, [3.0, 5.0], [1.0, 2.0, 3.0, 4.0, 5.0], 1.0, "post") + assert pairs == [(np.inf, 2.0)] + + def test_pt_post_matches_cs_single_date(self): + # Corollary 3.1: two-group single-date PT-Post == CS, exactly. + df = _single_date_panel(n_treated=200, n_control=200, seed=3) + edid = EfficientDiD(pt_assumption="post").fit(df, "y", "unit", "time", "first_treat") + cs = CallawaySantAnna(control_group="never_treated").fit( + df, "y", "unit", "time", "first_treat" + ) + checked = 0 + for (g, t), info in edid.group_time_effects.items(): + if t >= g: + c = _cs_effect(cs, g, t) + assert c is not None + assert info["effect"] == pytest.approx(c, abs=1e-9) + checked += 1 + assert checked >= 1 + + def test_pt_post_matches_cs_staggered(self): + # Corollary 3.2: staggered PT-Post == CS single-baseline (post-treatment). + df = _staggered_panel(groups=(2, 3), n_per_group=150, n_control=150, seed=5) + edid = EfficientDiD(pt_assumption="post").fit(df, "y", "unit", "time", "first_treat") + cs = CallawaySantAnna(control_group="never_treated").fit( + df, "y", "unit", "time", "first_treat" + ) + for (g, t), info in edid.group_time_effects.items(): + if t >= g and np.isfinite(info["effect"]): + c = _cs_effect(cs, g, t) + if c is not None: + assert info["effect"] == pytest.approx(c, abs=1e-8) + + +# ============================================================================= +# Theorem 4.1 — analytical SE = sqrt(mean(EIF^2)/n) +# ============================================================================= + + +class TestAnalyticalSE: + """The default analytical SE is ``sqrt(mean(EIF^2)/n)`` (Theorem 4.1, p.21).""" + + def test_se_equals_sqrt_mean_eif_squared_over_n(self): + df = _staggered_panel(groups=(2, 3), n_per_group=120, n_control=120, seed=9) + res = EfficientDiD(pt_assumption="all").fit( + df, "y", "unit", "time", "first_treat", store_eif=True + ) + assert res.influence_functions is not None + checked = 0 + for (g, t), info in res.group_time_effects.items(): + se = info["se"] + if not (np.isfinite(se) and se > 0): + continue + eif = res.influence_functions[(g, t)] + expected = float(np.sqrt(np.mean(eif**2) / len(eif))) + assert se == pytest.approx(expected, rel=1e-9) + checked += 1 + assert checked >= 1 + + +# ============================================================================= +# Theorem A.1 (Eq A.2) — Hausman statistic +# ============================================================================= + + +class TestHausmanStatistic: + """``H = delta' V^+ delta`` with ``V = cov_post - cov_all`` (restricted minus + efficient); effective-rank degrees of freedom; covariance-direction guard.""" + + def test_well_conditioned_v_df_equals_number_of_horizons(self): + from scipy.stats import chi2 + + cov_all = np.array([[0.04, 0.01], [0.01, 0.05]]) # efficient (smaller) + cov_post = np.array([[0.09, 0.02], [0.02, 0.10]]) # restricted (larger) + delta = np.array([0.3, -0.2]) + H, rank, p, n_neg, ok = _hausman_quadratic_form(delta, cov_post, cov_all) + V = cov_post - cov_all + assert ok and n_neg == 0 and rank == 2 + assert H == pytest.approx(float(delta @ np.linalg.inv(V) @ delta), rel=1e-10) + assert p == pytest.approx(float(chi2.sf(H, df=2)), rel=1e-10) + + def test_rank_deficient_v_uses_effective_rank(self): + from scipy.stats import chi2 + + # Second coordinate has equal restricted/efficient variance -> V[1,1] = 0 + # -> a zero eigenvalue -> df = effective_rank = 1 < |E| = 2. + cov_all = np.array([[0.04, 0.0], [0.0, 0.05]]) + cov_post = np.array([[0.09, 0.0], [0.0, 0.05]]) + delta = np.array([0.3, 0.2]) + H, rank, p, n_neg, ok = _hausman_quadratic_form(delta, cov_post, cov_all) + assert ok and rank == 1 and n_neg == 0 + # Only the positive-variance direction contributes: 0.3^2 / (0.09 - 0.04). + assert H == pytest.approx(0.3**2 / (0.09 - 0.04), rel=1e-9) + assert p == pytest.approx(float(chi2.sf(H, df=1)), rel=1e-9) + + def test_covariance_direction_restricted_minus_efficient(self): + # V must be cov_post - cov_all. Reversing the order gives a negative-definite + # V (efficiency reversal): n_negative > 0 and no positive eigenvalues. + cov_all = np.array([[0.04]]) + cov_post = np.array([[0.09]]) + delta = np.array([0.2]) + H1, r1, _, nneg1, _ = _hausman_quadratic_form(delta, cov_post, cov_all) + H2, r2, _, nneg2, _ = _hausman_quadratic_form(delta, cov_all, cov_post) + assert nneg1 == 0 and r1 == 1 + assert H1 == pytest.approx(0.2**2 / (0.09 - 0.04), rel=1e-9) + assert nneg2 == 1 and r2 == 0 and np.isnan(H2) + + def test_end_to_end_pretest_consistent(self): + df = _staggered_panel(groups=(3, 5), n_per_group=150, n_control=200, n_periods=7, seed=21) + pretest = EfficientDiD.hausman_pretest(df, "y", "unit", "time", "first_treat") + assert pretest.gt_details is not None + n_horizons = len(pretest.gt_details) + assert n_horizons >= 1 + assert 1 <= pretest.df <= n_horizons # effective rank <= |E| + assert pretest.statistic >= 0 + assert 0.0 <= pretest.p_value <= 1.0 + + +# ============================================================================= +# D1 — sieve outcome regression (the PR-B upgrade) +# ============================================================================= + + +class TestCovariateSieveOutcomeRegression: + """The outcome regression m_hat(X) is a polynomial sieve (Section 4).""" + + def test_sieve_recovers_nonlinear_conditional_mean(self): + # E[dY|x1] = x1^2 : the sieve (K>=2) tracks the curvature; a forced-linear + # (K=1) fit cannot. A pure-linear m_hat would have corr(m, x1^2) ~ 0. + rng = np.random.default_rng(0) + n = 400 + x1 = rng.normal(size=n) + dy = x1**2 + rng.normal(scale=0.3, size=n) + ow, cov, mask = _delta_y_arrays(n, x1, dy) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m_sieve = estimate_outcome_regression(ow, cov, mask, 1, 0, k_max=4, criterion="bic") + m_linear = estimate_outcome_regression(ow, cov, mask, 1, 0, k_max=1, criterion="bic") + truth = x1**2 + mse_sieve = float(np.mean((m_sieve - truth) ** 2)) + mse_linear = float(np.mean((m_linear - truth) ** 2)) + assert np.corrcoef(m_sieve, truth)[0, 1] > 0.9 + assert mse_sieve < 0.5 * mse_linear # sieve clearly better + assert not np.allclose(m_sieve, m_linear) # K>=2 was actually selected + + def test_sieve_degrades_to_linear_on_linear_data(self): + # E[dY|x1] = 3*x1 (+ noise): BIC selects K=1 and m_hat matches linear OLS. + rng = np.random.default_rng(1) + n = 400 + x1 = rng.normal(size=n) + dy = 3.0 * x1 + rng.normal(scale=0.5, size=n) + ow, cov, mask = _delta_y_arrays(n, x1, dy) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m_sieve = estimate_outcome_regression(ow, cov, mask, 1, 0, k_max=4, criterion="bic") + Xd = np.column_stack([np.ones(n), x1]) + beta = np.linalg.lstsq(Xd, dy, rcond=None)[0] + m_ols = Xd @ beta + np.testing.assert_allclose(m_sieve, m_ols, atol=1e-8) + + def test_all_degrees_skipped_falls_back_to_group_mean(self): + # A constant covariate makes every design rank-deficient -> fall back to the + # intercept-only within-group mean of dY (the documented fallback semantic). + rng = np.random.default_rng(2) + n = 50 + x1 = np.ones(n) # zero variance -> [1, X] is singular at every K + dy = 2.0 + rng.normal(scale=0.1, size=n) + ow, cov, mask = _delta_y_arrays(n, x1, dy) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m = estimate_outcome_regression(ow, cov, mask, 1, 0, k_max=4, criterion="bic") + np.testing.assert_allclose(m, np.full(n, dy.mean()), atol=1e-9) + + def test_sieve_order_selection_invariant_to_survey_weight_scale(self): + # Rescaling survey weights w -> c*w leaves the selected order and m̂ + # unchanged: the OLS IC uses the positive-weight support count for both n + # and the penalty, and only RSS is weighted, so the criterion shifts by a + # K-independent constant (REGISTRY outcome-regression Note). + rng = np.random.default_rng(7) + n = 300 + x1 = rng.normal(size=n) + dy = x1**2 + rng.normal(scale=0.4, size=n) # nonlinear -> K>=2 selected + ow, cov, mask = _delta_y_arrays(n, x1, dy) + w = rng.uniform(0.5, 2.0, size=n) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m1 = estimate_outcome_regression( + ow, cov, mask, 1, 0, k_max=4, criterion="bic", unit_weights=w + ) + m2 = estimate_outcome_regression( + ow, cov, mask, 1, 0, k_max=4, criterion="bic", unit_weights=100.0 * w + ) + np.testing.assert_allclose(m1, m2, atol=1e-9) + + def test_weighted_all_degrees_skipped_uses_weighted_group_mean(self): + # Survey weights + a constant covariate (every design singular) -> the + # fallback is the WEIGHTED within-group mean of dY. + rng = np.random.default_rng(8) + n = 40 + x1 = np.ones(n) + dy = 1.0 + rng.normal(scale=0.1, size=n) + ow, cov, mask = _delta_y_arrays(n, x1, dy) + w = rng.uniform(0.5, 2.0, size=n) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m = estimate_outcome_regression( + ow, cov, mask, 1, 0, k_max=4, criterion="bic", unit_weights=w + ) + expected = float(np.average(dy, weights=w)) + np.testing.assert_allclose(m, np.full(n, expected), atol=1e-9) + + def test_zero_weight_padding_does_not_change_auto_selected_order(self): + # Zero-weight (survey-subpopulation / padded) rows must be INERT for sieve + # ORDER SELECTION, not merely for the WLS coefficients at a fixed K. The + # auto order, the n_basis admissibility cap, and the IC sample-size terms + # all key off the positive-weight support; if they used the raw row count + # instead, padding the panel with zero-weight rows would push floor(n^{1/5}) + # to a higher candidate degree and could silently change the selected K + # (hence the DR estimate). This exercises the AUTO path (NO explicit k_max) + # with a genuinely cubic conditional mean: 240 positive-weight units give + # floor(240^{1/5}) = 2, but raw-count logic on the 300 padded rows would + # give floor(300^{1/5}) = 3 and select the cubic term -- diverging from the + # positive-weight-only fit. This test FAILS under raw-count selection and + # passes once selection uses the positive-weight support. + rng = np.random.default_rng(101) + n_real, n_pad = 240, 60 # floor(240^.2)=2 ; floor(300^.2)=3 + x1_real = rng.uniform(-2.0, 2.0, size=n_real) + dy_real = 2.0 * x1_real**3 + rng.normal(scale=0.3, size=n_real) # cubic -> K=3 if reachable + # Padded rows: arbitrary (deliberately wild) covariates/outcomes, weight 0. + x1_pad = rng.uniform(-2.0, 2.0, size=n_pad) + dy_pad = rng.normal(scale=5.0, size=n_pad) + x1 = np.concatenate([x1_real, x1_pad]) + dy = np.concatenate([dy_real, dy_pad]) + ow = np.column_stack([np.zeros(n_real + n_pad), dy]) + cov = x1.reshape(-1, 1) + w = np.concatenate([rng.uniform(0.5, 2.0, n_real), np.zeros(n_pad)]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + # group_mask spans all rows; 60 carry zero weight -> support = 240. + m_padded = estimate_outcome_regression( + ow, cov, np.ones(n_real + n_pad, dtype=bool), 1, 0, criterion="bic", unit_weights=w + ) + # group_mask = positive-weight rows only (the analytically-equivalent fit). + m_support = estimate_outcome_regression( + ow, cov, w > 0, 1, 0, criterion="bic", unit_weights=w + ) + assert np.all(np.isfinite(m_padded)) + # Strict invariance: zero-weight padding changes neither the auto-selected + # order nor the fitted m_hat anywhere. + np.testing.assert_allclose(m_padded, m_support, atol=1e-9) + + def test_sieve_recovers_multivariate_nonlinear_conditional_mean(self): + # d=2 covariates with a nonlinear conditional mean including a cross + # term: E[dY|x1,x2] = x1^2 + x1*x2. The multivariate sieve basis + # (dimension p_K = comb(K+2, 2)) captures it at K>=2; a degree-1 fit + # (linear in x1, x2, no quadratic/cross terms) cannot. Exercises the + # degree-vs-dimension distinction (p_K = comb(K+d, d), not K) that + # matters once d > 1. + rng = np.random.default_rng(13) + n = 600 # floor(n^(1/5)) = 3 -> candidate degrees 1..3 + x1 = rng.normal(size=n) + x2 = rng.normal(size=n) + cov = np.column_stack([x1, x2]) + truth = x1**2 + x1 * x2 + dy = truth + rng.normal(scale=0.4, size=n) + ow = np.column_stack([np.zeros(n), dy]) + mask = np.ones(n, dtype=bool) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m_sieve = estimate_outcome_regression(ow, cov, mask, 1, 0, criterion="bic") + m_lin = estimate_outcome_regression(ow, cov, mask, 1, 0, k_max=1, criterion="bic") + assert np.corrcoef(m_sieve, truth)[0, 1] > 0.9 + assert np.mean((m_sieve - truth) ** 2) < 0.5 * np.mean((m_lin - truth) ** 2) + + def test_growing_sieve_allows_order_above_5_for_large_group(self): + # The sieve is a *growing* sieve: the candidate order is floor(n^(1/5)) + # with NO fixed ceiling, which is what satisfies Assumption C.1's + # uniform-consistency / product-rate conditions (Theorem 4.1). For a + # large group (n=8000 -> floor(n^(1/5))=6) a genuinely degree-6 + # conditional mean is captured, which the previous hard K<=5 cap could + # not represent. + rng = np.random.default_rng(11) + n = 8000 # floor(n^(1/5)) = 6 + x1 = rng.uniform(-2.0, 2.0, size=n) + truth = x1**6 - 3.0 * x1**4 # needs degree 6 + dy = truth + rng.normal(scale=0.5, size=n) + ow, cov, mask = _delta_y_arrays(n, x1, dy) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + m_grow = estimate_outcome_regression(ow, cov, mask, 1, 0, criterion="bic") + m_cap5 = estimate_outcome_regression(ow, cov, mask, 1, 0, k_max=5, criterion="bic") + # The growing sieve (auto k_max=6) captures the degree-6 truth that a + # K<=5-capped fit structurally cannot. + assert np.mean((m_grow - truth) ** 2) < np.mean((m_cap5 - truth) ** 2) + assert np.corrcoef(m_grow, truth)[0, 1] > 0.95 + + def test_outcome_regression_rejects_bad_criterion(self): + # The helper validates its criterion for direct callers (the estimator + # surface validates upstream, but the helper must fail closed too). + ow, cov, mask = _delta_y_arrays(10, np.arange(10.0), np.arange(10.0)) + with pytest.raises(ValueError, match="criterion"): + estimate_outcome_regression(ow, cov, mask, 1, 0, criterion="xic") + + def test_covariate_dr_path_last_cohort_with_anticipation(self): + # DR (covariate) path under control_group="last_cohort" (all units + # eventually treated -> last cohort is the pseudo-control) with + # anticipation trimming. Mirrors the no-covariate last_cohort test on + # the new sieve outcome-regression path. + rng = np.random.default_rng(17) + groups = (3, 5, 7) + n_per, n_periods = 50, 7 + rows, uid = [], 0 + for g in groups: + for _ in range(n_per): + x1 = rng.normal() + ufe = rng.normal(0, 0.5) + for t in range(1, n_periods + 1): + y = ufe + 0.4 * x1 + 0.2 * t + (1.5 if t >= g else 0.0) + rng.normal(0, 0.3) + rows.append((uid, t, float(g), y, x1)) + uid += 1 + df = pd.DataFrame(rows, columns=["unit", "time", "first_treat", "y", "x1"]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = EfficientDiD( + pt_assumption="all", control_group="last_cohort", anticipation=1 + ).fit(df, "y", "unit", "time", "first_treat", covariates=["x1"], aggregate="all") + # anticipation=1, last_g=7 -> effective last cohort at 6 -> time_periods 1..5 + assert max(res.time_periods) == 5 + assert 7 not in res.groups # last cohort reclassified as pseudo-control + assert np.isfinite(res.overall_att) + assert np.isfinite(res.overall_se) + + @pytest.mark.slow + def test_covariate_path_beats_forced_linear_under_nonlinear_nuisance(self, ci_params): + """Conditional-PT DGP with nonlinear selection AND nonlinear Y(0) trend: + the auto growing sieve recovers ATT (both nuisances captured), while + sieve_k_max=1 -- which degree-1-constrains ALL covariate-path sieves + (the outcome regression AND both propensity sieves) -- misspecifies both + nuisances and so has larger error. This compares growing vs degree-1 + sieves; it does not isolate the outcome regression alone.""" + tau = 2.0 + n_sims = ci_params.bootstrap(60, min_n=20) + err_sieve, err_linear = [], [] + for s in range(n_sims): + rng = np.random.default_rng(5000 + s) + n = 320 + x1 = rng.normal(size=n) + # Nonlinear selection on x1^2 (kept within (0,1) for overlap). + p_treat = 1.0 / (1.0 + np.exp(-0.8 * (x1**2 - 1.0))) + treated = rng.uniform(size=n) < p_treat + ft = np.where(treated, 2.0, np.inf) + slope = 0.5 * x1 + 0.6 * x1**2 # nonlinear Y(0) trend in x1 + rows = [] + unit_fe = rng.normal(0, 0.5, n) + for i in range(n): + for t in (1, 2, 3): + y0 = unit_fe[i] + slope[i] * (t - 1) + rng.normal(0, 0.3) + y = y0 + (tau if (np.isfinite(ft[i]) and t >= ft[i]) else 0.0) + rows.append((i, t, ft[i], y, x1[i])) + df = pd.DataFrame(rows, columns=["unit", "time", "first_treat", "y", "x1"]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + a_sieve = ( + EfficientDiD(pt_assumption="all") + .fit(df, "y", "unit", "time", "first_treat", covariates=["x1"], aggregate="all") + .overall_att + ) + a_linear = ( + EfficientDiD(pt_assumption="all", sieve_k_max=1) + .fit(df, "y", "unit", "time", "first_treat", covariates=["x1"], aggregate="all") + .overall_att + ) + if np.isfinite(a_sieve): + err_sieve.append(abs(a_sieve - tau)) + if np.isfinite(a_linear): + err_linear.append(abs(a_linear - tau)) + mae_sieve = float(np.mean(err_sieve)) + mae_linear = float(np.mean(err_linear)) + # The growing sieve recovers ATT and is no worse than the all-degree-1 + # fit; under this nonlinear-nuisance DGP it is materially better. + assert mae_sieve < 0.5 + assert mae_sieve <= mae_linear + 0.05 + + +# ============================================================================= +# Growing sieve (order > 5) — the two propensity-path sieves +# ============================================================================= + + +def _cheb_t6(u): + """6th Chebyshev polynomial: its energy is purely degree 6 (orthogonal to all + lower degrees over [-1, 1]), so a degree-5 sieve captures none of it and a + degree-6 sieve is required -- a clean way to force order-6 selection.""" + return 32 * u**6 - 48 * u**4 + 18 * u**2 - 1 + + +class TestPropensitySieveGrowingOrder: + """The two propensity-path sieves are growing sieves with no K<=5 cap. + + Companion to ``test_growing_sieve_allows_order_above_5_for_large_group`` (which + covers the outcome regression): the hard ``K<=5`` ceiling was removed from ALL + THREE nuisance sieves, so for a large support (n with floor(n^{1/5}) = 6) a + genuinely degree-6 nuisance is selected. At that support the auto k_max is + exactly 6, so ``auto != k_max=5 fit`` is a sharp proof that order 6 was + selected (and ``auto == k_max=6 fit`` confirms it). The fitted target is made + pure degree-6 via the 6th Chebyshev polynomial so degree 5 cannot represent it. + """ + + def test_propensity_ratio_sieve_selects_order_six(self): + # r_{g,g'}(X) = p_g/p_{g'} is built as a pure degree-6 polynomial (0.3 + + # 0.18*T6, in [0.12, 0.48] so well inside the ratio clip). The comparison + # group g' is the majority (~78%), so its support clears 7776 and the auto + # k_max reaches 6. + rng = np.random.default_rng(0) + n = 20000 # comparison-group support ~15.6k -> floor(.^{1/5}) = 6 + x1 = rng.uniform(-2.0, 2.0, size=n) + r_true = 0.3 + 0.18 * _cheb_t6(x1 / 2.0) + p_g = r_true / (1.0 + r_true) + g = rng.uniform(size=n) < p_g + cov = x1.reshape(-1, 1) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + auto = estimate_propensity_ratio_sieve(cov, g, ~g, criterion="bic") + cap5 = estimate_propensity_ratio_sieve(cov, g, ~g, k_max=5, criterion="bic") + cap6 = estimate_propensity_ratio_sieve(cov, g, ~g, k_max=6, criterion="bic") + assert int((~g).sum() ** 0.2) == 6 # comparison support gives auto k_max = 6 + assert not np.allclose(auto, cap5, atol=1e-9) # order > 5 was selected + np.testing.assert_allclose(auto, cap6, atol=1e-9) # specifically order 6 + + def test_inverse_propensity_sieve_selects_order_six(self): + # s_{g'}(X) = 1/p_{g'}(X) is built as a pure degree-6 polynomial + # (3.0 + 1.8*T6, >= 1.2 so p in (0, 1]); the group support (~8.5k) clears + # 7776 so the auto k_max reaches 6. + rng = np.random.default_rng(0) + n = 20000 + x1 = rng.uniform(-2.0, 2.0, size=n) + s_true = 3.0 + 1.8 * _cheb_t6(x1 / 2.0) + p = 1.0 / s_true + g = rng.uniform(size=n) < p + cov = x1.reshape(-1, 1) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + auto = estimate_inverse_propensity_sieve(cov, g, criterion="bic") + cap5 = estimate_inverse_propensity_sieve(cov, g, k_max=5, criterion="bic") + cap6 = estimate_inverse_propensity_sieve(cov, g, k_max=6, criterion="bic") + assert int(g.sum() ** 0.2) == 6 # group support gives auto k_max = 6 + assert not np.allclose(auto, cap5, atol=1e-9) # order > 5 was selected + np.testing.assert_allclose(auto, cap6, atol=1e-9) # specifically order 6 + + +# ============================================================================= +# End-to-end survey invariance: zero-weight padding must not change the estimate +# ============================================================================= + + +class TestSurveyZeroWeightInvariance: + """EfficientDiD's DR (covariate) estimate is invariant to zero-weight padding. + + Adding survey rows with weight 0 carries no information, so the point estimate + must not change. Both auto-selectors on the covariate path key off the + positive-weight support, so two distinct paths are covered: + + - **Just-identified (H=1)** — ``test_..._invariant``: pins the sieve ORDER + selection. The never-treated comparison group has 240 positive-weight units + (floor(240^{1/5}) = 2); padding it with 60 zero-weight units would push the + raw count to 300 (floor = 3) under the buggy raw-count selection and select + the genuinely-cubic conditional mean (CI codex P0, round 1). At H=1 + ``compute_per_unit_weights`` short-circuits to 1, so the kernel/bandwidth + path is NOT exercised here. + - **Overidentified (H>1)** — ``test_..._invariant_overidentified``: pins the + auto Silverman BANDWIDTH for the kernel-smoothed ``Omega*(X)``. With H>1 the + efficient weights depend on ``Omega*(X)``, whose auto bandwidth must ignore + zero-weight rows — else extreme-covariate padding inflates the unweighted std + and silently moves ``ATT(g,t)`` (CI codex P0, round 2). + """ + + def _panel(self, seed=0): + rng = np.random.default_rng(seed) + rows = [] + uid = 0 + + def add(n, ft, weight, fe_sd, noise_sd): + nonlocal uid + for _ in range(n): + x1 = rng.uniform(-2.0, 2.0) + ufe = rng.normal(0.0, fe_sd) + cubic = 1.3 * x1**3 # nonlinear conditional trend (needs degree 3) + for t in (1, 2): + treated_now = np.isfinite(ft) and t >= ft + y = ( + ufe + + 0.3 * (t - 1) + + cubic * (t - 1) + + (2.0 if treated_now else 0.0) + + rng.normal(0.0, noise_sd) + ) + rows.append((uid, t, ft, y, x1, weight)) + uid += 1 + + add(120, 2.0, 1.0, 0.5, 0.3) # treated cohort + add(240, np.inf, 1.0, 0.5, 0.3) # real never-treated (positive weight) + base = pd.DataFrame(rows, columns=["unit", "time", "first_treat", "y", "x1", "w"]) + n_base_rows = len(rows) + add(60, np.inf, 0.0, 3.0, 3.0) # zero-weight padded controls (wild outcomes) + padded = pd.DataFrame(rows, columns=["unit", "time", "first_treat", "y", "x1", "w"]) + # ``base`` is exactly the positive-weight subset of ``padded``. + assert (padded["w"].to_numpy()[:n_base_rows] > 0).all() + return base, padded + + def test_zero_weight_padding_leaves_att_invariant(self): + from diff_diff.survey import SurveyDesign + + base, padded = self._panel(seed=0) + + def fit(df): + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + return EfficientDiD(pt_assumption="all").fit( + df, + "y", + "unit", + "time", + "first_treat", + covariates=["x1"], + survey_design=SurveyDesign(weights="w"), + aggregate="all", + ) + + r_filtered = fit(base) + r_padded = fit(padded) + # The DR point estimate is exactly invariant to zero-weight padding (the + # sieve-selection fix): the WLS fit, the weighted RSS, and now the order + # selection are all keyed off the positive-weight support. NOTE this + # single-date fixture is just-identified (H=1 per cell), so + # ``compute_per_unit_weights`` short-circuits to 1 and the kernel/bandwidth + # path is NOT exercised here — the overidentified test below covers that. + np.testing.assert_allclose( + r_padded.overall_att, r_filtered.overall_att, atol=1e-9, rtol=0.0 + ) + # The SE is invariant up to a tiny finite-sample DOF correction: the shared + # survey sandwich (compute_survey_vcov / _compute_stratified_psu_meat) counts + # zero-weight units as PSUs in its n_psu/(n_psu-1)-style correction. The + # weighted scores themselves are zero for zero-weight rows, so this is a + # second-order (~2e-4 relative) cross-cutting survey-infra effect, NOT the + # sieve selection (tracked in TODO.md). + np.testing.assert_allclose(r_padded.overall_se, r_filtered.overall_se, rtol=2e-3, atol=0.0) + + def _overid_panel(self, seed=0): + """Staggered panel (cohorts 2 & 3 + never-treated, 4 periods) whose PT-All + cells are OVERIDENTIFIED (H > 1): each ATT(g,t) has several admissible + (g', t_pre) comparison moments, so ``compute_per_unit_weights`` forms the + inverse-Omega* efficient combination rather than short-circuiting to 1 — + i.e. the kernel-smoothed Omega*(X) and its auto Silverman bandwidth ARE on + the path. The zero-weight padded units carry EXTREME covariates so that any + bandwidth dependence on zero-weight rows is forced to show up (an extreme + x1 inflates the unweighted std -> a larger auto bandwidth -> different + Omega*).""" + rng = np.random.default_rng(seed) + rows = [] + uid = 0 + + def add(n, ft, weight, xlo, xhi, fe_sd, noise_sd): + nonlocal uid + for _ in range(n): + x1 = rng.uniform(xlo, xhi) + ufe = rng.normal(0.0, fe_sd) + for t in (1, 2, 3, 4): + trend = 0.3 * (t - 1) + (0.5 * x1 + 0.3 * x1**2) * (t - 1) # nonlinear Y(0) + treated_now = np.isfinite(ft) and t >= ft + y = ufe + trend + (2.0 if treated_now else 0.0) + rng.normal(0.0, noise_sd) + rows.append((uid, t, ft, y, x1, weight)) + uid += 1 + + add(90, 2.0, 1.0, -2.0, 2.0, 0.5, 0.3) # cohort 2 + add(90, 3.0, 1.0, -2.0, 2.0, 0.5, 0.3) # cohort 3 + add(140, np.inf, 1.0, -2.0, 2.0, 0.5, 0.3) # never-treated + base = pd.DataFrame(rows, columns=["unit", "time", "first_treat", "y", "x1", "w"]) + n_base_rows = len(rows) + # zero-weight padded never-treated units with EXTREME covariates (x1 ~ [40, 60]) + add(40, np.inf, 0.0, 40.0, 60.0, 3.0, 3.0) + padded = pd.DataFrame(rows, columns=["unit", "time", "first_treat", "y", "x1", "w"]) + assert (padded["w"].to_numpy()[:n_base_rows] > 0).all() + return base, padded + + def test_zero_weight_padding_leaves_att_invariant_overidentified(self): + # Overidentified (H>1) DR survey path with the AUTO Silverman bandwidth + # (kernel_bandwidth=None): the auto bandwidth must ignore zero-weight rows, + # else extreme-covariate padding inflates the unweighted std, widens the + # kernel, changes Omega*(X) and the per-unit efficient weights, and silently + # moves ATT(g,t). This FAILS under a raw (all-row) bandwidth and passes once + # the bandwidth is keyed off the positive-weight support. + from diff_diff.survey import SurveyDesign + + base, padded = self._overid_panel(seed=0) + + def fit(df): + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + return EfficientDiD(pt_assumption="all").fit( + df, + "y", + "unit", + "time", + "first_treat", + covariates=["x1"], + survey_design=SurveyDesign(weights="w"), + aggregate="all", + ) + + r_filtered = fit(base) + r_padded = fit(padded) + # Staggered multi-cohort design with a never-treated comparison -> PT-All + # cells are overidentified (the kernel/bandwidth path is live). + assert len(r_filtered.groups) >= 2 + # Exact point-estimate invariance per (g,t) and overall. + for k, info in r_filtered.group_time_effects.items(): + np.testing.assert_allclose( + r_padded.group_time_effects[k]["effect"], info["effect"], atol=1e-9, rtol=0.0 + ) + np.testing.assert_allclose( + r_padded.overall_att, r_filtered.overall_att, atol=1e-9, rtol=0.0 + ) + # SE invariant up to the same second-order survey-vcov DOF correction. + np.testing.assert_allclose(r_padded.overall_se, r_filtered.overall_se, rtol=2e-3, atol=0.0)