Internal tracking for technical debt, known limitations, and maintenance tasks.
For the public feature roadmap, see ROADMAP.md.
Current limitations that may affect users:
| Issue | Location | Priority | Notes |
|---|---|---|---|
| MultiPeriodDiD wild bootstrap not supported (falls back to analytical) | estimators.py:1647 |
Low | Edge case |
predict() raises NotImplementedError |
estimators.py:890-911 |
Low | Rarely needed |
For survey-specific limitations (NotImplementedError paths), see the Current Limitations section of survey-roadmap.md.
Target: ideally < 1000 lines per module; modules ≥3000 lines are candidates for splitting, 2000-3000 are monitored, 1000-2000 are accepted as a cohesion / scope trade-off. Updated 2026-05-15.
| File | Lines | Action |
|---|---|---|
chaisemartin_dhaultfoeuille.py |
8636 | Consider splitting (per-path / placebos / survey IF / aggregation) |
had_pretests.py |
4951 | Consider splitting (Stute / Yatchew / QUG / joint pretests) |
had.py |
4593 | Consider splitting (continuous / mass-point / event-study / survey paths) |
staggered.py |
3963 | Consider splitting — grew through survey + aggregation features |
linalg.py |
3601 | Consider splitting (vcov surfaces) only if cohesion can be preserved — unified backend; vcov / solver paths are tightly coupled |
diagnostic_report.py |
3380 | Consider splitting (per-method renderers + provenance) |
power.py |
3196 | Consider splitting (power analysis + MDE + sample size) |
synthetic_did.py |
2819 | Monitor — variance methods + survey paths |
honest_did.py |
2785 | Monitor |
business_report.py |
2653 | Monitor — per-method narrative renderers |
imputation.py |
2475 | Monitor |
survey.py |
2466 | Monitor — grew with Phase 6 features |
utils.py |
2396 | Monitor |
prep_dgp.py |
2057 | Monitor |
triple_diff.py |
2053 | Monitor |
estimators.py |
1991 | Acceptable |
two_stage.py |
1985 | Acceptable |
chaisemartin_dhaultfoeuille_results.py |
1981 | Acceptable |
prep.py |
1876 | Acceptable |
efficient_did.py |
1793 | Acceptable |
sun_abraham.py |
1713 | Acceptable |
continuous_did.py |
1682 | Acceptable |
results.py |
1676 | Acceptable |
staggered_triple_diff.py |
1619 | Acceptable |
_nprobust_port.py |
1412 | Acceptable |
practitioner.py |
1402 | Acceptable |
trop_global.py |
1350 | Acceptable |
trop_local.py |
1339 | Acceptable |
local_linear.py |
1332 | Acceptable |
wooldridge.py |
1305 | Acceptable |
chaisemartin_dhaultfoeuille_bootstrap.py |
1175 | Acceptable |
bacon.py |
1144 | Acceptable |
pretrends.py |
1133 | Acceptable |
stacked_did.py |
1050 | Acceptable |
conley.py |
1006 | Acceptable |
visualization/ |
4316 | Subpackage (split across 7 files) — OK |
Deferred items from PR reviews that were not addressed before merge.
| Issue | Location | PR | Priority |
|---|---|---|---|
SyntheticControl cv: in_space_placebo() / leave_one_out() report a cv refit excluded for STRUCTURAL infeasibility (donor-indistinguishable re-aggregated window) with the generic status="failed" — same machine-readable status as a genuine inner-solver non-convergence. The failure warnings now distinguish the two causes (and the correct remediation) under cv, and in_time_placebo() already splits structural→"infeasible" vs "failed", but in-space/LOO do not yet emit a separate machine-readable status/reason-code. Thread a reason code from _outer_solve_V_cv()/_placebo_fit_unit() and add an "infeasible" status + count to the in-space/LOO outputs (mirror the in-time split). |
synthetic_control.py, synthetic_control_results.py |
follow-up | Low |
| dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | chaisemartin_dhaultfoeuille.py |
#294 | Low |
| dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | chaisemartin_dhaultfoeuille.py, docs/methodology/REGISTRY.md |
#408 | Medium |
| dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | test_chaisemartin_dhaultfoeuille_parity.py |
#294 | Low |
dCDH by_path: survey-aware backward-horizon (placebo + predict_het + survey_design) raises NotImplementedError because the Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution. Backward horizons would put ψ_g mass on a pre-period cell. Deriving the pre-period cell allocator (or adding a covariance-aware two-cell alternative) is deferred to a follow-up methodology PR. |
diff_diff/chaisemartin_dhaultfoeuille.py, docs/methodology/REGISTRY.md |
follow-up | Medium |
| CallawaySantAnna: consider materializing NaN entries for non-estimable (g,t) cells in group_time_effects dict (currently omitted with consolidated warning); would require updating downstream consumers (event study, balance_e, aggregation) | staggered.py |
#256 | Low |
CallawaySantAnna and StaggeredTripleDifference fit their covariate outcome-regression nuisance via estimator-local cho_solve(X'X) / scipy.lstsq(cond=1e-7) that bypass solve_ols, so they are NOT scale-equilibrated — a large-scale covariate can in principle perturb the nuisance fit (TripleDifference's OR fit already routes through solve_ols and is covered). Route the local OR fits through the shared scale-robust solver (or equilibrate locally). |
staggered.py, staggered_triple_diff.py |
covariate-review | Medium |
Adopt the shared _rank_guarded_inv for the structural (non-covariate) matrix inverses that share the LinAlgError-only fallback pattern and can become near-singular: continuous_did.py:1056 (dose B-spline basis), spillover.py:3371 (ring-solve, partially guarded via kept_cols), two_stage.py:3154 (TSL Stage-2 variance), imputation.py:2403, had.py:2413, conley.py:1109. These invert internal bases users cannot perturb with covariates= (so not the covariate-triggered SE bug already fixed by the DR/OR rank-guard) — lower priority; the _rank_guarded_inv helper is the seam. |
continuous_did.py, spillover.py, two_stage.py, imputation.py, had.py, conley.py |
dr-or-se-rank-guard | Low |
ImputationDiD dense (A0'A0).toarray() scales O((U+T+K)^2), OOM risk on large panels |
imputation.py |
#141 | Medium (deferred — only triggers when sparse solver fails) |
| Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | estimators.py |
#218 | Medium |
| Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | continuous_did.py, efficient_did.py, stacked_did.py |
#226 | Low |
SyntheticControl: the remaining ADH-2015 §4 items — the regression-weight W^reg = X_0'(X_0 X_0')^{-1} X_1 extrapolation diagnostic (flag implied OLS weights outside [0,1]) and sparse-SC subset search (l < J, holding V fixed). Leave-one-out (leave_one_out()), the in-time placebo (in_time_placebo()), out-of-sample CV V-selection (v_method="cv"), and inverse-variance V (v_method="inverse_variance") have landed; these two are the deferred tail. |
synthetic_control.py, synthetic_control_results.py |
ADH-2015 follow-up | Low |
ContinuousDiD deferred CGBS 2024 extensions: (a) covariates= kwarg not implemented (matches R contdid v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when P(D=0) = 0) not implemented — estimator requires never-treated controls. REGISTRY ## ContinuousDiD → Implementation Checklist marks these as deferred [ ] items. |
diff_diff/continuous_did.py |
— | Low |
Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — _silverman_bandwidth() uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement |
efficient_did_covariates.py |
— | Low |
Survey sandwich SE is not exactly invariant to zero-weight (subpopulation / padded) rows: the shared _compute_stratified_psu_meat finite-sample correction counts zero-weight units as PSUs (an n_psu/(n_psu-1)-style factor), so adding zero-weight rows shifts the SE by a second-order amount (~2e-4 relative in the EfficientDiD e2e). The point estimate is exactly invariant and the weighted scores of zero-weight rows are already zero — only the DOF correction's PSU count includes them. Cross-cutting across all survey-enabled estimators; fix by counting only positive-weight PSUs in the correction. |
survey.py (_compute_stratified_psu_meat) |
PR-B follow-up | Low |
TROP: extend Wave 4's _setup_trop_data helper to also cover the duplicated bootstrap resampling loop in _bootstrap_variance / _bootstrap_variance_global (~40 LoC dedup; mirrors the data-setup helper pattern with a fit_callable parameter for the per-draw refit step). |
trop_local.py, trop_global.py |
follow-up | Low |
TripleDifference power auto-routing: power.simulate_power ignores n_periods for DDD because _ddd_dgp_kwargs is hard-coded to the cross-sectional generate_ddd_data. Now that generate_ddd_panel_data exists (Wave 4), add a new _EstimatorProfile registry entry (or extend the existing one) to route to the panel DGP when n_periods > 2. |
power.py, prep_dgp.py |
follow-up | Low |
| StaggeredTripleDifference R cross-validation: CSV fixtures not committed (gitignored); tests skip without local R + triplediff. Commit fixtures or generate deterministically. | tests/test_methodology_staggered_triple_diff.py |
#245 | Medium |
| StaggeredTripleDifference R parity: benchmark only tests no-covariate path (xformla=~1). Add covariate-adjusted scenarios and aggregation SE parity assertions. | benchmarks/R/benchmark_staggered_triplediff.R |
#245 | Medium |
| StaggeredTripleDifference: per-cohort group-effect SEs include WIF (conservative vs R's wif=NULL). Documented in REGISTRY. Could override mixin for exact R match. | staggered_triple_diff.py |
#245 | Low |
| HonestDiD Delta^RM: uses naive FLCI instead of paper's ARP conditional/hybrid confidence sets (Sections 3.2.1-3.2.2). ARP infrastructure exists but moment inequality transformation needs calibration. CIs are conservative (wider, valid coverage). | honest_did.py |
#248 | Medium |
Replicate weight tests use Fay-like BRR perturbations (0.5/1.5), not true half-sample BRR. Add true BRR regressions per estimator family. Existing test_survey_phase6.py covers true BRR at the helper level. |
tests/test_replicate_weight_expansion.py |
#253 | Low |
WooldridgeDiD: QMLE sandwich uses aweight cluster-robust adjustment (G/(G-1))*(n-1)/(n-k) vs Stata's G/(G-1) only. Conservative (inflates SEs). Add qmle weight type if Stata golden values confirm material difference. |
wooldridge.py, linalg.py |
#216 | Medium |
WooldridgeDiD: response-scale APE / log-link coefficient bridge for R etwfe(family="poisson") + etwfe(family="logit") cell-level numerical parity. diff-diff WooldridgeDiD(method="poisson"|"logit") returns ATT on the response scale (counterfactual μ_1 − μ_0 / p_1 − p_0 per paper W2023 ASF / APE framework); R etwfe returns the cell-level log-link coefficient. PR-B Stage D ships log-link goldens at benchmarks/data/wooldridge_golden.json and surface tests (fit completes + goldens well-formed); cell-level numerical parity requires either emfx()-based APE extraction on the R side or link-function inversion with baseline-mean adjustment. |
benchmarks/R/generate_wooldridge_golden.R, tests/test_methodology_wooldridge.py::TestWooldridgeParityRPoisson/TestWooldridgeParityRLogit |
PR-B follow-up | Medium |
WooldridgeDiD: design-consistent cohort totals for aggregate(weights="cohort_share") on survey-weighted fits. Current impl populates _n_g_per_cohort from unit.nunique() (raw counts); composing these unweighted cohort shares with the design-weighted ATTs targets a mixed estimand inconsistent with paper W2025 Section 7's design-population cohort-share form. PR-B Stage E fail-closes the surface (raises ValueError when survey_design is not None); the follow-up implements survey-weighted unit totals per cohort and re-enables the surface. |
wooldridge.py _n_g_per_cohort population, wooldridge_results.py::aggregate survey gate |
PR-B follow-up | Medium |
WooldridgeDiD: unconditional inference for aggregate(weights="cohort_share") accounting for sampling uncertainty in the cohort shares ω̂_g / ω̂_{ge} (paper W2025 Section 7.5). Current impl fail-closes the t-stat / p-value / conf-int fields to NaN under cohort-share aggregation because the analytical SE is conditional-on-shares. Proper APE/GMM-style aggregate inference (Wooldridge 2023 Section 4 framework) re-enables full inference. |
wooldridge_results.py::aggregate cohort_share inference branch |
PR-B follow-up | Medium |
WooldridgeDiD: cohort_trends=True + survey_design composition. PR-B Stage E fail-closes the cross-product with NotImplementedError at fit() because the full-dummy dg_i · t design composed with the survey TSL variance hasn't been validated against R-parity goldens. Follow-up: validate the composition (or implement a survey-aware alternative) and re-enable the surface. |
wooldridge.py fit guard, wooldridge_results.py::aggregate (if survey-aware cohort_trends variance plumbing is added) |
PR-B follow-up | Low |
WooldridgeDiD: cohort_trends=True + control_group="never_treated" composition. PR-B Stage E (codex R9 P1 fix) fail-closes the cross-product with NotImplementedError at fit() because the OLS + never_treated branch emits ALL (g, t) cells as treatment-cell dummies (paper Section 4.4 placebo coverage); the appended dg_i · t trend columns are linearly spanned by the per-cohort sum of those cell dummies, so the Section 8 trend specification is unidentified. Follow-up: implement a separate design-matrix branch that drops the pre-treatment placebo dummies (or restricts the trend interaction to post-treatment cells) under the trend specification, then re-enable the combination. |
wooldridge.py fit guard + _build_interaction_matrix redesign for the cohort_trends path |
PR-B follow-up | Low |
WooldridgeDiD: Stata jwdid golden value tests — add R/Stata reference script and TestReferenceValues class. |
tests/test_wooldridge.py |
#216 | Medium |
PreTrendsPower: CS/SA anticipation=1 R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at atol=1e-4 on four shifted-grid / regular / irregular / K=1 fixtures, but R pretrends has no anticipation parameter so the Python-side _extract_pre_period_params anticipation filter (if t < _pre_cutoff in pretrends.py lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic CallawaySantAnnaResults (or SunAbrahamResults) with anticipation=1 and a t=-1 event-study entry that should be filtered before reaching _compute_power_nis, then assert the resulting γ_p matches R's slope_for_power() on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (TestPretrendsPropositions) and full-VCV tests (TestPretrendsCovarianceSource) already cover the filter mechanically; this would close the loop against R. |
tests/test_methodology_pretrends.py::TestPretrendsParityR, benchmarks/R/generate_pretrends_golden.R |
PR-C follow-up | Low |
StackedDiD vcov_type="conley" — deferred for a methodology reason, NOT plumbing (unlike the now-shipped SunAbraham / WooldridgeDiD-OLS conley threading): the stacked design replicates each control unit across every sub-experiment it qualifies for (_build_sub_experiment), so one geographic unit occupies many stacked rows. Conley's pairwise distance matrix would see those same-unit copies at distance 0 (K(0)=1, perfectly correlated), conflating the stacking-replication device with real spatial correlation, and there is no conleyreg analogue for stacked DiD to anchor parity. A correct treatment needs a per-stack spatial identifier and is paper-gated. |
diff_diff/stacked_did.py |
follow-up | Low |
Extend WooldridgeDiD method ∈ {"logit","poisson"} paths with vcov_type ∈ {classical, hc2, hc2_bm}. The GLM QMLE sandwich uses pseudo-residuals (weights=p(1-p) for logit, weights=μ_i for Poisson, aweight semantics); composing HC2 leverage and Bell-McCaffrey Satterthwaite DOF with QMLE on canonical-link pseudo-residuals needs derivation + R parity against clubSandwich::vcovCR(glm(...), type="CR2"). Phase 1b PR 3/8 rejects method != "ols" + vcov_type != "hc1" at __init__ with a deferral pointer here. |
diff_diff/wooldridge.py (_fit_logit, _fit_poisson) |
follow-up | Medium |
Extend CallawaySantAnna with vcov_type="conley" — would require deriving a spatial-HAC composition for per-unit influence functions (Conley 1999 spatial kernel × per-(g,t) IF aggregation); no reference implementation exists today. Phase 1b interstitial PR rejected this at __init__ with a deferral pointer here. |
diff_diff/staggered.py |
follow-up | Low |
Extend TripleDifference with vcov_type="conley" — would require deriving a spatial-HAC composition for the 3-pairwise-DiD influence-function decomposition (Conley 1999 spatial kernel × inf = w3·IF_3 + w2·IF_2 - w1·IF_1 aggregation); no reference implementation exists today. Phase 1b interstitial #2 PR rejected this at __init__ with a deferral pointer here. |
diff_diff/triple_diff.py |
follow-up | Low |
Extend ImputationDiD with vcov_type="conley" — would require deriving a spatial-HAC composition with the Theorem 3 per-unit IF aggregation (Conley 1999 spatial kernel × sigma_sq = (cluster_psi_sums**2).sum() reduction); no reference implementation exists today. Phase 1b interstitial #3 PR rejected this at __init__ with a deferral pointer here. |
diff_diff/imputation.py |
follow-up | Low |
Extend EfficientDiD with vcov_type="conley" — would require deriving a spatial-HAC composition with the per-unit EIF aggregation (Conley 1999 spatial kernel × _compute_se_from_eif reduction); no reference implementation exists today. Phase 1b interstitial #4 PR rejected this at __init__ with a deferral pointer here. |
diff_diff/efficient_did.py |
follow-up | Low |
Extend TwoStageDiD with vcov_type="conley" — thread a spatial-HAC composition into the GMM sandwich meat (_compute_gmm_variance); the Conley machinery already exists in the sibling SpilloverDiD _compute_gmm_corrected_meat (same module) and could be adapted to TwoStageDiD's per-cluster GMM score S_g = gamma_hat' c_g - X'_{2g} eps_{2g}, but two-stage GMM × Conley has no reference implementation. Phase 1b interstitial #5 PR rejected this at __init__/fit() with a deferral pointer here. |
diff_diff/two_stage.py |
follow-up | Low |
Decide whether to formally deprecate CallawaySantAnna.cluster=X in favor of survey_design=SurveyDesign(psu=X). Both APIs are first-class today (the bare-cluster path synthesizes a minimal SurveyDesign internally), but having two equivalent paths to express the same intent creates redundant surface. Mirrors a similar question for ImputationDiD / EfficientDiD / TwoStageDiD if those estimators ever face the same review. |
diff_diff/staggered.py |
follow-up | Low |
Harmonize SunAbraham's HC1 within-transform finite-sample correction with fixest::sunab(). SA's solve_ols applies n / (n - k_dm) (within-transform columns only); fixest applies n / (n - k_total) (counts absorbed FE). SE values differ by ~1-2% on typical panel sizes (documented in REGISTRY.md "Deviation from R"; pinned at atol=5e-3 in tests/test_methodology_sun_abraham.py). Either thread df_adjustment into the vcov scaling or document as an intentional difference. |
diff_diff/sun_abraham.py, diff_diff/linalg.py::compute_robust_vcov |
follow-up | Low |
LinearRegression.fit() pays the CR2 cost twice on the weighted hc2_bm path: once inside solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to populate _bm_dof. Correct but redundant. Fix: thread return_dof through solve_ols so the same CR2 computation produces both vcov + DOF, or cache the per-cluster A_g / MUWTWUM precomputes between calls. CI codex P3 on PR #475. |
linalg.py::LinearRegression.fit, linalg.py::solve_ols |
PR #475 follow-up | Low |
TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) with replicate-weight survey designs raises NotImplementedError (twfe.py:~233). The replicate path re-demeans per replicate (re-demeaning depends on the per-replicate weight vector), which doesn't compose with the full-dummy HC2/HC2-BM build — a correct implementation would need per-replicate full-dummy refit. Workaround: use vcov_type="hc1" for replicate-weight CR1. |
twfe.py::fit |
follow-up | Low |
TWFE's HC2/HC2-BM inline full-dummy build (twfe.py:280-315) duplicates the dummy-construction logic in DifferenceInDifferences(fixed_effects=...) (estimators.py:478-486). Extract a shared helper (or delegate TWFE's HC2/HC2-BM path to DiD's fixed_effects= branch, with TWFE-specific cluster default threading) to reduce drift risk on FE naming, survey behavior, and result-surface conventions. Substantive refactor — touches both estimators. |
twfe.py::fit, estimators.py::DifferenceInDifferences.fit |
follow-up | Low |
Unify Rust local-method estimate_model solver path to solve_wls_svd (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local) passes at atol=1e-5 — the residual ~1e-7 gap is roundoff between Rust's estimate_model matrix factorization and numpy's lstsq, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (atol=1e-14 for lambda_nn=inf, atol=1e-10 for finite lambda_nn — see test_local_method_main_fit_parity); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. |
rust/src/trop.rs::estimate_model, rust/src/linalg.rs::solve_wls_svd |
follow-up | Low |
Rust multiplier-bootstrap weight RNG (generate_bootstrap_weights_batch in rust/src/bootstrap.rs:9-10, 57-75) uses Xoshiro256PlusPlus::seed_from_u64(seed + i) per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (diff_diff/sdid.py, diff_diff/efficient_did_bootstrap.py, diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. |
rust/src/bootstrap.rs, diff_diff/bootstrap_utils.py |
follow-up | Medium |
bias_corrected_local_linear: extend golden parity to kernel="triangular" and kernel="uniform" (currently epa-only; all three kernels share kernel_W and the lprobust math, so parity is expected but not separately asserted). |
benchmarks/R/generate_nprobust_lprobust_golden.R, tests/test_bias_corrected_lprobust.py |
Phase 1c | Low |
bias_corrected_local_linear: expose vce in {"hc0", "hc1", "hc2", "hc3"} on the public wrapper once R parity goldens exist (currently raises NotImplementedError). The port-level lprobust and lprobust_res already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit hii for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). |
diff_diff/local_linear.py::bias_corrected_local_linear, benchmarks/R/generate_nprobust_lprobust_golden.R, tests/test_bias_corrected_lprobust.py |
Phase 1c | Medium |
bias_corrected_local_linear: support multi-eval grid (neval > 1) with cross-covariance (covgrid=TRUE branch of lprobust.R:253-378). Not needed for HAD but useful for multi-dose diagnostics. |
diff_diff/_nprobust_port.py::lprobust |
Phase 1c | Low |
Clustered-DGP parity: Phase 1c's DGP 4 uses manual h=b=0.3 to sidestep an nprobust-internal singleton-cluster bug in lpbwselect.mse.dpi's pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. |
benchmarks/R/generate_nprobust_lprobust_golden.R |
Phase 1c | Low |
HeterogeneousAdoptionDiD joint cross-horizon covariance on event study: per-horizon SEs use INDEPENDENT sandwiches in Phase 2b (paper-faithful pointwise CIs per Pierce-Schott Figure 2). A follow-up could derive an IF-based stacking of per-horizon scores for joint cross-horizon inference (needed for joint hypothesis tests across event-time horizons). Block-bootstrap is a reasonable alternative. |
diff_diff/had.py::_fit_event_study |
Phase 2b | Low |
HeterogeneousAdoptionDiD event-study staggered-timing beyond last cohort: Phase 2b auto-filters staggered panels to the last cohort per paper Appendix B.2. Earlier-cohort treatment effects are not identified by HAD; redirecting to ChaisemartinDHaultfoeuille / did_multiplegt_dyn is the paper's prescription. A full staggered HAD would require a different identification path (out of paper scope). |
diff_diff/had.py::_validate_had_panel_event_study |
Phase 2b | Low |
HeterogeneousAdoptionDiD joint cross-horizon analytical covariance on the weighted event-study path: Phase 4.5 B ships multiplier-bootstrap sup-t simultaneous CIs on the weighted event-study path but pointwise analytical variance is still independent across horizons. A follow-up could derive the full H × H analytical covariance from the per-horizon IF matrix (Psi.T @ Psi under survey weighting) for an analytical alternative to the bootstrap. Would also let the unweighted event-study path ship a sup-t band. |
diff_diff/had.py::_fit_event_study |
follow-up | Low |
HeterogeneousAdoptionDiD unweighted event-study sup-t band: Phase 4.5 B ships sup-t only on the WEIGHTED event-study path (to preserve pre-PR bit-exact output on unweighted). Extending sup-t to unweighted event-study (either via the multiplier bootstrap with unit-level iid multipliers or via analytical joint cross-horizon covariance) is a symmetric follow-up. |
diff_diff/had.py::_fit_event_study |
follow-up | Low |
HeterogeneousAdoptionDiD survey-aware support-endpoint test (research, not engineering): if the academic literature ever publishes a calibrated support-infimum test under complex sampling — combining endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999) with survey-aware functional CLTs for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017) and tail-empirical-process theory (Drees 2003) — Phase 4.5 C0's permanent NotImplementedError on qug_test(..., survey=...) / weights= can be revisited and the bridge implemented against the published recipe. See docs/methodology/REGISTRY.md § "QUG Null Test" — Note (Phase 4.5 C0) for the decision rationale and the research-direction sketch. |
diff_diff/had_pretests.py::qug_test |
Phase 4.5 C0 (2026-04, decision shipped) | Low |
HeterogeneousAdoptionDiD survey-aware pretests Phase 4.5 C still-open follow-ups (pweight + PSU + FPC + strata already shipped via bootstrap_utils.apply_stratum_centering + Yatchew closed-form weighted variance): (a) replicate-weight designs (BRR/Fay/JK1/JKn/SDR) — the per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; each linearity-family helper raises NotImplementedError on survey.replicate_weights is not None. (b) lonely_psu='adjust' + singleton-strata on the Stute family — the pseudo-stratum centering transform has not been derived for the Stute CvM functional (same pseudo-stratum centering gap as the HAD sup-t deviation; see REGISTRY § "Note (Stute stratified survey-bootstrap calibration)"). |
diff_diff/had_pretests.py |
Phase 4.5 C follow-up | Low |
HeterogeneousAdoptionDiD Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted lprobust with an unweighted DPI selector; users who want a weight-aware bandwidth must pass h/b explicitly. Extending lpbwselect_mse_dpi to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope. |
diff_diff/_nprobust_port.py::lpbwselect_mse_dpi |
Phase 4.5 | Low |
HeterogeneousAdoptionDiD Phase 4.5 C: replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the continuous-dose paths. Phase 4.5 A raises NotImplementedError on replicate designs in _aggregate_unit_resolved_survey. Rao-Wu-style replicate bootstrap for HAD paths requires deriving the per-replicate weight-ratio rescaling for the local-linear intercept IF. |
diff_diff/had.py::_aggregate_unit_resolved_survey |
Phase 4.5 C | Low |
HeterogeneousAdoptionDiD mass-point: vcov_type in {"hc2", "hc2_bm"} raises NotImplementedError pending a 2SLS-specific leverage derivation. The OLS leverage x_i' (X'X)^{-1} x_i is wrong for 2SLS; the correct finite-sample correction uses x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i. Needs derivation plus an R / Stata (ivreg2 small robust) parity anchor. |
diff_diff/had.py::_fit_mass_point_2sls |
Phase 2a | Medium |
HeterogeneousAdoptionDiD survey-design API consolidation, next minor bump: drop the deprecated survey= and weights= kwargs on all 8 HAD surfaces (HeterogeneousAdoptionDiD.fit, did_had_pretest_workflow, qug_test, stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test); only survey_design= remains. Also fold the legacy back-end weights= paths (e.g. _aggregate_unit_weights ad-hoc routing) into the unified _resolve_survey_for_fit-driven path. The _make_trivial_resolved underscore alias on survey.py stays (one-line, harmless). DeprecationWarning ships in this PR; the removal PR is ~50 LoC of cleanup. |
diff_diff/had.py, diff_diff/had_pretests.py |
next minor bump | Medium |
HeterogeneousAdoptionDiD continuous paths: thread cluster= through bias_corrected_local_linear (Phase 1c's wrapper already supports cluster; Phase 2a ignores it with a UserWarning on the continuous path to keep scope tight). |
diff_diff/had.py, diff_diff/local_linear.py |
Phase 2a | Low |
HeterogeneousAdoptionDiD trends_lin × survey_design follow-up: per-group linear-trend slope under survey weighting (weighted slope estimator? per-PSU slope?) is not derived from the paper. PR #389 raises NotImplementedError on the combination across all 3 trends_lin surfaces. If user demand emerges, derive the weighted variant and lift the gate. |
diff_diff/had.py::HeterogeneousAdoptionDiD.fit, diff_diff/had_pretests.py::joint_pretrends_test, diff_diff/had_pretests.py::joint_homogeneity_test |
follow-up | Low |
HeterogeneousAdoptionDiD Stute family Stata-bridge parity: PR #389 R-parity covers the full HAD fit + Yatchew surfaces but skips Stute family (stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test) because no R Stutetest package exists publicly (chaisemartinPackages publishes only the Stata stute_test module; the paper cites a 2024c R Stutetest module that is not on GitHub or CRAN). Stata-bridge parity would add benchmarks/stata/generate_stute_golden.do + a Stata installation requirement. Low priority unless user demand emerges. |
benchmarks/stata/, tests/test_stute_test_parity.py |
follow-up | Low |
HeterogeneousAdoptionDiD Phase 3 Stute performance: Appendix D vectorized matrix form replaces the per-iteration OLS refit with a single precomputed M = I - X(X'X)^{-1}X' applied to eps * eta. Functionally identical, ~2x faster. Shipped literal-refit form in Phase 3 to match paper text and keep reviewer surface small. |
diff_diff/had_pretests.py::stute_test |
Phase 3 | Low |
HeterogeneousAdoptionDiD Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against chaisemartin::stute_test / yatchew_test). Tight numerical parity requires aligning bootstrap seed semantics and B across numpy/R and is deferred. |
tests/test_had_pretests.py |
Phase 3 | Low |
HeterogeneousAdoptionDiD Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. |
diff_diff/had_pretests.py::stute_test |
Phase 3 | Low |
HeterogeneousAdoptionDiD Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. Waived in tracker-promotion PR (2026-05-20): R parity at atol=1e-8 on the same 3 DGPs (tests/test_did_had_parity.py) is a strictly stronger correctness anchor than reproducing Figure 2's pointwise CIs on the LBD-restricted PNTR panel; paper Section 5.2 self-acknowledges NP estimators too noisy to be informative there. Table 1 coverage-rate MC would re-verify the CCF asymptotic coverage already pinned by R parity (Python ≡ R ≡ paper). See REGISTRY HAD Deviations Notes #3 / #4 for full scope-caveat statements. Re-open if user demand emerges for an empirical-application replication harness. |
benchmarks/, tests/ |
Phase 2a | Low |
HeterogeneousAdoptionDiD time-varying dose on event study: Phase 2b REJECTS panels where D_{g,t} varies within a unit for t >= F (the aggregation uses D_{g, F} as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to ChaisemartinDHaultfoeuille. |
diff_diff/had.py::_validate_had_panel_event_study |
Phase 2b | Low |
HeterogeneousAdoptionDiD repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct data_mode / API surface. |
diff_diff/had.py::_validate_had_panel, diff_diff/had.py::_aggregate_first_difference |
Phase 2a | Medium |
SyntheticDiD: bootstrap cross-language parity anchor against R's default synthdid::vcov(method="bootstrap") (refit; rebinds opts per draw) or Julia Synthdid.jl::src/vcov.jl::bootstrap_se (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. |
benchmarks/R/, benchmarks/julia/, tests/ |
follow-up | Low |
Conley + survey weights / survey_design. Score-reweighted meat s_i = w_i · X_i · ε_i is mechanical, but PSU clustering interaction with the spatial kernel and replicate-weights variance under spatial correlation are non-trivial (Bertanha-Imbens 2014 covers cluster-sample but not the explicit Conley case). Phase 5 of the spillover-conley initiative; paper review prerequisite. Currently raises NotImplementedError at the linalg validator. |
linalg.py::_validate_vcov_args |
Phase 5 (spillover-conley) | Medium |
SyntheticDiD(vcov_type="conley") support. Currently raises TypeError at __init__ because SyntheticDiD uses variance_method ∈ {bootstrap, jackknife, placebo} rather than the analytical sandwich that Conley plugs into. Wiring would require either reimplementing an analytical sandwich path for SyntheticDiD or designing a spatial-block bootstrap (new methodology, Politis-Romano 1994 territory). |
synthetic_did.py::SyntheticDiD |
follow-up (spillover-conley) | Low |
SpilloverDiD(survey_design=...) replicate-weight variance (BRR / Fay / JK1 / JKn / SDR). Wave E.1 ships Taylor-linearization only. Per Gerber (2026) Appendix A, the IF-reweighting shortcut does NOT apply to TwoStageDiD-class estimators because gamma_hat is weight-sensitive; correct support requires per-replicate full re-fit of stage 1 and stage 2 (200+ LoC of test surface beyond E.1). |
spillover.py::SpilloverDiD.fit, survey.py::compute_replicate_refit_variance |
follow-up | Low |
SpilloverDiD(vcov_type="conley", conley_lag_cutoff > 0, survey_design=...) no-effective-PSU serial Bartlett HAC. Wave E.2 follow-up ships the panel-block composition when an effective PSU exists (explicit survey_design.psu OR injected via cluster=<col> per _inject_cluster_as_psu). Weights-only / strata-only survey designs WITHOUT a cluster fallback raise NotImplementedError at SpilloverDiD.fit post-resolution because under the pseudo-PSU = obs-index fallback each pseudo-PSU appears in exactly one period — the per-PSU serial cross-period loop would silently contribute zero. Fix would either derive a unit-level serial fallback for no-PSU designs (mixes IF allocators with the pseudo-PSU spatial term — needs methodology work) or route the serial loop through conley_unit with explicit documentation of the IF-allocator asymmetry. Regression goldens vs the effective-PSU shipped path. |
spillover.py::SpilloverDiD.fit, two_stage.py::_compute_stratified_serial_bartlett_meat |
follow-up (Wave E.2 follow-up tail) | Low |
SpilloverDiD(ring_method="count") extension. Currently only the nearest-treated-ring specification is exposed. Count-of-treated-in-ring (paper Section 3.2 end) is methodologically supported by Butts but re-introduces functional-form dependence; expose with an explicit kwarg gate and documentation warning. |
spillover.py::SpilloverDiD.fit |
follow-up | Low |
SpilloverDiD data-driven d_bar selection (Butts 2021b / Butts 2023 JUE Insight cross-validation). |
spillover.py::SpilloverDiD |
follow-up | Low |
SpilloverDiD sparse cKDTree path for the staggered nearest-treated-distance helper (mirrors the static helper's sparse branch). Currently _compute_nearest_treated_distance_staggered always builds dense (n_units, n_treated_by_onset) pairwise distance matrices per cohort; on large staggered panels with many cohorts this is avoidable memory/runtime. Add a sparse k-d-tree branch analogous to _compute_nearest_treated_distance_sparse, gated on n > _CONLEY_SPARSE_N_THRESHOLD. |
spillover.py::_compute_nearest_treated_distance_staggered |
follow-up (Wave B) | Low |
SpilloverDiDResults in DiagnosticReport dispatch tables. Wave C event-study emits a TwoStageDiD-compatible event_study_effects: Dict[int, Dict] alias that plot_event_study consumes via the new reference_period attribute fallback in _extract_plot_data, but SpilloverDiDResults is NOT registered in DiagnosticReport's _APPLICABILITY / _PT_METHOD tables — so DiagnosticReport(spillover_result) doesn't currently route to event-study diagnostics. Registering requires (a) deciding which diagnostics apply (parallel trends, pre-trends power, heterogeneity, design-effect) AND (b) adding an end-to-end test. |
diff_diff/diagnostic_report.py::_APPLICABILITY, _PT_METHOD |
follow-up (Wave C) | Low |
| Issue | Location | PR | Priority |
|---|---|---|---|
| ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) | imputation.py |
#141 | Low |
| Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost) | rust/src/linalg.rs:67 |
#115 | Low |
Won't-fix — not bit-identically achievable (verified 2026-06-01). The proposed bread_inv reuse / "factor (X'WX) once, reuse across HC2/HC2-BM" cannot be done bit-identically, which is the bar for a pure perf refactor of the inference path (no SE may move at all). The internal bread operations solve against different right-hand sides (X.T for hat-diagonals, eye for classical/CR2, meat+temp.T for the sandwich, contrasts for BM-DOF); only same-RHS results are bit-reusable. Measured (numpy 2.4.5): scipy.linalg.lu_solve(lu_factor(A), B) differs from np.linalg.solve(A, B) by up to 6.4e-15 (only 32/900 bit-equal — dgesv fuses factor+solve and rounds differently from a separate dgetrf+dgetrs); the inv(A) @ meat @ inv(A) sandwich differs from the current double-solve by up to 1.24e-14. Both are nonzero → not bit-identical — and note both sit below the affected goldens' actual tolerances (the HC2/HC2-BM/CR2 asserts are atol 1e-12/1e-10, e.g. test_methodology_wls_cr2.py::TestUnweightedRegressionStillBitEqual at 1e-12; the atol=1e-14 checks in test_linalg_hc2_bm.py are HC1 default-vs-explicit dispatch-equality sentinels, not this path), so a broad reuse would silently shift SEs at ~1e-14 without tripping the suite, which is exactly what the bit-identity bar exists to prevent. np.linalg.solve(A, eye) == np.linalg.inv(A) IS bit-identical (and raises the same LinAlgError) but swapping saves nothing. The only genuine bit-identical redundancy is one duplicated solve(bread, X.T) (hat-diagonals + the DOF H-build) in the unweighted one-way hc2_bm+return_dof path — an O(k²n) solve dwarfed by that path's dense (n×n) M=I−H construction and its O(n²k)/O(n²m) per-contrast quadratic forms (per the _compute_bm_dof_from_contrasts unweighted-branch cost model), so the achievable saving is negligible. |
linalg.py::compute_robust_vcov |
Phase 1a | Low |
MPD cluster+hc2_bm path computes CR2 precomputes twice — once via solve_ols → _compute_cr2_bm for vcov + per-coefficient DOF, then again via _compute_cr2_bm_contrast_dof from MultiPeriodDiD.fit() for the post-period-average contrast DOF. Both rebuild H = X bread_inv X', the residual-maker M, and the per-cluster A_g = (I - H_gg)^{-1/2} matrices. O(n²k) redundant work; acceptable for typical cluster-robust DiD panel sizes (n ≤ a few thousand). Fix would plumb the contrast DOF through the existing CR2 vcov path (intrusive API change) or share the precomputes via a cached helper. |
linalg.py::_compute_cr2_bm_contrast_dof, estimators.py::MultiPeriodDiD.fit |
follow-up | Low |
| Rust-backend HC2 implementation. Current Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to the NumPy backend. For large-n fits this is noticeable. | rust/src/linalg.rs |
Phase 1a | Low |
CR2 Bell-McCaffrey DOF uses a naive O(n² k) per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full n × n M matrix. Switch when a user hits a large-n cluster-robust design. |
linalg.py::_compute_cr2_bm |
Phase 1a | Low |
SyntheticControl retains a full _SyntheticControlFitSnapshot (pivoted outcome/predictor panels) on EVERY fit to support the opt-in in_space_placebo(), so callers who never run the placebo still pay O(units × periods × predictor-vars) memory (same as SyntheticDiD's always-on snapshot for in_time_placebo). Store a compact array/index representation instead of per-variable DataFrames, or build the snapshot lazily on first placebo call (would need to retain the source data, ~same cost). |
synthetic_control.py snapshot build, synthetic_control_results.py::_SyntheticControlFitSnapshot |
follow-up | Low |
EfficientDiD DR (covariate) path rebuilds the full polynomial sieve basis _polynomial_sieve_basis(X, K) for every candidate K inside each of the three nuisance fits (outcome regression, propensity ratio, inverse propensity), per fit(). After the growing-sieve cap removal (PR-B), large covariate-adjusted fits at large n pay more avoidable basis-construction cost. Cache the basis per (X, K) within a fit() and share it across the nuisance helpers. |
diff_diff/efficient_did_covariates.py (the three sieve helpers) |
PR-B follow-up | Low |
| Issue | Location | PR | Priority |
|---|---|---|---|
| Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | docs/tutorials/24_staggered_vs_collapsed_power.ipynb |
staggered-analysis-2x2 | Low |
Port the CI <notebook-prose> extraction into the reviewer-eval harness so docs/tutorials/*.ipynb cases (currently guarded out of verify-corpus/run) can be reviewed with CI-equivalent context |
tools/reviewer-eval/adapters/ci_prompt.py |
local-review | Low |
R comparison tests spawn separate Rscript per test (slow CI) |
tests/test_methodology_twfe.py:294 |
#139 | Low |
CS R helpers hard-code xformla = ~ 1; no covariate-adjusted R benchmark for IRLS path |
tests/test_methodology_callaway.py |
#202 | Low |
Validating the .txt AI guides (diff_diff/guides/llms-full.txt, llms-practitioner.txt) as executable snippets is not low-lift (re-scoped 2026-06-01): of their ~112 fenced Python blocks only ~20% are standalone-runnable — the rest are API-signature references (Foo(param: type = default) pseudo-signatures that are SyntaxError by design), context fragments (e.g. results.att on an undefined results), or dataset-shape-specific blocks. The guides are reference documentation, not runnable examples; a real implementation needs signature-block detection + a context/data skip-allowlist + per-snippet fixtures (multi-round curation), unlike the curated .rst files the existing smoke test covers. |
tests/test_doc_snippets.py |
#239 | Low |
SyntheticDiD: rename internal placebo_effects variable to variance_effects (or resampled_effects). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve placebo_effects as a deprecated alias for one release. |
synthetic_did.py, results.py |
follow-up | Medium |
AI review CI: pin workflow contract via test (uses openai/codex-action@v1, passes prompt-file, reads steps.run_codex.outputs.final-message, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. |
tests/test_openai_review.py, .github/workflows/ai_pr_review.yml |
#416 | Low |
TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14 dismissal guard) does not model: bash <script> / sh <script> / ./<script> / source <script> / . <script> direct shell-script execution; multi-line python3 -c bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (SCRIPT="$X"; python3 "$SCRIPT"); eval; find -exec; xargs -I {}. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in .github/workflows/ai_pr_review.yml comment block. |
tests/test_openai_review.py, .github/workflows/ai_pr_review.yml |
#436 | Low |
Render docs/methodology/REPORTING.md and docs/methodology/REGISTRY.md as in-site Sphinx pages so cross-references can use :doc: instead of off-site GitHub blob/main URLs. Current state (#410 fix-audit-r2) restores navigable links via blob/main, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add myst-parser to docs/conf.py extensions + docs extras and link with :doc:, or (b) convert both files to .rst. |
docs/conf.py, docs/api/business_report.rst, docs/api/diagnostic_report.rst, docs/tutorials/18_geo_experiments.ipynb, docs/tutorials/19_dcdh_marketing_pulse.ipynb |
follow-up | Low |
Ordered paydown view across the tables above. Tier A → D is by effort × risk, not severity — every item here already carries its own Low / Medium priority in the source-of-truth tables. The intent is to give a flat ordering to draw from wave-by-wave without re-litigating priority each time. Anchors point to the location reference of the originating row.
(No active items. The sole prior entry — the WooldridgeDiD method/outcome efficiency hint — has shipped; see CHANGELOG ## [Unreleased] and REGISTRY §WooldridgeDiD "Nonlinear extensions".)
(SyntheticDiD placebo_effects → variance_effects rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.)
- SyntheticDiD: rename internal
placebo_effects→variance_effectsAND publicplacebo_effectsfield with deprecation alias retained for one release (synthetic_did.py,results.py) - StaggeredTripleDifference R parity: commit CSV fixtures + add covariate-adjusted scenarios + aggregation-SE assertions (
tests/test_methodology_staggered_triple_diff.py,benchmarks/R/benchmark_staggered_triplediff.R) - StaggeredTripleDifference: per-cohort group-effect SE WIF override for exact R
triplediffmatch (staggered_triple_diff.py) - WooldridgeDiD: QMLE Stata-parity
qmleweight type + Stata golden values (wooldridge.py,linalg.py,tests/test_wooldridge.py) - WooldridgeDiD: optional
weights="cohort_share"onaggregate()(wooldridge_results.py) - HAD survey-design API consolidation: drop deprecated
survey=/weights=kwargs (had.py,had_pretests.py; gated on next minor bump) - Survey-design resolution / collapse helper extraction across
continuous_did.py,efficient_did.py,stacked_did.py - dCDH survey + backward-horizon
predict_hetallocator derivation: lift the warn-and-skip fallback at_compute_heterogeneity_testonce the pre-period Binder TSL cell-period allocator is derived (currently the gate emits aUserWarningand falls back to forward-horizon-only heterogeneity undersurvey_design + placebo + heterogeneity) (chaisemartin_dhaultfoeuille.py,docs/methodology/REGISTRY.md) - Rust local-method solver path unification to
solve_wls_svd+ bootstrap-weight RNG parity audit (rust/src/trop.rs,rust/src/bootstrap.rs) - AI review CI workflow-contract pin test expansion (
tests/test_openai_review.py) - In-site Sphinx render of
REPORTING.mdandREGISTRY.md(docs/conf.py+:doc:link migration)
- HonestDiD Δ^RM ARP conditional/hybrid confidence sets (
honest_did.py) - Multi-absorb weighted demeaning: alternating-projection iteration for N>1 absorb + weights (
estimators.py) - ImputationDiD dense
(A0'A0).toarray()OOM: alternative dense fallback or richer sparse strategy (imputation.py:1531) - HAD mass-point
vcov_type ∈ {hc2, hc2_bm}: 2SLS-specific leverage derivation (had.py::_fit_mass_point_2sls) - HAD repeated-cross-section identification path (
had.py::_validate_had_panel) - HAD time-varying-dose event study estimator (
had.py::_validate_had_panel_event_study) - Conley +
survey_design(linalg.py::_validate_vcov_args,conley.py) - SyntheticDiD
vcov_type="conley"(synthetic_did.py::SyntheticDiD— new analytical sandwich path OR spatial-block bootstrap)
- HAD survey-aware support-endpoint test (
had_pretests.py::qug_test; waits on literature — endpoint EVT × survey-aware functional CLT) - HAD joint cross-horizon analytical covariance / unweighted event-study sup-t band (low user demand)
- HAD Phase 4.5 replicate-weight pretests (BRR/Fay/JK1/JKn/SDR composition derivation)
- HAD Stute family Stata-bridge parity (no R
Stutetestpackage exists publicly) - HAD
trends_lin × survey_designweighted-slope derivation - Phase 1c lprobust follow-ups (
vcemodes, weight-aware auto-bandwidth DPI, multi-eval grid, clustered-DGP auto-bandwidth) — deferred to Phase 2+ ofbias_corrected_local_linear - TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14) residual bypass paths — diminishing return given documented threat model
- All remaining
Low-priority Performance and Testing/Docs rows (R-script-per-test, CS R covariate-adjusted IRLS benchmark, doc-deps integrity CI, Rust faer SVD overhead, etc.)
vcov_type has subsumed the previously-proposed se_type knob. DifferenceInDifferences and TwoWayFixedEffects accept vcov_type ∈ {"classical", "hc1", "hc2", "hc2_bm", "conley"} (the validated set in linalg.py::_VALID_VCOV_TYPES); cluster-robust variance is obtained by passing cluster= alongside the heteroscedasticity kind (hc1 + cluster ⇒ CR1 Liang-Zeger; hc2_bm + cluster ⇒ CR2 Bell-McCaffrey, including the weighted path landed via the clubSandwich WLS-CR2 port; the N>1 absorbed-FE + weights composition remains gated by the open multi-absorb row in the table above); wild cluster bootstrap is a separate inference="wild_bootstrap" path on the same estimator. Threading vcov_type through the 8 standalone estimators (CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, TripleDifference, StackedDiD, WooldridgeDiD, EfficientDiD) is complete as of Phase 1b; four of them (CallawaySantAnna, TripleDifference, ImputationDiD, EfficientDiD) are permanently narrow to {"hc1"} per their influence-function variance, and TwoStageDiD is likewise narrow because its Gardner GMM-corrected meat has no single cross-stage hat matrix for classical/hc2/hc2_bm. The per-estimator vcov_type="conley" extensions are tracked as follow-up rows in the table above: SunAbraham + WooldridgeDiD-OLS are shipped (within-transform conley via solve_ols); StackedDiD is deferred for a methodology reason (unit replication × spatial distance); the IF-based / GMM estimators have no reference implementation.
Mypy reports 0 errors. All mixin attr-defined errors resolved via
TYPE_CHECKING-guarded method stubs in bootstrap mixin classes.
Deprecated parameters still present for backward compatibility:
lambda_regandzetainSyntheticDiD(synthetic_did.py)- Deprecated in favor of
zeta_omega/zeta_lambdaparameters - Remove in v4.0.0 (SemVer-safe: public kwarg removal requires a major bump)
- Deprecated in favor of
Visualization tests skip when matplotlib / plotly are not installed (see pytest.importorskip markers in tests/test_visualization*.py).
Enhancements for honest_did.py:
- Improved C-LF implementation with direct optimization instead of grid search
(current implementation uses simplified FLCI approach with estimation uncertainty
adjustment; see
honest_did.py:947) - Support for CallawaySantAnnaResults (implemented in
honest_did.py:612-653; requiresaggregate='event_study'when callingCallawaySantAnna.fit()) - Event-study-specific bounds for each post-period
- Hybrid inference methods
- Simulation-based power analysis for honest bounds
- Consider aligning p-value computation with R
didpackage (symmetric percentile method)
Spurious RuntimeWarnings ("divide by zero", "overflow", "invalid value") are emitted by np.matmul/@ on Apple Silicon M4 + macOS Sequoia with numpy < 2.3. The warnings appear for matrices with ≥260 rows but do not affect result correctness — coefficients and fitted values are valid (no NaN/Inf), and the design matrices are full rank.
Root cause: Apple's BLAS SME (Scalable Matrix Extension) kernels corrupt the floating-point status register, causing spurious FPE signals. Tracked in numpy#28687 and numpy#29820. Fixed in numpy ≥ 2.3 via PR #29223.
Not reproducible on M3, Intel, or Linux.
-
linalg.py:162- Warnings in fitted value computation (X @ coefficients)- Caused by M4 BLAS bug, not extreme coefficient values
- Seen in test_prep.py during treatment effect recovery tests (n > 260)
-
triple_diff.py:307,323- Warnings in propensity score computation- Occurs in IPW and DR estimation methods with covariates
- Related to logistic regression overflow in edge cases (separate from BLAS bug)
-
Long-term: Revert to
@operator when numpy ≥ 2.3 becomes the minimum supported version.
Features in R's did package that block porting additional tests:
| Feature | R tests blocked | Priority | Status |
|---|---|---|---|
| Calendar time aggregation | 1 test in test-att_gt.R | Low |
Potential future optimizations:
- JIT compilation for bootstrap loops (numba)
- Sparse matrix handling for large fixed effects
Background: The current solve_ols() implementation performs both QR (for rank detection) and SVD (for solving) decompositions on rank-deficient matrices. This is technically redundant since SVD can determine rank directly.
Current approach (R-style, chosen for robustness):
- QR with pivoting for rank detection (
_detect_rank_deficiency()) - scipy's
lstsqwith 'gelsd' driver (SVD-based) for solving
Why we use QR for rank detection:
- QR with pivoting provides the canonical ordering of linearly dependent columns
- R's
lm()uses this approach for consistent dropped-column reporting - Ensures consistent column dropping across runs (SVD column selection can vary)
Potential optimization (future work):
- Skip QR when
rank_deficient_action="silent"since we don't need column names - Use SVD rank directly in the Rust backend (already implemented)
- Add
skip_rank_checkparameter for hot paths where matrix is known to be full-rank (implemented in v2.2.0)
Priority: Low - the QR overhead is minimal compared to SVD solve, and correctness is more important than micro-optimization.
Background: The solve_ols() function accepts a check_finite=False parameter intended to skip NaN/Inf validation for performance in hot paths where data is known to be clean.
Current limitation: When check_finite=False, our explicit validation is skipped, but scipy's internal QR decomposition in _detect_rank_deficiency() still validates finite values. This means callers cannot fully bypass all finite checks.
Impact: Minimal - the scipy check is fast and only affects edge cases where users explicitly pass check_finite=False with non-finite data (which would be a bug in their code anyway).
Potential fix (future work):
- Pass
check_finite=Falsethrough to scipy's QR call (requires scipy >= 1.9.0) - Or skip
_detect_rank_deficiency()entirely whencheck_finite=Falseand_skip_rank_check=True
Priority: Low - this is an edge case optimization that doesn't affect correctness.