Development TODO

Internal tracking for technical debt, known limitations, and maintenance tasks.

For the public feature roadmap, see ROADMAP.md.

Known Limitations

Current limitations that may affect users:

Issue	Location	Priority	Notes
MultiPeriodDiD wild bootstrap not supported (falls back to analytical)	`estimators.py:1647`	Low	Edge case
`predict()` raises NotImplementedError	`estimators.py:890-911`	Low	Rarely needed

For survey-specific limitations (NotImplementedError paths), see the Current Limitations section of survey-roadmap.md.

Code Quality

Large Module Files

Target: ideally < 1000 lines per module; modules ≥3000 lines are candidates for splitting, 2000-3000 are monitored, 1000-2000 are accepted as a cohesion / scope trade-off. Updated 2026-05-15.

File	Lines	Action
`chaisemartin_dhaultfoeuille.py`	8636	Consider splitting (per-path / placebos / survey IF / aggregation)
`had_pretests.py`	4951	Consider splitting (Stute / Yatchew / QUG / joint pretests)
`had.py`	4593	Consider splitting (continuous / mass-point / event-study / survey paths)
`staggered.py`	3963	Consider splitting — grew through survey + aggregation features
`linalg.py`	3601	Consider splitting (vcov surfaces) only if cohesion can be preserved — unified backend; vcov / solver paths are tightly coupled
`diagnostic_report.py`	3380	Consider splitting (per-method renderers + provenance)
`power.py`	3196	Consider splitting (power analysis + MDE + sample size)
`synthetic_did.py`	2819	Monitor — variance methods + survey paths
`honest_did.py`	2785	Monitor
`business_report.py`	2653	Monitor — per-method narrative renderers
`imputation.py`	2475	Monitor
`survey.py`	2466	Monitor — grew with Phase 6 features
`utils.py`	2396	Monitor
`prep_dgp.py`	2057	Monitor
`triple_diff.py`	2053	Monitor
`estimators.py`	1991	Acceptable
`two_stage.py`	1985	Acceptable
`chaisemartin_dhaultfoeuille_results.py`	1981	Acceptable
`prep.py`	1876	Acceptable
`efficient_did.py`	1793	Acceptable
`sun_abraham.py`	1713	Acceptable
`continuous_did.py`	1682	Acceptable
`results.py`	1676	Acceptable
`staggered_triple_diff.py`	1619	Acceptable
`_nprobust_port.py`	1412	Acceptable
`practitioner.py`	1402	Acceptable
`trop_global.py`	1350	Acceptable
`trop_local.py`	1339	Acceptable
`local_linear.py`	1332	Acceptable
`wooldridge.py`	1305	Acceptable
`chaisemartin_dhaultfoeuille_bootstrap.py`	1175	Acceptable
`bacon.py`	1144	Acceptable
`pretrends.py`	1133	Acceptable
`stacked_did.py`	1050	Acceptable
`conley.py`	1006	Acceptable
`visualization/`	4316	Subpackage (split across 7 files) — OK

Tech Debt from Code Reviews

Deferred items from PR reviews that were not addressed before merge.

Methodology/Correctness

Issue	Location	PR	Priority
`SyntheticControl` cv: `in_space_placebo()` / `leave_one_out()` report a cv refit excluded for STRUCTURAL infeasibility (donor-indistinguishable re-aggregated window) with the generic `status="failed"` — same machine-readable status as a genuine inner-solver non-convergence. The failure warnings now distinguish the two causes (and the correct remediation) under cv, and `in_time_placebo()` already splits structural→`"infeasible"` vs `"failed"`, but in-space/LOO do not yet emit a separate machine-readable status/reason-code. Thread a reason code from `_outer_solve_V_cv()`/`_placebo_fit_unit()` and add an `"infeasible"` status + count to the in-space/LOO outputs (mirror the in-time split).	`synthetic_control.py`, `synthetic_control_results.py`	follow-up	Low
dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE.	`chaisemartin_dhaultfoeuille.py`	#294	Low
dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note.	`chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md`	#408	Medium
dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting).	`test_chaisemartin_dhaultfoeuille_parity.py`	#294	Low
dCDH by_path: survey-aware backward-horizon (`placebo + predict_het + survey_design`) raises `NotImplementedError` because the Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution. Backward horizons would put ψ_g mass on a pre-period cell. Deriving the pre-period cell allocator (or adding a covariance-aware two-cell alternative) is deferred to a follow-up methodology PR.	`diff_diff/chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md`	follow-up	Medium
CallawaySantAnna: consider materializing NaN entries for non-estimable (g,t) cells in group_time_effects dict (currently omitted with consolidated warning); would require updating downstream consumers (event study, balance_e, aggregation)	`staggered.py`	#256	Low
CallawaySantAnna and StaggeredTripleDifference fit their covariate outcome-regression nuisance via estimator-local `cho_solve(X'X)` / `scipy.lstsq(cond=1e-7)` that bypass `solve_ols`, so they are NOT scale-equilibrated — a large-scale covariate can in principle perturb the nuisance fit (TripleDifference's OR fit already routes through `solve_ols` and is covered). Route the local OR fits through the shared scale-robust solver (or equilibrate locally).	`staggered.py`, `staggered_triple_diff.py`	covariate-review	Medium
Adopt the shared `_rank_guarded_inv` for the structural (non-covariate) matrix inverses that share the `LinAlgError`-only fallback pattern and can become near-singular: `continuous_did.py:1056` (dose B-spline basis), `spillover.py:3371` (ring-solve, partially guarded via `kept_cols`), `two_stage.py:3154` (TSL Stage-2 variance), `imputation.py:2403`, `had.py:2413`, `conley.py:1109`. These invert internal bases users cannot perturb with `covariates=` (so not the covariate-triggered SE bug already fixed by the DR/OR rank-guard) — lower priority; the `_rank_guarded_inv` helper is the seam.	`continuous_did.py`, `spillover.py`, `two_stage.py`, `imputation.py`, `had.py`, `conley.py`	dr-or-se-rank-guard	Low
ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels	`imputation.py`	#141	Medium (deferred — only triggers when sparse solver fails)
Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels)	`estimators.py`	#218	Medium
Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation	`continuous_did.py`, `efficient_did.py`, `stacked_did.py`	#226	Low
SyntheticControl: the remaining ADH-2015 §4 items — the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`) and sparse-SC subset search (`l < J`, holding `V` fixed). Leave-one-out (`leave_one_out()`), the in-time placebo (`in_time_placebo()`), out-of-sample CV `V`-selection (`v_method="cv"`), and inverse-variance `V` (`v_method="inverse_variance"`) have landed; these two are the deferred tail.	`synthetic_control.py`, `synthetic_control_results.py`	ADH-2015 follow-up	Low
ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items.	`diff_diff/continuous_did.py`	—	Low
Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement	`efficient_did_covariates.py`	—	Low
Survey sandwich SE is not exactly invariant to zero-weight (subpopulation / padded) rows: the shared `_compute_stratified_psu_meat` finite-sample correction counts zero-weight units as PSUs (an `n_psu/(n_psu-1)`-style factor), so adding zero-weight rows shifts the SE by a second-order amount (~2e-4 relative in the EfficientDiD e2e). The point estimate is exactly invariant and the weighted scores of zero-weight rows are already zero — only the DOF correction's PSU count includes them. Cross-cutting across all survey-enabled estimators; fix by counting only positive-weight PSUs in the correction.	`survey.py` (`_compute_stratified_psu_meat`)	PR-B follow-up	Low
TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step).	`trop_local.py`, `trop_global.py`	follow-up	Low
TripleDifference power auto-routing: `power.simulate_power` ignores `n_periods` for DDD because `_ddd_dgp_kwargs` is hard-coded to the cross-sectional `generate_ddd_data`. Now that `generate_ddd_panel_data` exists (Wave 4), add a new `_EstimatorProfile` registry entry (or extend the existing one) to route to the panel DGP when `n_periods > 2`.	`power.py`, `prep_dgp.py`	follow-up	Low
StaggeredTripleDifference R cross-validation: CSV fixtures not committed (gitignored); tests skip without local R + triplediff. Commit fixtures or generate deterministically.	`tests/test_methodology_staggered_triple_diff.py`	#245	Medium
StaggeredTripleDifference R parity: benchmark only tests no-covariate path (xformla=~1). Add covariate-adjusted scenarios and aggregation SE parity assertions.	`benchmarks/R/benchmark_staggered_triplediff.R`	#245	Medium
StaggeredTripleDifference: per-cohort group-effect SEs include WIF (conservative vs R's wif=NULL). Documented in REGISTRY. Could override mixin for exact R match.	`staggered_triple_diff.py`	#245	Low
HonestDiD Delta^RM: uses naive FLCI instead of paper's ARP conditional/hybrid confidence sets (Sections 3.2.1-3.2.2). ARP infrastructure exists but moment inequality transformation needs calibration. CIs are conservative (wider, valid coverage).	`honest_did.py`	#248	Medium
Replicate weight tests use Fay-like BRR perturbations (0.5/1.5), not true half-sample BRR. Add true BRR regressions per estimator family. Existing `test_survey_phase6.py` covers true BRR at the helper level.	`tests/test_replicate_weight_expansion.py`	#253	Low
WooldridgeDiD: QMLE sandwich uses `aweight` cluster-robust adjustment `(G/(G-1))*(n-1)/(n-k)` vs Stata's `G/(G-1)` only. Conservative (inflates SEs). Add `qmle` weight type if Stata golden values confirm material difference.	`wooldridge.py`, `linalg.py`	#216	Medium
WooldridgeDiD: response-scale APE / log-link coefficient bridge for R `etwfe(family="poisson")` + `etwfe(family="logit")` cell-level numerical parity. diff-diff `WooldridgeDiD(method="poisson"\|"logit")` returns ATT on the response scale (counterfactual μ_1 − μ_0 / p_1 − p_0 per paper W2023 ASF / APE framework); R `etwfe` returns the cell-level log-link coefficient. PR-B Stage D ships log-link goldens at `benchmarks/data/wooldridge_golden.json` and surface tests (fit completes + goldens well-formed); cell-level numerical parity requires either `emfx()`-based APE extraction on the R side or link-function inversion with baseline-mean adjustment.	`benchmarks/R/generate_wooldridge_golden.R`, `tests/test_methodology_wooldridge.py::TestWooldridgeParityRPoisson/TestWooldridgeParityRLogit`	PR-B follow-up	Medium
WooldridgeDiD: design-consistent cohort totals for `aggregate(weights="cohort_share")` on survey-weighted fits. Current impl populates `_n_g_per_cohort` from `unit.nunique()` (raw counts); composing these unweighted cohort shares with the design-weighted ATTs targets a mixed estimand inconsistent with paper W2025 Section 7's design-population cohort-share form. PR-B Stage E fail-closes the surface (raises `ValueError` when `survey_design is not None`); the follow-up implements survey-weighted unit totals per cohort and re-enables the surface.	`wooldridge.py` `_n_g_per_cohort` population, `wooldridge_results.py::aggregate` survey gate	PR-B follow-up	Medium
WooldridgeDiD: unconditional inference for `aggregate(weights="cohort_share")` accounting for sampling uncertainty in the cohort shares ω̂_g / ω̂_{ge} (paper W2025 Section 7.5). Current impl fail-closes the t-stat / p-value / conf-int fields to NaN under cohort-share aggregation because the analytical SE is conditional-on-shares. Proper APE/GMM-style aggregate inference (Wooldridge 2023 Section 4 framework) re-enables full inference.	`wooldridge_results.py::aggregate` cohort_share inference branch	PR-B follow-up	Medium
WooldridgeDiD: `cohort_trends=True` + `survey_design` composition. PR-B Stage E fail-closes the cross-product with `NotImplementedError` at `fit()` because the full-dummy `dg_i · t` design composed with the survey TSL variance hasn't been validated against R-parity goldens. Follow-up: validate the composition (or implement a survey-aware alternative) and re-enable the surface.	`wooldridge.py` fit guard, `wooldridge_results.py::aggregate` (if survey-aware cohort_trends variance plumbing is added)	PR-B follow-up	Low
WooldridgeDiD: `cohort_trends=True` + `control_group="never_treated"` composition. PR-B Stage E (codex R9 P1 fix) fail-closes the cross-product with `NotImplementedError` at `fit()` because the OLS + never_treated branch emits ALL `(g, t)` cells as treatment-cell dummies (paper Section 4.4 placebo coverage); the appended `dg_i · t` trend columns are linearly spanned by the per-cohort sum of those cell dummies, so the Section 8 trend specification is unidentified. Follow-up: implement a separate design-matrix branch that drops the pre-treatment placebo dummies (or restricts the trend interaction to post-treatment cells) under the trend specification, then re-enable the combination.	`wooldridge.py` fit guard + `_build_interaction_matrix` redesign for the cohort_trends path	PR-B follow-up	Low
WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class.	`tests/test_wooldridge.py`	#216	Medium
PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R.	`tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R`	PR-C follow-up	Low
`StackedDiD` `vcov_type="conley"` — deferred for a methodology reason, NOT plumbing (unlike the now-shipped SunAbraham / WooldridgeDiD-OLS conley threading): the stacked design replicates each control unit across every sub-experiment it qualifies for (`_build_sub_experiment`), so one geographic unit occupies many stacked rows. Conley's pairwise distance matrix would see those same-unit copies at distance 0 (`K(0)=1`, perfectly correlated), conflating the stacking-replication device with real spatial correlation, and there is no `conleyreg` analogue for stacked DiD to anchor parity. A correct treatment needs a per-stack spatial identifier and is paper-gated.	`diff_diff/stacked_did.py`	follow-up	Low
Extend `WooldridgeDiD` `method ∈ {"logit","poisson"}` paths with `vcov_type ∈ {classical, hc2, hc2_bm}`. The GLM QMLE sandwich uses pseudo-residuals (`weights=p(1-p)` for logit, `weights=μ_i` for Poisson, aweight semantics); composing HC2 leverage and Bell-McCaffrey Satterthwaite DOF with QMLE on canonical-link pseudo-residuals needs derivation + R parity against `clubSandwich::vcovCR(glm(...), type="CR2")`. Phase 1b PR 3/8 rejects `method != "ols" + vcov_type != "hc1"` at `__init__` with a deferral pointer here.	`diff_diff/wooldridge.py` (`_fit_logit`, `_fit_poisson`)	follow-up	Medium
Extend `CallawaySantAnna` with `vcov_type="conley"` — would require deriving a spatial-HAC composition for per-unit influence functions (Conley 1999 spatial kernel × per-(g,t) IF aggregation); no reference implementation exists today. Phase 1b interstitial PR rejected this at `__init__` with a deferral pointer here.	`diff_diff/staggered.py`	follow-up	Low
Extend `TripleDifference` with `vcov_type="conley"` — would require deriving a spatial-HAC composition for the 3-pairwise-DiD influence-function decomposition (Conley 1999 spatial kernel × `inf = w3·IF_3 + w2·IF_2 - w1·IF_1` aggregation); no reference implementation exists today. Phase 1b interstitial #2 PR rejected this at `__init__` with a deferral pointer here.	`diff_diff/triple_diff.py`	follow-up	Low
Extend `ImputationDiD` with `vcov_type="conley"` — would require deriving a spatial-HAC composition with the Theorem 3 per-unit IF aggregation (Conley 1999 spatial kernel × `sigma_sq = (cluster_psi_sums**2).sum()` reduction); no reference implementation exists today. Phase 1b interstitial #3 PR rejected this at `__init__` with a deferral pointer here.	`diff_diff/imputation.py`	follow-up	Low
Extend `EfficientDiD` with `vcov_type="conley"` — would require deriving a spatial-HAC composition with the per-unit EIF aggregation (Conley 1999 spatial kernel × `_compute_se_from_eif` reduction); no reference implementation exists today. Phase 1b interstitial #4 PR rejected this at `__init__` with a deferral pointer here.	`diff_diff/efficient_did.py`	follow-up	Low
Extend `TwoStageDiD` with `vcov_type="conley"` — thread a spatial-HAC composition into the GMM sandwich meat (`_compute_gmm_variance`); the Conley machinery already exists in the sibling SpilloverDiD `_compute_gmm_corrected_meat` (same module) and could be adapted to TwoStageDiD's per-cluster GMM score `S_g = gamma_hat' c_g - X'_{2g} eps_{2g}`, but two-stage GMM × Conley has no reference implementation. Phase 1b interstitial #5 PR rejected this at `__init__`/`fit()` with a deferral pointer here.	`diff_diff/two_stage.py`	follow-up	Low
Decide whether to formally deprecate `CallawaySantAnna.cluster=X` in favor of `survey_design=SurveyDesign(psu=X)`. Both APIs are first-class today (the bare-cluster path synthesizes a minimal SurveyDesign internally), but having two equivalent paths to express the same intent creates redundant surface. Mirrors a similar question for ImputationDiD / EfficientDiD / TwoStageDiD if those estimators ever face the same review.	`diff_diff/staggered.py`	follow-up	Low
Harmonize SunAbraham's HC1 within-transform finite-sample correction with `fixest::sunab()`. SA's `solve_ols` applies `n / (n - k_dm)` (within-transform columns only); fixest applies `n / (n - k_total)` (counts absorbed FE). SE values differ by ~1-2% on typical panel sizes (documented in REGISTRY.md "Deviation from R"; pinned at `atol=5e-3` in `tests/test_methodology_sun_abraham.py`). Either thread `df_adjustment` into the vcov scaling or document as an intentional difference.	`diff_diff/sun_abraham.py`, `diff_diff/linalg.py::compute_robust_vcov`	follow-up	Low
`LinearRegression.fit()` pays the CR2 cost twice on the weighted `hc2_bm` path: once inside `solve_ols(..., return_vcov=True)` and again via `compute_robust_vcov(..., return_dof=True)` to populate `_bm_dof`. Correct but redundant. Fix: thread `return_dof` through `solve_ols` so the same CR2 computation produces both vcov + DOF, or cache the per-cluster `A_g` / `MUWTWUM` precomputes between calls. CI codex P3 on PR #475.	`linalg.py::LinearRegression.fit`, `linalg.py::solve_ols`	PR #475 follow-up	Low
`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` with replicate-weight survey designs raises `NotImplementedError` (`twfe.py:~233`). The replicate path re-demeans per replicate (re-demeaning depends on the per-replicate weight vector), which doesn't compose with the full-dummy HC2/HC2-BM build — a correct implementation would need per-replicate full-dummy refit. Workaround: use `vcov_type="hc1"` for replicate-weight CR1.	`twfe.py::fit`	follow-up	Low
TWFE's HC2/HC2-BM inline full-dummy build (`twfe.py:280-315`) duplicates the dummy-construction logic in `DifferenceInDifferences(fixed_effects=...)` (`estimators.py:478-486`). Extract a shared helper (or delegate TWFE's HC2/HC2-BM path to DiD's `fixed_effects=` branch, with TWFE-specific cluster default threading) to reduce drift risk on FE naming, survey behavior, and result-surface conventions. Substantive refactor — touches both estimators.	`twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit`	follow-up	Low
Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug.	`rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd`	follow-up	Low
Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3.	`rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py`	follow-up	Medium
`bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted).	`benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py`	Phase 1c	Low
`bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance).	`diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py`	Phase 1c	Medium
`bias_corrected_local_linear`: support multi-eval grid (`neval > 1`) with cross-covariance (`covgrid=TRUE` branch of `lprobust.R:253-378`). Not needed for HAD but useful for multi-dose diagnostics.	`diff_diff/_nprobust_port.py::lprobust`	Phase 1c	Low
Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test.	`benchmarks/R/generate_nprobust_lprobust_golden.R`	Phase 1c	Low
`HeterogeneousAdoptionDiD` joint cross-horizon covariance on event study: per-horizon SEs use INDEPENDENT sandwiches in Phase 2b (paper-faithful pointwise CIs per Pierce-Schott Figure 2). A follow-up could derive an IF-based stacking of per-horizon scores for joint cross-horizon inference (needed for joint hypothesis tests across event-time horizons). Block-bootstrap is a reasonable alternative.	`diff_diff/had.py::_fit_event_study`	Phase 2b	Low
`HeterogeneousAdoptionDiD` event-study staggered-timing beyond last cohort: Phase 2b auto-filters staggered panels to the last cohort per paper Appendix B.2. Earlier-cohort treatment effects are not identified by HAD; redirecting to `ChaisemartinDHaultfoeuille` / `did_multiplegt_dyn` is the paper's prescription. A full staggered HAD would require a different identification path (out of paper scope).	`diff_diff/had.py::_validate_had_panel_event_study`	Phase 2b	Low
`HeterogeneousAdoptionDiD` joint cross-horizon analytical covariance on the weighted event-study path: Phase 4.5 B ships multiplier-bootstrap sup-t simultaneous CIs on the weighted event-study path but pointwise analytical variance is still independent across horizons. A follow-up could derive the full H × H analytical covariance from the per-horizon IF matrix (`Psi.T @ Psi` under survey weighting) for an analytical alternative to the bootstrap. Would also let the unweighted event-study path ship a sup-t band.	`diff_diff/had.py::_fit_event_study`	follow-up	Low
`HeterogeneousAdoptionDiD` unweighted event-study sup-t band: Phase 4.5 B ships sup-t only on the WEIGHTED event-study path (to preserve pre-PR bit-exact output on unweighted). Extending sup-t to unweighted event-study (either via the multiplier bootstrap with unit-level iid multipliers or via analytical joint cross-horizon covariance) is a symmetric follow-up.	`diff_diff/had.py::_fit_event_study`	follow-up	Low
`HeterogeneousAdoptionDiD` survey-aware support-endpoint test (research, not engineering): if the academic literature ever publishes a calibrated support-infimum test under complex sampling — combining endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999) with survey-aware functional CLTs for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017) and tail-empirical-process theory (Drees 2003) — Phase 4.5 C0's permanent NotImplementedError on `qug_test(..., survey=...)` / `weights=` can be revisited and the bridge implemented against the published recipe. See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C0) for the decision rationale and the research-direction sketch.	`diff_diff/had_pretests.py::qug_test`	Phase 4.5 C0 (2026-04, decision shipped)	Low
`HeterogeneousAdoptionDiD` survey-aware pretests Phase 4.5 C still-open follow-ups (pweight + PSU + FPC + strata already shipped via `bootstrap_utils.apply_stratum_centering` + Yatchew closed-form weighted variance): (a) replicate-weight designs (BRR/Fay/JK1/JKn/SDR) — the per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; each linearity-family helper raises `NotImplementedError` on `survey.replicate_weights is not None`. (b) `lonely_psu='adjust'` + singleton-strata on the Stute family — the pseudo-stratum centering transform has not been derived for the Stute CvM functional (same pseudo-stratum centering gap as the HAD sup-t deviation; see REGISTRY § "Note (Stute stratified survey-bootstrap calibration)").	`diff_diff/had_pretests.py`	Phase 4.5 C follow-up	Low
`HeterogeneousAdoptionDiD` Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted `lprobust` with an unweighted DPI selector; users who want a weight-aware bandwidth must pass `h`/`b` explicitly. Extending `lpbwselect_mse_dpi` to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope.	`diff_diff/_nprobust_port.py::lpbwselect_mse_dpi`	Phase 4.5	Low
`HeterogeneousAdoptionDiD` Phase 4.5 C: replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the continuous-dose paths. Phase 4.5 A raises `NotImplementedError` on replicate designs in `_aggregate_unit_resolved_survey`. Rao-Wu-style replicate bootstrap for HAD paths requires deriving the per-replicate weight-ratio rescaling for the local-linear intercept IF.	`diff_diff/had.py::_aggregate_unit_resolved_survey`	Phase 4.5 C	Low
`HeterogeneousAdoptionDiD` mass-point: `vcov_type in {"hc2", "hc2_bm"}` raises `NotImplementedError` pending a 2SLS-specific leverage derivation. The OLS leverage `x_i' (X'X)^{-1} x_i` is wrong for 2SLS; the correct finite-sample correction uses `x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i`. Needs derivation plus an R / Stata (`ivreg2 small robust`) parity anchor.	`diff_diff/had.py::_fit_mass_point_2sls`	Phase 2a	Medium
`HeterogeneousAdoptionDiD` survey-design API consolidation, next minor bump: drop the deprecated `survey=` and `weights=` kwargs on all 8 HAD surfaces (`HeterogeneousAdoptionDiD.fit`, `did_had_pretest_workflow`, `qug_test`, `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`); only `survey_design=` remains. Also fold the legacy back-end `weights=` paths (e.g. `_aggregate_unit_weights` ad-hoc routing) into the unified `_resolve_survey_for_fit`-driven path. The `_make_trivial_resolved` underscore alias on `survey.py` stays (one-line, harmless). DeprecationWarning ships in this PR; the removal PR is ~50 LoC of cleanup.	`diff_diff/had.py`, `diff_diff/had_pretests.py`	next minor bump	Medium
`HeterogeneousAdoptionDiD` continuous paths: thread `cluster=` through `bias_corrected_local_linear` (Phase 1c's wrapper already supports cluster; Phase 2a ignores it with a `UserWarning` on the continuous path to keep scope tight).	`diff_diff/had.py`, `diff_diff/local_linear.py`	Phase 2a	Low
`HeterogeneousAdoptionDiD` `trends_lin × survey_design` follow-up: per-group linear-trend slope under survey weighting (weighted slope estimator? per-PSU slope?) is not derived from the paper. PR #389 raises `NotImplementedError` on the combination across all 3 trends_lin surfaces. If user demand emerges, derive the weighted variant and lift the gate.	`diff_diff/had.py::HeterogeneousAdoptionDiD.fit`, `diff_diff/had_pretests.py::joint_pretrends_test`, `diff_diff/had_pretests.py::joint_homogeneity_test`	follow-up	Low
`HeterogeneousAdoptionDiD` Stute family Stata-bridge parity: PR #389 R-parity covers the full HAD fit + Yatchew surfaces but skips Stute family (`stute_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`) because no R `Stutetest` package exists publicly (chaisemartinPackages publishes only the Stata `stute_test` module; the paper cites a 2024c R Stutetest module that is not on GitHub or CRAN). Stata-bridge parity would add `benchmarks/stata/generate_stute_golden.do` + a Stata installation requirement. Low priority unless user demand emerges.	`benchmarks/stata/`, `tests/test_stute_test_parity.py`	follow-up	Low
`HeterogeneousAdoptionDiD` Phase 3 Stute performance: Appendix D vectorized matrix form replaces the per-iteration OLS refit with a single precomputed `M = I - X(X'X)^{-1}X'` applied to `eps * eta`. Functionally identical, ~2x faster. Shipped literal-refit form in Phase 3 to match paper text and keep reviewer surface small.	`diff_diff/had_pretests.py::stute_test`	Phase 3	Low
`HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred.	`tests/test_had_pretests.py`	Phase 3	Low
`HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope.	`diff_diff/had_pretests.py::stute_test`	Phase 3	Low
`HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. Waived in tracker-promotion PR (2026-05-20): R parity at `atol=1e-8` on the same 3 DGPs (`tests/test_did_had_parity.py`) is a strictly stronger correctness anchor than reproducing Figure 2's pointwise CIs on the LBD-restricted PNTR panel; paper Section 5.2 self-acknowledges NP estimators too noisy to be informative there. Table 1 coverage-rate MC would re-verify the CCF asymptotic coverage already pinned by R parity (Python ≡ R ≡ paper). See REGISTRY HAD Deviations Notes #3 / #4 for full scope-caveat statements. Re-open if user demand emerges for an empirical-application replication harness.	`benchmarks/`, `tests/`	Phase 2a	Low
`HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`.	`diff_diff/had.py::_validate_had_panel_event_study`	Phase 2b	Low
`HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface.	`diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference`	Phase 2a	Medium
SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path.	`benchmarks/R/`, `benchmarks/julia/`, `tests/`	follow-up	Low
Conley + survey weights / `survey_design`. Score-reweighted meat `s_i = w_i · X_i · ε_i` is mechanical, but PSU clustering interaction with the spatial kernel and replicate-weights variance under spatial correlation are non-trivial (Bertanha-Imbens 2014 covers cluster-sample but not the explicit Conley case). Phase 5 of the spillover-conley initiative; paper review prerequisite. Currently raises `NotImplementedError` at the linalg validator.	`linalg.py::_validate_vcov_args`	Phase 5 (spillover-conley)	Medium
`SyntheticDiD(vcov_type="conley")` support. Currently raises `TypeError` at `__init__` because SyntheticDiD uses `variance_method ∈ {bootstrap, jackknife, placebo}` rather than the analytical sandwich that Conley plugs into. Wiring would require either reimplementing an analytical sandwich path for SyntheticDiD or designing a spatial-block bootstrap (new methodology, Politis-Romano 1994 territory).	`synthetic_did.py::SyntheticDiD`	follow-up (spillover-conley)	Low
`SpilloverDiD(survey_design=...)` replicate-weight variance (BRR / Fay / JK1 / JKn / SDR). Wave E.1 ships Taylor-linearization only. Per Gerber (2026) Appendix A, the IF-reweighting shortcut does NOT apply to TwoStageDiD-class estimators because `gamma_hat` is weight-sensitive; correct support requires per-replicate full re-fit of stage 1 and stage 2 (200+ LoC of test surface beyond E.1).	`spillover.py::SpilloverDiD.fit`, `survey.py::compute_replicate_refit_variance`	follow-up	Low
`SpilloverDiD(vcov_type="conley", conley_lag_cutoff > 0, survey_design=...)` no-effective-PSU serial Bartlett HAC. Wave E.2 follow-up ships the panel-block composition when an effective PSU exists (explicit `survey_design.psu` OR injected via `cluster=<col>` per `_inject_cluster_as_psu`). Weights-only / strata-only survey designs WITHOUT a cluster fallback raise `NotImplementedError` at `SpilloverDiD.fit` post-resolution because under the pseudo-PSU = obs-index fallback each pseudo-PSU appears in exactly one period — the per-PSU serial cross-period loop would silently contribute zero. Fix would either derive a unit-level serial fallback for no-PSU designs (mixes IF allocators with the pseudo-PSU spatial term — needs methodology work) or route the serial loop through `conley_unit` with explicit documentation of the IF-allocator asymmetry. Regression goldens vs the effective-PSU shipped path.	`spillover.py::SpilloverDiD.fit`, `two_stage.py::_compute_stratified_serial_bartlett_meat`	follow-up (Wave E.2 follow-up tail)	Low
`SpilloverDiD(ring_method="count")` extension. Currently only the nearest-treated-ring specification is exposed. Count-of-treated-in-ring (paper Section 3.2 end) is methodologically supported by Butts but re-introduces functional-form dependence; expose with an explicit kwarg gate and documentation warning.	`spillover.py::SpilloverDiD.fit`	follow-up	Low
`SpilloverDiD` data-driven `d_bar` selection (Butts 2021b / Butts 2023 JUE Insight cross-validation).	`spillover.py::SpilloverDiD`	follow-up	Low
`SpilloverDiD` sparse cKDTree path for the staggered nearest-treated-distance helper (mirrors the static helper's sparse branch). Currently `_compute_nearest_treated_distance_staggered` always builds dense `(n_units, n_treated_by_onset)` pairwise distance matrices per cohort; on large staggered panels with many cohorts this is avoidable memory/runtime. Add a sparse k-d-tree branch analogous to `_compute_nearest_treated_distance_sparse`, gated on `n > _CONLEY_SPARSE_N_THRESHOLD`.	`spillover.py::_compute_nearest_treated_distance_staggered`	follow-up (Wave B)	Low
`SpilloverDiDResults` in `DiagnosticReport` dispatch tables. Wave C event-study emits a TwoStageDiD-compatible `event_study_effects: Dict[int, Dict]` alias that `plot_event_study` consumes via the new `reference_period` attribute fallback in `_extract_plot_data`, but `SpilloverDiDResults` is NOT registered in `DiagnosticReport`'s `_APPLICABILITY` / `_PT_METHOD` tables — so `DiagnosticReport(spillover_result)` doesn't currently route to event-study diagnostics. Registering requires (a) deciding which diagnostics apply (parallel trends, pre-trends power, heterogeneity, design-effect) AND (b) adding an end-to-end test.	`diff_diff/diagnostic_report.py::_APPLICABILITY`, `_PT_METHOD`	follow-up (Wave C)	Low

Performance

Issue	Location	PR	Priority
ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization)	`imputation.py`	#141	Low
Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost)	`rust/src/linalg.rs:67`	#115	Low
Won't-fix — not bit-identically achievable (verified 2026-06-01). The proposed `bread_inv` reuse / "factor `(X'WX)` once, reuse across HC2/HC2-BM" cannot be done bit-identically, which is the bar for a pure perf refactor of the inference path (no SE may move at all). The internal bread operations solve against different right-hand sides (`X.T` for hat-diagonals, `eye` for classical/CR2, `meat`+`temp.T` for the sandwich, `contrasts` for BM-DOF); only same-RHS results are bit-reusable. Measured (numpy 2.4.5): `scipy.linalg.lu_solve(lu_factor(A), B)` differs from `np.linalg.solve(A, B)` by up to 6.4e-15 (only 32/900 bit-equal — `dgesv` fuses factor+solve and rounds differently from a separate `dgetrf`+`dgetrs`); the `inv(A) @ meat @ inv(A)` sandwich differs from the current double-`solve` by up to 1.24e-14. Both are nonzero → not bit-identical — and note both sit below the affected goldens' actual tolerances (the HC2/HC2-BM/CR2 asserts are atol 1e-12/1e-10, e.g. `test_methodology_wls_cr2.py::TestUnweightedRegressionStillBitEqual` at 1e-12; the atol=1e-14 checks in `test_linalg_hc2_bm.py` are HC1 default-vs-explicit dispatch-equality sentinels, not this path), so a broad reuse would silently shift SEs at ~1e-14 without tripping the suite, which is exactly what the bit-identity bar exists to prevent. `np.linalg.solve(A, eye) == np.linalg.inv(A)` IS bit-identical (and raises the same `LinAlgError`) but swapping saves nothing. The only genuine bit-identical redundancy is one duplicated `solve(bread, X.T)` (hat-diagonals + the DOF `H`-build) in the unweighted one-way `hc2_bm`+`return_dof` path — an O(k²n) solve dwarfed by that path's dense `(n×n)` `M=I−H` construction and its O(n²k)/O(n²m) per-contrast quadratic forms (per the `_compute_bm_dof_from_contrasts` unweighted-branch cost model), so the achievable saving is negligible.	`linalg.py::compute_robust_vcov`	Phase 1a	Low
MPD cluster+hc2_bm path computes CR2 precomputes twice — once via `solve_ols` → `_compute_cr2_bm` for vcov + per-coefficient DOF, then again via `_compute_cr2_bm_contrast_dof` from `MultiPeriodDiD.fit()` for the post-period-average contrast DOF. Both rebuild `H = X bread_inv X'`, the residual-maker `M`, and the per-cluster `A_g = (I - H_gg)^{-1/2}` matrices. O(n²k) redundant work; acceptable for typical cluster-robust DiD panel sizes (n ≤ a few thousand). Fix would plumb the contrast DOF through the existing CR2 vcov path (intrusive API change) or share the precomputes via a cached helper.	`linalg.py::_compute_cr2_bm_contrast_dof`, `estimators.py::MultiPeriodDiD.fit`	follow-up	Low
Rust-backend HC2 implementation. Current Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to the NumPy backend. For large-n fits this is noticeable.	`rust/src/linalg.rs`	Phase 1a	Low
CR2 Bell-McCaffrey DOF uses a naive `O(n² k)` per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full `n × n` `M` matrix. Switch when a user hits a large-`n` cluster-robust design.	`linalg.py::_compute_cr2_bm`	Phase 1a	Low
`SyntheticControl` retains a full `_SyntheticControlFitSnapshot` (pivoted outcome/predictor panels) on EVERY fit to support the opt-in `in_space_placebo()`, so callers who never run the placebo still pay O(units × periods × predictor-vars) memory (same as `SyntheticDiD`'s always-on snapshot for `in_time_placebo`). Store a compact array/index representation instead of per-variable DataFrames, or build the snapshot lazily on first placebo call (would need to retain the source data, ~same cost).	`synthetic_control.py` snapshot build, `synthetic_control_results.py::_SyntheticControlFitSnapshot`	follow-up	Low
EfficientDiD DR (covariate) path rebuilds the full polynomial sieve basis `_polynomial_sieve_basis(X, K)` for every candidate `K` inside each of the three nuisance fits (outcome regression, propensity ratio, inverse propensity), per `fit()`. After the growing-sieve cap removal (PR-B), large covariate-adjusted fits at large `n` pay more avoidable basis-construction cost. Cache the basis per `(X, K)` within a `fit()` and share it across the nuisance helpers.	`diff_diff/efficient_did_covariates.py` (the three sieve helpers)	PR-B follow-up	Low

Testing/Docs

Issue	Location	PR	Priority
Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift	`docs/tutorials/24_staggered_vs_collapsed_power.ipynb`	staggered-analysis-2x2	Low
Port the CI `<notebook-prose>` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context	`tools/reviewer-eval/adapters/ci_prompt.py`	local-review	Low
R comparison tests spawn separate `Rscript` per test (slow CI)	`tests/test_methodology_twfe.py:294`	#139	Low
CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path	`tests/test_methodology_callaway.py`	#202	Low
Validating the `.txt` AI guides (`diff_diff/guides/llms-full.txt`, `llms-practitioner.txt`) as executable snippets is not low-lift (re-scoped 2026-06-01): of their ~112 fenced Python blocks only ~20% are standalone-runnable — the rest are API-signature references (`Foo(param: type = default)` pseudo-signatures that are `SyntaxError` by design), context fragments (e.g. `results.att` on an undefined `results`), or dataset-shape-specific blocks. The guides are reference documentation, not runnable examples; a real implementation needs signature-block detection + a context/data skip-allowlist + per-snippet fixtures (multi-round curation), unlike the curated `.rst` files the existing smoke test covers.	`tests/test_doc_snippets.py`	#239	Low
SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release.	`synthetic_did.py`, `results.py`	follow-up	Medium
AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted.	`tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml`	#416	Low
`TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block.	`tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml`	#436	Low
Render `docs/methodology/REPORTING.md` and `docs/methodology/REGISTRY.md` as in-site Sphinx pages so cross-references can use `:doc:` instead of off-site GitHub `blob/main` URLs. Current state (#410 fix-audit-r2) restores navigable links via `blob/main`, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add `myst-parser` to `docs/conf.py` extensions + docs extras and link with `:doc:`, or (b) convert both files to `.rst`.	`docs/conf.py`, `docs/api/business_report.rst`, `docs/api/diagnostic_report.rst`, `docs/tutorials/18_geo_experiments.ipynb`, `docs/tutorials/19_dcdh_marketing_pulse.ipynb`	follow-up	Low

Prioritized Tech-Debt Backlog

Ordered paydown view across the tables above. Tier A → D is by effort × risk, not severity — every item here already carries its own Low / Medium priority in the source-of-truth tables. The intent is to give a flat ordering to draw from wave-by-wave without re-litigating priority each time. Anchors point to the location reference of the originating row.

Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)

(No active items. The sole prior entry — the WooldridgeDiD method/outcome efficiency hint — has shipped; see CHANGELOG ## [Unreleased] and REGISTRY §WooldridgeDiD "Nonlinear extensions".)

(SyntheticDiD placebo_effects → variance_effects rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.)

Tier B — Mid-size methodology (5-10 CI rounds expected, per memory cascade priors)

SyntheticDiD: rename internal placebo_effects → variance_effects AND public placebo_effects field with deprecation alias retained for one release (synthetic_did.py, results.py)
StaggeredTripleDifference R parity: commit CSV fixtures + add covariate-adjusted scenarios + aggregation-SE assertions (tests/test_methodology_staggered_triple_diff.py, benchmarks/R/benchmark_staggered_triplediff.R)
StaggeredTripleDifference: per-cohort group-effect SE WIF override for exact R triplediff match (staggered_triple_diff.py)
WooldridgeDiD: QMLE Stata-parity qmle weight type + Stata golden values (wooldridge.py, linalg.py, tests/test_wooldridge.py)
WooldridgeDiD: optional weights="cohort_share" on aggregate() (wooldridge_results.py)
HAD survey-design API consolidation: drop deprecated survey=/weights= kwargs (had.py, had_pretests.py; gated on next minor bump)
Survey-design resolution / collapse helper extraction across continuous_did.py, efficient_did.py, stacked_did.py
dCDH survey + backward-horizon predict_het allocator derivation: lift the warn-and-skip fallback at _compute_heterogeneity_test once the pre-period Binder TSL cell-period allocator is derived (currently the gate emits a UserWarning and falls back to forward-horizon-only heterogeneity under survey_design + placebo + heterogeneity) (chaisemartin_dhaultfoeuille.py, docs/methodology/REGISTRY.md)
Rust local-method solver path unification to solve_wls_svd + bootstrap-weight RNG parity audit (rust/src/trop.rs, rust/src/bootstrap.rs)
AI review CI workflow-contract pin test expansion (tests/test_openai_review.py)
In-site Sphinx render of REPORTING.md and REGISTRY.md (docs/conf.py + :doc: link migration)

Tier C — Heavy / derivation required

HonestDiD Δ^RM ARP conditional/hybrid confidence sets (honest_did.py)
Multi-absorb weighted demeaning: alternating-projection iteration for N>1 absorb + weights (estimators.py)
ImputationDiD dense (A0'A0).toarray() OOM: alternative dense fallback or richer sparse strategy (imputation.py:1531)
HAD mass-point vcov_type ∈ {hc2, hc2_bm}: 2SLS-specific leverage derivation (had.py::_fit_mass_point_2sls)
HAD repeated-cross-section identification path (had.py::_validate_had_panel)
HAD time-varying-dose event study estimator (had.py::_validate_had_panel_event_study)
Conley + survey_design (linalg.py::_validate_vcov_args, conley.py)
SyntheticDiD vcov_type="conley" (synthetic_did.py::SyntheticDiD — new analytical sandwich path OR spatial-block bootstrap)

Tier D — Deferred / research (no active action planned)

HAD survey-aware support-endpoint test (had_pretests.py::qug_test; waits on literature — endpoint EVT × survey-aware functional CLT)
HAD joint cross-horizon analytical covariance / unweighted event-study sup-t band (low user demand)
HAD Phase 4.5 replicate-weight pretests (BRR/Fay/JK1/JKn/SDR composition derivation)
HAD Stute family Stata-bridge parity (no R Stutetest package exists publicly)
HAD trends_lin × survey_design weighted-slope derivation
Phase 1c lprobust follow-ups (vce modes, weight-aware auto-bandwidth DPI, multi-eval grid, clustered-DGP auto-bandwidth) — deferred to Phase 2+ of bias_corrected_local_linear
TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14) residual bypass paths — diminishing return given documented threat model
All remaining Low-priority Performance and Testing/Docs rows (R-script-per-test, CS R covariate-adjusted IRLS benchmark, doc-deps integrity CI, Rust faer SVD overhead, etc.)

Standard Error Consistency

vcov_type has subsumed the previously-proposed se_type knob. DifferenceInDifferences and TwoWayFixedEffects accept vcov_type ∈ {"classical", "hc1", "hc2", "hc2_bm", "conley"} (the validated set in linalg.py::_VALID_VCOV_TYPES); cluster-robust variance is obtained by passing cluster= alongside the heteroscedasticity kind (hc1 + cluster ⇒ CR1 Liang-Zeger; hc2_bm + cluster ⇒ CR2 Bell-McCaffrey, including the weighted path landed via the clubSandwich WLS-CR2 port; the N>1 absorbed-FE + weights composition remains gated by the open multi-absorb row in the table above); wild cluster bootstrap is a separate inference="wild_bootstrap" path on the same estimator. Threading vcov_type through the 8 standalone estimators (CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, TripleDifference, StackedDiD, WooldridgeDiD, EfficientDiD) is complete as of Phase 1b; four of them (CallawaySantAnna, TripleDifference, ImputationDiD, EfficientDiD) are permanently narrow to {"hc1"} per their influence-function variance, and TwoStageDiD is likewise narrow because its Gardner GMM-corrected meat has no single cross-stage hat matrix for classical/hc2/hc2_bm. The per-estimator vcov_type="conley" extensions are tracked as follow-up rows in the table above: SunAbraham + WooldridgeDiD-OLS are shipped (within-transform conley via solve_ols); StackedDiD is deferred for a methodology reason (unit replication × spatial distance); the IF-based / GMM estimators have no reference implementation.

Type Annotations

Mypy reports 0 errors. All mixin attr-defined errors resolved via TYPE_CHECKING-guarded method stubs in bootstrap mixin classes.

Deprecated Code

Deprecated parameters still present for backward compatibility:

lambda_reg and zeta in SyntheticDiD (synthetic_did.py)
- Deprecated in favor of zeta_omega/zeta_lambda parameters
- Remove in v4.0.0 (SemVer-safe: public kwarg removal requires a major bump)

Test Coverage

Visualization tests skip when matplotlib / plotly are not installed (see pytest.importorskip markers in tests/test_visualization*.py).

Honest DiD Improvements

Enhancements for honest_did.py:

Improved C-LF implementation with direct optimization instead of grid search (current implementation uses simplified FLCI approach with estimation uncertainty adjustment; see honest_did.py:947)
Support for CallawaySantAnnaResults (implemented in honest_did.py:612-653; requires aggregate='event_study' when calling CallawaySantAnna.fit())
Event-study-specific bounds for each post-period
Hybrid inference methods
Simulation-based power analysis for honest bounds

CallawaySantAnna Bootstrap Improvements

Consider aligning p-value computation with R did package (symmetric percentile method)

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Spurious RuntimeWarnings ("divide by zero", "overflow", "invalid value") are emitted by np.matmul/@ on Apple Silicon M4 + macOS Sequoia with numpy < 2.3. The warnings appear for matrices with ≥260 rows but do not affect result correctness — coefficients and fitted values are valid (no NaN/Inf), and the design matrices are full rank.

Root cause: Apple's BLAS SME (Scalable Matrix Extension) kernels corrupt the floating-point status register, causing spurious FPE signals. Tracked in numpy#28687 and numpy#29820. Fixed in numpy ≥ 2.3 via PR #29223.

Not reproducible on M3, Intel, or Linux.

linalg.py:162 - Warnings in fitted value computation (X @ coefficients)
- Caused by M4 BLAS bug, not extreme coefficient values
- Seen in test_prep.py during treatment effect recovery tests (n > 260)
triple_diff.py:307,323 - Warnings in propensity score computation
- Occurs in IPW and DR estimation methods with covariates
- Related to logistic regression overflow in edge cases (separate from BLAS bug)
Long-term: Revert to @ operator when numpy ≥ 2.3 becomes the minimum supported version.

Feature Gaps (from R `did` package comparison)

Features in R's did package that block porting additional tests:

Feature	R tests blocked	Priority	Status
Calendar time aggregation	1 test in test-att_gt.R	Low

Performance Optimizations

Potential future optimizations:

JIT compilation for bootstrap loops (numba)
Sparse matrix handling for large fixed effects

QR+SVD Redundancy in Rank Detection

Background: The current solve_ols() implementation performs both QR (for rank detection) and SVD (for solving) decompositions on rank-deficient matrices. This is technically redundant since SVD can determine rank directly.

Current approach (R-style, chosen for robustness):

QR with pivoting for rank detection (_detect_rank_deficiency())
scipy's lstsq with 'gelsd' driver (SVD-based) for solving

Why we use QR for rank detection:

QR with pivoting provides the canonical ordering of linearly dependent columns
R's lm() uses this approach for consistent dropped-column reporting
Ensures consistent column dropping across runs (SVD column selection can vary)

Potential optimization (future work):

Skip QR when rank_deficient_action="silent" since we don't need column names
Use SVD rank directly in the Rust backend (already implemented)
Add skip_rank_check parameter for hot paths where matrix is known to be full-rank (implemented in v2.2.0)

Priority: Low - the QR overhead is minimal compared to SVD solve, and correctness is more important than micro-optimization.

Incomplete `check_finite` Bypass

Background: The solve_ols() function accepts a check_finite=False parameter intended to skip NaN/Inf validation for performance in hot paths where data is known to be clean.

Current limitation: When check_finite=False, our explicit validation is skipped, but scipy's internal QR decomposition in _detect_rank_deficiency() still validates finite values. This means callers cannot fully bypass all finite checks.

Impact: Minimal - the scipy check is fast and only affects edge cases where users explicitly pass check_finite=False with non-finite data (which would be a bug in their code anyway).

Potential fix (future work):

Pass check_finite=False through to scipy's QR call (requires scipy >= 1.9.0)
Or skip _detect_rank_deficiency() entirely when check_finite=False and _skip_rank_check=True

Priority: Low - this is an edge case optimization that doesn't affect correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development TODO

Known Limitations

Code Quality

Large Module Files

Tech Debt from Code Reviews

Methodology/Correctness

Performance

Testing/Docs

Prioritized Tech-Debt Backlog

Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)

Tier B — Mid-size methodology (5-10 CI rounds expected, per memory cascade priors)

Tier C — Heavy / derivation required

Tier D — Deferred / research (no active action planned)

Standard Error Consistency

Type Annotations

Deprecated Code

Test Coverage

Honest DiD Improvements

CallawaySantAnna Bootstrap Improvements

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Feature Gaps (from R `did` package comparison)

Performance Optimizations

QR+SVD Redundancy in Rank Detection

Incomplete `check_finite` Bypass

FilesExpand file tree

TODO.md

Latest commit

History

TODO.md

File metadata and controls

Development TODO

Known Limitations

Code Quality

Large Module Files

Tech Debt from Code Reviews

Methodology/Correctness

Performance

Testing/Docs

Prioritized Tech-Debt Backlog

Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)

Tier B — Mid-size methodology (5-10 CI rounds expected, per memory cascade priors)

Tier C — Heavy / derivation required

Tier D — Deferred / research (no active action planned)

Standard Error Consistency

Type Annotations

Deprecated Code

Test Coverage

Honest DiD Improvements

CallawaySantAnna Bootstrap Improvements

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Feature Gaps (from R did package comparison)

Performance Optimizations

QR+SVD Redundancy in Rank Detection

Incomplete check_finite Bypass

Feature Gaps (from R `did` package comparison)

Incomplete `check_finite` Bypass