Merge pull request #535 from igerber/perf/rscript-consolidation

igerber · web-flow · commit ea5d4058374e · 2026-06-08T07:28:10.000-04:00
docs: correct stale R-comparison-tests TODO row (R not in CI)
diff --git a/TODO.md b/TODO.md
@@ -174,7 +174,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` | staggered-analysis-2x2 | Low |
 | ImputationDiD covariate-path variance lacks dedicated R `didimputation` parity / hand-calc. The PR-B FE-design correction (keep all unit dummies) affects the covariate projection too, but only the no-covariate staggered panel is R-parity'd (the covariate path shares the same validated projection code and passes the full suite). Add a covariate (time-varying X) R golden asserting overall/event-study SE parity, or a small dense-design hand-calc for the covariate projection. | `tests/test_methodology_imputation.py`, `benchmarks/R/generate_didimputation_golden.R` | imputation-validation follow-up | Low |
 | Port the CI `<notebook-prose>` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Low |
-| R comparison tests spawn separate `Rscript` per test (slow CI) | `tests/test_methodology_twfe.py:294` | #139 | Low |
+| **Premise corrected — no CI impact (verified 2026-06-07).** The "slow CI" motivation does not hold: no CI workflow installs R (no `setup-r` / `r-lib/actions` / `fixest` / `r-base` install anywhere in `.github/workflows/`), so every R-parity test skips in CI behind a per-file availability gate (`fixest_available` in twfe, `_check_r_contdid()` in continuous_did, `require_r` / `r_available` in `conftest.py`, etc.) — consolidating `Rscript` spawns yields zero CI speedup. The originally-cited file already session-caches its R fits: `test_methodology_twfe.py` exposes `r_twfe_results` / `r_twfe_results_with_covariate` as `scope="session"` fixtures, so each R model runs once per session, not once per test. The only residual is a LOCAL-dev micro-optimization for developers who have R installed: `test_methodology_continuous_did.py` (the `_run_r_contdid` helper plus three standalone inline `Rscript` calls) and `test_methodology_callaway.py` (`_run_r_estimation` called inline in three test methods, plus `_get_r_mpdta_and_results` re-run by the MPDTA R-parity tests) re-spawn `library(...)` per call with no session-level result cache. Applying the twfe session-fixture pattern there would speed local R-parity runs only. Low value; retained as a local-dev note. | `tests/test_methodology_continuous_did.py`, `tests/test_methodology_callaway.py` | #139 | Low |
 | CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
 | Validating the `.txt` AI guides (`diff_diff/guides/llms-full.txt`, `llms-practitioner.txt`) as executable snippets is **not low-lift** (re-scoped 2026-06-01): of their ~112 fenced Python blocks only ~20% are standalone-runnable — the rest are API-signature references (`Foo(param: type = default)` pseudo-signatures that are `SyntaxError` by design), context fragments (e.g. `results.att` on an undefined `results`), or dataset-shape-specific blocks. The guides are reference documentation, not runnable examples; a real implementation needs signature-block detection + a context/data skip-allowlist + per-snippet fixtures (multi-round curation), unlike the curated `.rst` files the existing smoke test covers. | `tests/test_doc_snippets.py` | #239 | Low |
 | `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
@@ -221,7 +221,7 @@ _(No active items. The sole prior entry — the WooldridgeDiD method/outcome eff
 - HAD `trends_lin × survey_design` weighted-slope derivation
 - Phase 1c lprobust follow-ups (`vce` modes, weight-aware auto-bandwidth DPI, multi-eval grid, clustered-DGP auto-bandwidth) — deferred to Phase 2+ of `bias_corrected_local_linear`
 - TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14) residual bypass paths — diminishing return given documented threat model
-- All remaining `Low`-priority Performance and Testing/Docs rows (R-script-per-test, CS R covariate-adjusted IRLS benchmark, doc-deps integrity CI, Rust faer SVD overhead, etc.)
+- All remaining `Low`-priority Performance and Testing/Docs rows (R-script-per-test local-dev caching, CS R covariate-adjusted IRLS benchmark, Rust faer SVD overhead, etc.)
 
 ---