Skip to content

chore(test): node-suite differential runner + regression guard#5097

Merged
proggeramlug merged 2 commits into
mainfrom
chore/node-suite-regression-guard
Jun 13, 2026
Merged

chore(test): node-suite differential runner + regression guard#5097
proggeramlug merged 2 commits into
mainfrom
chore/node-suite-regression-guard

Conversation

@proggeramlug

@proggeramlug proggeramlug commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What

Adds tooling to make the print-and-diff node-suite (test-parity/node-suite/) trustworthy and to stop silent regressions from merging green.

scripts/node_suite_run.py — the runner, with two correctness measures

  • Pre-warm pass: compile one test per module serially first, so a crypto-feature module's cold auto-optimize runtime/stdlib rebuild (~4 min) doesn't blow the per-test compile timeout and score the whole module as perry_err. This artifact had made dns look 0% and http 47% when both are actually 100%.
  • Fast/slow lanes: server/timing modules (http http2 https net dgram tls cluster dns stream child_process worker_threads inspector repl diagnostics_channel timers fetch) run strictly sequentially to avoid port-contention + timing-race false negatives. http measured 78.9% parallel vs a verified 100% sequential.

scripts/node_suite_regression_check.py + test-parity/node_suite_baseline.json — the guard

A floor check: fails (exit 1) if any module drops below its baseline pass count; improvements are always accepted and reported as ratchet candidates.

Why

The node-suite is not part of the per-PR CI gate (the parity job is opt-in and runs node 22, while the real oracle is node 26). A module can silently regress and still merge green — which is exactly how node:dns once went 83% → 0% behind a green build. This guard runs in the node-26 environment (the box) on a schedule or before a release and catches that class of regression. Validated: it flags a simulated dns 6→0 and passes the current tree clean.

Numbers

Floors captured at 2781/2863 (97.1%) on node v26.3.0. Racy timing modules (http2, net, stream, diagnostics_channel, fs-promises) carry a small margin so ordinary flake doesn't false-alarm.

Notes

  • Tooling only — no runtime/product code touched. Per repo policy I did not bump the version or edit CHANGELOG/CLAUDE.md; please fold those in at merge.
  • Does not wire into PR CI (would need node 26 + the corpus in CI, and the version-sensitivity of timing modules makes a node-22 gate brittle). Recommend running on the box nightly; happy to add that wiring in a follow-up if wanted.

Summary by CodeRabbit

  • Chores
    • Added a CLI regression-guard that compares recent test-suite results against stored baselines and fails CI on regressions.
    • Added a differential test runner that executes and compares runtime outputs in parallel, with a sequential path for timing-sensitive modules.
    • Added a baseline configuration file containing per-module pass floors and overall expectations.

Add scripts/node_suite_run.py (the print-and-diff node-suite runner) with two
correctness measures baked in:
- pre-warm pass: compile one test per module serially first so a crypto-feature
  module's cold auto-optimize rebuild doesn't blow the per-test compile timeout
  and score the whole module as perry_err (this made dns look 0% / http 47%).
- fast/slow lanes: server/timing modules (http, http2, net, stream, ...) run
  strictly sequentially to avoid port contention + timing-race false negatives
  (http measured 78.9% parallel vs a verified 100% sequential).

Add scripts/node_suite_regression_check.py + test-parity/node_suite_baseline.json:
a floor guard that fails if any module drops below baseline (improvements always
accepted). The node-suite is not part of the per-PR gate (the parity job is
opt-in and runs node 22 while the real oracle is node 26), so a module can
silently regress and still merge green -- which is how node:dns once went
83% -> 0% behind a green build. Run this in the node-26 environment on a
schedule or before cutting a release.

Floors captured at 2781/2863 (97.1%) on node v26.3.0; racy timing modules carry
a small margin so ordinary flake doesn't false-alarm.
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 96760d30-e422-41ef-8f8b-1c0a3e268991

📥 Commits

Reviewing files that changed from the base of the PR and between 283b7e2 and 4fecc93.

📒 Files selected for processing (2)
  • scripts/node_suite_regression_check.py
  • scripts/node_suite_run.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/node_suite_regression_check.py
  • scripts/node_suite_run.py

📝 Walkthrough

Walkthrough

Adds a differential node-suite test runner, a CLI regression checker that compares per-module pass counts against a JSON baseline, and the initial node-26 baseline data.

Changes

Node-suite regression testing

Layer / File(s) Summary
Test runner with parallel execution and failure classification
scripts/node_suite_run.py
Discovers non-.d.ts TypeScript tests, pre-warms one test per module, runs node and perry per test with timeouts, classifies outcomes (pass, diff, node_err, compile_fail, perry_err), schedules most modules on a 6-worker ThreadPoolExecutor while running SLOW_MODULES sequentially, and prints per-module and overall pass/total summaries.
Regression comparison and CLI interface
scripts/node_suite_regression_check.py
Parses CLI args, loads test-parity/node_suite_baseline.json (or provided path), invokes the test runner via subprocess.run, parses the runner's per-module pass/total table with a regex, compares counts against baseline floors, lists improvements (+) and regressions (!), and exits with 0 (ok), 1 (regressions), or 2 (harness/baseline error).
Initial node-26 regression baseline
test-parity/node_suite_baseline.json
Adds baseline JSON with _schema metadata and per-module pass/total floor values for the node-26 environment used by the regression checker.

🎯 3 (Moderate) | ⏱️ ~20 minutes


🐰 I hopped through tests at break of day,

Node and Perry chased diffs away,
Baseline tucked safely in a file so neat,
Pass counts checked, regressions face defeat,
My whiskers twitch — the suite runs sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main changes: adding a differential test runner and regression guard for the node-suite.
Description check ✅ Passed The description covers all required sections: Summary, Changes, Related issue (n/a), Test plan checklist items, and commits follow the loose prefix convention.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/node-suite-regression-guard

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/node_suite_regression_check.py`:
- Around line 49-51: The check in scripts/node_suite_regression_check.py should
fail closed whenever the runner exits non‑zero; update the conditional that
currently reads "if proc.returncode != 0 and not proc.stdout:" to simply "if
proc.returncode != 0:" so any non‑zero proc.returncode triggers printing the
error and returning 2 (remove reliance on proc.stdout presence so
crashes/timeouts don't lead to parsing incomplete output).

In `@scripts/node_suite_run.py`:
- Around line 57-69: The current flow ignores non-zero child exits and uses
strip() which removes leading whitespace; change the logic so after running NODE
(variable n) you check n.returncode and return (mod, "node_err") if non-zero,
and after running the generated binary (variable p) check p.returncode and
return (mod, "run_err") if non-zero (keep the existing check for c.returncode !=
0 returning "compile_fail"); and when comparing outputs use a
trailing-newline-only trim (e.g., compare n.stdout.rstrip('\n') to
p.stdout.rstrip('\n')) instead of strip() so leading whitespace is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c707e7ac-1167-4575-9bf0-5c2ee2cfb374

📥 Commits

Reviewing files that changed from the base of the PR and between 12ae086 and 283b7e2.

📒 Files selected for processing (3)
  • scripts/node_suite_regression_check.py
  • scripts/node_suite_run.py
  • test-parity/node_suite_baseline.json

Comment thread scripts/node_suite_regression_check.py Outdated
Comment thread scripts/node_suite_run.py Outdated
Comment on lines +57 to +69
n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)
except Exception:
return (mod, "node_err")
with tempfile.TemporaryDirectory() as td:
out = os.path.join(td, "o")
try:
c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
if c.returncode != 0:
return (mod, "compile_fail")
p = subprocess.run([out], capture_output=True, text=True, timeout=30)
except Exception:
return (mod, "perry_err")
return (mod, "pass" if n.stdout.strip() == p.stdout.strip() else "diff")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail on child-process errors and preserve leading whitespace.

strip() removes leading whitespace too, and the current code still compares output even when node or the generated out process exits non-zero. That can turn real failures into false pass/diff results.

🐛 Proposed fix
         n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)
+        if n.returncode != 0:
+            return (mod, "node_err")
@@
             c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
             if c.returncode != 0:
                 return (mod, "compile_fail")
             p = subprocess.run([out], capture_output=True, text=True, timeout=30)
+            if p.returncode != 0:
+                return (mod, "perry_err")
         except Exception:
             return (mod, "perry_err")
-    return (mod, "pass" if n.stdout.strip() == p.stdout.strip() else "diff")
+    return (mod, "pass" if n.stdout.rstrip() == p.stdout.rstrip() else "diff")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)
except Exception:
return (mod, "node_err")
with tempfile.TemporaryDirectory() as td:
out = os.path.join(td, "o")
try:
c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
if c.returncode != 0:
return (mod, "compile_fail")
p = subprocess.run([out], capture_output=True, text=True, timeout=30)
except Exception:
return (mod, "perry_err")
return (mod, "pass" if n.stdout.strip() == p.stdout.strip() else "diff")
n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)
if n.returncode != 0:
return (mod, "node_err")
except Exception:
return (mod, "node_err")
with tempfile.TemporaryDirectory() as td:
out = os.path.join(td, "o")
try:
c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
if c.returncode != 0:
return (mod, "compile_fail")
p = subprocess.run([out], capture_output=True, text=True, timeout=30)
if p.returncode != 0:
return (mod, "perry_err")
except Exception:
return (mod, "perry_err")
return (mod, "pass" if n.stdout.rstrip() == p.stdout.rstrip() else "diff")
🧰 Tools
🪛 ast-grep (0.43.0)

[error] 62-62: Use of unsanitized data to create processes
Context: subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
Note: [CWE-78].

(os-system-unsanitized-data)


[error] 65-65: Use of unsanitized data to create processes
Context: subprocess.run([out], capture_output=True, text=True, timeout=30)
Note: [CWE-78].

(os-system-unsanitized-data)


[error] 62-62: Command coming from incoming request
Context: subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
Note: [CWE-20].

(subprocess-from-request)


[error] 65-65: Command coming from incoming request
Context: subprocess.run([out], capture_output=True, text=True, timeout=30)
Note: [CWE-20].

(subprocess-from-request)

🪛 Ruff (0.15.15)

[error] 57-57: subprocess call: check for execution of untrusted input

(S603)


[warning] 58-58: Do not catch blind exception: Exception

(BLE001)


[error] 63-63: subprocess call: check for execution of untrusted input

(S603)


[error] 66-66: subprocess call: check for execution of untrusted input

(S603)


[warning] 67-67: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/node_suite_run.py` around lines 57 - 69, The current flow ignores
non-zero child exits and uses strip() which removes leading whitespace; change
the logic so after running NODE (variable n) you check n.returncode and return
(mod, "node_err") if non-zero, and after running the generated binary (variable
p) check p.returncode and return (mod, "run_err") if non-zero (keep the existing
check for c.returncode != 0 returning "compile_fail"); and when comparing
outputs use a trailing-newline-only trim (e.g., compare n.stdout.rstrip('\n') to
p.stdout.rstrip('\n')) instead of strip() so leading whitespace is preserved.

- node_suite_regression_check.py: fail closed on any non-zero runner exit
  (was: only when stdout also empty) so a crashed/timed-out runner can't lead
  to parsing partial output.
- node_suite_run.py: compare stdout with rstrip('\n') instead of strip() so
  leading/trailing spaces are significant (byte-for-byte intent), and require
  matching process exit code, so a Perry crash that printed matching output
  before dying is a diff, not a false pass. (Did not bucket node-nonzero as
  node_err per the literal suggestion — that would wrongly drop legitimate
  error-path tests where node intentionally exits non-zero and Perry matches.)
@proggeramlug proggeramlug merged commit 0dde211 into main Jun 13, 2026
14 checks passed
@proggeramlug proggeramlug deleted the chore/node-suite-regression-guard branch June 13, 2026 18:10
@proggeramlug

Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai — both addressed in 4fecc93:

  1. Fail closed on non-zero runner exit — adopted as suggested. node_suite_regression_check.py now returns 2 on any non-zero exit so a crashed/timed-out runner can't lead to parsing partial output.

  2. Exit codes + rstrip('\n') — adopted the spirit, adapted the letter:

    • Switched strip()rstrip('\n') so leading/trailing spaces are significant (byte-for-byte intent).
    • A test now passes only if stdout matches and the process exit codes match — so a Perry crash that printed matching output before dying is a diff, not a false pass. This covers the masked-failure concern behind the run_err suggestion.
    • I did not bucket node-nonzero exits as node_err. In this hand-written suite a non-zero node exit is often intentional (the test exercises an error path), and Perry matching it — same stdout, same exit code — is real parity that should count as pass. Bucketing it as node_err would wrongly drop those.

Re-ran the full suite on the node-26 oracle with the stricter comparison: no module dropped below its floor (overall 2779/2863, the ±2 vs the committed 2781 is ordinary stream/http2 timing flake, within the margins). So the stricter match confirmed the existing numbers weren't hiding masked crashes or whitespace diffs — baseline floors stand unchanged.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@proggeramlug: Thanks for the thorough write-up. Both items look solid:

  1. Fail-closed — returning exit 2 on any non-zero runner exit is the right call; it prevents partial-output parsing from masking crashes or timeouts.

  2. Exit codes + rstrip('\n') — the adaptation makes sense. Requiring both stdout and exit-code agreement already closes the masked-crash loophole, and keeping node-nonzero exits as first-class pass candidates (when Perry replicates them exactly) correctly reflects the intent of the hand-written error-path tests. Bucketing them as node_err would indeed have been a false negative there.

The re-run confirming 2779/2863 within the timing-flake margin is a good sanity check that the stricter comparison didn't expose any hidden whitespace or exit-code mismatches in the baseline.

(ᴗ˳ᴗ)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant