chore(test): node-suite differential runner + regression guard by proggeramlug · Pull Request #5097 · PerryTS/perry

proggeramlug · 2026-06-13T17:09:50Z

What

Adds tooling to make the print-and-diff node-suite (test-parity/node-suite/) trustworthy and to stop silent regressions from merging green.

`scripts/node_suite_run.py` — the runner, with two correctness measures

Pre-warm pass: compile one test per module serially first, so a crypto-feature module's cold auto-optimize runtime/stdlib rebuild (~4 min) doesn't blow the per-test compile timeout and score the whole module as perry_err. This artifact had made dns look 0% and http 47% when both are actually 100%.
Fast/slow lanes: server/timing modules (http http2 https net dgram tls cluster dns stream child_process worker_threads inspector repl diagnostics_channel timers fetch) run strictly sequentially to avoid port-contention + timing-race false negatives. http measured 78.9% parallel vs a verified 100% sequential.

`scripts/node_suite_regression_check.py` + `test-parity/node_suite_baseline.json` — the guard

A floor check: fails (exit 1) if any module drops below its baseline pass count; improvements are always accepted and reported as ratchet candidates.

Why

The node-suite is not part of the per-PR CI gate (the parity job is opt-in and runs node 22, while the real oracle is node 26). A module can silently regress and still merge green — which is exactly how node:dns once went 83% → 0% behind a green build. This guard runs in the node-26 environment (the box) on a schedule or before a release and catches that class of regression. Validated: it flags a simulated dns 6→0 and passes the current tree clean.

Numbers

Floors captured at 2781/2863 (97.1%) on node v26.3.0. Racy timing modules (http2, net, stream, diagnostics_channel, fs-promises) carry a small margin so ordinary flake doesn't false-alarm.

Notes

Tooling only — no runtime/product code touched. Per repo policy I did not bump the version or edit CHANGELOG/CLAUDE.md; please fold those in at merge.
Does not wire into PR CI (would need node 26 + the corpus in CI, and the version-sensitivity of timing modules makes a node-22 gate brittle). Recommend running on the box nightly; happy to add that wiring in a follow-up if wanted.

Summary by CodeRabbit

Chores
- Added a CLI regression-guard that compares recent test-suite results against stored baselines and fails CI on regressions.
- Added a differential test runner that executes and compares runtime outputs in parallel, with a sequential path for timing-sensitive modules.
- Added a baseline configuration file containing per-module pass floors and overall expectations.

Add scripts/node_suite_run.py (the print-and-diff node-suite runner) with two correctness measures baked in: - pre-warm pass: compile one test per module serially first so a crypto-feature module's cold auto-optimize rebuild doesn't blow the per-test compile timeout and score the whole module as perry_err (this made dns look 0% / http 47%). - fast/slow lanes: server/timing modules (http, http2, net, stream, ...) run strictly sequentially to avoid port contention + timing-race false negatives (http measured 78.9% parallel vs a verified 100% sequential). Add scripts/node_suite_regression_check.py + test-parity/node_suite_baseline.json: a floor guard that fails if any module drops below baseline (improvements always accepted). The node-suite is not part of the per-PR gate (the parity job is opt-in and runs node 22 while the real oracle is node 26), so a module can silently regress and still merge green -- which is how node:dns once went 83% -> 0% behind a green build. Run this in the node-26 environment on a schedule or before cutting a release. Floors captured at 2781/2863 (97.1%) on node v26.3.0; racy timing modules carry a small margin so ordinary flake doesn't false-alarm.

coderabbitai · 2026-06-13T17:10:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 96760d30-e422-41ef-8f8b-1c0a3e268991

📥 Commits

Reviewing files that changed from the base of the PR and between 283b7e2 and 4fecc93.

📒 Files selected for processing (2)

scripts/node_suite_regression_check.py
scripts/node_suite_run.py

🚧 Files skipped from review as they are similar to previous changes (2)

scripts/node_suite_regression_check.py
scripts/node_suite_run.py

📝 Walkthrough

Walkthrough

Adds a differential node-suite test runner, a CLI regression checker that compares per-module pass counts against a JSON baseline, and the initial node-26 baseline data.

Changes

Node-suite regression testing

Layer / File(s)	Summary
Test runner with parallel execution and failure classification `scripts/node_suite_run.py`	Discovers non-.d.ts TypeScript tests, pre-warms one test per module, runs `node` and `perry` per test with timeouts, classifies outcomes (`pass`, `diff`, `node_err`, `compile_fail`, `perry_err`), schedules most modules on a 6-worker ThreadPoolExecutor while running `SLOW_MODULES` sequentially, and prints per-module and overall pass/total summaries.
Regression comparison and CLI interface `scripts/node_suite_regression_check.py`	Parses CLI args, loads `test-parity/node_suite_baseline.json` (or provided path), invokes the test runner via `subprocess.run`, parses the runner's per-module pass/total table with a regex, compares counts against baseline floors, lists improvements (`+`) and regressions (`!`), and exits with 0 (ok), 1 (regressions), or 2 (harness/baseline error).
Initial node-26 regression baseline `test-parity/node_suite_baseline.json`	Adds baseline JSON with `_schema` metadata and per-module pass/total floor values for the node-26 environment used by the regression checker.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 I hopped through tests at break of day,

Node and Perry chased diffs away,
Baseline tucked safely in a file so neat,
Pass counts checked, regressions face defeat,
My whiskers twitch — the suite runs sweet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main changes: adding a differential test runner and regression guard for the node-suite.
Description check	✅ Passed	The description covers all required sections: Summary, Changes, Related issue (n/a), Test plan checklist items, and commits follow the loose prefix convention.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/node-suite-regression-guard

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/node_suite_regression_check.py`:
- Around line 49-51: The check in scripts/node_suite_regression_check.py should
fail closed whenever the runner exits non‑zero; update the conditional that
currently reads "if proc.returncode != 0 and not proc.stdout:" to simply "if
proc.returncode != 0:" so any non‑zero proc.returncode triggers printing the
error and returning 2 (remove reliance on proc.stdout presence so
crashes/timeouts don't lead to parsing incomplete output).

In `@scripts/node_suite_run.py`:
- Around line 57-69: The current flow ignores non-zero child exits and uses
strip() which removes leading whitespace; change the logic so after running NODE
(variable n) you check n.returncode and return (mod, "node_err") if non-zero,
and after running the generated binary (variable p) check p.returncode and
return (mod, "run_err") if non-zero (keep the existing check for c.returncode !=
0 returning "compile_fail"); and when comparing outputs use a
trailing-newline-only trim (e.g., compare n.stdout.rstrip('\n') to
p.stdout.rstrip('\n')) instead of strip() so leading whitespace is preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c707e7ac-1167-4575-9bf0-5c2ee2cfb374

📥 Commits

Reviewing files that changed from the base of the PR and between 12ae086 and 283b7e2.

📒 Files selected for processing (3)

scripts/node_suite_regression_check.py
scripts/node_suite_run.py
test-parity/node_suite_baseline.json

coderabbitai · 2026-06-13T17:16:57Z

+        n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)
+    except Exception:
+        return (mod, "node_err")
+    with tempfile.TemporaryDirectory() as td:
+        out = os.path.join(td, "o")
+        try:
+            c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
+            if c.returncode != 0:
+                return (mod, "compile_fail")
+            p = subprocess.run([out], capture_output=True, text=True, timeout=30)
+        except Exception:
+            return (mod, "perry_err")
+    return (mod, "pass" if n.stdout.strip() == p.stdout.strip() else "diff")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail on child-process errors and preserve leading whitespace.

strip() removes leading whitespace too, and the current code still compares output even when node or the generated out process exits non-zero. That can turn real failures into false pass/diff results.

🐛 Proposed fix

n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30) + if n.returncode != 0: + return (mod, "node_err") @@ c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120) if c.returncode != 0: return (mod, "compile_fail") p = subprocess.run([out], capture_output=True, text=True, timeout=30) + if p.returncode != 0: + return (mod, "perry_err") except Exception: return (mod, "perry_err") - return (mod, "pass" if n.stdout.strip() == p.stdout.strip() else "diff") + return (mod, "pass" if n.stdout.rstrip() == p.stdout.rstrip() else "diff")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)

except Exception:

return (mod, "node_err")

with tempfile.TemporaryDirectory() as td:

out = os.path.join(td, "o")

try:

c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)

if c.returncode != 0:

return (mod, "compile_fail")

p = subprocess.run([out], capture_output=True, text=True, timeout=30)

except Exception:

return (mod, "perry_err")

return (mod, "pass" if n.stdout.strip() == p.stdout.strip() else "diff")

n = subprocess.run([NODE, path], capture_output=True, text=True, timeout=30)

if n.returncode != 0:

return (mod, "node_err")

except Exception:

return (mod, "node_err")

with tempfile.TemporaryDirectory() as td:

out = os.path.join(td, "o")

try:

c = subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)

if c.returncode != 0:

return (mod, "compile_fail")

p = subprocess.run([out], capture_output=True, text=True, timeout=30)

if p.returncode != 0:

return (mod, "perry_err")

except Exception:

return (mod, "perry_err")

return (mod, "pass" if n.stdout.rstrip() == p.stdout.rstrip() else "diff")

🧰 Tools

🪛 ast-grep (0.43.0)

[error] 62-62: Use of unsanitized data to create processes
Context: subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
Note: [CWE-78].

(os-system-unsanitized-data)

[error] 65-65: Use of unsanitized data to create processes
Context: subprocess.run([out], capture_output=True, text=True, timeout=30)
Note: [CWE-78].

(os-system-unsanitized-data)

[error] 62-62: Command coming from incoming request
Context: subprocess.run([PERRY, path, "-o", out], capture_output=True, text=True, timeout=120)
Note: [CWE-20].

(subprocess-from-request)

[error] 65-65: Command coming from incoming request
Context: subprocess.run([out], capture_output=True, text=True, timeout=30)
Note: [CWE-20].

(subprocess-from-request)

🪛 Ruff (0.15.15)

[error] 57-57: subprocess call: check for execution of untrusted input

(S603)

[warning] 58-58: Do not catch blind exception: Exception

(BLE001)

[error] 63-63: subprocess call: check for execution of untrusted input

(S603)

[error] 66-66: subprocess call: check for execution of untrusted input

(S603)

[warning] 67-67: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/node_suite_run.py` around lines 57 - 69, The current flow ignores non-zero child exits and uses strip() which removes leading whitespace; change the logic so after running NODE (variable n) you check n.returncode and return (mod, "node_err") if non-zero, and after running the generated binary (variable p) check p.returncode and return (mod, "run_err") if non-zero (keep the existing check for c.returncode != 0 returning "compile_fail"); and when comparing outputs use a trailing-newline-only trim (e.g., compare n.stdout.rstrip('\n') to p.stdout.rstrip('\n')) instead of strip() so leading whitespace is preserved.

- node_suite_regression_check.py: fail closed on any non-zero runner exit (was: only when stdout also empty) so a crashed/timed-out runner can't lead to parsing partial output. - node_suite_run.py: compare stdout with rstrip('\n') instead of strip() so leading/trailing spaces are significant (byte-for-byte intent), and require matching process exit code, so a Perry crash that printed matching output before dying is a diff, not a false pass. (Did not bucket node-nonzero as node_err per the literal suggestion — that would wrongly drop legitimate error-path tests where node intentionally exits non-zero and Perry matches.)

proggeramlug · 2026-06-13T18:30:25Z

Thanks @coderabbitai — both addressed in 4fecc93:

Fail closed on non-zero runner exit — adopted as suggested. node_suite_regression_check.py now returns 2 on any non-zero exit so a crashed/timed-out runner can't lead to parsing partial output.
Exit codes + rstrip('\n') — adopted the spirit, adapted the letter:
- Switched strip() → rstrip('\n') so leading/trailing spaces are significant (byte-for-byte intent).
- A test now passes only if stdout matches and the process exit codes match — so a Perry crash that printed matching output before dying is a diff, not a false pass. This covers the masked-failure concern behind the run_err suggestion.
- I did not bucket node-nonzero exits as node_err. In this hand-written suite a non-zero node exit is often intentional (the test exercises an error path), and Perry matching it — same stdout, same exit code — is real parity that should count as pass. Bucketing it as node_err would wrongly drop those.

Re-ran the full suite on the node-26 oracle with the stricter comparison: no module dropped below its floor (overall 2779/2863, the ±2 vs the committed 2781 is ordinary stream/http2 timing flake, within the margins). So the stricter match confirmed the existing numbers weren't hiding masked crashes or whitespace diffs — baseline floors stand unchanged.

coderabbitai · 2026-06-13T18:30:52Z

Tip

For best results, initiate chat on the files or code changes.

@proggeramlug: Thanks for the thorough write-up. Both items look solid:

Fail-closed — returning exit 2 on any non-zero runner exit is the right call; it prevents partial-output parsing from masking crashes or timeouts.
Exit codes + rstrip('\n') — the adaptation makes sense. Requiring both stdout and exit-code agreement already closes the masked-crash loophole, and keeping node-nonzero exits as first-class pass candidates (when Perry replicates them exactly) correctly reflects the intent of the hand-written error-path tests. Bucketing them as node_err would indeed have been a false negative there.

The re-run confirming 2779/2863 within the timing-flake margin is a good sanity check that the stricter comparison didn't expose any hidden whitespace or exit-code mismatches in the baseline.

(ᴗ˳ᴗ)

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

proggeramlug merged commit 0dde211 into main Jun 13, 2026
14 checks passed

proggeramlug deleted the chore/node-suite-regression-guard branch June 13, 2026 18:10

This was referenced Jun 13, 2026

test(node-suite): normalize env-variant tokens so console scores 119/119 #5101

Merged

chore(test): ratchet node-suite baseline to clean 98.1% run #5166

Merged

ci: wire node-suite regression guard (nightly + merge-queue-ready) #5255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(test): node-suite differential runner + regression guard#5097

chore(test): node-suite differential runner + regression guard#5097
proggeramlug merged 2 commits into
mainfrom
chore/node-suite-regression-guard

proggeramlug commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 13, 2026

Uh oh!

Uh oh!

proggeramlug commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

proggeramlug commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

scripts/node_suite_run.py — the runner, with two correctness measures

scripts/node_suite_regression_check.py + test-parity/node_suite_baseline.json — the guard

Why

Numbers

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

proggeramlug commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

proggeramlug commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

`scripts/node_suite_run.py` — the runner, with two correctness measures

`scripts/node_suite_regression_check.py` + `test-parity/node_suite_baseline.json` — the guard

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading