Skip to content

fix(runtime): node:http2 — fix session/stream hang from timer-id ↔ handle-id collision#5083

Merged
proggeramlug merged 1 commit into
mainfrom
node-http2-fix-parity
Jun 13, 2026
Merged

fix(runtime): node:http2 — fix session/stream hang from timer-id ↔ handle-id collision#5083
proggeramlug merged 1 commit into
mainfrom
node-http2-fix-parity

Conversation

@proggeramlug

@proggeramlug proggeramlug commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

`node:http2` was at 8/9 in the in-repo node-suite (the mandate's "44%" was stale; the suite already passed 8 cases). The one failure — `session/controls.ts` — produced byte-identical output but hung (exit 124), which also hung the print-and-diff harness on that case.

Root cause

Perry's Node timer ids (`NEXT_TIMER_ID`, counting from 1) and the perry-ffi handle registry ids (`NEXT_HANDLE`, counting from 1) share the pointer-tagged small-integer value space, and both start at 1. So a JS value like `POINTER_TAG | 1` is bit-ambiguous between an HTTP/2 server handle (handle 1) and a `setTimeout` id (timer 1).

`js_native_call_method`'s timer-method dispatch arm fires on `is_known_timer_id(id)` alone. When `controls.ts` had both an HTTP/2 server (handle 1) and a `setTimeout` guard (timer 1) alive, `server.close()` (`0x7ffd…0001`) was swallowed by the timer arm → `clearTimeout(1)` instead of `js_node_http2_server_close`. The server never closed, `base.listening` stayed `true`, `js_node_http_server_has_active` kept returning 1, and the main event loop parked on the pump condvar forever.

The handle-registry docs already anticipate this class of bug ("handle families that participate in generic property/method dispatch reserve disjoint visible id ranges") — timers and the FFI registry simply both used `[1, 0x40000)`.

Fix

An authoritative FFI-handle-existence probe:

  • perry-runtime exposes `js_register_ffi_handle_exists_probe` + `ffi_handle_exists(id)` (mirrors the existing net-socket/stream probe pattern).
  • perry-ffi registers `ffi_handle_exists_probe` (`HANDLES.contains_key`) on first `register_handle`.
  • The timer arm now gates on `is_known_timer_id(id) && !ffi_handle_exists(id)`, so a live registered handle wins over the ambiguous timer-id heuristic. A genuine timer whose id is not also a live handle still resolves as a timer.

Provably inert for pure-JS / test262: with no `register_handle` the probe is never registered and `ffi_handle_exists` returns `false`, leaving the timer dispatch arm byte-identical to before.

Results (local, node v26 print-and-diff)

  • http2: the suite no longer hangs; `controls.ts` now exits 0.
  • Collateral fixes (same root cause — server handles + timers): `http/client-request/header-state-controls`, `http/incoming-message/client-set-encoding`, `http/incoming-message/server-headers-body` now pass (verified 3/3 stable). net/http/https went 29→32 pass.
  • 0 regressions: `timers` (identical 7 pre-existing flaky delay-warning fails on baseline and fix), `net`/`http`/`https` (no fix-only fails). perry-ffi + perry-runtime unit tests pass.

Known limitation: `controls.ts` is Node-side flaky

After the hang fix, `controls.ts` still diffs intermittently because the server `'session'` vs client `'connect'` event ordering is a genuine loopback race in node v26 (measured server-first 12 : client-first 8 over 20 node runs). `sequential-session-cleanup.ts` deterministically requires the opposite order (client-first), so no single deterministic Perry ordering can match both. Perry keeps client-first, which keeps the other 8 http2 cases deterministically green; `controls` passes whenever Node's coin-flip lands client-first. This is Node nondeterminism, not a Perry correctness gap.

Files

  • `crates/perry-runtime/src/object/class_handles.rs` — probe registration + `ffi_handle_exists`
  • `crates/perry-runtime/src/object/native_call_method.rs` — gate the timer dispatch arm
  • `crates/perry-ffi/src/handle.rs` — probe impl + lazy registration in `register_handle`

Summary by CodeRabbit

  • Bug Fixes
    • Fixed method dispatch on FFI-based objects to prevent incorrect routing to timer methods, ensuring proper method execution (e.g., .close() calls work as expected).

…r id

Root cause: Perry's Node timer ids (NEXT_TIMER_ID, from 1) and the perry-ffi
handle registry ids (NEXT_HANDLE, from 1) share the pointer-tagged
small-integer value space and both count from 1, so a value like
`POINTER_TAG | 1` is ambiguous between an HTTP/2 server handle and a
`setTimeout` id. js_native_call_method's timer-method arm fires on
is_known_timer_id(id) alone, so when a server (handle 1) and a timer (id 1)
were alive at once, `server.close()` was routed to clearTimeout(1) instead of
js_node_http2_server_close. The server never closed, base.listening stayed
true, js_node_http_server_has_active kept returning 1, and the event loop
parked forever (node:http2 session/controls.ts hung; exit 124).

Fix: an authoritative FFI-handle-existence probe. perry-ffi registers
ffi_handle_exists_probe (HANDLES.contains_key) with perry-runtime on first
register_handle; the timer arm now yields (is_known_timer_id(id) &&
!ffi_handle_exists(id)) so a live registered handle wins over the ambiguous
timer-id heuristic. Provably inert for pure-JS/test262: with no register_handle
the probe is unregistered and ffi_handle_exists returns false, leaving the
timer arm byte-identical.

node-suite http2: the suite no longer hangs on case 9 (controls.ts now exits 0).
Collateral: 3 http tests that combine server handles + timers now pass
(client-request/header-state-controls, incoming-message/client-set-encoding,
incoming-message/server-headers-body). 0 regressions across timers/net/http/https.

controls.ts remains flaky for a Node-side reason: the server 'session' vs client
'connect' event ordering is a genuine loopback race in node v26 (measured
server-first 12 : client-first 8 over 20 runs), and sequential-session-cleanup
deterministically requires the opposite order — no single deterministic Perry
ordering matches both, so Perry keeps client-first (the other 8 cases green).
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR introduces a callback-based probe system that allows the Perry runtime to query whether a handle ID is live in the FFI registry. The FFI layer registers a probe that checks the local registry, and the runtime uses it to disambiguate method dispatch when an ID could match both a timer and an FFI handle, ensuring FFI-backed method calls are not incorrectly routed to timer handlers.

Changes

Handle ID Disambiguation Between FFI and Timer Domains

Layer / File(s) Summary
Probe callback contract and runtime APIs
crates/perry-runtime/src/object/class_handles.rs
Defines FfiHandleExistsProbeFn callback type, adds global AtomicPtr storage for the registered probe, and implements ffi_handle_exists_probe() getter, ffi_handle_exists() wrapper predicate, and js_register_ffi_handle_exists_probe() registration entrypoint.
FFI handle probe implementation and lazy registration
crates/perry-ffi/src/handle.rs
Implements native ffi_handle_exists_probe() that checks HANDLES.contains_key(), guards registration with std::sync::Once, and wires register_handle() to ensure the probe is registered before any handle is produced.
Timer dispatch disambiguation using probe
crates/perry-runtime/src/object/native_call_method.rs
Adds !ffi_handle_exists(id) condition to timer method dispatch, preventing timer routing when the ID is actively registered as an FFI handle and ensuring FFI method calls are not swallowed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 A probe hops through the FFI veil,
To whisper: "This handle is real!"
No more ambiguous calls astray,
Timer and FFI now play fair today!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the root cause (timer-id ↔ handle-id collision) and the fix domain (node:http2), matching the core technical change in this PR.
Description check ✅ Passed The description is comprehensive, covering summary, root cause analysis, fix implementation, results, and known limitations. It aligns well with the template sections (Summary, Changes, Related issue, Test plan) and provides sufficient technical context.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch node-http2-fix-parity

Comment @coderabbitai help to get the list of available commands and usage tips.

@proggeramlug proggeramlug merged commit e092362 into main Jun 13, 2026
12 of 14 checks passed
@proggeramlug proggeramlug deleted the node-http2-fix-parity branch June 13, 2026 10:37
proggeramlug pushed a commit that referenced this pull request Jun 13, 2026
…ter_ffi_handle_exists_probe)

#5083 added a `js_register_ffi_handle_exists_probe(...)` call inside
`register_handle` (handle.rs) for the runtime's handle-vs-timer disambiguation
probe. That symbol is defined in perry-runtime, but perry-ffi's `runtime-link`
feature is off by default and CI runs `cargo test -p perry-ffi` per-package in
isolation (no --workspace feature unification). The handle-registry tests
(bare `#[cfg(test)]`) call `register_handle`, so the test binary retained a
reference to an unlinked symbol and failed at `cc` with `undefined reference`.

Gating the handle tests on `runtime-link` (the pattern other perry-ffi test
modules use) would drop them from CI entirely (the per-package loop never
enables the feature). Instead add a test-only no-op stub gated
`#[cfg(all(test, not(feature = "runtime-link")))]` so the registry tests keep
running and the binary links; the real extern declaration is gated out of that
exact config to avoid an E0428 name clash. The stub can never collide with
perry-runtime's real definition (it exists only in perry-ffi's own test binary
with runtime-link off).

Verified: `cargo test -p perry-ffi` links + passes (13); `cargo test -p
perry-ffi --features runtime-link` uses the real symbol + passes (35).
proggeramlug added a commit that referenced this pull request Jun 13, 2026
…ter_ffi_handle_exists_probe) (#5088)

#5083 added a `js_register_ffi_handle_exists_probe(...)` call inside
`register_handle` (handle.rs) for the runtime's handle-vs-timer disambiguation
probe. That symbol is defined in perry-runtime, but perry-ffi's `runtime-link`
feature is off by default and CI runs `cargo test -p perry-ffi` per-package in
isolation (no --workspace feature unification). The handle-registry tests
(bare `#[cfg(test)]`) call `register_handle`, so the test binary retained a
reference to an unlinked symbol and failed at `cc` with `undefined reference`.

Gating the handle tests on `runtime-link` (the pattern other perry-ffi test
modules use) would drop them from CI entirely (the per-package loop never
enables the feature). Instead add a test-only no-op stub gated
`#[cfg(all(test, not(feature = "runtime-link")))]` so the registry tests keep
running and the binary links; the real extern declaration is gated out of that
exact config to avoid an E0428 name clash. The stub can never collide with
perry-runtime's real definition (it exists only in perry-ffi's own test binary
with runtime-link off).

Verified: `cargo test -p perry-ffi` links + passes (13); `cargo test -p
perry-ffi --features runtime-link` uses the real symbol + passes (35).

Co-authored-by: Ralph Küpper <ralph2@skelpo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant