fix(runtime): node:http2 — fix session/stream hang from timer-id ↔ handle-id collision#5083
Conversation
…r id Root cause: Perry's Node timer ids (NEXT_TIMER_ID, from 1) and the perry-ffi handle registry ids (NEXT_HANDLE, from 1) share the pointer-tagged small-integer value space and both count from 1, so a value like `POINTER_TAG | 1` is ambiguous between an HTTP/2 server handle and a `setTimeout` id. js_native_call_method's timer-method arm fires on is_known_timer_id(id) alone, so when a server (handle 1) and a timer (id 1) were alive at once, `server.close()` was routed to clearTimeout(1) instead of js_node_http2_server_close. The server never closed, base.listening stayed true, js_node_http_server_has_active kept returning 1, and the event loop parked forever (node:http2 session/controls.ts hung; exit 124). Fix: an authoritative FFI-handle-existence probe. perry-ffi registers ffi_handle_exists_probe (HANDLES.contains_key) with perry-runtime on first register_handle; the timer arm now yields (is_known_timer_id(id) && !ffi_handle_exists(id)) so a live registered handle wins over the ambiguous timer-id heuristic. Provably inert for pure-JS/test262: with no register_handle the probe is unregistered and ffi_handle_exists returns false, leaving the timer arm byte-identical. node-suite http2: the suite no longer hangs on case 9 (controls.ts now exits 0). Collateral: 3 http tests that combine server handles + timers now pass (client-request/header-state-controls, incoming-message/client-set-encoding, incoming-message/server-headers-body). 0 regressions across timers/net/http/https. controls.ts remains flaky for a Node-side reason: the server 'session' vs client 'connect' event ordering is a genuine loopback race in node v26 (measured server-first 12 : client-first 8 over 20 runs), and sequential-session-cleanup deterministically requires the opposite order — no single deterministic Perry ordering matches both, so Perry keeps client-first (the other 8 cases green).
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces a callback-based probe system that allows the Perry runtime to query whether a handle ID is live in the FFI registry. The FFI layer registers a probe that checks the local registry, and the runtime uses it to disambiguate method dispatch when an ID could match both a timer and an FFI handle, ensuring FFI-backed method calls are not incorrectly routed to timer handlers. ChangesHandle ID Disambiguation Between FFI and Timer Domains
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…ter_ffi_handle_exists_probe) #5083 added a `js_register_ffi_handle_exists_probe(...)` call inside `register_handle` (handle.rs) for the runtime's handle-vs-timer disambiguation probe. That symbol is defined in perry-runtime, but perry-ffi's `runtime-link` feature is off by default and CI runs `cargo test -p perry-ffi` per-package in isolation (no --workspace feature unification). The handle-registry tests (bare `#[cfg(test)]`) call `register_handle`, so the test binary retained a reference to an unlinked symbol and failed at `cc` with `undefined reference`. Gating the handle tests on `runtime-link` (the pattern other perry-ffi test modules use) would drop them from CI entirely (the per-package loop never enables the feature). Instead add a test-only no-op stub gated `#[cfg(all(test, not(feature = "runtime-link")))]` so the registry tests keep running and the binary links; the real extern declaration is gated out of that exact config to avoid an E0428 name clash. The stub can never collide with perry-runtime's real definition (it exists only in perry-ffi's own test binary with runtime-link off). Verified: `cargo test -p perry-ffi` links + passes (13); `cargo test -p perry-ffi --features runtime-link` uses the real symbol + passes (35).
…ter_ffi_handle_exists_probe) (#5088) #5083 added a `js_register_ffi_handle_exists_probe(...)` call inside `register_handle` (handle.rs) for the runtime's handle-vs-timer disambiguation probe. That symbol is defined in perry-runtime, but perry-ffi's `runtime-link` feature is off by default and CI runs `cargo test -p perry-ffi` per-package in isolation (no --workspace feature unification). The handle-registry tests (bare `#[cfg(test)]`) call `register_handle`, so the test binary retained a reference to an unlinked symbol and failed at `cc` with `undefined reference`. Gating the handle tests on `runtime-link` (the pattern other perry-ffi test modules use) would drop them from CI entirely (the per-package loop never enables the feature). Instead add a test-only no-op stub gated `#[cfg(all(test, not(feature = "runtime-link")))]` so the registry tests keep running and the binary links; the real extern declaration is gated out of that exact config to avoid an E0428 name clash. The stub can never collide with perry-runtime's real definition (it exists only in perry-ffi's own test binary with runtime-link off). Verified: `cargo test -p perry-ffi` links + passes (13); `cargo test -p perry-ffi --features runtime-link` uses the real symbol + passes (35). Co-authored-by: Ralph Küpper <ralph2@skelpo.com>
Summary
`node:http2` was at 8/9 in the in-repo node-suite (the mandate's "44%" was stale; the suite already passed 8 cases). The one failure — `session/controls.ts` — produced byte-identical output but hung (exit 124), which also hung the print-and-diff harness on that case.
Root cause
Perry's Node timer ids (`NEXT_TIMER_ID`, counting from 1) and the perry-ffi handle registry ids (`NEXT_HANDLE`, counting from 1) share the pointer-tagged small-integer value space, and both start at 1. So a JS value like `POINTER_TAG | 1` is bit-ambiguous between an HTTP/2 server handle (handle 1) and a `setTimeout` id (timer 1).
`js_native_call_method`'s timer-method dispatch arm fires on `is_known_timer_id(id)` alone. When `controls.ts` had both an HTTP/2 server (handle 1) and a `setTimeout` guard (timer 1) alive, `server.close()` (`0x7ffd…0001`) was swallowed by the timer arm → `clearTimeout(1)` instead of `js_node_http2_server_close`. The server never closed, `base.listening` stayed `true`, `js_node_http_server_has_active` kept returning 1, and the main event loop parked on the pump condvar forever.
The handle-registry docs already anticipate this class of bug ("handle families that participate in generic property/method dispatch reserve disjoint visible id ranges") — timers and the FFI registry simply both used `[1, 0x40000)`.
Fix
An authoritative FFI-handle-existence probe:
Provably inert for pure-JS / test262: with no `register_handle` the probe is never registered and `ffi_handle_exists` returns `false`, leaving the timer dispatch arm byte-identical to before.
Results (local, node v26 print-and-diff)
Known limitation: `controls.ts` is Node-side flaky
After the hang fix, `controls.ts` still diffs intermittently because the server `'session'` vs client `'connect'` event ordering is a genuine loopback race in node v26 (measured server-first 12 : client-first 8 over 20 node runs). `sequential-session-cleanup.ts` deterministically requires the opposite order (client-first), so no single deterministic Perry ordering can match both. Perry keeps client-first, which keeps the other 8 http2 cases deterministically green; `controls` passes whenever Node's coin-flip lands client-first. This is Node nondeterminism, not a Perry correctness gap.
Files
Summary by CodeRabbit
.close()calls work as expected).