Skip to content

fix(cli): verify forward listener before success#1880

Merged
johntmyers merged 4 commits into
NVIDIA:mainfrom
shiju-nv:fix/forward-listener-health
Jun 22, 2026
Merged

fix(cli): verify forward listener before success#1880
johntmyers merged 4 commits into
NVIDIA:mainfrom
shiju-nv:fix/forward-listener-health

Conversation

@shiju-nv

@shiju-nv shiju-nv commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR makes openshell forward report success only after the local SSH forward listener is actually reachable. It also makes background forwards fail closed when the forked SSH process cannot be tracked, so OpenShell does not leave users with an unusable or unmanageable forward.

Related Issue

Fixes #1878

Changes

  • Wait for a connectable local listener before printing foreground forward success.
  • Fail background forwarding when OpenShell cannot discover the forked SSH process PID.
  • Probe the local listener before writing the background forward PID file.
  • Terminate the tracked background SSH process if listener health never becomes reachable.
  • Use connectable loopback probe hosts for wildcard binds such as 0.0.0.0 and ::.
  • Add focused unit coverage for reachable and missing forward listeners.
  • Document that OpenShell prints forwarded URLs only after listener health is proven, and that background forwards must be tracked for forward list and forward stop.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@shiju-nv shiju-nv requested review from a team, derekwaynecarr and mrunalp as code owners June 11, 2026 22:48
@johntmyers johntmyers self-assigned this Jun 11, 2026
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

PR Review Status

Validation: This PR is project-valid because it directly fixes reproducible openshell forward behavior from #1878: success should only be reported after a reachable local listener exists, and background forwards should fail closed when they cannot be tracked.
Head SHA: c7ac0b929c9aa61fada4e522ba0abffc84e5a090

Review findings:

  • crates/openshell-cli/src/ssh.rs:513: terminate_forward_pid shells out to kill through PATH. Please use nix::sys::signal::kill directly and guard against unsafe PID values before signaling.
  • crates/openshell-cli/tests/sandbox_create_lifecycle_integration.rs:1417: the happy-path test was updated, but the PR still needs the direct regression test from Sandbox forward reports success without a reachable local listener #1878 where fake ssh exits 0, no background PID is discoverable, and no listener exists. Please assert sandbox_forward(..., background = true, ...) returns an error.

Docs: Fern docs were updated under docs/sandboxes/manage-sandboxes.mdx; no navigation update appears necessary for this inline behavior note.

Next state: gator:in-review

@johntmyers johntmyers added the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 11, 2026
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Update

I re-evaluated latest head 301909f13c11d6dce811c7c0b8f005bf54a08774 after shiju-nv's June 12, 2026 follow-up commit.

Disposition: partially resolved.

Resolved items:

  • terminate_forward_pid no longer shells out to kill; it now uses nix::sys::signal::kill with PID range guards.
  • The background regression from Sandbox forward reports success without a reachable local listener #1878 is now covered: fake ssh exits 0, fake pgrep finds no PID, no listener exists, and sandbox_forward(..., background = true, ...) must return an error without writing a PID file.

Remaining items:

  • crates/openshell-cli/src/ssh.rs:381: before terminating the PID returned by find_ssh_forward_pid, please re-validate that it still matches the expected SSH forward, similar to the existing pid_matches_forward guard used by stop_forward. Otherwise a stale or spoofed same-user PID could be signaled. Security mapping: CWE-20.
  • crates/openshell-cli/src/ssh.rs:37: the old 2s foreground grace period is now a hard listener-readiness deadline for both foreground and background startup. Please either raise/rename this as an explicit readiness timeout, for example 10s, or explain why 2s is sufficient for slow SSH auth/proxy startup.

Non-blocking suggestion: consider adding a direct foreground regression test where fake ssh exits 0 without opening a listener, since the foreground path changed materially too.

Docs: Fern docs were updated under docs/sandboxes/manage-sandboxes.mdx; no navigation update appears necessary.

Next state: gator:in-review

@shiju-nv shiju-nv force-pushed the fix/forward-listener-health branch from 1678fe7 to 4583b5d Compare June 12, 2026 07:48
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Update

I re-evaluated latest head 4583b5d23c2fccab9b87eb78d3212633ae424cd2 after shiju-nv's June 12, 2026 follow-up commit.

Disposition: partially resolved.

Resolved items:

  • terminate_forward_pid now re-validates the discovered PID with pid_matches_forward immediately before signaling.
  • The listener-readiness timeout is now separated from PID discovery and raised to 10s for foreground and background startup.
  • The direct foreground regression test was added.

Remaining items:

  • crates/openshell-core/src/forward.rs:52 and crates/openshell-core/src/forward.rs:123: PID discovery and validation still use substring-style matching for the -L forward argument. A requested port such as 80 can match an existing 8080:127.0.0.1:8080 forward, and this PR can now terminate the discovered PID on listener-readiness failure. Please make the forward-argument match exact before cleanup can signal the process, for example by checking parsed/split command arguments for either 80:127.0.0.1:80 or a bind-prefixed argument ending in :80:127.0.0.1:80, and have find_ssh_forward_pid return only candidates that pass the exact matcher. Security mapping: CWE-20.

Docs: Fern docs remain updated under docs/sandboxes/manage-sandboxes.mdx; no navigation update appears necessary.

Checks: required gates are currently green, but review feedback remains unresolved.

Next state: gator:in-review

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Author Follow-Up Nudge

This PR has been in gator:in-review for more than 48 business hours with unresolved review feedback.

@shiju-nv, please respond to the review comments or push an update. The remaining blocking item is the exact -L forward-argument matching in crates/openshell-core/src/forward.rs: PID discovery and validation should not use substring-style matching before the new listener-failure cleanup path can signal a discovered process.

Current head SHA: 4583b5d23c2fccab9b87eb78d3212633ae424cd2

Next state: gator:in-review

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Author Follow-Up Nudge

This PR is still in gator:in-review with unresolved review feedback after the prior June 15, 2026 nudge.

@shiju-nv, please update the exact -L forward-argument matching in crates/openshell-core/src/forward.rs: PID discovery and validation should not use substring-style matching before the listener-failure cleanup path can signal a discovered process.

Current head SHA: 4583b5d23c2fccab9b87eb78d3212633ae424cd2

Checks are currently green, but gator cannot move this PR to gator:watch-pipeline or gator:approval-needed until the remaining review item is resolved or explicitly waived by a maintainer.

Next state: gator:in-review

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

PR Review Status

Validation: This PR remains project-valid because it fixes the openshell forward readiness and background tracking failure described in #1878.
Head SHA: 4583b5d23c2fccab9b87eb78d3212633ae424cd2

Review findings:

  • Blocking: crates/openshell-core/src/forward.rs:55 and crates/openshell-core/src/forward.rs:123 still match SSH -L forward arguments by substring. A requested port such as 80 can match an existing 8080:127.0.0.1:8080 forward, and the new listener-failure cleanup path can then terminate the wrong process. Please make PID discovery/validation inspect the exact -L argument before any cleanup signal is sent, and add a regression test for 80 not matching 8080. Security mapping: CWE-697.
  • Warning: crates/openshell-cli/src/ssh.rs:395 writes the PID file after listener readiness. If that write fails, sandbox_forward returns an error but leaves the background SSH process running and untracked. Please terminate the validated PID before returning the write error so background mode stays fail-closed.

Docs: Fern docs are adequate for this UX change; docs/sandboxes/manage-sandboxes.mdx now states that URLs print only after listener reachability and that background forwards must be locally tracked.

Checks: required gates are currently green, but review feedback remains unresolved.

Next state: gator:in-review

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

PR Review Status

Validation: This PR remains project-valid because it fixes the openshell forward readiness and background tracking failure described in #1878.
Head SHA: 4583b5d23c2fccab9b87eb78d3212633ae424cd2

Review findings:

  • Blocking: crates/openshell-core/src/forward.rs:55 and crates/openshell-core/src/forward.rs:123 still match SSH -L forward arguments by substring. A requested port such as 80 can match an existing 8080:127.0.0.1:8080 forward for the same sandbox, and the new listener-failure cleanup can then terminate the wrong forward. Please make PID discovery and pid_matches_forward verify an exact -L argument match, and add a regression test that 80 does not match 8080. Security mapping: CWE-697 / CWE-20.
  • Blocking: crates/openshell-cli/src/ssh.rs:395 can return a write_forward_pid(...) error after listener readiness succeeds while leaving the background SSH process running and untracked. Please terminate the revalidated PID before returning the write error, and add a focused test for PID-file write failure after listener readiness.

Docs: Fern docs were updated in docs/sandboxes/manage-sandboxes.mdx; no navigation update appears necessary for this inline behavior note.

Next state: gator:in-review

shiju-nv added 3 commits June 22, 2026 12:59
Wait for a connectable local forward listener before reporting
foreground forwarding success.

Fail background forwarding when the forked SSH process cannot be
tracked, probe the listener before writing the PID file, and
terminate the tracked SSH process if the listener never opens.

Document that forwarded URLs are printed only after listener
health is proven.

Signed-off-by: Shiju <shiju@nvidia.com>
Signed-off-by: Shiju <shiju@nvidia.com>
Signed-off-by: Shiju <shiju@nvidia.com>
@shiju-nv shiju-nv force-pushed the fix/forward-listener-health branch from 4583b5d to e4b2cb8 Compare June 22, 2026 11:03
@shiju-nv shiju-nv requested a review from maxamillion as a code owner June 22, 2026 11:03
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Update

I re-evaluated latest head e4b2cb8c056dc0dd072c8e72241198354978337a after shiju-nv's June 22, 2026 follow-up commit.

Disposition: partially resolved.

Resolved items:

  • crates/openshell-core/src/forward.rs no longer uses substring-style matching for SSH -L ownership checks. PID discovery now validates a candidate command line against the exact local-forward argument shape and the expected sandbox id, so a request for 80 does not match an existing 8080 forward before cleanup can signal a process.
  • Background listener-failure cleanup and PID-file write-failure cleanup both revalidate the PID before signaling.
  • The new focused tests cover port/sandbox collisions, command spoofing, legacy PID records, foreground listener readiness, listener-failure teardown, and PID-file write-failure teardown.

Remaining item:

  • Blocking: crates/openshell-cli/src/ssh.rs:356-385 still starts background mode with ssh -f and then reconstructs the long-lived process from the process table. If ssh -f successfully leaves a live listener behind but PID discovery fails because pgrep is unavailable, process command-line reads are blocked, or the matcher fails closed, sandbox_forward returns an error without an owned child handle to terminate. That can leave a reachable but untracked forward that forward list and forward stop cannot manage. Please make managed background forwarding own the SSH child PID directly, or otherwise add a cleanup path and regression test for the case where a real listener starts but PID discovery returns no match. Security mapping: CWE-404 / CWE-269.

Docs: Fern docs remain adequate for this UX change; docs/sandboxes/manage-sandboxes.mdx states that URLs print only after listener reachability and that background forwards must be locally tracked.

Checks: OpenShell / Branch Checks is currently pending, and review feedback remains unresolved.

Next state: gator:in-review

Require background forward cleanup to prove a candidate PID is the
OpenShell-generated SSH forwarding process before reporting it active
or signaling it. Match the exact -L forward argument and sandbox ID
instead of by substring, so a request for port 80 cannot collide with
an existing 8080 forward and terminate the wrong process. Reject legacy
PID records as non-authoritative and route port-only lookup through
validated live forward state.

Recover the exact process argv from /proc/<pid>/cmdline on Linux so a
ProxyCommand whose executable path contains whitespace is parsed
correctly; other platforms keep the best-effort ps command-line parse.

After listener readiness succeeds, terminate the revalidated background
ssh process if the PID-file write fails, instead of returning early and
leaving a reachable but untracked forward that forward stop/list cannot
manage.

Cover adversarial matcher cases (port and sandbox-id collisions,
command spoofing, remote-command token confusion, whitespace executable
paths) and the PID-file write-failure teardown path with focused tests.

Signed-off-by: Shiju <shiju@nvidia.com>
@shiju-nv shiju-nv force-pushed the fix/forward-listener-health branch from e4b2cb8 to e3e1d55 Compare June 22, 2026 11:57
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Update

I re-evaluated latest head e3e1d5523c5d539072ac565312122a6a037e72f0 after shiju-nv's June 22, 2026 follow-up commit that replaced ssh -f background forwarding with an owned ssh -N child.

Disposition: resolved.

Resolved items:

  • Background forwarding now records the owned SSH child PID directly instead of reporting failure after a separate process-table PID discovery miss, so the prior reachable-but-untracked forward blocker is resolved.
  • Listener-readiness failure and PID-file write failure both clean up the owned background child.
  • Exact -L port matching, sandbox-id matching, legacy PID record handling, and the previous PID-file write cleanup feedback remain addressed.
  • The independent reviewer found no blocking issues on the latest diff.

Remaining items:

  • No blocking review items remain.

Docs: Fern docs remain adequate for this UX change; docs/sandboxes/manage-sandboxes.mdx documents listener readiness and local background tracking. No navigation update appears necessary.

Checks: test:e2e is being applied because this changes runtime forwarding behavior. Branch checks are still pending.

Next state: gator:watch-pipeline

@johntmyers johntmyers added test:e2e Requires end-to-end coverage gator:watch-pipeline Gator is monitoring PR CI/CD status and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 22, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e applied for e3e1d55. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@johntmyers johntmyers added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 22, 2026
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: This PR is project-valid because it fixes the openshell forward readiness and background tracking failure described in #1878.
Review: No blocking gator review findings remain on head e3e1d5523c5d539072ac565312122a6a037e72f0; the latest author update resolved the owned background SSH child and cleanup concerns.
Docs: Fern docs were updated in docs/sandboxes/manage-sandboxes.mdx; no navigation update appears necessary.
Checks: Required gates are green: OpenShell / Branch Checks, OpenShell / Helm Lint, and OpenShell / E2E.
E2E: test:e2e was applied for the runtime forwarding behavior change, and the E2E gate passed.

Human maintainer approval or merge decision is now required.

@johntmyers johntmyers merged commit ffc102a into NVIDIA:main Jun 22, 2026
64 of 66 checks passed
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Monitoring Complete

Monitoring is complete because this PR has merged.

Final status: Gator had moved this PR to gator:approval-needed after validation, review, docs, Branch Checks, Helm Lint, and E2E were complete on head e3e1d5523c5d539072ac565312122a6a037e72f0.

I removed the active gator:* label because there is nothing left for gator to monitor on this PR.

@johntmyers johntmyers removed the gator:approval-needed Gator completed review; maintainer approval needed label Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandbox forward reports success without a reachable local listener

2 participants