Skip to content

fix: ignore SIGPIPE to prevent server crash on S3 idle connection timeout#8768

Open
Its-Tanay wants to merge 2 commits into
triton-inference-server:mainfrom
Its-Tanay:fix/sigpipe-crash
Open

fix: ignore SIGPIPE to prevent server crash on S3 idle connection timeout#8768
Its-Tanay wants to merge 2 commits into
triton-inference-server:mainfrom
Its-Tanay:fix/sigpipe-crash

Conversation

@Its-Tanay
Copy link
Copy Markdown

@Its-Tanay Its-Tanay commented May 6, 2026

What does the PR do?

This PR globally sets the SIGPIPE disposition to SIG_IGN inside the tritonserver executable.

This prevents an unhandled SIGPIPE from crashing the entire server with Exit Code 141 when the AWS C++ SDK attempts to write to an S3 keep-alive connection that has timed out during a slow model initialization. By ignoring the signal, the write() system call correctly returns EPIPE, allowing the AWS SDK to gracefully catch the error and reconnect to S3.

Checklist

  • I have read the Contribution guidelines and signed the Contributor License Agreement
  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • I ran pre-commit locally (pre-commit install, pre-commit run --all)
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

  • fix

Related PRs:

N/A

Where should the reviewer start?

src/triton_signal.cc -> RegisterSignalHandler()

Test plan:

Because this is an OS-level signal handler fix for a network idle timeout, standard unit tests cannot reliably cover it. I have verified this manually, and the exact deterministic reproduction steps using a dummy Python model (which forces the required 300s S3 idle timeout) are thoroughly documented in the attached GitHub issue.

Caveats:

None. The fix is explicitly placed in the standalone tritonserver executable (triton_signal.cc) rather than triton-core to ensure there are no global signal side-effects for users embedding Triton as a C++ library.

Background

This was discovered while loading a sequence of models from an S3 model repository, where one model took > 5 minutes to initialize on the GPU, causing the AWS connection to sit idle and be killed by the network layer.

Related Issues:

@pskiran1 pskiran1 requested review from pskiran1, whoisj and yinggeh May 14, 2026 09:23
@whoisj whoisj added bug Something isn't working PR: fix A bug fix labels May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working PR: fix A bug fix

Development

Successfully merging this pull request may close these issues.

Bug: tritonserver crashes with Exit Code 141 (SIGPIPE) when loading models from S3 due to idle connection timeout

3 participants