Guard NetworkTransport connect continuation against stale reconnectio…#178
Open
algal wants to merge 1 commit intomodelcontextprotocol:mainfrom
Open
Guard NetworkTransport connect continuation against stale reconnectio…#178algal wants to merge 1 commit intomodelcontextprotocol:mainfrom
algal wants to merge 1 commit intomodelcontextprotocol:mainfrom
Conversation
|
This issue, along with many others, has already been resolved in my fork, which @movetz, @stallent, and I are discussing merging into this repository. Instead of using a checked continuation with callback-based state handling (which requires the UUID-based guard proposed in this PR), my fork uses private func waitForConnectionReady() async throws {
let stateStream = AsyncStream<NWConnection.State> { continuation in
connection.stateUpdateHandler = { state in
continuation.yield(state)
switch state {
case .ready, .failed, .cancelled:
continuation.finish()
default:
break
}
}
}
for await state in stateStream {
switch state {
case .ready: return
case .failed(let error): throw error
case .cancelled: throw MCPError.internalError("Connection cancelled")
// ...
}
}
}This approach inherently avoids the double-resume race condition because:
|
Author
|
@DePasqualeOrg Cool! It sounds like your solution is better. I hope it is merged. :) |
|
It won't be, unless it's replicated by someone else, because the new maintainers of this package want to take a patchwork approach to improvements. My full-featured fork will continue to be developed separately here: DePasqualeOrg/swift-mcp |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a crash in
NetworkTransport.connect()where a delayed reconnection task could resume a stale continuation after a new connect attempt started, causing a crash.Motivation and Context
The bug occurs when a delayed reconnection task from a previous connect attempt runs after a new connect attempt has started.
While using iMCP.app, I saw that app crash when a client (
imcp-server) started and then died quickly. Crash logs showedCheckedContinuation.resume(throwing:)fromNetworkTransport.handleReconnection. What was happening was that whenNetworkTransport.connect()is called, it uses a checked continuation and schedules reconnection work with backoff on failure/cancellation. If a reconnect attempt is scheduled and a newconnect()call happens before that delayed task fires, the old task can still resume the previous continuation. That double‑resumes aCheckedContinuationand traps (EXC_BREAKPOINT / SIGTRAP).How Has This Been Tested?
Trying this with the iMCP.app is the most easy way to repro the crash, and what I used to ensure that the fix removed the crash. You just run iMCP, then try running its
imcp-serverand CTRL-C it immmediately a few times. Eventually, this brings down the iMCP process as well.I have not produced a test which reproduces the bug in isolation, outside of iMCP, but I can do so if that would help. The sequence would work as follows
Repro (before fix):
NetworkTransport(default reconnection config).connect(), then cancel the underlying connection quickly enough to enterhandleReconnection.connect()again before the reconnection backoff delay elapses.EXC_BREAKPOINT / SIGTRAP).I found that
swift testfails on Swift 6.2.3 due to existing strict-concurrency errors inTests/MCPTests/ClientTests.swift.How this fix works
The continuation‑resume guard was a single boolean reset on each
connect()call. Delayed reconnection tasks from a previous connect attempt could still run after a new connect had reset the guard, allowing a second resume on the old continuation.This fix works as follows:
connectContinuationID).handleConnectionReady/Failed/CancelledandhandleReconnection.Breaking Changes
None.
Besides the fix, behavior is unchanged; this only prevents stale tasks from resolving an old continuation.
This is an internal-only change; the only observable difference is that stale reconnection tasks no longer resolve a previous
connect()attempt.Types of changes
Checklist
Additional context on use of AI
I (algal) have reviewed this PR, performed the before/after tests above, and I believe the analysis I have stated above is correct. I also have a background in Swift and of native development. However, I am not deeply familiar with this code base, and I did rely on AI in the developing this fix and drafting this PR text.
I know there's a lot of slop PRs flying around these days, so I wanted to be transparent about that. I am happy to be corrected or steered