feat: add MCP server reconnect lifecycle on repeated failures#71
feat: add MCP server reconnect lifecycle on repeated failures#71chaizhenhua merged 10 commits intoawakenworks:mainfrom
Conversation
b75272d to
56ea538
Compare
56ea538 to
15c1438
Compare
15c1438 to
067ec7e
Compare
|
This is the MCP client side, not the MCP server side, and in the I think The intended retry flow seems to be:
That model is reasonable for a shared MCP client, but the implementation does not actually realize that retry budget. After the first failed reconnect, I also have a larger shared-state concern: There is also a snapshot regression here. The existing behavior is effectively “keep the last good snapshot on refresh failure”, but once reconnect is involved that guarantee is I also think the stdio transport path regresses cleanup behavior. This PR removes For this architecture, I think the better model is:
At minimum, I think this needs more tests before merge:
If we want to reduce risk, I would strongly prefer splitting this into two PRs:
|
|
Good direction overall, but I think this still needs a few fixes before merge: Auto-reconnect stops after the first failed reconnect because inactive slots are skipped by refresh_state(). Non-blocking: old McpTool handles may still point to the old transport after reconnect. |
|
Thanks, this review was correct. The original reconnect implementation was not safe enough I addressed this in the follow-up commit: What changed:
I also added coverage for the failure modes you called out:
I reran the reconnect-focused If you still prefer this to be split further, I can separate the lifecycle-state hardening |
|
Thanks for the refactor — the lifecycle + health model makes sense overall 👍 After going through the code, I see one correctness issue and a testing gap:
In attempt_reconnect() → disconnect_server(): runtime is taken before transport.close().await? This can leave the slot in a partially transitioned state.
Current tests don’t seem to cover: reconnect after failure threshold Given this is core lifecycle logic, these should be covered. Summary: |
|
Fixed the reconnect failure path so a close() error no longer leaves the server slot Also added lifecycle coverage for threshold-triggered reconnect, repeated failures to |
|
Thanks — this is much closer now, and I agree with the overall direction of moving MCP servers to an explicit lifecycle model with reconnect and per-server health. I still recommend request changes before merge for two reasons:
With those in place, I think this PR will be in a much safer shape for a shared MCP client. |
|
Thanks. Addressed. toggle(server, false) is now atomic on failure: disable only commits after a successful I also added/verified regression coverage for:
|
|
The current PR is directionally correct, but I would still optimize the design around stable server handles, generation-bound runtime leases, explicit transitional lifecycle states, and a separate published catalog. The remaining issue is that health updates are still keyed only by server_name, while runtime instances have no generation, so late results from an old transport can still affect a new connection. In addition, disable currently relies on state-level rollback even though close() is side-effecting at the transport layer, especially for stdio. A cleaner design is to detach the runtime, treat close as non-transactional teardown, keep publication separate from connectivity, and make all lifecycle outcomes explicit manager states. That gives you stale-result immunity, honest teardown semantics, stable last-good snapshots, and consistent tool/prompt/resource behavior. |
|
the way out of the current clone-and-swap maze is to replace it with four clear layers: a stable server slot, a generation-bound runtime, a published snapshot, and an explicit lifecycle state machine. Right now the PR still has no generation on McpServerRuntime, records tool-call success and failure only by server_name, clones slots that still share the same Arc transport, and relies on rollback even though stdio close() already mutates the transport by marking it dead, draining pending requests, clearing progress subscribers, and taking or terminating the child process. The code also already splits “published tools” from “live connectivity,” because snapshot rebuild keeps cached published tools for anything except Disabled and PermanentlyFailed, while activity checks still require Connected plus runtime.is_some(). The better design is: keep every server permanently in shared state, add a generation/epoch to each runtime so only the current generation can update health, separate the last-good published catalog from the current live runtime, and model lifecycle explicitly with states such as Connecting, Connected, Reconnecting, Disabling, Disabled, Disconnected, and PermanentlyFailed. On disable or reconnect, first detach the old runtime from the slot, then close it outside the lock, instead of treating close() as if it were transactional. That gives you three things at once: no stale old-call results corrupting a new connection, honest teardown semantics, and stable last-good tools during reconnects. |
b452dfb to
9dd7d10
Compare
Introduce McpPublishedSnapshot to bind published catalogs to a runtime generation, preventing stale tool-call results from corrupting health on a newer connection. Detach runtime before close on both reconnect and disable paths so close() is non-transactional teardown. Separate close failures from connect failures in reconnect budget accounting.
9dd7d10 to
ac9cc6f
Compare
|
Approving for merge. The latest revisions address the earlier correctness blockers, and the remaining concerns are follow-up design improvements rather than reasons to hold this PR. I’m comfortable merging this now and tracking the rest in separate issues. |
|
Follow-up PR: #107 |
Summary
This PR adds per-server runtime lifecycle management to the MCP extension so failed servers can be disconnected, refreshed, and reconnected without rebuilding the whole registry.
Changes
Behavior
Notes
Validation