feat: add MCP server reconnect lifecycle on repeated failures by zadawq · Pull Request #71 · awakenworks/awaken

zadawq · 2026-04-05T15:19:16Z

Summary

This PR adds per-server runtime lifecycle management to the MCP extension so failed servers can be disconnected, refreshed, and reconnected without rebuilding the whole registry.

Changes

refactored MCP manager state from a static server list to managed per-server slots
added per-server health tracking, failure counters, reconnect attempts, and permanent-failure state
added runtime server controls:
- reconnect(server_name)
- toggle(server_name, enabled)
- server_health(server_name)
changed refresh flow to:
- refresh servers individually
- keep healthy server snapshots when another server fails
- rebuild the registry snapshot from cached tool definitions
added automatic reconnect attempts after consecutive refresh failures
added transport lifecycle support with McpToolTransport::close()
implemented stdio transport shutdown handling and connection-closed propagation for pending requests
implemented HTTP transport close() by clearing MCP session state and cached capabilities
updated MCP tests to use per-server health semantics
added runtime management tests for disable/idempotency/reconnect-disabled behavior

Behavior

a single failing MCP server no longer forces the whole registry refresh to fail
disabled servers are explicitly rejected for manual reconnect
permanently failed servers are tracked explicitly
stdio requests waiting on a dead connection now receive ConnectionClosed instead of hanging on dropped channels

Notes

snapshot rebuild now uses cached tool definitions instead of performing transport I/O during rebuild
current reconnect behavior updates newly rebuilt registry tools, but previously retained tool handles may still point at the old transport until they are refreshed by consumers

Validation

cargo test -p awaken-ext-mcp manager_toggle -- --nocapture
cargo test -p awaken-ext-mcp manager_reconnect_rejects_disabled_server -- --nocapture

chaizhenhua · 2026-04-06T11:42:17Z

This is the MCP client side, not the MCP server side, and in the awaken-server integration this registry/client is shared across agents. Because of that, I think the reconnect
lifecycle needs a stricter correctness bar before merge.

I think ServerDisabled belongs in McpError. It is not a transport failure; it is a manager-level state rejection. The server exists, but the requested operation cannot be
satisfied in the current lifecycle state. ServerPermanentlyFailed can also live there for the same reason, although right now its public semantics still feel incomplete because
only some APIs surface it while list-style APIs silently skip inactive servers.

The intended retry flow seems to be:

periodic refresh runs tools/list
refresh failures increment consecutive_failures
after FAILURE_THRESHOLD = 3, attempt_reconnect() runs
reconnect backoff is 1s, 2s, 4s, 8s, 16s
after MAX_RECONNECT_ATTEMPTS = 5, the server becomes permanently failed

That model is reasonable for a shared MCP client, but the implementation does not actually realize that retry budget. After the first failed reconnect, runtime becomes None, and
future refreshes skip the slot because server_is_active() requires runtime.is_some(). In practice, automatic retry stops after one reconnect failure.

I also have a larger shared-state concern: refresh_state(), reconnect(), and toggle() all remove server state from the shared collection and then await I/O. Since this registry
is shared, concurrent readers can temporarily observe the server as missing or even see an empty server list. That is not a safe pattern for shared registry state.

There is also a snapshot regression here. The existing behavior is effectively “keep the last good snapshot on refresh failure”, but once reconnect is involved that guarantee is
broken: disconnect_server() clears tools_cache, and if reconnect fails, rebuild_snapshot() republishes without that server’s tools.

I also think the stdio transport path regresses cleanup behavior. This PR removes kill_on_drop(true), but the initialization failure paths do not call close(). That can leave
spawned MCP server processes behind.

For this architecture, I think the better model is:

keep each server slot in shared state at all times; do not take() / remove() it during async work
represent lifecycle explicitly, e.g. Disabled / Connected / Disconnected / PermanentlyFailed
separate shared connection lifecycle from tool snapshot publication
preserve the last good snapshot until a new successful snapshot is ready
route tool execution through a live server handle / manager lookup instead of storing a potentially stale transport inside McpTool
feed tool-call transport failures into the same health/reconnect state machine instead of relying only on tools/list refreshes
restore explicit stdio cleanup for initialization failures

At minimum, I think this needs more tests before merge:

reconnect triggers only after the failure threshold
reconnect success resets health and republishes tools
reconnect failure budget continues across future refresh cycles until permanent failure
reconnect failure does not drop the last good snapshot
one failed server does not affect other servers
concurrent reads during refresh/reconnect/toggle do not observe missing servers
stdio initialization failure does not leak child processes
close() is idempotent

If we want to reduce risk, I would strongly prefer splitting this into two PRs:

first land per-server health plus explicit transport close / cleanup
then land reconnect lifecycle once the shared-state model is corrected

chaizhenhua · 2026-04-07T23:40:31Z

Good direction overall, but I think this still needs a few fixes before merge:

Auto-reconnect stops after the first failed reconnect because inactive slots are skipped by refresh_state().
refresh_state() / reconnect() move servers out of shared state across await, which can expose inconsistent state to concurrent readers.
Reconnect should keep the last known good snapshot until the new connection is fully ready; currently a failed reconnect can drop previously discovered tools.
stdio child cleanup should also be guaranteed on early initialize/connect failures.

Non-blocking: old McpTool handles may still point to the old transport after reconnect.

zadawq · 2026-04-09T04:38:18Z

Thanks, this review was correct. The original reconnect implementation was not safe enough
for the shared client/registry model used by awaken-server.

I addressed this in the follow-up commit: 🛠️ fix(mcp): harden shared client reconnect lifecycle.

What changed:

ServerDisabled and ServerPermanentlyFailed now remain manager-level McpError states
rather than transport failures.
The registry no longer removes/takes server slots out of shared state during async
lifecycle work. Server entries stay present and lifecycle is modeled explicitly (Disabled / Connected / Disconnected / PermanentlyFailed).
Refresh/reconnect/toggle are serialized with a lifecycle lock so concurrent reads do not
observe missing servers or an empty registry.
Automatic reconnect now continues across future refresh cycles until the retry budget is
exhausted, instead of stopping after the first failed reconnect.
Snapshot publication is separated from connection lifecycle. We keep the last good
published tools until a new successful refresh is ready, so a failed reconnect does not drop
the last good snapshot.
McpTool no longer stores a potentially stale transport handle. Tool execution resolves
the live transport from shared registry state at call time.
Tool-call transport failures now feed the same health/reconnect state machine instead of
relying only on periodic tools/list refresh.
The stdio path restores explicit cleanup on initialization failure, and close() is
covered for idempotency.

I also added coverage for the failure modes you called out:

reconnect starts only after the failure threshold
reconnect budget continues across future refresh cycles until permanent failure
failed reconnect preserves the last good snapshot
one failing server does not affect other servers
concurrent reads during refresh do not observe missing servers
tool-call transport failures update health/reconnect state
stdio initialization failure cleans up the child process
close() is idempotent

I reran the reconnect-focused awaken-ext-mcp tests locally after the changes and they pass.

If you still prefer this to be split further, I can separate the lifecycle-state hardening
from the reconnect behavior, but the current patch is intended to make the reconnect feature
safe for the shared-client architecture before merge.

chaizhenhua · 2026-04-09T15:04:53Z

Thanks for the refactor — the lifecycle + health model makes sense overall 👍

After going through the code, I see one correctness issue and a testing gap:

Reconnect path can leave inconsistent state on close() failure

In attempt_reconnect() → disconnect_server():

runtime is taken before transport.close().await?
if close() fails:
runtime is already gone
reconnect_attempts is not incremented
health.reconnecting is not reset

This can leave the slot in a partially transitioned state.

Missing tests for lifecycle transitions

Current tests don’t seem to cover:

reconnect after failure threshold
repeated reconnect attempts → permanent failure
reconnect success resetting health
close-failure path

Given this is core lifecycle logic, these should be covered.

Summary:
Design looks good, but I’d recommend fixing the reconnect failure path and adding tests before merging.

zadawq · 2026-04-09T16:56:22Z

Fixed the reconnect failure path so a close() error no longer leaves the server slot
partially transitioned. runtime is now only cleared after a successful close, and reconnect
failure bookkeeping is handled consistently across all failure paths.

Also added lifecycle coverage for threshold-triggered reconnect, repeated failures to
permanent failure, reconnect-state reset, close-failure handling, and manual reconnect
failure behavior. cargo test -p awaken-ext-mcp passes.

chaizhenhua · 2026-04-09T21:08:49Z

Thanks — this is much closer now, and I agree with the overall direction of moving MCP servers to an explicit lifecycle model with reconnect and per-server health.

I still recommend request changes before merge for two reasons:

toggle(server, false) should be atomic on failure. If disable hits a close() error, we should not leave the slot in a partially transitioned state where lifecycle, runtime, and published snapshot can diverge. Please make disable all-or-nothing: either fully close and transition to Disabled, or restore the previous state on error.
Please add a few regression tests around the lifecycle state machine. The most important ones are: disable-on-close-failure preserves previous state; reconnect retries continue across refresh cycles until permanent failure; reconnect/refresh failure preserves the last good snapshot; one failed server does not affect other servers; and close() remains idempotent.

With those in place, I think this PR will be in a much safer shape for a shared MCP client.

zadawq · 2026-04-10T04:11:02Z

Thanks. Addressed.

toggle(server, false) is now atomic on failure: disable only commits after a successful
close(), and a close() error preserves the previous slot state unchanged.

I also added/verified regression coverage for:

disable on close failure preserving prior state
reconnect retries continuing across refresh cycles until permanent failure
reconnect/refresh failure preserving the last good snapshot
one failed server not affecting other servers
close/disable idempotence

chaizhenhua · 2026-04-10T23:13:22Z

The current PR is directionally correct, but I would still optimize the design around stable server handles, generation-bound runtime leases, explicit transitional lifecycle states, and a separate published catalog. The remaining issue is that health updates are still keyed only by server_name, while runtime instances have no generation, so late results from an old transport can still affect a new connection. In addition, disable currently relies on state-level rollback even though close() is side-effecting at the transport layer, especially for stdio. A cleaner design is to detach the runtime, treat close as non-transactional teardown, keep publication separate from connectivity, and make all lifecycle outcomes explicit manager states. That gives you stale-result immunity, honest teardown semantics, stable last-good snapshots, and consistent tool/prompt/resource behavior.

chaizhenhua · 2026-04-11T00:46:38Z

the way out of the current clone-and-swap maze is to replace it with four clear layers: a stable server slot, a generation-bound runtime, a published snapshot, and an explicit lifecycle state machine. Right now the PR still has no generation on McpServerRuntime, records tool-call success and failure only by server_name, clones slots that still share the same Arc transport, and relies on rollback even though stdio close() already mutates the transport by marking it dead, draining pending requests, clearing progress subscribers, and taking or terminating the child process. The code also already splits “published tools” from “live connectivity,” because snapshot rebuild keeps cached published tools for anything except Disabled and PermanentlyFailed, while activity checks still require Connected plus runtime.is_some().

The better design is: keep every server permanently in shared state, add a generation/epoch to each runtime so only the current generation can update health, separate the last-good published catalog from the current live runtime, and model lifecycle explicitly with states such as Connecting, Connected, Reconnecting, Disabling, Disabled, Disconnected, and PermanentlyFailed. On disable or reconnect, first detach the old runtime from the slot, then close it outside the lock, instead of treating close() as if it were transactional. That gives you three things at once: no stale old-call results corrupting a new connection, honest teardown semantics, and stable last-good tools during reconnects.

Introduce McpPublishedSnapshot to bind published catalogs to a runtime generation, preventing stale tool-call results from corrupting health on a newer connection. Detach runtime before close on both reconnect and disable paths so close() is non-transactional teardown. Separate close failures from connect failures in reconnect budget accounting.

chaizhenhua · 2026-04-11T17:29:16Z

Approving for merge. The latest revisions address the earlier correctness blockers, and the remaining concerns are follow-up design improvements rather than reasons to hold this PR. I’m comfortable merging this now and tracking the rest in separate issues.

chaizhenhua · 2026-04-11T17:54:56Z

Follow-up PR: #107

zadawq force-pushed the feat-mcp-auto-reconnect branch from b75272d to 56ea538 Compare April 5, 2026 15:22

zadawq closed this Apr 5, 2026

zadawq reopened this Apr 5, 2026

zadawq force-pushed the feat-mcp-auto-reconnect branch from 56ea538 to 15c1438 Compare April 5, 2026 15:28

✨ feat(mcp): add server reconnect lifecycle

067ec7e

zadawq force-pushed the feat-mcp-auto-reconnect branch from 15c1438 to 067ec7e Compare April 6, 2026 06:50

Merge branch 'awakenworks:main' into feat-mcp-auto-reconnect

318d522

zadawq added 2 commits April 8, 2026 11:53

Merge branch 'awakenworks:main' into feat-mcp-auto-reconnect

be3e24b

🛠️ fix(mcp): harden shared client reconnect lifecycle

b8450ef

zadawq added 2 commits April 10, 2026 00:35

Merge branch 'awakenworks:main' into feat-mcp-auto-reconnect

ed72c70

🐛 fix(mcp): keep reconnect state consistent on close failure

7e6fcb6

Merge branch 'awakenworks:main' into feat-mcp-auto-reconnect

2739c94

🐛 fix(mcp): make disable atomic on close failure

39cec0c

Merge branch 'awakenworks:main' into feat-mcp-auto-reconnect

780eedf

zadawq force-pushed the feat-mcp-auto-reconnect branch from b452dfb to 9dd7d10 Compare April 11, 2026 07:10

zadawq force-pushed the feat-mcp-auto-reconnect branch from 9dd7d10 to ac9cc6f Compare April 11, 2026 07:16

chaizhenhua merged commit a35ea9f into awakenworks:main Apr 11, 2026
7 checks passed

chaizhenhua mentioned this pull request Apr 11, 2026

fix(mcp): isolate lifecycle and health budgets #107

Merged

chaizhenhua mentioned this pull request Apr 12, 2026

feat: MCP server auto-reconnection on consecutive failures #39

Closed

zadawq deleted the feat-mcp-auto-reconnect branch April 17, 2026 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MCP server reconnect lifecycle on repeated failures#71

feat: add MCP server reconnect lifecycle on repeated failures#71
chaizhenhua merged 10 commits intoawakenworks:mainfrom
zadawq:feat-mcp-auto-reconnect

zadawq commented Apr 5, 2026

Uh oh!

chaizhenhua commented Apr 6, 2026

Uh oh!

chaizhenhua commented Apr 7, 2026

Uh oh!

zadawq commented Apr 9, 2026

Uh oh!

chaizhenhua commented Apr 9, 2026

Uh oh!

zadawq commented Apr 9, 2026

Uh oh!

chaizhenhua commented Apr 9, 2026

Uh oh!

zadawq commented Apr 10, 2026

Uh oh!

chaizhenhua commented Apr 10, 2026

Uh oh!

chaizhenhua commented Apr 11, 2026

Uh oh!

chaizhenhua commented Apr 11, 2026

Uh oh!

Uh oh!

chaizhenhua commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zadawq commented Apr 5, 2026

Summary

Changes

Behavior

Notes

Validation

Uh oh!

chaizhenhua commented Apr 6, 2026

Uh oh!

chaizhenhua commented Apr 7, 2026

Uh oh!

zadawq commented Apr 9, 2026

Uh oh!

chaizhenhua commented Apr 9, 2026

Uh oh!

zadawq commented Apr 9, 2026

Uh oh!

chaizhenhua commented Apr 9, 2026

Uh oh!

zadawq commented Apr 10, 2026

Uh oh!

chaizhenhua commented Apr 10, 2026

Uh oh!

chaizhenhua commented Apr 11, 2026

Uh oh!

chaizhenhua commented Apr 11, 2026

Uh oh!

Uh oh!

chaizhenhua commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants