Skip to content

feat: add MCP server reconnect lifecycle on repeated failures#71

Merged
chaizhenhua merged 10 commits intoawakenworks:mainfrom
zadawq:feat-mcp-auto-reconnect
Apr 11, 2026
Merged

feat: add MCP server reconnect lifecycle on repeated failures#71
chaizhenhua merged 10 commits intoawakenworks:mainfrom
zadawq:feat-mcp-auto-reconnect

Conversation

@zadawq
Copy link
Copy Markdown
Contributor

@zadawq zadawq commented Apr 5, 2026

Summary

This PR adds per-server runtime lifecycle management to the MCP extension so failed servers can be disconnected, refreshed, and reconnected without rebuilding the whole registry.

Changes

  • refactored MCP manager state from a static server list to managed per-server slots
  • added per-server health tracking, failure counters, reconnect attempts, and permanent-failure state
  • added runtime server controls:
    • reconnect(server_name)
    • toggle(server_name, enabled)
    • server_health(server_name)
  • changed refresh flow to:
    • refresh servers individually
    • keep healthy server snapshots when another server fails
    • rebuild the registry snapshot from cached tool definitions
  • added automatic reconnect attempts after consecutive refresh failures
  • added transport lifecycle support with McpToolTransport::close()
  • implemented stdio transport shutdown handling and connection-closed propagation for pending requests
  • implemented HTTP transport close() by clearing MCP session state and cached capabilities
  • updated MCP tests to use per-server health semantics
  • added runtime management tests for disable/idempotency/reconnect-disabled behavior

Behavior

  • a single failing MCP server no longer forces the whole registry refresh to fail
  • disabled servers are explicitly rejected for manual reconnect
  • permanently failed servers are tracked explicitly
  • stdio requests waiting on a dead connection now receive ConnectionClosed instead of hanging on dropped channels

Notes

  • snapshot rebuild now uses cached tool definitions instead of performing transport I/O during rebuild
  • current reconnect behavior updates newly rebuilt registry tools, but previously retained tool handles may still point at the old transport until they are refreshed by consumers

Validation

  • cargo test -p awaken-ext-mcp manager_toggle -- --nocapture
  • cargo test -p awaken-ext-mcp manager_reconnect_rejects_disabled_server -- --nocapture

@zadawq zadawq force-pushed the feat-mcp-auto-reconnect branch from b75272d to 56ea538 Compare April 5, 2026 15:22
@zadawq zadawq closed this Apr 5, 2026
@zadawq zadawq reopened this Apr 5, 2026
@zadawq zadawq force-pushed the feat-mcp-auto-reconnect branch from 56ea538 to 15c1438 Compare April 5, 2026 15:28
@zadawq zadawq force-pushed the feat-mcp-auto-reconnect branch from 15c1438 to 067ec7e Compare April 6, 2026 06:50
@chaizhenhua
Copy link
Copy Markdown
Contributor

This is the MCP client side, not the MCP server side, and in the awaken-server integration this registry/client is shared across agents. Because of that, I think the reconnect
lifecycle needs a stricter correctness bar before merge.

I think ServerDisabled belongs in McpError. It is not a transport failure; it is a manager-level state rejection. The server exists, but the requested operation cannot be
satisfied in the current lifecycle state. ServerPermanentlyFailed can also live there for the same reason, although right now its public semantics still feel incomplete because
only some APIs surface it while list-style APIs silently skip inactive servers.

The intended retry flow seems to be:

  • periodic refresh runs tools/list
  • refresh failures increment consecutive_failures
  • after FAILURE_THRESHOLD = 3, attempt_reconnect() runs
  • reconnect backoff is 1s, 2s, 4s, 8s, 16s
  • after MAX_RECONNECT_ATTEMPTS = 5, the server becomes permanently failed

That model is reasonable for a shared MCP client, but the implementation does not actually realize that retry budget. After the first failed reconnect, runtime becomes None, and
future refreshes skip the slot because server_is_active() requires runtime.is_some(). In practice, automatic retry stops after one reconnect failure.

I also have a larger shared-state concern: refresh_state(), reconnect(), and toggle() all remove server state from the shared collection and then await I/O. Since this registry
is shared, concurrent readers can temporarily observe the server as missing or even see an empty server list. That is not a safe pattern for shared registry state.

There is also a snapshot regression here. The existing behavior is effectively “keep the last good snapshot on refresh failure”, but once reconnect is involved that guarantee is
broken: disconnect_server() clears tools_cache, and if reconnect fails, rebuild_snapshot() republishes without that server’s tools.

I also think the stdio transport path regresses cleanup behavior. This PR removes kill_on_drop(true), but the initialization failure paths do not call close(). That can leave
spawned MCP server processes behind.

For this architecture, I think the better model is:

  • keep each server slot in shared state at all times; do not take() / remove() it during async work
  • represent lifecycle explicitly, e.g. Disabled / Connected / Disconnected / PermanentlyFailed
  • separate shared connection lifecycle from tool snapshot publication
  • preserve the last good snapshot until a new successful snapshot is ready
  • route tool execution through a live server handle / manager lookup instead of storing a potentially stale transport inside McpTool
  • feed tool-call transport failures into the same health/reconnect state machine instead of relying only on tools/list refreshes
  • restore explicit stdio cleanup for initialization failures

At minimum, I think this needs more tests before merge:

  • reconnect triggers only after the failure threshold
  • reconnect success resets health and republishes tools
  • reconnect failure budget continues across future refresh cycles until permanent failure
  • reconnect failure does not drop the last good snapshot
  • one failed server does not affect other servers
  • concurrent reads during refresh/reconnect/toggle do not observe missing servers
  • stdio initialization failure does not leak child processes
  • close() is idempotent

If we want to reduce risk, I would strongly prefer splitting this into two PRs:

  • first land per-server health plus explicit transport close / cleanup
  • then land reconnect lifecycle once the shared-state model is corrected

@chaizhenhua
Copy link
Copy Markdown
Contributor

Good direction overall, but I think this still needs a few fixes before merge:

Auto-reconnect stops after the first failed reconnect because inactive slots are skipped by refresh_state().
refresh_state() / reconnect() move servers out of shared state across await, which can expose inconsistent state to concurrent readers.
Reconnect should keep the last known good snapshot until the new connection is fully ready; currently a failed reconnect can drop previously discovered tools.
stdio child cleanup should also be guaranteed on early initialize/connect failures.

Non-blocking: old McpTool handles may still point to the old transport after reconnect.

@zadawq
Copy link
Copy Markdown
Contributor Author

zadawq commented Apr 9, 2026

Thanks, this review was correct. The original reconnect implementation was not safe enough
for the shared client/registry model used by awaken-server.

I addressed this in the follow-up commit: 🛠️ fix(mcp): harden shared client reconnect lifecycle.

What changed:

  • ServerDisabled and ServerPermanentlyFailed now remain manager-level McpError states
    rather than transport failures.
  • The registry no longer removes/takes server slots out of shared state during async
    lifecycle work. Server entries stay present and lifecycle is modeled explicitly (Disabled / Connected / Disconnected / PermanentlyFailed).
  • Refresh/reconnect/toggle are serialized with a lifecycle lock so concurrent reads do not
    observe missing servers or an empty registry.
  • Automatic reconnect now continues across future refresh cycles until the retry budget is
    exhausted, instead of stopping after the first failed reconnect.
  • Snapshot publication is separated from connection lifecycle. We keep the last good
    published tools until a new successful refresh is ready, so a failed reconnect does not drop
    the last good snapshot.
  • McpTool no longer stores a potentially stale transport handle. Tool execution resolves
    the live transport from shared registry state at call time.
  • Tool-call transport failures now feed the same health/reconnect state machine instead of
    relying only on periodic tools/list refresh.
  • The stdio path restores explicit cleanup on initialization failure, and close() is
    covered for idempotency.

I also added coverage for the failure modes you called out:

  • reconnect starts only after the failure threshold
  • reconnect budget continues across future refresh cycles until permanent failure
  • failed reconnect preserves the last good snapshot
  • one failing server does not affect other servers
  • concurrent reads during refresh do not observe missing servers
  • tool-call transport failures update health/reconnect state
  • stdio initialization failure cleans up the child process
  • close() is idempotent

I reran the reconnect-focused awaken-ext-mcp tests locally after the changes and they pass.

If you still prefer this to be split further, I can separate the lifecycle-state hardening
from the reconnect behavior, but the current patch is intended to make the reconnect feature
safe for the shared-client architecture before merge.

@chaizhenhua
Copy link
Copy Markdown
Contributor

Thanks for the refactor — the lifecycle + health model makes sense overall 👍

After going through the code, I see one correctness issue and a testing gap:

  1. Reconnect path can leave inconsistent state on close() failure

In attempt_reconnect() → disconnect_server():

runtime is taken before transport.close().await?
if close() fails:
runtime is already gone
reconnect_attempts is not incremented
health.reconnecting is not reset

This can leave the slot in a partially transitioned state.

  1. Missing tests for lifecycle transitions

Current tests don’t seem to cover:

reconnect after failure threshold
repeated reconnect attempts → permanent failure
reconnect success resetting health
close-failure path

Given this is core lifecycle logic, these should be covered.

Summary:
Design looks good, but I’d recommend fixing the reconnect failure path and adding tests before merging.

@zadawq
Copy link
Copy Markdown
Contributor Author

zadawq commented Apr 9, 2026

Fixed the reconnect failure path so a close() error no longer leaves the server slot
partially transitioned. runtime is now only cleared after a successful close, and reconnect
failure bookkeeping is handled consistently across all failure paths.

Also added lifecycle coverage for threshold-triggered reconnect, repeated failures to
permanent failure, reconnect-state reset, close-failure handling, and manual reconnect
failure behavior. cargo test -p awaken-ext-mcp passes.

@chaizhenhua
Copy link
Copy Markdown
Contributor

Thanks — this is much closer now, and I agree with the overall direction of moving MCP servers to an explicit lifecycle model with reconnect and per-server health.

I still recommend request changes before merge for two reasons:

  • toggle(server, false) should be atomic on failure. If disable hits a close() error, we should not leave the slot in a partially transitioned state where lifecycle, runtime, and published snapshot can diverge. Please make disable all-or-nothing: either fully close and transition to Disabled, or restore the previous state on error.
  • Please add a few regression tests around the lifecycle state machine. The most important ones are: disable-on-close-failure preserves previous state; reconnect retries continue across refresh cycles until permanent failure; reconnect/refresh failure preserves the last good snapshot; one failed server does not affect other servers; and close() remains idempotent.

With those in place, I think this PR will be in a much safer shape for a shared MCP client.

@zadawq
Copy link
Copy Markdown
Contributor Author

zadawq commented Apr 10, 2026

Thanks. Addressed.

toggle(server, false) is now atomic on failure: disable only commits after a successful
close(), and a close() error preserves the previous slot state unchanged.

I also added/verified regression coverage for:

  • disable on close failure preserving prior state
  • reconnect retries continuing across refresh cycles until permanent failure
  • reconnect/refresh failure preserving the last good snapshot
  • one failed server not affecting other servers
  • close/disable idempotence

@chaizhenhua
Copy link
Copy Markdown
Contributor

The current PR is directionally correct, but I would still optimize the design around stable server handles, generation-bound runtime leases, explicit transitional lifecycle states, and a separate published catalog. The remaining issue is that health updates are still keyed only by server_name, while runtime instances have no generation, so late results from an old transport can still affect a new connection. In addition, disable currently relies on state-level rollback even though close() is side-effecting at the transport layer, especially for stdio. A cleaner design is to detach the runtime, treat close as non-transactional teardown, keep publication separate from connectivity, and make all lifecycle outcomes explicit manager states. That gives you stale-result immunity, honest teardown semantics, stable last-good snapshots, and consistent tool/prompt/resource behavior.

@chaizhenhua
Copy link
Copy Markdown
Contributor

the way out of the current clone-and-swap maze is to replace it with four clear layers: a stable server slot, a generation-bound runtime, a published snapshot, and an explicit lifecycle state machine. Right now the PR still has no generation on McpServerRuntime, records tool-call success and failure only by server_name, clones slots that still share the same Arc transport, and relies on rollback even though stdio close() already mutates the transport by marking it dead, draining pending requests, clearing progress subscribers, and taking or terminating the child process. The code also already splits “published tools” from “live connectivity,” because snapshot rebuild keeps cached published tools for anything except Disabled and PermanentlyFailed, while activity checks still require Connected plus runtime.is_some().

The better design is: keep every server permanently in shared state, add a generation/epoch to each runtime so only the current generation can update health, separate the last-good published catalog from the current live runtime, and model lifecycle explicitly with states such as Connecting, Connected, Reconnecting, Disabling, Disabled, Disconnected, and PermanentlyFailed. On disable or reconnect, first detach the old runtime from the slot, then close it outside the lock, instead of treating close() as if it were transactional. That gives you three things at once: no stale old-call results corrupting a new connection, honest teardown semantics, and stable last-good tools during reconnects.

@zadawq zadawq force-pushed the feat-mcp-auto-reconnect branch from b452dfb to 9dd7d10 Compare April 11, 2026 07:10
Introduce McpPublishedSnapshot to bind published catalogs to a runtime
generation, preventing stale tool-call results from corrupting health
on a newer connection. Detach runtime before close on both reconnect
and disable paths so close() is non-transactional teardown. Separate
close failures from connect failures in reconnect budget accounting.
@zadawq zadawq force-pushed the feat-mcp-auto-reconnect branch from 9dd7d10 to ac9cc6f Compare April 11, 2026 07:16
@chaizhenhua
Copy link
Copy Markdown
Contributor

Approving for merge. The latest revisions address the earlier correctness blockers, and the remaining concerns are follow-up design improvements rather than reasons to hold this PR. I’m comfortable merging this now and tracking the rest in separate issues.

@chaizhenhua chaizhenhua merged commit a35ea9f into awakenworks:main Apr 11, 2026
7 checks passed
@chaizhenhua
Copy link
Copy Markdown
Contributor

Follow-up PR: #107

@zadawq zadawq deleted the feat-mcp-auto-reconnect branch April 17, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants