Skip to content

feat(mcp): self-healing tool execution with LLM argument correction#40

Merged
Mathews-Tom merged 3 commits into
mainfrom
feat/mcp-retry
Mar 25, 2026
Merged

feat(mcp): self-healing tool execution with LLM argument correction#40
Mathews-Tom merged 3 commits into
mainfrom
feat/mcp-retry

Conversation

@Mathews-Tom

Copy link
Copy Markdown
Owner

Summary

Add self-healing tool execution to the MCP server, inspired by SivaRamSV/paaw's self-healing tool pattern. When an MCP tool fails with a retryable error, the system asks the LLM to diagnose the failure and suggest corrected arguments, then retries once. Security and auth errors are never retried.

How It Works

Tool call → _dispatch_tool()
  ├─ Success → return result
  └─ Failure
       ├─ Non-retryable (ProfileError, PathTraversalError, TypeError) → raise immediately
       └─ Retryable (ValueError, LLMError, TimeoutError, ConnectionError)
            → LLM diagnoses error + suggests corrected args (JSON)
            → _dispatch_tool() with corrected args
                 ├─ Success → return result (logged as retry success)
                 └─ Failure → raise ToolRetryExhaustedError (both errors preserved)

Error Classification

Category Error Types Behavior
Non-retryable ProfileError, PathTraversalError, TypeError, KeyError, AttributeError Raise immediately, no retry
Retryable ValueError, LLMError, TimeoutError, ConnectionError LLM correction + 1 retry
Unknown Any error not in either list Raise immediately (safe default)

LLM Correction

When a retryable error occurs, the executor sends a structured prompt to the fast model:

  • Includes tool name, original arguments, error message, and error type
  • Lists common fixes (query length, path format, missing fields)
  • Expects JSON response: {"arguments": {...corrected...}} or {"corrected": false, "reason": "..."}
  • Invalid JSON, LLM errors, or declined corrections → original error raised (no retry)

Changes

New Files

  • src/vaultmind/mcp/retry.py (168 lines) — ToolRetryExecutor class with execute(), _should_retry(), and _correct_args() methods. ToolRetryExhaustedError exception preserving both original_error and retry_error. Module-level _NON_RETRYABLE_TYPES frozenset for hard-blocked error types
  • tests/test_mcp_retry.py (313 lines) — 19 tests across 4 classes

Modified Files

  • src/vaultmind/config.py — Added MCPRetryConfig class with enabled, max_retries, use_llm_correction, correction_model, timeout_seconds, retryable_errors fields. Added mcp_retry field to Settings
  • config/default.toml — Added [mcp_retry] section with all config entries
  • src/vaultmind/mcp/server.py — Added retry_executor optional parameter to create_mcp_server(). When provided, tool dispatch routes through ToolRetryExecutor.execute() with a closure over all dependencies. When absent, dispatch is direct — preserving existing behavior

Backward Compatibility

  • retry_executor defaults to None in create_mcp_server() — existing callers unaffected
  • When MCPRetryConfig.enabled is False, executor is a passthrough (direct dispatch)
  • When use_llm_correction is False or no LLM client provided, retryable errors still raise (no silent swallowing)
  • ToolRetryExhaustedError inherits from Exception, preserving standard error handling chains

Test plan

  • 19 new tests in test_mcp_retry.py across 4 classes:
    • Basic execution (3): success passthrough, disabled config, no LLM on success
    • Retry logic (6): retryable triggers retry, non-retryable errors (ProfileError, PathTraversalError, TypeError) raise immediately, exhausted error with preserved errors
    • LLM correction (6): valid JSON, invalid JSON, declined, LLM error, disabled, missing client
    • Should retry (4): ValueError/TimeoutError retryable, ProfileError non-retryable, unknown not retried
  • Full suite: 867/867 tests pass, 0 regressions
  • ruff check — clean
  • mypy --ignore-missing-imports — clean
  • Integration: verify retry triggers on real MCP tool call with transient failure
  • Manual: confirm LLM correction produces valid argument corrections for common error types

New module mcp/retry.py with ToolRetryExecutor that wraps tool dispatch
with a single retry on retryable errors (ValueError, LLMError,
TimeoutError, ConnectionError). On failure, asks the LLM to diagnose
and suggest corrected arguments via structured JSON. Security and auth
errors (ProfileError, PathTraversalError) are never retried.

Add MCPRetryConfig with enabled, max_retries, use_llm_correction,
correction_model, timeout_seconds, and retryable_errors fields.
Accept optional retry_executor parameter in create_mcp_server(). When
provided, tool dispatch routes through ToolRetryExecutor.execute() with
a closure over all dependencies. When absent, dispatch is direct —
preserving existing behavior for callers that don't configure retry.
19 tests across 4 classes: basic execution (3), retry logic (6),
LLM correction (6), and should_retry classification (4). Covers
success passthrough, retryable vs non-retryable error routing,
LLM JSON parsing (valid/invalid/declined), exhausted error
preservation, and disabled/missing client edge cases.
@Mathews-Tom Mathews-Tom merged commit f4152b6 into main Mar 25, 2026
3 checks passed
@Mathews-Tom Mathews-Tom deleted the feat/mcp-retry branch March 25, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant