🇬🇧 English · 🇪🇸 Español
MCP server that delegates Claude Code subagents to alternative backends — local models (LM Studio, llama.cpp, Ollama, vLLM, LiteLLM), DeepSeek, AWS Bedrock, or any OpenAI/Anthropic-compatible endpoint — without losing your Claude Code orchestrator session.
Built for users who want to keep their main Claude Code session on Anthropic (Max plan or API) for orchestration, while offloading specific subagents to cheaper, faster, or HIPAA-safe local backends.
- What it solves
- Features
- Quick install
- Configuration
- Tools exposed
- 3-tier agent lookup
- Dual-format backend routing
- Thinking-mode support
- Example: LiteLLM proxy
- Tested with
- Best practices
- Further reading
- Caveats
- License
You're working with Claude Code on a project and you want to:
- Send a specific subagent (e.g.,
security-engineer) to a local model to save tokens from your Max plan, or because you're handling sensitive data that can't leave your machine. - Route another subagent to DeepSeek because it's 10× cheaper and faster for large tasks.
- Keep your main Claude Code session exactly as it is — no swapping commands, no separate CLI, no losing the Max plan.
That's what delegate-local does. It's an MCP server you install once that exposes tools the orchestrator can invoke to route specific subagents to whatever backend you've configured.
- ✅ Your Anthropic Max plan stays intact. No need to launch a separate CLI like
ccr codeor swap commands. - ✅ 3-tier agent lookup. Same command works in any project — finds
.claude/agents/<name>.mdin the project first, then.claude/skills/<name>/SKILL.md, then global~/.claude/agents/<name>.md. - ✅ Dual-format backend. Auto-routes to
/v1/messages(Anthropic format) or/v1/chat/completions(OpenAI format) based on model prefix. Works with DeepSeek'sreasoning_contentthinking mode out of the box. - ✅ Full tool calling. Delegated agents get
read_file,write_file, andrun_bashwith the same loop semantics as Claude Code's native subagents.
Requires uv and Claude Code.
git clone https://github.com/fegone/claude-code-delegate-local.git
cd claude-code-delegate-local
uv sync
# Register as Claude Code MCP (user scope = global across projects)
claude mcp add delegate-local \
--scope user \
--env DELEGATE_LOCAL_URL=http://localhost:4000/v1/messages \
--env DELEGATE_LOCAL_KEY=your-backend-api-key \
--env DELEGATE_LOCAL_MODEL=local-qwen-3-6-35b \
-- uv run --directory $(pwd) python server.pyRestart Claude Code. The MCP exposes 4 tools (see below).
All env vars are optional; defaults assume a LiteLLM proxy on localhost:4000.
| Env var | Default | Description |
|---|---|---|
DELEGATE_LOCAL_URL |
http://localhost:4000/v1/messages |
Anthropic-format endpoint. For OpenAI-format models, the server auto-converts the URL to /v1/chat/completions. |
DELEGATE_LOCAL_KEY |
"" |
Bearer token / API key. Sent as both x-api-key and Authorization: Bearer. |
DELEGATE_LOCAL_MODEL |
local-qwen-3-6-35b |
Default model alias if the caller doesn't specify one. |
DELEGATE_LOCAL_AGENTS_DIR |
~/.claude/agents |
Where to look for global agent definitions. |
See docs/CONFIGURATION.md for full details and example setups with LiteLLM, llama.cpp, Ollama, DeepSeek direct, and AWS Bedrock.
| Tool | Purpose |
|---|---|
delegate_to_local_agent(agent_name, task, workdir, max_turns, model) |
Run a .md-defined agent on the default backend with full tool calling. max_turns defaults to 15 (validated sweet spot for MoE-A3B local backends; pass 25-30 explicit for cloud). Hard cap 40. |
delegate_batch(tasks) |
NEW v0.5.0 — Dispatch up to 4 agent tasks in parallel via asyncio.gather. Each task is a dict {agent_name, task, workdir?, max_turns?, model?, max_tokens?}. Returns per-task results in input order. Reuses same agent_name across tasks for KV-cache prefix benefit (~30-50% prompt savings on local llama.cpp). |
delegate_to_provider(provider_url, api_key, model, agent_name, task, ...) |
Run an agent on any arbitrary endpoint (DeepSeek, OpenRouter, etc.) |
list_local_agents() |
List agents found in DELEGATE_LOCAL_AGENTS_DIR with their frontmatter metadata |
local_backend_status() |
Health check + list of models available on the configured backend |
Claude Code sub-agents launched via the native Agent/Task tool do not inherit the parent session's MCP servers. This means delegate_batch (and any other MCP tool) is only callable from the main orchestrator session. Sub-agents that need parallel local-backend dispatch should use httpx.AsyncClient + asyncio.gather directly against the LiteLLM endpoint instead. This is a Claude Code architecture constraint, not a delegate-local limitation.
When you call delegate_to_local_agent("webdev", ...) with a workdir, the server looks for the agent definition in this order:
<workdir>/.claude/agents/webdev.md— project agent (highest priority)<workdir>/.claude/skills/webdev/SKILL.md— project skill (alternative location)~/.claude/agents/webdev.md— global agent (fallback)
This means the same delegate call works in any project, using whichever scope owns the agent. The response includes agent_source so the orchestrator knows which one was loaded.
Models with these prefixes are routed to OpenAI-format /v1/chat/completions:
deepseek-*openai-*gpt-*qwen-*(external Qwen APIs — note thatlocal-qwen-*aliases route via Anthropic/v1/messages)
All other models go to Anthropic-format /v1/messages. Inside the server everything is normalized to Anthropic-style content blocks (text / tool_use / thinking) so the agent loop stays uniform.
For models that emit reasoning_content (DeepSeek V4, OpenAI o1-style), the server preserves it as a {"type": "thinking", "thinking": "..."} content block between turns. This is required by LiteLLM and most providers — if you drop reasoning_content from the assistant message in multi-turn, the next request fails with 400 Bad Request.
max_tokens defaults to 65536 (parameter of the tool — caller can override). High default is intentional so thinking-mode models have budget for both reasoning and content output, and so large monolithic outputs (e.g., complete HTML files with embedded JS) don't get truncated. Lower it explicitly only if your backend has a stricter cap.
A minimal litellm/config.yaml to use with this MCP:
model_list:
- model_name: local-qwen-3-6-35b
litellm_params:
model: openai/Qwen3-6-35B
api_base: http://localhost:8000/v1 # your llama.cpp / vLLM server
api_key: sk-no-key-required
- model_name: deepseek-v4-flash
litellm_params:
model: deepseek/deepseek-chat
api_key: os.environ/DEEPSEEK_API_KEY
- model_name: bedrock-sonnet-4-6
litellm_params:
model: bedrock/anthropic.claude-sonnet-4-6-20260101-v1:0
aws_region_name: us-east-1Then run litellm --config config.yaml --port 4000 and point this MCP at it.
| Backend | Model | Single-turn | Multi-turn |
|---|---|---|---|
| LiteLLM + llama.cpp | local-qwen-3-6-35b (Qwen3.6 35B-A3B) |
✅ | ✅ |
| LiteLLM + DeepSeek API | deepseek-v4-pro |
✅ | ✅ |
| LiteLLM + DeepSeek API | deepseek-v4-flash |
✅ | ✅ |
| LiteLLM + AWS Bedrock | bedrock-sonnet-4-6, bedrock-llama4-* |
✅ | ✅ |
Validation tasks: SQL injection review (security-engineer agent), HTML calculator (creative agent, 500-800 LOC monolithic), Pac-Man game (884 LOC monolithic single-shot).
ReadTimeout at high turn counts as context saturates the slot. Splitting the work and reusing the same agent name across parallel workers can cut wall-clock time by ~60% and tokens by ~78%.
- 🎯 docs/BEST-PRACTICES.md — empirical thresholds for when to split work, KV-cache prefix reuse for parallel dispatches, scope-bounded prompts, estimated savings table
- 📐 docs/ARCHITECTURE.md — how it works internally, diagrams, design decisions
- ⚙️ docs/CONFIGURATION.md — full env var reference, LiteLLM setup from scratch, how to add new providers
- 💡 docs/EXAMPLES.md — 7 end-to-end use cases with copy-pasteable code
- 🔧 docs/TROUBLESHOOTING.md — common errors, lessons learned, and a dedicated section for AI agents helping with setup
- 📋 examples/litellm.example.yaml — ready-to-use LiteLLM config with 9 providers (local + cloud)
- 🤝 CONTRIBUTING.md — how to contribute
- 📝 CHANGELOG.md — version history
run_bashruns shell commands insideworkdirwithout sandboxing. Trust the agents you delegate. If you delegate to an unvetted public agent, the tool can read/write anywhere the calling user has access. There is no Docker isolation by default.- Caps:
read_filereturns the first 8KB,run_bashtruncates stdout to 4KB and stderr to 2KB, timeout 120s. max_turnshard cap is 40. Long-running orchestrations should be designed as multiple delegate calls rather than one huge loop.
MIT. See LICENSE.