claude-code-delegate-local

🇬🇧 English · 🇪🇸 Español

MCP server that delegates Claude Code subagents to alternative backends — local models (LM Studio, llama.cpp, Ollama, vLLM, LiteLLM), DeepSeek, AWS Bedrock, or any OpenAI/Anthropic-compatible endpoint — without losing your Claude Code orchestrator session.

Built for users who want to keep their main Claude Code session on Anthropic (Max plan or API) for orchestration, while offloading specific subagents to cheaper, faster, or HIPAA-safe local backends.

What it solves

You're working with Claude Code on a project and you want to:

Send a specific subagent (e.g., security-engineer) to a local model to save tokens from your Max plan, or because you're handling sensitive data that can't leave your machine.
Route another subagent to DeepSeek because it's 10× cheaper and faster for large tasks.
Keep your main Claude Code session exactly as it is — no swapping commands, no separate CLI, no losing the Max plan.

That's what delegate-local does. It's an MCP server you install once that exposes tools the orchestrator can invoke to route specific subagents to whatever backend you've configured.

Features

✅ Your Anthropic Max plan stays intact. No need to launch a separate CLI like ccr code or swap commands.
✅ 3-tier agent lookup. Same command works in any project — finds .claude/agents/<name>.md in the project first, then .claude/skills/<name>/SKILL.md, then global ~/.claude/agents/<name>.md.
✅ Dual-format backend. Auto-routes to /v1/messages (Anthropic format) or /v1/chat/completions (OpenAI format) based on model prefix. Works with DeepSeek's reasoning_content thinking mode out of the box.
✅ Full tool calling. Delegated agents get read_file, write_file, and run_bash with the same loop semantics as Claude Code's native subagents.

Quick install

Requires uv and Claude Code.

git clone https://github.com/fegone/claude-code-delegate-local.git
cd claude-code-delegate-local
uv sync

# Register as Claude Code MCP (user scope = global across projects)
claude mcp add delegate-local \
  --scope user \
  --env DELEGATE_LOCAL_URL=http://localhost:4000/v1/messages \
  --env DELEGATE_LOCAL_KEY=your-backend-api-key \
  --env DELEGATE_LOCAL_MODEL=local-qwen-3-6-35b \
  -- uv run --directory $(pwd) python server.py

Restart Claude Code. The MCP exposes 4 tools (see below).

Configuration

All env vars are optional; defaults assume a LiteLLM proxy on localhost:4000.

Env var	Default	Description
`DELEGATE_LOCAL_URL`	`http://localhost:4000/v1/messages`	Anthropic-format endpoint. For OpenAI-format models, the server auto-converts the URL to `/v1/chat/completions`.
`DELEGATE_LOCAL_KEY`	`""`	Bearer token / API key. Sent as both `x-api-key` and `Authorization: Bearer`.
`DELEGATE_LOCAL_MODEL`	`local-qwen-3-6-35b`	Default model alias if the caller doesn't specify one.
`DELEGATE_LOCAL_AGENTS_DIR`	`~/.claude/agents`	Where to look for global agent definitions.

See docs/CONFIGURATION.md for full details and example setups with LiteLLM, llama.cpp, Ollama, DeepSeek direct, and AWS Bedrock.

Tools exposed

Tool	Purpose
`delegate_to_local_agent(agent_name, task, workdir, max_turns, model)`	Run a `.md`-defined agent on the default backend with full tool calling. `max_turns` defaults to 15 (validated sweet spot for MoE-A3B local backends; pass 25-30 explicit for cloud). Hard cap 40.
`delegate_batch(tasks)`	NEW v0.5.0 — Dispatch up to 4 agent tasks in parallel via `asyncio.gather`. Each task is a dict `{agent_name, task, workdir?, max_turns?, model?, max_tokens?}`. Returns per-task results in input order. Reuses same agent_name across tasks for KV-cache prefix benefit (~30-50% prompt savings on local llama.cpp).
`delegate_to_provider(provider_url, api_key, model, agent_name, task, ...)`	Run an agent on any arbitrary endpoint (DeepSeek, OpenRouter, etc.)
`list_local_agents()`	List agents found in `DELEGATE_LOCAL_AGENTS_DIR` with their frontmatter metadata
`local_backend_status()`	Health check + list of models available on the configured backend

Note on `delegate_batch` and sub-agents

Claude Code sub-agents launched via the native Agent/Task tool do not inherit the parent session's MCP servers. This means delegate_batch (and any other MCP tool) is only callable from the main orchestrator session. Sub-agents that need parallel local-backend dispatch should use httpx.AsyncClient + asyncio.gather directly against the LiteLLM endpoint instead. This is a Claude Code architecture constraint, not a delegate-local limitation.

3-tier agent lookup

When you call delegate_to_local_agent("webdev", ...) with a workdir, the server looks for the agent definition in this order:

<workdir>/.claude/agents/webdev.md — project agent (highest priority)
<workdir>/.claude/skills/webdev/SKILL.md — project skill (alternative location)
~/.claude/agents/webdev.md — global agent (fallback)

This means the same delegate call works in any project, using whichever scope owns the agent. The response includes agent_source so the orchestrator knows which one was loaded.

Dual-format backend routing

Models with these prefixes are routed to OpenAI-format /v1/chat/completions:

deepseek-*
openai-*
gpt-*
qwen-* (external Qwen APIs — note that local-qwen-* aliases route via Anthropic /v1/messages)

All other models go to Anthropic-format /v1/messages. Inside the server everything is normalized to Anthropic-style content blocks (text / tool_use / thinking) so the agent loop stays uniform.

Thinking-mode support

For models that emit reasoning_content (DeepSeek V4, OpenAI o1-style), the server preserves it as a {"type": "thinking", "thinking": "..."} content block between turns. This is required by LiteLLM and most providers — if you drop reasoning_content from the assistant message in multi-turn, the next request fails with 400 Bad Request.

max_tokens defaults to 65536 (parameter of the tool — caller can override). High default is intentional so thinking-mode models have budget for both reasoning and content output, and so large monolithic outputs (e.g., complete HTML files with embedded JS) don't get truncated. Lower it explicitly only if your backend has a stricter cap.

Example: LiteLLM proxy

A minimal litellm/config.yaml to use with this MCP:

model_list:
  - model_name: local-qwen-3-6-35b
    litellm_params:
      model: openai/Qwen3-6-35B
      api_base: http://localhost:8000/v1   # your llama.cpp / vLLM server
      api_key: sk-no-key-required

  - model_name: deepseek-v4-flash
    litellm_params:
      model: deepseek/deepseek-chat
      api_key: os.environ/DEEPSEEK_API_KEY

  - model_name: bedrock-sonnet-4-6
    litellm_params:
      model: bedrock/anthropic.claude-sonnet-4-6-20260101-v1:0
      aws_region_name: us-east-1

Then run litellm --config config.yaml --port 4000 and point this MCP at it.

Tested with

Backend	Model	Single-turn	Multi-turn
LiteLLM + llama.cpp	`local-qwen-3-6-35b` (Qwen3.6 35B-A3B)	✅	✅
LiteLLM + DeepSeek API	`deepseek-v4-pro`	✅	✅
LiteLLM + DeepSeek API	`deepseek-v4-flash`	✅	✅
LiteLLM + AWS Bedrock	`bedrock-sonnet-4-6`, `bedrock-llama4-*`	✅	✅

Validation tasks: SQL injection review (security-engineer agent), HTML calculator (creative agent, 500-800 LOC monolithic), Pac-Man game (884 LOC monolithic single-shot).

Best practices

⚠️ If you dispatch multi-file sprints to local backends, read this first. Naive single-dispatch of 6+ files at once causes ReadTimeout at high turn counts as context saturates the slot. Splitting the work and reusing the same agent name across parallel workers can cut wall-clock time by ~60% and tokens by ~78%.

🎯 docs/BEST-PRACTICES.md — empirical thresholds for when to split work, KV-cache prefix reuse for parallel dispatches, scope-bounded prompts, estimated savings table

Caveats

run_bash runs shell commands inside workdir without sandboxing. Trust the agents you delegate. If you delegate to an unvetted public agent, the tool can read/write anywhere the calling user has access. There is no Docker isolation by default.
Caps: read_file returns the first 8KB, run_bash truncates stdout to 4KB and stderr to 2KB, timeout 120s.
max_turns hard cap is 40. Long-running orchestrations should be designed as multiple delegate calls rather than one huge loop.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
examples		examples
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.es.md		README.es.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
server.py		server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-code-delegate-local

Table of contents

What it solves

Features

Quick install

Configuration

Tools exposed

Note on `delegate_batch` and sub-agents

3-tier agent lookup

Dual-format backend routing

Thinking-mode support

Example: LiteLLM proxy

Tested with

Best practices

Further reading

Caveats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

claude-code-delegate-local

Table of contents

What it solves

Features

Quick install

Configuration

Tools exposed

Note on delegate_batch and sub-agents

3-tier agent lookup

Dual-format backend routing

Thinking-mode support

Example: LiteLLM proxy

Tested with

Best practices

Further reading

Caveats

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Note on `delegate_batch` and sub-agents

Packages