Prune oversized LLM tool catalogs before they reach your agent, with local fallback and zero router cost by default.
Context Janitor is a dependency-free CLI and Python library for pruning oversized LLM tool catalogs. Give it a user prompt and a JSON list of tools, and it returns only the tools the agent is likely to need.
It is built for agent systems where sending every available tool can become expensive, slow, and noisy. If an API-backed router fails, times out, or is missing credentials, Context Janitor can fall back to a local heuristic so the pipeline keeps moving.
Context Janitor is MCP-compatible by design. MCP servers expose structured tool definitions, and
Context Janitor can sit between those JSON tool catalogs and your agent runtime with janitor mcp-proxy.
Large tool catalogs can make agents worse in two ways:
- They inflate every request with thousands of extra prompt tokens.
- They can increase the chance that the model picks a plausible but wrong tool.
Context Janitor keeps the tool surface small before the main model sees it.
Use Context Janitor when your agent has dozens of tools, tool schema tokens are inflating requests,
or the model is choosing from too many similar tools. It is most useful when you want a small,
inspectable pruning layer before a larger model call, middleware request, or MCP tools/list
response.
You may not need Janitor for tiny tool catalogs, one-off scripts, or workflows where every tool must always be visible to the model. If missing a tool would be risky, build an eval pack from your own agent logs before using pruning in production.
Context Janitor reduces the tool-selection surface before inference. It does not evaluate argument correctness or replace observability tools. Pair it with agent eval tools when validating production tool-calling behavior.
| Setup | Tools sent | Tool overhead | Expected effect |
|---|---|---|---|
| Without Janitor | 50 | High | More prompt cost and more tool confusion |
| With Janitor | 5 | Low | Smaller payloads and clearer tool choice |
Run locally:
python scripts\benchmark.py --providers heuristicCurrent output on the included 100-prompt synthetic benchmark and examples/tools.json:
+-----------------------+--------------------+---------------+-----------+--------+-----------------+------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| Mode | Selection accuracy | Agent success | Median ms | p95 ms | Router cost/run | Tool payload/run | Compression | Notes |
+-----------------------+--------------------+---------------+-----------+--------+-----------------+------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| No Janitor (baseline) | 100.0% | not measured | 0 | 0 | $0.000000 | $0.001060 | 0.0% | all 8 tools sent for 100 prompts |
| heuristic | 100.0% | not measured | 0 | 0 | $0.000000 | $0.000280 | 73.6% | ok |
+-----------------------+--------------------+---------------+-----------+--------+-----------------+------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
Benchmark notes:
Selection accuracymeans the expected tool was present in the pruned selection.No Janitor (baseline)has 100% selection accuracy because every tool is sent.Agent successis intentionallynot measuredunless you provide real agent eval data.Tool payload/runuses--payload-price-per-million, which defaults to$5.00.Router cost/runuses--router-price-per-million, which defaults to$0.15.- The included benchmark is a small synthetic sanity check. Run it against your own catalog before making production claims.
A more realistic release gate is examples/messy_production_evals.jsonl, a 100-case prompt pack
with informal, ambiguous workplace phrasing, plus examples/messy_aliases.janitor.yaml for team
slang. The release gate requires the local heuristic to keep the expected tool for every messy case
against the simulated production catalog when that alias config is provided.
To display measured agent success rates:
python scripts\benchmark.py --providers heuristic --agent-success-file examples\agent_success.example.jsonFrom a local checkout:
pip install -e .With test dependencies:
pip install -e ".[test]"With contributor tooling:
pip install -e ".[dev]"The package exposes two console scripts:
janitorcontext-janitor
Most examples use the shorter janitor command.
On Windows, pip may install those scripts outside your current PATH. If janitor is not
recognized in cmd.exe, use:
set PATH=%PATH%;%APPDATA%\Python\Python314\ScriptsOr run the module directly:
python -m context_janitor.cli --helpjanitor prune --prompt "Search GitHub issues and make a PR" --tools examples\tools.json --limit 2Output:
{
"selected": [
{
"name": "github_search_issues",
"description": "Search issues in a GitHub repository by text, label, state, or assignee."
},
{
"name": "github_create_pr",
"description": "Open a pull request with a title, body, source branch, and target branch."
}
],
"metadata": {
"requested_provider": "heuristic",
"provider": "heuristic",
"fallback_used": false,
"cache_hit": false,
"duration_ms": 0,
"limit": 2,
"available_tools": 8,
"original_tokens": 212,
"selected_tokens": 60,
"reduced_tokens": 152,
"estimated_savings_usd": 0.00076
}
}Names-only output:
janitor prune --prompt "Search GitHub issues and make a PR" --tools examples\tools.json --limit 2 --format namesgithub_search_issues
github_create_pr
middleware reads an OpenAI-compatible request JSON from stdin, prunes the tools field, and
writes the modified payload to stdout.
Get-Content request.json | janitor middleware --limit 5In cmd.exe, use type instead of Get-Content:
type examples\request.example.json | janitor middleware --limit 2Input shape:
{
"messages": [
{ "role": "user", "content": "Create a calendar event" }
],
"tools": [
{
"type": "function",
"function": {
"name": "calendar_create",
"description": "Create events."
}
},
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web."
}
}
]
}Logs go to stderr, so stdout remains safe to pipe into another command.
Plain tool objects:
[
{
"name": "github_create_pr",
"description": "Open a pull request."
}
]OpenAI-style function tools:
[
{
"type": "function",
"function": {
"name": "github_create_pr",
"description": "Open a pull request.",
"parameters": {
"type": "object"
}
}
}
]Object wrappers are also accepted:
{
"tools": [
{ "name": "web_search", "description": "Search the web." }
]
}Context Janitor supports four provider values:
| Provider | Uses network | Required environment |
|---|---|---|
heuristic |
No | None |
openai |
Yes | OPENAI_API_KEY |
anthropic |
Yes | ANTHROPIC_API_KEY |
gemini |
Yes | GEMINI_API_KEY or GOOGLE_API_KEY |
Provider calls use only the Python standard library and default to an 800 ms timeout.
Example with OpenAI:
$env:OPENAI_API_KEY = "..."
janitor prune `
--provider openai `
--model gpt-4o-mini `
--prompt "Summarize this PDF" `
--tools tools.json `
--timeout-ms 800 `
--fallback heuristic `
--log-level INFOIf the provider errors, rate-limits, times out, or is missing credentials, --fallback heuristic
logs a warning and returns a heuristic selection instead of crashing the pipeline.
Set --fallback none if you want provider failures to exit with an error.
The local selector is not just a keyword set. It is a compact TF-IDF-style ranker:
- Tokenizes the prompt and each tool's
name + description - Splits names like
github_search_issuesinto useful terms - Removes common stop words
- Expands common intent aliases like
meeting -> calendar event - Scores term frequency in the tool text
- Weighs rare terms more heavily with inverse document frequency
- Adds a small bonus for longer substring matches
Distinctive terms like stripe, github, postgres, or pdf usually beat generic words like
create, get, or send.
Context Janitor searches upward from the current directory for .janitor.yaml.
Example:
provider: anthropic
model: claude-3-haiku-20240307
limit: 5
fallback: heuristic
cache: true
timeout_ms: 800
log_level: INFO
format: json
price_per_million_tokens: 5.0
keep: log_error,notify_admin
aliases:
bq: bigquery,query,warehouse
blast: email,send
prio: priorityThe config parser intentionally supports simple top-level key: value settings plus the aliases
mapping shown above. It is not a full YAML implementation.
CLI flags override config values.
For safety in untrusted repositories, an auto-discovered .janitor.yaml cannot silently switch the
selector from heuristic to a network provider. If a discovered config sets provider: openai,
provider: anthropic, or provider: gemini, pass --config path\to\.janitor.yaml or
--provider ... explicitly to confirm that you trust the project and intend to send prompts/tool
metadata to that provider.
| Key | Default | Description |
|---|---|---|
provider |
heuristic |
Selection backend: heuristic, openai, anthropic, or gemini |
model |
null |
Model name for API-backed providers |
limit |
5 |
Maximum number of tools to keep |
fallback |
heuristic |
Use heuristic or none after provider failure |
cache |
false |
Reuse previous selections from local cache |
timeout_ms |
800 |
Provider timeout in milliseconds |
log_level |
WARNING |
DEBUG, INFO, WARNING, ERROR, or CRITICAL |
format |
json |
prune output format: json, names, or raw |
price_per_million_tokens |
5.0 |
Price used for savings estimates |
keep |
empty | Comma-separated tool names that must stay selected |
aliases |
empty | Team-specific prompt slang to expand before ranking |
Use aliases for vocabulary your tool descriptions do not already cover. For example, if your team
types bq but your tool says BigQuery, map bq to bigquery,query,warehouse in config instead
of hardcoding that slang into Context Janitor.
Some production agents need safety, audit, or notification tools in every request. Use --keep
to force those tools into the selected set:
janitor prune --prompt "Search the web" --tools tools.json --limit 5 --keep log_error,notify_adminKept tools reserve slots inside the limit. If --limit 5 and you keep two tools, Janitor ranks
the catalog for the remaining three slots.
Selections modified by keep are not written into the normal semantic cache, because required
tools are policy rather than prompt relevance.
Enable prompt caching:
janitor prune --cache --prompt "Summarize the daily logs" --tools tools.jsonCache file:
~/.janitor_cache/cache.json
The cache stores selections by prompt, provider, model, limit, and catalog hash. It can also reuse highly similar prompts. If the cache cannot be read or written, Janitor ignores the cache and keeps running. Cache updates are written through a temporary file and atomically replaced, so interrupted writes should not leave partial JSON behind.
Privacy note: --cache stores prompt previews and prompt tokens in a local plaintext file. The
cache is not encrypted or obfuscated at rest. Keep it off for sensitive prompts unless local
plaintext storage is acceptable for your environment. Janitor ignores oversized cache files and
trims old entries so the cache cannot grow without bound.
Clear the local cache while iterating on prompts or tool descriptions:
janitor clear-cacheInspect cache metadata:
janitor cache-infoUse --explain to see why tools were kept or pruned.
janitor prune --prompt "Search GitHub issues" --tools examples\tools.json --limit 2 --explainJSON output includes an explain array:
{
"name": "github_search_issues",
"selected": true,
"score": 14.0026,
"matched_terms": ["github", "issues", "search"],
"top_terms": ["issues", "search", "github", "substring_match"]
}For --format names or --format raw, explanations are printed to stderr.
Use --dry-run to audition Janitor without changing the middleware request payload or touching the
local prune cache:
janitor prune --cache --dry-run --prompt "Search GitHub issues" --tools examples\tools.json --limit 2Get-Content request.json | janitor middleware --limit 5 --dry-run --log-level INFOFor middleware, the original JSON is written back to stdout. Janitor logs what it would have kept
and pruned to stderr.
Select tools for a prompt and a tool catalog.
janitor prune --prompt PROMPT --tools tools.json [options]
Options:
| Option | Description |
|---|---|
--prompt TEXT |
User prompt. If omitted, stdin is used |
--tools PATH |
Required path to a JSON tool catalog |
--limit N |
Maximum tools to keep |
--provider NAME |
heuristic, openai, anthropic, or gemini |
--model NAME |
Model for API-backed providers |
--fallback NAME |
heuristic or none |
--timeout-ms N |
Provider timeout |
--cache / --no-cache |
Enable or disable local cache |
--log-level LEVEL |
Structured stderr logging level |
--price-per-million-tokens N |
Cost estimate price |
--keep a,b |
Required tools to keep |
--explain |
Include or print ranking explanations |
--dry-run |
Run selection without reading or writing the local cache |
--format json |
Default structured output |
--format names |
Print selected tool names |
--format raw |
Print original selected tool objects |
--config PATH |
Explicit config file path |
Read a request payload from stdin and prune its tools field.
janitor middleware [options] < request.json
Most options match prune. middleware --dry-run logs the pruning decision without modifying the
request payload.
Proxy an MCP stdio server and prune tools/list responses before they reach the client:
janitor mcp-proxy --prompt "Find GitHub issues" --limit 5 -- python -m your_mcp_server
MCP tools/list does not include the user's chat prompt, so pass a scoped task prompt with
--prompt or JANITOR_PROMPT. Use --keep with prune or middleware for hidden policy tools;
for MCP proxy sessions, configure the downstream server around one narrow workflow when possible.
Validate a tool catalog and report quality warnings before using it in production:
janitor lint --tools tools.json
The linter checks the catalog shape, duplicate names, empty descriptions, and very short descriptions.
Delete the local semantic-selection cache:
janitor clear-cache
Show cache path, entry count, providers, models, and creation timestamps:
janitor cache-info
Synchronous API:
from context_janitor.models import load_tools
from context_janitor.selection import select_resilient
tools = load_tools(tool_json)
result = select_resilient(
provider="openai",
model="gpt-4o-mini",
prompt="Find GitHub issues about auth",
tools=tools,
limit=5,
fallback="heuristic",
timeout_ms=800,
cache_enabled=True,
keep=("log_error", "notify_admin"),
)
selected_tools = result.selectedAsync wrapper:
from context_janitor.selection import select_resilient_async
result = await select_resilient_async(
provider="heuristic",
prompt="Create a calendar event",
tools=tools,
limit=3,
)select_resilient_async runs the same implementation in a worker thread. The current provider
clients use the Python standard library rather than native async HTTP.
Use --log-level INFO to emit production-friendly logs to stderr:
[Janitor] INFO event=pruned requested_provider=openai provider=heuristic fallback=true cache_hit=false tools_before=50 tools_after=5 tokens_before=12000 tokens_after=1200 tokens_saved=10800 estimated_savings_usd=0.054000 duration_ms=7
Token counts use a lightweight estimate of roughly four characters per token. Savings are useful for quick comparisons, not invoice-grade accounting.
Run the included benchmark:
python scripts\benchmark.py --providers heuristic openai anthropic gemini --openai-model gpt-4o-mini --anthropic-model claude-3-haiku-20240307 --gemini-model gemini-1.5-flashUseful benchmark options:
| Option | Default | Description |
|---|---|---|
--providers |
heuristic |
Providers to compare |
--limit |
5 |
Tools kept per prompt |
--timeout-ms |
800 |
Provider timeout |
--router-price-per-million |
0.15 |
Router model input price estimate |
--payload-price-per-million |
5.0 |
Main model tool payload price estimate |
--agent-success-file |
none | JSON map of measured agent success rates |
Model pricing moves quickly, so treat the defaults as placeholders and set these values to your current provider prices when calculating ROI.
Example agent success file:
{
"baseline": 0.85,
"heuristic": 0.99
}The benchmark skips API providers when their API keys are missing.
Use scripts/evaluate.py to check Janitor against prompts from your own product instead of the
bundled synthetic benchmark:
python scripts\evaluate.py --tools examples\tools.json --evals examples\evals.example.json --providers heuristic --limit 2To report the production-facing Distraction Delta, pass measured agent success rates:
python scripts\evaluate.py --tools examples\tools.json --evals examples\evals.example.json --providers heuristic --limit 2 --agent-success-file examples\agent_success.example.jsonDistraction Delta is Success_with_Janitor - Success_baseline, which helps separate "the right
tool was present" from "the agent actually completed the task more often."
Eval files may be JSON or JSONL. Each case needs a prompt and one of expected_tool,
expected_tools, or expected:
[
{
"id": "github-triage",
"prompt": "Find open GitHub issues about billing and summarize the blockers.",
"expected_tool": "github_search_issues"
}
]For production rollout, replace examples/evals.example.json with real tasks from your agent logs
and track the resulting accuracy alongside downstream agent success.
Use scripts/eval_agent.py when you want to measure the whole agent loop, not just whether the
expected tool survived pruning.
The harness runs your agent command once with the full catalog and once with Janitor-pruned tools. Each run receives a JSON payload on stdin:
{
"id": "github-triage",
"mode": "janitor",
"provider": "heuristic",
"prompt": "Find open GitHub issues about billing and summarize the blockers.",
"expected_tools": ["github_search_issues"],
"tools": [{ "name": "github_search_issues", "description": "Search GitHub issues." }]
}The agent command should print JSON with a boolean success field:
{ "success": true, "used_tools": ["github_search_issues"] }Run the bundled deterministic mock agent:
python scripts\eval_agent.py --tools examples\tools.json --evals examples\evals.example.json --providers heuristic --limit 2 -- python examples\agent_runner_mock.pyFor a local model or real agent, replace the command after -- with your runner. The runner can
call Ollama, llama.cpp, a LangGraph app, or any process that accepts the JSON payload on stdin.
The repository includes a small local-model example that prunes a noisy 21-tool catalog before sending the remaining tool definitions to Ollama:
pip install ollama
python examples\ollama_agent.pySmall local models sometimes return tool calls as plain text or fenced JSON instead of native tool calls. The example handles all three shapes so you can confirm the pruned catalog is still usable before wiring Janitor into a larger agent loop.
For thresholded rollout gates, see Production Rollout.
To draft a real eval pack from agent logs:
python scripts\prepare_evals.py --logs agent-logs.jsonl --success-field success --output production-evals.draft.jsonBefore you have real logs, generate a deterministic production-like dataset:
python scripts\generate_simulated_data.pyThis creates:
examples\simulated_production_tools.json: 100 OpenAI-style tools across realistic domains.examples\simulated_production_evals.json: 100 labeled prompts.examples\simulated_agent_logs.jsonl: 100 JSONL agent-log rows.
Run selection accuracy:
python scripts\evaluate.py --tools examples\simulated_production_tools.json --evals examples\simulated_production_evals.json --providers heuristic --limit 5 --min-accuracy 0.95Run the full agent-success harness with the mock runner:
python scripts\eval_agent.py --tools examples\simulated_production_tools.json --evals examples\simulated_production_evals.json --providers heuristic --limit 5 --min-janitor-success-rate 0.95 --min-distraction-delta 0.50 -- python examples\agent_runner_mock.pyThe repository includes a VHS tape at docs/demo.tape.
Render it with VHS:
vhs docs/demo.tapeOn Windows, ScreenToGif is also a practical option for recording the terminal benchmark.
Set up:
pip install -e ".[dev]"Run tests:
python -m pytestRun lint and type checks:
python -m ruff check .
python -m mypy src scriptsValidate package metadata:
python -c "import tomllib; tomllib.load(open('pyproject.toml','rb')); print('pyproject ok')"Run the benchmark:
python scripts\benchmark.py --providers heuristicBuild distributable artifacts:
Remove-Item -Recurse -Force dist,build -ErrorAction SilentlyContinue
python -m buildRun the full release gate:
python scripts\release_check.py- Confirm the release version in pyproject.toml.
- Run Release Checklist.
- Create a matching GitHub release tag, for example
v1.0.0rc4. - Run the tests and benchmark.
- Run thresholded selection and agent-success evals.
- Clean stale build artifacts, then build the wheel and source distribution.
- Render or update the terminal GIF.
- Verify the README examples still match CLI output.
Context Janitor is at v1.0.0rc4: the CLI, config shape, heuristic selector, fallback behavior,
cache path, MCP proxy, eval tooling, and packaging flow are ready for final release-candidate
validation. Before the final v1.0.0 release, the remaining validation target is real-world testing
against external tool catalogs and at least one real-log eval pack.