We tested 30+ AI APIs. Designed routing from the data. Then Anthropic published a paper describing the same architecture.
Three articles. Each one builds on the previous. Together, they're the complete story of how we built intelligent API routing — from raw data to production architecture.
Article 1: OBSERVE — Benchmark 30+ APIs (who's fast, who's accurate, who lies)
↓
Article 2: DESIGN — Use exam data to design multi-provider routing
↓
Article 3: ARCHITECT — Build L1→L4, then Anthropic publishes the same patterns
This is Observability-Driven Routing: measure first, decide from data, automate decisions, verify results. We didn't know it had a name. We just kept solving the next problem.
30+ AI APIs Tested from Tokyo — February 2026
15 LLMs, 3 search engines, 5 translation APIs, 3 voice APIs, 6 data APIs. Four rounds of real HTTP requests, not synthetic benchmarks.
Key findings:
| Discovery | Why it matters |
|---|---|
| GPT-4o-mini can't do basic math (30/100 reasoning) | Don't trust brand names |
| Free 8B model beats GPT (Cerebras 92 vs GPT 82) | Cost ≠ quality |
| Groq is 8x faster but scores 30/100 on Chinese | Speed ≠ quality across languages |
| Free LLM translation beats DeepL (94 vs 93) | Dedicated APIs aren't always better |
If we'd picked providers by reputation instead of data, we'd have chosen wrong on every one of these.
31 Providers Tested, 10 Best Paths Found
The benchmark data revealed a critical insight: no single provider is best at everything. Groq is fastest but fails on Chinese. Cerebras scores highest but can't do Japanese. GPT-4o-mini handles all languages but can't do math.
So we designed language-aware, quality-driven routing — different provider chains for different languages and tasks, all driven by exam data.
Key insight: A provider scoring 100 in English and 30 in Chinese isn't a bug in our test. It's a 70-point quality collapse that's completely invisible to monitoring (valid 200 OK, well-formed JSON). Our routing catches this. No existing API gateway does.
Available in: English · 繁體中文 · 日本語
Token 76% Down, Cost 96% Down, 4.6x Faster — Reading Anthropic's Tool Use Paper, 4 Commits Same Day
Anthropic published "Advanced Tool Use" describing three techniques: Tool Search, Tool Use Examples, and Programmatic Tool Calling.
We read it. Recognized the same patterns we'd been building. Implemented two features the same day. And realized we'd independently built four things they hadn't described.
| What Anthropic published | What we already had | What we built after reading |
|---|---|---|
| Tool Search (85% token reduction) | — | Defer loading: 10.8KB→2.5KB (76%↓), same day |
| Tool Use Examples (72%→90% accuracy) | — | 11 endpoints with real JSON examples, same day |
| Dynamic Filtering | L2 Smart Gateway (strategy routing) | — |
| Programmatic Tool Calling (37% token↓) | L4 Task Engine | PTC mode: $0.02 vs $0.49 (96%↓), same day |
What we have that they didn't describe:
- Multi-provider fallback chains (L2) — because the right tool can still go down
- Exam-driven routing (P1-P4) — because static examples go stale
- Intent routing (L3) — because agents don't always know which tool to call
- Post-execution quality signals (L4 Phase 3) — because knowing it ran ≠ knowing it worked
Available in: English · 繁體中文 · 日本語
Companies working on similar problems: Martian (LLM Router), Not Diamond (Model Router), Portkey, LiteLLM (AI Gateway). They all solve pieces of this puzzle.
What makes our approach different:
| Existing AI Gateways | Our Approach | |
|---|---|---|
| Routing basis | Static config or cost | Exam data + live performance |
| Language awareness | None | Per-language provider ranking |
| Fallback | Simple retry | 4-deep provider chain with different timeouts |
| Quality check | None | Post-execution scoring (0-1) |
| Intent routing | None | Natural language → auto tool selection |
We're not a company. We're an animal sanctuary in rural Japan that needed APIs to work reliably. The architecture emerged from solving real problems, one layer at a time.
Washin Village (和心村) — an animal sanctuary on Japan's Boso Peninsula. 28 cats & dogs. Zero engineering background. Built a production API platform with AI coding agents in 7 months.
The benchmarks, the routing design, and the architecture — all open. Because the problems are universal, and the solutions should be too.
| Project | What it does |
|---|---|
| Zero Engineer | The full story: animal sanctuary → production API platform |
| 112 Claude Code Skills | Every production bug fix as a reusable skill |
| crawl-share | 200+ Apify actors battle-tested, community-shared |
| Confucius Debug | Debug AI with 4,300+ shared solutions |
Data and reports: CC BY 4.0. Share and adapt with attribution.