From Benchmarks to Architecture

We tested 30+ AI APIs. Designed routing from the data. Then Anthropic published a paper describing the same architecture.

The Stack

Three articles. Each one builds on the previous. Together, they're the complete story of how we built intelligent API routing — from raw data to production architecture.

Article 1: OBSERVE     — Benchmark 30+ APIs (who's fast, who's accurate, who lies)
    ↓
Article 2: DESIGN      — Use exam data to design multi-provider routing
    ↓
Article 3: ARCHITECT   — Build L1→L4, then Anthropic publishes the same patterns

This is Observability-Driven Routing: measure first, decide from data, automate decisions, verify results. We didn't know it had a name. We just kept solving the next problem.

Article 1: The Benchmark

30+ AI APIs Tested from Tokyo — February 2026

15 LLMs, 3 search engines, 5 translation APIs, 3 voice APIs, 6 data APIs. Four rounds of real HTTP requests, not synthetic benchmarks.

Key findings:

Discovery	Why it matters
GPT-4o-mini can't do basic math (30/100 reasoning)	Don't trust brand names
Free 8B model beats GPT (Cerebras 92 vs GPT 82)	Cost ≠ quality
Groq is 8x faster but scores 30/100 on Chinese	Speed ≠ quality across languages
Free LLM translation beats DeepL (94 vs 93)	Dedicated APIs aren't always better

If we'd picked providers by reputation instead of data, we'd have chosen wrong on every one of these.

Article 2: The Routing Design

31 Providers Tested, 10 Best Paths Found

The benchmark data revealed a critical insight: no single provider is best at everything. Groq is fastest but fails on Chinese. Cerebras scores highest but can't do Japanese. GPT-4o-mini handles all languages but can't do math.

So we designed language-aware, quality-driven routing — different provider chains for different languages and tasks, all driven by exam data.

Key insight: A provider scoring 100 in English and 30 in Chinese isn't a bug in our test. It's a 70-point quality collapse that's completely invisible to monitoring (valid 200 OK, well-formed JSON). Our routing catches this. No existing API gateway does.

Available in: English · 繁體中文 · 日本語

Article 3: The Architecture ⭐ NEW

Token 76% Down, Cost 96% Down, 4.6x Faster — Reading Anthropic's Tool Use Paper, 4 Commits Same Day

Anthropic published "Advanced Tool Use" describing three techniques: Tool Search, Tool Use Examples, and Programmatic Tool Calling.

We read it. Recognized the same patterns we'd been building. Implemented two features the same day. And realized we'd independently built four things they hadn't described.

What Anthropic published	What we already had	What we built after reading
Tool Search (85% token reduction)	—	Defer loading: 10.8KB→2.5KB (76%↓), same day
Tool Use Examples (72%→90% accuracy)	—	11 endpoints with real JSON examples, same day
Dynamic Filtering	L2 Smart Gateway (strategy routing)	—
Programmatic Tool Calling (37% token↓)	L4 Task Engine	PTC mode: $0.02 vs $0.49 (96%↓), same day

What we have that they didn't describe:

Multi-provider fallback chains (L2) — because the right tool can still go down
Exam-driven routing (P1-P4) — because static examples go stale
Intent routing (L3) — because agents don't always know which tool to call
Post-execution quality signals (L4 Phase 3) — because knowing it ran ≠ knowing it worked

Available in: English · 繁體中文 · 日本語

Why This Matters

Companies working on similar problems: Martian (LLM Router), Not Diamond (Model Router), Portkey, LiteLLM (AI Gateway). They all solve pieces of this puzzle.

What makes our approach different:

	Existing AI Gateways	Our Approach
Routing basis	Static config or cost	Exam data + live performance
Language awareness	None	Per-language provider ranking
Fallback	Simple retry	4-deep provider chain with different timeouts
Quality check	None	Post-execution scoring (0-1)
Intent routing	None	Natural language → auto tool selection

We're not a company. We're an animal sanctuary in rural Japan that needed APIs to work reliably. The architecture emerged from solving real problems, one layer at a time.

Who We Are

Washin Village (和心村) — an animal sanctuary on Japan's Boso Peninsula. 28 cats & dogs. Zero engineering background. Built a production API platform with AI coding agents in 7 months.

The benchmarks, the routing design, and the architecture — all open. Because the problems are universal, and the solutions should be too.

License

Data and reports: CC BY 4.0. Share and adapt with attribution.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data/2026-02		data/2026-02
docs		docs
README.md		README.md

Project	What it does
Zero Engineer	The full story: animal sanctuary → production API platform
112 Claude Code Skills	Every production bug fix as a reusable skill
crawl-share	200+ Apify actors battle-tested, community-shared
Confucius Debug	Debug AI with 4,300+ shared solutions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Benchmarks to Architecture

The Stack

Article 1: The Benchmark

Article 2: The Routing Design

Article 3: The Architecture ⭐ NEW

Why This Matters

Who We Are

Related

License

About

Uh oh!

Releases

Packages

sstklen/washin-api-benchmark

Folders and files

Latest commit

History

Repository files navigation

From Benchmarks to Architecture

The Stack

Article 1: The Benchmark

Article 2: The Routing Design

Article 3: The Architecture ⭐ NEW

Why This Matters

Who We Are

Related

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages