Skip to content

feat(ai-proxy): /v1/models aggregator + ai-gateway spec fragment (ADR-0030 §4)#82

Merged
ndreno merged 2 commits intomainfrom
feat/ai-proxy-models-endpoint
May 5, 2026
Merged

feat(ai-proxy): /v1/models aggregator + ai-gateway spec fragment (ADR-0030 §4)#82
ndreno merged 2 commits intomainfrom
feat/ai-proxy-models-endpoint

Conversation

@ndreno
Copy link
Copy Markdown
Contributor

@ndreno ndreno commented May 5, 2026

Summary

Adds the third and final protocol surface from ADR-0030 §1 — GET /v1/models — plus the operator-facing spec fragment that ties all three AI gateway operations (chat completions, responses, models) together.

This is PR-5 in the implementation plan. Branched off the merged main, no stacking.

What landed

/v1/models aggregator

Walks every unique provider in routes / targets / flat config, hits each upstream's /v1/models (or /api/tags for Ollama, then translates), and returns OpenAI-compatible { object: \"list\", data: [...] }.

Per-provider failure is non-fatal: returns 200 with partial: true + warnings: [{provider, status, detail}]. A single flaky upstream doesn't take the discovery endpoint down. Each failure increments barbacane_plugin_ai_proxy_models_provider_failures_total{provider}.

schemas/ai-gateway.yaml spec fragment

Drops three operations into one importable spec file — operators put it in their specs/ folder and the multi-file compiler discovers it automatically. Provider creds via env:// references; YAML anchors dedupe the dispatch config across the three ops.

Out of scope (deferred per ADR-0030 §4)

  • Caching the aggregated response. v1 hits upstreams on every call (discovery is rare, off the data plane critical path).
  • Route-based filtering of advertised models. Ergonomic, not security — denied models still 403 on actual dispatch. Future PR.
  • Root-level x-barbacane-dispatch-defaults to deduplicate the fragment's three dispatch blocks. v1 uses YAML anchors (&ai_proxy_config) which is idiomatic OpenAPI.

Test plan

  • Plugin unit tests: 109 passed (+12: collect_unique_providers, translate_models_response, handle 405/500)
  • Integration tests: 19 passed (+2: 3-provider aggregation roundtrip, partial-failure response)
  • Compilation smoke: test_shipped_ai_gateway_spec_fragment_compiles proves the fragment loads through the standard pipeline
  • End-to-end fragment test: test_shipped_fragment_chat_completions_and_models_via_ollama exercises the actual shipped fragment with OLLAMA_BASE_URL → wiremock; verifies both chat completions through the catch-all route and /v1/models partial response with the realistic first-deploy state (only Ollama configured)
  • cargo build --target wasm32-unknown-unknown --release — clean
  • cargo clippy --lib --bins — zero warnings
  • cargo fmt --all -- --check — clean
  • bash docs/rulesets/tests/run-tests.sh — 15/15 pass
  • CI green

ndreno added 2 commits May 5, 2026 09:14
…-0030 §4)

GET /v1/models walks every unique provider declared in routes / targets /
flat, queries each upstream's /v1/models (or /api/tags for Ollama, then
translates the shape), and returns an OpenAI-compatible
{ object: "list", data: [...] } payload.

On per-provider failure (5xx / timeout / connection error) the dispatcher
returns 200 OK with `partial: true` + warnings: [{provider, status,
detail}] — a single flaky upstream doesn't take the discovery endpoint
down. Each failure increments
barbacane_plugin_ai_proxy_models_provider_failures_total{provider} so
operators see degradations without polling clients.

A 405 with content-type application/problem+json + Allow: GET fires for
non-GET methods so a future spec that accidentally binds /v1/models to
multiple methods produces a clear error.

Spec fragment
=============

schemas/ai-gateway.yaml — shipped operator-facing OpenAPI fragment that
declares all three AI gateway operations (/v1/chat/completions,
/v1/responses, /v1/models) bound to ai-proxy. Provider keys come from
OPENAI_API_KEY / ANTHROPIC_API_KEY / OLLAMA_BASE_URL via env://
references; default routes are claude-* → Anthropic, gpt-* / o[1-4]* →
OpenAI, * → Ollama.

Multi-file spec discovery (manifest.rs:268-322 + artifact.rs:417-426)
picks the fragment up automatically when an operator drops it into
their specs/ folder. No new compiler mechanism required for v1.

Out of scope (explicit deferrals per ADR-0030 §4)
==================================================

- Caching the aggregated response via host_cache_*. ADR specifies a
  per-instance cache with 5-min TTL + single-flight against thundering-
  herd. v1 hits upstreams on every call — the endpoint isn't on the
  data plane critical path. Cost is one extra HTTP RTT per /v1/models
  call, which discovery clients call rarely. Caching slot is empty,
  not contested.
- Filtering advertised models by route allow/deny. Filtering is
  ergonomic (don't show clients models they can't call), not security
  (denied models still 403 on actual dispatch). Future PR.
- Root-level x-barbacane-dispatch-defaults to deduplicate the dispatch
  config across the fragment's three operations. v1 uses YAML anchors
  (&ai_proxy_config / *ai_proxy_config) which is idiomatic OpenAPI and
  doesn't require a new compiler feature.

Tests
=====

Plugin unit tests: 97 → 109 (+12).
- collect_unique_providers (5 tests covering dedup, ordering, defaults)
- translate_models_response (5 tests covering OpenAI/Anthropic/Ollama
  shapes + missing-data graceful handling)
- handle dispatch (2 tests: 405 on non-GET, 500 on no providers)

Integration tests: 17 → 19 (+2).
- test_ai_proxy_models_aggregates_three_providers — wiremock for OpenAI
  (/v1/models), Anthropic (/v1/models), Ollama (/api/tags); GET
  /v1/models returns the union, with Anthropic's owned_by stamped and
  Ollama's /api/tags reshape verified.
- test_ai_proxy_models_partial_response_on_provider_failure —
  Anthropic returns 503, OpenAI + Ollama succeed; response is 200 with
  partial: true + a single warning naming Anthropic.

Compilation smoke test:
- test_shipped_ai_gateway_spec_fragment_compiles — copies
  schemas/ai-gateway.yaml + a synthesized barbacane.yaml into a temp
  dir and verifies TestGateway::from_spec startup succeeds.

End-to-end fragment test:
- test_shipped_fragment_chat_completions_and_models_via_ollama —
  exercises the actual shipped fragment with OLLAMA_BASE_URL pointed
  at a wiremock; verifies (1) chat completions reach the wiremock via
  the * catch-all route, (2) /v1/models returns 200 with partial: true
  + warnings for the unreachable OpenAI/Anthropic providers (the
  realistic first-time-deploy operator experience). Combined into a
  single test because env vars are global state and parallel tests
  would race on the OLLAMA_BASE_URL setting.
R1 (load-bearing) — `/v1/models` was using `plugin.timeout` (default 120s),
sized for LLM completions. Sequential aggregation across N providers
meant a single hung upstream blocked the response for ~120s; three
hung upstreams = ~360s. Add a dedicated `models_timeout_ms` config
(default 5000ms) used only by the discovery aggregator. The worst
case for a single hung upstream is now bounded at 5s, and the
aggregator stays responsive even when one provider is degraded.

R2 — `UpstreamProvider`'s dedup key drops `api_key`. Documented the
trade-off explicitly: in multi-tenant configs where two routes
deliberately use different keys against the same upstream account
(e.g. for billing splits), the aggregator picks whichever key sorted
first into the dedup. If that key is revoked or rate-limited and the
other isn't, `/v1/models` warnings name the provider but not the
offending key — operators correlate via the upstream's logs.
Acceptable for v1; partial-response shape makes the failure visible.

R3 — Added two tests that close the upstream-edge-case gap:
- translate_handles_empty_arrays — Ollama / OpenAI returning empty
  data[] (operator just deployed, no models yet).
- test_ai_proxy_models_handles_empty_body_from_upstream — upstream
  returns 200 with Content-Length: 0 (some misbehaving proxies do
  this). Aggregator records "invalid JSON from upstream" warning,
  doesn't crash, other providers still contribute.

Style cleanup:
- Collapsed the `if .. { /* happy */ } else { return ... }` early-return
  in `handle()` to a single negation + early return.
- Derived Debug on UpstreamProvider (and on Provider, which it nests)
  so test panics print legible values.
- Made UpstreamProvider, UpstreamFailure, collect_unique_providers,
  and fetch_provider_models module-private. None are used outside
  models.rs and the `pub(crate)` was leaking surface area.

Spec fragment:
- Added a doc-comment paragraph on the Ollama catch-all (`pattern: "*"`).
  Convenient for local-dev (Ollama-as-default) but in production
  silently routes typos like `gtp-4o` to Ollama instead of returning
  a clean `400 no_route`. Operators wanting strict validation drop
  the catch-all from their copy.

Schema:
- New `models_timeout_ms` field documented in config-schema.json with
  the rationale (separate from LLM `timeout` because discovery doesn't
  need 120s patience).

Test count: plugin 109 → 110 (+1 empty-arrays test). Integration
19 → 19 (the empty-body test added; partial-response test renumbered
implicitly).
@ndreno ndreno merged commit 0f766cb into main May 5, 2026
9 checks passed
@ndreno ndreno deleted the feat/ai-proxy-models-endpoint branch May 5, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant