|
| 1 | +# LLM Provider Standards (bench) |
| 2 | + |
| 3 | +> **Audience**: bstack maintainers + anyone wiring a new live LLM backend |
| 4 | +> into `bstack bench`. Agent-readable substrate (markdown, per P18). |
| 5 | +
|
| 6 | +## The contract bstack adopts |
| 7 | + |
| 8 | +**OpenAI Chat Completions API v1** — the de facto LLM provider contract in 2026. |
| 9 | + |
| 10 | +``` |
| 11 | +POST {base_url}/chat/completions |
| 12 | +Authorization: Bearer <token> |
| 13 | +Content-Type: application/json |
| 14 | +
|
| 15 | +{ |
| 16 | + "model": "<model-id>", |
| 17 | + "messages": [ |
| 18 | + {"role": "system", "content": "..."}, |
| 19 | + {"role": "user", "content": "..."} |
| 20 | + ], |
| 21 | + "max_tokens": 4096, |
| 22 | + "temperature": 0.0 |
| 23 | +} |
| 24 | +``` |
| 25 | + |
| 26 | +Response: |
| 27 | + |
| 28 | +``` |
| 29 | +{ |
| 30 | + "id": "...", |
| 31 | + "model": "<resolved-model-id>", |
| 32 | + "choices": [ |
| 33 | + { |
| 34 | + "index": 0, |
| 35 | + "message": { "role": "assistant", "content": "..." }, |
| 36 | + "finish_reason": "stop" |
| 37 | + } |
| 38 | + ], |
| 39 | + "usage": { |
| 40 | + "prompt_tokens": 19, |
| 41 | + "completion_tokens": 6, |
| 42 | + "total_tokens": 25 |
| 43 | + } |
| 44 | +} |
| 45 | +``` |
| 46 | + |
| 47 | +## Why this contract |
| 48 | + |
| 49 | +1. **Industry alignment** — the same JSON shape is served by: |
| 50 | + - **Databricks Model Serving** (Anthropic Claude, Meta Llama, etc. behind one gateway) |
| 51 | + - **OpenAI** itself |
| 52 | + - **Together**, **Fireworks**, **Anyscale**, **Groq**, **Perplexity** |
| 53 | + - **vLLM**, **llama.cpp**, **TGI** (self-hosted) |
| 54 | + - **Anthropic via AWS Bedrock** (with a thin adapter) |
| 55 | + - **Vertex AI** model garden (with a thin adapter) |
| 56 | +2. **Token semantics are uniform** — `usage.{prompt,completion,total}_tokens` is universal; bench's per-task cost accounting reads consistently. |
| 57 | +3. **Anthropic's `messages.create` is a rotated version of the same shape** — providers that don't expose an OpenAI front-end (raw Anthropic SDK, Vertex) translate inside the provider class; bench code above stays clean. |
| 58 | + |
| 59 | +## Provider abstraction (Python) |
| 60 | + |
| 61 | +```python |
| 62 | +from bench.providers import Provider, ChatMessage, ChatCompletion, get_provider |
| 63 | + |
| 64 | +provider = get_provider("databricks") # or "mock", future: "anthropic", "openai", ... |
| 65 | + |
| 66 | +response: ChatCompletion = provider.chat( |
| 67 | + messages=[ |
| 68 | + ChatMessage(role="system", content="..."), |
| 69 | + ChatMessage(role="user", content="..."), |
| 70 | + ], |
| 71 | + model="databricks-claude-haiku-4-5", |
| 72 | + max_tokens=4096, |
| 73 | + temperature=0.0, |
| 74 | +) |
| 75 | + |
| 76 | +print(response.content) |
| 77 | +print(response.usage.total_tokens) |
| 78 | +print(response.model) # resolved model ID (richer than the alias) |
| 79 | +print(response.finish_reason) |
| 80 | +``` |
| 81 | + |
| 82 | +Three types are public: |
| 83 | + |
| 84 | +- `ChatMessage` — role-tagged message |
| 85 | +- `Usage` — `prompt_tokens` + `completion_tokens` + `total_tokens` |
| 86 | +- `ChatCompletion` — `content` + `model` + `usage` + `finish_reason` (+ `raw` for forward-compat) |
| 87 | + |
| 88 | +Three exception types: |
| 89 | + |
| 90 | +- `ProviderNotInstalled` — optional SDK missing (e.g. `pip install openai`) |
| 91 | +- `ProviderNotConfigured` — required env vars missing (e.g. `DATABRICKS_TOKEN`) |
| 92 | +- `ProviderError` — wraps upstream SDK errors uniformly |
| 93 | + |
| 94 | +## Built-in providers (v0.11.0) |
| 95 | + |
| 96 | +| Name | Backend | Required env | Models | |
| 97 | +|---|---|---|---| |
| 98 | +| `databricks` | Databricks Model Serving (OpenAI-compatible) | `DATABRICKS_HOST`, `DATABRICKS_TOKEN` | `databricks-claude-haiku-4-5`, `databricks-claude-sonnet-4`, `databricks-claude-opus-4-5`, `databricks-meta-llama-4-maverick` | |
| 99 | +| `mock` | In-process deterministic stub | — | `mock-small`, `mock-large` | |
| 100 | + |
| 101 | +Future providers (planned, not yet shipped): |
| 102 | + |
| 103 | +| Name | Backend | Required env | |
| 104 | +|---|---|---| |
| 105 | +| `anthropic` | Anthropic API direct | `ANTHROPIC_API_KEY` | |
| 106 | +| `openai` | OpenAI API direct | `OPENAI_API_KEY` | |
| 107 | +| `openai-compat` | Generic OpenAI-compatible endpoint | `OPENAI_BASE_URL`, `OPENAI_API_KEY` | |
| 108 | +| `bedrock` | Anthropic via AWS Bedrock | `AWS_*` | |
| 109 | + |
| 110 | +## How to add a new provider |
| 111 | + |
| 112 | +1. Create `scripts/bench/providers/<name>.py` with a class that subclasses `Provider` and implements `configured()`, `list_models()`, and `chat()`. |
| 113 | +2. Soft-import the SDK in `chat()` (or `__init__`) and raise `ProviderNotInstalled` if missing — never make the SDK a hard dep. |
| 114 | +3. Read credentials from env vars in `__init__`; raise `ProviderNotConfigured` if missing. |
| 115 | +4. Map upstream errors to `ProviderError` (preserve original via `raise ... from exc`). |
| 116 | +5. Add an entry in `scripts/bench/providers/registry.py:_BUILTIN_PROVIDERS` of the form `"<name>": "bench.providers.<name>:<ClassName>"`. |
| 117 | +6. Add a row to the table above + the cost table in `base.py:_COST_TABLE_USD_PER_MILLION`. |
| 118 | +7. Add provider-specific tests in `tests/bench-providers.test.sh`. |
| 119 | + |
| 120 | +## Recommended invocation patterns |
| 121 | + |
| 122 | +### Direct (env already exported) |
| 123 | + |
| 124 | +```bash |
| 125 | +export DATABRICKS_HOST=https://...azuredatabricks.net |
| 126 | +export DATABRICKS_TOKEN=dapi... |
| 127 | +bstack bench run --runner live --provider databricks \ |
| 128 | + --model databricks-claude-haiku-4-5 \ |
| 129 | + --judge-model databricks-claude-opus-4-5 \ |
| 130 | + --phase 1 |
| 131 | +``` |
| 132 | + |
| 133 | +### Railway as credential broker (recommended for shared dev envs) |
| 134 | + |
| 135 | +When credentials live in Railway (the bstack-broomva-stimulus convention), use `railway run` to inject env vars without writing them to disk: |
| 136 | + |
| 137 | +```bash |
| 138 | +railway run --service stimulus-api -- bstack bench run \ |
| 139 | + --runner live --provider databricks \ |
| 140 | + --model databricks-claude-haiku-4-5 \ |
| 141 | + --judge-model databricks-claude-opus-4-5 \ |
| 142 | + --phase 1 |
| 143 | +``` |
| 144 | + |
| 145 | +### 1Password / sops / direnv / vault |
| 146 | + |
| 147 | +Any tool that exports env vars works. The provider class never sees the credential storage system — it only reads `os.environ`. |
| 148 | + |
| 149 | +## P20 model-isolation enforcement |
| 150 | + |
| 151 | +Bench enforces **Cross-Review (P20)** at the layer where it matters most: the LLM judge. |
| 152 | + |
| 153 | +**Rule:** when `--evaluator llm-judge` is selected, the judge model MUST differ from the agent model. Same model judging itself is exactly the single-model-echo-chamber failure P20 exists to prevent. |
| 154 | + |
| 155 | +```bash |
| 156 | +# ❌ Rejected — same model agent + judge |
| 157 | +bstack bench run --runner live --provider databricks \ |
| 158 | + --model databricks-claude-haiku-4-5 \ |
| 159 | + --evaluator llm-judge --judge-model databricks-claude-haiku-4-5 |
| 160 | +# → exit 8: "judge model equals agent model" |
| 161 | + |
| 162 | +# ✅ Accepted — distinct models |
| 163 | +bstack bench run --runner live --provider databricks \ |
| 164 | + --model databricks-claude-haiku-4-5 \ |
| 165 | + --evaluator llm-judge --judge-model databricks-claude-opus-4-5 |
| 166 | + |
| 167 | +# ⚠️ Override only with --allow-same-judge-model (must pass rationale) |
| 168 | +bstack bench run --runner live --provider databricks \ |
| 169 | + --model databricks-claude-haiku-4-5 \ |
| 170 | + --evaluator llm-judge --judge-model databricks-claude-haiku-4-5 \ |
| 171 | + --allow-same-judge-model "smoke test only — not a quality measurement" |
| 172 | +# → warning logged + rationale captured in run config.json |
| 173 | +``` |
| 174 | + |
| 175 | +Same-provider, different-model is the cheapest path to compliance. Cross-provider (e.g. agent on Databricks Claude, judge on OpenAI GPT-4o) is the strongest. Bench logs both paths. |
| 176 | + |
| 177 | +## Anti-patterns |
| 178 | + |
| 179 | +- **Don't** hardcode credentials anywhere in bstack. The provider class reads `os.environ`; how those env vars get there is the caller's concern. |
| 180 | +- **Don't** make any SDK a hard dependency of bstack. Soft-import only. |
| 181 | +- **Don't** bypass the registry — always call `get_provider(name)`, never instantiate `DatabricksGatewayProvider(...)` directly from bench code (tests are the exception). |
| 182 | +- **Don't** treat `raw` as part of the public contract. It exists for one-off forward-compat reads (logprobs, structured outputs) and may be `None` for stub/mock providers. |
| 183 | +- **Don't** allow same-model judge silently. P20 violation requires explicit `--allow-same-judge-model` opt-out with rationale. |
| 184 | + |
| 185 | +## References |
| 186 | + |
| 187 | +- OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/chat |
| 188 | +- Databricks Foundation Model APIs: https://docs.databricks.com/en/machine-learning/foundation-models/index.html |
| 189 | +- Stimulus reference implementation: `apps/api/src/utils/databricks_openai.py` (in the stimulus repo) |
| 190 | +- bstack bench spec: `specs/bench-skill-evolution.md` |
| 191 | +- P20 Cross-Review primitive: `SKILL.md` § Bstack Core Automation Primitives |
0 commit comments