broomva
diff --git a/‎CHANGELOG.md‎
Lines changed: 53 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎SKILL.md‎
Lines changed: 2 additions & 0 deletions b/‎SKILL.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎VERSION‎
Lines changed: 1 addition & 1 deletion b/‎VERSION‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎bin/bstack-bench‎
Lines changed: 18 additions & 2 deletions b/‎bin/bstack-bench‎
Lines changed: 18 additions & 2 deletions
diff --git a/‎references/provider-standards.md‎
Lines changed: 191 additions & 0 deletions b/‎references/provider-standards.md‎
Lines changed: 191 additions & 0 deletions
@@ -1,5 +1,58 @@
 # Changelog
 
+## 0.11.0 — 2026-05-20
+
+### Live mode for `bstack bench` — Databricks Gateway provider + OpenAI-compatible abstraction (BRO-1211)
+
+Closes the live-mode gap left open by v0.10.0 (BRO-1205). v0.10.0 shipped `StubLiveRunner` + `StubLLMJudgeEvaluator` that raised NotImplementedError. v0.11.0 ships the real thing: an industry-standard provider abstraction with **Databricks Model Serving Gateway** as the first concrete provider. Live mode validated end-to-end against real Databricks Anthropic Claude endpoints (5/5 live tests green; 221 real tokens on a `databricks-claude-haiku-4-5` call; Haiku-agent + Sonnet-judge run reached quality cliff at rc=0).
+
+The contract bstack adopts is **OpenAI Chat Completions API v1** — the de facto LLM standard in 2026, served by Databricks, OpenAI, Anthropic-via-Bedrock, Together, Fireworks, Anyscale, vLLM, llama.cpp, etc. Future providers (anthropic, openai, openai-compat, bedrock) plug in by implementing `Provider.chat()`.
+
+- **NEW** `scripts/bench/providers/` package (stdlib + optional `openai` SDK):
+  - `base.py` — `Provider` ABC + OpenAI-compatible `ChatMessage` / `Usage` / `ChatCompletion` types + `ProviderError` / `ProviderNotConfigured` / `ProviderNotInstalled` taxonomy + `estimate_cost_usd()` with per-model pricing table.
+  - `databricks.py` — `DatabricksGatewayProvider`: wraps the OpenAI SDK with `base_url = {DATABRICKS_HOST}/serving-endpoints` + `api_key = DATABRICKS_TOKEN`. Mirrors Stimulus's `apps/api/src/utils/databricks_openai.py` pattern. Known models hardcoded: `databricks-claude-{haiku-4-5, sonnet-4, opus-4-5}` + `databricks-meta-llama-4-maverick`.
+  - `registry.py` — `get_provider(name, **kwargs)` factory with lazy module loading. Built-in providers: `databricks`, `mock`. Runtime extension via `register_provider()`.
+  - `__init__.py` — public API exports.
+- **NEW** `references/provider-standards.md` — documents OpenAI-compatible contract bstack adopts, how to add a new provider, P20 model-isolation enforcement rules, and Railway credential-broker invocation pattern.
+- **NEW** `tests/bench-providers.test.sh` — 10 offline tests covering unknown-provider error, missing `--model`, mock provider end-to-end (real `chat()` call), P20 violation (rc=8), P20 override rationale captured in config, P20 distinct-models accepted, `DATABRICKS_TOKEN` absent (rc=9), `list_providers()` shape, provider-standards doc presence, public API symbol coverage. All green.
+- **NEW** `tests/bench-live.test.sh` — 5 live integration tests (gated by `BSTACK_BENCH_LIVE=1` + `DATABRICKS_HOST` + `DATABRICKS_TOKEN`). Validates: `DatabricksGatewayProvider` instantiates with real creds, minimal `chat()` returns PONG with parseable usage stats, Phase 1 run produces non-canned token counts (real Databricks usage), `LLMJudgeEvaluator` end-to-end (Haiku agent + Sonnet judge), P20 enforcement holds in live mode. **All 5 passed against real Databricks at ship time** — this PR ships proven-working live mode, not stubs.
+- **CHANGED** `scripts/bench/agent_runner.py` — `LiveProviderRunner` replaces `StubLiveRunner` (which is kept as legacy fallback). Delegates to a Provider, captures real token usage from `completion.usage`, writes deliverables, estimates cost from per-model pricing table.
+- **CHANGED** `scripts/bench/evaluator.py` — `LLMJudgeEvaluator` replaces `StubLLMJudgeEvaluator`. Builds structured judge prompt from rubric criteria, parses JSON verdict (with fallback for prose-wrapped output), computes weighted pass rate, applies 0.6 cliff. Handles judge-side provider errors + parse failures gracefully (no traceback).
+- **CHANGED** `scripts/bench/orchestrator.py` — adds `--provider`, `--model`, `--judge-model`, `--allow-same-judge-model RATIONALE` flags. Enforces P20 model isolation: judge model MUST differ from agent model unless explicit override with rationale (captured in `config.json` for audit). New exit codes 8 (P20 violation), 9 (provider not configured), 10 (SDK not installed).
+- **CHANGED** `bin/bstack-bench` — surfaces new flags in `--help`; documents new exit codes; adds live-mode invocation examples (direct + Railway credential broker pattern).
+- **CHANGED** `tests/bench-mvp.test.sh` — tests #13 + #15 updated for v0.11.0 semantics: live runner / llm-judge without `--provider` now fails *fast* at the CLI layer with rc=2 + "--provider required" instead of reaching the stub. Cleaner error, surfaces missing config before any task runs.
+
+### Design choices
+
+- **OpenAI Chat Completions API is the contract.** Picked because Databricks, OpenAI, vLLM, Together, Fireworks, Anyscale, llama.cpp, and Anthropic-via-Bedrock all serve identical request/response JSON. Choosing the same shape means new providers ship with zero translation layer.
+- **`openai` SDK is a soft dependency.** Imported lazily inside `DatabricksGatewayProvider.__init__`; raises `ProviderNotInstalled` with install hint when missing. CI doesn't install it (mock provider covers offline tests). Live runs need it.
+- **Railway as credential broker.** Recommended invocation: `railway run --service stimulus-api -- bstack bench run ...`. Credentials never written to disk in the bstack tree. Direct env export works identically.
+- **P20 enforcement at CLI layer.** Same model for agent + judge is the same-model-echo-chamber failure mode P20 exists for. Rejected with rc=8 unless `--allow-same-judge-model "rationale"` is passed; rationale is captured in `config.json` for audit.
+- **Mock provider is built in.** Deterministic in-process provider; tests + CI never need network or credentials. Same registry, same factory, same API as `databricks`.
+- **No `.env` file loading at runtime.** Bstack reads `os.environ`; how vars get there is the caller's concern (Railway, direnv, sops, 1Password, manual export — all work).
+- **Stimulus pattern mirrored.** `DatabricksGatewayProvider` directly mirrors `apps/api/src/utils/databricks_openai.py` — same base_url construction (`{HOST}/serving-endpoints`), same auth (token as `api_key`), same model name conventions.
+
+### What this enables (next BRO-1205 followups, now unblocked)
+
+- Per-skill telemetry counters can land — substrate now produces real token usage to populate them.
+- Crystallize (P16) FIX/DERIVED/RETIRE sub-modes can read from real bench runs, not synthetic numbers.
+- Cross-provider benchmarking — agent on Databricks Claude, judge on OpenAI GPT-4o (when `openai` provider lands).
+- Cost-per-quality measurement — bench reports now include real `cost_usd` from `estimate_cost_usd()`.
+
+### Test counts
+
+- `tests/bench-mvp.test.sh`: 18 → 18 (unchanged, two assertions retargeted for v0.11.0 semantics)
+- `tests/bench-providers.test.sh`: NEW, 10 assertions
+- `tests/bench-live.test.sh`: NEW, 5 assertions (gated, ran green at ship time)
+- Total: 33 offline + 5 live (gated) = 38 assertions
+
+### Linked artifacts
+
+- Linear: BRO-1211 (this PR); BRO-1205 (predecessor — MVP)
+- Spec: `specs/bench-skill-evolution.md` (updated)
+- Reference: `references/provider-standards.md` (NEW)
+- Stimulus mirror: `apps/api/src/utils/databricks_openai.py` (reference implementation)
+
 ## 0.10.0 — 2026-05-20
 
 ### Skill-evolution benchmark substrate (BRO-1205)
 
@@ -39,6 +39,8 @@ Then, in your agent session:
 /bstack bench run                 → two-phase skill-evolution benchmark (P11)
 /bstack bench compare             → Phase 1 vs Phase 2 REPORT.md
 /bstack bench tasks list          → registered task sets
+/bstack bench run --runner live --provider databricks --model ...
+                                  → real LLM via OpenAI-compatible provider (≥ 0.11.0)
 ```
 
 ## What bstack enforces
 
@@ -1 +1 @@
-0.10.0
+0.11.0
@@ -6,6 +6,7 @@
 #
 # Subcommands:
 #   run [--tasks SET] [--runner R] [--evaluator E] [--phase {1|2|both}]
+#       [--provider P] [--model M] [--judge-model M] [--allow-same-judge-model RATIONALE]
 #       [--budget-usd N] [--resume RUN_ID] [--no-dry-run]
 #                                Two-phase bench against a task set.
 #   compare [--run-id RUN_ID]    Build REPORT.md from existing phase results.
@@ -25,6 +26,9 @@
 #   5  resume / status run-id not found
 #   6  all task runs failed (structurally broken — e.g. stub runner without SDK)
 #   7  compare requires both phase 1 + phase 2 results
+#   8  P20 violation: judge model equals agent model without --allow-same-judge-model
+#   9  provider not configured (missing DATABRICKS_TOKEN, etc.)
+#  10  provider SDK not installed (e.g. `pip install openai`)
 
 set -euo pipefail
 
@@ -72,6 +76,8 @@ bstack-bench — skill-evolution benchmark dispatcher
 
 Usage:
   bstack bench run [--tasks SET] [--runner R] [--evaluator E]
+                   [--provider P] [--model M] [--judge-model M]
+                   [--allow-same-judge-model RATIONALE]
                    [--phase {1|2|both}] [--budget-usd N]
                    [--resume RUN_ID] [--no-dry-run]
   bstack bench compare [--run-id RUN_ID]
@@ -81,12 +87,22 @@ Usage:
 
 Defaults:
   --tasks bstack-smoke   --runner dry-run   --evaluator rubric-match
-  --phase both           --dry-run (live mode is a stub in v0.10.0)
+  --phase both           --dry-run
+
+Live mode (v0.11.0+):
+  bstack bench run --runner live --evaluator llm-judge \\
+      --provider databricks \\
+      --model databricks-claude-haiku-4-5 \\
+      --judge-model databricks-claude-opus-4-5
+  # Or, with Railway as credential broker:
+  railway run --service stimulus-api -- bstack bench run --runner live ...
 
 State:
   ~/.config/bstack/bench/runs/<run-id>/  (override via BSTACK_BENCH_HOME)
 
-Spec: specs/bench-skill-evolution.md   Ticket: BRO-1205
+Spec:       specs/bench-skill-evolution.md
+Providers:  references/provider-standards.md
+Tickets:    BRO-1205 (MVP), BRO-1211 (live mode)
 EOF
 }
 
 
@@ -0,0 +1,191 @@
+# LLM Provider Standards (bench)
+
+> **Audience**: bstack maintainers + anyone wiring a new live LLM backend
+> into `bstack bench`. Agent-readable substrate (markdown, per P18).
+
+## The contract bstack adopts
+
+**OpenAI Chat Completions API v1** — the de facto LLM provider contract in 2026.
+
+```
+POST {base_url}/chat/completions
+Authorization: Bearer <token>
+Content-Type: application/json
+
+{
+  "model": "<model-id>",
+  "messages": [
+    {"role": "system", "content": "..."},
+    {"role": "user",   "content": "..."}
+  ],
+  "max_tokens": 4096,
+  "temperature": 0.0
+}
+```
+
+Response:
+
+```
+{
+  "id": "...",
+  "model": "<resolved-model-id>",
+  "choices": [
+    {
+      "index": 0,
+      "message": { "role": "assistant", "content": "..." },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 19,
+    "completion_tokens": 6,
+    "total_tokens": 25
+  }
+}
+```
+
+## Why this contract
+
+1. **Industry alignment** — the same JSON shape is served by:
+   - **Databricks Model Serving** (Anthropic Claude, Meta Llama, etc. behind one gateway)
+   - **OpenAI** itself
+   - **Together**, **Fireworks**, **Anyscale**, **Groq**, **Perplexity**
+   - **vLLM**, **llama.cpp**, **TGI** (self-hosted)
+   - **Anthropic via AWS Bedrock** (with a thin adapter)
+   - **Vertex AI** model garden (with a thin adapter)
+2. **Token semantics are uniform** — `usage.{prompt,completion,total}_tokens` is universal; bench's per-task cost accounting reads consistently.
+3. **Anthropic's `messages.create` is a rotated version of the same shape** — providers that don't expose an OpenAI front-end (raw Anthropic SDK, Vertex) translate inside the provider class; bench code above stays clean.
+
+## Provider abstraction (Python)
+
+```python
+from bench.providers import Provider, ChatMessage, ChatCompletion, get_provider
+
+provider = get_provider("databricks")  # or "mock", future: "anthropic", "openai", ...
+
+response: ChatCompletion = provider.chat(
+    messages=[
+        ChatMessage(role="system", content="..."),
+        ChatMessage(role="user",   content="..."),
+    ],
+    model="databricks-claude-haiku-4-5",
+    max_tokens=4096,
+    temperature=0.0,
+)
+
+print(response.content)
+print(response.usage.total_tokens)
+print(response.model)         # resolved model ID (richer than the alias)
+print(response.finish_reason)
+```
+
+Three types are public:
+
+- `ChatMessage` — role-tagged message
+- `Usage` — `prompt_tokens` + `completion_tokens` + `total_tokens`
+- `ChatCompletion` — `content` + `model` + `usage` + `finish_reason` (+ `raw` for forward-compat)
+
+Three exception types:
+
+- `ProviderNotInstalled` — optional SDK missing (e.g. `pip install openai`)
+- `ProviderNotConfigured` — required env vars missing (e.g. `DATABRICKS_TOKEN`)
+- `ProviderError` — wraps upstream SDK errors uniformly
+
+## Built-in providers (v0.11.0)
+
+| Name | Backend | Required env | Models |
+|---|---|---|---|
+| `databricks` | Databricks Model Serving (OpenAI-compatible) | `DATABRICKS_HOST`, `DATABRICKS_TOKEN` | `databricks-claude-haiku-4-5`, `databricks-claude-sonnet-4`, `databricks-claude-opus-4-5`, `databricks-meta-llama-4-maverick` |
+| `mock` | In-process deterministic stub | — | `mock-small`, `mock-large` |
+
+Future providers (planned, not yet shipped):
+
+| Name | Backend | Required env |
+|---|---|---|
+| `anthropic` | Anthropic API direct | `ANTHROPIC_API_KEY` |
+| `openai` | OpenAI API direct | `OPENAI_API_KEY` |
+| `openai-compat` | Generic OpenAI-compatible endpoint | `OPENAI_BASE_URL`, `OPENAI_API_KEY` |
+| `bedrock` | Anthropic via AWS Bedrock | `AWS_*` |
+
+## How to add a new provider
+
+1. Create `scripts/bench/providers/<name>.py` with a class that subclasses `Provider` and implements `configured()`, `list_models()`, and `chat()`.
+2. Soft-import the SDK in `chat()` (or `__init__`) and raise `ProviderNotInstalled` if missing — never make the SDK a hard dep.
+3. Read credentials from env vars in `__init__`; raise `ProviderNotConfigured` if missing.
+4. Map upstream errors to `ProviderError` (preserve original via `raise ... from exc`).
+5. Add an entry in `scripts/bench/providers/registry.py:_BUILTIN_PROVIDERS` of the form `"<name>": "bench.providers.<name>:<ClassName>"`.
+6. Add a row to the table above + the cost table in `base.py:_COST_TABLE_USD_PER_MILLION`.
+7. Add provider-specific tests in `tests/bench-providers.test.sh`.
+
+## Recommended invocation patterns
+
+### Direct (env already exported)
+
+```bash
+export DATABRICKS_HOST=https://...azuredatabricks.net
+export DATABRICKS_TOKEN=dapi...
+bstack bench run --runner live --provider databricks \
+    --model databricks-claude-haiku-4-5 \
+    --judge-model databricks-claude-opus-4-5 \
+    --phase 1
+```
+
+### Railway as credential broker (recommended for shared dev envs)
+
+When credentials live in Railway (the bstack-broomva-stimulus convention), use `railway run` to inject env vars without writing them to disk:
+
+```bash
+railway run --service stimulus-api -- bstack bench run \
+    --runner live --provider databricks \
+    --model databricks-claude-haiku-4-5 \
+    --judge-model databricks-claude-opus-4-5 \
+    --phase 1
+```
+
+### 1Password / sops / direnv / vault
+
+Any tool that exports env vars works. The provider class never sees the credential storage system — it only reads `os.environ`.
+
+## P20 model-isolation enforcement
+
+Bench enforces **Cross-Review (P20)** at the layer where it matters most: the LLM judge.
+
+**Rule:** when `--evaluator llm-judge` is selected, the judge model MUST differ from the agent model. Same model judging itself is exactly the single-model-echo-chamber failure P20 exists to prevent.
+
+```bash
+# ❌ Rejected — same model agent + judge
+bstack bench run --runner live --provider databricks \
+    --model databricks-claude-haiku-4-5 \
+    --evaluator llm-judge --judge-model databricks-claude-haiku-4-5
+# → exit 8: "judge model equals agent model"
+
+# ✅ Accepted — distinct models
+bstack bench run --runner live --provider databricks \
+    --model databricks-claude-haiku-4-5 \
+    --evaluator llm-judge --judge-model databricks-claude-opus-4-5
+
+# ⚠️  Override only with --allow-same-judge-model (must pass rationale)
+bstack bench run --runner live --provider databricks \
+    --model databricks-claude-haiku-4-5 \
+    --evaluator llm-judge --judge-model databricks-claude-haiku-4-5 \
+    --allow-same-judge-model "smoke test only — not a quality measurement"
+# → warning logged + rationale captured in run config.json
+```
+
+Same-provider, different-model is the cheapest path to compliance. Cross-provider (e.g. agent on Databricks Claude, judge on OpenAI GPT-4o) is the strongest. Bench logs both paths.
+
+## Anti-patterns
+
+- **Don't** hardcode credentials anywhere in bstack. The provider class reads `os.environ`; how those env vars get there is the caller's concern.
+- **Don't** make any SDK a hard dependency of bstack. Soft-import only.
+- **Don't** bypass the registry — always call `get_provider(name)`, never instantiate `DatabricksGatewayProvider(...)` directly from bench code (tests are the exception).
+- **Don't** treat `raw` as part of the public contract. It exists for one-off forward-compat reads (logprobs, structured outputs) and may be `None` for stub/mock providers.
+- **Don't** allow same-model judge silently. P20 violation requires explicit `--allow-same-judge-model` opt-out with rationale.
+
+## References
+
+- OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/chat
+- Databricks Foundation Model APIs: https://docs.databricks.com/en/machine-learning/foundation-models/index.html
+- Stimulus reference implementation: `apps/api/src/utils/databricks_openai.py` (in the stimulus repo)
+- bstack bench spec: `specs/bench-skill-evolution.md`
+- P20 Cross-Review primitive: `SKILL.md` § Bstack Core Automation Primitives