From 91de528a253fedc5245894a007c9133482153c07 Mon Sep 17 00:00:00 2001 From: Federico Kamelhar Date: Fri, 1 May 2026 20:26:17 -0400 Subject: [PATCH] docs(providers): rewrite OpenAI / Anthropic / Ollama / OCI pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous provider pages were correct but terse — capability tree dumps with no narrative, weak "when to pick" guidance, and no practical-workflow worked examples. This rewrite restructures every page around the same easy-to-read template: 1. One-paragraph "what this provider is + why you'd pick it" 2. "When to pick X" table with cross-links to alternatives 3. Getting started — three numbered steps 4. What you get out of the box — each capability as natural prose plus a runnable code example, not a single-line bullet 5. Practical workflow / gotcha section 6. Common gotchas table 7. Source links + See also Per-page highlights ------------------- - OpenAI: separated chat / o-series reasoning / streaming / tool-calling / structured-output as their own subsections; gateway tour (Azure / Portkey / LiteLLM / vLLM / together) explains when each one is the right pick. - Anthropic: the prompt-caching cost story spelled out (1/10× input on cached span; ~5-min ephemeral window; auto when system > ~1024 tokens). Extended-thinking ThinkEvent example. Cross-link to Claude-on-OCI as a no-API-key alternative. - Ollama: positioned as "develop offline / test deterministically / iterate before paying"; per-model tool-calling support table; the laptop→hosted swap-one-line workflow demonstrated. - OCI: now leads with the value prop ("90+ models, day-0 model coverage, one auth surface"). Three-transport story (V1 / SDK / DAC) with a single ascii table showing routing by model-id pattern. DAC section integrates the new how-to. The "laptop dev → OKE production" workflow shows the value of one auth surface concretely. Page sizes: 318 → 719 lines total. - openai: 78 → 165 - anthropic: 71 → 186 - ollama: 73 → 167 - oci: 97 → 201 Validation ---------- - ``hatch run check`` clean. - ``mkdocs build --strict`` not re-run here; docs.yml CI workflow catches link rot on the next push to main. Signed-off-by: Federico Kamelhar --- docs/concepts/providers/anthropic.md | 185 +++++++++++++++++++++----- docs/concepts/providers/oci.md | 186 +++++++++++++++++++++------ docs/concepts/providers/ollama.md | 172 +++++++++++++++++++------ docs/concepts/providers/openai.md | 176 ++++++++++++++++++------- 4 files changed, 560 insertions(+), 159 deletions(-) diff --git a/docs/concepts/providers/anthropic.md b/docs/concepts/providers/anthropic.md index 212a81c1..ab9a543b 100644 --- a/docs/concepts/providers/anthropic.md +++ b/docs/concepts/providers/anthropic.md @@ -1,71 +1,186 @@ # Anthropic -Direct calls to `api.anthropic.com` via `AnthropicModel`. +The Anthropic provider connects locus directly to Anthropic's API +(`api.anthropic.com`). Use it when you want **the Claude family** — +Opus for the hardest problems, Sonnet as the everyday workhorse, +Haiku for high-volume cheap calls — and want to talk to Anthropic +without going through an intermediary. -```python -agent = Agent(model="anthropic:claude-sonnet", ...) -``` +Two things make this provider distinct: **prompt caching** (long +system prompts and tool blocks pay 1/10th the input cost on repeat +turns) and **extended thinking** (Claude 4 surfaces its reasoning as +a stream of typed events your UI can render). + +## When to pick Anthropic + +| You want… | This is the right provider | +|---|---| +| Claude Opus / Sonnet / Haiku from Anthropic directly | ✓ | +| Long system prompts amortised across many turns | ✓ — automatic prompt caching | +| Extended-thinking models with visible reasoning | ✓ — `ThinkEvent` stream | +| Claude on Oracle infrastructure (no separate API key) | use [OCI](oci.md) — `oci:anthropic.claude-sonnet` | +| GPT or Llama | use [OpenAI](openai.md) or [OCI](oci.md) instead | + +## Getting started + +### 1. Set your API key ```bash export ANTHROPIC_API_KEY=sk-ant-... ``` -## Capabilities - -```text -anthropic: -│ -├── Claude family — opus · sonnet · haiku -├── streaming — real SSE, token-level -├── tool calling — Anthropic tool-use protocol -├── structured output — tool-as-schema pattern -│ -├── prompt caching — long system + tool blocks marked cacheable; -│ subsequent turns pay 1/10th input cost -│ -└── extended thinking — passes thinking blocks through as ThinkEvent +### 2. Pick a Claude model + +```python +from locus import Agent + +agent = Agent( + model="anthropic:claude-sonnet-4-20250514", + system_prompt="You are a helpful assistant.", +) +``` + +The string `"anthropic:claude-sonnet-4-20250514"` tells locus the +provider (`anthropic:`) and the exact model id. Any model id +Anthropic accepts, locus accepts — including the dated revision +suffixes (`-20250514`). + +### 3. Run it + +```python +result = agent.run_sync("Summarise the design doc in three bullets.") +print(result.message) +``` + +That's the full setup. Streaming, tool calling, prompt caching, and +extended thinking work without extra configuration. + +## What you get out of the box + +### The whole Claude family + +Whatever Anthropic ships, you can address by name: + +| Model | When to pick it | +|---|---| +| `claude-opus-4-…` | Hardest problems — code archaeology, deep research, multi-step reasoning | +| `claude-sonnet-4-…` | Everyday workhorse — fast enough, smart enough, cheap enough | +| `claude-haiku-4-…` | High-volume cheap calls — classification, routing, simple summaries | + +### Real SSE streaming + +Token-level streaming. The model emits content deltas; locus +converts them to `ModelChunkEvent`s; your `async for` loop reads +them as they arrive. + +```python +async for event in agent.run("Write a haiku about latency."): + if isinstance(event, ModelChunkEvent) and event.content: + print(event.content, end="", flush=True) +``` + +### Tool calling — the Anthropic tool-use protocol + +`@tool` functions are translated into Anthropic's `tools` schema; the +model's structured `tool_use` blocks are parsed back into locus +`ToolCall`s. Parallel tool calls are supported (the model can +request multiple tools per turn; locus runs them concurrently via +the `ConcurrentExecutor`). + +### Structured output — tool-as-schema + +Anthropic doesn't expose a `response_format` field, so locus uses +the standard "single-tool" trick: define the schema as a tool, force +the model to call it. From your side, the API is identical to the +other providers: + +```python +from pydantic import BaseModel + +class Triage(BaseModel): + severity: str + needs_human: bool + +agent = Agent( + model="anthropic:claude-sonnet-4-20250514", + output_schema=Triage, +) +result = agent.run_sync("This page is broken!") +print(result.parsed) # Triage(severity='high', needs_human=True) ``` -## Prompt caching +### Prompt caching — automatic for long prompts + +This is the biggest cost saver if your system prompt or tool block is +long (skills, playbooks, RAG context). Anthropic's prompt-caching +mechanism marks a span of the request as cacheable; subsequent turns +within the cache window pay **1/10th** the input cost on the cached +span. -locus marks long system prompts and tool blocks as cacheable -automatically. Subsequent turns within the cache window pay 1/10th -the input cost. +locus reads the request shape and applies `cache_control` to anything +beyond a small threshold automatically. You don't opt in. -You don't have to opt in — locus reads the request shape and applies -`cache_control` to anything beyond a small threshold. To force or -suppress it, set `prompt_cache=True|False` on the model config. +```python +# Force or suppress caching explicitly: +agent = Agent( + model="anthropic:claude-sonnet-4-20250514", + model_config={"prompt_cache": True}, # or False to opt out +) +``` -## Extended thinking +When it kicks in: -When the model returns thinking blocks (Claude 4 models with -`thinking_enabled`), locus emits a `ThinkEvent` per block in the -event stream. Pipe it straight to your UI: +- A 5-minute "ephemeral" cache (rolling window) — the default. +- Subsequent turns reusing the same prefix pay `0.1× input rate` on + the cached portion. +- Effective when system prompts > ~1024 tokens, or you've loaded a + big skill / playbook / RAG block. + +### Extended thinking — visible reasoning + +Claude 4 models with `thinking_enabled` think before answering, the +way the OpenAI o-series does. Anthropic surfaces those thinking +blocks in the response; locus emits a `ThinkEvent` for each one so +your UI can show what the model is working on: ```python async for event in agent.run("..."): match event: case ThinkEvent(reasoning=r) if r: print(f"💭 {r}") + case ModelChunkEvent(content=c) if c: + print(c, end="", flush=True) ``` -## Claude on OCI +## Claude on OCI — same model, different provider -For Claude without a separate Anthropic API key, use the OCI -transport instead — same `Agent`, different prefix: +Don't have an Anthropic API key? Want Claude billed through your +Oracle account on Oracle infrastructure? Switch the prefix: ```python agent = Agent(model="oci:anthropic.claude-sonnet", ...) ``` -This routes through `OCIOpenAIModel` and inherits OCI auth, so no -`ANTHROPIC_API_KEY` is needed. +That routes through `OCIOpenAIModel` against the OCI Generative AI +endpoint (uses `OCI_PROFILE` for auth, no `ANTHROPIC_API_KEY` needed). +Same model behind it; different billing surface. + +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| `401 authentication_error` | `ANTHROPIC_API_KEY` not set, or set to a key without console access | +| `404 not_found_error` on the model id | Dated revision suffix is wrong; check `https://docs.anthropic.com/en/docs/about-claude/models/all-models` | +| `429 overloaded_error` | Anthropic capacity; the `ModelRetryHook` re-tries with backoff if installed | +| Prompt caching not visible in usage stats | Cache window expired (5 min ephemeral) or prompt below the threshold | +| `ThinkEvent`s never fire | Model not in the extended-thinking subset, or `thinking_enabled` not set in `model_config` | ## Source -[`AnthropicModel` in `models/native/anthropic.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/native/anthropic.py). +[`AnthropicModel` in `src/locus/models/native/anthropic.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/native/anthropic.py) ## See also - [Models overview](../models.md) — the full provider tree. - [OCI Generative AI](oci.md) — Claude via OCI. +- [OpenAI](openai.md) — GPT family direct. diff --git a/docs/concepts/providers/oci.md b/docs/concepts/providers/oci.md index 40cb6876..e1761391 100644 --- a/docs/concepts/providers/oci.md +++ b/docs/concepts/providers/oci.md @@ -1,14 +1,41 @@ # OCI Generative AI -OCI is the day-1 target. **90+ models, two transports under one -class hierarchy, day-0 model support.** When OCI ships a new model -id, locus already supports it. +OCI Generative AI is locus's **day-1 target** and the most capable +provider in the box. It exposes 90+ models — OpenAI commercial +families, Meta Llama, Anthropic Claude, Google Gemini, xAI Grok, +Mistral, and Cohere — through Oracle's hosted inference service. +**When OCI ships a new model id, locus already supports it** — you +just pass the new id. + +The headline value over the direct providers: + +- **One auth surface.** Same `OCI_PROFILE` mechanism on a laptop, in + CI, or running on OCI Compute / OKE / Functions. +- **Day-0 model coverage.** New OpenAI / Anthropic / Llama models + reach OCI on the day they're released. +- **No per-provider API keys.** GPT, Claude, Llama all bill through + your OCI tenancy. +- **Dedicated AI Cluster (DAC) endpoints** for predictable latency + and isolation when on-demand isn't enough. + +## When to pick OCI + +| You want… | This is the right provider | +|---|---| +| GPT, Claude, Llama, Cohere, Gemini, Grok, Mistral all in one place | ✓ | +| Production inference on Oracle infrastructure (OKE / Compute / Functions) | ✓ | +| One auth surface across laptop, CI, OCI workloads | ✓ | +| Provisioned-capacity inference via [DAC](../../how-to/oci-dac.md) | ✓ | +| To avoid managing per-provider API keys | ✓ | +| Bleeding-edge OpenAI features the day they ship | use [OpenAI](openai.md) direct — OCI sometimes lags by hours/days | +| Local development without auth setup | use [Ollama](ollama.md) instead | -OCI exposes its inference service in two ways. locus speaks both -and picks the right one automatically from the model id. You do not -have to know which transport a model uses to call it. +## Two transports under one prefix -## Model families +OCI Generative AI exposes its inference service in two ways. locus +speaks both and **picks the right one automatically from the model +id** — you don't have to know which transport a model uses to call +it. ```text oci: (one prefix · two transports) @@ -21,77 +48,154 @@ oci: (one prefix · two transports) │ ├─ google.* — Google Gemini family │ └─ anthropic.* — Anthropic Claude on OCI (no separate API key) │ -└── SDK transport · OCIModel OCI Generative AI Python SDK - └─ cohere.command-r* — Cohere R-series only (native API only) +├── SDK transport · OCIModel OCI Generative AI Python SDK +│ └─ cohere.command-r* — Cohere R-series only (native API only) +│ +└── DAC endpoints · OCIModel DedicatedServingMode + └─ ocid1.generativeaiendpoint.... — provisioned capacity ``` -## V1 transport — `/openai/v1` (OpenAI-compatible) +### V1 transport — `/openai/v1` (OpenAI-compatible) `OCIOpenAIModel` calls `https://inference.generativeai..oci.oraclecloud.com/openai/v1/chat/completions`. -Used for the majority of OCI models: OpenAI commercial, Meta Llama, -xAI Grok, Mistral, Google Gemini, and Anthropic on OCI. Real SSE -streaming, OpenAI-style function calling, structured output. The -wire format is identical to OpenAI's — anything you know about -prompting OpenAI directly carries over. +This is the **default path for the majority of OCI models**: +OpenAI commercial, Meta Llama, xAI Grok, Mistral, Google Gemini, and +Claude on OCI. The wire format is identical to OpenAI's, so anything +you know about prompting OpenAI carries over: real SSE streaming, +OpenAI-style function calling, structured output, vision input. + +```python +agent = Agent(model="oci:openai.gpt-5.5") # OpenAI commercial +agent = Agent(model="oci:meta.llama-3.3-70b-instruct") # Meta Llama +agent = Agent(model="oci:anthropic.claude-sonnet") # Claude — no Anthropic key needed +``` -## SDK transport — OCI native API +### SDK transport — OCI native API -`OCIModel` calls the OCI Generative AI Python SDK directly. Used -**only for Cohere R-series** (`cohere.command-r-*`), which OCI +`OCIModel` calls the OCI Generative AI Python SDK directly. It's +used **only for Cohere R-series** (`cohere.command-r-*`), which OCI exposes through the native API rather than the OpenAI-compatible -gateway. +gateway. Cohere R has its own request shape (separate `message` + +`chat_history` instead of a flat `messages` array). -## Transport selection — automatic +```python +agent = Agent(model="oci:cohere.command-r-plus-08-2024") # SDK transport +``` + +### DAC endpoints — dedicated capacity + +When you've provisioned a Dedicated AI Cluster (DAC), OCI gives you +a **generative AI endpoint OCID**. Pass it as the model id and locus +auto-routes through the SDK transport with `DedicatedServingMode`: ```python -# Both work; the transport is picked from the model id: -agent = Agent(model="oci:openai.gpt-5.5") # → V1 (OCIOpenAIModel) -agent = Agent(model="oci:cohere.command-r-plus") # → SDK (OCIModel) +agent = Agent( + model=get_model( + "oci:ocid1.generativeaiendpoint.oc1.....", + compartment_id="ocid1.compartment.oc1...", + profile_name="DEFAULT", + ), +) ``` -Override with `LOCUS_OCI_TRANSPORT=v1` or `=sdk` if you ever need to -force one path. +[Full DAC how-to →](../../how-to/oci-dac.md) — covers Qwen-on-DAC, +streaming, tool-call quirks per model. -## Auth — one surface for every environment +## Transport selection — automatic -Same `OCI_PROFILE` mechanism on the laptop, in CI, and on OCI -workloads. `OCI_AUTH_TYPE` selects the signer: +You don't pick the transport. locus looks at the model id and +chooses: -| Auth type | Where it works | +| Model id pattern | Transport | |---|---| -| `api_key` | Laptop with `~/.oci/config` profile | -| `session_token` | Federated SSO laptop · `oci session authenticate` | -| `instance_principal` | OCI Compute · OKE pods | -| `resource_principal` | OCI Functions · serverless | +| `ocid1.generativeaiendpoint....` | SDK + `DedicatedServingMode` (DAC) | +| `cohere.command-r-*` | SDK + `OnDemandServingMode` | +| `openai.*` / `meta.*` / `xai.*` / `mistral.*` / `google.*` / `anthropic.*` | V1 (OpenAI-compatible) | + +Need to override? Set `LOCUS_OCI_TRANSPORT=v1` or `LOCUS_OCI_TRANSPORT=sdk`. + +## One auth surface — laptop, CI, OCI workloads + +Same `OCI_PROFILE` env var everywhere. `OCI_AUTH_TYPE` selects the +signer: + +| Auth type | Where it works | What you set | +|---|---|---| +| `api_key` | Laptop with `~/.oci/config` profile | `OCI_AUTH_TYPE=api_key`, `OCI_PROFILE=DEFAULT` | +| `session_token` | Federated SSO laptop | `oci session authenticate` first; then `OCI_AUTH_TYPE=session_token` | +| `instance_principal` | OCI Compute · OKE pods | `OCI_AUTH_TYPE=instance_principal` (no key file needed) | +| `resource_principal` | OCI Functions · serverless | `OCI_AUTH_TYPE=resource_principal` (provider-injected) | ```bash export OCI_PROFILE=DEFAULT -export OCI_AUTH_TYPE=api_key # or session_token / instance_principal / resource_principal +export OCI_AUTH_TYPE=api_key ``` -No code change between environments — only the env var differs. +**No code change between environments — only the env var differs.** +That's the value: prototype on your laptop, deploy to OKE, route +through Compute. Same `Agent` instance, same model id, three +different signers. ## Region OCI Generative AI is offered in `us-chicago-1`, `eu-frankfurt-1`, -`uk-london-1`, `sa-saopaulo-1`, and a growing list. Pass `OCI_REGION` -to override the region baked into your profile: +`uk-london-1`, `sa-saopaulo-1`, and a growing list. The region baked +into your profile is the default; override with `OCI_REGION`: ```bash export OCI_REGION=us-chicago-1 ``` +## Practical wiring — laptop dev → OKE production + +```python +# Same code on your laptop and on OKE: +from locus import Agent + +agent = Agent( + model="oci:openai.gpt-5.5", + system_prompt="You are a helpful assistant.", +) +``` + +```bash +# Laptop: +export OCI_PROFILE=DEFAULT +export OCI_AUTH_TYPE=api_key + +# OKE pod: +export OCI_AUTH_TYPE=instance_principal +# (no profile / key file — OKE injects the principal at runtime) +``` + +The agent doesn't care. That's the OCI provider's whole pitch. + +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| `404 Not Authorized` (yes, 404 not 403) | OCI's standard permission-denied disguise. Your principal lacks `inspect generative-ai-endpoints` policy in the compartment. | +| `model_id not found` | Model id doesn't exist in your tenancy's region. Check `oci generative-ai model list --region `. | +| `compartment_id is required` | DAC endpoints enforce it even when on-demand wouldn't. Pass `compartment_id=` on the model. | +| Streaming yields one big chunk | DAC endpoint rejected `is_stream`. The fall-back path swallows the failure and emits the full response as one chunk; check `OCI_LOG_REQUESTS=1`. | +| Cohere R model fails on V1 | Force the SDK transport: `LOCUS_OCI_TRANSPORT=sdk`. | + ## Source | | | |---|---| -| `OCIOpenAIModel` (V1) | [`models/providers/oci/openai_compat.py:163`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/providers/oci/openai_compat.py#L163) | -| `OCIModel` (SDK) | [`models/providers/oci/__init__.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/providers/oci/__init__.py) | -| Submodel providers (Cohere, Generic) | [`models/providers/oci/models/`](https://github.com/oracle-samples/locus/tree/main/src/locus/models/providers/oci/models) | +| `OCIOpenAIModel` (V1) | [`src/locus/models/providers/oci/openai_compat.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/providers/oci/openai_compat.py) | +| `OCIModel` (SDK + DAC) | [`src/locus/models/providers/oci/__init__.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/providers/oci/__init__.py) | +| Per-family request builders | [`src/locus/models/providers/oci/models/`](https://github.com/oracle-samples/locus/tree/main/src/locus/models/providers/oci/models) | +| Routing | [`src/locus/models/registry.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/registry.py) — `_make_oci()` | ## See also -- [OCI GenAI models how-to](../../how-to/oci-models.md) — auth setup, region selection, debugging. - [Models overview](../models.md) — the full provider tree. +- [OCI GenAI models how-to](../../how-to/oci-models.md) — auth setup, region selection, debugging. +- [OCI Dedicated AI Cluster (DAC)](../../how-to/oci-dac.md) — provisioned-capacity endpoints. +- [OpenAI](openai.md) — direct OpenAI when OCI lags. +- [Anthropic](anthropic.md) — Claude direct when OCI lags. +- [Ollama](ollama.md) — local development before swapping to OCI. diff --git a/docs/concepts/providers/ollama.md b/docs/concepts/providers/ollama.md index f7114a38..48ca4069 100644 --- a/docs/concepts/providers/ollama.md +++ b/docs/concepts/providers/ollama.md @@ -1,73 +1,167 @@ # Ollama -Local model runtime. `OllamaModel` calls a local Ollama server. +The Ollama provider is **locus pointed at a local model runtime**. +Ollama runs open-weight models on your laptop or a shared GPU box; +locus calls it over HTTP exactly the way it would call OpenAI or +Anthropic. No API key, no network egress, no per-token billing. + +This is the right pick for **offline development**, **deterministic +tests**, and **iterating on agent design before you spend a dollar +on hosted inference**. + +## When to pick Ollama + +| You want… | This is the right provider | +|---|---| +| To develop offline — laptop, plane, isolated network | ✓ | +| Deterministic tests — same prompt + seed → same output | ✓ | +| Cost-free agent iteration before swapping to a paid API | ✓ | +| Privacy-sensitive prototyping where data can't leave the machine | ✓ | +| A frontier model (GPT-5, Claude Opus 4) | use [OpenAI](openai.md) or [Anthropic](anthropic.md) | +| Production-scale concurrency | use [OCI](oci.md), [OpenAI](openai.md), [Anthropic](anthropic.md) | + +## Getting started + +### 1. Install Ollama and pull a model + +Ollama itself isn't a Python package — it's a small binary that runs +a local HTTP server. + +```bash +# macOS (Homebrew) — or download from ollama.com +brew install ollama + +# Start the server (it backgrounds itself): +ollama serve & + +# Pull a model with native tool-calling support: +ollama pull llama3.3 +``` + +`ollama list` will show what you've pulled. Anything in that list is +addressable from locus immediately. + +### 2. Wire locus ```python -agent = Agent(model="ollama:llama3.2", ...) +from locus import Agent + +agent = Agent(model="ollama:llama3.3", system_prompt="You are helpful.") ``` -```bash -export OLLAMA_HOST=http://localhost:11434 # default +That's it. No env vars, no auth — Ollama is local-first by default. + +### 3. Run it + +```python +result = agent.run_sync("Sum 7 plus 35 in one word.") +print(result.message) +# → '42.' ``` -No API key — Ollama is local-first. +Done. Streaming and tool calling work the same as for any other +provider — provided the model you pulled supports them. -## Capabilities +## What you get out of the box -```text -ollama: -│ -├── any pulled local model — run `ollama list` to see what is installed -├── streaming — token-level via the local SSE stream -├── tool calling — works for any model that supports it -│ (llama3.1+, qwen2.5, mistral, deepseek-r1) -└── auth — none · OLLAMA_HOST=http://localhost:11434 (default) +### Any pulled local model — no locus change needed + +The `model_id` after `ollama:` is whatever appears in `ollama list`. +locus doesn't maintain an allow-list; if Ollama can run it, locus +can address it. + +```bash +ollama list +# llama3.3:latest +# qwen2.5-coder:32b +# deepseek-r1:14b ``` -Whatever you `ollama pull` is available immediately. No locus change -needed. +```python +agent_a = Agent(model="ollama:llama3.3") +agent_b = Agent(model="ollama:qwen2.5-coder:32b") +agent_c = Agent(model="ollama:deepseek-r1:14b") +``` -## When to use Ollama +### Real local streaming -- **Offline development** — laptops, planes, isolated networks. -- **Deterministic tests** — no network egress; same model, same - prompt, same seed → same output. -- **Cost-control sandboxing** — iterate on agent design with a - free local model before swapping to a paid API. -- **Privacy-sensitive prototyping** — data never leaves the machine. +Ollama emits SSE-shaped chunks; locus reads them as `ModelChunkEvent`s +just like any other provider. Token-level streaming over localhost +is fast — typically <5 ms per chunk. -## Tool calling +```python +async for event in agent.run("Write a haiku about caching."): + if isinstance(event, ModelChunkEvent) and event.content: + print(event.content, end="", flush=True) +``` -Tool calling support is per-model in Ollama. As of writing: +### Tool calling — model-dependent + +Ollama supports tool calling for models that emit it natively. As of +writing: | Model family | Tool calling | |---|---| -| llama3.1+, llama3.2, llama3.3 | ✅ | -| llama4 | ✅ | -| qwen2.5, qwen2.5-coder | ✅ | -| mistral, mixtral | ✅ | -| deepseek-r1 | ✅ (with reasoning) | -| phi3 | ❌ (no native tool calling) | +| `llama3.1` / `llama3.2` / `llama3.3` | ✓ | +| `llama4` | ✓ | +| `qwen2.5` / `qwen2.5-coder` / `qwen3` | ✓ | +| `mistral` / `mixtral` | ✓ | +| `deepseek-r1` | ✓ (with reasoning) | +| `phi3` | ✗ — no native tool calling | -If a model doesn't support native tool calling, the agent will -still run but the model can't invoke tools — the loop terminates -on the first turn. +If a model doesn't support tool calling, the agent will still **run** — +it just won't be able to invoke any `@tool` you defined. The loop +then terminates after the first turn (no tools called, no follow-up +needed). -## Custom Ollama server +### No auth — by design -Override the host for a remote Ollama (e.g., a shared GPU box): +Ollama listens on `localhost:11434` with no authentication. That's +intentional for the local-first use case. To run against a shared +remote Ollama: ```bash export OLLAMA_HOST=http://gpu-box.internal:11434 ``` -The same `OllamaModel` class works against any HTTP-reachable Ollama -endpoint. +The same `OllamaModel` class talks to any HTTP-reachable Ollama +endpoint. (If you're exposing a remote Ollama, put it behind a VPN +or auth proxy yourself — Ollama doesn't ship one.) + +## Practical workflow — develop local, ship hosted + +A common pattern: prototype an agent against Ollama for free, then +swap one line to point at OCI / OpenAI / Anthropic for production. + +```python +# Development: +agent = Agent(model="ollama:llama3.3", tools=[...], system_prompt="...") + +# Production — same agent, swap the model id: +agent = Agent(model="oci:openai.gpt-5.5", tools=[...], system_prompt="...") +``` + +Everything else — tools, hooks, checkpointers, termination, RAG — +stays identical. You're not coupled to the local runtime; Ollama is +just a model address. + +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| `Connection refused` on `localhost:11434` | Ollama server isn't running. `ollama serve &` in another terminal. | +| `model 'X' not found` | Haven't pulled it yet. `ollama pull X`. | +| Slow first response after hours of idle | Ollama unloads models from VRAM after inactivity. The first call after a long pause re-loads (a few seconds). | +| Tool calls never fire | The model you pulled doesn't support tools (e.g. `phi3`). Switch to `llama3.3` or `qwen2.5`. | +| `tool_calls` parsed as text instead of structured | Some Ollama versions emit XML-style `{...}` blocks. Update Ollama (`brew upgrade ollama`) or use a model with stable structured tool-call output. | +| Different output every run despite the same prompt | Set `temperature=0` and pin `seed` in `model_config`. | ## Source -[`OllamaModel` in `models/native/ollama.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/native/ollama.py). +[`OllamaModel` in `src/locus/models/native/ollama.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/native/ollama.py) ## See also - [Models overview](../models.md) — the full provider tree. +- [OpenAI](openai.md) — GPT family direct. +- [OCI Generative AI](oci.md) — production-scale OCI inference. diff --git a/docs/concepts/providers/openai.md b/docs/concepts/providers/openai.md index 3109c8b5..d3c21c8e 100644 --- a/docs/concepts/providers/openai.md +++ b/docs/concepts/providers/openai.md @@ -1,77 +1,165 @@ # OpenAI -Direct calls to `api.openai.com` via `OpenAIModel`. +The OpenAI provider connects locus directly to OpenAI's API +(`api.openai.com`). It's what you reach for when you want **the latest +OpenAI model the day it ships** without going through any gateway, +translation layer, or middleware. -```python -agent = Agent(model="openai:gpt-5.5", ...) -agent = Agent(model="openai:o3", ...) # reasoning model -``` +It's also the **fastest way to try locus** — one env var, one line of +code, you're talking to GPT-5 or the o-series reasoning models. + +## When to pick OpenAI + +| You want… | This is the right provider | +|---|---| +| GPT-5, GPT-4o, or any latest OpenAI release | ✓ | +| The o-series reasoning models (`o3`, `o4-mini`) | ✓ | +| To go through Azure / Portkey / LiteLLM / vLLM | ✓ — same class, different `base_url` | +| Claude or Llama | use [Anthropic](anthropic.md) or [OCI](oci.md) instead | +| To run on Oracle infrastructure | use [OCI](oci.md) — you'll get the same OpenAI models without a separate key | + +## Getting started + +### 1. Set your API key ```bash export OPENAI_API_KEY=sk-... ``` -## Capabilities - -```text -openai: -│ -├── chat completions — gpt-* family (vision, audio, structured output) -├── reasoning models — o-series (adds reasoning_effort: low | medium | high) -├── streaming — real SSE, token-level -├── tool calling — OpenAI tool-call protocol -├── structured output — response_model / JSON schema -│ -└── base_url override — any OpenAI-compatible gateway - ├─ Azure OpenAI - ├─ Portkey - ├─ LiteLLM proxy - ├─ vLLM (self-hosted) - └─ together.ai · fireworks · groq · any /v1-shaped endpoint +That's the only setup. locus reads the env var automatically. + +### 2. Pick a model + +```python +from locus import Agent + +agent = Agent(model="openai:gpt-5.5", system_prompt="You are helpful.") ``` -## Custom base URL — Azure, Portkey, LiteLLM, vLLM +The string `"openai:gpt-5.5"` does two things: tells locus to use the +OpenAI provider (`openai:` prefix), and which model id to call +(`gpt-5.5`). Any model id OpenAI accepts, locus accepts. -`base_url` overrides the API endpoint. Any OpenAI-compatible gateway -works under the same `OpenAIModel` class: +### 3. Run it + +```python +result = agent.run_sync("What is two plus two?") +print(result.message) +# → 'Four.' +``` + +Done. Streaming, tool calls, structured output — all of it works +without further configuration. + +## What you get out of the box + +### Chat completions across the GPT family + +Every chat-shaped OpenAI model: `gpt-4o`, `gpt-4.1`, `gpt-5`, `gpt-5.5`, +`gpt-image-1`. Vision input (image URLs / base64), audio input, and +function calling work the same way you'd use them on the OpenAI SDK +directly — locus just normalises the events the model emits. + +### Reasoning models — the o-series + +`o1`, `o3`, `o4-mini` route through the same `Agent(model="openai:o3")` +call. They're slower and more expensive but think before they answer. +locus surfaces the model's thinking blocks as `ThinkEvent`s so your +UI can show "thinking…" without parsing the response yourself. ```python agent = Agent( - model="openai:gpt-4o", - model_config={"base_url": "https://api.portkey.ai/v1"}, + model="openai:o3", + model_config={"reasoning_effort": "high"}, # low | medium | high ) ``` -| Gateway | `base_url` | -|---|---| -| Azure OpenAI | `https://.openai.azure.com/openai/deployments/` | -| Portkey | `https://api.portkey.ai/v1` | -| LiteLLM Proxy | `https:///v1` | -| vLLM (self-hosted) | `http://localhost:8000/v1` | -| together.ai / fireworks / groq | their published `/v1` URL | +`reasoning_effort` is OpenAI's knob for how long the model spends +thinking. Default is `medium`. + +### Real SSE streaming + +Token-level streaming over Server-Sent Events. The model emits +deltas, locus turns them into `ModelChunkEvent`s, your `async for` +loop reads them as they arrive — no buffering, no fake chunking. + +```python +async for event in agent.run("Write a haiku about latency."): + if isinstance(event, ModelChunkEvent) and event.content: + print(event.content, end="", flush=True) +``` -For Azure, `api_key` carries the Azure key. For Portkey and LiteLLM, -their virtual-key system applies. +### Tool calling — the OpenAI protocol -## Reasoning models +`@tool` functions are converted to OpenAI's tool-call schema and +the structured `tool_calls` field in the response is parsed back into +locus `ToolCall` objects. Parallel tool calls are supported (the +model can request multiple tools per turn; locus runs them +concurrently via the `ConcurrentExecutor`). -Reasoning models (`o1`, `o3`, `o4-mini`) route through the same -class. locus adds `reasoning_effort` to the request when set: +### Structured output — Pydantic models in, validated objects out ```python +from pydantic import BaseModel + +class Answer(BaseModel): + summary: str + confidence: float + agent = Agent( - model="openai:o3", - model_config={"reasoning_effort": "high"}, + model="openai:gpt-5.5", + output_schema=Answer, + system_prompt="Reply as JSON matching the schema.", ) +result = agent.run_sync("Was the meeting productive?") +print(result.parsed) # Answer(summary='...', confidence=0.83) ``` -The model's thinking blocks come through as `ThinkEvent`s in the -event stream so you can show "thinking…" in your UI. +Under the hood, locus sends an OpenAI `response_format` with the +schema and a strict-mode flag; if the model produces invalid JSON, +locus retries with the validation errors in the prompt +(`output_schema_retries=2` by default). + +## Going through a gateway + +A `base_url` override turns `OpenAIModel` into a client for any +OpenAI-compatible endpoint: + +| Gateway | When to use it | `base_url` | +|---|---|---| +| **Azure OpenAI** | Enterprise / regulated workloads, Azure billing | `https://.openai.azure.com/openai/deployments/` | +| **Portkey** | Virtual keys, request routing across providers, retries | `https://api.portkey.ai/v1` | +| **LiteLLM Proxy** | Self-hosted control plane in front of N providers | `https:///v1` | +| **vLLM** | Self-hosted inference for open models with the OpenAI shape | `http://localhost:8000/v1` | +| **together.ai / fireworks / groq** | Hosted open-model inference at OpenAI-shape | their published `/v1` | + +```python +agent = Agent( + model="openai:gpt-4o", + model_config={"base_url": "https://api.portkey.ai/v1"}, +) +``` + +The `api_key` your `OPENAI_API_KEY` provides is forwarded — for Azure +that's the Azure resource key, for Portkey it's the Portkey virtual +key, etc. + +## Common gotchas + +| Symptom | Likely cause | +|---|---| +| `401 Unauthorized` | `OPENAI_API_KEY` not set, or set to the wrong project's key | +| `429 Rate limit exceeded` | OpenAI quota; locus retries automatically with `ModelRetryHook` if installed | +| `model_not_found` | Model id doesn't exist for your tier — check `https://platform.openai.com/docs/models` | +| Empty `tool_calls` | Model decided not to call a tool; check the system prompt | +| `reasoning_effort` rejected | Only valid for o-series models, not GPT-4o / GPT-5 | ## Source -[`OpenAIModel` in `models/native/openai.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/native/openai.py). +[`OpenAIModel` in `src/locus/models/native/openai.py`](https://github.com/oracle-samples/locus/blob/main/src/locus/models/native/openai.py) ## See also - [Models overview](../models.md) — the full provider tree. +- [Anthropic](anthropic.md) — Claude family direct. +- [OCI Generative AI](oci.md) — same OpenAI models without a separate key, on Oracle infrastructure.