diff --git a/how-to/integrating-cycles-with-async-openai.md b/how-to/integrating-cycles-with-async-openai.md index 8e748a8..f3c2047 100644 --- a/how-to/integrating-cycles-with-async-openai.md +++ b/how-to/integrating-cycles-with-async-openai.md @@ -20,6 +20,8 @@ The same lifecycle composes against other Rust LLM clients (Anthropic, Bedrock, - Error-aware patterns using `ReservationGuard` that preserve typed `OpenAIError` for the caller (`with_cycles()` wraps closure errors as `Error::Validation` and loses the original type) - Token-to-USD conversion at commit time for spend-denominated budgets +**Loud-failure stance.** The examples on this page error out on missing `usage`, missing `content`, or non-positive `caps.max_tokens` rather than silently committing zero or sending `max_completion_tokens=0` to OpenAI. This matches the shipped [`examples/async_openai_completion.rs`](https://github.com/runcycles/cycles-client-rust/blob/main/examples/async_openai_completion.rs) in the runcycles crate. Production code that prefers a fallback (e.g. commit the reservation estimate on missing usage) should opt into that fallback explicitly — the default in a teaching example should not be silent under-billing. + ## Cargo.toml ```toml @@ -28,6 +30,7 @@ runcycles = "0.2" async-openai = { version = "0.38", default-features = false, features = ["chat-completion", "rustls"] } tokio = { version = "1", features = ["full"] } futures = "0.3" # for stream consumption +thiserror = "2" # for the error-aware section ``` `async-openai` 0.31+ splits its surface behind per-API features — the `chat-completion` feature is what makes `Client` and the chat-completion types available. The 0.30.x line bundled everything by default; if you're upgrading from there, the example uses `async_openai::types::chat::` paths (the chat types moved out of the top-level `types::` module in 0.31). The 0.30.x line also pulled `backoff` transitively, which has been replaced with `tower` in 0.31+ — worth the version bump for the cleaner dependency tree alone. @@ -64,7 +67,9 @@ async fn main() -> Result<(), Box> { |_ctx| async move { let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(800u32) + // max_completion_tokens is the current field; max_tokens is + // deprecated upstream for chat completions. + .max_completion_tokens(800u32) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -73,17 +78,23 @@ async fn main() -> Result<(), Box> { let response = openai.chat().create(request).await?; + // Loud-failure stance: a successful HTTP response with no choices + // / no content is a malformed result. Surfacing it as `Err` lets + // `with_cycles` release the reservation rather than commit on an + // empty reply. let text = response .choices .first() .and_then(|c| c.message.content.clone()) - .unwrap_or_default(); + .ok_or("OpenAI response had no message content")?; - // usage is `Option`; treat missing as zero - let actual = response + // Same stance for missing usage: committing zero tokens against a + // successful-looking call silently under-bills the budget. Error + // out and let the caller decide whether to fall back. + let usage = response .usage - .map(|u| u.total_tokens as i64) - .unwrap_or(0); + .ok_or("OpenAI response omitted usage — refusing to commit a guessed amount")?; + let actual = i64::from(usage.total_tokens); Ok((text, Amount::tokens(actual))) }, @@ -117,17 +128,24 @@ let reply = with_cycles( .action("llm.completion", "gpt-4o-mini") .subject(Subject { tenant: Some("acme-corp".into()), ..Default::default() }), |ctx| async move { - // Default ceiling; override if Cycles capped lower + // Default ceiling; override if Cycles capped lower. A non-positive + // cap is treated as an explicit refusal — sending max_completion_tokens=0 + // would charge the request for zero output, which is never the intent. let mut max_tokens: u32 = 800; if let Some(caps) = &ctx.caps { if let Some(cap) = caps.max_tokens { - max_tokens = (cap as u32).min(max_tokens); + let cap_u32 = u32::try_from(cap) + .map_err(|_| "caps.max_tokens is negative — refusing to call OpenAI")?; + if cap_u32 == 0 { + return Err("caps.max_tokens is 0 — refusing to call OpenAI".into()); + } + max_tokens = cap_u32.min(max_tokens); } } let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(max_tokens) + .max_completion_tokens(max_tokens) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -135,9 +153,12 @@ let reply = with_cycles( .build()?; let response = openai.chat().create(request).await?; - let actual = response.usage.map(|u| u.total_tokens as i64).unwrap_or(0); - let text = response.choices.first().and_then(|c| c.message.content.clone()).unwrap_or_default(); - Ok((text, Amount::tokens(actual))) + let text = response.choices.first() + .and_then(|c| c.message.content.clone()) + .ok_or("OpenAI response had no message content")?; + let usage = response.usage + .ok_or("OpenAI response omitted usage")?; + Ok((text, Amount::tokens(i64::from(usage.total_tokens)))) }, ).await?; ``` @@ -178,24 +199,40 @@ let guard = cycles.reserve( .build() ).await?; -// Apply caps before building the request +// Apply caps before building the request. Non-positive caps are an explicit +// refusal — release the guard and bail rather than send max_completion_tokens=0. +// Note `let _ = ... .await` on release: if the release itself errors (rare — +// network failure between the agent and the Cycles server), the caller still +// sees the original zero-cap error rather than the release error swallowing +// it. let mut max_tokens: u32 = 1_500; if let Some(caps) = guard.caps() { if let Some(cap) = caps.max_tokens { - max_tokens = (cap as u32).min(max_tokens); + let cap_u32 = u32::try_from(cap) + .map_err(|_| "caps.max_tokens is negative — refusing to call OpenAI")?; + if cap_u32 == 0 { + let _ = guard.release("caps.max_tokens is 0".to_string()).await; + return Err("caps.max_tokens is 0 — refusing to call OpenAI".into()); + } + max_tokens = cap_u32.min(max_tokens); } } let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(max_tokens) + .max_completion_tokens(max_tokens) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? .into()]) .stream(true) - // Required for the stream to emit a final usage chunk. - .stream_options(ChatCompletionStreamOptions { include_usage: true }) + // Required for the stream to emit a final usage chunk. The struct's + // fields are `Option` in async-openai 0.38.x — `include_obfuscation` + // is set to `None` to keep the upstream default. + .stream_options(ChatCompletionStreamOptions { + include_usage: Some(true), + include_obfuscation: None, + }) .build()?; let mut stream = openai.chat().create_stream(request).await?; @@ -212,14 +249,33 @@ while let Some(chunk_result) = stream.next().await { } // The final chunk carries usage when include_usage was set. if let Some(usage) = chunk.usage { - final_usage_tokens = usage.total_tokens as i64; + final_usage_tokens = i64::from(usage.total_tokens); } } -// Defensive fallback: if usage didn't arrive (some OpenAI-compatible -// providers don't honor include_usage), estimate locally. +// Two edge cases at end-of-stream: +// +// - `full_text` is empty: the stream produced no content chunks. Treat as +// a malformed result and release the guard rather than commit on a +// zero-output response. +// - `final_usage_tokens` is zero: the stream completed but the provider +// didn't honor `include_usage`. Some OpenAI-compatible servers (Ollama, +// vLLM, certain LiteLLM configs) silently drop the usage chunk. Either +// estimate locally with a tokenizer, or release and error. +// +// The example below takes the loud path (release + error) to match the +// non-streaming sections' stance. For production code that prefers a +// fallback, plug in the `tiktoken-rs` crate's `o200k_base()` encoder and +// commit the estimate — see the snippet at the end of this section. +if full_text.is_empty() { + let _ = guard.release("openai_stream_no_content".to_string()).await; + return Err("OpenAI stream produced no content".into()); +} if final_usage_tokens == 0 { - final_usage_tokens = estimate_tokens_with_tiktoken(&prompt, &full_text); + let _ = guard.release("openai_stream_no_usage".to_string()).await; + return Err( + "OpenAI stream omitted usage — set stream_options.include_usage or estimate locally".into(), + ); } guard.commit( @@ -239,6 +295,24 @@ guard.commit( If the stream errors midway (network failure, rate limit, content policy violation), call `guard.release(...).await?` — the reservation is returned to the pool with a reason code. The guard's `Drop` implementation provides best-effort release on panic / early `?` return, but explicit release with a reason code is preferred for clean audit records. +### Optional: tokenizer fallback for missing-usage chunks + +If the loud-failure path on missing usage is too pessimistic for your deployment — for instance, you're routing through an OpenAI-compatible proxy that doesn't honor `include_usage` and you can't change the proxy — plug in a real tokenizer instead of erroring out. The `tiktoken-rs` crate's `o200k_base` encoder matches the tokenizer used by gpt-4o-family models: + +```rust +// Add to Cargo.toml: tiktoken-rs = "0.6" (check crates.io for current) +use tiktoken_rs::o200k_base; + +fn estimate_tokens(prompt: &str, output: &str) -> Result> { + let bpe = o200k_base()?; + let input = i64::try_from(bpe.encode_with_special_tokens(prompt).len())?; + let out = i64::try_from(bpe.encode_with_special_tokens(output).len())?; + Ok(input + out) +} +``` + +Then commit `estimate_tokens(&prompt, &full_text)` instead of releasing the guard on the missing-usage branch. The estimate will be approximate — it doesn't account for system prompts, tool definitions, or the model's actual tokenization of formatting tokens — but it beats committing zero. + ## Error handling: preserving the OpenAI error type `async-openai` returns `OpenAIError`; Cycles returns `runcycles::Error`. Callers usually want to act on these differently: @@ -282,7 +356,7 @@ async fn run_completion( let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(800u32) + .max_completion_tokens(800u32) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -298,14 +372,31 @@ async fn run_completion( } }; - let text = response.choices.first() - .and_then(|c| c.message.content.clone()) - .unwrap_or_default(); - let actual = response.usage.map(|u| u.total_tokens as i64).unwrap_or(0); + // Loud failure on malformed-but-successful responses: missing content or + // missing usage releases the reservation and surfaces as a typed error, + // rather than committing zero and silently under-billing. + let text = match response.choices.first().and_then(|c| c.message.content.clone()) { + Some(t) => t, + None => { + let _ = guard.release("openai_no_content".to_string()).await; + return Err(CompletionError::Cycles(CyclesError::Validation( + "OpenAI response had no message content".into(), + ))); + } + }; + let usage = match response.usage { + Some(u) => u, + None => { + let _ = guard.release("openai_no_usage".to_string()).await; + return Err(CompletionError::Cycles(CyclesError::Validation( + "OpenAI response omitted usage".into(), + ))); + } + }; guard.commit( CommitRequest::builder() - .actual(Amount::tokens(actual)) + .actual(Amount::tokens(i64::from(usage.total_tokens))) .build() ).await?; @@ -356,9 +447,12 @@ fn tokens_to_microcents(prompt_tokens: u32, completion_tokens: u32, model: &str) } // Inside the with_cycles closure: -let usage = response.usage.unwrap_or_default(); +let usage = response.usage + .ok_or("OpenAI response omitted usage — refusing to commit a guessed amount")?; let microcents = tokens_to_microcents(usage.prompt_tokens, usage.completion_tokens, "gpt-4o-mini"); -Ok((text, Amount::usd_microcents(microcents as i64))) +let amount = i64::try_from(microcents) + .map_err(|_| "microcents overflow when converting to i64")?; +Ok((text, Amount::usd_microcents(amount))) ``` Keeping the rate table in one helper makes provider rate changes a single-edit fix. For multi-provider deployments, hoist it to your shared `costs` module. @@ -371,7 +465,7 @@ The reserve-commit shape is the same for any Rust LLM client. The four things yo 1. **The request builder type** — `CreateChatCompletionRequestArgs` for async-openai, `MessageCreateBuilder` / `MessageCreateParams` for Anthropic's `anthropic-sdk-rust`, the provider-specific equivalent elsewhere. 2. **The call method** — `client.chat().create(req)` for async-openai; consult the provider crate's docs for the equivalent. -3. **The response usage extraction** — `response.usage.map(|u| u.total_tokens as i64)` for async-openai; Anthropic returns `input_tokens` + `output_tokens` separately on its response usage object; check the crate. +3. **The response usage extraction** — `response.usage.ok_or(...)?` then `i64::from(usage.total_tokens)` for async-openai (loud failure on missing usage, no `as` cast). Anthropic returns `input_tokens` + `output_tokens` separately on its response usage object; the same `ok_or(...)?` / `i64::from(...)` pattern applies, you just sum the two fields. 4. **The model name in the action label** — `.action("llm.completion", "claude-3-5-sonnet-20241022")` rather than `"gpt-4o-mini"`. Pin to the specific crate version you're using and verify each of those four points against its current docs before copy-pasting. The Rust Anthropic ecosystem in particular has churn across crate names and major versions; the reserve-commit lifecycle is unchanged, but the provider-side type paths are not portable. @@ -380,16 +474,18 @@ The [`Error Handling in Rust`](/how-to/error-handling-patterns-in-rust) patterns ## Common gotchas -1. **Streaming without `include_usage` reports zero tokens.** OpenAI's official streaming endpoint emits usage only when `stream_options.include_usage = true` is set on the request. Without it, you'll commit zero tokens and the budget will not reflect actual spend. Set the option, or fall back to a tokenizer estimate. +1. **Streaming without `include_usage` reports zero tokens.** OpenAI's official streaming endpoint emits usage only when `stream_options.include_usage` is set on the request. Without it, you'll commit zero tokens and the budget will not reflect actual spend. Set the option, and have a tokenizer fallback for OpenAI-compatible providers that don't honor it. -2. **`response.usage` is `Option`.** Some compatible servers (Ollama, vLLM, certain LiteLLM configs) don't return usage. Treat `None` as "estimate it locally" rather than "no spend." +2. **`response.usage` is `Option`.** Some compatible servers (Ollama, vLLM, certain LiteLLM configs) don't return usage. For **non-streaming** calls, the cleanest pattern is loud failure — return `Err`, let `with_cycles` release the reservation, surface the issue to the caller (the examples above follow this stance, matching the shipped `cycles-client-rust/examples/async_openai_completion.rs`). Streaming is the genuine exception: you've already consumed the stream so re-issuing is expensive, and a tokenizer estimate beats committing zero. -3. **`response.choices[0].message.content` can be `None`** when the model returns a tool-call or refusal. Handle the `None` case (commit zero or commit the prompt-token cost only) rather than unwrapping. +3. **`response.choices[0].message.content` can be `None`** when the model returns only a tool-call, a refusal, or finishes with `length` on a malformed setup. Treat that as a malformed result (fail loud and release) rather than committing on an empty reply. 4. **Don't include the OpenAI API key in the Cycles reservation metadata.** Cycles records actions, not credentials. If you're tagging the reservation with provider info, use the action name (`gpt-4o-mini`) — never the key. 5. **Mismatched async runtimes.** `async-openai` uses `tokio`; the blocking `runcycles` variant requires not being inside a Tokio runtime. Pick one — for most LLM workloads, the async client is correct. +6. **`as u32` / `as i64` on values you got from elsewhere.** `cap as u32` silently wraps on a negative `cap.max_tokens`; `microcents as i64` silently wraps on overflow. Use `u32::try_from(...)` / `i64::try_from(...)` and surface a typed error instead. + ## Next steps - [Rust Client Quickstart](/quickstart/getting-started-with-the-rust-client) — the lifecycle this page composes against