From e93b8134a0fa9da4b06bb7564cd4594a87878177 Mon Sep 17 00:00:00 2001 From: Albert Mavashev Date: Sat, 16 May 2026 12:41:47 -0400 Subject: [PATCH 1/2] docs(rust): sync async-openai how-to with the loud-failure example MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When the cycles-client-rust example (PR #37) went through codex code review, codex caught silent-under-billing bugs in patterns like `response.usage.map(...).unwrap_or(0)` — we fixed those in the example but never backported the fixes to the cycles-docs how-to. PR #659 merged the doc with the buggy patterns intact. As a result, the shipped doc (on main) and the shipped example (in cycles-client-rust) were teaching opposite things: - Doc: `.unwrap_or(0)` on missing usage → silent commit zero - Example: `.ok_or(...)?` on missing usage → error, auto-release Plus an internal contradiction inside the doc itself: the "Common gotchas" section said "Treat None as estimate locally rather than no spend" while the code blocks above unwrap to zero. Description vs code disagreed. This PR brings the doc back in sync with the example and fixes the internal contradictions. Changes in `how-to/integrating-cycles-with-async-openai.md`: - `.max_tokens(...)` → `.max_completion_tokens(...)` in all four code blocks (basic, ALLOW_WITH_CAPS, streaming, error-aware). OpenAI deprecated `max_tokens` for chat completions in favor of `max_completion_tokens`; async-openai 0.38 supports both but the doc should teach the current name. - Missing `response.usage`: `.unwrap_or(0)` → `.ok_or(...)?` in non-streaming examples (basic, ALLOW_WITH_CAPS, error-aware, token-to-USD). Streaming kept the fallback pattern because re-issuing a consumed stream is expensive — that's the one legitimate place a tokenizer estimate beats erroring out. - Missing `response.choices[0].message.content`: `.unwrap_or_default()` → `.ok_or(...)?` everywhere. Committing on an empty reply is the silent under-billing pattern that codex caught in the example. - `(cap as u32)` → `u32::try_from(cap)?` with explicit zero-check. The old pattern would silently wrap on a negative `caps.max_tokens` and would send `max_completion_tokens=0` (rejected by OpenAI, but only after we've already paid the request cost) on a zero cap. - `usage.total_tokens as i64` → `i64::from(usage.total_tokens)` — no `as` cast risk, more idiomatic. - `microcents as i64` → `i64::try_from(microcents)?` — same. - Streaming example's `estimate_tokens_with_tiktoken(...)` was a fictional function reference (never defined or imported). Replaced with a real inline tokenizer stub showing how to plug in `tiktoken-rs`, with a clear comment that the alternative is releasing the guard and erroring. - Added a "Loud-failure stance" callout under "What you get" that spells out the design choice explicitly and points readers at the shipped `examples/async_openai_completion.rs` for the reference implementation. - Rewrote gotchas #2 and #3 to match the new code rather than contradict it: gotcha #2 now distinguishes non-streaming (loud failure) from streaming (tokenizer fallback); gotcha #3 says "fail loud and release" instead of "commit zero". Added gotcha #6 covering `as` casts on values from external sources. The async-openai 0.38 API and the runcycles 0.2.4 API are unchanged from the prior version-sync PR (#660); this PR only changes how the example code handles edge cases. --- .../integrating-cycles-with-async-openai.md | 122 +++++++++++++----- 1 file changed, 93 insertions(+), 29 deletions(-) diff --git a/how-to/integrating-cycles-with-async-openai.md b/how-to/integrating-cycles-with-async-openai.md index 8e748a8..fd58c6e 100644 --- a/how-to/integrating-cycles-with-async-openai.md +++ b/how-to/integrating-cycles-with-async-openai.md @@ -20,6 +20,8 @@ The same lifecycle composes against other Rust LLM clients (Anthropic, Bedrock, - Error-aware patterns using `ReservationGuard` that preserve typed `OpenAIError` for the caller (`with_cycles()` wraps closure errors as `Error::Validation` and loses the original type) - Token-to-USD conversion at commit time for spend-denominated budgets +**Loud-failure stance.** The examples on this page error out on missing `usage`, missing `content`, or non-positive `caps.max_tokens` rather than silently committing zero or sending `max_completion_tokens=0` to OpenAI. This matches the shipped [`examples/async_openai_completion.rs`](https://github.com/runcycles/cycles-client-rust/blob/main/examples/async_openai_completion.rs) in the runcycles crate. Production code that prefers a fallback (e.g. commit the reservation estimate on missing usage) should opt into that fallback explicitly — the default in a teaching example should not be silent under-billing. + ## Cargo.toml ```toml @@ -64,7 +66,9 @@ async fn main() -> Result<(), Box> { |_ctx| async move { let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(800u32) + // max_completion_tokens is the current field; max_tokens is + // deprecated upstream for chat completions. + .max_completion_tokens(800u32) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -73,17 +77,23 @@ async fn main() -> Result<(), Box> { let response = openai.chat().create(request).await?; + // Loud-failure stance: a successful HTTP response with no choices + // / no content is a malformed result. Surfacing it as `Err` lets + // `with_cycles` release the reservation rather than commit on an + // empty reply. let text = response .choices .first() .and_then(|c| c.message.content.clone()) - .unwrap_or_default(); + .ok_or("OpenAI response had no message content")?; - // usage is `Option`; treat missing as zero - let actual = response + // Same stance for missing usage: committing zero tokens against a + // successful-looking call silently under-bills the budget. Error + // out and let the caller decide whether to fall back. + let usage = response .usage - .map(|u| u.total_tokens as i64) - .unwrap_or(0); + .ok_or("OpenAI response omitted usage — refusing to commit a guessed amount")?; + let actual = i64::from(usage.total_tokens); Ok((text, Amount::tokens(actual))) }, @@ -117,17 +127,24 @@ let reply = with_cycles( .action("llm.completion", "gpt-4o-mini") .subject(Subject { tenant: Some("acme-corp".into()), ..Default::default() }), |ctx| async move { - // Default ceiling; override if Cycles capped lower + // Default ceiling; override if Cycles capped lower. A non-positive + // cap is treated as an explicit refusal — sending max_completion_tokens=0 + // would charge the request for zero output, which is never the intent. let mut max_tokens: u32 = 800; if let Some(caps) = &ctx.caps { if let Some(cap) = caps.max_tokens { - max_tokens = (cap as u32).min(max_tokens); + let cap_u32 = u32::try_from(cap) + .map_err(|_| "caps.max_tokens is negative — refusing to call OpenAI")?; + if cap_u32 == 0 { + return Err("caps.max_tokens is 0 — refusing to call OpenAI".into()); + } + max_tokens = cap_u32.min(max_tokens); } } let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(max_tokens) + .max_completion_tokens(max_tokens) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -135,9 +152,12 @@ let reply = with_cycles( .build()?; let response = openai.chat().create(request).await?; - let actual = response.usage.map(|u| u.total_tokens as i64).unwrap_or(0); - let text = response.choices.first().and_then(|c| c.message.content.clone()).unwrap_or_default(); - Ok((text, Amount::tokens(actual))) + let text = response.choices.first() + .and_then(|c| c.message.content.clone()) + .ok_or("OpenAI response had no message content")?; + let usage = response.usage + .ok_or("OpenAI response omitted usage")?; + Ok((text, Amount::tokens(i64::from(usage.total_tokens)))) }, ).await?; ``` @@ -178,17 +198,24 @@ let guard = cycles.reserve( .build() ).await?; -// Apply caps before building the request +// Apply caps before building the request. Non-positive caps are an explicit +// refusal — release the guard and bail rather than send max_completion_tokens=0. let mut max_tokens: u32 = 1_500; if let Some(caps) = guard.caps() { if let Some(cap) = caps.max_tokens { - max_tokens = (cap as u32).min(max_tokens); + let cap_u32 = u32::try_from(cap) + .map_err(|_| "caps.max_tokens is negative — refusing to call OpenAI")?; + if cap_u32 == 0 { + guard.release("caps.max_tokens is 0".to_string()).await?; + return Err("caps.max_tokens is 0 — refusing to call OpenAI".into()); + } + max_tokens = cap_u32.min(max_tokens); } } let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(max_tokens) + .max_completion_tokens(max_tokens) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -212,14 +239,29 @@ while let Some(chunk_result) = stream.next().await { } // The final chunk carries usage when include_usage was set. if let Some(usage) = chunk.usage { - final_usage_tokens = usage.total_tokens as i64; + final_usage_tokens = i64::from(usage.total_tokens); } } // Defensive fallback: if usage didn't arrive (some OpenAI-compatible -// providers don't honor include_usage), estimate locally. +// providers don't honor include_usage), estimate locally. Streaming is the +// one legitimate place for a fallback — you've already consumed the stream +// and can't re-issue it cheaply, so a tokenizer estimate beats committing +// zero. Plug in a real tokenizer here (e.g. the `tiktoken-rs` crate's +// `cl100k_base` / `o200k_base` encoders) rather than the stub below: +// +// fn estimate_tokens(prompt: &str, output: &str) -> i64 { +// use tiktoken_rs::o200k_base; +// let bpe = o200k_base().unwrap(); +// (bpe.encode_with_special_tokens(prompt).len() +// + bpe.encode_with_special_tokens(output).len()) as i64 +// } +// +// If you don't want to add a tokenizer dependency, releasing the guard and +// erroring is also a defensible choice — see the loud-failure note in the +// non-streaming examples. if final_usage_tokens == 0 { - final_usage_tokens = estimate_tokens_with_tiktoken(&prompt, &full_text); + final_usage_tokens = estimate_tokens(&prompt, &full_text); } guard.commit( @@ -282,7 +324,7 @@ async fn run_completion( let request = CreateChatCompletionRequestArgs::default() .model("gpt-4o-mini") - .max_tokens(800u32) + .max_completion_tokens(800u32) .messages([ChatCompletionRequestUserMessageArgs::default() .content(prompt) .build()? @@ -298,14 +340,31 @@ async fn run_completion( } }; - let text = response.choices.first() - .and_then(|c| c.message.content.clone()) - .unwrap_or_default(); - let actual = response.usage.map(|u| u.total_tokens as i64).unwrap_or(0); + // Loud failure on malformed-but-successful responses: missing content or + // missing usage releases the reservation and surfaces as a typed error, + // rather than committing zero and silently under-billing. + let text = match response.choices.first().and_then(|c| c.message.content.clone()) { + Some(t) => t, + None => { + let _ = guard.release("openai_no_content".to_string()).await; + return Err(CompletionError::Cycles(CyclesError::Validation( + "OpenAI response had no message content".into(), + ))); + } + }; + let usage = match response.usage { + Some(u) => u, + None => { + let _ = guard.release("openai_no_usage".to_string()).await; + return Err(CompletionError::Cycles(CyclesError::Validation( + "OpenAI response omitted usage".into(), + ))); + } + }; guard.commit( CommitRequest::builder() - .actual(Amount::tokens(actual)) + .actual(Amount::tokens(i64::from(usage.total_tokens))) .build() ).await?; @@ -356,9 +415,12 @@ fn tokens_to_microcents(prompt_tokens: u32, completion_tokens: u32, model: &str) } // Inside the with_cycles closure: -let usage = response.usage.unwrap_or_default(); +let usage = response.usage + .ok_or("OpenAI response omitted usage — refusing to commit a guessed amount")?; let microcents = tokens_to_microcents(usage.prompt_tokens, usage.completion_tokens, "gpt-4o-mini"); -Ok((text, Amount::usd_microcents(microcents as i64))) +let amount = i64::try_from(microcents) + .map_err(|_| "microcents overflow when converting to i64")?; +Ok((text, Amount::usd_microcents(amount))) ``` Keeping the rate table in one helper makes provider rate changes a single-edit fix. For multi-provider deployments, hoist it to your shared `costs` module. @@ -380,16 +442,18 @@ The [`Error Handling in Rust`](/how-to/error-handling-patterns-in-rust) patterns ## Common gotchas -1. **Streaming without `include_usage` reports zero tokens.** OpenAI's official streaming endpoint emits usage only when `stream_options.include_usage = true` is set on the request. Without it, you'll commit zero tokens and the budget will not reflect actual spend. Set the option, or fall back to a tokenizer estimate. +1. **Streaming without `include_usage` reports zero tokens.** OpenAI's official streaming endpoint emits usage only when `stream_options.include_usage` is set on the request. Without it, you'll commit zero tokens and the budget will not reflect actual spend. Set the option, and have a tokenizer fallback for OpenAI-compatible providers that don't honor it. -2. **`response.usage` is `Option`.** Some compatible servers (Ollama, vLLM, certain LiteLLM configs) don't return usage. Treat `None` as "estimate it locally" rather than "no spend." +2. **`response.usage` is `Option`.** Some compatible servers (Ollama, vLLM, certain LiteLLM configs) don't return usage. For **non-streaming** calls, the cleanest pattern is loud failure — return `Err`, let `with_cycles` release the reservation, surface the issue to the caller (the examples above follow this stance, matching the shipped `cycles-client-rust/examples/async_openai_completion.rs`). Streaming is the genuine exception: you've already consumed the stream so re-issuing is expensive, and a tokenizer estimate beats committing zero. -3. **`response.choices[0].message.content` can be `None`** when the model returns a tool-call or refusal. Handle the `None` case (commit zero or commit the prompt-token cost only) rather than unwrapping. +3. **`response.choices[0].message.content` can be `None`** when the model returns only a tool-call, a refusal, or finishes with `length` on a malformed setup. Treat that as a malformed result (fail loud and release) rather than committing on an empty reply. 4. **Don't include the OpenAI API key in the Cycles reservation metadata.** Cycles records actions, not credentials. If you're tagging the reservation with provider info, use the action name (`gpt-4o-mini`) — never the key. 5. **Mismatched async runtimes.** `async-openai` uses `tokio`; the blocking `runcycles` variant requires not being inside a Tokio runtime. Pick one — for most LLM workloads, the async client is correct. +6. **`as u32` / `as i64` on values you got from elsewhere.** `cap as u32` silently wraps on a negative `cap.max_tokens`; `microcents as i64` silently wraps on overflow. Use `u32::try_from(...)` / `i64::try_from(...)` and surface a typed error instead. + ## Next steps - [Rust Client Quickstart](/quickstart/getting-started-with-the-rust-client) — the lifecycle this page composes against From 8a3e70157eda4dafc1eed2fa3215fa322c4dcf36 Mon Sep 17 00:00:00 2001 From: Albert Mavashev Date: Sat, 16 May 2026 12:50:04 -0400 Subject: [PATCH 2/2] docs(rust): apply codex round-1 review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Apply/skip tally: 6 applied, 0 pushed back. Applied: - `ChatCompletionStreamOptions` fields are `Option` in async-openai 0.38.x, not raw `bool`. Updated to `{ include_usage: Some(true), include_obfuscation: None }` with a comment explaining the `None` default. - Cargo.toml block was missing `thiserror = "2"` even though the error-aware example uses `#[derive(thiserror::Error)]`. Added. - Streaming zero-cap path used `guard.release(...).await?` — if release itself errored, the caller would see the release error instead of the original zero-cap error. Switched to `let _ = guard.release(...).await;` so the typed zero-cap error wins. - Streaming end-of-stream had an "empty content, but has usage" hole — the example would commit on a stream that produced no text. Added an empty-`full_text` release-and-bail check before the usage check, so the loud-failure stance applies consistently. - "Other Rust LLM clients" section still showed the old `response.usage.map(|u| u.total_tokens as i64)` pattern in the adaptation guidance for non-OpenAI providers. Updated to the `ok_or(...)?` + `i64::from(...)` shape that matches the rest of the doc. - The `estimate_tokens()` function was referenced in the streaming example but only existed as a comment. Made it a real, copy-pasteable helper (uses `tiktoken-rs::o200k_base`) and moved the fallback into an explicit "Optional: tokenizer fallback for missing-usage chunks" subsection, framed as an alternative to the loud-failure default. Also caught during the same pass: - The tiktoken-rs stub itself used `as i64` casts on `usize` values — exactly the pattern the new gotcha #6 warns against. Switched to `i64::try_from(...)?` and changed the function signature to return `Result>` so the conversion error has somewhere to go. Codex verified the runcycles + async-openai 0.38.2 API surface, the tiktoken-rs API names, and confirmed the shipped cycles-client-rust/examples/async_openai_completion.rs on main matches the doc's loud-failure stance. --- .../integrating-cycles-with-async-openai.md | 72 +++++++++++++------ 1 file changed, 52 insertions(+), 20 deletions(-) diff --git a/how-to/integrating-cycles-with-async-openai.md b/how-to/integrating-cycles-with-async-openai.md index fd58c6e..f3c2047 100644 --- a/how-to/integrating-cycles-with-async-openai.md +++ b/how-to/integrating-cycles-with-async-openai.md @@ -30,6 +30,7 @@ runcycles = "0.2" async-openai = { version = "0.38", default-features = false, features = ["chat-completion", "rustls"] } tokio = { version = "1", features = ["full"] } futures = "0.3" # for stream consumption +thiserror = "2" # for the error-aware section ``` `async-openai` 0.31+ splits its surface behind per-API features — the `chat-completion` feature is what makes `Client` and the chat-completion types available. The 0.30.x line bundled everything by default; if you're upgrading from there, the example uses `async_openai::types::chat::` paths (the chat types moved out of the top-level `types::` module in 0.31). The 0.30.x line also pulled `backoff` transitively, which has been replaced with `tower` in 0.31+ — worth the version bump for the cleaner dependency tree alone. @@ -200,13 +201,17 @@ let guard = cycles.reserve( // Apply caps before building the request. Non-positive caps are an explicit // refusal — release the guard and bail rather than send max_completion_tokens=0. +// Note `let _ = ... .await` on release: if the release itself errors (rare — +// network failure between the agent and the Cycles server), the caller still +// sees the original zero-cap error rather than the release error swallowing +// it. let mut max_tokens: u32 = 1_500; if let Some(caps) = guard.caps() { if let Some(cap) = caps.max_tokens { let cap_u32 = u32::try_from(cap) .map_err(|_| "caps.max_tokens is negative — refusing to call OpenAI")?; if cap_u32 == 0 { - guard.release("caps.max_tokens is 0".to_string()).await?; + let _ = guard.release("caps.max_tokens is 0".to_string()).await; return Err("caps.max_tokens is 0 — refusing to call OpenAI".into()); } max_tokens = cap_u32.min(max_tokens); @@ -221,8 +226,13 @@ let request = CreateChatCompletionRequestArgs::default() .build()? .into()]) .stream(true) - // Required for the stream to emit a final usage chunk. - .stream_options(ChatCompletionStreamOptions { include_usage: true }) + // Required for the stream to emit a final usage chunk. The struct's + // fields are `Option` in async-openai 0.38.x — `include_obfuscation` + // is set to `None` to keep the upstream default. + .stream_options(ChatCompletionStreamOptions { + include_usage: Some(true), + include_obfuscation: None, + }) .build()?; let mut stream = openai.chat().create_stream(request).await?; @@ -243,25 +253,29 @@ while let Some(chunk_result) = stream.next().await { } } -// Defensive fallback: if usage didn't arrive (some OpenAI-compatible -// providers don't honor include_usage), estimate locally. Streaming is the -// one legitimate place for a fallback — you've already consumed the stream -// and can't re-issue it cheaply, so a tokenizer estimate beats committing -// zero. Plug in a real tokenizer here (e.g. the `tiktoken-rs` crate's -// `cl100k_base` / `o200k_base` encoders) rather than the stub below: +// Two edge cases at end-of-stream: // -// fn estimate_tokens(prompt: &str, output: &str) -> i64 { -// use tiktoken_rs::o200k_base; -// let bpe = o200k_base().unwrap(); -// (bpe.encode_with_special_tokens(prompt).len() -// + bpe.encode_with_special_tokens(output).len()) as i64 -// } +// - `full_text` is empty: the stream produced no content chunks. Treat as +// a malformed result and release the guard rather than commit on a +// zero-output response. +// - `final_usage_tokens` is zero: the stream completed but the provider +// didn't honor `include_usage`. Some OpenAI-compatible servers (Ollama, +// vLLM, certain LiteLLM configs) silently drop the usage chunk. Either +// estimate locally with a tokenizer, or release and error. // -// If you don't want to add a tokenizer dependency, releasing the guard and -// erroring is also a defensible choice — see the loud-failure note in the -// non-streaming examples. +// The example below takes the loud path (release + error) to match the +// non-streaming sections' stance. For production code that prefers a +// fallback, plug in the `tiktoken-rs` crate's `o200k_base()` encoder and +// commit the estimate — see the snippet at the end of this section. +if full_text.is_empty() { + let _ = guard.release("openai_stream_no_content".to_string()).await; + return Err("OpenAI stream produced no content".into()); +} if final_usage_tokens == 0 { - final_usage_tokens = estimate_tokens(&prompt, &full_text); + let _ = guard.release("openai_stream_no_usage".to_string()).await; + return Err( + "OpenAI stream omitted usage — set stream_options.include_usage or estimate locally".into(), + ); } guard.commit( @@ -281,6 +295,24 @@ guard.commit( If the stream errors midway (network failure, rate limit, content policy violation), call `guard.release(...).await?` — the reservation is returned to the pool with a reason code. The guard's `Drop` implementation provides best-effort release on panic / early `?` return, but explicit release with a reason code is preferred for clean audit records. +### Optional: tokenizer fallback for missing-usage chunks + +If the loud-failure path on missing usage is too pessimistic for your deployment — for instance, you're routing through an OpenAI-compatible proxy that doesn't honor `include_usage` and you can't change the proxy — plug in a real tokenizer instead of erroring out. The `tiktoken-rs` crate's `o200k_base` encoder matches the tokenizer used by gpt-4o-family models: + +```rust +// Add to Cargo.toml: tiktoken-rs = "0.6" (check crates.io for current) +use tiktoken_rs::o200k_base; + +fn estimate_tokens(prompt: &str, output: &str) -> Result> { + let bpe = o200k_base()?; + let input = i64::try_from(bpe.encode_with_special_tokens(prompt).len())?; + let out = i64::try_from(bpe.encode_with_special_tokens(output).len())?; + Ok(input + out) +} +``` + +Then commit `estimate_tokens(&prompt, &full_text)` instead of releasing the guard on the missing-usage branch. The estimate will be approximate — it doesn't account for system prompts, tool definitions, or the model's actual tokenization of formatting tokens — but it beats committing zero. + ## Error handling: preserving the OpenAI error type `async-openai` returns `OpenAIError`; Cycles returns `runcycles::Error`. Callers usually want to act on these differently: @@ -433,7 +465,7 @@ The reserve-commit shape is the same for any Rust LLM client. The four things yo 1. **The request builder type** — `CreateChatCompletionRequestArgs` for async-openai, `MessageCreateBuilder` / `MessageCreateParams` for Anthropic's `anthropic-sdk-rust`, the provider-specific equivalent elsewhere. 2. **The call method** — `client.chat().create(req)` for async-openai; consult the provider crate's docs for the equivalent. -3. **The response usage extraction** — `response.usage.map(|u| u.total_tokens as i64)` for async-openai; Anthropic returns `input_tokens` + `output_tokens` separately on its response usage object; check the crate. +3. **The response usage extraction** — `response.usage.ok_or(...)?` then `i64::from(usage.total_tokens)` for async-openai (loud failure on missing usage, no `as` cast). Anthropic returns `input_tokens` + `output_tokens` separately on its response usage object; the same `ok_or(...)?` / `i64::from(...)` pattern applies, you just sum the two fields. 4. **The model name in the action label** — `.action("llm.completion", "claude-3-5-sonnet-20241022")` rather than `"gpt-4o-mini"`. Pin to the specific crate version you're using and verify each of those four points against its current docs before copy-pasting. The Rust Anthropic ecosystem in particular has churn across crate names and major versions; the reserve-commit lifecycle is unchanged, but the provider-side type paths are not portable.