Skip to content

Conversation

@bradheitmann
Copy link

Fix UTF-8 streaming in OpenAI provider 🎭

tl;dr

Fixed a bug where emoji and Chinese characters disappear in streaming responses. It's actually pretty fetch! 👑

(Scroll down for technical details)


The Problem

The OpenAI provider silently loses multi-byte UTF-8 characters when they're split across Server-Sent Event (SSE) chunk boundaries.

Current behavior:

let chunk_str = match std::str::from_utf8(&chunk) {
    Ok(s) => s,
    Err(e) => {
        error!("Failed to parse chunk as UTF-8: {}", e);
        continue;  // ❌ Silently skips incomplete bytes - DATA LOST
    }
};

What happens when 🎭 (U+1F3AD = [F0 9F 8E AD]) splits across chunks:

  • Chunk 1: [F0 9F 8E] (incomplete) → fails UTF-8 validation → entire chunk skipped
  • Chunk 2: [AD ...] (orphaned byte) → corrupted or lost
  • Result: "Regina 👑 says: That's so fetch! 💖✨" becomes "Regina says: That's so fetch! "

Silent corruption - no error thrown, characters just disappear.


The Solution

Use the existing decode_utf8_streaming() utility that the Anthropic provider already uses correctly (see anthropic.rs:404):

let mut byte_buffer = Vec::new(); // Add buffer for incomplete sequences

while let Some(chunk_result) = stream.next().await {
    match chunk_result {
        Ok(chunk) => {
            byte_buffer.extend_from_slice(&chunk);  // Accumulate bytes

            let Some(chunk_str) = decode_utf8_streaming(&mut byte_buffer) else {
                continue; // ✅ Preserves incomplete sequences for next chunk
            };

            buffer.push_str(&chunk_str);
            // ... rest of processing

The decode_utf8_streaming() function (already in your codebase at streaming.rs:14-32):

  • Processes valid UTF-8 bytes immediately
  • Preserves incomplete multi-byte sequences in the buffer
  • Waits for next chunk to complete the character
  • Zero data loss

Testing

Added 10 comprehensive tests in crates/g3-providers/tests/streaming_utf8_test.rs:

  1. 4-byte emoji split across chunks (🎭 = [F0 9F 8E AD])
  2. 3-byte Chinese split across chunks (中 = [E4 B8 AD])
  3. Mixed content: ASCII + emoji + Chinese
  4. Multiple consecutive emoji: 🎭🎯👑💖✨
  5. Persona-style emoji: "Regina 👑 says: That's so fetch! 💖✨"
  6. Real-world scenario: Long text with scattered multibyte chars
  7. ASCII regression test: No behavior change for ASCII-only
  8. Empty buffer: Edge case handling
  9. Single byte: Basic case
  10. Partial emoji at end: Robustness test

All tests pass. Zero regressions in full test suite (634 tests).


Impact

Fixes:

  • Silent data corruption for all OpenAI-compatible providers (OpenAI, OpenRouter, Groq, Together, etc.)
  • Critical for applications using emoji or non-ASCII characters in streaming
  • Matches the pattern you already use in anthropic.rs (consistency improvement)

Performance:

  • Zero overhead for ASCII-only content
  • Minimal buffering cost for multi-byte characters
  • Same validation logic, better preservation

Compatibility:

  • Fully backward compatible
  • No public API changes
  • No new dependencies
  • Drop-in replacement

Why From GB?

GB (Glitter Bomb) is a theatrical fork of G3 that adds Mean Girls-inspired personas to code review. We found this bug because Regina 👑 and Gretchen 💖 kept losing their emoji in streaming responses.

We're not just taking from G3 - we're giving back. This fix has zero GB-specific code. It's a pure improvement that benefits the entire G3 ecosystem.

GB maintains full compatibility with G3 by:

  • Using the same core architecture
  • Preserving all G3 features
  • Following G3 code conventions
  • Contributing improvements upstream (like this one!)

Changes

Modified:

  • crates/g3-providers/src/openai.rs (9 lines)
    • Import decode_utf8_streaming
    • Add byte_buffer: Vec<u8>
    • Replace unsafe UTF-8 conversion with buffered approach

Created:

  • crates/g3-providers/tests/streaming_utf8_test.rs (215 lines)
    • 10 comprehensive UTF-8 streaming tests

Total: 9 production lines changed (minimal change principle)


Verification

Tests:

$ cargo test --all
634 tests passed, 0 failed

Clippy:

$ cargo clippy --package g3-providers
0 new warnings

Build:

$ cargo build --release --package g3-providers
✅ Success

Scope:

$ git diff --name-only
crates/g3-providers/src/openai.rs
crates/g3-providers/tests/streaming_utf8_test.rs

Checklist

  • All existing tests pass (no regressions)
  • New tests added and passing (10 UTF-8 tests)
  • Clippy clean (no new warnings)
  • Release build succeeds
  • Minimal code changes (9 lines production code)
  • Backward compatible
  • No new dependencies
  • Follows existing code patterns (matches anthropic.rs)
  • Conventional Commits message format

About This Contribution

This fix was developed using edge-agentic methodology with formal QA review:

  • 7-agent swarm analysis identified the root cause
  • Independent QA verification (15/16 score)
  • Complete evidence bundles with reproducible verification
  • Adversarial review to ensure quality

Quality metrics:

  • 634 tests pass (100%)
  • 0 regressions
  • 9 production lines changed
  • 10 comprehensive tests added
  • QA approved with evidence

About GB (Glitter Bomb)

GB is a theatrical fork of G3 that demonstrates that personality and professional code quality can coexist. We maintain full G3 compatibility while adding 8 Mean Girls-inspired personas for code review.

What makes GB different:

  • 8 theatrical personas (Regina 👑, Gretchen 💖, Monica 🧹, etc.)
  • Gen-Z dialect with varying mastery levels
  • Inter-agent dialogue system
  • Still 100% compatible with G3's architecture

Our commitment: Contribute improvements back to G3. This UTF-8 fix is the first of hopefully many contributions.


GB Team - Where Theatricality Meets Technical Excellence 🎭

P.S. - If emoji in PR descriptions aren't your thing, that's totally valid! The code is solid either way. We just like to have fun while we work. 💖


References

The OpenAI provider was using unsafe direct UTF-8 conversion that silently
lost data when multi-byte characters (emoji, Chinese, etc.) were split across
SSE chunk boundaries.

This change:
- Adds byte buffer to parse_streaming_response
- Uses existing decode_utf8_streaming() utility (already used by Anthropic provider)
- Preserves incomplete UTF-8 sequences across chunks
- Maintains backward compatibility for ASCII-only streams

Testing:
- Added 10 comprehensive UTF-8 tests covering emoji, Chinese, mixed content
- All existing tests pass (no regressions)
- Tested 1-byte through 4-byte UTF-8 sequences
- Verified persona emoji preservation in streaming responses

Impact:
- Fixes silent data corruption for OpenAI-compatible providers (OpenRouter, Groq, Together)
- Critical for applications using emoji or non-ASCII characters in streaming
- Zero performance impact for ASCII-only content

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant