Skip to content

Engine: Fix UTF-8 mojibake (乱码) during streaming#129

Open
leo71233826 wants to merge 1 commit intojjang-ai:devfrom
leo71233826:fix/streaming-utf8-decoding
Open

Engine: Fix UTF-8 mojibake (乱码) during streaming#129
leo71233826 wants to merge 1 commit intojjang-ai:devfrom
leo71233826:fix/streaming-utf8-decoding

Conversation

@leo71233826
Copy link
Copy Markdown

Summary

This PR fixes a critical streaming issue where multi-byte UTF-8 characters (e.g., Chinese, Japanese, Emojis) would occasionally render as replacement characters (\ufffd) when a token boundary splits the character's byte sequence.

Motivation

In LLM streaming, a single token might only contain a partial byte sequence of a multi-byte character.

  • The Bug: The previous logic used a fixed-length truncation (based on tool-call markers) and decoded the resulting bytes immediately. If the truncation hit the middle of a UTF-8 sequence, it produced a decoding error.
  • The Fix: Introduced a trailing-byte validator that detects incomplete UTF-8 sequences at the end of a buffer and holds them back until the next chunk arrives to complete the sequence.

Changes

  • Stream.swift: Added a bitwise check on trailing bytes in the performOneGenerationPass loop to ensure we only emit complete UTF-8 sequences.
  • Tokenizer.swift: Updated NaiveStreamingDetokenizer to avoid emitting partial segments when \ufffd is detected at the end of the decoded string.

Verification

  • Tested with Qwen3.6-35B-A3B on Chinese prompts.
  • Confirmed that "mojibake" (random \ufffd inside words) is eliminated during high-speed streaming.
  • Verified zero performance regression on ASCII-only text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant