Engine: Fix UTF-8 mojibake (乱码) during streaming by leo71233826 · Pull Request #129 · jjang-ai/vmlx

leo71233826 · 2026-05-01T17:29:52Z

Summary

This PR fixes a critical streaming issue where multi-byte UTF-8 characters (e.g., Chinese, Japanese, Emojis) would occasionally render as replacement characters (\ufffd) when a token boundary splits the character's byte sequence.

Motivation

In LLM streaming, a single token might only contain a partial byte sequence of a multi-byte character.

The Bug: The previous logic used a fixed-length truncation (based on tool-call markers) and decoded the resulting bytes immediately. If the truncation hit the middle of a UTF-8 sequence, it produced a decoding error.
The Fix: Introduced a trailing-byte validator that detects incomplete UTF-8 sequences at the end of a buffer and holds them back until the next chunk arrives to complete the sequence.

Changes

Stream.swift: Added a bitwise check on trailing bytes in the performOneGenerationPass loop to ensure we only emit complete UTF-8 sequences.
Tokenizer.swift: Updated NaiveStreamingDetokenizer to avoid emitting partial segments when \ufffd is detected at the end of the decoded string.

Verification

Tested with Qwen3.6-35B-A3B on Chinese prompts.
Confirmed that "mojibake" (random \ufffd inside words) is eliminated during high-speed streaming.
Verified zero performance regression on ASCII-only text.

… alignment

Engine: fix UTF-8 mojibake during streaming by ensuring byte boundary…

8a7ef2a

… alignment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine: Fix UTF-8 mojibake (乱码) during streaming#129

Engine: Fix UTF-8 mojibake (乱码) during streaming#129
leo71233826 wants to merge 1 commit intojjang-ai:devfrom
leo71233826:fix/streaming-utf8-decoding

leo71233826 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leo71233826 commented May 1, 2026

Summary

Motivation

Changes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant