Skip to content

fix(tts): 修复流式播放初始杂音#27

Draft
biggsun wants to merge 1 commit into
groxaxo:mainfrom
biggsun:main
Draft

fix(tts): 修复流式播放初始杂音#27
biggsun wants to merge 1 commit into
groxaxo:mainfrom
biggsun:main

Conversation

@biggsun

@biggsun biggsun commented May 15, 2026

Copy link
Copy Markdown

首次发射时 codec 缓冲区帧数远小于解码窗口,
零填充码经解码器注意力层污染真实帧输出,
改用 decode() 直接解码实际帧,避免零填充伪影

首次发射时 codec 缓冲区帧数远小于解码窗口,
零填充码经解码器注意力层污染真实帧输出,
改用 decode() 直接解码实际帧,避免零填充伪影
@xsploit

xsploit commented May 24, 2026

Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e32945f50b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2780 to 2783
window_codes = torch.stack(codes_buffer, dim=0)
wavs, sr = self.speech_tokenizer.decode(
[{"audio_codes": window_codes.to(self.talker.device)}]
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve ref-code context in first emitted chunk

This first-emit path decodes codes_buffer directly and bypasses _add_ref_code_context(...). In voice-clone ICL mode (ref_code_context is set) and the common case where the first emit has fewer than decode_window_frames, the first chunk is produced without reference-speaker context while subsequent chunks include it, which can cause an audible speaker/prosody jump at the first chunk boundary and regresses the earlier "stable from start" behavior.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复 TTS 流式播放时开头出现的杂音问题:当流式解码窗口远大于当前累计的 codec 帧数时,固定窗口的零填充会在解码器注意力/卷积感受野中传播并污染真实帧的输出,因此在首段输出时改用不做零填充的 decode() 路径。

Changes:

  • stream_generate_pcm() 的首次 emit 且帧数不足解码窗口时,改用 speech_tokenizer.decode() 直接解码真实帧,避免零填充伪影。
  • 保留后续 emit 使用固定窗口的 streaming decode 优化路径(decode_streaming(... pad_to_size=decode_window_frames))。

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2771 to 2786
if total_frames_emitted == 0 and len(codes_buffer) < decode_window_frames:
# First emit: decode without zero-padding to avoid decoder artifacts.
# When the codec buffer is smaller than the decode window, zero-padding
# (codebook index 0) introduces artifacts because the neural decoder's
# transformer attention and convolutional receptive fields propagate
# the zero-code context into the real audio output, causing ~0.1-0.2s of
# noise at the beginning of streamed audio. Using the regular decode()
# path produces clean output at the cost of a slightly higher latency for
# the first chunk only.
window_codes = torch.stack(codes_buffer, dim=0)
wavs, sr = self.speech_tokenizer.decode(
[{"audio_codes": window_codes.to(self.talker.device)}]
)
wav = wavs[0].astype(np.float32)
chunk = wav
else:
Comment on lines +2795 to +2800
if use_optimized_decode and hasattr(self.speech_tokenizer, 'decode_streaming'):
wavs, sr = self.speech_tokenizer.decode_streaming(
window.to(self.talker.device),
use_optimized=True,
pad_to_size=decode_window_frames,
)
@groxaxo groxaxo marked this pull request as draft June 4, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants