fix(tts): 修复流式播放初始杂音#27
Conversation
首次发射时 codec 缓冲区帧数远小于解码窗口, 零填充码经解码器注意力层污染真实帧输出, 改用 decode() 直接解码实际帧,避免零填充伪影
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e32945f50b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| window_codes = torch.stack(codes_buffer, dim=0) | ||
| wavs, sr = self.speech_tokenizer.decode( | ||
| [{"audio_codes": window_codes.to(self.talker.device)}] | ||
| ) |
There was a problem hiding this comment.
Preserve ref-code context in first emitted chunk
This first-emit path decodes codes_buffer directly and bypasses _add_ref_code_context(...). In voice-clone ICL mode (ref_code_context is set) and the common case where the first emit has fewer than decode_window_frames, the first chunk is produced without reference-speaker context while subsequent chunks include it, which can cause an audible speaker/prosody jump at the first chunk boundary and regresses the earlier "stable from start" behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
该 PR 旨在修复 TTS 流式播放时开头出现的杂音问题:当流式解码窗口远大于当前累计的 codec 帧数时,固定窗口的零填充会在解码器注意力/卷积感受野中传播并污染真实帧的输出,因此在首段输出时改用不做零填充的 decode() 路径。
Changes:
- 在
stream_generate_pcm()的首次 emit 且帧数不足解码窗口时,改用speech_tokenizer.decode()直接解码真实帧,避免零填充伪影。 - 保留后续 emit 使用固定窗口的 streaming decode 优化路径(
decode_streaming(... pad_to_size=decode_window_frames))。
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if total_frames_emitted == 0 and len(codes_buffer) < decode_window_frames: | ||
| # First emit: decode without zero-padding to avoid decoder artifacts. | ||
| # When the codec buffer is smaller than the decode window, zero-padding | ||
| # (codebook index 0) introduces artifacts because the neural decoder's | ||
| # transformer attention and convolutional receptive fields propagate | ||
| # the zero-code context into the real audio output, causing ~0.1-0.2s of | ||
| # noise at the beginning of streamed audio. Using the regular decode() | ||
| # path produces clean output at the cost of a slightly higher latency for | ||
| # the first chunk only. | ||
| window_codes = torch.stack(codes_buffer, dim=0) | ||
| wavs, sr = self.speech_tokenizer.decode( | ||
| [{"audio_codes": window_codes.to(self.talker.device)}] | ||
| ) | ||
| wav = wavs[0].astype(np.float32) | ||
| chunk = wav | ||
| else: |
| if use_optimized_decode and hasattr(self.speech_tokenizer, 'decode_streaming'): | ||
| wavs, sr = self.speech_tokenizer.decode_streaming( | ||
| window.to(self.talker.device), | ||
| use_optimized=True, | ||
| pad_to_size=decode_window_frames, | ||
| ) |
首次发射时 codec 缓冲区帧数远小于解码窗口,
零填充码经解码器注意力层污染真实帧输出,
改用 decode() 直接解码实际帧,避免零填充伪影