Feat/realtime voice vad p2#20
Conversation
Two fixes for false VAD triggers from ambient noise: 1. SPEECH_THRESHOLD: 0.02 → 0.03, MIN_SPEECH_FRAMES: 5 → 8 (~1s) Higher bar for what counts as "speech" to filter ambient noise. 2. Empty pipeline loop protection: when handleProcessComplete fires with no TTS played (likely a noise-triggered empty pipeline), delay resumeRecording by 2s instead of 300ms. Prevents the rapid loop: noise → empty STT → complete → resume → noise → repeat.
Root cause: ambient noise exceeded SPEECH_THRESHOLD, causing false hasSpeechStarted=true. When noise subsided, VAD silence triggered even though the user never actually spoke. Raising thresholds helped but didn't eliminate the issue for noisy environments. Fix: add sttConfirmedSpeech flag. VAD silence detection now requires BOTH local RMS silence AND at least one non-empty STT intermediate result from the server. This ensures the user actually spoke before the system considers ending the recording. Flow: recording starts → local RMS detects audio → sends to server → server STT returns partial text → confirmSpeechFromSTT() → NOW VAD silence detection is armed → user stops speaking → 1.3s silence → auto audio_end Without STT confirmation, ambient noise can trigger local speech detection but VAD will never fire audio_end.
After TTS playback, resumeRecording() re-enabled audio frame sending on the frontend but never sent audio_start to the server. The server's audioSink was null (cleared by doFinally after the previous pipeline), so all audio frames were silently dropped in handleBinaryMessage. Fix: send wsClient.startAudioRecording() (audio_start) alongside resumeRecording() in both the playback state listener (normal TTS end) and handleProcessComplete (empty pipeline case). This creates a new server-side audioSink + pipeline for the next conversation round.
…tion Root cause: startAudioCall() calls clearQueue() which triggers notifyPlaybackState(false). The playback listener sees the call is active and schedules resumeRecording() + startAudioRecording() after 300ms. But startRecording() already sent audio_start. The second audio_start is rejected by the server with "已有进行中的音频会话", handleError stops recording, and the session enters a broken state where subsequent pipelines receive no audio (Xunfei timeout). Fix: resumeRecording() now returns boolean (true only when actually resuming from monitoring mode). Callers only send audio_start when resume returns true. This prevents: - Double audio_start on initial connection (monitoringOnly=false → no-op) - Stale audio_start after error recovery (recordingState=idle → no-op) Also adds fallback in handleProcessComplete: if recording hardware was stopped by an error, falls back to full startRecording() instead of resumeRecording().
…trigger After user speaks and VAD pauses recording, sttConfirmedSpeech stayed true from the previous STT result. When TTS ended and recording resumed, the stale sttConfirmedSpeech allowed VAD silence to trigger immediately (ambient noise met all other conditions). This sent an empty audio_end, creating an empty pipeline that took 15s to timeout on Xunfei, appearing as "mic stuck". Fix: reset sttConfirmedSpeech=false in pauseRecording(). Each new recording round must get fresh STT confirmation before VAD can fire.
There was a problem hiding this comment.
Pull request overview
This PR tunes the client-side realtime voice VAD behavior in VocaTaAIChat/AudioManager to reduce environment-noise false triggers by tightening VAD thresholds and gating “silence stop” on an STT-confirmed speech signal, while also preventing redundant audio_start messages when resuming.
Changes:
- Adjust VAD thresholds and minimum speech frame requirements to be more noise-resistant.
- Add an
sttConfirmedSpeechgate so VAD silence only triggers after STT has produced valid text. - Change
AudioManager.resumeRecording()to return a boolean and only sendaudio_startwhen a resume actually occurred; add a longer delay oncompleterecovery.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // VAD 静音触发条件:发送模式 + STT 已确认说话 | ||
| if (this.silenceFrameCount >= SILENCE_FRAMES_REQUIRED | ||
| && !this.monitoringOnly | ||
| && this.sttConfirmedSpeech) { |
There was a problem hiding this comment.
VAD 静音触发现在依赖 sttConfirmedSpeech。如果 STT 在本轮没有产出任何文本(例如识别失败/空结果/只在 audio_end 后才返回),这里将永远不会触发静音回调,导致连续模式下客户端持续发送音频且不会 audio_end。建议增加兜底策略:例如在检测到持续语音达到一定时长后也允许进入静音检测/触发停止,或在收到任何 STT(含 final)/服务端状态事件时解锁静音检测。
| // STT 返回有效文本 → 确认用户确实在说话,启用 VAD 静音检测 | ||
| if (message.text && message.text.trim().length > 0 && !message.isFinal) { |
There was a problem hiding this comment.
confirmSpeechFromSTT() 仅在 !message.isFinal 时触发,但静音检测又依赖该确认标志。若服务端只发送 final(或首个非空结果即为 final),将导致整轮对话永远无法启用 VAD 静音检测。建议对任意非空文本(包含 final)都进行确认,或至少在 final 为非空时也设置确认标志。
| // STT 返回有效文本 → 确认用户确实在说话,启用 VAD 静音检测 | |
| if (message.text && message.text.trim().length > 0 && !message.isFinal) { | |
| // STT 返回任意有效文本(包括 final)→ 确认用户确实在说话,启用 VAD 静音检测 | |
| if (message.text && message.text.trim().length > 0) { |
No description provided.