Skip to content

Feat/realtime voice vad p2#20

Merged
ailuckly merged 5 commits into
developfrom
feat/realtime-voice-vad-p2
Apr 19, 2026
Merged

Feat/realtime voice vad p2#20
ailuckly merged 5 commits into
developfrom
feat/realtime-voice-vad-p2

Conversation

@ailuckly
Copy link
Copy Markdown
Owner

No description provided.

Two fixes for false VAD triggers from ambient noise:

1. SPEECH_THRESHOLD: 0.02 → 0.03, MIN_SPEECH_FRAMES: 5 → 8 (~1s)
   Higher bar for what counts as "speech" to filter ambient noise.

2. Empty pipeline loop protection: when handleProcessComplete fires
   with no TTS played (likely a noise-triggered empty pipeline),
   delay resumeRecording by 2s instead of 300ms. Prevents the rapid
   loop: noise → empty STT → complete → resume → noise → repeat.
Root cause: ambient noise exceeded SPEECH_THRESHOLD, causing false
hasSpeechStarted=true. When noise subsided, VAD silence triggered
even though the user never actually spoke. Raising thresholds helped
but didn't eliminate the issue for noisy environments.

Fix: add sttConfirmedSpeech flag. VAD silence detection now requires
BOTH local RMS silence AND at least one non-empty STT intermediate
result from the server. This ensures the user actually spoke before
the system considers ending the recording.

Flow:
  recording starts → local RMS detects audio → sends to server
  → server STT returns partial text → confirmSpeechFromSTT()
  → NOW VAD silence detection is armed
  → user stops speaking → 1.3s silence → auto audio_end

Without STT confirmation, ambient noise can trigger local speech
detection but VAD will never fire audio_end.
After TTS playback, resumeRecording() re-enabled audio frame sending
on the frontend but never sent audio_start to the server. The server's
audioSink was null (cleared by doFinally after the previous pipeline),
so all audio frames were silently dropped in handleBinaryMessage.

Fix: send wsClient.startAudioRecording() (audio_start) alongside
resumeRecording() in both the playback state listener (normal TTS end)
and handleProcessComplete (empty pipeline case). This creates a new
server-side audioSink + pipeline for the next conversation round.
…tion

Root cause: startAudioCall() calls clearQueue() which triggers
notifyPlaybackState(false). The playback listener sees the call
is active and schedules resumeRecording() + startAudioRecording()
after 300ms. But startRecording() already sent audio_start. The
second audio_start is rejected by the server with "已有进行中的音频会话",
handleError stops recording, and the session enters a broken state
where subsequent pipelines receive no audio (Xunfei timeout).

Fix: resumeRecording() now returns boolean (true only when actually
resuming from monitoring mode). Callers only send audio_start when
resume returns true. This prevents:
- Double audio_start on initial connection (monitoringOnly=false → no-op)
- Stale audio_start after error recovery (recordingState=idle → no-op)

Also adds fallback in handleProcessComplete: if recording hardware
was stopped by an error, falls back to full startRecording() instead
of resumeRecording().
…trigger

After user speaks and VAD pauses recording, sttConfirmedSpeech stayed
true from the previous STT result. When TTS ended and recording
resumed, the stale sttConfirmedSpeech allowed VAD silence to trigger
immediately (ambient noise met all other conditions). This sent an
empty audio_end, creating an empty pipeline that took 15s to timeout
on Xunfei, appearing as "mic stuck".

Fix: reset sttConfirmedSpeech=false in pauseRecording(). Each new
recording round must get fresh STT confirmation before VAD can fire.
Copilot AI review requested due to automatic review settings April 19, 2026 07:03
@ailuckly ailuckly merged commit f0e48e9 into develop Apr 19, 2026
6 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tunes the client-side realtime voice VAD behavior in VocaTaAIChat/AudioManager to reduce environment-noise false triggers by tightening VAD thresholds and gating “silence stop” on an STT-confirmed speech signal, while also preventing redundant audio_start messages when resuming.

Changes:

  • Adjust VAD thresholds and minimum speech frame requirements to be more noise-resistant.
  • Add an sttConfirmedSpeech gate so VAD silence only triggers after STT has produced valid text.
  • Change AudioManager.resumeRecording() to return a boolean and only send audio_start when a resume actually occurred; add a longer delay on complete recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +545 to +548
// VAD 静音触发条件:发送模式 + STT 已确认说话
if (this.silenceFrameCount >= SILENCE_FRAMES_REQUIRED
&& !this.monitoringOnly
&& this.sttConfirmedSpeech) {
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VAD 静音触发现在依赖 sttConfirmedSpeech。如果 STT 在本轮没有产出任何文本(例如识别失败/空结果/只在 audio_end 后才返回),这里将永远不会触发静音回调,导致连续模式下客户端持续发送音频且不会 audio_end。建议增加兜底策略:例如在检测到持续语音达到一定时长后也允许进入静音检测/触发停止,或在收到任何 STT(含 final)/服务端状态事件时解锁静音检测。

Copilot uses AI. Check for mistakes.
Comment on lines +1043 to +1044
// STT 返回有效文本 → 确认用户确实在说话,启用 VAD 静音检测
if (message.text && message.text.trim().length > 0 && !message.isFinal) {
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmSpeechFromSTT() 仅在 !message.isFinal 时触发,但静音检测又依赖该确认标志。若服务端只发送 final(或首个非空结果即为 final),将导致整轮对话永远无法启用 VAD 静音检测。建议对任意非空文本(包含 final)都进行确认,或至少在 final 为非空时也设置确认标志。

Suggested change
// STT 返回有效文本 → 确认用户确实在说话,启用 VAD 静音检测
if (message.text && message.text.trim().length > 0 && !message.isFinal) {
// STT 返回任意有效文本(包括 final)→ 确认用户确实在说话,启用 VAD 静音检测
if (message.text && message.text.trim().length > 0) {

Copilot uses AI. Check for mistakes.
@ailuckly ailuckly deleted the feat/realtime-voice-vad-p2 branch May 7, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants