Skip to content

voice_loop: Phase 1 barge-in (duck mic + cancel/truncate)#6

Open
zdql wants to merge 1 commit into
mainfrom
feature/realtime-barge-in
Open

voice_loop: Phase 1 barge-in (duck mic + cancel/truncate)#6
zdql wants to merge 1 commit into
mainfrom
feature/realtime-barge-in

Conversation

@zdql
Copy link
Copy Markdown
Owner

@zdql zdql commented May 27, 2026

Summary

The realtime voice loop currently drops every microphone chunk whenever the assistant is speaking (response_active || playback_depth > 50ms, src/voice_loop/mod.rs). That fully disables barge-in: OpenAI's semantic_vad never sees user audio mid-response, so it can never trigger an interruption — even though the session is configured to use it.

This PR is Phase 1 of fixing that. It moves from "fully gate the mic" to "duck the mic + handle server barge-in events".

What changed

  • Forward the mic continuously. While the assistant is speaking, attenuate samples by mic_ducking_gain (default 0.25) instead of dropping them. Speaker leakage stays below the server VAD threshold; deliberate user speech rises above it.
  • Track the in-flight assistant item. Capture item.id from response.output_item.added, and accumulate enqueued audio duration (ms at 24 kHz) from response.output_audio.delta.
  • Handle input_audio_buffer.speech_started. While a response is active, emit response.cancel + conversation.item.truncate (with audio_end_ms = enqueued_ms - playback_buffer_depth_ms, so the assistant transcript matches what the user actually heard), and clear the local playback buffer so barge-in is instant.
  • Settings plumbing. New mic_ducking_gain: <f32> in settings.json, clamped to [0.0, 1.0]. Setting it to 0.0 restores the legacy fully-gated behavior; setting it to 1.0 disables ducking entirely.

response.done remains the source of truth for "the response is over", so cancelled responses unwind through the existing path with no additional state machine.

Risks / things to verify on real hardware

  • Ducking constant is empirical. 0.25 is a reasonable default for a MacBook with built-in speakers + mic, but it's a guess. Quieter speakers or headphones might benefit from a higher value (0.5+); very loud speakers or a sensitive mic may need 0.15. Tunable per-user via settings.
  • No echo cancellation. Without AEC, the model's own speech can occasionally cross the VAD threshold and trigger a self-interrupt. semantic_vad is text-aware and tends to ignore that, but it isn't perfect. If false barge-ins are common in practice, Phase 2 (software AEC via webrtc-audio-processing) is the follow-up.
  • Transcript fidelity on barge-in. audio_end_ms is computed as enqueued_assistant_ms − playback_depth_ms (both wall-clock ms). That should track what the user actually heard within one playback callback (~10 ms).

Rollback

Setting mic_ducking_gain: 0.0 in ~/.config/gamechat/settings.json reproduces the previous "drop mic while speaking" behavior exactly, so users who don't like the new feel have a one-line escape hatch.

Test plan

  • cargo test — 53 tests pass (45 existing + 8 new for duck_samples, clear_playback, and mic_ducking_gain parsing/clamping).
  • cargo check — clean.
  • Manual: gamechat --realtime on a real laptop. Start a long assistant response; interrupt mid-sentence. Confirm the model stops within ~200 ms and the next user turn is picked up cleanly.
  • Manual: confirm the assistant doesn't interrupt itself in a quiet room with default ducking.
  • Manual: try mic_ducking_gain: 0.0 and confirm legacy behavior (no barge-in possible) is preserved.
  • Manual: try mic_ducking_gain: 0.5 in a noisy room and see whether echo-induced false barge-ins increase.

🤖 Generated with Claude Code

Previously the realtime loop dropped every microphone chunk whenever the
assistant was speaking (response_active || playback_depth > 50ms). That
fully disabled barge-in: OpenAI's semantic_vad never saw user audio
during a response, so it could never interrupt one.

Phase 1 fix:

- Forward mic continuously. While the assistant is speaking, attenuate
  the mic by `mic_ducking_gain` (default 0.25) so speaker leakage stays
  below the server VAD threshold but a deliberate user utterance still
  rises above it. Setting `mic_ducking_gain: 0.0` in settings.json
  restores the legacy fully-gated behavior.
- Track the in-flight assistant item id (from response.output_item.added)
  and enqueued audio duration (from response.output_audio.delta).
- On input_audio_buffer.speech_started while a response is active:
  emit response.cancel, conversation.item.truncate (with audio_end_ms =
  enqueued_ms - playback_buffer_depth_ms), and clear the local playback
  buffer so the user hears their barge-in immediately.

Server `response.done` remains the source of truth for "response over",
so cancelled responses unwind through the existing path with no extra
state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant