voice_loop: Phase 1 barge-in (duck mic + cancel/truncate)#6
Open
zdql wants to merge 1 commit into
Open
Conversation
Previously the realtime loop dropped every microphone chunk whenever the assistant was speaking (response_active || playback_depth > 50ms). That fully disabled barge-in: OpenAI's semantic_vad never saw user audio during a response, so it could never interrupt one. Phase 1 fix: - Forward mic continuously. While the assistant is speaking, attenuate the mic by `mic_ducking_gain` (default 0.25) so speaker leakage stays below the server VAD threshold but a deliberate user utterance still rises above it. Setting `mic_ducking_gain: 0.0` in settings.json restores the legacy fully-gated behavior. - Track the in-flight assistant item id (from response.output_item.added) and enqueued audio duration (from response.output_audio.delta). - On input_audio_buffer.speech_started while a response is active: emit response.cancel, conversation.item.truncate (with audio_end_ms = enqueued_ms - playback_buffer_depth_ms), and clear the local playback buffer so the user hears their barge-in immediately. Server `response.done` remains the source of truth for "response over", so cancelled responses unwind through the existing path with no extra state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The realtime voice loop currently drops every microphone chunk whenever the assistant is speaking (
response_active || playback_depth > 50ms,src/voice_loop/mod.rs). That fully disables barge-in: OpenAI'ssemantic_vadnever sees user audio mid-response, so it can never trigger an interruption — even though the session is configured to use it.This PR is Phase 1 of fixing that. It moves from "fully gate the mic" to "duck the mic + handle server barge-in events".
What changed
mic_ducking_gain(default0.25) instead of dropping them. Speaker leakage stays below the server VAD threshold; deliberate user speech rises above it.item.idfromresponse.output_item.added, and accumulate enqueued audio duration (ms at 24 kHz) fromresponse.output_audio.delta.input_audio_buffer.speech_started. While a response is active, emitresponse.cancel+conversation.item.truncate(withaudio_end_ms = enqueued_ms - playback_buffer_depth_ms, so the assistant transcript matches what the user actually heard), and clear the local playback buffer so barge-in is instant.mic_ducking_gain: <f32>insettings.json, clamped to[0.0, 1.0]. Setting it to0.0restores the legacy fully-gated behavior; setting it to1.0disables ducking entirely.response.doneremains the source of truth for "the response is over", so cancelled responses unwind through the existing path with no additional state machine.Risks / things to verify on real hardware
0.25is a reasonable default for a MacBook with built-in speakers + mic, but it's a guess. Quieter speakers or headphones might benefit from a higher value (0.5+); very loud speakers or a sensitive mic may need0.15. Tunable per-user via settings.semantic_vadis text-aware and tends to ignore that, but it isn't perfect. If false barge-ins are common in practice, Phase 2 (software AEC viawebrtc-audio-processing) is the follow-up.audio_end_msis computed asenqueued_assistant_ms − playback_depth_ms(both wall-clock ms). That should track what the user actually heard within one playback callback (~10 ms).Rollback
Setting
mic_ducking_gain: 0.0in~/.config/gamechat/settings.jsonreproduces the previous "drop mic while speaking" behavior exactly, so users who don't like the new feel have a one-line escape hatch.Test plan
cargo test— 53 tests pass (45 existing + 8 new forduck_samples,clear_playback, andmic_ducking_gainparsing/clamping).cargo check— clean.gamechat --realtimeon a real laptop. Start a long assistant response; interrupt mid-sentence. Confirm the model stops within ~200 ms and the next user turn is picked up cleanly.mic_ducking_gain: 0.0and confirm legacy behavior (no barge-in possible) is preserved.mic_ducking_gain: 0.5in a noisy room and see whether echo-induced false barge-ins increase.🤖 Generated with Claude Code