voice_loop: Phase 1 barge-in (duck mic + cancel/truncate) by zdql · Pull Request #6 · zdql/gamechat

zdql · 2026-05-27T03:14:13Z

Summary

The realtime voice loop currently drops every microphone chunk whenever the assistant is speaking (response_active || playback_depth > 50ms, src/voice_loop/mod.rs). That fully disables barge-in: OpenAI's semantic_vad never sees user audio mid-response, so it can never trigger an interruption — even though the session is configured to use it.

This PR is Phase 1 of fixing that. It moves from "fully gate the mic" to "duck the mic + handle server barge-in events".

What changed

Forward the mic continuously. While the assistant is speaking, attenuate samples by mic_ducking_gain (default 0.25) instead of dropping them. Speaker leakage stays below the server VAD threshold; deliberate user speech rises above it.
Track the in-flight assistant item. Capture item.id from response.output_item.added, and accumulate enqueued audio duration (ms at 24 kHz) from response.output_audio.delta.
Handle input_audio_buffer.speech_started. While a response is active, emit response.cancel + conversation.item.truncate (with audio_end_ms = enqueued_ms - playback_buffer_depth_ms, so the assistant transcript matches what the user actually heard), and clear the local playback buffer so barge-in is instant.
Settings plumbing. New mic_ducking_gain: <f32> in settings.json, clamped to [0.0, 1.0]. Setting it to 0.0 restores the legacy fully-gated behavior; setting it to 1.0 disables ducking entirely.

response.done remains the source of truth for "the response is over", so cancelled responses unwind through the existing path with no additional state machine.

Risks / things to verify on real hardware

Ducking constant is empirical. 0.25 is a reasonable default for a MacBook with built-in speakers + mic, but it's a guess. Quieter speakers or headphones might benefit from a higher value (0.5+); very loud speakers or a sensitive mic may need 0.15. Tunable per-user via settings.
No echo cancellation. Without AEC, the model's own speech can occasionally cross the VAD threshold and trigger a self-interrupt. semantic_vad is text-aware and tends to ignore that, but it isn't perfect. If false barge-ins are common in practice, Phase 2 (software AEC via webrtc-audio-processing) is the follow-up.
Transcript fidelity on barge-in. audio_end_ms is computed as enqueued_assistant_ms − playback_depth_ms (both wall-clock ms). That should track what the user actually heard within one playback callback (~10 ms).

Rollback

Setting mic_ducking_gain: 0.0 in ~/.config/gamechat/settings.json reproduces the previous "drop mic while speaking" behavior exactly, so users who don't like the new feel have a one-line escape hatch.

Test plan

cargo test — 53 tests pass (45 existing + 8 new for duck_samples, clear_playback, and mic_ducking_gain parsing/clamping).
cargo check — clean.
Manual: gamechat --realtime on a real laptop. Start a long assistant response; interrupt mid-sentence. Confirm the model stops within ~200 ms and the next user turn is picked up cleanly.
Manual: confirm the assistant doesn't interrupt itself in a quiet room with default ducking.
Manual: try mic_ducking_gain: 0.0 and confirm legacy behavior (no barge-in possible) is preserved.
Manual: try mic_ducking_gain: 0.5 in a noisy room and see whether echo-induced false barge-ins increase.

🤖 Generated with Claude Code

Previously the realtime loop dropped every microphone chunk whenever the assistant was speaking (response_active || playback_depth > 50ms). That fully disabled barge-in: OpenAI's semantic_vad never saw user audio during a response, so it could never interrupt one. Phase 1 fix: - Forward mic continuously. While the assistant is speaking, attenuate the mic by `mic_ducking_gain` (default 0.25) so speaker leakage stays below the server VAD threshold but a deliberate user utterance still rises above it. Setting `mic_ducking_gain: 0.0` in settings.json restores the legacy fully-gated behavior. - Track the in-flight assistant item id (from response.output_item.added) and enqueued audio duration (from response.output_audio.delta). - On input_audio_buffer.speech_started while a response is active: emit response.cancel, conversation.item.truncate (with audio_end_ms = enqueued_ms - playback_buffer_depth_ms), and clear the local playback buffer so the user hears their barge-in immediately. Server `response.done` remains the source of truth for "response over", so cancelled responses unwind through the existing path with no extra state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

voice_loop: Phase 1 barge-in (duck mic + cancel/truncate)#6

voice_loop: Phase 1 barge-in (duck mic + cancel/truncate)#6
zdql wants to merge 1 commit into
mainfrom
feature/realtime-barge-in

zdql commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zdql commented May 27, 2026

Summary

What changed

Risks / things to verify on real hardware

Rollback

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant