-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Add Grok realtime voice CLI #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add voice command for real-time voice conversations with Grok AI via XAI's realtime WebSocket API. Supports voice mode (microphone/speaker via sox) and text mode for typing messages. Key components: - GrokVoiceClient: WebSocket client for XAI realtime voice API - AudioCapture: Microphone input using sox CLI - AudioPlayback: Speaker output using sox CLI - Voice CLI command with --voice, --text, --instructions options Requires XAI_API_KEY env var and sox for voice mode (brew install sox). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bypass Schema encoding in favor of direct JSON construction for WebSocket messages to reduce complexity. Update session config to match XAI API requirements: pcm16 format, Whisper transcription, and enhanced VAD settings (threshold, padding, silence detection). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| } | ||
|
|
||
| return Chunk.fromIterable(buffers) | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Audio chunking loses partial data between stream chunks
Medium Severity
The Stream.mapChunks callback creates a fresh accumulated buffer on each invocation (line 57), but state is not preserved between calls. When upstream audio data doesn't align with chunkSize boundaries, partial data at the end of each chunk is emitted immediately as an undersized buffer, then lost. The next upstream chunk starts accumulation from zero instead of continuing with the leftover bytes. This causes inconsistent audio chunk sizes to be sent to the WebSocket API, potentially causing audio quality issues or inefficient network usage. A stateful approach like Stream.mapAccum would be needed to properly accumulate across chunk boundaries.
| ws.on("error", (error) => { | ||
| Effect.runSync(Effect.logError(`WebSocket error: ${error.message}`)) | ||
| resume(Effect.fail(error as Error)) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebSocket errors after connection silently ignored
Medium Severity
The Effect.async callback's resume function can only be called once effectively. When the WebSocket connects successfully, resume(Effect.void) is called on the "open" event (line 145). If a WebSocket error occurs after the connection is established, the "error" handler calls resume(Effect.fail(error)) but this has no effect since resume was already invoked. Errors during an active session (network failures, authentication issues, server errors) are only logged but not propagated to the caller, leaving the application in a confusing state where streams silently stop working without proper error handling.
| yield* connection.close | ||
| }).pipe( | ||
| Effect.provide(VoiceLayer), | ||
| Effect.catchAll((error) => Console.error(`Error: ${error instanceof Error ? error.message : String(error)}`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resources not cleaned up when errors occur
Medium Severity
The cleanup code (lines 125-129) that interrupts fibers and closes the player/connection only executes if the text or voice mode block completes without error. If runTextMode or the mic stream throws (e.g., sox crashes, WebSocket disconnects), the error is caught by Effect.catchAll but the cleanup code is skipped entirely. The forked fibers (audioPlaybackFiber, transcriptFiber, userTranscriptFiber), the player process, and the WebSocket connection will remain open, causing resource leaks. The cleanup logic needs to be wrapped in Effect.ensuring or similar to guarantee execution.
| Effect.runSync(Queue.shutdown(transcriptQueue)) | ||
| Effect.runSync(Queue.shutdown(userTranscriptQueue)) | ||
| Effect.runSync(Queue.shutdown(eventQueue)) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readyQueue not shutdown causes infinite hang on early close
High Severity
When the WebSocket closes, the "close" handler shuts down audioQueue, transcriptQueue, userTranscriptQueue, and eventQueue, but readyQueue is not shutdown. If the WebSocket connects but then closes before session.updated is received (e.g., authentication failure, server rejection, or network issue), waitForReady on line 206 (Queue.take(readyQueue)) will block forever. The CLI will hang indefinitely with no error message or way to recover. The readyQueue needs to be shutdown in the close handler.
Additional Locations (1)
Introduces a transport-agnostic conversation interface that works with both: - HTTP-based chat completions (OpenAI-compatible APIs) - WebSocket-based voice APIs (Grok realtime) Key components: - domain.ts: Core types (ConversationEvent union, LlmTransport service) - http-transport.ts: Wraps OpenAiChatClient as stateless transport - ws-transport.ts: Wraps GrokVoiceClient as stateful transport - demo.ts: CLI demo with YAML event logging on exit The unified session provides consistent API (sendText, sendAudio, events stream) regardless of transport. Demo supports multiple providers (openrouter, xai, groq, etc). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
||
| process.on("exit", () => { | ||
| writeEventLog() | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Event log written twice on SIGINT exit
Low Severity
The SIGINT handler calls writeEventLog() and then process.exit(0). The 'exit' handler also calls writeEventLog(). When the user presses Ctrl+C, both handlers execute, creating two YAML log files with slightly different timestamps. The log is written once explicitly and once via the exit handler triggered by process.exit(0).
| return player.write(event.chunk) | ||
| } | ||
| return handleEvent(event) | ||
| }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AudioDelta events not logged in voice mode
Low Severity
In voiceDemo, the Stream.tap callback returns player.write(event.chunk) for AudioDelta events without calling handleEvent. Since logEvent is only called inside handleEvent, audio delta events are never logged to the YAML event file. All other event types are logged via handleEvent, but audio playback events are silently skipped from the log output.
| Options.withAlias("r"), | ||
| Options.withDescription("Audio sample rate in Hz"), | ||
| Options.withDefault(DEFAULT_SAMPLE_RATE) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sample rate option causes audio mismatch with fixed-rate API
Medium Severity
The --sample-rate CLI option affects local audio capture and playback rates but the XAI realtime API uses a fixed 24kHz sample rate (as noted in the client.ts comment). When a user specifies a non-default rate like 16000, the microphone captures at 16kHz while the API expects 24kHz input, and the API returns 24kHz audio that plays at 16kHz. This causes audio to be pitched/sped incorrectly in both directions. The option is exposed to users but functionally broken for any value other than the default.
Additional Locations (1)
| AudioCapture.Default, | ||
| AudioPlayback.Default, | ||
| BunCommandExecutor.layer | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing BunFileSystem layer dependency for command executor
Medium Severity
The VoiceLayer uses BunCommandExecutor.layer directly without providing BunFileSystem.layer as a dependency. In contrast, demo.ts correctly composes these as BunCommandExecutor.layer.pipe(Layer.provide(BunFileSystem.layer)). The checkSoxAvailable function uses Command.string() which requires the command executor. If BunCommandExecutor depends on BunFileSystem, this layer composition would fail at runtime when the voice command is executed.
Add Grok realtime voice CLI for real-time voice conversations via XAI's WebSocket API. Users can speak into a microphone, receive spoken responses, and optionally type messages instead.
Features
Usage
bun run mini-agent voice --voice ara— Voice mode with Ara voicebun run mini-agent voice --text— Text mode (type messages)bun run mini-agent voice --instructions "..."— Custom system instructionsRequires XAI_API_KEY env var and sox for voice mode.
🤖 Generated with Claude Code
Note
Adds end-to-end voice support and a transport-agnostic session layer.
src/voice/*(client.ts,audio-capture.ts,audio-playback.ts,cli.ts,domain.ts,index.ts) implementing Grok WS client, mic capture (sox), speaker playback (sox), and avoiceCLI command (text or voice modes)src/unified/*(domain.ts,http-transport.ts,ws-transport.ts,index.ts,demo.ts) providing a single conversation interface across OpenAI-compatible HTTP and Grok WS transports, plus a demo that logs events to YAMLvoiceCommandinsrc/cli/commands.tswsand@types/wsinpackage.json; updatesbun.lockPLAN.mdwith architecture and usage detailsWritten by Cursor Bugbot for commit 459c280. This will update automatically on new commits. Configure here.