Skip to content

feat: multi-provider voice transcription (Parakeet local GPU, Mistral, OpenAI)#132

Open
BasilPadre wants to merge 1 commit intoRichardAtCT:mainfrom
BasilPadre:feature/multi-provider-voice-transcription
Open

feat: multi-provider voice transcription (Parakeet local GPU, Mistral, OpenAI)#132
BasilPadre wants to merge 1 commit intoRichardAtCT:mainfrom
BasilPadre:feature/multi-provider-voice-transcription

Conversation

@BasilPadre
Copy link

Summary

  • Adds voice message transcription with three pluggable backends: Parakeet (local GPU, default), Mistral Voxtral (cloud), OpenAI Whisper (cloud)
  • New VOICE_PROVIDER setting selects the backend; cloud providers require their respective API keys
  • Parakeet uses NVIDIA NeMo Parakeet TDT 0.6B v3 — runs entirely on-device with no API cost; model is downloaded and cached automatically on first use
  • Optional dependency groups [voice] and [parakeet] keep heavy GPU deps out of the default install
  • New agentic_voice handler in the orchestrator transcribes the message and passes it to Claude as text, preserving session context

New settings

Variable Default Description
ENABLE_VOICE_PROCESSING false Enable voice transcription
VOICE_PROVIDER parakeet parakeet, mistral, or openai
FFMPEG_PATH (PATH) Explicit path to ffmpeg binary
VOICE_MAX_FILE_SIZE_MB 20 Reject files larger than this
MISTRAL_API_KEY Required for Mistral provider
OPENAI_API_KEY Required for OpenAI provider

Install

# Local GPU (Parakeet)
pip install "claude-code-telegram[parakeet]"

# Cloud providers
pip install "claude-code-telegram[voice]"

Test plan

  • Send voice message with VOICE_PROVIDER=parakeet — transcription appears, then Claude responds
  • Send voice message with VOICE_PROVIDER=mistral (requires MISTRAL_API_KEY)
  • Send voice message with VOICE_PROVIDER=openai (requires OPENAI_API_KEY)
  • File exceeding VOICE_MAX_FILE_SIZE_MB is rejected with a clear error
  • ENABLE_VOICE_PROCESSING=false — voice messages are silently ignored (no handler registered)
  • Missing optional deps raise a helpful RuntimeError with install instructions

🤖 Generated with Claude Code

Adds voice message transcription support with three backends:
- `parakeet` (default): local NVIDIA NeMo Parakeet TDT 0.6B v3, runs on GPU,
  no API key or cloud cost required
- `mistral`: Mistral Voxtral cloud API
- `openai`: OpenAI Whisper cloud API

New settings:
- ENABLE_VOICE_PROCESSING (bool, default false)
- VOICE_PROVIDER (mistral | openai | parakeet, default parakeet)
- FFMPEG_PATH (optional explicit path, falls back to PATH)
- VOICE_MAX_FILE_SIZE_MB (default 20)
- MISTRAL_API_KEY / OPENAI_API_KEY (for cloud providers)

Optional dependency groups added to pyproject.toml:
- `[voice]` for mistral + openai cloud providers
- `[parakeet]` for local GPU transcription via NeMo

The Parakeet model (~600 MB) is downloaded and cached automatically on
first use. Audio is converted ogg→wav via ffmpeg before transcription.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@FridayOpenClawBot
Copy link

PR Review
Reviewed head: 14d665b64dc714c9718a59dfee86e0e05e8d6ee8

Summary

  • Adds voice transcription with three backends (Parakeet local GPU, Mistral Voxtral, OpenAI Whisper) behind a feature flag. Good structure overall, but has a broken packaging pattern, a surprising default, and conflicts with existing CLAUDE.md conventions that need resolving before merge.

What looks good

  • Lazy client/model initialisation — no import cost unless the provider is actually used
  • Triple file-size guard (pre-download metadata, post-get-file, post-download bytes) is thorough
  • run_in_executor correctly offloads the blocking Parakeet inference off the event loop
  • Graceful error wrapping with structured logging on cloud provider failures
  • Feature flag defaults to False, so existing deployments are unaffected on upgrade

Issues / questions

  1. [Blocker] pyproject.toml — The optional deps are declared inside [tool.poetry.group.voice.dependencies] and [tool.poetry.group.parakeet.dependencies], not in [tool.poetry.dependencies]. Poetry only exports pip extras from the main dependencies table. As written, pip install "claude-code-telegram[voice]" and pip install "claude-code-telegram[parakeet]" will silently install nothing extra — extras groups are a Poetry-dev concept, not a packaging artifact. Move mistralai, openai, nemo_toolkit, and torch into [tool.poetry.dependencies] with optional = true, then keep the [tool.poetry.extras] block as-is.

  2. [Blocker] src/config/settings.pyVOICE_PROVIDER defaults to "parakeet", but Parakeet requires a CUDA GPU and a ~600 MB NeMo model download. Most cloud-deployed instances of this bot have no GPU. CLAUDE.md also explicitly documents the default as mistral. This will fail on first voice message for the majority of users. Change default to "mistral" to match documented behaviour, or at minimum add a startup validation that raises a clear error if parakeet is selected without CUDA available.

  3. [Blocker] src/config/settings.py — This PR renames ENABLE_VOICE_MESSAGESENABLE_VOICE_PROCESSING and drops VOICE_TRANSCRIPTION_MODEL, silently breaking existing deployments that set those env vars. CLAUDE.md documents the old names. Is this intentional? If so, CLAUDE.md must be updated in this PR and a migration note added to the README. If not, revert to the documented names.

  4. [Important] src/bot/features/voice_handler.py:_parakeet (property) — No lock around the lazy model load. If two voice messages arrive concurrently, both threads entering _run_parakeet via the executor could race through if self._parakeet_model is None and load the model twice. Add a threading.Lock acquired before the None check.

  5. [Important] src/bot/features/voice_handler.py:process_voice_message — The pre-download voice.file_size check is best-effort only: Telegram doesn't always populate that field. The real guard is the post-download byte-length check, which works correctly. Worth adding a comment so the next reader doesn't wonder why the triple check is needed.

  6. [Nit] src/bot/features/voice_handler.py:_run_parakeet — Missing return type annotation (-> str). CLAUDE.md requires type hints on all functions and mypy strict is enforced.

Suggested tests (if needed)

  • Unit test _check_file_size with boundary values (exactly at limit, one byte over)
  • Mock nemo_asr.models.ASRModel.from_pretrained and assert the property only calls it once across two concurrent _run_parakeet calls (regression for the race once fixed)
  • Integration smoke: VOICE_PROVIDER=openai with a mocked AsyncOpenAI client returns a ProcessedVoice with correct transcription and prompt fields

Verdict

  • ⚠️ Merge after fixes (blockers 1–3 need resolving; 4 is straightforward)

Friday, AI assistant to @RichardAtCT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants