feat(parse): implement audio resource parser with Whisper transcription#707
feat(parse): implement audio resource parser with Whisper transcription#707mvanhorn wants to merge 3 commits intovolcengine:mainfrom
Conversation
Replace the audio parser stub with a working implementation that: - Extracts metadata (duration, sample rate, channels, bitrate) via mutagen - Transcribes speech via Whisper API with timestamped segments - Builds structured ResourceNode tree with L0/L1/L2 content tiers - Falls back to metadata-only output when Whisper is unavailable - Adds mutagen as optional dependency under [audio] extra - Adds audio_summary prompt template for semantic indexing - Includes unit tests with mocked Whisper API and mutagen
qin-ctx
left a comment
There was a problem hiding this comment.
Review Summary
The core audio parser implementation is well-structured with good graceful degradation design. However, there are critical issues with the Whisper API integration that make transcription non-functional under default configuration.
Blocking Issues
- Default model name incompatible with OpenAI Whisper API
- No
base_urlsupport for custom Whisper endpoints audio_summary.yamlprompt template added but never used
Non-blocking
- Duplicate boilerplate across transcription methods
- ~13 files have unrelated ruff formatting changes mixed into this feature PR — consider splitting formatting into a separate PR for easier review and bisect
Note
This PR currently has merge conflicts that need to be resolved.
| audio_file.name = f"audio{ext}" | ||
|
|
||
| response = await client.audio.transcriptions.create( | ||
| model=model or "whisper-1", |
There was a problem hiding this comment.
[Bug] (blocking) The default transcription_model in AudioConfig is "whisper-large-v3", which is a HuggingFace/local model name. OpenAI's Whisper API only accepts "whisper-1". Since model is "whisper-large-v3" (truthy), model or "whisper-1" evaluates to "whisper-large-v3", and the API call will fail with an invalid model error.
The error is silently caught by the except block at line 344, so users get metadata-only output without any clear indication that transcription is broken.
Suggested fix: either change the AudioConfig default to "whisper-1", or add model name mapping logic for different Whisper providers.
| import openai | ||
|
|
||
| client = openai.AsyncOpenAI( | ||
| api_key=config.llm.api_key if hasattr(config, "llm") else None, |
There was a problem hiding this comment.
[Bug] (blocking) openai.AsyncOpenAI is created without a base_url parameter, so it always connects to OpenAI's default endpoint. The project's ProviderConfig supports custom API endpoints via api_base, but this code ignores it entirely. Users with custom Whisper deployments (Azure OpenAI, local Whisper server, etc.) cannot use transcription.
[Suggestion] (non-blocking) This direct client creation is also inconsistent with the project's patterns — ImageParser uses get_openviking_config().vlm for model calls, integrating with the provider infrastructure. Consider either reusing an existing provider client or at least reading api_base from the config to construct the client.
| @@ -0,0 +1,44 @@ | |||
| metadata: | |||
There was a problem hiding this comment.
[Design] (blocking) This prompt template (parsing.audio_summary) is defined but never referenced anywhere in the code. For comparison, ImageParser uses render_prompt("parsing.image_summary", ...) to generate semantic summaries via LLM, but AudioParser._generate_semantic_info() at audio.py:456 does simple string truncation instead of LLM-powered summarization.
Either integrate this template with render_prompt() in _generate_semantic_info (following the ImageParser pattern), or remove this file to avoid shipping dead code.
| Transcript with timestamps in markdown format, or None if not available | ||
| List of segment dicts with keys: start, end, text | ||
| """ | ||
| try: |
There was a problem hiding this comment.
[Design] (non-blocking) _asr_transcribe_with_timestamps (lines 348-406) contains identical boilerplate to _asr_transcribe (lines 309-346): config loading, openai.AsyncOpenAI client creation, and BytesIO wrapping. Consider extracting shared helpers like _get_whisper_client() and _prepare_audio_file() to reduce duplication and make the API integration easier to update.
The audio parser feature is unrelated to memory health stats and belongs in its own PR (volcengine#707). Reverts audio.py to pre-rewrite state, removes the unused audio_summary.yaml template and audio parser tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cation - Change default transcription_model from "whisper-large-v3" (HuggingFace name) to "whisper-1" (OpenAI API compatible) - Add base_url support via ProviderConfig.api_base so custom Whisper deployments (Azure, local server) work correctly - Extract _get_whisper_client() and _prepare_audio_file() helpers to eliminate duplicate boilerplate between _asr_transcribe and _asr_transcribe_with_timestamps - Remove unused audio_summary.yaml template (will integrate with render_prompt in a follow-up if LLM-powered summarization is desired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed all feedback in a6a57ce:
|
|
Addressed all feedback in a6a57ce:
|
Summary
openviking/parse/parsers/media/audio.pywith a working implementationmutagen(optional dependency)This addresses the audio parser stub and aligns with the Q2 multimodal roadmap mentioned in #372. The TODO comments at
audio.py:172("Integrate with actual ASR API (Whisper, etc.)") andaudio.py:190("Integrate with ASR API") are resolved by this implementation.Changes
openviking/parse/parsers/media/audio.pyopenviking/prompts/templates/parsing/audio_summary.yamltests/unit/parse/test_audio_parser.pypyproject.tomlmutagen>=1.47.0as optional[audio]dependencyDesign
BaseParserinterface andImageParserpatterns exactlymutagen(graceful fallback if not installed)openai.AsyncOpenAIfor Whisper API calls, reusing the existing provider configEvidence
Test plan
pytest tests/unit/parse/test_audio_parser.py -vruff check openviking/parse/parsers/media/audio.pySubmitted via osc-newfeature