Implement Semantic Audio Integration for Audio and Video files#7
Implement Semantic Audio Integration for Audio and Video files#7ShrutiBachal wants to merge 17 commits into
Conversation
…odal image, text, and audio search
- Integrated Faster-Whisper (CTranslate2) for accelerated audio transcription and background processing.
There was a problem hiding this comment.
Pull request overview
This PR adds semantic audio support to Semantixel’s indexing + retrieval pipeline by introducing audio transcription (Whisper variants), ambient sound embeddings (CLAP), and UI support for displaying audio results and improved video playback syncing.
Changes:
- Extend scanning/indexing to include common audio formats and create an ambient-audio vector collection.
- Add audio model providers (HF Whisper + Faster-Whisper) and CLAP provider; integrate them into
ModelManager, indexing, and unified semantic search. - Update WebUI to support “Audio Only” filtering, render audio results, and synchronize thumbnail/enlarged video playback.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| UI/Semantixel WebUI/styles.css | Adds styling for audio result cards and an audio badge. |
| UI/Semantixel WebUI/index.html | Adds “Audio Only” filter, renders audio results, and adds thumbnail/enlarged video sync logic. |
| semantixel/utils/scan_utils.py | Expands scanned media extensions to include audio formats. |
| semantixel/sources/google_drive_source.py | Stores OAuth PKCE code verifier and passes it into token exchange when available. |
| semantixel/services/search_service.py | Adds unified semantic search across visual + text + ambient-audio collections and updates ID/type handling. |
| semantixel/services/model_manager.py | Adds lazy-loading accessors for audio transcription and CLAP models. |
| semantixel/services/index_service.py | Adds an ambient-audio collection and a second-phase audio indexing pass for audio/video files. |
| semantixel/services/bm25_service.py | Updates media-type filtering logic to account for audio transcript doc IDs. |
| semantixel/providers/text/hf_provider.py | Modifies local-only loading behavior for MiniLM embeddings. |
| semantixel/providers/base.py | Introduces an AudioProvider abstract base class. |
| semantixel/providers/audio/hf_audio_provider.py | New HF Whisper transcription provider using transformers.pipeline. |
| semantixel/providers/audio/faster_whisper_provider.py | New Faster-Whisper transcription provider. |
| semantixel/providers/audio/clap_provider.py | New CLAP provider for ambient audio + text embeddings. |
| semantixel/providers/audio/init.py | Exposes audio providers via package init. |
| semantixel/core/config.py | Adds AudioConfig into the main settings model. |
| requirements.txt | Adds Google API dependencies (but currently missing required audio deps used by new providers). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
KarunyaChavan
left a comment
There was a problem hiding this comment.
@ShrutiBachal Please ensure that all above Copilot feedback is addressed and once done with changes please update PR description accordingly. Also note to update the docstrings as per new implementation
- update indexing progress logic - update audio indexing skip check logic to use specific document ids - resolve silent video indexing issue
KarunyaChavan
left a comment
There was a problem hiding this comment.
@ShrutiBachal I'm requesting a few comments regarding the coding standards in this implementation. There are concerns about potential redundancy, as well as whether the SOLID and DRY principles have been properly followed and utilized. Could you please review and address these comments ?
| from semantixel.providers.audio.clap_provider import HFAudioCLAPProvider | ||
|
|
||
| class ModelManager: | ||
| """ |
There was a problem hiding this comment.
Didn't understood the comment on line 12
…st processing logic
KarunyaChavan
left a comment
There was a problem hiding this comment.
There are unused imports, inconsistent indentation like in _infer_media_type, eager imports, broad except Exception, and comments like “instantly!” / “VRAM/RAM” that feel casual.
| @@ -0,0 +1,77 @@ | |||
| import torch | |||
| import librosa | |||
There was a problem hiding this comment.
librosa and faster-whisper are used but not added to requirements.txt
| ambient_results = self._query_collection( | ||
| model_manager.clap.get_text_embeddings, | ||
| self.audio_collection, | ||
| query, | ||
| query_k | ||
| ) |
There was a problem hiding this comment.
semantic_text_search() always queries audio_collection using model_manager.clap.get_text_embeddings.
A normal caption search now loads a large audio model and fails if CLAP/audio dependencies are unavailable. Audio search should be optional, guarded, or isolated so existing image/text search still works
| def _infer_media_type(self, doc_id: str) -> str: | ||
| """ | ||
| Infer media type from document ID. | ||
| """ | ||
| if ":::" not in doc_id: | ||
| return "image" | ||
|
|
||
| postfix = doc_id.split(":::")[-1] | ||
|
|
||
| if postfix in {"audio", "video"}: | ||
| return postfix | ||
|
|
||
| return "unknown" |
There was a problem hiding this comment.
I see that we're treating IDs without ::: as images and only recognizes postfixes audio or video. Existing video frame IDs use numeric timestamps like local|...:::12.000000, so they become unknown, right ? Can we please fix this ?
| metadatas=[{ | ||
| "source": media.source, | ||
| "source_media_id": media.media_id, | ||
| "locator": media.locator, | ||
| "display_path": media.display_path, | ||
| "type": "audio" | ||
| }] |
There was a problem hiding this comment.
Both video transcripts and pure audio files are stored as type="audio", and later _process_item_id() infers video via file extension. That makes behavior depend on locator/path formatting instead of explicit metadata.
Maybe store these separately instead:
media_type→video/audiomatch_modality→audio_transcript/audiosource_media_idfor derived assets
| s = 1.0 - raw_distance | ||
| if model_type == "clip": | ||
| s_min, s_max = 0.12, 0.30 | ||
| elif model_type == "minilm": | ||
| s_min, s_max = 0.20, 0.70 | ||
| elif model_type == "clap": | ||
| s_min, s_max = 0.10, 0.28 | ||
| else: | ||
| s_min, s_max = 0.0, 1.0 | ||
|
|
||
| s_norm = (s - s_min) / (s_max - s_min) if s_max > s_min else 0.0 | ||
| s_norm = max(0.0, min(1.0, s_norm)) | ||
| return 1.0 - s_norm |
There was a problem hiding this comment.
_normalize_distance() uses hardcoded empirical ranges and converts scores back into a distance-like value, which _filter_results() then converts again. This makes the scoring pipeline hard to reason about and can silently skew rankings across different embedding models (CLIP, MiniLM, CLAP).
Might be cleaner to normalize once into a single canonical similarity score and keep downstream logic model-agnostic. I think this might be distorting the top-k order retrieval problem you were facing
| # PHASE 2: Process Audio Constraints Sequentially | ||
| for media in audio_items: | ||
| # 1. Transcript indexing | ||
| transcript_id = f"{media.media_id}:::audio" | ||
| transcript_results = self.text_collection.get(ids=[transcript_id]) | ||
|
|
||
| if not transcript_results["ids"]: | ||
| transcript = model_manager.audio.transcribe(media.locator) |
There was a problem hiding this comment.
Every scan now transcribes and embeds ambient audio for all videos/audio files. There is no config flag, no max duration setting, no error isolation per provider, and no clear way to disable CLAP/transcription.
There was a problem hiding this comment.
If we're providing transcripts, shouldn't we utilise bm25 for them as it would be more useful, right ?
| const audio = document.createElement('audio'); | ||
| audio.src = `http://${window.location.hostname}:${port}/images/${encodeURIComponent(item.path)}`; |
There was a problem hiding this comment.
Audio playback uses encodeURIComponent(item.path), while image/video paths usually use media_id || path. This bypasses the normalized media-id flow and is less consistent with local/remote source handling.
1. Description
This PR introduces semantic audio retrieval and video-audio retrieval support into the Semantixel platform, enabling users to search and retrieve content using audio semantics across multiple modalities.
The implementation extends the existing semantic retrieval architecture by integrating:
The system now supports semantic search over:
Videos without audio streams are automatically skipped during audio-based retrieval processing.
2. Problem Statement
Before the introduction of audio integration, the Semantixel retrieval pipeline primarily focused on semantic retrieval for image and video modalities. While the system was capable of understanding contextual meaning for these supported formats, it lacked dedicated support for audio content and audio-aware video retrieval.
Existing limitations included:
As a result, multimedia retrieval capabilities were limited to existing supported modalities, leaving audio-rich content semantically inaccessible within the retrieval pipeline.
3. Goal
To extend the multimodal retrieval capabilities of Semantixel by incorporating semantic understanding of speech and ambient audio across audio and video files.
4. Design Overview
Hybrid Audio & Video Pipeline
The system utilizes a modular extraction and embedding workflow:
5. System Flow
flowchart TD A[User Query] --> B[Semantic Query Encoder] subgraph "Indexing Phase" C[Visual Frames / Images] --> D[CLIP Embedding Collection] E[Audio Files] --> F[Faster-Whisper Transcript Generation] F --> G[MiniLM Transcript Embeddings] E --> H[CLAP Ambient Sound Embeddings] I[Video Files] --> J[FFmpeg Audio Extraction] J --> K{Audio Track Available?} K -->|Yes| F K -->|No| L[Skip Audio-Based Video Retrieval] end G --> M[Unified Vector Database] H --> M D --> M B --> N[Semantic Vector Search] N --> M M --> O[Cross-Modal Retrieval Results] O --> P[UI Playback / Retrieval Layer]6. Key Changes in Architecture
Introduced Components
semantixel/providers/audio/faster_whisper_provider.pysemantixel/providers/audio/clap_provider.pyAudio Preprocessing: Implemented optimized audio-loading buffers using librosa for efficient in-memory feature extraction.
Updated Components
semantixel/services/index_service.pysemantixel/services/search_service.pyUI/Semantixel WebUI/index.htmlKey Functionalities to Validate
Tech Stack / Libraries Used