Implement Semantic Audio Integration for Audio and Video files by ShrutiBachal · Pull Request #7 · KarunyaChavan/Semantixel-Semantic_Image_Retrieval

ShrutiBachal · 2026-05-06T18:45:23Z

1. Description

This PR introduces semantic audio retrieval and video-audio retrieval support into the Semantixel platform, enabling users to search and retrieve content using audio semantics across multiple modalities.

The implementation extends the existing semantic retrieval architecture by integrating:

Audio transcript generation using Faster-Whisper
Text embedding generation using MiniLM
Ambient Intelligence: CLAP embeddings for non-speech sound recognition (e.g., "rain", "applause").
Unified Video Processing: Automatic audio stream extraction from videos using FFmpeg/PyAV.

The system now supports semantic search over:

Spoken audio content
Ambient/non-speech audio patterns
Videos containing audio tracks

Videos without audio streams are automatically skipped during audio-based retrieval processing.

2. Problem Statement

Before the introduction of audio integration, the Semantixel retrieval pipeline primarily focused on semantic retrieval for image and video modalities. While the system was capable of understanding contextual meaning for these supported formats, it lacked dedicated support for audio content and audio-aware video retrieval.

Existing limitations included:

No support for semantic retrieval from audio files
Inability to search using spoken content from multimedia data
Lack of ambient sound understanding for non-speech audio
No extraction or processing of audio streams from video files
Videos could not participate in semantic retrieval through their audio context

As a result, multimedia retrieval capabilities were limited to existing supported modalities, leaving audio-rich content semantically inaccessible within the retrieval pipeline.

3. Goal

To extend the multimodal retrieval capabilities of Semantixel by incorporating semantic understanding of speech and ambient audio across audio and video files.

4. Design Overview

Hybrid Audio & Video Pipeline

The system utilizes a modular extraction and embedding workflow:

Extraction: Videos are scanned; audio tracks are extracted into memory-buffered streams to avoid disk clutter.
Speech Path: Faster-Whisper generates transcripts, which are then vectorized via MiniLM.
Ambient Path: The raw audio signal is passed to CLAP for environmental sound classification.
Unified ID System: All fragments (visual, speech, sound) are linked back to a single Base64-encoded MediaDescriptor, ensuring that a match in a transcript correctly highlights the parent video file in the UI.

5. System Flow

flowchart TD
    A[User Query] --> B[Semantic Query Encoder]

    subgraph "Indexing Phase"
        C[Visual Frames / Images] --> D[CLIP Embedding Collection]

        E[Audio Files] --> F[Faster-Whisper Transcript Generation]
        F --> G[MiniLM Transcript Embeddings]

        E --> H[CLAP Ambient Sound Embeddings]

        I[Video Files] --> J[FFmpeg Audio Extraction]
        J --> K{Audio Track Available?}

        K -->|Yes| F
        K -->|No| L[Skip Audio-Based Video Retrieval]
    end

    G --> M[Unified Vector Database]
    H --> M
    D --> M

    B --> N[Semantic Vector Search]
    N --> M

    M --> O[Cross-Modal Retrieval Results]
    O --> P[UI Playback / Retrieval Layer]

6. Key Changes in Architecture

Introduced Components

semantixel/providers/audio/faster_whisper_provider.py

Optimized Inference: Integrated the Faster-Whisper (CTranslate2) engine for high-performance speech-to-text transcription.

semantixel/providers/audio/clap_provider.py

Ambient Sound Understanding: Integrated the CLAP (Contrastive Language-Audio Pretraining) model for semantic audio embedding.
Non-Speech Retrieval: Enabled the system to retrieve videos based on sound categories like "music", "applause", or "nature".
Audio Preprocessing: Implemented optimized audio-loading buffers using librosa for efficient in-memory feature extraction.

Updated Components

semantixel/services/index_service.py

Refactored Indexing Pipeline: Separated visual and audio processing workflows to allow for multi-stage media ingestion.
Second-Pass Processing: Implemented a mechanism to re-scan video files specifically for audio-track extraction and semantic indexing.
Unified Multimedia Support: Integrated Whisper (speech) and CLAP (ambient sound) generation into the core indexing loop.

semantixel/services/search_service.py

Multi-Modal Retrieval: Enhanced search logic to query the new Ambient Sound and Audio Transcript collections simultaneously.
ID Normalization: Implemented mapping logic to translate raw Base64-encoded locators into human-readable paths for the UI.
Fragment Mapping: Added logic to reconnect specific audio segments or timestamps back to their parent video files.

UI/Semantixel WebUI/index.html

State Synchronization: Implemented real-time playback synchronization between gallery thumbnails and the enlarged video view.
Bidirectional Controls: Added listeners to mirror Play, Pause, and Seek events across the background and foreground players.
Global Media Policy: Developed a "Single-Media" playback rule that automatically silences background audio when a new video is played.

Key Functionalities to Validate

Semantic retrieval from speech transcripts
Ambient sound retrieval using CLAP embeddings
FFmpeg-based audio extraction from videos
Correct skipping of videos without audio streams
Mapping of transcript/sound fragments back to original files

Tech Stack / Libraries Used

Faster-Whisper
Sentence Transformers (MiniLM)
CLAP
FFmpeg

…odal image, text, and audio search

- Integrated Faster-Whisper (CTranslate2) for accelerated audio transcription and background processing.

Copilot

Pull request overview

This PR adds semantic audio support to Semantixel’s indexing + retrieval pipeline by introducing audio transcription (Whisper variants), ambient sound embeddings (CLAP), and UI support for displaying audio results and improved video playback syncing.

Changes:

Extend scanning/indexing to include common audio formats and create an ambient-audio vector collection.
Add audio model providers (HF Whisper + Faster-Whisper) and CLAP provider; integrate them into ModelManager, indexing, and unified semantic search.
Update WebUI to support “Audio Only” filtering, render audio results, and synchronize thumbnail/enlarged video playback.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
UI/Semantixel WebUI/styles.css	Adds styling for audio result cards and an audio badge.
UI/Semantixel WebUI/index.html	Adds “Audio Only” filter, renders audio results, and adds thumbnail/enlarged video sync logic.
semantixel/utils/scan_utils.py	Expands scanned media extensions to include audio formats.
semantixel/sources/google_drive_source.py	Stores OAuth PKCE code verifier and passes it into token exchange when available.
semantixel/services/search_service.py	Adds unified semantic search across visual + text + ambient-audio collections and updates ID/type handling.
semantixel/services/model_manager.py	Adds lazy-loading accessors for audio transcription and CLAP models.
semantixel/services/index_service.py	Adds an ambient-audio collection and a second-phase audio indexing pass for audio/video files.
semantixel/services/bm25_service.py	Updates media-type filtering logic to account for audio transcript doc IDs.
semantixel/providers/text/hf_provider.py	Modifies local-only loading behavior for MiniLM embeddings.
semantixel/providers/base.py	Introduces an `AudioProvider` abstract base class.
semantixel/providers/audio/hf_audio_provider.py	New HF Whisper transcription provider using `transformers.pipeline`.
semantixel/providers/audio/faster_whisper_provider.py	New Faster-Whisper transcription provider.
semantixel/providers/audio/clap_provider.py	New CLAP provider for ambient audio + text embeddings.
semantixel/providers/audio/init.py	Exposes audio providers via package init.
semantixel/core/config.py	Adds `AudioConfig` into the main settings model.
requirements.txt	Adds Google API dependencies (but currently missing required audio deps used by new providers).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

KarunyaChavan

@ShrutiBachal Please ensure that all above Copilot feedback is addressed and once done with changes please update PR description accordingly. Also note to update the docstrings as per new implementation

…ocally

- update indexing progress logic - update audio indexing skip check logic to use specific document ids - resolve silent video indexing issue

…ed video

KarunyaChavan

@ShrutiBachal I'm requesting a few comments regarding the coding standards in this implementation. There are concerns about potential redundancy, as well as whether the SOLID and DRY principles have been properly followed and utilized. Could you please review and address these comments ?

KarunyaChavan · 2026-05-19T07:17:21Z

+from semantixel.providers.audio.clap_provider import HFAudioCLAPProvider

 class ModelManager:
    """


Empty String ?

Didn't understood the comment on line 12

…st processing logic

- encapsulate collection query logic in a function - add DocComment for _normalize_distance function - restructure parameters for path_val in multiple lines using 'or' operator

KarunyaChavan

There are unused imports, inconsistent indentation like in _infer_media_type, eager imports, broad except Exception, and comments like “instantly!” / “VRAM/RAM” that feel casual.

KarunyaChavan · 2026-05-20T11:08:33Z

@@ -0,0 +1,77 @@
+import torch
+import librosa


librosa and faster-whisper are used but not added to requirements.txt

KarunyaChavan · 2026-05-20T11:10:20Z

+        ambient_results = self._query_collection(
+            model_manager.clap.get_text_embeddings,
+            self.audio_collection,
+            query,
+            query_k
        )


semantic_text_search() always queries audio_collection using model_manager.clap.get_text_embeddings.
A normal caption search now loads a large audio model and fails if CLAP/audio dependencies are unavailable. Audio search should be optional, guarded, or isolated so existing image/text search still works

KarunyaChavan · 2026-05-20T11:12:03Z

+    def _infer_media_type(self, doc_id: str) -> str:
+            """
+            Infer media type from document ID.
+            """
+            if ":::" not in doc_id:
+                return "image"
+
+            postfix = doc_id.split(":::")[-1]
+
+            if postfix in {"audio", "video"}:
+                return postfix
+
+            return "unknown"


I see that we're treating IDs without ::: as images and only recognizes postfixes audio or video. Existing video frame IDs use numeric timestamps like local|...:::12.000000, so they become unknown, right ? Can we please fix this ?

KarunyaChavan · 2026-05-20T11:13:41Z

+                            metadatas=[{
+                                "source": media.source,
+                                "source_media_id": media.media_id,
+                                "locator": media.locator,
+                                "display_path": media.display_path,
+                                "type": "audio"
+                            }]


Both video transcripts and pure audio files are stored as type="audio", and later _process_item_id() infers video via file extension. That makes behavior depend on locator/path formatting instead of explicit metadata.

Maybe store these separately instead:

media_type → video / audio

match_modality → audio_transcript / audio

source_media_id for derived assets

KarunyaChavan · 2026-05-20T11:15:55Z

+        s = 1.0 - raw_distance
+        if model_type == "clip":
+            s_min, s_max = 0.12, 0.30
+        elif model_type == "minilm":
+            s_min, s_max = 0.20, 0.70
+        elif model_type == "clap":
+            s_min, s_max = 0.10, 0.28
+        else:
+            s_min, s_max = 0.0, 1.0
+
+        s_norm = (s - s_min) / (s_max - s_min) if s_max > s_min else 0.0
+        s_norm = max(0.0, min(1.0, s_norm))
+        return 1.0 - s_norm


_normalize_distance() uses hardcoded empirical ranges and converts scores back into a distance-like value, which _filter_results() then converts again. This makes the scoring pipeline hard to reason about and can silently skew rankings across different embedding models (CLIP, MiniLM, CLAP).

Might be cleaner to normalize once into a single canonical similarity score and keep downstream logic model-agnostic. I think this might be distorting the top-k order retrieval problem you were facing

KarunyaChavan · 2026-05-20T11:17:17Z

+            # PHASE 2: Process Audio Constraints Sequentially
+            for media in audio_items:
+                # 1. Transcript indexing
+                transcript_id = f"{media.media_id}:::audio"
+                transcript_results = self.text_collection.get(ids=[transcript_id])
+
+                if not transcript_results["ids"]:
+                    transcript = model_manager.audio.transcribe(media.locator)


Every scan now transcribes and embeds ambient audio for all videos/audio files. There is no config flag, no max duration setting, no error isolation per provider, and no clear way to disable CLAP/transcription.

If we're providing transcripts, shouldn't we utilise bm25 for them as it would be more useful, right ?

KarunyaChavan · 2026-05-20T11:18:46Z

+                            const audio = document.createElement('audio');
+                            audio.src = `http://${window.location.hostname}:${port}/images/${encodeURIComponent(item.path)}`;


Audio playback uses encodeURIComponent(item.path), while image/video paths usually use media_id || path. This bypasses the normalized media-id flow and is less consistent with local/remote source handling.

ShrutiBachal added 5 commits May 1, 2026 23:20

feat: implement semantic audio ingestion pipeline for unified cross-m…

c265ea8

…odal image, text, and audio search

Merge branch 'main' into feature/audio-integration

88c6151

feat: optimize multimodal audio indexing

dcf4ede

- Integrated Faster-Whisper (CTranslate2) for accelerated audio transcription and background processing.

feat: implement video retrieval based on it's audio

fe08c91

fix : synchronize state, audio for background video and enlarged video

26d1ad3

Copilot AI review requested due to automatic review settings May 6, 2026 18:45

ShrutiBachal added the enhancement New feature or request label May 6, 2026

Copilot started reviewing on behalf of ShrutiBachal May 6, 2026 18:46 View session

ShrutiBachal assigned rockers2004, KarunyaChavan and taslim121 and unassigned rockers2004, KarunyaChavan and taslim121 May 6, 2026

Copilot AI reviewed May 6, 2026

View reviewed changes

KarunyaChavan requested changes May 7, 2026

View reviewed changes

KarunyaChavan marked this pull request as draft May 7, 2026 08:00

ShrutiBachal added 4 commits May 18, 2026 20:34

fix (services) : restore fallback download logic if model not found l…

ae4625d

…ocally

fix (index_service) :

c715dea

- update indexing progress logic - update audio indexing skip check logic to use specific document ids - resolve silent video indexing issue

fix : resolve the synchronization issue of thumbnail video and enlarg…

9665a8f

…ed video

fix: improve media parsing and cross-collection sorting

e076586

KarunyaChavan marked this pull request as ready for review May 19, 2026 07:06

KarunyaChavan requested review from KarunyaChavan, rockers2004 and taslim121 May 19, 2026 07:06

KarunyaChavan requested changes May 19, 2026

View reviewed changes

ShrutiBachal added 4 commits May 19, 2026 23:45

fix: replace hardcoded default parameter and remove AI signatures

077faa1

fix: replace hardcoded default parameter and add clean comment for fa…

472637b

…st processing logic

fix: remove inline comment and add clean comment above logic

f01275f

fix:

0a3e273

- encapsulate collection query logic in a function - add DocComment for _normalize_distance function - restructure parameters for path_val in multiple lines using 'or' operator

ShrutiBachal added 2 commits May 20, 2026 12:35

fix: add DocComment for transcription logic

01658a2

fix: add descriptive DocComments for classes

d4205d5

rockers2004 reviewed May 20, 2026

View reviewed changes

Comment thread semantixel/services/bm25_service.py Outdated

ShrutiBachal and others added 2 commits May 20, 2026 15:48

fix: abstract inline parsing logic into reusable helper

ab0c0f7

Merge branch 'main' into feature/audio-integration

15984ae

ShrutiBachal requested a review from KarunyaChavan May 20, 2026 10:46

KarunyaChavan requested changes May 20, 2026

View reviewed changes

KarunyaChavan assigned ShrutiBachal May 20, 2026

KarunyaChavan added bug Something isn't working dependencies Pull requests that update a dependency file labels May 20, 2026

ShrutiBachal marked this pull request as draft May 22, 2026 07:02

		const audio = document.createElement('audio');
		audio.src = `http://${window.location.hostname}:${port}/images/${encodeURIComponent(item.path)}`;

Conversation

ShrutiBachal commented May 6, 2026

1. Description

2. Problem Statement

3. Goal

4. Design Overview

Hybrid Audio & Video Pipeline

5. System Flow

6. Key Changes in Architecture

Introduced Components

Updated Components

Key Functionalities to Validate

Tech Stack / Libraries Used

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KarunyaChavan left a comment

Choose a reason for hiding this comment

Uh oh!

KarunyaChavan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KarunyaChavan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants