Skip to content

Implement Semantic Audio Integration for Audio and Video files#7

Draft
ShrutiBachal wants to merge 17 commits into
KarunyaChavan:mainfrom
ShrutiBachal:feature/audio-integration
Draft

Implement Semantic Audio Integration for Audio and Video files#7
ShrutiBachal wants to merge 17 commits into
KarunyaChavan:mainfrom
ShrutiBachal:feature/audio-integration

Conversation

@ShrutiBachal

Copy link
Copy Markdown
Collaborator

1. Description

This PR introduces semantic audio retrieval and video-audio retrieval support into the Semantixel platform, enabling users to search and retrieve content using audio semantics across multiple modalities.

The implementation extends the existing semantic retrieval architecture by integrating:

  • Audio transcript generation using Faster-Whisper
  • Text embedding generation using MiniLM
  • Ambient Intelligence: CLAP embeddings for non-speech sound recognition (e.g., "rain", "applause").
  • Unified Video Processing: Automatic audio stream extraction from videos using FFmpeg/PyAV.

The system now supports semantic search over:

  • Spoken audio content
  • Ambient/non-speech audio patterns
  • Videos containing audio tracks

Videos without audio streams are automatically skipped during audio-based retrieval processing.


2. Problem Statement

Before the introduction of audio integration, the Semantixel retrieval pipeline primarily focused on semantic retrieval for image and video modalities. While the system was capable of understanding contextual meaning for these supported formats, it lacked dedicated support for audio content and audio-aware video retrieval.

Existing limitations included:

  • No support for semantic retrieval from audio files
  • Inability to search using spoken content from multimedia data
  • Lack of ambient sound understanding for non-speech audio
  • No extraction or processing of audio streams from video files
  • Videos could not participate in semantic retrieval through their audio context

As a result, multimedia retrieval capabilities were limited to existing supported modalities, leaving audio-rich content semantically inaccessible within the retrieval pipeline.


3. Goal

To extend the multimodal retrieval capabilities of Semantixel by incorporating semantic understanding of speech and ambient audio across audio and video files.


4. Design Overview

Hybrid Audio & Video Pipeline

The system utilizes a modular extraction and embedding workflow:

  • Extraction: Videos are scanned; audio tracks are extracted into memory-buffered streams to avoid disk clutter.
  • Speech Path: Faster-Whisper generates transcripts, which are then vectorized via MiniLM.
  • Ambient Path: The raw audio signal is passed to CLAP for environmental sound classification.
  • Unified ID System: All fragments (visual, speech, sound) are linked back to a single Base64-encoded MediaDescriptor, ensuring that a match in a transcript correctly highlights the parent video file in the UI.

5. System Flow

flowchart TD
    A[User Query] --> B[Semantic Query Encoder]

    subgraph "Indexing Phase"
        C[Visual Frames / Images] --> D[CLIP Embedding Collection]

        E[Audio Files] --> F[Faster-Whisper Transcript Generation]
        F --> G[MiniLM Transcript Embeddings]

        E --> H[CLAP Ambient Sound Embeddings]

        I[Video Files] --> J[FFmpeg Audio Extraction]
        J --> K{Audio Track Available?}

        K -->|Yes| F
        K -->|No| L[Skip Audio-Based Video Retrieval]
    end

    G --> M[Unified Vector Database]
    H --> M
    D --> M

    B --> N[Semantic Vector Search]
    N --> M

    M --> O[Cross-Modal Retrieval Results]
    O --> P[UI Playback / Retrieval Layer]
Loading

6. Key Changes in Architecture

Introduced Components

semantixel/providers/audio/faster_whisper_provider.py

  • Optimized Inference: Integrated the Faster-Whisper (CTranslate2) engine for high-performance speech-to-text transcription.

semantixel/providers/audio/clap_provider.py

  • Ambient Sound Understanding: Integrated the CLAP (Contrastive Language-Audio Pretraining) model for semantic audio embedding.
  • Non-Speech Retrieval: Enabled the system to retrieve videos based on sound categories like "music", "applause", or "nature".
    Audio Preprocessing: Implemented optimized audio-loading buffers using librosa for efficient in-memory feature extraction.

Updated Components

semantixel/services/index_service.py

  • Refactored Indexing Pipeline: Separated visual and audio processing workflows to allow for multi-stage media ingestion.
  • Second-Pass Processing: Implemented a mechanism to re-scan video files specifically for audio-track extraction and semantic indexing.
  • Unified Multimedia Support: Integrated Whisper (speech) and CLAP (ambient sound) generation into the core indexing loop.

semantixel/services/search_service.py

  • Multi-Modal Retrieval: Enhanced search logic to query the new Ambient Sound and Audio Transcript collections simultaneously.
  • ID Normalization: Implemented mapping logic to translate raw Base64-encoded locators into human-readable paths for the UI.
  • Fragment Mapping: Added logic to reconnect specific audio segments or timestamps back to their parent video files.

UI/Semantixel WebUI/index.html

  • State Synchronization: Implemented real-time playback synchronization between gallery thumbnails and the enlarged video view.
  • Bidirectional Controls: Added listeners to mirror Play, Pause, and Seek events across the background and foreground players.
  • Global Media Policy: Developed a "Single-Media" playback rule that automatically silences background audio when a new video is played.

Key Functionalities to Validate

  • Semantic retrieval from speech transcripts
  • Ambient sound retrieval using CLAP embeddings
  • FFmpeg-based audio extraction from videos
  • Correct skipping of videos without audio streams
  • Mapping of transcript/sound fragments back to original files

Tech Stack / Libraries Used

  • Faster-Whisper
  • Sentence Transformers (MiniLM)
  • CLAP
  • FFmpeg

Copilot AI review requested due to automatic review settings May 6, 2026 18:45
@ShrutiBachal ShrutiBachal added the enhancement New feature or request label May 6, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds semantic audio support to Semantixel’s indexing + retrieval pipeline by introducing audio transcription (Whisper variants), ambient sound embeddings (CLAP), and UI support for displaying audio results and improved video playback syncing.

Changes:

  • Extend scanning/indexing to include common audio formats and create an ambient-audio vector collection.
  • Add audio model providers (HF Whisper + Faster-Whisper) and CLAP provider; integrate them into ModelManager, indexing, and unified semantic search.
  • Update WebUI to support “Audio Only” filtering, render audio results, and synchronize thumbnail/enlarged video playback.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
UI/Semantixel WebUI/styles.css Adds styling for audio result cards and an audio badge.
UI/Semantixel WebUI/index.html Adds “Audio Only” filter, renders audio results, and adds thumbnail/enlarged video sync logic.
semantixel/utils/scan_utils.py Expands scanned media extensions to include audio formats.
semantixel/sources/google_drive_source.py Stores OAuth PKCE code verifier and passes it into token exchange when available.
semantixel/services/search_service.py Adds unified semantic search across visual + text + ambient-audio collections and updates ID/type handling.
semantixel/services/model_manager.py Adds lazy-loading accessors for audio transcription and CLAP models.
semantixel/services/index_service.py Adds an ambient-audio collection and a second-phase audio indexing pass for audio/video files.
semantixel/services/bm25_service.py Updates media-type filtering logic to account for audio transcript doc IDs.
semantixel/providers/text/hf_provider.py Modifies local-only loading behavior for MiniLM embeddings.
semantixel/providers/base.py Introduces an AudioProvider abstract base class.
semantixel/providers/audio/hf_audio_provider.py New HF Whisper transcription provider using transformers.pipeline.
semantixel/providers/audio/faster_whisper_provider.py New Faster-Whisper transcription provider.
semantixel/providers/audio/clap_provider.py New CLAP provider for ambient audio + text embeddings.
semantixel/providers/audio/init.py Exposes audio providers via package init.
semantixel/core/config.py Adds AudioConfig into the main settings model.
requirements.txt Adds Google API dependencies (but currently missing required audio deps used by new providers).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread semantixel/providers/text/hf_provider.py Outdated
Comment thread semantixel/services/index_service.py Outdated
Comment thread semantixel/services/index_service.py
Comment thread semantixel/services/index_service.py Outdated
Comment thread semantixel/services/search_service.py Outdated
Comment thread semantixel/services/search_service.py
Comment thread UI/Semantixel WebUI/index.html
Comment thread semantixel/utils/scan_utils.py Outdated

@KarunyaChavan KarunyaChavan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShrutiBachal Please ensure that all above Copilot feedback is addressed and once done with changes please update PR description accordingly. Also note to update the docstrings as per new implementation

@KarunyaChavan KarunyaChavan marked this pull request as draft May 7, 2026 08:00
- update indexing progress logic

- update audio indexing skip check logic to use specific document ids

- resolve silent video indexing issue
@KarunyaChavan KarunyaChavan marked this pull request as ready for review May 19, 2026 07:06

@KarunyaChavan KarunyaChavan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShrutiBachal I'm requesting a few comments regarding the coding standards in this implementation. There are concerns about potential redundancy, as well as whether the SOLID and DRY principles have been properly followed and utilized. Could you please review and address these comments ?

Comment thread semantixel/core/config.py
Comment thread semantixel/providers/audio/clap_provider.py Outdated
Comment thread semantixel/providers/audio/clap_provider.py Outdated
Comment thread semantixel/providers/audio/faster_whisper_provider.py Outdated
Comment thread semantixel/providers/audio/faster_whisper_provider.py Outdated
Comment thread semantixel/services/index_service.py Outdated
from semantixel.providers.audio.clap_provider import HFAudioCLAPProvider

class ModelManager:
"""

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty String ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't understood the comment on line 12

Comment thread semantixel/services/search_service.py Outdated
Comment thread semantixel/services/search_service.py
Comment thread semantixel/services/search_service.py
- encapsulate collection query logic in a function

- add DocComment for _normalize_distance function

- restructure parameters for path_val in multiple lines using 'or' operator
Comment thread semantixel/services/bm25_service.py Outdated

@KarunyaChavan KarunyaChavan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unused imports, inconsistent indentation like in _infer_media_type, eager imports, broad except Exception, and comments like “instantly!” / “VRAM/RAM” that feel casual.

@@ -0,0 +1,77 @@
import torch
import librosa

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

librosa and faster-whisper are used but not added to requirements.txt

Comment on lines +192 to 197
ambient_results = self._query_collection(
model_manager.clap.get_text_embeddings,
self.audio_collection,
query,
query_k
)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic_text_search() always queries audio_collection using model_manager.clap.get_text_embeddings.
A normal caption search now loads a large audio model and fails if CLAP/audio dependencies are unavailable. Audio search should be optional, guarded, or isolated so existing image/text search still works

Comment on lines +70 to +82
def _infer_media_type(self, doc_id: str) -> str:
"""
Infer media type from document ID.
"""
if ":::" not in doc_id:
return "image"

postfix = doc_id.split(":::")[-1]

if postfix in {"audio", "video"}:
return postfix

return "unknown"

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're treating IDs without ::: as images and only recognizes postfixes audio or video. Existing video frame IDs use numeric timestamps like local|...:::12.000000, so they become unknown, right ? Can we please fix this ?

Comment on lines +193 to +199
metadatas=[{
"source": media.source,
"source_media_id": media.media_id,
"locator": media.locator,
"display_path": media.display_path,
"type": "audio"
}]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both video transcripts and pure audio files are stored as type="audio", and later _process_item_id() infers video via file extension. That makes behavior depend on locator/path formatting instead of explicit metadata.

Maybe store these separately instead:

  • media_typevideo / audio
  • match_modalityaudio_transcript / audio
  • source_media_id for derived assets

Comment on lines +142 to +154
s = 1.0 - raw_distance
if model_type == "clip":
s_min, s_max = 0.12, 0.30
elif model_type == "minilm":
s_min, s_max = 0.20, 0.70
elif model_type == "clap":
s_min, s_max = 0.10, 0.28
else:
s_min, s_max = 0.0, 1.0

s_norm = (s - s_min) / (s_max - s_min) if s_max > s_min else 0.0
s_norm = max(0.0, min(1.0, s_norm))
return 1.0 - s_norm

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_distance() uses hardcoded empirical ranges and converts scores back into a distance-like value, which _filter_results() then converts again. This makes the scoring pipeline hard to reason about and can silently skew rankings across different embedding models (CLIP, MiniLM, CLAP).

Might be cleaner to normalize once into a single canonical similarity score and keep downstream logic model-agnostic. I think this might be distorting the top-k order retrieval problem you were facing

Comment on lines +179 to +186
# PHASE 2: Process Audio Constraints Sequentially
for media in audio_items:
# 1. Transcript indexing
transcript_id = f"{media.media_id}:::audio"
transcript_results = self.text_collection.get(ids=[transcript_id])

if not transcript_results["ids"]:
transcript = model_manager.audio.transcribe(media.locator)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every scan now transcribes and embeds ambient audio for all videos/audio files. There is no config flag, no max duration setting, no error isolation per provider, and no clear way to disable CLAP/transcription.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're providing transcripts, shouldn't we utilise bm25 for them as it would be more useful, right ?

Comment on lines +310 to +311
const audio = document.createElement('audio');
audio.src = `http://${window.location.hostname}:${port}/images/${encodeURIComponent(item.path)}`;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audio playback uses encodeURIComponent(item.path), while image/video paths usually use media_id || path. This bypasses the normalized media-id flow and is less consistent with local/remote source handling.

@KarunyaChavan KarunyaChavan added bug Something isn't working dependencies Pull requests that update a dependency file labels May 20, 2026
@ShrutiBachal ShrutiBachal marked this pull request as draft May 22, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working dependencies Pull requests that update a dependency file enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants