AudioVoxBench: Multimodal Embedding Benchmark

AudioVoxBench is a standalone Swift command-line suite designed to evaluate semantic search relevance using the Gemini Embedding 2 model. It allows developers to compare different indexing strategies (text-only vs. interleaved multimodal) to determine the highest possible search recall. For more background, see Multimodal Ground Truth: Building AudioVoxBench

Goals

Objective Measurement: Calculate Mean Reciprocal Rank (MRR) for various "ensemble" embedding strategies.
Data-Driven Roadmap: Identify if adding raw audio and image data to embeddings actually improves user search experience compared to rich text metadata.
Reproducibility: Provide a consistent "Golden Set" of tracks and queries to track semantic search improvements over time.

Components

TrackSeeder: A tool to generate audio (via Lyria 3) and images (via Gemini 3.1 Flash Image / Nano Banana 2) for a set of synthetic prompts defined in golden_set.json, and upload them to Google Cloud Storage.
TrackIngestor: A tool to import existing tracks from a Firestore collection (e.g., production history) into the benchmark format, featuring automated 80/20 Corpus Splitting for self-retrieval testing.
AudioVoxBench: The main evaluation tool. It iterates through strategies, indexes the "Target Set" into a local vector database (sqlite-vec), and runs cross-modal queries to score recall.

Prerequisites

To run this suite independently, you need:

Google Cloud Project: With the Vertex AI API enabled.
Permissions: Your account (or service account) needs aiplatform.user and storage.objectAdmin.
Lyria API Access: Specifically the interactions endpoint for audio generation.
GCS Bucket: A bucket to host the multimodal assets (audio/mp3 and images/jpg).
Environment:
- GCP_ACCESS_TOKEN: A valid OAuth2 token (run gcloud auth print-access-token).
- Swift 5.9+ / macOS 14+.

Quick Start (Synthetic Data)

The fastest way to validate the system is using the included synthetic "Golden Set".

Initial Setup:

cp config.json.sample config.json
mkdir -p tests
cp golden_set.json.sample tests/golden_set.json

Prepare Data: Edit config.json with your GCP Project details.

Seed Assets:

export GCP_ACCESS_TOKEN=$(gcloud auth print-access-token)
swift run TrackSeeder

Run Benchmark:
```
swift run AudioVoxBench
```

Production Data Benchmarking

For instructions on how to evaluate a large, existing Firestore corpus (e.g., 600+ tracks) using the 80/20 Corpus Split and Self-Retrieval Evaluation methodology, please read the Production Benchmarking Guide.

Pricing Note: Running AudioVoxBench against existing production data is incredibly cost-effective because it skips all Lyria and Gemini Image generation. A 60-track benchmark costs <$0.01 USD, and a massive 600-track benchmark costs only ~$0.07 USD (embedding costs only).

Results & Reports

Benchmark results are automatically stored in docs/benchmarks/run_[date].md. Current "Winner": Strategy C (Semantic Text-Augmentation). Strategy C consistently achieves a 1.0 MRR even when queried with purely non-text media probes.

Pricing Estimates (15-Asset Synthetic Run)

Running the synthetic benchmark with 10 database tracks and 5 hold-out probes (13 audio clips, 12 images) costs approximately $5.50 USD on Vertex AI.

Component	Quantity	Est. Unit Cost	Subtotal
Audio (Lyria 3)	13 clips (30s ea)	$0.36 / clip	$4.68
Images (Gemini 3.1 Flash)	12 images (1K)	$0.067 / image	$0.80
Embeddings (Gemini 2)	200+ calls	$0.025 / 1M tokens	<$0.01
TOTAL			~$5.49

Note: Strategy C (Text-only) is the most cost-effective as it bypasses raw image/audio generation costs for subsequent searches. Running the benchmark against existing production data is nearly free, as it only incurs the <$0.01 embedding cost.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Sources		Sources
docs		docs
.gitignore		.gitignore
AudioVox-Bridging-Header.h		AudioVox-Bridging-Header.h
Package.swift		Package.swift
README.md		README.md
config.json.sample		config.json.sample
golden_set.json.sample		golden_set.json.sample
golden_set_phase2.json.sample		golden_set_phase2.json.sample
probes_phase2.json.sample		probes_phase2.json.sample

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioVoxBench: Multimodal Embedding Benchmark

Goals

Components

Prerequisites

Quick Start (Synthetic Data)

Production Data Benchmarking

Results & Reports

Pricing Estimates (15-Asset Synthetic Run)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AudioVoxBench: Multimodal Embedding Benchmark

Goals

Components

Prerequisites

Quick Start (Synthetic Data)

Production Data Benchmarking

Results & Reports

Pricing Estimates (15-Asset Synthetic Run)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages