AudioVoxBench is a standalone Swift command-line suite designed to evaluate semantic search relevance using the Gemini Embedding 2 model. It allows developers to compare different indexing strategies (text-only vs. interleaved multimodal) to determine the highest possible search recall. For more background, see Multimodal Ground Truth: Building AudioVoxBench
- Objective Measurement: Calculate Mean Reciprocal Rank (MRR) for various "ensemble" embedding strategies.
- Data-Driven Roadmap: Identify if adding raw audio and image data to embeddings actually improves user search experience compared to rich text metadata.
- Reproducibility: Provide a consistent "Golden Set" of tracks and queries to track semantic search improvements over time.
TrackSeeder: A tool to generate audio (via Lyria 3) and images (via Gemini 3.1 Flash Image / Nano Banana 2) for a set of synthetic prompts defined ingolden_set.json, and upload them to Google Cloud Storage.TrackIngestor: A tool to import existing tracks from a Firestore collection (e.g., production history) into the benchmark format, featuring automated 80/20 Corpus Splitting for self-retrieval testing.AudioVoxBench: The main evaluation tool. It iterates through strategies, indexes the "Target Set" into a local vector database (sqlite-vec), and runs cross-modal queries to score recall.
To run this suite independently, you need:
- Google Cloud Project: With the Vertex AI API enabled.
- Permissions: Your account (or service account) needs
aiplatform.userandstorage.objectAdmin. - Lyria API Access: Specifically the
interactionsendpoint for audio generation. - GCS Bucket: A bucket to host the multimodal assets (audio/mp3 and images/jpg).
- Environment:
GCP_ACCESS_TOKEN: A valid OAuth2 token (rungcloud auth print-access-token).- Swift 5.9+ / macOS 14+.
The fastest way to validate the system is using the included synthetic "Golden Set".
- Initial Setup:
cp config.json.sample config.json mkdir -p tests cp golden_set.json.sample tests/golden_set.json
- Prepare Data: Edit
config.jsonwith your GCP Project details. - Seed Assets:
export GCP_ACCESS_TOKEN=$(gcloud auth print-access-token) swift run TrackSeeder
- Run Benchmark:
swift run AudioVoxBench
For instructions on how to evaluate a large, existing Firestore corpus (e.g., 600+ tracks) using the 80/20 Corpus Split and Self-Retrieval Evaluation methodology, please read the Production Benchmarking Guide.
Pricing Note: Running AudioVoxBench against existing production data is incredibly cost-effective because it skips all Lyria and Gemini Image generation. A 60-track benchmark costs <$0.01 USD, and a massive 600-track benchmark costs only ~$0.07 USD (embedding costs only).
Benchmark results are automatically stored in docs/benchmarks/run_[date].md.
Current "Winner": Strategy C (Semantic Text-Augmentation). Strategy C consistently achieves a 1.0 MRR even when queried with purely non-text media probes.
Running the synthetic benchmark with 10 database tracks and 5 hold-out probes (13 audio clips, 12 images) costs approximately $5.50 USD on Vertex AI.
| Component | Quantity | Est. Unit Cost | Subtotal |
|---|---|---|---|
| Audio (Lyria 3) | 13 clips (30s ea) | $0.36 / clip | $4.68 |
| Images (Gemini 3.1 Flash) | 12 images (1K) | $0.067 / image | $0.80 |
| Embeddings (Gemini 2) | 200+ calls | $0.025 / 1M tokens | <$0.01 |
| TOTAL | ~$5.49 |
Note: Strategy C (Text-only) is the most cost-effective as it bypasses raw image/audio generation costs for subsequent searches. Running the benchmark against existing production data is nearly free, as it only incurs the <$0.01 embedding cost.
