Add hybrid triple-path retrieval pipeline with cross-encoder reranking#6
Add hybrid triple-path retrieval pipeline with cross-encoder reranking#6i-anishR-droid wants to merge 3 commits intomainfrom
Conversation
- Add run_pipeline.py: standalone pipeline with Snowflake arctic-embed-l-v2.0 dense embeddings, dual BM25 (full-text + title-only), RRF fusion, and cross-encoder reranking (ms-marco-MiniLM-L-6-v2); runs 92 test queries in ~2 min - Update devrev_search.ipynb: refactored Section 9 as infrastructure setup, added Section 10 Multi-Query Triple-Path Retrieval strategy - Add submission outputs: test_queries_results.json/.parquet (latest run), enhanced and old variants for comparison - Ignore large embeddings_*.npy files via .gitignore Made-with: Cursor
|
Hey! |
|
Can you also try without cross encoder? And try with various other boost configs? Also did you experiment with RRF_K values? |
I am trying Without cross-encoder, testing around 14 different weight combinations for dense, BM25, and title paths (e.g., dense-only, bm25-only, title-only, equal weights, title-boost-3x/4x, dense-boost-2x, etc.), both with and without the reranker, experimenting with RRF_K values of 10, 20, 30, 40, 60, 80, and 100 to find the sweet spot for rank fusion, again tested both with and without reranking. for this submission I only used |
|
Did you use any framework for benchmarking all these different configs? |
No, I haven't used any benchmarking framework. Everything : config management, metric computation, result tracking is inline Python. |
Built a custom hybrid retrieval pipeline for the DevRev Search, replacing the baseline FAISS-only approach with a multi-signal retrieval system.
System Details:
Open Source: Fully open source - all models, retrieval infrastructure, and pipeline code are open source.
ISS-1