Skip to content

Add hybrid triple-path retrieval pipeline with cross-encoder reranking#6

Open
i-anishR-droid wants to merge 3 commits intomainfrom
enhanced-pipeline
Open

Add hybrid triple-path retrieval pipeline with cross-encoder reranking#6
i-anishR-droid wants to merge 3 commits intomainfrom
enhanced-pipeline

Conversation

@i-anishR-droid
Copy link
Copy Markdown

@i-anishR-droid i-anishR-droid commented Mar 12, 2026

Built a custom hybrid retrieval pipeline for the DevRev Search, replacing the baseline FAISS-only approach with a multi-signal retrieval system.

  • Dense Retrieval: Snowflake/snowflake-arctic-embed-l-v2.0 (1024-dim) embeddings indexed with FAISS IndexFlatIP (cosine similarity on normalized vectors) over ~65K knowledge base documents
  • Sparse Retrieval (dual): BM25Okapi on full-text (title + cleaned body) and a separate BM25Okapi on titles only (2x boosted RRF weight) for high-precision title matching
  • Fusion: Reciprocal Rank Fusion (RRF, k=60) across all three retrieval paths, run over up to 3 rule-based query expansions per input query
  • Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 cross-encoder on the top-60 fused candidates, returning the top-10
  • Text Cleaning: Strips b'...' byte-string artifacts, normalizes escaped unicode, collapses whitespace

System Details:

  • System Description: Hybrid search pipeline combining dense semantic embeddings (Snowflake/snowflake-arctic-embed-l-v2.0, 1024-dim) via FAISS IndexFlatIP, with dual sparse lexical retrieval (full-text BM25 + title-only BM25 with 2x boosted weight), fused using Reciprocal Rank Fusion across rule-based query expansions (up to 3 variants), then reranked with a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2).
  • System Type: Hybrid / RAG Retriever

Open Source: Fully open source - all models, retrieval infrastructure, and pipeline code are open source.

ISS-1

- Add run_pipeline.py: standalone pipeline with Snowflake arctic-embed-l-v2.0
  dense embeddings, dual BM25 (full-text + title-only), RRF fusion, and
  cross-encoder reranking (ms-marco-MiniLM-L-6-v2); runs 92 test queries in ~2 min
- Update devrev_search.ipynb: refactored Section 9 as infrastructure setup,
  added Section 10 Multi-Query Triple-Path Retrieval strategy
- Add submission outputs: test_queries_results.json/.parquet (latest run),
  enhanced and old variants for comparison
- Ignore large embeddings_*.npy files via .gitignore

Made-with: Cursor
@prakhar7651
Copy link
Copy Markdown
Contributor

Hey!
These are your scores.
Recall@10: 0.3549
Precision@10: 0.3434

@prakhar7651
Copy link
Copy Markdown
Contributor

Can you also try without cross encoder? And try with various other boost configs? Also did you experiment with RRF_K values?

@i-anishR-droid
Copy link
Copy Markdown
Author

Can you also try without cross encoder? And try with various other boost configs? Also did you experiment with RRF_K values?

I am trying Without cross-encoder, testing around 14 different weight combinations for dense, BM25, and title paths (e.g., dense-only, bm25-only, title-only, equal weights, title-boost-3x/4x, dense-boost-2x, etc.), both with and without the reranker, experimenting with RRF_K values of 10, 20, 30, 40, 60, 80, and 100 to find the sweet spot for rank fusion, again tested both with and without reranking.

for this submission I only used rrf_k=60 — no experimentation was done.

@prakhar7651
Copy link
Copy Markdown
Contributor

Did you use any framework for benchmarking all these different configs?

@i-anishR-droid
Copy link
Copy Markdown
Author

Did you use any framework for benchmarking all these different configs?

No, I haven't used any benchmarking framework. Everything : config management, metric computation, result tracking is inline Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants