[feat]:add evaluation metrics + multi-model benchmark by Aksshay88 · Pull Request #5 · devrev/devrev-search-bench

Aksshay88 · 2026-03-11T17:29:36Z

Add evaluate.py with standard IR evaluation metrics (NDCG@k, MRR, MAP, Recall@k, Precision@k) to measure search quality
against the 291 annotated queries
Add compare_models.py to benchmark multiple Ollama embedding models with auto-pull, task prefixes, text truncation, embedding
caching, and leaderboard generation
Add evaluation section (Section 8) to the notebook with per-query breakdown
Update README with model comparison docs, supported models table, and metrics reference

Why this is needed

The repo could embed docs and retrieve results, but had no way to measure search quality. The devrev/search dataset includes
291 annotated queries with golden retrievals specifically for evaluation — but no evaluation code existed. Additionally, only 2
embedding providers were supported (OpenAI and one Ollama model) with no way to compare them.

What's added

`evaluate.py` — IR Evaluation Metrics

Computes standard information retrieval metrics by comparing predicted retrievals against golden annotations:

Metric	What it measures
NDCG@k	Are correct documents ranked near the top?
MRR	How early does the first correct document appear?
MAP	Average precision across all relevant documents
Recall@k	What fraction of correct documents were found in top-k?
Precision@k	What fraction of top-k results are actually correct?

https://app.devrev.ai/devrev/works/ISS-269510

- evaluate.py with NDCG, MRR, MAP, Recall, Precision - compare_models.py with 6 Ollama models, task prefixes, caching, and leaderboard

prakhar7651 · 2026-03-16T14:26:27Z

What are we evaluating against?

Aksshay88 · 2026-03-16T14:41:13Z

What are we evaluating against?
@prakhar7651
We evaluate against the 291 annotated queries in the annotated_queries split of the devrev/search HuggingFace dataset.

Each query has a set of golden document IDs —> human-labeled correct chunks from the 65,224-doc knowledge base that actually answer that query.

The evaluation flow:

For each annotated query, run FAISS similarity search → get top-K retrieved doc IDs
Compare retrieved IDs vs golden IDs
Compute NDCG@k, MRR, MAP, Recall@k, Precision@k

The 92 test_queries have no labels (held-out for leaderboard submission), so those can't be evaluated locally, only the 291 annotated queries have ground truth.

prakhar7651 · 2026-03-27T07:19:29Z

Do you have a file to evaluate?

[feat]:add evaluation metrics + multi-model benchmark

c0c67eb

- evaluate.py with NDCG, MRR, MAP, Recall, Precision - compare_models.py with 6 Ollama models, task prefixes, caching, and leaderboard

Aksshay88 requested a review from nimit2801 March 11, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]:add evaluation metrics + multi-model benchmark#5

[feat]:add evaluation metrics + multi-model benchmark#5
Aksshay88 wants to merge 1 commit intodevrev:mainfrom
Aksshay88:aksshay88/model

Aksshay88 commented Mar 11, 2026 •

edited

Loading

Uh oh!

prakhar7651 commented Mar 16, 2026

Uh oh!

Aksshay88 commented Mar 16, 2026 •

edited

Loading

Uh oh!

prakhar7651 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aksshay88 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this is needed

What's added

evaluate.py — IR Evaluation Metrics

Uh oh!

prakhar7651 commented Mar 16, 2026

Uh oh!

Aksshay88 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prakhar7651 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aksshay88 commented Mar 11, 2026 •

edited

Loading

`evaluate.py` — IR Evaluation Metrics

Aksshay88 commented Mar 16, 2026 •

edited

Loading