Skip to content

[feat]:add evaluation metrics + multi-model benchmark#5

Open
Aksshay88 wants to merge 1 commit intodevrev:mainfrom
Aksshay88:aksshay88/model
Open

[feat]:add evaluation metrics + multi-model benchmark#5
Aksshay88 wants to merge 1 commit intodevrev:mainfrom
Aksshay88:aksshay88/model

Conversation

@Aksshay88
Copy link
Copy Markdown
Member

@Aksshay88 Aksshay88 commented Mar 11, 2026

  • Add evaluate.py with standard IR evaluation metrics (NDCG@k, MRR, MAP, Recall@k, Precision@k) to measure search quality
    against the 291 annotated queries
  • Add compare_models.py to benchmark multiple Ollama embedding models with auto-pull, task prefixes, text truncation, embedding
    caching, and leaderboard generation
  • Add evaluation section (Section 8) to the notebook with per-query breakdown
  • Update README with model comparison docs, supported models table, and metrics reference

Why this is needed

The repo could embed docs and retrieve results, but had no way to measure search quality. The devrev/search dataset includes
291 annotated queries with golden retrievals specifically for evaluation — but no evaluation code existed. Additionally, only 2
embedding providers were supported (OpenAI and one Ollama model) with no way to compare them.

What's added

evaluate.py — IR Evaluation Metrics

Computes standard information retrieval metrics by comparing predicted retrievals against golden annotations:

Metric What it measures
NDCG@k Are correct documents ranked near the top?
MRR How early does the first correct document appear?
MAP Average precision across all relevant documents
Recall@k What fraction of correct documents were found in top-k?
Precision@k What fraction of top-k results are actually correct?

https://app.devrev.ai/devrev/works/ISS-269510

-  evaluate.py with NDCG, MRR, MAP, Recall, Precision
 -  compare_models.py with 6 Ollama models, task prefixes,
  caching, and leaderboard
@Aksshay88 Aksshay88 requested a review from nimit2801 March 11, 2026 17:34
@prakhar7651
Copy link
Copy Markdown
Contributor

What are we evaluating against?

@Aksshay88
Copy link
Copy Markdown
Member Author

Aksshay88 commented Mar 16, 2026

What are we evaluating against?
@prakhar7651
We evaluate against the 291 annotated queries in the annotated_queries split of the devrev/search HuggingFace dataset.

  • Each query has a set of golden document IDs —> human-labeled correct chunks from the 65,224-doc knowledge base that actually answer that query.

The evaluation flow:

  1. For each annotated query, run FAISS similarity search → get top-K retrieved doc IDs
  2. Compare retrieved IDs vs golden IDs
  3. Compute NDCG@k, MRR, MAP, Recall@k, Precision@k
  • The 92 test_queries have no labels (held-out for leaderboard submission), so those can't be evaluated locally, only the 291 annotated queries have ground truth.

@prakhar7651
Copy link
Copy Markdown
Contributor

Do you have a file to evaluate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants