spanish-ir-bench

About

This project implements and evaluates an Information Retrieval (IR) system in Spanish using the MessIRve dataset and a subset of Spanish Wikipedia, eswiki_20240401_corpus.

Two retrieval approaches are compared:

TF-IDF (sparse lexical retrieval)
Sentence-Transformer embeddings (dense semantic retrieval)

The goal is to analyze performance differences between traditional term-based methods and modern semantic models in a large-scale Spanish IR setting.

Project Structure

The notebook follows this pipeline:

Data Loading
- Load train/test splits and corpus
- Convert to pandas
- Remove unused columns
- Filter corpus to relevant documents
Preproessing
- Heavy preprocessing for TF-IDF (stopword removal, normalization)
- Light preprocessing for embeddings (lowercasing + punctuation removal)
- Parallelized with joblib
TF-IDF Retrieval
- TfidfVectorizer with:
  - unigrams + bigrams
  - max 50,000 features
  - sublinear TF scaling
- Cosine similarity
- Top-100 ranking per query
Embedding-Based Retrieval
- Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- 384-dimensional dense vectors
- Normalized embeddings
- Dot product for fast similarity computation
- Top-100 ranking per query
Evaluation
- Precision@k
- Recall@k
- nDCG@k
- Evaluated on test split only

Results

TF-IDF

Metric	Value
Precision@1	0.100
Recall@10	0.343
nDCG@10	0.209

Embeddings

Metric	Value
Precision@1	0.175
Recall@10	0.419
nDCG@10	0.290

Key Findings

Embeddings significantly outperform TF-IDF across all metrics.
Precision@1 improves from 10% to 17.5%.
Recall@10 increases from 34.3% to 41.9%.
nDCG@10 improves substantially, indicating better ranking quality.

These results confirm that dense semantic representations capture query-document similarity more effectively than purely lexical term matching, especially in scenarios with lexical variation.

Possible Improvements

Future work could include:

Hybrid sparse + dense retrieval
More advanced embedding models
Cross-encoder re-ranking
Fine-tuning on the Messirve dataset
Spanish-specific transformer models
Approximate nearest neighbor search for scalability

Motivation

This project serves as an experimental benchmark comparing sparse and dense retrieval methods in Spanish. It is intended as an educational and exploratory implementation rather than a production-grade search engine.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spanish-ir-bench

About

Project Structure

Results

TF-IDF

Embeddings

Key Findings

Possible Improvements

Motivation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spanish-ir-bench

About

Project Structure

Results

TF-IDF

Embeddings

Key Findings

Possible Improvements

Motivation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages