Semantic Search Engines over Big Data Using LLMs

Survey Paper | Big Data Analytics Course | Tunis Business School, University of Tunis
Authors: Khouloud BEN YOUNES · Montaha GHABRI
Supervisor: Pr. Manel ABDELKADER | Academic Year: 2025–2026

📄 Abstract

This survey explores semantic search over big data using large language models (LLMs), focusing on dense retrieval techniques and retrieval-augmented generation (RAG) systems. Semantic search has evolved from traditional keyword-based methods to advanced embedding-based approaches that capture meaning and context, enabling more accurate information retrieval from massive corpora.

Key findings include:

Dense retrievers like DPR and ColBERT improve semantic understanding but face scalability and adversarial challenges
RAG systems like SELF-RAG enhance LLM reliability by integrating external knowledge
Emerging trends: unsupervised LLM adaptation (Llama2Vec), hardware acceleration (Chameleon), and query enhancement (HyDE, Query2doc)

Introduction

The exponential growth of digital data presents unprecedented challenges for information retrieval. Traditional term-based methods like BM25 suffer from vocabulary mismatch — relevant documents may use different terminology than queries. Neural semantic search addresses three core limitations:

Limitation	Example
Vocabulary Mismatch	Understanding "physician" and "doctor" as equivalent
Semantic Understanding	Matching "bad guy" to "villain" in context
Compositionality	Interpreting multi-word queries as coherent semantic units

Research Questions

#	Question
RQ1	How do sparse, dense, and hybrid retrieval paradigms compare in effectiveness, efficiency, and scalability?
RQ2	What training methodologies most impact retrieval quality?
RQ3	How do neural retrievers handle billion-scale deployment challenges?
RQ4	What security vulnerabilities exist in dense retrieval systems?
RQ5	How do retrieval systems integrate with LLM generation in RAG architectures?

Literature Selection

Sources: IEEE Xplore, ACM Digital Library, SpringerLink, Elsevier, arXiv (cs.CL, cs.IR, cs.LG)
Time Window: 2018–2025
Selection: 800+ initial results → 180 shortlisted → 35 highly relevant → 16 core papers
Citation Thresholds: 100+ (2020–2022), 20+ (2023), 10+ (2024–2025)
Top Venues: SIGIR, ACL, EMNLP, NeurIPS, ICML, VLDB, WWW

Background

Core Concepts

Term	Definition
Dense Retrieval	Dual-encoder architectures mapping queries and passages to a shared vector space
RAG	Framework integrating retrieval with generative LLMs for factual grounding
Contrastive Learning	Training that pulls positive pairs closer while pushing negatives apart
Hard Negatives	Semantically similar but factually incorrect passages used in training
Late Interaction	Token-level query-passage matching (e.g., ColBERT)
ANN Search	Approximate nearest neighbor algorithms like FAISS and HNSW
Vector Quantization	Compression of full-precision vectors into discrete codes

Key Evaluation Benchmarks

MS MARCO — Primary benchmark for passage ranking and retrieval
BEIR — Zero-shot generalization across medical, legal, financial domains
Natural Questions (NQ) — Open-domain QA over Wikipedia

Evaluation Metrics

MRR@k — Mean Reciprocal Rank of first relevant document
nDCG — Normalized Discounted Cumulative Gain (accounts for graded relevance)

Review of Existing Work

1. Dense Retrieval Foundations

Paper	Contribution	Key Result
DPR [Karpukhin et al., 2020]	Dual-encoder with in-batch negatives	78.4% top-20 accuracy on NQ (vs. 59.1% BM25)
SIMLM [Wang et al., 2023]	Bottleneck pre-training (128-dim compression)	41% MRR@10 on MS MARCO, strong zero-shot
Survey [Guo et al., 2022]	30-year taxonomy of semantic models	Identifies hybrid approaches as best balance
Poisoning Attacks [Zhong et al., 2023]	Corpus poisoning via gradient-based token replacement	99.4% attack success on unsupervised retrievers

2. Scalability & Vector Search

Paper	Contribution	Key Result
SPFresh [Xu et al., 2023]	Incremental in-place updates (LIRE algorithm)	2–5× faster updates, 95% recall on 1B vectors
Distill-VQ [Xiao et al., 2022]	Ranking-aware vector quantization	2–5% MRR/recall gain at same compression ratio
Compressed Concat. [Ayoub et al., 2025]	Compressed concatenation of small models	89% performance at 48× compression
HAKES [Hu et al., 2025]	Disaggregated vector database architecture	16× throughput vs. Weaviate/Milvus

3. RAG Architectures & Optimization

Paper	Contribution	Key Result
RAG Survey [Zhao et al., 2026]	Comprehensive RAG taxonomy	Identifies hallucination reduction as central motivation
Chameleon [Jiang et al., 2023]	FPGA+GPU heterogeneous acceleration	2.16× latency reduction, 3.18× throughput
SELF-RAG [Asai et al., 2024]	Self-reflection tokens for adaptive retrieval	50% fewer retrievals, improved factuality
Query2doc [Wang et al., 2023]	LLM-based zero-shot query expansion	Up to 15% improvement on BM25
Llama2Vec [Liu et al., 2023]	Unsupervised LLM-to-retriever adaptation	Matches supervised retrievers on BEIR zero-shot

4. Hybrid & Advanced Methods

Paper	Contribution	Key Result
SPLADE++ [Formal et al., 2022]	Enhanced sparse neural retrieval	50.7 nDCG@10 on BEIR (exceeds ColBERTv2)
HyDE [Gao et al., 2023]	Hypothetical document embeddings	5–15% zero-shot improvement
ColBERTv2 [Santhanam et al., 2022]	Late interaction + residual compression	6–10× storage reduction, maintained accuracy

Critical Analysis

Strengths

Dense retrieval establishes strong semantic matching (DPR outperforms BM25 by 19+ points)
Billion-scale deployment is feasible via compression and disaggregation (HAKES: 16× throughput)
RAG reduces hallucinations through dynamic retrieval and self-reflection (SELF-RAG: 50% fewer redundant retrievals)
Sparse neural methods (SPLADE++) achieve dense-competitive performance with inverted index compatibility

Weaknesses & Limitations

Hard negative mining and large contrastive training are computationally expensive
RAG systems can propagate retrieval noise into generation
Security: 0.02% adversarial passages can mislead dense retrievers
Multimodal retrieval remains largely underdeveloped
Ethical issues (bias, transparency) are insufficiently addressed

Research Gaps

🔒 Robustness       — Certified defenses against corpus poisoning
🖼️  Multimodal      — Cross-modal retrieval beyond text
⚖️  Ethics          — Bias measurement, fairness constraints in embeddings
🌱 Sustainability   — Energy-efficient trillion-scale systems
📱 Edge Deployment  — On-device semantic search and RAG
📄 Long Context     — Multi-hop reasoning over large documents

Trends & Future Directions

Emerging Technologies

Trend	Example
Hardware Acceleration	Chameleon (FPGA+GPU), HAKES (disaggregated architecture)
Unsupervised Learning	Llama2Vec synthetic training, HyDE zero-shot expansion
Compression	ColBERTv2 residual compression (6–10×), 48× compressed concatenation
Self-Reflective RAG	SELF-RAG reflection tokens for adaptive retrieval

Future Applications

🏥 Healthcare — Clinical decision support with verifiable citations
⚖️ Legal — Cross-jurisdiction case law synthesis
💹 Finance — Real-time market intelligence and fraud detection
🎓 Education — Personalized intelligent tutoring systems
🏭 Enterprise — Unified semantic search across documents, email, and code
🔬 Scientific Research — Systematic review acceleration

Conclusion

Semantic search over big data with LLMs has progressed from early neural ranking to sophisticated retrieval-generation systems. Key takeaways:

Training methodology (negative sampling, distillation) matters more than architectural complexity
Sample efficiency is remarkable — DPR with 1,000 examples outperforms classical BM25
Security vulnerabilities in dense retrieval are severe, especially for unsupervised models
Hybrid approaches combining sparse and dense retrieval consistently achieve best trade-offs
RAG + hardware acceleration enables production-scale factually-grounded generation

As these systems mediate information access for billions of users, attention to effectiveness, robustness, fairness, and transparency becomes paramount.

References

#	Citation
[1]	Karpukhin et al., "Dense Passage Retrieval for Open-Domain QA," EMNLP 2020
[2]	Santhanam et al., "ColBERTv2," NAACL-HLT 2022
[3]	Wang et al., "SIMLM," ACL 2023
[4]	Liu et al., "Llama2Vec," arXiv 2023
[5]	Asai et al., "SELF-RAG," arXiv 2024
[6]	Jiang et al., "Chameleon," VLDB 2023
[7]	Hu et al., "HAKES," arXiv 2025
[8]	Guo et al., "Semantic Models for First-Stage Retrieval," ACM TOIS 2022
[9]	Zhao et al., "RAG for AI-Generated Content: A Survey," 2026
[10]	Zhong et al., "Poisoning Retrieval Corpora," EMNLP 2023
[11]	Xu et al., "SPFresh," SOSP 2023
[12]	Xiao et al., "Distill-VQ," SIGIR 2022
[13]	Ayoub et al., "Compressed Concatenation of Small Embedding Models," CIKM 2025
[14]	Wang et al., "Query2doc," arXiv 2023
[15]	Formal et al., "SPLADE++," SIGIR 2022
[16]	Gao et al., "HyDE," ACL 2023

Tunis Business School · University of Tunis · 2025–2026

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Papers		Papers
Big Data Survey Paper slides.pdf		Big Data Survey Paper slides.pdf
Big Data Survey Paper.pdf		Big Data Survey Paper.pdf
README.md		README.md
Semantic Search Engines over Big Data Using LLMs Report.pdf		Semantic Search Engines over Big Data Using LLMs Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search Engines over Big Data Using LLMs

📄 Abstract

📚 Table of Contents

Introduction

Research Questions

Literature Selection

Background

Core Concepts

Key Evaluation Benchmarks

Evaluation Metrics

Review of Existing Work

1. Dense Retrieval Foundations

2. Scalability & Vector Search

3. RAG Architectures & Optimization

4. Hybrid & Advanced Methods

Critical Analysis

Strengths

Weaknesses & Limitations

Research Gaps

Trends & Future Directions

Emerging Technologies

Future Applications

Conclusion

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Semantic Search Engines over Big Data Using LLMs

📄 Abstract

📚 Table of Contents

Introduction

Research Questions

Literature Selection

Background

Core Concepts

Key Evaluation Benchmarks

Evaluation Metrics

Review of Existing Work

1. Dense Retrieval Foundations

2. Scalability & Vector Search

3. RAG Architectures & Optimization

4. Hybrid & Advanced Methods

Critical Analysis

Strengths

Weaknesses & Limitations

Research Gaps

Trends & Future Directions

Emerging Technologies

Future Applications

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages