Skip to content

mountaha-ghabri/Semantic-Search-Engines-over-Big-Data-Using-LLMs

 
 

Repository files navigation

Semantic Search Engines over Big Data Using LLMs

Survey Paper | Big Data Analytics Course | Tunis Business School, University of Tunis
Authors: Khouloud BEN YOUNES · Montaha GHABRI
Supervisor: Pr. Manel ABDELKADER | Academic Year: 2025–2026


📄 Abstract

This survey explores semantic search over big data using large language models (LLMs), focusing on dense retrieval techniques and retrieval-augmented generation (RAG) systems. Semantic search has evolved from traditional keyword-based methods to advanced embedding-based approaches that capture meaning and context, enabling more accurate information retrieval from massive corpora.

Key findings include:

  • Dense retrievers like DPR and ColBERT improve semantic understanding but face scalability and adversarial challenges
  • RAG systems like SELF-RAG enhance LLM reliability by integrating external knowledge
  • Emerging trends: unsupervised LLM adaptation (Llama2Vec), hardware acceleration (Chameleon), and query enhancement (HyDE, Query2doc)

📚 Table of Contents

  1. Introduction
  2. Literature Selection
  3. Background
  4. Review of Existing Work
  5. Critical Analysis
  6. Trends & Future Directions
  7. Conclusion
  8. References

Introduction

The exponential growth of digital data presents unprecedented challenges for information retrieval. Traditional term-based methods like BM25 suffer from vocabulary mismatch — relevant documents may use different terminology than queries. Neural semantic search addresses three core limitations:

Limitation Example
Vocabulary Mismatch Understanding "physician" and "doctor" as equivalent
Semantic Understanding Matching "bad guy" to "villain" in context
Compositionality Interpreting multi-word queries as coherent semantic units

Research Questions

# Question
RQ1 How do sparse, dense, and hybrid retrieval paradigms compare in effectiveness, efficiency, and scalability?
RQ2 What training methodologies most impact retrieval quality?
RQ3 How do neural retrievers handle billion-scale deployment challenges?
RQ4 What security vulnerabilities exist in dense retrieval systems?
RQ5 How do retrieval systems integrate with LLM generation in RAG architectures?

Literature Selection

  • Sources: IEEE Xplore, ACM Digital Library, SpringerLink, Elsevier, arXiv (cs.CL, cs.IR, cs.LG)
  • Time Window: 2018–2025
  • Selection: 800+ initial results → 180 shortlisted → 35 highly relevant → 16 core papers
  • Citation Thresholds: 100+ (2020–2022), 20+ (2023), 10+ (2024–2025)
  • Top Venues: SIGIR, ACL, EMNLP, NeurIPS, ICML, VLDB, WWW

Background

Core Concepts

Term Definition
Dense Retrieval Dual-encoder architectures mapping queries and passages to a shared vector space
RAG Framework integrating retrieval with generative LLMs for factual grounding
Contrastive Learning Training that pulls positive pairs closer while pushing negatives apart
Hard Negatives Semantically similar but factually incorrect passages used in training
Late Interaction Token-level query-passage matching (e.g., ColBERT)
ANN Search Approximate nearest neighbor algorithms like FAISS and HNSW
Vector Quantization Compression of full-precision vectors into discrete codes

Key Evaluation Benchmarks

  • MS MARCO — Primary benchmark for passage ranking and retrieval
  • BEIR — Zero-shot generalization across medical, legal, financial domains
  • Natural Questions (NQ) — Open-domain QA over Wikipedia

Evaluation Metrics

  • MRR@k — Mean Reciprocal Rank of first relevant document
  • nDCG — Normalized Discounted Cumulative Gain (accounts for graded relevance)

Review of Existing Work

1. Dense Retrieval Foundations

Paper Contribution Key Result
DPR [Karpukhin et al., 2020] Dual-encoder with in-batch negatives 78.4% top-20 accuracy on NQ (vs. 59.1% BM25)
SIMLM [Wang et al., 2023] Bottleneck pre-training (128-dim compression) 41% MRR@10 on MS MARCO, strong zero-shot
Survey [Guo et al., 2022] 30-year taxonomy of semantic models Identifies hybrid approaches as best balance
Poisoning Attacks [Zhong et al., 2023] Corpus poisoning via gradient-based token replacement 99.4% attack success on unsupervised retrievers

2. Scalability & Vector Search

Paper Contribution Key Result
SPFresh [Xu et al., 2023] Incremental in-place updates (LIRE algorithm) 2–5× faster updates, 95% recall on 1B vectors
Distill-VQ [Xiao et al., 2022] Ranking-aware vector quantization 2–5% MRR/recall gain at same compression ratio
Compressed Concat. [Ayoub et al., 2025] Compressed concatenation of small models 89% performance at 48× compression
HAKES [Hu et al., 2025] Disaggregated vector database architecture 16× throughput vs. Weaviate/Milvus

3. RAG Architectures & Optimization

Paper Contribution Key Result
RAG Survey [Zhao et al., 2026] Comprehensive RAG taxonomy Identifies hallucination reduction as central motivation
Chameleon [Jiang et al., 2023] FPGA+GPU heterogeneous acceleration 2.16× latency reduction, 3.18× throughput
SELF-RAG [Asai et al., 2024] Self-reflection tokens for adaptive retrieval 50% fewer retrievals, improved factuality
Query2doc [Wang et al., 2023] LLM-based zero-shot query expansion Up to 15% improvement on BM25
Llama2Vec [Liu et al., 2023] Unsupervised LLM-to-retriever adaptation Matches supervised retrievers on BEIR zero-shot

4. Hybrid & Advanced Methods

Paper Contribution Key Result
SPLADE++ [Formal et al., 2022] Enhanced sparse neural retrieval 50.7 nDCG@10 on BEIR (exceeds ColBERTv2)
HyDE [Gao et al., 2023] Hypothetical document embeddings 5–15% zero-shot improvement
ColBERTv2 [Santhanam et al., 2022] Late interaction + residual compression 6–10× storage reduction, maintained accuracy

Critical Analysis

Strengths

  • Dense retrieval establishes strong semantic matching (DPR outperforms BM25 by 19+ points)
  • Billion-scale deployment is feasible via compression and disaggregation (HAKES: 16× throughput)
  • RAG reduces hallucinations through dynamic retrieval and self-reflection (SELF-RAG: 50% fewer redundant retrievals)
  • Sparse neural methods (SPLADE++) achieve dense-competitive performance with inverted index compatibility

Weaknesses & Limitations

  • Hard negative mining and large contrastive training are computationally expensive
  • RAG systems can propagate retrieval noise into generation
  • Security: 0.02% adversarial passages can mislead dense retrievers
  • Multimodal retrieval remains largely underdeveloped
  • Ethical issues (bias, transparency) are insufficiently addressed

Research Gaps

🔒 Robustness       — Certified defenses against corpus poisoning
🖼️  Multimodal      — Cross-modal retrieval beyond text
⚖️  Ethics          — Bias measurement, fairness constraints in embeddings
🌱 Sustainability   — Energy-efficient trillion-scale systems
📱 Edge Deployment  — On-device semantic search and RAG
📄 Long Context     — Multi-hop reasoning over large documents

Trends & Future Directions

Emerging Technologies

Trend Example
Hardware Acceleration Chameleon (FPGA+GPU), HAKES (disaggregated architecture)
Unsupervised Learning Llama2Vec synthetic training, HyDE zero-shot expansion
Compression ColBERTv2 residual compression (6–10×), 48× compressed concatenation
Self-Reflective RAG SELF-RAG reflection tokens for adaptive retrieval

Future Applications

  • 🏥 Healthcare — Clinical decision support with verifiable citations
  • ⚖️ Legal — Cross-jurisdiction case law synthesis
  • 💹 Finance — Real-time market intelligence and fraud detection
  • 🎓 Education — Personalized intelligent tutoring systems
  • 🏭 Enterprise — Unified semantic search across documents, email, and code
  • 🔬 Scientific Research — Systematic review acceleration

Conclusion

Semantic search over big data with LLMs has progressed from early neural ranking to sophisticated retrieval-generation systems. Key takeaways:

  1. Training methodology (negative sampling, distillation) matters more than architectural complexity
  2. Sample efficiency is remarkable — DPR with 1,000 examples outperforms classical BM25
  3. Security vulnerabilities in dense retrieval are severe, especially for unsupervised models
  4. Hybrid approaches combining sparse and dense retrieval consistently achieve best trade-offs
  5. RAG + hardware acceleration enables production-scale factually-grounded generation

As these systems mediate information access for billions of users, attention to effectiveness, robustness, fairness, and transparency becomes paramount.


References

# Citation
[1] Karpukhin et al., "Dense Passage Retrieval for Open-Domain QA," EMNLP 2020
[2] Santhanam et al., "ColBERTv2," NAACL-HLT 2022
[3] Wang et al., "SIMLM," ACL 2023
[4] Liu et al., "Llama2Vec," arXiv 2023
[5] Asai et al., "SELF-RAG," arXiv 2024
[6] Jiang et al., "Chameleon," VLDB 2023
[7] Hu et al., "HAKES," arXiv 2025
[8] Guo et al., "Semantic Models for First-Stage Retrieval," ACM TOIS 2022
[9] Zhao et al., "RAG for AI-Generated Content: A Survey," 2026
[10] Zhong et al., "Poisoning Retrieval Corpora," EMNLP 2023
[11] Xu et al., "SPFresh," SOSP 2023
[12] Xiao et al., "Distill-VQ," SIGIR 2022
[13] Ayoub et al., "Compressed Concatenation of Small Embedding Models," CIKM 2025
[14] Wang et al., "Query2doc," arXiv 2023
[15] Formal et al., "SPLADE++," SIGIR 2022
[16] Gao et al., "HyDE," ACL 2023

Tunis Business School · University of Tunis · 2025–2026

About

Survey on semantic search with LLMs: dense retrieval, RAG architectures, billion-scale vector search, and hybrid methods over big data corpora.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors