You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Survey Paper | Big Data Analytics Course | Tunis Business School, University of Tunis Authors: Khouloud BEN YOUNES · Montaha GHABRI Supervisor: Pr. Manel ABDELKADER | Academic Year: 2025–2026
📄 Abstract
This survey explores semantic search over big data using large language models (LLMs), focusing on dense retrieval techniques and retrieval-augmented generation (RAG) systems. Semantic search has evolved from traditional keyword-based methods to advanced embedding-based approaches that capture meaning and context, enabling more accurate information retrieval from massive corpora.
Key findings include:
Dense retrievers like DPR and ColBERT improve semantic understanding but face scalability and adversarial challenges
RAG systems like SELF-RAG enhance LLM reliability by integrating external knowledge
The exponential growth of digital data presents unprecedented challenges for information retrieval. Traditional term-based methods like BM25 suffer from vocabulary mismatch — relevant documents may use different terminology than queries. Neural semantic search addresses three core limitations:
Limitation
Example
Vocabulary Mismatch
Understanding "physician" and "doctor" as equivalent
Semantic Understanding
Matching "bad guy" to "villain" in context
Compositionality
Interpreting multi-word queries as coherent semantic units
Research Questions
#
Question
RQ1
How do sparse, dense, and hybrid retrieval paradigms compare in effectiveness, efficiency, and scalability?
RQ2
What training methodologies most impact retrieval quality?
RQ3
How do neural retrievers handle billion-scale deployment challenges?
RQ4
What security vulnerabilities exist in dense retrieval systems?
RQ5
How do retrieval systems integrate with LLM generation in RAG architectures?
As these systems mediate information access for billions of users, attention to effectiveness, robustness, fairness, and transparency becomes paramount.
References
#
Citation
[1]
Karpukhin et al., "Dense Passage Retrieval for Open-Domain QA," EMNLP 2020
[2]
Santhanam et al., "ColBERTv2," NAACL-HLT 2022
[3]
Wang et al., "SIMLM," ACL 2023
[4]
Liu et al., "Llama2Vec," arXiv 2023
[5]
Asai et al., "SELF-RAG," arXiv 2024
[6]
Jiang et al., "Chameleon," VLDB 2023
[7]
Hu et al., "HAKES," arXiv 2025
[8]
Guo et al., "Semantic Models for First-Stage Retrieval," ACM TOIS 2022
[9]
Zhao et al., "RAG for AI-Generated Content: A Survey," 2026
[10]
Zhong et al., "Poisoning Retrieval Corpora," EMNLP 2023
[11]
Xu et al., "SPFresh," SOSP 2023
[12]
Xiao et al., "Distill-VQ," SIGIR 2022
[13]
Ayoub et al., "Compressed Concatenation of Small Embedding Models," CIKM 2025
[14]
Wang et al., "Query2doc," arXiv 2023
[15]
Formal et al., "SPLADE++," SIGIR 2022
[16]
Gao et al., "HyDE," ACL 2023
Tunis Business School · University of Tunis · 2025–2026
About
Survey on semantic search with LLMs: dense retrieval, RAG architectures, billion-scale vector search, and hybrid methods over big data corpora.