A complete RAG (Retrieval-Augmented Generation) system for React documentation that includes ingestion, hybrid retrieval, and answer generation with support for multiple LLM providers.
This project implements a full RAG pipeline that:
- Ingests React markdown documentation using LlamaIndex for parsing and chunking
- Stores embeddings and metadata in Qdrant vector database
- Retrieves relevant chunks using hybrid search (dense + sparse + reranking)
- Generates answers using LLMs (OpenAI or Ollama) with intelligent context assembly
- Heading-aware chunking: Splits documents based on markdown headings (
#,##,###) - Token-based splitting: Chunks large sections into ~800 token pieces with 120 token overlap
- Rich metadata: Each chunk includes file path, section slug, heading hierarchy (h1/h2/h3), code-heavy flag, chunk index, and full text content
- Code block preservation: Ensures code blocks are kept intact during chunking
- Flexible embeddings: Switch between embedding models (sentence-transformers or OpenAI)
- Hybrid search: Combines dense vector search (Qdrant) and sparse keyword search (Whoosh BM25)
- Cross-encoder reranking: Uses sentence-transformers cross-encoder for final result ordering
- FastAPI API: RESTful endpoint for querying the knowledge base
- Intelligent context assembly: Expands retrieved chunks to include all chunks from matching files, ordered by chunk index
- Multiple LLM providers: Switch between OpenAI and Ollama (DeepSeek R1, etc.)
- Temperature control: Configurable creativity/randomness for LLM responses
- Token counting: Displays prompt token count before generation
- Context dump: Saves question and assembled context to markdown file for debugging
- Progress indicators: Visual feedback during retrieval and generation
rag-pipeline-react/
├── app/
│ ├── __init__.py
│ ├── config.py # Configuration management (env vars, settings)
│ ├── models.py # Pydantic models for data validation
│ ├── embeddings.py # Embedding model abstraction layer
│ ├── ingestion/
│ │ ├── __init__.py
│ │ ├── ingest_react_docs.py # Main ingestion CLI script
│ │ ├── llama_ingestion.py # LlamaIndex pipeline: load → parse → chunk
│ │ └── qdrant_store.py # Qdrant connection and upsert helpers
│ ├── retrieval/
│ │ ├── __init__.py
│ │ ├── dense_retriever.py # Qdrant dense vector search
│ │ ├── sparse_retriever.py # Whoosh BM25 keyword search
│ │ ├── reranker.py # Cross-encoder reranker
│ │ ├── hybrid_retriever.py # Merge dense+sparse and rerank
│ │ └── api.py # FastAPI app exposing /query
│ ├── answer/
│ │ ├── __init__.py
│ │ ├── answer_service.py # Answer generation service
│ │ └── api.py # FastAPI app exposing /answer
│ └── llm/
│ ├── __init__.py
│ └── adapters.py # LLM adapter abstraction (OpenAI, Ollama)
├── react-docs/ # React documentation markdown files
├── rag.py # Main CLI entry point
├── pyproject.toml # Python project dependencies
├── .env.example # Environment variables template
└── README.md
- Python 3.10+
- Qdrant instance running (default:
localhost:6333) - (Optional) Ollama running locally if using Ollama LLM provider
-
Clone the repository (if applicable) or navigate to the project directory
-
Install dependencies:
pip install -e .- Configure environment variables:
Create a .env file in the project root (see .env.example for template):
# Embedding Model Configuration (match Qdrant vector size)
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
# OpenAI Configuration (optional, only needed if using OpenAI)
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
OPENAI_BASE_URL=https://api.openai.com/v1
# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION_NAME=react-docs
# React Docs Path
REACT_DOCS_PATH=./react-docs
# Retrieval settings
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
WHOOSH_INDEX_PATH=./whoosh_index
TOP_K=10
DENSE_LIMIT=30
SPARSE_LIMIT=30
# LLM Provider (openai or ollama)
LLM_PROVIDER=ollama
# Ollama Configuration (if using Ollama)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=deepseek-r1:8b
# LLM Generation Parameters
TEMPERATURE=0.7 # 0.0-2.0, higher = more creative
# Answer API
RETRIEVAL_URL=http://localhost:8000/query
CONTEXT_DUMP_PATH=./context_dump.md- Start Qdrant:
Using Docker:
docker run -p 6333:6333 qdrant/qdrantOr install Qdrant locally following the official documentation.
- Start Ollama (if using Ollama LLM):
# Install Ollama from https://ollama.ai
# Pull your desired model
ollama pull deepseek-r1:8bUse the main rag.py script in the root directory:
python rag.py [COMMAND]Available commands:
ingest- Ingest React documentation into Qdrantretrieve- Query React documentation using hybrid searchanswer- Generate answers using retrieval + LLM
Run the ingestion command to process and store React documentation:
python rag.py ingestOr with custom options:
python rag.py ingest --docs-path ./custom-docs --collection-name my-collectionCommand-line Options:
--docs-path/-d: Override the default docs path--collection-name/-c: Override the default Qdrant collection name
What happens:
- Loads all markdown files from the docs directory
- Parses markdown into heading-aware nodes
- Splits nodes into token-bounded chunks (800 tokens, 120 overlap)
- Generates embeddings for each chunk
- Stores chunks + embeddings + metadata in Qdrant
Query the knowledge base using hybrid search:
python rag.py retrieve "How do I use useEffect?" --top-k 5Command-line Options:
--top-k/-k: Number of results to return (default: 5)--collection-name/-c: Override the default Qdrant collection name
What happens:
- Embeds the query
- Performs dense vector search in Qdrant
- Performs sparse keyword search using BM25 (Whoosh)
- Merges and normalizes candidate sets
- Reranks using cross-encoder
- Returns top-k chunks with scores and metadata
Generate comprehensive answers using retrieval + LLM:
python rag.py answer "Why does useEffect run twice in React 18?" --top-k 5Command-line Options:
--top-k/-k: Number of chunks to retrieve initially (default: 5)--collection-name/-c: Override the default Qdrant collection name
What happens:
- Retrieves top-k chunks using hybrid search
- Expands context: For each file in the retrieved set, fetches ALL chunks from that file
- Orders chunks: Sorts by file_path and chunk_index to preserve document flow
- Formats context for LLM
- Displays prompt token count
- Generates answer using configured LLM
- Saves context dump to
context_dump.mdfor review - Returns answer and source chunks
LLM Provider Configuration:
Set LLM_PROVIDER in .env:
openai- Use OpenAI (requiresOPENAI_API_KEY)- Configure:
OPENAI_MODEL,OPENAI_BASE_URL
- Configure:
ollama- Use Ollama (requires Ollama running locally)- Configure:
OLLAMA_BASE_URL,OLLAMA_MODEL
- Configure:
Temperature Control:
Adjust creativity/randomness via TEMPERATURE in .env:
0.0-0.3: More deterministic, focused responses0.7: Balanced (default)0.7-2.0: More creative, varied responses
Run the retrieval API server:
uvicorn app.retrieval.api:app --reload --port 8000Query the API:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "How do I add React to an existing project?", "top_k": 5}'Response:
{
"results": [
{
"chunk_id": "...",
"score": 0.85,
"text": "...",
"file_path": "learn/add-react-to-an-existing-project.md",
"section_slug": "add-react-to-an-existing-project",
"h1": "Adding React to an Existing Project",
"chunk_index": 0,
...
}
]
}Run the answer API server:
uvicorn app.answer.api:app --reload --port 8002Query the API:
curl -X POST http://localhost:8002/answer \
-H "Content-Type: application/json" \
-d '{"question": "Why does useEffect run twice in React 18?", "top_k": 5}'Response:
{
"answer": "In React 18, useEffect runs twice in development...",
"sources": [
{
"chunk_id": "...",
"text": "...",
"file_path": "learn/lifecycle-of-reactive-effects.md",
...
}
]
}-
Markdown Parsing: Uses
MarkdownNodeParserto split documents into nodes based on headings#→ top-level sections##,###for subsections
-
Token-based Chunking: Uses
TokenTextSplitterto further split large sections- 800 tokens max per chunk
- ~120 token overlap between adjacent chunks
- Code blocks are preserved intact
-
Metadata per Chunk:
file_path: Relative path to source markdown filesection_slug: URL-friendly slug (e.g.,learn/use-effect#cleaning-up)h1,h2,h3: Heading hierarchyis_code_heavy: True if >30% of content is in code fenceschunk_index: 0-based index within the sectiontext: Full chunk text content (stored for inference)
The pipeline supports flexible embedding models:
- sentence-transformers/all-mpnet-base-v2 (default): 768 dimensions, local model
- all-MiniLM-L6-v2: 384 dimensions, faster, smaller
- OpenAI models:
text-embedding-3-small(1536 dims),text-embedding-3-large(3072 dims),text-embedding-ada-002(1536 dims)
Configure via EMBEDDING_MODEL environment variable:
sentence-transformers/all-mpnet-base-v2- Use sentence transformer (default)openai:text-embedding-3-small- Use OpenAI embeddingsall-MiniLM-L6-v2- Use smaller sentence transformer
Important: The embedding model dimension must match your Qdrant collection's vector size. If you change models, you may need to recreate the collection.
The retrieval system uses a three-stage approach:
- Dense Search: Semantic vector search in Qdrant using query embeddings
- Sparse Search: Keyword-based BM25 search using Whoosh index
- Reranking: Cross-encoder reranking using
cross-encoder/ms-marco-MiniLM-L-6-v2
Results are merged, normalized, deduplicated, and reranked before returning top-k.
When generating answers, the system:
- Retrieves top-k most relevant chunks
- Extracts unique file paths from these chunks
- For each file, fetches all chunks associated with that file
- Orders chunks by
(file_path, chunk_index)to preserve document flow - Formats the ordered chunks into context for the LLM
This ensures the LLM receives complete, ordered context from relevant files rather than just isolated chunks.
The pipeline creates a Qdrant collection with:
- Vector size: Automatically determined by embedding model (768 for all-mpnet-base-v2, 384 for all-MiniLM-L6-v2, etc.)
- Distance metric: Cosine
- Payload schema: All metadata fields from
ChunkMetadatamodel, including fulltextfield for inference - Point IDs: Deterministic UUIDs generated from file_path, chunk_index, and section_slug
fastapi: HTTP API frameworkuvicorn[standard]: ASGI serversentence-transformers: Embeddings + cross-encoder rerankerqdrant-client: Vector database clientwhoosh: BM25 sparse indexpython-dotenv: Environment variable managementpydantic/pydantic-settings: Data validation and settingstyper: CLI frameworkrich: Terminal formatting and progress indicatorshttpx: Async HTTP clienttiktoken: Token counting for OpenAI modelsllama-index-core: Document loading and chunking (ingestion only)llama-index-embeddings-openai: OpenAI embeddings (optional)openai: OpenAI SDK
(Add test instructions if tests exist)
- Ingestion: Uses LlamaIndex only for ingestion (not at query time)
- Retrieval: Pure Python implementation using Qdrant and Whoosh
- Answer Generation: LLM adapter pattern for easy provider switching
Ensure Qdrant is running and accessible:
curl http://localhost:6333/collectionsIf you see errors like "expected dim: 768, got 384":
- Check your
EMBEDDING_MODELsetting matches the collection's vector size - Either recreate the collection with the correct dimension, or change the embedding model to match
If using Ollama, ensure it's running:
curl http://localhost:11434/api/tagsIf you encounter import errors:
pip install -e .If the sparse retriever fails, try deleting the whoosh_index directory and re-running ingestion. The index will be rebuilt automatically.
MIT License