RAG Pipeline for React Documentation

A complete RAG (Retrieval-Augmented Generation) system for React documentation that includes ingestion, hybrid retrieval, and answer generation with support for multiple LLM providers.

Overview

This project implements a full RAG pipeline that:

Ingests React markdown documentation using LlamaIndex for parsing and chunking
Stores embeddings and metadata in Qdrant vector database
Retrieves relevant chunks using hybrid search (dense + sparse + reranking)
Generates answers using LLMs (OpenAI or Ollama) with intelligent context assembly

Features

Ingestion

Heading-aware chunking: Splits documents based on markdown headings (#, ##, ###)
Token-based splitting: Chunks large sections into ~800 token pieces with 120 token overlap
Rich metadata: Each chunk includes file path, section slug, heading hierarchy (h1/h2/h3), code-heavy flag, chunk index, and full text content
Code block preservation: Ensures code blocks are kept intact during chunking
Flexible embeddings: Switch between embedding models (sentence-transformers or OpenAI)

Retrieval

Hybrid search: Combines dense vector search (Qdrant) and sparse keyword search (Whoosh BM25)
Cross-encoder reranking: Uses sentence-transformers cross-encoder for final result ordering
FastAPI API: RESTful endpoint for querying the knowledge base

Answer Generation

Intelligent context assembly: Expands retrieved chunks to include all chunks from matching files, ordered by chunk index
Multiple LLM providers: Switch between OpenAI and Ollama (DeepSeek R1, etc.)
Temperature control: Configurable creativity/randomness for LLM responses
Token counting: Displays prompt token count before generation
Context dump: Saves question and assembled context to markdown file for debugging
Progress indicators: Visual feedback during retrieval and generation

Project Structure

rag-pipeline-react/
├── app/
│   ├── __init__.py
│   ├── config.py              # Configuration management (env vars, settings)
│   ├── models.py              # Pydantic models for data validation
│   ├── embeddings.py          # Embedding model abstraction layer
│   ├── ingestion/
│   │   ├── __init__.py
│   │   ├── ingest_react_docs.py    # Main ingestion CLI script
│   │   ├── llama_ingestion.py      # LlamaIndex pipeline: load → parse → chunk
│   │   └── qdrant_store.py         # Qdrant connection and upsert helpers
│   ├── retrieval/
│   │   ├── __init__.py
│   │   ├── dense_retriever.py      # Qdrant dense vector search
│   │   ├── sparse_retriever.py     # Whoosh BM25 keyword search
│   │   ├── reranker.py              # Cross-encoder reranker
│   │   ├── hybrid_retriever.py     # Merge dense+sparse and rerank
│   │   └── api.py                   # FastAPI app exposing /query
│   ├── answer/
│   │   ├── __init__.py
│   │   ├── answer_service.py       # Answer generation service
│   │   └── api.py                   # FastAPI app exposing /answer
│   └── llm/
│       ├── __init__.py
│       └── adapters.py              # LLM adapter abstraction (OpenAI, Ollama)
├── react-docs/                 # React documentation markdown files
├── rag.py                      # Main CLI entry point
├── pyproject.toml              # Python project dependencies
├── .env.example                # Environment variables template
└── README.md

Setup

Prerequisites

Python 3.10+
Qdrant instance running (default: localhost:6333)
(Optional) Ollama running locally if using Ollama LLM provider

Installation

Clone the repository (if applicable) or navigate to the project directory
Install dependencies:

pip install -e .

Configure environment variables:

Create a .env file in the project root (see .env.example for template):

# Embedding Model Configuration (match Qdrant vector size)
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2

# OpenAI Configuration (optional, only needed if using OpenAI)
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
OPENAI_BASE_URL=https://api.openai.com/v1

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION_NAME=react-docs

# React Docs Path
REACT_DOCS_PATH=./react-docs

# Retrieval settings
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
WHOOSH_INDEX_PATH=./whoosh_index
TOP_K=10
DENSE_LIMIT=30
SPARSE_LIMIT=30

# LLM Provider (openai or ollama)
LLM_PROVIDER=ollama

# Ollama Configuration (if using Ollama)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=deepseek-r1:8b

# LLM Generation Parameters
TEMPERATURE=0.7  # 0.0-2.0, higher = more creative

# Answer API
RETRIEVAL_URL=http://localhost:8000/query
CONTEXT_DUMP_PATH=./context_dump.md

Start Qdrant:

Using Docker:

docker run -p 6333:6333 qdrant/qdrant

Or install Qdrant locally following the official documentation.

Start Ollama (if using Ollama LLM):

# Install Ollama from https://ollama.ai
# Pull your desired model
ollama pull deepseek-r1:8b

Usage

Main CLI Runner

Use the main rag.py script in the root directory:

python rag.py [COMMAND]

Available commands:

ingest - Ingest React documentation into Qdrant
retrieve - Query React documentation using hybrid search
answer - Generate answers using retrieval + LLM

Ingestion

Run the ingestion command to process and store React documentation:

python rag.py ingest

Or with custom options:

python rag.py ingest --docs-path ./custom-docs --collection-name my-collection

Command-line Options:

--docs-path / -d: Override the default docs path
--collection-name / -c: Override the default Qdrant collection name

What happens:

Loads all markdown files from the docs directory
Parses markdown into heading-aware nodes
Splits nodes into token-bounded chunks (800 tokens, 120 overlap)
Generates embeddings for each chunk
Stores chunks + embeddings + metadata in Qdrant

Retrieval

Query the knowledge base using hybrid search:

python rag.py retrieve "How do I use useEffect?" --top-k 5

Command-line Options:

--top-k / -k: Number of results to return (default: 5)
--collection-name / -c: Override the default Qdrant collection name

What happens:

Embeds the query
Performs dense vector search in Qdrant
Performs sparse keyword search using BM25 (Whoosh)
Merges and normalizes candidate sets
Reranks using cross-encoder
Returns top-k chunks with scores and metadata

Answer Generation

Generate comprehensive answers using retrieval + LLM:

python rag.py answer "Why does useEffect run twice in React 18?" --top-k 5

Command-line Options:

--top-k / -k: Number of chunks to retrieve initially (default: 5)
--collection-name / -c: Override the default Qdrant collection name

What happens:

Retrieves top-k chunks using hybrid search
Expands context: For each file in the retrieved set, fetches ALL chunks from that file
Orders chunks: Sorts by file_path and chunk_index to preserve document flow
Formats context for LLM
Displays prompt token count
Generates answer using configured LLM
Saves context dump to context_dump.md for review
Returns answer and source chunks

LLM Provider Configuration:

Set LLM_PROVIDER in .env:

openai - Use OpenAI (requires OPENAI_API_KEY)
- Configure: OPENAI_MODEL, OPENAI_BASE_URL
ollama - Use Ollama (requires Ollama running locally)
- Configure: OLLAMA_BASE_URL, OLLAMA_MODEL

Temperature Control:

Adjust creativity/randomness via TEMPERATURE in .env:

0.0-0.3: More deterministic, focused responses
0.7: Balanced (default)
0.7-2.0: More creative, varied responses

FastAPI APIs

Retrieval API

Run the retrieval API server:

uvicorn app.retrieval.api:app --reload --port 8000

Query the API:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How do I add React to an existing project?", "top_k": 5}'

Response:

{
  "results": [
    {
      "chunk_id": "...",
      "score": 0.85,
      "text": "...",
      "file_path": "learn/add-react-to-an-existing-project.md",
      "section_slug": "add-react-to-an-existing-project",
      "h1": "Adding React to an Existing Project",
      "chunk_index": 0,
      ...
    }
  ]
}

Answer API

Run the answer API server:

uvicorn app.answer.api:app --reload --port 8002

Query the API:

curl -X POST http://localhost:8002/answer \
  -H "Content-Type: application/json" \
  -d '{"question": "Why does useEffect run twice in React 18?", "top_k": 5}'

Response:

{
  "answer": "In React 18, useEffect runs twice in development...",
  "sources": [
    {
      "chunk_id": "...",
      "text": "...",
      "file_path": "learn/lifecycle-of-reactive-effects.md",
      ...
    }
  ]
}

Technical Details

Chunking Strategy

Markdown Parsing: Uses MarkdownNodeParser to split documents into nodes based on headings
- # → top-level sections
- ##, ### for subsections
Token-based Chunking: Uses TokenTextSplitter to further split large sections
- 800 tokens max per chunk
- ~120 token overlap between adjacent chunks
- Code blocks are preserved intact
Metadata per Chunk:
- file_path: Relative path to source markdown file
- section_slug: URL-friendly slug (e.g., learn/use-effect#cleaning-up)
- h1, h2, h3: Heading hierarchy
- is_code_heavy: True if >30% of content is in code fences
- chunk_index: 0-based index within the section
- text: Full chunk text content (stored for inference)

Embedding Models

The pipeline supports flexible embedding models:

sentence-transformers/all-mpnet-base-v2 (default): 768 dimensions, local model
all-MiniLM-L6-v2: 384 dimensions, faster, smaller
OpenAI models: text-embedding-3-small (1536 dims), text-embedding-3-large (3072 dims), text-embedding-ada-002 (1536 dims)

Configure via EMBEDDING_MODEL environment variable:

sentence-transformers/all-mpnet-base-v2 - Use sentence transformer (default)
openai:text-embedding-3-small - Use OpenAI embeddings
all-MiniLM-L6-v2 - Use smaller sentence transformer

Important: The embedding model dimension must match your Qdrant collection's vector size. If you change models, you may need to recreate the collection.

Hybrid Retrieval

The retrieval system uses a three-stage approach:

Dense Search: Semantic vector search in Qdrant using query embeddings
Sparse Search: Keyword-based BM25 search using Whoosh index
Reranking: Cross-encoder reranking using cross-encoder/ms-marco-MiniLM-L-6-v2

Results are merged, normalized, deduplicated, and reranked before returning top-k.

Context Assembly

When generating answers, the system:

Retrieves top-k most relevant chunks
Extracts unique file paths from these chunks
For each file, fetches all chunks associated with that file
Orders chunks by (file_path, chunk_index) to preserve document flow
Formats the ordered chunks into context for the LLM

This ensures the LLM receives complete, ordered context from relevant files rather than just isolated chunks.

Qdrant Collection

The pipeline creates a Qdrant collection with:

Vector size: Automatically determined by embedding model (768 for all-mpnet-base-v2, 384 for all-MiniLM-L6-v2, etc.)
Distance metric: Cosine
Payload schema: All metadata fields from ChunkMetadata model, including full text field for inference
Point IDs: Deterministic UUIDs generated from file_path, chunk_index, and section_slug

Development

Project Dependencies

fastapi: HTTP API framework
uvicorn[standard]: ASGI server
sentence-transformers: Embeddings + cross-encoder reranker
qdrant-client: Vector database client
whoosh: BM25 sparse index
python-dotenv: Environment variable management
pydantic / pydantic-settings: Data validation and settings
typer: CLI framework
rich: Terminal formatting and progress indicators
httpx: Async HTTP client
tiktoken: Token counting for OpenAI models
llama-index-core: Document loading and chunking (ingestion only)
llama-index-embeddings-openai: OpenAI embeddings (optional)
openai: OpenAI SDK

Running Tests

(Add test instructions if tests exist)

Code Structure

Ingestion: Uses LlamaIndex only for ingestion (not at query time)
Retrieval: Pure Python implementation using Qdrant and Whoosh
Answer Generation: LLM adapter pattern for easy provider switching

Troubleshooting

Qdrant Connection Error

Ensure Qdrant is running and accessible:

curl http://localhost:6333/collections

Embedding Model Dimension Mismatch

If you see errors like "expected dim: 768, got 384":

Check your EMBEDDING_MODEL setting matches the collection's vector size
Either recreate the collection with the correct dimension, or change the embedding model to match

Ollama Connection Error

If using Ollama, ensure it's running:

curl http://localhost:11434/api/tags

Missing Dependencies

If you encounter import errors:

pip install -e .

Whoosh Index Issues

If the sparse retriever fails, try deleting the whoosh_index directory and re-running ingestion. The index will be rebuilt automatically.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
react-docs/learn		react-docs/learn
whoosh_index		whoosh_index
whoosh_index_test		whoosh_index_test
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
rag.py		rag.py

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline for React Documentation

Overview

Features

Ingestion

Retrieval

Answer Generation

Project Structure

Setup

Prerequisites

Installation

Usage

Main CLI Runner

Ingestion

Retrieval

Answer Generation

FastAPI APIs

Retrieval API

Answer API

Technical Details

Chunking Strategy

Embedding Models

Hybrid Retrieval

Context Assembly

Qdrant Collection

Development

Project Dependencies

Running Tests

Code Structure

Troubleshooting

Qdrant Connection Error

Embedding Model Dimension Mismatch

Ollama Connection Error

Missing Dependencies

Whoosh Index Issues

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages