Skip to content

souravrane/rag-pipeline-react

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Pipeline for React Documentation

A complete RAG (Retrieval-Augmented Generation) system for React documentation that includes ingestion, hybrid retrieval, and answer generation with support for multiple LLM providers.

Overview

This project implements a full RAG pipeline that:

  1. Ingests React markdown documentation using LlamaIndex for parsing and chunking
  2. Stores embeddings and metadata in Qdrant vector database
  3. Retrieves relevant chunks using hybrid search (dense + sparse + reranking)
  4. Generates answers using LLMs (OpenAI or Ollama) with intelligent context assembly

Features

Ingestion

  • Heading-aware chunking: Splits documents based on markdown headings (#, ##, ###)
  • Token-based splitting: Chunks large sections into ~800 token pieces with 120 token overlap
  • Rich metadata: Each chunk includes file path, section slug, heading hierarchy (h1/h2/h3), code-heavy flag, chunk index, and full text content
  • Code block preservation: Ensures code blocks are kept intact during chunking
  • Flexible embeddings: Switch between embedding models (sentence-transformers or OpenAI)

Retrieval

  • Hybrid search: Combines dense vector search (Qdrant) and sparse keyword search (Whoosh BM25)
  • Cross-encoder reranking: Uses sentence-transformers cross-encoder for final result ordering
  • FastAPI API: RESTful endpoint for querying the knowledge base

Answer Generation

  • Intelligent context assembly: Expands retrieved chunks to include all chunks from matching files, ordered by chunk index
  • Multiple LLM providers: Switch between OpenAI and Ollama (DeepSeek R1, etc.)
  • Temperature control: Configurable creativity/randomness for LLM responses
  • Token counting: Displays prompt token count before generation
  • Context dump: Saves question and assembled context to markdown file for debugging
  • Progress indicators: Visual feedback during retrieval and generation

Project Structure

rag-pipeline-react/
├── app/
│   ├── __init__.py
│   ├── config.py              # Configuration management (env vars, settings)
│   ├── models.py              # Pydantic models for data validation
│   ├── embeddings.py          # Embedding model abstraction layer
│   ├── ingestion/
│   │   ├── __init__.py
│   │   ├── ingest_react_docs.py    # Main ingestion CLI script
│   │   ├── llama_ingestion.py      # LlamaIndex pipeline: load → parse → chunk
│   │   └── qdrant_store.py         # Qdrant connection and upsert helpers
│   ├── retrieval/
│   │   ├── __init__.py
│   │   ├── dense_retriever.py      # Qdrant dense vector search
│   │   ├── sparse_retriever.py     # Whoosh BM25 keyword search
│   │   ├── reranker.py              # Cross-encoder reranker
│   │   ├── hybrid_retriever.py     # Merge dense+sparse and rerank
│   │   └── api.py                   # FastAPI app exposing /query
│   ├── answer/
│   │   ├── __init__.py
│   │   ├── answer_service.py       # Answer generation service
│   │   └── api.py                   # FastAPI app exposing /answer
│   └── llm/
│       ├── __init__.py
│       └── adapters.py              # LLM adapter abstraction (OpenAI, Ollama)
├── react-docs/                 # React documentation markdown files
├── rag.py                      # Main CLI entry point
├── pyproject.toml              # Python project dependencies
├── .env.example                # Environment variables template
└── README.md

Setup

Prerequisites

  • Python 3.10+
  • Qdrant instance running (default: localhost:6333)
  • (Optional) Ollama running locally if using Ollama LLM provider

Installation

  1. Clone the repository (if applicable) or navigate to the project directory

  2. Install dependencies:

pip install -e .
  1. Configure environment variables:

Create a .env file in the project root (see .env.example for template):

# Embedding Model Configuration (match Qdrant vector size)
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2

# OpenAI Configuration (optional, only needed if using OpenAI)
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
OPENAI_BASE_URL=https://api.openai.com/v1

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION_NAME=react-docs

# React Docs Path
REACT_DOCS_PATH=./react-docs

# Retrieval settings
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
WHOOSH_INDEX_PATH=./whoosh_index
TOP_K=10
DENSE_LIMIT=30
SPARSE_LIMIT=30

# LLM Provider (openai or ollama)
LLM_PROVIDER=ollama

# Ollama Configuration (if using Ollama)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=deepseek-r1:8b

# LLM Generation Parameters
TEMPERATURE=0.7  # 0.0-2.0, higher = more creative

# Answer API
RETRIEVAL_URL=http://localhost:8000/query
CONTEXT_DUMP_PATH=./context_dump.md
  1. Start Qdrant:

Using Docker:

docker run -p 6333:6333 qdrant/qdrant

Or install Qdrant locally following the official documentation.

  1. Start Ollama (if using Ollama LLM):
# Install Ollama from https://ollama.ai
# Pull your desired model
ollama pull deepseek-r1:8b

Usage

Main CLI Runner

Use the main rag.py script in the root directory:

python rag.py [COMMAND]

Available commands:

  • ingest - Ingest React documentation into Qdrant
  • retrieve - Query React documentation using hybrid search
  • answer - Generate answers using retrieval + LLM

Ingestion

Run the ingestion command to process and store React documentation:

python rag.py ingest

Or with custom options:

python rag.py ingest --docs-path ./custom-docs --collection-name my-collection

Command-line Options:

  • --docs-path / -d: Override the default docs path
  • --collection-name / -c: Override the default Qdrant collection name

What happens:

  1. Loads all markdown files from the docs directory
  2. Parses markdown into heading-aware nodes
  3. Splits nodes into token-bounded chunks (800 tokens, 120 overlap)
  4. Generates embeddings for each chunk
  5. Stores chunks + embeddings + metadata in Qdrant

Retrieval

Query the knowledge base using hybrid search:

python rag.py retrieve "How do I use useEffect?" --top-k 5

Command-line Options:

  • --top-k / -k: Number of results to return (default: 5)
  • --collection-name / -c: Override the default Qdrant collection name

What happens:

  1. Embeds the query
  2. Performs dense vector search in Qdrant
  3. Performs sparse keyword search using BM25 (Whoosh)
  4. Merges and normalizes candidate sets
  5. Reranks using cross-encoder
  6. Returns top-k chunks with scores and metadata

Answer Generation

Generate comprehensive answers using retrieval + LLM:

python rag.py answer "Why does useEffect run twice in React 18?" --top-k 5

Command-line Options:

  • --top-k / -k: Number of chunks to retrieve initially (default: 5)
  • --collection-name / -c: Override the default Qdrant collection name

What happens:

  1. Retrieves top-k chunks using hybrid search
  2. Expands context: For each file in the retrieved set, fetches ALL chunks from that file
  3. Orders chunks: Sorts by file_path and chunk_index to preserve document flow
  4. Formats context for LLM
  5. Displays prompt token count
  6. Generates answer using configured LLM
  7. Saves context dump to context_dump.md for review
  8. Returns answer and source chunks

LLM Provider Configuration:

Set LLM_PROVIDER in .env:

  • openai - Use OpenAI (requires OPENAI_API_KEY)
    • Configure: OPENAI_MODEL, OPENAI_BASE_URL
  • ollama - Use Ollama (requires Ollama running locally)
    • Configure: OLLAMA_BASE_URL, OLLAMA_MODEL

Temperature Control:

Adjust creativity/randomness via TEMPERATURE in .env:

  • 0.0-0.3: More deterministic, focused responses
  • 0.7: Balanced (default)
  • 0.7-2.0: More creative, varied responses

FastAPI APIs

Retrieval API

Run the retrieval API server:

uvicorn app.retrieval.api:app --reload --port 8000

Query the API:

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How do I add React to an existing project?", "top_k": 5}'

Response:

{
  "results": [
    {
      "chunk_id": "...",
      "score": 0.85,
      "text": "...",
      "file_path": "learn/add-react-to-an-existing-project.md",
      "section_slug": "add-react-to-an-existing-project",
      "h1": "Adding React to an Existing Project",
      "chunk_index": 0,
      ...
    }
  ]
}

Answer API

Run the answer API server:

uvicorn app.answer.api:app --reload --port 8002

Query the API:

curl -X POST http://localhost:8002/answer \
  -H "Content-Type: application/json" \
  -d '{"question": "Why does useEffect run twice in React 18?", "top_k": 5}'

Response:

{
  "answer": "In React 18, useEffect runs twice in development...",
  "sources": [
    {
      "chunk_id": "...",
      "text": "...",
      "file_path": "learn/lifecycle-of-reactive-effects.md",
      ...
    }
  ]
}

Technical Details

Chunking Strategy

  1. Markdown Parsing: Uses MarkdownNodeParser to split documents into nodes based on headings

    • # → top-level sections
    • ##, ### for subsections
  2. Token-based Chunking: Uses TokenTextSplitter to further split large sections

    • 800 tokens max per chunk
    • ~120 token overlap between adjacent chunks
    • Code blocks are preserved intact
  3. Metadata per Chunk:

    • file_path: Relative path to source markdown file
    • section_slug: URL-friendly slug (e.g., learn/use-effect#cleaning-up)
    • h1, h2, h3: Heading hierarchy
    • is_code_heavy: True if >30% of content is in code fences
    • chunk_index: 0-based index within the section
    • text: Full chunk text content (stored for inference)

Embedding Models

The pipeline supports flexible embedding models:

  • sentence-transformers/all-mpnet-base-v2 (default): 768 dimensions, local model
  • all-MiniLM-L6-v2: 384 dimensions, faster, smaller
  • OpenAI models: text-embedding-3-small (1536 dims), text-embedding-3-large (3072 dims), text-embedding-ada-002 (1536 dims)

Configure via EMBEDDING_MODEL environment variable:

  • sentence-transformers/all-mpnet-base-v2 - Use sentence transformer (default)
  • openai:text-embedding-3-small - Use OpenAI embeddings
  • all-MiniLM-L6-v2 - Use smaller sentence transformer

Important: The embedding model dimension must match your Qdrant collection's vector size. If you change models, you may need to recreate the collection.

Hybrid Retrieval

The retrieval system uses a three-stage approach:

  1. Dense Search: Semantic vector search in Qdrant using query embeddings
  2. Sparse Search: Keyword-based BM25 search using Whoosh index
  3. Reranking: Cross-encoder reranking using cross-encoder/ms-marco-MiniLM-L-6-v2

Results are merged, normalized, deduplicated, and reranked before returning top-k.

Context Assembly

When generating answers, the system:

  1. Retrieves top-k most relevant chunks
  2. Extracts unique file paths from these chunks
  3. For each file, fetches all chunks associated with that file
  4. Orders chunks by (file_path, chunk_index) to preserve document flow
  5. Formats the ordered chunks into context for the LLM

This ensures the LLM receives complete, ordered context from relevant files rather than just isolated chunks.

Qdrant Collection

The pipeline creates a Qdrant collection with:

  • Vector size: Automatically determined by embedding model (768 for all-mpnet-base-v2, 384 for all-MiniLM-L6-v2, etc.)
  • Distance metric: Cosine
  • Payload schema: All metadata fields from ChunkMetadata model, including full text field for inference
  • Point IDs: Deterministic UUIDs generated from file_path, chunk_index, and section_slug

Development

Project Dependencies

  • fastapi: HTTP API framework
  • uvicorn[standard]: ASGI server
  • sentence-transformers: Embeddings + cross-encoder reranker
  • qdrant-client: Vector database client
  • whoosh: BM25 sparse index
  • python-dotenv: Environment variable management
  • pydantic / pydantic-settings: Data validation and settings
  • typer: CLI framework
  • rich: Terminal formatting and progress indicators
  • httpx: Async HTTP client
  • tiktoken: Token counting for OpenAI models
  • llama-index-core: Document loading and chunking (ingestion only)
  • llama-index-embeddings-openai: OpenAI embeddings (optional)
  • openai: OpenAI SDK

Running Tests

(Add test instructions if tests exist)

Code Structure

  • Ingestion: Uses LlamaIndex only for ingestion (not at query time)
  • Retrieval: Pure Python implementation using Qdrant and Whoosh
  • Answer Generation: LLM adapter pattern for easy provider switching

Troubleshooting

Qdrant Connection Error

Ensure Qdrant is running and accessible:

curl http://localhost:6333/collections

Embedding Model Dimension Mismatch

If you see errors like "expected dim: 768, got 384":

  1. Check your EMBEDDING_MODEL setting matches the collection's vector size
  2. Either recreate the collection with the correct dimension, or change the embedding model to match

Ollama Connection Error

If using Ollama, ensure it's running:

curl http://localhost:11434/api/tags

Missing Dependencies

If you encounter import errors:

pip install -e .

Whoosh Index Issues

If the sparse retriever fails, try deleting the whoosh_index directory and re-running ingestion. The index will be rebuilt automatically.

License

MIT License

About

RAG Pipeline for React Documentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages