A Retrieval-Augmented Generation (RAG) system for querying Encompass API documentation with natural language, featuring hybrid retrieval and multiple LLM support.
Unofficial. Not affiliated with ICE Mortgage Technology. A community-built retrieval index and RAG pipeline over the publicly available Encompass Developer Connect documentation, intended as a developer reference and for educational/research use. For canonical, up-to-date documentation, always defer to the official source: https://developer.icemortgagetechnology.com/.
The Encompass RAG Assistant is a tool that allows users to query Encompass API documentation using natural language. It leverages a hybrid retrieval system combining semantic and keyword search over the publicly available Encompass Developer Connect documentation and the Postman collection to provide contextual answers to API-related questions.
This application combines cutting-edge RAG techniques with a user-friendly interface to make Encompass API documentation more accessible and easier to navigate.
- Hybrid Retrieval System: Combines FAISS semantic search with BM25 keyword search, fused via Reciprocal Rank Fusion (RRF)
- Multi-LLM Support: Choose between Ollama (qwen2.5-coder:7b) and Google Gemini (gemini-1.5-flash)
- Jina v3 Embeddings:
jinaai/jina-embeddings-v3(1024-d) with task-specific prompts for retrieval - Postman Endpoint Gate: Token-overlap + score-ratio gate surfaces 0–3 relevant API endpoints separately from prose chunks
- Two Data Sources: Crawled Encompass Developer Connect documentation + Postman collection
- Environment Configuration: Full .env support for easy deployment and configuration
- Enhanced FastAPI Backend: Robust API with health checks, configuration endpoints, and CORS support
- Streamlit UI: Clean, intuitive interface with enhanced source display
- Source Attribution: Detailed source tracking with metadata and full content access
The system has a multi-component architecture:
- Jina v3 Embeddings:
jinaai/jina-embeddings-v3(1024-d);task=retrieval.passageat ingest,retrieval.queryat query time - FAISS-IP Vector Store: Inner-product similarity search over Jina vectors
- BM25 Index:
rank_bm25.BM25Okapiover the same chunks for lexical matching - Metadata Store: Per-chunk title, breadcrumb, kind, and source URL
- HybridRetriever: FAISS semantic + BM25 lexical over the prose chunk corpus
- RRF Fusion: Reciprocal Rank Fusion (
k=60) merges semantic and lexical rankings into one top-k list - Postman Endpoint Gate: Separate BM25 over Postman entries → token-overlap filter (≥0.3) → score-ratio split returns 0–3 endpoints
- Ollama Support: Local LLM hosting with qwen2.5-coder:7b
- Google Gemini: Cloud-based gemini-1.5-flash model option
- Direct Prompt Construction:
## Documentationand## Relevant API endpoint(s)are built as separate prompt sections (no RetrievalQA chain)
- FastAPI Backend: Production-ready API with comprehensive endpoints
- Environment Configuration: Full .env support for deployment flexibility
- Streamlit UI: Enhanced interface with categorized source display
- CORS Support: Ready for web application integration
flowchart TD
subgraph Ingest["Build pipeline (offline)"]
direction TB
Crawler["scripts/crawler/<br/>(skips auth-gated)"] --> JSONL["developer_connect.jsonl"]
JSONL --> FDC["filter → dedupe → chunk<br/>page-wise ≤ 2K, else 1500/200"]
FDC --> Embed["Jina v3 embed<br/>task=retrieval.passage"]
Postman["Encompass_Developer_Connect_<br/>postman_collection.json"]
end
Embed --> FAISS[("FAISS-IP<br/>jsonl_faiss/")]
Embed --> BM25C[("BM25 over chunks<br/>jsonl_bm25.pkl")]
Postman --> PEntries[("Postman BM25 + entries<br/>postman_*.pkl")]
HF[("HF dataset (fallback)<br/>Richie-rk/encompass-developer-connect-index")] -. "snapshot_download<br/>if local missing" .-> FAISS
subgraph Runtime["Query pipeline (runtime)"]
direction TB
Q["User query"] --> QE["Embed via Jina v3<br/>task=retrieval.query"]
QE --> Sem["FAISS top-N"]
Q --> Lex["BM25 top-N over chunks"]
Sem --> RRF["RRF fuse → top-K chunks"]
Lex --> RRF
Q --> PG["Postman BM25 +<br/>token-overlap gate"]
RRF --> Prompt["Prompt builder<br/>## Documentation + ## Endpoints"]
PG --> Prompt
Prompt --> LLM["LLM<br/>Ollama qwen2.5-coder · Gemini 1.5 Flash"]
LLM --> Ans["Answer + sources + endpoints"]
end
FAISS --> Sem
BM25C --> Lex
PEntries --> PG
- Python 3.11+ (Python 3.13 compatible)
- Ollama (for local LLM hosting) OR Google Gemini API key
- CUDA-compatible GPU recommended for embeddings
- 16GB+ RAM recommended for optimal performance
- Optional: uv for faster package installation
-
Clone the repository:
git clone https://github.com/richie-rk/encompass-rag-assistant.git cd encompass-rag-assistant -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies — pick the path that matches your hardware. The only thing that differs is which
torchbuild gets installed; the application code dispatches to CPU or GPU at runtime viaJINA_DEVICEin.env.Option A — uv with pyproject.toml (recommended)
CPU only (works everywhere; embedding pipeline runs on CPU):
uv pip install -e .NVIDIA GPU (CUDA 12.1) — pulls CUDA-enabled
torch+faiss-gpu:uv pip install -e ".[gpu]"With dev extras:
uv pip install -e ".[dev]" uv pip install -e ".[dev,gpu]" # GPU + dev tools
Everything:
uv pip install -e ".[all]"Option B — plain pip (no uv)
CPU:
pip install -e . # or: pip install -r requirements.txt
NVIDIA GPU (CUDA 12.1) — pip can't auto-route
torchto the PyTorch index frompyproject.toml, so the GPU build is a two-step:pip install -e . pip uninstall -y torch pip install torch --index-url https://download.pytorch.org/whl/cu121 pip install faiss-gpu>=1.7.4
💡 Which method to choose?
- uv (Option A): significantly faster installs; auto-routes
torchto the right index when you pick[gpu]. Recommended. - pip (Option B): works without extra tooling; needs the manual torch swap above for GPU.
💡 Hardware setup After install, set
JINA_DEVICE=cuda(ormpson Apple silicon) in.envto actually use the GPU at embedding time. Default iscpu. - uv (Option A): significantly faster installs; auto-routes
-
LLM Setup (Choose one):
Option A: Ollama (Local)
# Install Ollama from https://ollama.ai/ ollama pull qwen2.5-coder:7bOption B: Google Gemini (Cloud)
# Get API key from https://ai.google.dev/ # Set in .env file: GEMINI_API_KEY=your_key_here
-
Environment Configuration:
# Create .env file with your settings cp .env.example .env # Edit .env with your preferred configuration
-
Crawl the Developer Connect documentation:
# From the repo root — fetches all guide / API-reference / changelog pages # for /developer-connect/ and writes scripts/data/developer_connect.jsonl python -m scripts.crawler
The crawler enumerates the page frontier from the ReadMe sidebar (no BFS) and extracts each page's authored Markdown, OpenAPI spec, and metadata directly from the embedded
ssr-propsJSON. First run takes ~5–7 minutes (216 pages at 0.5 s delay); HTML is cached toscripts/data/cache/, so re-runs are instant.Useful flags:
python -m scripts.crawler --dry-run # enumerate frontier, no fetches python -m scripts.crawler --limit 10 # debug: only first 10 pages python -m scripts.crawler --no-cache # ignore cache, refetch all python -m scripts.crawler --include-hidden # include hidden:true pages python -m scripts.crawler --delay 1.0 # slower pace for politeness python -m scripts.crawler --help # full option list
Outputs:
scripts/data/developer_connect.jsonl— one record per page (slug, title, URL, breadcrumb,body_md, OpenAPIoas,updated_at, …)scripts/data/developer_connect_manifest.jsonl— per-URL fetch log with status, cache-hit, body hash, and any error reason
-
Create the vector store:
python scripts/create_vector_store.py
💡 Skip steps 6–7 if you just want to run the API. If
./vector_store/is missing on first boot, the API automatically fetches the published index from Hugging Face into the HF cache (~/.cache/huggingface/hub/...) and uses it. Configurable viaVECTOR_STORE_HF_REPOandVECTOR_STORE_HF_REVISIONin.env(defaults point atRichie-rk/encompass-developer-connect-indexonmain). Build locally only when you want to re-ingest your own crawl.
-
Start the FastAPI backend:
cd rag_docs/src python -m rag_docs.rag_app -
In a separate terminal, start the Streamlit UI:
cd rag_docs/src streamlit run rag_docs/web_ui.py -
Access the application:
- Streamlit UI: http://localhost:8501
- FastAPI docs: http://localhost:8000/docs
- Health check: http://localhost:8000/api/health
POST /api/query- Submit questions about Encompass APIGET /api/health- Check system statusGET /api/config- View current configuration
The application is fully configurable through environment variables or a .env file:
# LLM Configuration
OLLAMA_MODEL=qwen2.5-coder:7b
USE_GEMINI=false
GEMINI_API_KEY=your_gemini_api_key_here
TEMPERATURE=0.1
# Vector Store Configuration
VECTOR_STORE_PATH=vector_store
# Embeddings (Jina v3)
JINA_BACKEND=local # `local` (sentence-transformers) or `api`
JINA_API_KEY= # required when JINA_BACKEND=api
JINA_MODEL=jinaai/jina-embeddings-v3
JINA_DEVICE=cpu # `cpu`, `cuda`, or `mps`
# API Configuration
HOST=0.0.0.0
PORT=8000
# Logging Configuration
LOG_LEVEL=INFO
LOG_FORMAT=%(asctime)s - %(levelname)s - [%(filename)s:%(lineno)d] - %(message)s
LOG_FILE=logs/app.log
LOG_MAX_SIZE=10485760
LOG_BACKUP_COUNT=5
# Retrieval Configuration
HYBRID_ALPHA=0.6 # Weight for semantic vs keyword search
RETRIEVAL_K=5 # Number of documents to retrieve# Choose your LLM provider
USE_GEMINI=false
OLLAMA_MODEL=qwen2.5-coder:7b
# GEMINI_API_KEY=your_key_here
# Vector store settings
VECTOR_STORE_PATH=vector_store
TEMPERATURE=0.1
# API settings
HOST=0.0.0.0
PORT=8000
# Logging
LOG_LEVEL=INFO# Build wheel and source distribution
python -m build
# Or with uv (faster)
uv buildBuilt with ❤️ for Encompass API users