Local-First Legal Research & Intelligence Platform (RAG-based)
Lexa is a work-in-progress legal research and intelligence system built around Retrieval-Augmented Generation (RAG). My long term goal for the project is for it to become a full-fledged legal research platform, exposing a robust API and a frontend interface backed by a comprehensive legal document database.
At its current stage, Lexa focuses on:
- reliable document ingestion and normalization
- embeddings and semantic vector search
- grounded answer generation with citations
- modular, extensible backend architecture
Development is temporarily paused due to hardware constraints, but the foundation is stable and extensible.
Lexa is designed to evolve into:
- A large-scale legal document repository (cases, statutes, judgments, reports)
- A high-quality legal search engine powered by hybrid retrieval (vector + keyword)
- A citation-aware legal reasoning assistant
- A backend API serving a modern web frontend
- A practical research tool for students, practitioners, and researchers
The current repository represents the core backend and RAG pipeline that future API and frontend layers will build upon.
- PDF parsing and preprocessing
- Document merging and cleaning pipeline
- Intelligent document chunking
- Local embeddings using Ollama (
nomic-embed-text) - Vector storage and retrieval with ChromaDB
- Prompt construction with strict source grounding
- Local LLM-based answer generation (
mistral) - Inline source citations
- Modular ingestion → chunking → embedding → retrieval pipeline
PDF Documents
↓
Parsing & Cleaning
↓
Chunking
↓
Embeddings (Ollama: nomic-embed-text)
↓
ChromaDB Vector Store
↓
Top-k Retrieval
↓
Prompt Construction
↓
LLM Answer Generation (Ollama: mistral)
- Python 3.11+
- Ollama - local LLM and embedding runtime
- ChromaDB - vector database for embeddings
- LangChain (select utilities) - document loading and orchestration
- PyMuPDF
- Custom chunking and cleaning pipeline
- Dataclasses & typing for structured document handling
- Docker / Docker Compose
- Requests / tqdm for networking and ingestion utilities
- FastAPI - backend API layer
- PostgreSQL / SQLite - metadata & structured storage
- Next.js / React - frontend interface
- Hybrid retrieval (BM25 + embeddings)
- GPU inference (CUDA / ROCm)
Lexa is compute-intensive.
- CPU-only laptop (dual-core, low-power)
- ~8 GB RAM
- Embeddings and retrieval perform well
- LLM generation (7B models) is slow on CPU
- Cold-start model loading can take 10+ minutes
- Long prompts may be truncated due to context limits
Because of this, active development is paused until access to more capable hardware (multi-core CPU or GPU).
The codebase remains functional, documented, and ready to resume.
docker compose up -dpython -m scripts.ingestpython -m scripts.retrieveLexa currently manages dependencies using a requirements.txt file. Environment variable configuration has not yet been implemented.
Create and activate a virtual environment, then install dependencies:
python -m venv venv
./venv/Scripts/activate # On Mac: source venv/bin/activate
pip install -r requirements.txtThe requirements.txt file defines all core libraries required to run Lexa, including:
- PDF parsing and document processing libraries
- Vector database and embedding dependencies
- RAG pipeline utilities
For demonstration and testing purposes, Lexa includes 5 sample legal case files located in the /data/sample directory.
These files are used to:
- validate the ingestion and preprocessing pipeline
- test chunking and embedding behavior
- evaluate retrieval accuracy and citation grounding
- provide a minimal working dataset for local experimentation
Directory structure:
data/
sample/
case_1.pdf
case_2.pdf
case_3.pdf
case_4.pdf
case_5.pdf
You can replace or expand this folder with your own legal documents to build a larger corpus.
Note: The sample files are included solely for development and testing purposes and do not represent a comprehensive legal dataset.
✅ Core RAG pipeline implemented
✅ Ingestion, cleaning, and chunking pipeline working
✅ Embedding and retrieval validated
✅ Chroma schema and dimension handling stabilized
⏸️ API layer planned
⏸️ Frontend planned
⏸️ Scaling deferred pending hardware upgrade\
- REST API for retrieval and generation (FastAPI)
- Authentication and multi-user support
- Frontend interface (Next.js)
- Dataset expansion and automated indexing pipeline
- Hybrid retrieval (vector + lexical search)
- Advanced citation handling and filtering
- Performance optimizations (GPU inference, batching, caching)
- Evaluation framework for retrieval and answer quality
MIT License
Built by Ezra Minty
Early-stage legal tech research and experimentation project.
Lexa is shared in its current state to document the design, architecture, and engineering decisions behind a local-first legal research system. Contributions, ideas, and future continuation are welcome.