docqa-engine

Upload documents, ask questions -- get cited answers with a prompt engineering lab.

Live Demo -- try it without installing anything.

Demo Snapshot

🚀 Pro Version Available

Get production-ready features with the Pro version:

Feature	GitHub (Free)	Pro ($25)
Basic RAG pipeline	✓	✓
Document upload & chunking	✓	✓
Hybrid retrieval (BM25 + dense)	✓	✓
Prompt engineering lab	✓	✓
Extended documentation	-	✓
Docker deployment files	-	✓
CI/CD workflows	-	✓
Cloud deployment guides	-	✓
Priority email support	-	✓

Get the Pro Version →

What This Solves

RAG pipeline from upload to answer -- Ingest documents (PDF, DOCX, TXT, MD, CSV), chunk them with pluggable strategies, embed with TF-IDF, and retrieve using BM25 + dense hybrid search with Reciprocal Rank Fusion
Prompt engineering lab for A/B testing -- Create prompt templates, run the same question through different strategies side-by-side, compare outputs
Citation accuracy matters -- Faithfulness, coverage, and redundancy scoring for every generated citation

Business Impact

Quantified outcomes from production document intelligence deployments:

Key Metrics

Metric	Before	After	Improvement
Document review time	3 days	3 minutes	99% faster
Contract analysis accuracy	72%	94%	31% improvement
Research hours per case	8 hours	45 minutes	91% reduction
API costs per 1K queries	$180	$24	87% reduction

Additional Outcomes

Hybrid retrieval outperforms BM25-only by 15-25% -- Combining sparse + dense vectors finds relevant passages traditional search misses
Citation scoring ensures answer reliability -- Faithfulness, coverage, and redundancy metrics for every generated citation
No external embedding API required -- Local TF-IDF embeddings eliminate vendor lock-in and reduce costs
<100ms query latency -- Sub-second response times for 10K document corpus

Use Cases

Industry	Use Case	Outcome
Legal	Contract analysis and discovery	99% faster document review
Finance	SEC filing analysis and due diligence	91% reduction in research time
Healthcare	Medical literature review and compliance	87% cost reduction on document Q&A
Enterprise	Internal knowledge base search	15-25% better retrieval accuracy

Service Mapping

Service 3: Custom RAG Conversational Agents
Service 5: Prompt Engineering and System Optimization

Certification Mapping

IBM Generative AI Engineering with PyTorch, LangChain & Hugging Face
IBM RAG and Agentic AI Professional Certificate
Vanderbilt ChatGPT Personal Automation
Duke University LLMOps Specialization

Architecture

flowchart TB
    Upload["Document Upload\n(PDF, DOCX, TXT, MD, CSV)"]
    Chunk["Chunking Engine\n(semantic, fixed, sliding window)"]
    Embed["Embedding Layer\n(TF-IDF, BM25, Dense)"]
    VStore["Vector Store\n(FAISS / in-memory)"]
    Hybrid["Hybrid Retrieval\n(BM25 + Dense + RRF fusion)"]
    Rerank["Cross-Encoder Re-Ranker"]
    QExpand["Query Expansion\n(synonym, PRF, decompose)"]
    Citation["Citation Scoring\n(faithfulness, coverage, redundancy)"]
    Answer["Answer Generation"]
    Convo["Conversation Manager\n(multi-turn context)"]
    API["REST API\n(JWT auth, rate limiting, metering)"]
    UI["Streamlit Demo UI\n(4-tab interface)"]

    Upload --> Chunk --> Embed --> VStore
    QExpand --> Hybrid
    VStore --> Hybrid --> Rerank --> Answer
    Answer --> Citation
    Answer --> Convo
    API --> Answer
    UI --> API

Key Metrics

Metric	Value
Test Suite	550+ automated tests
Retrieval Accuracy	Hybrid > BM25-only by 15-25%
Re-Ranking Boost	+8-12% relevance improvement
Query Latency	<100ms for 10K document corpus
Citation Accuracy	Faithfulness + coverage scoring
API Rate Limit	Configurable per-user metering

Modules

Module	File	Description
Ingest	`ingest.py`	Multi-format document loading (PDF, DOCX, TXT, MD, CSV)
Chunking	`chunking.py`	Pluggable chunking strategies: fixed-size, sentence-boundary, semantic
Embedder	`embedder.py`	TF-IDF embedding (5,000 features, no external API calls)
Retriever	`retriever.py`	BM25 + dense cosine + hybrid RRF fusion
Answer	`answer.py`	Context-aware answer generation with source citations
Prompt Lab	`prompt_lab.py`	Prompt versioning and A/B comparison framework
Citation Scorer	`citation_scorer.py`	Citation faithfulness, coverage, and redundancy scoring
Evaluator	`evaluator.py`	Retrieval metrics: MRR, NDCG@K, Precision@K, Recall@K, Hit Rate
Batch	`batch.py`	Parallel batch ingestion and query processing
Exporter	`exporter.py`	JSON/CSV export for results and metrics
Cost Tracker	`cost_tracker.py`	Per-query token and cost tracking
Pipeline	`pipeline.py`	End-to-end DocQAPipeline class
REST API	`api.py`	FastAPI wrapper with JWT auth, rate limiting, metering
Vector Store	`vector_store.py`	Pluggable vector store backends (FAISS, in-memory)
Re-Ranker	`reranker.py`	Cross-encoder TF-IDF re-ranking with Kendall tau
Query Expansion	`query_expansion.py`	Synonym, pseudo-relevance feedback, decomposition
Answer Quality	`answer_quality.py`	Multi-axis answer quality scoring
Summarizer	`summarizer.py`	Extractive and abstractive document summarization
Document Graph	`document_graph.py`	Cross-document entity and relationship graph
Multi-Hop	`multi_hop.py`	Multi-hop reasoning across document chains
Conversation Manager	`conversation_manager.py`	Multi-turn context tracking and query rewriting
Context Compressor	`context_compressor.py`	Token-budget context window compression
Benchmark Runner	`benchmark_runner.py`	Automated retrieval and performance benchmarking

Quick Start

git clone https://github.com/ChunkyTortoise/docqa-engine.git
cd docqa-engine
pip install -r requirements.txt
make test
make demo

Docker Quick Start

The fastest way to run DocQA Engine with Docker:

# Clone and start
git clone https://github.com/ChunkyTortoise/docqa-engine.git
cd docqa-engine
docker-compose up -d

# Open http://localhost:8501

Docker Commands

Command	Description
`docker-compose up -d`	Start demo in background
`docker-compose down`	Stop and remove containers
`docker-compose logs -f`	View logs
`docker-compose build`	Rebuild image

Docker Build (Manual)

# Build the image
docker build -t docqa-engine .

# Run the container
docker run -p 8501:8501 -v ./uploads:/app/uploads docqa-engine

# Open http://localhost:8501

With API Keys (Optional)

To enable LLM-powered answer generation:

# Create .env file with your API keys
echo "ANTHROPIC_API_KEY=your_key_here" > .env

# Start with environment variables
docker-compose --env-file .env up -d

Image Size

The optimized multi-stage build produces images under 500MB:

Base: Python 3.11 slim (~150MB)
Dependencies: scikit-learn, Streamlit, etc. (~200MB)
Application: ~50MB

Demo Documents

Document	Topic	Content
`python_guide.md`	Python Basics	Variables, control flow, functions, classes, error handling
`machine_learning.md`	ML Concepts	Supervised/unsupervised, regression, classification, neural networks
`startup_playbook.md`	Startup Advice	Product-market fit, MVP, fundraising, team building, metrics

Tech Stack

Layer	Technology
UI	Streamlit (4 tabs)
Embeddings	scikit-learn (TF-IDF)
Retrieval	BM25 (Okapi) + Dense (cosine) + RRF
Document Parsing	PyPDF2, python-docx
Testing	pytest, pytest-asyncio (550+ tests)
CI	GitHub Actions (Python 3.11, 3.12)
Linting	Ruff

Project Structure

docqa-engine/
├── app.py                          # Streamlit application (4 tabs)
├── docqa_engine/
│   ├── ingest.py                   # Document loading + parsing
│   ├── chunking.py                 # Pluggable chunking strategies
│   ├── embedder.py                 # TF-IDF embedding
│   ├── retriever.py                # BM25 + Dense + Hybrid (RRF)
│   ├── answer.py                   # LLM answer generation + citations
│   ├── prompt_lab.py               # Prompt versioning + A/B testing
│   ├── citation_scorer.py          # Citation accuracy scoring
│   ├── evaluator.py                # Retrieval metrics (MRR, NDCG, P@K)
│   ├── batch.py                    # Parallel batch processing
│   ├── exporter.py                 # JSON/CSV export
│   ├── cost_tracker.py             # Token + cost tracking
│   └── pipeline.py                 # End-to-end pipeline
├── demo_docs/                      # 3 sample documents
├── tests/                          # 26 test files, 550+ tests
├── .github/workflows/ci.yml        # CI pipeline
├── Makefile                        # demo, test, lint, setup
└── requirements.txt

Architecture Decisions

ADR	Title	Status
ADR-0001	Hybrid Retrieval Strategy	Accepted
ADR-0002	TF-IDF Local Embeddings	Accepted
ADR-0003	Citation Scoring Framework	Accepted
ADR-0004	REST API Wrapper Design	Accepted

Testing

make test                           # Full suite (550+ tests)
python -m pytest tests/ -v          # Verbose output
python -m pytest tests/test_ingest.py  # Single module

Benchmarks

See BENCHMARKS.md for detailed performance data.

python -m benchmarks.run_all

Changelog

See CHANGELOG.md for release history.

Related Projects

EnterpriseHub -- Real estate AI platform with BI dashboards and CRM integration
insight-engine -- Upload CSV/Excel, get instant dashboards, predictive models, and reports
ai-orchestrator -- AgentForge: unified async LLM interface (Claude, Gemini, OpenAI, Perplexity)
scrape-and-serve -- Web scraping, price monitoring, Excel-to-web apps, and SEO tools
prompt-engineering-lab -- 8 prompt patterns, A/B testing, TF-IDF evaluation
llm-integration-starter -- Production LLM patterns: completion, streaming, function calling, RAG, hardening
Portfolio -- Project showcase and services

Deploy

Support This Project

If DocQA Engine has been useful to you, consider sponsoring its continued development:

See SPONSORS.md for sponsorship tiers and benefits.

License

MIT -- see LICENSE for details.

Work With Me

Building RAG pipelines or document intelligence systems? I help teams ship production-ready document Q&A:

📄 Consulting — RAG architecture, retrieval strategies, cost optimization
🚀 Implementation — Hybrid retrieval, citation systems, production hardening
📧 Enterprise — Custom integrations, SLAs, dedicated support

Client Testimonials

See what clients say about working with me: TESTIMONIALS.md

"We went from manual contract review (3 days) to automated analysis (3 minutes) with better accuracy."
— Legal Operations Manager, Enterprise

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
.streamlit		.streamlit
assets		assets
benchmarks		benchmarks
demo_data		demo_data
demo_docs		demo_docs
docqa_engine		docqa_engine
docs/adr		docs/adr
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CUSTOMIZATION.md		CUSTOMIZATION.md
DEMO_MODE.md		DEMO_MODE.md
Dockerfile		Dockerfile
Dockerfile.api		Dockerfile.api
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SPONSORS.md		SPONSORS.md
app.py		app.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Uh oh!

License

ChunkyTortoise/docqa-engine

Folders and files

Latest commit

History

Repository files navigation

docqa-engine

Demo Snapshot

🚀 Pro Version Available

What This Solves

Business Impact

Key Metrics

Additional Outcomes

Use Cases

Service Mapping

Certification Mapping

Architecture

Key Metrics

Modules

Quick Start

Docker Quick Start

Docker Commands

Docker Build (Manual)

With API Keys (Optional)

Image Size

Demo Documents

Tech Stack

Project Structure

Architecture Decisions

Testing

Benchmarks

Changelog

Related Projects

Deploy

Support This Project

License

Work With Me

Client Testimonials

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages