Smart Document Q&A

An AI-powered Retrieval-Augmented Generation (RAG) system that lets you chat with your documents. Upload PDF, DOCX, or TXT files and ask questions. Answers are grounded in your document content with source references.

Try it live: eugen-goebel-smart-doc-qa-app-av3twb.streamlit.app. Runs in Demo Mode (no API key required) so you can test the full RAG retrieval flow. Add your own Anthropic key in the sidebar for AI-generated answers.

Screenshots

Demo Mode: clean landing view; runs without an API key using raw retrieval results

Question Answered: asking about 2025 revenue returns the most relevant chunk with source reference

Retrieved Chunks: similarity search surfaces multiple ranked matches across the document

How It Works

┌─────────────┐    ┌──────────┐    ┌──────────────┐    ┌───────────┐
│  Document    │───▶│  Text    │───▶│  Vector      │───▶│  Stored   │
│  Upload      │    │  Chunker │    │  Store       │    │  Chunks   │
│  (PDF/DOCX)  │    │  (split) │    │  (ChromaDB)  │    │  (embed)  │
└─────────────┘    └──────────┘    └──────────────┘    └─────┬─────┘
                                                             │
┌─────────────┐    ┌──────────┐    ┌──────────────┐          │
│  Answer +   │◀───│  LLM     │◀───│  Relevant    │◀─────────┘
│  Sources    │    │  API     │    │  Chunks      │   (similarity
└─────────────┘    └──────────┘    └──────────────┘    search)

RAG Pipeline

Document Loading: Reads PDF, DOCX, or TXT files and extracts plain text
Chunking: Splits text into overlapping ~500-character pieces
Embedding & Storage: Each chunk is converted to a vector and stored in ChromaDB
Retrieval: When you ask a question, the most relevant chunks are found via similarity search
Generation: The LLM answers your question using only the retrieved context

Quick Start

# Clone and setup
git clone https://github.com/eugen-goebel/smart-doc-qa.git
cd smart-doc-qa
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and add your Anthropic API key

# Run the app
streamlit run app.py

The app opens in your browser. Upload a document and start asking questions.

Try with Sample Data

A sample company report is included at data/sample_company_report.txt. Upload it in the app and try questions like:

"What was the company's revenue in 2025?"
"Who are the main competitors?"
"What are the strategic priorities for 2026?"
"Tell me about the BMW case study"

Architecture

smart-doc-qa/
├── app.py                          # Streamlit web interface
├── agents/
│   ├── document_loader.py          # Reads PDF/DOCX/TXT files
│   ├── chunker.py                  # Splits text into overlapping chunks
│   ├── vectorstore.py              # ChromaDB wrapper for similarity search
│   └── qa_agent.py                 # RAG pipeline: retrieve + generate
├── data/
│   └── sample_company_report.txt   # Sample document for testing
├── tests/
│   ├── test_document_loader.py     # 10 tests
│   ├── test_chunker.py             # 15 tests
│   ├── test_vectorstore.py         # 16 tests
│   └── test_qa_agent.py            # 14 tests
├── requirements.txt
└── README.md

Agent Roles

Agent	Purpose	API Call?
DocumentLoader	Reads PDF, DOCX, TXT files and extracts text	No
TextChunker	Splits text into overlapping chunks for search	No
VectorStore	Stores chunks and finds relevant ones via similarity	No (local embeddings)
QAAgent	Sends relevant chunks + question to the LLM for answers	Yes (Anthropic API)

Key Concepts

What is RAG?

Retrieval-Augmented Generation combines search with AI generation:

Instead of sending an entire document to the AI (expensive, limited by context window)
We first search for the most relevant parts, then send only those to the AI
This is faster, cheaper, and produces more accurate answers

What are Embeddings?

Text is converted into lists of numbers (vectors) that capture meaning. Similar texts have similar vectors. ChromaDB uses the all-MiniLM-L6-v2 model to generate these embeddings locally, no API key needed.

What is Chunking?

Documents are split into overlapping pieces (~500 chars each). The overlap ensures no sentence is cut without context at chunk boundaries.

Tech Stack

Component	Technology	Purpose
AI	Anthropic API	Answer generation from context
Vector DB	ChromaDB	Embedding storage and similarity search
Embeddings	all-MiniLM-L6-v2	Local text-to-vector conversion
Data Models	Pydantic v2	Type-safe data validation
Web UI	Streamlit	Interactive chat interface
PDF Reading	pypdf	PDF text extraction
DOCX Reading	python-docx	Word document text extraction
Testing	pytest	55+ unit and integration tests

Testing

# Run all tests
pytest tests/ -v

# Run tests for a specific agent
pytest tests/test_vectorstore.py -v

All tests run without an API key. The QA agent tests use mocked API responses.

Deployment

This app is designed to deploy in one click on Streamlit Community Cloud (free tier).

Steps:

Sign in at share.streamlit.io with your GitHub account.
Click New app and pick this repository / branch / app.py.
(Optional) In Advanced settings → Secrets, paste:
```
ANTHROPIC_API_KEY = "sk-ant-..."
```
See .streamlit/secrets.toml.example.
Click Deploy. The app builds in ~2 minutes.

API key handling:

The app reads the key from three places, in this order:

os.environ["ANTHROPIC_API_KEY"]: set via .env for local runs
st.secrets["ANTHROPIC_API_KEY"]: set in Streamlit Cloud dashboard
Manual entry in the sidebar, fallback for end users

If no key is provided, the app runs in Demo Mode: vector search still works, but the model-generated answer step is skipped and the raw retrieved chunks are shown instead.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
.streamlit		.streamlit
agents		agents
data		data
docs/screenshots		docs/screenshots
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Document Q&A

Screenshots

How It Works

RAG Pipeline

Quick Start

Try with Sample Data

Architecture

Agent Roles

Key Concepts

What is RAG?

What are Embeddings?

What is Chunking?

Tech Stack

Testing

Deployment

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart Document Q&A

Screenshots

How It Works

RAG Pipeline

Quick Start

Try with Sample Data

Architecture

Agent Roles

Key Concepts

What is RAG?

What are Embeddings?

What is Chunking?

Tech Stack

Testing

Deployment

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages