Scientific Paper QA System

A retrieval-augmented generation (RAG) pipeline for querying research papers. Ask questions and get answers with citations pointing to the exact source passage. It runs fully locally using Ollama and Llama 3.2.

What it does

Instead of asking an LLM a question and hoping it knows the answer from training data, this system:

Splits your PDFs into overlapping text chunks
Converts each chunk into a vector embedding that captures its meaning
Stores those embeddings locally in ChromaDB
At query time, it embeds your question and finds the most semantically similar chunks
Passes those chunks to Llama 3.2 as context
Returns an answer based only on the retrieved text, and returns source citations

This means the model answers based only on the uploaded documents, and not general training knowledge.

Stack

Component	Tool
LLM	Llama 3.2 3B via Ollama
Embeddings	BAAI/bge-small-en-v1.5 via HuggingFace
Vector store	ChromaDB (persistent, on-disk)
Orchestration	LlamaIndex
Interface	Streamlit

Setup

Requirements: Python 3.9+, Ollama installed (ollama.com)

# Clone the repo
git clone https://github.com/aeesh/paper-qa-system.git
cd paper-qa-system

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate        # Mac/Linux
venv\Scripts\activate           # Windows

# Install dependencies
pip install llama-index llama-index-vector-stores-chroma llama-index-embeddings-huggingface llama-index-llms-ollama chromadb streamlit pymupdf

# Pull the local model
ollama pull llama3.2

Running

Step 1 — Add the PDFs

Drop the PDF papers into the papers/ folder.

Step 2 — Ingest (run once)

# In a separate terminal, start Ollama
ollama serve

# Back in your main terminal
python ingest.py

This reads your PDFs, chunks them, generates embeddings, and stores everything in chroma_db/. You only need to run this again if you add new papers.

Step 3 — Launch the interface

streamlit run app.py

A browser window opens. Type your question, click "Get Answer", and see the answer with source citations.

Or query from terminal

python query.py

Evaluation

The system was evaluated on 34 domain-specific questions across 5 research papers in materials science and AI — specifically GraphMetaMat, DiffuMeta, two high-entropy wolframite oxide papers, and a Quantum ESPRESSO tutorial.

Setting	Automated Accuracy
top_k = 3 (retrieve 3 chunks)	44.1% (15/34)
top_k = 5 (retrieve 5 chunks)	50.0% (17/34)

The automated scorer checks for key numbers, acronyms, and phrases from the expected answer. Manual review of the full results puts real accuracy slightly higher, since the scorer misses paraphrased correct answers.

Main failure modes:

Retrieval misses: For questions about specific methods or numerical details, the relevant passage sometimes isn't in the top-k chunks retrieved. Increasing k helps but doesn't fully solve it.
Cross-paper ambiguity: Two papers cover similar wolframite materials with different measurements. The model sometimes retrieves from the wrong one.
Model size: Llama 3.2 3B will sometimes hallucinate rather than say "I don't know." A larger model would improve factual accuracy on these technical questions significantly.

Full evaluation results are in eval_results.json. Questions and expected answers are in eval_dataset.json.

Project structure

paper-qa-system/
├── papers/              # PDF files go here
├── ingest.py            # Read PDFs, chunk, embed, store in ChromaDB
├── query.py             # Single question from terminal
├── evaluate.py          # Run full eval dataset, compute accuracy
├── app.py               # Streamlit web interface
├── eval_dataset.json    # Evaluation questions with expected answers
├── eval_results.json # Evaluation results
├── util.py            # helper functions
└── .gitignore

What I would improve next

Semantic chunking instead of fixed-size splitting — keeping sentences and paragraphs intact would improve retrieval precision
Reranking — after retrieving top-k chunks, a cross-encoder reranker could reorder them by relevance before passing to the LLM
Larger model — swapping Llama 3.2 3B for a 13B or 70B model (given sufficient hardware) would substantially improve factual accuracy on technical questions
Hybrid search — combining dense vector search with keyword (BM25) search would help for questions about specific numbers or named methods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scientific Paper QA System

What it does

Stack

Setup

Running

Evaluation

Project structure

What I would improve next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
papers		papers
.gitignore		.gitignore
README.md		README.md
app.py		app.py
eval_dataset.json		eval_dataset.json
eval_results.json		eval_results.json
evaluate.py		evaluate.py
ingest.py		ingest.py
query.py		query.py
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt
run.slurm		run.slurm
util.py		util.py

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper QA System

What it does

Stack

Setup

Running

Evaluation

Project structure

What I would improve next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages