Ask questions about Hindi cinema in everyday English and get grounded, factual answers. Under the hood this is a GraphRAG (Graph Retrieval-Augmented Generation) system: a Neo4j knowledge graph of films, actors, directors, composers, studios and awards, queried by walking relationships rather than matching loose text — with GPT-4o turning the retrieved facts into a readable reply.
Plain vector RAG is great at finding passages, but it struggles with questions that require following connections — "which composers scored a film that won both a National and a Filmfare award?" That's a chain of relationships, not a similarity lookup.
This project stores entities and the edges between them in a graph, so the system can locate the relevant nodes by embedding similarity and then traverse outward to collect connected facts. Every fact handed to the LLM comes straight from a real graph edge, so answers stay verifiable.
| Building block | Where it shows up here |
|---|---|
| Property graph database | Neo4j with a purpose-built Bollywood schema |
| Cypher | Data loading, traversal and aggregate queries |
| Vector embeddings | One OpenAI embedding stored per graph node |
| Retrieval pipeline | Vector search → graph walk → LLM synthesis |
| REST API | FastAPI exposing every pipeline step |
| Web UI | Streamlit chat + entity explorer |
| Orchestration | A single Docker Compose stack |
NODES
──────────────────────────────────────────────────────────
Person {name, born, profession, hometown}
Movie {title, year, genre, box_office_crore, description}
ProductionHouse {name, founded, founder, hq}
Award {name, category, year}
RELATIONSHIPS
──────────────────────────────────────────────────────────
(Person) -[:ACTED_IN {character, lead_role}]-> (Movie)
(Person) -[:DIRECTED]-> (Movie)
(Person) -[:COMPOSED_MUSIC_FOR]-> (Movie)
(Person) -[:WON]-> (Award)
(Movie) -[:WON]-> (Award)
(ProductionHouse) -[:PRODUCED]-> (Movie)
| Tool | Version | Notes |
|---|---|---|
| Docker Desktop | Latest | Keep it running in the background |
| Python | 3.11+ | For the loader / embedding scripts |
| OpenAI API key | — | Needed for embeddings and GPT-4o |
git clone <repo-url>
cd bollywood-graphrag
cp .env.example .env
# open .env and paste in your OpenAI keyA complete .env looks like:
OPENAI_API_KEY=sk-...
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=bollywood2024!
docker compose up neo4j -d
# give it ~15s, then confirm it booted:
docker compose logs neo4j | grep "Started"The browser console lives at http://localhost:7474 (login neo4j / bollywood2024!).
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtcd src
python loader.pyYou should see something like:
[1/2] Loading nodes...
✓ Constraints active
✓ 35 Person nodes
✓ 26 Movie nodes
✓ 10 ProductionHouse nodes
✓ 25 Award nodes
[2/2] Loading relationships...
✓ 41 ACTED_IN relationships
✓ 27 DIRECTED relationships
...
python embeddings.pyThis sends each node to the OpenAI embeddings API and writes the resulting vector back onto the node in Neo4j. It only needs to run once, and re-running is harmless (it overwrites).
The Streamlit UI talks to FastAPI over HTTP, so the API has to be live first.
# from inside src/
uvicorn api:app --reload --port 8000Keep that terminal open and check http://localhost:8000/docs to confirm it's serving.
In a second terminal, re-activate the venv and run:
cd src
streamlit run app.pyThen open http://localhost:8501
Heads up: three things must be up at once — Neo4j (Docker), FastAPI, and Streamlit. If juggling terminals is annoying, skip straight to step 8.
docker compose up --build| Service | URL |
|---|---|
| Neo4j Browser | http://localhost:7474 |
| FastAPI docs | http://localhost:8000/docs |
| Streamlit chat | http://localhost:8501 |
The 💬 Chat tab takes plain-English questions. A few to try:
- "Which films has Shah Rukh Khan done with Yash Raj Films?"
- "Which music composers have worked with Aamir Khan productions?"
- "List the National Award winning films in the graph."
- "Which actors directed by Rajkumar Hirani also worked with AR Rahman?"
The 🔍 Explore tab looks up any entity by name and renders its surrounding graph neighbourhood.
# Ask a question
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "Which films did Aamir Khan direct?", "top_k": 3, "hops": 2}'
# Vector search
curl "http://localhost:8000/search?q=revenge+thriller&label=Movie"
# An entity's neighbourhood
curl "http://localhost:8000/graph/Dangal?label=Movie&hops=2"
# Someone's filmography
curl "http://localhost:8000/person/AR%20Rahman/filmography"
# Graph-wide stats
curl "http://localhost:8000/stats"bollywood-graphrag/
├── docker-compose.yml ← Neo4j + API + Streamlit in one stack
├── requirements.txt
├── .env.example
├── Dockerfile.api ← FastAPI image
├── Dockerfile.streamlit ← Streamlit image
└── src/
├── db.py ← Neo4j connection wrapper
├── loader.py ← Loads schema + data into Neo4j
├── embeddings.py ← Builds + stores node embeddings
├── graphrag.py ← The end-to-end retrieval pipeline
├── api.py ← FastAPI service
├── app.py ← Streamlit UI
└── data/
└── bollywood_data.py ← All nodes + relationships
User question (text)
│
▼
┌──────────────────────────────┐
│ 1. Vector Search │ Embed the question, find the top-k closest nodes
│ (embeddings.py) │ e.g. "Who directed PK?" → [Rajkumar Hirani, PK]
└──────────────────┬───────────┘
│ matched node identifiers
▼
┌──────────────────────────────┐
│ 2. Graph Traversal │ Walk N hops out from each matched node
│ (graphrag.py) │ Gather connected facts as triples
└──────────────────┬───────────┘
│ subgraph rendered as text
▼
┌──────────────────────────────┐
│ 3. Answer Generation │ GPT-4o reasons over the graph context
│ (OpenAI GPT-4o) │ Returns a grounded, checkable answer
└──────────────────────────────┘
Append entries to MOVIES, ACTED_IN, DIRECTED, etc. in src/data/bollywood_data.py, then re-run loader.py and embeddings.py.
- Add the tuples to the right list in
bollywood_data.py - Write a matching loader function in
loader.py - Wire it into
load_all()
The OpenAI calls live in graphrag.py — swap them for any OpenAI-compatible endpoint (Azure OpenAI, Gemini, Groq, etc.).
-- Aamir Khan films that grossed over 200 crore
MATCH (p:Person {name: 'Aamir Khan'})-[:ACTED_IN]->(m:Movie)
WHERE m.box_office_crore > 200
RETURN m.title, m.year, m.box_office_crore ORDER BY m.box_office_crore DESC
-- Directors who also acted in a film they directed
MATCH (p:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(p)
RETURN p.name, m.title
-- AR Rahman scores that went on to win a National Award
MATCH (ar:Person {name:'AR Rahman'})-[:COMPOSED_MUSIC_FOR]->(m:Movie)-[:WON]->(a:Award)
WHERE a.category = 'National'
RETURN m.title, a.name
-- The shortest connection between Shah Rukh Khan and AR Rahman
MATCH path = shortestPath(
(a:Person {name: 'Shah Rukh Khan'})-[*]-(b:Person {name: 'AR Rahman'})
)
RETURN [n IN nodes(path) | coalesce(n.name, n.title)] AS path, length(path) AS hops
-- Yash Raj productions that crossed 500 crore
MATCH (ph:ProductionHouse {name: 'Yash Raj Films'})-[:PRODUCED]->(m:Movie)
WHERE m.box_office_crore > 500
RETURN m.title, m.year, m.box_office_crore ORDER BY m.box_office_crore DESC