🎬 Bollywood GraphRAG

Ask questions about Hindi cinema in everyday English and get grounded, factual answers. Under the hood this is a GraphRAG (Graph Retrieval-Augmented Generation) system: a Neo4j knowledge graph of films, actors, directors, composers, studios and awards, queried by walking relationships rather than matching loose text — with GPT-4o turning the retrieved facts into a readable reply.

Why GraphRAG?

Plain vector RAG is great at finding passages, but it struggles with questions that require following connections — "which composers scored a film that won both a National and a Filmfare award?" That's a chain of relationships, not a similarity lookup.

This project stores entities and the edges between them in a graph, so the system can locate the relevant nodes by embedding similarity and then traverse outward to collect connected facts. Every fact handed to the LLM comes straight from a real graph edge, so answers stay verifiable.

Building block	Where it shows up here
Property graph database	Neo4j with a purpose-built Bollywood schema
Cypher	Data loading, traversal and aggregate queries
Vector embeddings	One OpenAI embedding stored per graph node
Retrieval pipeline	Vector search → graph walk → LLM synthesis
REST API	FastAPI exposing every pipeline step
Web UI	Streamlit chat + entity explorer
Orchestration	A single Docker Compose stack

Graph Schema

NODES
──────────────────────────────────────────────────────────
Person          {name, born, profession, hometown}
Movie           {title, year, genre, box_office_crore, description}
ProductionHouse {name, founded, founder, hq}
Award           {name, category, year}

RELATIONSHIPS
──────────────────────────────────────────────────────────
(Person)          -[:ACTED_IN {character, lead_role}]-> (Movie)
(Person)          -[:DIRECTED]->                        (Movie)
(Person)          -[:COMPOSED_MUSIC_FOR]->              (Movie)
(Person)          -[:WON]->                             (Award)
(Movie)           -[:WON]->                             (Award)
(ProductionHouse) -[:PRODUCED]->                        (Movie)

Getting Started

What you'll need

Tool	Version	Notes
Docker Desktop	Latest	Keep it running in the background
Python	3.11+	For the loader / embedding scripts
OpenAI API key	—	Needed for embeddings and GPT-4o

1. Grab the code and set your secrets

git clone <repo-url>
cd bollywood-graphrag

cp .env.example .env
# open .env and paste in your OpenAI key

A complete .env looks like:

OPENAI_API_KEY=sk-...
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=bollywood2024!

2. Bring up Neo4j

docker compose up neo4j -d

# give it ~15s, then confirm it booted:
docker compose logs neo4j | grep "Started"

The browser console lives at http://localhost:7474 (login neo4j / bollywood2024!).

3. Set up the Python environment

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

4. Populate the graph

cd src
python loader.py

You should see something like:

[1/2] Loading nodes...
  ✓ Constraints active
  ✓ 35 Person nodes
  ✓ 26 Movie nodes
  ✓ 10 ProductionHouse nodes
  ✓ 25 Award nodes

[2/2] Loading relationships...
  ✓ 41 ACTED_IN relationships
  ✓ 27 DIRECTED relationships
  ...

5. Generate node embeddings

python embeddings.py

This sends each node to the OpenAI embeddings API and writes the resulting vector back onto the node in Neo4j. It only needs to run once, and re-running is harmless (it overwrites).

6. Launch the API

The Streamlit UI talks to FastAPI over HTTP, so the API has to be live first.

# from inside src/
uvicorn api:app --reload --port 8000

Keep that terminal open and check http://localhost:8000/docs to confirm it's serving.

7. Launch the UI

In a second terminal, re-activate the venv and run:

cd src
streamlit run app.py

Then open http://localhost:8501

Heads up: three things must be up at once — Neo4j (Docker), FastAPI, and Streamlit. If juggling terminals is annoying, skip straight to step 8.

8. Or just run the whole stack

docker compose up --build

Service	URL
Neo4j Browser	http://localhost:7474
FastAPI docs	http://localhost:8000/docs
Streamlit chat	http://localhost:8501

Using It

The chat UI

The 💬 Chat tab takes plain-English questions. A few to try:

"Which films has Shah Rukh Khan done with Yash Raj Films?"
"Which music composers have worked with Aamir Khan productions?"
"List the National Award winning films in the graph."
"Which actors directed by Rajkumar Hirani also worked with AR Rahman?"

The 🔍 Explore tab looks up any entity by name and renders its surrounding graph neighbourhood.

Hitting the API directly

# Ask a question
curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "Which films did Aamir Khan direct?", "top_k": 3, "hops": 2}'

# Vector search
curl "http://localhost:8000/search?q=revenge+thriller&label=Movie"

# An entity's neighbourhood
curl "http://localhost:8000/graph/Dangal?label=Movie&hops=2"

# Someone's filmography
curl "http://localhost:8000/person/AR%20Rahman/filmography"

# Graph-wide stats
curl "http://localhost:8000/stats"

Layout

bollywood-graphrag/
├── docker-compose.yml          ← Neo4j + API + Streamlit in one stack
├── requirements.txt
├── .env.example
├── Dockerfile.api              ← FastAPI image
├── Dockerfile.streamlit        ← Streamlit image
└── src/
    ├── db.py                   ← Neo4j connection wrapper
    ├── loader.py               ← Loads schema + data into Neo4j
    ├── embeddings.py           ← Builds + stores node embeddings
    ├── graphrag.py             ← The end-to-end retrieval pipeline
    ├── api.py                  ← FastAPI service
    ├── app.py                  ← Streamlit UI
    └── data/
        └── bollywood_data.py   ← All nodes + relationships

The Pipeline at a Glance

User question (text)
       │
       ▼
┌──────────────────────────────┐
│  1. Vector Search            │  Embed the question, find the top-k closest nodes
│     (embeddings.py)          │  e.g. "Who directed PK?" → [Rajkumar Hirani, PK]
└──────────────────┬───────────┘
                   │  matched node identifiers
                   ▼
┌──────────────────────────────┐
│  2. Graph Traversal          │  Walk N hops out from each matched node
│     (graphrag.py)            │  Gather connected facts as triples
└──────────────────┬───────────┘
                   │  subgraph rendered as text
                   ▼
┌──────────────────────────────┐
│  3. Answer Generation        │  GPT-4o reasons over the graph context
│     (OpenAI GPT-4o)          │  Returns a grounded, checkable answer
└──────────────────────────────┘

Making It Your Own

More movies

Append entries to MOVIES, ACTED_IN, DIRECTED, etc. in src/data/bollywood_data.py, then re-run loader.py and embeddings.py.

A new relationship type

Add the tuples to the right list in bollywood_data.py
Write a matching loader function in loader.py
Wire it into load_all()

A different model

The OpenAI calls live in graphrag.py — swap them for any OpenAI-compatible endpoint (Azure OpenAI, Gemini, Groq, etc.).

Fun Queries for the Neo4j Browser

-- Aamir Khan films that grossed over 200 crore
MATCH (p:Person {name: 'Aamir Khan'})-[:ACTED_IN]->(m:Movie)
WHERE m.box_office_crore > 200
RETURN m.title, m.year, m.box_office_crore ORDER BY m.box_office_crore DESC

-- Directors who also acted in a film they directed
MATCH (p:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(p)
RETURN p.name, m.title

-- AR Rahman scores that went on to win a National Award
MATCH (ar:Person {name:'AR Rahman'})-[:COMPOSED_MUSIC_FOR]->(m:Movie)-[:WON]->(a:Award)
WHERE a.category = 'National'
RETURN m.title, a.name

-- The shortest connection between Shah Rukh Khan and AR Rahman
MATCH path = shortestPath(
    (a:Person {name: 'Shah Rukh Khan'})-[*]-(b:Person {name: 'AR Rahman'})
)
RETURN [n IN nodes(path) | coalesce(n.name, n.title)] AS path, length(path) AS hops

-- Yash Raj productions that crossed 500 crore
MATCH (ph:ProductionHouse {name: 'Yash Raj Films'})-[:PRODUCED]->(m:Movie)
WHERE m.box_office_crore > 500
RETURN m.title, m.year, m.box_office_crore ORDER BY m.box_office_crore DESC

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.example		.env.example
.gitignore		.gitignore
CYPHER_PRACTICE.md		CYPHER_PRACTICE.md
Dockerfile.api		Dockerfile.api
Dockerfile.streamlit		Dockerfile.streamlit
PROJECT_WALKTHROUGH.md		PROJECT_WALKTHROUGH.md
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Bollywood GraphRAG

Why GraphRAG?

Graph Schema

Getting Started

What you'll need

1. Grab the code and set your secrets

2. Bring up Neo4j

3. Set up the Python environment

4. Populate the graph

5. Generate node embeddings

6. Launch the API

7. Launch the UI

8. Or just run the whole stack

Using It

The chat UI

Hitting the API directly

Layout

The Pipeline at a Glance

Making It Your Own

More movies

A new relationship type

A different model

Fun Queries for the Neo4j Browser

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎬 Bollywood GraphRAG

Why GraphRAG?

Graph Schema

Getting Started

What you'll need

1. Grab the code and set your secrets

2. Bring up Neo4j

3. Set up the Python environment

4. Populate the graph

5. Generate node embeddings

6. Launch the API

7. Launch the UI

8. Or just run the whole stack

Using It

The chat UI

Hitting the API directly

Layout

The Pipeline at a Glance

Making It Your Own

More movies

A new relationship type

A different model

Fun Queries for the Neo4j Browser

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages