Skip to content

Ranu92/CineVerse-GraphRAG

Repository files navigation

🎬 Bollywood GraphRAG

Ask questions about Hindi cinema in everyday English and get grounded, factual answers. Under the hood this is a GraphRAG (Graph Retrieval-Augmented Generation) system: a Neo4j knowledge graph of films, actors, directors, composers, studios and awards, queried by walking relationships rather than matching loose text — with GPT-4o turning the retrieved facts into a readable reply.


Why GraphRAG?

Plain vector RAG is great at finding passages, but it struggles with questions that require following connections — "which composers scored a film that won both a National and a Filmfare award?" That's a chain of relationships, not a similarity lookup.

This project stores entities and the edges between them in a graph, so the system can locate the relevant nodes by embedding similarity and then traverse outward to collect connected facts. Every fact handed to the LLM comes straight from a real graph edge, so answers stay verifiable.

Building block Where it shows up here
Property graph database Neo4j with a purpose-built Bollywood schema
Cypher Data loading, traversal and aggregate queries
Vector embeddings One OpenAI embedding stored per graph node
Retrieval pipeline Vector search → graph walk → LLM synthesis
REST API FastAPI exposing every pipeline step
Web UI Streamlit chat + entity explorer
Orchestration A single Docker Compose stack

Graph Schema

NODES
──────────────────────────────────────────────────────────
Person          {name, born, profession, hometown}
Movie           {title, year, genre, box_office_crore, description}
ProductionHouse {name, founded, founder, hq}
Award           {name, category, year}

RELATIONSHIPS
──────────────────────────────────────────────────────────
(Person)          -[:ACTED_IN {character, lead_role}]-> (Movie)
(Person)          -[:DIRECTED]->                        (Movie)
(Person)          -[:COMPOSED_MUSIC_FOR]->              (Movie)
(Person)          -[:WON]->                             (Award)
(Movie)           -[:WON]->                             (Award)
(ProductionHouse) -[:PRODUCED]->                        (Movie)

Getting Started

What you'll need

Tool Version Notes
Docker Desktop Latest Keep it running in the background
Python 3.11+ For the loader / embedding scripts
OpenAI API key Needed for embeddings and GPT-4o

1. Grab the code and set your secrets

git clone <repo-url>
cd bollywood-graphrag

cp .env.example .env
# open .env and paste in your OpenAI key

A complete .env looks like:

OPENAI_API_KEY=sk-...
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=bollywood2024!

2. Bring up Neo4j

docker compose up neo4j -d

# give it ~15s, then confirm it booted:
docker compose logs neo4j | grep "Started"

The browser console lives at http://localhost:7474 (login neo4j / bollywood2024!).

3. Set up the Python environment

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

4. Populate the graph

cd src
python loader.py

You should see something like:

[1/2] Loading nodes...
  ✓ Constraints active
  ✓ 35 Person nodes
  ✓ 26 Movie nodes
  ✓ 10 ProductionHouse nodes
  ✓ 25 Award nodes

[2/2] Loading relationships...
  ✓ 41 ACTED_IN relationships
  ✓ 27 DIRECTED relationships
  ...

5. Generate node embeddings

python embeddings.py

This sends each node to the OpenAI embeddings API and writes the resulting vector back onto the node in Neo4j. It only needs to run once, and re-running is harmless (it overwrites).

6. Launch the API

The Streamlit UI talks to FastAPI over HTTP, so the API has to be live first.

# from inside src/
uvicorn api:app --reload --port 8000

Keep that terminal open and check http://localhost:8000/docs to confirm it's serving.

7. Launch the UI

In a second terminal, re-activate the venv and run:

cd src
streamlit run app.py

Then open http://localhost:8501

Heads up: three things must be up at once — Neo4j (Docker), FastAPI, and Streamlit. If juggling terminals is annoying, skip straight to step 8.

8. Or just run the whole stack

docker compose up --build
Service URL
Neo4j Browser http://localhost:7474
FastAPI docs http://localhost:8000/docs
Streamlit chat http://localhost:8501

Using It

The chat UI

The 💬 Chat tab takes plain-English questions. A few to try:

  • "Which films has Shah Rukh Khan done with Yash Raj Films?"
  • "Which music composers have worked with Aamir Khan productions?"
  • "List the National Award winning films in the graph."
  • "Which actors directed by Rajkumar Hirani also worked with AR Rahman?"

The 🔍 Explore tab looks up any entity by name and renders its surrounding graph neighbourhood.

Hitting the API directly

# Ask a question
curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "Which films did Aamir Khan direct?", "top_k": 3, "hops": 2}'

# Vector search
curl "http://localhost:8000/search?q=revenge+thriller&label=Movie"

# An entity's neighbourhood
curl "http://localhost:8000/graph/Dangal?label=Movie&hops=2"

# Someone's filmography
curl "http://localhost:8000/person/AR%20Rahman/filmography"

# Graph-wide stats
curl "http://localhost:8000/stats"

Layout

bollywood-graphrag/
├── docker-compose.yml          ← Neo4j + API + Streamlit in one stack
├── requirements.txt
├── .env.example
├── Dockerfile.api              ← FastAPI image
├── Dockerfile.streamlit        ← Streamlit image
└── src/
    ├── db.py                   ← Neo4j connection wrapper
    ├── loader.py               ← Loads schema + data into Neo4j
    ├── embeddings.py           ← Builds + stores node embeddings
    ├── graphrag.py             ← The end-to-end retrieval pipeline
    ├── api.py                  ← FastAPI service
    ├── app.py                  ← Streamlit UI
    └── data/
        └── bollywood_data.py   ← All nodes + relationships

The Pipeline at a Glance

User question (text)
       │
       ▼
┌──────────────────────────────┐
│  1. Vector Search            │  Embed the question, find the top-k closest nodes
│     (embeddings.py)          │  e.g. "Who directed PK?" → [Rajkumar Hirani, PK]
└──────────────────┬───────────┘
                   │  matched node identifiers
                   ▼
┌──────────────────────────────┐
│  2. Graph Traversal          │  Walk N hops out from each matched node
│     (graphrag.py)            │  Gather connected facts as triples
└──────────────────┬───────────┘
                   │  subgraph rendered as text
                   ▼
┌──────────────────────────────┐
│  3. Answer Generation        │  GPT-4o reasons over the graph context
│     (OpenAI GPT-4o)          │  Returns a grounded, checkable answer
└──────────────────────────────┘

Making It Your Own

More movies

Append entries to MOVIES, ACTED_IN, DIRECTED, etc. in src/data/bollywood_data.py, then re-run loader.py and embeddings.py.

A new relationship type

  1. Add the tuples to the right list in bollywood_data.py
  2. Write a matching loader function in loader.py
  3. Wire it into load_all()

A different model

The OpenAI calls live in graphrag.py — swap them for any OpenAI-compatible endpoint (Azure OpenAI, Gemini, Groq, etc.).


Fun Queries for the Neo4j Browser

-- Aamir Khan films that grossed over 200 crore
MATCH (p:Person {name: 'Aamir Khan'})-[:ACTED_IN]->(m:Movie)
WHERE m.box_office_crore > 200
RETURN m.title, m.year, m.box_office_crore ORDER BY m.box_office_crore DESC

-- Directors who also acted in a film they directed
MATCH (p:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(p)
RETURN p.name, m.title

-- AR Rahman scores that went on to win a National Award
MATCH (ar:Person {name:'AR Rahman'})-[:COMPOSED_MUSIC_FOR]->(m:Movie)-[:WON]->(a:Award)
WHERE a.category = 'National'
RETURN m.title, a.name

-- The shortest connection between Shah Rukh Khan and AR Rahman
MATCH path = shortestPath(
    (a:Person {name: 'Shah Rukh Khan'})-[*]-(b:Person {name: 'AR Rahman'})
)
RETURN [n IN nodes(path) | coalesce(n.name, n.title)] AS path, length(path) AS hops

-- Yash Raj productions that crossed 500 crore
MATCH (ph:ProductionHouse {name: 'Yash Raj Films'})-[:PRODUCED]->(m:Movie)
WHERE m.box_office_crore > 500
RETURN m.title, m.year, m.box_office_crore ORDER BY m.box_office_crore DESC

About

GraphRAG over a Bollywood knowledge graph — ask Hindi-cinema questions in plain English, answered by Neo4j graph traversal + an LLM. FastAPI backend, Streamlit UI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages