A natural language movie search engine prototype that understands complex queries and returns relevant results from a dataset of ~9,700 movies.
This search engine bridges the gap between human language and structured movie data, enabling users to find movies using natural language queries like:
- "sci-fi movies from the 90s with Tom Hanks"
- "war movies about love"
- "funny films from the early 2000s"
- "comedy films in the 80s starring Eddie Murphy"
- Natural Language Understanding: LLM-based query parsing to extract search intent and filters
- Hybrid Search: Combines BM25 keyword search with semantic search for better relevance
- Intelligent Filtering: Supports filtering by year ranges, genres, actors, directors, and ratings
- Multiple Search Strategies: Configurable search strategies (BM25, semantic, or fusion)
- Flexible Parsing: Separate parser configuration for BM25 and semantic (regex or LLM)
- Rich Console Output: Beautiful formatted output with colors, tables, and metadata
- Source Tracking: Results show which search method found them (BM25, Semantic, or Both)
- Fast Performance: Optimized indexing and search algorithms for sub-2-second response times
- CLI Interface: Interactive command-line interface for easy querying
The system consists of several key components:
- Query Parser: Configurable parser (regex or LLM) that extracts search terms and filters
- Regex Parser: Fast rule-based parsing with stop word removal
- LLM Parser: Gemini-based natural language understanding
- Separate parsers for BM25 (cleaned text) and semantic (full text) processing
- BM25 Indexer: Whoosh-based keyword search index for exact matches and fuzzy matching
- Semantic Indexer: FAISS-based semantic search using sentence transformers
- Fusion Engine: RRF (Reciprocal Rank Fusion) to combine results from multiple search methods
- Search Engine: Orchestrates all components based on configured strategy
The system uses the Movie Search Ranking Dataset (MSRD) which contains:
- ~9,700 movies with rich metadata (titles, descriptions, genres, cast, crew, tags, ratings, etc.)
- Tab-separated CSV format with fields: id, title, overview, tags, genres, director, actors, characters, year, votes, rating, popularity, budget, poster_url
- Python 3.8+
- Required packages (see
requirements.txt) - Gemini API key (for LLM-based query parsing)
- Clone the repository and navigate to the project directory:
cd fusion-reel/src- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
# Create a .env file in the src directory
echo "GEMINI_API_KEY=your_api_key_here" > .env-
Prepare the dataset:
- The dataset should be located at
../msrd/dataset/movies.csv - Ensure the CSV file is tab-separated
- The dataset should be located at
-
Build indices (if not already built):
- BM25 index: Run the BM25 indexer to create Whoosh indices
- Semantic index: Run the semantic indexer to create FAISS indices
python app.pypython -m src.cli search "sci-fi movies from the 90s"python -m src.cli search "romantic comedies" --format jsonThe system is configured via config.yaml. Key settings include:
- Search Strategy: Choose from
bm25,semantic, orfusion - Parser Configuration: Separate parser strategies for BM25 and semantic (
regexorllm) - LLM Settings: Configure model, temperature, and API provider
- Indexer Paths: Specify paths to BM25 and semantic indices
- Search Limits: Configure result limits for BM25, FAISS, and final output
- Fusion Parameters: Adjust RRF fusion weights and constants
- Embedding Model: Choose embedding model (default:
all-MiniLM-L6-v2, alternative:nomic-ai/nomic-embed-text-v1)
-
bm25: BM25 keyword search with cleaned text preprocessing
- Uses configured parser (
parser.bm25_strategy) for query cleaning - Stop words removed, filters extracted
- Fast and precise for exact matches
- Uses configured parser (
-
semantic: Semantic search with full query context
- Uses full original query text (no modification)
- Filters extracted separately using
parser.semantic_strategy - Better semantic understanding
-
fusion: Hybrid search combining BM25 and semantic (recommended)
- BM25 uses cleaned text, semantic uses full text
- Results include source tracking (BM25, Semantic, or Both)
- Best accuracy by combining both methods
fusion-reel/
├── src/
│ ├── app.py # Interactive CLI application
│ ├── cli.py # Command-line interface
│ ├── search_engine.py # Main search engine orchestrator
│ ├── config.yaml # Configuration file
│ ├── requirements.txt # Python dependencies
│ ├── indexer/ # Search indexers
│ │ ├── bm25_indexer.py # BM25/Whoosh indexer
│ │ └── semantic_indexer.py # FAISS semantic indexer
│ ├── query_parser/ # Query parsing components
│ │ ├── regex_parser.py # Regex-based query parser
│ │ └── llm_parser.py # LLM-based query parser
│ ├── utils/ # Utility functions
│ │ └── formatters.py # Rich console formatters
│ ├── llm/ # LLM handlers
│ │ └── gemini_handler.py # Gemini API handler
│ ├── fusion/ # Result fusion algorithms
│ │ └── rrf_fusion.py # Reciprocal Rank Fusion
│ └── config/ # Configuration management
│ └── loader.py # Config loader
├── msrd/ # Dataset directory
│ └── dataset/
│ └── movies.csv # Movie metadata (tab-separated)
└── docs/ # Documentation files
Comprehensive documentation is available in the docs/ directory:
- Query Understanding - How natural language queries are parsed
- Content Search - BM25 and semantic search implementation
- Intelligent Filtering - Filtering capabilities and logic
- Data Management - Dataset processing and indexing
- Performance & Scalability - Performance characteristics and optimizations
- User Interface - CLI and interface details
- Architecture Overview - System architecture and components
- Design Choices & Trade-offs - Design decisions and rationale
- Drawbacks & Limitations - Known limitations and issues
- Future Roadmap - Planned improvements and enhancements
- Query Response Time: Typically < 2 seconds for most queries
- Index Size:
- BM25 index: ~few MB (Whoosh)
- Semantic index: ~tens of MB (FAISS + embeddings)
- Memory Usage: Moderate (depends on embedding model and index size)
- Output Formatting: Rich console formatting with colors, tables, and panels
- LLM parser requires API key (costs apply) - regex parser available as alternative
- Semantic search requires pre-computed embeddings
- Filtering capabilities are limited to predefined fields
- No web UI (CLI only, but with Rich console formatting)
- Limited to English language queries
- Rich formatting requires terminal that supports Rich console (most modern terminals)
See Drawbacks & Limitations for detailed information.
The dataset is shared under CC-BY-SA 4.0 license. See msrd/LICENSE.md for details.
This is a prototype implementation. For production use, consider:
- Adding caching mechanisms
- Implementing query rewriting
- Adding evaluation metrics
- Building a web frontend
- Supporting multiple languages
See Future Roadmap for planned improvements.
For questions or issues, please refer to the documentation files in the docs/ directory.