The seed we started with is mostly football-related websites
Falcony provides powerful text search capabilities with support for:
- Standard keyword searching (e.g., "premier league standings")
- Relevance-based result ranking
- PageRank implementation for determining page importance
- Search suggestions based on popular football queries
Search for exact phrases using quotation marks:
- Example:
"Cristiano Ronaldo goals" - Results will include only pages containing the exact phrase
Combine phrases with logical operators for advanced searching:
- AND:
"Lionel Messi" AND "Barcelona" - OR:
"Premier League" OR "La Liga" - NOT:
"Real Madrid" NOT "Champions League"
Search by uploading an image to find visually similar football images:
- Uses DinoV2 ONNX model for feature extraction
- Vector similarity search for efficient image matching
- Supports various image formats
- Collects web pages and images from the internet
- Respects robots.txt rules
- Normalizes URLs to avoid duplicates
- Stores documents in MongoDB
- TextIndexer: Processes web page content, tokenizes text, removes stop words, and creates an inverted index
- ImageIndexer: Extracts image features using DinoV2 model and stores vector representations
- Handles user queries and routes to appropriate rankers
- Supports suggestion generation for autocomplete
- Handles pagination of results
- TokenBasedRanker: Ranks results for keyword searches using TF-IDF and popularity
- PhraseBasedRanker: Specialized ranking for phrase searches with boolean operators
- Uses MongoDB for document and image storage
- Separate collections for documents, tokens, images, and queries
- Vector search capabilities using MongoDB Atlas
- React-based user interface
- Real-time search suggestions
- Responsive design for various devices
- Support for both text and image search interfaces
- Backend: Java
- Frontend: React, TailwindCSS
- Database: MongoDB
- Machine Learning: ONNX Runtime, DinoV2 model
- Text Processing: OpenNLP TokenizerME, Porter Stemmer
- Web Crawling: JSoup
- Build Tool: Gradle
- User inputs a query like "Champions League final highlights"
- Query processor analyzes the query to determine if it's a keyword search, phrase search, or boolean search
- Tokenization and stemming are applied to the query
- Candidate documents are retrieved from the inverted index
- Results are ranked based on term frequency, document popularity (PageRank), and other relevance factors
- Snippets are generated highlighting query terms in context
- Results are returned to the user interface
- User uploads an image of a football moment through the interface
- Image features are extracted using the DinoV2 ONNX model
- The feature vector is compared against the database of indexed images using vector similarity
- Similar football images are ranked by cosine similarity
- Results are returned to the user interface with source documents
- Crawler collects web pages and their images from seed URLs
- TextIndexer processes textual content:
- Tokenization and stemming
- Removal of stop words
- Creation of inverted index with position information
- ImageIndexer processes images:
- Feature extraction with DinoV2
- Vector normalization
- Storage in MongoDB with vector indexing
- Graph representation of web pages and their links
- Iterative calculation of importance scores
- Integration of scores into the document ranking process






