- Designed and developed a search system using Python and Spacy to process a user-given query within 1s
- Scaled the project with SAX parsing and multi-level indexing. Reduced page retrieval time by 40% with binary search
- Built a search module for retrieval and ranking (top-10) of relevant wiki titles using vector space model with TF-IDF weighting scheme
- Python >=3.7.3
- NLTK (
pip install --user -U nltk) - Spacy and pre-trained models for English Language (
pip install -U spacy && pip install -U spacy-lookups-data && python -m spacy download en_core_web_sm) - Sorted Containers (
pip install sortedcontainers) - FTFY (
pip install ftfy) - Unidecode (
pip install Unidecode) - Dill (
pip install dill)
index: global index created by runningindex.pyindex.py: Index creation code which takes wikipedia dump (./enwiki_samplefor eg) and index folder (./indexfor eg)search.py: Search Operation which takes index folder (./indexfor eg), queryfile (./queryfilefor eg) and outputfile (./resultfilefor eg)enwiki_sample: Wiki dump sample (100MB).queryfile: Input query fileresultfile: Search outputs file
- Index files for original wiki dump (75 GB) can be downloaded: https://iiitaphyd-my.sharepoint.com/:f:/g/personal/priyansh_agrawal_research_iiit_ac_in/EgpyrelWNKBFsysv_TMF_Y8BG34ltJgL6Vz-iJ4_97E53g?e=LTMwGu.