This project is a basic web search engine created as part of the DSAI 301 - Introduction to Python Programming course at Bogazici University. It demonstrates key concepts in web crawling, indexing, and ranking, with Python as the programming language.
Special thanks to the teachings of Asst. Prof. Dr. Huseyin Oktay Altun.
-
Web Crawling:
- Starts with a seed URL and recursively visits pages to build a list of links.
- Extracts and cleans the content of visited pages.
-
Indexing:
- Builds a searchable index of keywords from the crawled pages.
-
Page Ranking:
- Implements a simple ranking algorithm using the concept of in-links and out-links.
-
Search Functionality:
- Allows keyword searches with or without ranking.
The crawlWeb function begins with a seed URL and uses the following steps:
- Fetches page content using
getPage. - Extracts links using
get_all_links. - Recursively visits links and avoids duplicates.
- The content is cleaned using
getclearpage. - Keywords are added to the index using
add_to_indexandaddPageToIndex.
- A graph of interconnections between pages is generated.
- Page ranks are computed using a dampening factor.
- The
lookupfunction searches the index for keywords and ranks results usingcomputeRanks.
To run this project:
- Open the
MyFirstWebSearchEngine.ipynbfile in Google Colab. - Modify the
seed_urlvariable to your desired starting point. - Run all cells sequentially to crawl, index, rank, and search.
Example:
seed_url = "https://example.com"
index, graph = crawlWeb(seed_url)
lookup(index, "keyword", graph, computeRanks)