Code submitted by : SAUMYA GUPTA | CS6200 | 002299840
Implement your own index that would replace the one from elasticsearch used in HW1, then index the document collection used in HW1. Your index should be able to handle large numbers of documents and terms without using excessive memory or disk I/O.
This assignment involves writing two programs:
- Task 1 - Crawler
- Task 2 - Link Graph
- Task3: Merging team indexes
- Task4: Vertical Search (MS students only)
This Homework combines individual responsibility with team collaboration. While you will work closely with your team members, your individual effort in developing the crawler, collecting documents, and following the politeness policy is crucial for the project's success.
Homework 3: Crawling, Merging, Vertical Search
In this assignment, you will work with a team to create a vertical search engine using Elasticsearch. Please read these instructions carefully: although you are working with teammates, you will be graded individually for most of the assignment.
You will write a web crawler, and crawl Internet documents to construct a document collection focused on a particular topic. Your crawler must conform strictly to a particular politeness policy. Once the documents are crawled, you will pool them together.
Form a team of three students with your classmates. Your team will be assigned a single query with few associated seed URLs. You will each crawl web pages starting from a set of common seeds and different seed URLs. When you have each collected your individual documents, you will pool them together, index them and implement search.
Each individual is responsible for writing their own crawler and crawling from their own seed URLs.
- Set up Elasticsearch with your teammates to have the same cluster name and index name
- Your crawler will manage a frontier of URLs to be crawled, initially containing just your seed URLs
- Crawl at least 30,000 documents (10,000 for undergrads), starting from seed URLs
- Choose the next URL using a best-first strategy (see Frontier Management below)
- Your crawler must strictly conform to the politeness policy detailed below
- Only crawl HTML documents
- Find all outgoing links, canonicalize them, and add new ones to the frontier
- For each page, store the following fields in Elasticsearch:
- ID
- URL
- HTTP headers
- Page contents cleaned (with term positions)
- Raw HTML
- List of all in-links and out-links
Once crawling is done, merge indexes with teammates so all members end up with the merged index.
Your crawler must strictly observe this policy at all times, including during development and testing.
- Make no more than one HTTP request per second from any given domain
- You may crawl multiple pages from different domains simultaneously
- Before crawling any domain, fetch its robots.txt file and strictly obey it
- Use a third-party library to parse robots.txt
The frontier stores pages to be crawled, including the canonicalized URL and in-link count.
Prioritization criteria:
- Seed URLs crawled first
- Must use BFS wave number as baseline graph traversal
- Prefer pages with higher in-link counts
- Prefer URLs with matching keywords in link or anchor text
- Prefer URLs extracted from a relevant page
- Prefer certain domains
- Prefer recent URLs
- If multiple pages have maximal in-link counts, choose the one longest in the queue
- If the next page is at a recently crawled domain, crawl from a different domain instead
Apply the following rules to all URLs encountered:
- Convert scheme and host to lowercase
- Remove port 80 from HTTP URLs and port 443 from HTTPS URLs
- Make relative URLs absolute
- Remove fragments beginning with
# - Remove duplicate slashes
Once a page is downloaded:
- Extract all links in
<a>tags, canonicalize, and update frontier - Extract document text stripped of HTML, JavaScript, and CSS
- Store entire HTTP response separately
Document format:
<DOC>
<DOCNO>http://www.example1.com/something.html</DOCNO>
<HEAD>The page title</HEAD>
<TEXT>The body text from the document</TEXT>
<OUTLINK>http://www.example2.com/something.html</OUTLINK>
</DOC>Write a link graph reporting all out-links and in-links from each crawled URL.
Option 1: Store canonical links as inlinks and outlinks fields in Elasticsearch per document
Option 2: Maintain a separate links file where each line is a tab-separated list: first URL is the crawled document, remaining URLs are out-links
- Merge individual crawls into one Elasticsearch index
- Merging must happen as independent agents, not in a master-slave manner
- All team members must be connected and run merging code simultaneously
- Add all 90,000 documents to an Elasticsearch index using canonical URL as document ID
- Create a simple HTML page to run queries against the index
- Result list must contain at minimum the URL of the crawled page
- Run several queries and evaluate result quality
| # | Task | Description |
|---|---|---|
| EC1 | Crawl more documents | Expand team crawl to 180,000 documents |
| EC2 | Crawl into merged index | Crawl directly into a distributed ES index dynamically |
| EC3 | Frontier Management | Experiment with different URL selection techniques |
| EC4 | Speed Improvements | Optimize crawler speed without violating politeness policy |
| EC5 | Search Interface | Improve search UI with snippets, layout, or custom operators |