Web Scraper is a command-line interface (CLI) tool for analyzing websites for SEO purposes. This project demonstrates the implementation of a concurrent web crawler using TypeScript and Node.js, capable of traversing web pages, extracting key information, and generating structured reports.
- Language: TypeScript
- Runtime: Node.js
- Testing: Vitest
- Tooling: npm
- Key Libraries:
jsdom(HTML parsing),p-limit(concurrency control)
- Concurrent Crawling: Efficiently traverses multiple pages simultaneously with a configurable concurrency limit to respect server load.
- Data Extraction: Automatically extracts key SEO metrics from each visited page:
- Links: Collects all outgoing links to build a map of the site structure.
- Images: Scrapes all image URLs.
- Content: Captures the main heading (
<h1>) and the first paragraph for content analysis.
- Domain Containment: Strictly follows links within the same domain as the starting URL, preventing the crawler from wandering off to external sites.
- CSV Reporting: Generates a detailed
report.csvfile containing the scraped data for easy analysis in spreadsheet software. - Configurable: Simple CLI arguments to control the starting URL, maximum concurrency, and maximum pages to crawl.
Building this Web Scraper provided valuable experience in:
- Concurrency Patterns: Implementing
p-limitto manage concurrent asynchronous operations, ensuring efficient crawling without overwhelming the target server or the local machine. - Recursion & Graph Traversal: Designing algorithms to traverse the web graph (pages and links), handling cycles (visited pages), and managing the depth/breadth of the crawl.
- DOM Manipulation: Utilizing
jsdomto parse raw HTML strings server-side and interact with the DOM API to query elements and attributes. - Data Normalization: Developing robust URL normalization logic to handle relative paths, trailing slashes, and different URL formats to avoid duplicate work.
- CLI Development: Creating a user-friendly command-line interface that accepts and validates arguments.
- Node.js (v20+ recommended)
- npm
-
Clone the repository:
git clone https://github.com/yourusername/web-scraper.git cd web-scraper -
Install dependencies:
npm install
To start the crawler, use the npm start command followed by the required arguments:
npm start -- <baseURL> <maxConcurrency> <maxPages>baseURL: The starting URL for the crawler (e.g.,https://example.com).maxConcurrency: The maximum number of concurrent requests (e.g.,5).maxPages: The maximum number of pages to crawl (e.g.,100).
Example:
npm start -- https://wagslane.dev 3 10This command will:
- Start crawling at
https://wagslane.dev. - Use a maximum of 3 concurrent requests.
- Crawl up to 10 unique pages.
- Generate a
report.csvfile in the current directory.
To run the test suite:
npm testThe tool generates a report.csv file with the following columns:
page_url: The normalized URL of the visited page.h1: The text content of the first<h1>tag found.first_paragraph: The text content of the first<p>tag found (prioritizing those inside<main>).outgoing_link_urls: A semicolon-separated list of all links found on the page.image_urls: A semicolon-separated list of all image sources found on the page.
This project was built as part of the backend engineering curriculum at Boot.dev.