The E-commerce Product URL Crawler is a web scraping tool designed to discover and list all product URLs across multiple e-commerce websites. This tool is built to handle variations in URL structures and can scale to process hundreds of domains efficiently.
- URL Discovery: Intelligently identifies product pages using regex patterns.
- Scalability: Modular design and parallel processing for handling large websites.
- Performance: Optimized for efficient scraping with configurable settings.
- Robustness: Handles edge cases and variations in URL structures.
- Python 3.x
- Selenium
- BeautifulSoup
- ChromeDriver
-
Clone the Repository:
git clone https://github.com/akhil298/Web-Crawler.git
-
Install Dependencies:
pip install -r requirements.txt
Configure the crawler settings in config/config.py:
SHOW_BROWSER: Set toTrueto show the browser during scraping.MAX_SCROLLS: Maximum number of scrolls per page.MAX_WORKERS: Number of parallel workers for scraping.SCROLL_PAUSE_TIME: Time to wait between scrolls.WAIT_TIME: Initial wait time after loading a page.
Run the crawler with the following command:
python main.pyThe crawler will process the specified domains and save the discovered product URLs to JSON files.
config/: Configuration settings.utils/: Utility functions for URL processing.scraping/: Web scraping logic using Selenium.processing/: Functions for extracting and processing URLs.saving/: Functions for saving data to files.requirements.txt: List of Python dependencies.README.md: Project documentation.
The crawler uses a combination of regex patterns and URL cleaning to identify product URLs. The process involves the following steps:
- Homepage Scraping: The crawler first scrapes the homepage of each domain to extract potential product URLs.
- URL Filtering: The extracted URLs are filtered using regex patterns to identify those that are likely product pages.
- Endpoint Processing: Each filtered URL is then processed to extract product links. The crawler scrolls through the page to load dynamically loaded content.
- Data Saving: The discovered product URLs are saved to JSON files, and the status of each domain's processing is logged to a CSV file.
Contributions are welcome! Please open an issue or submit a pull request.
For any questions or suggestions, please contact akhilakhi298@gmai.com.