E-commerce Product URL Crawler

The E-commerce Product URL Crawler is a web scraping tool designed to discover and list all product URLs across multiple e-commerce websites. This tool is built to handle variations in URL structures and can scale to process hundreds of domains efficiently.

Features

URL Discovery: Intelligently identifies product pages using regex patterns.
Scalability: Modular design and parallel processing for handling large websites.
Performance: Optimized for efficient scraping with configurable settings.
Robustness: Handles edge cases and variations in URL structures.

Requirements

Python 3.x
Selenium
BeautifulSoup
ChromeDriver

Setup

Clone the Repository:

git clone https://github.com/akhil298/Web-Crawler.git

Install Dependencies:
```
pip install -r requirements.txt
```

Configuration

Configure the crawler settings in config/config.py:

SHOW_BROWSER: Set to True to show the browser during scraping.
MAX_SCROLLS: Maximum number of scrolls per page.
MAX_WORKERS: Number of parallel workers for scraping.
SCROLL_PAUSE_TIME: Time to wait between scrolls.
WAIT_TIME: Initial wait time after loading a page.

Usage

Run the crawler with the following command:

python main.py

The crawler will process the specified domains and save the discovered product URLs to JSON files.

Project Structure

config/: Configuration settings.
utils/: Utility functions for URL processing.
scraping/: Web scraping logic using Selenium.
processing/: Functions for extracting and processing URLs.
saving/: Functions for saving data to files.
requirements.txt: List of Python dependencies.
README.md: Project documentation.

Approach to Finding Product URLs

The crawler uses a combination of regex patterns and URL cleaning to identify product URLs. The process involves the following steps:

Homepage Scraping: The crawler first scrapes the homepage of each domain to extract potential product URLs.
URL Filtering: The extracted URLs are filtered using regex patterns to identify those that are likely product pages.
Endpoint Processing: Each filtered URL is then processed to extract product links. The crawler scrolls through the page to load dynamically loaded content.
Data Saving: The discovered product URLs are saved to JSON files, and the status of each domain's processing is logged to a CSV file.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Contact

For any questions or suggestions, please contact akhilakhi298@gmai.com.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
nykaafashion.com		nykaafashion.com
processing		processing
saving		saving
scraping		scraping
utils		utils
www.bewakoof.com		www.bewakoof.com
www.tatacliq.com		www.tatacliq.com
www.virgio.com		www.virgio.com
www.westside.com		www.westside.com
README.md		README.md
endpoint.json		endpoint.json
initial_status_log.csv		initial_status_log.csv
main.py		main.py
product_links.json		product_links.json
requirements.txt		requirements.txt
status_log.csv		status_log.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Product URL Crawler

Features

Requirements

Setup

Configuration

Usage

Project Structure

Approach to Finding Product URLs

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

E-commerce Product URL Crawler

Features

Requirements

Setup

Configuration

Usage

Project Structure

Approach to Finding Product URLs

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages