A robust, scalable web scraping project built with Python that extracts product data from a real-world eCommerce demo site.
This repository demonstrates two approaches:
- 🧵 Synchronous Scraping (requests) — simple, readable, beginner-friendly
- ⚡ Asynchronous Scraping (aiohttp + asyncio) — fast, concurrent, production-grade
✔ Handles pagination automatically
✔ Extracts complete product metadata
✔ Implements polite scraping practices
✔ Exports clean data into CSV format
✨ What makes this project stand out:
- 🔄 Automatic Pagination Detection
- 🌐 Session-Based Requests
- 🧠 Fault-Tolerant Data Extraction
- 📦 Structured Data Storage
- 📊 CSV Export via pandas
-
⚡ Async Scraping (Concurrency with asyncio)
-
🚀 Parallel Page Fetching (Massive Speed Boost)
-
⏱️ Polite Scraping (Delays & Headers)
-
🔗 Extracts:
- Product Title
- Price
- Image URL
- Product URL
Initialize Session → Fetch Page → Parse → Repeat → Save CSV
Fetch First Page → Detect Total Pages
↓
Create Async Tasks (All Pages)
↓
Execute Concurrent Requests (asyncio.gather)
↓
Parse HTML → Store Data → Export CSV
| Feature | Sync (requests) 🧵 | Async (aiohttp) ⚡ |
|---|---|---|
| Execution | Sequential | Concurrent |
| Speed | Slower | Much Faster 🚀 |
| Complexity | Easy | Intermediate |
| Scalability | Limited | High |
| Use Case | Small projects | Large-scale scraping |
| Category | Tools Used |
|---|---|
| Language | Python 🐍 |
| Sync HTTP | requests |
| Async HTTP | aiohttp + asyncio |
| Parsing | BeautifulSoup (bs4) |
| Parser Engine | lxml |
| Data Handling | pandas |
scrapeme-scraper/
│
├── scraper.py # Sync version (requests)
├── scraper_async.py # Async version (aiohttp)
├── products_info.csv # Output (sync)
├── products_info_async.csv # Output (async)
└── README.md # Documentation
git clone https://github.com/your-username/scrapeme-scraper.git
cd scrapeme-scraperpip install requests aiohttp beautifulsoup4 lxml pandaspython scraper.pypython scraper_async.pyFetched: page 1
Fetched: page 2
Fetched: page 3
...
Total products scraped: 755
CSV saved successfully!
| Title | Price | Image URL | Product URL |
|---|---|---|---|
| Bulbasaur | £63 | ... | ... |
| Ivysaur | £87 | ... | ... |
✔ Encoding: utf-8-sig (Excel-ready)
- ✔ Handles network failures
- ✔ Prevents crashes from missing HTML elements
- ✔ Uses safe parsing patterns
- ✔ Async version handles partial failures gracefully
This project follows best practices:
- ⏳ Uses delays / controlled concurrency
- 🤝 Avoids aggressive request patterns
- 📜 Built for educational purposes
Always respect
robots.txtand website terms.
- 🔁 Retry logic + exponential backoff
- 🌍 Proxy / IP rotation
- 📦 Export to JSON / Database
- 🧾 Logging system (production-grade)
- ⚙️ CLI tool support
- ☁️ Deploy as API / microservice
Mohammad Mustak Absar Khan
🔗 GitHub: https://github.com/MustakAbsarKhan
If you found this useful:
⭐ Star the repository 🍴 Fork it 🚀 Build your own version