Skip to content

DohaSK/Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping Project - Leather Goods & Handicrafts

A Python web scraping project designed to extract product data from various e-commerce websites specializing in leather goods and handicrafts.

📁 Project Structure

Web-Scrabing/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
│
├── 🐍 SCRAPERS (Main Scripts)
│   ├── curomania.py                  # Scrapes cuiromania.com leather products
│   ├── simple.py                     # General-purpose web scraper
│   ├── MinAjiliki.py                 # Scrapes MinAjiliki handicrafts
│   └── cleaned.py                    # Data cleaning utility
│
├── 📊 DATA & OUTPUTS
│   ├── data/                         # Directory for organized data files
│   ├── EXCEL/                        # Excel export folder
│   ├── cleaned_data.csv              # Cleaned and processed data
│   ├── cuiromania_products.csv       # Raw cuiromania.com products
│   ├── maleatherdesign_products.csv  # Leather design products (empty)
│   ├── maleatherdesign_products_100.csv  # Leather design products (100 items)
│   ├── leather_goods_cuiromania_*.csv    # Dated extracts
│   ├── leather_goods_morocco_*.csv      # Morocco products
│   ├── cleaned_data.csv              # Processed output
│   └── scraped_product_urls.json     # URLs JSON export
│
├── 💻 WEB CONTENT
│   └── product_page.html             # Sample/cached product page HTML
│
├── ⚙️ UTILITIES & CONFIG
│   ├── utils/                        # Utility functions directory
│   ├── scrapers/                     # Scrapers modules directory
│   ├── blog                          # Blog/documentation file
│   └── .vscode/                      # VS Code settings

🎯 What This Project Does

This project scrapes product data from multiple Moroccan and international leather goods/handicrafts websites:

  • Cuiromania (curomania.py) - Premium leather goods
  • MaLeatherDesign - Leather product catalog
  • Other sources - Via simple.py and MinAjiliki.py

Data extracted:

  • Product name & description
  • Prices & availability
  • Product URLs
  • Images/media
  • Categories & references

🚀 Getting Started

Prerequisites

  • Python 3.x
  • Dependencies listed in requirements.txt

Installation

# Clone the repository
git clone https://github.com/DohaSK/Web-Scrabing.git
cd Web-Scrabing

# Install dependencies
pip install -r requirements.txt

Running Scrapers

# Scrape Cuiromania products
python curomania.py

# Run general scraper
python simple.py

# Clean data
python cleaned.py

📋 Data Files Guide

Raw Data (Input)

File Source Purpose
cuiromania_products.csv cuiromania.com Raw product listings
maleatherdesign_products_100.csv MaLeatherDesign 100 products sample
leather_goods_*.csv Various Dated extractions

Processed Data (Output)

File Purpose
cleaned_data.csv Deduplicated & cleaned dataset
scraped_product_urls.json All product URLs in JSON format

Cache Files

File Purpose
product_page.html Sample HTML page for testing/reference

📝 Requirements

See requirements.txt for the complete list. Common dependencies:

  • requests - HTTP requests
  • BeautifulSoup4 - HTML parsing
  • Selenium (optional) - JavaScript rendering
  • pandas - Data processing

🔄 Workflow

1. Run Scraper (curomania.py / simple.py)
   ↓
2. Extract Product Data (name, price, URL, images)
   ↓
3. Save to CSV/JSON
   ↓
4. Clean & Deduplicate (cleaned.py)
   ↓
5. Export to Excel (data/EXCEL/)

📌 Important Notes

  • Data Updates: Each scraper run overwrites or appends data with timestamps
  • CSV Files with Dates: leather_goods_*_[timestamp].csv - These are dated backups
  • Empty Files: Some CSVs may be empty if scraping failed or no data was found
  • Rate Limiting: Be respectful to target websites; add delays between requests

🛠️ Troubleshooting

No Data Being Scraped?

  • Check website URLs in scripts (websites may change structure)
  • Verify internet connection
  • Check for JavaScript-rendered content (may need Selenium)

Data Cleaning Issues?

  • Review cleaned.py for filtering logic
  • Check input CSV format matches expected structure

Large HTML Files?

  • product_page.html is a cached page - safe to delete if needed

📂 Recommended Next Steps

To improve organization:

  1. Move all scrapersscrapers/ folder
  2. Move utility functionsutils/ folder
  3. Organize raw datadata/raw/
  4. Organize processed datadata/processed/
  5. Create config.py for URLs and settings

📝 License

Specify your license here (e.g., MIT, GPL, etc.)

👤 Author

Doha Skouf - Created April 2026


📞 Support

For issues or questions, create an issue on GitHub or contact the project maintainer.

About

Automated web scraping pipeline for Moroccan artisanal e-commerce products. Collects, cleans, and structures data from multiple platforms.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors