Skip to content

suhaasd/Zyte-Scrape-to-S3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Zyte-Scrape-to-S3

A Scrapy-based web scraping project that integrates with Zyte (formerly Scrapinghub) to extract job listings and automatically uploads the scraped data to AWS S3 for storage and analysis.

📋 Overview

This project automates the process of scraping job listings from various sources and storing them in AWS S3 buckets. It leverages Zyte's cloud-based scraping infrastructure for reliable and scalable data extraction.

✨ Features

  • Multi-Spider Architecture: Run multiple Scrapy spiders simultaneously to scrape different job boards
  • Zyte Integration: Deploy and manage spiders on Zyte's cloud platform
  • AWS S3 Storage: Automatic upload of scraped data to S3 buckets
  • SQLite Databases: Local storage of job data for quick access and analysis
  • Batch Processing: Execute all spiders with a single command using run_all_spiders.py

📁 Project Structure

Zyte-Scrape-to-S3/
├── CombinedJobs/              # Main Scrapy project directory
│   ├── CombinedJobs/          # Scrapy spiders and settings
│   ├── run_all_spiders.py     # Script to execute all spiders
│   ├── scrapy.cfg             # Scrapy configuration
│   ├── setup.py               # Project setup file
│   └── scrapinghub.yml        # Zyte deployment configuration
├── Jobs.db                    # SQLite database for job listings
├── python_jobs.db             # SQLite database for Python-specific jobs
└── remote.db                  # SQLite database for remote job listings

📦 Prerequisites

  • Python 3.x
  • Scrapy
  • AWS Account with S3 access
  • Zyte Account (for cloud deployment)
  • Required Python packages (see installation)

🔧 Installation

  1. Clone the repository:
git clone https://github.com/suhaasd/Zyte-Scrape-to-S3.git
cd Zyte-Scrape-to-S3
  1. Navigate to the project directory:
cd CombinedJobs
  1. Install dependencies:
pip install scrapy scrapy-zyte-api boto3

⚙️ Configuration

AWS S3 Setup

  1. Create an S3 bucket in your AWS account
  2. Configure AWS credentials with S3 write permissions
  3. Update the S3 bucket name in your Scrapy settings

Zyte Setup

  1. Sign up for a Zyte account at https://www.zyte.com/
  2. Configure your API key in scrapinghub.yml
  3. Deploy your spiders to Zyte using the Scrapy Cloud CLI

💻 Usage

Running Spiders Locally

To run a specific spider:

cd CombinedJobs
scrapy crawl <spider_name>

Running All Spiders

Execute all spiders at once:

python run_all_spiders.py

Deploying to Zyte

Deploy your project to Zyte Cloud:

shub deploy

💾 Data Storage

The project stores scraped job data in three formats:

  1. SQLite Databases: Local storage for quick queries

    • Jobs.db: General job listings
    • python_jobs.db: Python-specific positions
    • remote.db: Remote job opportunities
  2. AWS S3: Cloud storage for long-term retention and analysis

📸 Screenshots

The project includes visual documentation:

  • AWS S3 bucket configuration
  • Zyte job execution dashboard
  • Data items structure

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

👥 Author

  • SUHAAS D
  • SRIRAM K
  • VAISHNAV P S

🙏 Acknowledgments

  • Scrapy - Web scraping framework
  • Zyte - Cloud scraping platform
  • AWS S3 - Cloud storage solution

📝 License

This project is available for educational and analytical purposes.


For questions or issues, please open an issue in the GitHub repository.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages