🚀 Zyte-Scrape-to-S3

A Scrapy-based web scraping project that integrates with Zyte (formerly Scrapinghub) to extract job listings and automatically uploads the scraped data to AWS S3 for storage and analysis.

📋 Overview

This project automates the process of scraping job listings from various sources and storing them in AWS S3 buckets. It leverages Zyte's cloud-based scraping infrastructure for reliable and scalable data extraction.

✨ Features

Multi-Spider Architecture: Run multiple Scrapy spiders simultaneously to scrape different job boards
Zyte Integration: Deploy and manage spiders on Zyte's cloud platform
AWS S3 Storage: Automatic upload of scraped data to S3 buckets
SQLite Databases: Local storage of job data for quick access and analysis
Batch Processing: Execute all spiders with a single command using run_all_spiders.py

📁 Project Structure

Zyte-Scrape-to-S3/
├── CombinedJobs/              # Main Scrapy project directory
│   ├── CombinedJobs/          # Scrapy spiders and settings
│   ├── run_all_spiders.py     # Script to execute all spiders
│   ├── scrapy.cfg             # Scrapy configuration
│   ├── setup.py               # Project setup file
│   └── scrapinghub.yml        # Zyte deployment configuration
├── Jobs.db                    # SQLite database for job listings
├── python_jobs.db             # SQLite database for Python-specific jobs
└── remote.db                  # SQLite database for remote job listings

📦 Prerequisites

Python 3.x
Scrapy
AWS Account with S3 access
Zyte Account (for cloud deployment)
Required Python packages (see installation)

🔧 Installation

Clone the repository:

git clone https://github.com/suhaasd/Zyte-Scrape-to-S3.git
cd Zyte-Scrape-to-S3

Navigate to the project directory:

cd CombinedJobs

Install dependencies:

pip install scrapy scrapy-zyte-api boto3

⚙️ Configuration

AWS S3 Setup

Create an S3 bucket in your AWS account
Configure AWS credentials with S3 write permissions
Update the S3 bucket name in your Scrapy settings

Zyte Setup

Sign up for a Zyte account at https://www.zyte.com/
Configure your API key in scrapinghub.yml
Deploy your spiders to Zyte using the Scrapy Cloud CLI

💻 Usage

Running Spiders Locally

To run a specific spider:

cd CombinedJobs
scrapy crawl <spider_name>

Running All Spiders

Execute all spiders at once:

python run_all_spiders.py

Deploying to Zyte

Deploy your project to Zyte Cloud:

shub deploy

💾 Data Storage

The project stores scraped job data in three formats:

SQLite Databases: Local storage for quick queries
- Jobs.db: General job listings
- python_jobs.db: Python-specific positions
- remote.db: Remote job opportunities
AWS S3: Cloud storage for long-term retention and analysis

📸 Screenshots

The project includes visual documentation:

AWS S3 bucket configuration
Zyte job execution dashboard
Data items structure

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

👥 Author

SUHAAS D
SRIRAM K
VAISHNAV P S

🙏 Acknowledgments

Scrapy - Web scraping framework
Zyte - Cloud scraping platform
AWS S3 - Cloud storage solution

📝 License

This project is available for educational and analytical purposes.

For questions or issues, please open an issue in the GitHub repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Zyte-Scrape-to-S3

📋 Overview

✨ Features

📁 Project Structure

📦 Prerequisites

🔧 Installation

⚙️ Configuration

AWS S3 Setup

Zyte Setup

💻 Usage

Running Spiders Locally

Running All Spiders

Deploying to Zyte

💾 Data Storage

📸 Screenshots

🤝 Contributing

👥 Author

🙏 Acknowledgments

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CombinedJobs		CombinedJobs
Jobs.db		Jobs.db
README.md		README.md
python_jobs.db		python_jobs.db
remote.db		remote.db

Folders and files

Latest commit

History

Repository files navigation

🚀 Zyte-Scrape-to-S3

📋 Overview

✨ Features

📁 Project Structure

📦 Prerequisites

🔧 Installation

⚙️ Configuration

AWS S3 Setup

Zyte Setup

💻 Usage

Running Spiders Locally

Running All Spiders

Deploying to Zyte

💾 Data Storage

📸 Screenshots

🤝 Contributing

👥 Author

🙏 Acknowledgments

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages