A Scrapy-based web scraping project that integrates with Zyte (formerly Scrapinghub) to extract job listings and automatically uploads the scraped data to AWS S3 for storage and analysis.
This project automates the process of scraping job listings from various sources and storing them in AWS S3 buckets. It leverages Zyte's cloud-based scraping infrastructure for reliable and scalable data extraction.
- Multi-Spider Architecture: Run multiple Scrapy spiders simultaneously to scrape different job boards
- Zyte Integration: Deploy and manage spiders on Zyte's cloud platform
- AWS S3 Storage: Automatic upload of scraped data to S3 buckets
- SQLite Databases: Local storage of job data for quick access and analysis
- Batch Processing: Execute all spiders with a single command using
run_all_spiders.py
Zyte-Scrape-to-S3/
├── CombinedJobs/ # Main Scrapy project directory
│ ├── CombinedJobs/ # Scrapy spiders and settings
│ ├── run_all_spiders.py # Script to execute all spiders
│ ├── scrapy.cfg # Scrapy configuration
│ ├── setup.py # Project setup file
│ └── scrapinghub.yml # Zyte deployment configuration
├── Jobs.db # SQLite database for job listings
├── python_jobs.db # SQLite database for Python-specific jobs
└── remote.db # SQLite database for remote job listings
- Python 3.x
- Scrapy
- AWS Account with S3 access
- Zyte Account (for cloud deployment)
- Required Python packages (see installation)
- Clone the repository:
git clone https://github.com/suhaasd/Zyte-Scrape-to-S3.git
cd Zyte-Scrape-to-S3- Navigate to the project directory:
cd CombinedJobs- Install dependencies:
pip install scrapy scrapy-zyte-api boto3- Create an S3 bucket in your AWS account
- Configure AWS credentials with S3 write permissions
- Update the S3 bucket name in your Scrapy settings
- Sign up for a Zyte account at https://www.zyte.com/
- Configure your API key in
scrapinghub.yml - Deploy your spiders to Zyte using the Scrapy Cloud CLI
To run a specific spider:
cd CombinedJobs
scrapy crawl <spider_name>Execute all spiders at once:
python run_all_spiders.pyDeploy your project to Zyte Cloud:
shub deployThe project stores scraped job data in three formats:
-
SQLite Databases: Local storage for quick queries
Jobs.db: General job listingspython_jobs.db: Python-specific positionsremote.db: Remote job opportunities
-
AWS S3: Cloud storage for long-term retention and analysis
The project includes visual documentation:
- AWS S3 bucket configuration
- Zyte job execution dashboard
- Data items structure
Contributions are welcome! Please feel free to submit a Pull Request.
- SUHAAS D
- SRIRAM K
- VAISHNAV P S
This project is available for educational and analytical purposes.
For questions or issues, please open an issue in the GitHub repository.