A comprehensive web scraping and data analytics platform that analyzes job market trends for technology professionals. This project scrapes job listings from the.protocol to identify the most in-demand technologies for Python developers in Poland, providing actionable insights for career development and skill prioritization.
This platform combines web scraping, data analysis, and visualization to deliver market intelligence for job seekers and professionals looking to understand current technology trends. By analyzing job postings, it helps users make informed decisions about which skills to develop for maximum career impact.
- Multi-technology scraping using Scrapy and Selenium for dynamic content
- Custom pipelines and middlewares for robust data extraction
- CSS and XPath selectors for precise data targeting
- Configurable position targeting (Python, Java, JavaScript, etc.)
- Real-time data analysis using Pandas and Jupyter
- Interactive visualizations with Matplotlib
- Market trend identification and skill demand analysis
- Custom analytics pipeline for job market insights
- Microservices architecture with Docker containerization
- Asynchronous task processing with Celery and Redis
- RESTful API with Flask backend
- Nginx reverse proxy for production deployment
- MySQL database for metadata management
- Docker Compose for easy deployment
- Automated scraping schedules with Celery Beat
- Environment-based configuration management
- Production-ready setup with proper logging and monitoring
| Category | Technology | Version |
|---|---|---|
| Web Scraping | Scrapy, Selenium | 2.11.2, 4.27.1 |
| Backend | Flask | 3.0.3 |
| Data Processing | Pandas, NLTK | 2.2.2 |
| Visualization | Matplotlib | 3.8.4 |
| Database | MySQL, SQLAlchemy | 9.0.0, 2.0.31 |
| Task Queue | Celery, Redis | 5.2.1 |
| Containerization | Docker, Docker Compose | Latest |
| Web Server | Nginx | Latest |
The platform provides comprehensive analytics including:
- Skill Demand Analysis: Most requested technologies by experience level
- Market Trends: Employment types and contract preferences
- Geographic Insights: Job distribution across locations
- Technology Stack Analysis: Required vs. optional skills breakdown
- Career Path Guidance: Experience level requirements and progression
- Docker and Docker Compose installed
- Git for version control
-
Clone the repository
git clone https://github.com/bigdata5911/Jobsite-Scraper-and-Analyzer.git cd Jobsite-Scraper-and-Analyzer -
Configure environment variables
cp .env.sample .env # Edit .env file with your configuration -
Launch the application
docker-compose up --build
-
Access the application
- Web Interface: http://localhost:8000/scraping/diagrams
- API Documentation: Available in the web interface
| Endpoint | Method | Description |
|---|---|---|
/scraping/diagrams |
GET | Initiates analytics task and returns task ID |
/scraping/diagrams/{task_id} |
GET | Retrieves analytics results by task ID |
Modify the POSITION variable in config.py to target different job roles:
POSITION = "python" # Options: python, java, javascript, dev, etc.Adjust the scraping frequency in your environment configuration:
SCRAPING_EVERY_DAYS=7 # Scrape every 7 daysJobsite-Scraper-and-Analyzer/
βββ analyzing/ # Data analysis and visualization modules
βββ scraping/ # Web scraping spiders and pipelines
βββ web_server/ # Flask web application
βββ db/ # Database models and connections
βββ main_celery/ # Celery task queue configuration
βββ static/ # Static assets and visualizations
βββ migrations/ # Database migration files
βββ nginx/ # Nginx configuration
βββ scripts/ # Utility scripts
- Web Scraping: Advanced techniques with Selenium, Scrapy, CSS/XPath selectors
- Data Analysis: Comprehensive data processing with Pandas, NLTK, and visualization
- Task Management: Distributed task processing with Celery and Redis
- Containerization: Production-ready Docker deployment
- API Development: RESTful API design with Flask
- DevOps: Nginx configuration and container orchestration
- Cloud Storage Integration: Migrate from local file storage to cloud solutions
- Database Optimization: Implement proper database storage for scraped data
- Container Optimization: Reduce Docker image sizes and improve build efficiency
- Code Quality: Implement comprehensive testing and SOLID principles
- Performance Monitoring: Add application performance monitoring and logging
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
bigdata5911 - GitHub Profile
Built with β€οΈ for the developer community
Empowering developers with data-driven career insights







