This repository contains Python scripts for scraping data from websites of highly prestigious and prestigious academic awards. The goal is to collect information about award recipients, categories, and other relevant details.
- Scrapes data from multiple academic award websites
- Extracts information such as award names, recipients, categories, and years
- Stores data in a structured format (CSV, JSON, etc.)
- Handles website navigation and pagination
- Includes error handling and logging
- Python 3.8+
- BeautifulSoup4
- Requests
- Pandas
- Selenium (for dynamic content)
-
Clone the repository:
git clone https://github.com/yourusername/academic-awards-scraper.git cd academic-awards-scraper -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Update the
config.jsonfile with the URLs of the award websites you want to scrape and other configuration details. -
Run the scraper:
python scraper.py
-
The scraped data will be saved in the
outputdirectory in the specified format.
The config.json file should contain the following fields:
urls: List of award website URLs to scrapeoutput_format: Format to save the scraped data (e.g., CSV, JSON)log_level: Logging level (e.g., DEBUG, INFO, WARNING)
Example config.json:
{
"urls": [
"https://example.com/award1",
"https://example.com/award2"
],
"output_format": "csv",
"log_level": "INFO"
}Contributions are welcome! Please fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License. See the LICENSE file for details.
- BeautifulSoup4
- Requests
- Pandas
- Selenium