These scripts and data files are designed to scrape DC Department of Health restaurant inspection report data from https://dc.healthinspections.us/?a=Inspections
Here is the workflow that produces files in the output directory:
- Run
01_scrape_inspection_links.pyto generate or update thescraped_inspection_links.csvfile. As this downloads the complete set of active links from the page above, processing this script can take a little while. - Run
02_extract_inspection_data.pyto process those links in thescraped_inspection_links.csvfile that have not already had their data extracted. This will generate a local cache of html files for each link, and either create or append the data to theinspection_summary_data.csvandviolation_details_data.csvfiles.
Experimental alternative/additional steps:
- Run
02alt_cache_potential_inspections.pyto sequentially scrape the range of known possible values of 'inspection_id' and generate a local cache of possible inspection reports. This generates or updates the potential_inspection_ids.csv file. Note that some of these may not be valid reports (there are known broken duplicates on the server, for example). - Run
03alt_extract_potential_inspection_data.pyto process all such potential inspection reports (including those cached by #1 above) as in #2 above. This will produce thepotential_inspection_summary_data.csvandpotential_violation_details_data.csvfiles. The first of these has an additional column indicating if the given id is known to be valid (has been linked to by the dc.healthinspections.us site before, either in this scraping effort or in previous efforts).
Future versions of these scripts and data will resolve issues relating to duplicates and other invalid inspection reports.