Scrape, clean and structure United States campaign documents and debate transcripts from the UC Santa Barbara American Presidency Project.
This project gathers publicly available political content from presidency.ucsb.edu and turns it into tidy CSV datasets ready for analysis. It is built around two Jupyter notebooks, each handling a different kind of source material.
documents.ipynbcollects campaign documents: speeches, remarks, statements, interviews and addresses, along with their metadata and full text.debates.ipynbcollects presidential and vice presidential debate transcripts, with participants, moderators and venue details pulled out into their own fields.
Everything here is sourced from material that is already public. The scrapers are rate limited and use polite request patterns so they sit lightly on the source server.
- Two focused collectors so campaign documents and debates each get extraction logic suited to their page layout.
- Metadata plus full text captured in one pass: dates, titles, speakers, document types, word counts and the complete content.
- Concurrent scraping with a thread pool to move through paginated results quickly without hammering the server.
- Resumable runs via a pickle cache and a JSON checkpoint, so a long collection can stop and pick up where it left off.
- Date normalisation that standardises mixed date formats into ISO timestamps.
- Retry and backoff built into the HTTP session to ride out the occasional network hiccup.
| Layer | Tool |
|---|---|
| Language | Python 3.9+ |
| Environment | Jupyter Notebook |
| HTTP | requests with retry adapter and connection pooling |
| HTML parsing | BeautifulSoup4 (+ lxml) |
| Data handling | pandas |
| Concurrency | concurrent.futures.ThreadPoolExecutor |
| Progress | tqdm |
| Caching | pickle cache + JSON checkpoint |
Install the dependencies:
pip install jupyter pandas requests beautifulsoup4 lxml tqdmLaunch Jupyter and open whichever notebook you need:
jupyter notebookCampaign documents (documents.ipynb)
- Run the cells in order. The first pass scrapes document metadata across the paginated category listing and writes
campaign_documents.csv. - The second pass reads that CSV back in and extracts the full text for each document, saving progress to
document_cache.pklandextraction_checkpoint.jsonas it goes. - If a run is interrupted, just re-run the cells. The checkpoint lets it resume rather than start over.
Debates (debates.ipynb)
- Run the cells in order to scrape the debate listing into
debates_data.csv. - The processing cells then pull each transcript apart into participants, moderators and content, writing
debates_data_processed.csv.
You can tune the worker count, rate-limit delays and page size near the top of each notebook to match your machine and how gently you want to treat the source.
| File | What it holds |
|---|---|
campaign_documents.csv |
Campaign documents with metadata and full text |
documents_processed_optimized.csv |
Processed campaign documents |
debates_data.csv |
Raw debate listing (date, title, links) |
debates_data_processed.csv |
Debate transcripts split into participants, moderators and content |
Campaign document fields include the URL, publication date (ISO), title, speaker and speaker title, document type, location, full content, word count and extraction status.
Debate fields include the date, title, participant and moderator lists, full transcript text and HTML, and video availability.
The resulting datasets suit a range of political science and computational social science work, such as political speech and rhetoric analysis, tracking how campaign messaging shifts over time, debate analysis, and natural language processing on political text.
- All data is collected from publicly available sources.
- Rate limiting is in place to keep the load on the source server light.
- The datasets are intended for research and educational use. Please follow the source site's terms of service and any relevant institutional policies.
Intended for research and educational use. Please ensure compliance with relevant institutional policies and the source site's data usage guidelines.