Political Data Collection System

Scrape, clean and structure United States campaign documents and debate transcripts from the UC Santa Barbara American Presidency Project.

Overview

This project gathers publicly available political content from presidency.ucsb.edu and turns it into tidy CSV datasets ready for analysis. It is built around two Jupyter notebooks, each handling a different kind of source material.

documents.ipynb collects campaign documents: speeches, remarks, statements, interviews and addresses, along with their metadata and full text.
debates.ipynb collects presidential and vice presidential debate transcripts, with participants, moderators and venue details pulled out into their own fields.

Everything here is sourced from material that is already public. The scrapers are rate limited and use polite request patterns so they sit lightly on the source server.

Highlights

Two focused collectors so campaign documents and debates each get extraction logic suited to their page layout.
Metadata plus full text captured in one pass: dates, titles, speakers, document types, word counts and the complete content.
Concurrent scraping with a thread pool to move through paginated results quickly without hammering the server.
Resumable runs via a pickle cache and a JSON checkpoint, so a long collection can stop and pick up where it left off.
Date normalisation that standardises mixed date formats into ISO timestamps.
Retry and backoff built into the HTTP session to ride out the occasional network hiccup.

Tech Stack

Layer	Tool
Language	Python 3.9+
Environment	Jupyter Notebook
HTTP	`requests` with retry adapter and connection pooling
HTML parsing	`BeautifulSoup4` (+ `lxml`)
Data handling	`pandas`
Concurrency	`concurrent.futures.ThreadPoolExecutor`
Progress	`tqdm`
Caching	`pickle` cache + JSON checkpoint

Getting Started

Prerequisites

Install the dependencies:

pip install jupyter pandas requests beautifulsoup4 lxml tqdm

Running the collectors

Launch Jupyter and open whichever notebook you need:

jupyter notebook

Campaign documents (documents.ipynb)

Run the cells in order. The first pass scrapes document metadata across the paginated category listing and writes campaign_documents.csv.
The second pass reads that CSV back in and extracts the full text for each document, saving progress to document_cache.pkl and extraction_checkpoint.json as it goes.
If a run is interrupted, just re-run the cells. The checkpoint lets it resume rather than start over.

Debates (debates.ipynb)

Run the cells in order to scrape the debate listing into debates_data.csv.
The processing cells then pull each transcript apart into participants, moderators and content, writing debates_data_processed.csv.

You can tune the worker count, rate-limit delays and page size near the top of each notebook to match your machine and how gently you want to treat the source.

Datasets

File	What it holds
`campaign_documents.csv`	Campaign documents with metadata and full text
`documents_processed_optimized.csv`	Processed campaign documents
`debates_data.csv`	Raw debate listing (date, title, links)
`debates_data_processed.csv`	Debate transcripts split into participants, moderators and content

Campaign document fields include the URL, publication date (ISO), title, speaker and speaker title, document type, location, full content, word count and extraction status.

Debate fields include the date, title, participant and moderator lists, full transcript text and HTML, and video availability.

Research Applications

The resulting datasets suit a range of political science and computational social science work, such as political speech and rhetoric analysis, tracking how campaign messaging shifts over time, debate analysis, and natural language processing on political text.

Legal and Ethical Notes

All data is collected from publicly available sources.
Rate limiting is in place to keep the load on the source server light.
The datasets are intended for research and educational use. Please follow the source site's terms of service and any relevant institutional policies.

License

Intended for research and educational use. Please ensure compliance with relevant institutional policies and the source site's data usage guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
_archive		_archive
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
README.md		README.md
campaign_documents.csv		campaign_documents.csv
debates.ipynb		debates.ipynb
debates_data.csv		debates_data.csv
debates_data_processed.csv		debates_data_processed.csv
documents.ipynb		documents.ipynb
documents_processed_optimized.csv		documents_processed_optimized.csv
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Political Data Collection System

Overview

Highlights

Tech Stack

Getting Started

Prerequisites

Running the collectors

Datasets

Research Applications

Legal and Ethical Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Political Data Collection System

Overview

Highlights

Tech Stack

Getting Started

Prerequisites

Running the collectors

Datasets

Research Applications

Legal and Ethical Notes

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages