A framework for creating and storing FollowTheMoney entities, used by OpenLobbying.
Warning
This is a work in progress. Expect breaking changes and incomplete features.
Muckrake is the data pipeline. It is partially inspired by zavod and other FollowTheMoney tools.
Run uv run muckrake --help for a full list of available commands.
You can find crawlers for various datasets in datasets/. At a minimum, each dataset consists of a config.yml with metadata and a crawl.py script that outputs FollowTheMoney statements in CSV format.
To crawl a dataset, run uv run muckrake crawl {dataset_name}. Run uv run muckrake list to see available datasets.
Each crawl now creates a dataset_runs record in Postgres and stores immutable artifacts under MUCKRAKE_ARTIFACT_PATH (defaults to data/artifacts). The latest successful run remains mirrored into data/datasets/{name}/statements.pack.csv for local compatibility.
Many data sources have composite fields that contain multiple entities. We use LLMs to extract unique entities and relationships from these fields, and store them as candidates in the database for review and approval. See NER docs for details.
# Create extraction candidates for one dataset
uv run muckrake ner-extract open_access --extractor llm --limit 50
# Review candidates in a terminal UI
uv run muckrake ner-reviewOur goal is to link entities across datasets to provide a unified view of lobbying and political finance for any given person, company, or organisation.
# Create dedupe candidates across all datasets
uv run muckrake xref
# Review candidates in a terminal UI
uv run muckrake dedupeWe also want to collapse duplicate relationship edges across datasets, especially for ORCL and PRCA. This is done automatically, no review step required.
uv run muckrake dedupe-edgesStatements are loaded into Postgres with uv run muckrake load. This reads the statements CSV files and applies any approved NER candidates before materialising entities and relationships.
To load from a specific immutable crawl snapshot instead of the local workspace copy:
uv run muckrake load gb_political_finance --run-id 123For the published site, prefer the release workflow instead of loading directly into the serving database:
uv run muckrake release-build
uv run muckrake release-publish 1The primary user of Muckrake data is OpenLobbying, an open database of lobbying and political finance data.
Start the API server:
uv run muckrake serverStart the Svelte frontend:
cd openlobbying
npm run devIn development, frontend requests to /api/* are proxied to http://127.0.0.1:8000 via Vite.
- Set
MUCKRAKE_DATABASE_URL..envis loaded automatically from the repo root. - Optional: set
MUCKRAKE_ARTIFACT_PATHto control where immutable run artifacts are stored locally. - Set
MUCKRAKE_PUBLISHED_DATABASE_URLto a separate published Postgres database used by the API. - Example:
export MUCKRAKE_DATABASE_URL="postgresql+psycopg://muckrake:password@127.0.0.1:5432/muckrake"
export MUCKRAKE_PUBLISHED_DATABASE_URL="postgresql+psycopg://muckrake:password@127.0.0.1:5432/muckrake_published"
export MUCKRAKE_ARTIFACT_PATH="data/artifacts"- VPS guide and templates:
docs/deploy/README.md - One-command deploy (code + data):
./scripts/deploy_to_vps.sh {ip_address}