A Techblurbs initiative to liberate Kenyan government financial data from "dirty" PDFs and turn it into actionable, machine-readable insights.
Public financial data in Kenya is often published in complex, inconsistently formatted PDF reports across various government portals. GovSpend-KE is a civic tech project designed to automate the collection, extraction, and visualization of this data, providing a clear view of how national and county budgets are actually implemented.
The project is structured into four main phases:
- Ingestion: Automated scrapers that monitor and download reports from the National Treasury and the Office of the Controller of Budget (OCOB).
- Extraction & Cleaning: Using OCR and table-parsing libraries to pull data from "stubborn" PDFs and standardizing it into a unified format.
- Storage & Versioning: A relational database to store historical records, allowing us to track changes and revisions over time.
- Presentation: A web dashboard for visualizations and an API/Export tool for CSV access.
Accuracy is the backbone of this project. Because PDF extraction can be "noisy" and government reports can contain internal errors, the following protocols are mandatory:
- Multi-Source Reconciliation: Data must be cross-referenced between Treasury "Exchequer Releases" and OCOB "Implementation Reports" to identify discrepancies.
- Vertical/Horizontal Totals: Automated scripts will verify that the sum of sub-categories matches the "Total" figures printed in the source documents.
- Human-in-the-Loop: High-variance figures or "broken" tables will be flagged for manual audit by the data team before being committed to the production database.
- Source Linking: Every data point in the DB will maintain a link back to the original source PDF and the specific page number for transparency.
| Team | Responsibilities |
|---|---|
| Data Engineering | PDF scrapers, ETL pipelines, and DB schema management. |
| Data Audit | Verification logic, reconciliation, and domain mapping. |
| Front End | Visualization (D3.js/Chart.js), UI/UX, and CSV export tools. |
| DevOps | Automation of data harvests and hosting infrastructure. |
-
Clone the repository:
git clone <repository-url> cd gov-spend-KE
-
Set up the virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables: Copy
.env.exampleto.envand update the database credentials.cp .env.example .env
-
Run the initial harvester:
python3 run_harvester.py
If you have Docker installed, you can skip the manual setup:
-
Configure environment variables:
cp .env.example .env
-
Start the services:
docker compose up -d
This will automatically spin up a PostgreSQL database and the harvester application with all system dependencies (OCR, etc.) pre-installed.
.
├── data/ # Data storage
│ ├── raw/ # Original PDF reports (ignored by git)
│ └── processed/ # Extracted CSV/JSON data
├── src/ # Source code
│ ├── ingestion/ # Scrapers and downloaders
│ ├── extraction/ # PDF parsing and OCR logic
│ ├── storage/ # DB schema and models
│ ├── audit/ # Verification and reconciliation
│ ├── presentation/ # API and Dashboard code
│ └── utils/ # Shared utilities
├── tests/ # Unit and integration tests
├── .env.example # Environment template
├── .gitignore # Git exclusion rules
├── Dockerfile # Docker image configuration
├── docker-compose.yml # Multi-container orchestration
├── requirements.txt # Project dependencies
└── run_harvester.py # Main pipeline entry point
Disclaimer: GovSpend-KE is an independent project by Techblurbs and is not an official government platform. We aim for 100% accuracy through rigorous auditing.