OGA-budget-lens OGA-budget-lens is an AI-assisted tool for turning complex & unstructured government budget documents (PDFs) into structured, auditable, and human-verifiable data, with a focus on African public finance and accountability.
The project prioritizes trust, provenance, and reviewability over opaque automation.
This repository is at an early stage and focuses on laying strong technical and governance foundations.
Across many African countries, government budgets are published as:
- Scanned or poorly formatted PDFs
- Documents with weak or inconsistent structure
- Multi-language texts (English, French, Portuguese, and others)
- Files that are difficult to compare across years or countries
Most existing extraction approaches:
- Lose links to original sources
- Hide uncertainty or errors
- Or rely on AI systems that infer missing fiscal data
Budget Lens takes a different approach:
every extracted number must be traceable, reviewable, and correctable by a human.
- Provenance is non-negotiable
- Human verification is first-class
- AI assistance is constrained and auditable
- Ambiguity must be surfaced, not hidden
- Cross-country comparison is a design requirement
This repository will evolve to include:
- A provenance-aware parsing pipeline for budget PDFs
- A canonical budget line item data model
- Validation and quality assurance tooling
- Human-in-the-loop review workflows
- Standardized export formats for reuse
At present, the repository focuses on:
- Architecture and data model design
- Defining safety and trust constraints
- Preparing for initial implementation work
- No production pipeline yet
- No fixed development environment
- Active design and early implementation phase
Early contributors will help define:
- Core schemas
- Tooling choices
- Validation rules
- Repository structure
This project follows an individual-ownership, collaborative-development model across all phases of work and maintains explicit attribution for all contributors. Contributors are encouraged to collaborate through discussion, reviews, and coordination at every stage of the project. However, all implemented work must have clearly attributable ownership.
Each contributor is credited with the specific components, tasks, or deliverables they owned or led. Participation, discussion, or review alone does not imply ownership.
| Contributor | Role / Focus Area | Owned Deliverables |
|---|---|---|
| Lochit-Vinay (@Lochit-Vinay) | Data / Research | Added publicly available African government budget PDFs for testing; documented official source URLs (PR #8) |
| Lochit-Vinay (@Lochit-Vinay) | Backend / Data Processing | Implemented PDF type detection (scanned vs digital) with confidence scoring, structured logging, error handling improvements, and unit tests; addressed review feedback and hardened the implementation (PR #10) |
| Lochit-Vinay (@Lochit-Vinay) | Backend / Data Processing | Implemented page-level text extraction pipeline with OCR fallback for mixed-format PDFs; added digital extraction using PyMuPDF, OCR extraction using pytesseract, token-level spatial metadata (bounding boxes & confidence), provenance metadata, and structured JSON outputs for downstream layout/table analysis (PR #28) |
| Lochit-Vinay (@Lochit-Vinay) | Backend / Data Processing | Designed, raised, proposed, and implemented an advanced table detection enhancement leveraging token-level spatial metadata; introduced multi-stage structural validation (row clustering via vertical proximity, column alignment via horizontal consistency, and semantic filtering) to eliminate paragraph misclassification; significantly improved precision of table extraction across diverse PDF formats (scanned + digital); evaluated on real-world African government budget datasets; extended the OCR + layout-aware pipeline (PR #32) |
| Arvinder-Singh-Dhoul (@arvinder004) | Dependency/Infrastructure | resolved camelot-py dependency conflicts and add system packages (PR #22) |
| Arvinder-Singh-Dhoul (@arvinder004) | Dependency/Infrastructure (backend/infra) | Bug cross platform dependency (PR #25) |
| Name / GitHub | Backend, Frontend, Data, Infra, Research | Clearly scoped features, services, or setup tasks | | Divyanshu-Off | Infra | CI/CD pipeline for the project | | Divyanshu-Off | Backend, Infra | Containerization of parsing environment (Docker & Compose), FastAPI scaffolding, and contributor documentation |
This table must be kept up to date as the project evolves, from Phase 0 through final delivery. Phase-level credit is insufficient on its own; ownership must always be traceable to concrete deliverables, from initial scaffolding (Phase 0) through final handover.
Clarification on Collaboration and Ownership (All Phases)
From Phase 0 through the final phase, contributors may not jointly claim the same implementation output unless responsibilities are explicitly separated and documented. Collaboration should strengthen implementation quality, not dilute accountability.
A task or feature is considered complete only when all of the following are satisfied:
- Code is clean, readable, and compliant with project linting and formatting rules.
- Appropriate unit and/or integration tests are included and CI passes.
- Relevant documentation is updated (README, ARCHITECTURE, API docs where applicable).
- Database migrations are provided and reviewed if schema changes are involved.
- The Pull Request has received at least one peer review and maintainer approval.
This project may be developed in part through tech programs. If you are contributing through GSoC, MLH, Outreachy etc, please find your project standard here & roadmap here. If this becomes obselete please raise an issue for
Contributors are expected to:
- Build reusable, well-documented components
- Respect long-term maintenance needs
- Treat programs as an entry point, not a finish line
The roadmap and contribution guidelines are designed for continuity beyond any single program.
GSoC compatibility: Contributors may collaborate through discussion and peer review, but all submitted work must have clear individual ownership and be attributable to a single contributor for evaluation.
These guidelines apply from Phase 0 through final project delivery.
Maintainers are responsible for ensuring clear ownership and accountability throughout the project lifecycle. When reviewing work, maintainers should verify that:
- Every pull request has a clearly identifiable primary owner.
- Each deliverable, regardless of phase, is attributable to a specific contributor.
- The README “Contributors & Roles” section reflects actual implementation ownership, not participation alone.
- Multiple contributors are not credited for the same deliverable unless roles and responsibilities are explicitly differentiated.
- Collaboration is demonstrated through reviews, discussions, and coordination — not shared ownership of identical outputs.
If ownership is unclear at any stage, maintainers should request clarification or restructuring before merging. Clear ownership is required for all phases to ensure sustainability, accountability, and long-term project health.
Expected top-level documents include:
TECHNICAL_OVERVIEW.mdROADMAP.mdCONTRIBUTING.mdDATA_MODEL_DECISIONS.mdARCHITECTURE.md
Structure will stabilize as implementation progresses.
- Python 3.10 or higher
- System dependencies (for PDF processing):
- Tesseract OCR: Installation guide
- Ghostscript: Installation guide
- Clone the repository
git clone https://github.com/OpenGovAfrica/oga-budget-lens.git
cd oga-budget-lens- Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Verify your setup
python setup_check.pyYou should see: ✓ All checks passed! Environment ready.
Once your environment is set up, you can explore the project structure and review open issues to start contributing.
- Read the roadmap to understand project direction
- Check issues labeled
good first issueordesign - Join discussions around data models and validation
- Propose improvements via issues or pull requests
See CONTRIBUTING.md for details.
This project is maintained under the OpenGovAfrica ecosystem.
Design decisions are expected to:
- Be documented
- Favor clarity over cleverness
- Support reuse by journalists, researchers, and civic technologists