Feat: Implement Vector-Semantic PDF Alignment to replace brittle geometric mapping by NEHAJAKATE · Pull Request #393 · fireform-core/FireForm

NEHAJAKATE · 2026-03-30T20:54:52Z

Motivation

Currently, the document generation logic in src/filler.py relies heavily on geometric Y/X coordinate sorting or exact matching with internal PDF /T (Title) metadata. This creates a brittle dependency. Government agency PDFs frequently contain malformed or auto-generated widget names (e.g., TextField_42 instead of Incident_Address). If a new agency uploads a PDF with a slightly different visual layout or generic metadata, the current coordinate mapping will silently inject the wrong JSON data into the wrong legal boxes.

Changes Proposed

This PR introduces a Vector-Semantic PDF Alignment engine to mathematically align extracted JSON data with the visual PDF fields.
Visual Context Extraction: Integrated PyMuPDF (fitz) to scan the Document Object Model (DOM) and extract the visible printed text located immediately adjacent to interactive PDF widgets.
Quantized Embeddings:** Integrated the edge-optimized <100MB all-MiniLM-L6-v2 embedding model to generate vectors for both the JSON keys and the extracted visual text.
Cosine Similarity Mapping:** Replaced rigid coordinate mapping with a mathematical threshold (e.g., similarity > 0.75), allowing the system to logically pair extracted data with the correct PDF box regardless of internal metadata names.

Impact

Zero-Config Onboarding:** Eliminates the need for custom YAML mapping scripts for every new agency.
Fault Tolerance:** Grants FireForm immunity to lazily formatted or heterogenous government PDFs, scaling the platform toward a true Digital Public Good.

(Note: I am submitting this architectural exploration as part of my active research and contribution for GSoC 2026. I would love the maintainers' feedback on this vector-based approach!)

…leneck

Refactor: Implement Async Semantic Routing to eliminate O(N) LLM bott…

c607b5b

…leneck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Implement Vector-Semantic PDF Alignment to replace brittle geometric mapping#393

Feat: Implement Vector-Semantic PDF Alignment to replace brittle geometric mapping#393
NEHAJAKATE wants to merge 1 commit intofireform-core:mainfrom
NEHAJAKATE:feature/vector-pdf-mapper

NEHAJAKATE commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NEHAJAKATE commented Mar 30, 2026

Motivation

Changes Proposed

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant