This repository contains (mostly) Belgian-related templates to extract
structured data from invoices given as (unstructured) PDF files. The templates
are intended for the invoice2data open source Python library.
It also contains some code around invoice2data to make it easier to create
and test templates, and to show extracted data in a web page (including as a
SEPA Credit Transfer QR code).
We're using a Nix shell to bring the necessary dependencies (e.g. Python
libraries or the pdftotext binary) during development:
$ nix-shell
$ invoice2data --help
This provides for instance invoice2data and pdftotext.
When adding a new template, it is easier to use the invoice2data command-line
tool in debug mode. See below. The invoice2data repository has a tutorial too.
We're using invoice2data.
"Templates" are YAML files describing how to extract data from text using
regexes. The text itself is obtained using e.g. texttopdf. Other ways to
obtain the text is possible (e.g. using Tesseract).
Multiple templates are tried on a given PDF. The output can be saved to a file (for instance in JSON).
$ invoice2data \
--debug \
--exclude-built-in-templates \
--template-folder ./templates/ \
pdfs/2023-04-09-ovh.pdf
Note: Example PDFs are not included in the repository.
TODO Try Tesseract on (possibly scanned) PDFs.
invoice2dataseems quite fragile, e.g. it tries to read a PDF after it failed to generate it.- Monetary amounts are using the
floattype. - Some fields are mandatory (
date,amount,invoice_number) but other fields that are important to us can fail to be extracted with simply a warning. Code exploiting the data should check for those cases. - We're using the non-official
comment:field to try to add some information to our templates.
$ sqlite3 invoices.db < sql/create-invoices-table.sql
$ sqlite3 invoices.db ".schema invoices"
$ sqlite3 invoices.db "select * from invoices"
A Docker image running invoiced serve with Gunicorn is provided:
$ nix-build docker.nix
$ docker load < result
$ docker run -p 8000:8000 invoiced:xxxx
Note: I didn't manage to get convert to run properly. The error when run
within the container:
$ docker run -v$(pwd):/src invoiced:qqyikanwalr3y0x07aq936896zagjpi7 convert -quality 100 -density 200 -colorspace sRGB '/src/pdfs/2023-01-09-proximus.pdf[0]' -flatten /src/output.png
convert: FailedToExecuteCommand `'gs' -sstdout=%stderr -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 '-sDEVICE=pngalpha' -dTextAlphaBits=4 -dGraphicsAlphaBits=4 '-r200x200' -dPrinted=false -dFirstPage=1 -dLastPage=1 '-sOutputFile=magick-YMX-n5OYS7EpIgG9fu04gS4I7nGNnCNl%d' '-fmagick-OwWeXniaYxuw2iAhuXsz-k1nZxBrwoWG' '-fmagick-QvORRBCAbGmU4AKh7VoxaEDlubJkI2qk'' (32512) @ error/ghostscript-private.h/ExecuteGhostscriptCommand/74.
convert: no images defined `/src/output.png' @ error/convert.c/ConvertImageCommand/3342.
When using convert on my host machine, but omitting gs, is very similar,
but has an additional sh: line 1: gs: command not found line:
$ convert -quality 100 -density 200 -colorspace sRGB 'pdfs/2023-01-09-proximus.pdf[0]' -flatten output.png
sh: line 1: gs: command not found
convert: FailedToExecuteCommand `'gs' -sstdout=%stderr -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 '-sDEVICE=pngalpha' -dTextAlphaBits=4 -dGraphicsAlphaBits=4 '-r200x200' -dPrinted=false -dFirstPage=1 -dLastPage=1 '-sOutputFile=/run/user/1000/magick-_f97AT5-L2DxpZYeSrj-LNLOgB8QW3b8%d' '-f/run/user/1000/magick-arJJP49wUA61JM7BoRTyPPj6kApH5gGZ' '-f/run/user/1000/magick-sLB3aFbLHxrGl-LUdJLCApxv9H6ugSt0'' (32512) @ error/ghostscript-private.h/ExecuteGhostscriptCommand/74.
convert: no images defined `output.png' @ error/convert.c/ConvertImageCommand/3342.
(Calling gs directly in the container does work.)