GitHub - SFBioinformaticsGroup/paper2table: Extract tables from papers

paper2table

Extract tables from papers

paper2table is a toolchain for extracting tabular information from scientific papers. It is composed of various command-line programs:

paper2table: the main command, which is used to extract data
filenorm: a command for preparing papers
tablemerge: a command for merging result of multiple paper2table runs
tablestats: a command for querying paper2table and tablemerge results
table2html: a command for generating simple extracted data visualizations
table2csv: a command for exporting tables to csv files.
tablevalidate: a command for validating tables files.

Installing

# install base dependencies
$ pip install -e .

# install testing dependencies
$ pip install -e .[testing]

# install tox build tool
$ pip install tox

File preparation

Before running paper2table, it is recommended that you normalize your input papers files first, so that you avoid duplicate work. In order to do so, a small program filenorm is provided that will remove duplicate files and normalize filenames.

# normalize all the given files
# will ask for confirmation before each change
filenorm -q PATH [PATH ...]

# don't ask for confirmation. a log with each change will be printed
filenorm -y PATH [PATH ...]

Running

paper2table can read paper's table using three different backends:

the pdfplumber package (this is the default option)

the camelot package (this is the default option)

an external generative agent. This option is usually more robust, but slower, less deterministic and presents additional costs

# basic usage
$ paper2table -p SCHEMA PATH [PATH ...]

# e.g. use the default pdfplumber reader backend
$ paper2table -q tests/data/demo_table.pdf

# e.g. use the pdfplumber reader specifying column name hints
paper2table -r pdfplumber -c tests/data/demo_column_hints.txt  tests/data/demo_table.pdf

# e.g. use the camelot reader backend
$ paper2table -r camelot -q tests/data/demo_table.pdf

# by default paper2table outputs data to stdout
# but you can specify an output directory
$ paper2table -o .  tests/data/demo_table.pdf
# result will be stored in demo_table.tables.json

# e.g. use the agent backend with the Gemini API
$ GEMINI_API_KEY=... paper2table -r agent -m google-gla:gemini-2.5-flash -p tests/data/demo_schema.txt tests/data/demo_table.pdf

Hybrid mode

Hybrid mode combines an LLM agent with a traditional reader backend. The agent analyses the PDF once to detect which tables are relevant and how their columns map to your schema. That mapping is then passed to the reader (pdfplumber, camelot, pymupdf) which performs the actual row extraction. This is usually more accurate and stable than running either approach alone.

Enable hybrid mode with -H together with a schema (-p or -s) and, optionally, -r to choose the underlying reader (default: pdfplumber).

# run hybrid mode with the default pdfplumber reader
$ GEMINI_API_KEY=... paper2table -H -m google-gla:gemini-2.5-flash \
    -p tests/data/demo_schema.txt \
    tests/data/demo_table.pdf

# use camelot as the underlying reader instead
$ GEMINI_API_KEY=... paper2table -H -r camelot -m google-gla:gemini-2.5-flash \
    -p tests/data/demo_schema.txt \
    tests/data/demo_table.pdf

# save mappings to a custom directory (default: ./mappings)
$ GEMINI_API_KEY=... paper2table -H -m google-gla:gemini-2.5-flash \
    -p tests/data/demo_schema.txt \
    -M tests/data/mappings \
    tests/data/demo_table.pdf

The generated mapping is cached in the mappings directory (<paper_name>.mapping.json). On subsequent runs for the same PDF the agent step is skipped automatically. Use -F to force regeneration of the mapping:

# regenerate the mapping even if one already exists
$ GEMINI_API_KEY=... paper2table -H -F -m google-gla:gemini-2.5-flash \
    -p tests/data/demo_schema.txt \
    tests/data/demo_table.pdf

Each mapping file records which model produced it and when, under a metadata field:

{
  "tables": [ ... ],
  "citation": "...",
  "metadata": {
    "model": "google-gla:gemini-2.5-flash",
    "date": "2026-03-23T10:00:00+00:00"
  }
}pylint $(git ls-files 'src/**/*.py')

Merging

paper2table also provides a table merging program called tablemerge. In order to be able to use it, you'll need to first generate some metadata. You can produce it using the same paper2table command:

# this command will create a new directory with the resultset, adding a metadata file
# suitable for use with tablemerge command
$ paper2table -t -o tests/data/tables tests/data/demo_table.pdf

After doing this, you can merge tables like this:

$ tablemerge -o tests/data/merges tests/data/demo_resultsets/*

Generating stats

A tool tablestats is provided for getting some stats about the extracted tables. It can be used to query both the direct output of a paper2table run or the results of a tablemerge output.

# generate a json file with stats
tablestats -o tests/data/stats.json tests/data/demo_resultsets/08ba0033-8b20-4dbb-bf4a-e2be1f194bc7/

# pretty print stats to stdout
# you can optionally sort results by number of extracted tables
tablestats --sort desc tests/data/merges

# if you only need to output empty files, use --empty
# this is useful for debugging your results
tablestats --empty tests/data/merges

Visualizing data

A tool table2html is provided for displaying a resultset:

# it can be used both with the raw resultset of a paper2table run
# or with the output of tablemerge
table2html tests/data/merges

Running tests

$ tox

`TablesFile` format

paper2table and tablemerge command output the the extracted tables data in a TablesFile file format, (with extension .tables.json). You can validate that those files follow the exact format using tablevalidate:

tablevalidate tests/data/demo_resultsets/*/*

The format is informally specified this way:

{
  "tables": [
    {
      "rows": [
        {
          "COLUMN_NAME_1": string | [{ "value": string, "agreement_level": integer }],
          "COLUMN_NAME_2": string | [{ "value": string, "agreement_level": integer }],
          "COLUMN_NAME_3": string | [{ "value": string, "agreement_level": integer }],
          "agreement_level_": integer // this is optional
        }
      ],
      "page": integer,
    },
    {
      "table_fragments": [
        {
          "rows": ..., // same schema as previous "rows" attribute
          "page": integer
        }
      ]
    }
  ],
  "citation": string | [{ "value": string, "agreement_level": integer }],
  "metadata": { // optional
    "filename": string,
  }
}

You can also find a proper json schema definition in tablesfile.schema.json

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.rst		CHANGELOG.rst
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE		LICENSE
PRINCIPLES		PRINCIPLES
README.rst		README.rst
agpl-3.0.txt		agpl-3.0.txt
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tablesfile.schema.json		tablesfile.schema.json
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paper2table

Installing

File preparation

Running

Hybrid mode

Merging

Generating stats

Visualizing data

Running tests

`TablesFile` format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

paper2table

Installing

File preparation

Running

Hybrid mode

Merging

Generating stats

Visualizing data

Running tests

TablesFile format

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`TablesFile` format

Packages