Automated Data Extraction for Plant Traits
This repository accompanies the ADEPT research paper.
ADEPT is a workflow for the automated extraction of plant trait information from biodiversity literature, producing structured matrices of trait terms derived from taxonomic descriptions.
Descriptions are sourced from the Biodiversity Heritage Library (BHL), eFloras (Flora of North America; Flora of Chine; Flora of Pakistan), and EcoFlora. Because BHL literature contains a mixture of narrative text and formal species descriptions, ADEPT uses a binary text classifier to identify candidate descriptive passages before downstream processing.
Trait extraction is then performed using rule-based matching approaches designed for botanical descriptive language, enabling consistent identification of morphological and ecological trait terms from heterogeneous historical and contemporary texts.
The resulting outputs are structured trait matrices suitable for downstream ecological, evolutionary, and biodiversity informatics analyses. Output matrices for Angiosperms are available at https://doi.org/10.5519/p3dm31kc
ADEPT requires Python 3.11+ and uses uv for dependency management.
curl -Ls https://astral.sh/uv/install.sh | shuv venv --python 3.11
source .venv/bin/activateuv pip install git+https://github.com/NaturalHistoryMuseum/ADEPT.gitFor reproducible analyses, installing a tagged release rather than the main branch is recommended.
ADEPT requires Biodiversity Heritage Library data downloaded locally.
Creates a local index of taxonomic occurrences across BHL pages derived from BHL data archives.
adept assets bhl-namesDownloads a local copy of the BHL Optical Character Recognition (OCR) full-text export:
Richard, Joel & Dearborn, Jacqueline (2022). BHL Optical Character Recognition (OCR) – Full Text Export. Smithsonian Libraries and Archives. https://doi.org/10.25573/data.21422193.v22
If the BHL OCR archive is not locally, ADEPT will retrieve page text via the BHL API. This may significantly degrade processing speed.
adept assets bhl-ocrPretrained ADEPT machine-learning models used in the paper are available via Hugging Face:
-
Taxonomic named-entity recognition model https://huggingface.co/benhartley/en_adept_ner_trf
-
Description classification model https://huggingface.co/benhartley/en_description_classifier
These enable reproducibility of the text-mining workflow described in the manuscript.
ADEPT provides a command-line interface for running extraction workflows.
Generate a trait matrix for a species:
adept traits --taxa "Leersia hexandra" --group angiospermUse Tesseract OCR instead of BHL OCR:
adept traits --taxa "Leersia hexandra" --group angiosperm --ocr TESSERACTProcess species from an input spreadsheet:
adept traits --file ../data/examples/angiosperm-10.xlsx --limit 4 --group angiospermRetrieve botanical descriptions from BHL, Ecoflora, and eFlora sources:
adept descriptions --taxa "Achillea millefolium"If you encounter bugs, unexpected behaviour, or have feature requests, please open an issue in this repository.
By default, ADEPT uses OCR text provided by BHL, which may contain OCR errors affecting downstream trait extraction.
To use locally generated OCR via Tesseract instead:
.env configuration
BHL_OCR_SOURCE=TESSERACT
or via CLI:
adept traits --ocr TESSERACTRequests to BHL services are cached. Supported backends:
- SQLite (default)
- Redis (recommended for large-scale processing)
Start Redis:
docker compose up -d redisConfigure:
CACHE_BACKEND=REDIS
ADEPT workflows are orchestrated using Luigi, which manages task dependencies, execution order, and pipeline monitoring. By default the pipeline runs with Luigi’s local scheduler, which is suitable for single-machine execution and simple workflows.
You can override this behaviour from the command line:
--local-scheduler False
or:
local_scheduler=False
Setting local_scheduler=False enables connection to a central Luigi scheduler, which is useful for distributed execution, multi-user environments, or long-running production pipelines.
The Luigi central scheduler runs as a lightweight web service built on the Tornado web framework. This provides:
A task coordination endpoint for workers
A browser-based monitoring interface
Visibility into task status, failures, and dependencies
To use this mode, start the Luigi scheduler service separately (typically via luigid) before running ADEPT tasks.
--
If you use ADEPT in your research, please cite the ADEPT paper (once published)
A BibTeX entry will be added once the paper is formally published.
This project is released under the MIT Licence.
You are free to use, modify, and distribute the software with minimal restrictions. See the LICENSE file for full details.