Small command-line utility that parses .eml messages from a DOI request form (HTML body) and produces a DataCite Kernel-4 metadata XML file suitable for import into DataCite Fabrica.
- Python 3.8+
- Dependencies:
lxml,requests(seerequirements.txt)
Create a virtualenv and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun the converter:
python src/main.py path/to/request.emlThis writes path/to/request.xml (same directory as the input .eml).
The script uses simple regex patterns against the email’s HTML body to extract:
- Requester name and email
- URL
- Creator(s)
- Title
- Publisher
- Publication year
- Resource type
- Description/abstract
It prints a readable “Extracted data” table to the terminal and then generates DataCite XML.
- The
<identifier>is a placeholder (10.XXXX/XXXXX). DataCite assigns the real DOI when you register it. - Creator parsing:
- Multiple creators are separated by semicolons (
;). - Names containing a comma are treated as
Personal(e.g.,Last, First). - Names without a comma are treated as
Organizational.
- Multiple creators are separated by semicolons (
- The
<resourceType>is currently emitted withresourceTypeGeneral="Dataset".
For organizational creators and the publisher, the script attempts to add a ROR ID:
- Checks a small local mapping (
ROR_MAPPINGS) insrc/main.py - Falls back to the public ROR API (
https://api.ror.org/organizations)
If the lookup fails (offline, timeout, no matches), it continues without a ROR ID.
src/main.py— CLI entrypoint and conversion logicdata/— sample.eml/.xmlfilesoutput/— spare output folder (not currently used by the script)
See LICENSE.