Convert FAIRe-formatted metadata to ENA (European Nucleotide Archive) XML format for submission.
This toolset automates the conversion of metadata from the FAIRe (Findable, Accessible, Interoperable, Reusable) standard to ENA's XML submission format.
The toolset consists of two scripts:
faire2ena_sample.py- Converts sample metadata to ENA SAMPLE XML (ERC000024 GSC MIxS water checklist)faire2ena_run.py- Converts experiment/run metadata to ENA EXPERIMENT and RUN XML files
There is also a helper script written by Olivia Nguyen, upload_reads_to_ena.py, which uploads fastq.gz files to ENA.
pip install pandas openpyxlThe typical ENA submission workflow is:
- Submit samples using
faire2ena_sample.py, generating a ENA receipt file - Get sample accessions from ENA receipt
- Upload FASTQ files to ENA FTP using
upload_reads_to_ena.py - Submit experiments and runs using
faire2ena_run.pywith the receipt file
python faire2ena_sample.py \
-i <input_excel_file> \
-c <center_name> \
-o <output_xml_file>| Argument | Short | Description | Required |
|---|---|---|---|
--input_file |
-i |
Path to FAIRe-formatted Excel file | Yes |
--center_name |
-c |
Name of the sequencing center | Yes |
--output_file |
-o |
Output XML filename | Yes |
python faire2ena_sample.py \
-i rowley_shoals_metadata.xlsx \
-c "OceanOmics" \
-o ena_samples.xmlThis uploads fastq.gz files in the working directory to ENA.
python upload_reads.py
--host <host_name>
--subdir <folder_name>
--user <username>
--passw <userpassword>| Argument | Short | Description | Required | Default |
|---|---|---|---|---|
--host |
- | ENA FTP host URL | Yes | webin2.ebi.ac.uk (test) |
--subdir |
- | Subdirectory path on FTP server | Yes | - |
--user |
- | Webin username | Yes | - |
--passw |
- | Webin password | Yes | - |
The script:
- Searches for all
*.fastq.gzfiles in the current directory - Uploads each file to the specified ENA FTP location using
curl - Provides progress information for each upload
- Exits with an error if any upload fails
Important: Run this script from the directory containing your FASTQ files.
# For TEST server (files auto-deleted after 24 hours)
cd /path/to/fastq/files
python upload_reads_to_ena.py \
--host webin2.ebi.ac.uk \
--subdir rowley_shoals_2019 \
--user Webin-12345 \
--passw your_password# For PRODUCTION server
python upload_reads_to_ena.py \
--host webin.ebi.ac.uk \
--subdir rowley_shoals_2019 \
--user Webin-12345 \
--passw your_passwordThe script will show progress for each file:
Assuming files end in fastq.gz
Found files: RS19_C13_A.R1.fq.gz
RS19_C13_A.R2.fq.gz
RS19_C13_B.R1.fq.gz
RS19_C13_B.R2.fq.gz
Uploading 4 file(s) to ENA TEST FTP (webin2.ebi.ac.uk)
Uploading RS19_C13_A.R1.fq.gz → ftp://webin2.ebi.ac.uk/rowley_shoals_2019/
Uploaded: RS19_C13_A.R1.fq.gz
Uploading RS19_C13_A.R2.fq.gz → ftp://webin2.ebi.ac.uk/rowley_shoals_2019/
Uploaded: RS19_C13_A.R2.fq.gz
...
All uploads complete.
Files are now in your ENA TEST upload area:
ftp://webin2.ebi.ac.uk//rowley_shoals_2019
ℹ️ Note: Files on the TEST server are automatically deleted within 24 hours.
- Test vs Production:
- Test server:
webin2.ebi.ac.uk(files deleted after 24 hours) - Production server:
webin.ebi.ac.uk(permanent storage)
- Test server:
- File naming: The filenames in your
experimentRunMetadatasheet must match exactly what you upload - MD5 checksums: These are taken from
experimentRunMetadataas well. Make sure they match! - Upload time: Large files may take considerable time to upload
After uploading FASTQ files and receiving the sample submission receipt from ENA, use this script to submit experiment and run metadata.
python faire2ena_run.py \
-i <input_excel_file> \
-r <receipt_xml_file> \
-s <study_accession> \
-c <center_name> \
-e <experiment_output_xml> \
-o <run_output_xml>| Argument | Short | Description | Required | Default |
|---|---|---|---|---|
--input_file |
-i |
Path to FAIRe-formatted Excel file | Yes | - |
--receipt_file |
-r |
ENA sample submission receipt XML | Yes | - |
--study_accession |
-s |
ENA study accession (e.g., PRJEB12345) | Yes | - |
--center_name |
-c |
Name of the sequencing center | Yes | - |
--experiment_output |
-e |
Output file for EXPERIMENT XML | No | ena_experiments.xml |
--run_output |
-o |
Output file for RUN XML | No | ena_runs.xml |
--instrument_model |
-m |
Sequencing instrument model | No | Illumina NextSeq 2000 |
--assay |
-a |
Assay name to append to experiment/run aliases | No | None |
The script:
- Parses the ENA sample receipt XML to extract sample alias → accession mappings (e.g.,
RS19_C13_A_2→ERS32025180) - Reads experiment/run metadata from the
experimentRunMetadatasheet - Matches each library/run to its corresponding sample accession
- Optionally appends an assay suffix to all experiment/run names (if
--assayis provided) - Generates two separate XML files:
- EXPERIMENT XML: Contains library preparation metadata (library strategy, source, selection, platform)
- RUN XML: Contains sequencing run metadata (FASTQ filenames, MD5 checksums)
Each experiment is linked to a sample via the sample accession, and each run is linked to its experiment.
Multiple Assays per Sample: If you have multiple assays (e.g., 16S, COI, 18S, ITS) for the same BioSample:
- Prepare separate FASTQ files for each assay
- Run the script separately for each assay using the
--assayparameter - This appends the assay name to experiment aliases (e.g.,
RS19_C13_AbecomesRS19_C13_A_16S) - Generate separate XML files for each assay (e.g.,
ena_experiments_16S.xml,ena_experiments_COI.xml) - Submit each assay separately to ENA
This ensures each BioSample can have multiple experiments (one per assay) without alias conflicts.
# Submit all runs without assay suffix
python faire2ena_run.py \
-i rowley_shoals_metadata.xlsx \
-r sample_receipt.xml \
-s PRJEB12345 \
-c "OceanOmics" \
-e ena_experiments.xml \
-o ena_runs.xml
# OR submit with assay suffix (for multiple assays per sample)
# First submission: 16S data
python faire2ena_run.py \
-i rowley_shoals_16S_metadata.xlsx \
-r sample_receipt.xml \
-s PRJEB12345 \
-c "OceanOmics" \
-a 16S \
-e ena_experiments_16S.xml \
-o ena_runs_16S.xml
# Second submission: COI data (same samples, different files)
python faire2ena_run.py \
-i rowley_shoals_COI_metadata.xlsx \
-r sample_receipt.xml \
-s PRJEB12345 \
-c "OceanOmics" \
-a COI \
-e ena_experiments_COI.xml \
-o ena_runs_COI.xmlThe script will generate two files and print summary information:
INFO: Loaded 245 sample accessions from receipt
INFO: Adding assay suffix '_16S' to all experiment and run names
INFO: Generated EXPERIMENT XML with 245 experiments -> ena_experiments_16S.xml
INFO: Generated RUN XML with 245 runs -> ena_runs_16S.xml
If any samples are missing from the receipt, you'll see warnings:
WARNING: Skipped 3 samples without accessions:
- RS19_C20_E_2
- RS19_M12_C_2
- RS19_M15_A_1
Both script expect a FAIRe-formatted Excel file with multiple sheets:
projectMetadata- Contains project-level information includingproject_idsampleMetadata- Starting at row 3, contains FAIRe-formatted sample dataexperimentRunMetadata- Starting at row 3, contains sequencing run and library preparation data
eventDate- Collection date (ISO 8601 format)decimalLatitude- Latitude in decimal degreesdecimalLongitude- Longitude in decimal degreesgeo_loc_name- Geographic location nameenv_broad_scale- Broad environmental context (with ENVO terms)env_local_scale- Local environmental contextenv_medium- Environmental medium (with ENVO terms)minimumDepthInMeters- Sampling depth
The tool supports mapping for 50+ optional fields including:
- Water chemistry (salinity, pH, dissolved oxygen, nutrients)
- Physical parameters (temperature, turbidity, conductivity)
- Sample collection details (device, method, volume)
- Sample processing (storage, extraction methods)
See the Field Mapping section for complete details.
samp_name- Sample name (must matchsamp_namefromsampleMetadata)lib_id- Library identifierfilename- Forward read FASTQ filenamefilename2- Reverse read FASTQ filenamechecksum_filename- MD5 checksum for forward readchecksum_filename2- MD5 checksum for reverse read
assay_name- Assay or marker namepcr_plate_id- PCR plate identifierseq_run_id- Sequencing run identifierlib_conc- Library concentration valuelib_conc_unit- Library concentration unitlib_conc_meth- Library quantification methodphix_perc- PhiX spike-in percentagemid_forward- Forward index/barcodemid_reverse- Reverse index/barcodeinput_read_count- Number of raw readsoutput_read_count- Number of processed readsoutput_otu_num- Number of OTUs/ASVsotu_num_tax_assigned- Number of taxonomically assigned OTUs
The tool generates an ENA-compliant SAMPLE XML file structured as:
<?xml version="1.0" encoding="UTF-8"?>
<SAMPLE_SET>
<SAMPLE alias="RS19_RS1_1_A" center_name="OceanOmics">
<SAMPLE_NAME>
<TAXON_ID>408172</TAXON_ID>
</SAMPLE_NAME>
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>amount or size of sample collected</TAG>
<VALUE>1.0 L</VALUE>
<UNITS>L</UNITS>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>broad-scale environmental context</TAG>
<VALUE>ocean biome [ENVO:01000048]</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>collection date</TAG>
<VALUE>2019-10-16</VALUE>
</SAMPLE_ATTRIBUTE>
<!-- More samples... -->
</SAMPLE_SET>The tool generates an ENA-compliant EXPERIMENT XML file:
<?xml version="1.0" encoding="UTF-8"?>
<EXPERIMENT_SET>
<EXPERIMENT alias="RS19_C13_A" center_name="OceanOmics">
<TITLE>RS19_C13_A</TITLE>
<STUDY_REF accession="PRJEB12345"/>
<DESIGN>
<DESIGN_DESCRIPTION>eDNA metabarcoding</DESIGN_DESCRIPTION>
<SAMPLE_DESCRIPTOR accession="ERS32025180"/>
<LIBRARY_DESCRIPTOR>
<LIBRARY_NAME>RS19_C13_A</LIBRARY_NAME>
<LIBRARY_STRATEGY>AMPLICON</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>METAGENOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>PCR</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<PAIRED/>
</LIBRARY_LAYOUT>
</LIBRARY_DESCRIPTOR>
</DESIGN>
<PLATFORM>
<ILLUMINA>
<INSTRUMENT_MODEL>Illumina NovaSeq 6000</INSTRUMENT_MODEL>
</ILLUMINA>
</PLATFORM>
</EXPERIMENT>
<!-- More experiments... -->
</EXPERIMENT_SET>The tool generates an ENA-compliant RUN XML file:
<?xml version="1.0" encoding="UTF-8"?>
<RUN_SET>
<RUN alias="RS19_C13_A_run" center_name="OceanOmics">
<EXPERIMENT_REF refname="RS19_C13_A"/>
<DATA_BLOCK>
<FILES>
<FILE filename="RS19_C13_A.R1.fq.gz" filetype="fastq"
checksum_method="MD5" checksum="674097d23b8497452c223a933325cbf3"/>
<FILE filename="RS19_C13_A.R2.fq.gz" filetype="fastq"
checksum_method="MD5" checksum="0f4a6a2dc433b8da4269b864d8d9a314"/>
</FILES>
</DATA_BLOCK>
</RUN>
<!-- More runs... -->
</RUN_SET>You can submit these XML files to ENA via curl - see the ENA manual.
curl -u 'your_email@office.com':'please_dont_steal_my_password_I_WILL_cry' \
-F "SUBMISSION=@submission.xml" \
-F "SAMPLE=@ena_samples.xml" \
https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submitThis will return a receipt XML file with sample accessions (ERS...).
curl -u 'your_email@office.com':'your_password' \
-F "SUBMISSION=@submission.xml" \
-F "EXPERIMENT=@ena_experiments.xml" \
-F "RUN=@ena_runs.xml" \
https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submitNote: Use wwwdev.ebi.ac.uk for testing. For production submissions, use www.ebi.ac.uk.
The FAIRe fields differ a bit from the ENA checklist fields. Here's the mapping.
| FAIRe Field | ENA Field | Mandatory |
|---|---|---|
materialSampleID |
source material identifiers |
Optional |
eventDate |
collection date |
Yes |
decimalLatitude |
geographic location (latitude) |
Yes |
decimalLongitude |
geographic location (longitude) |
Yes |
geo_loc_name |
geographic location (country and/or sea) |
Yes |
env_broad_scale |
broad-scale environmental context |
Yes |
env_local_scale |
local environmental context |
Yes |
env_medium |
environmental medium |
Yes |
minimumDepthInMeters |
depth |
Yes |
| FAIRe Field | ENA Field |
|---|---|
samp_collect_device |
sample collection device |
samp_collect_method |
sample collection method |
samp_size + samp_size_unit |
amount or size of sample collected |
samp_store_temp |
sample storage temperature |
samp_store_loc |
sample storage location |
samp_store_dur |
sample storage duration |
samp_category |
control_sample |
| FAIRe Field | ENA Field |
|---|---|
temp |
temperature |
salinity |
salinity |
ph |
ph |
diss_oxygen |
dissolved oxygen |
chlorophyll |
chlorophyll |
turbidity |
turbidity |
| FAIRe Field | ENA Field |
|---|---|
nitrate |
nitrate |
nitrite |
nitrite |
diss_org_carb |
dissolved organic carbon |
diss_inorg_carb |
dissolved inorganic carbon |
tot_nitro |
total nitrogen concentration |
[See full mapping in source code]
faire2end_sample.py validates that all mandatory ENA fields are present. If any are missing, default values are applied:
WARNING: Sample name RS19_RS1_1_A missing mandatory field 'depth', setting to default '0'
Default Values:
env_local_scale:marine pelagic zone [ENVO:00000208]minimumDepthInMeters:0(most OceanOmics samples are surface)
Collection dates are validated against ENA's required pattern. Invalid dates (such as 2019-00-00) are automatically replaced:
WARNING: Sample name RS19_RS1_1_A has invalid date 2019-00-00T00:00:00. Replacing with 'not provided'.
Valid date formats:
- Full date:
2019-10-16 - Year-month:
2019-10 - Year only:
2019 - Date with time:
2019-10-16T00:00:00 - Missing values:
not provided,not collected
Control samples (where samp_category is not 'sample') are handled specially:
control_samplefield is set toTRUE- Missing mandatory fields are set to
'missing: control sample' - Optional fields with no data are omitted from the XML
The default taxon ID is set to 408172 (marine metagenome). Modify this in the code if working with different organisms:
process_faire_df(df, args.output_file, project_name,
taxon_id='YOUR_TAXON_ID',
center_name=args.center_name)If you see validation warnings, check that your FAIRe file contains:
- Collection date in ISO 8601 format (YYYY-MM-DD)
- Geographic coordinates in decimal degrees
- ENVO ontology terms for environmental contexts
- Depth measurements in meters
The tool will apply sensible defaults for OceanOmics samples if these are missing.
Dates with invalid months or days (e.g., 2019-00-00) will be automatically set to 'not provided'. Ensure dates follow ISO 8601 format or use year-only precision if exact dates are unknown.
Units also added to specific fields via the <UNITS> XML tag, there's a hardcoded look-up table which you may need to change.
depth→ units:mgeographic location (latitude/longitude)→ units:DD(decimal degrees)amount or size of sample collected→ units:L
The geo_loc_name field is automatically parsed to extract the country/sea name:
- Input:
Indian Ocean: Rowley Shoals, Mermaid - Output:
Indian Ocean(text before the first colon)
Empty or NaN values are handled as follows:
- For control samples: set to
'missing: control sample' - For regular samples with missing mandatory fields: replaced with defaults
- Optional empty fields: omitted from the XML output
If samples are skipped during run submission, check:
- The receipt XML file contains all sample accessions
- The
samp_namevalues match exactly betweensampleMetadataandexperimentRunMetadatasheets - All samples were successfully submitted in the first step
The script will warn you about skipped samples:
WARNING: Skipped 3 samples without accessions:
- RS19_C20_E_2
- RS19_M12_C_2
Ensure your experimentRunMetadata sheet contains:
filenameandfilename2- full FASTQ filenames with extensionschecksum_filenameandchecksum_filename2- valid MD5 checksums
You can generate MD5 checksums with:
md5sum your_file.fastq.gzMake sure you have a valid ENA study accession (PRJEB...) before submitting experiments and runs. You need to create a study separately through the ENA Webin portal or API.