Skip to content

Add GTDB taxonomy transform#488

Merged
realmarcin merged 7 commits intomasterfrom
gtdb
Feb 12, 2026
Merged

Add GTDB taxonomy transform#488
realmarcin merged 7 commits intomasterfrom
gtdb

Conversation

@realmarcin
Copy link
Collaborator

Summary

Implements a new transform to ingest GTDB (Genome Taxonomy Database) taxonomy data into the kg-microbe knowledge graph.

Features

  • Hierarchical taxonomy structure: 181,959 GTDB taxa with rdfs:subClassOf relationships
  • Genome nodes: 732,475 genomes (715,230 bacterial + 17,245 archaeal)
  • GTDB→NCBITaxon mappings: 245,471 skos:closeMatch edges (78.9% coverage)
  • Dual accession support: Both RefSeq (GCF_) and GenBank (GCA_) accessions
  • Robust error handling: Graceful handling of missing metadata files

Data Model

Node Types:

  • GTDB:* - GTDB taxon nodes (biolink:OrganismTaxon)
  • GenBank:GCF_* / GenBank:GCA_* - Genome nodes (biolink:Genome)

Edge Types:

  • Taxonomy hierarchy: biolink:subclass_of (rdfs:subClassOf)
  • Genome→Taxon: biolink:subclass_of (rdfs:subClassOf)
  • GTDB→NCBI: biolink:close_match (skos:closeMatch)

Transform Output

  • Total nodes: 914,434
  • Total edges: 1,159,903
  • File size: ~180 MB (91 MB nodes + 89 MB edges)

Files Changed

New Files

  • kg_microbe/transform_utils/gtdb/gtdb.py - Main transform class
  • kg_microbe/transform_utils/gtdb/utils.py - Helper functions
  • kg_microbe/transform_utils/gtdb/__init__.py - Module initialization
  • GTDB_IMPLEMENTATION_SUMMARY.md - Complete implementation documentation
  • GTDB_TRANSFORM_QC_REPORT.md - QC results and validation
  • download_gtdb.yaml - Helper file to download only GTDB sources

Modified Files

  • download.yaml - Added GTDB data sources
  • kg_microbe/transform.py - Registered GTDB transform
  • kg_microbe/transform_utils/constants.py - Added GTDB constants and paths
  • merge.yaml - Added GTDB to merge configuration

Testing

✅ Tested with full GTDB dataset (732K genomes)
✅ QC validation passed (see GTDB_TRANSFORM_QC_REPORT.md)
✅ All node categories correct (OrganismTaxon, Genome)
✅ All edge types correct (subclass_of, close_match)
✅ Taxonomy hierarchy complete (genome → domain)
✅ Deduplication working (no duplicate NCBI mappings)

Usage

# Download GTDB data
poetry run kg download -y download_gtdb.yaml -o data/raw

# Run transform
poetry run kg transform -s gtdb

# Include in merge
poetry run kg merge -y merge.yaml

Example Query

Complete taxonomy path for E. coli genome GCF_000005845:

GenBank:GCF_000005845 (genome)
  └─ GTDB:126910 (s__Escherichia_coli)
    └─ GTDB:147 (g__Escherichia)
      └─ GTDB:281 (f__Enterobacteriaceae)
        └─ GTDB:205 (o__Enterobacterales)
          └─ GTDB:295 (c__Gammaproteobacteria)
            └─ GTDB:209 (p__Pseudomonadota)
              └─ GTDB:3 (d__Bacteria)

Documentation

  • Implementation Summary: GTDB_IMPLEMENTATION_SUMMARY.md
  • QC Report: GTDB_TRANSFORM_QC_REPORT.md

Notes

  • Initial archaeal taxonomy file (ar53_taxonomy.tsv) was incomplete and was re-downloaded
  • Transform works with or without metadata files (NCBI mappings require metadata)
  • ~21% of GTDB taxa lack NCBI mappings (expected for newly-described/uncultured taxa)

🤖 Generated with Claude Code

Implements transform for GTDB (Genome Taxonomy Database) with:
- Hierarchical taxonomy structure (181,959 taxa) using rdfs:subClassOf
- Genome nodes (732,475 genomes: 715K bacterial, 17K archaeal)
- GTDB→NCBITaxon mappings (245,471 edges, 78.9% coverage)
- Support for both RefSeq (GCF) and GenBank (GCA) accessions
- Graceful handling of missing metadata files

Output: 914,434 nodes and 1,159,903 edges

Includes comprehensive documentation (GTDB_IMPLEMENTATION_SUMMARY.md,
GTDB_TRANSFORM_QC_REPORT.md) and helper download file (download_gtdb.yaml).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new GTDB (Genome Taxonomy Database) ingestion path to kg-microbe, enabling GTDB taxonomy + genome nodes (and optional GTDB→NCBI mappings) to be produced in KGX TSV form and included in the merged graph.

Changes:

  • Introduces a new GTDBTransform plus parsing helpers to produce GTDB taxonomy + genome nodes/edges.
  • Registers the new transform and wires GTDB into download + merge configurations.
  • Adds implementation/QC documentation and a GTDB-only download manifest.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
kg_microbe/transform_utils/gtdb/gtdb.py Core GTDB transform that parses taxonomy/metadata and writes KGX TSV nodes/edges.
kg_microbe/transform_utils/gtdb/utils.py Helper parsing/normalization utilities for GTDB taxonomy/accessions.
kg_microbe/transform_utils/gtdb/__init__.py Exposes GTDBTransform from the package.
kg_microbe/transform_utils/constants.py Adds GTDB constants (source name, prefixes, file names/paths, genome category, close_match constants).
kg_microbe/transform.py Registers GTDB as a selectable transform source.
download.yaml Adds GTDB release URLs to the standard download manifest.
download_gtdb.yaml Adds a GTDB-only download manifest.
merge.yaml Adds GTDB nodes/edges files to the merged graph configuration.
GTDB_IMPLEMENTATION_SUMMARY.md Documents intended model/implementation decisions for GTDB ingestion.
GTDB_TRANSFORM_QC_REPORT.md Captures QC results and validation notes for a full GTDB run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add PROVIDED_BY_COLUMN import and use it for nodes (instead of primary_knowledge_source)
- Add file existence check before opening taxonomy files with helpful error message
- Sort unique taxa for deterministic ID assignment across runs
- Respect CLI-provided input_dir parameter (check input_base_dir first)
- Initialize _created_mappings set in __init__ and clarify dedup comment
- Remove unused hasattr check for _created_mappings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Collaborator Author

@realmarcin realmarcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed Copilot feedback:

✅ Fixed for nodes (was using )
✅ Added file existence check with helpful error message
✅ Made taxon ID assignment deterministic by sorting unique taxa
✅ Fixed input directory to respect CLI-provided
✅ Clarified dedup comment to match actual behavior
✅ Initialized in

Remaining items noted but not addressed in this PR:

  • Memory usage optimization (streaming writes) - can be addressed in future PR if needed
  • Unit tests - can be added in future PR

Changes pushed to gtdb branch in commit 3b4e4e0.

Copy link
Collaborator Author

@realmarcin realmarcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed Copilot feedback:

✅ Fixed node provenance to use provided_by (was using primary_knowledge_source)
✅ Added file existence check with helpful error message
✅ Made taxon ID assignment deterministic by sorting unique taxa
✅ Fixed input directory to respect CLI-provided input_dir parameter
✅ Clarified dedup comment to match actual behavior
✅ Initialized _created_mappings set in init

Remaining items noted but not addressed in this PR:

  • Memory usage optimization (streaming writes) - can be addressed in future PR if needed
  • Unit tests - can be added in future PR

Changes pushed in commit 3b4e4e0.

- Test utility functions (parse_taxonomy_string, extract_accession_type, clean_taxon_name)
- Test transform initialization and configuration
- Test taxonomy parsing and hierarchy building
- Test node and edge creation
- Test genome node creation with/without NCBI mappings
- Test deterministic ID assignment
- Test deduplication
- Add minimal test data files in tests/resources/gtdb/

All 19 tests pass successfully.

Addresses Copilot review feedback requesting unit tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@realmarcin
Copy link
Collaborator Author

Unit Tests Added ✅

Added comprehensive unit tests for GTDB transform (commit 021b729):

Test Coverage:

  • ✅ Utility functions (parse_taxonomy_string, extract_accession_type, clean_taxon_name)
  • ✅ Transform initialization and configuration
  • ✅ Taxonomy file parsing (including error handling for missing files)
  • ✅ Taxonomy hierarchy building
  • ✅ Node and edge creation (including deduplication)
  • ✅ Genome node creation with/without NCBI mappings
  • ✅ Deterministic taxon ID assignment

Test Data:
Created minimal test fixtures in tests/resources/gtdb/:

  • bac120_taxonomy.tsv (3 bacterial genomes)
  • bac120_metadata.tsv (with NCBI taxid mappings)
  • ar53_taxonomy.tsv (1 archaeal genome)

Test Results:
All 19 tests pass successfully.

This addresses the Copilot review feedback requesting unit tests.

@realmarcin
Copy link
Collaborator Author

Summary of Copilot Review Feedback

✅ Resolved (7/8 issues):

  1. ✅ Provenance field - Changed nodes to use provided_by instead of primary_knowledge_source
  2. ✅ File existence check - Added helpful FileNotFoundError with download instructions
  3. ✅ Deterministic ID assignment - Sort unique_taxa before ID assignment
  4. ✅ Input directory override - Respect CLI-provided input_dir parameter
  5. ✅ Deduplication comment - Clarified allows multiple NCBI IDs per GTDB taxon
  6. ✅ Unused import - Removed unused Path import
  7. ✅ Unit tests - Added comprehensive test suite with 19 tests (commit 021b729)

⏭️ Deferred for future work (1/8 issues):

  1. 🔄 Memory optimization (streaming writes) - Valid concern but requires significant refactoring. Current memory usage for full dataset is estimated at ~400MB (914K nodes + 1.16M edges), which is manageable but could be optimized. Suggest addressing in a separate PR focused on memory optimization across all transforms.

Test Results:

  • All 19 new unit tests pass ✅
  • Transform successfully processes 732,475 genomes with correct output ✅
  • All QC checks pass (see GTDB_TRANSFORM_QC_REPORT.md) ✅

@realmarcin
Copy link
Collaborator Author

All Copilot Review Conversations Resolved ✅

All 8 review threads have been addressed and marked as resolved:

Implemented fixes (7 issues):

  1. ✅ Provenance field correction
  2. ✅ File existence check with helpful error
  3. ✅ Deterministic ID assignment
  4. ✅ Input directory parameter handling
  5. ✅ Deduplication comment clarity
  6. ✅ Removed unused import
  7. ✅ Comprehensive unit test suite (19 tests)

Acknowledged for future optimization (1 issue):
8. ✅ Memory usage optimization - Deferred to future PR. Current implementation is functional with ~400MB memory usage for full dataset.

Status: Ready for human review 🚀

realmarcin and others added 4 commits February 10, 2026 20:54
Resolved conflicts in:
1. download.yaml - Keep both GTDB entries AND Biolink/KGX entries
2. merge.yaml - Keep COG uncommented and add GTDB source

Changes from master include:
- Biolink model and KGX format downloads
- Multiple transform updates (bacdive, bakta, kegg, etc.)
- New utility modules and documentation
- Updated dependencies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add docstring to __init__ method
- Fix run() docstring to imperative mood
- Use r""" for docstring with backslashes
- Rename unused loop variable to _accession
- Fix import sorting in transform.py
- Fix line-too-long in test_gtdb.py
- Remove unused test variables

All ruff checks now pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep gtdb branch focused on GTDB transform only.
COG can be enabled separately in a future PR.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Auto-formatted by ruff for consistency with style guide.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 1d2f90f into master Feb 12, 2026
3 checks passed
@realmarcin realmarcin deleted the gtdb branch February 12, 2026 05:06
crocodile27 pushed a commit that referenced this pull request Mar 4, 2026
crocodile27 pushed a commit that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants