Conversation
Implements transform for GTDB (Genome Taxonomy Database) with: - Hierarchical taxonomy structure (181,959 taxa) using rdfs:subClassOf - Genome nodes (732,475 genomes: 715K bacterial, 17K archaeal) - GTDB→NCBITaxon mappings (245,471 edges, 78.9% coverage) - Support for both RefSeq (GCF) and GenBank (GCA) accessions - Graceful handling of missing metadata files Output: 914,434 nodes and 1,159,903 edges Includes comprehensive documentation (GTDB_IMPLEMENTATION_SUMMARY.md, GTDB_TRANSFORM_QC_REPORT.md) and helper download file (download_gtdb.yaml). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new GTDB (Genome Taxonomy Database) ingestion path to kg-microbe, enabling GTDB taxonomy + genome nodes (and optional GTDB→NCBI mappings) to be produced in KGX TSV form and included in the merged graph.
Changes:
- Introduces a new
GTDBTransformplus parsing helpers to produce GTDB taxonomy + genome nodes/edges. - Registers the new transform and wires GTDB into download + merge configurations.
- Adds implementation/QC documentation and a GTDB-only download manifest.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
kg_microbe/transform_utils/gtdb/gtdb.py |
Core GTDB transform that parses taxonomy/metadata and writes KGX TSV nodes/edges. |
kg_microbe/transform_utils/gtdb/utils.py |
Helper parsing/normalization utilities for GTDB taxonomy/accessions. |
kg_microbe/transform_utils/gtdb/__init__.py |
Exposes GTDBTransform from the package. |
kg_microbe/transform_utils/constants.py |
Adds GTDB constants (source name, prefixes, file names/paths, genome category, close_match constants). |
kg_microbe/transform.py |
Registers GTDB as a selectable transform source. |
download.yaml |
Adds GTDB release URLs to the standard download manifest. |
download_gtdb.yaml |
Adds a GTDB-only download manifest. |
merge.yaml |
Adds GTDB nodes/edges files to the merged graph configuration. |
GTDB_IMPLEMENTATION_SUMMARY.md |
Documents intended model/implementation decisions for GTDB ingestion. |
GTDB_TRANSFORM_QC_REPORT.md |
Captures QC results and validation notes for a full GTDB run. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add PROVIDED_BY_COLUMN import and use it for nodes (instead of primary_knowledge_source) - Add file existence check before opening taxonomy files with helpful error message - Sort unique taxa for deterministic ID assignment across runs - Respect CLI-provided input_dir parameter (check input_base_dir first) - Initialize _created_mappings set in __init__ and clarify dedup comment - Remove unused hasattr check for _created_mappings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
realmarcin
left a comment
There was a problem hiding this comment.
Addressed Copilot feedback:
✅ Fixed for nodes (was using )
✅ Added file existence check with helpful error message
✅ Made taxon ID assignment deterministic by sorting unique taxa
✅ Fixed input directory to respect CLI-provided
✅ Clarified dedup comment to match actual behavior
✅ Initialized in
Remaining items noted but not addressed in this PR:
- Memory usage optimization (streaming writes) - can be addressed in future PR if needed
- Unit tests - can be added in future PR
Changes pushed to gtdb branch in commit 3b4e4e0.
realmarcin
left a comment
There was a problem hiding this comment.
Addressed Copilot feedback:
✅ Fixed node provenance to use provided_by (was using primary_knowledge_source)
✅ Added file existence check with helpful error message
✅ Made taxon ID assignment deterministic by sorting unique taxa
✅ Fixed input directory to respect CLI-provided input_dir parameter
✅ Clarified dedup comment to match actual behavior
✅ Initialized _created_mappings set in init
Remaining items noted but not addressed in this PR:
- Memory usage optimization (streaming writes) - can be addressed in future PR if needed
- Unit tests - can be added in future PR
Changes pushed in commit 3b4e4e0.
- Test utility functions (parse_taxonomy_string, extract_accession_type, clean_taxon_name) - Test transform initialization and configuration - Test taxonomy parsing and hierarchy building - Test node and edge creation - Test genome node creation with/without NCBI mappings - Test deterministic ID assignment - Test deduplication - Add minimal test data files in tests/resources/gtdb/ All 19 tests pass successfully. Addresses Copilot review feedback requesting unit tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Unit Tests Added ✅Added comprehensive unit tests for GTDB transform (commit 021b729): Test Coverage:
Test Data:
Test Results: This addresses the Copilot review feedback requesting unit tests. |
Summary of Copilot Review Feedback✅ Resolved (7/8 issues):
⏭️ Deferred for future work (1/8 issues):
Test Results:
|
All Copilot Review Conversations Resolved ✅All 8 review threads have been addressed and marked as resolved: Implemented fixes (7 issues):
Acknowledged for future optimization (1 issue): Status: Ready for human review 🚀 |
Resolved conflicts in: 1. download.yaml - Keep both GTDB entries AND Biolink/KGX entries 2. merge.yaml - Keep COG uncommented and add GTDB source Changes from master include: - Biolink model and KGX format downloads - Multiple transform updates (bacdive, bakta, kegg, etc.) - New utility modules and documentation - Updated dependencies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add docstring to __init__ method - Fix run() docstring to imperative mood - Use r""" for docstring with backslashes - Rename unused loop variable to _accession - Fix import sorting in transform.py - Fix line-too-long in test_gtdb.py - Remove unused test variables All ruff checks now pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Keep gtdb branch focused on GTDB transform only. COG can be enabled separately in a future PR. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Auto-formatted by ruff for consistency with style guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add GTDB taxonomy transform
Add GTDB taxonomy transform
Summary
Implements a new transform to ingest GTDB (Genome Taxonomy Database) taxonomy data into the kg-microbe knowledge graph.
Features
rdfs:subClassOfrelationshipsskos:closeMatchedges (78.9% coverage)Data Model
Node Types:
GTDB:*- GTDB taxon nodes (biolink:OrganismTaxon)GenBank:GCF_*/GenBank:GCA_*- Genome nodes (biolink:Genome)Edge Types:
biolink:subclass_of(rdfs:subClassOf)biolink:subclass_of(rdfs:subClassOf)biolink:close_match(skos:closeMatch)Transform Output
Files Changed
New Files
kg_microbe/transform_utils/gtdb/gtdb.py- Main transform classkg_microbe/transform_utils/gtdb/utils.py- Helper functionskg_microbe/transform_utils/gtdb/__init__.py- Module initializationGTDB_IMPLEMENTATION_SUMMARY.md- Complete implementation documentationGTDB_TRANSFORM_QC_REPORT.md- QC results and validationdownload_gtdb.yaml- Helper file to download only GTDB sourcesModified Files
download.yaml- Added GTDB data sourceskg_microbe/transform.py- Registered GTDB transformkg_microbe/transform_utils/constants.py- Added GTDB constants and pathsmerge.yaml- Added GTDB to merge configurationTesting
✅ Tested with full GTDB dataset (732K genomes)
✅ QC validation passed (see GTDB_TRANSFORM_QC_REPORT.md)
✅ All node categories correct (OrganismTaxon, Genome)
✅ All edge types correct (subclass_of, close_match)
✅ Taxonomy hierarchy complete (genome → domain)
✅ Deduplication working (no duplicate NCBI mappings)
Usage
Example Query
Complete taxonomy path for E. coli genome GCF_000005845:
Documentation
GTDB_IMPLEMENTATION_SUMMARY.mdGTDB_TRANSFORM_QC_REPORT.mdNotes
🤖 Generated with Claude Code