Comprehensive Microbial Culture Media Knowledge Graph
A production-ready knowledge base containing 10,657 culture media recipes from 10 major international repositories, with LinkML schema validation, ontology grounding, and browser-based exploration.
Total Recipes: 10,657 culture media formulations
| Category | Recipes | Sources |
|---|---|---|
| Bacterial | 10,134 | MediaDive, TOGO, BacDive, ATCC, NBRC, KOMODO, MediaDB |
| Algae | 242 | UTEX, CCAP, SAG |
| Fungal | 119 | MediaDive, TOGO |
| Specialized | 99 | KOMODO |
| Archaea | 63 | MediaDive, TOGO |
| Source | Recipes | Type | Description |
|---|---|---|---|
| KOMODO | 3,637 | Bacterial | Korean microbial media database |
| MediaDive | 3,327 | Multi-kingdom | DSMZ comprehensive collection |
| TOGO Medium | 2,917 | Multi-kingdom | Japanese BRCs curated database |
| MediaDB | 469 | Defined | Chemically defined media |
| CCAP | 113 | Algae | UK algae & protozoa collection |
| UTEX | 99 | Algae | University of Texas algae |
| SAG | 30 | Algae | German algae culture collection |
| NBRC | 2 | Bacterial | Japanese biological resources |
| BacDive | 1 | Bacterial | DSMZ cultivation conditions |
| Medium Type | Recipes | Percentage |
|---|---|---|
| Complex | 8,399 | 79.3% |
| Defined | 2,196 | 20.7% |
Complex media contain undefined components (e.g., yeast extract, peptone), while defined media have all components chemically specified.
| State | Recipes | Percentage |
|---|---|---|
| Liquid | 10,593 | 99.98% |
| Solid (Agar) | 2 | 0.02% |
| Metric | Value | Percentage |
|---|---|---|
| Recipes with ingredients | 6,815 | 64.3% |
| CHEBI-grounded ingredients | 3,548 | 33.5% |
| Average ingredients/recipe | 15.7 | - |
| LinkML validated | 10,595 | 100% |
Ontology Grounding:
- Chemicals: CHEBI (Chemical Entities of Biological Interest)
- Biological materials: FOODON (Food Ontology), UBERON (Anatomy), ENVO (Environment)
- Organisms: NCBITaxon (NCBI Taxonomy)
- Media databases: DSMZ, TOGO, ATCC prefixes
Unmapped Ingredients Tracking System (2026-03-05):
- π― 136 unmapped ingredients identified across 522 media (4.9% of total)
- π 3,084 total instances requiring ontology term mapping
- π Automated detection of numeric placeholders ('1', '2', '3'), generic terms, and empty values
- π§ͺ Chemical name extraction from notes fields for mapping assistance
- π Priority-based mapping recommendations (critical: 51+ occurrences)
- See UNMAPPED_INGREDIENTS_SUMMARY.md and docs/unmapped_ingredients_guide.md
Advanced Normalization & SSSOM Enrichment (2026-02):
- β¨ Integrated MicroMediaParam's production-grade 16-step chemical normalization pipeline
- π 100+ curated biological products (yeast extract, peptone, serum, DNA, agar, etc.)
- π§ͺ 100+ chemical formula mappings (Fe2(SO4)3 β iron(III) sulfate)
- π¬ 15+ buffer abbreviations (HEPES, MES, Tris)
- π¨ 11 common laboratory gases (CO2, N2, O2, H2, CH4, etc.)
- π€ Unicode dot normalization (5 variants: Β·, γ», β’, β, β )
- π Coverage achieved: 45.6% (2,302 / 5,048 ingredients) - +935 new mappings from baseline
- π 68.4% increase in coverage (27.1% β 45.6%)
- See PROJECT_STATUS_SUMMARY.md and GAS_MAPPING_SUMMARY.md for details
Enum Normalization (2026-02-20):
- π§ Normalized 10,657 YAML files for schema compliance
- β
Fixed capitalization:
medium_type(COMPLEX, DEFINED),physical_state(LIQUID, SOLID_AGAR) - π Recategorized all "imported" files to proper organism types (bacterial, fungal, archaea, algae, specialized)
- π― 100% schema compliance across all enum fields
β
10,657 recipes - Production-ready dataset from 10 authoritative sources
β
Four-tier architecture - Clean separation: raw β raw_yaml β normalized_yaml β merge_yaml
β
Recipe deduplication - Merge recipes with same ingredient sets (~344 unique base formulations)
β
LinkML schema validation - Comprehensive data quality enforcement
β
Ontology grounding - CHEBI for chemicals, NCBITaxon for organisms
β
Full provenance tracking - Complete source attribution and curation history
β
Automated pipelines - Fetchers, converters, and importers for all sources
β
Browser interface - Faceted search and filtering
β
UMAP visualization - Interactive 2D embeddings visualization for exploring media similarity
β
Knowledge graph export - Biolink-compliant KGX format
β
Literature verification - 6-tier cascading PDF retrieval for cross-reference validation
β
ATCC cross-references - Automated equivalency detection with DSMZ media
β
Unmapped ingredients tracking - Automated detection and prioritization of ingredients needing ontology mapping
β
Comprehensive documentation - 30+ guides in docs/
# Clone the repository
git clone https://github.com/CultureBotAI/CultureMech.git
cd CultureMech
# Install dependencies (requires uv)
just install
# Optional: Install Koza for KG export
just install-koza# Generate browser data from recipes
just gen-browser-data
# Serve locally
just serve-browser
# Open http://localhost:8000/app/# Generate interactive UMAP visualization (requires KG-Microbe embeddings)
just gen-media-umap /path/to/embeddings.tsv.gz
# View locally
open docs/media_umap.html
# See docs/MEDIA_UMAP_GUIDE.md for detailed instructionsjust count-recipes
# Output:
# algae: 242
# bacterial: 10,072
# fungal: 119
# archaea: 63
# specialized: 99
# Total: 10,595# Validate a single recipe
just validate data/normalized_yaml/bacterial/LB_Broth.yaml
# Validate all recipes
just validate-all
# Schema validation only
just validate-schema data/normalized_yaml/bacterial/LB_Broth.yamlCultureMech integrates culture media recipes from 10 major international repositories:
| Source | Recipes | Description | Status |
|---|---|---|---|
| KOMODO | 3,637 | Korean microbial media database | β Complete |
| MediaDive (DSMZ) | 3,327 | German Collection, comprehensive bacterial/fungal media | β Complete |
| TOGO Medium | 2,917 | Japanese BRCs, curated media database | β Complete |
| MediaDB | 469 | Chemically defined media database | β Complete |
| CCAP | 113 | UK Culture Collection of Algae and Protozoa | β Complete |
| UTEX | 99 | University of Texas algae collection | β Complete |
| SAG | 30 | German algae culture collection (GΓΆttingen) | β Complete |
| NBRC | 2 | Japanese NITE Biological Resource Center | π Initial |
| BacDive | 1 | DSMZ cultivation conditions database | π Initial |
| Source | Potential | Description | Notes |
|---|---|---|---|
| BacDive | ~2,500+ | Additional organism-specific cultivation conditions | Requires API access |
| ATCC | ~900 | American Type Culture Collection media | Web scraping needed |
| NBRC | ~420 | Additional NITE media formulations | Incremental import |
Three major algae culture collections fully integrated:
- UTEX (Austin, TX): 99 recipes - Full composition details
- CCAP (Oban, Scotland): 113 recipes - Metadata + PDF references
- SAG (GΓΆttingen, Germany): 30 recipes - Metadata + PDF references
Total: 242 algae media recipes covering:
- Freshwater algae (BG-11, Bold's Basal, TAP)
- Marine phytoplankton (f/2, Erdschreiber's)
- Cyanobacteria (Spirulina, BG-11 variants)
- Specialized media (diatoms, euglenoids, volvocales)
See docs/ALGAE_PIPELINE_COMPLETE.md for details.
# Fetch all available sources
just fetch-algae-collections # UTEX, CCAP, SAG
just fetch-bacdive 100 # BacDive (requires registration)
just fetch-nbrc 50 # NBRC web scraping
# Import to normalized format
just import-algae-collections
just import-bacdive
just import-nbrcCultureMech/
βββ src/culturemech/ # Python package
β βββ schema/ # LinkML schema definitions
β β βββ culturemech.yaml # Main schema (1800+ lines)
β β βββ unmapped_ingredients_schema.yaml # Unmapped ingredients schema
β βββ fetch/ # Data fetchers (10 sources)
β β βββ utex_fetcher.py # UTEX algae media
β β βββ ccap_fetcher.py # CCAP algae media
β β βββ sag_fetcher.py # SAG algae media
β β βββ ... (7 more fetchers)
β βββ convert/ # Raw YAML converters
β βββ import/ # Normalized importers (11 total)
β β βββ utex_importer.py # Full UTEX pipeline
β β βββ ccap_importer.py # CCAP metadata importer
β β βββ sag_importer.py # SAG metadata importer
β β βββ ... (8 more importers)
β βββ export/ # Export modules
β β βββ browser_export.py # Browser data generator
β β βββ kgx_export.py # Knowledge graph export
β βββ render.py # HTML page generator
β
βββ scripts/ # Utility scripts
β βββ aggregate_unmapped_ingredients.py # Aggregate unmapped ingredients
β βββ unmapped_ingredients_stats.py # Generate statistics reports
β
βββ output/ # Generated outputs
β βββ unmapped_ingredients.yaml # Aggregated unmapped ingredients (502KB)
β
βββ data/ # Three-tier data architecture
β βββ raw/ # Layer 1: Source files (git ignored)
β β βββ utex/ # UTEX raw data
β β βββ ccap/ # CCAP raw data
β β βββ sag/ # SAG raw data
β β βββ ... (10+ sources)
β βββ raw_yaml/ # Layer 2: Unnormalized YAML (git ignored)
β βββ normalized_yaml/ # Layer 3: Curated recipes (in git)
β βββ algae/ # 242 algae recipes
β βββ bacterial/ # 10,072 bacterial recipes
β βββ fungal/ # 119 fungal recipes
β βββ archaea/ # 63 archaeal recipes
β βββ specialized/ # 99 specialized recipes
β
βββ docs/ # Comprehensive documentation
β βββ QUICK_START.md # 5-minute getting started
β βββ DATA_LAYERS.md # Three-tier architecture
β βββ ALGAE_PIPELINE_COMPLETE.md # Algae integration guide
β βββ ... (27 more docs)
β
βββ app/ # Browser interface
β βββ index.html # Faceted search UI
β βββ schema.js # Browser configuration
β
βββ tests/ # Test suite
βββ conf/ # Configuration files
βββ project.justfile # Build automation (80+ commands)
βββ pyproject.toml # Python project config
Comprehensive documentation is available in the docs/ directory:
- Quick Start Guide - Get up and running in 5 minutes
- Quick Reference - Command cheat sheet
- Contributing Guide - How to contribute
- Data Layers - Three-tier architecture explained
- Migration Guide - Directory structure reference
- Implementation Status - Integration progress
- Algae Pipeline - UTEX/CCAP/SAG integration (242 recipes)
- UTEX Deployment - Full UTEX pipeline details
- CCAP/SAG Deployment - Metadata import details
- Data Sources Summary - All source repositories
- Enrichment Guide - Data quality improvement workflow
- Implementation Summary - Literature verification & enum normalization
- Unmapped Ingredients Guide - System for tracking ingredients needing ontology mapping
- Unmapped Ingredients Summary - Executive summary with statistics and priorities
Recipes are stored as YAML files following the LinkML schema:
name: BG-11 Medium
category: algae
medium_type: COMPLEX
physical_state: LIQUID
description: Standard cyanobacteria medium from UTEX Culture Collection
ingredients:
- agent_term:
preferred_term: NaNO3
amount: 1.5 g/L
- agent_term:
preferred_term: K2HPO4
amount: 0.04 g/L
preparation_steps:
- step_number: 1
instruction: Dissolve all ingredients in distilled water
- step_number: 2
instruction: Autoclave at 121Β°C for 20 minutes
# Algae-specific fields
light_intensity: 50-100 Β΅mol photons mβ»Β² sβ»ΒΉ
light_cycle: 12:12 or 16:8 light:dark
temperature_range: 15-30Β°C depending on species
applications:
- Algae cultivation
- Cyanobacteria culture
- Phytoplankton research
curation_history:
- curator: utex-import
date: '2026-01-28'
action: Imported from UTEX Culture Collection
references:
- reference_id: UTEX:bg-11-medium
- reference_id: https://utex.org/products/bg-11-mediumSee data/normalized_yaml/ for complete examples.
The schema (src/culturemech/schema/culturemech.yaml) defines:
Key Classes:
MediaRecipe- Root entity (one per YAML file)IngredientDescriptor- Chemicals with CHEBI termsOrganismDescriptor- Target organisms with NCBITaxon IDsSolutionDescriptor- Stock solutionsPreparationStep- Ordered protocol stepsMediaVariant- Related formulations
Ontology Bindings:
- CHEBI - Chemical ingredients
- NCBITaxon - Target organisms
- UO - Units of measurement
- Source databases - DSMZ, TOGO, ATCC, UTEX, CCAP, SAG
Enums:
MediumTypeEnum: DEFINED, COMPLEX, MINIMAL, SELECTIVE, DIFFERENTIAL, ENRICHMENTPhysicalStateEnum: LIQUID, SOLID_AGAR, SEMISOLID, BIPHASICPreparationActionEnum: DISSOLVE, MIX, HEAT, AUTOCLAVE, FILTER_STERILIZESterilizationMethodEnum: AUTOCLAVE, FILTER, DRY_HEAT, TYNDALLIZATION
Added fields for algae culture conditions:
light_intensity- Β΅mol photons mβ»Β² sβ»ΒΉlight_cycle- Photoperiod (e.g., "16:8 light:dark")light_quality- Light source typetemperature_range- Cultivation temperaturesalinity- Marine vs freshwateraeration- COβ supplementationculture_vessel- Flask, tube, bioreactor
Layer 1: raw/ β Raw source files (JSON, TSV, SQL)
β
Layer 2: raw_yaml/ β Unnormalized YAML (preserves original structure)
β
Layer 3: normalized_yaml/ β LinkML-validated, ontology-grounded recipes
Benefits:
- Reproducible pipeline from source to curated data
- Easy to re-import with schema changes
- Clear separation of concerns
- Version control on curated layer only
# Full validation (schema + ontologies)
just validate data/normalized_yaml/algae/BG-11_Medium.yaml
# Schema validation only
just validate-schema data/normalized_yaml/algae/BG-11_Medium.yaml
# Validate all recipes
just validate-allEvery recipe includes:
- Source database attribution
- Fetch date and version
- Import date and curator
- Cross-references to original sources
- PDF URLs for detailed protocols (CCAP/SAG)
NEW (2026-02-20): CultureMech now includes a comprehensive literature verification system for validating cross-references through scientific papers.
The system attempts to retrieve PDFs from multiple sources in order:
- Direct Publisher Access - ASM, PLOS, Frontiers, MDPI, Nature, Science, Elsevier
- PubMed Central (PMC) - NCBI idconv API
- Unpaywall API - Open access aggregator
- Semantic Scholar - Open PDF endpoint
- Sci-Hub Fallback - Optional, disabled by default (requires explicit opt-in)
- Web Search - arXiv, bioRxiv, Europe PMC
- β Legal sources first - Always tries publisher, PMC, Unpaywall, and Semantic Scholar before fallback
- β
Sci-Hub opt-in only - Disabled by default, requires
--enable-scihub-fallbackflag - β Full provenance - Tracks which tier successfully retrieved each PDF
- β Evidence extraction - 8 regex patterns for detecting media equivalencies
- β Batch processing - Verify multiple candidates efficiently
- β Caching layer - Metadata and PDFs cached locally to avoid repeated requests
# Generate ATCC-DSMZ cross-reference candidates (name-based matching only)
python -m culturemech.enrich.atcc_crossref_builder generate
# Verify candidates using legal sources only (no Sci-Hub)
python -m culturemech.enrich.atcc_crossref_builder generate \
--verify-literature
# Verify with Sci-Hub fallback enabled (explicit opt-in)
python -m culturemech.enrich.atcc_crossref_builder generate \
--verify-literature \
--enable-scihub-fallback
# Configure via environment variables
export ENABLE_SCIHUB_FALLBACK=true
export LITERATURE_EMAIL="your@email.com"
export FALLBACK_PDF_MIRRORS="https://sci-hub.se,https://sci-hub.st"
python -m culturemech.enrich.atcc_crossref_builder generate --verify-literatureSafety features:
- Default:
use_fallback_pdf=False - Legal sources exhausted first
- Clear warnings when Sci-Hub is enabled
- Full provenance tracking
- No auto-distribution of PDFs
See IMPLEMENTATION_SUMMARY.md for complete documentation.
The faceted search browser (app/index.html) provides:
- Full-text search - Name, organism, ingredient, application
- Faceted filtering - Category, type, state, organisms, sterilization
- Real-time filtering - Instant results from 10,595 recipes
- External links - CHEBI, NCBITaxon, source databases
- Mobile responsive - Works on all devices
Generate browser data:
just gen-browser-data
just serve-browser
# Open http://localhost:8000/app/just --list # Show all 80+ commands
just count-recipes # Count recipes by category
just fetch-utex # Fetch UTEX algae media
just import-utex # Import UTEX to normalized format
just validate-all # Validate all recipes
just gen-browser-data # Generate browser search data
just test # Run test suite# Generate ATCC-DSMZ cross-reference candidates
python -m culturemech.enrich.atcc_crossref_builder generate \
--output data/curation/atcc_candidates.json
# Verify candidates via literature search (legal sources only)
python -m culturemech.enrich.atcc_crossref_builder generate \
--verify-literature
# Verify with Sci-Hub fallback (opt-in, requires explicit flag)
python -m culturemech.enrich.atcc_crossref_builder generate \
--verify-literature \
--enable-scihub-fallback
# Normalize enum values (medium_type, physical_state, category)
python -m culturemech.enrich.normalize_enums --dry-run # Preview changes
python -m culturemech.enrich.normalize_enums # Apply changes
# Aggregate unmapped ingredients for mapping prioritization
python scripts/aggregate_unmapped_ingredients.py --verbose --min-occurrences 2
# View unmapped ingredients statistics
python scripts/unmapped_ingredients_stats.py --top 20
# View full aggregated data
less output/unmapped_ingredients.yaml
# Read the comprehensive guide
cat docs/unmapped_ingredients_guide.md
# Read the executive summary
cat UNMAPPED_INGREDIENTS_SUMMARY.md-
Create YAML file in appropriate category:
cp data/normalized_yaml/bacterial/LB_Broth.yaml \ data/normalized_yaml/bacterial/Your_Medium.yaml
-
Edit following schema structure
-
Validate:
just validate data/normalized_yaml/bacterial/Your_Medium.yaml
-
Regenerate browser:
just gen-browser-data
# All tests
just test
# With coverage
just test-cov
# Specific test
pytest tests/test_kgx_export.py- Find media recipes for specific organisms
- Compare formulations across culture collections
- Access detailed protocols with preparation steps
- Discover alternatives through variant relationships
- Standardize media recipe formats
- Cross-reference with other collections
- Track provenance and curation history
- Export to knowledge graphs for integration
- Query via KG using Biolink model
- Link organisms to cultivation conditions
- Analyze ingredients with CHEBI ontology
- Build applications on structured data
$ just count-recipes
Recipe count by category:
algae: 242
archaea: 63
bacterial: 10,134
fungal: 119
specialized: 99
Total recipes: 10,657Data Quality:
- β 100% schema-validated
- β 100% enum compliance (10,657 files normalized)
- β Full source attribution
- β Comprehensive provenance tracking
- β LinkML compliance
Pipeline Coverage:
- β 10 data sources integrated
- β 11 import pipelines operational
- β 3 algae collections (UTEX, CCAP, SAG)
- β Automated fetch β convert β import workflow
Enrichment Features:
- β Literature verification with 6-tier PDF retrieval
- β ATCC-DSMZ cross-reference detection
- β Automated enum normalization
- β Evidence extraction from scientific papers
- β Unmapped ingredients aggregation and tracking (136 ingredients, 3,084 instances)
We welcome contributions! Ways to contribute:
- Add recipes - Create YAML files following the schema
- Enhance existing recipes - Add ontology terms, preparation details
- Report issues - Found errors or have suggestions?
- Improve documentation - Help make guides clearer
- Add data sources - Know of other culture media databases?
See CONTRIBUTING.md for detailed guidelines.
- Fork the repository
- Create a feature branch
- Make your changes
- Validate:
just validate-all - Test:
just test - Submit pull request
- DSMZ MediaDive: https://mediadive.dsmz.de/
- TOGO Medium: http://togodb.org/db/medium/
- ATCC: https://www.atcc.org/
- UTEX: https://utex.org/
- CCAP: https://www.ccap.ac.uk/
- SAG: https://sagdb.uni-goettingen.de/
- CHEBI: https://www.ebi.ac.uk/chebi/
- NCBITaxon: https://www.ncbi.nlm.nih.gov/taxonomy
- UO (Units): https://github.com/bio-ontology-research-group/unit-ontology
- KG-Hub: https://github.com/Knowledge-Graph-Hub
- LinkML: https://linkml.io/
- Biolink Model: https://biolink.github.io/biolink-model/
This work is dedicated to the public domain under CC0 1.0 Universal.
You are free to:
- Use for any purpose
- Modify and distribute
- Use commercially
- No attribution required (but appreciated!)
If you use CultureMech in your research, please cite:
@software{culturemech2026,
title = {CultureMech: A Comprehensive Microbial Culture Media Knowledge Graph},
author = {CultureBotAI},
year = {2026},
url = {https://github.com/CultureBotAI/CultureMech},
note = {10,595 culture media recipes from 10 international repositories}
}Data Sources: DSMZ, TOGO, ATCC, NBRC, BacDive, KOMODO, UTEX, CCAP, SAG, MediaDB
Architecture: Inspired by the dismech project
Ontologies: CHEBI, NCBITaxon, UO
Community: KG-Hub, LinkML, Biolink Model
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with β€οΈ for microbiology research
10,657 recipes β’ 10 sources β’ Production ready β’ Public domain