A KDE-based framework for tracking sequence-discrete microbial populations across metagenomes
This repository implements the full analysis workflow described in:
Conrad, R. et al.
An ANIr-based methodology to determine whether two sequence-discrete populations are identical or distinct in metagenomic datasets.
(in review)
Zenodo archive (code + data snapshot):
https://doi.org/10.5281/zenodo.15231818
This repository provides a complete, end-to-end computational workflow for tracking microbial populations across metagenomic datasets using Average Nucleotide Identity of recruited reads (ANIr). The framework is designed to determine whether two metagenomic samples contain:
- The same population (shared strain-level lineage), or
- Distinct but closely related populations, even when 16S identity or species-level resolution fails.
The pipeline is built around three core concepts:
-
Recruitment-based population definition
Metagenomic reads are recruited to reference genomes or MAGs using BLAST+ or MagicBlast. The distribution of percent identity values (pident) is used to separate target populations from non-target background recruitment. -
KDE-based threshold detection
A kernel density estimate (KDE) is applied to thepidentdistribution, and local minima in the density curve are used as objective sequence-identity cutoffs separating true population members from non-members. -
Statistical population comparison
Once target populations are isolated, the framework applies:- ANIr and mANIr (mean and median percent identity)
- Bootstrap resampling to estimate confidence intervals
- Permutation testing to assess whether two samples contain statistically indistinguishable populations
- Differential gene-content testing using normalized per-gene coverage
Together, these tools allow you to answer:
Are two metagenomes recruiting the same microbial population, or two distinct ones?
At a high level, the population-tracking workflow follows these stages:
-
Input data preparation
- Reference genomes or MAGs (FASTA)
- Gene annotations (e.g., Prodigal CDS)
- Metagenomic read sets (FASTQ)
- Read recruitment tables (BLAST+ or MagicBlast)
-
KDE-based population boundary detection
- KDE applied to
pidentdistributions - Local minima used to define recruitment threshold
- Target vs non-target reads separated objectively
- KDE applied to
-
Population coverage & diversity metrics
- Breadth and depth of coverage
- ANIr and mANIr calculated from thresholded reads
-
Across-sample population matrices
- Presence/absence across samples
- ANIr matrices across environments
- Heatmaps and clustering
-
Statistical population comparison
- Bootstrap testing of ANIr distributions
- Permutation testing of ANIr differences
- Identifies identical vs distinct populations
-
Differential gene content
- Per-gene normalized coverage
- Quantile-based detection of enriched/depleted genes
- Functional differentiation between populations
This repository is organized into three main script families:
-
scripts-gene-coverage/
Raw recruitment → coverage, ANIr, and per-gene depth calculations. -
scripts-populations/
KDE thresholding, population-level statistics, bootstrap/permutation testing, and differential gene analysis.
This is the primary focus of the population-tracking workflow. -
scripts-general/
Utility scripts, plotting helpers, and data-wrangling functions.
A legacy old/ directory is preserved for transparency but is not recommended for new analyses.
A full breakdown of all major scripts is provided in later sections of this README.
- Python ≥ 3.7.10
- NumPy
- Pandas
- Matplotlib
- Seaborn
- SciPy
- Statsmodels
- BLAST+ (for
blastnrecruitment) - MagicBlast (optional alternative recruiter)
- Prodigal or equivalent CDS caller (for gene-level analyses)
Exact versions used in the manuscript are documented in the Zenodo archive.
This is the minimal outline of a complete run:
# 1. Recruit reads to MAGs or references
blastn -query reads.fq -db MAGs.fna -outfmt 6 > sampleA.MAG001.blast.tsv
# 2. Run KDE + recruitment plotting + ANIr estimation
python scripts-populations/06f_TabBlast_RecPlot_Mini_Auto_vX.py \
--blast sampleA.MAG001.blast.tsv \
--fasta MAG001.fna \
--outdir results/MAG001/sampleA/
# 3. Build across-sample ANIr matrices
# 4. Run bootstrap and permutation tests between samples
# 5. Run differential gene-content analysisA fully reproducible, step-by-step tutorial for every script in scripts-populations/ is provided in the next major section of this README.
If you use this repository in your work, please cite:
- The associated manuscript (once published)
- The Zenodo software archive: https://doi.org/10.5281/zenodo.15231818
This repository is organized into three main script families plus supporting files:
Metagenomic_Population_Tracking/
├── scripts-gene-coverage/ # Read recruitment → coverage, ANIr, RecPlot minis, gene-level depth
├── scripts-populations/ # Population thresholding, ANIr comparisons, bootstrap/permutation tests
├── scripts-general/ # Shared utilities, plotting helpers, data-wrangling functions
├── old/ # Legacy scripts kept for transparency (not recommended for new use)
├── LICENSE # GPL-3.0
└── README.md # This document
In practice:
- Use
scripts-gene-coverage/when you’re starting from raw recruitment tables and want coverage / ANIr / RecPlot minis / gene depth. - Use
scripts-populations/when you’re doing population-level comparisons (same vs different population across samples, differential gene content). - Use
scripts-general/for shared helper functions that some of the above scripts call. - The
old/directory contains earlier versions and exploratory scripts and is mostly for archival / provenance.
To make the workflow easier to follow, it helps to think in terms of functional “blocks” rather than individual scripts first:
-
Recruitment & KDE / RecPlot minis
- Parse BLAST+/MagicBlast recruitment output.
- Fit KDE to
%identity(seqIDs) and identify local minima. - Apply a valley modifier and optional KDE bandwidth tweaks.
- Compute breadth, depth, ANIr, and mANIr.
- Generate RecPlot minis that overlay coverage and identity patterns.
-
Population presence / absence and matrices
- Aggregate per-MAG×sample metrics into matrices (ANIr, breadth, depth).
- Build presence/absence calls and sample × population tables.
- Provide inputs for downstream clustering and metadata analyses.
-
Population comparison tests (ANIr-based)
- Use bootstrap to generate ANIr distributions per sample.
- Use permutation tests to evaluate significance of ANIr differences.
- Return p-values and confidence intervals for “same vs different population”.
-
Differential gene content
- Compute per-gene depth using positional information (
sstart,send, gene coordinates). - Normalize gene coverage by genome-level abundance.
- Identify genes in the extreme quantiles (e.g., bottom 2.5% / top 97.5%) of abundance differences between samples.
- Compute per-gene depth using positional information (
-
Metadata integration & plotting
- Correlate ANIr and population presence/absence with environmental or experimental metadata.
- Produce publication-ready figures (e.g., ANIr vs depth, basin, time).
💡 Script names below correspond to those referenced in the manuscript. Depending on how you organize the repo, they may live under
scripts-gene-coverage/orscripts-populations/.
| Script / family | Main purpose | Typical inputs | Key outputs |
|---|---|---|---|
TabBlast_RecPlot_Mini_Auto_v2.py / v3.py |
KDE-based thresholding, RecPlot minis, coverage & ANIr metrics | BLAST+ tabular recruitment file, reference genome/MAG FASTA | KDE plots, local-minimum threshold, RecPlot minis, per-MAG×sample table of breadth, depth, ANIr, mANIr, read counts |
MagicBlast_RecPlotMini-Auto_subspecies.py |
Same as above, tuned for MagicBlast output | MagicBlast tabular recruitment file, reference genome/MAG FASTA | KDE + RecPlot minis + coverage / ANIr summary as above |
06h_PopX_Bootstrap_Test.py |
Bootstrap test for ANIr distributions between two samples | SeqID arrays (or recruitment tables) for sample A and B mapped to the same MAG / reference | Bootstrap ANIr distributions, confidence intervals, p-values for “same population?” |
06h_PopX_Permutation_Test.py |
Permutation test for ANIr differences between two samples | Same as above (seqIDs for two samples for a single MAG) | Null distribution of ANIr differences, observed difference, p-values |
08b_BlastPlus_CoverageMagic_Basic.py |
Per-gene coverage and normalized gene abundance | BLAST+ recruitment tables, gene (CDS) coordinates on contigs, genome abundance estimates | Per-gene depth / coverage tables, gene-level normalized abundance per sample |
08d_MagicBlast_Differential_Genes.py |
Differential gene content between two samples | Per-gene coverage / normalized abundance tables for sample pairs | Gene list with difference metrics; genes flagged in bottom 2.5% / top 97.5% quantiles (significant abundance changes) |
| RecPlot aggregation scripts (e.g. “minis”) | Arrange many RecPlot minis into composite figures | Individual RecPlot outputs, sample metadata (e.g., basin, depth) | Multi-panel PDF/PNG grids of RecPlot minis arranged by geography / depth / other metadata |
| ANIr / metadata plotting scripts | Relate ANIr and population metrics to environmental metadata | ANIr matrices, presence/absence tables, sample metadata (depth, basin, time, etc.) | Scatterplots, timeseries, and faceted plots linking population metrics to environment |
You can think of the execution path like this:
- Start with
TabBlast_RecPlot_Mini_Auto_*/MagicBlast_RecPlotMini-*to define populations and compute coverage + ANIr. - Aggregate per-MAG×sample results into multi-sample matrices.
- Apply
06h_PopX_*scripts to test “same vs different” populations between sample pairs. - Optionally run
08b_*and08d_*for differential gene content within those same sample pairs. - Finish with plotting/metadata scripts to put everything in ecological / experimental context.
These scripts were developed and tested under:
-
Python: 3.7.10+
-
Core libraries:
numpypandasmatplotlibseabornscipystatsmodels
External tools:
- BLAST+ (
blastn) – read recruitment against MAGs / reference genomes. - MagicBlast – optional alternative recruiter.
- Prodigal (or equivalent) – gene prediction (CDS coordinates) for per-gene coverage / differential gene tests.
- Common command-line tools (
bash,gzip, etc.) for file management and pre-processing.
A simple conda environment example:
conda create -n anir-populations python=3.8 \
numpy pandas matplotlib seaborn scipy statsmodels
conda activate anir-populations
# Install BLAST+ and (optionally) MagicBlast via bioconda
conda install -c bioconda blast
conda install -c bioconda magicblast # optional
conda install -c bioconda prodigal # for gene predictionThis folder contains the population-level analysis tools used in the manuscript, including:
-
Population detection and RecPlot mini-figures (ANIr, breadth, depth)
06f_TabBlast_RecPlot_Mini_Auto_v4.pyMagicBlast_RecPlotMini-Auto_subspecies.py
-
Population ANIr significance tests
06h_PopX_Bootstrap_Test.py06h_PopX_Permutation_Test.py
-
Gene-level coverage and differential gene content
08c_Low_Coverage_Genes.py08d_Differential_Genes.py
Below is a step-by-step guide to using these scripts, in the typical order you would run them.
Most scripts assume:
-
One reference genome / rMAG FASTA per run (can be multi-contig).
-
Tabular BLAST+ or MagicBlast read recruitment files:
-
One file per sample × reference genome.
-
“Outfmt 6”-style with at least:
% identityin column 3 (pident)- subject start/stop in columns 9–10 (
sstart,send) – 0-based vs 1-based is handled consistently in the scripts.
-
-
Consistent naming convention for sample IDs encoded in the filenames.
For the GoM case, filenames look like:
GoM_1200m_2016-01-01_MAG001.tab
GoM_80m_2016-01-01_MAG001.tab
...
Internally, the scripts parse:
Sample= everything before the first.and before the first-depth= the third_-separated field (e.g.1200m)loca= the first two_-separated fields (e.g.GoM_1200m)
If your sample naming scheme is different, you must edit the lines that set Sample, depth, and loca in the plotting scripts.
Purpose
-
Reads a reference FASTA and a folder of filtered BLAST tabular files (one per sample for that reference).
-
Identifies a population cutoff in the seqID distribution via KDE and valley detection.
-
Filters alignments above the population cutoff.
-
Calculates:
- Breadth of coverage (fraction of genome positions with ≥1 population read).
- Depth of coverage (truncated mean coverage).
- ANIr and MNIr (mean and median percent identity of the population reads).
-
Produces:
- Per-sample RecPlot mini PDFs (pident vs genome position).
- A summary
*_stats.tsvfile. - A
*_popx_readmap_values.tsvfile used downstream by the bootstrap and permutation tests.
Key inputs
-f / --input_fasta_fileReference genome (or rMAG) FASTA used for read recruitment.-b / --input_blast_dirDirectory containing tabular BLAST result files (one per sample).-o / --out_dirOutput directory (created if it does not exist).-k / --kde_bandwidth(optional, float, default ~0.3) KDE bandwidth for smoothing seqID distributions.-v / --valley_modifier(optional, float, default 3.0) Value added to the KDE valley to define the population cutoff (in % identity).-d / --draw_threshold(flag) If present, draw the horizontal population cutoff line on RecPlots.-t / --breadth_threshold(optional, float, default 10.0) Minimum breadth (%) required to call a population “detected”.
Example command
# Example: one rMAG, BLAST-based read recruitment files per sample
python 06f_TabBlast_RecPlot_Mini_Auto_v4.py \
-f refs/MAG001.fasta \
-b blasts/MAG001/ \
-o results/MAG001_popx/ \
-k 0.25 \
-v 3.0 \
-d \
-t 10.0Outputs
From the example above, you’ll get (under results/MAG001_popx/):
-
MAG001_stats.tsv- Columns:
Sample,Detected(0/1),Depth,Breadth,ANIr,MNIr,PopCutoff.
- Columns:
-
MAG001_popx_readmap_values.tsv- Contains per-sample population-level metrics and the raw seqID array for the population reads:
- Columns (for each sample):
Sample,PopCutoff,ExactMatchRatio(EMR),BreadthTotal,DepthTotal,Breadth100,Depth100,Breadth99,Depth99,pident above population threshold(comma-separated list of seqIDs). - This file is the input for
06h_PopX_Bootstrap_Test.pyand06h_PopX_Permutation_Test.py.
-
Per-sample plots:
-
SampleX_RecPlot_Mini.pdf- KDE-based population cutoff (optional line).
- SeqID vs genome position; ANIr, MNIr, depth, and breadth annotated in the plot.
-
SampleX_Pop_Cutoff.png- Histogram and KDE of seqIDs with the chosen cutoff line.
-
This script is the MagicBlast-friendly counterpart of 06f_TabBlast_RecPlot_Mini_Auto_v4.py. It:
- Expects MagicBlast-style tabular recruitment instead of BLAST+.
- Performs the same KDE valley-based population cutoff search, breadth/depth/ANIr/MNIr calculations, and RecPlot mini generation.
The CLI interface mirrors 06f_TabBlast_RecPlot_Mini_Auto_v4.py:
-f / --input_fasta_file-b / --input_blast_dir-o / --out_dir-k / --kde_bandwidth-v / --valley_modifier-d / --draw_threshold-t / --breadth_threshold
You can inspect python MagicBlast_RecPlotMini-Auto_subspecies.py -h for the exact help text, but usage is effectively identical: point it at a reference FASTA and a directory of MagicBlast tabular results, and it will produce the same types of outputs (including the *_popx_readmap_values.tsv file used by downstream tests).
Purpose
Test the stability of ANIr for each detected population by:
- Bootstrapping subsamples of the seqID array (population reads) from
*_popx_readmap_values.tsv. - Building a bootstrap distribution of ANIr for each sample.
- Estimating how extreme the observed ANIr is relative to the bootstrap distribution.
The underlying method (as described in the manuscript):
-
For each sample, repeatedly sample a fraction of the seqID array (
bootsize, default 2% of reads) with replacement. -
Compute the mean seqID (ANIr) for each bootstrap.
-
From the bootstrap distribution, compute:
- mean, standard deviation, min, max.
- probability of observing the measured ANIr (p-value) under this distribution.
Required input
*_popx_readmap_values.tsvoutput from06f_TabBlast_RecPlot_Mini_Auto_v4.pyorMagicBlast_RecPlotMini-Auto_subspecies.py.
Arguments
-i / --input_popx_filePopX file (e.g.,MAG001_popx_readmap_values.tsv).-t / --target_sampleName of the target sample (must match aSamplein the input file).-o / --output_prefixPrefix for all output files (e.g.,MAG001_bootstrap).-b / --bootstrap_boot_size(optional, default 0.02) Fraction of the sample’s reads per bootstrap.0.02≈ 2% of the reads.-n / --bootstrap_number_boots(optional, default 10000) Number of bootstrap replicates.
Example command
python 06h_PopX_Bootstrap_Test.py \
-i results/MAG001_popx/MAG001_popx_readmap_values.tsv \
-t GoM_1200m_2016-01-01 \
-o results/MAG001_popx/MAG001_bootstrap \
-b 0.02 \
-n 10000Outputs
-
MAG001_bootstrap_<Sample>_PopX_Bootstrap.png(per sample)- Histogram of bootstrapped ANIr values with fitted normal PDF.
- Annotated with mean, 3×SD, measured ANIr, and p-value relative to the bootstrap distribution.
-
MAG001_bootstrap_bootstrap_data.tsv- Columns (per sample):
Sample,Boot_Mean,Boot_Stdev,Boot_Min,Boot_Max,Boot_pvalue. - A placeholder row for the target sample is included with
"Target"entries.
- Columns (per sample):
Purpose
Test whether the ANIr difference between two samples (for the same reference genome) is larger than expected if both samples came from the same underlying population.
- Implements a permutation test on seqIDs (pidents) drawn from
*_popx_readmap_values.tsv.
Conceptually (also described in the manuscript):
-
Choose a target sample A.
-
For each other sample B:
- Combine the seqID arrays from A and B.
- Randomly shuffle (permute) the combined array.
- Split into two new arrays of sizes matching A and B.
- Compute ANIr for each array and record their difference.
- Repeat many times (default 10,000) to obtain a null distribution of ANIr differences under the hypothesis A and B are the same population.
- Compare the observed |ANIr(A) − ANIr(B)| to this null distribution and compute a p-value.
Required input
- Same
*_popx_readmap_values.tsvfile used by the bootstrap script.
Arguments
-i / --input_popx_filePopX file (e.g.,MAG001_popx_readmap_values.tsv).-t / --target_sampleTarget sample name (the reference population).-o / --output_prefixPrefix for all outputs (e.g.,MAG001_permutation).-p / --number_of_permutations(optional, default 10000) Number of permutations to run.
Example command
python 06h_PopX_Permutation_Test.py \
-i results/MAG001_popx/MAG001_popx_readmap_values.tsv \
-t GoM_1200m_2016-01-01 \
-o results/MAG001_popx/MAG001_perm \
-p 10000Outputs
-
MAG001_perm_permutation_data.tsv- Per sample (relative to target sample):
Sample,ANIr,MNIr,EMR,BreadthTotal,DepthTotal,Breadth100,Depth100,Breadth99,Depth99,sample_size,Measured_Diff,Perm_Mean_Diff,Perm_diff_Stdev,pvalue_Measured_Diff.
- Per sample (relative to target sample):
-
MAG001_perm_<Sample>_permutation_test.png(one per non-target sample)-
Histogram of simulated ANIr differences from permutations.
-
Fitted normal PDF.
-
Lines for:
- 3×SD of simulated differences.
- Observed ANIr difference.
-
Annotated p-value for the observed difference.
-
Purpose
Identify significantly low coverage genes within a sample, based on the distribution of normalized gene coverage values (TAD, typically from *_gene_tad.tsv produced by the coverage-magic script).
-
Uses a non-parametric approach:
- Computes the quantile (default 0.025) of the gene coverage distribution.
- Flags all genes with coverage ≤ that quantile as significantly low (α = 0.025).
Inputs
-
*_gene_tad.tsvfile (one per sample) from coverage magic, with at least:GeneID.TAD(normalized coverage depth per gene).
-
Optional:
genes.faaFASTA containing gene sequences.*.annotationsfile from MicrobeAnnotator (tabular, gene as first column).
Arguments
-i / --input_gene_tad_fileInput gene coverage file (e.g.,MAG001_GoM_1200m_gene_tad.tsv).-o / --output_prefixPrefix for output files.-s / --significance_alpha(optional, default 0.025) Quantile used as the low-coverage cutoff.-a / --annotation_file(optional) MicrobeAnnotator annotations file.-f / --fasta_file(optional) Gene FASTA file.
Example command
python 08c_Low_Coverage_Genes.py \
-i coverage_magic/MAG001_GoM_1200m_gene_tad.tsv \
-o results/MAG001_lowcov/MAG001_GoM_1200m \
-s 0.025 \
-a annotations/MAG001.annotations \
-f genes/MAG001_genes.faaOutputs
-
MAG001_GoM_1200m_gene_coverage.png-
Histogram of gene coverage values.
-
Vertical line at the low-coverage cutoff q (e.g., 2.5th percentile), annotated with:
- α (significance level),
- cutoff value,
-
-
MAG001_GoM_1200m_sig_genes.tsv- Significantly low coverage genes (≤ cutoff).
- Columns:
Gene,TAD, and optional annotation fields.
-
MAG001_GoM_1200m.fasta(optional)- FASTA of significantly low coverage genes if
--fasta_filewas provided.
- FASTA of significantly low coverage genes if
Purpose
Identify genes whose relative coverage differs significantly between samples, for a given reference genome.
As described in the manuscript:
-
For each gene and each pair of samples, compute the difference in normalized gene coverage.
-
Use the distribution of differences across all genes to find:
- genes in the bottom 2.5% (underenriched),
- genes in the top 2.5% (overenriched), corresponding to a two-sided α = 0.05.
Inputs
-
A directory containing
*_gene_tad.tsvfiles from coverage magic:- One file per sample for the same reference genome; they must share a consistent basename suffix (e.g.,
*_gene_tad.tsv).
- One file per sample for the same reference genome; they must share a consistent basename suffix (e.g.,
-
Optional:
- Annotations and gene FASTA, as above.
Arguments
-i / --input_directoryDirectory with the*_gene_tad.tsvfiles for this reference.-t / --target_sampleThe sample name to treat as the reference in pairwise comparisons (must match the prefix used in the gene TAD filenames).-o / --output_directoryOutput directory to store results.-s / --significance_alpha(optional, default 0.025) Lower quantile for significance; the script usesαand1 − αto define low and high cutoff percentiles (e.g., 0.025 and 0.975 for a 5% two-sided test).-a / --annotation_directory(optional) Annotations file from MicrobeAnnotator.-f / --fasta_directory(optional) Gene FASTA file.
Example command
python 08d_Differential_Genes.py \
-i coverage_magic/MAG001/ \
-t GoM_1200m_2016-01-01 \
-o results/MAG001_diffgenes/ \
-s 0.025 \
-a annotations/MAG001.annotations \
-f genes/MAG001_genes.faaWhat the script does (high level)
-
Parse all
*_gene_tad.tsvfiles ininput_directory.-
For each sample:
- Reads gene IDs.
- Reads normalized TAD values and raw TAD.
-
Stores:
- log-transformed normalized TADs (
lgnmtd) for statistical comparisons, - normalized TADs (
nmtd) and raw TADs (td) for reporting.
- log-transformed normalized TADs (
-
-
Select the target sample and compute per-gene differences:
-
Difference = log(norm TAD_target) − log(norm TAD_sample).
-
For each non-target sample, compute:
- the low quantile (
lq) atα, - the high quantile (
hq) at1 − α, - counts of genes below
lqand abovehq.
- the low quantile (
-
-
Plot differential coverage distributions:
-
For each sample vs target:
- Histogram of per-gene differential log(TAD/AVG).
- Vertical lines for
lqandhq. - Text annotation summarizing α, cutoffs, and counts of significant genes.
-
-
Write out results:
-
For each sample vs target, four main outputs:
<target>-<sample>_diffcov_plot.pngHistogram of log(TAD/AVG) differences with quantile lines.<target>-<sample>_diffcov_high.tsvGenes with differential coverage ≥hq. Columns:Gene,log(TAD/AVG) Diff,Target_Norm_TAD,Test_Norm_TAD,Target_TAD,Test_TAD, and optional annotations.<target>-<sample>_diffcov_low.tsvGenes with differential coverage ≤lq, same columns as above.- Optional FASTA files:
<target>-<sample>_diffcov_high.faand<target>-<sample>_diffcov_low.facontaining coding sequences for high and low differential genes if a gene FASTA was provided.
-
For a single reference genome / rMAG:
-
RecPlot + population detection
-
Use BLAST+ or MagicBlast to recruit reads.
-
Run
06f_TabBlast_RecPlot_Mini_Auto_v4.pyorMagicBlast_RecPlotMini-Auto_subspecies.pyon each reference genome to:- generate RecPlot minis per sample,
- compute breadth/depth/ANIr/MNIr,
- produce
*_popx_readmap_values.tsv.
-
-
Population ANIr statistics
- Run
06h_PopX_Bootstrap_Test.pyto characterize within-sample ANIr uncertainty. - Run
06h_PopX_Permutation_Test.pyto test between-sample ANIr differences for the target sample against all others.
- Run
-
Gene-level coverage and content differences
- Use coverage-magic (upstream script) to generate per-sample
*_gene_tad.tsvfiles. - Optionally run
08c_Low_Coverage_Genes.pyon each sample to identify significantly undercovered genes. - Run
08d_Differential_Genes.pyon the folder of gene TAD files to identify genes that differ significantly in relative coverage between the target sample and each other sample.
- Use coverage-magic (upstream script) to generate per-sample
This worked example demonstrates a complete population-tracking analysis for a single reference genome (rMAG) across multiple metagenomic samples, starting from BLAST recruitment and ending with differential gene content.
We assume the following directory layout:
project/
├── refs/
│ └── MAG001.fasta
├── blasts/
│ └── MAG001/
│ ├── GoM_1200m_2016-01-01_MAG001.tab
│ ├── GoM_80m_2016-01-01_MAG001.tab
│ └── GoM_10m_2016-01-01_MAG001.tab
├── coverage_magic/
│ └── MAG001/
│ ├── MAG001_GoM_1200m_gene_tad.tsv
│ ├── MAG001_GoM_80m_gene_tad.tsv
│ └── MAG001_GoM_10m_gene_tad.tsv
├── annotations/
│ └── MAG001.annotations
├── genes/
│ └── MAG001_genes.faa
└── results/
We first identify the target population in each metagenomic sample using KDE-based valley detection.
python scripts-populations/06f_TabBlast_RecPlot_Mini_Auto_v4.py \
-f refs/MAG001.fasta \
-b blasts/MAG001/ \
-o results/MAG001_popx/ \
-k 0.25 \
-v 3.0 \
-d \
-t 10.0results/MAG001_popx/
├── MAG001_stats.tsv
├── MAG001_popx_readmap_values.tsv
├── GoM_1200m_2016-01-01_RecPlot_Mini.pdf
├── GoM_80m_2016-01-01_RecPlot_Mini.pdf
├── GoM_10m_2016-01-01_RecPlot_Mini.pdf
├── GoM_1200m_2016-01-01_Pop_Cutoff.png
├── GoM_80m_2016-01-01_Pop_Cutoff.png
└── GoM_10m_2016-01-01_Pop_Cutoff.png
At this stage you now have:
- Population cutoff for each sample
- ANIr, MNIr, depth, and breadth
- Publication-ready RecPlot minis
- A master file used for all downstream population tests:
MAG001_popx_readmap_values.tsv
We now quantify within-sample ANIr uncertainty for one focal population (e.g., the deep GoM sample):
python scripts-populations/06h_PopX_Bootstrap_Test.py \
-i results/MAG001_popx/MAG001_popx_readmap_values.tsv \
-t GoM_1200m_2016-01-01 \
-o results/MAG001_popx/MAG001_bootstrap \
-b 0.02 \
-n 10000results/MAG001_popx/
├── MAG001_bootstrap_bootstrap_data.tsv
└── MAG001_bootstrap_GoM_1200m_2016-01-01_PopX_Bootstrap.png
This gives you:
- The bootstrapped ANIr distribution
- The mean, SD, and p-value of the observed ANIr
- A PDF+histogram figure used to evaluate population stability
We now test whether other samples represent the same population as the 1200 m target sample.
python scripts-populations/06h_PopX_Permutation_Test.py \
-i results/MAG001_popx/MAG001_popx_readmap_values.tsv \
-t GoM_1200m_2016-01-01 \
-o results/MAG001_popx/MAG001_perm \
-p 10000results/MAG001_popx/
├── MAG001_perm_permutation_data.tsv
├── MAG001_perm_GoM_80m_2016-01-01_permutation_test.png
└── MAG001_perm_GoM_10m_2016-01-01_permutation_test.png
This tells you, for each comparison sample:
- Observed ANIr difference
- Mean and SD of the null permutation distribution
- The p-value testing if the populations are distinguishable
At this point you can formally say:
“These two samples contain the same population” or “These samples contain statistically distinct populations.”
If you want to identify genes that are significantly underrepresented in a single sample:
python scripts-populations/08c_Low_Coverage_Genes.py \
-i coverage_magic/MAG001/MAG001_GoM_1200m_gene_tad.tsv \
-o results/MAG001_lowcov/MAG001_GoM_1200m \
-s 0.025 \
-a annotations/MAG001.annotations \
-f genes/MAG001_genes.faaresults/MAG001_lowcov/
├── MAG001_GoM_1200m_gene_coverage.png
├── MAG001_GoM_1200m_sig_genes.tsv
└── MAG001_GoM_1200m.fasta
These are genes in the lowest 2.5% of normalized coverage.
Now we compare gene content differences between the 1200 m sample and all other samples:
python scripts-populations/08d_Differential_Genes.py \
-i coverage_magic/MAG001/ \
-t GoM_1200m_2016-01-01 \
-o results/MAG001_diffgenes/ \
-s 0.025 \
-a annotations/MAG001.annotations \
-f genes/MAG001_genes.faaresults/MAG001_diffgenes/
├── GoM_1200m-GoM_80m_diffcov_plot.png
├── GoM_1200m-GoM_10m_diffcov_plot.png
├── GoM_1200m-GoM_80m_diffcov_high.tsv
├── GoM_1200m-GoM_80m_diffcov_low.tsv
├── GoM_1200m-GoM_10m_diffcov_high.tsv
├── GoM_1200m-GoM_10m_diffcov_low.tsv
├── GoM_1200m-GoM_80m_diffcov_high.fa
├── GoM_1200m-GoM_80m_diffcov_low.fa
├── GoM_1200m-GoM_10m_diffcov_high.fa
└── GoM_1200m-GoM_10m_diffcov_low.fa
Each comparison yields:
- A distribution plot of per-gene coverage differences
- A list of over-enriched genes (top 2.5%)
- A list of under-enriched genes (bottom 2.5%)
- Optional FASTA outputs for downstream functional analysis
| Stage | Script | Output |
|---|---|---|
| Population detection | 06f_TabBlast_RecPlot_Mini_Auto_v4.py |
ANIr, MNIr, depth, breadth, RecPlots |
| ANIr stability | 06h_PopX_Bootstrap_Test.py |
Bootstrap distributions |
| Population comparison | 06h_PopX_Permutation_Test.py |
Same vs different population tests |
| Low-coverage genes | 08c_Low_Coverage_Genes.py |
Underrepresented genes |
| Differential genes | 08d_Differential_Genes.py |
Enriched & depleted genes |
This workflow reproduces the population-tracking, statistical validation, and functional differentiation results described in the manuscript.