Skip to content

Commit 8a8766a

Browse files
Mike LeeMike Lee
authored andcommitted
modularized bit-normalize-table and added tests
1 parent 78d6a98 commit 8a8766a

9 files changed

Lines changed: 298 additions & 114 deletions

File tree

.coveragerc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ concurrency = multiprocessing
44

55
source =
66
bit.modules
7+
bit.cli.normalize_table
78
bit.cli.summarize_assembly
89
bit.cli.summarize_column
910

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
### Changed
2121
- `bit-summarize-column` no longer has an `-i` flag, it expects the input table/file to be given as a positional argument
2222
- `bit-count-bases` can now take STDIN
23+
- `bit-normalize-table` modularized and tests added
2324

2425
---
2526

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,14 @@ Some of the helper programs/scripts in _bit_ include:
4141
| ------- | ------- |
4242
| `bit-dl-ncbi-assemblies` | download NCBI assemblies in different formats by just providing accessions |
4343
| `bit-get-accessions-from-GTDB` | search the (stellar) [Genome Taxonomy Database](https://gtdb.ecogenomic.org/) by taxonomy and get their NCBI accessions |
44-
| `bit-gen-reads` | quickly generate paired-end reads from fasta files in desired proportions |
4544
| `bit-mutate-seqs` | introduce point mutations (substitutions/indels) in nucleotide or amino acid fasta files |
45+
| `bit-gen-reads` | quickly generate reads from fasta files in desired proportions |
46+
| `bit-cov-analyzer` | analyze coverage patterns by providing a bam file and reference fasta to identify regions of relatively higher or lower coverage |
4647
| `bit-get-cov-stats` | get detection, coverage, and mean percent ID information for a single or multiple references given the fasta(s) and a bam file |
4748
| `bit-get-mapped-reads-pid` | get percent ID information for mapped reads in a bam file |
48-
| `bit-cov-analyzer` | analyze coverage patterns by providing a bam file and reference fasta to identify regions of relatively higher or lower coverage |
4949
| `bit-summarize-assembly` | quickly summarize nucleotide assemblies |
5050
| `bit-ez-screen` | search for nucleotide targets in nucleotide input fastas based on blast or in reads based on mapping, filtered based on tunable target-coverage and percent-ID thresholds and summarized in a simple table |
51+
| `bit-normalize-table` | normalize a table to counts- or coverage-per million (CPM) or with the median-ratio method from DESeq2 (allowing for floats) |
5152
| `bit-calc-variation-in-msa` | calculate [variation](http://scikit-bio.org/docs/0.5.3/generated/skbio.alignment.TabularMSA.conservation.html) in each column of a multiple-sequence alignment |
5253
| `bit-summarize-column` | summarize a numeric column in a table |
5354
| `bit-filter-table` | filter a table based on wanted IDs |

bit/cli/colnames.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def build_parser():
1616
description=desc,
1717
epilog="Ex. usage: `bit-colnames input.tsv`",
1818
formatter_class=CustomRichHelpFormatter,
19-
add_help=False,
19+
add_help=False
2020
)
2121

2222
required = parser.add_argument_group("Required Parameters")
@@ -25,7 +25,7 @@ def build_parser():
2525
required.add_argument(
2626
"input_file",
2727
metavar="<FILE>",
28-
help="Input delimited file",
28+
help="Input delimited file"
2929
)
3030

3131
add_help(optional)

bit/cli/normalize_table.py

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
"""
2+
Expects rows to be units (e.g. genes/KOs/etc.), and columns to be samples.
3+
4+
This script normalizes a table by either coverage per million (CPM) or based on the median-ratio
5+
method as performed in DESeq2. But unlike DESeq2, we don't care here if there are floats in there.
6+
7+
I initially wrote this for normalizing metagenomic coverage data, like gene-level coverage, or summed KO coverages.
8+
These are normalized for gene-length already because they are "coverages", but they are not yet normalized
9+
for sampling depth - which is where this script comes in.
10+
11+
I also found myself wanting this because I wanted to do differential abundance testing of coverages
12+
of KO terms. DESeq2 doesn't require normalizing for gene-length because it is the same unit being analyzed
13+
across all samples - the same gene, so the same size. However, after grouping genes into their KO annotations,
14+
(which we may need to compare across samples that don't all share the same underlying assembly or genes),
15+
they no longer all represent the same units across all samples. It is because of this I decided to stick with
16+
gene-level coverages (which are normalized for gene-length), and then sum those values based on KO annotations.
17+
18+
The CPM (coverage per million) normalization is just like a percent, except scaled to 1 million instead of 100.
19+
So each row's entry (e.g. gene/KO/etc.) is the proportion out of 1 million for that column (sample),
20+
and each column will sum to 1 million.
21+
22+
The median-ration normalization method (MR) was initially described in this paper
23+
(http://dx.doi.org/10.1186/gb-2010-11-10-r106; e.q. 5), and this site is super-informative in general
24+
about the DESeq2 process overall, and helped me understand the normalizaiton process better to implement it:
25+
https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html. Columns will not sum to
26+
the same amount when the median-ratio method is applied.
27+
"""
28+
29+
import sys
30+
import argparse
31+
import pandas as pd # type: ignore
32+
import numpy as np # type: ignore
33+
from scipy.stats.mstats import gmean # type: ignore
34+
from bit.modules.general import check_files_are_found
35+
from bit.cli.common import CustomRichHelpFormatter, add_help
36+
37+
38+
def build_parser():
39+
40+
desc = """
41+
This script normalizes a table by either counts- or coverage-per-million (CPM) or with the median-ratio
42+
method as performed in DESeq2. See note at top of module for more info. It expects a
43+
tab-delimited table with samples as columns and units (e.g. genes/KOs/OTUs/etc.) as rows.
44+
For version info, run `bit-version`.
45+
"""
46+
47+
parser = argparse.ArgumentParser(
48+
description=desc,
49+
epilog="Ex. usage: `bit-normalize-table -i input-table.tsv -n CPM -o output-table.tsv`",
50+
formatter_class=CustomRichHelpFormatter,
51+
add_help=False
52+
)
53+
54+
required = parser.add_argument_group("Required Parameters")
55+
optional = parser.add_argument_group("Optional Parameters")
56+
57+
required.add_argument(
58+
"-i",
59+
"--input-table",
60+
metavar="<FILE>",
61+
help="Input tab-delimited table",
62+
required=True,
63+
)
64+
65+
optional.add_argument(
66+
"-n",
67+
"--normalization",
68+
help='Desired normalization method of either "CPM" for counts- or coverage-per-million or "MR" for median-ratio (default: "CPM")',
69+
choices=["CPM", "MR"],
70+
default="CPM",
71+
)
72+
73+
optional.add_argument(
74+
"-o",
75+
"--output-table",
76+
metavar="<FILE>",
77+
help='Output filename (default: "normalized.tsv")',
78+
default="normalized.tsv",
79+
)
80+
81+
add_help(optional)
82+
83+
return parser
84+
85+
86+
def main(args=None):
87+
88+
parser = build_parser()
89+
90+
if len(sys.argv) == 1:
91+
parser.print_help(sys.stderr)
92+
sys.exit(0)
93+
94+
args = parser.parse_args(args)
95+
96+
check_files_are_found([args.input_table])
97+
98+
tab = pd.read_csv(args.input_table, sep="\t", index_col=0, low_memory=False)
99+
100+
# removing columns with all zeroes prior to normalization (will be restored after)
101+
tab, zero_column_names, ordered_columns = remove_zero_columns(tab)
102+
103+
if args.normalization == "CPM":
104+
norm_tab = normalize_cpm(tab)
105+
else:
106+
norm_tab = normalize_median_ratio(tab)
107+
108+
# restoring zero columns and original column order
109+
norm_tab = restore_zero_columns(norm_tab, zero_column_names, ordered_columns)
110+
111+
norm_tab.to_csv(args.output_table, sep="\t")
112+
113+
114+
def remove_zero_columns(tab):
115+
ordered_columns = tab.columns.tolist()
116+
column_sums = tab.sum()
117+
zero_column_names = column_sums[column_sums == 0].index.tolist()
118+
tab = tab.drop(zero_column_names, axis=1)
119+
return tab, zero_column_names, ordered_columns
120+
121+
122+
def restore_zero_columns(tab, zero_column_names, ordered_columns):
123+
for col in zero_column_names:
124+
tab[col] = 0.0
125+
return tab[ordered_columns]
126+
127+
128+
def normalize_cpm(tab):
129+
return tab / tab.sum() * 1000000
130+
131+
132+
def normalize_median_ratio(tab):
133+
134+
# getting geometric means for each row
135+
with np.errstate(divide='ignore'):
136+
geomeans = gmean(tab, axis=1)
137+
138+
# getting ratios of values to geometric means
139+
ratios_tab = (tab.T / geomeans).T
140+
141+
# calculating size factors from rows with non-zero geometric means
142+
size_factors = ratios_tab[geomeans > 0].median().to_list()
143+
144+
return tab / size_factors
145+
146+
if __name__ == "__main__":
147+
main()

bit/scripts/bit-normalize-table

Lines changed: 0 additions & 107 deletions
This file was deleted.

0 commit comments

Comments
 (0)