Skip to content

ashoks773/Bacterial_growth_rates

Repository files navigation

Bacterial growth rate calculation

Computational pipeline to calculate bacterial growth rates from metagenomic samples.

By: Ashok K. Sharma; June 18th, 2021;

Any questions about this pipeline please mail to ashoks773@gmail.com

Pacakges to be installed:

Before starting the tutorial check the installation of the followings:

Step1: Patient specific Co-assembly: to generate contigs using MEGAHIT

Before running this step, raw metagenomic reads need to be processed for 1.Qaulity filtering: Using Trimmomatic and 2. Host contamination removal: using Bowtie2 and Samtools. Or kneaddata https://huttenhower.sph.harvard.edu/kneaddata/ can be used for this purpose. Kneaddata can be run using the following command.

kneaddata --input ${SAMPLE}_1.fastq.gz --input ${SAMPLE}_2.fastq.gz -db /home/sharmaa4/Databases/Human_Genome/human_GRCh38.p13 --output /home/sharmaa4/IBD_datasets/HMP2/step1_WGS_rawReads_QC/${SAMPLE}_kneaddata_output --bypass-trf

Output of kneaddata: will generate multiple files per sample. We need to use the following four files per sample: paired_1.fastq, paired_2.fastq, unmatched_1.fastq, unmatched_2.fastq. All four files can be merged to generate one file for each sample if someone want to use Humann3 (https://huttenhower.sph.harvard.edu/humann/) or MetaPhlAn 3.0 (https://huttenhower.sph.harvard.edu/metaphlan/) for functional and taxonomic classification respectively. Else, paired end files can be generated by concatenating paired_1.fastq and unmatched_1.fastq and paired_2.fastq and unmatched_2.fastq files.

cat ${SAMPLE}_paired_1.fastq ${SAMPLE}_unmatched_1.fastq > ${SAMPLE}_1.fastq
cat ${SAMPLE}_paired_2.fastq ${SAMPLE}_unmatched_2.fastq > ${SAMPLE}_2.fastq
gzip *fastq

We are starting from here: For this tutorial, quality filtered paired end reads (_1.fastq.gz and 2.fastq.gz) will be used. We have total 20 fastq.gz files for 10 Samples. This directly will be used for downstream processing. Note: Limited size can be uploaded to GitHub and therefore very less number of reads (50,000 reads/sample) were uploaded. This might give the error while running the Binning steps and you might end up with Zero bins. So, it is recommended to start this tutorial with original data which can be downloaded from any publically available metagenomic datasets. Metagenomic samples provided here were downloaded from https://ibdmdb.org/tunnel/public/summary.html

module load bowtie/2.3.2
module load samtools/1.9

cd ~/IBD_datasets/HMP2/Bacterial_replication/Sample_Dataset #Set your directory accrodingly where you would like to put your Quality filtered samples.

~/Softwares/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit -1 Samples/SRR5936130_1.fastq.gz,Samples/SRR5936131_1.fastq.gz,Samples/SRR5936132_1.fastq.gz,Samples/SRR5936133_1.fastq.gz,Samples/SRR5936134_1.fastq.gz,Samples/SRR5936135_1.fastq.gz,Samples/SRR5936136_1.fastq.gz,Samples/SRR5936137_1.fastq.gz,Samples/SRR5936138_1.fastq.gz,Samples/SRR5936139_1.fastq.gz -2 Samples/SRR5936130_2.fastq.gz,Samples/SRR5936131_2.fastq.gz,Samples/SRR5936132_2.fastq.gz,Samples/SRR5936133_2.fastq.gz,Samples/SRR5936134_2.fastq.gz,Samples/SRR5936135_2.fastq.gz,Samples/SRR5936136_2.fastq.gz,Samples/SRR5936137_2.fastq.gz,Samples/SRR5936138_2.fastq.gz,Samples/SRR5936139_2.fastq.gz -m 100000000000 -o MEGAHIT_assembly -t 120

Or concatenate files before running MEGAHIT

cat *_1.fastq.gz > allsample_1.fastq.gz
cat *_2.fastq.gz > allsample_2.fastq.gz
~/Softwares/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit -1 allsample_1.fastq.gz -2 allsample_2.fastq.gz -m 100000000000 -o MEGAHIT_assembly -t 120

Step2: Alignmnet of reads on assembled contigs

This step includes indexing of contigs files, followed by mapping of reads to generate sam and bam files which needs to be sorted and indexed to be used for other packages such as MetaBAT2 and MaxBin (for construction of MAGs) and DEMIC (to calculate Bacterial growth rates)

bowtie2-build MEGAHIT_assembly/final.contigs.fa MEGAHIT_assembly/final.contigs.fa
mkdir SAM # Make a folder name SAM to store alignment outputs
cd Samples
for file in *_1.fastq.gz;
do
  SAMPLE=$(echo ${file} | sed "s/_1.fastq.gz//")
  echo ${SAMPLE}_1.fastq.gz ${SAMPLE}_2.fastq.gz
  bowtie2 -q -x MEGAHIT_assembly/final.contigs.fa -1 ${SAMPLE}_1.fastq.gz -2 ${SAMPLE}_2.fastq.gz --no-unal -p 120 -S SAM/${SAMPLE}.sam
  samtools view -b -o SAM/${SAMPLE}.bam SAM/${SAMPLE}.sam
  samtools sort -o SAM/${SAMPLE}_sort.bam SAM/${SAMPLE}.bam
  samtools index SAM/${SAMPLE}_sort.bam
  samtools view -h SAM/${SAMPLE}_sort.bam > SAM/${SAMPLE}_sort.sam #--We need sorted sam files - to run DEMIC in step4
done

Step3: Binning- Construction of MAGs

For binning we mainly need Two type of files. 1) contigs file, and 2) contig abundances which can be obtained from mapped bam/sam files generated in previous step.

Step3.1 - Using MaxBin2

module load java/1.8.0_221
mkdir Abundance # Make a folder name Abundace to store contigs abundance information
cd Samples
for file in *_1.fastq.gz;
do
  SAMPLE=$(echo ${file} | sed "s/_1.fastq.gz//")
  echo ${SAMPLE}_1.fastq.gz ${SAMPLE}_2.fastq.gz
~/Softwares/bbmap/pileup.sh in=SAM/${SAMPLE}.sam out=Abundance/${SAMPLE}.cov.txt
awk '{print $1"\t"$5}' Abundance/${SAMPLE}.cov.txt | grep -v '^#' > Abundance/${SAMPLE}.abundance.txt
done
ls Abundance/*abundance.txt > abund_list_file
module load perl/5.30.1
perl ~/Softwares/MaxBin-2.2.7/run_MaxBin.pl -contig MEGAHIT_assembly/final.contigs.fa -abund_list abund_list_file -out myout -prob_threshold 0.5 -markerset 40
mkdir Maxbin_bin
mv myout* Maxbin_bin/

Step3.2 - Using MetaBAT2

mkdir Metabat_bin
module load metabat/2.12.1 
jgi_summarize_bam_contig_depths SAM/*bam --outputDepth depth.txt 
metabat2 -i MEGAHIT_assembly/final.contigs.fa -a depth.txt -o Metabat_bin/bins_dir

Step3.3 - Check qaulity of bins and taxonomic assignments Using CheckM

Qaulity of bins to know the % completeness, % contamination, taxnomomic assignments using lineage specific marker genes can be done suing CheckM Load Modules

module load python3/3.8.0
module load prodigal/2.6.3
module load pplacer/1.1.19
module load hmmer/3.3.2

CheckM database download and set Path this is one time task Before running CheckM check the location where CheckM is installed.

mkdir ~/Databases/check_data
cd ~/Databases/check_data
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -zxvf checkm_data_2015_01_16.tar.gz
~/.local/bin/checkm data setRoot /home/sharmaa4/Databases/checkm_data

Run CheckM to check the quality of Metabat Bins

~/.local/bin/checkm tree Metabat_bin Metabat_bin_checkM -x .fa -t 24
~/.local/bin/checkm tree_qa Metabat_bin_checkM -f marker_file
~/.local/bin/checkm lineage_set Metabat_bin_checkM marker_file
~/.local/bin/checkm analyze marker_file Metabat_bin Metabat_bin_checkM_analyse -x .fa -t 24
~/.local/bin/checkm qa marker_file Metabat_bin_checkM_analyse -o 1 -t 24 > metabat_bin_stats.txt

Run CheckM to check the quality of MaxBin Bins

~/.local/bin/checkm tree Maxbin_bin Maxbin_bin_checkM -x .fasta -t 24
~/.local/bin/checkm tree_qa Maxbin_bin_checkM -f marker_file_maxbin
~/.local/bin/checkm lineage_set Maxbin_bin_checkM marker_file_maxbin
~/.local/bin/checkm analyze marker_file_maxbin Maxbin_bin Maxbin_bin_checkM_analyse -x .fasta -t 24
~/.local/bin/checkm qa marker_file_maxbin Maxbin_bin_checkM_analyse -o 1 -t 24 > max_bin_stats.txt

After running CheckM we will have lot of information about Bins quality and their taxonomic affiliations. Still there are some other methods for Binning method comparisons and/or taxonomic assignment. We will try those methods before selecting the bins (reconstructed Genomes) to be used for growth rate calculations.

Step3.4 (Optional) - Integrate results from Different binning methods to generate non-redundant set of bins using DAS_tools

#https://github.com/cmks/DAS_Tool Check here for more details

mkdir DAS_tools
cut -d" " -f1 MEGAHIT_assembly/final.contigs.fa > DAS_tools/contigs_formatDastools.fa

Generate scaffold to bin files for both methods and format them manually to be used for DAS tools

grep ">" Metabat_bin/bins_dir.*fa > DAS_tools/metabat_Scaffolds2bin
grep ">" Maxbin_bin/myout.*fasta > DAS_tools/maxbin_Scaffolds2bin

These two files needs to be formated first column should be scaffold name and second column should be bin.ID (tab delimeted) This program will generate the non-redundant set of Bins (Genomes).

./DAS_Tool  -i DAS_tools/metabat_Scaffolds2bin.tsv,
                 DAS_tools/maxbin_Scaffolds2bin.tsv
              -l metabat,maxbin
              -c DAS_tools/contigs_formatDastools.fa
              -o DAS_tools/DASToolRun1

Step3.5 (Optional) - taxonomic classification of contigs and metagenome assembled genomes (MAGs/bins) using CAT and BAT.

#https://github.com/dutilh/CAT # Visit this site for more details

cd ~/Databases
wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20210107.tar.gz
tar -zxvf CAT_prepare_20210107.tar.gz

Database preparation is done. This is a one time task. Now we can run the CAT/BAT to assign taxonomy to contigs and Bins

module load python3/3.8.0
module load prodigal/2.6.3
cd ~/IBD_datasets/HMP2/Bacterial_replication/Sample_Dataset

#--- Run CAT on contigs
~/Softwares/CAT-master/CAT_pack/CAT contigs -c MEGAHIT_assembly/final.contigs.fa -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond -p out.CAT.predicted_proteins.faa
~/Softwares/CAT-master/CAT_pack/CAT add_names -i out.CAT.contig2classification.txt -o out.CAT.contig2classification_Names.txt -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --only_official --exclude_scores
~/Softwares/CAT-master/CAT_pack/CAT summarise -c ../MEGAHIT_assembly/final.contigs.fa -i out.CAT.contig2classification_Names.txt -o CAT_summary

#-check contamination in a Selected Bin
~/Softwares/CAT-master/CAT_pack/CAT bin -b Maxbin_bin/myout.061.fasta -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy -o BAT_myout061MAG --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond
~/Softwares/CAT-master/CAT_pack/CAT contigs -c Maxbin_bin/myout.061.fasta -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy -p BAT.myout061MAG.predicted_proteins.faa -a BAT.myout061MAG.alignment.diamond -o CAT.myout061MAG --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond

#--- Run CAT on Bins
~/Softwares/CAT-master/CAT_pack/CAT bins -b Metabat_bin --bin_suffix .fa -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond
~/Softwares/CAT-master/CAT_pack/CAT add_names -i out.BAT.bin2classification.txt -o out.BAT.bin2classification_Names.txt -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --only_official --exclude_scores
~/Softwares/CAT-master/CAT_pack/CAT summarise -i out.BAT.bin2classification_Names.txt -o BAT_summary

Note: Step 3.4 and 3.5 are completely optional. Selected bins (genomes) obtained after Step3.3 can be used for the final Step that is "Bacterial Growth Rate Calculation".

Step4: Use iRep, CoPTR, and DEMIC to calculate Bacterial Growth Rates - ** Final Step **

https://sourceforge.net/projects/demic/ # Check this for installation and detailed usase DEMIC need two folders as input: One which contains Sorted sam files (we have generated this in Step2), Second folder should be constructed bins directory it can Metabat_bin or Maxbin_bin or any directly which contains high quality and non-redundant bins. This can be decided after running CheckM or DAS tools

perl ~/Softwares/DEMIC.pl -S SAM -F Metabat_bin -O Demic_output

For iRep and CoPTR seperate scipt files (Part4) are added.

Note: Same steps can be used for individual sample assemblies. For that multiple assembled files can be concatenated and then redundacy can be removed using CD-HIT. This final contig set (now no need to run co-assembly or both methods can be used seperately to generate bins) can be used for downstream processing from Step2 onwards.

About

Computational pipeline to calculate bacterial growth rates from metagenomic samples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors