Computational pipeline to calculate bacterial growth rates from metagenomic samples.
By: Ashok K. Sharma; June 18th, 2021;
Any questions about this pipeline please mail to ashoks773@gmail.com
- Check the following modules at your HPC and install them manually if they are not available.
- Kneaddata: https://huttenhower.sph.harvard.edu/kneaddata/ (If reads are not quality filtered)
- Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic
- Bowtie2: https://www.metagenomics.wiki/tools/bowtie2/install
- Samtools: https://gist.github.com/adefelicibus/f6fd06df1b4bb104ceeaccdd7325b856
- Megahit: https://github.com/voutcn/megahit
- MetaBAT2: https://bitbucket.org/berkeleylab/metabat/src/master/
- MaxBin2: https://sourceforge.net/projects/maxbin2/
- Prodigal: https://github.com/hyattpd/Prodigal/wiki/installation
- pplacer: https://github.com/matsen/pplacer
- hmmer: http://hmmer.org/documentation.html
- CheckM: https://github.com/Ecogenomics/CheckM
- DAS_Tools: https://github.com/cmks/DAS_Tool
- CAT and BAT: https://github.com/dutilh/CAT
- DEMIC: https://sourceforge.net/projects/demic/
- Cdhit: https://github.com/weizhongli/cdhit/releases
Before starting the tutorial check the installation of the followings:
- Phython3: https://www.python.org/downloads/
- Perl: https://www.perl.org/get.html
- Java: https://docs.oracle.com/en/java/javase/11/install/installation-jdk-linux-platforms.html#GUID-737A84E4-2EFF-4D38-8E60-3E29D1B884B8
Before running this step, raw metagenomic reads need to be processed for 1.Qaulity filtering: Using Trimmomatic and 2. Host contamination removal: using Bowtie2 and Samtools. Or kneaddata https://huttenhower.sph.harvard.edu/kneaddata/ can be used for this purpose. Kneaddata can be run using the following command.
kneaddata --input ${SAMPLE}_1.fastq.gz --input ${SAMPLE}_2.fastq.gz -db /home/sharmaa4/Databases/Human_Genome/human_GRCh38.p13 --output /home/sharmaa4/IBD_datasets/HMP2/step1_WGS_rawReads_QC/${SAMPLE}_kneaddata_output --bypass-trfOutput of kneaddata: will generate multiple files per sample. We need to use the following four files per sample: paired_1.fastq, paired_2.fastq, unmatched_1.fastq, unmatched_2.fastq. All four files can be merged to generate one file for each sample if someone want to use Humann3 (https://huttenhower.sph.harvard.edu/humann/) or MetaPhlAn 3.0 (https://huttenhower.sph.harvard.edu/metaphlan/) for functional and taxonomic classification respectively. Else, paired end files can be generated by concatenating paired_1.fastq and unmatched_1.fastq and paired_2.fastq and unmatched_2.fastq files.
cat ${SAMPLE}_paired_1.fastq ${SAMPLE}_unmatched_1.fastq > ${SAMPLE}_1.fastq
cat ${SAMPLE}_paired_2.fastq ${SAMPLE}_unmatched_2.fastq > ${SAMPLE}_2.fastq
gzip *fastqWe are starting from here: For this tutorial, quality filtered paired end reads (_1.fastq.gz and 2.fastq.gz) will be used. We have total 20 fastq.gz files for 10 Samples. This directly will be used for downstream processing. Note: Limited size can be uploaded to GitHub and therefore very less number of reads (50,000 reads/sample) were uploaded. This might give the error while running the Binning steps and you might end up with Zero bins. So, it is recommended to start this tutorial with original data which can be downloaded from any publically available metagenomic datasets. Metagenomic samples provided here were downloaded from https://ibdmdb.org/tunnel/public/summary.html
module load bowtie/2.3.2
module load samtools/1.9
cd ~/IBD_datasets/HMP2/Bacterial_replication/Sample_Dataset #Set your directory accrodingly where you would like to put your Quality filtered samples.
~/Softwares/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit -1 Samples/SRR5936130_1.fastq.gz,Samples/SRR5936131_1.fastq.gz,Samples/SRR5936132_1.fastq.gz,Samples/SRR5936133_1.fastq.gz,Samples/SRR5936134_1.fastq.gz,Samples/SRR5936135_1.fastq.gz,Samples/SRR5936136_1.fastq.gz,Samples/SRR5936137_1.fastq.gz,Samples/SRR5936138_1.fastq.gz,Samples/SRR5936139_1.fastq.gz -2 Samples/SRR5936130_2.fastq.gz,Samples/SRR5936131_2.fastq.gz,Samples/SRR5936132_2.fastq.gz,Samples/SRR5936133_2.fastq.gz,Samples/SRR5936134_2.fastq.gz,Samples/SRR5936135_2.fastq.gz,Samples/SRR5936136_2.fastq.gz,Samples/SRR5936137_2.fastq.gz,Samples/SRR5936138_2.fastq.gz,Samples/SRR5936139_2.fastq.gz -m 100000000000 -o MEGAHIT_assembly -t 120Or concatenate files before running MEGAHIT
cat *_1.fastq.gz > allsample_1.fastq.gz
cat *_2.fastq.gz > allsample_2.fastq.gz
~/Softwares/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit -1 allsample_1.fastq.gz -2 allsample_2.fastq.gz -m 100000000000 -o MEGAHIT_assembly -t 120This step includes indexing of contigs files, followed by mapping of reads to generate sam and bam files which needs to be sorted and indexed to be used for other packages such as MetaBAT2 and MaxBin (for construction of MAGs) and DEMIC (to calculate Bacterial growth rates)
bowtie2-build MEGAHIT_assembly/final.contigs.fa MEGAHIT_assembly/final.contigs.fa
mkdir SAM # Make a folder name SAM to store alignment outputs
cd Samples
for file in *_1.fastq.gz;
do
SAMPLE=$(echo ${file} | sed "s/_1.fastq.gz//")
echo ${SAMPLE}_1.fastq.gz ${SAMPLE}_2.fastq.gz
bowtie2 -q -x MEGAHIT_assembly/final.contigs.fa -1 ${SAMPLE}_1.fastq.gz -2 ${SAMPLE}_2.fastq.gz --no-unal -p 120 -S SAM/${SAMPLE}.sam
samtools view -b -o SAM/${SAMPLE}.bam SAM/${SAMPLE}.sam
samtools sort -o SAM/${SAMPLE}_sort.bam SAM/${SAMPLE}.bam
samtools index SAM/${SAMPLE}_sort.bam
samtools view -h SAM/${SAMPLE}_sort.bam > SAM/${SAMPLE}_sort.sam #--We need sorted sam files - to run DEMIC in step4
doneFor binning we mainly need Two type of files. 1) contigs file, and 2) contig abundances which can be obtained from mapped bam/sam files generated in previous step.
module load java/1.8.0_221
mkdir Abundance # Make a folder name Abundace to store contigs abundance information
cd Samples
for file in *_1.fastq.gz;
do
SAMPLE=$(echo ${file} | sed "s/_1.fastq.gz//")
echo ${SAMPLE}_1.fastq.gz ${SAMPLE}_2.fastq.gz
~/Softwares/bbmap/pileup.sh in=SAM/${SAMPLE}.sam out=Abundance/${SAMPLE}.cov.txt
awk '{print $1"\t"$5}' Abundance/${SAMPLE}.cov.txt | grep -v '^#' > Abundance/${SAMPLE}.abundance.txt
done
ls Abundance/*abundance.txt > abund_list_file
module load perl/5.30.1
perl ~/Softwares/MaxBin-2.2.7/run_MaxBin.pl -contig MEGAHIT_assembly/final.contigs.fa -abund_list abund_list_file -out myout -prob_threshold 0.5 -markerset 40
mkdir Maxbin_bin
mv myout* Maxbin_bin/mkdir Metabat_bin
module load metabat/2.12.1
jgi_summarize_bam_contig_depths SAM/*bam --outputDepth depth.txt
metabat2 -i MEGAHIT_assembly/final.contigs.fa -a depth.txt -o Metabat_bin/bins_dirQaulity of bins to know the % completeness, % contamination, taxnomomic assignments using lineage specific marker genes can be done suing CheckM Load Modules
module load python3/3.8.0
module load prodigal/2.6.3
module load pplacer/1.1.19
module load hmmer/3.3.2CheckM database download and set Path this is one time task Before running CheckM check the location where CheckM is installed.
mkdir ~/Databases/check_data
cd ~/Databases/check_data
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -zxvf checkm_data_2015_01_16.tar.gz
~/.local/bin/checkm data setRoot /home/sharmaa4/Databases/checkm_dataRun CheckM to check the quality of Metabat Bins
~/.local/bin/checkm tree Metabat_bin Metabat_bin_checkM -x .fa -t 24
~/.local/bin/checkm tree_qa Metabat_bin_checkM -f marker_file
~/.local/bin/checkm lineage_set Metabat_bin_checkM marker_file
~/.local/bin/checkm analyze marker_file Metabat_bin Metabat_bin_checkM_analyse -x .fa -t 24
~/.local/bin/checkm qa marker_file Metabat_bin_checkM_analyse -o 1 -t 24 > metabat_bin_stats.txtRun CheckM to check the quality of MaxBin Bins
~/.local/bin/checkm tree Maxbin_bin Maxbin_bin_checkM -x .fasta -t 24
~/.local/bin/checkm tree_qa Maxbin_bin_checkM -f marker_file_maxbin
~/.local/bin/checkm lineage_set Maxbin_bin_checkM marker_file_maxbin
~/.local/bin/checkm analyze marker_file_maxbin Maxbin_bin Maxbin_bin_checkM_analyse -x .fasta -t 24
~/.local/bin/checkm qa marker_file_maxbin Maxbin_bin_checkM_analyse -o 1 -t 24 > max_bin_stats.txtAfter running CheckM we will have lot of information about Bins quality and their taxonomic affiliations. Still there are some other methods for Binning method comparisons and/or taxonomic assignment. We will try those methods before selecting the bins (reconstructed Genomes) to be used for growth rate calculations.
Step3.4 (Optional) - Integrate results from Different binning methods to generate non-redundant set of bins using DAS_tools
#https://github.com/cmks/DAS_Tool Check here for more details
mkdir DAS_tools
cut -d" " -f1 MEGAHIT_assembly/final.contigs.fa > DAS_tools/contigs_formatDastools.faGenerate scaffold to bin files for both methods and format them manually to be used for DAS tools
grep ">" Metabat_bin/bins_dir.*fa > DAS_tools/metabat_Scaffolds2bin
grep ">" Maxbin_bin/myout.*fasta > DAS_tools/maxbin_Scaffolds2binThese two files needs to be formated first column should be scaffold name and second column should be bin.ID (tab delimeted) This program will generate the non-redundant set of Bins (Genomes).
./DAS_Tool -i DAS_tools/metabat_Scaffolds2bin.tsv,
DAS_tools/maxbin_Scaffolds2bin.tsv
-l metabat,maxbin
-c DAS_tools/contigs_formatDastools.fa
-o DAS_tools/DASToolRun1Step3.5 (Optional) - taxonomic classification of contigs and metagenome assembled genomes (MAGs/bins) using CAT and BAT.
#https://github.com/dutilh/CAT # Visit this site for more details
cd ~/Databases
wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20210107.tar.gz
tar -zxvf CAT_prepare_20210107.tar.gzDatabase preparation is done. This is a one time task. Now we can run the CAT/BAT to assign taxonomy to contigs and Bins
module load python3/3.8.0
module load prodigal/2.6.3
cd ~/IBD_datasets/HMP2/Bacterial_replication/Sample_Dataset
#--- Run CAT on contigs
~/Softwares/CAT-master/CAT_pack/CAT contigs -c MEGAHIT_assembly/final.contigs.fa -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond -p out.CAT.predicted_proteins.faa
~/Softwares/CAT-master/CAT_pack/CAT add_names -i out.CAT.contig2classification.txt -o out.CAT.contig2classification_Names.txt -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --only_official --exclude_scores
~/Softwares/CAT-master/CAT_pack/CAT summarise -c ../MEGAHIT_assembly/final.contigs.fa -i out.CAT.contig2classification_Names.txt -o CAT_summary
#-check contamination in a Selected Bin
~/Softwares/CAT-master/CAT_pack/CAT bin -b Maxbin_bin/myout.061.fasta -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy -o BAT_myout061MAG --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond
~/Softwares/CAT-master/CAT_pack/CAT contigs -c Maxbin_bin/myout.061.fasta -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy -p BAT.myout061MAG.predicted_proteins.faa -a BAT.myout061MAG.alignment.diamond -o CAT.myout061MAG --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond
#--- Run CAT on Bins
~/Softwares/CAT-master/CAT_pack/CAT bins -b Metabat_bin --bin_suffix .fa -d ~/Databases/CAT_prepare_20210107/2021-01-07_CAT_database -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --path_to_diamond ~/Databases/CAT_prepare_20210107/Diamond_2.0.6/diamond
~/Softwares/CAT-master/CAT_pack/CAT add_names -i out.BAT.bin2classification.txt -o out.BAT.bin2classification_Names.txt -t ~/Databases/CAT_prepare_20210107/2021-01-07_taxonomy --only_official --exclude_scores
~/Softwares/CAT-master/CAT_pack/CAT summarise -i out.BAT.bin2classification_Names.txt -o BAT_summary
Note: Step 3.4 and 3.5 are completely optional. Selected bins (genomes) obtained after Step3.3 can be used for the final Step that is "Bacterial Growth Rate Calculation".
https://sourceforge.net/projects/demic/ # Check this for installation and detailed usase DEMIC need two folders as input: One which contains Sorted sam files (we have generated this in Step2), Second folder should be constructed bins directory it can Metabat_bin or Maxbin_bin or any directly which contains high quality and non-redundant bins. This can be decided after running CheckM or DAS tools
perl ~/Softwares/DEMIC.pl -S SAM -F Metabat_bin -O Demic_outputFor iRep and CoPTR seperate scipt files (Part4) are added.
Note: Same steps can be used for individual sample assemblies. For that multiple assembled files can be concatenated and then redundacy can be removed using CD-HIT. This final contig set (now no need to run co-assembly or both methods can be used seperately to generate bins) can be used for downstream processing from Step2 onwards.