• Database Preparation: The db_prep directory contains an automated pipeline that downloads, normalizes, and cleans four major 16S rRNA reference databases (SILVA, Greengenes2, RefSeq, and MIMt). Run db_prep/db_prep.sh to start the complete cleaning pipeline.
• Benchmarking: The benchmark directory contains scripts that utilize the cleaned databases to test and compare their performance against original databases using DADA2 on mock community datasets for comprehensive evaluation of taxonomic classification accuracy.
• Database Cleaning Pipeline: Remove nested sequences, duplicates, and correct missing taxonomic nomenclature in 16S rRNA reference databases
• Multi-Database Comparison: Benchmarking framework comparing four major 16S databases (SILVA, Greengenes2, RefSeq, and MIMt) using 69 mock communities
• DADA2 Integration: Optimized for use with the DADA2 analysis pipeline for microbial genus-level profiling
• SILVA: Reduced from 452,055 to 291,733 sequences (35% reduction)
• Greengenes2: Reduced from 337,506 to 277,982 sequences (18% reduction)
• MIMt: Reduced from 48,749 to 34,734 sequences (29% reduction)
• RefSeq: Reduced from 27,376 to 25,970 sequences (5% reduction)
• Greengenes2, MIMt, and RefSeq consistently outperform SILVA in recall, precision, and abundance estimation accuracy
• Database preprocessing did not change the taxonomic classification results, but improved the computational efficiency
• Provides cleaned databases as ready-to-use resources for accurate 16S rRNA analysis
• Microbial profiling and taxonomic assignment
• 16S rRNA gene analysis workflows
• Microbiome research and analysis pipelines
The preprint is available here.
@article {Baghbanzadeh2025.11.04.686545,
author = {Baghbanzadeh, Mahdi and Mahangade, Vedant and Crandall, Keith A and Rahnavard, Ali},
title = {Curating 16S rRNA databases enhances taxonomic accuracy and computational efficiency in microbial profiling},
elocation-id = {2025.11.04.686545},
year = {2025},
doi = {10.1101/2025.11.04.686545},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/11/05/2025.11.04.686545},
eprint = {https://www.biorxiv.org/content/early/2025/11/05/2025.11.04.686545.full.pdf},
journal = {bioRxiv}
}
