GitHub - omicsEye/16S_DB: This repository hosts scripts to generate a clean database of 16S references from NCBI RefSeq and SILVA.

Usage

• Database Preparation: The db_prep directory contains an automated pipeline that downloads, normalizes, and cleans four major 16S rRNA reference databases (SILVA, Greengenes2, RefSeq, and MIMt). Run db_prep/db_prep.sh to start the complete cleaning pipeline.
• Benchmarking: The benchmark directory contains scripts that utilize the cleaned databases to test and compare their performance against original databases using DADA2 on mock community datasets for comprehensive evaluation of taxonomic classification accuracy.

Key Features

• Database Cleaning Pipeline: Remove nested sequences, duplicates, and correct missing taxonomic nomenclature in 16S rRNA reference databases
• Multi-Database Comparison: Benchmarking framework comparing four major 16S databases (SILVA, Greengenes2, RefSeq, and MIMt) using 69 mock communities
• DADA2 Integration: Optimized for use with the DADA2 analysis pipeline for microbial genus-level profiling

Database Processing Results

• SILVA: Reduced from 452,055 to 291,733 sequences (35% reduction)
• Greengenes2: Reduced from 337,506 to 277,982 sequences (18% reduction)
• MIMt: Reduced from 48,749 to 34,734 sequences (29% reduction)
• RefSeq: Reduced from 27,376 to 25,970 sequences (5% reduction)

Performance Insights

• Greengenes2, MIMt, and RefSeq consistently outperform SILVA in recall, precision, and abundance estimation accuracy
• Database preprocessing did not change the taxonomic classification results, but improved the computational efficiency
• Provides cleaned databases as ready-to-use resources for accurate 16S rRNA analysis

Applications

• Microbial profiling and taxonomic assignment
• 16S rRNA gene analysis workflows
• Microbiome research and analysis pipelines

Citation:

The preprint is available here.

@article {Baghbanzadeh2025.11.04.686545,
	author = {Baghbanzadeh, Mahdi and Mahangade, Vedant and Crandall, Keith A and Rahnavard, Ali},
	title = {Curating 16S rRNA databases enhances taxonomic accuracy and computational efficiency in microbial profiling},
	elocation-id = {2025.11.04.686545},
	year = {2025},
	doi = {10.1101/2025.11.04.686545},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/11/05/2025.11.04.686545},
	eprint = {https://www.biorxiv.org/content/early/2025/11/05/2025.11.04.686545.full.pdf},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
benchmark		benchmark
db_prep		db_prep
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Usage

Key Features

Database Processing Results

Performance Insights

Applications

Citation:

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

omicsEye/16S_DB

Folders and files

Latest commit

History

Repository files navigation

Usage

Key Features

Database Processing Results

Performance Insights

Applications

Citation:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages