Skip to content

This repository hosts scripts to generate a clean database of 16S references from NCBI RefSeq and SILVA.

Notifications You must be signed in to change notification settings

omicsEye/16S_DB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Usage

Database Preparation: The db_prep directory contains an automated pipeline that downloads, normalizes, and cleans four major 16S rRNA reference databases (SILVA, Greengenes2, RefSeq, and MIMt). Run db_prep/db_prep.sh to start the complete cleaning pipeline.
Benchmarking: The benchmark directory contains scripts that utilize the cleaned databases to test and compare their performance against original databases using DADA2 on mock community datasets for comprehensive evaluation of taxonomic classification accuracy.

Key Features

Database Cleaning Pipeline: Remove nested sequences, duplicates, and correct missing taxonomic nomenclature in 16S rRNA reference databases
Multi-Database Comparison: Benchmarking framework comparing four major 16S databases (SILVA, Greengenes2, RefSeq, and MIMt) using 69 mock communities
DADA2 Integration: Optimized for use with the DADA2 analysis pipeline for microbial genus-level profiling

Database Processing Results

SILVA: Reduced from 452,055 to 291,733 sequences (35% reduction)
Greengenes2: Reduced from 337,506 to 277,982 sequences (18% reduction)
MIMt: Reduced from 48,749 to 34,734 sequences (29% reduction)
RefSeq: Reduced from 27,376 to 25,970 sequences (5% reduction)

Performance Insights

• Greengenes2, MIMt, and RefSeq consistently outperform SILVA in recall, precision, and abundance estimation accuracy
• Database preprocessing did not change the taxonomic classification results, but improved the computational efficiency
• Provides cleaned databases as ready-to-use resources for accurate 16S rRNA analysis

Applications

• Microbial profiling and taxonomic assignment
• 16S rRNA gene analysis workflows
• Microbiome research and analysis pipelines

Citation:

The preprint is available here.

@article {Baghbanzadeh2025.11.04.686545,
	author = {Baghbanzadeh, Mahdi and Mahangade, Vedant and Crandall, Keith A and Rahnavard, Ali},
	title = {Curating 16S rRNA databases enhances taxonomic accuracy and computational efficiency in microbial profiling},
	elocation-id = {2025.11.04.686545},
	year = {2025},
	doi = {10.1101/2025.11.04.686545},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/11/05/2025.11.04.686545},
	eprint = {https://www.biorxiv.org/content/early/2025/11/05/2025.11.04.686545.full.pdf},
	journal = {bioRxiv}
}

About

This repository hosts scripts to generate a clean database of 16S references from NCBI RefSeq and SILVA.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •