🧬 eDNA Biodiversity AI Explorer

Zero-Shot Species Detection | DNABERT-2-117M | Streamlit + CLI

Upload eDNA sequences from hair, blood, soil, or water and detect species using DNABERT-2-117M — a foundation model pre-trained on 135 species genomes.

Architecture

Raw DNA String
     │
     ▼
DNABERT-2 Tokenizer  (BPE tokenization, internal to model)
     │
     ▼
DNABERT-2 Encoder  (117M params — MosaicBERT, 12 layers, 768-dim)
     │   pre-trained on 135 species genomes from NCBI
     ▼
[CLS] Token Embedding  (768-dim contextual sequence representation)
     │
     ▼
Classification Head  (LayerNorm → Dropout → Linear(768→256) → GELU → Linear(256→N))
     │
     ▼
Softmax → Confidence Threshold Gate
     │                │
     ▼                ▼
Top species      < 75% prob → "Inconclusive / Mixed Sample"
+ confidence
+ top-3 list

File Structure

Biodiversity_AI/
├── config.json          ← species map, hyperparameters, model name
├── model.py             ← DNABERT2Classifier wrapper + predict()
├── app.py               ← Streamlit web UI
├── predict.py           ← CLI inference tool (sequence / FASTA / CSV)
├── fasta_to_csv.py      ← FASTA → labeled CSV converter
├── test_meta.py         ← quick model initialization test
├── requirements.txt     ← pip dependencies
├── data.csv             ← sample labeled data
├── data/                ← sample FASTA files
│   ├── human.fasta
│   ├── chimp.fasta
│   ├── dog.fasta
│   └── unknown.fasta
└── DNABERT-2-117M/      ← local model weights (from HuggingFace)
    ├── config.json
    ├── pytorch_model.bin
    ├── tokenizer.json
    ├── bert_layers.py
    ├── bert_padding.py
    ├── configuration_bert.py
    └── flash_attn_triton.py

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Run the Streamlit app

streamlit run app.py

Upload a .fasta, .csv, or .txt file to detect species.

3. CLI inference (alternative)

# Single sequence
python predict.py --seq "ATCGATCGATCGATCGATCGATCGATCG..."

# FASTA file
python predict.py --fasta data/human.fasta

# CSV batch
python predict.py --csv data.csv --output results.csv

# Interactive mode
python predict.py

Supported Species

Species	Label
Human	0
Chimpanzee	1
Dog	2
Unknown	3

To add a new species, edit config.json → species_map and provide corresponding training data.

Sample Output

────────────────────────────────────────────────────
  Record : Human_mitochondria_CYB
  Status : ✅  Human
  Species: Human  (97.43%)

  ┌─ Ranked Predictions ─────────────────────┐
  │  🥇  Human         97.43%
  │  🥈  Chimpanzee     2.11%
  │  🥉  Dog            0.29%
  └───────────────────────────────────────────┘
────────────────────────────────────────────────────

Key Design Decisions

Zero-shot inference — uses DNABERT-2 pre-trained backbone without fine-tuning
Confidence threshold gate — sequences below 75% confidence are flagged as "Inconclusive / Mixed Sample"
Local model — weights in DNABERT-2-117M/ directory for offline usage
Mac M1 compatible — automatically uses MPS acceleration when available

Requirements

Python 3.10+
PyTorch ≥ 2.0.0
transformers ≥ 4.40.0
streamlit ≥ 1.28.0
einops (required by DNABERT-2 MosaicBERT architecture)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 eDNA Biodiversity AI Explorer

Zero-Shot Species Detection | DNABERT-2-117M | Streamlit + CLI

Architecture

File Structure

Quick Start

1. Install dependencies

2. Run the Streamlit app

3. CLI inference (alternative)

Supported Species

Sample Output

Key Design Decisions

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DNABERT-2-117M		DNABERT-2-117M
references		references
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.json		config.json
model.py		model.py
predict.py		predict.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧬 eDNA Biodiversity AI Explorer

Zero-Shot Species Detection | DNABERT-2-117M | Streamlit + CLI

Architecture

File Structure

Quick Start

1. Install dependencies

2. Run the Streamlit app

3. CLI inference (alternative)

Supported Species

Sample Output

Key Design Decisions

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages