Upload eDNA sequences from hair, blood, soil, or water and detect species using DNABERT-2-117M — a foundation model pre-trained on 135 species genomes.
Raw DNA String
│
▼
DNABERT-2 Tokenizer (BPE tokenization, internal to model)
│
▼
DNABERT-2 Encoder (117M params — MosaicBERT, 12 layers, 768-dim)
│ pre-trained on 135 species genomes from NCBI
▼
[CLS] Token Embedding (768-dim contextual sequence representation)
│
▼
Classification Head (LayerNorm → Dropout → Linear(768→256) → GELU → Linear(256→N))
│
▼
Softmax → Confidence Threshold Gate
│ │
▼ ▼
Top species < 75% prob → "Inconclusive / Mixed Sample"
+ confidence
+ top-3 list
Biodiversity_AI/
├── config.json ← species map, hyperparameters, model name
├── model.py ← DNABERT2Classifier wrapper + predict()
├── app.py ← Streamlit web UI
├── predict.py ← CLI inference tool (sequence / FASTA / CSV)
├── fasta_to_csv.py ← FASTA → labeled CSV converter
├── test_meta.py ← quick model initialization test
├── requirements.txt ← pip dependencies
├── data.csv ← sample labeled data
├── data/ ← sample FASTA files
│ ├── human.fasta
│ ├── chimp.fasta
│ ├── dog.fasta
│ └── unknown.fasta
└── DNABERT-2-117M/ ← local model weights (from HuggingFace)
├── config.json
├── pytorch_model.bin
├── tokenizer.json
├── bert_layers.py
├── bert_padding.py
├── configuration_bert.py
└── flash_attn_triton.py
pip install -r requirements.txtstreamlit run app.pyUpload a .fasta, .csv, or .txt file to detect species.
# Single sequence
python predict.py --seq "ATCGATCGATCGATCGATCGATCGATCG..."
# FASTA file
python predict.py --fasta data/human.fasta
# CSV batch
python predict.py --csv data.csv --output results.csv
# Interactive mode
python predict.py| Species | Label |
|---|---|
| Human | 0 |
| Chimpanzee | 1 |
| Dog | 2 |
| Unknown | 3 |
To add a new species, edit config.json → species_map and provide corresponding training data.
────────────────────────────────────────────────────
Record : Human_mitochondria_CYB
Status : ✅ Human
Species: Human (97.43%)
┌─ Ranked Predictions ─────────────────────┐
│ 🥇 Human 97.43%
│ 🥈 Chimpanzee 2.11%
│ 🥉 Dog 0.29%
└───────────────────────────────────────────┘
────────────────────────────────────────────────────
- Zero-shot inference — uses DNABERT-2 pre-trained backbone without fine-tuning
- Confidence threshold gate — sequences below 75% confidence are flagged as "Inconclusive / Mixed Sample"
- Local model — weights in
DNABERT-2-117M/directory for offline usage - Mac M1 compatible — automatically uses MPS acceleration when available
- Python 3.10+
- PyTorch ≥ 2.0.0
- transformers ≥ 4.40.0
- streamlit ≥ 1.28.0
- einops (required by DNABERT-2 MosaicBERT architecture)