Skip to content

Amanjha112113/eDNA-Biodiversity-AI-Explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 eDNA Biodiversity AI Explorer

Zero-Shot Species Detection | DNABERT-2-117M | Streamlit + CLI

Upload eDNA sequences from hair, blood, soil, or water and detect species using DNABERT-2-117M — a foundation model pre-trained on 135 species genomes.


Architecture

Raw DNA String
     │
     ▼
DNABERT-2 Tokenizer  (BPE tokenization, internal to model)
     │
     ▼
DNABERT-2 Encoder  (117M params — MosaicBERT, 12 layers, 768-dim)
     │   pre-trained on 135 species genomes from NCBI
     ▼
[CLS] Token Embedding  (768-dim contextual sequence representation)
     │
     ▼
Classification Head  (LayerNorm → Dropout → Linear(768→256) → GELU → Linear(256→N))
     │
     ▼
Softmax → Confidence Threshold Gate
     │                │
     ▼                ▼
Top species      < 75% prob → "Inconclusive / Mixed Sample"
+ confidence
+ top-3 list

File Structure

Biodiversity_AI/
├── config.json          ← species map, hyperparameters, model name
├── model.py             ← DNABERT2Classifier wrapper + predict()
├── app.py               ← Streamlit web UI
├── predict.py           ← CLI inference tool (sequence / FASTA / CSV)
├── fasta_to_csv.py      ← FASTA → labeled CSV converter
├── test_meta.py         ← quick model initialization test
├── requirements.txt     ← pip dependencies
├── data.csv             ← sample labeled data
├── data/                ← sample FASTA files
│   ├── human.fasta
│   ├── chimp.fasta
│   ├── dog.fasta
│   └── unknown.fasta
└── DNABERT-2-117M/      ← local model weights (from HuggingFace)
    ├── config.json
    ├── pytorch_model.bin
    ├── tokenizer.json
    ├── bert_layers.py
    ├── bert_padding.py
    ├── configuration_bert.py
    └── flash_attn_triton.py

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Run the Streamlit app

streamlit run app.py

Upload a .fasta, .csv, or .txt file to detect species.

3. CLI inference (alternative)

# Single sequence
python predict.py --seq "ATCGATCGATCGATCGATCGATCGATCG..."

# FASTA file
python predict.py --fasta data/human.fasta

# CSV batch
python predict.py --csv data.csv --output results.csv

# Interactive mode
python predict.py

Supported Species

Species Label
Human 0
Chimpanzee 1
Dog 2
Unknown 3

To add a new species, edit config.jsonspecies_map and provide corresponding training data.


Sample Output

────────────────────────────────────────────────────
  Record : Human_mitochondria_CYB
  Status : ✅  Human
  Species: Human  (97.43%)

  ┌─ Ranked Predictions ─────────────────────┐
  │  🥇  Human         97.43%
  │  🥈  Chimpanzee     2.11%
  │  🥉  Dog            0.29%
  └───────────────────────────────────────────┘
────────────────────────────────────────────────────

Key Design Decisions

  • Zero-shot inference — uses DNABERT-2 pre-trained backbone without fine-tuning
  • Confidence threshold gate — sequences below 75% confidence are flagged as "Inconclusive / Mixed Sample"
  • Local model — weights in DNABERT-2-117M/ directory for offline usage
  • Mac M1 compatible — automatically uses MPS acceleration when available

Requirements

  • Python 3.10+
  • PyTorch ≥ 2.0.0
  • transformers ≥ 4.40.0
  • streamlit ≥ 1.28.0
  • einops (required by DNABERT-2 MosaicBERT architecture)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages