An Open Machine Learning Platform for Blood–Brain Barrier Permeability Prediction with Neurodegenerative Disease Applications. This project serves as a resource for advancing research in central nervous system (CNS) therapeutics and supporting the development of novel treatment strategies.
🚀 Live Demo • 📖 Documentation • 💬 Discussions • 🐛 Report Bug
- Overview
- Key Features
- Scientific Background
- Installation
- Quick Start
- Workflow
- Models & Performance
- Platform Features
- Case Study: mTOR Inhibitors
- Results & Benchmarking
- API Reference
- Troubleshooting
- Contributing
- Citation
- Acknowledgments
- License
BrainRoute is an open-source, AI-powered computational platform designed to predict blood-brain barrier (BBB) permeability of small molecules, addressing one of the most critical challenges in central nervous system (CNS) drug discovery.
The blood-brain barrier acts as a highly selective physiological interface that restricts ~98% of small molecules and nearly all large molecules from entering the brain. This creates a significant bottleneck in developing therapeutics for neurodegenerative diseases, CNS infections, brain tumors, and other neurological conditions.
- 🎯 High Accuracy: KNN model achieves 92% F1-score, outperforming many deep learning approaches
- 🔬 Uncertainty Quantification: Ensemble-based confidence scoring for reliable predictions
- 🤖 AI-Augmented Insights: Integrated Llama 3 LLM for contextual molecular information
- 📊 Batch Processing: Analyze hundreds of compounds simultaneously
- 🌐 User-Friendly Interface: No coding required - accessible via web browser
- 📖 Open Science: Fully reproducible with curated datasets and transparent methods
- 💊 Clinically Relevant: Applied to real-world case studies (mTOR inhibitors, Alzheimer's disease)
- Multiple ML Algorithms: KNN, XGBoost, SVM, Random Forest, Logistic Regression
- Deep Learning: BERT-based SMILES encoder with transfer learning
- Ensemble Predictions: Combines multiple models for robust outputs
- Uncertainty Estimation: Model agreement metrics and confidence intervals
- Single Molecule Analysis: Input by compound name or SMILES string
- Batch Processing: Upload CSV files with multiple compounds
- Real-time Predictions: Fast inference with pre-trained models
- Molecular Visualization: 2D structure rendering with RDKit
- Property Calculation: Automatic computation of MW, LogP, TPSA, HBA/HBD
- LLM Integration: Chat with Llama 3 about your molecules
- Contextual Insights: Get mechanism of action, drug potential, side effects
- Literature Integration: Links to PubMed, ChEMBL, and PubChem
- Export Capabilities: Download predictions and chat histories
- React-based Interface: Interactive molecular database browser
- 9,857+ Molecules: Annotated with BBB permeability data
- Real-time Structure Rendering: Client-side RDKit.js integration
- Search & Filter: Find molecules by properties or predictions
The blood-brain barrier (BBB) is formed by specialized brain microvascular endothelial cells connected by tight junctions, supported by pericytes and astrocytes. It serves as a critical neuroprotective mechanism but simultaneously represents the most significant obstacle in CNS therapeutics development.
Key Statistics:
- ~98% of small molecule drugs cannot cross the BBB
- Nearly 100% of large molecule biologics are excluded
- BBB penetration failure is a leading cause of late-stage drug development attrition
- Estimated cost: $2-3 billion per failed CNS drug candidate
Traditional BBB prediction methods rely on:
- Physicochemical heuristics (Lipinski's rules, polar surface area)
- In vitro assays (PAMPA-BBB, MDCK-MDR1)
- In vivo methods (brain/plasma ratio, microdialysis)
Limitations:
- Resource-intensive and time-consuming
- Low throughput for early-stage screening
- High inter-assay variability
- Limited predictive accuracy for novel scaffolds
- Expensive (~$10,000-50,000 per compound tested)
BrainRoute addresses these limitations by:
- Leveraging heterogeneous datasets (B3DB, MoleculeNet) for robust training
- Implementing uncertainty-aware predictions with ensemble methods
- Providing interpretable features and molecular property analysis
- Offering open-source, reproducible workflows
- Integrating LLM for knowledge augmentation and mechanism exploration
Clinical Relevance: Applied to neurodegenerative diseases (Alzheimer's, Parkinson's), neuro-oncology, infectious diseases (meningitis, encephalitis), and toxicology assessment.
- Python 3.8+ (3.12 recommended)
- pip or conda package manager
- 8GB RAM minimum (16GB recommended for batch processing)
- Modern web browser (Chrome, Firefox, Safari, Edge)
# Clone the repository
git clone https://github.com/omicscodeathon/brainroute.git
cd brainroute
# Create virtual environment (recommended)
python -m venv brainroute_env
source brainroute_env/bin/activate # On Windows: brainroute_env\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Clone and navigate
git clone https://github.com/omicscodeathon/brainroute.git
cd brainroute
# Create conda environment (recommended for M-series chips)
conda create -n brainroute python=3.12
conda activate brainroute
# Install dependencies
pip install -r requirements_macos.txt# Pull the Docker image
docker pull ghcr.io/omicscodeathon/brainroute:latest
# Run the container
docker run -p 8501:8501 ghcr.io/omicscodeathon/brainroute:latest# Test imports
python -c "import rdkit; import streamlit; import sklearn; print('✓ Installation successful!')"
# Run tests
pytest tests/# Navigate to project directory
cd brainroute
# Start Streamlit app
streamlit run scripts/webapp/main.pyThe application will open automatically in your browser at http://localhost:8501
from rdkit import Chem
from scripts.webapp.prediction import predict_bbb_penetration_with_uncertainty
from scripts.webapp.utils import load_ml_models
# Load models
models, _ = load_ml_models()
# Predict BBB permeability
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" # Aspirin
mol = Chem.MolFromSmiles(smiles)
result, error = predict_bbb_penetration_with_uncertainty(mol, models)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2f}%")
print(f"Uncertainty: {result['uncertainty']:.2f}%")
print(f"Model Agreement: {result['agreement']:.2f}%")import pandas as pd
# Prepare data
molecules = pd.DataFrame({
'name': ['Aspirin', 'Caffeine', 'Donepezil'],
'smiles': ['CC(=O)OC1=CC=CC=C1C(=O)O',
'CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
'COc1ccc2c(c1)C(=O)c1ccccc1N2']
})
# Save to CSV
molecules.to_csv('molecules_to_predict.csv', index=False)
# Upload via web interface or use batch API
from scripts.webapp.prediction import process_batch_molecules
results, error = process_batch_molecules(molecules, 'csv', models)Data Sources:
- B3DB - 7,807 molecules with experimental BBB data
- MoleculeNet BBBP - 2,059 binary classifications
- Additional curated datasets from literature
Initial Dataset:
- Total molecules: 9,857
- BBB+ (permeable): 6,523 (66.2%)
- BBB- (non-permeable): 3,334 (33.8%)
- Representation: SMILES strings
- Molecular descriptors of the molecules were calculated using RDKIT version 25.03.6.
- The descriptors were calculated using the 'RDKIT_descriptors' python script.
# Extract SMILES from source files
python scripts/extract_smiles.py
# Calculate RDKit descriptors (217 features)
python scripts/RDKIT_descriptors.py
Descriptors Computed (RDKit - 217 features):
- Physicochemical: MW, LogP, MR, TPSA
- Topological: Balaban J, Bertz CT, Chi indices
- Electronic: Partial charges, atom type counts
- Structural: Ring counts, H-bond donors/acceptors
- Lipinski's descriptors: Rule of 5 compliance
- Graph-based: Molecular connectivity indices
Quality Control:
- Duplicate smiles were removed
- Entries with completely missing descriptors were dropped
# Remove duplicates
df = df.drop_duplicates(subset=['smiles'])
# Handle missing values
df = df.dropna(thresh=len(df.columns) * 0.5) # Drop if >50% missing
df = df.fillna(0) # Fill remaining with 0Normalization:
- Data was standardized to have zero mean and unit standard deviation using StandardScaler() to enable models converge quickly.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# μ = 0, σ = 1 for all featuresClass Balancing:
- Balanced out BBB+ and BBB- classes in data by generating synthetic data for BBB minority class using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)Final Dataset:
- Total samples: 12,716 (after SMOTE)
- BBB+: 6,358 (50%)
- BBB-: 6,358 (50%)
- Features: 217 (RDKit descriptors)
- Train/Test split: 80/20 (10,172 / 2,544)
Model Selection Strategy:
- Initial Screening: LazyPredict for rapid benchmarking
- Hyperparameter Tuning: GridSearchCV with 5-fold CV
- Final Evaluation: Stratified test set
Models Implemented:
| Model | Algorithm | Key Parameters | Training Time |
|---|---|---|---|
| KNN | K-Nearest Neighbors | k=3, weights='distance' | ~5 min |
| XGBoost | Gradient Boosting | n_estimators=100, max_depth=6 | ~15 min |
| SVM | Support Vector Machine | kernel='rbf', C=1.0, gamma='scale' | ~30 min |
| Random Forest | Ensemble (Bagging) | n_estimators=100, max_depth=8 | ~20 min |
| Logistic Reg | Linear Classifier | max_iter=1000, solver='lbfgs' | ~2 min |
Training Code:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import joblib
# Initialize model
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')
# Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(knn, X_train, y_train, cv=skf, scoring='f1')
# Train final model
knn.fit(X_train, y_train)
# Save model
joblib.dump(knn, 'output/models/KNN_model.pkl')Architecture: SPMM (Structure-Property Multi-Modal) Model
- Base: BERT (Bidirectional Encoder Representations from Transformers)
- Input: SMILES tokenization
- Pre-training: 10M+ molecules from PubChem
- Fine-tuning: BBB-specific dataset
Model Architecture:
Input (SMILES) → Tokenizer → BERT Encoder → [CLS] Token
↓
Linear(768 → 256) + GELU
↓
Dropout(0.1)
↓
Linear(256 → 2)
↓
Softmax → [BBB+, BBB-]
Training Configuration:
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.nn import CrossEntropyLoss
optimizer = AdamW(model.parameters(), lr=3e-5, weight_decay=0.02)
scheduler = CosineAnnealingLR(optimizer, T_max=10)
loss_fn = CrossEntropyLoss()
# Training hyperparameters
batch_size_train = 16
batch_size_eval = 64
num_epochs = 10
warmup_epochs = 1Technology Stack:
- Frontend: Streamlit + Custom CSS
- Backend: Python 3.12
- ML Framework: scikit-learn, XGBoost, PyTorch
- Cheminformatics: RDKit 2025.03.6
- LLM: Llama 3 8B (via Hugging Face Inference API)
- Database: React + Google Sheets API + RDKit.js
- Deployment: Hugging Face Spaces
API Integrations:
- ChEMBL API: Drug information, bioactivity data
- PubChem PUG REST API: Chemical properties, synonyms
- Hugging Face Router: LLM inference
- Google Sheets API: Collaborative database backend
Cross-Validation (5-fold Stratified):
External Test Set (Held-out 20%):
Top 2 performing models (KNN and XGBoost)

| Study | Year | Best Model | Accuracy/F1 | Dataset Size |
|---|---|---|---|---|
| BrainRoute (Ours) | 2025 | XGB | Acc: 0.93, F1: 0.93, AUC: 0.98 | 12,716 |
| BrainRoute (Ours) | 2025 | KNN | Acc: 0.92, F1: 0.92, AUC: 0.96 | 12,716 |
| Wang et al. | 2018 | SVM Consensus | Acc: 96.6% | 2,358 |
| Liu et al. | 2021 | Ensemble | Acc: 93.0% | 1,757 |
| Lim et al. | 2023 | GCNN | Acc: 88.0% | 8,000 |
| Shaker et al. (LightBBB) | 2021 | LightGBM | Acc: 89.0% | 7,162 |
| Atallah et al. | 2024 | Voting Ensemble | AUC: 96.0% | 7,807 |
Key Insights:
- Classical ML (KNN, XGBoost) achieves competitive or superior performance compared to deep learning approaches taking into account Auc and F1 scores
- Ensemble methods provide robust uncertainty quantification
- Data quality and preprocessing are more critical than model complexity for this task
- BrainRoute's uncertainty-aware predictions enable risk-stratified decision making
Input Methods:
- Compound Name: e.g., "Donepezil", "Caffeine", "Aspirin"
- SMILES String: e.g.,
CC(=O)OC1=CC=CC=C1C(=O)O
Outputs:
- BBB Prediction: BBB+ or BBB- with confidence score
- Uncertainty Metrics: Standard deviation across models
- Model Agreement: Percentage of models in consensus
- Molecular Properties: MW, LogP, TPSA, HBA, HBD, rotatable bonds
- 2D Structure: Interactive molecular visualization
- ChEMBL Data: Bioactivity, mechanism of action, clinical phase
- Individual Model Predictions: See how each model voted
Example Results:
Compound: Donepezil
SMILES: COc1ccc2c(c1)[C@H](CCN1CCC(CC1)C(=O)c1ccccc1)c1ccccc1-2
🎯 Prediction: BBB+ (Permeable)
📊 Confidence: 87.3%
⚠️ Uncertainty: 8.2% (Low)
🤝 Model Agreement: 100%
Molecular Properties:
- Molecular Weight: 379.5 g/mol
- LogP: 4.32
- TPSA: 38.8 Ų
- H-Bond Donors: 0
- H-Bond Acceptors: 3
- Rotatable Bonds: 5
✓ Lipinski's Rule of 5: PASS
✓ BBB Permeability Rules: PASS
Supported Formats:
- CSV Upload: Must contain
smilesand/ornamecolumns - Text Input: One molecule per line (names or SMILES)
Features:
- Process hundreds of molecules simultaneously
- Summary Statistics: BBB+ rate, average confidence, success rate
- Interactive Visualizations:
- Prediction distribution (pie chart)
- Confidence vs. Uncertainty scatter plot
- Molecular property distributions
- Export Options: CSV, Excel, JSON formats
- Filtering: By status, prediction, confidence threshold
Example Batch Results:
📊 Batch Summary:
- Total Molecules: 150
- Successful Predictions: 147 (98%)
- BBB+ Predictions: 54 (37%)
- BBB- Predictions: 93 (63%)
- Average Confidence: 84.2%
- Average Uncertainty: 11.5%
AI Chat Interface (Llama3-8B-Instruct Integration)
Capabilities:
- Contextual Q&A: Ask about your molecule's properties
- Drug Discovery Insights: CNS potential, mechanism, side effects
- Literature Guidance: Research directions and references
- Comparison: How does it compare to similar compounds?
Example Conversations:
User: What makes Donepezil effective for Alzheimer's?
🦙 Llama 3: Donepezil is an acetylcholinesterase inhibitor with
excellent BBB penetration (predicted BBB+ with 87% confidence).
Its moderate molecular weight (379.5 g/mol) and optimal LogP (4.32)
enable effective CNS entry. It enhances cholinergic neurotransmission
by preventing acetylcholine breakdown in synapses, which helps
compensate for cholinergic deficits in AD.
Quick Questions:
- 💊 Drug Potential
- 🧪 Key Properties
⚠️ Side Effects- 🔬 Current Research
Access: Separate React-based web interface
Features:
- 9,857+ Annotated Molecules: All with BBB predictions
- Real-time Structure Rendering: Client-side RDKit.js
- Advanced Search: By name, SMILES, properties
- Property Filters: MW range, LogP, TPSA thresholds
- Prediction Filters: BBB+/-, confidence threshold
- Export: Selected molecules to CSV
- Collaborative: Google Sheets backend for easy updates
mTOR (Mechanistic Target of Rapamycin):
- Serine/threonine protein kinase
- Master regulator of cell growth, metabolism, autophagy
- Two complexes: mTORC1 (growth) and mTORC2 (survival)
Relevance to Alzheimer's Disease:
- Hyperactivation linked to tau hyperphosphorylation
- Contributes to amyloid-β plaque formation
- Inhibits autophagy → toxic protein accumulation
- Rapamycin shows neuroprotective effects in AD models
Clinical Challenge:
- Most mTOR inhibitors designed for cancer/immunosuppression
- BBB penetration rarely optimized
- Research Question: Which approved mTOR inhibitors can cross the BBB?
Dataset:
- 25 FDA-approved mTOR inhibitors analyzed
- Includes: Rapamycin (Sirolimus), Everolimus, Temsirolimus, etc.
- Data: SMILES, ChEMBL IDs, approved indications
Methodology:
# Load mTOR inhibitors dataset
import pandas as pd
mtor_data = pd.read_csv('data/case_study/mtor_inhibitors.csv')
# Batch prediction using BrainRoute
from scripts.webapp.prediction import process_batch_molecules
results, error = process_batch_molecules(mtor_data, 'csv', models)
# Analyze results
bbb_positive = [r for r in results if r['prediction'] == 'BBB+']
print(f"BBB+ inhibitors: {len(bbb_positive)} / {len(results)}")| Category | Count | Percentage |
|---|---|---|
| Total Analyzed | 25 | 100% |
| BBB+ (Permeable) | 9 | 36% |
| BBB- (Non-permeable) | 16 | 64% |
BBB+ Predicted Compounds (Examples):
- Rapamycin - Confidence: 78% (literature-confirmed)
- Temsirolimus - Confidence: 72%
- Compound X - Confidence: 81%
BBB- Predicted Compounds (Examples):
- Everolimus - Confidence: 89%
- Ridaforolimus - Confidence: 85%
Property Analysis:
BBB+ Compounds:
- Avg MW: 892 ± 156 g/mol
- Avg LogP: 5.2 ± 1.1
- Avg TPSA: 168 ± 34 Ų
BBB- Compounds:
- Avg MW: 1,124 ± 203 g/mol
- Avg LogP: 4.1 ± 0.9
- Avg TPSA: 245 ± 52 Ų
-
Minority BBB Penetration: Only 36% predicted to cross BBB
-
MW Threshold: BBB+ compounds generally <1,000 Da
-
TPSA Correlation: BBB+ compounds have lower TPSA (<200 Ų)
-
Clinical Implications:
- Most existing mTOR inhibitors unsuitable for CNS disorders
- Need for structure optimization or alternative delivery
- Rapamycin's BBB+ prediction aligns with clinical AD trials
-
Future Directions:
- Design CNS-optimized mTOR inhibitors
- Explore drug delivery strategies (nanoparticles, intranasal)
- Validate predictions with experimental BBB assays
Notebook: Full analysis available in Case study notebook
Best Performers:
- Accuracy: XGBoost (93%), KNN (92%)
- F1-Score: XGBoost (0.93), KNN (0.92)
- ROC-AUC: XGBoost (0.98), KNN (0.96)
- Robustness: Ensemble (lowest variance across folds)
Key Takeaways:
- Classical ML outperforms deep learning for this dataset size
- Ensemble methods provide best uncertainty quantification
- XGBoost's success highlights importance of molecular similarity
- Deep learning may improve with 10x larger datasets (>100,000 molecules)
| Model | Training Time | Inference (1 mol) | Inference (1000 mols) |
|---|---|---|---|
| KNN | 5 min | <0.1s | ~30s |
| XGBoost | 15 min | <0.1s | ~45s |
| SVM | 30 min | <0.1s | ~60s |
| BERT | 2 hours | 0.5s | ~8 min |
Hardware: Intel icore 7 CPU @ 1.5 GHz, 16GB RAM Note BERT requires GPU to run
from scripts.webapp.prediction import predict_bbb_penetration_with_uncertainty
from scripts.webapp.utils import load_ml_models
from rdkit import Chem
# Load models once
models, errors = load_ml_models()
# Single prediction
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")
result, error = predict_bbb_penetration_with_uncertainty(mol, models)
if not error:
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2f}%")
print(f"Uncertainty: {result['uncertainty']:.2f}%")
print(f"Agreement: {result['agreement']:.2f}%")
# Batch prediction
from scripts.webapp.lewis.prediction import process_batch_molecules
import pandas as pd
batch_data = pd.DataFrame({
'name': ['Aspirin', 'Caffeine'],
'smiles': ['CC(=O)OC1=CC=CC=C1C(=O)O', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C']
})
results, error = process_batch_molecules(batch_data, 'csv', models)
for result in results:
print(f"{result['name']}: {result['prediction']} ({result['confidence']:.1f}%)")We plan to develop a REST API for programmatic access:
# Predict single molecule
curl -X POST https://api.brainroute.io/predict \
-H "Content-Type: application/json" \
-d '{"smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"}'
# Response
{
"prediction": "BBB-",
"confidence": 85.2,
"uncertainty": 12.3,
"agreement": 80.0,
"properties": {
"mw": 180.16,
"logp": 1.19,
"tpsa": 63.6
}
}Problem: FileNotFoundError: [Errno 2] No such file or directory: 'output/models/KNN_model.pkl'
Solution:
# Download pre-trained models
wget https://github.com/omicscodeathon/brainroute/releases/download/v1.0/models.zip
unzip models.zip -d output/
# Or train models from scratch
python notebooks/model_training.ipynbProblem: ModuleNotFoundError: No module named 'rdkit'
Solution:
# Conda installation (recommended)
conda install -c conda-forge rdkit
# Pip installation
pip install rdkit-pypi
# Mac M1/M2/M3 specific
conda install -c conda-forge rdkit python=3.12Problem: Address already in use
Solution:
# Use different port
streamlit run scripts/webapp/lewis/main.py --server.port 8502
# Or kill existing process
lsof -ti:8501 | xargs kill -9 # Mac/Linux
netstat -ano | findstr :8501 # Windows (find PID and kill)Problem: Hugging Face API token not found
Solution:
# Set environment variable
export HUGGINGFACE_API_TOKEN="your_token_here"
# Or add to .streamlit/secrets.toml
mkdir -p .streamlit
echo 'HF_TOKEN = "your_token_here"' > .streamlit/secrets.toml
# Get free token from: https://huggingface.co/settings/tokensProblem: MemoryError when processing large batches
Solution:
# Process in smaller chunks
chunk_size = 100
for i in range(0, len(molecules), chunk_size):
chunk = molecules[i:i+chunk_size]
results = process_batch_molecules(chunk, 'csv', models)Problem: Could not process molecule
Solution:
from rdkit import Chem
# Validate SMILES before prediction
smiles = "invalid_smiles"
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print("Invalid SMILES string")
# Try sanitization
mol = Chem.MolFromSmiles(smiles, sanitize=False)
if mol:
Chem.SanitizeMol(mol)# Cache model loading
import streamlit as st
@st.cache_resource
def load_models():
return load_ml_models()
# Use batch processing for multiple molecules
# ~10x faster than individual predictions
# Enable GPU for deep learning (if available)
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)- 📖 Documentation
- 💬 GitHub Discussions
- 🐛 Report a Bug
- ✨ Feature Request
- 📧 Email: sohamshirolkar24@gmail.com, leahcerere@gmail.com, lewistem@gmail.com, nemase00@gmail.com
We welcome contributions from the community! BrainRoute is an open-science project that thrives on collaboration.
-
Fork the repository
git clone https://github.com/yourusername/brainroute.git cd brainroute git checkout -b feature/your-feature-name -
Make your changes
- Add new features or fix bugs
- Update documentation
- Add tests for new functionality
-
Run tests
pytest tests/ python -m pylint scripts/
-
Submit a Pull Request
- Clear description of changes
- Reference related issues
- Include screenshots for UI changes
- 🧪 Add new models: Implement additional ML algorithms
- 📊 Improve visualizations: Enhance plots and charts
- 🗄️ Expand database: Curate additional BBB datasets
- 📝 Documentation: Improve tutorials and examples
- 🐛 Bug fixes: Report and fix issues
- 🌍 Translations: Internationalization support
- 🧬 Case studies: Apply to new disease areas
# Use Black formatter
black scripts/
# Follow PEP 8
pylint scripts/
# Add docstrings
def predict_bbb_penetration(mol, models):
"""
Predict BBB permeability of a molecule.
Args:
mol (rdkit.Chem.Mol): RDKit molecule object
models (dict): Dictionary of trained models
Returns:
dict: Prediction results with confidence scores
"""
passContributors will be:
- Listed in CONTRIBUTORS.md
- Acknowledged in publications
- Invited to co-author future papers (for significant contributions)
If you use BrainRoute in your research, please cite:
@article{shirolkar2026brainroute,
title={BrainRoute: An Open Machine Learning Platform for Blood-Brain Barrier Permeability Prediction with Neurodegenerative Disease Applications},
author={Shirolkar, Soham and Cerere, Leah W. and Tem, Lewis and Ahmed, Noura E. and Some, Georges and Awe, Olaitan I.},
journal={Springernature},
year={2026},
doi={10.1101/2026.xxx},
url={https://github.com/omicscodeathon/brainroute}
}APA Format: Shirolkar, S., Cerere, L. W., Tem, L., Ahmed, N. E., Some, G., & Awe, O. I. (2026). BrainRoute: An Open Machine Learning Platform for Blood-Brain Barrier Permeability Prediction with Neurodegenerative Disease Applications. Springernature. https://doi.org/10.1101/2026.xxx
Vancouver Format: Shirolkar S, Cerere LW, Tem L, Ahmed NE, Some G, Awe OI. BrainRoute: An Open Machine Learning Platform for Blood-Brain Barrier Permeability Prediction with Neurodegenerative Disease Applications. Springernature. 2026. doi:10.1101/2026.xxx
This work was supported by:
- National Institutes of Health (NIH) - Office of Data Science Strategy (ODSS)
- Institute for Genomic Medicine Research - West Hartford, CT, USA
- African Society for Bioinformatics and Computational Biology (ASBCB)
- Omics Codeathon - October 2025
We gratefully acknowledge:
- B3DB - Curated BBB permeability database
- MoleculeNet - Benchmark datasets for molecular ML
- ChEMBL - European Bioinformatics Institute (EBI)
- PubChem - National Center for Biotechnology Information (NCBI)
- Therapeutics Data Commons - Harvard Medical School
BrainRoute builds upon:
- RDKit - Cheminformatics toolkit
- scikit-learn - Machine learning library
- Streamlit - Web application framework
- PyTorch - Deep learning framework
- Hugging Face - LLM infrastructure and model hosting
- Plotly - Interactive visualizations
Soham Shirolkar Project Lead, Lead Developer |
Leah W. Cerere Visualization & Documentation |
Lewis Tem Lead Developer |
Noura E. Ahmed Visualization & Documentation |
Olaitan I. Awe Project Supervision |
- Omics Codeathon Organizers - For providing the platform and resources
- Peer Reviewers - For valuable feedback and suggestions
- Open-Source Community - For tools and inspiration
- Beta Testers - For helping refine the platform
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 BrainRoute Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
BrainRoute is committed to:
- ✅ Open-source code (GitHub)
- ✅ Open data (curated datasets publicly available)
- ✅ Open access publications (preprints on bioRxiv)
- ✅ Reproducible workflows (documented pipelines)
- ✅ Community contributions (welcoming pull requests)
- Core prediction models (KNN, XGBoost, SVM, RF, LR)
- Streamlit web interface
- Llama 3 LLM integration
- Batch processing capability
- ChEMBL/PubChem API integration
- Uncertainty quantification
- mTOR case study
- REST API for programmatic access
- Docker containerization
- Expanded descriptor sets (Mordred, PaDEL)
- Model explainability (SHAP values)
- Additional case studies (Parkinson's, brain tumors)
- User authentication system
- Molecule sketcher integration
- Graph Neural Networks (GNN) models
- Multi-task learning (BBB + toxicity + bioavailability)
- Active learning for data-efficient training
- P-glycoprotein efflux prediction
- BBB dysfunction modeling (disease states)
- Integration with molecular docking tools
- Mobile application (iOS/Android)
- Federated learning for privacy-preserving data sharing
- Generative models for BBB-permeable molecule design
- Clinical trial integration
- Regulatory approval pathway documentation
- Partnerships with pharmaceutical companies
- Educational modules for drug discovery courses
Soham Shirolkar
- 📧 Email: sohamshirolkar24@gmail.com
- 🔗 ORCID: 0009-0004-4798-899X
- 🏛️ Affiliation: University of South Florida
Olaitan I. Awe
- 📧 Email: laitanawe@gmail.com
- 🔗 ORCID: 0000-0002-4257-3611
- 🏛️ Affiliation: Institute for Genomic Medicine Research & ASBCB
- 🌐 Curated Database: brainroutedb
- 💻 GitHub: github.com/omicscodeathon/brainroute
- 🚀 Live Demo: BrainRoute Deployment
- 📖 Documentation: GitHub Wiki
- 💬 Discussions: GitHub Discussions
- 🐛 Bug Reports: GitHub Issues
Made with ❤️ by the BrainRoute Team
Accelerating CNS Drug Discovery Through Open Science
© 2026 BrainRoute Team. All rights reserved.
If you find this project useful, please consider giving it a ⭐ on GitHub!




