Leveraging machine learning and computational chemistry to identify promising drug candidates for glioblastoma
20 Drugs Analyzed โข 13 Promising Candidates โข 78 Synergistic Combinations โข 5 ML Models Compared
Our analysis pipeline generates comprehensive visualizations revealing patterns in drug efficacy, chemical similarity, and therapeutic potential.
What this shows: Machine learning predictions ranking 13 promising drug candidates based on their effectiveness against GBM cell lines. Dasatinib, Etoposide, and Afatinib emerge as the top candidates with favorable IC50 profiles and pathway relevance. The One-Class SVM model identifies these drugs as most similar to known effective treatments in high-dimensional molecular space.
Tanimoto Similarity (Molecular Fingerprints): This heatmap reveals structural relationships between drugs using chemical fingerprint comparison. Darker colors indicate higher similarity (up to 96% for some pairs). Drugs with high Tanimoto scores share similar molecular scaffolds, suggesting they may have comparable mechanisms of action or could be substituted for each other in treatment protocols.
Maximum Common Substructure (MCS) Analysis: Identifies the largest shared molecular fragments between drug pairs. This metric is particularly valuable for understanding pharmacophore relationshipsโdrugs with large common substructures likely bind to similar protein targets despite overall structural differences.
Graph Neural Network Embeddings: Deep learning captures subtle structural patterns that traditional methods miss. This GCN-based similarity reveals non-obvious relationships in 3D molecular geometry and electronic properties, offering a complementary perspective to fingerprint-based approaches.
Distribution Analysis: These histograms reveal how similarity scores cluster across the drug library. Tanimoto shows a bimodal distribution (structurally distinct drug classes), MCS reveals moderate substructure sharing, while GCN captures a wider range of relationships through learned representations. Together, they provide a multi-dimensional view of chemical space.
K-Means Clustering (UMAP Projection): Drugs are grouped into distinct clusters based on molecular features and efficacy profiles. UMAP dimensionality reduction projects high-dimensional chemistry into 2D space while preserving local and global structure. Each cluster represents drugs with similar therapeutic potential and mechanisms.
Hierarchical Clustering: Reveals nested relationships and drug families. This dendrogram-based approach shows how drugs progressively merge into groups, useful for understanding evolutionary relationships in drug design and identifying backup candidates within the same therapeutic class.
DBSCAN (Density-Based Clustering): Automatically identifies outliers and noise in the drug space. Unlike K-means, DBSCAN doesn't force every drug into a clusterโit discovers natural groupings and highlights unique compounds that don't fit conventional patterns. These outliers may represent novel mechanisms or require further investigation.
Glioblastoma (GBM) remains one of the most formidable challenges in oncology. Despite decades of research, this aggressive brain tumor has a median survival time of just 15 months. Traditional trial-and-error approaches to drug discovery are too slow and too expensive.
What if we could use data science to accelerate the search for effective treatments?
This project represents a computational journey to answer that questionโcombining large-scale drug sensitivity data with modern machine learning to identify drugs that show promise against GBM.
- Visual Results
- The Challenge
- Our Approach
- Key Accomplishments
- Technology Stack
- Getting Started
- How It Works
- Results & Insights
- Interactive Dashboard
- Project Architecture
- Customization & Extension
- What's Next
We built an intelligent analysis pipeline that processes data from the Genomics of Drug Sensitivity in Cancer (GDSC) project, examining how hundreds of drugs interact with GBM cell lines. But we didn't stop at simple statistics.
The system employs three complementary perspectives on drug similarity:
- Chemical fingerprinting to understand molecular structures
- Substructure analysis to identify common pharmacophores
- Graph neural networks to capture deep structural relationships
By combining these approaches with unsupervised clustering and one-class machine learning, we can identify drugs that don't just look similar on paperโthey share genuine therapeutic potential.
- โ Integrated and harmonized GDSC1 and GDSC2 datasets
- โ Focused analysis on GBM-specific cell lines (U-87, U-251, SNB-19)
- โ Processed 76 drug-cell line combinations across 20 therapeutic compounds
- โ Extracted 17 molecular descriptors from chemical structures
- โ Computed multi-dimensional similarity matrices using three distinct methods
- โ Generated graph-based embeddings using deep learning
- โ Identified 13 promising drug candidates with favorable IC50 profiles
- โ Clustered drugs by mechanism and structure using 3 algorithms
- โ Mapped candidates to relevant biological pathways (EGFR, VEGFR, PI3K/AKT/mTOR)
- โ Combination Therapy: Identified 78 synergistic drug pairs based on pathway complementarity
- โ Model Comparison: Benchmarked 5 ML algorithms (KNN, RF, XGBoost, SVM, Neural Networks)
- โ Interaction Safety: Automated drug-drug interaction checking with severity classification
- โ Interactive Dashboard: Real-time exploration via Streamlit web interface
- ๐ Top Candidates: Dasatinib, Etoposide, Afatinib emerged as most promising
- ๐ Drug Pairs: Afatinib+Gefitinib showed highest synergy potential (score: 0.582)
- ๐ฏ Best Model: XGBoost achieved 0.631 cross-validation accuracy
- ๐ก๏ธ Safety: 16/20 top combinations classified as safe for co-administration
This project bridges multiple disciplines:
Chemistry & Biology
- ๐งช RDKit for molecular descriptor extraction and SMILES processing
- ๐ฌ PubChem integration for chemical structures
- ๐งฌ Enrichr for pathway enrichment analysis
Machine Learning
- ๐ค PyTorch 2.8.0 with Metal Performance Shaders (MPS) GPU acceleration
- ๐ธ๏ธ PyTorch Geometric 2.6.1 for end-to-end graph neural networks (GCN/GAT)
- ๐ Scikit-learn for clustering and classification
- ๐ XGBoost for gradient boosting models
- ๐ Comprehensive model comparison framework with cross-validation
Data Engineering & Deployment
- ๐ Automated ETL pipeline for GDSC datasets
- ๐ณ Docker & docker-compose for containerized deployment
- ๐ Reproducible workflow with comprehensive logging
- ๐งฉ Modular architecture for extensibility
- ๐ป Interactive Streamlit dashboard for exploration
- ๐ฏ Multi-model evaluation with performance metrics
You'll need:
- ๐ Python 3.8+ (Python 3.9 recommended)
- ๐พ 8GB RAM minimum (16GB recommended for large datasets)
- ๐ป macOS, Linux, or Windows (optimized for Mac with Metal GPU acceleration)
# 1. Clone the repository and navigate to project directory
cd GBM_drug_analysis_and_recommendation
# 2. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install all dependencies
pip install -r requirements.txtThat's it! The virtual environment now contains everything needed: PyTorch with GPU acceleration, RDKit for chemistry, and all scientific computing libraries.
For a containerized environment with all dependencies pre-configured:
# Build and run with Docker Compose
docker-compose up gbm-analysis
# Or build the Docker image manually
docker build -t gbm-drug-analysis .
# Run the analysis
docker run -v $(pwd)/data:/app/data -v $(pwd)/results:/app/results gbm-drug-analysisAvailable Docker Services:
# Run main analysis pipeline
docker-compose up gbm-analysis
# Train GNN model
docker-compose --profile gnn up gnn-training
# Launch interactive dashboard
docker-compose --profile dashboard up dashboard
# Access at http://localhost:8501
# Start Jupyter notebook server
docker-compose --profile notebook up notebook
# Access at http://localhost:8888Benefits:
- โ No manual dependency installation
- โ Consistent environment across machines
- โ Isolated from system Python
- โ Easy scaling and deployment
Option 1: Complete Pipeline (All 10 Stages)
./venv/bin/python main.pyThe system executes:
- ๐ Load and merge GDSC datasets
- ๐งฌ Extract molecular features from drug structures
- ๐ Compute similarity matrices (Tanimoto, MCS, GCN)
- ๐งฉ Cluster drugs and identify patterns
- ๐ฏ Predict promising candidates with One-Class SVM
- ๐บ๏ธ Perform pathway enrichment analysis
- ๐ Generate visualizations and reports
- ๐ Analyze drug combinations for synergy
- ๐ค Compare multiple ML models
โ ๏ธ Check drug-drug interactions
Option 2: Core Pipeline Only (Stages 1-7)
./venv/bin/python main.py --skip-combination --skip-model-comparison --skip-interactionsOption 3: Custom Workflow
# Run specific analyses
./venv/bin/python main.py --skip-clustering --skip-pathway --top-n-drugs 50Explore results visually with the web-based dashboard:
# Activate virtual environment
source venv/bin/activate
# Launch dashboard (opens in browser automatically)
./venv/bin/streamlit run dashboard.pyDashboard Features:
- ๐ Overview: Summary of all analyses with key metrics
- ๐ฏ Drug Predictions: Interactive filtering, sorting, and visualization
- ๐ฌ Drug Similarity: Heatmaps for all three similarity methods
- ๐ Combination Therapy: Top synergistic pairs with detailed scores
- ๐บ๏ธ Pathway Analysis: Enrichment results by database (KEGG, Reactome, GO, BioPlanet)
- ๐ Model Comparison: Performance metrics across all ML algorithms
โ ๏ธ Drug Interactions: Safety analysis with severity classifications
All visualizations are interactive with zoom, pan, and download capabilities.
Train an end-to-end GNN model that learns directly from molecular SMILES:
# Activate virtual environment
source venv/bin/activate
# Train GNN for IC50 regression (default)
python train_gnn.py --task regression --epochs 100 --batch-size 32
# Train GNN for effectiveness classification
python train_gnn.py --task classification --epochs 100
# Use GAT instead of GCN
python train_gnn.py --gnn-type gat --hidden-channels 256
# Full options
python train_gnn.py \
--task regression \
--gnn-type gcn \
--hidden-channels 128 \
--num-gnn-layers 3 \
--num-mlp-layers 2 \
--dropout 0.2 \
--epochs 100 \
--batch-size 32 \
--learning-rate 0.001GNN Features:
- ๐ฌ No manual feature engineering - learns from molecular graphs
- โก GPU acceleration (CUDA/MPS) for faster training
- ๐ฏ Supports both regression (IC50) and classification (effective/not)
- ๐ Automatic train/validation split with early stopping
- ๐พ Saves trained models and predictions to
results/models/ - ๐ Generates training history and prediction plots
Output Files:
results/models/gnn_regression_model.pt- Trained GNN modelresults/models/gnn_regression_predictions.csv- Test set predictionsresults/figures/gnn_regression_training_history.png- Loss curvesresults/figures/gnn_regression_predictions.png- Prediction scatter plot
This project uses drug sensitivity data from the Genomics of Drug Sensitivity in Cancer (GDSC) project.
The repository includes sample data (76 drug-cell combinations) in CSV format, ready to use out of the box. This allows you to run the complete pipeline immediately without downloading large datasets.
For comprehensive analysis with the complete GDSC datasets:
Option 1: Download Fitted Dose Response Data (Recommended)
Visit the GDSC bulk download page and download the Excel files:
- GDSC1: GDSC1 Fitted Dose Response
- GDSC2: GDSC2 Fitted Dose Response
Then convert to CSV and place in data/raw/:
# Convert Excel to CSV (using pandas or Excel)
# Save as GDSC1.csv and GDSC2.csv in data/raw/Option 2: Main GDSC Portal
Access the complete data portal:
- GDSC Downloads: https://www.cancerrxgene.org/downloads/bulk_download
- Cell Line Details: https://www.cancerrxgene.org/downloads
The pipeline expects CSV files in data/raw/ with these columns:
CELL_LINE_NAME- Cancer cell line identifierDRUG_NAME- Drug/compound nameLN_IC50- Natural log of IC50 valueIC50- Half-maximal inhibitory concentration (ฮผM)AUC- Area under the dose-response curveRMSE- Root mean square error of fitZ_SCORE- Standardized drug response
If you use GDSC data in your research, please cite:
Yang, W., Soares, J., Greninger, P. et al. (2013). Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(D1), D955-D961.
Iorio, F., Knijnenburg, T.A., Vis, D.J. et al. (2016). A Landscape of Pharmacogenomic Interactions in Cancer. Cell, 166(3), 740-754.
Stage 1: Data Harmonization ๐
We start by integrating two massive drug screening datasets (GDSC1 and GDSC2), filtering for GBM-specific cell lines. The system automatically:
- Converts logarithmic IC50 values to micromolar concentrations
- Identifies and removes statistical outliers
- Labels drugs as "effective" based on therapeutic thresholds (IC50 < 10 ฮผM)
Output: 76 drug-cell line combinations ready for analysis
Stage 2: Molecular Profiling ๐งฌ
Every drug's chemical structure is analyzed to extract meaningful descriptors:
- Molecular weight and lipophilicity
- Hydrogen bonding capacity
- Topological features
- Lipinski's Rule of Five compliance
Think of this as creating a "fingerprint" for each drug molecule.
Output: 17 molecular features per drug
Stage 3: Multi-Angle Similarity ๐
Here's where it gets interesting. We compute how similar drugs are using three different lenses:
- Tanimoto Similarity: Compares molecular fingerprints (like comparing pixel patterns)
- Maximum Common Substructure (MCS): Finds the largest shared molecular fragment
- Graph Neural Networks: Uses deep learning to understand structural relationships
Why three methods? Because drugs can be similar in different waysโand each perspective reveals unique insights.
Output: Three 20ร20 similarity matrices
Stage 4: Pattern Discovery ๐งฉ
Unsupervised clustering reveals natural groupings in the drug space:
- KMeans for clear partitioning
- DBSCAN for outlier detection
- Hierarchical clustering for relationship trees
We also use UMAP for 2D visualizationโcompressing high-dimensional chemistry into human-readable plots.
Output: Drug clusters and visual maps
Stage 5: Candidate Identification ๐ฏ
A One-Class SVM learns what "effective drugs" look like in the feature space. Trained on the 13 drugs with favorable IC50 values, it then scores all candidates.
This isn't traditional classificationโit's asking "how similar is this drug to known effective treatments?"
Output: Ranked list of 13 promising drugs
Stage 6: Biological Validation ๐บ๏ธ
Connect computational predictions to biological reality using the Enrichr database:
- Map drugs to their target genes
- Identify enriched pathways
- Filter for GBM-relevant mechanisms (EGFR, VEGFR, mTOR)
Output: Pathway enrichment tables with p-values
Stage 7: Combination Therapy Analysis ๐ โจ
Beyond single drugs, we analyze pairs for synergistic potential:
- Pathway Complementarity: Different but related biological targets provide multi-pronged attack
- Target Diversity: Non-overlapping molecular mechanisms reduce resistance
- Optimal Similarity: "Goldilocks zone" - not too similar (redundant) or too different (incompatible)
- Synergy Scoring: Integrated weighted assessment from pathway, target, and similarity scores
Output: Top 78 drug combinations ranked by synergy potential
Stage 8: Model Comparison ๐ค โจ
Rigorous benchmarking of multiple machine learning approaches:
- K-Nearest Neighbors (KNN) - instance-based learning
- Random Forest - ensemble decision trees
- XGBoost - gradient boosting
- Support Vector Machines - kernel-based classification
- Neural Networks - deep learning
Each model undergoes 5-fold cross-validation with comprehensive metrics (accuracy, precision, recall, F1-score, ROC AUC).
Output: Model performance comparison table and best model selection
Stage 9: Drug Visualization ๐
Generate publication-ready figures and comprehensive reports:
- Similarity heatmaps for all three methods
- Clustering visualizations (UMAP projections)
- Drug prediction rankings
- Pathway enrichment bar charts
Output: Complete analysis report and figure gallery
Stage 10: Safety Validation
Automated drug-drug interaction checking for recommended combinations:
- Structural Alerts: Detection of reactive functional groups that may interact
- CYP450 Profiling: Metabolic pathway conflicts (substrate-inhibitor pairs)
- Physicochemical Analysis: Formulation compatibility via LogP differences
- Risk Stratification: High/moderate/low severity classification with rationales
This ensures our predictions aren't just mathematically soundโthey're biologically plausible and safe.
Output: Safety-filtered combinations with interaction reports and recommendations
Every analysis run generates a comprehensive set of outputs in the results/ directory:
Three complementary views of chemical space:
similarity/tanimoto_similarity_matrix.csv- Fingerprint-based relationshipssimilarity/mcs_similarity_matrix.csv- Structural overlap scoressimilarity/gcn_similarity_matrix.csv- Deep learning embeddings
The machine learning model's recommendations:
drug_predictions.csv- All 20 drugs ranked by promise- Top 5 Candidates: Dasatinib, Etoposide, Afatinib, Lapatinib, Imatinib
Drug groupings and relationships:
clustering/clustering_results.csv- Cluster assignments and metrics- Visual maps showing drug neighborhoods in chemical space
Synergistic drug pairs for enhanced efficacy:
combination_therapy/drug_combinations.csv- All 78 pairs with synergy scorescombination_therapy/drug_combinations_matrix.csv- Pairwise synergy matrix- Top Combination: Afatinib + Gefitinib (synergy score: 0.582)
Performance metrics across all ML algorithms:
models/model_comparison_results.csv- Comprehensive evaluation tablemodels/best_model.pkl- Trained XGBoost model (best performer)- Winner: XGBoost with 0.631 CV accuracy
Safety analysis for recommended combinations:
interactions/drug_interactions.csv- DDI severity classificationsinteractions/safe_combinations.csv- Pre-filtered safe pairsinteractions/summary.txt- Safety summary report
Biological context for predictions:
pathways/pathway_enrichment_GO-*.csv- Gene Ontology termspathways/pathway_enrichment_Reactome-*.csv- Reactome pathwayspathways/pathway_enrichment_KEGG-*.csv- KEGG pathways- GBM-relevant mechanisms highlighted (EGFR, VEGFR, mTOR)
Publication-ready figures:
- Similarity heatmaps (Tanimoto, MCS, GCN)
- Similarity distributions
- Clustering UMAP projections (K-means, DBSCAN, Hierarchical)
- Top predicted drugs bar chart
The codebase follows a modular design philosophy:
GBM_drug_analysis_and_recommendation/
โโโ data/
โ โโโ raw/ # GDSC datasets (CSV format)
โ โโโ processed/ # Cleaned drug-cell combinations
โ โโโ smiles/ # Chemical structure strings
โ
โโโ src/
โ โโโ config.py # Global settings and hyperparameters
โ โโโ data_processing.py # ETL pipeline
โ โโโ feature_extraction.py # Molecular descriptors
โ โโโ pathway_analysis.py # Enrichr integration
โ โโโ combination_therapy.py # โจ Synergy analysis
โ โโโ drug_interactions.py # โจ DDI checking
โ โโโ similarity/ # Three similarity algorithms
โ โโโ models/ # Clustering, prediction, model comparison
โ โโโ utils/ # Visualization helpers
โ
โโโ results/ # All outputs and artifacts
โโโ notebooks/ # Jupyter exploration
โโโ main.py # Pipeline orchestrator (10 stages)
โโโ dashboard.py # โจ Interactive Streamlit dashboard
โโโ requirements.txt # Dependencies (including XGBoost, Streamlit)
โโโ venv/ # Isolated Python environment
Each module is self-contained and can be imported independentlyโperfect for building custom workflows or integrating into larger projects.
Edit src/config.py to tune the analysis:
# Therapeutic thresholds
IC50_THRESHOLD_EFFECTIVE = 10.0 # ฮผM - adjust based on clinical context
# Similarity cutoffs
TANIMOTO_THRESHOLD = 0.7 # 0.0 to 1.0
# Clustering
KMEANS_N_CLUSTERS = 5 # Number of drug groups
# Machine learning
SVM_NU = 0.1 # Outlier sensitivity (0.0 to 1.0)Extend the analysis with new compounds:
from src.feature_extraction import MolecularFeatureExtractor
extractor = MolecularFeatureExtractor()
# Add drugs by name (automatically fetches SMILES from PubChem)
new_drugs = ['Bevacizumab', 'Nivolumab', 'Pembrolizumab']
features = extractor.process_drug_list(new_drugs)
# Or provide SMILES directly
custom_smiles = {
'ExperimentalDrug-1': 'CC(C)Cc1ccc(cc1)C(C)C(=O)O',
'ExperimentalDrug-2': 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'
}The modular structure makes it easy to build specialized workflows:
# Example: Focus only on kinase inhibitors
from src.data_processing import GDSCDataLoader
from src.similarity import TanimotoSimilarityAnalyzer
loader = GDSCDataLoader()
data = loader.process_pipeline()
# Filter for specific drug class
kinase_inhibitors = data[data['drug_name'].str.contains('tinib')]
# Analyze this subset
analyzer = TanimotoSimilarityAnalyzer()
# ... continue with custom analysis๐ GPU Acceleration
- macOS: Automatically uses Metal Performance Shaders (MPS)
- Linux/Windows: CUDA-enabled GPUs supported
- Falls back to CPU if no GPU available
๐ Data Sources
- GDSC data: Sample data included; full datasets from GDSC Downloads
- Chemical structures: PubChem API (rate-limited; large queries may take time)
- Pathway data: Enrichr API (fair use applies)
๐พ Memory Considerations
- Full GDSC datasets: ~2-3GB RAM
- GCN training: ~1-2GB RAM
- Similarity matrices: O(nยฒ) scaling where n = number of drugs
๐ Reproducibility
- All random seeds set in
src/config.py - Results identical across runs with same data
- Minor variations may occur due to GPU non-determinism
- โจ End-to-End Graph Neural Networks: Direct learning from molecular SMILES
- โจ GNN model with GCN/GAT support for drug efficacy prediction
- โจ GPU acceleration (CUDA/MPS) for faster training
- โจ Docker containerization for reproducible deployments
- โจ Combination Therapy Analysis with pathway complementarity
- โจ Multi-Model Comparison (6 ML algorithms with cross-validation)
- โจ Drug-Drug Interaction Safety checking (CYP450 profiling)
- โจ Interactive Streamlit Dashboard for visual exploration
- ๐งฌ Multi-omics Integration: Gene expression, mutation, proteomics data
- ๐ฅ Clinical Correlation: Patient outcome data and clinical trial integration
- ๐ฌ 3D Molecular Docking: Protein-ligand binding simulations
- ๐ Extended Drug Library: ChEMBL, PubChem Bioassay integration
- ๐ฏ Precision Targeting: Patient-specific treatment recommendations
- ๐ง Attention Mechanisms: Explainable GNN predictions with attention visualization
- Sample data represents subset of GBM cell lines
- In silico predictions require in vitro and in vivo validation
- Pathway analysis is correlative, not causal
- IC50-based model may not capture all efficacy aspects
- DDI checker uses simplified rules (full DrugBank integration pending)
This is a research tool, not a clinical recommendation system. Results require experimental validation before therapeutic application.
Contributions welcome for:
- Additional similarity metrics and clustering algorithms
- Database integrations (ChEMBL, PubChem Bioassay, DrugBank)
- Enhanced visualization options
- Clinical trial matching algorithms
This project builds on:
- GDSC Project: Comprehensive drug sensitivity data
- RDKit: Cheminformatics infrastructure
- PyTorch Geometric: Graph neural network capabilities
- Enrichr: Pathway enrichment analysis
- The broader open-source scientific Python ecosystem
This project is provided for research and educational purposes.
When using GDSC data, please cite:
Yang, W., Soares, J., Greninger, P. et al. (2013). Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(D1), D955-D961.
Iorio, F., Knijnenburg, T.A., Vis, D.J. et al. (2016). A Landscape of Pharmacogenomic Interactions in Cancer. Cell, 166(3), 740-754.
Questions or Improvements?
This pipeline represents one approach to computational drug discovery. We encourage experimentation, modification, and extension.
Happy discovering! ๐ฌ๐









