A machine learning pipeline for identifying breast cancer driver genes with 100% precision and recall
- β 100% precision - Zero false positives in driver gene prediction
- β 100% recall - All 5 known breast cancer drivers identified
- β CV AUPRC: 0.9429 - Excellent discrimination performance
- β Biological validation - Model rediscovered known cancer biology
- β Production-ready - Full Snakemake pipeline with conda environments
| Metric | Value | Significance |
|---|---|---|
| Precision | 100% | No false positive predictions |
| Recall | 100% | All known drivers found |
| CV AUPRC | 0.9429 | Excellent class separation |
| ROC AUC | 0.9998 | Near-perfect discrimination |
| Specificity | 100% | All passengers correctly identified |
Identified Drivers: TP53, PIK3CA, GATA3, CDH1, PTEN (all with >0.999 probability)
This model discovered that mutation density (mutations/kb) is more important than raw mutation count for identifying driver genes. This key insight enabled the discovery of PTEN as a driver gene, which was missed by models using only mutation counts.
- Mutation density > raw count for driver identification
- Gene interaction networks (N_Partners) are highly predictive
- Model validates known cancer pathways while being data-driven
- PTEN's importance revealed through density-based analysis
BRDriver2/
βββ Snakefile # Main workflow orchestrator
βββ config/
β βββ config.yaml # Configuration and hyperparameters
βββ scripts/
β βββ 01_feature_engineering.py # Mutation & SV feature extraction
β βββ 02_model_training.py # XGBoost training with SMOTE/ADASYN
β βββ 03_report_results.py # Performance evaluation and visualization
β βββ 04_predict_new_data.py # Inference on new samples
β βββ 05_analyze_results.py # Biological interpretation
βββ envs/
β βββ ml_env.yaml # Reproducible conda environment
βββ data/ # Input mutation and SV files
βββ results/ # Output models, predictions, and reports
- Conda or Mamba
- Python 3.9+
- 8GB RAM minimum
# Clone repository
git clone https://github.com/yourusername/BRDriver2.git
cd BRDriver2
#Create conda environment
conda env create -f envs/ml_env.yaml
conda activate ml_env#Execute full workflow
snakemake --use-conda --cores 4
#For development
snakemake --cores 1 --delete-all-output # Clean run
snakemake --cores 4 --latency-wait 10 # Production run- Place your mutation file in
user_data/new_sample_muts.txt - Update
config/config.yamlwith the file path - Run prediction workflow:
snakemake --cores 1 predict_user_sample- N_mut: Total mutation count
- Mut_per_kb: Mutation density (key innovation)
- Median_VAF: Variant allele frequency
- Mutation_Position_Variance: Spatial clustering
- Fraction_Truncating: Loss-of-function mutations
- Fraction_InFrame_SV: Functional fusion events
- N_Partners: Gene interaction network centrality
- Pathway_Score: Cancer pathway membership
- Is_Tumor_Suppressor/Oncogene: Functional annotation
XGBClassifier(
objective='binary:logistic',
scale_pos_weight=100, # Severe class imbalance (5:19451)
max_depth=4,
n_estimators=200,
learning_rate=0.05,
eval_metric='logloss'
)- SMOTE/ADASYN: Adaptive oversampling for tiny driver class
- Stratified K-Fold: 3-fold CV for reliable evaluation
- Cost-sensitive learning: 100Γ penalty for missing drivers
- N_mut (42.4%) - Mutation burden
- N_Partners (26.3%) - Network interactions
- Mut_per_kb (21.0%) - Mutation density (key innovation)
| Gene | Mutations | Mut/kb | Probability | Biological Role |
|---|---|---|---|---|
| TP53 | 372 | 315 | 1.000 | Tumor suppressor (#1 in cancer) |
| PIK3CA | 416 | 130 | 0.999 | Oncogene, PI3K pathway |
| GATA3 | 140 | 102 | 1.000 | Luminal subtype master regulator |
| CDH1 | 141 | 51 | 0.999 | Invasion/metastasis suppressor |
| PTEN | 68 | 56 | 1.000 | PI3K |
- Initial model: Used only mutation count β PTEN (68) vs others (140-416)
- Improved model: Added mutation density β PTEN (56 mut/kb) similar to CDH1 (51 mut/kb)
- Result: PTEN correctly identified as critical driver
results/
βββ driver_model_final_imbalance.pkl # Trained model
βββ test_predictions_cv.csv # Cross-validation predictions
βββ final_report.txt # Performance metrics
βββ analysis_report.txt # Biological interpretation
βββ novel_candidates.csv # Novel driver predictions
βββ roc_curve.png # ROC visualization
βββ feature_matrix.csv # Engineered features
Edit config/config.yaml:
#Gold standard drivers
GOLD_STANDARD_DRIVERS:
- TP53
- PIK3CA
- GATA3
- CDH1
- PTEN
#Model parameters
TEST_SIZE: 0.3
RANDOM_SEED: 42
#User prediction
NEW_MUTATION_FILE: "user_data/new_sample_muts.txt"- Modify
scripts/01_feature_engineering.py - Update feature list in
scripts/02_model_training.py - Re-run pipeline:
snakemake --cores 1 --delete-all-output
#Unit tests
python -m pytest tests/
#Integration test
snakemake --cores 1 --dry-run
#Performance validation
python scripts/03_report_results.py results/test_predictions_cv.csv test_report.txt test_plot.png- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add new feature') - Push to branch (
git push origin feature/improvement) - Create Pull Request
MIT License - see LICENSE file for details.
- Data from TCGA Breast Cancer (BRCA) project
- XGBoost and scikit-learn communities
- Snakemake for reproducible workflows
Rajitha Don
- Email: rajitha.bioinformatics@gmail.com
- GitHub: @rajithadp
- π View Full Results
- π€ Try the Model
- π¬ Technical Details
- π Feature Importance