A machine learning project focused on predicting solvation free energy of molecules using various ML models and analyzing their feature importance.
This project aims to predict solvation free energy using both molecular fingerprints and graph-based representations. Multiple machine learning models are implemented and optimized, including Graph Neural Networks (GCN), Random Forest (RF), XGBoost (XGB), and Feed-Forward Neural Networks (FFNN).
The project uses Poetry for dependency management. Main dependencies include:
- Python ^3.11
- PyTorch ^2.2.1
- RDKit ^2024.3.5
- XGBoost ^2.1.2
- Pandas ^2.2.3
- Scikit-learn ^1.5.2
- Ax-platform ^0.4.3 (for hyperparameter optimization)
For a complete list of dependencies, see pyproject.toml.
solvation-free-energy/
├── data/
│ ├── SAMPL.csv
│ ├── train.csv
│ ├── test.csv
│ ├── train_fp.csv # Training data with Morgan fingerprints
│ ├── test_fp.csv # Test data with Morgan fingerprints
│ └── optimization_results/
│ ├── GCN_optimization.csv
│ ├── RFR_optimization.csv
│ ├── XGB_optimization.csv
│ └── FFNN_optimization.csv
├── notebooks/
│ ├── EDA.ipynb # Exploratory Data Analysis
│ ├── gcn_optimization.ipynb # Graph Neural Network implementation
│ ├── rfr_optimization.ipynb # Random Forest optimization
│ ├── xgbr_optimization.ipynb # XGBoost optimization
│ └── ffnn_optimization.ipynb # Feed Forward Neural Network optimization
└── plots/
└── optimization_plots/ # Model optimization visualizations
- Initial EDA shows the distribution of molecular properties and baseline performance
- Two main approaches for molecular representation:
- Morgan fingerprints (ECFP4) for traditional ML models
- Graph representation for GCN
Multiple models are implemented and optimized:
- Custom implementation with configurable architecture
- Features:
- Convolution layers with adjacency matrix normalization
- Pooling layers
- Dropout for regularization
- Batch normalization
- Optimized parameters:
- hidden_nodes
- n_conv_layers
- n_hidden_layers
- learning_rate
- Optimized parameters:
- n_estimators (100-1000)
- max_depth (10-300)
- max_features ('sqrt', 'log2')
- min_samples_leaf (0.001-0.5)
- Optimized parameters:
- learning_rate (0.001-1)
- max_depth (1-6)
- colsample_bytree (0-1)
- reg_alpha (1e-6-10)
- Architecture:
- Multiple hidden layers with ReLU activation
- Batch normalization
- Dropout regularization
- Optimized parameters:
- learning_rate
- patience (for early stopping)
- dropout_rate
All models use Bayesian optimization via Ax Platform for hyperparameter tuning with:
- RMSE as the optimization metric
- 20-25 trials per model
- Train/test split of 80/20
The dataset is comprised of over 600 compounds with their experimentally determined solvation free energy. Additionally, each sample has its calculated solvation free energy using molecular dynamics. For details of the dataset, refer to the following publication (Mobley, D., Guthrie, J. P. 2014)
The dataset was split into a training and test set. The test set was only used for final model comparison and wasn't used for the training. For comparison, the mean absolute error (MAE), root of the mean squared error (RMSE) and Coefficient of determination (R²) between calculated and predicted solvation energy are shown below.
| Samples | MAE | RMSE | R² | |
|---|---|---|---|---|
| Complete dataset | 642 | 1.11 | 1.54 | 0.84 |
| Train split | 514 | 1.10 | 1.5 | 0.85 |
| Test split | 128 | 1.18 | 1.71 | 0.79 |
The calculated fingerprints were used to calculate the similarity between the training and test dataset:
Cosine similarity: 0.9658
Wasserstein distance: 38.1324
jenJensen–Shannon divergence: 0.1843
Bhattacharyya coefficient: 0.8284
Hellinger distance: 0.4143
Performance metrics and optimization results for each model are stored in data/optimization_results/. Each model's optimization process includes:
- Optimization traces showing RMSE improvement over trials
- Feature importance analysis where applicable
- Contour plots showing parameter interactions
- Final optimized parameters and performance metrics
Comparison of optimization by Ax platform. The GCN shows a better performance on the train set than other models.

Using SHAP analysis, several features were found to impact the solvation free energy. Polar groups contributed the most towards negative values as seen here:
Feature importance with corresponding Morgan fingerprints.
Beeswarm plot of the top features. Since features are binary High = 1, Low = 0.
- Comparative analysis of model performance
- Feature importance comparison across different models
- Ensemble methods exploration
- Clone the repository
- Install Poetry if not already installed:
pip install poetry- Install dependencies:
poetry install- Run notebooks in the following order:
- EDA.ipynb for initial data analysis
- Individual model optimization notebooks based on needs
- Integration of models into an ensemble
- Comparative analysis of all implemented models
- Implementation of additional architectures
- Cross-validation for more robust performance estimates
- Web interface for predictions