The main goal of this project is to analyze the relationship between medullary thyroid carcinoma (MTC) and associated gene/clinical data using machine learning. We employ a RandomForestClassifier to classify and evaluate models on various test datasets, and we use SHAP to interpret feature importance.
- Data Loading: Load gene features and clinical data from multiple Excel files.
- Data Preprocessing: Fill missing values, convert categorical variables (e.g.,
tumor_grade,sex,Heredity,T,N,M,TNM stage) into numerical codes. - Feature Selection: Use RandomForestClassifier together with
SelectFromModelto identify the most discriminative features. - Model Training & Evaluation:
- Run K-fold cross-validation to choose the best model on the training set.
- Evaluate on independent test sets to output performance metrics (precision, recall, confusion matrix, AUC, etc.).
- SHAP Interpretability: Compute SHAP values to evaluate the contribution and impact of each feature on model predictions.
your_project/
│
├─ main.py # Python script containing the core logic
└─ README.md
Note: The actual names of the Excel files depend on your local files.
The script assumes they reside in thedata/folder.
- Python 3.7+
- Key Libraries:
pandasnumpyscikit-learnshapargparse
Install the dependencies via pip:
pip install pandas numpy scikit-learn shap argparse-
Clone or Download
Clone or download this project to your local environment. Ensure thedata/directory contains the required Excel files. -
Install Dependencies
pip install -r requirements.txt
Or manually:
pip install pandas numpy scikit-learn shap argparse
-
Run the Script
From the terminal, navigate to the project folder and execute:python main.py --seed 42
Argument details:
--seed: Random seed (default is 10).
-
Output
The script outputs:- Cross-validation scores and the best validation score.
- Test set performance metrics (precision, recall, confusion matrix, AUC).
- Prediction results (predicted labels, probabilities, true labels).
- Selected important features.
- SHAP values for feature interpretation.
-
load_data()
Loads data from multiple Excel files into dictionaries for further processing. -
print_metrics(y_true, y_pred, y_pred_proba)
Calculates and displays classification metrics including precision, recall, confusion matrix, and AUC. -
print_metrics_conf_matrix(y_test, y_pred, y_pred_proba)
Similar toprint_metricsbut emphasizes the confusion matrix. -
train(gene_discovery_data, gene_test1_data, gene_test2_data)
Trains a RandomForestClassifier with K-fold cross-validation, selects the best model, and evaluates on test datasets. -
Clinical_train_with_slectfeature(gene_discovery_data, gene_test1_data, gene_test2_data, max_features=10, random_state=1)
Performs feature selection, model training, evaluation, and SHAP-based feature importance analysis.
-
Cross-validation
- Outputs accuracy scores for each fold in K-fold CV.
- Selects the model with the best overall cross-validation accuracy.
-
Test Set Performance
Metrics include precision, recall, F1 Score, confusion matrix, and AUC. -
Feature Importance
Evaluates feature importance using both RandomForest’s built-in feature importances and SHAP values. Higher SHAP values indicate more influential features.
-
Excel File Names or Versions Differ
Update file paths and sheet names inload_data()accordingly. -
Library Compatibility Issues
Use virtual environments (e.g., conda, venv) to resolve potential dependency conflicts. -
Missing Data Issues
Consider alternative strategies (e.g., median, mean, interpolation) if current filling strategy is unsuitable. -
Categorical Variable Encoding
Update the dictionaryr_dictin the script if category mappings need to be modified.
You are encouraged to modify the code by trying different models, feature selection strategies, or visualizations. Contributions, suggestions, and improvements are welcome via issues or pull requests.
You can release this project under your laboratory’s or company’s preferred license. If not specified, consider using the MIT License.
We hope this README helps you and others easily understand and use the project. For further questions or clarifications, please reach out.