The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006. In the twelve years since, TCGA has generated more than 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomics. These data have led to improvements in the ability to diagnose, treat and prevent cancer by helping to establish the importance of cancer genomics.
During the experimental process, the size of the dataset used was significantly increased in order to improve the diversity and representativeness of the data. This adjustment allowed the model to learn from a wider variety of examples, improving its generalization. In addition, adjustments were made to both models involved in this study, both the classification model and the Siamese-type model. A key element of the optimization was the implementation of the use of custom weights. This strategy allowed different weights to be assigned to different instances of the dataset based on the amount of samples present. Finally, a specification was introduced regarding the types of mutations, this allowed for greater precision in the analysis of genetic information. Numerous studies aimed at identifying a distinctive genomic signature for different types of cancer are being conducted in the current research landscape.
pip install -r requirements.txt
To install tensorflow follow this guide: link
To install and set up cuda and cudnn follow this guide:
The dataset of cancer patients is composed of data obtained from cBioPortal for Cancer Genomics and is composed of several files, namely:
data_clinical_patient, which contains clinical data on the patient (such as Patient ID, gender, and tumor status)data_clinical_sample, which contains data regarding the tumor samplesdata_cna, which contains information about changes in the copy number of specific DNA segmentsdata_methylation, which contains information on the DNA methylationdata_mrna_seq_v2_rsem, which contains the sequencing mRNA sequencing of tumor samples.data_mutations, which contains mutation data obtained by whole-exome sequencingdata_rppa, which contains data on the expression of the proteins
Within the project, there is the pre-processing folder.
Within this are the .py files used to do preprocessing of the datasets.
The folder contains subfolders for different functions.
- The folder
data_cleaningwhich contains the files for cleaning and formatting the dataset, also contains the files for calculating normalization and standard deviation of the values; - The
merged_datafolder that contains the files for merging the files that will make up the final dataset - The
utilsfolder that contains useful functions for possible necessary changes to the dataset, such as changing the csv delimiter, deleting columns
python3 traspose.py, this script takes as input thedata_mrna_seq_v2_rsemfile and returns a the transposed csvpython3 normalize.py, this script takes as input the result of the previous script and returns a new csv file with the normalized valuespython3 deviazione.py, this script takes as input the result of the previous script and returns a new csv file with the standard deviation appliedpython3 add_variantType.py, this script takes as input the result of the previous script, the filedata_mutations.csvanddata_cna.csvand returns the dataset with the gene variantspython3 cna_scaling.py, this script takes as input the result of the previous script and returns the dataset with normalized gene variants
Files should be downloaded within a folder with the name
Dataset.
Exactly as with the previously mentioned dataset, we need the files:
data_clinical_patient, which contains clinical data on the patient (such as Patient ID, gender, and tumor status)data_clinical_sample, which contains data regarding the tumor samplesdata_cna, which contains information about changes in the copy number of specific DNA segmentsdata_methylation, which contains information on the DNA methylationdata_mrna_seq_v2_rsem, which contains the sequencing mRNA sequencing of tumor samples.data_mutations, which contains mutation data obtained by whole-exome sequencingdata_rppa, which contains data on the expression of the proteins
However, given the current absence of normal patient data within the cBioPortal for Cancer Genomics platform, the dataset of normal patients was formed from data found on the GitHub of cBioPortal in the DataHub section.
More specifically, the cesc_tcga dataset containing data on normal patients was used.
Link to the normals dataset used: LINK
Files related to the normals dataset should be placed within a folder called Normals in the root folder of the project
In order to work on normals patients, pre-processing has to follow a somewhat different procedure since their dateset contains a variety of parameters that are not used by the network and therefore negligible.
- As with standard pre-processing, the first step is to transpose the dataset with
python3 traspose.py. - As for the second step, we save the indexes and the cancer status of each normals with
python3 normals_statusAndindex.pywe'll use them later on. - Third step we perform normalization with
python3 normalize.py. - Fourth step, we calculate the deviation with
python3 deviazione.py. - Fifth step, we clean the dataset from columns with information not needed by the network with
python3 data_cleaning_normals.py. - Sixth step, we calculate the variants of the genes with
python3 add_variantType.py. - Seventh step, normalize the variant column _cna with
python3 cna_scaling.py. - Eighth step, add the columns of which genes were not found with
python3 add_missing_variants.py.
To run the program, you have to run python3 main.py.
However, to run the main we need some important information about how this works.
In this script there are settings and path that we are going to describe now:
dataset_path: path for the dataset that we want to useencoded_path: path for the encoded of the dataset;data_encoded = False: boolean flag that allows to generate the encoded of the dataset (if this is the first time you run the code leave the default value)False: encoded to be generated;True: load an encoded;
only_variant = False: if you use the dataset that contains only variations in gene mutations set this onTrue;
model_path: where the model will be saved or uploaded;risultati_classification: path for the results of the classification;classification = True: boolean flag to run the classification;classification_normals = True: boolean flag to run the classification on normals dataset;
siamese_path: where the model of the siamese network will be saved or uploaded;risultati_siamese: path for the results of the siamese network;siamese_net = True: boolean flag to run the siamese network;siamese_variants = True: if you use the dataset that contains the variations in gene mutations set this onTrue;siamese_normals = True: boolean flag to run the siamese with normals dataset;normals_max_epsilon= Falsetype of comparison range for normals (gives result only if siamese_normals is True)normals_param_epsilon = Truetype of comparison range for normals (gives result only if siamese_normals is True)
normals_path: the dataset that contains people with and without the disease;
The Siamese Network can only be launched if it has a classification model already trained and saved. In the project the classification model has already been trained. If you want to use the models in this project and not start experimenting again set the parameters in this way:
only_variant = False
data_encoded = True
classification = False
classification_normals = False
siamese_net = True
siamese_normals = False
siamese_variants = True
normals_max_epsilon = False
normals_param_epsilon = False
To perform the operations on the normals datasets, the pre-trained siamese network found in the Detection-signature-cancer/code/models/0005/siamese/espressione_genomica_con_varianti_2LAYER/ folder was used.
The siamese_normals with normals dataset, when True, create a set of tresholds of similarity for each
Cancer Type and, for each Normal, check if the patient is inclined to contract the disease of
the same type.
The comparison is made between a normals and a parameterizable number, k, of patients with each type of cancer.
Thus a normals will be compared with each cancer type k times to calculate their propensity to get that type of cancer.
A normal patient is more likely to get a type of cancer if his or her similarity value is in one of the ranges that we are now going to present.
The first range is nothing more than the range between the average threshold of a cancer type and the deviation of that threshold. Thus, if a patient falls within this range, he or she is likely to contract that type of cancer.
The flag to use this type of range is normals_max_espilon
The second range, on the other hand, again consists of the average threshold of a cancer type but this time with a parameterizable value.
The flag to use this type of range is normals_param_espilon
The results of each threshold calculation are saved within the threshold_nameOfCancer.txt file inside the Threshold folder.
In addition, in the root folder of the project, the threshold.txt file containing the values of all calculated thresholds is also generated
Each of its rows is a different type of cancer and contains minimum threshold, maximum threshold, average of thresholds, and standard deviation of thresholds, respectively.
So they will be displayed like this:
| Cancer_Type | Min | Max | Mean | Std |
|---|
On the other hand, the results of comparisons with normal patients is saved within the
Results_Comparison folder.Which in turn contains the Over_Comparison folder, where all normal patients who have fallen within the established range of epsilon are saved, and the Over_Percentage folder, where normal patients who have fallen within the established range of epsilon in a percentage greater than 50% are saved
The files inside Over_Comparison contains on each row:
| Patient identifier | Calculated similarity | Type of cancer | Threshold of cancer |
|---|
The files inside Over_Percentage contains on each row:
| Patient identifier | Type of cancer |
|---|
This repository provides a framework for integrating SHAP values into a Siamese Neural Network (SNN) for cancer-type prediction. The SNN computes a similarity score between pairs of samples, reflecting their likelihood of belonging to the same cancer type. SHAP values are used to quantify the contribution of each feature to the similarity score, providing insights into feature importance in the context of cancer classification.
- Similarity Score: Given a pair of input samples
x_iandx_j, the SNN computes a similarity scoreS(fv(x_i), fv(x_j)) ∈ [0, 1], wherefv(x)represents the feature vector of samplex. This score indicates the likelihood of the samples belonging to the same cancer type. - SHAP Value Integration: SHAP values quantify the contribution of individual features to the similarity score. However, since features for a pair of samples can assume different values (
fv_i(x)andfv_i(y)), two SHAP values (φ_i(x)andφ_i(y)) are computed independently for each feature. - Unified SHAP Value: To summarize the importance of a feature for a sample pair
p = (x, y), the unified SHAP value is defined as:
φ_i(p) = (|φ_i(x)| + |φ_i(y)|) / 2 This value measures the combined contribution of the feature across both samples, capturing the overall influence on the similarity score. - Global Feature Importance: For a set of sample pairs
P, the global SHAP importance of a featureiis the mean unified SHAP value across all pairs inP, defined as:
Φ_i(P) = (Σ φ_i(p) for p ∈ P) / |P|
To identify the most important features for each cancer type c, the following methodology is applied:
- Extract all samples corresponding to
c. - Generate pairs:
- Positive Pair: Two samples from the same cancer type.
- Negative Pair: A sample from
cpaired with one from a different cancer type.
- Compute (\Phi_i(P_c)), the cancer-specific global feature importance, using the pairs (P_c).
This approach considers both the similarity and dissimilarity contributions of features, leveraging all available data. By creating both positive and negative pairs, it accounts for the feature's role in distinguishing between cancer types.
This technique offers the following improvements:
- Granular Insights: Feature importance is calculated specifically for each cancer type, as opposed to the dataset-wide approach proposed in Mostavi et al. (2021).
- Gene-Level Analysis: The method enables identifying gene-associated feature importance (e.g., gene expression, genomic mutations) for each of the 24 cancer types described in the dataset.
The Jupyter notebook aggregated_cancer_shap_analysis.ipynb gives the possibilty to extract most features importance with the usage of SHAP values techniques using the trained siamese network.
Take care the modify the following path to match the model path:
model=load_model('path/to/siames_model, safe_mode=False, custom_objects={'initialize_bias': initialize_bias})
and load the dataset:
dataset_df, n_classes, classes, y = pre_processing(path/to/dataset.zip)
| Name | Description |
|---|---|
Rocco Zaccagnino |
Email - rzaccagnino@unisa.it |
Gerardo Benevento |
Email - gbenevento@unisa.it |
Delfina Malandrino |
Email - dmalandrino@unisa.it |