Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006. In the twelve years since, TCGA has generated more than 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomics. These data have led to improvements in the ability to diagnose, treat and prevent cancer by helping to establish the importance of cancer genomics.

Contribution of this work

During the experimental process, the size of the dataset used was significantly increased in order to improve the diversity and representativeness of the data. This adjustment allowed the model to learn from a wider variety of examples, improving its generalization. In addition, adjustments were made to both models involved in this study, both the classification model and the Siamese-type model. A key element of the optimization was the implementation of the use of custom weights. This strategy allowed different weights to be assigned to different instances of the dataset based on the amount of samples present. Finally, a specification was introduced regarding the types of mutations, this allowed for greater precision in the analysis of genetic information. Numerous studies aimed at identifying a distinctive genomic signature for different types of cancer are being conducted in the current research landscape.

Requirements (tested)

pip install -r requirements.txt

To install tensorflow follow this guide: link
To install and set up cuda and cudnn follow this guide:

Dataset of cancer patients

The dataset of cancer patients is composed of data obtained from cBioPortal for Cancer Genomics and is composed of several files, namely:

data_clinical_patient, which contains clinical data on the patient (such as Patient ID, gender, and tumor status)
data_clinical_sample, which contains data regarding the tumor samples
data_cna, which contains information about changes in the copy number of specific DNA segments
data_methylation, which contains information on the DNA methylation
data_mrna_seq_v2_rsem, which contains the sequencing mRNA sequencing of tumor samples.
data_mutations, which contains mutation data obtained by whole-exome sequencing
data_rppa, which contains data on the expression of the proteins

Pre-Processing

Within the project, there is the pre-processing folder. Within this are the .py files used to do preprocessing of the datasets.
The folder contains subfolders for different functions.

The folder data_cleaning which contains the files for cleaning and formatting the dataset, also contains the files for calculating normalization and standard deviation of the values;

The merged_data folder that contains the files for merging the files that will make up the final dataset

The utils folder that contains useful functions for possible necessary changes to the dataset, such as changing the csv delimiter, deleting columns

To pre-process the dataset correctly, it is mandatory to perform at least the following commands:

python3 traspose.py, this script takes as input the data_mrna_seq_v2_rsem file and returns a the transposed csv

python3 normalize.py, this script takes as input the result of the previous script and returns a new csv file with the normalized values

python3 deviazione.py, this script takes as input the result of the previous script and returns a new csv file with the standard deviation applied
python3 add_variantType.py, this script takes as input the result of the previous script, the file data_mutations.csv and data_cna.csv and returns the dataset with the gene variants
python3 cna_scaling.py, this script takes as input the result of the previous script and returns the dataset with normalized gene variants

Or, if you do not want to perform the above steps, you can download the dataset already used for this experiment: LINK.
Files should be downloaded within a folder with the name Dataset.

Dataset of people without cancer (Normals)

Exactly as with the previously mentioned dataset, we need the files:

data_clinical_patient, which contains clinical data on the patient (such as Patient ID, gender, and tumor status)
data_clinical_sample, which contains data regarding the tumor samples
data_cna, which contains information about changes in the copy number of specific DNA segments
data_methylation, which contains information on the DNA methylation
data_mrna_seq_v2_rsem, which contains the sequencing mRNA sequencing of tumor samples.
data_mutations, which contains mutation data obtained by whole-exome sequencing
data_rppa, which contains data on the expression of the proteins

However, given the current absence of normal patient data within the cBioPortal for Cancer Genomics platform, the dataset of normal patients was formed from data found on the GitHub of cBioPortal in the DataHub section. More specifically, the cesc_tcga dataset containing data on normal patients was used.

Link to the normals dataset used: LINK
Files related to the normals dataset should be placed within a folder called Normals in the root folder of the project

Pre-Processing normals

In order to work on normals patients, pre-processing has to follow a somewhat different procedure since their dateset contains a variety of parameters that are not used by the network and therefore negligible.

As with standard pre-processing, the first step is to transpose the dataset with python3 traspose.py.
As for the second step, we save the indexes and the cancer status of each normals with python3 normals_statusAndindex.py we'll use them later on.
Third step we perform normalization with python3 normalize.py.
Fourth step, we calculate the deviation with python3 deviazione.py.
Fifth step, we clean the dataset from columns with information not needed by the network with python3 data_cleaning_normals.py.
Sixth step, we calculate the variants of the genes with python3 add_variantType.py.
Seventh step, normalize the variant column _cna with python3 cna_scaling.py.
Eighth step, add the columns of which genes were not found with python3 add_missing_variants.py.

Main.py

To run the program, you have to run python3 main.py.
However, to run the main we need some important information about how this works.

Configuration

In this script there are settings and path that we are going to describe now:

Dataset

dataset_path: path for the dataset that we want to use
encoded_path: path for the encoded of the dataset;
data_encoded = False: boolean flag that allows to generate the encoded of the dataset (if this is the first time you run the code leave the default value)
- False: encoded to be generated;
- True: load an encoded;
only_variant = False: if you use the dataset that contains only variations in gene mutations set this on True;

Classification

model_path: where the model will be saved or uploaded;
risultati_classification: path for the results of the classification;
classification = True: boolean flag to run the classification;
classification_normals = True: boolean flag to run the classification on normals dataset;

Siamese

siamese_path: where the model of the siamese network will be saved or uploaded;
risultati_siamese: path for the results of the siamese network;
siamese_net = True: boolean flag to run the siamese network;
siamese_variants = True: if you use the dataset that contains the variations in gene mutations set this on True;
siamese_normals = True: boolean flag to run the siamese with normals dataset;
normals_max_epsilon= False type of comparison range for normals (gives result only if siamese_normals is True)
normals_param_epsilon = True type of comparison range for normals (gives result only if siamese_normals is True)

Normals

normals_path: the dataset that contains people with and without the disease;

Keep in mind

The Siamese Network can only be launched if it has a classification model already trained and saved. In the project the classification model has already been trained. If you want to use the models in this project and not start experimenting again set the parameters in this way:

only_variant = False
data_encoded = True
classification = False
classification_normals = False
siamese_net = True
siamese_normals = False
siamese_variants = True
normals_max_epsilon = False
normals_param_epsilon = False

Siamese model with normal patients

To perform the operations on the normals datasets, the pre-trained siamese network found in the Detection-signature-cancer/code/models/0005/siamese/espressione_genomica_con_varianti_2LAYER/ folder was used.

The siamese_normals with normals dataset, when True, create a set of tresholds of similarity for each Cancer Type and, for each Normal, check if the patient is inclined to contract the disease of the same type.

The comparison is made between a normals and a parameterizable number, k, of patients with each type of cancer.
Thus a normals will be compared with each cancer type k times to calculate their propensity to get that type of cancer.

A normal patient is more likely to get a type of cancer if his or her similarity value is in one of the ranges that we are now going to present.

Different ranges for comparisons

To determine whether a normal patient was likely to get a certain type of cancer, we decided to rely on two distinct ranges.

Max Epsilon

The first range is nothing more than the range between the average threshold of a cancer type and the deviation of that threshold. Thus, if a patient falls within this range, he or she is likely to contract that type of cancer. The flag to use this type of range is normals_max_espilon

Param Epsilon

The second range, on the other hand, again consists of the average threshold of a cancer type but this time with a parameterizable value. The flag to use this type of range is normals_param_espilon

Results

The results of each threshold calculation are saved within the threshold_nameOfCancer.txt file inside the Threshold folder.
In addition, in the root folder of the project, the threshold.txt file containing the values of all calculated thresholds is also generated

Each of its rows is a different type of cancer and contains minimum threshold, maximum threshold, average of thresholds, and standard deviation of thresholds, respectively.
So they will be displayed like this:

Cancer_Type	Min	Max	Mean	Std

On the other hand, the results of comparisons with normal patients is saved within the Results_Comparison folder.
Which in turn contains the Over_Comparison folder, where all normal patients who have fallen within the established range of epsilon are saved, and the Over_Percentage folder, where normal patients who have fallen within the established range of epsilon in a percentage greater than 50% are saved

The files inside Over_Comparison contains on each row:

Patient identifier	Calculated similarity	Type of cancer	Threshold of cancer

The files inside Over_Percentage contains on each row:

Patient identifier	Type of cancer

SHAP-enhanced SNNs: a novel mathematical perspective

Overview

This repository provides a framework for integrating SHAP values into a Siamese Neural Network (SNN) for cancer-type prediction. The SNN computes a similarity score between pairs of samples, reflecting their likelihood of belonging to the same cancer type. SHAP values are used to quantify the contribution of each feature to the similarity score, providing insights into feature importance in the context of cancer classification.

Key Concepts

Similarity Score: Given a pair of input samples x_i and x_j, the SNN computes a similarity score S(fv(x_i), fv(x_j)) ∈ [0, 1], where fv(x) represents the feature vector of sample x. This score indicates the likelihood of the samples belonging to the same cancer type.
SHAP Value Integration: SHAP values quantify the contribution of individual features to the similarity score. However, since features for a pair of samples can assume different values (fv_i(x) and fv_i(y)), two SHAP values (φ_i(x) and φ_i(y)) are computed independently for each feature.
Unified SHAP Value: To summarize the importance of a feature for a sample pair p = (x, y), the unified SHAP value is defined as:
φ_i(p) = (|φ_i(x)| + |φ_i(y)|) / 2 This value measures the combined contribution of the feature across both samples, capturing the overall influence on the similarity score.
Global Feature Importance: For a set of sample pairs P, the global SHAP importance of a feature i is the mean unified SHAP value across all pairs in P, defined as:
Φ_i(P) = (Σ φ_i(p) for p ∈ P) / |P|

Cancer-Specific Feature Importance

To identify the most important features for each cancer type c, the following methodology is applied:

Extract all samples corresponding to c.
Generate pairs:
- Positive Pair: Two samples from the same cancer type.
- Negative Pair: A sample from c paired with one from a different cancer type.
Compute (\Phi_i(P_c)), the cancer-specific global feature importance, using the pairs (P_c).

This approach considers both the similarity and dissimilarity contributions of features, leveraging all available data. By creating both positive and negative pairs, it accounts for the feature's role in distinguishing between cancer types.

Advantages

This technique offers the following improvements:

Granular Insights: Feature importance is calculated specifically for each cancer type, as opposed to the dataset-wide approach proposed in Mostavi et al. (2021).
Gene-Level Analysis: The method enables identifying gene-associated feature importance (e.g., gene expression, genomic mutations) for each of the 24 cancer types described in the dataset.

Code

The Jupyter notebook aggregated_cancer_shap_analysis.ipynb gives the possibilty to extract most features importance with the usage of SHAP values techniques using the trained siamese network. Take care the modify the following path to match the model path:

model=load_model('path/to/siames_model, safe_mode=False, custom_objects={'initialize_bias': initialize_bias})

and load the dataset:

dataset_df, n_classes, classes, y = pre_processing(path/to/dataset.zip)

Author & Contacts

Name	Description
Rocco Zaccagnino	Email - rzaccagnino@unisa.it
Gerardo Benevento	Email - gbenevento@unisa.it
Delfina Malandrino	Email - dmalandrino@unisa.it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

Contribution of this work

Requirements (tested)

Dataset of cancer patients

Pre-Processing

Dataset of people without cancer (Normals)

Pre-Processing normals

Main.py

Configuration

Dataset

Classification

Siamese

Normals

Keep in mind

Siamese model with normal patients

Different ranges for comparisons

Max Epsilon

Param Epsilon

Results

SHAP-enhanced SNNs: a novel mathematical perspective

Overview

Key Concepts

Cancer-Specific Feature Importance

Advantages

Code

Author & Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
Dataset		Dataset
Results_Comparison		Results_Comparison
Threshold		Threshold
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

FLaTNNBio/SNN_for_Enhanced_Cancer_Type_Detection

Folders and files

Latest commit

History

Repository files navigation

Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

Contribution of this work

Requirements (tested)

Dataset of cancer patients

Pre-Processing

Dataset of people without cancer (Normals)

Pre-Processing normals

Main.py

Configuration

Dataset

Classification

Siamese

Normals

Keep in mind

Siamese model with normal patients

Different ranges for comparisons

Max Epsilon

Param Epsilon

Results

SHAP-enhanced SNNs: a novel mathematical perspective

Overview

Key Concepts

Cancer-Specific Feature Importance

Advantages

Code

Author & Contacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages