This repository contains Jupyter notebooks to investigate the impact of O-GlcNAcylation dysregulation on cancer through data-driven methodologies. The analysis uses gene expression data from TCGA (The Cancer Genome Atlas) to quantify dysregulation.
-
1_data_download.ipynb
This notebook provides step-by-step instructions to download RNA-seq gene expression data from the GDC portal using the GDC Data Transfer Tool. It specifies how to structure and organize downloaded files for further analysis. -
2_KDE_generation.ipynb
This notebook generates Kernel Density Estimates (KDEs) for visualizing and analyzing OGT and OGA expression distributions across different cancer types. -
3_simulation_runs.ipynb
This notebook simulates data to compare different metrics, including KDE-based measures and other regulation measures, for distinguishing between healthy and cancerous tissues. -
4_modeling.ipynb
This notebook performs modeling using real cancer datasets, quantifying the relationship between OGT and OGA expression. It provides statistical insights into O-GlcNAcylation dysregulation across cancers.
- Python 3.8 or later
- Jupyter Notebook
- Required Python libraries: numpy, pandas, matplotlib, seaborn, scipy, sklearn (refer to each notebook for specific imports)
For the analysis, RNA-seq gene expression data must be downloaded and organized as follows:
-
Go to the GDC Portal.
-
Select the cancer type of interest by tissue (e.g., Breast, Blood and Bone Marrow, etc).
-
Under the Repository, apply the following filters:
- Experimental Strategy: RNA-seq
- Data Category: Transcriptome Profiling
- Data Type: Gene Expression Quantification
-
Use the GDC Data Transfer Tool to download the sample gene expression data.
-
Organize the downloaded
.tsvfiles in the following structure:/data/TCGA_GeneExpression/{cancer}/gene_expression/Replace
{cancer}with the specific cancer type.
data/
└── TCGA_GeneExpression/
├── Kidney/
│ └── gene_expression/
│ ├── sample_1.tsv
│ ├── sample_2.tsv
│ └── ...
├── Lung/
│ └── gene_expression/
│ ├── sample_1.tsv
│ ├── sample_2.tsv
│ └── ...
-
Data Download:
Run the1_data_download.ipynbnotebook to confirm data requirements and download the necessary files. -
KDE Generation:
Execute2_KDE_generation.ipynbto compute and visualize KDEs for the selected cancer types. -
Simulation Runs:
Use3_simulation_runs.ipynbto simulate data and compare different measures of regulation. -
Modeling:
Run4_modeling.ipynbfor application of the methodology to TCGA datasets.
- Ensure the data is preprocessed as described in
1_data_download.ipynbbefore proceeding to the analysis notebooks. - Modify paths and cancer types in each notebook as necessary.