Authored by Masha Tvile, Gabriel Loyaza, Marianna Gordin, Marcel Skumantz
Alternative splicing (AS) is a key regulatory mechanism that contributes to functional complexityby generating multiple transcript isoforms from a single gene. Over 95% of multi-exon humangenes undergo AS, often producing proteins with distinct functional domains [1]. This isoformvariability is essential for fine-tuned cellular control and has been implicated in diseases such ascancer and neurodegeneration [2]. The clinical significance of AS has driven interest in methods tostudy isoform expression. While recent advances in long-read Ribonucleic Acid (RNA) sequencinghas improved our ability to profile transcript isoforms with high resolution, the high cost andcomputational demands limit their scability [3]. This presents a valuable opportunity to exploredeep learning approaches as an efficient alternative for inferring isoform-level expression from moreaccessible data.
AS produces functionally distinct transcript isoforms that determine cell fate and disease. However,most available RNA-sequencing technologies provide only gene-level counts, masking insights intoisoform expression patterns. Existing computational tools for isoform quantification have limitedaccuracy, particularly in single-cell data, due to sparsity and coverage constraints [4]. To addressthe gap between these technical limitations and cell-type-specific transcriptional diversity, ourproject aims to predict isoform expression from raw gene expression profiles. Building on recentadvances in representation learning [5], we will compare diverse feature representations derivedfrom Principle Component Analysis (PCA), Variational Auto Encoders (VAEs), and pretrainedtransformer models to identify the optimal strategy for capturing isoform-level biological signals[6]. The models will be trained on bulk RNA-seq data, which is denser and more validated, andthen evaluated for generalization to single-cell RNA-sequencing data, enabling robust isoform-levelprediction across diverse transcriptomics datasets
As part of the Deep Learning course, you have been given a scratch directory on a temporary, high-performance storage system on the HPC. The path to this directory is stored in the environment variable BLACKHOLE for convenience.
So for example
cd $BLACKHOLEto move to that directory, or
ls $BLACKHOLEto list its content. For further details, please have a look at /dtu/blackhole/readme.txt.
The content of this directory is automatically deleted at each service window. The data will be deleted at the end of January 2026, after the 3-week courses. So if you need the data after that time, you have to make sure you copy it somewhere else. The data, once deleted, will not be recoverable.
-
Start with
.ipynbto understand everything and work on small subsets -
Split and modularize functionality to actual functions
- Standard splitting workflow, should work with subsets in the beginning to ease development speed
- core training loop as a function where the model and input output data is passed to unify the process
- standardized evaluation loop
-
Write hpc scripts to execute various steps
- I would suggest one to transform the data, and save it with a specific postfix so that we know information
- we could try varying the transformation from log1p to nll binomial (not a transformation) -> Suggested by Edir. Would be more biollogically accurate? TA said: yep could be also interesting to investigate the differences and also there is zero inflated
-
Outputs,metrics,data, etc. "artifacts" should be readable on the login node (from a size point) and we can generate most of the analysis information there, no batching scripts necssary
- /dtu/blackhole/17/187991/pca_outputs/embeddings_bulk.csv
- /dtu/blackhole/17/187991/pca_outputs/embeddings_sc.csv
- /dtu/blackhole/17/187991/vae_outputs/embeddings_bulk.csv
- /dtu/blackhole/17/187991/vae_outputs/embeddings_sc.csv #does not exist yet
- /dtu/blackhole/00/223872/geneformer_outputs/bulk_log1p/embeddings.csv
- /dtu/blackhole/00/223872/geneformer_outputs/sc_log1p/embeddings.csv
- /dtu/blackhole/00/223872/log1p/bulk_processed_genes.h5ad
- /dtu/blackhole/00/223872/log1p/bulk_processed_transcripts.h5ad
- /dtu/blackhole/00/223872/log1p/sc_processed_genes.h5ad
- /dtu/blackhole/00/223872/log1p/sc_processed_transcripts.h5ad