Predicting isoform expression through learned gene representations from bulk and scRNA-seq

Authored by Masha Tvile, Gabriel Loyaza, Marianna Gordin, Marcel Skumantz

1 Motivation

Alternative splicing (AS) is a key regulatory mechanism that contributes to functional complexityby generating multiple transcript isoforms from a single gene. Over 95% of multi-exon humangenes undergo AS, often producing proteins with distinct functional domains [1]. This isoformvariability is essential for fine-tuned cellular control and has been implicated in diseases such ascancer and neurodegeneration [2]. The clinical significance of AS has driven interest in methods tostudy isoform expression. While recent advances in long-read Ribonucleic Acid (RNA) sequencinghas improved our ability to profile transcript isoforms with high resolution, the high cost andcomputational demands limit their scability [3]. This presents a valuable opportunity to exploredeep learning approaches as an efficient alternative for inferring isoform-level expression from moreaccessible data.

2 Background

AS produces functionally distinct transcript isoforms that determine cell fate and disease. However,most available RNA-sequencing technologies provide only gene-level counts, masking insights intoisoform expression patterns. Existing computational tools for isoform quantification have limitedaccuracy, particularly in single-cell data, due to sparsity and coverage constraints [4]. To addressthe gap between these technical limitations and cell-type-specific transcriptional diversity, ourproject aims to predict isoform expression from raw gene expression profiles. Building on recentadvances in representation learning [5], we will compare diverse feature representations derivedfrom Principle Component Analysis (PCA), Variational Auto Encoders (VAEs), and pretrainedtransformer models to identify the optimal strategy for capturing isoform-level biological signals[6]. The models will be trained on bulk RNA-seq data, which is denser and more validated, andthen evaluated for generalization to single-cell RNA-sequencing data, enabling robust isoform-levelprediction across diverse transcriptomics datasets

Additional information for the project

As part of the Deep Learning course, you have been given a scratch directory on a temporary, high-performance storage system on the HPC. The path to this directory is stored in the environment variable BLACKHOLE for convenience.

So for example

cd $BLACKHOLE

to move to that directory, or

ls $BLACKHOLE

to list its content. For further details, please have a look at /dtu/blackhole/readme.txt.

The content of this directory is automatically deleted at each service window. The data will be deleted at the end of January 2026, after the 3-week courses. So if you need the data after that time, you have to make sure you copy it somewhere else. The data, once deleted, will not be recoverable.

HPC JOB Documentation

Tasks distribution

Trello Invite

Project Archticture and Workflow:

Start with .ipynb to understand everything and work on small subsets
Split and modularize functionality to actual functions
- Standard splitting workflow, should work with subsets in the beginning to ease development speed
- core training loop as a function where the model and input output data is passed to unify the process
- standardized evaluation loop
Write hpc scripts to execute various steps
- I would suggest one to transform the data, and save it with a specific postfix so that we know information
- we could try varying the transformation from log1p to nll binomial (not a transformation) -> Suggested by Edir. Would be more biollogically accurate? TA said: yep could be also interesting to investigate the differences and also there is zero inflated
Outputs,metrics,data, etc. "artifacts" should be readable on the login node (from a size point) and we can generate most of the analysis information there, no batching scripts necssary

Artifacts location list

/dtu/blackhole/17/187991/pca_outputs/embeddings_bulk.csv
/dtu/blackhole/17/187991/pca_outputs/embeddings_sc.csv
/dtu/blackhole/17/187991/vae_outputs/embeddings_bulk.csv
/dtu/blackhole/17/187991/vae_outputs/embeddings_sc.csv #does not exist yet
/dtu/blackhole/00/223872/geneformer_outputs/bulk_log1p/embeddings.csv
/dtu/blackhole/00/223872/geneformer_outputs/sc_log1p/embeddings.csv
/dtu/blackhole/00/223872/log1p/bulk_processed_genes.h5ad
/dtu/blackhole/00/223872/log1p/bulk_processed_transcripts.h5ad
/dtu/blackhole/00/223872/log1p/sc_processed_genes.h5ad
/dtu/blackhole/00/223872/log1p/sc_processed_transcripts.h5ad

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
core		core
figures		figures
logs		logs
scripts		scripts
.gitignore		.gitignore
00_inital_exploration.ipynb		00_inital_exploration.ipynb
01_data_transformation.ipynb		01_data_transformation.ipynb
04_geneformer_embeddings.ipynb		04_geneformer_embeddings.ipynb
05_plots_example.ipynb		05_plots_example.ipynb
05_results.ipynb		05_results.ipynb
06_plots_comparison_1.ipynb		06_plots_comparison_1.ipynb
06_plots_comparison_2.ipynb		06_plots_comparison_2.ipynb
06_plots_comparison_3.ipynb		06_plots_comparison_3.ipynb
README.md		README.md
requirements.txt		requirements.txt
umap_plots.py		umap_plots.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting isoform expression through learned gene representations from bulk and scRNA-seq

1 Motivation

2 Background

Additional information for the project

Tasks distribution

Project Archticture and Workflow:

Artifacts location list

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting isoform expression through learned gene representations from bulk and scRNA-seq

1 Motivation

2 Background

Additional information for the project

Tasks distribution

Project Archticture and Workflow:

Artifacts location list

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages