Skip to content

Vanthys/02456_project

Repository files navigation

Predicting isoform expression through learned gene representations from bulk and scRNA-seq

Authored by Masha Tvile, Gabriel Loyaza, Marianna Gordin, Marcel Skumantz

1 Motivation

Alternative splicing (AS) is a key regulatory mechanism that contributes to functional complexityby generating multiple transcript isoforms from a single gene. Over 95% of multi-exon humangenes undergo AS, often producing proteins with distinct functional domains [1]. This isoformvariability is essential for fine-tuned cellular control and has been implicated in diseases such ascancer and neurodegeneration [2]. The clinical significance of AS has driven interest in methods tostudy isoform expression. While recent advances in long-read Ribonucleic Acid (RNA) sequencinghas improved our ability to profile transcript isoforms with high resolution, the high cost andcomputational demands limit their scability [3]. This presents a valuable opportunity to exploredeep learning approaches as an efficient alternative for inferring isoform-level expression from moreaccessible data.

2 Background

AS produces functionally distinct transcript isoforms that determine cell fate and disease. However,most available RNA-sequencing technologies provide only gene-level counts, masking insights intoisoform expression patterns. Existing computational tools for isoform quantification have limitedaccuracy, particularly in single-cell data, due to sparsity and coverage constraints [4]. To addressthe gap between these technical limitations and cell-type-specific transcriptional diversity, ourproject aims to predict isoform expression from raw gene expression profiles. Building on recentadvances in representation learning [5], we will compare diverse feature representations derivedfrom Principle Component Analysis (PCA), Variational Auto Encoders (VAEs), and pretrainedtransformer models to identify the optimal strategy for capturing isoform-level biological signals[6]. The models will be trained on bulk RNA-seq data, which is denser and more validated, andthen evaluated for generalization to single-cell RNA-sequencing data, enabling robust isoform-levelprediction across diverse transcriptomics datasets

Additional information for the project

As part of the Deep Learning course, you have been given a scratch directory on a temporary, high-performance storage system on the HPC. The path to this directory is stored in the environment variable BLACKHOLE for convenience.

So for example

cd $BLACKHOLE

to move to that directory, or

ls $BLACKHOLE

to list its content. For further details, please have a look at /dtu/blackhole/readme.txt.

The content of this directory is automatically deleted at each service window. The data will be deleted at the end of January 2026, after the 3-week courses. So if you need the data after that time, you have to make sure you copy it somewhere else. The data, once deleted, will not be recoverable.

HPC JOB Documentation

Tasks distribution

Trello Invite

Project Archticture and Workflow:

  1. Start with .ipynb to understand everything and work on small subsets

  2. Split and modularize functionality to actual functions

    • Standard splitting workflow, should work with subsets in the beginning to ease development speed
    • core training loop as a function where the model and input output data is passed to unify the process
    • standardized evaluation loop
  3. Write hpc scripts to execute various steps

    • I would suggest one to transform the data, and save it with a specific postfix so that we know information
    • we could try varying the transformation from log1p to nll binomial (not a transformation) -> Suggested by Edir. Would be more biollogically accurate? TA said: yep could be also interesting to investigate the differences and also there is zero inflated
  4. Outputs,metrics,data, etc. "artifacts" should be readable on the login node (from a size point) and we can generate most of the analysis information there, no batching scripts necssary

Artifacts location list
  • /dtu/blackhole/17/187991/pca_outputs/embeddings_bulk.csv
  • /dtu/blackhole/17/187991/pca_outputs/embeddings_sc.csv
  • /dtu/blackhole/17/187991/vae_outputs/embeddings_bulk.csv
  • /dtu/blackhole/17/187991/vae_outputs/embeddings_sc.csv #does not exist yet
  • /dtu/blackhole/00/223872/geneformer_outputs/bulk_log1p/embeddings.csv
  • /dtu/blackhole/00/223872/geneformer_outputs/sc_log1p/embeddings.csv
  • /dtu/blackhole/00/223872/log1p/bulk_processed_genes.h5ad
  • /dtu/blackhole/00/223872/log1p/bulk_processed_transcripts.h5ad
  • /dtu/blackhole/00/223872/log1p/sc_processed_genes.h5ad
  • /dtu/blackhole/00/223872/log1p/sc_processed_transcripts.h5ad

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors