Skip to content

yzhong36/DeepEnsemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genome-wide annotation on functional Branchpoints in human genome

Introduction

We developed an ensemble-based deep learning framework (DeepEnsemble) to predict intronic branchpoints, essential for RNA splicing, in the human genome. The model integrates sequence features and genomic distances to identify branchpoints within 70-nucleotide regions upstream of 3' splice sites. Additionally, we prioritized ClinVar branchpoint variants and extended the framework (DeepEnsemble-LR) to quantify the effects of SNVs on branchpoint functionality. Example Image

Branchpoint annotation

Users can download pre-computed annotation files from the data/bp_annotation folder. Currently, two versions of annotations are available: one based on the GENCODE V19 (hg19) reference and another on the GENCODE V44 (hg38) reference. These annotations include both predicted (cbp) and experimental-based (ebp) branchpoints. The files are stored in an R-based object format and can be accessed once the GenomicRanges package is installed. Examples of usage are provided below:

library(GenomicRanges)

cbp_v44 <- readRDS("data/bp_annotation/gencode_v44_cbp.rds")
head(cbp_v44, 3)

GRanges object with 3 ranges and 7 metadata columns:
      seqnames    ranges strand |       model_window   BP_prob     transcript_id     tx_type intron_type intron_length          intron_gr
         <Rle> <IRanges>  <Rle> |          <GRanges> <numeric>       <character> <character> <character>     <integer>          <GRanges>
  [1]     chr1     12572      + | chr1:12543-12612:+  0.520336 ENST00000456328.2      lncRNA          U2           385 chr1:12228-12612:+
  [2]     chr1     12584      + | chr1:12543-12612:+  0.419528 ENST00000456328.2      lncRNA          U2           385 chr1:12228-12612:+
  [3]     chr1     12593      + | chr1:12543-12612:+  0.656099 ENST00000456328.2      lncRNA          U2           385 chr1:12228-12612:+
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

ebp_v44 <- readRDS("data/bp_annotation/gencode_v44_ebp.rds")
head(ebp_v44, 3)

GRanges object with 3 ranges and 7 metadata columns:
      seqnames    ranges strand |              source  n_source     transcript_id     tx_type intron_type intron_length            intron_gr
         <Rle> <IRanges>  <Rle> |         <character> <integer>       <character> <character> <character>     <integer>            <GRanges>
  [1]     chr1    781910      + | eBP_Mercer,eBP_Zeng         2 ENST00000434264.6      lncRNA          U2          2844 chr1:779093-781936:+
  [2]     chr1    783082      + |            eBP_Zeng         1 ENST00000589899.5      lncRNA          U2          1067 chr1:782044-783110:+
  [3]     chr1    786361      + |            eBP_Zeng         1 ENST00000586928.5      lncRNA          U2          1899 chr1:784494-786392:+
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

Visualization in UCSC genome browser

We also provide an option to access branchpoint annotations through the UCSC Genome Browser. Click here for the hg19 version and here for the hg38 version. Users can bookmark these links for easy access in the future. Example Image

Contact

This work is currently in progress during the manuscript preparation stage. For questions or support, please contact yu_zhong@brown.edu.

About

Ensemble-based deep learning framework for annotating intronic branchpoints in the human genome

Topics

Resources

Stars

Watchers

Forks

Contributors