RNAsum/README.Rmd at main · umccr/RNAsum · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```

# RNAsum

Transforms RNA-sequencing data into actionable clinical insights with automated reports.

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17353511.svg)](https://doi.org/10.5281/zenodo.17353511)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Documentation** | [umccr.github.io/RNAsum](https://umccr.github.io/RNAsum/)

## What is RNAsum?

`RNAsum` is an R package that integrates whole-genome sequencing (WGS) and whole-transcriptome sequencing (WTS) data to generate comprehensive, interactive HTML reports for cancer patient samples.

## Quick start

RNAsum can be installed using one of the following two methods.

### Installation

#### Option 1: from GitHub

`RNAsum` depends on `pdftools`, which requires system-level libraries (poppler, cairo, etc.) to be installed before installing the R package.

---

<details><summary><strong>System dependencies installation</strong></summary>

**Ubuntu/Debian:**

```bash
sudo apt-get install libpoppler-cpp-dev libharfbuzz-dev libfribidi-dev \
                     libfreetype6-dev libcairo2-dev libpango1.0-dev
```

**macOS:**

```bash
brew install poppler
```

**HPC/Cluster (without root):**

If you do not have root access (e.g., on a cluster), creating a fresh Conda environment is the most reliable way to provide necessary system libraries:

```bash
conda create -n rnasum_env -c conda-forge -c bioconda \
  r-base=4.1 poppler harfbuzz fribidi freetype pkg-config \
  cairo openssl pango make gxx_linux-64
conda activate rnasum_env
```

</details>

---

Once system dependencies are met, you can install the package directly from GitHub from within R console.

```{r install-github}
# 1. Increase timeout to prevent download failure for RNAsum.data
options(timeout = 600)

# 2. Install via remotes
if (!require("remotes")) install.packages("remotes")
remotes::install_github("umccr/RNAsum")
```

#### Option 2: from Conda

Conda package is available from the Anaconda umccr channel:

```bash
conda create -n rnasum -c umccr -c conda-forge -c bioconda r-rnasum
conda activate rnasum
```

## Workflow

The pipeline consists of five main components.

![](man/figures/RNAsum_workflow_updated.png)

1. **WTS data collection**: ingests per-gene read counts and gene fusions.
2. **Reference integration**: normalises against [reference cohorts](https://umccr.github.io/RNAsum/articles/reference_cohorts.html).
3. **WGS data integration**: links genomic alterations with expression data.
4. **Knowledge enrichment**: annotates with clinically relevant databases.
5. **Report generation**: prioritises findings and creates interactive visualizations.

[Detailed workflow documentation](https://umccr.github.io/RNAsum/articles/workflow.html)

## Usage

Add `RNAsum` to PATH environment variable.

```bash
rnasum_cli=$(Rscript -e 'cat(system.file("cli", package="RNAsum"))')
ln -sf "$rnasum_cli/rnasum.R" "$rnasum_cli/rnasum"
export PATH="$rnasum_cli:$PATH"
```

```bash
rnasum --version
```

### Common options

| Option | Description | Default |
|--------|-------------|---------|
| `--sample_name` | Sample identifier | Required |
| `--dataset` | TCGA reference cohort | `PANCAN` |
| `--salmon` | Salmon quantification file | - |
| `--kallisto` | Kallisto abundance file | - |
| `--arriba_tsv` | Arriba fusion detection output | - |
| `--pcgr_tiers_tsv` | PCGR variant calls (tier 1-4) | - |
| `--cn_gene_tsv` | Copy number by gene | - |
| `--filter` | Filter low-expressed genes | `TRUE` |

Run `rnasum --help` to get complete list of options.

For format and minimal content of input files (e.g. `--pcgr_tiers_tsv`, `--cn_gene_tsv`, `--sv_tsv`), see [Input file formats](https://umccr.github.io/RNAsum/articles/input_files.html).

**Note**: human reference genome [GRCh38](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39) (Ensembl based annotation version 105) is used for gene annotation by default. GRCh37 is no longer supported.

## Examples

**Test data**: in `/inst/rawdata/test_data` folder of the GitHub repo
**Runtime**: < 15 minutes (16GB RAM, 1 CPU)

### Scenario 1: WGS + WTS (recommended)

Comprehensive reporting, in which WGS-based findings are used as a primary source for expression profile prioritisation.

```bash
cd $rnasum_cli

rnasum \
  --sample_name test_sample_WTS \
  --dataset TEST \
  --salmon "$PWD/../rawdata/test_data/dragen/TEST.quant.genes.sf" \
  --arriba_pdf "$PWD/../rawdata/test_data/dragen/arriba/fusions.pdf" \
  --arriba_tsv "$PWD/../rawdata/test_data/dragen/arriba/fusions.tsv"  \
  --dragen_fusions "$PWD/../rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final"  \
  --pcgr_tiers_tsv "$PWD/../rawdata/test_data/small_variants/TEST-snvs_indels.tiers.tsv" \
  --cn_gene_tsv "$PWD/../rawdata/test_data/copy_number/TEST.cnv.gene.tsv" \
  --sv_tsv "$PWD/../rawdata/test_data/structural/TEST-sv.tsv" \
  --report_dir "$PWD/../rawdata/test_data/RNAsum" \
  --save_tables FALSE \
  --filter TRUE
```

The HTML report `test_sample_WTS.RNAsum.html` will be created in the `inst/rawdata/test_data/dragen/RNAsum` folder.

### Scenario 2: WTS only

Basic reporting including information about detected gene fusions and expression levels of key genes.

```bash
cd $rnasum_cli

rnasum \
  --sample_name test_sample_WTS \
  --dataset TEST \
  --salmon "$PWD/../rawdata/test_data/dragen/TEST.quant.genes.sf" \
  --arriba_pdf "$PWD/../rawdata/test_data/dragen/arriba/fusions.pdf" \
  --arriba_tsv "$PWD/../rawdata/test_data/dragen/arriba/fusions.tsv"  \
  --report_dir "$PWD/../rawdata/test_data/RNAsum" \
  --save_tables FALSE \
  --filter TRUE
```

The HTML report `test_sample_WTS.RNAsum.html` will be created in the `inst/rawdata/test_data/dragen/RNAsum` folder.

## What's in the report?

`RNAsum` generates an interactive HTML report with the following core sections:

- **Findings summary**: summary of genes listed across various report sections
- **Mutated genes**: expression of genes with somatic mutations (requires WGS)
- **Fusion genes**: detected gene fusions with functional annotations
- **Structural variants**: expression of genes located within structural variants (requires WGS)
- **CN altered genes**: expression in CN-gained/lost regions (requires WGS)
- **Cancer genes**: expression of cancer-associated genes

[View example reports](https://doi.org/10.5281/zenodo.17353511).

## Available reference datasets

`RNAsum` includes 33 TCGA cancer type cohorts for comparative analysis:

| Cancer Type | Dataset Code | Samples |
|-------------|--------------|---------|
| Pan-Cancer | `PANCAN` | 330 |
| Breast Invasive Carcinoma | `BRCA` | 300 |
| Lung Adenocarcinoma | `LUAD` | 300 |
| Pancreatic Adenocarcinoma | `PAAD` | 150 |

See the complete [TCGA projects summary table](https://umccr.github.io/RNAsum/articles/tcga_projects_summary.html).

## Documentation

| Resource | Link |
|----------|------|
| Full documentation | [umccr.github.io/RNAsum](https://umccr.github.io/RNAsum/) |
| Workflow details | [Workflow details](https://umccr.github.io/RNAsum/articles/workflow.html) |
| Report structure | [Report structure](https://umccr.github.io/RNAsum/articles/report_structure.html) |
| TCGA datasets | [TCGA projects summary](https://umccr.github.io/RNAsum/articles/tcga_projects_summary.html) |

## Contributing

We welcome contributions! Please see our [Code of Conduct](./CODE_OF_CONDUCT.md) and contribution guidelines.

### Reporting Issues

Found a bug or have a feature request? [Open an issue](https://github.com/umccr/RNAsum/issues/new).

## Citation

If you use `RNAsum` please cite:

> Kanwal S, Marzec J, Diakumis P, Hofmann O, Grimmond S (2024). "RNAsum: An R package to comprehensively post-process, summarise and visualise genomics and transcriptomics data." version 1.1.0, https://umccr.github.io/RNAsum/

A BibTeX entry for LaTeX users is

```
@Unpublished{,
  title = {RNAsum: An R package to comprehensively post-process, summarise and visualise genomics and transcriptomics data},
  author = {Sehrish Kanwal and Jacek Marzec and Peter Diakumis and Oliver Hofmann and Sean Grimmond},
  year = {2024},
  note = {version 1.1.0},
  url = {https://umccr.github.io/RNAsum/},
}
```