Edi-Set-Flow - Detecting RNA editing at scale

A robust pipeline for RNA editing detection and differential analysis in bulk RNA-seq

This is a beta release - please try it and report any issues here

Overview

Edi-Set-Flow runs in five stages:

Setup — Downloads reference genome, GTF, REDIportal (hg38 only), and common variant exclusion lists (dbSNP/gnomAD), then builds aligner indices. All resources are cached in esf_resources/ and reused across runs.
Alignment — Optionally trims reads with fastp, aligns to the reference (STAR or BWA-MEM2), filters BAMs, estimates per-sample coverage with mosdepth, and automatically detects library strandedness.
Site Discovery & Counting — JACUSA2 scans for candidate A-to-I (A>G) RNA editing sites in two passes: a loose discovery pass to identify candidate sites, followed by a stricter counting pass to quantify allele counts. Both passes run in parallel across samples.
Annotation — Ensembl VEP annotates variant consequences (gene, transcript, impact); vcfanno adds REDIportal catalog membership and repeat-masker context.
Reporting & Statistics — MultiQC aggregates alignment and QC metrics; the companion edisetr R package fits per-site GLMs across experimental groups and generates an interactive HTML report.

Getting Started

1) Requirements

Nextflow (≥ 22.03.0)
Container engine, any of:

2) Sample Manifest (CSV format)

Required Columns:
- sample_id — unique identifier for each sample
- group — experimental condition used as the primary contrast in the GLM (e.g. tissue, treatment)
- One of the following input options per sample:
  - fastq1 (+ optional fastq2) — paths to local FASTQ files; omit fastq2 or leave blank for single-end data
  - run_accession — SRA or ENA run accession (e.g. SRR..., ERR...); reads are downloaded automatically using fasterq-dump
Optional Columns:
- Arbitrary covariates to include in the GLM as fixed effects (specified via --report_fixed_effects)
- Columns containing numbers are treated as numeric; all others are treated as factors

Example (local FASTQs):

sample_id,fastq1,fastq2,group,sex,age
SRR1311086,/PATH/TO/SRR1311086_1.fastq.gz,/PATH/TO/SRR1311086_2.fastq.gz,cortex,male,50
SRR1477080,/PATH/TO/SRR1477080_1.fastq.gz,/PATH/TO/SRR1477080_2.fastq.gz,cerebellum,female,60
SRR1085825,/PATH/TO/SRR1085825_1.fastq.gz,/PATH/TO/SRR1085825_2.fastq.gz,hippocampus,male,50
...

Example (SRA/ENA accessions):

sample_id,run_accession,group,age
SRR5961804,SRR5961804,CTRL,38
SRR5961807,SRR5961807,CTRL,55
SRR5961808,SRR5961808,MDD,38
...

3) Run Edi-Set-Flow

Example:

nextflow run bahlolab/Edi-Set-Flow \
    -revision 26.03-beta.2 \
    -profile hg38,singularity \
    -resume \
    --input sample_manifest.csv \
    --report_fixed_effects sex,age \
    --outdir esf_results

Quick test run:

The test profile provides a built-in manifest of 10 SRA samples (5 CTRL / 5 MDD, human prefrontal cortex). Reads are downloaded automatically via fasterq-dump, so no input files are needed:
```
nextflow run bahlolab/Edi-Set-Flow \
    -revision 26.03-beta.2 \
    -profile hg38,singularity,test \
    -resume \
    --outdir esf_test_results
```
Notes:
- Profiles:
  - Genome: hg38, mm10 or mm39 are supported. Other genome builds require custom specification — see nextflow.config
  - Container Engine: singularity, apptainer, or docker
  - Examples: -profile hg38,singularity or -profile mm10,docker
- Resources (e.g. reference genome) are downloaded and stored in esf_resources/; use --resource_dir /path/to/resource_dir to share a single cache across multiple runs or users
- REDIportal annotation is only supported for hg38

Parameters

Key parameters grouped by function. Defaults are set by the -profile where noted.

Input / Output

Parameter	Default	Description
`--input`	required	Path to sample manifest CSV
`--outdir`	`output`	Directory for pipeline outputs
`--resource_dir`	`esf_resources`	Reference data cache directory; share across runs to avoid re-downloading

Preprocessing

Parameter	Default	Description
`--aligner`	`BWAMEM2`	Aligner to use: `BWAMEM2` or `STAR`. STAR is recommended when splice-aware alignment is important
`--fastp_mode`	off	Read QC/trimming: omit to skip, `MINIMAL` (adapter trimming only), `STRICT` (quality filtering + adapter trimming)
`--drop_duplicates`	`true`	Remove PCR duplicates from BAMs before site detection

Site Detection

Parameter	Default	Description
`--min_mapq`	`20`	Minimum mapping quality passed to JACUSA2
`--infer_enzyme`	`true`	Infer editing enzyme type from data (ADAR: A>G; APOBEC: C>T)
`--n_intervals`	`10`	Number of genomic intervals for parallelisation — increase for larger datasets or more available CPUs

GLM / Report

Parameter	Default	Description
`--report_fixed_effects`	—	Comma-separated covariate columns from the manifest to include in the GLM (e.g. `sex,age`). `group` is always included automatically
`--report_model`	`quasibinomial`	GLM family: `quasibinomial` (recommended — handles overdispersion), `binomial`, `linear`, or `arcsine`
`--report_min_med_dp`	`10`	Minimum median depth across samples for a site to appear in the report
`--report_adar_only`	`true`	Restrict the report to A>G (ADAR-type) sites only

Output Files

File	Description
`EdiSetFlow.report.html`	Interactive Edi-Set-Flow HTML report — see example (GTEx brain)
`EdiSetFlow.sample_counts.csv.gz`	Reference and alternate allele counts per site and sample
`EdiSetFlow.sample_summary.csv.gz`	Summary statistics (median depth & editing rate) per sample
`EdiSetFlow.site_summary.csv.gz`	Per-site statistics and annotation (e.g. gene, region, REDIportal membership, VEP consequence)
`EdiSetFlow.glm_summary.csv.gz`	GLM coefficients, standard errors, and significance per site
`EdiSetFlow.glm_anova.csv.gz`	ANOVA test results for each model term per site
`EdiSetFlow.glm_contrasts.csv.gz`	Pairwise group comparisons for differential editing
`EdiSetFlow.glm_margins.csv.gz`	Estimated marginal editing rates per group
`multiqc_report.html`	MultiQC summary report (fastp, STAR/BWA-MEM2, mosdepth, etc.)

Running on HPC & Cloud

Nextflow handles job submission and resource management, but a few settings are worth configuring for cluster or cloud environments.

Executor (SLURM, PBS, AWS, etc.)

By default Nextflow runs processes locally. To submit jobs to a cluster scheduler or cloud provider, add an executor block to a local nextflow.config in your working directory, or pass it with -c custom.config:

process {
    executor = 'slurm'       // or 'pbspro', 'awsbatch', 'google-lifesciences', etc.
    queue    = 'your-queue'
}

See the Nextflow executor documentation for all supported platforms and options.

Overriding CPU / Memory per Process

Each tool runs under a named label (e.g. jacusa2, star, vep). Override resources for any label in your config:

process {
    withLabel: 'jacusa2' { cpus = 16; memory = '80 GB' }
    withLabel: 'star'    { cpus = 8;  memory = '40 GB' }
}

Resuming Failed Runs

Always pass -resume when restarting a run. Nextflow caches completed tasks and will skip them, restarting only from the point of failure.

Container Cache

Set the appropriate environment variable so containers are pulled once and shared across runs:

export NXF_APPTAINER_CACHEDIR=/shared/path/apptainer_cache   # Apptainer / Singularity
export NXF_SINGULARITY_CACHEDIR=/shared/path/singularity_cache

Sharing Reference Resources

Pass --resource_dir /shared/path/esf_resources to point all runs at a shared reference cache, avoiding repeated downloads of the genome, GTF, and annotation databases.

Debugging Failed Runs

Pipeline-level errors (e.g. invalid parameters, missing files, config problems) are logged to .nextflow.log in your working directory. Check the bottom of this file first.
Task-level errors (e.g. a tool crashing mid-run) are captured in the task's work/ subdirectory. Nextflow prints the work directory path in the error output — navigate there and inspect:
- .command.sh — the exact shell command that was run
- .command.log — stdout/stderr from the task
- .exitcode — the exit code
Common fixes are increasing the CPU/memory or walltime allocation for the failing process (see Overriding CPU / Memory above).

Getting Help

Nextflow documentation: nextflow.io/docs
Pipeline issues and questions: GitHub Issues

References

Piechotta, M., Naarmann-de Vries, I. S., Wang, Q., Altmüller, J. & Dieterich, C. (2022) RNA modification mapping with JACUSA2. Genome Biology, 23(1), 115.

D'Addabbo, P., Cohen-Fultheim, R., Twersky, I., Fonzino, A., Silvestris, D. A., Prakash, A., … & Picardi, E. (2025) REDIportal: toward an integrated view of the A-to-I editing. Nucleic Acids Research, 53(D1), D233–D242.

McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R., Thormann, A., … & Cunningham, F. (2016) The Ensembl Variant Effect Predictor. Genome Biology, 17, 1–14.

da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., … & Perez-Riverol, Y. (2017) BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics, 33(16), 2580–2582.

Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E. & Notredame, C. (2017) Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
bin		bin
docker		docker
edisetr		edisetr
img		img
lib		lib
misc		misc
modules/local		modules/local
subworkflows/local		subworkflows/local
workflows		workflows
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edi-Set-Flow - Detecting RNA editing at scale

Overview

Getting Started

1) Requirements

2) Sample Manifest (CSV format)

3) Run Edi-Set-Flow

Parameters

Input / Output

Preprocessing

Site Detection

GLM / Report

Output Files

Running on HPC & Cloud

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Edi-Set-Flow - Detecting RNA editing at scale

Overview

Getting Started

1) Requirements

2) Sample Manifest (CSV format)

3) Run Edi-Set-Flow

Parameters

Input / Output

Preprocessing

Site Detection

GLM / Report

Output Files

Running on HPC & Cloud

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages