Sylvan Genome Annotation

Sylvan is a comprehensive genome annotation pipeline that combines EVM/PASA, GETA, and Helixer with semi-supervised random forest filtering for generating high-quality gene models from raw genome assemblies.

Features

Multi-evidence integration: RNA-seq, protein homology, neighbor species annotations
Dual RNA-seq alignment pathways: STAR and HiSat2 with StringTie/PsiCLASS
Multiple ab initio predictors: Helixer (GPU-accelerated), Augustus
Semi-supervised filtering: Random forest-based spurious gene removal
Score-based filtering: Alternative logistic regression + random forest scoring pipeline
HPC-ready: SLURM cluster support with Singularity containers
Local execution: Run without SLURM on any Linux machine with bin/annotate_local.sh
Customizable cluster command: sbatch template lives in the config YAML — no shell script edits needed
TidyGFF: Format annotations for public distribution
Cleanup utility: Remove intermediate files after pipeline completion

Quick Start

Complete Installation (conda environment, Singularity image, git clone)
Run with toy data:

# Dry-run first
snakemake -n --snakefile bin/Snakefile_annotate

# Run annotation
./bin/annotate_toydata.sh

The toy data experiment uses A. thaliana chromosome 4 with 12 paired-end RNA-seq samples, 3 neighbor species, and the land_plant Helixer model. For a detailed walkthrough, see the Wiki.

Installation

Requirements

Linux (tested on CentOS/RHEL, Ubuntu)
Singularity/Apptainer 3.x+
Conda/Mamba
SLURM for cluster execution (optional — see Local Execution for running without HPC)
Git LFS (for toy data)
GPU (optional): NVIDIA GPU with driver >= 525.60.13 for Helixer acceleration. See GPU / CUDA Compatibility for details.

Dependencies

Most bioinformatics tools (STAR, Augustus, GeneWise, PASA, EVM, BLAST, BUSCO, etc.) are bundled inside the Singularity container. The host environment needs:

Package	Purpose
Python 3.10+	Pipeline orchestration
Snakemake 7	Workflow engine
pandas	Data manipulation
scikit-learn	Random forest classifier
NumPy	Numerical operations
PyYAML	Config parsing
rich	Logging (optional)

Perl and R scripts (fillingEndsOfGeneModels.pl, filter_distributions.R) run inside the Singularity container and do not require host installation.

Setup

# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan

# Download Singularity image (latest = v4, GPU-capable TensorFlow)
singularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latest
# Or a specific version: library://wyim/sylvan/sylvan:v3 (CPU-only TF, smaller)

# Clone repository (with Git LFS for toy data)
git lfs install
git clone https://github.com/plantgenomicslab/Sylvan.git

Build from source (optional)

cd Sylvan/singularity
sudo singularity build sylvan.sif Sylvan.def
# Or without root (requires user namespaces):
singularity build --fakeroot sylvan.sif Sylvan.def

GPU / CUDA Compatibility

Sylvan uses Helixer (TensorFlow-based deep learning gene predictor) which benefits significantly from GPU acceleration. The container is designed to work across different GPU hardware without CUDA version conflicts.

How it works:

The Singularity container bundles tensorflow with individual NVIDIA CUDA pip packages (nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.). This means:

No host CUDA toolkit required — only the NVIDIA driver is needed on the host
No GPU-model-specific builds — the same container works on V100, A100, H100, etc.
Automatic CPU fallback — if no GPU is detected, TensorFlow runs on CPU transparently (slower but functional)

Component	Location	Required
NVIDIA driver	Host (>= 525.60.13)	For GPU only
CUDA runtime	Container (pip: nvidia-cuda-runtime-cu12)	Bundled
cuDNN	Container (pip: nvidia-cudnn-cu12)	Bundled
TensorFlow 2.15	Container (helixer conda env)	Bundled

Singularity --nv flag:

All entry scripts pass --nv to Singularity, which bind-mounts the host's NVIDIA driver libraries into the container. This is safe to include even on CPU-only nodes — Singularity silently skips --nv if no GPU is found.

# Default: GPU passthrough enabled (falls back to CPU if no GPU)
./bin/annotate.sh

# Override singularity args if needed (e.g., custom bind paths, no --nv)
SYLVAN_SINGULARITY_ARGS="--nv -B /scratch" ./bin/annotate.sh

SLURM GPU configuration:

For HPC clusters with separate CPU and GPU partitions, configure the helixer rule in config_annotate.yml to request GPU resources:

helixer:
  account: gpu-account          # GPU-specific SLURM account (if different)
  partition: gpu-partition       # GPU partition (e.g., gpu-s1-pgl-0)
  extra_args: "--gres=gpu:1"    # Request 1 GPU
  ncpus: 12
  memory: 48g

All other rules run on CPU nodes using the __default__ account/partition. Only the helixer rule needs GPU access.

Compatibility matrix (tested):

Host GPU	Host Driver	Container CUDA	Status
NVIDIA A100	>= 525.60.13	12.x (bundled)	Supported
NVIDIA V100	>= 525.60.13	12.x (bundled)	Supported
No GPU	N/A	N/A	CPU fallback (slower)

Note: The minimum driver version 525.60.13 corresponds to CUDA 12.0 forward compatibility. Older drivers will trigger CPU fallback. Run nvidia-smi on the host to check your driver version.

Pipeline Design

The Sylvan pipeline consists of two main phases — annotation and filtration — with configurable modules that process evidence from multiple sources and combine them into a unified gene model. The following describes the available tools and modules. Users configure which components to enable and how to parameterize them via config_annotate.yml and config_filter.yml.

Phase 1: Annotate

The annotation phase generates gene models by integrating multiple configurable evidence sources.

Evidence Generation

Repeat Masking
- Runs RepeatMasker with a user-specified species library (e.g. Embryophyta, Viridiplantae, Metazoa — configured via geta.RM_species)
- Can optionally run RepeatModeler for de novo repeat identification
- Supports user-supplied custom repeat libraries (e.g. from EDTA, configured via geta.RM_lib)
RNA-seq Processing
- Quality-trims reads with fastp
- Aligns reads via STAR (default) or HiSat2 (alternative pathway — both are available in the pipeline; the active pathway depends on the Snakemake rule graph)
- Assembles transcripts with StringTie and PsiCLASS
- Optionally performs de novo transcript assembly with SPAdes + Evigene clustering
- Refines and clusters transcripts with PASA
Protein Homology (sequential pipeline)
- Miniprot performs fast protein-to-genome alignment to identify candidate gene regions
- GeneWise refines gene structures on Miniprot-identified regions
- GMAP provides exonerate-style exon-level alignments
Ab Initio Prediction
- Helixer: deep learning–based gene prediction (optionally GPU-accelerated; model selected via helixer_model — land_plant, vertebrate, or fungi)
- Augustus: HMM-based prediction, either trained de novo on the target genome or initialized from an existing species model (via augustus_start_from), or skipped entirely if a pre-trained model is supplied (via use_augustus)
Liftover
- LiftOff transfers annotations from one or more neighbor species (configured via liftoff.neighbor_gff and liftoff.neighbor_fasta)

Evidence Combination

GETA Pipeline
- TransDecoder predicts ORFs from assembled transcripts
- Gene models are combined and filtered; repeat-overlapping genes are removed
Portcullis
- Filters splice junctions from transcript evidence
EvidenceModeler (EVM)
- Integrates all evidence sources using configurable weights (evm_weights.txt)
- Generates consensus gene models
- Genome is partitioned into overlapping segments for parallel execution (partition count configured via num_evm_files)
PASA Post-processing
- PASA operates at two stages in the pipeline: (1) initial transcript assembly and clustering before EVM, and (2) post-EVM refinement for UTR addition and alternative isoform incorporation
PASA–EVM Merge
- PASA annotation comparison only outputs gene models with transcript evidence overlap, silently dropping EVM genes without transcript support. The merge step (merge_pasa_evm.py) rescues these dropped EVM genes by adding them back alongside PASA-updated models, preserving conserved genes that lack RNA-seq coverage.
Gene Boundary Refinement
- Detects truncated gene models by comparing against Helixer and Augustus predictions at the same locus
- Replaces truncated models only when supported by independent evidence: RNA-seq splice junctions (primary authority for exon boundaries), miniprot protein alignment coverage (detection only, not used for replacement), and cross-source CDS boundary agreement
- Miniprot is explicitly excluded as a replacement source — its protein-level alignments give approximate, not exact, exon boundaries
AGAT
- Final GFF3 format cleaning and validation

Output: results/PREFILTER/Sylvan.gff3

Phase 2: Semi-Supervised Random Forest Filter

The filter phase computes additional evidence features for each gene model and applies a semi-supervised random forest classifier to separate high-quality genes from spurious predictions.

Feature Generation

The following features are computed for every gene model in the draft annotation:

PfamScan — identifies conserved protein domains using the Pfam-A HMM database
RSEM — quantifies transcript expression (TPM) from re-aligned RNA-seq reads; bedtools computes read coverage
BLASTp (homolog) — measures similarity to a user-supplied protein database (parallelized across 20 split peptide files)
BLASTp (RexDB) — measures similarity to a repeat element protein database (e.g. RepeatExplorer Viridiplantae)
Ab initio overlap — computes the fraction of each gene model overlapping with Augustus predictions, Helixer predictions, and RepeatMasker annotations
Miniprot overlap — computes the fraction of each gene model overlapping with Miniprot protein-to-genome alignments (used as RF feature only — not as a rescue condition)
lncDC — classifies transcripts as protein-coding or long non-coding RNA using an XGBoost model with plant-specific pre-trained parameters
BUSCO — identifies conserved single-copy orthologs (used to monitor the filtration process and as a safety net to prevent discarding conserved genes)

Semi-supervised Classification

Initial gene set selection: A data-driven heuristic selects high-confidence positive genes (strong homolog/Pfam/expression evidence) and high-confidence negative genes (repeat-like, no expression) using configurable cutoff thresholds (TPM, coverage, BLAST identity/coverage, repeat overlap)
Random forest training: A binary classifier is trained on the initial gene set
Iterative refinement: High-confidence predictions (above the --recycle threshold, default 0.95) are added back to the training set, and the model is retrained. This repeats for up to --max-iter iterations (default 5) or until convergence
Three-tier rescue for undecided genes: (1) RF Keep probability > 0.6 excluding TE-only genes, (2) Pfam domain present without repeat/RexDB contamination, (3) BUSCO safety net — genes with Complete BUSCO hits are never discarded
Discard classification: Discarded genes are categorized as TE_related (RexDB hit), lncRNA (lncDC prediction), or pseudogene (low/no evidence)

Output files:

results/FILTER/filtered.gff3 — Kept gene models
results/FILTER/discard.gff3 — Discarded gene models (each feature annotated with discard_reason=TE_related|lncRNA|pseudogene)
results/FILTER/data.tsv — Feature matrix used by random forest
results/FILTER/keep_data.tsv — Evidence data for kept genes
results/FILTER/discard_data.tsv — Evidence data for discarded genes with discard_reason column
results/FILTER/{prefix}.cdna — Extracted transcript sequences
results/FILTER/{prefix}.pep — Extracted peptide sequences

Alternative: Score-based Filter

An alternative scoring pipeline (Snakefile_filter_score) uses logistic regression and random forest scoring with pseudo-labels instead of the iterative semi-supervised approach. This requires the same feature generation outputs and produces:

results/FILTER/scores.csv — Per-gene scores and features
results/FILTER/scores.metrics.txt — AUC/PR/F1 and chosen thresholds

export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
./bin/filter_score_toydata.sh

Running the Annotate Phase

This section describes the inputs, configuration, and commands needed to run the annotation pipeline on your data.

Input Requirements

Input	Description	Config Field
Genome assembly	FASTA file (`.fa`, `.fasta`, `.fna`, `.fa.gz`, `.fasta.gz`, `.fna.gz`)	`genome`
RNA-seq data	Paired-end gzipped FASTQ files (`_1.fastq.gz`/`_2.fastq.gz` or `_R1.fastq.gz`/`_R2.fastq.gz`) in a folder	`rna_seq`
Protein sequences	FASTA from UniProt, OrthoDB, etc. (comma-separated for multiple files)	`proteins`
Neighbor species	Directories containing GFF3 and genome FASTA (`.fa`, `.fasta`, `.fna`, `.fas`, `.fsa`, `.seq`) files, one per species	`liftoff.neighbor_gff`, `liftoff.neighbor_fasta`
Repeat library	EDTA output (`.TElib.fa`)	`geta.RM_lib`
Singularity image	Path to `sylvan.sif`	`singularity`

Running the Pipeline

SLURM (HPC)

# Set config (required)
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_annotate

# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=1g \
  -J annotate -o annotate.out -e annotate.err \
  --wrap="./bin/annotate_toydata.sh"

# Or run directly
./bin/annotate_toydata.sh

Local (no SLURM)

# Set config for local execution
export SYLVAN_CONFIG="toydata/config/config_annotate_local.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_annotate

# Run locally (uses --cores instead of --cluster)
./bin/annotate_local.sh

See Local Execution for details.

Output: results/PREFILTER/Sylvan.gff3

Running the Filter Phase

This section describes the inputs and commands for the filter pipeline. All inputs below are specified in config_filter.yml.

Input Requirements

Input	Description	Config Field
Annotated GFF	Output from Annotate phase (`results/PREFILTER/Sylvan.gff3`)	`anot_gff`
Genome	Same as Annotate phase	`genome`
RNA-seq data	Same as Annotate phase	`rna_seq`
Protein sequences	Same as Annotate phase	`protein`
Augustus GFF	Augustus predictions (`results/GETA/Augustus/augustus.gff3`)	`augustus_gff`
Helixer GFF	Helixer predictions (`results/AB_INITIO/Helixer/helixer.gff3`)	`helixer_gff`
Repeat GFF	RepeatMasker output (`results/GETA/RepeatMasker/genome.repeat.gff3`)	`repeat_gff`
HmmDB	Pfam database directory (default: `/usr/local/src` inside container)	`HmmDB`
RexDB	RepeatExplorer protein DB (e.g. `Viridiplantae_v4.0.fasta` from rexdb)	`RexDB`
BUSCO lineage	e.g., `eudicots_odb10`	`busco_lin`
Chromosome regex	Regex to match chromosome prefixes (e.g. `(^Chr)\|(^chr)\|(^LG)`)	`chrom_regex`

Filter cutoff thresholds (in config_filter.yml under Cutoff):

Parameter	Description	Default
`tpm`	TPM threshold for initial gene selection	3
`rsem_cov`	RNA-seq coverage threshold	0.5
`blast_pident` / `blast_qcovs`	BLASTp identity / coverage	0.6 / 0.6
`rex_pident` / `rex_qcovs`	RexDB identity / coverage	0.6 / 0.6
`helixer_cov` / `augustus_cov`	Ab initio overlap	0.8 / 0.8
`repeat_cov`	Repeat overlap coverage threshold	0.5
`miniprot_cov`	Miniprot protein alignment overlap (RF feature)	0.5

Running the Pipeline

# Set config (required)
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_filter

# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=4g \
  -J filter -o filter.out -e filter.err \
  --wrap="./bin/filter_toydata.sh"

# Or run directly
./bin/filter_toydata.sh

Output: results/FILTER/filtered.gff3

Phase 3: Benchmark (Optional)

Compare annotation quality across all pipeline stages using BUSCO and OMArk:

# Configure benchmark targets in config_filter.yml (Benchmark section)
# Then run:
./bin/benchmark_local.sh        # local
./bin/benchmark.sh              # SLURM

This benchmarks each GFF3 listed in Benchmark.gff3_files by extracting proteins and running BUSCO protein-mode (and optionally OMArk). Results are saved to results/BENCHMARK/benchmark_summary.tsv.

OMArk setup (optional): OMArk requires the OMAmer database (LUCA.h5, ~6 GB), which is not bundled in the container to keep the image under the Sylabs Cloud 10 GB limit. Download it into your project root:

cd Sylvan/                    # project root
wget https://omabrowser.org/All/LUCA.h5

The toydata config (toydata/config/config_filter_local.yml) already references LUCA.h5 as a relative path from the working directory. For custom projects, set Benchmark.omark_db in config_filter.yml to the path where you downloaded LUCA.h5. Leave empty to skip OMArk and run BUSCO only.

Output: results/BENCHMARK/benchmark_summary.tsv

See the Wiki — Step 5d for toydata benchmark results and detailed configuration.

Combined Pipeline

Run all phases (annotate + filter + benchmark) sequentially:

./bin/run_local.sh              # all three phases
SYLVAN_SKIP_ANNOTATE=1 ./bin/run_local.sh   # skip Phase 1
SYLVAN_SKIP_BENCHMARK=1 ./bin/run_local.sh  # skip Phase 3

Feature Importance Test

After a filter run completes, run the leave-one-feature-out ablation test:

python bin/filter_feature_importance.py FILTER/data.tsv results/busco/full_table.tsv \
  --output-table FILTER/feature_importance.tsv

See the Wiki for detailed usage, optional flags, and workflow.

Configuration

Sylvan separates pipeline configuration (inputs, tool parameters, thread counts) from cluster configuration (SLURM account, partition, resources).

File	Purpose
`config_annotate.yml`	Pipeline options: input paths, species parameters, tool settings, per-rule thread counts
`config_filter.yml`	Filter options: input paths, cutoff thresholds, thread counts
`cluster_annotate.yml`	SLURM resources for annotate: account, partition, per-rule ncpus/memory, extra sbatch flags
`cluster_filter.yml`	SLURM resources for filter: account, partition, per-rule ncpus/memory, extra sbatch flags
`evm_weights.txt`	EVM evidence weights: priority of each evidence source
`config/plant.yaml`	Mikado scoring: transcript selection parameters (plant-specific defaults provided)

Single-file mode: By default, config_annotate.yml can also serve as --cluster-config (it includes a __default__ section with SLURM settings). To use a separate cluster file, set SYLVAN_CLUSTER_CONFIG (or SYLVAN_FILTER_CLUSTER_CONFIG for filter). Generate a standalone cluster YAML with bin/generate_cluster_from_config.py.

Pipeline Config (`config_annotate.yml`)

Contains:

Input file paths (genome, RNA-seq, proteins, neighbor species)
Species-specific settings (Helixer model, Augustus species)
Tool parameters (max intron length, EVM weights)
Output prefix and directories
Per-rule thread counts (read by Snakefiles)

Cluster Config (`cluster_annotate.yml`)

Contains:

SLURM account and partition (__default__ section) — both are optional; leave empty or set to placeholder on systems that don't require them
Per-rule CPU/memory/time overrides
extra_args for additional sbatch flags (e.g., --gres=gpu:1, --export=ALL)

EVM Weights (`evm_weights.txt`)

Controls how EvidenceModeler prioritizes different evidence sources. Higher weights give more influence. Example (from toy data):

ABINITIO_PREDICTION  AUGUSTUS     7
ABINITIO_PREDICTION  Helixer     3
OTHER_PREDICTION     Liftoff     2
OTHER_PREDICTION     GETA        5
OTHER_PREDICTION     Genewise    2
TRANSCRIPT           assembler-pasa.sqlite  10
TRANSCRIPT           StringTie   1
TRANSCRIPT           PsiClass    1
PROTEIN              GeneWise    2
PROTEIN              miniprot    2

Adjust weights based on the quality of each evidence type for your organism. PASA transcripts (weight 10) typically have the highest weight as they represent direct transcript evidence.

Environment Variables

Variable	Phase	Description
`SYLVAN_CONFIG`	Annotate	Path to `config_annotate.yml` (default: `config_annotate.yml` in cwd)
`SYLVAN_FILTER_CONFIG`	Filter	Path to `config_filter.yml` (default: `config_filter.yml` in cwd)
`SYLVAN_RESULTS_DIR`	Annotate	Override results output directory (default: `$(pwd)/results/`)
`TMPDIR`	Both	Temporary directory — critical on HPC (see below)
`SLURM_TMPDIR`	Both	Should match `TMPDIR`
`SINGULARITY_BIND`	Both	Bind additional host paths into container

Why TMPDIR matters: Many HPC nodes mount /tmp as tmpfs (RAM-backed). Large temporary files from STAR, RepeatMasker, or Augustus can exhaust memory, causing cryptic segmentation faults or "no space left on device" errors. Always set TMPDIR to disk-backed project storage:

mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"

Using Custom Config Location

# For toydata
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"

# For custom project
export SYLVAN_CONFIG="/path/to/my_config.yml"

This is required for any Snakemake command (dry-run, unlock, etc.):

export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate  # dry-run
snakemake --unlock --snakefile bin/Snakefile_annotate  # unlock

Key Parameters

Parameter	Description	Example
`prefix`	Output file prefix	`my_species`
`helixer_model`	`land_plant`, `vertebrate`, `fungi`	`land_plant`
`helixer_subseq`	64152 (plants), 21384 (fungi), 213840 (vertebrates)	`64152`
`augustus_species`	Augustus species name for training	`arabidopsis`
`augustus_start_from`	Start Augustus training from an existing species model (skips de novo training if close match available)	`arabidopsis`
`use_augustus`	Use a pre-trained Augustus species without re-training (set to species name, or `placeholder` to train fresh)	`placeholder`
`num_evm_files`	Number of parallel EVM partitions (more = faster but more SLURM jobs)	`126`
`geta.RM_species`	RepeatMasker species database (e.g. `Embryophyta`, `Viridiplantae`, `Metazoa`)	`Embryophyta`

Helixer GPU Configuration

Helixer benefits significantly from GPU acceleration (~10x speedup). To use a separate GPU partition, add the following per-rule override in config_annotate.yml:

helixer:
  ncpus: 4
  memory: 32g
  account: your-gpu-account      # GPU-specific billing account
  partition: your-gpu-partition   # GPU partition name

Custom Helixer Models

To use custom Helixer .h5 model files instead of the container defaults:

Set helixer_model_dir in your config to the host directory containing the model files:
```
helixer_model_dir: "/path/to/custom/models"
```
Bind the directory into the container via SINGULARITY_BIND:
```
export SINGULARITY_BIND="/path/to/custom/models"
```

The pipeline will look for {helixer_model_dir}/{helixer_model}.h5 (e.g., /path/to/custom/models/land_plant.h5).

SLURM Configuration: Step-by-Step Setup

Follow these steps to configure Sylvan for your HPC cluster. This only needs to be done once per cluster.

Step 1: Find your SLURM account and partition

# Show your accounts and partitions
sacctmgr show user "$USER" withassoc format=Account,Partition -nP

# List all available partitions with time limits, node counts, and memory
sinfo -o "%P %l %D %c %m"

Example output:

PARTITION      TIMELIMIT  NODES  CPUS  MEMORY
cpu-s1-pgl-0   14-00:00:00  4    64   256000
gpu-s2-core-0  14-00:00:00  10   64   256000
cpu-s3-test-0  8:00:00      2    64   191000

Note your account name (e.g., cpu-s1-pgl-0) and partition name (e.g., cpu-s1-pgl-0). On some clusters these are different; on others they are the same. If your cluster does not require an account or partition, you can leave them empty or set to placeholder.

Step 2: Generate a cluster config file

Use generate_cluster_from_config.py to create a cluster config tailored to your cluster. The script reads per-rule resource requirements from config_annotate.yml and adds your SLURM account/partition.

python3 bin/generate_cluster_from_config.py \
  --config config/config_annotate.yml \
  --out config/cluster_annotate.yml \
  --account your-account \
  --partition your-partition

What this does:

Extracts per-rule CPU/memory/time settings from config_annotate.yml
Sets __default__.account and __default__.partition to your values
Auto-detects walltime: queries sinfo for your partition's max time limit and sets time = max - 1 day (e.g., if max is 14 days, sets 13 days). Falls back to 9 days if sinfo is unavailable.
Writes a standalone cluster_annotate.yml ready for Snakemake's --cluster-config

To override the auto-detected time, use --time:

python3 bin/generate_cluster_from_config.py \
  --config config/config_annotate.yml \
  --out config/cluster_annotate.yml \
  --account your-account --partition your-partition \
  --time "5-00:00:00"

For the toy data:

python3 bin/generate_cluster_from_config.py \
  --config toydata/config/config_annotate.yml \
  --out toydata/config/cluster_annotate.yml \
  --account your-account --partition your-partition

Step 3: Verify the generated config

Open the generated file and check that __default__ looks correct:

__default__:
  account: your-account
  partition: your-partition
  memory: 4g
  ncpus: 1
  nodes: 1
  time: "13-00:00:00"          # Auto-detected: partition max (14d) minus 1 day
  name: '{rule}.{wildcards}'
  output: results/logs/{rule}_{wildcards}.out
  error: results/logs/{rule}_{wildcards}.err
  extra_args: ''

Important: Always quote time values in YAML (e.g., time: "3-00:00:00"). Unquoted values like 72:00:00 are silently parsed as integers by YAML 1.1, resulting in incorrect SLURM walltimes. The generator handles this automatically.

Step 4: Do a dry run

export SYLVAN_CONFIG="config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate

If this completes without errors, your configuration is valid.

Step 5: Run the pipeline

# Submit as a SLURM head job (recommended)
sbatch -A your-account -p your-partition -c 1 --mem=1g \
  -J annotate -o annotate.out -e annotate.err \
  --wrap="./bin/annotate.sh"

# Or run directly (if already on a compute node)
./bin/annotate.sh

Reconfiguring for a different cluster

If you move to a new HPC system, re-run Step 2 with the new account/partition. The generator will auto-detect the new partition's time limit. All other settings (per-rule resources) are preserved from config_annotate.yml.

Single-file mode (alternative)

If you prefer not to use a separate cluster file, you can edit the __default__ section directly in config_annotate.yml. The entry scripts default to using the pipeline config as --cluster-config when SYLVAN_CLUSTER_CONFIG is not set.

Customizing SLURM Submission

Job submission is handled by bin/cluster_submit.py, which dynamically builds the sbatch command. Account (-A) and partition (-p) flags are automatically skipped when their values are empty or set to placeholder — no need to edit any script.

For per-rule customization (e.g., GPU for Helixer), add overrides to the rule's section in cluster_annotate.yml or config_annotate.yml:

helixer:
  ncpus: 4
  memory: 32g
  account: your-gpu-account
  partition: your-gpu-partition
  extra_args: "--gres=gpu:1"

Output

All outputs are organized under results/:

results/
├── PREFILTER/
│   └── Sylvan.gff3              # Annotate phase final output
│
├── AB_INITIO/
│   └── Helixer/                 # Helixer predictions
│
├── GETA/
│   ├── RepeatMasker/            # Repeat masking results
│   ├── Augustus/                # Augustus predictions
│   ├── transcript/              # TransDecoder results
│   ├── homolog/                 # Protein alignments (Miniprot → GeneWise)
│   └── CombineGeneModels/       # GETA gene models
│
├── LIFTOVER/
│   └── LiftOff/                 # Neighbor species liftover
│
├── TRANSCRIPT/
│   ├── PASA/                    # PASA assemblies
│   ├── spades/                  # De novo assembly
│   └── evigene/                 # Evigene transcript clustering
│
├── PROTEIN/                     # Merged protein alignments
│
├── EVM/                         # EvidenceModeler output
│
├── FILTER/
│   ├── filtered.gff3            # Kept gene models
│   ├── discard.gff3             # Discarded gene models
│   ├── data.tsv                 # Feature matrix (input to random forest)
│   ├── keep_data.tsv            # Evidence data for kept genes
│   ├── discard_data.tsv         # Evidence data for discarded genes
│   ├── {prefix}.cdna            # Transcript sequences
│   ├── {prefix}.pep             # Peptide sequences
│   ├── STAR/                    # RNA-seq realignment for filter
│   ├── rsem_outdir/             # RSEM quantification
│   ├── splitPep/                # Parallelized BLAST inputs
│   ├── busco_*/                 # BUSCO results (monitoring only)
│   └── lncrna_predict.csv       # lncDC predictions
│
├── BENCHMARK/                   # Quality benchmarking (optional)
│   ├── benchmark_summary.tsv    # Combined BUSCO + OMArk table
│   ├── {label}.pep              # Extracted proteins per GFF3
│   ├── {label}.busco/           # BUSCO results per GFF3
│   └── {label}.omark/           # OMArk results per GFF3 (if omark_db set)
│
├── config/                      # Runtime config copies
│
└── logs/                        # SLURM job logs

Formatting Output

Use TidyGFF to prepare annotations for public distribution:

singularity exec sylvan.sif python bin/TidyGFF.py \
  MySpecies results/FILTER/filtered.gff3 \
  --out MySpecies_v1.0 --splice-name t --justify 5 --sort \
  --chrom-regex "^Chr" --source Sylvan

TidyGFF options:

Option	Description
`pre` (positional)	Prefix for gene IDs (e.g. `Ath` produces `Ath01G000010`)
`gff` (positional)	Input GFF3 file
`--out`	Output file basename (produces `.gff3`, `.cdna`, `.pep` files)
`--splice-name`	Splice variant label style (e.g. `t` → `mRNA1.t1`, `mRNA1.t2`)
`--justify`	Number of digits in gene IDs (default: 8)
`--sort`	Sort output by chromosome and start coordinate
`--chrom-regex`	Regex for chromosome prefixes (auto-detects `Chr`, `chr`, `LG`, `Ch`, `^\d`)
`--contig-regex`	Regex for contig/scaffold naming (e.g. `HiC_scaffold_(\d+$),Scaf`)
`--source`	Value for GFF column 2 (e.g. `Sylvan`)
`--remove-names`	Remove Name attributes from GFF
`--no-chrom-id`	Do not number gene IDs by chromosome

Useful Commands

# Force rerun all
./bin/annotate.sh --forceall

# Rerun specific rule
./bin/annotate.sh --forcerun helixer

# Rerun incomplete jobs (jobs that started but didn't finish)
./bin/rerun-incomplete.sh

# Generate report after completion
snakemake --report report.html --snakefile bin/Snakefile_annotate

# Unlock after interruption
./bin/annotate.sh --unlock

# Clean up intermediate files (run after BOTH phases complete)
./bin/cleanup.sh

Cleanup

bin/cleanup.sh removes intermediate files generated during the annotation phase while preserving:

Final outputs (PREFILTER/Sylvan.gff3, FILTER/filtered.gff3)
Log files (results/logs/)
Configuration files
Filter phase outputs (FILTER/)

Run this only after both annotation and filter phases have completed successfully.

Helper Scripts

Script	Purpose
`bin/generate_cluster_from_config.py`	Generate a standalone cluster-only YAML from `config_annotate.yml`
`bin/cluster_submit.py`	SLURM job submission wrapper — dynamically builds `sbatch` command, skips `-A`/`-p` when account/partition are empty

Local Execution (without SLURM)

Sylvan can run on any Linux machine without SLURM. The bin/annotate_local.sh script uses Snakemake's --cores flag for local parallelism instead of --cluster.

Requirements

16+ CPU cores, 64+ GB RAM recommended
Singularity 3.x+ (or a writable sandbox)
No SLURM or job scheduler needed

Setup

Create a local config by copying the toydata example:

cp toydata/config/config_annotate_local.yml toydata/config/my_local_config.yml

Edit __default__ — set account and partition to empty strings (they are ignored in local mode).
Adjust per-rule ncpus/threads to fit your machine (e.g., cap at 12 for a 16-core machine).

Running

export SYLVAN_CONFIG="toydata/config/config_annotate_local.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_annotate

# Run
./bin/annotate_local.sh

# Pass extra snakemake flags
./bin/annotate_local.sh -n            # dry-run
./bin/annotate_local.sh --forceall    # force rerun

Key Differences from SLURM Mode

Aspect	SLURM (`annotate.sh`)	Local (`annotate_local.sh`)
Parallelism	`--cluster sbatch`	`--cores 16`
Job scheduling	SLURM queue	Snakemake's built-in scheduler
Resource limits	Per-job SLURM allocation	System-wide (all jobs share RAM)
Singularity	`--use-singularity`	`--use-singularity` (same)
GPU (Helixer)	`--gres=gpu:1`	`--nv` flag (auto-detects host GPU)

Known Issues for Local Execution

Issue	Solution
Helixer GPU/CUDA mismatch	Container CUDA version must match host driver. If mismatched, Helixer produces empty output (pipeline continues).
`run:` blocks execute on host	Already handled by `run_in_container()` helper in Snakefile_annotate.
Container `/bin/sh` is dash	Avoid `&>` in shell commands inside container; use `> file 2>&1` instead.

Troubleshooting

Check logs

# Find recent errors
ls -lt results/logs/*.err | head -10
grep -l 'Error\|Traceback' results/logs/*.err

# View specific log
cat results/logs/{rule}_{wildcards}.err

Common Issues

Issue	Solution
Out of memory	Increase `memory` in `config_annotate.yml` for the rule
`No space left on device`	`TMPDIR` is on tmpfs or quota exceeded — set `TMPDIR` to project storage
`Segmentation fault`	Often caused by tmpfs exhaustion — set `TMPDIR` to disk-backed storage
File not found (Singularity)	Path not bound in container — add to `SINGULARITY_BIND`
Singularity bind error	Ensure paths are within working directory or use `SINGULARITY_BIND`
`Permission denied` in container	Check directory permissions, ensure path is bound
SLURM account error	Use `account` (billing account), not username
SLURM account error	Set `account` to empty string (`""`) or `placeholder` in the cluster config if your HPC doesn't require one
LFS files not downloaded	Run `git lfs pull`; verify with `ls -la toydata/` (files should be > 200 bytes)
Augustus training fails	Needs minimum ~500 training genes; use `augustus_start_from` with a close species
Job timeout	Increase `time` in `config_annotate.yml` for the rule
Variables not in SLURM job	Add `#SBATCH --export=ALL` or explicitly export in submit script
Filter `chrom_regex` error	Ensure `chrom_regex` in `config_filter.yml` matches your chromosome naming convention

Memory Guidelines

General recommendation: 4 GB per thread
Example: 48 threads = 192 GB memory
ncpus and threads should match in config_annotate.yml
Some rules need more: mergeSTAR may require ~18 GB per thread for large datasets
Check df -h $TMPDIR to ensure temp storage is on real disk, not tmpfs

Citation

Sylvan: A comprehensive genome annotation pipeline. Under review.

License

MIT License - see LICENSE

Contact

Issues: https://github.com/plantgenomicslab/Sylvan/issues

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.claude		.claude
.github/workflows		.github/workflows
LncDC_plant_db		LncDC_plant_db
bin		bin
config		config
docs/images		docs/images
envs		envs
singularity		singularity
test		test
toydata		toydata
.gitattributes		.gitattributes
.gitignore		.gitignore
.mailmap		.mailmap
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
Wiki.md		Wiki.md
auto-format-python.skill		auto-format-python.skill
error.md		error.md

Folders and files

Latest commit

History

Repository files navigation

Sylvan Genome Annotation

Features

Quick Start

Installation

Requirements

Dependencies

Setup

Build from source (optional)

GPU / CUDA Compatibility

Pipeline Design

Phase 1: Annotate

Evidence Generation

Evidence Combination

Phase 2: Semi-Supervised Random Forest Filter

Feature Generation

Semi-supervised Classification

Alternative: Score-based Filter

Running the Annotate Phase

Input Requirements

Running the Pipeline

SLURM (HPC)

Local (no SLURM)

Running the Filter Phase

Input Requirements

Running the Pipeline

Phase 3: Benchmark (Optional)

Combined Pipeline

Feature Importance Test

Configuration

Pipeline Config (config_annotate.yml)

Cluster Config (cluster_annotate.yml)

EVM Weights (evm_weights.txt)

Environment Variables

Using Custom Config Location

Key Parameters

Helixer GPU Configuration

Custom Helixer Models

SLURM Configuration: Step-by-Step Setup

Step 1: Find your SLURM account and partition

Step 2: Generate a cluster config file

Step 3: Verify the generated config

Step 4: Do a dry run

Step 5: Run the pipeline

Reconfiguring for a different cluster

Single-file mode (alternative)

Customizing SLURM Submission

Output

Formatting Output

Useful Commands

Cleanup

Helper Scripts

Local Execution (without SLURM)

Requirements

Setup

Running

Key Differences from SLURM Mode

Known Issues for Local Execution

Troubleshooting

Check logs

Common Issues

Memory Guidelines

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Pipeline Config (`config_annotate.yml`)

Cluster Config (`cluster_annotate.yml`)

EVM Weights (`evm_weights.txt`)

Packages