Sylvan is a comprehensive genome annotation pipeline that combines EVM/PASA, GETA, and Helixer with semi-supervised random forest filtering for generating high-quality gene models from raw genome assemblies.
- Multi-evidence integration: RNA-seq, protein homology, neighbor species annotations
- Dual RNA-seq alignment pathways: STAR and HiSat2 with StringTie/PsiCLASS
- Multiple ab initio predictors: Helixer (GPU-accelerated), Augustus
- Semi-supervised filtering: Random forest-based spurious gene removal
- Score-based filtering: Alternative logistic regression + random forest scoring pipeline
- HPC-ready: SLURM cluster support with Singularity containers
- Local execution: Run without SLURM on any Linux machine with
bin/annotate_local.sh - Customizable cluster command:
sbatchtemplate lives in the config YAML — no shell script edits needed - TidyGFF: Format annotations for public distribution
- Cleanup utility: Remove intermediate files after pipeline completion
- Complete Installation (conda environment, Singularity image, git clone)
- Run with toy data:
# Dry-run first
snakemake -n --snakefile bin/Snakefile_annotate
# Run annotation
./bin/annotate_toydata.shThe toy data experiment uses A. thaliana chromosome 4 with 12 paired-end RNA-seq samples, 3 neighbor species, and the land_plant Helixer model. For a detailed walkthrough, see the Wiki.
- Linux (tested on CentOS/RHEL, Ubuntu)
- Singularity/Apptainer 3.x+
- Conda/Mamba
- SLURM for cluster execution (optional — see Local Execution for running without HPC)
- Git LFS (for toy data)
- GPU (optional): NVIDIA GPU with driver >= 525.60.13 for Helixer acceleration. See GPU / CUDA Compatibility for details.
Most bioinformatics tools (STAR, Augustus, GeneWise, PASA, EVM, BLAST, BUSCO, etc.) are bundled inside the Singularity container. The host environment needs:
| Package | Purpose |
|---|---|
| Python 3.10+ | Pipeline orchestration |
| Snakemake 7 | Workflow engine |
| pandas | Data manipulation |
| scikit-learn | Random forest classifier |
| NumPy | Numerical operations |
| PyYAML | Config parsing |
| rich | Logging (optional) |
Perl and R scripts (fillingEndsOfGeneModels.pl, filter_distributions.R) run inside the Singularity container and do not require host installation.
# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan
# Download Singularity image (latest = v4, GPU-capable TensorFlow)
singularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latest
# Or a specific version: library://wyim/sylvan/sylvan:v3 (CPU-only TF, smaller)
# Clone repository (with Git LFS for toy data)
git lfs install
git clone https://github.com/plantgenomicslab/Sylvan.gitcd Sylvan/singularity
sudo singularity build sylvan.sif Sylvan.def
# Or without root (requires user namespaces):
singularity build --fakeroot sylvan.sif Sylvan.defSylvan uses Helixer (TensorFlow-based deep learning gene predictor) which benefits significantly from GPU acceleration. The container is designed to work across different GPU hardware without CUDA version conflicts.
How it works:
The Singularity container bundles tensorflow with individual NVIDIA CUDA pip packages (nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.). This means:
- No host CUDA toolkit required — only the NVIDIA driver is needed on the host
- No GPU-model-specific builds — the same container works on V100, A100, H100, etc.
- Automatic CPU fallback — if no GPU is detected, TensorFlow runs on CPU transparently (slower but functional)
| Component | Location | Required |
|---|---|---|
| NVIDIA driver | Host (>= 525.60.13) | For GPU only |
| CUDA runtime | Container (pip: nvidia-cuda-runtime-cu12) | Bundled |
| cuDNN | Container (pip: nvidia-cudnn-cu12) | Bundled |
| TensorFlow 2.15 | Container (helixer conda env) | Bundled |
Singularity --nv flag:
All entry scripts pass --nv to Singularity, which bind-mounts the host's NVIDIA driver libraries into the container. This is safe to include even on CPU-only nodes — Singularity silently skips --nv if no GPU is found.
# Default: GPU passthrough enabled (falls back to CPU if no GPU)
./bin/annotate.sh
# Override singularity args if needed (e.g., custom bind paths, no --nv)
SYLVAN_SINGULARITY_ARGS="--nv -B /scratch" ./bin/annotate.shSLURM GPU configuration:
For HPC clusters with separate CPU and GPU partitions, configure the helixer rule in config_annotate.yml to request GPU resources:
helixer:
account: gpu-account # GPU-specific SLURM account (if different)
partition: gpu-partition # GPU partition (e.g., gpu-s1-pgl-0)
extra_args: "--gres=gpu:1" # Request 1 GPU
ncpus: 12
memory: 48gAll other rules run on CPU nodes using the __default__ account/partition. Only the helixer rule needs GPU access.
Compatibility matrix (tested):
| Host GPU | Host Driver | Container CUDA | Status |
|---|---|---|---|
| NVIDIA A100 | >= 525.60.13 | 12.x (bundled) | Supported |
| NVIDIA V100 | >= 525.60.13 | 12.x (bundled) | Supported |
| No GPU | N/A | N/A | CPU fallback (slower) |
Note: The minimum driver version 525.60.13 corresponds to CUDA 12.0 forward compatibility. Older drivers will trigger CPU fallback. Run
nvidia-smion the host to check your driver version.
The Sylvan pipeline consists of two main phases — annotation and filtration — with configurable modules that process evidence from multiple sources and combine them into a unified gene model. The following describes the available tools and modules. Users configure which components to enable and how to parameterize them via config_annotate.yml and config_filter.yml.
The annotation phase generates gene models by integrating multiple configurable evidence sources.
-
Repeat Masking
- Runs RepeatMasker with a user-specified species library (e.g.
Embryophyta,Viridiplantae,Metazoa— configured viageta.RM_species) - Can optionally run RepeatModeler for de novo repeat identification
- Supports user-supplied custom repeat libraries (e.g. from EDTA, configured via
geta.RM_lib)
- Runs RepeatMasker with a user-specified species library (e.g.
-
RNA-seq Processing
- Quality-trims reads with fastp
- Aligns reads via STAR (default) or HiSat2 (alternative pathway — both are available in the pipeline; the active pathway depends on the Snakemake rule graph)
- Assembles transcripts with StringTie and PsiCLASS
- Optionally performs de novo transcript assembly with SPAdes + Evigene clustering
- Refines and clusters transcripts with PASA
-
Protein Homology (sequential pipeline)
- Miniprot performs fast protein-to-genome alignment to identify candidate gene regions
- GeneWise refines gene structures on Miniprot-identified regions
- GMAP provides exonerate-style exon-level alignments
-
Ab Initio Prediction
- Helixer: deep learning–based gene prediction (optionally GPU-accelerated; model selected via
helixer_model—land_plant,vertebrate, orfungi) - Augustus: HMM-based prediction, either trained de novo on the target genome or initialized from an existing species model (via
augustus_start_from), or skipped entirely if a pre-trained model is supplied (viause_augustus)
- Helixer: deep learning–based gene prediction (optionally GPU-accelerated; model selected via
-
Liftover
- LiftOff transfers annotations from one or more neighbor species (configured via
liftoff.neighbor_gffandliftoff.neighbor_fasta)
- LiftOff transfers annotations from one or more neighbor species (configured via
-
GETA Pipeline
- TransDecoder predicts ORFs from assembled transcripts
- Gene models are combined and filtered; repeat-overlapping genes are removed
-
Portcullis
- Filters splice junctions from transcript evidence
-
EvidenceModeler (EVM)
- Integrates all evidence sources using configurable weights (
evm_weights.txt) - Generates consensus gene models
- Genome is partitioned into overlapping segments for parallel execution (partition count configured via
num_evm_files)
- Integrates all evidence sources using configurable weights (
-
PASA Post-processing
- PASA operates at two stages in the pipeline: (1) initial transcript assembly and clustering before EVM, and (2) post-EVM refinement for UTR addition and alternative isoform incorporation
-
PASA–EVM Merge
- PASA annotation comparison only outputs gene models with transcript evidence overlap, silently dropping EVM genes without transcript support. The merge step (
merge_pasa_evm.py) rescues these dropped EVM genes by adding them back alongside PASA-updated models, preserving conserved genes that lack RNA-seq coverage.
- PASA annotation comparison only outputs gene models with transcript evidence overlap, silently dropping EVM genes without transcript support. The merge step (
-
Gene Boundary Refinement
- Detects truncated gene models by comparing against Helixer and Augustus predictions at the same locus
- Replaces truncated models only when supported by independent evidence: RNA-seq splice junctions (primary authority for exon boundaries), miniprot protein alignment coverage (detection only, not used for replacement), and cross-source CDS boundary agreement
- Miniprot is explicitly excluded as a replacement source — its protein-level alignments give approximate, not exact, exon boundaries
-
AGAT
- Final GFF3 format cleaning and validation
Output: results/PREFILTER/Sylvan.gff3
The filter phase computes additional evidence features for each gene model and applies a semi-supervised random forest classifier to separate high-quality genes from spurious predictions.
The following features are computed for every gene model in the draft annotation:
- PfamScan — identifies conserved protein domains using the Pfam-A HMM database
- RSEM — quantifies transcript expression (TPM) from re-aligned RNA-seq reads; bedtools computes read coverage
- BLASTp (homolog) — measures similarity to a user-supplied protein database (parallelized across 20 split peptide files)
- BLASTp (RexDB) — measures similarity to a repeat element protein database (e.g. RepeatExplorer Viridiplantae)
- Ab initio overlap — computes the fraction of each gene model overlapping with Augustus predictions, Helixer predictions, and RepeatMasker annotations
- Miniprot overlap — computes the fraction of each gene model overlapping with Miniprot protein-to-genome alignments (used as RF feature only — not as a rescue condition)
- lncDC — classifies transcripts as protein-coding or long non-coding RNA using an XGBoost model with plant-specific pre-trained parameters
- BUSCO — identifies conserved single-copy orthologs (used to monitor the filtration process and as a safety net to prevent discarding conserved genes)
- Initial gene set selection: A data-driven heuristic selects high-confidence positive genes (strong homolog/Pfam/expression evidence) and high-confidence negative genes (repeat-like, no expression) using configurable cutoff thresholds (TPM, coverage, BLAST identity/coverage, repeat overlap)
- Random forest training: A binary classifier is trained on the initial gene set
- Iterative refinement: High-confidence predictions (above the
--recyclethreshold, default 0.95) are added back to the training set, and the model is retrained. This repeats for up to--max-iteriterations (default 5) or until convergence - Three-tier rescue for undecided genes: (1) RF Keep probability > 0.6 excluding TE-only genes, (2) Pfam domain present without repeat/RexDB contamination, (3) BUSCO safety net — genes with Complete BUSCO hits are never discarded
- Discard classification: Discarded genes are categorized as
TE_related(RexDB hit),lncRNA(lncDC prediction), orpseudogene(low/no evidence)
Output files:
results/FILTER/filtered.gff3— Kept gene modelsresults/FILTER/discard.gff3— Discarded gene models (each feature annotated withdiscard_reason=TE_related|lncRNA|pseudogene)results/FILTER/data.tsv— Feature matrix used by random forestresults/FILTER/keep_data.tsv— Evidence data for kept genesresults/FILTER/discard_data.tsv— Evidence data for discarded genes withdiscard_reasoncolumnresults/FILTER/{prefix}.cdna— Extracted transcript sequencesresults/FILTER/{prefix}.pep— Extracted peptide sequences
An alternative scoring pipeline (Snakefile_filter_score) uses logistic regression and random forest scoring with pseudo-labels instead of the iterative semi-supervised approach. This requires the same feature generation outputs and produces:
results/FILTER/scores.csv— Per-gene scores and featuresresults/FILTER/scores.metrics.txt— AUC/PR/F1 and chosen thresholds
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
./bin/filter_score_toydata.shThis section describes the inputs, configuration, and commands needed to run the annotation pipeline on your data.
| Input | Description | Config Field |
|---|---|---|
| Genome assembly | FASTA file (.fa, .fasta, .fna, .fa.gz, .fasta.gz, .fna.gz) |
genome |
| RNA-seq data | Paired-end gzipped FASTQ files (*_1.fastq.gz/*_2.fastq.gz or *_R1.fastq.gz/*_R2.fastq.gz) in a folder |
rna_seq |
| Protein sequences | FASTA from UniProt, OrthoDB, etc. (comma-separated for multiple files) | proteins |
| Neighbor species | Directories containing GFF3 and genome FASTA (.fa, .fasta, .fna, .fas, .fsa, .seq) files, one per species |
liftoff.neighbor_gff, liftoff.neighbor_fasta |
| Repeat library | EDTA output (.TElib.fa) |
geta.RM_lib |
| Singularity image | Path to sylvan.sif |
singularity |
# Set config (required)
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
# Dry run
snakemake -n --snakefile bin/Snakefile_annotate
# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=1g \
-J annotate -o annotate.out -e annotate.err \
--wrap="./bin/annotate_toydata.sh"
# Or run directly
./bin/annotate_toydata.sh# Set config for local execution
export SYLVAN_CONFIG="toydata/config/config_annotate_local.yml"
# Dry run
snakemake -n --snakefile bin/Snakefile_annotate
# Run locally (uses --cores instead of --cluster)
./bin/annotate_local.shSee Local Execution for details.
Output: results/PREFILTER/Sylvan.gff3
This section describes the inputs and commands for the filter pipeline. All inputs below are specified in config_filter.yml.
| Input | Description | Config Field |
|---|---|---|
| Annotated GFF | Output from Annotate phase (results/PREFILTER/Sylvan.gff3) |
anot_gff |
| Genome | Same as Annotate phase | genome |
| RNA-seq data | Same as Annotate phase | rna_seq |
| Protein sequences | Same as Annotate phase | protein |
| Augustus GFF | Augustus predictions (results/GETA/Augustus/augustus.gff3) |
augustus_gff |
| Helixer GFF | Helixer predictions (results/AB_INITIO/Helixer/helixer.gff3) |
helixer_gff |
| Repeat GFF | RepeatMasker output (results/GETA/RepeatMasker/genome.repeat.gff3) |
repeat_gff |
| HmmDB | Pfam database directory (default: /usr/local/src inside container) |
HmmDB |
| RexDB | RepeatExplorer protein DB (e.g. Viridiplantae_v4.0.fasta from rexdb) |
RexDB |
| BUSCO lineage | e.g., eudicots_odb10 |
busco_lin |
| Chromosome regex | Regex to match chromosome prefixes (e.g. (^Chr)|(^chr)|(^LG)) |
chrom_regex |
Filter cutoff thresholds (in config_filter.yml under Cutoff):
| Parameter | Description | Default |
|---|---|---|
tpm |
TPM threshold for initial gene selection | 3 |
rsem_cov |
RNA-seq coverage threshold | 0.5 |
blast_pident / blast_qcovs |
BLASTp identity / coverage | 0.6 / 0.6 |
rex_pident / rex_qcovs |
RexDB identity / coverage | 0.6 / 0.6 |
helixer_cov / augustus_cov |
Ab initio overlap | 0.8 / 0.8 |
repeat_cov |
Repeat overlap coverage threshold | 0.5 |
miniprot_cov |
Miniprot protein alignment overlap (RF feature) | 0.5 |
# Set config (required)
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
# Dry run
snakemake -n --snakefile bin/Snakefile_filter
# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=4g \
-J filter -o filter.out -e filter.err \
--wrap="./bin/filter_toydata.sh"
# Or run directly
./bin/filter_toydata.shOutput: results/FILTER/filtered.gff3
Compare annotation quality across all pipeline stages using BUSCO and OMArk:
# Configure benchmark targets in config_filter.yml (Benchmark section)
# Then run:
./bin/benchmark_local.sh # local
./bin/benchmark.sh # SLURMThis benchmarks each GFF3 listed in Benchmark.gff3_files by extracting proteins and running BUSCO protein-mode (and optionally OMArk). Results are saved to results/BENCHMARK/benchmark_summary.tsv.
OMArk setup (optional): OMArk requires the OMAmer database (LUCA.h5, ~6 GB), which is not bundled in the container to keep the image under the Sylabs Cloud 10 GB limit. Download it into your project root:
cd Sylvan/ # project root
wget https://omabrowser.org/All/LUCA.h5The toydata config (toydata/config/config_filter_local.yml) already references LUCA.h5 as a relative path from the working directory. For custom projects, set Benchmark.omark_db in config_filter.yml to the path where you downloaded LUCA.h5. Leave empty to skip OMArk and run BUSCO only.
Output: results/BENCHMARK/benchmark_summary.tsv
See the Wiki — Step 5d for toydata benchmark results and detailed configuration.
Run all phases (annotate + filter + benchmark) sequentially:
./bin/run_local.sh # all three phases
SYLVAN_SKIP_ANNOTATE=1 ./bin/run_local.sh # skip Phase 1
SYLVAN_SKIP_BENCHMARK=1 ./bin/run_local.sh # skip Phase 3After a filter run completes, run the leave-one-feature-out ablation test:
python bin/filter_feature_importance.py FILTER/data.tsv results/busco/full_table.tsv \
--output-table FILTER/feature_importance.tsvSee the Wiki for detailed usage, optional flags, and workflow.
Sylvan separates pipeline configuration (inputs, tool parameters, thread counts) from cluster configuration (SLURM account, partition, resources).
| File | Purpose |
|---|---|
config_annotate.yml |
Pipeline options: input paths, species parameters, tool settings, per-rule thread counts |
config_filter.yml |
Filter options: input paths, cutoff thresholds, thread counts |
cluster_annotate.yml |
SLURM resources for annotate: account, partition, per-rule ncpus/memory, extra sbatch flags |
cluster_filter.yml |
SLURM resources for filter: account, partition, per-rule ncpus/memory, extra sbatch flags |
evm_weights.txt |
EVM evidence weights: priority of each evidence source |
config/plant.yaml |
Mikado scoring: transcript selection parameters (plant-specific defaults provided) |
Single-file mode: By default,
config_annotate.ymlcan also serve as--cluster-config(it includes a__default__section with SLURM settings). To use a separate cluster file, setSYLVAN_CLUSTER_CONFIG(orSYLVAN_FILTER_CLUSTER_CONFIGfor filter). Generate a standalone cluster YAML withbin/generate_cluster_from_config.py.
Contains:
- Input file paths (genome, RNA-seq, proteins, neighbor species)
- Species-specific settings (Helixer model, Augustus species)
- Tool parameters (max intron length, EVM weights)
- Output prefix and directories
- Per-rule thread counts (read by Snakefiles)
Contains:
- SLURM account and partition (
__default__section) — both are optional; leave empty or set toplaceholderon systems that don't require them - Per-rule CPU/memory/time overrides
extra_argsfor additional sbatch flags (e.g.,--gres=gpu:1,--export=ALL)
Controls how EvidenceModeler prioritizes different evidence sources. Higher weights give more influence. Example (from toy data):
ABINITIO_PREDICTION AUGUSTUS 7
ABINITIO_PREDICTION Helixer 3
OTHER_PREDICTION Liftoff 2
OTHER_PREDICTION GETA 5
OTHER_PREDICTION Genewise 2
TRANSCRIPT assembler-pasa.sqlite 10
TRANSCRIPT StringTie 1
TRANSCRIPT PsiClass 1
PROTEIN GeneWise 2
PROTEIN miniprot 2
Adjust weights based on the quality of each evidence type for your organism. PASA transcripts (weight 10) typically have the highest weight as they represent direct transcript evidence.
| Variable | Phase | Description |
|---|---|---|
SYLVAN_CONFIG |
Annotate | Path to config_annotate.yml (default: config_annotate.yml in cwd) |
SYLVAN_FILTER_CONFIG |
Filter | Path to config_filter.yml (default: config_filter.yml in cwd) |
SYLVAN_RESULTS_DIR |
Annotate | Override results output directory (default: $(pwd)/results/) |
TMPDIR |
Both | Temporary directory — critical on HPC (see below) |
SLURM_TMPDIR |
Both | Should match TMPDIR |
SINGULARITY_BIND |
Both | Bind additional host paths into container |
Why TMPDIR matters: Many HPC nodes mount /tmp as tmpfs (RAM-backed). Large temporary files from STAR, RepeatMasker, or Augustus can exhaust memory, causing cryptic segmentation faults or "no space left on device" errors. Always set TMPDIR to disk-backed project storage:
mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"# For toydata
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
# For custom project
export SYLVAN_CONFIG="/path/to/my_config.yml"This is required for any Snakemake command (dry-run, unlock, etc.):
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate # dry-run
snakemake --unlock --snakefile bin/Snakefile_annotate # unlock| Parameter | Description | Example |
|---|---|---|
prefix |
Output file prefix | my_species |
helixer_model |
land_plant, vertebrate, fungi |
land_plant |
helixer_subseq |
64152 (plants), 21384 (fungi), 213840 (vertebrates) | 64152 |
augustus_species |
Augustus species name for training | arabidopsis |
augustus_start_from |
Start Augustus training from an existing species model (skips de novo training if close match available) | arabidopsis |
use_augustus |
Use a pre-trained Augustus species without re-training (set to species name, or placeholder to train fresh) |
placeholder |
num_evm_files |
Number of parallel EVM partitions (more = faster but more SLURM jobs) | 126 |
geta.RM_species |
RepeatMasker species database (e.g. Embryophyta, Viridiplantae, Metazoa) |
Embryophyta |
Helixer benefits significantly from GPU acceleration (~10x speedup). To use a separate GPU partition, add the following per-rule override in config_annotate.yml:
helixer:
ncpus: 4
memory: 32g
account: your-gpu-account # GPU-specific billing account
partition: your-gpu-partition # GPU partition nameTo use custom Helixer .h5 model files instead of the container defaults:
-
Set
helixer_model_dirin your config to the host directory containing the model files:helixer_model_dir: "/path/to/custom/models"
-
Bind the directory into the container via
SINGULARITY_BIND:export SINGULARITY_BIND="/path/to/custom/models"
The pipeline will look for {helixer_model_dir}/{helixer_model}.h5 (e.g., /path/to/custom/models/land_plant.h5).
Follow these steps to configure Sylvan for your HPC cluster. This only needs to be done once per cluster.
# Show your accounts and partitions
sacctmgr show user "$USER" withassoc format=Account,Partition -nP
# List all available partitions with time limits, node counts, and memory
sinfo -o "%P %l %D %c %m"Example output:
PARTITION TIMELIMIT NODES CPUS MEMORY
cpu-s1-pgl-0 14-00:00:00 4 64 256000
gpu-s2-core-0 14-00:00:00 10 64 256000
cpu-s3-test-0 8:00:00 2 64 191000
Note your account name (e.g., cpu-s1-pgl-0) and partition name (e.g., cpu-s1-pgl-0). On some clusters these are different; on others they are the same. If your cluster does not require an account or partition, you can leave them empty or set to placeholder.
Use generate_cluster_from_config.py to create a cluster config tailored to your cluster. The script reads per-rule resource requirements from config_annotate.yml and adds your SLURM account/partition.
python3 bin/generate_cluster_from_config.py \
--config config/config_annotate.yml \
--out config/cluster_annotate.yml \
--account your-account \
--partition your-partitionWhat this does:
- Extracts per-rule CPU/memory/time settings from
config_annotate.yml - Sets
__default__.accountand__default__.partitionto your values - Auto-detects walltime: queries
sinfofor your partition's max time limit and setstime = max - 1 day(e.g., if max is 14 days, sets 13 days). Falls back to 9 days ifsinfois unavailable. - Writes a standalone
cluster_annotate.ymlready for Snakemake's--cluster-config
To override the auto-detected time, use --time:
python3 bin/generate_cluster_from_config.py \
--config config/config_annotate.yml \
--out config/cluster_annotate.yml \
--account your-account --partition your-partition \
--time "5-00:00:00"For the toy data:
python3 bin/generate_cluster_from_config.py \
--config toydata/config/config_annotate.yml \
--out toydata/config/cluster_annotate.yml \
--account your-account --partition your-partitionOpen the generated file and check that __default__ looks correct:
__default__:
account: your-account
partition: your-partition
memory: 4g
ncpus: 1
nodes: 1
time: "13-00:00:00" # Auto-detected: partition max (14d) minus 1 day
name: '{rule}.{wildcards}'
output: results/logs/{rule}_{wildcards}.out
error: results/logs/{rule}_{wildcards}.err
extra_args: ''Important: Always quote time values in YAML (e.g.,
time: "3-00:00:00"). Unquoted values like72:00:00are silently parsed as integers by YAML 1.1, resulting in incorrect SLURM walltimes. The generator handles this automatically.
export SYLVAN_CONFIG="config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotateIf this completes without errors, your configuration is valid.
# Submit as a SLURM head job (recommended)
sbatch -A your-account -p your-partition -c 1 --mem=1g \
-J annotate -o annotate.out -e annotate.err \
--wrap="./bin/annotate.sh"
# Or run directly (if already on a compute node)
./bin/annotate.shIf you move to a new HPC system, re-run Step 2 with the new account/partition. The generator will auto-detect the new partition's time limit. All other settings (per-rule resources) are preserved from config_annotate.yml.
If you prefer not to use a separate cluster file, you can edit the __default__ section directly in config_annotate.yml. The entry scripts default to using the pipeline config as --cluster-config when SYLVAN_CLUSTER_CONFIG is not set.
Job submission is handled by bin/cluster_submit.py, which dynamically builds the sbatch command. Account (-A) and partition (-p) flags are automatically skipped when their values are empty or set to placeholder — no need to edit any script.
For per-rule customization (e.g., GPU for Helixer), add overrides to the rule's section in cluster_annotate.yml or config_annotate.yml:
helixer:
ncpus: 4
memory: 32g
account: your-gpu-account
partition: your-gpu-partition
extra_args: "--gres=gpu:1"All outputs are organized under results/:
results/
├── PREFILTER/
│ └── Sylvan.gff3 # Annotate phase final output
│
├── AB_INITIO/
│ └── Helixer/ # Helixer predictions
│
├── GETA/
│ ├── RepeatMasker/ # Repeat masking results
│ ├── Augustus/ # Augustus predictions
│ ├── transcript/ # TransDecoder results
│ ├── homolog/ # Protein alignments (Miniprot → GeneWise)
│ └── CombineGeneModels/ # GETA gene models
│
├── LIFTOVER/
│ └── LiftOff/ # Neighbor species liftover
│
├── TRANSCRIPT/
│ ├── PASA/ # PASA assemblies
│ ├── spades/ # De novo assembly
│ └── evigene/ # Evigene transcript clustering
│
├── PROTEIN/ # Merged protein alignments
│
├── EVM/ # EvidenceModeler output
│
├── FILTER/
│ ├── filtered.gff3 # Kept gene models
│ ├── discard.gff3 # Discarded gene models
│ ├── data.tsv # Feature matrix (input to random forest)
│ ├── keep_data.tsv # Evidence data for kept genes
│ ├── discard_data.tsv # Evidence data for discarded genes
│ ├── {prefix}.cdna # Transcript sequences
│ ├── {prefix}.pep # Peptide sequences
│ ├── STAR/ # RNA-seq realignment for filter
│ ├── rsem_outdir/ # RSEM quantification
│ ├── splitPep/ # Parallelized BLAST inputs
│ ├── busco_*/ # BUSCO results (monitoring only)
│ └── lncrna_predict.csv # lncDC predictions
│
├── BENCHMARK/ # Quality benchmarking (optional)
│ ├── benchmark_summary.tsv # Combined BUSCO + OMArk table
│ ├── {label}.pep # Extracted proteins per GFF3
│ ├── {label}.busco/ # BUSCO results per GFF3
│ └── {label}.omark/ # OMArk results per GFF3 (if omark_db set)
│
├── config/ # Runtime config copies
│
└── logs/ # SLURM job logs
Use TidyGFF to prepare annotations for public distribution:
singularity exec sylvan.sif python bin/TidyGFF.py \
MySpecies results/FILTER/filtered.gff3 \
--out MySpecies_v1.0 --splice-name t --justify 5 --sort \
--chrom-regex "^Chr" --source SylvanTidyGFF options:
| Option | Description |
|---|---|
pre (positional) |
Prefix for gene IDs (e.g. Ath produces Ath01G000010) |
gff (positional) |
Input GFF3 file |
--out |
Output file basename (produces .gff3, .cdna, .pep files) |
--splice-name |
Splice variant label style (e.g. t → mRNA1.t1, mRNA1.t2) |
--justify |
Number of digits in gene IDs (default: 8) |
--sort |
Sort output by chromosome and start coordinate |
--chrom-regex |
Regex for chromosome prefixes (auto-detects Chr, chr, LG, Ch, ^\d) |
--contig-regex |
Regex for contig/scaffold naming (e.g. HiC_scaffold_(\d+$),Scaf) |
--source |
Value for GFF column 2 (e.g. Sylvan) |
--remove-names |
Remove Name attributes from GFF |
--no-chrom-id |
Do not number gene IDs by chromosome |
# Force rerun all
./bin/annotate.sh --forceall
# Rerun specific rule
./bin/annotate.sh --forcerun helixer
# Rerun incomplete jobs (jobs that started but didn't finish)
./bin/rerun-incomplete.sh
# Generate report after completion
snakemake --report report.html --snakefile bin/Snakefile_annotate
# Unlock after interruption
./bin/annotate.sh --unlock
# Clean up intermediate files (run after BOTH phases complete)
./bin/cleanup.shbin/cleanup.sh removes intermediate files generated during the annotation phase while preserving:
- Final outputs (
PREFILTER/Sylvan.gff3,FILTER/filtered.gff3) - Log files (
results/logs/) - Configuration files
- Filter phase outputs (
FILTER/)
Run this only after both annotation and filter phases have completed successfully.
| Script | Purpose |
|---|---|
bin/generate_cluster_from_config.py |
Generate a standalone cluster-only YAML from config_annotate.yml |
bin/cluster_submit.py |
SLURM job submission wrapper — dynamically builds sbatch command, skips -A/-p when account/partition are empty |
Sylvan can run on any Linux machine without SLURM. The bin/annotate_local.sh script uses Snakemake's --cores flag for local parallelism instead of --cluster.
- 16+ CPU cores, 64+ GB RAM recommended
- Singularity 3.x+ (or a writable sandbox)
- No SLURM or job scheduler needed
-
Create a local config by copying the toydata example:
cp toydata/config/config_annotate_local.yml toydata/config/my_local_config.yml
-
Edit
__default__— setaccountandpartitionto empty strings (they are ignored in local mode). -
Adjust per-rule
ncpus/threadsto fit your machine (e.g., cap at 12 for a 16-core machine).
export SYLVAN_CONFIG="toydata/config/config_annotate_local.yml"
# Dry run
snakemake -n --snakefile bin/Snakefile_annotate
# Run
./bin/annotate_local.sh
# Pass extra snakemake flags
./bin/annotate_local.sh -n # dry-run
./bin/annotate_local.sh --forceall # force rerun| Aspect | SLURM (annotate.sh) |
Local (annotate_local.sh) |
|---|---|---|
| Parallelism | --cluster sbatch |
--cores 16 |
| Job scheduling | SLURM queue | Snakemake's built-in scheduler |
| Resource limits | Per-job SLURM allocation | System-wide (all jobs share RAM) |
| Singularity | --use-singularity |
--use-singularity (same) |
| GPU (Helixer) | --gres=gpu:1 |
--nv flag (auto-detects host GPU) |
| Issue | Solution |
|---|---|
| Helixer GPU/CUDA mismatch | Container CUDA version must match host driver. If mismatched, Helixer produces empty output (pipeline continues). |
run: blocks execute on host |
Already handled by run_in_container() helper in Snakefile_annotate. |
Container /bin/sh is dash |
Avoid &> in shell commands inside container; use > file 2>&1 instead. |
# Find recent errors
ls -lt results/logs/*.err | head -10
grep -l 'Error\|Traceback' results/logs/*.err
# View specific log
cat results/logs/{rule}_{wildcards}.err| Issue | Solution |
|---|---|
| Out of memory | Increase memory in config_annotate.yml for the rule |
No space left on device |
TMPDIR is on tmpfs or quota exceeded — set TMPDIR to project storage |
Segmentation fault |
Often caused by tmpfs exhaustion — set TMPDIR to disk-backed storage |
| File not found (Singularity) | Path not bound in container — add to SINGULARITY_BIND |
| Singularity bind error | Ensure paths are within working directory or use SINGULARITY_BIND |
Permission denied in container |
Check directory permissions, ensure path is bound |
| SLURM account error | Use account (billing account), not username |
| SLURM account error | Set account to empty string ("") or placeholder in the cluster config if your HPC doesn't require one |
| LFS files not downloaded | Run git lfs pull; verify with ls -la toydata/ (files should be > 200 bytes) |
| Augustus training fails | Needs minimum ~500 training genes; use augustus_start_from with a close species |
| Job timeout | Increase time in config_annotate.yml for the rule |
| Variables not in SLURM job | Add #SBATCH --export=ALL or explicitly export in submit script |
Filter chrom_regex error |
Ensure chrom_regex in config_filter.yml matches your chromosome naming convention |
- General recommendation: 4 GB per thread
- Example: 48 threads = 192 GB memory
ncpusandthreadsshould match inconfig_annotate.yml- Some rules need more:
mergeSTARmay require ~18 GB per thread for large datasets - Check
df -h $TMPDIRto ensure temp storage is on real disk, not tmpfs
Sylvan: A comprehensive genome annotation pipeline. Under review.
MIT License - see LICENSE

