Skip to content

bharat1912/annotation-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

annotation-demo — A minimal pixi proof-of-concept for biologists

1. Purpose

This repository is a minimal, self-contained demonstration of the principles behind taxonomy_bundle — a 17-environment pixi project for microbial genomics research.

Before tackling 17 environments, 60+ tools, and ~1.2 TB of databases, this demo lets you experience the core concepts with just four environments, two annotation tools, and two databases on a single well-characterised genome.

What you will learn by running this demo:

  • How a single pixi.toml defines multiple isolated environments
  • How pixi installs tools without dependency conflicts
  • How databases are managed explicitly and connected to tools automatically
  • How Snakemake stitches pixi environments together into a pipeline
  • How these four concepts scale directly to taxonomy_bundle's 17 environments

What this demo runs:

Step Tool Environment Purpose
Download genome NCBI Datasets CLI env-datasets Download E. coli K-12 from NCBI
Annotate Prokka 1.15.6 env-prokka Fast prokaryotic annotation
Annotate Bakta 1.12.0 env-bakta NCBI-compliant annotation
Pipeline Snakemake 9.16.2 env-snakemake Automate all steps

2. Prerequisites

These tools must be present before you begin. Check each one first — on most Ubuntu systems git and tmux are already installed.

git

# Check if already installed
git --version

Already installed: version number appears — no action needed.

Personal machine with administrator access:

sudo apt update && sudo apt install git -y

Shared institutional server without sudo access:

# Check for a loadable module
module avail git
module load git

Pixi

# Check if already installed
pixi --version

If not installed — Pixi requires no sudo privileges, installs entirely in your home directory, safe on shared institutional servers:

curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc
pixi --version    # confirm installation

tmux (recommended)

tmux keeps your session alive during long downloads — if your connection drops, the download continues in the background.

# Check if already installed
tmux -V

Personal machine with administrator access:

sudo apt install tmux -y

Shared institutional server without sudo access:

module avail tmux
module load tmux

tmux commands used in this demo:

Command What it does
tmux new -s name Start a new named session
Ctrl+B then D Detach — job keeps running in background
tmux attach -s name Reattach to a running session
tmux ls List all running sessions
exit Close the session when finished

3. Setup

3a. Create the annotation-demo directory

annotation-demo and taxonomy_bundle are independent projects. Each must live in its own directory — this is the pixi project model. See section 6 for a full explanation of pixi projects and directory demarcation.

# Go to your software directory — alongside taxonomy_bundle, not inside it
cd ~/software

# Clone annotation-demo into its own directory
git clone https://github.com/bharat1912/annotation-demo
cd annotation-demo

Your directory layout should look like this:

~/software/
├── annotation-demo/    <- this project (you are here)
└── taxonomy_bundle/    <- separate project, not affected

3b. Install all environments

# Installs all four environments in parallel from pixi.toml
# No conda activate, no conflict resolution, no manual environment management
pixi install --all

Takes approximately 2-5 minutes. Verify all four environments installed:

pixi run -e env-prokka prokka --version       # should return 1.15.6
pixi run -e env-bakta bakta --version         # should return 1.12.0
pixi run -e env-snakemake snakemake --version # should return 9.16.2
pixi run -e env-datasets datasets --version   # should return a version number

3c. Download databases

Both tools require databases before they can annotate. This demo stores both in a databases/ directory. See section 6 for how this mirrors the vault architecture in taxonomy_bundle.

Prokka database:

# Sets up Prokka database in databases/prokka/ and links it to the environment
# See section 6 for how this symlink pattern works and mirrors taxonomy_bundle
pixi run -e env-prokka download-prokka-db

Takes approximately 2-3 minutes. Verify:

pixi run -e env-prokka prokka --listdb

Bakta database:

The Bakta database (~1.34 GB) takes approximately 70 minutes on a standard connection. Run in tmux so it survives disconnects:

# Option 1 — with tmux (recommended for long downloads)
tmux new -s bakta_db
pixi run -e env-bakta download-bakta-db
# Press Ctrl+B then D to detach — download continues in background
# Reattach later: tmux attach -s bakta_db
# Type exit when complete to close the session

# Option 2 — directly in terminal (do not close until complete)
pixi run -e env-bakta download-bakta-db

Run once only — re-running downloads the entire database again unnecessarily. See section 6 for why Bakta requires an explicit download.


4. Running the demo

There are two ways to run the demo. Choose one approach per session — do not mix them in the same directory without cleaning outputs first.

Option A — step by step (learning mode) Run each tool individually to see exactly what it does and what it produces. Recommended for first-time users.

Option B — single pipeline command Run the complete workflow with one Snakemake command. Recommended once you have completed Option A and understand what each tool does.


Option A — Run each tool individually

A1 — Download the E. coli genome

# Downloads E. coli K-12 MG1655 from NCBI (~4.6 Mb)
# See section 7 for alternative genome downloaders
pixi run -e env-datasets download-genome

Verify:

ls -lh ecoli_k12.fna

A2 — Annotate with Prokka

# Annotates ecoli_k12.fna using databases/prokka/
# Prokka finds its database automatically via symlink — no --dbdir flag needed
# Takes approximately 30 seconds
pixi run -e env-prokka run-prokka

Verify:

ls -lh results/prokka/

A3 — Annotate with Bakta

# Annotates ecoli_k12.fna using databases/bakta/db-light/
# BAKTA_DB is set automatically by pixi before this command runs
# Takes approximately 2 minutes
pixi run -e env-bakta run-bakta

Verify:

ls -lh results/bakta/

A4 — View and understand the output

# View annotation summaries
cat results/prokka/ecoli_k12.txt
cat results/bakta/ecoli_k12.txt

# Count annotated CDS features
grep -c "CDS" results/prokka/ecoli_k12.gff
grep -c "CDS" results/bakta/ecoli_k12.gff3

# List all output files
ls -lh results/prokka/
ls -lh results/bakta/

Differences in Prokka and Bakta annotations

Both tools annotated the same genome but found slightly different numbers of features — this is expected. Prokka and Bakta use different gene prediction approaches and databases. Neither is wrong. The difference reflects the tools' different design goals:

  • Prokka is fast and useful for quick checks and outputs used in pangenome studies.
  • Bakta produces NCBI-compliant output for GenBank submissions.

For a well-characterised genome like E. coli K-12, both results are reliable. For novel or poorly characterised organisms, the choice of tool matters more.

A note on the .err file

Prokka generates a ecoli_k12.err file containing NCBI compliance discrepancy reports. This is normal — it flags items NCBI would require for GenBank submission such as missing project IDs and structured comments. For research annotation it can be ignored. Bakta's --compliant flag produces cleaner output for direct NCBI submission.

A note on Prokka --compliant

Prokka supports a --compliant flag that enforces NCBI submission standards — it adds gene features for each CDS, ignores contigs under 200 bp, and requires a sequencing centre ID. This demo runs Prokka without --compliant deliberately, to highlight the difference between a fast research annotation and an NCBI-ready submission. The small difference in CDS count between Prokka (4315) and Bakta (4295) partly reflects this.

If you need NCBI-compliant Prokka output, add --compliant --centre YOUR_CENTRE to the run-prokka task in pixi.toml. For final genome submissions, Bakta with --compliant is the recommended tool — it handles compliance automatically without additional flags.

A note on PGAP — the gold standard for GenBank submission

For final GenBank submissions, NCBI's own annotation pipeline PGAP is the gold standard — it is the same pipeline NCBI runs internally when you submit a genome. However PGAP is not a conda package and cannot be added to pixi.toml in the same way as Prokka and Bakta. It runs inside a Docker or Singularity container and requires ~100 GB of database storage and 32 GB RAM minimum.

For most labs Bakta with --compliant is the practical choice for submissions — it produces GenBank-ready output that passes NCBI validation without the infrastructure overhead.

In taxonomy_bundle: env-pgap2 wraps PGAP correctly using Singularity via pixi tasks — demonstrating how container-based tools can be integrated alongside conda-based tools in the same pixi.toml. See section 6 for how this pattern works.


Option B — Single pipeline command

Snakemake reads Snakefile_demo and runs all steps in the correct order — genome download, Prokka annotation, Bakta annotation — each in its own pixi environment. See section 6 for how this pipeline pattern scales to taxonomy_bundle's 4 production Snakefiles.

If you ran Option A first, delete outputs so Snakemake runs all steps:

# Snakemake skips steps whose outputs already exist — delete to run fresh
# databases/ must remain — do not delete it
rm -f ecoli_k12.fna ecoli_k12.zip
rm -rf ncbi_dataset/ results/prokka/ results/bakta/

Run the pipeline:

# Dry run first — shows all steps without executing anything
pixi run -e env-snakemake dry-run

# Full pipeline run — download, annotate with Prokka, annotate with Bakta
pixi run -e env-snakemake run-pipeline

# Optional: generate a DAG diagram of the pipeline
# Requires graphviz: sudo apt install graphviz -y
pixi run -e env-snakemake dag

If interrupted, re-run run-pipeline — Snakemake resumes from where it stopped without re-running completed steps.


5. Output files

After running either Option A or Option B:

results/
├── prokka/
│   ├── ecoli_k12.gff      # Gene annotation (GFF3)
│   ├── ecoli_k12.gbk      # GenBank format
│   ├── ecoli_k12.faa      # Protein sequences
│   └── ecoli_k12.txt      # Summary statistics
└── bakta/
    ├── ecoli_k12.gff3     # Gene annotation (GFF3, NCBI-compliant)
    ├── ecoli_k12.gbff     # GenBank format (submission-ready)
    ├── ecoli_k12.faa      # Protein sequences
    └── ecoli_k12.txt      # Summary statistics

6. From demo to taxonomy_bundle

Now that you have run the demo, the concepts behind taxonomy_bundle's 17-environment design will be much easier to understand.

Pixi projects — one directory, one pixi.toml

A pixi project is a directory containing a pixi.toml file. That directory is the project home — all environments, databases, and tasks belong to it. Multiple pixi projects can coexist on the same machine but in different directories — they never interfere because of this directory demarcation.

This is fundamentally different from Conda, which has a global base environment that can be contaminated by installing packages into it. Pixi has no global state. Each project is self-contained. Remove the directory and everything it installed disappears with it.

~/software/
├── annotation-demo/    <- independent pixi project (4 environments)
│   ├── pixi.toml
│   ├── pixi.lock
│   ├── databases/
│   └── .pixi/
└── taxonomy_bundle/    <- independent pixi project (17 environments)
    ├── pixi.toml
    ├── pixi.lock
    ├── db_link/
    └── .pixi/

The golden rule: the directory you are in when you type pixi run determines which project and which environments are used.

annotation-demo vs taxonomy_bundle

Feature annotation-demo taxonomy_bundle
Purpose Learn pixi concepts with a minimal working example Full production pipeline for microbial genomics research
Environments 4 17
Tools NCBI Datasets, Prokka, Bakta, Snakemake 60+ tools across assembly, taxonomy, pangenomics, comparative genomics, trait prediction
Databases 2 in databases/ (~1.5 GB total) 19 in $EXTERNAL_VAULT (~1.2 TB total)
Prokka database symlink: .pixi/envs/env-prokka/db/ → databases/prokka/ symlink: .pixi/envs/env-pan/db/ → $EXTERNAL_VAULT/prokka_db/
Bakta database BAKTA_DB via [activation.env] pointing to databases/bakta/db-light BAKTA_DB via [activation.env] pointing to db_link/bakta
All other databases n/a 17 more env vars set via [activation.env]
Snakemake pipelines 1 demo pipeline (3 rules) 4 production pipelines (50+ rules)
Install time ~5 minutes ~30-60 minutes
Disk space ~3 GB ~1.5 TB
Python versions 3.10, 3.12 3.9, 3.10, 3.11, 3.12 (different per environment)
RAM requirement 4 GB sufficient Up to 320 GB for GTDB-Tk pplacer step

The same pattern, 17 times

Every tool in taxonomy_bundle follows the same pattern you just used:

pixi run -e env-checkm2  checkm2 predict ...
pixi run -e env-pan      prokka ...
pixi run -e env-a        bakta ...
pixi run -e env-busco    busco ...
pixi run -e env-anti     antismash ...
pixi run -e env-egg      emapper.py ...
pixi run -e env-b        snakemake -s Snakefile_hybrid_taxonomy.smk

Each environment is one tool or a compatible group. Each Snakemake rule calls pixi run -e <env> <tool> — exactly as in Snakefile_demo. Database paths are handled automatically by symlinks and [activation.env].

How databases are connected to tools

Two patterns handle database paths in both projects:

Pattern 1 — Symlink (Prokka)

download-prokka-db indexes the database into databases/prokka/ then replaces .pixi/envs/env-prokka/db/ with a symlink:

.pixi/envs/env-prokka/db/  →  databases/prokka/

Prokka looks for its database in db/ inside the environment — it finds the symlink and follows it automatically. No --dbdir flag ever needed.

taxonomy_bundle uses the identical pattern:

.pixi/envs/env-pan/db/  →  $EXTERNAL_VAULT/prokka_db/

Pattern 2 — Activation env var (Bakta and all others)

[activation.env] in pixi.toml sets BAKTA_DB automatically before every pixi run -e env-bakta command. The tool reads the variable and finds its database without any flags.

taxonomy_bundle uses this pattern for all 19 databases: BAKTA_DB, GTDBTK_DATA_PATH, CHECKM2_DB and 16 more — all set once in [activation.env], all used automatically on every run.

Why Bakta requires an explicit database download

Bakta was deliberately designed this way for scientific reproducibility. The database is large, actively maintained, and versioned. By downloading it explicitly, you always know which database version produced your annotation results — a requirement for publication. The Snakemake pipeline cannot download it automatically because that would repeat a 70-minute download for every pipeline run.

Prokka bundles a smaller database with its conda install. --setupdb rebuilds it — but always into the bundled db/ directory regardless of --dbdir. The symlink approach redirects this to databases/prokka/ transparently.

The vault architecture in taxonomy_bundle

In this demo, databases live in databases/. In taxonomy_bundle, all 19 databases live in $EXTERNAL_VAULT on an external drive:

  • setup-vault creates 19 subdirectories in $EXTERNAL_VAULT and creates symlinks in db_link/ pointing to each one
  • For Prokka: .pixi/envs/env-pan/db/ is symlinked to $EXTERNAL_VAULT/prokka_db/ — identical to this demo
  • For all other tools: [activation.env] sets the database variable to point at the appropriate db_link/ symlink
  • Moving all databases to a new drive requires updating one variable — $EXTERNAL_VAULT in ~/.bashrc — and re-running setup-vault

The concept is identical. The scale is not.

On AI and domain knowledge

AI can generate pixi.toml structure, write Snakefiles, and document workflows. What AI cannot know without being told:

  • That Bakta requires a pre-downloaded database by deliberate design
  • That Prokka's --setupdb always rebuilds into the bundled db/ directory — the symlink approach is needed to redirect it
  • That PROKKA_DB does not work for --setupdb but the symlink does
  • That Prokka 1.14.x fails on Python 3.12 but 1.15.6 works
  • That GTDB-Tk requires 320 GB RAM for its pplacer step
  • That miga init overwrites configuration and must never be run
  • That bacLIFE requires Snakemake 7, not 8 or 9

This knowledge comes from running the tools, reading errors, and accumulated experience. The way to build it is exactly what you just did — run each tool from its environment, observe the output, and understand what it produces and why.

AI amplifies domain expertise. Running tools from their pixi environments is how you build that expertise.


7. A note on genome downloaders

This demo uses ncbi-datasets-cli — the current NCBI recommended tool for downloading genomes by accession. Other downloaders work equally well and can be added as tasks in pixi.toml using the same pattern:

Tool Best for
ncbi-datasets-cli Modern NCBI downloads by accession — used here
wget / curl Direct FTP/HTTPS — simple, no extra install
kingfisher SRA reads — multiple backends (AWS, ENA, Aspera)
ncbi-genome-download Batch downloads by organism or taxonomic group
genome_updater Keeping local NCBI genome collections up to date

All follow the same pattern — add the tool to [feature.X.dependencies] and define a download task in [feature.X.tasks].


8. What this demonstrates

  • One file, multiple environmentspixi.toml defines all four environments. No separate environment.yml files.
  • Reproducibilitypixi.lock pins every dependency exactly. pixi install --all gives the same result on any machine, any day.
  • No conflict resolution — Prokka and Bakta have incompatible dependencies. Pixi isolates them automatically.
  • Symlink database pattern — Prokka's db/ is a symlink to databases/prokka/ — the same pattern as taxonomy_bundle's vault.
  • Activation env vars[activation.env] sets BAKTA_DB automatically — the same pattern for all 19 databases in taxonomy_bundle.
  • Task automationpixi run replaces long shell commands. pixi task list shows everything available.
  • Pipeline integration — Snakemake orchestrates across environments using pixi run -e <env> inside shell rules.
  • Directory demarcation — each pixi project is self-contained. Multiple projects coexist without interference.

9. Citations

If you use Prokka in your work please cite:

Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068-2069. https://doi.org/10.1093/bioinformatics/btu153

Note on Prokka versions: This demo uses Prokka 1.15.6 (Python 3.10). Earlier versions (1.14.x) are widely cited but may fail to install with modern Python versions. Use version 1.15.6 or later via pixi or conda-forge.

If you use Bakta in your work please cite:

Schwengers et al. (2021) Bakta: rapid and standardised annotation of bacterial genomes via a comprehensive database. Microbial Genomics. https://doi.org/10.1099/mgen.0.000685

If you use NCBI Datasets in your work please cite: https://www.ncbi.nlm.nih.gov/datasets/


10. Author

Bharat K.C. Patel — microbiologist, bioinformatics educator, 30+ years at the intersection of microbiology and computational biology.

About

Minimal pixi proof-of-concept for biologists — Prokka, Bakta and Snakemake in four isolated environments

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors