This repository is a minimal, self-contained demonstration of the principles behind taxonomy_bundle — a 17-environment pixi project for microbial genomics research.
Before tackling 17 environments, 60+ tools, and ~1.2 TB of databases, this demo lets you experience the core concepts with just four environments, two annotation tools, and two databases on a single well-characterised genome.
What you will learn by running this demo:
- How a single
pixi.tomldefines multiple isolated environments - How pixi installs tools without dependency conflicts
- How databases are managed explicitly and connected to tools automatically
- How Snakemake stitches pixi environments together into a pipeline
- How these four concepts scale directly to taxonomy_bundle's 17 environments
What this demo runs:
| Step | Tool | Environment | Purpose |
|---|---|---|---|
| Download genome | NCBI Datasets CLI | env-datasets |
Download E. coli K-12 from NCBI |
| Annotate | Prokka 1.15.6 | env-prokka |
Fast prokaryotic annotation |
| Annotate | Bakta 1.12.0 | env-bakta |
NCBI-compliant annotation |
| Pipeline | Snakemake 9.16.2 | env-snakemake |
Automate all steps |
These tools must be present before you begin. Check each one first — on most Ubuntu systems git and tmux are already installed.
# Check if already installed
git --versionAlready installed: version number appears — no action needed.
Personal machine with administrator access:
sudo apt update && sudo apt install git -yShared institutional server without sudo access:
# Check for a loadable module
module avail git
module load git# Check if already installed
pixi --versionIf not installed — Pixi requires no sudo privileges, installs entirely in your home directory, safe on shared institutional servers:
curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc
pixi --version # confirm installationtmux keeps your session alive during long downloads — if your connection drops, the download continues in the background.
# Check if already installed
tmux -VPersonal machine with administrator access:
sudo apt install tmux -yShared institutional server without sudo access:
module avail tmux
module load tmuxtmux commands used in this demo:
| Command | What it does |
|---|---|
tmux new -s name |
Start a new named session |
| Ctrl+B then D | Detach — job keeps running in background |
tmux attach -s name |
Reattach to a running session |
tmux ls |
List all running sessions |
exit |
Close the session when finished |
annotation-demo and taxonomy_bundle are independent projects. Each must live in its own directory — this is the pixi project model. See section 6 for a full explanation of pixi projects and directory demarcation.
# Go to your software directory — alongside taxonomy_bundle, not inside it
cd ~/software
# Clone annotation-demo into its own directory
git clone https://github.com/bharat1912/annotation-demo
cd annotation-demoYour directory layout should look like this:
~/software/
├── annotation-demo/ <- this project (you are here)
└── taxonomy_bundle/ <- separate project, not affected
# Installs all four environments in parallel from pixi.toml
# No conda activate, no conflict resolution, no manual environment management
pixi install --allTakes approximately 2-5 minutes. Verify all four environments installed:
pixi run -e env-prokka prokka --version # should return 1.15.6
pixi run -e env-bakta bakta --version # should return 1.12.0
pixi run -e env-snakemake snakemake --version # should return 9.16.2
pixi run -e env-datasets datasets --version # should return a version numberBoth tools require databases before they can annotate. This demo stores
both in a databases/ directory. See section 6 for how this mirrors
the vault architecture in taxonomy_bundle.
Prokka database:
# Sets up Prokka database in databases/prokka/ and links it to the environment
# See section 6 for how this symlink pattern works and mirrors taxonomy_bundle
pixi run -e env-prokka download-prokka-dbTakes approximately 2-3 minutes. Verify:
pixi run -e env-prokka prokka --listdbBakta database:
The Bakta database (~1.34 GB) takes approximately 70 minutes on a standard connection. Run in tmux so it survives disconnects:
# Option 1 — with tmux (recommended for long downloads)
tmux new -s bakta_db
pixi run -e env-bakta download-bakta-db
# Press Ctrl+B then D to detach — download continues in background
# Reattach later: tmux attach -s bakta_db
# Type exit when complete to close the session
# Option 2 — directly in terminal (do not close until complete)
pixi run -e env-bakta download-bakta-dbRun once only — re-running downloads the entire database again unnecessarily. See section 6 for why Bakta requires an explicit download.
There are two ways to run the demo. Choose one approach per session — do not mix them in the same directory without cleaning outputs first.
Option A — step by step (learning mode) Run each tool individually to see exactly what it does and what it produces. Recommended for first-time users.
Option B — single pipeline command Run the complete workflow with one Snakemake command. Recommended once you have completed Option A and understand what each tool does.
# Downloads E. coli K-12 MG1655 from NCBI (~4.6 Mb)
# See section 7 for alternative genome downloaders
pixi run -e env-datasets download-genomeVerify:
ls -lh ecoli_k12.fna# Annotates ecoli_k12.fna using databases/prokka/
# Prokka finds its database automatically via symlink — no --dbdir flag needed
# Takes approximately 30 seconds
pixi run -e env-prokka run-prokkaVerify:
ls -lh results/prokka/# Annotates ecoli_k12.fna using databases/bakta/db-light/
# BAKTA_DB is set automatically by pixi before this command runs
# Takes approximately 2 minutes
pixi run -e env-bakta run-baktaVerify:
ls -lh results/bakta/# View annotation summaries
cat results/prokka/ecoli_k12.txt
cat results/bakta/ecoli_k12.txt
# Count annotated CDS features
grep -c "CDS" results/prokka/ecoli_k12.gff
grep -c "CDS" results/bakta/ecoli_k12.gff3
# List all output files
ls -lh results/prokka/
ls -lh results/bakta/Both tools annotated the same genome but found slightly different numbers of features — this is expected. Prokka and Bakta use different gene prediction approaches and databases. Neither is wrong. The difference reflects the tools' different design goals:
- Prokka is fast and useful for quick checks and outputs used in pangenome studies.
- Bakta produces NCBI-compliant output for GenBank submissions.
For a well-characterised genome like E. coli K-12, both results are reliable. For novel or poorly characterised organisms, the choice of tool matters more.
Prokka generates a ecoli_k12.err file containing NCBI compliance
discrepancy reports. This is normal — it flags items NCBI would require
for GenBank submission such as missing project IDs and structured comments.
For research annotation it can be ignored. Bakta's --compliant flag
produces cleaner output for direct NCBI submission.
Prokka supports a --compliant flag that enforces NCBI submission
standards — it adds gene features for each CDS, ignores contigs under
200 bp, and requires a sequencing centre ID. This demo runs Prokka
without --compliant deliberately, to highlight the difference between
a fast research annotation and an NCBI-ready submission. The small
difference in CDS count between Prokka (4315) and Bakta (4295) partly
reflects this.
If you need NCBI-compliant Prokka output, add --compliant --centre YOUR_CENTRE
to the run-prokka task in pixi.toml. For final genome submissions,
Bakta with --compliant is the recommended tool — it handles compliance
automatically without additional flags.
For final GenBank submissions, NCBI's own annotation pipeline PGAP is
the gold standard — it is the same pipeline NCBI runs internally when
you submit a genome. However PGAP is not a conda package and cannot be
added to pixi.toml in the same way as Prokka and Bakta. It runs
inside a Docker or Singularity container and requires ~100 GB of
database storage and 32 GB RAM minimum.
For most labs Bakta with --compliant is the practical choice for
submissions — it produces GenBank-ready output that passes NCBI
validation without the infrastructure overhead.
In taxonomy_bundle:
env-pgap2wraps PGAP correctly using Singularity via pixi tasks — demonstrating how container-based tools can be integrated alongside conda-based tools in the samepixi.toml. See section 6 for how this pattern works.
Snakemake reads Snakefile_demo and runs all steps in the correct order —
genome download, Prokka annotation, Bakta annotation — each in its own
pixi environment. See section 6 for how this pipeline pattern scales to
taxonomy_bundle's 4 production Snakefiles.
If you ran Option A first, delete outputs so Snakemake runs all steps:
# Snakemake skips steps whose outputs already exist — delete to run fresh
# databases/ must remain — do not delete it
rm -f ecoli_k12.fna ecoli_k12.zip
rm -rf ncbi_dataset/ results/prokka/ results/bakta/Run the pipeline:
# Dry run first — shows all steps without executing anything
pixi run -e env-snakemake dry-run
# Full pipeline run — download, annotate with Prokka, annotate with Bakta
pixi run -e env-snakemake run-pipeline
# Optional: generate a DAG diagram of the pipeline
# Requires graphviz: sudo apt install graphviz -y
pixi run -e env-snakemake dagIf interrupted, re-run run-pipeline — Snakemake resumes from where it
stopped without re-running completed steps.
After running either Option A or Option B:
results/
├── prokka/
│ ├── ecoli_k12.gff # Gene annotation (GFF3)
│ ├── ecoli_k12.gbk # GenBank format
│ ├── ecoli_k12.faa # Protein sequences
│ └── ecoli_k12.txt # Summary statistics
└── bakta/
├── ecoli_k12.gff3 # Gene annotation (GFF3, NCBI-compliant)
├── ecoli_k12.gbff # GenBank format (submission-ready)
├── ecoli_k12.faa # Protein sequences
└── ecoli_k12.txt # Summary statistics
Now that you have run the demo, the concepts behind taxonomy_bundle's 17-environment design will be much easier to understand.
A pixi project is a directory containing a pixi.toml file. That directory
is the project home — all environments, databases, and tasks belong to it.
Multiple pixi projects can coexist on the same machine but in different
directories — they never interfere because of this directory demarcation.
This is fundamentally different from Conda, which has a global base
environment that can be contaminated by installing packages into it. Pixi
has no global state. Each project is self-contained. Remove the directory
and everything it installed disappears with it.
~/software/
├── annotation-demo/ <- independent pixi project (4 environments)
│ ├── pixi.toml
│ ├── pixi.lock
│ ├── databases/
│ └── .pixi/
└── taxonomy_bundle/ <- independent pixi project (17 environments)
├── pixi.toml
├── pixi.lock
├── db_link/
└── .pixi/
The golden rule: the directory you are in when you type
pixi rundetermines which project and which environments are used.
| Feature | annotation-demo | taxonomy_bundle |
|---|---|---|
| Purpose | Learn pixi concepts with a minimal working example | Full production pipeline for microbial genomics research |
| Environments | 4 | 17 |
| Tools | NCBI Datasets, Prokka, Bakta, Snakemake | 60+ tools across assembly, taxonomy, pangenomics, comparative genomics, trait prediction |
| Databases | 2 in databases/ (~1.5 GB total) |
19 in $EXTERNAL_VAULT (~1.2 TB total) |
| Prokka database | symlink: .pixi/envs/env-prokka/db/ → databases/prokka/ |
symlink: .pixi/envs/env-pan/db/ → $EXTERNAL_VAULT/prokka_db/ |
| Bakta database | BAKTA_DB via [activation.env] pointing to databases/bakta/db-light |
BAKTA_DB via [activation.env] pointing to db_link/bakta |
| All other databases | n/a | 17 more env vars set via [activation.env] |
| Snakemake pipelines | 1 demo pipeline (3 rules) | 4 production pipelines (50+ rules) |
| Install time | ~5 minutes | ~30-60 minutes |
| Disk space | ~3 GB | ~1.5 TB |
| Python versions | 3.10, 3.12 | 3.9, 3.10, 3.11, 3.12 (different per environment) |
| RAM requirement | 4 GB sufficient | Up to 320 GB for GTDB-Tk pplacer step |
Every tool in taxonomy_bundle follows the same pattern you just used:
pixi run -e env-checkm2 checkm2 predict ...
pixi run -e env-pan prokka ...
pixi run -e env-a bakta ...
pixi run -e env-busco busco ...
pixi run -e env-anti antismash ...
pixi run -e env-egg emapper.py ...
pixi run -e env-b snakemake -s Snakefile_hybrid_taxonomy.smkEach environment is one tool or a compatible group. Each Snakemake rule
calls pixi run -e <env> <tool> — exactly as in Snakefile_demo.
Database paths are handled automatically by symlinks and [activation.env].
Two patterns handle database paths in both projects:
Pattern 1 — Symlink (Prokka)
download-prokka-db indexes the database into databases/prokka/ then
replaces .pixi/envs/env-prokka/db/ with a symlink:
.pixi/envs/env-prokka/db/ → databases/prokka/
Prokka looks for its database in db/ inside the environment — it finds
the symlink and follows it automatically. No --dbdir flag ever needed.
taxonomy_bundle uses the identical pattern:
.pixi/envs/env-pan/db/ → $EXTERNAL_VAULT/prokka_db/
Pattern 2 — Activation env var (Bakta and all others)
[activation.env] in pixi.toml sets BAKTA_DB automatically before
every pixi run -e env-bakta command. The tool reads the variable and
finds its database without any flags.
taxonomy_bundle uses this pattern for all 19 databases:
BAKTA_DB, GTDBTK_DATA_PATH, CHECKM2_DB and 16 more —
all set once in [activation.env], all used automatically on every run.
Bakta was deliberately designed this way for scientific reproducibility. The database is large, actively maintained, and versioned. By downloading it explicitly, you always know which database version produced your annotation results — a requirement for publication. The Snakemake pipeline cannot download it automatically because that would repeat a 70-minute download for every pipeline run.
Prokka bundles a smaller database with its conda install. --setupdb
rebuilds it — but always into the bundled db/ directory regardless of
--dbdir. The symlink approach redirects this to databases/prokka/
transparently.
In this demo, databases live in databases/. In taxonomy_bundle, all 19
databases live in $EXTERNAL_VAULT on an external drive:
setup-vaultcreates 19 subdirectories in$EXTERNAL_VAULTand creates symlinks indb_link/pointing to each one- For Prokka:
.pixi/envs/env-pan/db/is symlinked to$EXTERNAL_VAULT/prokka_db/— identical to this demo - For all other tools:
[activation.env]sets the database variable to point at the appropriatedb_link/symlink - Moving all databases to a new drive requires updating one variable —
$EXTERNAL_VAULTin~/.bashrc— and re-runningsetup-vault
The concept is identical. The scale is not.
AI can generate pixi.toml structure, write Snakefiles, and document
workflows. What AI cannot know without being told:
- That Bakta requires a pre-downloaded database by deliberate design
- That Prokka's
--setupdbalways rebuilds into the bundleddb/directory — the symlink approach is needed to redirect it - That
PROKKA_DBdoes not work for--setupdbbut the symlink does - That Prokka 1.14.x fails on Python 3.12 but 1.15.6 works
- That GTDB-Tk requires 320 GB RAM for its pplacer step
- That
miga initoverwrites configuration and must never be run - That bacLIFE requires Snakemake 7, not 8 or 9
This knowledge comes from running the tools, reading errors, and accumulated experience. The way to build it is exactly what you just did — run each tool from its environment, observe the output, and understand what it produces and why.
AI amplifies domain expertise. Running tools from their pixi environments is how you build that expertise.
This demo uses ncbi-datasets-cli — the current NCBI recommended tool
for downloading genomes by accession. Other downloaders work equally well
and can be added as tasks in pixi.toml using the same pattern:
| Tool | Best for |
|---|---|
ncbi-datasets-cli |
Modern NCBI downloads by accession — used here |
wget / curl |
Direct FTP/HTTPS — simple, no extra install |
kingfisher |
SRA reads — multiple backends (AWS, ENA, Aspera) |
ncbi-genome-download |
Batch downloads by organism or taxonomic group |
genome_updater |
Keeping local NCBI genome collections up to date |
All follow the same pattern — add the tool to [feature.X.dependencies]
and define a download task in [feature.X.tasks].
- One file, multiple environments —
pixi.tomldefines all four environments. No separateenvironment.ymlfiles. - Reproducibility —
pixi.lockpins every dependency exactly.pixi install --allgives the same result on any machine, any day. - No conflict resolution — Prokka and Bakta have incompatible dependencies. Pixi isolates them automatically.
- Symlink database pattern — Prokka's
db/is a symlink todatabases/prokka/— the same pattern as taxonomy_bundle's vault. - Activation env vars —
[activation.env]setsBAKTA_DBautomatically — the same pattern for all 19 databases in taxonomy_bundle. - Task automation —
pixi runreplaces long shell commands.pixi task listshows everything available. - Pipeline integration — Snakemake orchestrates across environments
using
pixi run -e <env>inside shell rules. - Directory demarcation — each pixi project is self-contained. Multiple projects coexist without interference.
If you use Prokka in your work please cite:
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068-2069. https://doi.org/10.1093/bioinformatics/btu153
Note on Prokka versions: This demo uses Prokka 1.15.6 (Python 3.10). Earlier versions (1.14.x) are widely cited but may fail to install with modern Python versions. Use version 1.15.6 or later via pixi or conda-forge.
If you use Bakta in your work please cite:
Schwengers et al. (2021) Bakta: rapid and standardised annotation of bacterial genomes via a comprehensive database. Microbial Genomics. https://doi.org/10.1099/mgen.0.000685
If you use NCBI Datasets in your work please cite: https://www.ncbi.nlm.nih.gov/datasets/
Bharat K.C. Patel — microbiologist, bioinformatics educator, 30+ years at the intersection of microbiology and computational biology.