From FASTA to Foundation model — Fast.
🚀 Ship a protein language model without writing a training loop. nanoPLM gives you a batteries‑included CLI, reproducible data workflows, and a simple YAML files to control everything.
- Control everything with simple YAML files: Prepare your data and Pretrain your model, with YAML files.
- Data you can trust: Using Data Version Control (DVC) under the hood.
- Scale sensibly: Multi‑GPU ready.
Install the package from PyPi
pip install nanoplmRemember for CUDA, you should install dion package (used for muon and normuon optimizers) manually.
pip install git+https://github.com/alint77/dion@dev/megabatchingFor benchmarking you also need Biotrainer.
pip install git+https://github.com/sacdallago/biotrainer@mainnanoplm data get-yamlYou'll get params.yaml and a dvc.yaml files. Just edit the params.yaml if you want.
We're using DVC under the hood, so you can track your data version.
Use the command below to prepare your data for pLM pretraining (you'll get train and val FASTAs)
nanoplm data from-yamlBy default, this uses
params.yamlin your current directory. You can optionally specify a different path argument (relative or absolute) if needed. Like:nanoplm data from-yaml <path/to/params.yaml>
For pretraining shard generation, nanoPLM auto-uses a native C tokenizer/writer backend (OpenMP-enabled if available) when using the default protein tokenizer.
Disable it withNANOPLM_DISABLE_NATIVE_SHARD_WRITER=1.
shuffle,filter, andsplitalso use native C FASTA backends when available.
Disable them withNANOPLM_DISABLE_NATIVE_FASTA_OPS=1.
📊 Now your data is ready! Let's start the training.
nanoplm pretrain get-yamlThis writes pretraining YAML file to your current directory.
nanoplm distill get-yamlThis writes distillation YAML file to your current directory.
nanoplm pretrain from-yamlor
nanoplm distill from-yamldata_params:
# Pipeline mode: 'pretrain', 'distillation', or 'none'
# - 'pretrain': Generate HDF5 shards for MLM pretraining
# - 'distillation': Generate teacher embeddings for knowledge distillation
# - 'none': Only run data preparation (download, filter, split)
pipeline_mode: "pretrain"
seqs_num: 20000
min_seq_len: 20
max_seq_len: 512
val_ratio: 0.1
device: "auto"
shuffle_backend: "auto" # auto -> seqkit if available, else fast
shuffle: true
shuffle_seed: 24
filter_skip_n: 0
# Pretrain config (used when pipeline_mode: 'pretrain')
# A .data_manifest file will be created in output_dir for use by pretrain pipeline
pretrain_config:
output_dir: "output/data/pretrain_data" # Will contain train/ and val/ subdirs
samples_per_shard: 2000
max_workers: 2 # -1 to use all available CPUs
force: false
# Distillation config (used when pipeline_mode: 'distillation')
# A .data_manifest file will be created in output_dir for use by distill pipeline
distillation_config:
output_dir: "output/data/distillation_data" # Will contain train/ and val/ subdirs
on_the_fly: false # If true, skip embedding generation (embeddings computed during training)
samples_per_shard: 2000 # -1 for single file (no sharding)
teacher_model: "prott5"
embed_calc_batch_size: 4
# Data directories
data_dirs:
url: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
# swissprot: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
# trembl: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz"
# uniref50: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz"
# uniref90: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz"
# uniref100: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz"
compressed_fasta: "output/data/raw/uniref50.fasta.gz"
extracted_fasta: "output/data/raw/uniref50.fasta"
shuffled_fasta: "output/data/raw/uniref50_shuffled.fasta"
filtered_fasta: "output/data/filter/uniref50_filtered.fasta"
splitted_fasta_dir: "output/data/split"# Pretraining configuration for nanoPLM
#
# IMPORTANT: Before running pretraining, ensure you have prepared your data with:
# 1. Set pipeline_mode: 'pretrain' in params.yaml
# 2. Run: nanoplm data from-yaml
# This will generate binary shards and a .data_manifest file.
model:
hidden_size: 768
intermediate_size: 1536
num_hidden_layers: 12
num_attention_heads: 8
vocab_size: 32
mlp_activation: "swiglu"
mlp_dropout: 0.0
mlp_bias: false
attention_bias: false
attention_dropout: 0.0
classifier_activation: "gelu"
# The options below only work on pure-torch and TE pipelines
use_resid_lambdas: true # scales residual stream per layer
use_x0_lambdas: true # blends initial embedding x0 per layer
use_qk_norm: false # applies RMS norm to Q/K in attention
use_canon_layers: true # enables bidirectional Canon-ABCD (pure_torch only)
canon_layers_mode: "ac" # subset of Canon sites (A/B/C/D); "ac" is lighter/faster than full "abcd"
canon_layers_kernel_size: 5 # symmetric Canon kernel size (allowed: 3/5/7, default: 5)
pretraining:
# Dataset directory (contains .data_manifest from nanoplm data from-yaml)
# Note: paths are RELATIVE to where you RUN the command, NOT the YAML file.
dataset_dir: "output/data/pretrain_data"
# Output model path
ckp_dir: "output/pretraining_checkpoints"
# Hyperparameters
micro_batch_size: 64
global_batch_size: 1048576 # 2^20 ≈ 1M tokens/step (based on PLM best practices)
# micro_batch_size: samples per GPU per forward pass (limited by GPU memory)
# global_batch_size: total tokens per optimizer step across all GPUs
# gradient_accumulation_steps is inferred automatically:
# grad_accum = ceil(global_batch_size / (micro_batch_size * max_seq_len * num_gpus))
num_epochs: 10
warmup_ratio: 0.05
optimizer: "adamw" # adamw, stable_adamw, muon, normuon
# AdamW hyperparameters (also used for AdamW side [1D and embedding/unembed params] when optimizer=muon or normuon)
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-8
adam_learning_rate: 1e-4 # AdamW LR (Muon uses muon_learning_rate)
max_grad_norm: .inf # set to .inf (equivalent to float("inf")) to disable clipping
warmup_steps: 302
lr_decay_to_fraction: 0.1
lr_schedule: "cosine" # Linear or Cosine
adam_weight_decay: 0.0
# Muon/NorMuon hyperparameters (used only when optimizer: muon or normuon)
muon_learning_rate: 1e-3
muon_weight_decay: 0.01
muon_cautious_weight_decay: true
muon_use_polar_express: true
muon_momentum: 0.95
muon_nesterov: true
muon_eps: 1e-7
mlm_probability: 0.3
mask_replace_prob: 0.8
random_token_prob: 0.1
keep_probability: 0.1
logging_steps: 1
eval_steps: 250
save_steps: 5000
seed: 42
num_workers: "auto"
prefetch_factor: 2
# Sequence packing: concatenates shorter sequences into fewer rows to eliminate
# padding waste and increase GPU utilization. Requires flash attention and --pure-torch/--pure-te
use_packing: true
# Experimental throughput optimization: with packing, enables static input sizes which enables the use of torch.compile(dynamic=False) and cudagraphs
use_static_inp_size: true
# Mixed precision training (recommended: keep enabled for 1.5-3x speedup)
# When bf16 is true, automatically selects the best precision for your hardware:
# - CUDA Ampere+ (A100, RTX 3090+): bf16 + TF32
# - CUDA Volta/Turing (V100, RTX 2080): fp16 fallback
# - Apple Silicon (M1/M2/M3): fp16 (hardware accelerated)
# - CPU: fp32 (no mixed precision)
bf16: true
tf32: true # TF32 mode on Ampere+ CUDA GPUs only (automatically not used on MPS/CPU)
# Provides 3x faster fp32 matmuls with negligible precision loss
fp8: false # Enable FP8 Linear matmuls in pure_torch/pure_te paths (CUDA, best on H100+)
multi_gpu: true
world_size: 'auto' # Use "auto" if you want to use all available GPUs
project_name: "nanoplm-pretraining"
# Pure-torch training loop settings (alternative to HF Trainer).
pure_torch:
enabled: false
# torch.compile: compile the model for faster training. Disable for debugging,
# unsupported hardware (e.g. Apple Silicon), or to avoid warmup overhead.
use_compile: true
# Sequence packing: concatenates shorter sequences into fewer rows to eliminate
# padding waste and increase GPU utilization. Requires flash attention.
use_packing: false
# Fixed row count for static-shape compilation when use_packing is true (enables torch.compile dynamic=False).
# Set to ceil(micro_batch_size * avg_len / max_seq_len) + margin. Leave null for dynamic=True.
target_packed_rows: null
resume:
# Set is_resume: true to resume training from a checkpoint
# When resuming, the model, tokenizer, and training state will be loaded from checkpoint_dir
# extra_epochs: adds to 'pretraining.num_epochs' to define total epochs.
is_resume: false
checkpoint_dir: "output/pretraining_checkpoints/run-1/checkpoint-1"
extra_epochs: 0# Distillation configuration for nanoPLM
#
# IMPORTANT: Before running distillation, ensure you have prepared your data with:
# 1. Set pipeline_mode: 'distillation' in params.yaml
# 2. Set distillation_config.on_the_fly in params.yaml:
# - false (default): Pre-compute teacher embeddings during data preparation
# - true: Generate teacher embeddings on-the-fly during training
# 3. Run: nanoplm data from-yaml
# This will generate a .data_manifest file with the appropriate configuration.
model:
hidden_size: 1024
intermediate_size: 2048
num_hidden_layers: 16
num_attention_heads: 16
mlp_activation: "swiglu"
mlp_dropout: 0.0
mlp_bias: false
attention_bias: false
attention_dropout: 0.0
classifier_activation: "gelu"
projection_layer: true # Set to false if student hidden_size matches teacher (1024)
distillation:
# Dataset directory (contains .data_manifest from nanoplm data from-yaml)
# The manifest automatically provides:
# - max_seq_len, max_seqs_num, val_ratio
# - on_the_fly mode and dataset paths (FASTA or H5)
# Note: paths are RELATIVE to where you RUN the command, NOT the YAML file.
dataset_dir: "output/data/distillation_data"
# Output checkpoint path
ckp_dir: "output/distillation_checkpoints"
# Training hyperparameters
num_epochs: 10
batch_size: 32
learning_rate: 1e-3
gradient_accumulation_steps: 1
warmup_ratio: 0.05
# LR scheduler
lr_scheduler: "cosine" # cosine, linear, polynomial, constant
lr_scheduler_kwargs: {}
# Data loader optimization
max_open_files: 5
chunk_size: 32
prefetch_batches: 2
use_threading: true
num_workers: 4
# Checkpointing
project_name: "nanoplm-distillation"
logging_steps: 1
eval_steps: 250
save_steps: 5000
# Mixed precision training (recommended: keep enabled for 1.5-3x speedup)
# When bf16 is true, automatically selects the best precision for your hardware:
# - CUDA Ampere+ (A100, RTX 3090+): bf16 + TF32
# - CUDA Volta/Turing (V100, RTX 2080): fp16 fallback
# - Apple Silicon (M1/M2/M3): fp16 (hardware accelerated)
# - CPU: fp32 (no mixed precision)
bf16: true
tf32: true # TF32 mode on Ampere+ CUDA GPUs only (automatically not used on MPS/CPU)
# Provides 3x faster fp32 matmuls with negligible precision loss
# Distributed training
multi_gpu: false
world_size: 1
seed: 42
resume:
# Set is_resume: true to resume training from a checkpoint
# When resuming, the model, tokenizer, and training state will be loaded from checkpoint_dir
# extra_epochs: adds to 'distillation.num_epochs' to define total epochs.
is_resume: false
checkpoint_dir: "output/distillation/run-1/checkpoint-1"
extra_epochs: 0Tip: Paths are resolved relative to where you run the command (not where the YAML lives).
- Python 3.12+
- macOS or Linux
- GPU recommended (CPU is fine for tiny tests)
PRs welcome. If you’re unsure where to start, open an issue with your use‑case.
If nanoPLM saved you time, a star helps others find it and keeps development going.