Skip to content

alint77/nanoplm

 
 

Repository files navigation

logo

From FASTA to Foundation model — Fast.

GitHub Actions License PyPI Version Code style: black

🚀 Ship a protein language model without writing a training loop. nanoPLM gives you a batteries‑included CLI, reproducible data workflows, and a simple YAML files to control everything.

🧬 What makes nanoPLM different?

  • Control everything with simple YAML files: Prepare your data and Pretrain your model, with YAML files.
  • Data you can trust: Using Data Version Control (DVC) under the hood.
  • Scale sensibly: Multi‑GPU ready.

🛠️ Install

Install the package from PyPi

pip install nanoplm

Remember for CUDA, you should install dion package (used for muon and normuon optimizers) manually.

pip install git+https://github.com/alint77/dion@dev/megabatching

For benchmarking you also need Biotrainer.

pip install git+https://github.com/sacdallago/biotrainer@main

🤖 Zero‑to‑model in 4 commands

1. Get data YAML file

nanoplm data get-yaml

You'll get params.yaml and a dvc.yaml files. Just edit the params.yaml if you want.

We're using DVC under the hood, so you can track your data version.

2. Prepare your data

Use the command below to prepare your data for pLM pretraining (you'll get train and val FASTAs)

nanoplm data from-yaml

By default, this uses params.yaml in your current directory. You can optionally specify a different path argument (relative or absolute) if needed. Like: nanoplm data from-yaml <path/to/params.yaml>

For pretraining shard generation, nanoPLM auto-uses a native C tokenizer/writer backend (OpenMP-enabled if available) when using the default protein tokenizer.
Disable it with NANOPLM_DISABLE_NATIVE_SHARD_WRITER=1.

shuffle, filter, and split also use native C FASTA backends when available.
Disable them with NANOPLM_DISABLE_NATIVE_FASTA_OPS=1.

📊 Now your data is ready! Let's start the training.

3. Get a pretrain or distillation YAML file

nanoplm pretrain get-yaml

This writes pretraining YAML file to your current directory.

nanoplm distill get-yaml

This writes distillation YAML file to your current directory.

4. Start your pretraining or distillation

nanoplm pretrain from-yaml

or

nanoplm distill from-yaml

Data Preparation YAML

data_params:
  # Pipeline mode: 'pretrain', 'distillation', or 'none'
  # - 'pretrain': Generate HDF5 shards for MLM pretraining
  # - 'distillation': Generate teacher embeddings for knowledge distillation
  # - 'none': Only run data preparation (download, filter, split)
  pipeline_mode: "pretrain"

  seqs_num: 20000
  min_seq_len: 20
  max_seq_len: 512
  val_ratio: 0.1
  device: "auto"

  shuffle_backend: "auto"  # auto -> seqkit if available, else fast
  shuffle: true
  shuffle_seed: 24
  filter_skip_n: 0

# Pretrain config (used when pipeline_mode: 'pretrain')
# A .data_manifest file will be created in output_dir for use by pretrain pipeline
pretrain_config:
  output_dir: "output/data/pretrain_data"  # Will contain train/ and val/ subdirs
  samples_per_shard: 2000
  max_workers: 2  # -1 to use all available CPUs
  force: false

# Distillation config (used when pipeline_mode: 'distillation')
# A .data_manifest file will be created in output_dir for use by distill pipeline
distillation_config:
  output_dir: "output/data/distillation_data"  # Will contain train/ and val/ subdirs
  on_the_fly: false  # If true, skip embedding generation (embeddings computed during training)
  samples_per_shard: 2000  # -1 for single file (no sharding)
  teacher_model: "prott5"
  embed_calc_batch_size: 4

# Data directories
data_dirs:
  url: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
  # swissprot: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz"
  # trembl: "https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz"
  # uniref50: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz"
  # uniref90: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz"
  # uniref100: "https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz"
  compressed_fasta: "output/data/raw/uniref50.fasta.gz"
  extracted_fasta: "output/data/raw/uniref50.fasta"
  shuffled_fasta: "output/data/raw/uniref50_shuffled.fasta"
  filtered_fasta: "output/data/filter/uniref50_filtered.fasta"
  splitted_fasta_dir: "output/data/split"

Pretraining YAML

# Pretraining configuration for nanoPLM
#
# IMPORTANT: Before running pretraining, ensure you have prepared your data with:
#   1. Set pipeline_mode: 'pretrain' in params.yaml
#   2. Run: nanoplm data from-yaml
# This will generate binary shards and a .data_manifest file.

model:
  hidden_size: 768
  intermediate_size: 1536
  num_hidden_layers: 12
  num_attention_heads: 8
  vocab_size: 32
  mlp_activation: "swiglu"
  mlp_dropout: 0.0
  mlp_bias: false
  attention_bias: false
  attention_dropout: 0.0
  classifier_activation: "gelu"
  # The options below only work on pure-torch and TE pipelines
  use_resid_lambdas: true  # scales residual stream per layer
  use_x0_lambdas: true  # blends initial embedding x0 per layer
  use_qk_norm: false  # applies RMS norm to Q/K in attention
  use_canon_layers: true  # enables bidirectional Canon-ABCD (pure_torch only)
  canon_layers_mode: "ac"  # subset of Canon sites (A/B/C/D); "ac" is lighter/faster than full "abcd"
  canon_layers_kernel_size: 5  # symmetric Canon kernel size (allowed: 3/5/7, default: 5)

pretraining:
  # Dataset directory (contains .data_manifest from nanoplm data from-yaml)
  # Note: paths are RELATIVE to where you RUN the command, NOT the YAML file.
  dataset_dir: "output/data/pretrain_data"

  # Output model path
  ckp_dir: "output/pretraining_checkpoints"

  # Hyperparameters
  micro_batch_size: 64
  global_batch_size: 1048576  # 2^20 ≈ 1M tokens/step (based on PLM best practices)
  #   micro_batch_size: samples per GPU per forward pass (limited by GPU memory)
  #   global_batch_size: total tokens per optimizer step across all GPUs
  #   gradient_accumulation_steps is inferred automatically:
  #     grad_accum = ceil(global_batch_size / (micro_batch_size * max_seq_len * num_gpus))
  num_epochs: 10
  warmup_ratio: 0.05

  optimizer: "adamw"  # adamw, stable_adamw, muon, normuon
  # AdamW hyperparameters (also used for AdamW side [1D and embedding/unembed params] when optimizer=muon or normuon)
  adam_beta1: 0.9
  adam_beta2: 0.999
  adam_epsilon: 1e-8
  adam_learning_rate: 1e-4  # AdamW LR (Muon uses muon_learning_rate)
  max_grad_norm: .inf  # set to .inf (equivalent to float("inf")) to disable clipping
  warmup_steps: 302
  lr_decay_to_fraction: 0.1
  lr_schedule: "cosine" # Linear or Cosine
  adam_weight_decay: 0.0
  # Muon/NorMuon hyperparameters (used only when optimizer: muon or normuon)
  muon_learning_rate: 1e-3
  muon_weight_decay: 0.01
  muon_cautious_weight_decay: true
  muon_use_polar_express: true
  muon_momentum: 0.95
  muon_nesterov: true
  muon_eps: 1e-7
  mlm_probability: 0.3
  mask_replace_prob: 0.8
  random_token_prob: 0.1
  keep_probability: 0.1
  logging_steps: 1
  eval_steps: 250
  save_steps: 5000
  seed: 42
  num_workers: "auto"
  prefetch_factor: 2
  # Sequence packing: concatenates shorter sequences into fewer rows to eliminate
  # padding waste and increase GPU utilization. Requires flash attention and --pure-torch/--pure-te
  use_packing: true
  # Experimental throughput optimization: with packing, enables static input sizes which enables the use of torch.compile(dynamic=False) and cudagraphs
  use_static_inp_size: true

  # Mixed precision training (recommended: keep enabled for 1.5-3x speedup)
  # When bf16 is true, automatically selects the best precision for your hardware:
  #   - CUDA Ampere+ (A100, RTX 3090+): bf16 + TF32
  #   - CUDA Volta/Turing (V100, RTX 2080): fp16 fallback
  #   - Apple Silicon (M1/M2/M3): fp16 (hardware accelerated)
  #   - CPU: fp32 (no mixed precision)
  bf16: true
  tf32: true  # TF32 mode on Ampere+ CUDA GPUs only (automatically not used on MPS/CPU)
             # Provides 3x faster fp32 matmuls with negligible precision loss
  fp8: false  # Enable FP8 Linear matmuls in pure_torch/pure_te paths (CUDA, best on H100+)

  multi_gpu: true
  world_size: 'auto'  # Use "auto" if you want to use all available GPUs
  project_name: "nanoplm-pretraining"

# Pure-torch training loop settings (alternative to HF Trainer).
pure_torch:
  enabled: false
  # torch.compile: compile the model for faster training. Disable for debugging,
  # unsupported hardware (e.g. Apple Silicon), or to avoid warmup overhead.
  use_compile: true
  # Sequence packing: concatenates shorter sequences into fewer rows to eliminate
  # padding waste and increase GPU utilization. Requires flash attention.
  use_packing: false
  # Fixed row count for static-shape compilation when use_packing is true (enables torch.compile dynamic=False).
  # Set to ceil(micro_batch_size * avg_len / max_seq_len) + margin. Leave null for dynamic=True.
  target_packed_rows: null

resume:
  # Set is_resume: true to resume training from a checkpoint
  # When resuming, the model, tokenizer, and training state will be loaded from checkpoint_dir
  # extra_epochs: adds to 'pretraining.num_epochs' to define total epochs.
  is_resume: false
  checkpoint_dir: "output/pretraining_checkpoints/run-1/checkpoint-1"
  extra_epochs: 0

Distill YAML

# Distillation configuration for nanoPLM
#
# IMPORTANT: Before running distillation, ensure you have prepared your data with:
#   1. Set pipeline_mode: 'distillation' in params.yaml
#   2. Set distillation_config.on_the_fly in params.yaml:
#      - false (default): Pre-compute teacher embeddings during data preparation
#      - true: Generate teacher embeddings on-the-fly during training
#   3. Run: nanoplm data from-yaml
# This will generate a .data_manifest file with the appropriate configuration.

model:
  hidden_size: 1024
  intermediate_size: 2048
  num_hidden_layers: 16
  num_attention_heads: 16
  mlp_activation: "swiglu"
  mlp_dropout: 0.0
  mlp_bias: false
  attention_bias: false
  attention_dropout: 0.0
  classifier_activation: "gelu"
  projection_layer: true  # Set to false if student hidden_size matches teacher (1024)

distillation:

  # Dataset directory (contains .data_manifest from nanoplm data from-yaml)
  # The manifest automatically provides:
  #   - max_seq_len, max_seqs_num, val_ratio
  #   - on_the_fly mode and dataset paths (FASTA or H5)
  # Note: paths are RELATIVE to where you RUN the command, NOT the YAML file.
  dataset_dir: "output/data/distillation_data"

  # Output checkpoint path
  ckp_dir: "output/distillation_checkpoints"

  # Training hyperparameters
  num_epochs: 10
  batch_size: 32
  learning_rate: 1e-3
  gradient_accumulation_steps: 1
  warmup_ratio: 0.05

  # LR scheduler
  lr_scheduler: "cosine"  # cosine, linear, polynomial, constant
  lr_scheduler_kwargs: {}

  # Data loader optimization
  max_open_files: 5
  chunk_size: 32
  prefetch_batches: 2
  use_threading: true
  num_workers: 4

  # Checkpointing
  project_name: "nanoplm-distillation"
  logging_steps: 1
  eval_steps: 250
  save_steps: 5000

  # Mixed precision training (recommended: keep enabled for 1.5-3x speedup)
  # When bf16 is true, automatically selects the best precision for your hardware:
  #   - CUDA Ampere+ (A100, RTX 3090+): bf16 + TF32
  #   - CUDA Volta/Turing (V100, RTX 2080): fp16 fallback
  #   - Apple Silicon (M1/M2/M3): fp16 (hardware accelerated)
  #   - CPU: fp32 (no mixed precision)
  bf16: true
  tf32: true  # TF32 mode on Ampere+ CUDA GPUs only (automatically not used on MPS/CPU)
             # Provides 3x faster fp32 matmuls with negligible precision loss

  # Distributed training
  multi_gpu: false
  world_size: 1
  seed: 42

resume:
  # Set is_resume: true to resume training from a checkpoint
  # When resuming, the model, tokenizer, and training state will be loaded from checkpoint_dir
  # extra_epochs: adds to 'distillation.num_epochs' to define total epochs.
  is_resume: false
  checkpoint_dir: "output/distillation/run-1/checkpoint-1"
  extra_epochs: 0

Tip: Paths are resolved relative to where you run the command (not where the YAML lives).


Requirements

  • Python 3.12+
  • macOS or Linux
  • GPU recommended (CPU is fine for tiny tests)

Contributing

PRs welcome. If you’re unsure where to start, open an issue with your use‑case.


Like it? Star it.

If nanoPLM saved you time, a star helps others find it and keeps development going.

↑ Back to Top

About

Dev fork of nanoplm, with focus on throughput and arch optimizations in pretraining.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.0%
  • C 3.5%
  • Other 0.5%