Maester is a PyTorch training stack for large language models. It includes reference implementations of Gemma 3 and Llama, distributed utilities, dataset pipelines, and job management tooling used for multi-GPU training runs (e.g. on LUMI).
git clone https://github.com/rlrs/maester.git
cd maester
uv sync
# select the appropriate PyTorch build if needed, e.g.
# uv sync --extra cuda # CUDA 12.8 wheels
# uv sync --extra rocm # ROCm 6.3 wheels
# uv sync --extra cuda-nightly / rocm-nightly for nightly buildsIf you plan to pull gated Hugging Face models (e.g. Gemma tokenizers) or log to Weights & Biases, run the usual CLI logins before training:
hf auth login
wandb loginmaester/models/– model definitions and shared layersmaester/datasets/experimental_otf.py– on-the-fly Parquet text loadermaester/sft/– conversation dataset and packing helpers for SFTmaester/parallelisms/– tensor/data parallel setup + checkpointingconfigs/– ready-to-use experiment configs (Gemma 3 variants, etc.)scripts/– data converters, packers, checkpoint converterstests/– regression tests for datasets, masking, and models
- Create a job directory –
submit.pyrenders a config snapshot and SLURM script underjobs/<job-name>/:python submit.py --config-file configs/gemma3/4b-sft.toml
- Local / non-SLURM run – use the job directory with
torchrun:torchrun --standalone --nproc_per_node=8 train.py jobs/<job-name>
train.pyreadsjobs/<job-name>/config.json, initialises distributed state, builds the configured data loader, and logs throughput, padding efficiency, and data-loading time. - SLURM run – submit the generated script:
The template lives in
sbatch jobs/<job-name>/slurm.sh
templates/slurm.sh; customise it (andscripts/slurm/) for your cluster. On LUMI, exportRUN_NCCL_PREFLIGHT=1inside the script sonccl_preflight.pyruns before training. - Sweeps – define parameter grids in
sweep_config.py, then:python sweep.py submit sweep_config.py python sweep.py status sweeps/<sweep-name>
Setting cfg.sft switches the loader to PackedSFTDataset, which outputs
position_ids and document_ids so FlexAttention respects conversation
boundaries.
Typical workflow:
- Convert conversations to Parquet:
scripts/jsonl_convo_to_parquet.py(JSONL) orscripts/hf_convo_to_parquet.py(HuggingFace datasets). - Pack sequences with
scripts/pack_sft_data.pyto generate fixed-length inputs plus boundary metadata. - Validate with
pytest tests/test_sft.py,pytest tests/test_packed_sft.py, andpytest tests/test_packed_attention.py. - Point
cfg.sft.packed_pathat the packed file and launch training as above.
- Activation checkpointing: configure
cfg.ac_mode(full,selective, ornone) andcfg.selective_ac_option. - Optimizer grouping: embeddings and bias/low-rank parameters skip weight decay
by default (see
train.py). Adjust the optimizer section of your config to change this behaviour. - FlexAttention document masking is implemented in
maester/models/gemma/model.py::make_document_mask_wrapperand covered bytests/test_packed_attention.py.
- Run targeted tests with
pytest, e.g.pytest tests/test_packed_sft.py. - Job logs live under
jobs/<job-name>/logs/and include padding and data-loading statistics. - For LUMI, enable the NCCL preflight check by exporting
RUN_NCCL_PREFLIGHT=1in your SLURM script.
Inspired by pytorch/torchtitan and IBM’s experimental dataloader work. Licensed under the terms in LICENSE.
Pull requests are welcome; include regression tests when possible.