This repo provides a byte-level compression pipeline driven by a neural predictor (BoaConstrictor) and entropy coding (range coding). It includes:
- A clean CLI to train a model, compress with it, and decompress back
- Per-experiment YAML configs and an interactive config creator
- Optional progress bars and timing for each major stage
- CPU and GPU execution, with tips for best performance
Key entrypoints:
- CLI:
main.py - Example config:
experiments/cms_experiment/cms_experiment.yaml
Note
Reference implementation for GPU Portability
The portability_solved_cpp folder contains a reference implementation of BOA using the Mamba network in C++. This implementation specifically solves portability issues on GPUs for CUDA. Please note that it includes only compression/decompression logic and does not contain code for training.
- Install dependencies (PyTorch not pinned here; use the build suited for your system):
python3 -m pip install -r requirements.txt- Create a config interactively and run the experiment:
python3 main.py --new-experiment- Or run with an existing config and show timings:
python3 main.py --config experiment_name --show-timingsUseful flags:
--no-progressto disable progress bars--device cpu|cudato override device--precision fp32|fp16|fp8to override compute precision (training only)--train-only,--compress-only,--decompress-onlyto run specific stages--model-path /path/to/model.ptto load a pre-trained checkpoint and skip training (also supported viamodel_pathin the YAML)--verifyto verify the files after compression-decompression cycle--evaluate,--evaluate-onlyto evaluate performance of the compression model--comparison-baseline-onlyto run LZMA and ZLIB on the dataset as baselines
Warning
Currently training can only be done on a CUDA-Compatible GPU!
A minimal example (configs/experiment.yaml):
name: example_experiment
file_path: /path/to/dataset.bin
progress: true
device: cuda
precision: fp16
# Optional: set a checkpoint to skip training
# Path can be absolute or relative to this YAML file
# model_path: /path/to/checkpoints/example_experiment_final_model_fp16.pt
dataloader:
seq_len: 32768
batch_size: 3
model:
d_model: 256
num_layers: 8
training:
lr: 5e-4
epochs: 50
compression:
chunks_count: 1000
file_to_compress: ''
splits: [0.8, 0.1, 0.1]file_pathshould point to the raw bytes file to train/encode.splitsshould sum to 1.0; if not, defaults are applied.chunks_countcontrols how many chunks are used during compression; see Performance notes below.
main.py wires together:
- Reading input bytes
- Building the model (
BoaConstrictor) andByteDataloader - Splitting into train/val/test (
make_splits) - Training via
train(...) - Compression/Decompression via
BoaFile.compress(...)/BoaFile.decompress(...)
Timings are printed when --show-timings is used. Progress bars respect progress: true (in config) unless --no-progress is passed.
-
Byte modeling (neural predictor)
- The
BoaConstrictormodel receives byte sequences and predicts a distribution over the next byte (0..255) at each position. - Training minimizes cross-entropy between predictions and observed bytes.
- The
-
Entropy coding (range coding)
- For each byte to be stored, the predictor provides probabilities p(b | context).
- A range coder converts these probabilities and symbols into a compact bitstream close to the theoretical entropy (−log₂ p).
-
Container and chunks
- Data is processed in chunks, enabling parallelism and streaming.
- Each chunk stores (a) first bytes, (b) the compressed range-coded stream, and (c) metadata.
-
Decompression mirrors compression
- The range decoder reconstructs each symbol using the same probabilities generated by the model conditioned on previously decoded bytes (and chunk state).
Range coding is a practical form of arithmetic coding. At a high level, it maintains an interval [low, high) within [0, 1) representing the current coder state. For each symbol with probability distribution {p_i} over the alphabet:
- Partition the current interval into sub-intervals proportional to {p_i}
- Select the sub-interval for the observed symbol
- Renormalize when the interval becomes too small, emitting bits
Conceptually, after encoding a sequence x₁…x_T, the final interval size is approximately Π_t p(x_t | context), so the total code length approaches −Σ_t log₂ p(x_t | context) bits.
A simplified encode step with cumulative frequencies (integer-scaled):
state: low=0, high=RANGE_MAX
for symbol s with cumulative counts C and total T:
range = high - low + 1
high = low + (range * C[s+1] // T) - 1
low = low + (range * C[s] // T)
while renormalization_condition(low, high):
output_bit_and_shift(low, high)
C[k]is the cumulative count of symbols < k (C[0] = 0, C[Σ] = T)- Renormalization shifts out stable MSBs so the internal registers don’t overflow
- Decoding performs the inverse using the same
CandT
Why range coding here?
- It’s simple, fast, and numerically stable vs naive arithmetic coding
- Integer arithmetic avoids floating-point drift
- It compresses close to the entropy bound, assuming good probability estimates from the model
Compression/decompression performance hinges on two main costs:
- Probability computation (neural model inference)
- Range coder symbol processing
Range coder serialism vs parallelism
- The range coder is intrinsically sequential per symbol. However, you can parallelize across independent chunks.
- Choose
compression.chunks_countto balance parallelism and overhead. Too many tiny chunks increase metadata and launch overhead; too few large chunks underutilize parallel resources.
CPU speedups
- Vectorized preprocessing: Prefer NumPy or PyTorch tensor operations on large slices over Python loops.
- Chunk sizing: Tune
chunks_countso each chunk fits in cache and reduces memory stalls. - Threaded inference: If running on CPU, enable MKL/OpenMP threading for BLAS (subject to your PyTorch build). Typical knobs:
OMP_NUM_THREADS,MKL_NUM_THREADS,torch.set_num_threads. - I/O buffering: Read the dataset once into memory when feasible; use memory-mapped I/O for very large files.
GPU speedups
- Batch inference: Evaluate the model on multiple sequences (or longer sequences) in one pass to utilize SMs better.
- Mixed precision: Use
precision: fp16(or--precision fp16) to halve bandwidth and often speed up GEMMs/attention layers on supported GPUs (training only). - Chunk-level parallelism: Schedule multiple chunks concurrently so the GPU is fed continuously; avoid tiny chunks that cause excessive kernel launch overhead.
- Custom GPU range coder for batch independent compression
Compression runs in batches of chunks to keep memory usage bounded. By default, the batch size ("gpu_streams") is chosen automatically based on your configuration. For demos or reproducibility, you can force a fixed batch size via an environment variable:
# Example: process 10,000 chunks in two streaming batches of 5,000 each
export BOA_GPU_STREAMS=5000
python3 main.py --config your_experimentWith chunks_count: 10000 (or when the input produces 10,000 chunks), this will compress in two waves of 5,000 chunks each, demonstrating the streaming pattern (write-as-you-go with an index finalized at the end).
- The CLI saves/loads model checkpoints according to the
train(...)implementation. Keep names consistent with yournamefield so compression uses the trained model you expect. - For long runs, prefer deterministic flags where feasible (e.g., set random seeds) but note that GPU determinism can reduce performance and is not always guaranteed. Only guaranteed determinism is across same software stack and same hardware stack.
file_pathnot found:- Update the YAML to point to an existing dataset file. For a smoke test, use a small file first.
- CUDA out of memory:
- Reduce
batch_size, decreaseseq_len. Ensure other processes aren’t using VRAM.
- Reduce
- Slow throughput on GPU:
- Increase chunk-level parallelism and batch size, and avoid tiny chunks.
- Range coding (arithmetic coding): classic papers and tutorials provide in-depth renormalization details and proofs.
- Neural compression literature for modeling bytes/sequences with transformers and state-space models (e.g., Mamba).
If you use this codebase, or otherwise find our work valuable, please cite BOA Constrictor:
@misc{gupta2025boaconstrictormambabasedlossless,
title={BOA Constrictor: A Mamba-based lossless compressor for High Energy Physics data},
author={Akshat Gupta and Caterina Doglioni and Thomas Joseph Elliott},
year={2025},
eprint={2511.11337},
archivePrefix={arXiv},
primaryClass={physics.comp-ph},
url={https://arxiv.org/abs/2511.11337},
}
@software{gupta_2025_17571973,
author = {Gupta, Akshat and
Doglioni, Caterina and
Elliott, Thomas},
title = {Boa Constrictor: A Mamba-based Lossless Compressor
for High Energy Physics data
},
month = nov,
year = 2025,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.17571973},
url = {https://doi.org/10.5281/zenodo.17571973},
swhid = {swh:1:dir:7273b2950222286fe7622e7c545a5806863d1afa
;origin=https://doi.org/10.5281/zenodo.17571972;vi
sit=swh:1:snp:6b782111318d9521b182d6fab427ad97d9ea
17ad;anchor=swh:1:rel:355c1a3afc7bb7536829745e9c53
0fe831265922;path=boa-constrictor-1.0.0
},
}
This project is licensed under the GNU Affero General Public License v3.0 (AGPLv3).
See the LICENSE file for details.
The authors declare the use of generative AI in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision:
- Feasibility assessment and risk evaluation
- Preliminary hypothesis testing
- Evaluation of the novelty of the research and identification of gaps
- Code generation
- Code optimisation
- Creation of algorithms for data analysis
- Visualization
- Proofreading and editing
- Summarising text
- Adapting and adjusting emotional tone
- Reformatting
- Preparation of press releases and outreach materials
- Quality assessment
The GAI tool used were: ChatGPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5. Responsibility for the final manuscript lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes. Declaration submitted by: Akshat Gupta