rubam is a pure-Rust BAM/VCF analysis library with first-class Python bindings. It provides per-base depth, pileup, flag statistics, read counting, VCF/BCF read+write and indexed query — multi-threaded, with bit-exact parity against samtools and pysam, and native binaries for Linux, macOS and Windows (no WSL, no MSYS2, no htslib system install). The core AlignmentFile surface (fetch / count / count_coverage / pileup / header access) is a drop-in for pysam on BAM, validated base-for-base against pysam on real hg38 data. CRAM is experimental: AlignmentFile(path, reference_filename=...) opens any CRAM and reads its header; record decode is panic-guarded and raises a Python error on codecs noodles-cram does not yet support, so it never crashes across the FFI boundary.
Originally forked from
rustbam(Choi et al.).rubamis now an independent project: pure-Rust backend (noodles), expanded API, full cross-platform CI, and a peer-reviewed validation campaign.
| Capability | pysam |
samtools CLI |
mosdepth |
rubam |
|---|---|---|---|---|
| Native Windows wheel | ❌ | ❌ | ❌ | ✅ |
| Multi-threaded depth | ❌ (GIL) | ⚠ partial | ❌ | ✅ |
| Python API | ✅ | ❌ | ❌ | ✅ |
| CRAM support | ✅ | ✅ | ✅ | ⚠ (skeleton v0.3.1; full decode v0.4) |
| Pure-Rust (no C dep) | ❌ | n/a | ❌ | ✅ |
pip install "just works" on Windows |
❌ | n/a | ❌ | ✅ |
| Tool | 1 thread | 8 threads | vs pysam @ 8t |
|---|---|---|---|
| rubam | 4.14 s | 1.51 s | 6.0× |
| samtools depth | 8.34 s | 5.79 s | 1.6× |
| pysam | 8.95 s | 9.11 s (GIL) | 1.0× |
| mosdepth | 15.32 s | 13.88 s | 0.66× |
Scaling sweep at threads {1, 2, 4, 8, 16}, 3 reps best-of (lower = better):
| Tool | 1t | 2t | 4t | 8t | 16t |
|---|---|---|---|---|---|
| rubam | 60.4 s | 35.3 s | 21.8 s | 17.1 s | 17.1 s |
| samtools depth | 89.7 s | 43.5 s | 44.1 s | 43.8 s | 45.2 s |
| pysam | 109.7 s | 110.7 s | 111.5 s | 109.4 s | 111.1 s (GIL) |
| mosdepth | 36.7 s | 36.8 s | 36.7 s | 36.3 s | 37.1 s |
rubam scales 3.5× from 1 → 8 threads, then saturates at 8t (I/O-bound). samtools scales only 1 → 2 threads. pysam and mosdepth are flat. At 8 threads, rubam beats every competitor: 6.4× pysam, 2.6× samtools, 2.1× mosdepth.
| Tool | 1 thread | 8 threads | vs pysam @ 8t |
|---|---|---|---|
| rubam | 5.3 s | 1.9 s | 5.6× |
| samtools depth | 9.7 s | 5.8 s | 1.8× |
| pysam | 10.6 s | 10.5 s | 1.0× |
| mosdepth | 19.0 s | 19.0 s | 0.55× |
→ rubam handles long-read CIGAR (rich D/I/=) without slowdown.
| Tool | 1 thread | 8 threads | vs pysam @ 8t |
|---|---|---|---|
| rubam | 4.6 s | 2.3 s | 4.8× |
| samtools depth | 8.3 s | 5.8 s | 1.9× |
| pysam | 11.0 s | 10.8 s | 1.0× |
→ rubam correctly skips reference-skip ops (N) without crashing; throughput is unchanged vs unspliced data. mosdepth not run on spliced data.
All numbers are best-of-3 wall-clock on the datasets named in each table heading.
rubam is also a publishable Cargo crate. Add it to your Cargo.toml:
[dependencies]
rubam = "0.3.12"…and use the pure-Rust types directly (no Python, no pyo3):
use rubam::api::{AlignmentFile, Aux};
fn count_reverse_reads(bam_path: &str) -> rubam::api::Result<usize> {
let mut bam = AlignmentFile::open(bam_path)?;
let mut n = 0;
for r in bam.records() {
if r?.is_reverse() {
n += 1;
}
}
Ok(n)
}
fn extract_split_reads(bam_path: &str) -> rubam::api::Result<Vec<String>> {
let mut bam = AlignmentFile::open(bam_path)?;
let mut sa_tags = Vec::new();
for r in bam.records() {
let r = r?;
if let Ok(Aux::String(s)) = r.aux(b"SA") {
sa_tags.push(s.to_owned());
}
}
Ok(sa_tags)
}API surface (v0.2.1, stable):
| Type | Methods |
|---|---|
AlignmentFile |
open(path), header(), records() |
Header |
target_count, tid2name(tid), target_len(tid), target_names() |
AlignedSegment |
qname, tid, pos, mapq, seq, qual (raw phred), seq_len, 12 flag accessors, cigar(), aux(tag) |
Cigar |
enum with Match/Ins/Del/RefSkip/Equal/Diff/SoftClip/HardClip/Pad, each (u32) |
Aux<'a> |
enum with 18 variants (Char, I8/U8/.../U32, Float/Double, String, HexByteArray, 8 Array*) |
Drop-in replacement for rust_htslib::bam::Reader::from_path for codebases that iterate linearly. Indexed query (fetch) lands in v0.3.x. The pyo3 wrapper classes (rubam.AlignmentFile etc.) coexist with api::* and share the same noodles backend; v0.2.2 will refactor them to delegate to api::* directly.
rubam is bit-exact against samtools depth -a over 5 × 10⁶ positions across five datasets, including whole-chromosome chr1:
| Dataset | Positions | rubam vs samtools |
|---|---|---|
| Synthetic chr20 30× WGS | 1 000 000 | 0 mismatches ✅ |
Synthetic chr20 spliced (5 % CIGAR N) |
1 000 000 | 0 mismatches ✅ |
| HG002 GIAB 2×250bp chr20 | 1 000 000 | 0 mismatches ✅ |
| HG002 PacBio HiFi chr20 | 1 000 000 | 0 mismatches ✅ |
| HG002 GIAB 2×250bp whole chr1 (249 Mb) | 1 000 000 | 0 mismatches ✅ |
| Total | 5 000 000 | 0 / 5 M ✅ |
VCF-side correctness vs pysam.VariantFile: 319 349 / 319 349 = 100.00 % on the GIAB HG002 truth chr1 (319 k records, 13 MB BGZF).
Cross-tool correctness vs system bcftools: 100 % on view, query, sort.
pip install rubamPre-built wheels are published for Linux, macOS and Windows; a single
abi3 wheel per OS covers CPython 3.8 → 3.13. No htslib, no compiler,
no WSL required — pip install rubam just works on Windows.
The NumPy return path (get_depths_numpy) needs NumPy at runtime:
pip install rubam[numpy]import rubam
positions, depths = rubam.get_depths(
"sample.bam", "chr1", 1_000_000, 1_001_000,
step=1, min_mapq=20, min_bq=20,
max_depth=8000, num_threads=12,
)CLI:
rubam depth sample.bam chr1 1000000 1001000 -n 12 -Q 20 -q 20 > depth.tsvget_depths(bam, chr, start, end, ...)— per-base coverage over a 1-based, inclusive region.- CLI
rubam depth ….
count_reads(bam, chr, start, end, ...)—pysam.AlignmentFile.countreplacement.flag_stats(bam)—samtools flagstatreplacement, returning a Python dict.pileup_bases(bam, chr, start, end, ...)— A/C/G/T counts per position.get_depths_regions(bam, regions)— batch BED-style regions with shared thread pool.get_depths_numpy(...)— zero-copynp.uint64/np.uint32return path (~4.5× lower peak RSS than the list path; needspip install rubam[numpy]).
- ⚠ CRAM full record decode (v0.4):
rubam.AlignmentFile("sample.cram", reference_filename="ref.fa")already opens and reads the header; record decode is panic-guarded and raises a Python error on codecsnoodles-cramdoes not yet support (e.g. Huffman byte-series on NYGC-style CRAMs). Tracking the upstream codec landing. to_pandas()zero-copy helper; Parquet output.rubam.compat.pysamdrop-in shim (v0.5).
rubam.AlignmentFileandrubam.AlignedSegment— drop-in pysam-style read iteration and per-read property access (flags, cigar, sequence, qualities, tags, reference helpers).AlignmentFile.fetch(chr, start, end)— indexed region iterator.AlignmentFile.pileup(chr, start, end)— buffered per-position iterator yieldingPileupColumnobjects with(reference_pos, depth, A/C/G/T/N).rubam.tools.{sort, index, view, merge, flagstat, idxstats, calmd, faidx}— pure-Rust ports of the eight most-used samtools subcommands.rubam-samtoolsshadow CLI binary —alias samtools='rubam samtools'and your shell pipelines keep working, on Windows included.
rubam::api::{AlignmentFile, AlignedSegment, Header, Cigar, Aux, Error}— pure-Rust public crate API. External Rust crates drop inrubam = "0.3.12"and import these types directly without pulling in pyo3 — a drop-in forrust-htslib::bam::Readerfor codebases that iterate linearly. The public surface is pinned bytests/api_smoke.rsandtests/integration_test.rs.
rubam.VariantFileandrubam.VariantRecord— pysam-style VCF / BCF / Tabix support. Read, write (modes"w"/"wz"/"wb"for plain / BGZF / BCF), iterate, indexedfetch(contig, start, end), multi-sample genotype access viarecord.samples["NA12878"]["GT"].rubam.VariantHeader— read-only metadata: samples, contigs (with lengths), INFO / FORMAT meta lines (id / number / type / description), FILTER ids, file format version.rubam.VariantRecord(header=, …)constructor — build records from scratch. Plusset_position,set_quality,set_filter,add_filter,clear_filters,set_infomutation APIs.rubam.tools.bcftools.{view, norm, concat, query, index, sort, stats}— pure-Rust ports of seven most-used bcftools subcommands.rubam-bcftoolsshadow CLI —alias bcftools='rubam bcftools'works on Windows. Same shape asrubam-samtools.- Cross-tool correctness:
(chrom, pos, ref, alt, ids, qual, filters)extracted via bothrubam.VariantFileandpysam.VariantFileagree on 0 / 100 records mismatch on a 3-sample synthetic VCF.
pysam parity on real-world hg38 BAMs, verified base-for-base against
pysam 0.24.0 (tests/test_pysam_parity_findings.py):
- Tolerant header parsing — opens real hg38 / GATK / Picard BAMs that
a strict SAM-header parser rejects (
@HDwith noVN, multi-part versions likeVN:1.6.0, duplicate@PG/@RG/@SQIDs from re-run pipelines). Valid headers still take the strict fast path unchanged. countmatches pysam defaults —read_callback='nofilter'by default (counts every read in the region, including secondary / supplementary / duplicate / QC-fail);read_callback='all'applies the0x704mask.count_coveragematches pysam defaults —quality_threshold=15(base counted iffqual >= threshold), no depth cap, and aread_callbackargument.
The compatibility layer rubam.compat.pysam (drop-in from rubam.compat import pysam)
lands in v0.5; v0.2 + v0.3 are the foundation it sits on top of.
rubam is validated against pysam, samtools depth, samtools mpileup, mosdepth, bedtools genomecov and the original rustbam on real WGS, RNA-seq, exome and PacBio HiFi datasets (HG002, NA12878, public ENA RNA-seq), with multi-threaded scaling and cross-platform parity. The numbers in the tables above are drawn from that campaign.
MIT — see LICENSE.
If you use rubam in academic work, please cite the bioRxiv preprint (link will be added once posted).