Releases · abdenlab/oxbow

07 May 16:55

nvictus

v0.8.0

b228a3e

v0.8.0

Breaking changes

Scanner constructors require an explicit `CoordSystem`

Every scanner's new() constructor now takes a trailing coord_system: CoordSystem parameter. Pass the format-native value to preserve previous behavior.

use oxbow::CoordSystem;
use oxbow::alignment::scanner::BamScanner;
use oxbow::bed::scanner::BedScanner;

// Before
let scanner = BamScanner::new(header, fields, tag_defs)?;
let scanner = BedScanner::new(bed_schema, fields)?;

// After — preserve previous (format-native) behavior
let scanner = BamScanner::new(header, fields, tag_defs, CoordSystem::OneClosed)?;
let scanner = BedScanner::new(bed_schema, fields, CoordSystem::ZeroHalfOpen)?;

Format-native values:

Constructor	Pass
`SamScanner`, `BamScanner`, `CramScanner`	`CoordSystem::OneClosed`
`VcfScanner`, `BcfScanner`	`CoordSystem::OneClosed`
`GffScanner`, `GtfScanner`	`CoordSystem::OneClosed`
`FastaScanner`	`CoordSystem::OneClosed`
`BedScanner`, `BigBedScanner`, `BigWigScanner`	`CoordSystem::ZeroHalfOpen`
`BBIZoomScanner`	inherits from base scanner

`scan_query` takes `oxbow::Region` instead of `noodles::core::Region`

All scanner scan_query methods now accept oxbow::Region. Convert at the call site:

use oxbow::{CoordSystem, Region};

// Before
let region: noodles::core::Region = "chr1:10000-20000".parse()?;
scanner.scan_query(reader, region, /* ... */)?;

// After — parse with an explicit coord system
let region = Region::parse("chr1:10000-20000", CoordSystem::OneClosed)?;
scanner.scan_query(reader, region, /* ... */)?;

// Or use self-describing bracket notation, no CoordSystem needed
let region: Region = "chr1:[10000,20000)".parse()?;
scanner.scan_query(reader, region, /* ... */)?;

// Or construct directly (always 0-based half-open)
let region = Region::new("chr1", Some(10000), Some(20000));

oxbow::Region normalizes to 0-based half-open internally and converts to noodles::core::Region via to_noodles() for index seeking, so range query behavior is unchanged.

What's new

Coordinate-system-aware scanners and regions

Every scanner is now "coordinate-system" aware in both directions: the convention used to interpret input region queries, and the convention used to encode start positions in the output Arrow batches. Each format defaults to its native convention, so the new feature is opt-in beyond the constructor change above.

New oxbow::CoordSystem enum (OneClosed / ZeroHalfOpen) with Display/FromStr round-tripping ("11" / "01") and start_offset_from() for computing the offset between systems.
New oxbow::Region type with coordinate-system-aware parsing. Internally always 0-based half-open. Supports two parsing styles:
- Implicit UCSC notation (chr1:10,000-20,000) interpreted using a provided CoordSystem. Accepts , and _ as thousands separators.
- Explicit bracket notation (chr1:[10_000,20_000) or chr1:[10_001,20_000]) that is self-describing and overrides any provided coordinate system.
CoordSystem and Region extracted into a new oxbow::coords module and re-exported from the crate root.
Every model now carries a coord_system field (alignment, variant, gxf, bed, bbi base, bbi zoom, sequence). Offsets are applied at the field-builder level (alignment, variant, gxf, bed) or batch-builder level (bbi base, bbi zoom) during push().
Default impls added on AlignmentModel and GxfModel.

Resolves #114. See the user guide for a conceptual overview.

Full Changelog: v0.7.0...v0.8.0

Assets 2

07 May 17:00

nvictus

py-oxbow@v0.8.0

b228a3e

py-oxbow@v0.8.0 Latest

Latest

New features

Coordinate-system control: the `coords` argument

Every DataSource constructor and from_* factory now accepts a coords keyword argument that controls how start positions are emitted in output Arrow batches and how region query strings are interpreted. Each format defaults to its native convention, so existing code is unchanged.

import oxbow as ox

# Native: BAM positions are emitted 1-based.
ds = ox.from_bam("sample.bam")

# Coerce BAM positions to 0-based half-open to match BED tracks.
ds = ox.from_bam("sample.bam", coords="01")

# Native: BED positions are emitted 0-based.
ds = ox.from_bed("sample.bed")

# Coerce BED positions to 1-based closed to match SAM/VCF.
ds = ox.from_bed("sample.bed", coords="11")

Accepted values are the literal strings "01" (0-based start, 1-based end — half-open) and "11" (1-based start, 1-based end — closed). Only the start column is affected; end coordinates are numerically identical in both systems.

Format-native defaults:

"11" — SAM, BAM, CRAM, VCF, BCF, GFF, GTF, FASTA
"01" — BED, BigBed, BigWig

Region query strings: bracket notation

Region strings passed to regions() now support an explicit, self-describing bracket notation that carries the coordinate system in the string itself:

ds.regions("chr1:[10000,20000)")    # 0-based half-open
ds.regions("chr1:[10_001,20_000]")  # 1-based closed (same interval)

Bracket notation overrides any coords setting on the data source. The familiar UCSC notation (chr1:10,000-20,000) is still accepted and is interpreted in the data source's coords setting. Both , and _ are accepted as thousands separators in UCSC notation; only _ is accepted in bracket notation since , separates start from end. Resolves #114.

Combine systems freely

# Return 0-based output, but supply the query in 1-based closed.
ds = ox.from_bam("sample.bam", coords="01").regions("chr3:[10001,20000]")

Documentation

New user-guide page on coordinate conventions: https://oxbow.readthedocs.io/en/latest/user-guide/coordinate-systems.html

Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.7.0...py-oxbow@v0.8.0

Assets 2

18 Mar 18:00

nvictus

v0.7.0

5c3be21

v0.7.0

What's changed

Declarative data model types for all format families

Each format family now has a standalone "model" type that encapsulates all schema-defining parameters independently of any file handle or header. The model type implements the data model for how we map a file format to Arrow and enables declaration of the initial projection/schema of a table derived from a file.

AlignmentModel (SAM/BAM/CRAM) — fields + tag definitions
GxfModel (GFF/GTF) — fields + attribute definitions
SequenceModel (FASTA/FASTQ) — fields
VariantModel (VCF/BCF) — fields, info fields, genotype fields, samples, layout
BedModel (BED/BigBed/BigWig) — wraps BedSchema with field projection
BBIZoomModel — zoom summary schema

Each model produces an Arrow schema without a file, supports column projection, exposes a model() accessor on its scanner, and round-trips through Display/FromStr.

Scanners gain with_model(header, model) constructors alongside the existing new() constructors.

Selection semantics

A Select<T> enum replaces Option<Vec<T>> for all types of field selection. All scanner constructors and model types now use a ternary Select enum instead of Some or None to express field selection intent unambiguously:

Select::All — include all fields (from defaults or file header)
Select::Some(vec) — include only the named fields
Select::Omit — omit the column group entirely

This affects the new() constructors of all scanners across all format families (alignment, GXF, variant, sequence, BED, BBI).

Customizable BED schemas for BED and BBI files

New constructor BedSchema::from_defs() for fully custom BED schemas (field name and type definitions) from a Vec<FieldDef>, enabling programmatic schema construction for formats like narrowPeak without going through a string specifier. The autoSql-based type system (FieldDef) is now shared and harmonized across the BED and BBI models for extended BED fields, but standard BED fields are interpreted using format-native (BigBed) or spec-compliant (BED) types.

Nested samples table in `VariantModel`

VariantModel gains a samples_nested boolean parameter. When true, all sample genotype data is emitted as a single "samples" struct column rather than N top-level per-sample or per-field columns. This makes it straightforward to treat genotype data as an atomic projection unit (e.g. project(["samples"])). The default is false, preserving existing behavior. Resolves #167

Full Changelog: v0.6.0...v0.7.0

Assets 2

18 Mar 18:41

nvictus

py-oxbow@v0.7.0

02652a4

py-oxbow@v0.7.0

New features

New selection semantics (None, list, "*") in #172

All DataSource constructors now accept the value "*" for all field declaration parameters (referring to all standard fields, all info/format fields in a header, all samples in a header, etc.) in addition to a list or None (which now means: "omit entirely"). Previously, None was used as the "all fields" sentinel, which was ambiguous. Parameter defaults have been updated to reflect these new semantics, keeping the same defaults, except for those listed below.

Customizable BED schemas for BED and BBI files in #169

Support for fully custom BED schemas (field name and type definitions) from a tuple of (str, dict[str, str]), where the first item is a bed3-12 string specifier for the initial standard fields, and the second item is a dictionary of field names to type names for the remaining fields, parsed using an AutoSql-inspired type system with additional Rust numeric type aliases. This enables programmatic schema construction for formats like narrowPeak. The autoSql-based type system is now shared and harmonized across the BED and BBI models for extended BED fields, but standard BED fields are interpreted using format-native (BigBed) or spec-compliant (BED) types.

Nested samples table in VCF/BCF DataSources in #170

VcfFile/from_vcf and BcfFile/from_bcf gain a samples_nested boolean parameter. When true, all sample genotype data is emitted as a single "samples" struct column rather than N top-level per-sample or per-field columns. This makes it straightforward to treat genotype data as an atomic projection unit. The default is false, preserving existing behavior. Resolves #167

API changes

Tag and attribute discovery is no longer automatic (breaking)

Previously, alignment and annotation file constructors would scan an initial number of records to discover tag/attribute definitions and include them in the schema by default. This auto-discovery has been removed. Tag and attribute definition and discovery is now opt-in.

tag_scan_rows parameter removed from SamFile/from_sam, BamFile/from_bam, CramFile/from_cram.
attribute_scan_rows parameter removed from GtfFile/from_gtf, GffFile/from_gff.
tag_defs and attribute_defs now default to None, which omits the "tags" / "attributes" column entirely
Use the new with_tags() and with_attributes() builder methods (below) to opt in. (Recommended)

Sample genotype data is no longer projected by default (breaking)

from_vcf and from_bcf previously defaulted to projecting all samples defined in the header, including all sample genotype columns. The default is now samples=None, omitting genotype data entirely.
Use the new with_samples() builder method (below) to opt in. (Recommended)

New builder methods for tags, attributes and samples

`with_tags()` — opt-in tag discovery for alignment files

df = ox.from_bam("sample.bam").with_tags().pl()

Call with_tags() on any SamFile, BamFile, or CramFile to discover tag definitions by scanning an initial number of records. Pass explicit definitions to skip discovery:

ds = ox.from_bam("sample.bam").with_tags([("NM", "i"), ("MD", "Z")])

The scan_rows keyword argument controls how many records are scanned (default: 1024; pass -1 to scan the whole file).

`with_attributes()` — opt-in attribute discovery for annotation files

df = ox.from_gff("sample.gff").with_attributes().pl()

Same pattern as with_tags(), for GtfFile and GffFile. The scan_rows keyword argument is also supported.

`with_samples()` — nested sample genotype data for variant files

Calling with_samples() on a VcfFile or BcfFile includes all sample genotype data nested under a single "samples" struct column. Accepts optional samples, genotype_fields, and group_by arguments:

df = ox.from_vcf("sample.vcf.gz").with_samples().pl()
df.unnest("samples")

ds = (
    ox.from_vcf("sample.vcf.gz")
    .with_samples(["NA12891", "NA12892"], genotype_fields=["GT", "DP"], group_by="field")
)

Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.6.0...py-oxbow@v0.7.0

Assets 2

07 Mar 02:50

nvictus

py-oxbow@v0.6.0

015367a

py-oxbow@v0.6.0

New Features

Rust backtrace in Python exceptions: When RUST_BACKTRACE=1 is set, parsing and validation errors raised from the Rust core now include a Rust backtrace in the exception message, making it easier to diagnose issues. Errors also map to more appropriate Python exception types: KeyError for missing resources, IOError for I/O failures, ValueError for everything else. (#166)

External reference support in CRAM high-level API: from_cram() now accepts reference and reference_index keyword arguments for decoding CRAM files that store bases as diffs against an external reference. Also fixes a bug where tag discovery on reference-dependent CRAM files would fail. (#161)

Core version retrieval: oxbow.__core_version__ exposes the version of the core oxbow Rust library. (#162)

Maintenance

Simplified DataSource internals: Schema-defining parameters (fields, tag_defs, attr_defs, etc.) are now passed to the Rust scanner at construction time rather than at each scan call. With this change, the Python DataSource classes are significantly simplified. Column projection is now handled entirely by the Rust scanner. User-facing API (from_*() constructors, .to_table(), .to_batches(), .to_reader(), .batches(), etc.) is unchanged. (#161)

Dependency Upgrades: PyO3 0.28, pyo3-arrow 0.17, Arrow 58, noodles 0.107. (#165)

Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.5.2...py-oxbow@v0.6.0

Assets 2

07 Mar 02:13

nvictus

v0.6.0

99c76e3

v0.6.0

New Features

Zero-column projection support — record batch builders now accept zero-column projections and preserve row counts when no columns are projected. (#160)

Multithreaded BGZF reader support — Widened trait bounds allow callers to pass bgzf::io::MultithreadedReader to scan methods. (#164)

Backtrace capture in errors — Six crate-level error variants (InvalidInput, InvalidData, NotFound, Io, Arrow, External) with backtrace captured at creation time (displayed when RUST_BACKTRACE=1). On the Python side, variants map to PyValueError, PyKeyError, or PyIOError. (#166)

API changes

Schema definition separate from projection: scanner schemas are now declared at construction time and scan methods project onto the declared schema. (#161)

Schema-defining parameters (fields, tag_defs, attr_defs, info_fields, genotype_fields, samples, etc.) move from scan() arguments to Scanner::new(). Scanners validate and cache their Arrow schema at construction. Scan methods now accept only columns (projection), batch_size, and limit.
Discovery methods like tag_defs() and attribute_defs() are now static/standalone rather than instance methods.

New traits: shared RecordBatchBuilder and Push<T> traits for record batch builders (#160)

schema() now returns the cached schema (not computed on the fly).

Widened BGZF trait bounds on scan methods (#164)

scan_query, scan_unmapped, and scan_virtual_ranges accept generic R: bgzf::io::BufRead + bgzf::io::Seek instead of concrete bgzf::io::Reader<R>

Custom crate error type: OxbowError replaces io::Error (#166)

All public scanner methods and the Push<T> trait now return crate::Result<T> (alias for Result<T, OxbowError>).
TryFrom/FromStr impls use type Error = OxbowError.

Maintenance

Module paths renamed: /format/ directories renamed to /scanner/, with batch_iterator nested underneath. All public re-exports preserved at the family level, but direct path imports will break. (#163)
Unified BED schema model: Shared BedSchema, FieldDef, and FieldType type system (37 AutoSql variants) extracted into bed/model/field_def.rs, used by both BED and BBI formats. (#161)
Export version variable for inspection in py-oxbow. (#162)
Dependency Upgrades: noodles 0.107, Arrow 58, PyO3 0.28, pyo3-arrow 0.17, Rust toolchain 1.94. (#165)

Full Changelog: v0.5.2...v0.6.0

Assets 2

03 Mar 16:40

nvictus

v0.5.2

cb8dd65

v0.5.2

Bug fixes

VCF: Recover INFO fields around malformed tokens.

Real-world VCFs (e.g., Ensembl variation files) can contain double semicolons (;;) in the INFO column. Previously, info.get() was called per field, which scanned from the beginning each time and aborted at the first tokenization error, silently nullifying all fields past the error. Now uses a single info.iter() pass that advances past malformed tokens, recovering fields on both sides of ;;. This also improves performance by scanning the INFO string once instead of N times per record, but comes at the cost of doing both tokenization and parsing of all fields even when projecting a subset. (#156)

New Contributors

@mwiewior made their first contribution in #156

Full Changelog: v0.5.1...v0.5.2

Contributors

mwiewior

Assets 2

03 Mar 17:19

nvictus

py-oxbow@v0.5.2

e6c4889

py-oxbow@v0.5.2

Bug fixes

VCF: Recover INFO fields around malformed tokens (oxbow-rs). (#156)
Streaming-compatible Polars lazy frames.: scan_pyarrow_dataset() still does not support the Polars streaming engine. Replaced with register_io_source(), which yields DataFrames batch-by-batch and integrates natively with streaming execution. This enables sink_parquet() and other streaming operations on oxbow DataSource objects. (#158, fixes #157)

New Contributors

@mwiewior made their first contribution in #156

Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.5.1...py-oxbow@v0.5.2

Contributors

mwiewior

Assets 2

10 Dec 16:58

nvictus

v0.5.1

6e5e5f9

v0.5.1

Bug fixes and improvements

Projections onto all types of BED schemas now work correctly #148
Updated noodles and arrow dependencies to latest versions #149

Full Changelog: v0.5.0...v0.5.1

Assets 2

10 Dec 17:10

nvictus

py-oxbow@v0.5.1

6e5e5f9

py-oxbow@v0.5.1

Bug fixes and improvements

🎉 All Rust panics during batch scanning now propagate to Python as exceptions and are no longer fatal #146
Projections onto all types of BED schemas now work correctly #148
Updated noodles, arrow, pyo3, and pyo3-arrow dependencies to latest versions #149

Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.5.0...py-oxbow@v0.5.1

Assets 2

Uh oh!

Releases: abdenlab/oxbow

v0.8.0

Breaking changes

Scanner constructors require an explicit CoordSystem

scan_query takes oxbow::Region instead of noodles::core::Region

What's new

Coordinate-system-aware scanners and regions

Uh oh!

py-oxbow@v0.8.0

New features

Coordinate-system control: the coords argument

Region query strings: bracket notation

Combine systems freely

Documentation

Uh oh!

v0.7.0

What's changed

Declarative data model types for all format families

Selection semantics

Customizable BED schemas for BED and BBI files

Nested samples table in VariantModel

Uh oh!

py-oxbow@v0.7.0

New features

New selection semantics (None, list, "*") in #172

Customizable BED schemas for BED and BBI files in #169

Nested samples table in VCF/BCF DataSources in #170

API changes

Tag and attribute discovery is no longer automatic (breaking)

Sample genotype data is no longer projected by default (breaking)

New builder methods for tags, attributes and samples

with_tags() — opt-in tag discovery for alignment files

with_attributes() — opt-in attribute discovery for annotation files

with_samples() — nested sample genotype data for variant files

Uh oh!

py-oxbow@v0.6.0

New Features

Maintenance

Uh oh!

v0.6.0

New Features

API changes

Maintenance

Uh oh!

v0.5.2

Bug fixes

New Contributors

Contributors

Uh oh!

py-oxbow@v0.5.2

Bug fixes

New Contributors

Contributors

Uh oh!

v0.5.1

Bug fixes and improvements

Uh oh!

py-oxbow@v0.5.1

Bug fixes and improvements

Uh oh!

Scanner constructors require an explicit `CoordSystem`

`scan_query` takes `oxbow::Region` instead of `noodles::core::Region`

Coordinate-system control: the `coords` argument

Nested samples table in `VariantModel`

`with_tags()` — opt-in tag discovery for alignment files

`with_attributes()` — opt-in attribute discovery for annotation files

`with_samples()` — nested sample genotype data for variant files