twobitreader is a small, fast Python package for reading UCSC .2bit genome
files. It supports random access by sequence name and genomic interval, making
it useful for pulling slices from large genome files without loading whole
chromosomes into memory.
The package reads .2bit files only; it does not write them.
Version 4 keeps decoding pure Python while reducing startup cost and speeding up
common slice paths. The main changes are lazy construction of the large
two-byte lookup table, faster N-block lookup with bisect, and decoded sequence
buffers backed by plain Python character lists instead of deprecated
array('u') buffers.
Benchmarks below compare v4.0.0 with v3.1.8 on Python 3.14.5, using
synthetic 5 Mb .2bit files. The v3.1.9 tag has the same reader
implementation as v3.1.8, plus release/CI packaging changes.
| Benchmark | v3.1.8 | v4.0.0 | Change |
|---|---|---|---|
| Cold import time | 179.6 ms | 35.6 ms | 5.0x faster |
| Peak import memory | 14.18 MB | 2.22 MB | 6.4x less |
| Plain 1 Mb slice | 135.6 ms | 17.3 ms | 7.8x faster |
| 10 bp slice with 50k N-blocks | 0.749 ms | 0.0026 ms | 290x faster |
Install the latest released package from PyPI:
pip install twobitreaderFor local development, clone the repository and install it in editable mode:
git clone https://github.com/benjschiller/twobitreader.git
cd twobitreader
pip install -e ".[dev,docs]"
pre-commit installOpen a .2bit file with TwoBitFile. It behaves like a dictionary whose keys
are sequence names and whose values are sliceable sequence objects.
from twobitreader import TwoBitFile
with TwoBitFile("hg19.2bit") as genome:
print(genome.keys())
print(genome.sequence_sizes()["chr1"])
sequence = genome["chr1"][100_000:100_050]
print(sequence)Coordinates follow Python and UCSC BED conventions: they are 0-based and
end-open. For example, genome["chr1"][10:20] returns 10 bases.
Converting an entire chromosome to a string works, but can use a lot of memory:
with TwoBitFile("hg19.2bit") as genome:
chr_m = str(genome["chrM"])twobitreader can also read BED-style intervals from standard input and write
FASTA records to standard output:
python -m twobitreader genome.2bit < regions.bed > regions.faInput lines should have at least three whitespace-separated fields:
chrom start end
chr1 100000 100050
chr2 250 300
Invalid regions are skipped with warnings written to standard error. Intervals that extend past the end of a sequence are truncated.
The twobitreader.download module can fetch .2bit genomes from UCSC:
python -m twobitreader.download hg19Please follow UCSC's usage guidelines and avoid excessive automated downloads.
Run the full test suite with:
python3 -m unittest discover -s testsRun the lightweight package smoke test with:
python3 test_package.pyBuild the package with:
python3 -m buildBuild the Sphinx documentation with:
sphinx-build -W --keep-going -b html doc doc/_build/htmlRun formatting and repository checks with:
pre-commit run --all-filesThe Makefile uses python in a few targets. If your environment only provides
python3, run the equivalent command directly with python3.
twobitreader is licensed under the Perl Artistic License 2.0. See
LICENSE.txt and COPYRIGHT for details.
No warranty is provided, express or implied.