Fast image loading for Python. Built in Rust.
A drop-in replacement for PIL.Image.open() that decodes, resizes, crops, and normalizes images 6x+ faster using libjpeg-turbo SIMD decoding, IDCT downscaling, and hardware-accelerated resize. Returns numpy arrays with zero-copy. Includes parallel batch loading, a torchvision.transforms replacement with Rust-accelerated augmentations, Dataset/DataLoader with 40x faster training loops, and Rust-backed dataset filtering.
- 6.6x faster than PIL for full ML pipelines (resize + crop + normalize)
- 47x faster batch loading via parallel rayon thread pool
- 40x faster DataLoader — drop-in
ImageFolderreplacement with Rust rayon collation instead of Python multiprocessing - Drop-in torchvision.transforms replacement with auto-fused Rust fast-path
- Zero-copy numpy output and PyTorch tensor support (
device="cpu"/"cuda") - Dataset filtering with perceptual hash dedup (3-27x faster than imagehash) and CLIP aesthetic scoring
- JPEG, PNG, WebP, AVIF with RGBA/grayscale auto-converted to RGB
- EXIF orientation auto-correction for phone photos
- Bytes API (
load_bytes,load_batch_bytes) for S3/HTTP workflows
pip install tensorimageimport tensorimage as ti
# Load an image as a numpy array (H, W, 3) uint8
img = ti.load("photo.jpg")
# Load and resize — shortest edge becomes 512, aspect ratio preserved
img = ti.load("photo.jpg", size=512)
# Full ML pipeline — resize, center crop, normalize → f32 (3, 224, 224) ready for PyTorch
tensor = ti.load("photo.jpg", size=224, crop="center", normalize="imagenet")
# Batch loading — parallel via rayon, returns stacked (N, 3, 224, 224) ndarray
batch = ti.load_batch(paths, size=224, crop="center", normalize="imagenet")No Image.open(), no .convert("RGB"), no manual normalize+transpose. One call, one tensor.
import tensorimage as ti
# Load from in-memory bytes — same parameters as ti.load()
data = open("photo.webp", "rb").read() # or from S3, HTTP response, etc.
img = ti.load_bytes(data, size=224, crop="center", normalize="imagenet")
# Parallel batch loading from bytes
data_list = [open(p, "rb").read() for p in paths]
batch = ti.load_batch_bytes(data_list, size=224, crop="center", normalize="imagenet")
# Quick dimension check (header-only, no decode)
w, h = ti.image_info("photo.jpg")import tensorimage as ti
# Load directly as a PyTorch CPU tensor (zero-copy from numpy)
tensor = ti.load("photo.jpg", size=224, crop="center", normalize="imagenet", device="cpu")
# Load to GPU
tensor = ti.load("photo.jpg", size=224, crop="center", normalize="imagenet", device="cuda")
# Batch loading to GPU
batch = ti.load_batch(paths, size=224, crop="center", normalize="imagenet", device="cuda")
# DLPack interop — framework-agnostic zero-copy (JAX, TensorFlow, etc.)
arr = ti.load("photo.jpg", size=224)
import torch
tensor = torch.from_dlpack(ti.to_dlpack(arr))device=None (default) returns numpy arrays for full backward compatibility.
import tensorimage as ti
from tensorimage import transforms
# Drop-in replacement for torchvision.datasets.ImageFolder
dataset = ti.ImageFolder(
"data/imagenet/train",
transform=transforms.RandomHorizontalFlip(), # per-image augmentations
size=224, # Rust: resize shortest edge to 224
crop="center", # Rust: center crop to 224x224
normalize="imagenet", # Rust: fused normalize + HWC->CHW
)
# Rust rayon handles all parallelism — no Python multiprocessing needed
loader = ti.create_dataloader(dataset, batch_size=64, shuffle=True)
for images, labels in loader:
# images: (64, 3, 224, 224) float32 tensor
# labels: (64,) long tensor
output = model(images.cuda())Or from a flat list of paths (no directory structure required):
dataset = ti.ImageDataset(paths, labels, size=224, crop="center", normalize="imagenet")
loader = ti.create_dataloader(dataset, batch_size=32)Key design: __getitem__ returns (path, label) instead of (image, label). The custom collate_fn batch-loads all images in one ti.load_batch() call, leveraging Rust rayon parallelism. This eliminates Python multiprocessing overhead entirely — num_workers=0 by default.
# from torchvision import transforms
from tensorimage import transforms # same API, faster
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open("photo.jpg")
tensor = transform(img) # (3, 224, 224) float32Supports: Compose, Resize, CenterCrop, RandomCrop, ToTensor, Normalize, RandomHorizontalFlip, RandomVerticalFlip, ColorJitter, GaussianBlur, RandomRotation, RandomAffine, RandomPerspective, RandomErasing, Grayscale, RandomGrayscale, GaussianNoise, Pad, ElasticTransform. Resize, GaussianBlur, and spatial transforms (rotation, affine, perspective) use the Rust SIMD backend. torch is optional -- ToTensor returns torch.Tensor if available, numpy otherwise. Compose auto-detects common patterns and fuses operations in Rust for extra speed.
Resize accepts an optional max_size parameter to cap the longer edge after shortest-edge scaling: Resize(256, max_size=512).
Compose fast-path conditions: The full Rust pipeline (5.2x speedup) activates automatically when the pipeline is exactly [Resize(int), CenterCrop(int), ToTensor, Normalize] and the input to __call__ is a file path string or a PIL Image that still has a .filename attribute set. For all other inputs or pipeline shapes the standard sequential path runs (with fused ToTensor+Normalize if that pair is present).
import tensorimage as ti
# Perceptual hashing — fast, computed in Rust
h = ti.phash("photo.jpg") # 64-bit dHash
h = ti.phash("photo.jpg", algorithm="phash") # 64-bit pHash (more robust)
hashes = ti.phash_batch(paths, algorithm="dhash") # parallel batch hashing
dist = ti.hamming_distance(h1, h2) # Hamming distance between hashes
# Deduplication — find and group near-duplicate images
result = ti.deduplicate(paths, algorithm="dhash", threshold=5)
# result = {"keep_indices": [0, 3, 7], "duplicate_groups": [[0, 1, 2], ...], "hashes": [...]}
unique_paths = [paths[i] for i in result["keep_indices"]]
# Full pipeline — dimension filter + dedup + optional aesthetic scoring
result = ti.filter_dataset(
paths,
min_width=512, # remove undersized images
min_height=512,
deduplicate=True, # remove near-duplicates (default)
hash_algorithm="dhash", # "dhash" (fast) or "phash" (robust)
hash_threshold=5, # Hamming distance threshold
min_aesthetic=5.0, # CLIP aesthetic score (requires torch + open_clip)
verbose=True, # print progress
)
clean_paths = result["paths"]
print(result["stats"]) # {"total": 1000, "dimension_removed": 50, "duplicate_removed": 120, ...}Filters are applied cheapest-first: dimension check (header-only, no decode) -> perceptual hash dedup (Rust parallel) -> aesthetic scoring (CLIP, only if min_aesthetic is set). No torch dependency for hash-only workflows.
All benchmarks on Apple M4, 100 iterations.
4000x2000 JPEG -> 512px shortest edge:
| Task | tensorimage | PIL + numpy | Speedup |
|---|---|---|---|
| Resize only | 7.7 ms | 41.6 ms | 5.4x |
| Full pipeline (resize + crop + normalize) | 6.5 ms | 43.0 ms | 6.6x |
| Batch (8 images, 4 workers) | 7.3 ms | 343.8 ms | 47x |
Resize(256) -> CenterCrop(224) -> ToTensor -> Normalize, 4000x2000 JPEG:
| Pipeline | tensorimage | torchvision | Speedup |
|---|---|---|---|
| From numpy (fused ToTensor+Normalize) | 3.2 ms | 10.4 ms | 3.2x |
| End-to-end file -> tensor (fast-path) | 3.5 ms | 18.2 ms | 5.2x |
Full training loop: ImageFolder + DataLoader, Resize(224) + CenterCrop(224) + RandomHorizontalFlip + Normalize, 256 images:
| DataLoader | Epoch time | Throughput | Speedup |
|---|---|---|---|
| tensorimage (num_workers=0, rayon) | 87 ms | 2,939 img/s | 40x |
| torchvision (num_workers=0) | 3,513 ms | 73 img/s | 1x |
| torchvision (num_workers=2) | 12,865 ms | 20 img/s | 0.2x |
| torchvision (num_workers=4) | 22,300 ms | 11 img/s | 0.1x |
tensorimage's ImageFolder defers image loading to the collate_fn, which calls ti.load_batch() for Rust rayon parallel decoding. This eliminates Python multiprocessing overhead entirely. torchvision's multi-worker DataLoader actually gets slower due to process spawn, IPC serialization, and GIL contention — tensorimage avoids all of this by doing parallelism in Rust with num_workers=0.
| Task | tensorimage | imagehash | Speedup |
|---|---|---|---|
| dHash (1920x1080 JPEG) | 0.97 ms | 3.05 ms | 3.1x |
| pHash (1920x1080 JPEG) | 1.11 ms | 30.00 ms | 27x |
| dHash (4000x2000 JPEG) | 2.83 ms | 10.24 ms | 3.6x |
| pHash (4000x2000 JPEG) | 3.05 ms | 10.87 ms | 3.6x |
| dHash batch (8 images, parallel) | 4.06 ms | 53.49 ms | 13.2x |
pHash is especially fast because tensorimage uses IDCT-scaled JPEG decode (decodes directly at ~32px instead of full resolution) plus a hand-rolled DCT in Rust. Batch adds ~3-4x via rayon parallelism. Header-only dimension reads run at 0.015 ms/image.
Load an image file and return a numpy array or torch.Tensor.
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str |
required | Image file path (JPEG, PNG, WebP, AVIF) |
size |
int |
None |
Target shortest edge size. Preserves aspect ratio. |
algorithm |
str |
"lanczos3" |
"nearest", "bilinear", "catmullrom", "mitchell", "lanczos3" |
crop |
str |
None |
"center" = center crop to size x size. Requires size. |
normalize |
str |
None |
"imagenet", "clip", or "[-1,1]". Outputs float32 CHW. |
device |
str |
None |
None = numpy, "cpu" = zero-copy torch, "cuda" = GPU tensor |
Returns: ndarray (H, W, 3) uint8 without normalize, (3, H, W) float32 with normalize. torch.Tensor if device is set.
Normalize preset values:
| Preset | Mean (R, G, B) | Std (R, G, B) |
|---|---|---|
"imagenet" |
[0.485, 0.456, 0.406] |
[0.229, 0.224, 0.225] |
"clip" |
[0.48145, 0.45783, 0.40821] |
[0.26863, 0.26130, 0.27578] |
"[-1,1]" |
[0.5, 0.5, 0.5] |
[0.5, 0.5, 0.5] |
ti.load_batch(paths, size=None, algorithm=None, crop=None, normalize=None, workers=None, device=None)
Load multiple images in parallel. Same parameters as load(), plus workers (default: CPU count).
Returns: With normalize + crop: contiguous (N, 3, H, W) float32. Otherwise: list of individual arrays.
Load an image from raw bytes. Same parameters as load(), but accepts bytes instead of a file path. Useful for S3/HTTP workflows where image data is already in memory.
ti.load_batch_bytes(data_list, size=None, algorithm=None, crop=None, normalize=None, workers=None, device=None)
Load multiple images from raw bytes in parallel. Same parameters as load_batch(), but accepts list[bytes].
Read image dimensions without decoding (header-only, very fast). Returns (width, height) tuple.
Drop-in replacement for torchvision.datasets.ImageFolder. Walks root/ directory, discovers images (.jpg, .jpeg, .png, .webp, .avif) in class subdirectories (sorted alphabetically).
| Parameter | Type | Default | Description |
|---|---|---|---|
root |
str |
required | Root directory with class subdirectories |
transform |
callable | None |
Per-image augmentation (applied after Rust batch decode) |
size |
int |
None |
Resize shortest edge (passed to ti.load_batch) |
crop |
str |
None |
"center" = center crop to size x size |
normalize |
str |
None |
"imagenet", "clip", or "[-1,1]" |
device |
str |
"cpu" |
Tensor device |
Attributes: classes, class_to_idx, samples (list of (path, label)), targets, imgs (alias for samples).
__getitem__ returns (path, label) — image loading is deferred to the collate function.
ti.ImageDataset(paths, labels=None, transform=None, size=None, crop=None, normalize=None, device="cpu")
Generic dataset from a list of image paths (no directory structure required). Same interface as ImageFolder but accepts explicit paths and labels.
| Parameter | Type | Default | Description |
|---|---|---|---|
paths |
list[str] |
required | Image file paths |
labels |
list[int] |
None |
Labels (defaults to all zeros) |
Convenience function that creates a torch.utils.data.DataLoader with num_workers=0 and the dataset's Rust-backed collate_fn. All parallelism is handled by Rust rayon.
Compute a 64-bit perceptual hash. Accepts a file path (str) or numpy array (H, W, 3) uint8.
| Algorithm | Description |
|---|---|
"dhash" |
Difference hash. Fast, compares adjacent pixels. |
"phash" |
Perceptual hash. More robust, uses DCT. |
Compute perceptual hashes for multiple images in parallel. Returns list[int].
Note: phash_batch accepts file paths only. To hash an in-memory numpy array use ti.phash(array) (single image).
Hamming distance (number of differing bits) between two 64-bit hashes. Returns int (0 = identical, 64 = maximally different).
Find and group near-duplicate images by perceptual hash.
| Parameter | Type | Default | Description |
|---|---|---|---|
paths |
list[str] |
required | Image file paths |
algorithm |
str |
"dhash" |
"dhash" or "phash" |
threshold |
int |
0 (dhash) / 10 (phash) |
Max Hamming distance to consider as duplicate. Default is 0 for "dhash" (exact match only) and 10 for "phash". |
workers |
int |
CPU count | Worker threads |
Returns: dict with "keep_indices" (first of each group), "duplicate_groups" (groups with 2+ members), "hashes" (all hashes).
High-level dataset filtering pipeline. Applies filters cheapest-first.
| Parameter | Type | Default | Description |
|---|---|---|---|
paths |
list[str] |
required | Image file paths |
min_width |
int |
None |
Remove images narrower than this |
min_height |
int |
None |
Remove images shorter than this |
deduplicate |
bool |
True |
Remove near-duplicates |
hash_algorithm |
str |
"dhash" |
"dhash" or "phash" |
hash_threshold |
int |
None |
Hamming distance threshold for dedup |
min_aesthetic |
float |
None |
Min CLIP aesthetic score (1-10). Requires torch + open_clip_torch. |
workers |
int |
CPU count | Worker threads |
verbose |
bool |
False |
Print progress |
Returns: dict with "paths" (surviving paths), "indices" (original indices), "stats" (counts per stage).
CLIP-based aesthetic scorer. Predicts a quality score (roughly 1–10) using CLIP ViT-L/14 features with a linear regression head trained on the LAION aesthetic dataset.
Requires pip install torch open_clip_torch.
On first use, the predictor weights (~3 MB) are downloaded from GitHub and cached at ~/.cache/tensorimage/aesthetic_predictor_v2.pth. Subsequent calls use the cached file.
from tensorimage.aesthetic import AestheticScorer
scorer = AestheticScorer(device="cuda") # or "cpu"
score = scorer.score("photo.jpg") # → float, e.g. 6.2
scores = scorer.score_batch(["a.jpg", "b.jpg"], batch_size=32) # → [6.2, 4.1]| Method | Description |
|---|---|
score(path_or_image) |
Score a single image. Accepts a file path (str/Path) or PIL.Image. |
score_batch(inputs, batch_size=32) |
Score a list of paths or PIL Images. Returns list[float]. |
Read image dimensions for multiple files without decoding (header-only, very fast). Returns a list[(width, height)]. Uses the shared rayon thread pool for parallelism. Useful as a cheap pre-filter before loading.
Export a numpy array or torch tensor via DLPack for framework-agnostic interop.
Requires numpy >= 1.22 for numpy array inputs (numpy's __dlpack__ protocol was added in 1.22). Older numpy versions will raise a TypeError with a version hint.
Image file on disk (JPEG, PNG, WebP, AVIF)
| std::fs::read (single syscall)
v
Raw bytes in memory
| Format routing: JPEG → turbojpeg, WebP → libwebp, other → image crate
| JPEG: IDCT scaling (decode at 1/4 resolution if target is small)
| JPEG: EXIF orientation auto-correction (rotate/flip as needed)
v
RGB pixels at reduced resolution
| fused resize+crop: SIMD Lanczos with source-space crop (single resampling pass)
v
RGB pixels at crop size (e.g., 224x224)
| fused normalize + HWC->CHW transpose (single pass, pre-computed scale/bias)
v
float32 pixels in CHW layout
| PyArray::from_vec (zero-copy ownership transfer to numpy)
v
numpy.ndarray (3, H, W) float32
The GIL is released during all Rust work. Batch loading runs images in parallel via a persistent rayon thread pool.
Key optimizations: fat LTO + target-cpu=native for cross-crate SIMD inlining, fused resize+crop in a single resampling pass, persistent thread pool via OnceLock, and contiguous batch output ([N,3,H,W] pre-allocated, each worker writes to its slice).
| Platform | Architecture | Status |
|---|---|---|
| Linux | x86_64 (AVX2) | Supported |
| Linux | aarch64 (NEON) | Supported |
| macOS | Apple Silicon (NEON) | Supported |
| macOS | Intel x86_64 (AVX2) | Supported |
| Windows | x86_64 (AVX2) | Supported |
Requires Python 3.8+ and numpy. torch is optional (needed for device= and to_dlpack). The wheel is self-contained — no system libjpeg, libpng, or libwebp required.
| Scenario | Behavior |
|---|---|
| Corrupt or truncated image | Raises ValueError with a descriptive message |
| Unsupported format | Raises ValueError |
Corrupt image in load_batch |
Raises ValueError and aborts the whole batch |
crop without size |
Raises ValueError |
normalize preset not recognized |
Raises ValueError listing valid options |
deduplicate / filter_dataset I/O error |
The failing file is reported in the exception; the batch is aborted |
All errors from Rust propagate as Python ValueError exceptions with human-readable messages.
Requires Rust toolchain, CMake, and NASM (for libjpeg-turbo):
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install build deps (macOS)
brew install cmake nasm
# Build and install in development mode
python -m venv .venv && source .venv/bin/activate
pip install maturin numpy
maturin develop --releasetensorimage/
├── crates/
│ ├── tensorimage-core/ # Pure Rust library
│ │ └── src/
│ │ ├── decode.rs # JPEG (turbojpeg), WebP (libwebp), PNG/AVIF (image crate)
│ │ ├── exif.rs # EXIF orientation parsing + pixel transforms
│ │ ├── resize.rs # SIMD resize via fast_image_resize
│ │ ├── crop.rs # Center crop
│ │ ├── normalize.rs # Fused normalize + HWC->CHW transpose
│ │ ├── pipeline.rs # Chained decode->resize->crop->normalize (file + bytes)
│ │ ├── batch.rs # Parallel batch loading via rayon (file + bytes)
│ │ ├── pool.rs # Shared rayon thread pool
│ │ ├── phash.rs # Perceptual hashing (dHash, pHash)
│ │ ├── dedup.rs # Parallel deduplication
│ │ ├── jpeg_info.rs # Header-only dimension read
│ │ ├── augment.rs # Gaussian blur, affine/perspective transforms
│ │ └── error.rs # Error types
│ └── tensorimage-python/ # PyO3 bindings
│ └── src/
│ ├── lib.rs # Python module definition
│ ├── load.rs # Image loading bindings
│ ├── hash.rs # Hash and dedup bindings
│ └── augment.rs # Augmentation bindings (blur, affine, perspective)
├── python/tensorimage/ # Python package
│ ├── __init__.py # Public API
│ ├── transforms.py # torchvision.transforms replacement
│ ├── data.py # ImageFolder, ImageDataset, DataLoader integration
│ └── aesthetic.py # CLIP aesthetic scoring (optional)
├── tests/ # 257 tests
└── benches/ # Benchmarks vs PIL and torchvision
Phases 1-5, 7-10 are complete. Upcoming:
- Phase 6: GPU decode via NVJPEG — end-to-end CUDA pipeline
- Phase 11: Streaming I/O (TAR shards, WebDataset, HTTP/URL loading) for large-scale training
- Phase 12: Video frame extraction via FFmpeg
See PLAN.md for the detailed development roadmap and phase history.
MIT