Skip to content

SouKangC/tensorimage

Repository files navigation

tensorimage

Fast image loading for Python. Built in Rust.

A drop-in replacement for PIL.Image.open() that decodes, resizes, crops, and normalizes images 6x+ faster using libjpeg-turbo SIMD decoding, IDCT downscaling, and hardware-accelerated resize. Returns numpy arrays with zero-copy. Includes parallel batch loading, a torchvision.transforms replacement with Rust-accelerated augmentations, Dataset/DataLoader with 40x faster training loops, and Rust-backed dataset filtering.

Features

  • 6.6x faster than PIL for full ML pipelines (resize + crop + normalize)
  • 47x faster batch loading via parallel rayon thread pool
  • 40x faster DataLoader — drop-in ImageFolder replacement with Rust rayon collation instead of Python multiprocessing
  • Drop-in torchvision.transforms replacement with auto-fused Rust fast-path
  • Zero-copy numpy output and PyTorch tensor support (device="cpu" / "cuda")
  • Dataset filtering with perceptual hash dedup (3-27x faster than imagehash) and CLIP aesthetic scoring
  • JPEG, PNG, WebP, AVIF with RGBA/grayscale auto-converted to RGB
  • EXIF orientation auto-correction for phone photos
  • Bytes API (load_bytes, load_batch_bytes) for S3/HTTP workflows

Installation

pip install tensorimage

Quickstart

import tensorimage as ti

# Load an image as a numpy array (H, W, 3) uint8
img = ti.load("photo.jpg")

# Load and resize — shortest edge becomes 512, aspect ratio preserved
img = ti.load("photo.jpg", size=512)

# Full ML pipeline — resize, center crop, normalize → f32 (3, 224, 224) ready for PyTorch
tensor = ti.load("photo.jpg", size=224, crop="center", normalize="imagenet")

# Batch loading — parallel via rayon, returns stacked (N, 3, 224, 224) ndarray
batch = ti.load_batch(paths, size=224, crop="center", normalize="imagenet")

No Image.open(), no .convert("RGB"), no manual normalize+transpose. One call, one tensor.

Loading from bytes (S3, HTTP, etc.)

import tensorimage as ti

# Load from in-memory bytes — same parameters as ti.load()
data = open("photo.webp", "rb").read()  # or from S3, HTTP response, etc.
img = ti.load_bytes(data, size=224, crop="center", normalize="imagenet")

# Parallel batch loading from bytes
data_list = [open(p, "rb").read() for p in paths]
batch = ti.load_batch_bytes(data_list, size=224, crop="center", normalize="imagenet")

# Quick dimension check (header-only, no decode)
w, h = ti.image_info("photo.jpg")

PyTorch integration

import tensorimage as ti

# Load directly as a PyTorch CPU tensor (zero-copy from numpy)
tensor = ti.load("photo.jpg", size=224, crop="center", normalize="imagenet", device="cpu")

# Load to GPU
tensor = ti.load("photo.jpg", size=224, crop="center", normalize="imagenet", device="cuda")

# Batch loading to GPU
batch = ti.load_batch(paths, size=224, crop="center", normalize="imagenet", device="cuda")

# DLPack interop — framework-agnostic zero-copy (JAX, TensorFlow, etc.)
arr = ti.load("photo.jpg", size=224)
import torch
tensor = torch.from_dlpack(ti.to_dlpack(arr))

device=None (default) returns numpy arrays for full backward compatibility.

Drop-in ImageFolder & DataLoader (40x faster)

import tensorimage as ti
from tensorimage import transforms

# Drop-in replacement for torchvision.datasets.ImageFolder
dataset = ti.ImageFolder(
    "data/imagenet/train",
    transform=transforms.RandomHorizontalFlip(),  # per-image augmentations
    size=224,                   # Rust: resize shortest edge to 224
    crop="center",              # Rust: center crop to 224x224
    normalize="imagenet",       # Rust: fused normalize + HWC->CHW
)

# Rust rayon handles all parallelism — no Python multiprocessing needed
loader = ti.create_dataloader(dataset, batch_size=64, shuffle=True)

for images, labels in loader:
    # images: (64, 3, 224, 224) float32 tensor
    # labels: (64,) long tensor
    output = model(images.cuda())

Or from a flat list of paths (no directory structure required):

dataset = ti.ImageDataset(paths, labels, size=224, crop="center", normalize="imagenet")
loader = ti.create_dataloader(dataset, batch_size=32)

Key design: __getitem__ returns (path, label) instead of (image, label). The custom collate_fn batch-loads all images in one ti.load_batch() call, leveraging Rust rayon parallelism. This eliminates Python multiprocessing overhead entirely — num_workers=0 by default.

Drop-in torchvision.transforms replacement

# from torchvision import transforms
from tensorimage import transforms  # same API, faster

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open("photo.jpg")
tensor = transform(img)  # (3, 224, 224) float32

Supports: Compose, Resize, CenterCrop, RandomCrop, ToTensor, Normalize, RandomHorizontalFlip, RandomVerticalFlip, ColorJitter, GaussianBlur, RandomRotation, RandomAffine, RandomPerspective, RandomErasing, Grayscale, RandomGrayscale, GaussianNoise, Pad, ElasticTransform. Resize, GaussianBlur, and spatial transforms (rotation, affine, perspective) use the Rust SIMD backend. torch is optional -- ToTensor returns torch.Tensor if available, numpy otherwise. Compose auto-detects common patterns and fuses operations in Rust for extra speed.

Resize accepts an optional max_size parameter to cap the longer edge after shortest-edge scaling: Resize(256, max_size=512).

Compose fast-path conditions: The full Rust pipeline (5.2x speedup) activates automatically when the pipeline is exactly [Resize(int), CenterCrop(int), ToTensor, Normalize] and the input to __call__ is a file path string or a PIL Image that still has a .filename attribute set. For all other inputs or pipeline shapes the standard sequential path runs (with fused ToTensor+Normalize if that pair is present).

Dataset filtering

import tensorimage as ti

# Perceptual hashing — fast, computed in Rust
h = ti.phash("photo.jpg")                              # 64-bit dHash
h = ti.phash("photo.jpg", algorithm="phash")           # 64-bit pHash (more robust)
hashes = ti.phash_batch(paths, algorithm="dhash")       # parallel batch hashing
dist = ti.hamming_distance(h1, h2)                     # Hamming distance between hashes

# Deduplication — find and group near-duplicate images
result = ti.deduplicate(paths, algorithm="dhash", threshold=5)
# result = {"keep_indices": [0, 3, 7], "duplicate_groups": [[0, 1, 2], ...], "hashes": [...]}
unique_paths = [paths[i] for i in result["keep_indices"]]

# Full pipeline — dimension filter + dedup + optional aesthetic scoring
result = ti.filter_dataset(
    paths,
    min_width=512,              # remove undersized images
    min_height=512,
    deduplicate=True,           # remove near-duplicates (default)
    hash_algorithm="dhash",     # "dhash" (fast) or "phash" (robust)
    hash_threshold=5,           # Hamming distance threshold
    min_aesthetic=5.0,          # CLIP aesthetic score (requires torch + open_clip)
    verbose=True,               # print progress
)
clean_paths = result["paths"]
print(result["stats"])  # {"total": 1000, "dimension_removed": 50, "duplicate_removed": 120, ...}

Filters are applied cheapest-first: dimension check (header-only, no decode) -> perceptual hash dedup (Rust parallel) -> aesthetic scoring (CLIP, only if min_aesthetic is set). No torch dependency for hash-only workflows.

Benchmarks

All benchmarks on Apple M4, 100 iterations.

Image loading (vs PIL + numpy)

4000x2000 JPEG -> 512px shortest edge:

Task tensorimage PIL + numpy Speedup
Resize only 7.7 ms 41.6 ms 5.4x
Full pipeline (resize + crop + normalize) 6.5 ms 43.0 ms 6.6x
Batch (8 images, 4 workers) 7.3 ms 343.8 ms 47x

Transforms (vs torchvision.transforms)

Resize(256) -> CenterCrop(224) -> ToTensor -> Normalize, 4000x2000 JPEG:

Pipeline tensorimage torchvision Speedup
From numpy (fused ToTensor+Normalize) 3.2 ms 10.4 ms 3.2x
End-to-end file -> tensor (fast-path) 3.5 ms 18.2 ms 5.2x

DataLoader (vs torchvision.datasets.ImageFolder)

Full training loop: ImageFolder + DataLoader, Resize(224) + CenterCrop(224) + RandomHorizontalFlip + Normalize, 256 images:

DataLoader Epoch time Throughput Speedup
tensorimage (num_workers=0, rayon) 87 ms 2,939 img/s 40x
torchvision (num_workers=0) 3,513 ms 73 img/s 1x
torchvision (num_workers=2) 12,865 ms 20 img/s 0.2x
torchvision (num_workers=4) 22,300 ms 11 img/s 0.1x

tensorimage's ImageFolder defers image loading to the collate_fn, which calls ti.load_batch() for Rust rayon parallel decoding. This eliminates Python multiprocessing overhead entirely. torchvision's multi-worker DataLoader actually gets slower due to process spawn, IPC serialization, and GIL contention — tensorimage avoids all of this by doing parallelism in Rust with num_workers=0.

Perceptual hashing (vs imagehash)

Task tensorimage imagehash Speedup
dHash (1920x1080 JPEG) 0.97 ms 3.05 ms 3.1x
pHash (1920x1080 JPEG) 1.11 ms 30.00 ms 27x
dHash (4000x2000 JPEG) 2.83 ms 10.24 ms 3.6x
pHash (4000x2000 JPEG) 3.05 ms 10.87 ms 3.6x
dHash batch (8 images, parallel) 4.06 ms 53.49 ms 13.2x

pHash is especially fast because tensorimage uses IDCT-scaled JPEG decode (decodes directly at ~32px instead of full resolution) plus a hand-rolled DCT in Rust. Batch adds ~3-4x via rayon parallelism. Header-only dimension reads run at 0.015 ms/image.

API Reference

ti.load(path, size=None, algorithm=None, crop=None, normalize=None, device=None)

Load an image file and return a numpy array or torch.Tensor.

Parameter Type Default Description
path str required Image file path (JPEG, PNG, WebP, AVIF)
size int None Target shortest edge size. Preserves aspect ratio.
algorithm str "lanczos3" "nearest", "bilinear", "catmullrom", "mitchell", "lanczos3"
crop str None "center" = center crop to size x size. Requires size.
normalize str None "imagenet", "clip", or "[-1,1]". Outputs float32 CHW.
device str None None = numpy, "cpu" = zero-copy torch, "cuda" = GPU tensor

Returns: ndarray (H, W, 3) uint8 without normalize, (3, H, W) float32 with normalize. torch.Tensor if device is set.

Normalize preset values:

Preset Mean (R, G, B) Std (R, G, B)
"imagenet" [0.485, 0.456, 0.406] [0.229, 0.224, 0.225]
"clip" [0.48145, 0.45783, 0.40821] [0.26863, 0.26130, 0.27578]
"[-1,1]" [0.5, 0.5, 0.5] [0.5, 0.5, 0.5]

ti.load_batch(paths, size=None, algorithm=None, crop=None, normalize=None, workers=None, device=None)

Load multiple images in parallel. Same parameters as load(), plus workers (default: CPU count).

Returns: With normalize + crop: contiguous (N, 3, H, W) float32. Otherwise: list of individual arrays.

ti.load_bytes(data, size=None, algorithm=None, crop=None, normalize=None, device=None)

Load an image from raw bytes. Same parameters as load(), but accepts bytes instead of a file path. Useful for S3/HTTP workflows where image data is already in memory.

ti.load_batch_bytes(data_list, size=None, algorithm=None, crop=None, normalize=None, workers=None, device=None)

Load multiple images from raw bytes in parallel. Same parameters as load_batch(), but accepts list[bytes].

ti.image_info(path)

Read image dimensions without decoding (header-only, very fast). Returns (width, height) tuple.

ti.ImageFolder(root, transform=None, size=None, crop=None, normalize=None, device="cpu")

Drop-in replacement for torchvision.datasets.ImageFolder. Walks root/ directory, discovers images (.jpg, .jpeg, .png, .webp, .avif) in class subdirectories (sorted alphabetically).

Parameter Type Default Description
root str required Root directory with class subdirectories
transform callable None Per-image augmentation (applied after Rust batch decode)
size int None Resize shortest edge (passed to ti.load_batch)
crop str None "center" = center crop to size x size
normalize str None "imagenet", "clip", or "[-1,1]"
device str "cpu" Tensor device

Attributes: classes, class_to_idx, samples (list of (path, label)), targets, imgs (alias for samples).

__getitem__ returns (path, label) — image loading is deferred to the collate function.

ti.ImageDataset(paths, labels=None, transform=None, size=None, crop=None, normalize=None, device="cpu")

Generic dataset from a list of image paths (no directory structure required). Same interface as ImageFolder but accepts explicit paths and labels.

Parameter Type Default Description
paths list[str] required Image file paths
labels list[int] None Labels (defaults to all zeros)

ti.create_dataloader(dataset, batch_size=32, shuffle=True, drop_last=False, **kwargs)

Convenience function that creates a torch.utils.data.DataLoader with num_workers=0 and the dataset's Rust-backed collate_fn. All parallelism is handled by Rust rayon.

ti.phash(path_or_array, algorithm="dhash")

Compute a 64-bit perceptual hash. Accepts a file path (str) or numpy array (H, W, 3) uint8.

Algorithm Description
"dhash" Difference hash. Fast, compares adjacent pixels.
"phash" Perceptual hash. More robust, uses DCT.

ti.phash_batch(paths, algorithm="dhash", workers=None)

Compute perceptual hashes for multiple images in parallel. Returns list[int].

Note: phash_batch accepts file paths only. To hash an in-memory numpy array use ti.phash(array) (single image).

ti.hamming_distance(a, b)

Hamming distance (number of differing bits) between two 64-bit hashes. Returns int (0 = identical, 64 = maximally different).

ti.deduplicate(paths, algorithm="dhash", threshold=None, workers=None)

Find and group near-duplicate images by perceptual hash.

Parameter Type Default Description
paths list[str] required Image file paths
algorithm str "dhash" "dhash" or "phash"
threshold int 0 (dhash) / 10 (phash) Max Hamming distance to consider as duplicate. Default is 0 for "dhash" (exact match only) and 10 for "phash".
workers int CPU count Worker threads

Returns: dict with "keep_indices" (first of each group), "duplicate_groups" (groups with 2+ members), "hashes" (all hashes).

ti.filter_dataset(paths, ...)

High-level dataset filtering pipeline. Applies filters cheapest-first.

Parameter Type Default Description
paths list[str] required Image file paths
min_width int None Remove images narrower than this
min_height int None Remove images shorter than this
deduplicate bool True Remove near-duplicates
hash_algorithm str "dhash" "dhash" or "phash"
hash_threshold int None Hamming distance threshold for dedup
min_aesthetic float None Min CLIP aesthetic score (1-10). Requires torch + open_clip_torch.
workers int CPU count Worker threads
verbose bool False Print progress

Returns: dict with "paths" (surviving paths), "indices" (original indices), "stats" (counts per stage).

ti.aesthetic.AestheticScorer(model_name="ViT-L-14", pretrained="openai", device="cpu")

CLIP-based aesthetic scorer. Predicts a quality score (roughly 1–10) using CLIP ViT-L/14 features with a linear regression head trained on the LAION aesthetic dataset.

Requires pip install torch open_clip_torch.

On first use, the predictor weights (~3 MB) are downloaded from GitHub and cached at ~/.cache/tensorimage/aesthetic_predictor_v2.pth. Subsequent calls use the cached file.

from tensorimage.aesthetic import AestheticScorer

scorer = AestheticScorer(device="cuda")   # or "cpu"
score = scorer.score("photo.jpg")         # → float, e.g. 6.2
scores = scorer.score_batch(["a.jpg", "b.jpg"], batch_size=32)  # → [6.2, 4.1]
Method Description
score(path_or_image) Score a single image. Accepts a file path (str/Path) or PIL.Image.
score_batch(inputs, batch_size=32) Score a list of paths or PIL Images. Returns list[float].

ti.image_info_batch(paths, workers=None)

Read image dimensions for multiple files without decoding (header-only, very fast). Returns a list[(width, height)]. Uses the shared rayon thread pool for parallelism. Useful as a cheap pre-filter before loading.

ti.to_dlpack(array)

Export a numpy array or torch tensor via DLPack for framework-agnostic interop.

Requires numpy >= 1.22 for numpy array inputs (numpy's __dlpack__ protocol was added in 1.22). Older numpy versions will raise a TypeError with a version hint.

How it works

Image file on disk (JPEG, PNG, WebP, AVIF)
  |  std::fs::read (single syscall)
  v
Raw bytes in memory
  |  Format routing: JPEG → turbojpeg, WebP → libwebp, other → image crate
  |  JPEG: IDCT scaling (decode at 1/4 resolution if target is small)
  |  JPEG: EXIF orientation auto-correction (rotate/flip as needed)
  v
RGB pixels at reduced resolution
  |  fused resize+crop: SIMD Lanczos with source-space crop (single resampling pass)
  v
RGB pixels at crop size (e.g., 224x224)
  |  fused normalize + HWC->CHW transpose (single pass, pre-computed scale/bias)
  v
float32 pixels in CHW layout
  |  PyArray::from_vec (zero-copy ownership transfer to numpy)
  v
numpy.ndarray (3, H, W) float32

The GIL is released during all Rust work. Batch loading runs images in parallel via a persistent rayon thread pool.

Key optimizations: fat LTO + target-cpu=native for cross-crate SIMD inlining, fused resize+crop in a single resampling pass, persistent thread pool via OnceLock, and contiguous batch output ([N,3,H,W] pre-allocated, each worker writes to its slice).

Platform support

Platform Architecture Status
Linux x86_64 (AVX2) Supported
Linux aarch64 (NEON) Supported
macOS Apple Silicon (NEON) Supported
macOS Intel x86_64 (AVX2) Supported
Windows x86_64 (AVX2) Supported

Requires Python 3.8+ and numpy. torch is optional (needed for device= and to_dlpack). The wheel is self-contained — no system libjpeg, libpng, or libwebp required.

Error handling

Scenario Behavior
Corrupt or truncated image Raises ValueError with a descriptive message
Unsupported format Raises ValueError
Corrupt image in load_batch Raises ValueError and aborts the whole batch
crop without size Raises ValueError
normalize preset not recognized Raises ValueError listing valid options
deduplicate / filter_dataset I/O error The failing file is reported in the exception; the batch is aborted

All errors from Rust propagate as Python ValueError exceptions with human-readable messages.

Building from source

Requires Rust toolchain, CMake, and NASM (for libjpeg-turbo):

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install build deps (macOS)
brew install cmake nasm

# Build and install in development mode
python -m venv .venv && source .venv/bin/activate
pip install maturin numpy
maturin develop --release

Project structure

tensorimage/
├── crates/
│   ├── tensorimage-core/       # Pure Rust library
│   │   └── src/
│   │       ├── decode.rs       # JPEG (turbojpeg), WebP (libwebp), PNG/AVIF (image crate)
│   │       ├── exif.rs         # EXIF orientation parsing + pixel transforms
│   │       ├── resize.rs       # SIMD resize via fast_image_resize
│   │       ├── crop.rs         # Center crop
│   │       ├── normalize.rs    # Fused normalize + HWC->CHW transpose
│   │       ├── pipeline.rs     # Chained decode->resize->crop->normalize (file + bytes)
│   │       ├── batch.rs        # Parallel batch loading via rayon (file + bytes)
│   │       ├── pool.rs         # Shared rayon thread pool
│   │       ├── phash.rs        # Perceptual hashing (dHash, pHash)
│   │       ├── dedup.rs        # Parallel deduplication
│   │       ├── jpeg_info.rs    # Header-only dimension read
│   │       ├── augment.rs      # Gaussian blur, affine/perspective transforms
│   │       └── error.rs        # Error types
│   └── tensorimage-python/     # PyO3 bindings
│       └── src/
│           ├── lib.rs          # Python module definition
│           ├── load.rs         # Image loading bindings
│           ├── hash.rs         # Hash and dedup bindings
│           └── augment.rs      # Augmentation bindings (blur, affine, perspective)
├── python/tensorimage/         # Python package
│   ├── __init__.py             # Public API
│   ├── transforms.py           # torchvision.transforms replacement
│   ├── data.py                 # ImageFolder, ImageDataset, DataLoader integration
│   └── aesthetic.py            # CLIP aesthetic scoring (optional)
├── tests/                      # 257 tests
└── benches/                    # Benchmarks vs PIL and torchvision

Roadmap

Phases 1-5, 7-10 are complete. Upcoming:

  • Phase 6: GPU decode via NVJPEG — end-to-end CUDA pipeline
  • Phase 11: Streaming I/O (TAR shards, WebDataset, HTTP/URL loading) for large-scale training
  • Phase 12: Video frame extraction via FFmpeg

See PLAN.md for the detailed development roadmap and phase history.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors