-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
When using isel() to select a single index along a dimension, the variable's encoding['chunksizes'] is preserved unchanged.
This creates a mismatch where the data has N-1 dimensions but the encoding still contains N-dimensional chunksizes.
Minimal example
import tempfile
from pathlib import Path
import numpy as np
import xarray as xr
with tempfile.TemporaryDirectory() as tmpdir:
original_path = Path(tmpdir) / 'original.nc'
resaved_path = Path(tmpdir) / 'resaved.nc'
# Create a chunked netCDF file
data = np.random.randint(0, 255, size=(5, 10, 20), dtype=np.uint8)
ds = xr.Dataset({'images': (['time', 'y', 'x'], data)})
ds['images'].encoding = {'chunksizes': (1, 10, 20)}
ds.to_netcdf(original_path, engine='h5netcdf')
# Load it back
loaded = xr.open_dataset(original_path, engine='h5netcdf')
print(f"Loaded shape: {loaded.images.shape}")
print(f"Loaded chunksizes: {loaded.images.encoding.get('chunksizes')}")
# Use isel to select single index, removing 'time' dimension
selected = loaded.isel(time=2)
print(f"\nAfter isel shape: {selected.images.shape}")
print(f"After isel chunksizes: {selected.images.encoding.get('chunksizes')}")
# MISMATCH: data is 2D but chunksizes is still 3D
print(f"\n*** Data has {selected.images.ndim} dims, chunksizes has {len(selected.images.encoding['chunksizes'])} dims ***")
# Save and reload to see what xarray does with the mismatched encoding
selected.to_netcdf(resaved_path, engine='h5netcdf')
reloaded = xr.open_dataset(resaved_path, engine='h5netcdf')
print(f"\nAfter save/reload:")
print(f"Shape: {reloaded.images.shape}")
print(f"Chunksizes: {reloaded.images.encoding.get('chunksizes')}")Output:
Loaded shape: (5, 10, 20)
Loaded chunksizes: (1, 10, 20)
After isel shape: (10, 20)
After isel chunksizes: (1, 10, 20)
*** Data has 2 dims, chunksizes has 3 dims ***
After save/reload:
Shape: (10, 20)
Chunksizes: NoneOur use case:
We work with large chunked netCDF files and have workflows that select subsets before re-saving.
For example, selecting the best image from a stack of images removes a dimension.
Our save code reads encoding['chunksizes'] to preserve the user's chunking preferences, but after isel() removes a dimension, the encoding no longer matches the data.
What we find complicated:
The encoding dict seems to serve two purposes:
- Source metadata - "this is how the data was stored in the file you loaded"
- Output preferences - "this is how the data should be written when you save"
After isel(), the encoding reflects the source structure, but we need output structure for saving.
We can't easily distinguish "user explicitly set this" from "this was copied from the source file and is now stale."
What we've done:
We detect when len(chunksizes) != ndim and recompute chunksizes with a warning.
But it feels like something that could be handled more systematically.
Questions:
- Is this the expected behavior?
- Should
isel()(or other dimension-reducing operations) update dimension-dependent encoding fields? - Is there a recommended pattern for handling this?