Skip to content

encoding['chunksizes'] can become stale after isel() removes a dimension #11028

@claydugo

Description

@claydugo

When using isel() to select a single index along a dimension, the variable's encoding['chunksizes'] is preserved unchanged.
This creates a mismatch where the data has N-1 dimensions but the encoding still contains N-dimensional chunksizes.

Minimal example
import tempfile
from pathlib import Path

import numpy as np
import xarray as xr

with tempfile.TemporaryDirectory() as tmpdir:
  original_path = Path(tmpdir) / 'original.nc'
  resaved_path = Path(tmpdir) / 'resaved.nc'

  # Create a chunked netCDF file
  data = np.random.randint(0, 255, size=(5, 10, 20), dtype=np.uint8)
  ds = xr.Dataset({'images': (['time', 'y', 'x'], data)})
  ds['images'].encoding = {'chunksizes': (1, 10, 20)}
  ds.to_netcdf(original_path, engine='h5netcdf')

  # Load it back
  loaded = xr.open_dataset(original_path, engine='h5netcdf')
  print(f"Loaded shape: {loaded.images.shape}")
  print(f"Loaded chunksizes: {loaded.images.encoding.get('chunksizes')}")

  # Use isel to select single index, removing 'time' dimension
  selected = loaded.isel(time=2)
  print(f"\nAfter isel shape: {selected.images.shape}")
  print(f"After isel chunksizes: {selected.images.encoding.get('chunksizes')}")

  # MISMATCH: data is 2D but chunksizes is still 3D
  print(f"\n*** Data has {selected.images.ndim} dims, chunksizes has {len(selected.images.encoding['chunksizes'])} dims ***")

  # Save and reload to see what xarray does with the mismatched encoding
  selected.to_netcdf(resaved_path, engine='h5netcdf')
  reloaded = xr.open_dataset(resaved_path, engine='h5netcdf')
  print(f"\nAfter save/reload:")
  print(f"Shape: {reloaded.images.shape}")
  print(f"Chunksizes: {reloaded.images.encoding.get('chunksizes')}")
Output:
Loaded shape: (5, 10, 20)
Loaded chunksizes: (1, 10, 20)

After isel shape: (10, 20)
After isel chunksizes: (1, 10, 20)

*** Data has 2 dims, chunksizes has 3 dims ***

After save/reload:
Shape: (10, 20)
Chunksizes: None

Our use case:

We work with large chunked netCDF files and have workflows that select subsets before re-saving.
For example, selecting the best image from a stack of images removes a dimension.
Our save code reads encoding['chunksizes'] to preserve the user's chunking preferences, but after isel() removes a dimension, the encoding no longer matches the data.

What we find complicated:

The encoding dict seems to serve two purposes:

  1. Source metadata - "this is how the data was stored in the file you loaded"
  2. Output preferences - "this is how the data should be written when you save"

After isel(), the encoding reflects the source structure, but we need output structure for saving.
We can't easily distinguish "user explicitly set this" from "this was copied from the source file and is now stale."

What we've done:

We detect when len(chunksizes) != ndim and recompute chunksizes with a warning.
But it feels like something that could be handled more systematically.

Questions:

  1. Is this the expected behavior?
  2. Should isel() (or other dimension-reducing operations) update dimension-dependent encoding fields?
  3. Is there a recommended pattern for handling this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageIssue that has not been reviewed by xarray team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions