Chunking and rechunking functionality for large datasets by hombit · Pull Request #475 · lincc-frameworks/nested-pandas

hombit · 2026-03-24T19:24:39Z

This is an alternative approach to support large nested arrays, >2**31 nested values; see #462 for another approach. It is aimed at two things: 1) prevent offsets overflow, and 2) prevent small chunks appearance and memory borrowing with re-chunking. In contrast to #462, this is not a breaking change, but it applies a lot of trade-offs and tuning hyperparameters, which are not validated for the best performance.

Closes #95

- accessor.py: use list_lengths directly instead of np.diff(list_offsets) - ext_array.py: remove __getstate__ (default pickle now preserves chunks) - packer.py: view_sorted_series_as_list_array now produces properly chunked output using compute_chunk_boundaries instead of one giant ListArray; calculate_sorted_index_offsets returns int64 to avoid overflow; boundaries computed once in view_sorted_df_as_list_arrays and shared across columns; view_sorted_series_as_list_array uses keyword-only args with * separator - packer.py: pack_lists passes explicit struct_type to pa.chunked_array so empty DataFrames (0-row) no longer raise ArrowInvalid

github-actions · 2026-03-24T19:25:50Z

Pandas Nightly Test Results (Python 3.11)

486 tests +27 469 ✅ +27 22s ⏱️ +5s
1 suites ± 0 0 💤 ± 0
1 files ± 0 17 ❌ ± 0

For more details on these failures, see this check.

Results for commit c8a284a. ± Comparison against base commit 509562f.

codecov · 2026-03-24T19:26:04Z

Codecov Report

❌ Patch coverage is 96.96970% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.99%. Comparing base (32a90df) to head (c8a284a).
⚠️ Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
src/nested_pandas/series/packer.py	89.28%	3 Missing ⚠️
src/nested_pandas/series/ext_array.py	98.59%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #475      +/-   ##
==========================================
- Coverage   97.30%   95.99%   -1.32%     
==========================================
  Files          19       20       +1     
  Lines        2156     2347     +191     
==========================================
+ Hits         2098     2253     +155     
- Misses         58       94      +36

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-03-24T19:30:47Z

Before [`509562f`]	After [`db44e38`]	Ratio	Benchmark (Parameter)
10.6±0.1ms	10.9±0.05ms	1.03	benchmarks.NestedFrameQuery.time_run
103M	105M	1.02	benchmarks.NestedFrameAddNested.peakmem_run
108M	111M	1.02	benchmarks.NestedFrameQuery.peakmem_run
107M	109M	1.02	benchmarks.NestedFrameReduce.peakmem_run
256M	258M	1.01	benchmarks.AssignSingleDfToNestedSeries.peakmem_run
136M	138M	1.01	benchmarks.CountNestedBy.peakmem_run
10.9±0.3ms	11.1±0.2ms	1.01	benchmarks.NestedFrameAddNested.time_run
1.21G	1.22G	1.01	benchmarks.ReadFewColumnsS3.peakmem_run
1.25±0.01ms	1.23±0ms	0.99	benchmarks.NestedFrameReduce.time_run
66.6±0.8ms	65.5±0.5ms	0.98	benchmarks.CountNestedBy.time_run

Click here to view all benchmarks.

hombit added 5 commits March 6, 2026 16:39

Initial rechunking impl

41f0746

Improve rechunking code

fd5cba6

Rechunk in concat

a514ae5

Rechunk getitem and fix list_offsets

c8a284a

hombit requested a review from dougbrn March 24, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking and rechunking functionality for large datasets#475

Chunking and rechunking functionality for large datasets#475
hombit wants to merge 5 commits intomainfrom
rechunking

hombit commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hombit commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Pandas Nightly Test Results (Python 3.11)

Uh oh!

codecov bot commented Mar 24, 2026

Codecov Report

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hombit commented Mar 24, 2026 •

edited

Loading