Skip to content

Chunking and rechunking functionality for large datasets#475

Draft
hombit wants to merge 5 commits intomainfrom
rechunking
Draft

Chunking and rechunking functionality for large datasets#475
hombit wants to merge 5 commits intomainfrom
rechunking

Conversation

@hombit
Copy link
Collaborator

@hombit hombit commented Mar 24, 2026

This is an alternative approach to support large nested arrays, >2**31 nested values; see #462 for another approach. It is aimed at two things: 1) prevent offsets overflow, and 2) prevent small chunks appearance and memory borrowing with re-chunking. In contrast to #462, this is not a breaking change, but it applies a lot of trade-offs and tuning hyperparameters, which are not validated for the best performance.

Closes #95

hombit added 5 commits March 6, 2026 16:39
- accessor.py: use list_lengths directly instead of np.diff(list_offsets)
- ext_array.py: remove __getstate__ (default pickle now preserves chunks)
- packer.py: view_sorted_series_as_list_array now produces properly chunked
  output using compute_chunk_boundaries instead of one giant ListArray;
  calculate_sorted_index_offsets returns int64 to avoid overflow; boundaries
  computed once in view_sorted_df_as_list_arrays and shared across columns;
  view_sorted_series_as_list_array uses keyword-only args with * separator
- packer.py: pack_lists passes explicit struct_type to pa.chunked_array so
  empty DataFrames (0-row) no longer raise ArrowInvalid
@github-actions
Copy link

Pandas Nightly Test Results (Python 3.11)

486 tests  +27   469 ✅ +27   22s ⏱️ +5s
  1 suites ± 0     0 💤 ± 0 
  1 files   ± 0    17 ❌ ± 0 

For more details on these failures, see this check.

Results for commit c8a284a. ± Comparison against base commit 509562f.

@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 96.96970% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.99%. Comparing base (32a90df) to head (c8a284a).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
src/nested_pandas/series/packer.py 89.28% 3 Missing ⚠️
src/nested_pandas/series/ext_array.py 98.59% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #475      +/-   ##
==========================================
- Coverage   97.30%   95.99%   -1.32%     
==========================================
  Files          19       20       +1     
  Lines        2156     2347     +191     
==========================================
+ Hits         2098     2253     +155     
- Misses         58       94      +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

Before [509562f] After [db44e38] Ratio Benchmark (Parameter)
10.6±0.1ms 10.9±0.05ms 1.03 benchmarks.NestedFrameQuery.time_run
103M 105M 1.02 benchmarks.NestedFrameAddNested.peakmem_run
108M 111M 1.02 benchmarks.NestedFrameQuery.peakmem_run
107M 109M 1.02 benchmarks.NestedFrameReduce.peakmem_run
256M 258M 1.01 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
136M 138M 1.01 benchmarks.CountNestedBy.peakmem_run
10.9±0.3ms 11.1±0.2ms 1.01 benchmarks.NestedFrameAddNested.time_run
1.21G 1.22G 1.01 benchmarks.ReadFewColumnsS3.peakmem_run
1.25±0.01ms 1.23±0ms 0.99 benchmarks.NestedFrameReduce.time_run
66.6±0.8ms 65.5±0.5ms 0.98 benchmarks.CountNestedBy.time_run

Click here to view all benchmarks.

@hombit hombit requested a review from dougbrn March 24, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle series with more than 2^31 "flat" elements

1 participant