Chunking and rechunking functionality for large datasets#475
Draft
Chunking and rechunking functionality for large datasets#475
Conversation
- accessor.py: use list_lengths directly instead of np.diff(list_offsets) - ext_array.py: remove __getstate__ (default pickle now preserves chunks) - packer.py: view_sorted_series_as_list_array now produces properly chunked output using compute_chunk_boundaries instead of one giant ListArray; calculate_sorted_index_offsets returns int64 to avoid overflow; boundaries computed once in view_sorted_df_as_list_arrays and shared across columns; view_sorted_series_as_list_array uses keyword-only args with * separator - packer.py: pack_lists passes explicit struct_type to pa.chunked_array so empty DataFrames (0-row) no longer raise ArrowInvalid
Pandas Nightly Test Results (Python 3.11)486 tests +27 469 ✅ +27 22s ⏱️ +5s For more details on these failures, see this check. Results for commit c8a284a. ± Comparison against base commit 509562f. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #475 +/- ##
==========================================
- Coverage 97.30% 95.99% -1.32%
==========================================
Files 19 20 +1
Lines 2156 2347 +191
==========================================
+ Hits 2098 2253 +155
- Misses 58 94 +36 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Click here to view all benchmarks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an alternative approach to support large nested arrays, >2**31 nested values; see #462 for another approach. It is aimed at two things: 1) prevent
offsetsoverflow, and 2) prevent small chunks appearance and memory borrowing with re-chunking. In contrast to #462, this is not a breaking change, but it applies a lot of trade-offs and tuning hyperparameters, which are not validated for the best performance.Closes #95