Skip to content

DONT-MERGE GSOC26: Improve performance of replace_with_mask()#476

Draft
s7g4 wants to merge 1 commit intolincc-frameworks:mainfrom
s7g4:gsoc-replace-with-mask-performance-opt
Draft

DONT-MERGE GSOC26: Improve performance of replace_with_mask()#476
s7g4 wants to merge 1 commit intolincc-frameworks:mainfrom
s7g4:gsoc-replace-with-mask-performance-opt

Conversation

@s7g4
Copy link
Copy Markdown

@s7g4 s7g4 commented Mar 26, 2026

Change Description

Closes #52

Optimized NestedExtensionArray.__setitem__ and the internal replace_with_mask utility to handle pyarrow.Scalar values directly. This avoids the significant overhead caused by expanding a single replacement scalar into a full-sized pyarrow.Array (matching the length of the target array).

Solution Description

The optimization involves two main changes in src/nested_pandas/series/ext_array.py:

  1. NestedExtensionArray.__setitem__: Now detects if the replacement value can be boxed as a pa.Scalar. If so, it passes the scalar directly instead of creating a len(self)-sized array.
  2. replace_with_mask: Updated to handle pa.Scalar inputs. When a scalar is detected, it leverages PyArrow's native pa.compute.if_else kernel for efficient broadcasting.

Benchmarks (1,000,000 rows):

  • Original approach: ~7.3s execution time | ~78MB peak memory
  • Optimized approach: ~0.26s execution time | ~0MB peak memory (approx. 30x faster)

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

@hombit hombit marked this pull request as draft March 26, 2026 20:35
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.04%. Comparing base (509562f) to head (1dc5fdf).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #476   +/-   ##
=======================================
  Coverage   96.03%   96.04%           
=======================================
  Files          20       20           
  Lines        2247     2249    +2     
=======================================
+ Hits         2158     2160    +2     
  Misses         89       89           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

Before [509562f] After [29420ea] Ratio Benchmark (Parameter)
180M 182M 1.01 benchmarks.ReadFewColumnsHTTPS.peakmem_run
139M 139M 1 benchmarks.CountNestedBy.peakmem_run
106M 105M 1 benchmarks.NestedFrameAddNested.peakmem_run
111M 110M 1 benchmarks.NestedFrameQuery.peakmem_run
11.3±0.3ms 11.3±0.2ms 1 benchmarks.NestedFrameQuery.time_run
109M 109M 1 benchmarks.NestedFrameReduce.peakmem_run
680±50ms 684±10ms 1 benchmarks.ReadFewColumnsS3.time_run
262M 262M 1 benchmarks.ReassignHalfOfNestedSeries.peakmem_run
66.8±2ms 66.1±1ms 0.99 benchmarks.CountNestedBy.time_run
11.5±0.3ms 11.5±0.2ms 0.99 benchmarks.NestedFrameAddNested.time_run

Click here to view all benchmarks.

# TODO: performance optimization
# https://github.com/lincc-frameworks/nested-pandas/issues/52
# Fast path for scalars using native broadcasting
if isinstance(value, pa.Scalar):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include unit tests for the new functionality. Existing test_replace_with_mask only provides list test data.



def replace_with_mask(array: pa.ChunkedArray, mask: pa.BooleanArray, value: pa.Array) -> pa.ChunkedArray:
def replace_with_mask(array: pa.ChunkedArray, mask: pa.BooleanArray, value: pa.Array | pa.Scalar) -> pa.ChunkedArray:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-commit fails for line too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of series.ext_array.replace_with_mask()

2 participants