Expand distributed indexing, match numpy indexing scheme by ClaudiaComito · Pull Request #938 · helmholtz-analytics/heat

ClaudiaComito · 2022-03-24T05:23:18Z

Description

This pull request introduces a significant overhaul of distributed indexing within dndarray.py, specifically targeting the __getitem__ and __setitem__ methods. The primary objective is to achieve full NumPy indexing compliance in a distributed environment while minimizing MPI overhead and memory footprint.

The logic has been refactored to identify zero-communication paths ("early out"), and route heavy unordered advanced indexing through optimized communication.

The following table shows the distribution semantics of DNDarray indexing operations.

UPDATED 26.5.2026

Array is distributed	Operation	Key is distributed	Value is distributed	Result is distributed	Notes
No	`array[key]`	No	--	No	Standard local indexing.
No	`array[key]`	Yes	--	Yes	The resulting array inherits the `split` axis and balanced status directly from the distributed key.
Yes	`array[key]`	No	--	Yes / No	No if the key is a pure scalar along the split axis (the split dimension is lost and the result is broadcasted). Yes for slices/masks. Unordered local advanced indices are automatically distributed across the split axis under the hood.
Yes	`array[key]`	Yes	--	Yes	Split axis is retained or shifted. Evaluated as a `distr_mask` fast-path or triggers `__getitem_unordered` for cross-node MPI collective fetching.
No	`array[key] = val`	No	No	No (In-place)	Standard local assignment.
Yes	`array[key] = val`	No	No	Yes (In-place)	The local value is automatically converted into a distributed array and broadcasted to align with the array's distribution constraints.
Yes	`array[key] = val`	No	Yes	Yes (In-place)	Split axis match required: If the `value`'s split axis doesn't match the target's split axis, a `RuntimeError` is raised. If they do match, `value` is dynamically load-balanced (`redistribute_`) to match the target's chunk sizes before assignment.
Yes	`array[key] = val`	Yes	No, scalar	Yes (In-place)	A pure scalar value is correctly assigned to all masked/indexed elements across all MPI ranks natively.
Yes	`array[key] = val`	Yes	No, array	ERROR	Exception raised. You cannot assign a local/non-distributed array using a distributed index.
Yes	`array[key] = val`	Yes	Yes	Yes (In-place)	Communication-heavy: The value's split axis is dynamically redistributed to match the key's distribution layout, followed by an `Alltoallv` shuffle to assign elements to their global unordered indices.

Routing logic

UPDATED 26.5.2026

graph TD
    Start((Receive Key)) --> CheckScalar{Is key a pure scalar<br/>and not boolean?}
    
    CheckScalar -- Yes --> EvalRoot{Compute root}
    EvalRoot --> OpScalar[op_type = 'scalar']
    
    CheckScalar -- No --> CheckFastPath{Matches distr_mask<br/>fast path?}
    
    CheckFastPath -- Yes & not tuple --> OpDistrMask1[op_type = 'distr_mask']
    
    CheckFastPath -- No / Tuple --> Normalize[Normalize keys, extract bounds,<br/>check dimensionality & broadcast]
    
    Normalize --> FinalRouting{Evaluate Key State}
    
    FinalRouting -->|root is not None| OpScalar2[op_type = 'scalar']
    FinalRouting -->|split_key_is_ordered == 0| OpDist[op_type = 'distributed'<br/>Unordered MPI Communication]
    FinalRouting -->|split_key_is_ordered == -1| OpDesc[op_type = 'descending_slice']
    
    FinalRouting -->|key_is_mask_like == True| MaskTypeCheck{distr_mask_fast_path?}
    MaskTypeCheck -- Yes --> OpDistrMask2[op_type = 'distr_mask']
    MaskTypeCheck -- No --> OpLocalMask[op_type = 'local_mask']
    
    FinalRouting -->|Default / Ordered| OpAdv[op_type = 'advanced'<br/>Local Fast Path]

    %% Map to actual handlers
    subgraph Handlers [Target Routing Methods]
        OpScalar & OpScalar2 --> H_Scalar[__getitem_scalar<br/>__setitem_scalar]
        OpDist --> H_Dist[__getitem_advanced_distributed<br/>__setitem_advanced_distributed]
        OpDesc --> H_Desc[__getitem_descending_slice_distributed<br/>__setitem_descending_slice_distributed]
        OpDistrMask1 & OpDistrMask2 --> H_DistMask[__getitem_mask<br/>__setitem_mask]
        OpLocalMask --> H_LocalMask[__getitem_advanced_local<br/>__setitem_advanced_local]
        OpAdv --> H_Adv[__getitem_advanced_local<br/>__setitem_advanced_local]
    end
    
    %% Styling
    classDef target fill:#d4edda,stroke:#28a745,stroke-width:2px;
    class H_Scalar,H_Dist,H_Desc,H_DistMask,H_LocalMask,H_Adv target;

Main changes

abstracts key parsing and alignment into a centralized private method that handles dimension expansion, shape broadcasting, and classifies the state of the indexing operation to determine network routing.
enforces standard last-assignment-wins semantics for advanced indexing duplicates on cuda tensors by generating linear indices and mapping local occurrence priorities (thanks @Hakdag97 ).
intercepts multidimensional and single-dimensional boolean masks early in the pipeline, converting them to explicit integer configurations locally to prevent unnecessary cross-rank broadcasting.
maps and isolates zero-communication assignments during slice operations, executing completely local pytorch tensor modifications when the requested indices and data already reside on the active rank.
structures unordered read requests by compiling global communication matrices, enabling the dispatch of non-blocking Isend and Recv calls strictly between nodes that own the requested indices and those requesting them.
forces distribution alignment during set operations if the right-hand side assignment value is also distributed, utilizing an Alltoallv operation to shuffle payload data and target indices concurrently.
introduces a value broadcasting helper function to natively squeeze or expand the dimensions of scalar or tensor payloads to match the specific dimensional footprint of the target slice before assignment occurs.

To Be Continued...

Memory footprint

Scaling behaviour

Issue/s resolved: #703 #914 #918 #1012 #1019 #2135 #1816 #824

Changes proposed:

feature extension in __process_key, getitem, and setitem methods
edge case handling
extensive comparison to numpy API in unittests

Type of change

Memory requirements

Performance

Due Diligence

All split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

yes / no

skip ci

…r boolean mask

…splits

for more information, see https://pre-commit.ci

brownbaerchen

I had a quick look through all the stuff that is not actually the advanced indexing. I think it would be good to clean up this PR by

Removing any changes we don't want to keep at all
Separate PR with refactoring of basic tests
Separate PR with changes to non_zero
Separate PR with adding keyword arguments to DNDarray instantiation

These separate PRs can be merged very quickly and then the PR does only what it promises to and is easier to review.

brownbaerchen · 2026-05-27T09:44:58Z



-def nonzero(x: DNDarray) -> DNDarray:
+def nonzero(x: DNDarray, as_tuple: bool = True) -> tuple[DNDarray, ...] | DNDarray:


I think the changes to this function belong in its own PR since it seems unrelated to advanced indexing and could be merged quickly.

I think the changes to this function belong in its own PR since it seems unrelated to advanced indexing and could be merged quickly.

In principle you're right and I agree, but in practice the changes and tests are not so easy to disentangle from the new indexing capabilities. If you want to go for it, I've started PR #2332 but I won't spend time on it.

brownbaerchen · 2026-06-03T14:36:18Z

+
+        # 1D boolean mask resolution
+        first = key[0] if isinstance(key, tuple) and len(key) >= 1 else key
+        if isinstance(first, (DNDarray, torch.Tensor, np.ndarray)) and arr.ndim >= 1:


I think it would be nice to cast numpy arrays and torch tensors to DNDarray in the beginning of this function. Then we always know we have a DNDarray and don't have to worry about stuff like numel or size.

I think it would be nice if we do:

Early out for some special things that we need to be fast

Cast array keys to DNDarray such that we have a key that is a tuple of ellipses, slices, integers, or DNDarrays

Any further processing of keys

What do you think, @ClaudiaComito? Would that make sense?

for more information, see https://pre-commit.ci

Co-authored-by: Thomas Saupe <39156931+brownbaerchen@users.noreply.github.com>

for more information, see https://pre-commit.ci

* First small cleanup * Another small simplification

ClaudiaComito mentioned this pull request Mar 25, 2022

fix #925: ht.nonzero() returns tuple of 1-D arrays instead of n-D arrays #937

Merged

4 tasks

This was referenced Aug 30, 2022

[Bug]: Indexing with 0-dimensional key #1019

Open

[Bug]: Slice error when array contains an axis of length 0 #1012

Open

ClaudiaComito and others added 27 commits August 31, 2022 09:31

Expand __process_key() to deal with distributed boolean mask

231c1de

Expand test_getitem for distributed single-element indexing, non-dist…

f19f902

…r boolean mask

Add check for matching boolean index / indexed array shapes

7ed435f

Only sort result if input.split != 0

0da7f56

BROKEN: distributed boolean indexing to return stable result for all …

e55c7f9

…splits

Add tests for distributed boolean indexing

75d9314

BROKEN: Fixed key redistribution for input.split != 0.

15a8a28

Expanded boolean indexing tests

8db0511

Set up communication matrix for boolean indexing along non-zero split

291329e

Implement getitem for non-ordered key along split axis

6d986dd

Fix edge-case contiguity mismatch for Allgatherv

f46ae67

merge branch release/1.2.x

4da69fd

Update ubuntu

27ea911

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0fb6c8

for more information, see https://pre-commit.ci

switch back to ubuntu 20.04

0e704d4

pull

f5d7850

Upgrade CI to ubuntu 22.04 and cuda 11.7.1

acfe9bd

avoid unnecessary gathering of test DNDarrays

0fd3d87

early out for resplit of non-distributed DNDarrays

3c4c07c

match split of comparison array to expected output

989e0f4

avoid MPI calls in non-distributed cases

6d66fad

avoid MPI calls in non-distributed resplit

a37b4d3

set default to None

8eebe10

remove print statement

22c5c68

upgrade torch version

c692bff

copy to cpu before comparing

df6a4e5

use ht.allclose instead of np.allclose

af0e721

comment out print statements

ffd87cd

ClaudiaComito added 3 commits May 27, 2026 05:18

vectorized mapping to dest ranks

d745001

refactor unordered setitem, extract communication prep

8e68226

switch unordered getitem to Alltoallv

61f1740

ClaudiaComito added the benchmark PR label May 27, 2026

brownbaerchen requested changes May 27, 2026

View reviewed changes

ClaudiaComito added 2 commits June 2, 2026 06:10

Merge branch 'main' into 914_adv-indexing-outshape-outsplit

c8589c4

Merge branch 'main' into 914_adv-indexing-outshape-outsplit

86ecb5c

brownbaerchen mentioned this pull request Jun 2, 2026

nonzero, where fixes from #938 #2332

Open

7 tasks

brownbaerchen reviewed Jun 3, 2026

View reviewed changes

ClaudiaComito and others added 14 commits June 8, 2026 09:07

Merge branch 'main' into 914_adv-indexing-outshape-outsplit

9feaf97

Reintegrate input sanitation in nonzero

2d5502e

[pre-commit.ci] auto fixes from pre-commit.com hooks

6af95c2

for more information, see https://pre-commit.ci

Apply suggestions from code review

937bc18

Co-authored-by: Thomas Saupe <39156931+brownbaerchen@users.noreply.github.com>

Remove edits

1ec9543

bring back to original state

6bfb650

[pre-commit.ci] auto fixes from pre-commit.com hooks

199518a

for more information, see https://pre-commit.ci

Refactor distr_mask_fast_path

eaa34eb

* First small cleanup * Another small simplification

Merge branch 'main' into 914_adv-indexing-outshape-outsplit

350aaf8

remove legacy indexing leftovers

23ab286

remove orphaned functions after refactoring

b3bf485

better name and docstring for tafkaprocessed_key

c803588

move resolve_indexing_state out of Class

442d83e

rename dedup and move out of class

49406ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand distributed indexing, match numpy indexing scheme#938

Expand distributed indexing, match numpy indexing scheme#938
ClaudiaComito wants to merge 271 commits into
mainfrom
914_adv-indexing-outshape-outsplit

ClaudiaComito commented Mar 24, 2022 •

edited

Loading

Uh oh!

brownbaerchen left a comment

Uh oh!

Uh oh!

Uh oh!

brownbaerchen May 27, 2026

Uh oh!

ClaudiaComito Jun 2, 2026

Uh oh!

brownbaerchen Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		def nonzero(x: DNDarray) -> DNDarray:
		def nonzero(x: DNDarray, as_tuple: bool = True) -> tuple[DNDarray, ...] \| DNDarray:

Conversation

ClaudiaComito commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Routing logic

Main changes

Memory footprint

Scaling behaviour

Changes proposed:

Type of change

Memory requirements

Performance

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

Uh oh!

brownbaerchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brownbaerchen May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ClaudiaComito Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

brownbaerchen Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ClaudiaComito commented Mar 24, 2022 •

edited

Loading