fix #925: ht.nonzero() returns tuple of 1-D arrays instead of n-D arrays#937
Conversation
different python and pytorch versions
for more information, see https://pre-commit.ci
delete example with different split axis
…oc/901-tutorial_update Update tutorial.ipynb
…ocs/927-citation Create CITATION.cff
…nhancement/203-ghactions-matrix run tests on gh actions
Removal of old logo
|
GPU cluster tests are currently disabled on this Pull Request. |
|
Hi @ClaudiaComito. This is the solution that I could come up with at first glance. It works well for the arrays I tested. But I still have to test different edge-cases like 1-D arrays, etc... Just thought I could get your input before proceeding further. |
1d683cf to
96fabe5
Compare
|
My fix failed for 1-D arrays. I think I fixed it now! |
|
I will update the documentation and the tests a little bit later. |
ClaudiaComito
left a comment
There was a problem hiding this comment.
Hey @Mystic-Slice thanks a lot for diving into this right away!
Note that we only use torch operations internally. Our tensors might reside on GPUs, and a numpy operation would require copying the tensor to CPU.
be699be to
dd1b83d
Compare
ClaudiaComito
left a comment
There was a problem hiding this comment.
Well done @Mystic-Slice, we're getting close, thanks a lot! The tests will be more work because many existing functions rely on the previous nonzero format.
|
@Mystic-Slice I changed the base branch of this PR, because the change you implemented is needed in our indexing overhaul PR #938. I've taken care of resolving conflicts (hence the commits). I hope you don't mind. |
|
Not at all. |
Absolutely! |
|
@ClaudiaComito |
|
@ClaudiaComito Hi, I made the fixes. But I am not able to get rid of a few errors that arise in testing. Can you please take a look? |
@Mystic-Slice , the DNDarray metadata always have to be correct. The information doesn't get lost, it gets used in all subsequent operations, and the returned tuple is a tuple of DNDarrays whose metadata will be based on the original one. |
| comm=x.comm, | ||
| balanced=False, | ||
| lcl_nonzero = lcl_nonzero.transpose(0, 1) | ||
|
|
There was a problem hiding this comment.
gout must reflect the transposed global shape
There was a problem hiding this comment.
@Mystic-Slice thanks for the changes. Just two more changes are needed as far as I'm concerned.
Don't worry about the tests, they fail because our getitem/setitem is transitioning (among other things) from the old nonzero format to the new one, so basically every operation fails. We will take care of this in the parent branch/PR.
Ahh okok. I was so lost becoz the changes I made had nothing to do with the errors I got. Thnkx for the clarification. |
|
I made this change because when the DNDarray is converted into a tuple, the meta-data is not really being passed on to the tuple members. This fix solves it. |
ClaudiaComito
left a comment
There was a problem hiding this comment.
I've been losing (lengthy) comments so I'll submit this review now even if it's not quite finished and fix it later. Thanks @Mystic-Slice !
| def nonzero(x: DNDarray) -> Tuple[DNDarray, ...]: | ||
| """ | ||
| Return a :class:`~heat.core.dndarray.DNDarray` containing the indices of the elements that are non-zero.. (using ``torch.nonzero``) | ||
| Return a Tuple of :class:`~heat.core.dndarray.DNDarray`s, one for each dimension of a, |
There was a problem hiding this comment.
"... one for each dimension of x"
| """ | ||
| Return a :class:`~heat.core.dndarray.DNDarray` containing the indices of the elements that are non-zero.. (using ``torch.nonzero``) | ||
| Return a Tuple of :class:`~heat.core.dndarray.DNDarray`s, one for each dimension of a, | ||
| containing the indices of the non-zero elements in that dimension. (using ``torch.nonzero``) |
There was a problem hiding this comment.
I know it was there before you started working on it, but I would remove "(using torch.nonzero)" now.
| can be UNBALANCED as it contains the indices of the non-zero elements on each node. | ||
| Returns an array with one entry for each dimension of ``x``, containing the indices of the non-zero elements in that dimension. | ||
| The values in ``x`` are always tested and returned in row-major, C-style order. | ||
| The values in ``x`` are always tested and returned in column-major, F-style order. |
There was a problem hiding this comment.
No, they are still tested in row-major, C-style order, otherwise nonzero would return indices starting from the last dimension. Here's a good reference: Internal memory layout of an ndarray.
| gout = list(lcl_nonzero.size()) | ||
| gout[0] = x.comm.allreduce(gout[0], MPI.SUM) |
There was a problem hiding this comment.
I had written a lengthy detailed explanation of what's happening here, and then I lost it. Doh. Anyway.
The original torch.nonzero output shape is (number_of_nonzero_elements, number_of_dimensions). The process-local lcl_nonzero only know about how many nonzero elements are on process. But the global output DNDarray must know the total.
That allreduce call replaces the local value of gout[0] (local number of rows = local number of nonzero elements) with the sum of all gout[0] on all processes, in order to synchronize gout to the global (albeit memory-distributed) number of rows of the output.
Problem: lcl_nonzero has been transposed, so it's no longer the rows, but the columns that represent the number of nonzero elements.
My suggestion:
- transpose before the
if splitcheck - adjust line 71
lcl_nonzero[..., x.split] += displs[x.comm.rank]to correct the row (not the column) corresponding to the split dimension - the MPI collective call at line 78 should sum along the dimension containing the number of nonzero elements, now the columns
|
@ClaudiaComito Thanks for the detailed review. I have made the changes. |
ClaudiaComito
left a comment
There was a problem hiding this comment.
Thanks @Mystic-Slice, one more change needed and then I think we're good to go.
| [ | ||
| DNDarray( | ||
| dim_indices, | ||
| gshape=tuple(dim_indices.size()), |
There was a problem hiding this comment.
gshape needs to be the global size of the array (you calculated it with the allreduce call), dim_indices are local slices of the array.
There was a problem hiding this comment.
Ahh makes sense....will make the change
ClaudiaComito
left a comment
There was a problem hiding this comment.
@Mystic-Slice can you update the changelog? I'll merge this one into the parent branch afterwards. Thanks a lot!
Done :) |
ClaudiaComito
left a comment
There was a problem hiding this comment.
Great, thanks a lot @Mystic-Slice , I will merge this into the parent branch and we can take care of the tests with the rest.
eb297fb
into
helmholtz-analytics:914_adv-indexing-outshape-outsplit
Description
Switched out the torch's non-zero function with the NumPy version. Works as described in the issue ticket. Testing to be done.
Issue/s resolved: #925
Due Diligence
Does this change modify the behaviour of other functions? If so, which?
no
skip ci