Skip to content

Reading NDArrays from arrow records Using Binary data#184

Open
ShamsUlAzeem wants to merge 2 commits intofr_fastpyfrom
sa/read-ndarrays-arrow
Open

Reading NDArrays from arrow records Using Binary data#184
ShamsUlAzeem wants to merge 2 commits intofr_fastpyfrom
sa/read-ndarrays-arrow

Conversation

@ShamsUlAzeem
Copy link
Copy Markdown

What changes were proposed in this pull request?

Reading NDArrays from arrow records Using Binary data

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Quick checklist

The following checklist helps ensure your PR is complete:

  • Eclipse Contributor Agreement signed, and signed commits - see IP Requirements page for details
  • Reviewed the Contributing Guidelines and followed the steps within.
  • Created tests for any significant new code additions.
  • Relevant tests for your changes are passing.

@ShamsUlAzeem
Copy link
Copy Markdown
Author

@agibsonccc @AlexDBlack looking at the changes of this commit. 6f97e6a
What else can we have in the binary data? It could be a bunch of things. For now, I'm assuming ND4J arrays but it could be something else like a numpy binary format or just purely other form of binary data. What I can think of here is to have try catch blocks and find out if the data can be deserialized into NDArrays (could be from numpy as well) and if not just treat it as BytesWritable

@ShamsUlAzeem
Copy link
Copy Markdown
Author

The context here is that when we save NDArrayWritable as an arrow record it's saved as a Binary data format and while deserialising the we don't really know what that format could be. The schema while saving the record is NDArray and while deserialising it's Bytes

@AlexDBlack
Copy link
Copy Markdown

Although I see the potential for the storing of NDArrays as bytes in Arrow format to be potentially problematic, it's probably fine if the only use is within DataVec.
If we are trying to convert Arrow data that was provided from the user (say as part of a konduit serving pipeline) then we can't and shouldn't assume it's in a particular known format (like ND4J or Numpy)

In general "try/catch for a bunch of common formats" approach will be too brittle, let's avoid that and do it properly. Arrow supports n-dimensional arrays.
Can you clarify the use cases here? Then we can design a better solution...

@ShamsUlAzeem
Copy link
Copy Markdown
Author

@AlexDBlack

So, my first use case for storing an NDArray was like this:

Schema customSchema = new Schema.Builder()
        .addColumnNDArray("inputVar", new long[] {10, 10, 10})
        .build();
ArrowRecordWriter arrowRecordWriter = new ArrowRecordWriter(customSchema);

File tmpFile = new File(temporary.getRoot(), "tmp.arrow");
System.out.println("tmpFile" + tmpFile);
FileSplit fileSplit = new FileSplit(tmpFile);
arrowRecordWriter.initialize(fileSplit, new NumberOfRecordsPartitioner());
arrowRecordWriter.writeBatch(
        Collections.singletonList(
                Collections.singletonList(
                        new NDArrayWritable(Nd4j.ones(10, 10, 10))
                )
        ));

When we store an NDArrayWritable using ArrowRecordWriter, it's saved in the arrow format as a Binary type, as apparent from here:

case NDArray: return field(name,new ArrowType.Binary());

Keeping that in mind, when we try to read it back again into an arrow record using the following code:

Pair<Schema, ArrowWritableRecordBatch> output1 = ArrowConverter.readFromFile(tmpFile);

The internal mapper does map it to a BinaryMetaData as visible here:

but when we try to fetch the data using the get method:

System.out.println(output1.getValue().get(0));

The function doesn't really know how to map it to a suitable writable and the output comes out to be a NullWritable -> due to this exception ->

Which hits when this is evaluated:

ret.add(ArrowConverter.fromEntry(offset + i, list.get(column), schema.getType(column)));

So, the mappings for binary data here:


we should know what the binary data could potentially contain... and if it's not an array, we can assume it's something else. For now, I can think of arrays in the form of binary nd4j or numpy data.

@ShamsUlAzeem
Copy link
Copy Markdown
Author

Or maybe arrow contains a datatype for tensors that we can use here instead

@agibsonccc
Copy link
Copy Markdown

@ShamsUlAzeem no there is only the tensor container type. Nd4j-arrow covers this pretty well already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants