Reading NDArrays from arrow records Using Binary data by ShamsUlAzeem · Pull Request #184 · KonduitAI/deeplearning4j

ShamsUlAzeem · 2020-01-24T16:13:24Z

What changes were proposed in this pull request?

Reading NDArrays from arrow records Using Binary data

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Quick checklist

The following checklist helps ensure your PR is complete:

Eclipse Contributor Agreement signed, and signed commits - see IP Requirements page for details
Reviewed the Contributing Guidelines and followed the steps within.
Created tests for any significant new code additions.
Relevant tests for your changes are passing.

…om float, double to int for TypeConversion

ShamsUlAzeem · 2020-01-24T22:47:25Z

@agibsonccc @AlexDBlack looking at the changes of this commit. 6f97e6a
What else can we have in the binary data? It could be a bunch of things. For now, I'm assuming ND4J arrays but it could be something else like a numpy binary format or just purely other form of binary data. What I can think of here is to have try catch blocks and find out if the data can be deserialized into NDArrays (could be from numpy as well) and if not just treat it as BytesWritable

ShamsUlAzeem · 2020-01-24T22:52:31Z

The context here is that when we save NDArrayWritable as an arrow record it's saved as a Binary data format and while deserialising the we don't really know what that format could be. The schema while saving the record is NDArray and while deserialising it's Bytes

AlexDBlack · 2020-01-27T04:58:27Z

Although I see the potential for the storing of NDArrays as bytes in Arrow format to be potentially problematic, it's probably fine if the only use is within DataVec.
If we are trying to convert Arrow data that was provided from the user (say as part of a konduit serving pipeline) then we can't and shouldn't assume it's in a particular known format (like ND4J or Numpy)

In general "try/catch for a bunch of common formats" approach will be too brittle, let's avoid that and do it properly. Arrow supports n-dimensional arrays.
Can you clarify the use cases here? Then we can design a better solution...

ShamsUlAzeem · 2020-01-27T12:55:08Z

@AlexDBlack

So, my first use case for storing an NDArray was like this:

Schema customSchema = new Schema.Builder()
        .addColumnNDArray("inputVar", new long[] {10, 10, 10})
        .build();
ArrowRecordWriter arrowRecordWriter = new ArrowRecordWriter(customSchema);

File tmpFile = new File(temporary.getRoot(), "tmp.arrow");
System.out.println("tmpFile" + tmpFile);
FileSplit fileSplit = new FileSplit(tmpFile);
arrowRecordWriter.initialize(fileSplit, new NumberOfRecordsPartitioner());
arrowRecordWriter.writeBatch(
        Collections.singletonList(
                Collections.singletonList(
                        new NDArrayWritable(Nd4j.ones(10, 10, 10))
                )
        ));

When we store an NDArrayWritable using ArrowRecordWriter, it's saved in the arrow format as a Binary type, as apparent from here:

deeplearning4j/datavec/datavec-arrow/src/main/java/org/datavec/arrow/ArrowConverter.java

Line 481 in 6f97e6a

case NDArray: return field(name,new ArrowType.Binary());

Keeping that in mind, when we try to read it back again into an arrow record using the following code:

Pair<Schema, ArrowWritableRecordBatch> output1 = ArrowConverter.readFromFile(tmpFile);

The internal mapper does map it to a BinaryMetaData as visible here:

deeplearning4j/datavec/datavec-arrow/src/main/java/org/datavec/arrow/ArrowConverter.java

Line 1173 in 6f97e6a

return new BinaryMetaData(field.getName());

but when we try to fetch the data using the get method:

System.out.println(output1.getValue().get(0));

The function doesn't really know how to map it to a suitable writable and the output comes out to be a NullWritable -> due to this exception ->

deeplearning4j/datavec/datavec-arrow/src/main/java/org/datavec/arrow/recordreader/ArrowWritableRecordBatch.java

Line 155 in b5f0ec0

ret.add(NullWritable.INSTANCE);

Which hits when this is evaluated:

deeplearning4j/datavec/datavec-arrow/src/main/java/org/datavec/arrow/recordreader/ArrowWritableRecordBatch.java

Line 150 in b5f0ec0

    
           ret.add(ArrowConverter.fromEntry(offset + i, list.get(column), schema.getType(column)));

So, the mappings for binary data here:

deeplearning4j/datavec/datavec-arrow/src/main/java/org/datavec/arrow/ArrowConverter.java

Line 1224 in 6f97e6a

case Bytes:

we should know what the binary data could potentially contain... and if it's not an array, we can assume it's something else. For now, I can think of arrays in the form of binary nd4j or numpy data.

ShamsUlAzeem · 2020-01-27T12:56:46Z

Or maybe arrow contains a datatype for tensors that we can use here instead

agibsonccc · 2020-01-27T13:29:17Z

@ShamsUlAzeem no there is only the tensor container type. Nd4j-arrow covers this pretty well already.

ShamsUlAzeem added 2 commits January 22, 2020 21:57

Add support for boolean types in arrow records and ability to cast fr…

918c6cb

…om float, double to int for TypeConversion

Reading NDArrays from Bytes for arrow records

6f97e6a

ShamsUlAzeem requested review from AlexDBlack and farizrahman4u January 24, 2020 16:13

ShamsUlAzeem self-assigned this Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading NDArrays from arrow records Using Binary data#184

Reading NDArrays from arrow records Using Binary data#184
ShamsUlAzeem wants to merge 2 commits intofr_fastpyfrom
sa/read-ndarrays-arrow

ShamsUlAzeem commented Jan 24, 2020

Uh oh!

ShamsUlAzeem commented Jan 24, 2020

Uh oh!

ShamsUlAzeem commented Jan 24, 2020

Uh oh!

AlexDBlack commented Jan 27, 2020

Uh oh!

ShamsUlAzeem commented Jan 27, 2020

Uh oh!

ShamsUlAzeem commented Jan 27, 2020

Uh oh!

agibsonccc commented Jan 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShamsUlAzeem commented Jan 24, 2020

What changes were proposed in this pull request?

How was this patch tested?

Quick checklist

Uh oh!

ShamsUlAzeem commented Jan 24, 2020

Uh oh!

ShamsUlAzeem commented Jan 24, 2020

Uh oh!

AlexDBlack commented Jan 27, 2020

Uh oh!

ShamsUlAzeem commented Jan 27, 2020

Uh oh!

ShamsUlAzeem commented Jan 27, 2020

Uh oh!

agibsonccc commented Jan 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants