Skip to content

dealing with strange conventions for writing nullable arrays with no nulls #43

@ExpandingMan

Description

@ExpandingMan

The pyarrow output for arrays not containing nulls is rather strange. It seems that, by default the pyarrow output schema indicates that all columns are nullable. However, for columns without nulls, instead of outputting a normal bitmask, it outputs zero-length buffers. By this we mean that in the RecordBatch, there is a FieldNode for the column showing that it has zero nulls, and it contains two Buffer objects (as expected). The first of these buffer objects, however, instead of describing the (all 1's) bitmask that you'd expect, has zero length. It of course would make sense to elide the bitmask when it's unnecessary, but in that case I'd expect there to be no Buffer object.

I can see the following options for dealing with this

  1. Detect that the Buffer has zero length and return an object without a bitmask.
  2. Promote the nullable objects to optionally hold FillArrays of all 1's instead of a normal arrow bitmask.
  3. Allocate a new arrow formatted bitamsk in Julia.

Of these options, 3 seems the worst as it is potentially a huge performance sacrifice. 1 and 2 both have the disadvantage that the container types can no longer be uniquely predicted by the schema, though this issue seems somewhat worse in 1. 2 seems like a more complicated attempt at a solution, which still doesn't really seem like it solves the problem, so I think 1 is the only real option.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions