Skip to content

fix(writer): unify nullable utf8/binary on the NullableData carrier#168

Merged
dfa1 merged 1 commit into
mainfrom
feat/unify-nullable-carrier
Jun 26, 2026
Merged

fix(writer): unify nullable utf8/binary on the NullableData carrier#168
dfa1 merged 1 commit into
mainfrom
feat/unify-nullable-carrier

Conversation

@dfa1

@dfa1 dfa1 commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Summary

Follow-up to #167. Tackles the nullable-input shape asymmetry (utf8 used String[]-with-nulls, primitives used NullableData) by unifying on the NullableData carrier — and fixes two latent bugs that asymmetry was hiding.

What changed

  • ChunkImpl — nullable Utf8/Binary now produces NullableData(String[]-with-nulls, validity), like nullable primitives.
  • VarBinEncodingEncoder — a null element encodes as a zero-length slot (masked values child), no more NPE; stats already skip null.
  • ZstdEncodingEncoder — consumes NullableData for utf8/binary; acceptsNullable covers them.
  • CsvExporter — handles MaskedArray (null row → empty field).

Latent bugs fixed

  • Default-path nullable utf8 write threw NullPointerException in VarBin.
  • Nullable utf8 with no nulls in the scanned rows decoded as a bare VarBinArray, silently dropping null info; now consistently MaskedArray.
  • CSV export of any nullable column (primitives too) failed with "unsupported array type".

Behavior change

Nullable utf8/binary now always decode as MaskedArray (validity + values child), consistent with nullable primitives. Tests updated (parquet importer, parquet import IT).

Testing

  • Full unit suite (all modules) green.
  • Full integration suite green, incl. new javaWriter_rustReader_masked_nullableUtf8 (Rust reads a Java-written nullable utf8 file) and the existing zstd-nullable ITs.

🤖 Generated with Claude Code

Nullable utf8/binary columns now flow through the same NullableData carrier
as nullable primitives instead of a String[] with inline nulls. ChunkImpl
derives the validity bitmap, the writer routes the column through
MaskedEncoding (or a nullable-capable encoder such as vortex.zstd), and
VarBin treats a null element as a zero-length slot.

Fixes two latent bugs the old String[]-with-nulls path hid:
- default-path nullable utf8 write threw NullPointerException in VarBin
- nullable utf8 with no nulls in the scanned rows decoded as a bare
  VarBinArray, silently dropping null information; it now decodes as a
  MaskedArray (validity + values child) like nullable primitives.

CsvExporter now handles MaskedArray (null row -> empty field), so nullable
columns export instead of failing with "unsupported array type".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dfa1 dfa1 force-pushed the feat/unify-nullable-carrier branch from 58407c5 to e18948b Compare June 26, 2026 15:50
@dfa1 dfa1 merged commit 70d6c89 into main Jun 26, 2026
6 checks passed
@dfa1 dfa1 deleted the feat/unify-nullable-carrier branch June 26, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant