Skip to content

Conversation

@Abhisheklearn12
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The arrow-json crate does not support RunEndEncoded arrays. This adds read and write support for RunEndEncoded arrays in the JSON reader and writer.

What changes are included in this PR?

  • Add DataType::RunEndEncoded match arm in make_decoder function
  • Add RunEndEncodedArrayDecoder that decodes JSON values and run-length encodes consecutive equal values
  • Add DataType::RunEndEncoded match arm in make_encoder function
  • Add RunEndEncodedEncoder that maps logical indices to physical indices via get_physical_index()
  • Add tests for RunEndEncoded read, write, and roundtrip

Are these changes tested?

Yes. Added seven tests:

  • test_read_run_end_encoded - tests basic read with consecutive runs
  • test_run_end_encoded_roundtrip - tests write then read back
  • test_read_run_end_encoded_consecutive_nulls - tests null run coalescing
  • test_read_run_end_encoded_all_unique - tests no compression when all values unique
  • test_read_run_end_encoded_int16_run_ends - tests Int16 run end type
  • test_write_run_end_encoded - tests writing string REE array
  • test_write_run_end_encoded_int_values - tests writing integer REE array

Are there any user-facing changes?

Yes. RunEndEncoded arrays can now be serialized to and deserialized from JSON using the arrow-json crate.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 9, 2026
@Abhisheklearn12
Copy link
Contributor Author

Hi @Jefffrey, I’d love to get your feedback whenever you have time. Appreciate it!

assert_eq!(batches.len(), 1);

let col = batches[0].column(0);
let run_array = col
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use as_run for more ergonomic downcasting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, will switch to as_run

assert_eq!(run_array.len(), 5);
assert_eq!(run_array.run_ends().values(), &[2, 5]);

let values = run_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and as_string here

let mut buf = Vec::new();
{
let mut writer = crate::writer::LineDelimitedWriter::new(&mut buf);
writer.write_batches(&[&batch]).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like this test best belongs in the write module since it uses write functionality 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will move it to the writer tests


let len = pos.len();
if len == 0 {
let empty_run_ends = new_empty_run_ends(run_ends_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use new_empty_array

Also not sure about calling decode here if len is zero; does that achieve anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, new_empty_array handles this cleanly. The decode call was there to get a typed empty values array but that's redundant when new_empty_array already constructs the full REE structure. So, will simplify it.

.add_child_data(values_data);

// Safety:
// Valid by construction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does valid by construction mean?

Copy link
Contributor Author

@Abhisheklearn12 Abhisheklearn12 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_ends are built strictly increasing from the encoding loop, with the last value always equal to len, and values has the same length as run_ends, so the REE layout invariants are satisfied. I can spell that out in the comment instead if you'd prefer, commented for clarity.

data.slice(i, 1) == data.slice(j, 1)
}

fn build_run_ends_array(dt: &DataType, run_ends: Vec<i64>) -> Result<ArrayData, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably inline this if we make RunEndEncodedArrayDecoder generic over its index type, instead of defaulting to i64 and then trying to convert after the fact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, I'll make the decoder generic over R: RunEndIndexType and dispatch in make_decoder, that removes the i64 conversion entirely.

@Abhisheklearn12 Abhisheklearn12 force-pushed the feat/support-ree-arrow-json branch from 9ca6d1d to 82ff74d Compare February 10, 2026 18:38
@Abhisheklearn12
Copy link
Contributor Author

Thanks for the review @Jefffrey! All the feedbacks have been addressed.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +374 to +375
array => {
NullableEncoder::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the macro require the {} even tho normally this would be array => NullableEncoder::new(...)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support RunEndEncoded arrays in arrow-json

3 participants