-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: support RunEndEncoded arrays in arrow-json reader and writer #9379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: support RunEndEncoded arrays in arrow-json reader and writer #9379
Conversation
|
Hi @Jefffrey, I’d love to get your feedback whenever you have time. Appreciate it! |
arrow-json/src/reader/mod.rs
Outdated
| assert_eq!(batches.len(), 1); | ||
|
|
||
| let col = batches[0].column(0); | ||
| let run_array = col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use as_run for more ergonomic downcasting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, will switch to as_run
arrow-json/src/reader/mod.rs
Outdated
| assert_eq!(run_array.len(), 5); | ||
| assert_eq!(run_array.run_ends().values(), &[2, 5]); | ||
|
|
||
| let values = run_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and as_string here
arrow-json/src/reader/mod.rs
Outdated
| let mut buf = Vec::new(); | ||
| { | ||
| let mut writer = crate::writer::LineDelimitedWriter::new(&mut buf); | ||
| writer.write_batches(&[&batch]).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like this test best belongs in the write module since it uses write functionality 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, will move it to the writer tests
|
|
||
| let len = pos.len(); | ||
| if len == 0 { | ||
| let empty_run_ends = new_empty_run_ends(run_ends_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use new_empty_array
Also not sure about calling decode here if len is zero; does that achieve anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, new_empty_array handles this cleanly. The decode call was there to get a typed empty values array but that's redundant when new_empty_array already constructs the full REE structure. So, will simplify it.
| .add_child_data(values_data); | ||
|
|
||
| // Safety: | ||
| // Valid by construction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does valid by construction mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_ends are built strictly increasing from the encoding loop, with the last value always equal to len, and values has the same length as run_ends, so the REE layout invariants are satisfied. I can spell that out in the comment instead if you'd prefer, commented for clarity.
| data.slice(i, 1) == data.slice(j, 1) | ||
| } | ||
|
|
||
| fn build_run_ends_array(dt: &DataType, run_ends: Vec<i64>) -> Result<ArrayData, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably inline this if we make RunEndEncodedArrayDecoder generic over its index type, instead of defaulting to i64 and then trying to convert after the fact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, I'll make the decoder generic over R: RunEndIndexType and dispatch in make_decoder, that removes the i64 conversion entirely.
9ca6d1d to
82ff74d
Compare
|
Thanks for the review @Jefffrey! All the feedbacks have been addressed. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| array => { | ||
| NullableEncoder::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the macro require the {} even tho normally this would be array => NullableEncoder::new(...)?
Which issue does this PR close?
RunEndEncodedarrays inarrow-json#9359.Rationale for this change
The
arrow-jsoncrate does not supportRunEndEncodedarrays. This adds read and write support forRunEndEncodedarrays in the JSON reader and writer.What changes are included in this PR?
DataType::RunEndEncodedmatch arm inmake_decoderfunctionRunEndEncodedArrayDecoderthat decodes JSON values and run-length encodes consecutive equal valuesDataType::RunEndEncodedmatch arm inmake_encoderfunctionRunEndEncodedEncoderthat maps logical indices to physical indices viaget_physical_index()Are these changes tested?
Yes. Added seven tests:
test_read_run_end_encoded- tests basic read with consecutive runstest_run_end_encoded_roundtrip- tests write then read backtest_read_run_end_encoded_consecutive_nulls- tests null run coalescingtest_read_run_end_encoded_all_unique- tests no compression when all values uniquetest_read_run_end_encoded_int16_run_ends- tests Int16 run end typetest_write_run_end_encoded- tests writing string REE arraytest_write_run_end_encoded_int_values- tests writing integer REE arrayAre there any user-facing changes?
Yes.
RunEndEncodedarrays can now be serialized to and deserialized from JSON using thearrow-jsoncrate.