-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Allow setting page size per column #9353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| self.page_metrics.num_buffered_rows as usize >= self.props.data_page_row_count_limit() | ||
| || self.encoder.estimated_data_page_size() >= self.props.data_page_size_limit() | ||
| || self.encoder.estimated_data_page_size() | ||
| >= self.props.column_data_page_size_limit(self.descr.path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is the actual change
| let other_values = write_and_collect_page_values(ColumnPath::from("other"), props, data); | ||
|
|
||
| assert_eq!(col_values, vec![3, 3, 3, 1]); | ||
| assert_eq!(other_values, vec![10]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the observed end-to-end behavior, only the specified column changed, other columns unaffected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested without this code change and the tests fails with
---- column::writer::tests::test_column_writer_column_data_page_size_limit stdout ----
thread 'column::writer::tests::test_column_writer_column_data_page_size_limit' (23590832) panicked at parquet/src/column/writer/
mod.rs:2522:9:
assertion `left == right` failed
left: [10]
right: [3, 3, 3, 1]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao
I have one suggestion about the testing, but otherwise this looks great to me
cc @etseidl @sdf-jkl and @jhorstmann as you are potentially interested in this as well
parquet/src/file/properties.rs
Outdated
| /// | ||
| /// Note: this is a best effort limit based on value of | ||
| /// [`set_write_batch_size`](Self::set_write_batch_size). | ||
| /// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the nice comments
| let other_values = write_and_collect_page_values(ColumnPath::from("other"), props, data); | ||
|
|
||
| assert_eq!(col_values, vec![3, 3, 3, 1]); | ||
| assert_eq!(other_values, vec![10]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested without this code change and the tests fails with
---- column::writer::tests::test_column_writer_column_data_page_size_limit stdout ----
thread 'column::writer::tests::test_column_writer_column_data_page_size_limit' (23590832) panicked at parquet/src/column/writer/
mod.rs:2522:9:
assertion `left == right` failed
left: [10]
right: [3, 3, 3, 1]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XiangpengHao. I'm in favor of more per-column settings, but I think we should keep a consistent model for the settings.
parquet/src/file/properties.rs
Outdated
| /// Note: this is a best effort limit based on the write batch size | ||
| /// | ||
| /// For more details see [`WriterPropertiesBuilder::set_data_page_size_limit`] | ||
| /// and [`WriterPropertiesBuilder::set_column_data_page_size_limit`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it confusing that this setting will break from all other per-column settings in that it exists on both the WriterProperties and ColumnProperties. I think it would be better to remove data_page_size_limit from the WriterProperties and rely on the value set on the ColumnProperties (either the specific column or the default).
parquet/src/file/properties.rs
Outdated
| /// [`set_write_batch_size`](Self::set_write_batch_size). | ||
| /// | ||
| /// This value acts as the default for all columns and can be overridden with | ||
| /// [`Self::set_column_data_page_size_limit`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think this should move down to the "Setters for any column (global)" section below (line 735 or so) and use the same pattern as other per-column settings.
|
Thank you @etseidl the comments make sense to me, I've updated the pr. |
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XiangpengHao, looks great!
| /// See example on [`WriterProperties`] | ||
| #[derive(Debug, Clone)] | ||
| pub struct WriterPropertiesBuilder { | ||
| data_page_size_limit: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @XiangpengHao and @etseidl -- this looks great
Page sizes were per-file configuration, this PR allows each column to have their own page size settings.
This enables a few optimizations, for example, configure smaller pages size for columns that need better random access.
Related to
WriterProperties#9242cc @alamb @coracuity @JigaoLuo