Core: Interface based DataFile reader and writer API#12298
Core: Interface based DataFile reader and writer API#12298pvary wants to merge 6 commits intoapache:mainfrom
Conversation
907089c to
313c2d5
Compare
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/AppenderBuilder.java
Outdated
Show resolved
Hide resolved
c528a52 to
9975b4f
Compare
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @pvary for this proposal, I left some comments.
core/src/main/java/org/apache/iceberg/io/datafile/ReaderBuilder.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
|
I will start to collect the differences here between the different writer types (appender/dataWriter/equalityDeleteWriter/positionalDeleteWriter) for reference:
|
core/src/main/java/org/apache/iceberg/io/datafile/DataWriterBuilder.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java
Outdated
Show resolved
Hide resolved
...urces/META-INF/services/org.apache.iceberg.io.datafile.DataFileServiceRegistry$WriterService
Outdated
Show resolved
Hide resolved
|
While I think the goal here is a good one, the implementation looks too complex to be workable in its current form. The primary issue that we currently have is adapting object models (like Iceber's internal - switch (format) {
- case AVRO:
- AvroIterable<ManifestEntry<F>> reader =
- Avro.read(file)
- .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
- .createResolvingReader(this::newReader)
- .reuseContainers()
- .build();
+ CloseableIterable<ManifestEntry<F>> reader =
+ InternalData.read(format, file)
+ .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
+ .reuseContainers()
+ .build();
- addCloseable(reader);
+ addCloseable(reader);
- return CloseableIterable.transform(reader, inheritableMetadata::apply);
+ return CloseableIterable.transform(reader, inheritableMetadata::apply);
-
- default:
- throw new UnsupportedOperationException("Invalid format for manifest file: " + format);
- }This shows:
In this PR, there are a lot of other changes as well. I'm looking at one of the simpler Spark cases in the row reader. The builder is initialized from return DataFileServiceRegistry.readerBuilder(
format, InternalRow.class.getName(), file, projection, idToConstant)There are also new static classes in the file. Each creates a new service and each service creates the builder and object model: public static class AvroReaderService implements DataFileServiceRegistry.ReaderService {
@Override
public DataFileServiceRegistry.Key key() {
return new DataFileServiceRegistry.Key(FileFormat.AVRO, InternalRow.class.getName());
}
@Override
public ReaderBuilder builder(
InputFile inputFile,
Schema readSchema,
Map<Integer, ?> idToConstant,
DeleteFilter<?> deleteFilter) {
return Avro.read(inputFile)
.project(readSchema)
.createResolvingReader(schema -> SparkPlannedAvroReader.create(schema, idToConstant));
}The In addition, there are now a lot more abstractions:
I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:
|
I'm happy that we agree with the goals. I created a PR to start the conversation. If there are willing reviewers we can introduce more invasive changes to archive a better API. I'm all for it!
I think we need to keep this direct transformations to prevent the performance loss which would be caused by multiple transformations between object model -> common model -> file format. We have a matrix of transformation which we need to encode somewhere:
The InternalData reader has one advantage over the data file readers/writers. The internal object model is static for these readers/writers. For the DataFile readers/writers we have multiple object models to handle.
If we allow adding new builders for the file formats we can remove a good chunk of the boilerplate code. Let me see how this would look like
We need to refactor the Avro positional delete write for this, or add a positionalWriterFunc. Also need to consider that the format specific configurations which are different for the appenders and the delete files (DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES)
If we are ok with having a new Builder for the readers/writers, then we don't need the service. It was needed to keep the current APIs and the new APIs compatible.
Will do
Will see what could be arcived |
c488d32 to
71ec538
Compare
core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java
Outdated
Show resolved
Hide resolved
|
|
||
| private FormatModelRegistry() {} | ||
|
|
||
| private static class FileWriterBuilderImpl<W extends FileWriter<?, ?>, D, S> |
There was a problem hiding this comment.
I don't think that the type params are quite right here. The row type of FileWriter should be D, right? That means that this should probably be FileWriterBuilderImpl<D, S, W extends FileWriter<D, ?> right? And it seems suspicious that we aren't correctly carrying through the R param of FileWriter, too. This could probably be parameterized by R since it is determined by the returned writer type.
There was a problem hiding this comment.
I left it like this, because it needs some ugly casting magic on the registry side:
FormatModel<PositionDelete<D>, ?> model =
(FormatModel<PositionDelete<D>, ?>) (FormatModel) modelFor(format, PositionDelete.class);
Updated the code based on your recommendation. Check if you like it this way better, or not.
core/src/main/java/org/apache/iceberg/formats/WriteBuilder.java
Outdated
Show resolved
Hide resolved
fc5a2f2 to
8a8a67e
Compare
core/src/main/java/org/apache/iceberg/formats/FileWriterBuilderImpl.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/formats/FileWriterBuilderImpl.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java
Show resolved
Hide resolved
| // Spark eagerly consumes the batches. So the underlying memory allocated could be | ||
| // reused without worrying about subsequent reads clobbering over each other. This | ||
| // improves read performance as every batch read doesn't have to pay the cost of | ||
| // allocating memory. |
There was a problem hiding this comment.
Nit: Did this need to be reformatted? It's less of a problem if there aren't substantive changes mixed together with reformatting.
There was a problem hiding this comment.
Reformatted to comply with line-length restrictions. The increased indentation required the comment the comment to be reformatted.
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java
Outdated
Show resolved
Hide resolved
cecf8c3 to
bec9b38
Compare
| * @param outputFile destination for the written data | ||
| * @return a configured delete write builder for creating a {@link PositionDeleteWriter} | ||
| */ | ||
| @SuppressWarnings({"unchecked", "rawtypes"}) |
There was a problem hiding this comment.
I don't think this needs all of the casts and rawtypes. This works for me:
@SuppressWarnings("unchecked")
public static <D> FileWriterBuilder<PositionDeleteWriter<D>, ?> positionDeleteWriteBuilder(
FileFormat format, EncryptedOutputFile outputFile) {
FormatModel<PositionDelete<D>, ?> model = FormatModelRegistry.modelFor(format, PositionDelete.class);
return FileWriterBuilderImpl.forPositionDelete(model, outputFile);
}There was a problem hiding this comment.
This is very strange, because the IntelliJ doesn't report the compilation error, but when compiling from command line we get this:
> Task :iceberg-core:compileJava
/Users/petervary/dev/iceberg/core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java:182: error: incompatible types: cannot infer type-variable(s) D#1,S
FormatModel<PositionDelete<D>, ?> model = modelFor(format, PositionDelete.class);
^
(argument mismatch; Class<PositionDelete> cannot be converted to Class<? extends PositionDelete<D#2>>)
where D#1,S,D#2 are type-variables:
D#1 extends Object declared in method <D#1,S>modelFor(FileFormat,Class<? extends D#1>)
S extends Object declared in method <D#1,S>modelFor(FileFormat,Class<? extends D#1>)
D#2 extends Object declared in method <D#2>positionDeleteWriteBuilder(FileFormat,EncryptedOutputFile)
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
1 error
Basically the raw FormatModel<PositionDelete, ?> cannot be converted to FormatModel<PositionDelete<D>, ?>
Alternatively we can do something like this (equally ugly):
FormatModel<PositionDelete<D>, ?> model =
modelFor(format, (Class<PositionDelete<D>>) (Class) PositionDelete.class);
| "Equality field ids not supported for this writer type"); | ||
| } | ||
|
|
||
| ModelWriteBuilder<D, S> modelWriteBuilder() { |
There was a problem hiding this comment.
Is there a reason to use package-private rather than protected? Looks like these are intended for use in the private subclasses. I think this is the more restrictive option?
There was a problem hiding this comment.
According to the Java documentation (https://docs.oracle.com/javase/tutorial/java/javaOO/accesscontrol.html), package‑private access is more restrictive than protected. Our current Checkstyle rules require accessor methods even for protected, package‑private, so regardless of which visibility we choose, we still need accessor methods.
| keyMetadata()); | ||
| } | ||
|
|
||
| private static class PositionDeleteFileAppender<T> implements FileAppender<StructLike> { |
There was a problem hiding this comment.
Because of this and some of the suppressions (like "rawtypes"), I took a deeper look into the type params in this class.
I was able to avoid needing this class by updating PositionDeleteWriter so that its FileAppender is parameterized by PositionDelete<T> rather than StructLike (which is a parent of PositionDelete). Here's the diff:
diff --git a/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java b/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java
index a8af5e9d0f..6fcd772d59 100644
--- a/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java
+++ b/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java
@@ -51,7 +51,7 @@ public class PositionDeleteWriter<T> implements FileWriter<PositionDelete<T>, De
private static final Set<Integer> FILE_AND_POS_FIELD_IDS =
ImmutableSet.of(DELETE_FILE_PATH.fieldId(), DELETE_FILE_POS.fieldId());
- private final FileAppender<StructLike> appender;
+ private final FileAppender<PositionDelete<T>> appender;
private final FileFormat format;
private final String location;
private final PartitionSpec spec;
@@ -61,7 +61,7 @@ public class PositionDeleteWriter<T> implements FileWriter<PositionDelete<T>, De
private DeleteFile deleteFile = null;
public PositionDeleteWriter(
- FileAppender<StructLike> appender,
+ FileAppender<PositionDelete<T>> appender,
FileFormat format,
String location,
PartitionSpec spec,Also, I looked into consolidating as much as possible into the parent class and I think that validations are cleaner if they are put in a validate method on the parent. However, there was an issue with the type param D for PositionDeleteWriteBuilder, where PositionDeleteWriter needs to be constructed in the PositionDeleteWriterBuilder class because D in FileWriterBuilderImpl is D=PositionDelete<T> and there is no way to identify T in the parent class. That made me keep this model of implementing the build methods in the child classes.
I also think it's less code to use protected instance fields rather than the getter methods. It seems slightly cleaner, but I'm fine if you don't like the change and want to keep the getters. Here's the diff for the other changes:
diff --git a/core/src/main/java/org/apache/iceberg/formats/FileWriterBuilderImpl.java b/core/src/main/java/org/apache/iceberg/formats/FileWriterBuilderImpl.java
index 85c7464069..e79161bd7c 100644
--- a/core/src/main/java/org/apache/iceberg/formats/FileWriterBuilderImpl.java
+++ b/core/src/main/java/org/apache/iceberg/formats/FileWriterBuilderImpl.java
@@ -20,13 +20,11 @@ package org.apache.iceberg.formats;
import java.io.IOException;
import java.nio.ByteBuffer;
-import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import org.apache.iceberg.FileContent;
import org.apache.iceberg.FileFormat;
-import org.apache.iceberg.Metrics;
import org.apache.iceberg.MetricsConfig;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
@@ -38,21 +36,11 @@ import org.apache.iceberg.deletes.PositionDeleteWriter;
import org.apache.iceberg.encryption.EncryptedOutputFile;
import org.apache.iceberg.encryption.EncryptionKeyMetadata;
import org.apache.iceberg.io.DataWriter;
-import org.apache.iceberg.io.FileAppender;
import org.apache.iceberg.io.FileWriter;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
implements FileWriterBuilder<W, S> {
- private final ModelWriteBuilder<D, S> modelWriteBuilder;
- private final String location;
- private final FileFormat format;
- private Schema schema = null;
- private PartitionSpec spec = null;
- private StructLike partition = null;
- private EncryptionKeyMetadata keyMetadata = null;
- private SortOrder sortOrder = null;
-
/** Creates a builder for {@link DataWriter} instances for writing data files. */
static <D, S> FileWriterBuilder<DataWriter<D>, S> forDataFile(
FormatModel<D, S> model, EncryptedOutputFile outputFile) {
@@ -75,8 +63,20 @@ abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
return new PositionDeleteWriterBuilder<>(model, outputFile);
}
+ private final FileContent content;
+ protected final ModelWriteBuilder<D, S> modelWriteBuilder;
+ protected final String location;
+ protected final FileFormat format;
+ protected Schema schema = null;
+ protected PartitionSpec spec = null;
+ protected StructLike partition = null;
+ protected EncryptionKeyMetadata keyMetadata = null;
+ protected SortOrder sortOrder = null;
+ protected int[] equalityFieldIds = null;
+
private FileWriterBuilderImpl(
FormatModel<D, S> model, EncryptedOutputFile outputFile, FileContent content) {
+ this.content = content;
this.modelWriteBuilder = model.writeBuilder(outputFile).content(content);
this.location = outputFile.encryptingOutputFile().location();
this.format = model.format();
@@ -157,40 +157,27 @@ abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
@Override
public FileWriterBuilderImpl<W, D, S> equalityFieldIds(int... fieldIds) {
- throw new UnsupportedOperationException(
- "Equality field ids not supported for this writer type");
- }
-
- ModelWriteBuilder<D, S> modelWriteBuilder() {
- return modelWriteBuilder;
- }
-
- String location() {
- return location;
- }
-
- FileFormat format() {
- return format;
- }
-
- Schema schema() {
- return schema;
- }
-
- PartitionSpec spec() {
- return spec;
- }
+ if (content != FileContent.EQUALITY_DELETES) {
+ throw new UnsupportedOperationException(
+ "Equality field ids not supported for this writer type");
+ }
- StructLike partition() {
- return partition;
- }
+ this.equalityFieldIds = fieldIds;
- EncryptionKeyMetadata keyMetadata() {
- return keyMetadata;
+ return this;
}
- SortOrder sortOrder() {
- return sortOrder;
+ protected void validate() {
+ Preconditions.checkState(
+ content != FileContent.EQUALITY_DELETES || equalityFieldIds != null,
+ "Invalid delete field ids for equality delete writer: null");
+ Preconditions.checkState(
+ content == FileContent.POSITION_DELETES || schema != null, "Invalid schema: null");
+ Preconditions.checkArgument(spec != null, "Invalid partition spec: null");
+ Preconditions.checkArgument(
+ spec.isUnpartitioned() || partition != null,
+ "Invalid partition, does not match spec: %s",
+ spec);
}
/** Builder for creating {@link DataWriter} instances for writing data files. */
@@ -203,21 +190,9 @@ abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
@Override
public DataWriter<D> build() throws IOException {
- Preconditions.checkState(schema() != null, "Invalid schema for data writer: null");
- Preconditions.checkArgument(spec() != null, "Invalid partition spec for data writer: null");
- Preconditions.checkArgument(
- spec().isUnpartitioned() || partition() != null,
- "Invalid partition, does not match spec: %s",
- spec());
-
+ validate();
return new DataWriter<>(
- modelWriteBuilder().build(),
- format(),
- location(),
- spec(),
- partition(),
- keyMetadata(),
- sortOrder());
+ modelWriteBuilder.build(), format, location, spec, partition, keyMetadata, sortOrder);
}
}
@@ -227,33 +202,16 @@ abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
private static class EqualityDeleteWriterBuilder<D, S>
extends FileWriterBuilderImpl<EqualityDeleteWriter<D>, D, S> {
- private int[] equalityFieldIds = null;
-
private EqualityDeleteWriterBuilder(FormatModel<D, S> model, EncryptedOutputFile outputFile) {
super(model, outputFile, FileContent.EQUALITY_DELETES);
}
- @Override
- public EqualityDeleteWriterBuilder<D, S> equalityFieldIds(int... fieldIds) {
- this.equalityFieldIds = fieldIds;
- return this;
- }
-
@Override
public EqualityDeleteWriter<D> build() throws IOException {
- Preconditions.checkState(schema() != null, "Invalid schema for equality delete writer: null");
- Preconditions.checkState(
- equalityFieldIds != null, "Invalid delete field ids for equality delete writer: null");
- Preconditions.checkArgument(
- spec() != null, "Invalid partition spec for equality delete writer: null");
- Preconditions.checkArgument(
- spec().isUnpartitioned() || partition() != null,
- "Invalid partition, does not match spec: %s",
- spec());
-
+ validate();
return new EqualityDeleteWriter<>(
- modelWriteBuilder()
- .schema(schema())
+ modelWriteBuilder
+ .schema(schema)
.meta("delete-type", "equality")
.meta(
"delete-field-ids",
@@ -261,12 +219,12 @@ abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
.mapToObj(Objects::toString)
.collect(Collectors.joining(", ")))
.build(),
- format(),
- location(),
- spec(),
- partition(),
- keyMetadata(),
- sortOrder(),
+ format,
+ location,
+ spec,
+ partition,
+ keyMetadata,
+ sortOrder,
equalityFieldIds);
}
}
@@ -284,55 +242,14 @@ abstract class FileWriterBuilderImpl<W extends FileWriter<D, ?>, D, S>
@Override
public PositionDeleteWriter<D> build() throws IOException {
- Preconditions.checkArgument(
- spec() != null, "Invalid partition spec for position delete writer: null");
- Preconditions.checkArgument(
- spec().isUnpartitioned() || partition() != null,
- "Invalid partition, does not match spec: %s",
- spec());
-
+ validate();
return new PositionDeleteWriter<>(
- new PositionDeleteFileAppender<>(
- modelWriteBuilder().meta("delete-type", "position").build()),
- format(),
- location(),
- spec(),
- partition(),
- keyMetadata());
- }
-
- private static class PositionDeleteFileAppender<T> implements FileAppender<StructLike> {
- private final FileAppender<PositionDelete<T>> appender;
-
- PositionDeleteFileAppender(FileAppender<PositionDelete<T>> appender) {
- this.appender = appender;
- }
-
- @SuppressWarnings("unchecked")
- @Override
- public void add(StructLike positionDelete) {
- appender.add((PositionDelete<T>) positionDelete);
- }
-
- @Override
- public Metrics metrics() {
- return appender.metrics();
- }
-
- @Override
- public long length() {
- return appender.length();
- }
-
- @Override
- public void close() throws IOException {
- appender.close();
- }
-
- @Override
- public List<Long> splitOffsets() {
- return appender.splitOffsets();
- }
+ modelWriteBuilder.meta("delete-type", "position").build(),
+ format,
+ location,
+ spec,
+ partition,
+ keyMetadata);
}
}
}I think this is a bit better and removes some of the casting needed.
There was a problem hiding this comment.
I tried to avoid changing PositionDeleteWriter, since that could introduce a breaking change for external users who might be using the writer with a StructType appender. Let’s discuss this further in the API PR.
I kept the attributes private and retained the accessor methods (as required by Checkstyle).
Merged the validation logic as suggested.
Here is what the PR does:
ReadBuilder- Builder for reading data from data filesAppenderBuilder- Builder for writing data to data filesObjectModel- Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pairAppenderBuilder- Builder for writing a fileDataWriterBuilder- Builder for generating a data filePositionDeleteWriterBuilder- Builder for generating a position delete fileEqualityDeleteWriterBuilder- Builder for generating an equality delete fileReadBuilderhere - the file format reader builder is reusedWriterBuilderclass which implements the interfaces above (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) based on a provided file format specificAppenderBuilderObjectModelRegistrywhich stores the availableObjectModels, and engines and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.GenericObjectModels- for reading and writing Iceberg RecordsSparkObjectModels- for reading (vectorized and non-vectorized) and writing Spark InternalRow/ColumnarBatch objectsFlinkObjectModels- for reading and writing Flink RowData objects