Discover and upload Iceberg tables alongside Hudi#189
Draft
tiennguyen-onehouse wants to merge 6 commits into
Draft
Discover and upload Iceberg tables alongside Hudi#189tiennguyen-onehouse wants to merge 6 commits into
tiennguyen-onehouse wants to merge 6 commits into
Conversation
Extends LakeView to find Iceberg tables (metadata/*.metadata.json) in addition to Hudi tables (.hoodie/) and upload their current metadata.json to the control plane for parsing. Iceberg upload is a parallel orchestrator rather than a Hudi-coupled branch: the active/archived timeline, hoodie.properties bootstrap, and LSM manifest plumbing don't apply. Abstractions: - TableFormatDetector SPI with HudiTableFormatDetector + IcebergTableFormatDetector implementations. TableDiscoveryService iterates registered detectors; the first match determines the format and tags the Table. - New IcebergMetadataUploaderService sibling of TableMetadataUploaderService. Reuses OnehouseApiClient, PresignedUrlFileUploader, AsyncStorageClient. - TableDiscoveryAndUploadJob.dispatchUpload partitions discovered tables by format and runs the two uploaders concurrently. Wire: - TableFormat enum (HUDI, ICEBERG) added to api/models/request. - Table model carries tableFormat (default HUDI for backward compat). - InitializeSingleTableMetricsCheckpointRequest gains tableFormat (nullable; server treats absent as HUDI). TableType (COW/MOR) stays @nonnull; Iceberg uploads pass COW as a meaningless placeholder since the server discriminates on tableFormat first. Depends on idls PR #1939 for ObservedTableFormat on the proto side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LakeviewSyncTool builds TableDiscoveryService and TableDiscoveryAndUploadJob manually (no Guice in the sync-tool entry path), so the constructor signature changes from the parent commit broke its compile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-image iceberg-metrics-pr2-lakeview-may13
Adds the LakeView half of the fast-path that lets us skip the per-cycle S3 LIST on Iceberg tables when the control plane already knows the current metadata.json URI (e.g. from AWS Glue's metadata_location parameter on the catalog entry). - New TableHint POJO and Database.tableHints map (keyed on tableId). Older YAML versions don't carry this field; Jackson leaves it null and discovery falls through to the existing listing-based path. - Table model gains metadataLocationHint (optional). - TableDiscoveryService merges per-database tableHints into a single tableId -> hint map and attaches metadataLocationHint to each Table it discovers, when the tableId matches a hint. - IcebergMetadataUploaderService.uploadIfNewMetadataJson branches on the hint: if set, derive the filename, compare against the checkpoint, and PUT directly to the hint URI. If the checkpoint already matches, no-op without touching S3 at all. Falls back to the existing LIST behavior when no hint is provided. The control-plane producer (gw-agent MetricsExtractorFileUpdater) does not yet emit the new field into the YAML — that needs a lakeview-config artifact version bump on the gateway-controller side. Once that lands, the fast-path activates automatically; until then, tableHints stays null and behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-jar iceberg-metrics-pr2-lakeview-may13
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-jar iceberg-lakeview-may13
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-jar iceberg-lakeview-may13-v2
…tion
Concern 2: the previous IcebergTableFormatDetector matched any directory
containing a sub-directory named "metadata", which produced false positives
across customer warehouses (Spark checkpoint dirs, custom layouts, schema
folders, etc.). Replace the SPI ordering with per-Database declarative
routing — each Database in the parser YAML now declares its tableFormat
(default HUDI for backward compat), and TableDiscoveryService picks the
single matching detector for that database. The Iceberg detector also
becomes strict: requires metadata/ AND at least one *.metadata.json inside
it (one extra LIST during discovery only, not per upload cycle).
Concern 1: pickLatestMetadataJson lex-sorted filenames, so for the Hadoop
catalog naming v{N}.metadata.json (unpadded), v2.metadata.json beat
v10.metadata.json — a stale pointer. Resolve in three tiers: catalog hint,
then metadata/version-hint.text when present (canonical Iceberg lookup),
then numeric-aware sort on the leading integer. Empty metadata/ now
increments a NO_SUCH_KEY failure counter instead of silently returning
true, so phantom tables surface in dashboards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-jar iceberg-lakeview-may13-v2
S3 ListObjectsV2 surfaces "subdirectories" as CommonPrefixes whose
Prefix string carries a trailing slash (e.g. "metadata/"). The storage
client preserves that in File.filename, so the exact-string check
"metadata".equals(filename) fails and the detector returns false even
though the table layout is a valid Iceberg root.
Concrete repro in staging (Testing-Acme org, table
s3://acme-data-2/iceberg_tables/iceberg_table_test_may13/):
- S3 list of the table base returns a single CommonPrefix "metadata/"
- The agent emits tableFormat=ICEBERG + metadataLocationHint in the
extractor YAML, the v2 lake-view image parses it fine, the
discovery walks the base path
- Detector returns false at the base, discoverTablesInPath recurses
into metadata/, no Iceberg Table is ever emitted, dispatchUpload
is called with an empty Iceberg set, IcebergMetadataUploaderService
early-returns, nothing reaches the temp bucket
Strip a single trailing slash before comparing, so both "metadata" and
"metadata/" are accepted. Hudi's detector dodges this naturally via
startsWith(".hoodie"); the Iceberg one switched to equals() and lost
the same tolerance.
Add a unit test that hands the detector a File with filename="metadata/"
— the shape S3 actually produces — which fails on main and passes here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-jar iceberg-lakeview-may13-v3
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-image iceberg-lakeview-may13-v3
LakeView's InitializeTableMetricsCheckpoint request sends tableFormat as a Jackson-serialized enum value. Without an explicit @JsonProperty the wire string was the short Java identifier ("HUDI" / "ICEBERG"), which protobuf-java-util's JSON parser on external-api cannot map onto lake.TableFormat (the proto enum uses the canonical names "TABLE_FORMAT_HUDI" / "TABLE_FORMAT_ICEBERG"). The mismatch is silent: the parser falls back to enum-zero (TABLE_FORMAT_INVALID), the checkpoint is persisted without the field, and GenerateCommitMetadata UploadUrlHandler routes the table through the Hudi back-compat path. For Hudi this happens to be correct so nothing surfaces; for Iceberg the filename regex rejects the metadata.json with 400. Annotate the enum values so the JSON wire string matches the proto enum name on both sides of the contract. Java callsites remain unchanged.
Contributor
dharmendersheshma
left a comment
There was a problem hiding this comment.
/push-image iceberg-lakeview-may14-v1
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Teaches LakeView to detect Iceberg table roots (the
metadata/folder) in addition to Hudi (.hoodie/), and upload the currentmetadata.jsonpointer file to the Onehouse temp bucket on each iteration. The control plane parses the snapshot summary (which carries cumulativetotal-records,total-files-size,total-data-files, per-snapshot deltas, and thesnapshot-log[]history) — gateway-controller side is a separate PR.Abstractions
TableFormatDetectorSPI withHudiTableFormatDetector(existing.hoodie/check moved out ofTableDiscoveryService.isHudiTableFolder) andIcebergTableFormatDetector(matches when a directory listing contains ametadata/subdirectory).TableDiscoveryServiceiterates registered detectors and tags each discoveredTablewith its format.IcebergMetadataUploaderServiceas a sibling ofTableMetadataUploaderServicerather than a branch inside it. Hudi's active/archived timeline distinction,hoodie.propertiesbootstrap, and LSM manifest plumbing don't apply to Iceberg, so a separate service is clearer than a branching megaclass. It reusesOnehouseApiClient,PresignedUrlFileUploader,AsyncStorageClient, andStorageUtilsdirectly.TableDiscoveryAndUploadJob.dispatchUploadpartitions discovered tables by format and runs the two uploaders concurrently; bothrunOnceandprocessTablesroute through it.Wire
TableFormatenum (HUDI,ICEBERG) inapi/models/request/.Tablemodel gainstableFormat(defaults toHUDIfor backward compat).InitializeSingleTableMetricsCheckpointRequestgains a nullabletableFormat. Server treats absent asHUDI.tableType(COW/MOR) stays@NonNullto preserve the existing wire contract; Iceberg init passesCOWas a meaningless placeholder since the server discriminates ontableFormatfirst.ICEBERG_METADATA_FOLDER_NAME = "metadata",ICEBERG_METADATA_FILE_SUFFIX = ".metadata.json".What gets uploaded for Iceberg
Exactly one file per iteration — the latest
metadata.json(picked by lexicographic order, which is correct for both Hadoop-catalogv{N}.metadata.jsonand Hive/Glue/Spark00000-<uuid>.metadata.jsonnaming). Checkpoint tracks the last uploaded filename; if it hasn't changed since last run, the iteration is a no-op.Out of scope (separate PRs)
metadata.jsonand emit to OpenSearch (gateway-controller PR withIcebergCommitMetadataParser).TableMetricsCheckpoint: added on the idls PR #1939 branch (80d81701).Test plan
./gradlew :lakeview:test --tests "ai.onehouse.metadata_extractor.*"green locally (incl. 3 new test files: detector pairs +pickLatestMetadataJson)TableDiscoveryServiceTestandTableDiscoveryAndUploadJobTestpass after the constructor signature changess3://aadsharma-quanton-test/spark4_iceberg_variant.db/t_variant_1/once control-plane parser lands🤖 Generated with Claude Code