Discover and upload Iceberg tables alongside Hudi by tiennguyen-onehouse · Pull Request #189 · onehouseinc/LakeView

tiennguyen-onehouse · 2026-05-13T02:30:37Z

Summary

Teaches LakeView to detect Iceberg table roots (the metadata/ folder) in addition to Hudi (.hoodie/), and upload the current metadata.json pointer file to the Onehouse temp bucket on each iteration. The control plane parses the snapshot summary (which carries cumulative total-records, total-files-size, total-data-files, per-snapshot deltas, and the snapshot-log[] history) — gateway-controller side is a separate PR.

Abstractions

TableFormatDetector SPI with HudiTableFormatDetector (existing .hoodie/ check moved out of TableDiscoveryService.isHudiTableFolder) and IcebergTableFormatDetector (matches when a directory listing contains a metadata/ subdirectory). TableDiscoveryService iterates registered detectors and tags each discovered Table with its format.
IcebergMetadataUploaderService as a sibling of TableMetadataUploaderService rather than a branch inside it. Hudi's active/archived timeline distinction, hoodie.properties bootstrap, and LSM manifest plumbing don't apply to Iceberg, so a separate service is clearer than a branching megaclass. It reuses OnehouseApiClient, PresignedUrlFileUploader, AsyncStorageClient, and StorageUtils directly.
TableDiscoveryAndUploadJob.dispatchUpload partitions discovered tables by format and runs the two uploaders concurrently; both runOnce and processTables route through it.

Wire

New TableFormat enum (HUDI, ICEBERG) in api/models/request/.
Table model gains tableFormat (defaults to HUDI for backward compat).
InitializeSingleTableMetricsCheckpointRequest gains a nullable tableFormat. Server treats absent as HUDI. tableType (COW/MOR) stays @NonNull to preserve the existing wire contract; Iceberg init passes COW as a meaningless placeholder since the server discriminates on tableFormat first.
New constants: ICEBERG_METADATA_FOLDER_NAME = "metadata", ICEBERG_METADATA_FILE_SUFFIX = ".metadata.json".

What gets uploaded for Iceberg

Exactly one file per iteration — the latest metadata.json (picked by lexicographic order, which is correct for both Hadoop-catalog v{N}.metadata.json and Hive/Glue/Spark 00000-<uuid>.metadata.json naming). Checkpoint tracks the last uploaded filename; if it hasn't changed since last run, the iteration is a no-op.

Out of scope (separate PRs)

Control-plane parsing of Iceberg metadata.json and emit to OpenSearch (gateway-controller PR with IcebergCommitMetadataParser).
Proto field on TableMetricsCheckpoint: added on the idls PR #1939 branch (80d81701).

Test plan

./gradlew :lakeview:test --tests "ai.onehouse.metadata_extractor.*" green locally (incl. 3 new test files: detector pairs + pickLatestMetadataJson)
Existing TableDiscoveryServiceTest and TableDiscoveryAndUploadJobTest pass after the constructor signature changes
CI green
Manual: end-to-end against s3://aadsharma-quanton-test/spark4_iceberg_variant.db/t_variant_1/ once control-plane parser lands

🤖 Generated with Claude Code

@nonnull

Extends LakeView to find Iceberg tables (metadata/*.metadata.json) in addition to Hudi tables (.hoodie/) and upload their current metadata.json to the control plane for parsing. Iceberg upload is a parallel orchestrator rather than a Hudi-coupled branch: the active/archived timeline, hoodie.properties bootstrap, and LSM manifest plumbing don't apply. Abstractions: - TableFormatDetector SPI with HudiTableFormatDetector + IcebergTableFormatDetector implementations. TableDiscoveryService iterates registered detectors; the first match determines the format and tags the Table. - New IcebergMetadataUploaderService sibling of TableMetadataUploaderService. Reuses OnehouseApiClient, PresignedUrlFileUploader, AsyncStorageClient. - TableDiscoveryAndUploadJob.dispatchUpload partitions discovered tables by format and runs the two uploaders concurrently. Wire: - TableFormat enum (HUDI, ICEBERG) added to api/models/request. - Table model carries tableFormat (default HUDI for backward compat). - InitializeSingleTableMetricsCheckpointRequest gains tableFormat (nullable; server treats absent as HUDI). TableType (COW/MOR) stays @nonnull; Iceberg uploads pass COW as a meaningless placeholder since the server discriminates on tableFormat first. Depends on idls PR #1939 for ObservedTableFormat on the proto side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LakeviewSyncTool builds TableDiscoveryService and TableDiscoveryAndUploadJob manually (no Guice in the sync-tool entry path), so the constructor signature changes from the parent commit broke its compile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiennguyen-onehouse

/push-image iceberg-metrics-pr2-lakeview-may13

Adds the LakeView half of the fast-path that lets us skip the per-cycle S3 LIST on Iceberg tables when the control plane already knows the current metadata.json URI (e.g. from AWS Glue's metadata_location parameter on the catalog entry). - New TableHint POJO and Database.tableHints map (keyed on tableId). Older YAML versions don't carry this field; Jackson leaves it null and discovery falls through to the existing listing-based path. - Table model gains metadataLocationHint (optional). - TableDiscoveryService merges per-database tableHints into a single tableId -> hint map and attaches metadataLocationHint to each Table it discovers, when the tableId matches a hint. - IcebergMetadataUploaderService.uploadIfNewMetadataJson branches on the hint: if set, derive the filename, compare against the checkpoint, and PUT directly to the hint URI. If the checkpoint already matches, no-op without touching S3 at all. Falls back to the existing LIST behavior when no hint is provided. The control-plane producer (gw-agent MetricsExtractorFileUpdater) does not yet emit the new field into the YAML — that needs a lakeview-config artifact version bump on the gateway-controller side. Once that lands, the fast-path activates automatically; until then, tableHints stays null and behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiennguyen-onehouse

/push-jar iceberg-metrics-pr2-lakeview-may13

tiennguyen-onehouse

/push-jar iceberg-lakeview-may13

tiennguyen-onehouse

/push-jar iceberg-lakeview-may13-v2

…tion Concern 2: the previous IcebergTableFormatDetector matched any directory containing a sub-directory named "metadata", which produced false positives across customer warehouses (Spark checkpoint dirs, custom layouts, schema folders, etc.). Replace the SPI ordering with per-Database declarative routing — each Database in the parser YAML now declares its tableFormat (default HUDI for backward compat), and TableDiscoveryService picks the single matching detector for that database. The Iceberg detector also becomes strict: requires metadata/ AND at least one *.metadata.json inside it (one extra LIST during discovery only, not per upload cycle). Concern 1: pickLatestMetadataJson lex-sorted filenames, so for the Hadoop catalog naming v{N}.metadata.json (unpadded), v2.metadata.json beat v10.metadata.json — a stale pointer. Resolve in three tiers: catalog hint, then metadata/version-hint.text when present (canonical Iceberg lookup), then numeric-aware sort on the leading integer. Empty metadata/ now increments a NO_SUCH_KEY failure counter instead of silently returning true, so phantom tables surface in dashboards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiennguyen-onehouse

/push-jar iceberg-lakeview-may13-v2

S3 ListObjectsV2 surfaces "subdirectories" as CommonPrefixes whose Prefix string carries a trailing slash (e.g. "metadata/"). The storage client preserves that in File.filename, so the exact-string check "metadata".equals(filename) fails and the detector returns false even though the table layout is a valid Iceberg root. Concrete repro in staging (Testing-Acme org, table s3://acme-data-2/iceberg_tables/iceberg_table_test_may13/): - S3 list of the table base returns a single CommonPrefix "metadata/" - The agent emits tableFormat=ICEBERG + metadataLocationHint in the extractor YAML, the v2 lake-view image parses it fine, the discovery walks the base path - Detector returns false at the base, discoverTablesInPath recurses into metadata/, no Iceberg Table is ever emitted, dispatchUpload is called with an empty Iceberg set, IcebergMetadataUploaderService early-returns, nothing reaches the temp bucket Strip a single trailing slash before comparing, so both "metadata" and "metadata/" are accepted. Hudi's detector dodges this naturally via startsWith(".hoodie"); the Iceberg one switched to equals() and lost the same tolerance. Add a unit test that hands the detector a File with filename="metadata/" — the shape S3 actually produces — which fails on main and passes here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiennguyen-onehouse

/push-jar iceberg-lakeview-may13-v3

tiennguyen-onehouse

/push-image iceberg-lakeview-may13-v3

@JsonProperty

LakeView's InitializeTableMetricsCheckpoint request sends tableFormat as a Jackson-serialized enum value. Without an explicit @JsonProperty the wire string was the short Java identifier ("HUDI" / "ICEBERG"), which protobuf-java-util's JSON parser on external-api cannot map onto lake.TableFormat (the proto enum uses the canonical names "TABLE_FORMAT_HUDI" / "TABLE_FORMAT_ICEBERG"). The mismatch is silent: the parser falls back to enum-zero (TABLE_FORMAT_INVALID), the checkpoint is persisted without the field, and GenerateCommitMetadata UploadUrlHandler routes the table through the Hudi back-compat path. For Hudi this happens to be correct so nothing surfaces; for Iceberg the filename regex rejects the metadata.json with 400. Annotate the enum values so the JSON wire string matches the proto enum name on both sides of the contract. Java callsites remain unchanged.

dharmendersheshma

/push-image iceberg-lakeview-may14-v1

sonarqubecloud · 2026-05-14T08:55:52Z

Quality Gate failed

Failed conditions
41.3% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

tiennguyen-onehouse and others added 2 commits May 12, 2026 19:30

tiennguyen-onehouse commented May 13, 2026

View reviewed changes

dharmendersheshma reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discover and upload Iceberg tables alongside Hudi#189

Discover and upload Iceberg tables alongside Hudi#189
tiennguyen-onehouse wants to merge 6 commits into
mainfrom
iceberg-metrics-pr2-lakeview

tiennguyen-onehouse commented May 13, 2026

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

tiennguyen-onehouse left a comment

Uh oh!

dharmendersheshma left a comment

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tiennguyen-onehouse commented May 13, 2026

Summary

Abstractions

Wire

What gets uploaded for Iceberg

Out of scope (separate PRs)

Test plan

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

Uh oh!

dharmendersheshma left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants