Skip to content

Discover and upload Iceberg tables alongside Hudi#189

Draft
tiennguyen-onehouse wants to merge 6 commits into
mainfrom
iceberg-metrics-pr2-lakeview
Draft

Discover and upload Iceberg tables alongside Hudi#189
tiennguyen-onehouse wants to merge 6 commits into
mainfrom
iceberg-metrics-pr2-lakeview

Conversation

@tiennguyen-onehouse
Copy link
Copy Markdown
Contributor

Summary

Teaches LakeView to detect Iceberg table roots (the metadata/ folder) in addition to Hudi (.hoodie/), and upload the current metadata.json pointer file to the Onehouse temp bucket on each iteration. The control plane parses the snapshot summary (which carries cumulative total-records, total-files-size, total-data-files, per-snapshot deltas, and the snapshot-log[] history) — gateway-controller side is a separate PR.

Abstractions

  • TableFormatDetector SPI with HudiTableFormatDetector (existing .hoodie/ check moved out of TableDiscoveryService.isHudiTableFolder) and IcebergTableFormatDetector (matches when a directory listing contains a metadata/ subdirectory). TableDiscoveryService iterates registered detectors and tags each discovered Table with its format.
  • IcebergMetadataUploaderService as a sibling of TableMetadataUploaderService rather than a branch inside it. Hudi's active/archived timeline distinction, hoodie.properties bootstrap, and LSM manifest plumbing don't apply to Iceberg, so a separate service is clearer than a branching megaclass. It reuses OnehouseApiClient, PresignedUrlFileUploader, AsyncStorageClient, and StorageUtils directly.
  • TableDiscoveryAndUploadJob.dispatchUpload partitions discovered tables by format and runs the two uploaders concurrently; both runOnce and processTables route through it.

Wire

  • New TableFormat enum (HUDI, ICEBERG) in api/models/request/.
  • Table model gains tableFormat (defaults to HUDI for backward compat).
  • InitializeSingleTableMetricsCheckpointRequest gains a nullable tableFormat. Server treats absent as HUDI. tableType (COW/MOR) stays @NonNull to preserve the existing wire contract; Iceberg init passes COW as a meaningless placeholder since the server discriminates on tableFormat first.
  • New constants: ICEBERG_METADATA_FOLDER_NAME = "metadata", ICEBERG_METADATA_FILE_SUFFIX = ".metadata.json".

What gets uploaded for Iceberg

Exactly one file per iteration — the latest metadata.json (picked by lexicographic order, which is correct for both Hadoop-catalog v{N}.metadata.json and Hive/Glue/Spark 00000-<uuid>.metadata.json naming). Checkpoint tracks the last uploaded filename; if it hasn't changed since last run, the iteration is a no-op.

Out of scope (separate PRs)

  • Control-plane parsing of Iceberg metadata.json and emit to OpenSearch (gateway-controller PR with IcebergCommitMetadataParser).
  • Proto field on TableMetricsCheckpoint: added on the idls PR #1939 branch (80d81701).

Test plan

  • ./gradlew :lakeview:test --tests "ai.onehouse.metadata_extractor.*" green locally (incl. 3 new test files: detector pairs + pickLatestMetadataJson)
  • Existing TableDiscoveryServiceTest and TableDiscoveryAndUploadJobTest pass after the constructor signature changes
  • CI green
  • Manual: end-to-end against s3://aadsharma-quanton-test/spark4_iceberg_variant.db/t_variant_1/ once control-plane parser lands

🤖 Generated with Claude Code

tiennguyen-onehouse and others added 2 commits May 12, 2026 19:30
Extends LakeView to find Iceberg tables (metadata/*.metadata.json) in addition
to Hudi tables (.hoodie/) and upload their current metadata.json to the
control plane for parsing. Iceberg upload is a parallel orchestrator rather
than a Hudi-coupled branch: the active/archived timeline, hoodie.properties
bootstrap, and LSM manifest plumbing don't apply.

Abstractions:
- TableFormatDetector SPI with HudiTableFormatDetector + IcebergTableFormatDetector
  implementations. TableDiscoveryService iterates registered detectors; the
  first match determines the format and tags the Table.
- New IcebergMetadataUploaderService sibling of TableMetadataUploaderService.
  Reuses OnehouseApiClient, PresignedUrlFileUploader, AsyncStorageClient.
- TableDiscoveryAndUploadJob.dispatchUpload partitions discovered tables by
  format and runs the two uploaders concurrently.

Wire:
- TableFormat enum (HUDI, ICEBERG) added to api/models/request.
- Table model carries tableFormat (default HUDI for backward compat).
- InitializeSingleTableMetricsCheckpointRequest gains tableFormat (nullable;
  server treats absent as HUDI). TableType (COW/MOR) stays @nonnull; Iceberg
  uploads pass COW as a meaningless placeholder since the server discriminates
  on tableFormat first.

Depends on idls PR #1939 for ObservedTableFormat on the proto side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LakeviewSyncTool builds TableDiscoveryService and TableDiscoveryAndUploadJob
manually (no Guice in the sync-tool entry path), so the constructor
signature changes from the parent commit broke its compile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-image iceberg-metrics-pr2-lakeview-may13

Adds the LakeView half of the fast-path that lets us skip the per-cycle
S3 LIST on Iceberg tables when the control plane already knows the
current metadata.json URI (e.g. from AWS Glue's metadata_location
parameter on the catalog entry).

- New TableHint POJO and Database.tableHints map (keyed on tableId).
  Older YAML versions don't carry this field; Jackson leaves it null
  and discovery falls through to the existing listing-based path.
- Table model gains metadataLocationHint (optional).
- TableDiscoveryService merges per-database tableHints into a single
  tableId -> hint map and attaches metadataLocationHint to each Table
  it discovers, when the tableId matches a hint.
- IcebergMetadataUploaderService.uploadIfNewMetadataJson branches on
  the hint: if set, derive the filename, compare against the
  checkpoint, and PUT directly to the hint URI. If the checkpoint
  already matches, no-op without touching S3 at all. Falls back to the
  existing LIST behavior when no hint is provided.

The control-plane producer (gw-agent MetricsExtractorFileUpdater) does
not yet emit the new field into the YAML — that needs a lakeview-config
artifact version bump on the gateway-controller side. Once that lands,
the fast-path activates automatically; until then, tableHints stays
null and behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-jar iceberg-metrics-pr2-lakeview-may13

Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-jar iceberg-lakeview-may13

Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-jar iceberg-lakeview-may13-v2

…tion

Concern 2: the previous IcebergTableFormatDetector matched any directory
containing a sub-directory named "metadata", which produced false positives
across customer warehouses (Spark checkpoint dirs, custom layouts, schema
folders, etc.). Replace the SPI ordering with per-Database declarative
routing — each Database in the parser YAML now declares its tableFormat
(default HUDI for backward compat), and TableDiscoveryService picks the
single matching detector for that database. The Iceberg detector also
becomes strict: requires metadata/ AND at least one *.metadata.json inside
it (one extra LIST during discovery only, not per upload cycle).

Concern 1: pickLatestMetadataJson lex-sorted filenames, so for the Hadoop
catalog naming v{N}.metadata.json (unpadded), v2.metadata.json beat
v10.metadata.json — a stale pointer. Resolve in three tiers: catalog hint,
then metadata/version-hint.text when present (canonical Iceberg lookup),
then numeric-aware sort on the leading integer. Empty metadata/ now
increments a NO_SUCH_KEY failure counter instead of silently returning
true, so phantom tables surface in dashboards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-jar iceberg-lakeview-may13-v2

S3 ListObjectsV2 surfaces "subdirectories" as CommonPrefixes whose
Prefix string carries a trailing slash (e.g. "metadata/"). The storage
client preserves that in File.filename, so the exact-string check
"metadata".equals(filename) fails and the detector returns false even
though the table layout is a valid Iceberg root.

Concrete repro in staging (Testing-Acme org, table
s3://acme-data-2/iceberg_tables/iceberg_table_test_may13/):
  - S3 list of the table base returns a single CommonPrefix "metadata/"
  - The agent emits tableFormat=ICEBERG + metadataLocationHint in the
    extractor YAML, the v2 lake-view image parses it fine, the
    discovery walks the base path
  - Detector returns false at the base, discoverTablesInPath recurses
    into metadata/, no Iceberg Table is ever emitted, dispatchUpload
    is called with an empty Iceberg set, IcebergMetadataUploaderService
    early-returns, nothing reaches the temp bucket

Strip a single trailing slash before comparing, so both "metadata" and
"metadata/" are accepted. Hudi's detector dodges this naturally via
startsWith(".hoodie"); the Iceberg one switched to equals() and lost
the same tolerance.

Add a unit test that hands the detector a File with filename="metadata/"
— the shape S3 actually produces — which fails on main and passes here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-jar iceberg-lakeview-may13-v3

Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-image iceberg-lakeview-may13-v3

LakeView's InitializeTableMetricsCheckpoint request sends tableFormat
as a Jackson-serialized enum value. Without an explicit @JsonProperty
the wire string was the short Java identifier ("HUDI" / "ICEBERG"),
which protobuf-java-util's JSON parser on external-api cannot map
onto lake.TableFormat (the proto enum uses the canonical names
"TABLE_FORMAT_HUDI" / "TABLE_FORMAT_ICEBERG"). The mismatch is silent:
the parser falls back to enum-zero (TABLE_FORMAT_INVALID), the
checkpoint is persisted without the field, and GenerateCommitMetadata
UploadUrlHandler routes the table through the Hudi back-compat path.
For Hudi this happens to be correct so nothing surfaces; for Iceberg
the filename regex rejects the metadata.json with 400.

Annotate the enum values so the JSON wire string matches the proto
enum name on both sides of the contract. Java callsites remain
unchanged.
Copy link
Copy Markdown
Contributor

@dharmendersheshma dharmendersheshma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-image iceberg-lakeview-may14-v1

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
41.3% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants