Skip to content

[OSS PR #18204] feat(utilities): add external HudiHiveSyncJob for on-demand Hive sync#35

Open
yihua wants to merge 8 commits into
masterfrom
oss-18204
Open

[OSS PR #18204] feat(utilities): add external HudiHiveSyncJob for on-demand Hive sync#35
yihua wants to merge 8 commits into
masterfrom
oss-18204

Conversation

@yihua
Copy link
Copy Markdown
Owner

@yihua yihua commented Apr 13, 2026

Mirror of apache#18204 for automated bot review.

Original author: @suryaprasanna
Base branch: master

Summary by CodeRabbit

  • New Features

    • Added HudiHiveSyncJob utility enabling synchronization of Hudi tables with Hive metastore
    • Supports CLI configuration through properties files and command-line parameters
    • Enables registration of unregistered Hudi datasets in the metastore
  • Tests

    • Added comprehensive test coverage for the new Hive sync utility

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

The PR introduces a new Hive sync utility job for Hudi with CLI argument parsing and table synchronization capabilities, alongside comprehensive test coverage. Additionally, resource cleanup in test utilities is improved with conditional null-guarding and explicit null-assignment to prevent resource leaks.

Changes

Cohort / File(s) Summary
Hive Test Utilities
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java
Enhanced clear() method to conditionally execute Hive DDL operations only when ddlExecutor is non-null; updated shutdown() method to explicitly nullify resource references (hiveServer, hiveTestService, zkServer, zkService, fileSystem) after closing to prevent leaks.
Hive Sync Job Utility
hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java
New executable job class that parses CLI arguments via JCommander into a Config object, builds a Spark context, constructs Hive-sync properties from a user-provided file and --hoodie-conf entries, and runs HiveSyncTool to synchronize Hudi tables with error handling and resource cleanup.
Hive Sync Job Test
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java
New JUnit 5 test class validating that HudiHiveSyncJob registers unregistered Hudi datasets in the Hive metastore, including Hive test infrastructure setup, dataset creation via Spark, table existence verification pre- and post-sync, and proper resource cleanup.

Sequence Diagram

sequenceDiagram
    actor User
    participant HudiHiveSyncJob
    participant HiveSyncTool
    participant HiveMetastore
    participant SparkContext

    User->>HudiHiveSyncJob: main(args)
    HudiHiveSyncJob->>HudiHiveSyncJob: parse CLI args via JCommander
    HudiHiveSyncJob->>SparkContext: create Spark context
    SparkContext-->>HudiHiveSyncJob: context ready
    HudiHiveSyncJob->>HudiHiveSyncJob: construct properties from file + configs
    HudiHiveSyncJob->>HiveSyncTool: new HiveSyncTool(properties)
    HudiHiveSyncJob->>HiveSyncTool: run()
    HiveSyncTool->>HiveMetastore: sync table metadata
    HiveMetastore-->>HiveSyncTool: sync complete
    HudiHiveSyncJob->>HiveSyncTool: close()
    HudiHiveSyncJob->>SparkContext: stop()
    HudiHiveSyncJob-->>User: job finished
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops of joy through metastore halls,
Where Hive tables sync when duty calls,
With resources cleaned and nulls in place,
This rabbit's code: a tidy embrace!
˖ ꒰ა ☆ ໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋ ໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋໋ ໋ა ☆ ꒱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: adding a new external HudiHiveSyncJob utility for on-demand Hive synchronization, which is the primary focus of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch oss-18204

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR introduces HudiHiveSyncJob, a new standalone Spark utility for running Hive metastore sync on-demand against any Hudi table path, decoupled from ingestion pipelines. It also ships a full integration test backed by an embedded Hive server, and defensively hardens HiveTestUtil with null-checks on shared static fields to prevent cascading failures across test classes.

Key changes:

  • HudiHiveSyncJob.java — new spark-submit-compatible entry point; wraps HiveSyncTool with JCommander config parsing, timer logging, and a finally close block.
  • TestHudiHiveSyncJob.java — end-to-end test that writes a Hudi dataset without Hive registration, then verifies HudiHiveSyncJob.run() correctly registers it in the embedded metastore.
  • HiveTestUtil.java — adds null guards for ddlExecutor in clear(), and assigns null after stopping hiveServer, hiveTestService, zkServer, zkService, and fileSystem in shutdown() to prevent stale-reference bugs across tests.

Issues found:

  • Exception suppression (P1): In run(), if syncHoodieTable() throws and then syncTool.close() also throws inside finally, Java silently discards the primary exception. Using try-with-resources would preserve both via suppressed exception chaining.
  • Dead throws IOException (P2): run() declares throws IOException but all exceptions are caught and re-thrown as unchecked HoodieException; the declaration misleads callers.
  • Mutable props in run() (P2): META_SYNC_BASE_PATH and META_SYNC_BASE_FILE_FORMAT are placed into the shared props field on every run() call rather than in the constructor.
  • getHiveConf() NPE risk (P2): After shutdown() sets hiveServer = null, a call to HiveTestUtil.getHiveConf() would throw a NullPointerException with no diagnostic message.

Confidence Score: 3/5

Safe to merge after addressing the exception-suppression bug in run(); remaining issues are style/cleanup

The P1 exception-suppression issue in the finally block is a real correctness bug: when both syncHoodieTable() and close() fail, the primary sync failure is silently discarded and only the close exception surfaces, making production incidents very hard to diagnose. This warrants a fix before merge. The remaining P2s (dead throws IOException, props mutation in run(), getHiveConf() NPE guard) are style and defensive-coding concerns that do not affect the happy path.

hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java — exception handling in run()

Important Files Changed

Filename Overview
hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java New utility job for on-demand Hive sync; has exception-suppression bug in finally block, misleading throws IOException declaration, and mutable props field modified in run()
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java Integration test verifying sync registration against a live embedded Hive metastore; test design is sound with correct lifecycle hooks and end-to-end assertion
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java Defensive null-checks added for ddlExecutor in clear() and null assignments after stopping hiveServer/hiveTestService/zkServer/zkService/fileSystem; getHiveConf() still lacks null-check for hiveServer

Sequence Diagram

sequenceDiagram
    participant CLI as spark-submit / caller
    participant Job as HudiHiveSyncJob
    participant UH as UtilHelpers
    participant JSC as JavaSparkContext
    participant HST as HiveSyncTool
    participant HMS as Hive Metastore

    CLI->>Job: main(args)
    Job->>UH: buildSparkContext(...)
    UH-->>Job: JavaSparkContext
    Job->>Job: new HudiHiveSyncJob(jsc, cfg)
    note over Job: builds props from propsFilePath + --hoodie-conf overrides
    Job->>Job: run()
    Job->>Job: props.put(BASE_PATH, BASE_FILE_FORMAT)
    Job->>HST: new HiveSyncTool(props, HiveConf)
    Job->>HST: syncHoodieTable()
    HST->>HMS: register / update table & partitions
    HMS-->>HST: OK
    HST-->>Job: (done)
    Job->>HST: close() [finally]
    Job->>JSC: stop() [finally in main]
Loading

Comments Outside Diff (1)

  1. hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java, line 207-209 (link)

    P2 getHiveConf() has no null-check for hiveServer

    The PR now sets hiveServer = null in shutdown(). However, getHiveConf() still dereferences hiveServer without a null-check. TestHudiHiveSyncJob.tableExistsInMetastore() calls HiveTestUtil.getHiveConf(), and if shutdown() were ever called before the assertion completes (e.g., in a test ordering edge case), this would throw a NullPointerException.

Reviews (1): Last reviewed commit: "Fixing test set up" | Re-trigger Greptile

Comment on lines +93 to +113
public void run() throws IOException {
LOG.info("Starting hive sync for {}", cfg.basePath);
HoodieTimer timer = HoodieTimer.start();
HiveSyncTool syncTool = null;
try {
props.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);

LOG.info("HiveSyncConfig props used to sync data {}", props);
syncTool = new HiveSyncTool(props, new HiveConf(hadoopConf, HiveConf.class));
syncTool.syncHoodieTable();
} catch (Exception e) {
LOG.error("Exception in running hive-sync", e);
throw new HoodieException("Hive sync failed", e);
} finally {
if (syncTool != null) {
syncTool.close();
}
LOG.info("Hive-sync duration in ms {}", timer.endTimer());
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Exception suppression in finally block

If syncTool.syncHoodieTable() throws (caught and re-thrown as HoodieException), and then syncTool.close() also throws a HoodieHiveSyncException inside the finally block, Java will suppress the original HoodieException — the primary sync failure and its cause are silently lost, replaced by the close-failure. This makes debugging very confusing.

Since HiveSyncTool implements AutoCloseable, the cleanest fix is try-with-resources, which automatically adds suppressed exceptions:

public void run() {
    LOG.info("Starting hive sync for {}", cfg.basePath);
    HoodieTimer timer = HoodieTimer.start();
    props.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
    props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);
    LOG.info("HiveSyncConfig props used to sync data {}", props);
    try (HiveSyncTool syncTool = new HiveSyncTool(props, new HiveConf(hadoopConf, HiveConf.class))) {
      syncTool.syncHoodieTable();
    } catch (Exception e) {
      LOG.error("Exception in running hive-sync", e);
      throw new HoodieException("Hive sync failed", e);
    } finally {
      LOG.info("Hive-sync duration in ms {}", timer.endTimer());
    }
}

}
}

public void run() throws IOException {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading throws IOException declaration

run() declares throws IOException, but it never actually throws a checked IOException. All exceptions from HiveSyncTool are caught in the try/catch block and re-thrown as unchecked HoodieException. The syncTool.close() in the finally block throws HoodieHiveSyncException (also unchecked). The throws IOException is dead and misleads callers into handling an exception that will never be checked-thrown.

Suggested change
public void run() throws IOException {
public void run() {

Comment on lines +97 to +99
try {
props.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mutable shared props modified in run()

props is a field of HudiHiveSyncJob that is built once in the constructor. Calling props.put(...) inside run() mutates the shared field on every invocation. These two properties are known at construction time and should be set in the constructor, keeping run() free of field mutations:

// In the constructor, after building props:
this.props.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
this.props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java (1)

73-77: Avoid swallowing Throwable in cleanup.

Line [75] should catch Exception instead of Throwable to avoid masking serious VM/test framework errors.

🧹 Proposed cleanup adjustment
-    } catch (Throwable t) {
+    } catch (Exception e) {
       // no-op for cleanup failures in tests
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java`
around lines 73 - 77, In TestHudiHiveSyncJob, change the cleanup block that
calls HiveTestUtil.clear() so it catches Exception (not Throwable) to avoid
swallowing VM/test framework errors; update the catch clause in the try {
HiveTestUtil.clear(); } catch (...) to catch Exception and optionally log the
exception (e.g., using test logger) while keeping the no-op behavior for normal
cleanup failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`:
- Around line 81-85: The Spark master fallback currently hard-codes "local[2]"
when cfg.sparkMaster is empty; change the logic in HudiHiveSyncJob so that when
StringUtils.isNullOrEmpty(cfg.sparkMaster) it calls
UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.enableHiveSupport) instead
of using "local[2]" so the job inherits the environment/default Spark master
like HoodieStreamer; keep the existing branch that uses cfg.sparkMaster when
present and ensure you reference UtilHelpers.buildSparkContext and
cfg.enableHiveSupport in the updated call.

---

Nitpick comments:
In
`@hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java`:
- Around line 73-77: In TestHudiHiveSyncJob, change the cleanup block that calls
HiveTestUtil.clear() so it catches Exception (not Throwable) to avoid swallowing
VM/test framework errors; update the catch clause in the try {
HiveTestUtil.clear(); } catch (...) to catch Exception and optionally log the
exception (e.g., using test logger) while keeping the no-op behavior for normal
cleanup failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48bab25b-ffca-4c5a-921c-5d7bf99e23c9

📥 Commits

Reviewing files that changed from the base of the PR and between 35e2bbf and 57e2be4.

📒 Files selected for processing (3)
  • hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java
  • hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java
  • hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java

public static void main(String[] args) throws IOException {
final Config cfg = new Config();
new JCommander(cfg, null, args);
LOG.info("Cfg received: {}", cfg);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sensitive values are being logged.

Line [79] and Line [101] can expose credentials because Config.toString() and props include user/password entries.

🔐 Proposed fix (redact or avoid secret-bearing logs)
-    LOG.info("Cfg received: {}", cfg);
+    LOG.info("Cfg received: basePath={}, baseFileFormat={}, propsFilePath={}, configsCount={}",
+        cfg.basePath, cfg.baseFileFormat, cfg.propsFilePath, cfg.configs.size());

@@
-      LOG.info("HiveSyncConfig props used to sync data {}", props);
+      LOG.info("HiveSyncConfig prepared for sync ({} entries)", props.size());

Also applies to: 101-101, 136-142

Comment on lines +81 to +85
if (StringUtils.isNullOrEmpty(cfg.sparkMaster)) {
jsc = UtilHelpers.buildSparkContext("HudiHiveSyncJob", "local[2]", true);
} else {
jsc = UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.sparkMaster, true);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify actual Spark context builder semantics and conventions in sibling utilities jobs.

set -euo pipefail

echo "== Locate UtilHelpers =="
fd -i "UtilHelpers.java"

echo
echo "== Inspect buildSparkContext declarations/implementations =="
rg -n -C3 --type=java '\bbuildSparkContext\s*\('

echo
echo "== Compare spark-master fallback patterns in utilities jobs =="
rg -n -C3 --type=java 'sparkMaster|local\[2\]|buildSparkContext\("Hudi'

Repository: yihua/hudi

Length of output: 50367


🏁 Script executed:

cat -n hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java | head -150

Repository: yihua/hudi

Length of output: 6583


Fix --spark-master fallback to honor environment configuration.

Lines 81–85 hard-code local[2] when cfg.sparkMaster is empty, but the parameter description at lines 125–127 states it should inherit from the environment when not defined. Update to call UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.enableHiveSupport) when cfg.sparkMaster is empty, matching the behavior of other utilities like HoodieStreamer and respecting documented configuration precedence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`
around lines 81 - 85, The Spark master fallback currently hard-codes "local[2]"
when cfg.sparkMaster is empty; change the logic in HudiHiveSyncJob so that when
StringUtils.isNullOrEmpty(cfg.sparkMaster) it calls
UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.enableHiveSupport) instead
of using "local[2]" so the job inherits the environment/default Spark master
like HoodieStreamer; keep the existing branch that uses cfg.sparkMaster when
present and ensure you reference UtilHelpers.buildSparkContext and
cfg.enableHiveSupport in the updated call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants