Conversation
📝 WalkthroughWalkthroughThe PR introduces a new Hive sync utility job for Hudi with CLI argument parsing and table synchronization capabilities, alongside comprehensive test coverage. Additionally, resource cleanup in test utilities is improved with conditional null-guarding and explicit null-assignment to prevent resource leaks. Changes
Sequence DiagramsequenceDiagram
actor User
participant HudiHiveSyncJob
participant HiveSyncTool
participant HiveMetastore
participant SparkContext
User->>HudiHiveSyncJob: main(args)
HudiHiveSyncJob->>HudiHiveSyncJob: parse CLI args via JCommander
HudiHiveSyncJob->>SparkContext: create Spark context
SparkContext-->>HudiHiveSyncJob: context ready
HudiHiveSyncJob->>HudiHiveSyncJob: construct properties from file + configs
HudiHiveSyncJob->>HiveSyncTool: new HiveSyncTool(properties)
HudiHiveSyncJob->>HiveSyncTool: run()
HiveSyncTool->>HiveMetastore: sync table metadata
HiveMetastore-->>HiveSyncTool: sync complete
HudiHiveSyncJob->>HiveSyncTool: close()
HudiHiveSyncJob->>SparkContext: stop()
HudiHiveSyncJob-->>User: job finished
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR introduces Key changes:
Issues found:
Confidence Score: 3/5Safe to merge after addressing the exception-suppression bug in run(); remaining issues are style/cleanup The P1 exception-suppression issue in the finally block is a real correctness bug: when both syncHoodieTable() and close() fail, the primary sync failure is silently discarded and only the close exception surfaces, making production incidents very hard to diagnose. This warrants a fix before merge. The remaining P2s (dead throws IOException, props mutation in run(), getHiveConf() NPE guard) are style and defensive-coding concerns that do not affect the happy path. hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java — exception handling in run() Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as spark-submit / caller
participant Job as HudiHiveSyncJob
participant UH as UtilHelpers
participant JSC as JavaSparkContext
participant HST as HiveSyncTool
participant HMS as Hive Metastore
CLI->>Job: main(args)
Job->>UH: buildSparkContext(...)
UH-->>Job: JavaSparkContext
Job->>Job: new HudiHiveSyncJob(jsc, cfg)
note over Job: builds props from propsFilePath + --hoodie-conf overrides
Job->>Job: run()
Job->>Job: props.put(BASE_PATH, BASE_FILE_FORMAT)
Job->>HST: new HiveSyncTool(props, HiveConf)
Job->>HST: syncHoodieTable()
HST->>HMS: register / update table & partitions
HMS-->>HST: OK
HST-->>Job: (done)
Job->>HST: close() [finally]
Job->>JSC: stop() [finally in main]
|
| public void run() throws IOException { | ||
| LOG.info("Starting hive sync for {}", cfg.basePath); | ||
| HoodieTimer timer = HoodieTimer.start(); | ||
| HiveSyncTool syncTool = null; | ||
| try { | ||
| props.put(META_SYNC_BASE_PATH.key(), cfg.basePath); | ||
| props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat); | ||
|
|
||
| LOG.info("HiveSyncConfig props used to sync data {}", props); | ||
| syncTool = new HiveSyncTool(props, new HiveConf(hadoopConf, HiveConf.class)); | ||
| syncTool.syncHoodieTable(); | ||
| } catch (Exception e) { | ||
| LOG.error("Exception in running hive-sync", e); | ||
| throw new HoodieException("Hive sync failed", e); | ||
| } finally { | ||
| if (syncTool != null) { | ||
| syncTool.close(); | ||
| } | ||
| LOG.info("Hive-sync duration in ms {}", timer.endTimer()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Exception suppression in
finally block
If syncTool.syncHoodieTable() throws (caught and re-thrown as HoodieException), and then syncTool.close() also throws a HoodieHiveSyncException inside the finally block, Java will suppress the original HoodieException — the primary sync failure and its cause are silently lost, replaced by the close-failure. This makes debugging very confusing.
Since HiveSyncTool implements AutoCloseable, the cleanest fix is try-with-resources, which automatically adds suppressed exceptions:
public void run() {
LOG.info("Starting hive sync for {}", cfg.basePath);
HoodieTimer timer = HoodieTimer.start();
props.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);
LOG.info("HiveSyncConfig props used to sync data {}", props);
try (HiveSyncTool syncTool = new HiveSyncTool(props, new HiveConf(hadoopConf, HiveConf.class))) {
syncTool.syncHoodieTable();
} catch (Exception e) {
LOG.error("Exception in running hive-sync", e);
throw new HoodieException("Hive sync failed", e);
} finally {
LOG.info("Hive-sync duration in ms {}", timer.endTimer());
}
}| } | ||
| } | ||
|
|
||
| public void run() throws IOException { |
There was a problem hiding this comment.
Misleading
throws IOException declaration
run() declares throws IOException, but it never actually throws a checked IOException. All exceptions from HiveSyncTool are caught in the try/catch block and re-thrown as unchecked HoodieException. The syncTool.close() in the finally block throws HoodieHiveSyncException (also unchecked). The throws IOException is dead and misleads callers into handling an exception that will never be checked-thrown.
| public void run() throws IOException { | |
| public void run() { |
| try { | ||
| props.put(META_SYNC_BASE_PATH.key(), cfg.basePath); | ||
| props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat); |
There was a problem hiding this comment.
Mutable shared
props modified in run()
props is a field of HudiHiveSyncJob that is built once in the constructor. Calling props.put(...) inside run() mutates the shared field on every invocation. These two properties are known at construction time and should be set in the constructor, keeping run() free of field mutations:
// In the constructor, after building props:
this.props.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
this.props.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java (1)
73-77: Avoid swallowingThrowablein cleanup.Line [75] should catch
Exceptioninstead ofThrowableto avoid masking serious VM/test framework errors.🧹 Proposed cleanup adjustment
- } catch (Throwable t) { + } catch (Exception e) { // no-op for cleanup failures in tests }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java` around lines 73 - 77, In TestHudiHiveSyncJob, change the cleanup block that calls HiveTestUtil.clear() so it catches Exception (not Throwable) to avoid swallowing VM/test framework errors; update the catch clause in the try { HiveTestUtil.clear(); } catch (...) to catch Exception and optionally log the exception (e.g., using test logger) while keeping the no-op behavior for normal cleanup failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`:
- Around line 81-85: The Spark master fallback currently hard-codes "local[2]"
when cfg.sparkMaster is empty; change the logic in HudiHiveSyncJob so that when
StringUtils.isNullOrEmpty(cfg.sparkMaster) it calls
UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.enableHiveSupport) instead
of using "local[2]" so the job inherits the environment/default Spark master
like HoodieStreamer; keep the existing branch that uses cfg.sparkMaster when
present and ensure you reference UtilHelpers.buildSparkContext and
cfg.enableHiveSupport in the updated call.
---
Nitpick comments:
In
`@hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java`:
- Around line 73-77: In TestHudiHiveSyncJob, change the cleanup block that calls
HiveTestUtil.clear() so it catches Exception (not Throwable) to avoid swallowing
VM/test framework errors; update the catch clause in the try {
HiveTestUtil.clear(); } catch (...) to catch Exception and optionally log the
exception (e.g., using test logger) while keeping the no-op behavior for normal
cleanup failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 48bab25b-ffca-4c5a-921c-5d7bf99e23c9
📒 Files selected for processing (3)
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.javahudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.javahudi-utilities/src/test/java/org/apache/hudi/utilities/TestHudiHiveSyncJob.java
| public static void main(String[] args) throws IOException { | ||
| final Config cfg = new Config(); | ||
| new JCommander(cfg, null, args); | ||
| LOG.info("Cfg received: {}", cfg); |
There was a problem hiding this comment.
Sensitive values are being logged.
Line [79] and Line [101] can expose credentials because Config.toString() and props include user/password entries.
🔐 Proposed fix (redact or avoid secret-bearing logs)
- LOG.info("Cfg received: {}", cfg);
+ LOG.info("Cfg received: basePath={}, baseFileFormat={}, propsFilePath={}, configsCount={}",
+ cfg.basePath, cfg.baseFileFormat, cfg.propsFilePath, cfg.configs.size());
@@
- LOG.info("HiveSyncConfig props used to sync data {}", props);
+ LOG.info("HiveSyncConfig prepared for sync ({} entries)", props.size());Also applies to: 101-101, 136-142
| if (StringUtils.isNullOrEmpty(cfg.sparkMaster)) { | ||
| jsc = UtilHelpers.buildSparkContext("HudiHiveSyncJob", "local[2]", true); | ||
| } else { | ||
| jsc = UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.sparkMaster, true); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify actual Spark context builder semantics and conventions in sibling utilities jobs.
set -euo pipefail
echo "== Locate UtilHelpers =="
fd -i "UtilHelpers.java"
echo
echo "== Inspect buildSparkContext declarations/implementations =="
rg -n -C3 --type=java '\bbuildSparkContext\s*\('
echo
echo "== Compare spark-master fallback patterns in utilities jobs =="
rg -n -C3 --type=java 'sparkMaster|local\[2\]|buildSparkContext\("Hudi'Repository: yihua/hudi
Length of output: 50367
🏁 Script executed:
cat -n hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java | head -150Repository: yihua/hudi
Length of output: 6583
Fix --spark-master fallback to honor environment configuration.
Lines 81–85 hard-code local[2] when cfg.sparkMaster is empty, but the parameter description at lines 125–127 states it should inherit from the environment when not defined. Update to call UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.enableHiveSupport) when cfg.sparkMaster is empty, matching the behavior of other utilities like HoodieStreamer and respecting documented configuration precedence.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`
around lines 81 - 85, The Spark master fallback currently hard-codes "local[2]"
when cfg.sparkMaster is empty; change the logic in HudiHiveSyncJob so that when
StringUtils.isNullOrEmpty(cfg.sparkMaster) it calls
UtilHelpers.buildSparkContext("HudiHiveSyncJob", cfg.enableHiveSupport) instead
of using "local[2]" so the job inherits the environment/default Spark master
like HoodieStreamer; keep the existing branch that uses cfg.sparkMaster when
present and ensure you reference UtilHelpers.buildSparkContext and
cfg.enableHiveSupport in the updated call.
Mirror of apache#18204 for automated bot review.
Original author: @suryaprasanna
Base branch: master
Summary by CodeRabbit
New Features
Tests