Skip to content

[OSS PR #18520] chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s…#48

Open
yihua wants to merge 1 commit into
masterfrom
oss-18520
Open

[OSS PR #18520] chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s…#48
yihua wants to merge 1 commit into
masterfrom
oss-18520

Conversation

@yihua
Copy link
Copy Markdown
Owner

@yihua yihua commented Apr 17, 2026

Mirror of apache#18520 for automated bot review.

Original author: @voonhous
Base branch: master

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Java 17 runtime support for Hadoop environments alongside existing Java 11 support.
    • Introduced complete Docker Compose configurations for Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 stack (both amd64 and arm64).
    • Parameterized AWS dependency versions for flexibility.
  • Bug Fixes

    • Improved error handling during library copying in container builds.
  • Chores

    • Updated maintainer labels to modern Docker standards.

…etup

- Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17)
- build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION
- Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64
- Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

This PR extends Docker infrastructure to support Java 17-based images for Apache Spark 4.0.1+ environments. It introduces a new base_java17 image with Hadoop, updates the build script to detect Spark version and select the appropriate base image, adds Docker Compose configurations for Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 stacks on both amd64 and arm64 architectures, parameterizes AWS library versions in Spark, and improves error handling in Spark ad-hoc containers.

Changes

Cohort / File(s) Summary
Build Script Logic
docker/build_docker_images.sh
Parses SPARK_VERSION major component to select base_java17 for Spark ≥ 4.0 and base_java11 otherwise; updates first Docker image entry to use the selected base image dynamically.
Docker Compose Configurations
docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml, docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml
New integrated Docker Compose environments defining full Hadoop/Hive/Spark 4.0.1 stacks with HDFS (namenode/datanode), YARN historyserver, Hive metastore (PostgreSQL-backed), HiveServer2, Zookeeper, Kafka, Spark cluster, and MinIO S3-compatible storage with inter-service wiring, healthchecks, and named volume persistence.
Base Java 11 Image Update
docker/hoodie/hadoop/base_java11/Dockerfile
Updates maintainer declaration from deprecated MAINTAINER instruction to LABEL maintainer="Hoodie" annotation.
Base Java 17 Image
docker/hoodie/hadoop/base_java17/Dockerfile, docker/hoodie/hadoop/base_java17/entrypoint.sh, docker/hoodie/hadoop/base_java17/export_container_ip.sh, docker/hoodie/hadoop/base_java17/pom.xml
New Java 17-based Hadoop base image with Dockerfile installing Hadoop from Apache archives, bash entrypoint script configuring Hadoop XML properties via environment variables, network IP export utility, and Maven POM for Docker image build and tagging via dockerfile-maven-plugin.
Spark Base Dependencies
docker/hoodie/hadoop/spark_base/Dockerfile
Introduces build arguments HADOOP_AWS_VERSION and AWS_SDK_VERSION to parameterize AWS library jar downloads instead of hardcoding versions inline.
Spark Ad-hoc Container
docker/hoodie/hadoop/sparkadhoc/Dockerfile
Adds error suppression (2>/dev/null || true) to JAR copy commands for optional Hive library files (calcite-core*.jar, libfb*.jar) to prevent build failures when glob patterns do not match.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hoppy times ahead!
A Java 17 base now springs to life,
With Spark 4.0.1 cutting through the strife,
Docker Compose weaves a Hadoop tapestry so grand,
MinIO and Hive join hands across the land! ✨
From base_java11 to base_java17 we bound!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately captures the main changes: adding a Java 17 Hadoop base image and Spark 4.0.1 docker compose configurations, which align with the primary modifications across multiple Docker-related files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch oss-18520

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR adds a Java 17 Hadoop base Docker image (base_java17/) to support Spark 4.0+, introduces new Docker Compose files for the Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 combination, and updates build_docker_images.sh to automatically select the Java 17 base when Spark 4.0+ is requested.

Key changes:

  • New docker/hoodie/hadoop/base_java17/ directory with Dockerfile, entrypoint.sh, export_container_ip.sh, and pom.xml — mirrors the structure of the existing base_java11 image but targets eclipse-temurin:17-jdk and Hadoop 3.4.0
  • New docker-compose_hadoop340_hive313_spark401_amd64.yml and _arm64.yml compose files
  • build_docker_images.sh gained logic to select base_java17 when SPARK_MAJOR >= 4
  • spark_base/Dockerfile and sparkadhoc/Dockerfile updated to work with the new stack

Issues found:

  • The new amd64 compose file is missing platform: linux/amd64 on every service — all other _amd64.yml files in this directory include this directive on each service, and its absence means the file won't enforce amd64 execution on non-amd64 hosts
  • build_docker_images.sh defaults to HADOOP_VERSION=2.8.4, which is incompatible with both Hadoop 2.x/Java 17 and the new 3.4.0 image tags expected by the compose files; a user running ./build_docker_images.sh --spark-version 4.0.1 without explicitly adding --hadoop-version 3.4.0 would build mismatched image tags

Confidence Score: 3/5

Not ready to merge — the amd64 compose file is missing platform directives and the build script's default Hadoop version will produce mismatched image tags when building for Spark 4.0.1

Two concrete P1 issues block the primary user path: (1) the _amd64.yml compose file omits platform: linux/amd64 on all services, breaking its purpose on non-amd64 hosts; (2) the build script defaults to HADOOP_VERSION=2.8.4 while the new compose files expect 3.4.0, meaning the default build invocation produces images with wrong tags that won't work with the new compose setup. The new base_java17 image and pom are structurally correct; the issues are in the wiring between the build script and compose files.

docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml needs platform directives; docker/build_docker_images.sh needs its default HADOOP_VERSION aligned with the Spark 4.0 path

Important Files Changed

Filename Overview
docker/build_docker_images.sh New build script auto-selects Java 17 base for Spark 4.0+, but defaults HADOOP_VERSION=2.8.4 which is incompatible with the Java 17 image and the new compose files (which expect 3.4.0)
docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml New compose file for Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 (amd64), but is missing platform: linux/amd64 on every service — inconsistent with the established pattern in all other _amd64.yml files in this directory
docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml arm64 variant of the Spark 4.0.1 compose file; correctly omits platform directives (consistent with other _arm64.yml files), identical structure to amd64 file
docker/hoodie/hadoop/base_java17/Dockerfile New Java 17 base image using eclipse-temurin:17-jdk with Hadoop 3.4.0; structurally sound and mirrors base_java11 pattern, but downloads .asc signature without verifying it
docker/hoodie/hadoop/base_java17/entrypoint.sh Identical copy of base_java11/entrypoint.sh; handles Hadoop XML config injection and Ganglia metrics — no new issues introduced
docker/hoodie/hadoop/base_java17/export_container_ip.sh Copy of the existing export_container_ip.sh; resolves container IP via ifconfig on en0/eth0 — no new issues
docker/hoodie/hadoop/base_java17/pom.xml New Maven module pom for the Java 17 base Docker image, mirrors the structure of base_java11's pom.xml — looks correct
docker/hoodie/hadoop/spark_base/Dockerfile Updated to support Spark 4.0.1 (downloads spark-4.0.1-bin-hadoop3.tgz correctly), but default ARG HADOOP_VERSION=3.3.4 is stale relative to the new 3.4.0 setup
docker/hoodie/hadoop/sparkadhoc/Dockerfile Updated to reference spark 4.0.1 base image; default ARG HADOOP_VERSION=3.3.4 is stale, and separate RUN apt-get update/install layers are a minor Docker best practice issue

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build_docker_images.sh] --> B{SPARK_MAJOR >= 4?}
    B -- Yes --> C[base_java17/Dockerfile\neclipse-temurin:17-jdk\nHadoop 3.4.0]
    B -- No --> D[base_java11/Dockerfile\nopenjdk:11-jdk-slim\nHadoop 2.8.4]
    C --> E[hudi-hadoop_3.4.0-base]
    D --> F[hudi-hadoop_2.8.4-base]
    E --> G[datanode / namenode / historyserver]
    E --> H[hive_base → hudi-hadoop_3.4.0-hive_3.1.3]
    H --> I[spark_base/Dockerfile\nspark-4.0.1-bin-hadoop3.tgz]
    I --> J[sparkadhoc / sparkmaster / sparkworker]
    J --> K[docker-compose\nhadoop340_hive313_spark401\namd64 / arm64]

    style C fill:#d4edda,stroke:#28a745
    style K fill:#d4edda,stroke:#28a745
    style A fill:#fff3cd,stroke:#ffc107
Loading

Comments Outside Diff (2)

  1. docker/build_docker_images.sh, line 22-25 (link)

    P1 Default HADOOP_VERSION incompatible with Spark 4.0 + Java 17 path

    The default HADOOP_VERSION is 2.8.4, but when --spark-version 4.0.1 is passed (which triggers the Java 17 base image), the resulting image tags are apachehudi/hudi-hadoop_2.8.4-*. The new compose files, however, reference apachehudi/hudi-hadoop_3.4.0-*. A user running:

    ./build_docker_images.sh --spark-version 4.0.1

    would build images that do not match the new compose files at all. Additionally, Hadoop 2.8.4 is not certified to run on Java 17, so the build itself is likely to fail or produce a broken image.

    The base_java17/Dockerfile correctly defaults to HADOOP_VERSION=3.4.0, but the build script overrides that ARG with its own default of 2.8.4. Consider updating the default HADOOP_VERSION to 3.4.0 (and HIVE_VERSION to 3.1.3) when Spark 4.0+ is detected, or at minimum add a comment/error guard requiring the caller to pair --spark-version 4.0.1 with --hadoop-version 3.4.0.

    Or alternatively, derive the Hadoop/Hive defaults after parsing the Spark version argument.

  2. docker/hoodie/hadoop/spark_base/Dockerfile, line 18-19 (link)

    P2 Stale default HADOOP_VERSION in ARG

    The default ARG HADOOP_VERSION=3.3.4 is out of sync with the new Hadoop 3.4.0 compose setup. While the build script correctly passes --build-arg HADOOP_VERSION=3.4.0, anyone building this image manually without arguments would silently pick up 3.3.4 and produce an image that does not match the new _hadoop340_* compose files.

    The same issue exists in sparkadhoc/Dockerfile (also ARG HADOOP_VERSION=3.3.4).

    Consider bumping both defaults to 3.4.0 to stay consistent with the new compose configuration.

Reviews (1): Last reviewed commit: "chore: Add Java 17 Hadoop base image and..." | Re-trigger Greptile

Comment on lines +16 to +20
services:

namenode:
image: apachehudi/hudi-hadoop_3.4.0-namenode:latest
hostname: namenode
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing platform: linux/amd64 on every service

Every service in the existing amd64 compose files (e.g. docker-compose_hadoop334_hive313_spark353_amd64.yml) carries a platform: linux/amd64 directive. Without it, Docker will pull/run the image for the host's native architecture, defeating the purpose of the _amd64 variant. On Apple Silicon (arm64) hosts, this would silently use the arm64 image (or fail if only an amd64 image exists), and on an amd64 host this would just happen to work by accident.

All services — namenode, datanode1, historyserver, hive-metastore-postgresql, hivemetastore, hiveserver, zookeeper, kafka, sparkmaster, spark-worker-1, adhoc-1, adhoc-2, minio, and mc — should each have a platform: linux/amd64 line, consistent with the pattern established in all other _amd64.yml files in this directory.

Comment on lines +30 to +39
RUN set -x \
&& DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \
&& echo "Fetch URL2 is : ${HADOOP_URL}" \
&& curl -fSL "${HADOOP_URL}" -o /tmp/hadoop.tar.gz \
&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \
&& mkdir -p /opt/hadoop-$HADOOP_VERSION/logs \
&& tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
&& rm /tmp/hadoop.tar.gz* \
&& ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \
&& mkdir /hadoop-data
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hadoop archive signature downloaded but never verified

The .asc file is fetched but no GPG verification step follows:

&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \

Without calling gpg --verify, the downloaded .asc file provides no actual integrity guarantee. The same pattern exists in base_java11/Dockerfile, so this is pre-existing, but worth resolving in the new image. A hardened build would import the Apache release keys and run:

gpg --import <apache-keys.asc>
gpg --verify /tmp/hadoop.tar.gz.asc /tmp/hadoop.tar.gz

Alternatively, use the SHA-512 checksum which Apache publishes alongside the tarball.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
docker/hoodie/hadoop/base_java17/Dockerfile (1)

53-54: Use COPY instead of ADD for local files.

Per Dockerfile best practices, COPY is preferred over ADD when simply copying files (without extraction or URL fetching).

♻️ Proposed fix
-ADD entrypoint.sh /entrypoint.sh
-ADD export_container_ip.sh /usr/bin/
+COPY entrypoint.sh /entrypoint.sh
+COPY export_container_ip.sh /usr/bin/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/Dockerfile` around lines 53 - 54, Replace
the ADD directives that copy local files with COPY: change the lines that add
entrypoint.sh and export_container_ip.sh (the ADD entrypoint.sh /entrypoint.sh
and ADD export_container_ip.sh /usr/bin/) to use COPY instead, ensuring file
permissions remain correct (e.g., keep executable bits on entrypoint.sh) and
update any related Dockerfile comments if present.
docker/hoodie/hadoop/base_java17/export_container_ip.sh (1)

23-23: Address shellcheck warnings for robustness.

Replace backticks with $(...) and quote the variable to prevent issues with word splitting:

♻️ Proposed fix
-  ipAddr=`ifconfig $interface | grep -Eo 'inet (addr:)?([0-9]+\.){3}[0-9]+' | grep -Eo '([0-9]+\.){3}[0-9]+' | grep -v '127.0.0.1' | head`
+  ipAddr=$(ifconfig "$interface" 2>/dev/null | grep -Eo 'inet (addr:)?([0-9]+\.){3}[0-9]+' | grep -Eo '([0-9]+\.){3}[0-9]+' | grep -v '127.0.0.1' | head -n1)

Note: Also added 2>/dev/null to suppress errors for non-existent interfaces and -n1 to head for explicitness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh` at line 23, The
command that assigns ipAddr uses backticks and an unquoted $interface which
risks word-splitting and failure; replace the backticks with $(...) and quote
"$interface" in the ifconfig call, redirect stderr to /dev/null to ignore
missing-interface errors, and make head explicit with -n1; specifically update
the ipAddr assignment (the line that references ipAddr and $interface and uses
ifconfig | grep ... | head) to use $(...) and "$interface", add 2>/dev/null
after ifconfig, and change head to head -n1.
docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml (2)

131-141: Replace bitnamilegacy images with current bitnami images.

The bitnamilegacy/zookeeper:3.6.4 and bitnamilegacy/kafka:3.4.1 images are deprecated and no longer maintained. Bitnami has moved these to a legacy namespace and provides no further updates. The current bitnami/zookeeper and bitnami/kafka images are actively maintained and support ARM64 architecture. Upgrade to the latest stable versions (e.g., bitnami/zookeeper:3.9.3 and bitnami/kafka:latest) for ongoing security patches and maintenance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml` around
lines 131 - 141, Update the Docker Compose service images for the zookeeper and
kafka services to use the maintained Bitnami namespace: replace image references
for the zookeeper service (image: 'bitnamilegacy/zookeeper:3.6.4') and the kafka
service (image: 'bitnamilegacy/kafka:3.4.1') with current bitnami images (e.g.,
'bitnami/zookeeper:3.9.3' and 'bitnami/kafka:latest' or another pinned stable
tag), ensuring the quotes/formatting remain consistent and that any
environment/port settings for services named zookeeper and kafka are left
intact.

227-240: Replace deprecated MinIO environment variables and pin image version.

  1. MINIO_ACCESS_KEY and MINIO_SECRET_KEY have been deprecated since MinIO release 2021-04-22. Use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD instead.
  2. Using minio/minio:latest can cause reproducibility issues. Pin to a specific release tag (e.g., RELEASE.2024-12-18T13-15-44Z or another stable version).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml` around
lines 227 - 240, Update the minio service block to replace deprecated
environment variables and pin the image: change MINIO_ACCESS_KEY and
MINIO_SECRET_KEY to MINIO_ROOT_USER and MINIO_ROOT_PASSWORD in the environment
list for the minio service, and replace the image value minio/minio:latest with
a specific release tag (for example RELEASE.2024-12-18T13-15-44Z) to ensure
reproducible builds while keeping the existing hostname, container_name, ports,
volumes, and command unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml`:
- Around line 18-35: The namenode service (image
apachehudi/hudi-hadoop_3.4.0-namenode:latest, container_name namenode) is
missing a volumes mount for the declared named volume that should persist
NameNode metadata; add a volumes: section to the namenode service and mount the
declared volume into the NameNode data directory used by the image (e.g., add "-
<your_named_volume>:/hadoop/dfs/name" under the namenode service) so the HDFS
namespace is persisted across container recreates.

In `@docker/hoodie/hadoop/base_java17/Dockerfile`:
- Around line 30-39: The Dockerfile for base_java17 is missing the safe
creation/check for /etc/hadoop/mapred-site.xml which entrypoint.sh expects when
calling addProperty; update the RUN step that extracts Hadoop (the block using
HADOOP_URL, tar -xvf, ln -s, mkdir /hadoop-data) to also create or copy a
default mapred-site.xml if it doesn't exist (match the logic in base_java11) —
e.g., test for /etc/hadoop/mapred-site.xml and if absent copy
/opt/hadoop-$HADOOP_VERSION/etc/hadoop/mapred-site.xml.template or touch a file
so addProperty in entrypoint.sh can safely modify it. Ensure you reference the
same HADOOP_VERSION path and /etc/hadoop symlink so addProperty and
entrypoint.sh find the file.

In `@docker/hoodie/hadoop/base_java17/entrypoint.sh`:
- Around line 80-81: The MAPRED section incorrectly adds the YARN property via
the addProperty call targeting /etc/hadoop/mapred-site.xml with key
yarn.nodemanager.bind-host; remove this addProperty line or move it to the
appropriate YARN config (e.g. use addProperty against /etc/hadoop/yarn-site.xml
for yarn.nodemanager.bind-host) and ensure the MAPRED block only contains
MapReduce-specific properties.
- Around line 76-77: Duplicate addition of the same Hadoop property is
happening: remove the redundant addProperty call so yarn.nodemanager.bind-host
is only added once, or replace the second addProperty invocation with a guard
that checks yarn-site.xml for an existing yarn.nodemanager.bind-host entry
before adding; locate the addProperty calls (symbol: addProperty) targeting
yarn-site.xml and ensure only a single write for yarn.nodemanager.bind-host
occurs.

In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh`:
- Around line 29-30: The export in /usr/bin/export_container_ip.sh only affects
its subprocess, so entrypoint.sh won't see MY_CONTAINER_IP; fix by either (A)
having entrypoint.sh source the script (e.g., use .
/usr/bin/export_container_ip.sh) so export MY_CONTAINER_IP=$ipAddr sets the
variable in the caller, or (B) change export_container_ip.sh to simply output
the ip (echo "$ipAddr") and update entrypoint.sh to capture it
(MY_CONTAINER_IP=$(/usr/bin/export_container_ip.sh)); reference the script name
export_container_ip.sh, the caller entrypoint.sh, and the variable
MY_CONTAINER_IP/ipAddr when making the change.

In `@docker/hoodie/hadoop/spark_base/Dockerfile`:
- Around line 82-86: The Dockerfile uses ARG HADOOP_AWS_VERSION and ARG
AWS_SDK_VERSION but build_docker_images.sh doesn't forward these for Hadoop
3.4.0, causing a mismatch; update build_docker_images.sh to detect when the
requested --hadoop-version is 3.4.0 (or Spark 4.x/Java 17 target) and add the
corresponding --build-arg HADOOP_AWS_VERSION=3.4.0 (and align AWS_SDK_VERSION if
needed) to the docker build command so the HADOOP_AWS_VERSION ARG in the
Dockerfile receives the correct value.

---

Nitpick comments:
In `@docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml`:
- Around line 131-141: Update the Docker Compose service images for the
zookeeper and kafka services to use the maintained Bitnami namespace: replace
image references for the zookeeper service (image:
'bitnamilegacy/zookeeper:3.6.4') and the kafka service (image:
'bitnamilegacy/kafka:3.4.1') with current bitnami images (e.g.,
'bitnami/zookeeper:3.9.3' and 'bitnami/kafka:latest' or another pinned stable
tag), ensuring the quotes/formatting remain consistent and that any
environment/port settings for services named zookeeper and kafka are left
intact.
- Around line 227-240: Update the minio service block to replace deprecated
environment variables and pin the image: change MINIO_ACCESS_KEY and
MINIO_SECRET_KEY to MINIO_ROOT_USER and MINIO_ROOT_PASSWORD in the environment
list for the minio service, and replace the image value minio/minio:latest with
a specific release tag (for example RELEASE.2024-12-18T13-15-44Z) to ensure
reproducible builds while keeping the existing hostname, container_name, ports,
volumes, and command unchanged.

In `@docker/hoodie/hadoop/base_java17/Dockerfile`:
- Around line 53-54: Replace the ADD directives that copy local files with COPY:
change the lines that add entrypoint.sh and export_container_ip.sh (the ADD
entrypoint.sh /entrypoint.sh and ADD export_container_ip.sh /usr/bin/) to use
COPY instead, ensuring file permissions remain correct (e.g., keep executable
bits on entrypoint.sh) and update any related Dockerfile comments if present.

In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh`:
- Line 23: The command that assigns ipAddr uses backticks and an unquoted
$interface which risks word-splitting and failure; replace the backticks with
$(...) and quote "$interface" in the ifconfig call, redirect stderr to /dev/null
to ignore missing-interface errors, and make head explicit with -n1;
specifically update the ipAddr assignment (the line that references ipAddr and
$interface and uses ifconfig | grep ... | head) to use $(...) and "$interface",
add 2>/dev/null after ifconfig, and change head to head -n1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1227866-d2d5-4be8-abaf-3a7cb50c3c88

📥 Commits

Reviewing files that changed from the base of the PR and between a369773 and 1252d5d.

📒 Files selected for processing (10)
  • docker/build_docker_images.sh
  • docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml
  • docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml
  • docker/hoodie/hadoop/base_java11/Dockerfile
  • docker/hoodie/hadoop/base_java17/Dockerfile
  • docker/hoodie/hadoop/base_java17/entrypoint.sh
  • docker/hoodie/hadoop/base_java17/export_container_ip.sh
  • docker/hoodie/hadoop/base_java17/pom.xml
  • docker/hoodie/hadoop/spark_base/Dockerfile
  • docker/hoodie/hadoop/sparkadhoc/Dockerfile

Comment on lines +18 to +35
namenode:
image: apachehudi/hudi-hadoop_3.4.0-namenode:latest
hostname: namenode
container_name: namenode
environment:
- CLUSTER_NAME=hudi_hadoop340_hive313_spark401
ports:
- "50070:50070"
- "8020:8020"
- "9870:9870"
env_file:
- ./hadoop.env
healthcheck:
test: ["CMD", "curl", "-f", "http://namenode:9870"]
interval: 30s
timeout: 10s
retries: 3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mount the declared namenode volume into the NameNode data directory.

The named volume declared at Lines 256-257 is never attached to the namenode service. That means the NameNode metadata stays ephemeral, so recreating this container can wipe the HDFS namespace and break the rest of the stack. Please add a volumes: mount here for the NameNode metadata path used by this image.

Also applies to: 256-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml` around
lines 18 - 35, The namenode service (image
apachehudi/hudi-hadoop_3.4.0-namenode:latest, container_name namenode) is
missing a volumes mount for the declared named volume that should persist
NameNode metadata; add a volumes: section to the namenode service and mount the
declared volume into the NameNode data directory used by the image (e.g., add "-
<your_named_volume>:/hadoop/dfs/name" under the namenode service) so the HDFS
namespace is persisted across container recreates.

Comment on lines +30 to +39
RUN set -x \
&& DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \
&& echo "Fetch URL2 is : ${HADOOP_URL}" \
&& curl -fSL "${HADOOP_URL}" -o /tmp/hadoop.tar.gz \
&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \
&& mkdir -p /opt/hadoop-$HADOOP_VERSION/logs \
&& tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
&& rm /tmp/hadoop.tar.gz* \
&& ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \
&& mkdir /hadoop-data
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if Hadoop 3.4.0 includes mapred-site.xml.template
curl -sI "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | head -5
# Compare with base_java11
diff -u <(cat docker/hoodie/hadoop/base_java11/Dockerfile | grep -A20 "RUN set -x") <(cat docker/hoodie/hadoop/base_java17/Dockerfile | grep -A20 "RUN set -x") 2>/dev/null || echo "Files differ"

Repository: yihua/hudi

Length of output: 202


🏁 Script executed:

# Read the relevant sections from both Dockerfiles
echo "=== base_java11 Dockerfile (lines 30-50) ==="
cat -n docker/hoodie/hadoop/base_java11/Dockerfile | sed -n '30,50p'

echo -e "\n=== base_java17 Dockerfile (lines 30-50) ==="
cat -n docker/hoodie/hadoop/base_java17/Dockerfile | sed -n '30,50p'

Repository: yihua/hudi

Length of output: 2144


🏁 Script executed:

# Check if mapred-site.xml.template exists in Hadoop 3.4.0 distribution
# Download and inspect the Hadoop tarball structure
cd /tmp
curl -sL "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -tzf - | grep -i "mapred-site.xml" | head -20

Repository: yihua/hudi

Length of output: 93


🏁 Script executed:

# Search for all template files in Hadoop 3.4.0 distribution
cd /tmp
curl -sL "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -tzf - | grep "\.template" | head -20

Repository: yihua/hudi

Length of output: 140


🏁 Script executed:

# Also check all files in etc/hadoop directory
cd /tmp
curl -sL "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -tzf - | grep "etc/hadoop/" | sort

Repository: yihua/hudi

Length of output: 140


🏁 Script executed:

# Search repository for any references to mapred-site.xml
rg "mapred-site.xml" --type dockerfile

echo "---"

# Check if there's documentation about the differences between base_java11 and base_java17
find docker/hoodie/hadoop -name "README*" -o -name "*.md" | xargs cat 2>/dev/null || echo "No README found"

echo "---"

# Look for any comments or context around the Hadoop setup
grep -r "mapred-site" docker/hoodie/hadoop/ 2>/dev/null || echo "No matches in comments"

Repository: yihua/hudi

Length of output: 1004


🏁 Script executed:

# Check the full base_java17 Dockerfile to understand the context
cat -n docker/hoodie/hadoop/base_java17/Dockerfile | head -80

Repository: yihua/hudi

Length of output: 2683


Add missing mapred-site.xml configuration step.

The base_java17 Dockerfile is missing a step that exists in base_java11. More importantly, the entrypoint.sh script for base_java17 expects mapred-site.xml to exist when it attempts to modify it:

addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0

The base Dockerfile handles this more safely with a conditional check. Update base_java17 to match:

Proposed fix
     && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \
+    && if [[ -f /etc/hadoop/mapred-site.xml.template ]]; then cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml; fi \
     && mkdir /hadoop-data
🧰 Tools
🪛 Trivy (0.69.3)

[error] 30-39: 'apt-get' missing '--no-install-recommends'

'--no-install-recommends' flag is missed: 'set -x && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps && echo "Fetch URL2 is : ${HADOOP_URL}" && curl -fSL "${HADOOP_URL}" -o /tmp/hadoop.tar.gz && curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc && mkdir -p /opt/hadoop-$HADOOP_VERSION/logs && tar -xvf /tmp/hadoop.tar.gz -C /opt/ && rm /tmp/hadoop.tar.gz* && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop && mkdir /hadoop-data'

Rule: DS-0029

Learn more

(IaC/Dockerfile)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/Dockerfile` around lines 30 - 39, The
Dockerfile for base_java17 is missing the safe creation/check for
/etc/hadoop/mapred-site.xml which entrypoint.sh expects when calling
addProperty; update the RUN step that extracts Hadoop (the block using
HADOOP_URL, tar -xvf, ln -s, mkdir /hadoop-data) to also create or copy a
default mapred-site.xml if it doesn't exist (match the logic in base_java11) —
e.g., test for /etc/hadoop/mapred-site.xml and if absent copy
/opt/hadoop-$HADOOP_VERSION/etc/hadoop/mapred-site.xml.template or touch a file
so addProperty in entrypoint.sh can safely modify it. Ensure you reference the
same HADOOP_VERSION path and /etc/hadoop symlink so addProperty and
entrypoint.sh find the file.

Comment on lines +76 to +77
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate property addition.

yarn.nodemanager.bind-host is added twice to yarn-site.xml (lines 76 and 77).

🔧 Proposed fix
     addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
     addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
-    addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
     addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 76 - 77,
Duplicate addition of the same Hadoop property is happening: remove the
redundant addProperty call so yarn.nodemanager.bind-host is only added once, or
replace the second addProperty invocation with a guard that checks yarn-site.xml
for an existing yarn.nodemanager.bind-host entry before adding; locate the
addProperty calls (symbol: addProperty) targeting yarn-site.xml and ensure only
a single write for yarn.nodemanager.bind-host occurs.

Comment on lines +80 to +81
# MAPRED
addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrong configuration file for YARN property.

yarn.nodemanager.bind-host is a YARN property but is being added to mapred-site.xml. This appears to be a copy-paste error. The MAPRED section should likely set a MapReduce-specific property or be removed.

🔧 Proposed fix (remove incorrect entry)
     # MAPRED
-    addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
+    addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020
+    addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888

Alternatively, if no MapReduce multi-homed config is needed, simply remove lines 80-81.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# MAPRED
addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
# MAPRED
addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020
addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 80 - 81, The
MAPRED section incorrectly adds the YARN property via the addProperty call
targeting /etc/hadoop/mapred-site.xml with key yarn.nodemanager.bind-host;
remove this addProperty line or move it to the appropriate YARN config (e.g. use
addProperty against /etc/hadoop/yarn-site.xml for yarn.nodemanager.bind-host)
and ensure the MAPRED block only contains MapReduce-specific properties.

Comment on lines +29 to +30
echo "Container IP is set to : $ipAddr"
export MY_CONTAINER_IP=$ipAddr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how the base_java11 version handles this
cat docker/hoodie/hadoop/base_java11/export_container_ip.sh 2>/dev/null || echo "File not found"
cat docker/hoodie/hadoop/base_java11/entrypoint.sh 2>/dev/null | grep -A2 -B2 "export_container_ip"

Repository: yihua/hudi

Length of output: 1269


🏁 Script executed:

cat docker/hoodie/hadoop/base_java17/entrypoint.sh 2>/dev/null | grep -A3 -B3 "export_container_ip" || echo "Not found or error"

Repository: yihua/hudi

Length of output: 136


🏁 Script executed:

cat docker/hoodie/hadoop/base_java17/entrypoint.sh

Repository: yihua/hudi

Length of output: 4618


Exported variable in subprocess won't propagate to entrypoint.sh.

When entrypoint.sh calls /usr/bin/export_container_ip.sh as a subprocess (not sourced), the export MY_CONTAINER_IP=$ipAddr only affects that subprocess. The variable won't be available in entrypoint.sh after the script completes. Note that base_java11/entrypoint.sh avoids this by exporting MY_CONTAINER_IP directly in the entrypoint script before calling the subprocess.

Fix this by sourcing the script:

-/usr/bin/export_container_ip.sh
+source /usr/bin/export_container_ip.sh

Or have the script output the IP for the caller to capture:

echo "$ipAddr"  # Caller does: MY_CONTAINER_IP=$(/usr/bin/export_container_ip.sh)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh` around lines 29 -
30, The export in /usr/bin/export_container_ip.sh only affects its subprocess,
so entrypoint.sh won't see MY_CONTAINER_IP; fix by either (A) having
entrypoint.sh source the script (e.g., use . /usr/bin/export_container_ip.sh) so
export MY_CONTAINER_IP=$ipAddr sets the variable in the caller, or (B) change
export_container_ip.sh to simply output the ip (echo "$ipAddr") and update
entrypoint.sh to capture it
(MY_CONTAINER_IP=$(/usr/bin/export_container_ip.sh)); reference the script name
export_container_ip.sh, the caller entrypoint.sh, and the variable
MY_CONTAINER_IP/ipAddr when making the change.

Comment on lines +82 to +86
ARG HADOOP_AWS_VERSION=3.3.4
ARG AWS_SDK_VERSION=1.12.734
RUN wget -nc -q -O "${SPARK_INSTALL}/jars/jersey-bundle-1.19.4.jar" "https://repo1.maven.org/maven2/com/sun/jersey/jersey-bundle/1.19.4/jersey-bundle-1.19.4.jar" && \
wget -O "${SPARK_INSTALL}/jars/hadoop-aws-3.3.4.jar" https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar && \
wget -O "${SPARK_INSTALL}/jars/aws-java-sdk-bundle-1.12.734.jar" https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.734/aws-java-sdk-bundle-1.12.734.jar
wget -O "${SPARK_INSTALL}/jars/hadoop-aws-${HADOOP_AWS_VERSION}.jar" https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar && \
wget -O "${SPARK_INSTALL}/jars/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar" https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if build_docker_images.sh passes HADOOP_AWS_VERSION arg
rg -n "HADOOP_AWS_VERSION" docker/build_docker_images.sh

Repository: yihua/hudi

Length of output: 36


🏁 Script executed:

# Find and examine the build script
find . -name "build_docker_images.sh" -o -name "*build*.sh" | head -20

Repository: yihua/hudi

Length of output: 1283


🏁 Script executed:

# Check the full Dockerfile to understand Hadoop version context
head -100 docker/hoodie/hadoop/spark_base/Dockerfile

Repository: yihua/hudi

Length of output: 3814


🏁 Script executed:

# Search for references to Hadoop 3.4.0 or Java 17 context in the repo
rg -i "hadoop.3.4|java.17|spark.4" docker/ --type dockerfile --max-count 10

Repository: yihua/hudi

Length of output: 88


🏁 Script executed:

# Read the build_docker_images.sh script
cat docker/build_docker_images.sh

Repository: yihua/hudi

Length of output: 5138


🏁 Script executed:

# Check for other spark_base Dockerfiles or variants
find docker -name "*Dockerfile*" -o -name "*dockerfile*" | grep -i spark

Repository: yihua/hudi

Length of output: 227


🏁 Script executed:

# Search for Hadoop 3.4.0 references in docker directory
rg "3.4.0" docker/

Repository: yihua/hudi

Length of output: 5786


🏁 Script executed:

# Check the base_java17 Dockerfile to see how it handles Hadoop and AWS versions
head -100 docker/hoodie/hadoop/base_java17/Dockerfile

Repository: yihua/hudi

Length of output: 2277


🏁 Script executed:

# Check if spark_base Dockerfile inherits from a base image
grep -n "^FROM" docker/hoodie/hadoop/spark_base/Dockerfile

Repository: yihua/hudi

Length of output: 123


🏁 Script executed:

# Check if base_java17 has the same AWS library ARG setup
rg -n "HADOOP_AWS_VERSION|AWS_SDK_VERSION" docker/hoodie/hadoop/base_java17/

Repository: yihua/hudi

Length of output: 36


Address potential AWS SDK version mismatch for Hadoop 3.4.0 builds.

The parameterization of HADOOP_AWS_VERSION and AWS_SDK_VERSION is good, but build_docker_images.sh currently doesn't pass these values when building with different Hadoop versions. When using --hadoop-version 3.4.0 (for Spark 4.x with Java 17), the docker build will still default to HADOOP_AWS_VERSION=3.3.4, creating a version mismatch. Update the build script to pass --build-arg HADOOP_AWS_VERSION=3.4.0 when appropriate for Hadoop 3.4.0 builds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/spark_base/Dockerfile` around lines 82 - 86, The
Dockerfile uses ARG HADOOP_AWS_VERSION and ARG AWS_SDK_VERSION but
build_docker_images.sh doesn't forward these for Hadoop 3.4.0, causing a
mismatch; update build_docker_images.sh to detect when the requested
--hadoop-version is 3.4.0 (or Spark 4.x/Java 17 target) and add the
corresponding --build-arg HADOOP_AWS_VERSION=3.4.0 (and align AWS_SDK_VERSION if
needed) to the docker build command so the HADOOP_AWS_VERSION ARG in the
Dockerfile receives the correct value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants