Skip to content

chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s…#18520

Merged
yihua merged 1 commit into
apache:masterfrom
voonhous:add-docker-compose-hadoop3
Apr 17, 2026
Merged

chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s…#18520
yihua merged 1 commit into
apache:masterfrom
voonhous:add-docker-compose-hadoop3

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Apr 17, 2026

…etup

Describe the issue this Pull Request addresses

Adding Java17 Hadoop base image and Spark4.0.1 docker compose images as our Variant feature push requires it.

Register base_java17 in Maven reactor and fix tag collision

  • Add base_java17 to the hudi-hadoop-docker parent POM so the Java 17 base image is picked up by the Maven reactor and imported by IntelliJ, mirroring how base_java11 is wired in.
  • Rename base_java17's docker repository from apachehudi/hudi-hadoop_*-base to apachehudi/hudi-hadoop_*-base-java17 so that a Maven-driven reactor build does not stomp on the tag produced by the base/ module (which publishes apachehudi/hudi-hadoop_*-base and is consumed by namenode, datanode, historyserver, hive_base, and prestobase via FROM ...-base:latest).
  • This mirrors base_java11's -base-java11 suffix. Also add the commented-out push goal for parity with base_java11.

Note: docker/build_docker_images.sh still tags the base image as plain -base regardless of Java version, which is intentional for the shell path so that downstream Dockerfiles' FROM ...-base clauses inherit the newer JDK. Only the Maven reactor path needed the suffix fix.

Summary and Changelog

  • Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17)
  • build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION
  • Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64
  • Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile

Impact

None

Risk Level

Low, this is a chore task.

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous force-pushed the add-docker-compose-hadoop3 branch from f43107b to 74b91d5 Compare April 17, 2026 17:50
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Apr 17, 2026
@voonhous voonhous force-pushed the add-docker-compose-hadoop3 branch from 74b91d5 to 1252d5d Compare April 17, 2026 17:56
@voonhous voonhous requested a review from yihua April 17, 2026 17:57
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — straightforward chore adding a Java 17 base image module for Spark 4.x. One question inline about potential Docker image tag collision with the existing base_java11 module.

</goals>
<configuration>
<skip>${docker.build.skip}</skip>
<pullNewerImage>false</pullNewerImage>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The repository name apachehudi/hudi-hadoop_${docker.hadoop.version}-base and tags (latest, ${project.version}) look identical to what base_java11 publishes. If both modules are ever built in the same Maven invocation, won't the second one overwrite the first's image under the same tag? Would it be safer to differentiate the tag (e.g. include java17 in the repository name or tag) so the two base images can coexist?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as long as we're using a new hadoop version that is Java 17 supported. Left to you @voonhous to see if java17 suffix to the image name is really needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add docs on what image is for which Java version.

- ALLOW_ANONYMOUS_LOGIN=yes

kafka:
image: 'bitnamilegacy/kafka:3.4.1'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker: should we consider using 'apache/kafka:3.7.2'?

Comment on lines +147 to +148
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker: do we need same configs as Spark3.5 docker image (docker-compose_hadoop334_hive313_spark353_arm64):

      - KAFKA_NODE_ID=1
      - KAFKA_PROCESS_ROLES=broker,controller
      - KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:9093,PLAINTEXT_HOST://0.0.0.0:9092
      - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafkabroker:29092,PLAINTEXT_HOST://localhost:9092
      - KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER
      - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      - KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT
      - KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafkabroker:9093
      - KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
      - KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1
      - KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1
      - KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0
      - KAFKA_NUM_PARTITIONS=3

Comment thread docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml
…etup

- Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17)
- build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION
- Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64
- Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile
- Register base_java17 in Maven reactor and fix tag collision
  - Add <module>base_java17</module> to the hudi-hadoop-docker parent POM so the Java 17 base image is picked up by the Maven reactor and imported by IntelliJ, mirroring how base_java11 is wired in.
  - Rename base_java17's docker repository from apachehudi/hudi-hadoop_*-base to apachehudi/hudi-hadoop_*-base-java17 so that a Maven-driven reactor build does not stomp on the tag produced by the base/ module (which publishes apachehudi/hudi-hadoop_*-base and is consumed by namenode, datanode, historyserver, hive_base, and prestobase via FROM ...-base:latest).
  - This mirrors base_java11's -base-java11 suffix. Also add the commented-out push goal for parity with base_java11.
  - Note: docker/build_docker_images.sh still tags the base image as plain -base regardless of Java version, which is intentional for the shell path so that downstream Dockerfiles' FROM ...-base clauses inherit the newer JDK. Only the Maven reactor path needed the suffix fix.
@voonhous voonhous force-pushed the add-docker-compose-hadoop3 branch from 1252d5d to 1fc0a55 Compare April 17, 2026 18:10
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR extends Docker infrastructure to support Java 17-based images for Apache Spark 4.0.1+ environments. It introduces a new base_java17 image with Hadoop, updates the build script to detect Spark version and select the appropriate base image, adds Docker Compose configurations for Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 stacks on both amd64 and arm64 architectures, parameterizes AWS library versions in Spark, and improves error handling in Spark ad-hoc containers.

Greptile Summary: This PR adds a Java 17 Hadoop base Docker image (base_java17/) to support Spark 4.0+, introduces new Docker Compose files for the Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 combination, and updates build_docker_images.sh to automatically select the Java 17 base when Spark 4.0+ is requested.

Key changes:

  • New docker/hoodie/hadoop/base_java17/ directory with Dockerfile, entrypoint.sh, export_container_ip.sh, and pom.xml — mirrors the structure of the existing base_java11 image but targets eclipse-temurin:17-jdk and Hadoop 3.4.0
  • New docker-compose_hadoop340_hive313_spark401_amd64.yml and _arm64.yml compose files
  • build_docker_images.sh gained logic to select base_java17 when SPARK_MAJOR >= 4
  • spark_base/Dockerfile and sparkadhoc/Dockerfile updated to work with the new stack

Issues found:

  • The new amd64 compose file is missing platform: linux/amd64 on every service — all other _amd64.yml files in this directory include this directive on each service, and its absence means the file won't enforce amd64 execution on non-amd64 hosts
  • build_docker_images.sh defaults to HADOOP_VERSION=2.8.4, which is incompatible with both Hadoop 2.x/Java 17 and the new 3.4.0 image tags expected by the compose files; a user running ./build_docker_images.sh --spark-version 4.0.1 without explicitly adding --hadoop-version 3.4.0 would build mismatched image tags

Greptile Confidence Score: 3/5
Not ready to merge — the amd64 compose file is missing platform directives and the build script's default Hadoop version will produce mismatched image tags when building for Spark 4.0.1

Two concrete P1 issues block the primary user path: (1) the _amd64.yml compose file omits platform: linux/amd64 on all services, breaking its purpose on non-amd64 hosts; (2) the build script defaults to HADOOP_VERSION=2.8.4 while the new compose files expect 3.4.0, meaning the default build invocation produces images with wrong tags that won't work with the new compose setup. The new base_java17 image and pom are structurally correct; the issues are in the wiring between the build script and compose files.

docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml needs platform directives; docker/build_docker_images.sh needs its default HADOOP_VERSION aligned with the Spark 4.0 path

CodeRabbit: yihua#48 (review)
Greptile: yihua#48 (review)

Comment thread docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml
&& tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
&& rm /tmp/hadoop.tar.gz* \
&& ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \
&& mkdir /hadoop-data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hadoop archive signature downloaded but never verified

The .asc file is fetched but no GPG verification step follows:

&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \

Without calling gpg --verify, the downloaded .asc file provides no actual integrity guarantee. The same pattern exists in base_java11/Dockerfile, so this is pre-existing, but worth resolving in the new image. A hardened build would import the Apache release keys and run:

gpg --import <apache-keys.asc>
gpg --verify /tmp/hadoop.tar.gz.asc /tmp/hadoop.tar.gz

Alternatively, use the SHA-512 checksum which Apache publishes alongside the tarball.

Greptile (original) (source:comment#3102315298)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these are existing issues. Comparing base_java11 and base_java17, there isn't much change except for version and docs update:

> diff -r base_java11 base_java17
diff --color=auto -r base_java11/Dockerfile base_java17/Dockerfile
18c18
< FROM eclipse-temurin:11-jdk-jammy
---
> FROM eclipse-temurin:17-jdk
25c25
< ARG HADOOP_VERSION=2.8.4 
---
> ARG HADOOP_VERSION=3.4.0
31c31
<     && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat procps \
---
>     && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \
39d38
<     && cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml \
52c51
< EXPOSE 0-1024 4040 7000-10100 5000-5100 50000-50200 58188 58088 58042 
---
> EXPOSE 0-1024 4040 7000-10100 5000-5100 50000-50200 58188 58088 58042
60d58
< 
diff --color=auto -r base_java11/pom.xml base_java17/pom.xml
27c27
<   <artifactId>hudi-hadoop-base-java11-docker</artifactId>
---
>   <artifactId>hudi-hadoop-base-java17-docker</artifactId>
29c29
<   <description>Base Docker Image with Hoodie</description>
---
>   <description>Base Docker Image with Java 17 for Spark 4.0+</description>
37d36
< 
50d48
< 
71c69
<               <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java11</repository>
---
>               <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java17</repository>
87c85
<               <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java11</repository>
---
>               <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java17</repository>

Comment thread docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml
Comment thread docker/hoodie/hadoop/base_java17/Dockerfile
# YARN
addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate property addition.

yarn.nodemanager.bind-host is added twice to yarn-site.xml (lines 76 and 77).

🔧 Proposed fix
     addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
     addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
-    addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
     addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 76 - 77,
Duplicate addition of the same Hadoop property is happening: remove the
redundant addProperty call so yarn.nodemanager.bind-host is only added once, or
replace the second addProperty invocation with a guard that checks yarn-site.xml
for an existing yarn.nodemanager.bind-host entry before adding; locate the
addProperty calls (symbol: addProperty) targeting yarn-site.xml and ensure only
a single write for yarn.nodemanager.bind-host occurs.

CodeRabbit (original) (source:comment#3102348614)

addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0

# MAPRED
addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrong configuration file for YARN property.

yarn.nodemanager.bind-host is a YARN property but is being added to mapred-site.xml. This appears to be a copy-paste error. The MAPRED section should likely set a MapReduce-specific property or be removed.

🔧 Proposed fix (remove incorrect entry)
     # MAPRED
-    addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
+    addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020
+    addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888

Alternatively, if no MapReduce multi-homed config is needed, simply remove lines 80-81.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
# MAPRED
addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020
addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 80 - 81, The
MAPRED section incorrectly adds the YARN property via the addProperty call
targeting /etc/hadoop/mapred-site.xml with key yarn.nodemanager.bind-host;
remove this addProperty line or move it to the appropriate YARN config (e.g. use
addProperty against /etc/hadoop/yarn-site.xml for yarn.nodemanager.bind-host)
and ensure the MAPRED block only contains MapReduce-specific properties.

CodeRabbit (original) (source:comment#3102348632)

Comment thread docker/hoodie/hadoop/base_java17/export_container_ip.sh
Comment thread docker/hoodie/hadoop/spark_base/Dockerfile
Comment thread docker/build_docker_images.sh
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. We should address minor comments in a follow-up.

@yihua yihua merged commit 41cfc19 into apache:master Apr 17, 2026
54 of 58 checks passed
@voonhous voonhous deleted the add-docker-compose-hadoop3 branch April 17, 2026 18:42
voonhous added a commit to voonhous/hudi that referenced this pull request Apr 17, 2026
…stack

- Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts.
- Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly.
- Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars.
- Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64.
- Document which base image goes with which Spark/JDK combo in docker/README.md.
voonhous added a commit to voonhous/hudi that referenced this pull request Apr 17, 2026
….1 stack

- Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts.
- Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly.
- Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars.
- Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64.
- Document which base image goes with which Spark/JDK combo in docker/README.md.
voonhous added a commit to voonhous/hudi that referenced this pull request Apr 17, 2026
…1 stack

- Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts.
- Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly.
- Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars.
- Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64.
- Document which base image goes with which Spark/JDK combo in docker/README.md.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.84%. Comparing base (a369773) to head (1fc0a55).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18520   +/-   ##
=========================================
  Coverage     68.84%   68.84%           
- Complexity    28250    28253    +3     
=========================================
  Files          2464     2464           
  Lines        135442   135442           
  Branches      16417    16417           
=========================================
+ Hits          93239    93251   +12     
+ Misses        34821    34812    -9     
+ Partials       7382     7379    -3     
Flag Coverage Δ
common-and-other-modules 44.58% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 44.80% <ø> (-0.02%) ⬇️
spark-client-hadoop-common 48.44% <ø> (ø)
spark-java-tests 48.93% <ø> (+0.01%) ⬆️
spark-scala-tests 45.45% <ø> (-0.01%) ⬇️
utilities 38.19% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous added a commit to voonhous/hudi that referenced this pull request Apr 18, 2026
…1 stack

- Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts.
- Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly.
- Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars.
- Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64.
- Document which base image goes with which Spark/JDK combo in docker/README.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants