Conversation
…etup - Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17) - build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION - Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64 - Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile
📝 WalkthroughWalkthroughThis PR extends Docker infrastructure to support Java 17-based images for Apache Spark 4.0.1+ environments. It introduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR adds a Java 17 Hadoop base Docker image ( Key changes:
Issues found:
Confidence Score: 3/5Not ready to merge — the amd64 compose file is missing platform directives and the build script's default Hadoop version will produce mismatched image tags when building for Spark 4.0.1 Two concrete P1 issues block the primary user path: (1) the _amd64.yml compose file omits
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[build_docker_images.sh] --> B{SPARK_MAJOR >= 4?}
B -- Yes --> C[base_java17/Dockerfile\neclipse-temurin:17-jdk\nHadoop 3.4.0]
B -- No --> D[base_java11/Dockerfile\nopenjdk:11-jdk-slim\nHadoop 2.8.4]
C --> E[hudi-hadoop_3.4.0-base]
D --> F[hudi-hadoop_2.8.4-base]
E --> G[datanode / namenode / historyserver]
E --> H[hive_base → hudi-hadoop_3.4.0-hive_3.1.3]
H --> I[spark_base/Dockerfile\nspark-4.0.1-bin-hadoop3.tgz]
I --> J[sparkadhoc / sparkmaster / sparkworker]
J --> K[docker-compose\nhadoop340_hive313_spark401\namd64 / arm64]
style C fill:#d4edda,stroke:#28a745
style K fill:#d4edda,stroke:#28a745
style A fill:#fff3cd,stroke:#ffc107
|
| services: | ||
|
|
||
| namenode: | ||
| image: apachehudi/hudi-hadoop_3.4.0-namenode:latest | ||
| hostname: namenode |
There was a problem hiding this comment.
Missing
platform: linux/amd64 on every service
Every service in the existing amd64 compose files (e.g. docker-compose_hadoop334_hive313_spark353_amd64.yml) carries a platform: linux/amd64 directive. Without it, Docker will pull/run the image for the host's native architecture, defeating the purpose of the _amd64 variant. On Apple Silicon (arm64) hosts, this would silently use the arm64 image (or fail if only an amd64 image exists), and on an amd64 host this would just happen to work by accident.
All services — namenode, datanode1, historyserver, hive-metastore-postgresql, hivemetastore, hiveserver, zookeeper, kafka, sparkmaster, spark-worker-1, adhoc-1, adhoc-2, minio, and mc — should each have a platform: linux/amd64 line, consistent with the pattern established in all other _amd64.yml files in this directory.
| RUN set -x \ | ||
| && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \ | ||
| && echo "Fetch URL2 is : ${HADOOP_URL}" \ | ||
| && curl -fSL "${HADOOP_URL}" -o /tmp/hadoop.tar.gz \ | ||
| && curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \ | ||
| && mkdir -p /opt/hadoop-$HADOOP_VERSION/logs \ | ||
| && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \ | ||
| && rm /tmp/hadoop.tar.gz* \ | ||
| && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \ | ||
| && mkdir /hadoop-data |
There was a problem hiding this comment.
Hadoop archive signature downloaded but never verified
The .asc file is fetched but no GPG verification step follows:
&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \Without calling gpg --verify, the downloaded .asc file provides no actual integrity guarantee. The same pattern exists in base_java11/Dockerfile, so this is pre-existing, but worth resolving in the new image. A hardened build would import the Apache release keys and run:
gpg --import <apache-keys.asc>
gpg --verify /tmp/hadoop.tar.gz.asc /tmp/hadoop.tar.gzAlternatively, use the SHA-512 checksum which Apache publishes alongside the tarball.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
docker/hoodie/hadoop/base_java17/Dockerfile (1)
53-54: UseCOPYinstead ofADDfor local files.Per Dockerfile best practices,
COPYis preferred overADDwhen simply copying files (without extraction or URL fetching).♻️ Proposed fix
-ADD entrypoint.sh /entrypoint.sh -ADD export_container_ip.sh /usr/bin/ +COPY entrypoint.sh /entrypoint.sh +COPY export_container_ip.sh /usr/bin/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/hoodie/hadoop/base_java17/Dockerfile` around lines 53 - 54, Replace the ADD directives that copy local files with COPY: change the lines that add entrypoint.sh and export_container_ip.sh (the ADD entrypoint.sh /entrypoint.sh and ADD export_container_ip.sh /usr/bin/) to use COPY instead, ensuring file permissions remain correct (e.g., keep executable bits on entrypoint.sh) and update any related Dockerfile comments if present.docker/hoodie/hadoop/base_java17/export_container_ip.sh (1)
23-23: Address shellcheck warnings for robustness.Replace backticks with
$(...)and quote the variable to prevent issues with word splitting:♻️ Proposed fix
- ipAddr=`ifconfig $interface | grep -Eo 'inet (addr:)?([0-9]+\.){3}[0-9]+' | grep -Eo '([0-9]+\.){3}[0-9]+' | grep -v '127.0.0.1' | head` + ipAddr=$(ifconfig "$interface" 2>/dev/null | grep -Eo 'inet (addr:)?([0-9]+\.){3}[0-9]+' | grep -Eo '([0-9]+\.){3}[0-9]+' | grep -v '127.0.0.1' | head -n1)Note: Also added
2>/dev/nullto suppress errors for non-existent interfaces and-n1toheadfor explicitness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh` at line 23, The command that assigns ipAddr uses backticks and an unquoted $interface which risks word-splitting and failure; replace the backticks with $(...) and quote "$interface" in the ifconfig call, redirect stderr to /dev/null to ignore missing-interface errors, and make head explicit with -n1; specifically update the ipAddr assignment (the line that references ipAddr and $interface and uses ifconfig | grep ... | head) to use $(...) and "$interface", add 2>/dev/null after ifconfig, and change head to head -n1.docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml (2)
131-141: Replacebitnamilegacyimages with currentbitnamiimages.The
bitnamilegacy/zookeeper:3.6.4andbitnamilegacy/kafka:3.4.1images are deprecated and no longer maintained. Bitnami has moved these to a legacy namespace and provides no further updates. The currentbitnami/zookeeperandbitnami/kafkaimages are actively maintained and support ARM64 architecture. Upgrade to the latest stable versions (e.g.,bitnami/zookeeper:3.9.3andbitnami/kafka:latest) for ongoing security patches and maintenance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml` around lines 131 - 141, Update the Docker Compose service images for the zookeeper and kafka services to use the maintained Bitnami namespace: replace image references for the zookeeper service (image: 'bitnamilegacy/zookeeper:3.6.4') and the kafka service (image: 'bitnamilegacy/kafka:3.4.1') with current bitnami images (e.g., 'bitnami/zookeeper:3.9.3' and 'bitnami/kafka:latest' or another pinned stable tag), ensuring the quotes/formatting remain consistent and that any environment/port settings for services named zookeeper and kafka are left intact.
227-240: Replace deprecated MinIO environment variables and pin image version.
MINIO_ACCESS_KEYandMINIO_SECRET_KEYhave been deprecated since MinIO release 2021-04-22. UseMINIO_ROOT_USERandMINIO_ROOT_PASSWORDinstead.- Using
minio/minio:latestcan cause reproducibility issues. Pin to a specific release tag (e.g.,RELEASE.2024-12-18T13-15-44Zor another stable version).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml` around lines 227 - 240, Update the minio service block to replace deprecated environment variables and pin the image: change MINIO_ACCESS_KEY and MINIO_SECRET_KEY to MINIO_ROOT_USER and MINIO_ROOT_PASSWORD in the environment list for the minio service, and replace the image value minio/minio:latest with a specific release tag (for example RELEASE.2024-12-18T13-15-44Z) to ensure reproducible builds while keeping the existing hostname, container_name, ports, volumes, and command unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml`:
- Around line 18-35: The namenode service (image
apachehudi/hudi-hadoop_3.4.0-namenode:latest, container_name namenode) is
missing a volumes mount for the declared named volume that should persist
NameNode metadata; add a volumes: section to the namenode service and mount the
declared volume into the NameNode data directory used by the image (e.g., add "-
<your_named_volume>:/hadoop/dfs/name" under the namenode service) so the HDFS
namespace is persisted across container recreates.
In `@docker/hoodie/hadoop/base_java17/Dockerfile`:
- Around line 30-39: The Dockerfile for base_java17 is missing the safe
creation/check for /etc/hadoop/mapred-site.xml which entrypoint.sh expects when
calling addProperty; update the RUN step that extracts Hadoop (the block using
HADOOP_URL, tar -xvf, ln -s, mkdir /hadoop-data) to also create or copy a
default mapred-site.xml if it doesn't exist (match the logic in base_java11) —
e.g., test for /etc/hadoop/mapred-site.xml and if absent copy
/opt/hadoop-$HADOOP_VERSION/etc/hadoop/mapred-site.xml.template or touch a file
so addProperty in entrypoint.sh can safely modify it. Ensure you reference the
same HADOOP_VERSION path and /etc/hadoop symlink so addProperty and
entrypoint.sh find the file.
In `@docker/hoodie/hadoop/base_java17/entrypoint.sh`:
- Around line 80-81: The MAPRED section incorrectly adds the YARN property via
the addProperty call targeting /etc/hadoop/mapred-site.xml with key
yarn.nodemanager.bind-host; remove this addProperty line or move it to the
appropriate YARN config (e.g. use addProperty against /etc/hadoop/yarn-site.xml
for yarn.nodemanager.bind-host) and ensure the MAPRED block only contains
MapReduce-specific properties.
- Around line 76-77: Duplicate addition of the same Hadoop property is
happening: remove the redundant addProperty call so yarn.nodemanager.bind-host
is only added once, or replace the second addProperty invocation with a guard
that checks yarn-site.xml for an existing yarn.nodemanager.bind-host entry
before adding; locate the addProperty calls (symbol: addProperty) targeting
yarn-site.xml and ensure only a single write for yarn.nodemanager.bind-host
occurs.
In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh`:
- Around line 29-30: The export in /usr/bin/export_container_ip.sh only affects
its subprocess, so entrypoint.sh won't see MY_CONTAINER_IP; fix by either (A)
having entrypoint.sh source the script (e.g., use .
/usr/bin/export_container_ip.sh) so export MY_CONTAINER_IP=$ipAddr sets the
variable in the caller, or (B) change export_container_ip.sh to simply output
the ip (echo "$ipAddr") and update entrypoint.sh to capture it
(MY_CONTAINER_IP=$(/usr/bin/export_container_ip.sh)); reference the script name
export_container_ip.sh, the caller entrypoint.sh, and the variable
MY_CONTAINER_IP/ipAddr when making the change.
In `@docker/hoodie/hadoop/spark_base/Dockerfile`:
- Around line 82-86: The Dockerfile uses ARG HADOOP_AWS_VERSION and ARG
AWS_SDK_VERSION but build_docker_images.sh doesn't forward these for Hadoop
3.4.0, causing a mismatch; update build_docker_images.sh to detect when the
requested --hadoop-version is 3.4.0 (or Spark 4.x/Java 17 target) and add the
corresponding --build-arg HADOOP_AWS_VERSION=3.4.0 (and align AWS_SDK_VERSION if
needed) to the docker build command so the HADOOP_AWS_VERSION ARG in the
Dockerfile receives the correct value.
---
Nitpick comments:
In `@docker/compose/docker-compose_hadoop340_hive313_spark401_arm64.yml`:
- Around line 131-141: Update the Docker Compose service images for the
zookeeper and kafka services to use the maintained Bitnami namespace: replace
image references for the zookeeper service (image:
'bitnamilegacy/zookeeper:3.6.4') and the kafka service (image:
'bitnamilegacy/kafka:3.4.1') with current bitnami images (e.g.,
'bitnami/zookeeper:3.9.3' and 'bitnami/kafka:latest' or another pinned stable
tag), ensuring the quotes/formatting remain consistent and that any
environment/port settings for services named zookeeper and kafka are left
intact.
- Around line 227-240: Update the minio service block to replace deprecated
environment variables and pin the image: change MINIO_ACCESS_KEY and
MINIO_SECRET_KEY to MINIO_ROOT_USER and MINIO_ROOT_PASSWORD in the environment
list for the minio service, and replace the image value minio/minio:latest with
a specific release tag (for example RELEASE.2024-12-18T13-15-44Z) to ensure
reproducible builds while keeping the existing hostname, container_name, ports,
volumes, and command unchanged.
In `@docker/hoodie/hadoop/base_java17/Dockerfile`:
- Around line 53-54: Replace the ADD directives that copy local files with COPY:
change the lines that add entrypoint.sh and export_container_ip.sh (the ADD
entrypoint.sh /entrypoint.sh and ADD export_container_ip.sh /usr/bin/) to use
COPY instead, ensuring file permissions remain correct (e.g., keep executable
bits on entrypoint.sh) and update any related Dockerfile comments if present.
In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh`:
- Line 23: The command that assigns ipAddr uses backticks and an unquoted
$interface which risks word-splitting and failure; replace the backticks with
$(...) and quote "$interface" in the ifconfig call, redirect stderr to /dev/null
to ignore missing-interface errors, and make head explicit with -n1;
specifically update the ipAddr assignment (the line that references ipAddr and
$interface and uses ifconfig | grep ... | head) to use $(...) and "$interface",
add 2>/dev/null after ifconfig, and change head to head -n1.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c1227866-d2d5-4be8-abaf-3a7cb50c3c88
📒 Files selected for processing (10)
docker/build_docker_images.shdocker/compose/docker-compose_hadoop340_hive313_spark401_amd64.ymldocker/compose/docker-compose_hadoop340_hive313_spark401_arm64.ymldocker/hoodie/hadoop/base_java11/Dockerfiledocker/hoodie/hadoop/base_java17/Dockerfiledocker/hoodie/hadoop/base_java17/entrypoint.shdocker/hoodie/hadoop/base_java17/export_container_ip.shdocker/hoodie/hadoop/base_java17/pom.xmldocker/hoodie/hadoop/spark_base/Dockerfiledocker/hoodie/hadoop/sparkadhoc/Dockerfile
| namenode: | ||
| image: apachehudi/hudi-hadoop_3.4.0-namenode:latest | ||
| hostname: namenode | ||
| container_name: namenode | ||
| environment: | ||
| - CLUSTER_NAME=hudi_hadoop340_hive313_spark401 | ||
| ports: | ||
| - "50070:50070" | ||
| - "8020:8020" | ||
| - "9870:9870" | ||
| env_file: | ||
| - ./hadoop.env | ||
| healthcheck: | ||
| test: ["CMD", "curl", "-f", "http://namenode:9870"] | ||
| interval: 30s | ||
| timeout: 10s | ||
| retries: 3 | ||
|
|
There was a problem hiding this comment.
Mount the declared namenode volume into the NameNode data directory.
The named volume declared at Lines 256-257 is never attached to the namenode service. That means the NameNode metadata stays ephemeral, so recreating this container can wipe the HDFS namespace and break the rest of the stack. Please add a volumes: mount here for the NameNode metadata path used by this image.
Also applies to: 256-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml` around
lines 18 - 35, The namenode service (image
apachehudi/hudi-hadoop_3.4.0-namenode:latest, container_name namenode) is
missing a volumes mount for the declared named volume that should persist
NameNode metadata; add a volumes: section to the namenode service and mount the
declared volume into the NameNode data directory used by the image (e.g., add "-
<your_named_volume>:/hadoop/dfs/name" under the namenode service) so the HDFS
namespace is persisted across container recreates.
| RUN set -x \ | ||
| && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \ | ||
| && echo "Fetch URL2 is : ${HADOOP_URL}" \ | ||
| && curl -fSL "${HADOOP_URL}" -o /tmp/hadoop.tar.gz \ | ||
| && curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \ | ||
| && mkdir -p /opt/hadoop-$HADOOP_VERSION/logs \ | ||
| && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \ | ||
| && rm /tmp/hadoop.tar.gz* \ | ||
| && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \ | ||
| && mkdir /hadoop-data |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if Hadoop 3.4.0 includes mapred-site.xml.template
curl -sI "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | head -5
# Compare with base_java11
diff -u <(cat docker/hoodie/hadoop/base_java11/Dockerfile | grep -A20 "RUN set -x") <(cat docker/hoodie/hadoop/base_java17/Dockerfile | grep -A20 "RUN set -x") 2>/dev/null || echo "Files differ"Repository: yihua/hudi
Length of output: 202
🏁 Script executed:
# Read the relevant sections from both Dockerfiles
echo "=== base_java11 Dockerfile (lines 30-50) ==="
cat -n docker/hoodie/hadoop/base_java11/Dockerfile | sed -n '30,50p'
echo -e "\n=== base_java17 Dockerfile (lines 30-50) ==="
cat -n docker/hoodie/hadoop/base_java17/Dockerfile | sed -n '30,50p'Repository: yihua/hudi
Length of output: 2144
🏁 Script executed:
# Check if mapred-site.xml.template exists in Hadoop 3.4.0 distribution
# Download and inspect the Hadoop tarball structure
cd /tmp
curl -sL "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -tzf - | grep -i "mapred-site.xml" | head -20Repository: yihua/hudi
Length of output: 93
🏁 Script executed:
# Search for all template files in Hadoop 3.4.0 distribution
cd /tmp
curl -sL "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -tzf - | grep "\.template" | head -20Repository: yihua/hudi
Length of output: 140
🏁 Script executed:
# Also check all files in etc/hadoop directory
cd /tmp
curl -sL "https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -tzf - | grep "etc/hadoop/" | sortRepository: yihua/hudi
Length of output: 140
🏁 Script executed:
# Search repository for any references to mapred-site.xml
rg "mapred-site.xml" --type dockerfile
echo "---"
# Check if there's documentation about the differences between base_java11 and base_java17
find docker/hoodie/hadoop -name "README*" -o -name "*.md" | xargs cat 2>/dev/null || echo "No README found"
echo "---"
# Look for any comments or context around the Hadoop setup
grep -r "mapred-site" docker/hoodie/hadoop/ 2>/dev/null || echo "No matches in comments"Repository: yihua/hudi
Length of output: 1004
🏁 Script executed:
# Check the full base_java17 Dockerfile to understand the context
cat -n docker/hoodie/hadoop/base_java17/Dockerfile | head -80Repository: yihua/hudi
Length of output: 2683
Add missing mapred-site.xml configuration step.
The base_java17 Dockerfile is missing a step that exists in base_java11. More importantly, the entrypoint.sh script for base_java17 expects mapred-site.xml to exist when it attempts to modify it:
addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
The base Dockerfile handles this more safely with a conditional check. Update base_java17 to match:
Proposed fix
&& ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \
+ && if [[ -f /etc/hadoop/mapred-site.xml.template ]]; then cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml; fi \
&& mkdir /hadoop-data🧰 Tools
🪛 Trivy (0.69.3)
[error] 30-39: 'apt-get' missing '--no-install-recommends'
'--no-install-recommends' flag is missed: 'set -x && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps && echo "Fetch URL2 is : ${HADOOP_URL}" && curl -fSL "${HADOOP_URL}" -o /tmp/hadoop.tar.gz && curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc && mkdir -p /opt/hadoop-$HADOOP_VERSION/logs && tar -xvf /tmp/hadoop.tar.gz -C /opt/ && rm /tmp/hadoop.tar.gz* && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop && mkdir /hadoop-data'
Rule: DS-0029
(IaC/Dockerfile)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/base_java17/Dockerfile` around lines 30 - 39, The
Dockerfile for base_java17 is missing the safe creation/check for
/etc/hadoop/mapred-site.xml which entrypoint.sh expects when calling
addProperty; update the RUN step that extracts Hadoop (the block using
HADOOP_URL, tar -xvf, ln -s, mkdir /hadoop-data) to also create or copy a
default mapred-site.xml if it doesn't exist (match the logic in base_java11) —
e.g., test for /etc/hadoop/mapred-site.xml and if absent copy
/opt/hadoop-$HADOOP_VERSION/etc/hadoop/mapred-site.xml.template or touch a file
so addProperty in entrypoint.sh can safely modify it. Ensure you reference the
same HADOOP_VERSION path and /etc/hadoop symlink so addProperty and
entrypoint.sh find the file.
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 | ||
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 |
There was a problem hiding this comment.
Duplicate property addition.
yarn.nodemanager.bind-host is added twice to yarn-site.xml (lines 76 and 77).
🔧 Proposed fix
addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
- addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 | |
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 | |
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 76 - 77,
Duplicate addition of the same Hadoop property is happening: remove the
redundant addProperty call so yarn.nodemanager.bind-host is only added once, or
replace the second addProperty invocation with a guard that checks yarn-site.xml
for an existing yarn.nodemanager.bind-host entry before adding; locate the
addProperty calls (symbol: addProperty) targeting yarn-site.xml and ensure only
a single write for yarn.nodemanager.bind-host occurs.
| # MAPRED | ||
| addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 |
There was a problem hiding this comment.
Wrong configuration file for YARN property.
yarn.nodemanager.bind-host is a YARN property but is being added to mapred-site.xml. This appears to be a copy-paste error. The MAPRED section should likely set a MapReduce-specific property or be removed.
🔧 Proposed fix (remove incorrect entry)
# MAPRED
- addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
+ addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020
+ addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888Alternatively, if no MapReduce multi-homed config is needed, simply remove lines 80-81.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # MAPRED | |
| addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 | |
| # MAPRED | |
| addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020 | |
| addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 80 - 81, The
MAPRED section incorrectly adds the YARN property via the addProperty call
targeting /etc/hadoop/mapred-site.xml with key yarn.nodemanager.bind-host;
remove this addProperty line or move it to the appropriate YARN config (e.g. use
addProperty against /etc/hadoop/yarn-site.xml for yarn.nodemanager.bind-host)
and ensure the MAPRED block only contains MapReduce-specific properties.
| echo "Container IP is set to : $ipAddr" | ||
| export MY_CONTAINER_IP=$ipAddr |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how the base_java11 version handles this
cat docker/hoodie/hadoop/base_java11/export_container_ip.sh 2>/dev/null || echo "File not found"
cat docker/hoodie/hadoop/base_java11/entrypoint.sh 2>/dev/null | grep -A2 -B2 "export_container_ip"Repository: yihua/hudi
Length of output: 1269
🏁 Script executed:
cat docker/hoodie/hadoop/base_java17/entrypoint.sh 2>/dev/null | grep -A3 -B3 "export_container_ip" || echo "Not found or error"Repository: yihua/hudi
Length of output: 136
🏁 Script executed:
cat docker/hoodie/hadoop/base_java17/entrypoint.shRepository: yihua/hudi
Length of output: 4618
Exported variable in subprocess won't propagate to entrypoint.sh.
When entrypoint.sh calls /usr/bin/export_container_ip.sh as a subprocess (not sourced), the export MY_CONTAINER_IP=$ipAddr only affects that subprocess. The variable won't be available in entrypoint.sh after the script completes. Note that base_java11/entrypoint.sh avoids this by exporting MY_CONTAINER_IP directly in the entrypoint script before calling the subprocess.
Fix this by sourcing the script:
-/usr/bin/export_container_ip.sh
+source /usr/bin/export_container_ip.shOr have the script output the IP for the caller to capture:
echo "$ipAddr" # Caller does: MY_CONTAINER_IP=$(/usr/bin/export_container_ip.sh)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/base_java17/export_container_ip.sh` around lines 29 -
30, The export in /usr/bin/export_container_ip.sh only affects its subprocess,
so entrypoint.sh won't see MY_CONTAINER_IP; fix by either (A) having
entrypoint.sh source the script (e.g., use . /usr/bin/export_container_ip.sh) so
export MY_CONTAINER_IP=$ipAddr sets the variable in the caller, or (B) change
export_container_ip.sh to simply output the ip (echo "$ipAddr") and update
entrypoint.sh to capture it
(MY_CONTAINER_IP=$(/usr/bin/export_container_ip.sh)); reference the script name
export_container_ip.sh, the caller entrypoint.sh, and the variable
MY_CONTAINER_IP/ipAddr when making the change.
| ARG HADOOP_AWS_VERSION=3.3.4 | ||
| ARG AWS_SDK_VERSION=1.12.734 | ||
| RUN wget -nc -q -O "${SPARK_INSTALL}/jars/jersey-bundle-1.19.4.jar" "https://repo1.maven.org/maven2/com/sun/jersey/jersey-bundle/1.19.4/jersey-bundle-1.19.4.jar" && \ | ||
| wget -O "${SPARK_INSTALL}/jars/hadoop-aws-3.3.4.jar" https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar && \ | ||
| wget -O "${SPARK_INSTALL}/jars/aws-java-sdk-bundle-1.12.734.jar" https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.734/aws-java-sdk-bundle-1.12.734.jar | ||
| wget -O "${SPARK_INSTALL}/jars/hadoop-aws-${HADOOP_AWS_VERSION}.jar" https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar && \ | ||
| wget -O "${SPARK_INSTALL}/jars/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar" https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if build_docker_images.sh passes HADOOP_AWS_VERSION arg
rg -n "HADOOP_AWS_VERSION" docker/build_docker_images.shRepository: yihua/hudi
Length of output: 36
🏁 Script executed:
# Find and examine the build script
find . -name "build_docker_images.sh" -o -name "*build*.sh" | head -20Repository: yihua/hudi
Length of output: 1283
🏁 Script executed:
# Check the full Dockerfile to understand Hadoop version context
head -100 docker/hoodie/hadoop/spark_base/DockerfileRepository: yihua/hudi
Length of output: 3814
🏁 Script executed:
# Search for references to Hadoop 3.4.0 or Java 17 context in the repo
rg -i "hadoop.3.4|java.17|spark.4" docker/ --type dockerfile --max-count 10Repository: yihua/hudi
Length of output: 88
🏁 Script executed:
# Read the build_docker_images.sh script
cat docker/build_docker_images.shRepository: yihua/hudi
Length of output: 5138
🏁 Script executed:
# Check for other spark_base Dockerfiles or variants
find docker -name "*Dockerfile*" -o -name "*dockerfile*" | grep -i sparkRepository: yihua/hudi
Length of output: 227
🏁 Script executed:
# Search for Hadoop 3.4.0 references in docker directory
rg "3.4.0" docker/Repository: yihua/hudi
Length of output: 5786
🏁 Script executed:
# Check the base_java17 Dockerfile to see how it handles Hadoop and AWS versions
head -100 docker/hoodie/hadoop/base_java17/DockerfileRepository: yihua/hudi
Length of output: 2277
🏁 Script executed:
# Check if spark_base Dockerfile inherits from a base image
grep -n "^FROM" docker/hoodie/hadoop/spark_base/DockerfileRepository: yihua/hudi
Length of output: 123
🏁 Script executed:
# Check if base_java17 has the same AWS library ARG setup
rg -n "HADOOP_AWS_VERSION|AWS_SDK_VERSION" docker/hoodie/hadoop/base_java17/Repository: yihua/hudi
Length of output: 36
Address potential AWS SDK version mismatch for Hadoop 3.4.0 builds.
The parameterization of HADOOP_AWS_VERSION and AWS_SDK_VERSION is good, but build_docker_images.sh currently doesn't pass these values when building with different Hadoop versions. When using --hadoop-version 3.4.0 (for Spark 4.x with Java 17), the docker build will still default to HADOOP_AWS_VERSION=3.3.4, creating a version mismatch. Update the build script to pass --build-arg HADOOP_AWS_VERSION=3.4.0 when appropriate for Hadoop 3.4.0 builds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/spark_base/Dockerfile` around lines 82 - 86, The
Dockerfile uses ARG HADOOP_AWS_VERSION and ARG AWS_SDK_VERSION but
build_docker_images.sh doesn't forward these for Hadoop 3.4.0, causing a
mismatch; update build_docker_images.sh to detect when the requested
--hadoop-version is 3.4.0 (or Spark 4.x/Java 17 target) and add the
corresponding --build-arg HADOOP_AWS_VERSION=3.4.0 (and align AWS_SDK_VERSION if
needed) to the docker build command so the HADOOP_AWS_VERSION ARG in the
Dockerfile receives the correct value.
Mirror of apache#18520 for automated bot review.
Original author: @voonhous
Base branch: master
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores