refactor(docker): migrate single-node compose from host to bridge networking by bitflicker64 · Pull Request #2952 · apache/hugegraph

bitflicker64 · 2026-02-15T13:13:37Z

Purpose of the PR

Fix Docker deployment failing on macOS due to Linux-only host networking. Originally scoped to single-node, expanded during review to cover the full 3-node distributed cluster as well.

close #2951

Main Changes

Docker Compose:

Remove network_mode: host from both single-node and 3-node compose files
Switch to Docker bridge network (hg-net) with container hostnames
Replace config file volume mounts with environment variable injection (HG_* prefix)
Add proper depends_on: condition: service_healthy, restart policies, and healthchecks

Entrypoint Scripts (complete rewrite):

hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh — SPRING_APPLICATION_JSON injection via HG_PD_* env vars, deprecated alias migration
hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh — HG_STORE_* env vars
hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh — HG_SERVER_* env vars, auth support

Helper Scripts:

wait-storage.sh — configurable PD auth
wait-partition.sh — 120s timeout, configurable

Documentation:

Created docker/README.md — full setup guide, env var reference, port table, troubleshooting
Fixed 7 files pointing to dead example/ directory
Fixed hugegraph-store/docs/deployment-guide.md wrong env var names
Updated K8s manifest env var names
Added bridge network notes to PD configuration and README docs

Problem

The original Docker configuration uses network_mode: host which only works on native Linux. Docker Desktop on macOS does not implement host networking the same way. Containers start but HugeGraph services advertise incorrect addresses (127.0.0.1, 0.0.0.0).

Resulting failures:

Server stuck in loop waiting for storage backend
PD client UNAVAILABLE io exception errors
Store reports zero partitions
Cluster never becomes usable even though containers are running

Root Cause

network_mode: host is Linux-specific
Docker Desktop falls back to bridge networking silently
HugeGraph components advertise localhost-style addresses
Other containers cannot route to those addresses

Solution

Switch to bridge networking and advertise container-resolvable hostnames. Docker DNS resolves service names automatically. Configuration injected via SPRING_APPLICATION_JSON through HG_* env vars instead of mounted config files.

Verification

Tested on macOS (Apple M4, Docker Desktop):

Single-node cluster:

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
hg-server   Up healthy   0.0.0.0:8080->8080
hg-store    Up healthy   0.0.0.0:8520->8520
hg-pd       Up healthy   0.0.0.0:8620->8620

3-node cluster:

curl -u store:admin -s http://localhost:8620/v1/stores | grep -o '"partitionCount":[0-9]*'
"partitionCount":12
"partitionCount":12
"partitionCount":12

All 3 stores show partitionCount:12. All 9 containers healthy.

Also verified on Ubuntu 24.04 (native Docker):

All 9 containers healthy
partitionCount:12 on all 3 stores
Cluster works regardless of which PD node wins leader election

Does this PR potentially affect the following parts?

Documentation Status

Doc - Done

Related fixes discovered during this work:

Bug	Issue	Fix
`getLeaderGrpcAddress()` NPE in bridge mode	#2959	#2961
`IpAuthHandler` hostname vs IP mismatch	#2960	#2962

Changes Checklist

Replace network_mode: host with explicit port mappings and add configuration volumes for PD, Store, and Server services to support macOS/Windows Docker. - Remove host network mode from all services - Add explicit port mappings (8620, 8520, 8080) - Add configuration directories with volume mounts - Update healthcheck endpoints - Add PD peers environment variable Enables HugeGraph cluster to run on all Docker platforms.

…mlin-console

bitflicker64 · 2026-02-20T12:09:08Z

Bridge networking changes have been validated successfully across environments:

macOS (Docker Desktop)
Ubuntu 24.04.4 LTS

Observed behavior:

PD container starts and becomes healthy
Store container starts, registers, and receives partitions
Partitions are assigned and Raft leaders are elected
Server container initializes without errors
REST endpoints respond as expected

No regressions were observed in the single-node deployment. Service discovery and inter-container communication function correctly under bridge networking.

ARM64 Compatibility Fix — `wait-storage.sh`

Problem

The original wait-storage.sh relied on gremlin-console.sh for storage readiness detection:

On ARM64 (Apple Silicon), this fails due to a Jansi native library crash

Root Cause

gremlin-console.sh depends on Jansi, which is unstable on ARM64
The detection logic is triggered only when hugegraph.* environment variables are used
Volume-mounted configurations bypass this code path, masking the failure

Fix

Replaced Gremlin Console detection with a lightweight PD REST health check:

Cleanup

detect-storage.groovy is no longer required by the updated startup flow and can be removed

Copilot

Pull request overview

This PR migrates the single-node Docker Compose configuration from Linux-specific host networking to cross-platform bridge networking. The change addresses a critical issue where Docker Desktop on macOS and Windows doesn't support host networking properly, causing services to advertise unreachable addresses and preventing cluster initialization.

Changes:

Replaced host networking with bridge networking and explicit port mappings
Added comprehensive environment-based configuration for PD, Store, and Server through new entrypoint scripts
Implemented health-aware startup with PD REST endpoint polling in wait-storage.sh
Added volume mounts for persistent data and deprecated variable migration guards

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
docker/docker-compose.yml	Migrated from host to bridge networking, added environment variables, updated healthchecks, exposed required ports
hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh	New comprehensive entrypoint with SPRING_APPLICATION_JSON configuration, deprecation guards, and required variable validation
hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh	New comprehensive entrypoint with SPRING_APPLICATION_JSON configuration, deprecation guards, and required variable validation
hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh	Refactored to use environment variables for backend and PD configuration with deprecation guards
hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh	Replaced Gremlin-based storage detection with PD REST health endpoint polling, increased timeout to 300s

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docker/docker-compose.yml

hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh

hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh

docker/docker-compose.yml

bitflicker64 · 2026-02-20T18:52:22Z

Thank you for the review. I’ll take care of the suggested adjustments and will proceed with testing the 3 node cluster configuration next.

…ecks, remove unused detect-storage script

hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh

docker/docker-compose.yml

hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh

hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh

hugegraph-server/hugegraph-dist/docker/scripts/detect-storage.groovy

hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh

docker/docker-compose.yml

hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh

…ipts

Copilot

Pull request overview

Copilot reviewed 33 out of 37 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docker/docker-compose-3pd-3store-3server.yml

hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh

hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh

imbajin · 2026-03-09T10:54:18Z

Overall this PR looks good to me now.

The overall direction makes sense, and I really appreciate all the work you’ve put into this, especially the follow-up updates during review. Thanks a lot for the contribution and for continuing to improve it.

Once the last small issue is fixed and confirmed, I think it would be reasonable to merge it first and validate it in real-world use.

bitflicker64 · 2026-03-10T06:53:00Z

Overall this PR looks good to me now.

The overall direction makes sense, and I really appreciate all the work you’ve put into this, especially the follow-up updates during review. Thanks a lot for the contribution and for continuing to improve it.

Once the last small issue is fixed and confirmed, I think it would be reasonable to merge it first and validate it in real-world use.

@imbajin I'll get the remaining fix done shortly.

Also wanted to flag something I noticed while testing docker logs doesn't work properly for any of the HugeGraph containers right now. The startup scripts redirect JVM output to files inside the container (>> ${OUTPUT} 2>&1), so nothing reaches Docker's log capture. On top of that, the console appender is defined in the Log4j2 dist configs but never wired to the root logger. This means during debugging I had to manually exec into the container and tail the log files directly.

I also noticed the non-dist log4j2.xml configs already have the console appender wired correctly — the dist configs just never got updated to match. Should I open a separate PR for this?

imbajin · 2026-03-10T07:06:21Z

Also wanted to flag something I noticed while testing docker logs doesn't work properly for any of the HugeGraph containers right now. The startup scripts redirect JVM output to files inside the container (>> ${OUTPUT} 2>&1), so nothing reaches Docker's log capture. On top of that, the console appender is defined in the Log4j2 dist configs but never wired to the root logger. This means during debugging I had to manually exec into the container and tail the log files directly.

I also noticed the non-dist log4j2.xml configs already have the console appender wired correctly — the dist configs just never got updated to match. Should I open a separate PR for this?

Thanks for catching that while testing — that does sound like a real issue.

I’d lean toward a separate PR for it, since it feels a bit beyond the main scope of this one. Keeping it separate would make this PR easier to land first, and the logging change can be reviewed on its own.

So I think it makes sense to finish the remaining fix here, and open a follow-up PR for the docker logs / dist logging config issue.

bitflicker64 · 2026-03-10T07:11:03Z

Thanks for catching that while testing — that does sound like a real issue.

I’d lean toward a separate PR for it, since it feels a bit beyond the main scope of this one. Keeping it separate would make this PR easier to land first, and the logging change can be reviewed on its own.

So I think it makes sense to finish the remaining fix here, and open a follow-up PR for the docker logs / dist logging config issue.

Makes sense, will do!

- Cache leader PeerId after waitingForLeader() and null-check to avoid NPE when leader election times out - Remove incorrect fallback that derived leader gRPC address from local node's port, causing silent misroutes in multi-node clusters - Wire config.getRpcTimeout() into RaftRpcClient's RpcOptions so Bolt transport timeout is consistent with future.get() caller timeout - Replace hardcoded 10000ms in waitingForLeader() with config.getRpcTimeout() - Remove unused RaftOptions variable and dead imports (ReplicatorGroup, ThreadId) Fixes apache#2959 Related to apache#2952, apache#2962

…endpoint

bitflicker64 · 2026-03-17T14:04:33Z

Fixed in latest commit — wait-storage.sh now tries all PD peers in order instead of only the first one. Also fixed wait-partition.sh endpoint which was always returning partitionCount:0. Both verified on Ubuntu 24.04 with a fresh 3-node cluster.

Copilot

Pull request overview

Copilot reviewed 33 out of 37 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

docker/docker-compose-3pd-3store-3server.yml

docker/docker-compose.yml

hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh

+./bin/start-hugegraph-pd.sh -j "${JAVA_OPTS:-}"
 tail -f /dev/null


docker/docker-compose-3pd-3store-3server.yml

docker/docker-compose.yml

docker/docker-compose-3pd-3store-3server.yml

hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh

+./bin/start-hugegraph-store.sh -j "${JAVA_OPTS:-}"
 tail -f /dev/null


docker/docker-compose.yml

hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh

+./bin/start-hugegraph.sh -j "${JAVA_OPTS:-}" -t 120
+
+# Post-startup cluster stabilization check
+./bin/wait-partition.sh || log "WARN: partitions not assigned yet"



VGalaxies · 2026-03-18T12:23:18Z

I re-ran this locally with the Docker dev compose path and the single-node flow works fine for me now: PD, Store, and Server all came up healthy, and the basic endpoints responded as expected.

One small follow-up issue I noticed: the new PD/Store entrypoints seem to break the current standalone docker run examples in the module READMEs, because they now require the newer HG_PD_* / HG_STORE_* environment variables. This doesn’t seem blocking for the compose-based change in this PR, but it would be good to align the docs or add a small compatibility fallback in a follow-up.

bitflicker64 · 2026-03-18T12:47:04Z

I re-ran this locally with the Docker dev compose path and the single-node flow works fine for me now: PD, Store, and Server all came up healthy, and the basic endpoints responded as expected.

One small follow-up issue I noticed: the new PD/Store entrypoints seem to break the current standalone docker run examples in the module READMEs, because they now require the newer HG_PD_* / HG_STORE_* environment variables. This doesn’t seem blocking for the compose-based change in this PR, but it would be good to align the docs or add a small compatibility fallback in a follow-up.

The docker run examples in the module READMEs have already been updated to use the new HG_PD_* / HG_STORE_* variable names in the docs follow-up PR #2963. The entrypoints also include soft migration fallbacks so old variable names are automatically mapped to the new ones with a deprecation warning, meaning existing setups won't break immediately. Hope that covers the concern!

bitflicker64 · 2026-03-18T13:04:32Z

Note: testing the 3-node cluster locally currently requires building a local PD image (docker build -f hugegraph-pd/Dockerfile -t hugegraph/pd:local .) due to the temporary entrypoint volume mount workaround. This is because the current published images predate the getLeaderGrpcAddress() and IpAuthHandler fixes (#2961, #2962) — without them the cluster only works reliably when pd0 wins leader election. Once updated images are published this requirement will be cleaned up in a follow-up PR.

VGalaxies

LGTM.

@imbajin Since #2962 is already merged and #2961 still seems relevant for the 3-node path, my suggestion would be to land #2961 first, then this PR, and follow with #2963.

…ompose

bitflicker64 · 2026-03-18T18:27:55Z

Two small cleanups: removed UTF-8 BOM from the 3-node compose (Copilot flag), and removed the image: tag from the dev compose so docker compose up actually builds locally instead of pulling from Docker Hub

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. pd PD module store Store module labels Feb 15, 2026

github-project-automation bot added this to HugeGraph PD-Store Tasks Feb 15, 2026

github-project-automation bot moved this to In progress in HugeGraph PD-Store Tasks Feb 15, 2026

chore: add Apache license headers to config files

3c2ba6e

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 15, 2026

imbajin mentioned this pull request Feb 15, 2026

[Bug] Docker setup broken on macOS/Windows — network_mode: host replaced with bridge networking (single-node + 3-node cluster) #2951

Open

1 task

bitflicker64 added 2 commits February 19, 2026 00:33

---

afa63f0

fix: ARM64 compatibility in wait-storage.sh - use curl instead of gre…

64a9aab

…mlin-console

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 20, 2026

imbajin requested a review from Copilot February 20, 2026 16:17

Copilot started reviewing on behalf of imbajin February 20, 2026 16:18 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Fix Docker configuration: remove deprecated checks, simplify healthch…

6e024d2

…ecks, remove unused detect-storage script