Skip to content

fix: bound curl fetches with timeout+retry to stop docker build hangs#58

Merged
chiro-hiro merged 1 commit into
orochi-network:mainfrom
chiro-hiro:fix/curl-timeout-build-hang
Jun 7, 2026
Merged

fix: bound curl fetches with timeout+retry to stop docker build hangs#58
chiro-hiro merged 1 commit into
orochi-network:mainfrom
chiro-hiro:fix/curl-timeout-build-hang

Conversation

@chiro-hiro

Copy link
Copy Markdown
Contributor

Problem

Consumer docker builds hang indefinitely in the fetch/install phase. The root cause is that every curl fetch in dev-off runs without a timeout, so a stalled-but-open TCP connection (flaky registry, self-hosted runner behind a proxy, GitHub raw blip) blocks forever with no output.

yarn/corepack carry their own network timeouts and fail eventually — but the bare curl ... | bash that fetches the build script inside the build RUN (and the trust-script fetches) do not, so they are the one infinite-hang vector in that phase.

How it was diagnosed

  • Read + swept every consumer-executed script — no infinite loop; while read pipes are correctly isolated.
  • Empirically ruled out the usual suspects in real containers: corepack downloads fine non-interactively (no prompt hang), and Yarn 4 treats --frozen-lockfile as a deprecated alias and fails in ~10 ms (no hang).
  • Reproduced the real generated build end-to-end: happy path ≈ 4.2 s, confirming the pipeline is correct and the hang is network-stall-dependent.

Fix

Bounded timeout + retry on all curl fetches — a stall now fails fast and retries transient blips instead of hanging:

--connect-timeout 15 --max-time 120 --retry 3 --retry-delay 2
  • dockerfile.sh — generated build RUN fetch (curl ... | bash) + remote template fetch
  • check-gpg.sh / check-ssh.sh — checksum + allowlist fetches (via a CURL_OPTS array)
  • generate-ssh-allowed-signers.sh — GitHub .keys fetch

The generated build RUN's outer shell stays set -eu (it's dash on the Ubuntu builder — set -o pipefail is unsupported there).

Also: restore green CI

main is currently red (both failures introduced by the #57 merge); this PR brings it back to green:

Verification

  • shellcheck -e SC1091 over all scripts — clean
  • bash -n on all edited scripts — clean
  • ./generate-checksums.sh + sha256sum -c --strict — fresh & matching
  • dockerfile.sh dry-run smoke for node/next/nginx/strapi (+ strapi runtime contract, injection rejection) — pass
  • Real docker build of a node consumer end-to-end with the regenerated Dockerfile — succeeds; build RUN now carries the timeout flags

Out of scope (noted, not changed)

corepack enable is emitted in the runner stage rather than the builder stage (works today only because the base image's yarn is already a corepack shim). It causes errors, not hangs — flagging for a separate PR.

Consumer docker builds (and the CI trust checks) could hang indefinitely in
the fetch/install phase. Every curl fetch in dev-off ran without a timeout, so
a stalled-but-open TCP connection blocked forever with no output. yarn and
corepack have their own network timeouts; the bare `curl ... | bash` that
fetches the build script — and the trust-script fetches — did not.

Add bounded --connect-timeout/--max-time plus --retry/--retry-delay to every
curl so a stalled fetch fails fast and retries transient blips instead of
hanging:
- dockerfile.sh: generated build RUN fetch + remote template fetch
- check-gpg.sh / check-ssh.sh: checksum + allowlist fetches (CURL_OPTS array)
- generate-ssh-allowed-signers.sh: GitHub .keys fetch

Also restore green CI (both jobs went red after the orochi-network#57 merge):
- regenerate checksum.sha256 — it was stale because orochi-network#57 changed dockerfile.sh
  and scripts/build-prod-node.sh without refreshing it
- silence a false-positive shellcheck SC2016 on the intentional single-quoted
  `ENV APP_VERSION=${APP_VERSION}` (Docker, not the shell, expands it)
@chiro-hiro chiro-hiro merged commit fe2d6f4 into orochi-network:main Jun 7, 2026
4 checks passed
chiro-hiro added a commit to chiro-hiro/dev-off that referenced this pull request Jun 7, 2026
The generated Dockerfile ran `corepack enable` only in the runner stage, even
though the build (`yarn install`/`yarn build`) happens in the builder. It worked
by accident — the orochinetwork/ubuntu:node base image already ships yarn as a
corepack shim — and contradicted the documented contract (DOCKERFILE.md:
"Enables corepack in both the builder and the runner"). If the base image ever
stopped pre-shimming yarn, the Strapi (Yarn 4 berry) build would break with no
corepack in the builder.

Emit the corepack-enable RUN layer in BOTH stages:
- Dockerfile.template: add {{builder_corepack_run}}, placed as root before the
  USER switch (and before COPY, so the layer caches independently of code).
- dockerfile.sh: substitute {{builder_corepack_run}} from the same generator;
  refresh the now-stale comment.

Also carries the same shellcheck SC2016 silence as orochi-network#58 so this PR is green
independently of merge order (the finding was introduced by the orochi-network#57 merge).
chiro-hiro added a commit that referenced this pull request Jun 7, 2026
…ner (#59)

The generated Dockerfile ran `corepack enable` only in the runner stage, even
though the build (`yarn install`/`yarn build`) happens in the builder. It worked
by accident — the orochinetwork/ubuntu:node base image already ships yarn as a
corepack shim — and contradicted the documented contract (DOCKERFILE.md:
"Enables corepack in both the builder and the runner"). If the base image ever
stopped pre-shimming yarn, the Strapi (Yarn 4 berry) build would break with no
corepack in the builder.

Emit the corepack-enable RUN layer in BOTH stages:
- Dockerfile.template: add {{builder_corepack_run}}, placed as root before the
  USER switch (and before COPY, so the layer caches independently of code).
- dockerfile.sh: substitute {{builder_corepack_run}} from the same generator;
  refresh the now-stale comment.

Also carries the same shellcheck SC2016 silence as #58 so this PR is green
independently of merge order (the finding was introduced by the #57 merge).
@chiro-hiro chiro-hiro deleted the fix/curl-timeout-build-hang branch June 7, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant