Skip to content

K8s-friendly nginx: cluster DNS resolver, /matter route, MATTER_API_URL env#3

Closed
an9xyz wants to merge 1 commit into
mainfrom
chore/oss-improvements
Closed

K8s-friendly nginx: cluster DNS resolver, /matter route, MATTER_API_URL env#3
an9xyz wants to merge 1 commit into
mainfrom
chore/oss-improvements

Conversation

@an9xyz
Copy link
Copy Markdown
Contributor

@an9xyz an9xyz commented May 12, 2026

Summary

Three coupled changes to `nginx.conf.template` + `docker-entrypoint.sh` so the prebuilt image works out of the box in both Kubernetes and docker-compose deployments:

  1. K8s-friendly resolver — Replace the hardcoded `resolver 127.0.0.11` (docker embedded DNS, only valid inside docker-compose) with the cluster's CoreDNS service `kube-dns.kube-system.svc.cluster.local`. This is what production nginx configs already use and unblocks variable-form `proxy_pass` on K8s (the previous setting returned 502 on /summary because nginx could not resolve the service hostname at request time).
  2. `/matter/` route — Add a `location /matter/` block that proxies to `$matter_api_url` (env: `MATTER_API_URL`) with the same optional-503 + rewrite pattern already used for /summary. Closes a gap where the matter (task / timeline) service had no front-door route in octo-web.
  3. Entrypoint envsubst — extend `docker-entrypoint.sh` to envsubst the new `MATTER_API_URL` variable alongside existing `SUMMARY_API_URL` and `API_URL`.

Test plan

  • In K8s: `MATTER_API_URL=http://octo-matter:8080\` env set, `/matter/api/v1/matters` returns the backend's response (401 without token is expected)
  • `MATTER_API_URL` unset → `/matter/*` returns 503 `{"msg":"matter is not configured"}`
  • docker-compose users still work — note: they may need to switch `resolver` back to `127.0.0.11` manually (documented in code comment)
  • `/api/`, `/v1/`, `/summary/api/v1/`, `/` still route as before

Three coupled changes to nginx.conf.template + docker-entrypoint.sh so
that the prebuilt image works out of the box in both Kubernetes and
docker-compose deployments:

1. Replace the hardcoded `resolver 127.0.0.11` (docker embedded DNS,
   only valid inside docker-compose) with the cluster's CoreDNS service
   - kube-dns.kube-system.svc.cluster.local. This is what production
   nginx configs already use and unblocks variable-form proxy_pass on
   Kubernetes (the previous setting returned 502 on /summary because
   nginx could not resolve the service hostname at request time).

2. Add a `location /matter/` block that proxies to $matter_api_url
   (env: MATTER_API_URL) with the same optional-503 + rewrite pattern
   already used for /summary. Closes a gap where the matter (task /
   timeline) service had no front-door route in octo-web.

3. Extend docker-entrypoint.sh to envsubst the new MATTER_API_URL
   variable alongside SUMMARY_API_URL and API_URL.
Copy link
Copy Markdown
Contributor

@Jerry-Xin Jerry-Xin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds the right /matter/ route shape, but the resolver configuration is internally inconsistent and will break at least one supported deployment path.

🔴 Blocking

🟡 Warning — nginx.conf.template hardcodes resolver kube-dns.kube-system.svc.cluster.local even though docker-entrypoint.sh defines and exports NGINX_RESOLVER with a docker-compose default of 127.0.0.11. envsubst includes ${NGINX_RESOLVER}, but the template never references it, so users cannot actually override the resolver through the documented env var. This contradicts the PR goal of working out of the box in both Kubernetes and docker-compose, and can make /summary/ and /matter/ fail in compose because nginx will try to use the Kubernetes DNS service name as its runtime resolver. Use resolver ${NGINX_RESOLVER} valid=30s ipv6=off; or otherwise align the entrypoint/defaults/template.

💬 Non-blocking

🔵 Suggestion — Add a small config-rendering test for docker-entrypoint.sh/nginx.conf.template that asserts MATTER_API_URL and NGINX_RESOLVER are substituted into the generated config. The current security-header test would not catch this regression.

✅ Highlights

The /matter/ location mirrors the existing optional /summary/ pattern, returns JSON for the unconfigured case, and rewrites /matter/api/v1/... to /api/v1/..., which matches the existing frontend API base path.

Copy link
Copy Markdown
Contributor

@Jerry-Xin Jerry-Xin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

octo-web #3 Review — CHANGES_REQUESTED

/matter/ 路由 + resolver 提级到 server 块 + env var 化,方向完全正确。Jerry-Xin 抓到的 P0 我逐行确认了,确实存在。


🔴 P0 — NGINX_RESOLVER env var 有名无实(+1 Jerry-Xin)

entrypoint 侧链路完整:

: "${NGINX_RESOLVER:=127.0.0.11}"
export ... NGINX_RESOLVER
envsubst '... ${NGINX_RESOLVER}' < /nginx.conf.template > ...

模板里硬编码了 K8s DNS 名

resolver kube-dns.kube-system.svc.cluster.local valid=30s ipv6=off;

应该是:

resolver ${NGINX_RESOLVER} valid=30s ipv6=off;

现状:NGINX_RESOLVER 被 envsubst 处理但模板里没有占位符引用它,env var 写了等于没写。docker-compose 用户拿到的 resolver 是 kube-dns.kube-system.svc.cluster.local——容器里根本解析不到,/summary/*/matter/* 全部 502。

修法一行:模板里 kube-dns.kube-system.svc.cluster.local${NGINX_RESOLVER}


💬 Non-blocking

  1. 注释可更新:模板里的注释写的是 "K8s CoreDNS service",修完之后 resolver 值由 env 决定,注释应改成说明 env var 机制而非指向特定值,比如 "Runtime DNS resolver, default 127.0.0.11 (Docker embedded DNS), override via NGINX_RESOLVER env"

  2. /matter/ rewrite 比 /summary/ 更宽/summary/ 只匹配 /summary/api/v1/*/matter/ 匹配 /matter/* 全量。设计上没问题(matter 后端可能有 /api/v1/ 以外的路径),只是留意两者风格不一致


Highlights

  • /matter/ location 完全复用了 #2 建立的 optional-503 + rewrite + variable proxy_pass 模式,一致性好
  • resolver 从 location 提到 server 级消除了重复,新 location 自动继承
  • resolver_timeout 5s 是好补充
  • entrypoint 里 POSIX default + export + envsubst 链路设计是对的,模板占位符补上就通了
  • 顺手修了 docker-entrypoint.sh 末尾缺少的 newline 👍

Copy link
Copy Markdown
Contributor

@lml2468 lml2468 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the existing resolver blocker still needs a fix, and I did not find an additional PR-introduced blocker in the /matter path handling.

Blocking items:

  • Existing P1, already raised by Jerry-Xin: I independently rendered the template with NGINX_RESOLVER=127.0.0.11 and the generated config still contains resolver kube-dns.kube-system.svc.cluster.local valid=30s ipv6=off;, while docker-entrypoint.sh includes ${NGINX_RESOLVER} in envsubst. That confirms the env var is currently not connected to the template. I am not opening a duplicate blocker for the same issue.

Non-blocking notes:

  • apps/web/vite.config.ts:118: the dev proxy rewrites /matter/api/v1 when VITE_MATTER_API_URL is set, but not when the fallback VITE_TODO_API_URL is set. If VITE_TODO_API_URL is also intended to point directly at the matter/todo backend, local dev will forward /matter/api/v1/... instead of /api/v1/.... If it is only a legacy alias for the gateway, the current behavior is fine but the comment/variable naming is ambiguous.
  • nginx.conf.template:53: the existing /summary location keeps the PR #2 fix pattern: rewrite first, then proxy_pass $summary_api_url without a URI suffix, so I do not see the variable proxy_pass path-loss bug there.
  • nginx.conf.template:78: /matter/api/v1/matters rewrites to /api/v1/matters before proxy_pass $matter_api_url, so the new production route does strip the /matter prefix correctly.

Highlights:

  • The new /matter/ production location mirrors the optional-service behavior from /summary: blank env returns JSON 503, configured env proxies with auth/forwarding headers, and query suffixes should remain on the rewritten URI.
  • Dockerfile still copies the entrypoint and template into the nginx image, so the entrypoint change is on the runtime path.

Copy link
Copy Markdown
Contributor

@lml2468 lml2468 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrading to CHANGES_REQUESTED — I independently verified the resolver blocker raised by Jerry-Xin.

P0 — NGINX_RESOLVER env var not wired into template (+1 Jerry-Xin, +1 Allen)

I rendered the template locally with envsubst — the generated config still has the hardcoded resolver kube-dns.kube-system.svc.cluster.local valid=30s ipv6=off; regardless of NGINX_RESOLVER value. The entrypoint defines/exports/envsubsts the var, but nginx.conf.template:27 never references ${NGINX_RESOLVER}.

Fix: resolver kube-dns.kube-system.svc.cluster.local valid=30s ipv6=off;resolver ${NGINX_RESOLVER} valid=30s ipv6=off;

Impact: docker-compose users get a resolver that does not exist inside the container → all variable proxy_pass locations (/summary/*, /matter/*) return 502.

Non-blocking

  • vite.config.ts:118-122: dev proxy rewrites /matter prefix only when VITE_MATTER_API_URL is set, not when VITE_TODO_API_URL fallback is used. If VITE_TODO_API_URL also points directly at the matter service, dev requests will arrive with the /matter prefix intact. Minor — only affects local dev, not production nginx.

Highlights

  • /matter/ location correctly mirrors the /summary/ optional-503 + rewrite + variable proxy_pass pattern from PR #2
  • Resolver promoted to server level eliminates per-location duplication
  • resolver_timeout 5s is a good addition
  • Entrypoint POSIX defaults are clean; template placeholder is the only missing link

Copy link
Copy Markdown
Contributor

@Jerry-Xin Jerry-Xin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolver configuration bug:

The entrypoint correctly exports NGINX_RESOLVER (default 127.0.0.11) and passes it to envsubst, but nginx.conf.template hardcodes the resolver to kube-dns.kube-system.svc.cluster.local instead of using ${NGINX_RESOLVER}.

This means:

  1. Docker-compose deployments will break — kube-dns.kube-system.svc.cluster.local doesn't resolve outside K8s
  2. The NGINX_RESOLVER env var has no effect (exported but never referenced in template)

Fix: change the template line to:

resolver ${NGINX_RESOLVER} valid=30s ipv6=off;

The rest of the PR (MATTER_API_URL route, graceful 503 fallback) looks good.

an9xyz added a commit that referenced this pull request May 21, 2026
Addresses the new findings from the second round of review on commit 3cac5f1.

Blocking
  - B1' (Jerry-Xin + yujiawei): the round-1 fix only handled known HTTP status
    codes on confirm/create, but mapBindError's top branch returns
    `terminal: false` for any non-OidcBindHttpError (network timeout, fetch
    abort, parseLoginResp JSON throw, applyLoginResp invariant throw). Same
    stuck-spinner failure mode as B1 since the confirming/creating loader
    doesn't render inlineError. Now: post-verify endpoints terminate on
    *every* failure shape, including non-HTTP errors and confirm 401.

Non-blocking cleanups picked up
  - yujiawei #2: move `/oidc/bind` pathname guard above the `?invite=`
    short-circuit in apps/web/src/Layout/index.tsx. Defense in depth — no
    documented URL composes both today, but if one ever did, the bind token
    would otherwise be silently dropped.
  - yujiawei #4: merge the two `from '../oidc'` import lines in BindPage.tsx.
  - yujiawei #5: drop the now-contradictory "唯一例外 429 / 503 — 允许再试"
    comment in runConfirm (since round-1 made those terminal). Replace with a
    one-liner describing the actual contract.
  - yujiawei #7: remove `BindErrorDisplay.retryable`. The field was set on
    503/500 of interactive endpoints but no caller read it (the "retry" UX it
    was meant to gate never materialised — users retry by re-submitting the
    form directly). Drop the field + the three test assertions on it.
  - yujiawei #9: replace the magic 200ms setTimeout with a comment explaining
    it's there to let the success state paint + Toast settle before navigation
    tears down the React tree.

Skipped
  - yujiawei #3 (entryRef unmount cleanup): incompatible with the existing
    `initRanRef` guard under React 18 StrictMode dev — the spurious unmount
    fires the new cleanup, nulling entryRef, then the remount skips the init
    body because initRanRef.current is true (refs persist across strict-mode
    remount). Same trap as the OidcResume bug fixed earlier. Memory pressure
    of leaving entryRef on the closure is one object until the full-page
    navigation kills the JS context — not worth re-introducing the trap to
    save it.

Tests
  - vitest 162/162
  - playwright 19/19
@yujiawei yujiawei assigned an9xyz and unassigned yujiawei May 22, 2026
@an9xyz an9xyz closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants