fix(file): self-heal MinIO bucket policy on every process start (#77)#87
fix(file): self-heal MinIO bucket policy on every process start (#77)#87tongjichao wants to merge 2 commits into
Conversation
…nglamp-OSS#77) - ensureBucket 在 BucketExists=true 时不再早 return,仍会 SetBucketPolicy - 加 bucketReady sync.Map 做进程内 memo,热上传路径无额外 round-trip - 集成测试 TestEnsureBucket_SelfHealsPolicyOnExistingBucket 覆盖 Mininglamp-OSS#77 复现 deployer 用 minio-init / mc mb 预创建 bucket 时,旧版 ensureBucket 跳过 SetBucketPolicy,导致 group 这类 bucket 永远无匿名读策略,群头像 403。 现在首次 ensureBucket 调用一定会落 policy;后续调用通过 bucketReady memo 短路,进程重启即自动修复。 OSS 后端 (modules/file/service_oss.go:52) 是同形 bug 但本次 out-of-scope, 另开 Mininglamp-OSS#86 跟踪。
yujiawei
left a comment
There was a problem hiding this comment.
Code Review — PR #87 (octo-server)
Summary
Targeted, minimal fix for #77. ServiceMinio.ensureBucket previously short-circuited on BucketExists=true and never reached SetBucketPolicy, so any bucket pre-provisioned by minio-init / mc mb (e.g. group, report, download) stayed policy-less and anonymous GETs returned 403. The PR removes the early return and adds a per-process bucketReady sync.Map so the bootstrap (BucketExists + optional MakeBucket + SetBucketPolicy) runs exactly once per bucket per process, regardless of upload concurrency.
The fix is small (2 files, +116/-14), the rationale is documented in-place, and the new integration test is a faithful reproduction of #77.
Verification
| Item | Result | Evidence |
|---|---|---|
| Early return removed | ✅ | modules/file/service_minio.go:130 — the if exists { return nil } block is gone; MakeBucket is now gated by if !exists and SetBucketPolicy runs unconditionally. |
| Memo guards the hot path | ✅ | modules/file/service_minio.go:109-122,157 — double-checked Load (once outside the lock at L109, once inside at L120) and a single Store at L157 after a successful Set. |
| Memo only set on success | ✅ | All error paths (BucketExists error L127, MakeBucket non-benign error L143, SetBucketPolicy error L154) return before reaching bucketReady.Store, so a transient failure leaves the bucket retryable on the next call. |
BucketAlreadyOwnedByYou still falls through to SetBucketPolicy |
✅ | modules/file/service_minio.go:139-145 — race winner getting the swallowed code still proceeds to the policy step, preserving the prior cross-process race semantics. |
| Concurrent-bootstrap invariant preserved | ✅ | TestPresignedPutURL_ConcurrentBucketBootstrap (service_minio_integration_test.go:269) still passes — MakeBucket and SetBucketPolicy each run exactly once across 16 racing goroutines on a shared fresh bucket. |
| New test reproduces #77 | ✅ | Confirmed locally: TestEnsureBucket_SelfHealsPolicyOnExistingBucket fails on parent (a03a32f) with policyCount==0, passes on 0e689a2. Uses the group/... prefix called out in #77. |
| Allow-listed bucket assumption | ✅ | group is present in allowedMinioBuckets (modules/file/helpers.go:18), so the test exercises the real code path rather than collapsing to the file default. |
go test ./modules/file/... |
✅ | Green locally on HEAD. |
go build ./... and go vet ./modules/file/... |
✅ | Both clean. |
Findings
No P0 / P1 issues.
P2 / suggestions (non-blocking)
-
Test assertion is a bit looser than necessary.
service_minio_integration_test.go:399usesGreaterOrEqual(... int32(1))forpolicyCount, but the test issues exactly onePresignedPutURLand the production path applies policy exactly once per cold-start bootstrap. A stricterEqual(int32(1), policyCount.Load(), ...)would also catch an unintended over-apply (e.g. a future bug where the memo gate stops working and every upload re-PUTs the policy). Loose is defensible — keeps the test robust to incidental future calls — but tightening it would mirror the==1style used immediately above inTestPresignedPutURL_ConcurrentBucketBootstrap. -
Memoization scope is process-local by design — surface it for operators. The new docstring at
service_minio.go:97-107correctly notes "a restart re-runs the bootstrap, which is exactly the self-healing behaviour wanted." Worth restating in the PR's user-facing notes (and eventually the deployment docs): if an operator manuallymc policy set none <bucket>mid-run, octo-server will NOT re-self-heal until the process restarts. This is the right tradeoff for the failure mode #77 describes, but it's a behaviour operators should know. -
Sibling out-of-scope bugs called out clearly — keep #86 visible. The PR description correctly flags
service_oss.goas having the same shape of bug (CreateBucket(... ACLPublicRead)only on first-create) and defers it to #86. Worth a quick scan ofservice_cos.goas well — it isn't mentioned, and a quick read suggests COS has its own bucket-bootstrap flow worth checking for the same "early return on exists" anti-pattern. If COS is already known-clean, a one-line note in #86 closing it out would help future readers. Out of scope for this PR; flagging for awareness only. -
bucketLocksandbucketReadyare bounded but not pruned. Documented at L62-63 and L72-74 — bucket count is bounded by the allow-list (~10), so this is fine. Just noting for the record that both maps are intentionally append-only.
Behaviour change to call out
The PR description already notes this, but worth restating: every ensureBucket call now reaches SetBucketPolicy on first invocation per bucket per process. Deployments that previously had buckets created at startup (where the policy was applied at create-time) will now see a redundant SetBucketPolicy PUT on the first upload to each bucket after a restart. Cost is O(allowed buckets) extra round-trips per process lifetime — negligible.
Recommendation
The fix is correct, the test is a faithful reproduction, the docstring explains the memo + self-heal rationale clearly, and the concurrency guarantee is preserved. Approving.
Jerry-Xin
left a comment
There was a problem hiding this comment.
The PR is relevant to octo-server, but the implementation does not actually self-heal MinIO bucket policy on process start, so it may not fix the reported broken existing avatar/read case.
🔴 Blocking
🔴 Critical — The policy repair only runs on upload paths, not on process start or read paths. NewServiceMinio only constructs the service and does not bootstrap any buckets (modules/file/service_minio.go). ensureBucket is only called from UploadFile and PresignedPutURL (modules/file/service_minio.go, modules/file/service_minio.go). DownloadURL/PresignedGetURL do not call it, so a deployment with a pre-existing policy-less group bucket and existing avatars will still return anonymous GET 403 after restart until some later upload happens to that bucket. That contradicts the PR title/description’s “on every process start” and “self-healed on octo-server startup” behavior, and leaves the reported existing asset read failure unresolved.
💬 Non-blocking
🔵 Suggestion — The helper comment in modules/file/service_minio_integration_test.go still says BucketExists skips SetBucketPolicy entirely. After this PR, that is no longer true, so the test comment should be updated to avoid documenting the old behavior.
✅ Highlights
The concurrency guard is sound: bucketReady is checked before and after the per-bucket lock, and it is only stored after SetBucketPolicy succeeds (modules/file/service_minio.go).
The new regression test covers the important pre-existing-bucket branch and correctly verifies no MakeBucket call plus at least one policy PUT (modules/file/service_minio_integration_test.go).
Verification: go test ./modules/file/... passes.
lml2468
left a comment
There was a problem hiding this comment.
Independent Review (齐静春, 0e689a2)
Architecture Assessment
Clean fix for #77. The core insight: pre-provisioned buckets (via minio-init / mc mb) never got SetBucketPolicy applied because the old code returned early on BucketExists=true. Fix is to always apply the policy, then memoize with bucketReady sync.Map.
Code Quality
- Double-checked locking —
bucketReady.Loadbefore lock, re-check inside lock — correct pattern for one-time init per key. - SetBucketPolicy is idempotent — safe to re-apply even if policy already exists.
- MakeBucket race improved — on
BucketAlreadyOwnedByYou, falls through toSetBucketPolicyinstead of returning early. The race winner may not have applied the policy yet. - Memo is process-local — restart re-runs bootstrap. This IS the self-healing behaviour: minio-init breaks policy → server restart fixes it.
- Integration test —
TestEnsureBucket_SelfHealsPolicyOnExistingBucketexactly reproduces #77: bucket exists, MakeBucket NOT called, SetBucketPolicy IS called.
Re: Jerry-Xin's P0 (policy only runs on upload, not startup/read)
Independently confirmed: ensureBucket is only called from PresignedPutURL, not at startup or on read paths. A server restart with pre-provisioned buckets and NO uploads will still serve 403 on browser-direct GETs until the first upload triggers ensureBucket.
However, I'd classify this as P2 (follow-up), not P0:
- Production reality: upload traffic typically starts immediately after startup. The 403 window is bounded by time-to-first-upload (seconds/minutes, not hours).
- Startup hook complexity: calling
ensureBucketat init time for all allowed buckets requires MinIO to be reachable, which creates a startup ordering dependency. If MinIO starts after octo-server, init would fail or need retry logic. - The fix DOES solve #77: the bug report is about pre-provisioned buckets staying policy-less forever. Now they self-heal on first use per process.
- The title is misleading — "on every process start" implies a startup hook. The actual behavior is "on first upload per process per bucket." Suggest updating the title to:
fix(file): self-heal MinIO bucket policy on first use per process (#77).
A startup hook iterating allowed buckets can land as a follow-up without blocking this fix.
Verdict
APPROVED — correct fix for #77 with proper concurrency handling and regression test. The "not at startup" gap is P2, not a blocker.
) - NewServiceMinio 启动 goroutine 对 allowedMinioBuckets 全集跑 ensureBucket - 30s 总 timeout 防 minio 长时间不可达时 goroutine 永生 - 失败 WARN 不阻塞启动,下次 upload 会通过 ensureBucket 重试 - cfg.Test 跳过以保留现有测试 fake-server 计数的确定性 - 新测试 TestNewServiceMinio_BootstrapsAllAllowedBucketsAtStartup 验证 read-only / never-uploaded 场景(Mininglamp-OSS#77 真实复现路径) 回应 Jerry-Xin (PR Mininglamp-OSS#87) "self-heal on process start" 与实际行为 不符的 review 反馈:先前的 ensureBucket 修复只在 upload 路径上跑, read-only group avatar 重启后仍 403;现在 startup 就把 policy 落齐。
|
@Jerry-Xin Thanks for the catch — you were correct, the iteration-1 fix was load-bearing on "some upload eventually happens" which doesn't hold for read-only buckets like Per-finding ack🔴 Blocking — "only runs on upload paths, not on process start or read paths"Fixed.
Read paths ( 🔵 Suggestion — "helper comment in service_minio_integration_test.go:22 still says SetBucketPolicy is skipped entirely"Fixed in the same commit. Now reads:
✅ HighlightsThanks for confirming the concurrency guard and the iteration-1 test shape. New test
|
yujiawei
left a comment
There was a problem hiding this comment.
Code Review — PR #87 (octo-server)
Summary
Addresses the prior round's blocking concern: the earlier commit (0e689a2) only repaired bucket policy on the upload path, so a pre-provisioned group bucket that never sees a fresh write would still serve 403 on anonymous GET after restart. This commit (cfb20a1) adds a process-start goroutine — bootstrapAllowedBuckets — that walks the full allowedMinioBuckets set and calls ensureBucket against each, so the policy is reapplied at boot regardless of whether anyone subsequently uploads.
The fix is small and the right shape: async + best-effort + budgeted, so a transient MinIO outage during boot never blocks startup, and the existing upload path still self-heals on demand. Net runtime cost in steady state is ~10 HTTP round trips at startup, then zero per request via the bucketReady memo. Tests exercise the new path explicitly.
Verification
| Item | Result |
|---|---|
go test ./modules/file/... -race -count=1 |
✅ green (1.6s) |
go vet ./modules/file/... |
✅ clean |
TestEnsureBucket_SelfHealsPolicyOnExistingBucket |
✅ pass — covers exists=true → SetBucketPolicy must still run |
TestNewServiceMinio_BootstrapsAllAllowedBucketsAtStartup |
✅ pass — covers the new goroutine; verifies 10/10 buckets get SetBucketPolicy, 0 MakeBucket |
TestPresignedPutURL_ConcurrentBucketBootstrap |
✅ pass — bucketReady double-checked-locking preserves the "exactly once per bucket" invariant |
TestPresignedURLs_SignAgainstPublicEndpoint |
✅ pass — unchanged by this PR |
Findings
P0 / P1 (blocking)
None.
P2 (non-blocking — suggestions)
P2-1: bucketReady is a true memo, not a TTL — drift only repairs at restart
bucketReady is sync.Map keyed by bucket name, only ever stored, never deleted. Once a bucket is marked ready in a given process, the policy is never reapplied — even if an operator manually clears it via mc policy set none ... mid-process. Documented in the field comment and that is the intended trade-off (restart = repair), but worth surfacing: if you ever want runtime drift detection (e.g. nightly sweep), the memo needs an expiry.
Not blocking — matches the PR's stated scope of "self-heal on process start" and the cost of an inline policy re-apply on every upload is what the memo exists to avoid.
P2-2: 30s total budget is a magic number
modules/file/service_minio.go:118 hardcodes 30 * time.Second for the entire bootstrap loop. The doc comment justifies the size (10 buckets × ~3s each), but the constant is not named, so the 30/10 relationship is implicit. If allowedMinioBuckets grows to 20 the budget silently becomes ~1.5s per bucket. Suggestion: lift to a named const minioBootstrapTimeout = 30 * time.Second adjacent to minioDefaultRegion, or compute as 3 * time.Second * time.Duration(len(allowedMinioBuckets)). Cosmetic, not blocking.
P2-3: const wantBuckets int32 = 10 in the new test is brittle
modules/file/service_minio_integration_test.go:467 hardcodes the allow-list size. Comment correctly flags it ("bump this if allowedMinioBuckets in helpers.go grows or shrinks"), but a future contributor adding audio to the allow-list will get a confusing assertion failure (got 11, want 10) with no compile-time signal. Lowest-friction fix: export AllowedMinioBuckets (or a length helper) from the package and use int32(len(file.AllowedMinioBuckets)) in the test. Non-blocking — the comment is sufficient as a contract.
P2-4: Test mutates ctx.GetConfig().Test = false after NewTestContext force-set it true
modules/file/service_minio_integration_test.go:463 reaches into the config to flip the test gate, which works (cfg is a pointer) but is awkward: the test depends on the implementation detail that NewTestContext returns a context backed by the same cfg pointer. If testutil.NewTestContext ever defensively copies, the test breaks silently (bootstrap doesn't run, polling times out, assertion fails — but the root cause is hidden behind the timeout). The cleaner pattern is to add a testutil option to skip the Test=true override, or construct the context manually. Non-blocking — current shape works and the comment explains why.
P2-5: bootstrapAllowedBuckets iterates allowedMinioBuckets (map) — order is non-deterministic
This is fine functionally (every bucket gets processed) and the comment in the func acknowledges it, but if a deployer's MinIO is partially reachable (say only the first few buckets succeed before the 30s deadline), which buckets get healed varies per restart. Not worth fixing — the upload path picks up the rest on demand — but worth a one-line mention in the bootstrap doc that order is undefined.
Architecture / Design
- Async + best-effort is the right call. Blocking startup on MinIO availability would create a deploy-order coupling between octo-server and MinIO that doesn't otherwise exist. WARN-then-continue is correct.
- Double-checked locking on
bucketReadyis sound. The pre-lock load skips the mutex for the hot path; the post-lock re-check prevents a parallel cold-start caller from re-running the bootstrap after winning the mutex race.bucketReady.Storehappens last, afterSetBucketPolicysucceeds, so a partial failure never causes a false memo. cfg.Testgate is in the right place. The other integration tests in this file setcfg.Test=trueprecisely to suppress the bootstrap side-effect so their fake-server counters stay deterministic. The new test cleanly opts back in. Without the gate, everyTestPresigned…test would race extra HEAD/PUT calls into its own counters.- Scoped correctly. OSS / COS / Qiniu / SeaweedFS have analogous shapes but are out of scope (PR body calls out #86 for OSS). Keeping each backend's fix as its own PR is the right granularity.
Additional observations (not requested but worth flagging)
bootstrapAllowedBucketsdoes not log a success summary. If you ever need to diagnose "did the bootstrap actually run on this pod?", you only get WARN lines on failure — success is silent. A single INFO at end ("MinIO startup bootstrap completed: 10/10 buckets ready") would be cheap and useful for ops. Optional.- The 30s
context.WithTimeoutis consumed by the combined duration of allBucketExists+SetBucketPolicycalls in the loop, not per-bucket. Documented behavior, but a slow first bucket can starve later ones. Minor.
Verdict
This is the correct fix for #77, addresses the prior CHANGES_REQUESTED cleanly, and ships with tests that actually pin the new behavior (not just the diff). The P2 items above are polish, not blockers.
Jerry-Xin
left a comment
There was a problem hiding this comment.
This PR is in scope for octo-server and correctly addresses MinIO bucket policy self-healing without introducing blocking issues.
💬 Non-blocking
- 🟡 Warning:
bucketReadyis perServiceMinioinstance, but the comments describe “per process” behavior. Since several modules constructfile.NewService(ctx), startup may run multiple bootstrap sweeps in one process. This is not a correctness issue, but it can add extra startup MinIO calls. Seemodules/file/service_minio.go:66,modules/file/service_minio.go:98, and call sites such asmodules/user/api.go:116.
✅ Highlights
- The early-return bug is fixed: existing buckets now still receive
SetBucketPolicy.modules/file/service_minio.go:163 - The per-bucket lock plus double-checked memo keeps concurrent upload bootstrap bounded.
modules/file/service_minio.go:147 - Tests cover both existing-bucket self-heal and startup bootstrap behavior.
modules/file/service_minio_integration_test.go:356,modules/file/service_minio_integration_test.go:425
Verification run:
go test ./modules/file/...passed- Targeted MinIO bootstrap tests passed
git diff --check main...HEADpassed
lml2468
left a comment
There was a problem hiding this comment.
APPROVED (incremental re-review: 0e689a2 → cfb20a1, 1 new commit).
Previous P2 finding resolved: the startup bootstrap goroutine now exists. The title "self-heal on every process start" is now accurate.
New code assessment:
-
bootstrapAllowedBuckets— walks the fullallowedMinioBucketsmap and callsensureBucketon each. 30s context timeout prevents goroutine leak if MinIO is unreachable. Errors logged at WARN, non-blocking — upload path retries via the sameensureBucket. Clean. -
NewServiceMiniowiring — bootstrap runs in a goroutine, skipped undercfg.Test=trueto keep existing tests deterministic. Correct placement: after struct construction, before return. -
Test (
TestNewServiceMinio_BootstrapsAllAllowedBucketsAtStartup) — fake MinIO with HEAD/PUT counters, polls forpolicyCount >= 10with 5s deadline. Explicitly flipscfg.Test=falseto exercise the production path while other tests suppress it. ThewantBuckets = 10literal is documented as needing a bump ifallowedMinioBucketschanges — acceptable for a per-file constant.
No blocking items. No new issues.
| // not have applied the policy yet, and a redundant Set is | ||
| // cheap. | ||
| if minio.ToErrorResponse(err).Code == minioBucketAlreadyOwnedByYou { | ||
| sm.Info("bucket already owned by us, skipping create", zap.String("bucket", bucket)) |
| if minio.ToErrorResponse(err).Code == minioBucketAlreadyOwnedByYou { | ||
| sm.Info("bucket already owned by us, skipping create", zap.String("bucket", bucket)) | ||
| } else { | ||
| sm.Error(fmt.Sprintf("创建 %s目录失败", bucket), zap.Error(err)) |
Fixes #77 (server-side half).
Problem
ServiceMinio.ensureBucketshort-circuits whenBucketExistsreturns true and never reachesSetBucketPolicy. Deployments that pre-provision buckets viaminio-init/mc mb(Docker Compose, Helm, octo-deployment) hit this path on every startup, so any bucket the deployer pre-created without the anonymous-read policy stays policy-less — anonymous GET returns 403. The reported symptom in #77 is broken group avatars served from thegroupbucket; the same path also affectsreport/downloadbuckets that octo-server treats as anonymous-readable butminio-initexcludes from its policy loop.Fix
if exists { return nil }early return inensureBucket. After the existence/MakeBucket dance, always callSetBucketPolicyso a pre-existing bucket gets its policy reapplied.bucketReady sync.Mapso the hot upload path pays the extra round-trip exactly once per bucket per process — not per request. The check runs both before and insidebucketLocks(double-checked) so concurrent cold-start callers do not race past the memo.Tests
New
TestEnsureBucket_SelfHealsPolicyOnExistingBucketinmodules/file/service_minio_integration_test.go:MakeBucketwas NOT called andSetBucketPolicyWAS called ≥ 1 time after a singlePresignedPutURL.group/...prefix specifically — the bucket bug: group bucket missing anonymous read policy — group avatars return 403 #77 names as broken.Existing
TestPresignedPutURL_ConcurrentBucketBootstrapstill passes — the memo preserves its "each of make/policy runs exactly once across N concurrent callers" invariant.go test ./modules/file/...→ green.Out of scope
groupto the anonymous-download loop) — that fix belongs inMininglamp-OSS/octo-deployment, not this repo. With this PR alone, deployers re-running their stack will getgroupself-healed on octo-server startup; the docker-compose change is a complementary deployment-time fix.CreateBucket(... ACLPublicRead)only on first-create, never re-applied — pre-provisioned OSS buckets stay private). Tracked in service_oss: ACL applied only on bucket create — pre-provisioned OSS buckets render 403 (same shape as #77) #86. Same fix recipe (ensureBuckethelper + memo) but a different SDK surface, so left to a separate PR.Test plan
go test ./modules/file/...git stash && go test ...) and passes on this commitensureBucketmatches the code (the doc explains the self-healing rationale + the memo)