fix(file): self-heal OSS bucket ACL on first upload#199
Conversation
UploadFile only called CreateBucket (with ACLPublicRead) when the bucket did not exist. For deployments where the bucket was pre-provisioned by external tooling (Terraform, console, ossutil) with a non-public ACL, the server never reapplied the ACL, so anonymous browser GETs against asset URLs returned 403. Added ensureBucketACL that: 1. Creates the bucket with ACLPublicRead if absent (first-upload bootstrap) 2. Unconditionally calls SetBucketACL(PublicRead) on pre-existing buckets 3. Runs at most once per process lifetime via sync.Once (single bucket) Both UploadFile and PresignedPutURL now call ensureBucketACL before writing, so the ACL is guaranteed correct regardless of how the bucket was originally provisioned. This is the OSS-backend counterpart to #77 (MinIO equivalent). Signed-off-by: lifeifei <lifeifei@mlamp.cn>
Jerry-Xin
left a comment
There was a problem hiding this comment.
The PR is relevant to octo-server, but the new one-time OSS ACL bootstrap can permanently break uploads after a transient or permission-related first failure.
π΄ Blocking
π΄ Critical β sync.Once caches the first failure forever. In modules/file/service_oss.go, aclOnce.Do runs only once even when IsBucketExist, CreateBucket, or SetBucketACL fails. Because aclErr is then returned on every later UploadFile and PresignedPutURL call, a temporary OSS/network/permission propagation error during the first request makes the process unable to upload until restart. Use a retryable guard instead: only mark ACL initialization complete after success, or protect the check with a mutex/state flag so failures can be retried.
π΄ Critical β the create path is not race-tolerant and combines badly with the cached failure. In modules/file/service_oss.go and modules/file/service_oss.go, two processes can both observe the bucket as absent, one creates it, and the other receives an βalready exists/already ownedβ OSS service error. This PR records that error in aclErr permanently due to sync.Once, so one app instance can fail all future uploads even though the bucket now exists. Treat the benign OSS already-exists codes as success and still apply/verify ACL, or retry the existing-bucket ACL path after that error.
π¬ Non-blocking
π‘ Warning β modules/file/service_oss.go uses client.IsBucketExist, which the Aliyun SDK implements via bucket listing. That may require broader ListBuckets permission than a least-privilege runtime policy that only has object write and bucket ACL permissions. Consider avoiding a global list dependency by attempting SetBucketACL directly and handling NoSuchBucket, or document the new RAM permission requirement clearly.
β Highlights
The PR correctly identifies that client.Bucket() is only a handle constructor and does not prove remote bucket existence.
Both server-side uploads and presigned PUT generation now route through the same bucket ACL bootstrap path.
Verification run: go test ./modules/file/... passed.
lml2468
left a comment
There was a problem hiding this comment.
[REQUEST_CHANGES]
Good root-cause analysis β client.Bucket() always returns a non-nil handle so the old bucket == nil branch was dead code. The ensureBucketACL approach is the right shape.
π‘ Warning (blocking)
sync.Once caches transient failures permanently. If the first ensureBucketACL invocation fails (transient network error, OSS control-plane blip, credentials not yet rotated at boot), aclErr is permanently set and sync.Once never retries. Every subsequent UploadFile / PresignedPutURL returns this stale error for the entire process lifetime β a restart is required to recover.
The MinIO equivalent (ensureBucket in service_minio.go) intentionally avoids sync.Once: it uses a sync.Mutex and retries on every call until success. The OSS implementation should follow the same pattern. Suggested fix:
type ServiceOSS struct {
// ...
aclMu sync.Mutex
aclDone atomic.Bool
}
func (s *ServiceOSS) ensureBucketACL(client *oss.Client, bucketName string) error {
if s.aclDone.Load() {
return nil
}
s.aclMu.Lock()
defer s.aclMu.Unlock()
if s.aclDone.Load() {
return nil
}
// ... existing creation / SetBucketACL logic ...
s.aclDone.Store(true)
return nil
}This keeps the "at most once on success" optimization while allowing retry on failure.
β Highlights
- Correct identification that
client.Bucket()is always non-nil β the SDK just constructs a local handle, making the oldbucket == nilbranch unreachable. IsBucketExistis the right API to probe server-side existence.- Skipping redundant
SetBucketACLafterCreateBucketis a clean optimization. - Both
UploadFileandPresignedPutURLare covered.
CI: Build/Lint/Vet/i18n β pending; check-sprint failure is sprint-tracking, not code quality.
yujiawei
left a comment
There was a problem hiding this comment.
Code Review β PR #199 (octo-server)
Summary
The change correctly diagnoses the real bug: client.Bucket() in the Aliyun SDK always returns a non-nil handle, so the old if bucket == nil { CreateBucket(...) } branch was dead code and pre-provisioned buckets never had ACLPublicRead reapplied. Replacing it with an explicit IsBucketExist probe plus an unconditional SetBucketACL self-heal is the right approach, and routing both UploadFile and PresignedPutURL through the helper is consistent. Build is clean (go build ./modules/file/...).
However, the chosen sync.Once + sticky-aclErr memoization introduces a new, worse failure mode than the bug it fixes, so I'm requesting one change before merge.
P1 β sync.Once permanently latches a transient ACL error and disables all uploads for the process lifetime
modules/file/service_oss.go:59-91
func (s *ServiceOSS) ensureBucketACL(client *oss.Client, bucketName string) error {
s.aclOnce.Do(func() {
exists, err := client.IsBucketExist(bucketName)
if err != nil {
...
s.aclErr = err // latched forever
return
}
...
})
return s.aclErr
}ServiceOSS is a long-lived singleton β it is constructed once per module at startup via file.NewService(ctx) (modules/file/service.go:64, called from modules/user/api.go:127, modules/message/api.go:266, modules/robot/api.go:234, etc.) and reused for every request. Because sync.Once.Do consumes the once-guard on the first invocation regardless of outcome:
- If the very first
ensureBucketACLcall hits a transient failure β an OSS 5xx, a network blip, a temporarily-expired STS credential βaclErris cached andaclOnceis spent. - Every subsequent
UploadFileandPresignedPutURLcall then returns that stale cached error forever, with no retry, until the process is restarted. One momentary error at the first upload after a deploy bricks all file uploads.
This is strictly worse than the pre-PR behavior. Before this change, the missing self-heal only caused anonymous GETs to 403 β uploads themselves still succeeded. After this change, a single transient error on the bootstrap path takes down the upload path entirely until restart.
Note the divergence from the sibling MinIO backend, which the PR description explicitly cites as the model: ServiceMinio.ensureBucket (modules/file/service_minio.go:85-120) re-checks on every call under a per-bucket mutex and returns errors without latching, so a transient failure self-recovers on the next request. The OSS version should have the same self-recovery property.
Suggested fix
Don't memoize the failure. Replace sync.Once/aclErr with a mutex + a success-only flag, mirroring the MinIO pattern β re-attempt on every call until one succeeds, then short-circuit:
type ServiceOSS struct {
log.Log
ctx *config.Context
aclMu sync.Mutex
aclDone bool
}
func (s *ServiceOSS) ensureBucketACL(client *oss.Client, bucketName string) error {
s.aclMu.Lock()
defer s.aclMu.Unlock()
if s.aclDone {
return nil
}
exists, err := client.IsBucketExist(bucketName)
if err != nil {
return err // not latched β next call retries
}
if !exists {
if err := client.CreateBucket(bucketName, oss.ACL(oss.ACLPublicRead)); err != nil {
return err
}
s.aclDone = true
return nil
}
if err := client.SetBucketACL(bucketName, oss.ACLPublicRead); err != nil {
return err
}
s.aclDone = true
return nil
}This keeps the "apply once per process under steady state" optimization the PR is after, while making the bootstrap path resilient to transient errors. (The successful path still runs exactly once, so the "negligible startup overhead" rationale in the docstring is preserved.)
Nits (non-blocking)
- Docstring HTTP-verb inaccuracy (
service_oss.go:55-57): the comment says the unconditional self-heal is "a single HTTP HEAD + PUT". The implementation actually issuesIsBucketExist(aGET ?location/ HEAD-style probe) plusSetBucketACL(PUT ?acl), i.e. two round-trips, not one. Minor wording, worth correcting so future readers don't undercount the calls. - Concurrency on first call is fine β
sync.Once.Docorrectly serializes concurrent first-callers, so there is no double-create race. The only defect is the error-latching above; flagging this so it's clear the concurrency aspect was reviewed and is sound.
Things verified as correct
- β
Root-cause analysis is accurate β confirmed
client.Bucket()returns a non-nil handle unconditionally, so the removedbucket == nilbranch was genuinely dead. - β
IsBucketExist,CreateBucket,SetBucketACLSDK signatures match the calls. - β
Both write paths (
UploadFile,PresignedPutURL) now gate onensureBucketACL;DownloadURL/PresignedGetURLcorrectly do not (read-only). - β
Skipping the redundant
SetBucketACLafterCreateBucket(ACLPublicRead)is correct. - β
Existing regression test
TestOSSDownloadURL_RoutesThroughNormalizeris unaffected and the new code path is integration-only (consistent withTestOSSUpload's manual-only skip).
Problem
UploadFileinservice_oss.goonly calledCreateBucket(ACLPublicRead)when the bucket did not exist. For deployments where the OSS bucket was pre-provisioned by external tooling (Terraform, console,ossutil) with a non-public ACL, the server never reappliedACLPublicRead, so anonymous browser GETs against asset URLs returned HTTP 403.This is the OSS-backend counterpart to #77 (MinIO equivalent, PR#87 in progress).
Root Cause
The Aliyun SDK's
client.Bucket()always returns a non-nil handle regardless of whether the bucket exists on the server, so thebucket == nilbranch was effectively dead code for any deployment with pre-existing buckets.Fix
Added
ensureBucketACLthat:ACLPublicReadif absent (first-upload bootstrap)client.SetBucketACL(PublicRead)on pre-existing buckets to self-heal incorrect ACLsync.Once(OSS uses a single configured bucket, unlike MinIO's per-type buckets)Both
UploadFileandPresignedPutURLnow callensureBucketACLbefore writing.Testing
go build ./modules/file/...β cleanTestOSSDownloadURL_RoutesThroughNormalizerβ passesensureBucketACLintegration requires real Aliyun credentials (same as existingTestOSSUpload, manual-only)Closes #86