Skip to content

fix(file): self-heal OSS bucket ACL on first upload#199

Open
lml2468 wants to merge 1 commit into
mainfrom
fix/oss-acl-self-heal-86
Open

fix(file): self-heal OSS bucket ACL on first upload#199
lml2468 wants to merge 1 commit into
mainfrom
fix/oss-acl-self-heal-86

Conversation

@lml2468
Copy link
Copy Markdown
Contributor

@lml2468 lml2468 commented May 30, 2026

Problem

UploadFile in service_oss.go only called CreateBucket(ACLPublicRead) when the bucket did not exist. For deployments where the OSS bucket was pre-provisioned by external tooling (Terraform, console, ossutil) with a non-public ACL, the server never reapplied ACLPublicRead, so anonymous browser GETs against asset URLs returned HTTP 403.

This is the OSS-backend counterpart to #77 (MinIO equivalent, PR#87 in progress).

Root Cause

// Before: ACL only set on bucket creation
bucket, err := client.Bucket(bucketName)
if bucket == nil {
    client.CreateBucket(bucketName, oss.ACL(oss.ACLPublicRead))
    // ...
}
// Pre-existing bucket: ACL never applied

The Aliyun SDK's client.Bucket() always returns a non-nil handle regardless of whether the bucket exists on the server, so the bucket == nil branch was effectively dead code for any deployment with pre-existing buckets.

Fix

Added ensureBucketACL that:

  1. Creates the bucket with ACLPublicRead if absent (first-upload bootstrap)
  2. Unconditionally calls client.SetBucketACL(PublicRead) on pre-existing buckets to self-heal incorrect ACL
  3. Runs at most once per process lifetime via sync.Once (OSS uses a single configured bucket, unlike MinIO's per-type buckets)

Both UploadFile and PresignedPutURL now call ensureBucketACL before writing.

Testing

  • go build ./modules/file/... β€” clean
  • Existing TestOSSDownloadURL_RoutesThroughNormalizer β€” passes
  • ensureBucketACL integration requires real Aliyun credentials (same as existing TestOSSUpload, manual-only)

Closes #86

UploadFile only called CreateBucket (with ACLPublicRead) when the bucket
did not exist. For deployments where the bucket was pre-provisioned by
external tooling (Terraform, console, ossutil) with a non-public ACL,
the server never reapplied the ACL, so anonymous browser GETs against
asset URLs returned 403.

Added ensureBucketACL that:
1. Creates the bucket with ACLPublicRead if absent (first-upload bootstrap)
2. Unconditionally calls SetBucketACL(PublicRead) on pre-existing buckets
3. Runs at most once per process lifetime via sync.Once (single bucket)

Both UploadFile and PresignedPutURL now call ensureBucketACL before
writing, so the ACL is guaranteed correct regardless of how the bucket
was originally provisioned.

This is the OSS-backend counterpart to #77 (MinIO equivalent).

Signed-off-by: lifeifei <lifeifei@mlamp.cn>
@lml2468 lml2468 requested a review from a team as a code owner May 30, 2026 07:09
@github-actions github-actions Bot added the size/M PR size: M label May 30, 2026
Copy link
Copy Markdown
Contributor

@Jerry-Xin Jerry-Xin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is relevant to octo-server, but the new one-time OSS ACL bootstrap can permanently break uploads after a transient or permission-related first failure.

πŸ”΄ Blocking

πŸ”΄ Critical β€” sync.Once caches the first failure forever. In modules/file/service_oss.go, aclOnce.Do runs only once even when IsBucketExist, CreateBucket, or SetBucketACL fails. Because aclErr is then returned on every later UploadFile and PresignedPutURL call, a temporary OSS/network/permission propagation error during the first request makes the process unable to upload until restart. Use a retryable guard instead: only mark ACL initialization complete after success, or protect the check with a mutex/state flag so failures can be retried.

πŸ”΄ Critical β€” the create path is not race-tolerant and combines badly with the cached failure. In modules/file/service_oss.go and modules/file/service_oss.go, two processes can both observe the bucket as absent, one creates it, and the other receives an β€œalready exists/already owned” OSS service error. This PR records that error in aclErr permanently due to sync.Once, so one app instance can fail all future uploads even though the bucket now exists. Treat the benign OSS already-exists codes as success and still apply/verify ACL, or retry the existing-bucket ACL path after that error.

πŸ’¬ Non-blocking

🟑 Warning β€” modules/file/service_oss.go uses client.IsBucketExist, which the Aliyun SDK implements via bucket listing. That may require broader ListBuckets permission than a least-privilege runtime policy that only has object write and bucket ACL permissions. Consider avoiding a global list dependency by attempting SetBucketACL directly and handling NoSuchBucket, or document the new RAM permission requirement clearly.

βœ… Highlights

The PR correctly identifies that client.Bucket() is only a handle constructor and does not prove remote bucket existence.

Both server-side uploads and presigned PUT generation now route through the same bucket ACL bootstrap path.

Verification run: go test ./modules/file/... passed.

Copy link
Copy Markdown
Contributor Author

@lml2468 lml2468 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[REQUEST_CHANGES]

Good root-cause analysis β€” client.Bucket() always returns a non-nil handle so the old bucket == nil branch was dead code. The ensureBucketACL approach is the right shape.

🟑 Warning (blocking)

sync.Once caches transient failures permanently. If the first ensureBucketACL invocation fails (transient network error, OSS control-plane blip, credentials not yet rotated at boot), aclErr is permanently set and sync.Once never retries. Every subsequent UploadFile / PresignedPutURL returns this stale error for the entire process lifetime β€” a restart is required to recover.

The MinIO equivalent (ensureBucket in service_minio.go) intentionally avoids sync.Once: it uses a sync.Mutex and retries on every call until success. The OSS implementation should follow the same pattern. Suggested fix:

type ServiceOSS struct {
    // ...
    aclMu   sync.Mutex
    aclDone atomic.Bool
}

func (s *ServiceOSS) ensureBucketACL(client *oss.Client, bucketName string) error {
    if s.aclDone.Load() {
        return nil
    }
    s.aclMu.Lock()
    defer s.aclMu.Unlock()
    if s.aclDone.Load() {
        return nil
    }
    // ... existing creation / SetBucketACL logic ...
    s.aclDone.Store(true)
    return nil
}

This keeps the "at most once on success" optimization while allowing retry on failure.

βœ… Highlights

  • Correct identification that client.Bucket() is always non-nil β€” the SDK just constructs a local handle, making the old bucket == nil branch unreachable.
  • IsBucketExist is the right API to probe server-side existence.
  • Skipping redundant SetBucketACL after CreateBucket is a clean optimization.
  • Both UploadFile and PresignedPutURL are covered.

CI: Build/Lint/Vet/i18n β€” pending; check-sprint failure is sprint-tracking, not code quality.

Copy link
Copy Markdown
Contributor

@yujiawei yujiawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review β€” PR #199 (octo-server)

Summary

The change correctly diagnoses the real bug: client.Bucket() in the Aliyun SDK always returns a non-nil handle, so the old if bucket == nil { CreateBucket(...) } branch was dead code and pre-provisioned buckets never had ACLPublicRead reapplied. Replacing it with an explicit IsBucketExist probe plus an unconditional SetBucketACL self-heal is the right approach, and routing both UploadFile and PresignedPutURL through the helper is consistent. Build is clean (go build ./modules/file/...).

However, the chosen sync.Once + sticky-aclErr memoization introduces a new, worse failure mode than the bug it fixes, so I'm requesting one change before merge.

P1 β€” sync.Once permanently latches a transient ACL error and disables all uploads for the process lifetime

modules/file/service_oss.go:59-91

func (s *ServiceOSS) ensureBucketACL(client *oss.Client, bucketName string) error {
	s.aclOnce.Do(func() {
		exists, err := client.IsBucketExist(bucketName)
		if err != nil {
			...
			s.aclErr = err   // latched forever
			return
		}
		...
	})
	return s.aclErr
}

ServiceOSS is a long-lived singleton β€” it is constructed once per module at startup via file.NewService(ctx) (modules/file/service.go:64, called from modules/user/api.go:127, modules/message/api.go:266, modules/robot/api.go:234, etc.) and reused for every request. Because sync.Once.Do consumes the once-guard on the first invocation regardless of outcome:

  • If the very first ensureBucketACL call hits a transient failure β€” an OSS 5xx, a network blip, a temporarily-expired STS credential β€” aclErr is cached and aclOnce is spent.
  • Every subsequent UploadFile and PresignedPutURL call then returns that stale cached error forever, with no retry, until the process is restarted. One momentary error at the first upload after a deploy bricks all file uploads.

This is strictly worse than the pre-PR behavior. Before this change, the missing self-heal only caused anonymous GETs to 403 β€” uploads themselves still succeeded. After this change, a single transient error on the bootstrap path takes down the upload path entirely until restart.

Note the divergence from the sibling MinIO backend, which the PR description explicitly cites as the model: ServiceMinio.ensureBucket (modules/file/service_minio.go:85-120) re-checks on every call under a per-bucket mutex and returns errors without latching, so a transient failure self-recovers on the next request. The OSS version should have the same self-recovery property.

Suggested fix

Don't memoize the failure. Replace sync.Once/aclErr with a mutex + a success-only flag, mirroring the MinIO pattern β€” re-attempt on every call until one succeeds, then short-circuit:

type ServiceOSS struct {
	log.Log
	ctx     *config.Context
	aclMu   sync.Mutex
	aclDone bool
}

func (s *ServiceOSS) ensureBucketACL(client *oss.Client, bucketName string) error {
	s.aclMu.Lock()
	defer s.aclMu.Unlock()
	if s.aclDone {
		return nil
	}
	exists, err := client.IsBucketExist(bucketName)
	if err != nil {
		return err // not latched β€” next call retries
	}
	if !exists {
		if err := client.CreateBucket(bucketName, oss.ACL(oss.ACLPublicRead)); err != nil {
			return err
		}
		s.aclDone = true
		return nil
	}
	if err := client.SetBucketACL(bucketName, oss.ACLPublicRead); err != nil {
		return err
	}
	s.aclDone = true
	return nil
}

This keeps the "apply once per process under steady state" optimization the PR is after, while making the bootstrap path resilient to transient errors. (The successful path still runs exactly once, so the "negligible startup overhead" rationale in the docstring is preserved.)

Nits (non-blocking)

  • Docstring HTTP-verb inaccuracy (service_oss.go:55-57): the comment says the unconditional self-heal is "a single HTTP HEAD + PUT". The implementation actually issues IsBucketExist (a GET ?location / HEAD-style probe) plus SetBucketACL (PUT ?acl), i.e. two round-trips, not one. Minor wording, worth correcting so future readers don't undercount the calls.
  • Concurrency on first call is fine β€” sync.Once.Do correctly serializes concurrent first-callers, so there is no double-create race. The only defect is the error-latching above; flagging this so it's clear the concurrency aspect was reviewed and is sound.

Things verified as correct

  • βœ… Root-cause analysis is accurate β€” confirmed client.Bucket() returns a non-nil handle unconditionally, so the removed bucket == nil branch was genuinely dead.
  • βœ… IsBucketExist, CreateBucket, SetBucketACL SDK signatures match the calls.
  • βœ… Both write paths (UploadFile, PresignedPutURL) now gate on ensureBucketACL; DownloadURL/PresignedGetURL correctly do not (read-only).
  • βœ… Skipping the redundant SetBucketACL after CreateBucket(ACLPublicRead) is correct.
  • βœ… Existing regression test TestOSSDownloadURL_RoutesThroughNormalizer is unaffected and the new code path is integration-only (consistent with TestOSSUpload's manual-only skip).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

service_oss: ACL applied only on bucket create β€” pre-provisioned OSS buckets render 403 (same shape as #77)

3 participants