refactor(operatorhub): install ArtifactVersion via violet binary (PR 2/3)#14
Conversation
The MinIO root for violet packages varies across environments (private vs shared, regional mirrors), so the CLI should not pretend any value is "the default". Make Violet.PackagePrefix a required field; empty value surfaces as a clear "packagePrefix is empty" error at BuildPackageURL time (already covered in PR 1 tests).
Swaps the InstallArtifactVersion entry point from the in-process dynamic
client write path to a child-process invocation of the `violet push` binary.
Artifact / ArtifactVersion CR creation is now owned by violet; OLM Subscription,
InstallPlan and CSV flows remain in Go and are unchanged.
The new flow lives in installViaViolet (pkg/operator/operatorhub/violet.go)
and runs eight stages, each with a labelled error prefix so failures are easy
to locate:
1. get Artifact CR (must already exist, managed out of band)
2. build the MinIO download URL via BuildPackageURL
3. download .tgz to a per-call temp directory (defer cleanup)
4. verify Version.ExpectedSha256 when non-empty (no-op otherwise)
5. ensure clean ArtifactVersion: delete same-name residue + poll NotFound,
so waitArtifactVersionPresent cannot match a stale object from a prior
failed upgrade and report false success
6. exec `violet push <tgz>` with a minimal-permission child env
(KUBECONFIG / PATH / HOME / USER / VIOLET_*); credentials come from
VIOLET_REGISTRY_USERNAME / _PASSWORD via BuildVioletPushArgs; the rendered
command is mask-logged before exec to keep --password values out of logs
7. wait ArtifactVersion phase=Present (reuses existing wait helper) and
cross-check that spec.tag equals the requested BundleVersion — if violet
ever changes its naming convention this surfaces loudly rather than
silently accepting an unrelated AV
8. wait PackageManifest exposes the CSV (reuses existing wait helper)
Supporting changes:
- Operator gains a `violet *config.VioletConfig` field, populated by
NewOperator from OperatorConfig.Violet. InstallArtifactVersion returns
a clear config error when this field is nil.
- UpgradeOperator now passes the whole config.Version (not just BundleVersion)
into InstallArtifactVersion — channel and ExpectedSha256 are needed for the
violet flow.
- validateVioletBin requires Violet.Bin (when set) to be an absolute path to
an executable regular file, guarding against accidental execution of a
user-supplied directory or non-executable path. Empty Bin falls back to
looking up `violet` in $PATH.
- createArtifactVersion (the in-process write path) is preserved as dead
code with a Deprecated comment; PR 3 removes it.
OQ1 outcome: the local violet binary does not support --password-stdin or
file-based credential reads, so argv passing is the only available path.
Mitigations are README documentation (shared-runner warning) and MaskCommand
log redaction; OS-level inspection (`ps auxe`) of the credential remains an
accepted risk for the first release. Issue tracked in plan §Security.
OQ2 outcome: deferred to integration validation. The cross-check on
spec.tag (step 7) converts a silent failure into a loud one if violet uses
a different AV naming convention.
OQ3 outcome: deferred to integration validation. Tekton task image must
ship violet; verify before merging the PR.
Plan: docs/plans/2026-05-18-refactor-externalize-violet-binary-plan.md
- README §编写配置文件: extend the upgrade.yaml example with operatorConfig.violet (required block) and Version.expectedSha256 (optional). Adds a "配置说明" callout that names violet as required, declares packagePrefix has no default, documents the URL convention, and flags expectedSha256 as a defense against HTTP-MITM-induced supply-chain attacks. - README new section §violet 依赖与运行环境: covers binary dependency (PATH or absolute path with executable-bit check), the env allowlist (KUBECONFIG / PATH / HOME / USER / VIOLET_*), registry credential injection via VIOLET_REGISTRY_USERNAME / _PASSWORD with shared-runner argv warning, and KUBECONFIG passthrough for k8s auth. - CLAUDE.md §架构: clarify Artifact / ArtifactVersion writes are owned by violet; Go retains only Get + Delete + waits. References installViaViolet and targetCatalogSource constant. - CLAUDE.md §硬约束: add three rules — violet owns AV writes, registry creds must go through env vars not config, packagePrefix is required (no default).
🤖 AI Code Review
SummaryThis PR (PR 2/3) refactors the operatorhub flow to install ArtifactVersion via the external Review Statistics
Critical Issues
Warnings
Suggestions
Positive Feedback
Note: This review focused on security and reliability issues. The PR has good overall structure and addresses many of the concerns raised in PR 1. The critical URL scheme and SHA validation issues should be addressed before merge. ℹ️ About this reviewThis review was automatically generated using the
|
Add unit tests for the new code introduced in this PR. installViaViolet and execVioletPush remain integration-level (require a real cluster and a stub binary respectively), but everything below them is testable in process and now covered: pkg/exec (PR 1 additions that previously had no test): - TestMatchAllowlist: exact / prefix-glob / mixed / empty allowlist - TestFilterHostEnv_*: empty allowlist preserves full os.Environ() passthrough, non-empty allowlist filters, prefix glob matches by prefix not substring - TestLastLines: trailing-newline trimming, fewer-than-N lines, only-newlines collapses to empty - TestWrapWithStderrTail: empty stderr passes through unchanged; non-empty stderr is appended and errors.Is still identifies the base error - TestRunCommand_*: end-to-end verification that EnvAllowlist actually filters what the child sees (uses /bin/sh -c 'env'), empty allowlist preserves legacy behavior, and a failing child wraps the stderr tail into the error pkg/operator/operatorhub: - TestValidateVioletBin: relative path / missing / directory / non-executable rejections plus the absolute-executable accept path - TestDownloadToTemp_*: 200 OK + filename extracted from URL path; 404 and 500 surface as labelled errors; URL with only "/" falls back to "package.tgz"; context-cancel propagates from net/http (verifies the request honours ctx) - TestDeleteArtifactVersionIfExists_*: backed by k8s client-go dynamic fake. Covers no-residue noop, residue removed + poll-until-NotFound, and the race where the AV vanishes between our Get and Delete (must not propagate as an error) Production fix surfaced by the tests: - downloadToTemp used filepath.Base(rawURL) to pick the local filename, which is wrong: filepath.Base does not understand URL structure and on inputs like "http://host:port/" returns "host:port" rather than "/". Switched to url.Parse(rawURL) + path.Base(parsed.Path), with the same fallback to "package.tgz" for empty/root paths. Tests now lock this behaviour in. go.mod / go.sum: k8s.io/client-go/dynamic/fake transitively pulls a few additional indirect deps (kr/pretty, go-difflib, etc.) — picked up by `go mod tidy`. No first-party deps added.
- bound .tgz download with an explicit timeout derived from o.timeout so a stalled MinIO does not pin the upgrade to OS-level TCP keepalive - explicit close-and-check on the downloaded file so a flush error (disk full, NFS reset) does not silently produce a truncated .tgz that bypasses VerifySha256 when ExpectedSha256 is unset - retry transient apiserver errors (timeout / throttle / unavailable) while polling ArtifactVersion and PackageManifest; treat NotFound as keep-polling to absorb the post-violet-push eventual-consistency window - exact-match CSV in PackageManifest entries (was strings.Contains, which matched v0.65.0 against v0.65.0-rc1 and friends) and surface NestedString type errors instead of silently dropping them - return csv from InstallArtifactVersion so UpgradeOperator does not silently feed an empty status.version into InstallSubscription via a second unchecked NestedString extraction - require version.channel up front in UpgradeOperator (was a silent "stable" fallback that contradicted BuildPackageURL's strict policy and could route Subscription through the wrong OLM channel on a typo) Adds TestDownloadToTemp_TimeoutEnforcedWithoutParentDeadline as a regression guard for the download timeout path.
17 findings from a 6-agent parallel review (security / silent-failure / simplicity / architecture / performance / type-design) consolidated into the todos/ directory: - 001-012 are P2 — should fix before / with PR 3 - 013-017 are P3 — cleanup follow-ups for the PR 3 cleanup commit 001 + 004-008 are already addressed by the previous commit (a036487); the corresponding files are kept in pending status so triage can flip them to ready / complete when the implementation is reviewed.
Review SummaryPR 2 of 3: The violet externalization refactor is well-executed. The code correctly delegates ArtifactVersion write operations to Critical Issues: 0 The 17 See |
… channel Real-cluster integration testing on 40-devops surfaced a design defect: the single Version.Channel field was used both as the OLM Subscription channel and as the MinIO URL path segment, but the two are independent on platform packages. On the live cluster: - tektoncd-operator OLM channels are: latest, pipelines-4.0, pipelines-4.2, pipelines-4.6, stable - MinIO URL segments are: v4.0, v4.2, v4.6, rc A user filling `channel: stable` (which is correct for Subscription) made BuildPackageURL produce ".../tektoncd-operator/stable/...tgz" which is 404, and the violet flow could never download anything. Fix: - Add Version.PackageChannel for the MinIO segment. - Channel keeps its OLM Subscription meaning (unchanged for callers). - Version.EffectivePackageChannel() returns PackageChannel when set, falling back to Channel for repositories where they coincide. - installViaViolet now calls BuildPackageURL with EffectivePackageChannel() instead of Channel. - README documents the distinction explicitly and gives both fields in the example config. - New table test TestVersionEffectivePackageChannel covers precedence, fallback, both-empty, and divergent-segments cases. Verified by HEAD-checking the MinIO endpoint: http://.../tektoncd-operator/stable/...v4.0.17.tgz -> 404 http://.../tektoncd-operator/v4.0/...v4.0.17.tgz -> 200
| return "", nil, fmt.Errorf("build request: %w", err) | ||
| } | ||
|
|
||
| resp, err := http.DefaultClient.Do(req) |
There was a problem hiding this comment.
Critical (security/ssrf): Neither BuildPackageURL nor downloadToTemp validates the URL scheme. An attacker who controls config.yaml (multi-tenant pipeline, supply-chain injection) can set packagePrefix: http://169.254.169.254/latest/meta-data/ to exfiltrate cloud IMDS tokens — Go follows up to 10 redirects by default. Add scheme allowlist (http/https only) before the HTTP request:
parsed, err := url.Parse(rawURL)
if err != nil { return "", nil, err }
if parsed.Scheme != "http" && parsed.Scheme != "https" {
return "", nil, fmt.Errorf("unsupported scheme %q (only http/https allowed)", parsed.Scheme)
}There was a problem hiding this comment.
Update: This issue remains unaddressed in the current PR. The URL scheme is still not validated - accepts any scheme. This should be addressed before merging.
| Args: args, | ||
| EnvAllowlist: violetEnvAllowlist, | ||
| }) | ||
| return result.Err |
There was a problem hiding this comment.
Critical (security/secrets): VIOLET_REGISTRY_PASSWORD is passed as --password argv to the violet subprocess. MaskCommand only redacts it from the parent log, but violet's stdout/stderr stream directly to the console via io.MultiWriter in pkg/exec/exec.go, and wrapWithStderrTail injects raw stderr into the error chain. On a shared CI runner, ps auxe or /proc/<violet-pid>/cmdline can read the password. Scrub the password from result.Err.Error() and the captured Stderr buffer after exec.RunCommand returns:
if pw := os.Getenv(EnvVioletRegistryPassword); pw != "" && result.Err != nil {
result.Err = fmt.Errorf("%v", strings.ReplaceAll(result.Err.Error(), pw, "***"))
result.Stderr = strings.ReplaceAll(result.Stderr, pw, "***")
}
return result.ErrThere was a problem hiding this comment.
Update: This issue remains unaddressed. The passwords are still passed as CLI arguments and can leak via violet's stdout/stderr (printed via exec.RunCommand's MultiWriter to os.Stderr). Should be addressed before merging.
|
|
||
| tgzPath, cleanup, err := downloadToTemp(ctx, url, o.timeout) | ||
| if err != nil { | ||
| return nil, "", fmt.Errorf("download %s: %w", url, err) |
There was a problem hiding this comment.
Critical (security/integrity): ExpectedSha256 is optional. HTTP downloads with no SHA verification allow an on-path attacker to inject a malicious .tgz, which violet promotes as ArtifactVersion and OLM installs in cpaas-system (cluster-admin). The README documents this risk but code does not enforce it. Implement Config.Validate() (see todo #10) to reject configs where packagePrefix starts with http:// and any expectedSha256 is empty.
There was a problem hiding this comment.
Update: This issue remains unaddressed. is still optional in config.go. For http/https URLs, this should be required to prevent MITM attacks.
| // essentials; VIOLET_* covers VIOLET_REGISTRY_USERNAME / _PASSWORD and any | ||
| // future violet-specific overrides without leaking unrelated CI secrets such | ||
| // as GITHUB_TOKEN or AWS_*. | ||
| var violetEnvAllowlist = []string{ |
There was a problem hiding this comment.
Warning (security/env-allowlist): The VIOLET_* prefix glob is overly broad — any env var starting with VIOLET_ leaks into the violet subprocess, including hypothetical future vars like VIOLET_GITHUB_TOKEN. Narrow to the two known required vars:
var violetEnvAllowlist = []string{
"KUBECONFIG", "PATH", "HOME", "USER",
EnvVioletRegistryUsername,
EnvVioletRegistryPassword,
}There was a problem hiding this comment.
Update: This issue remains unaddressed. The allowlist still uses glob pattern at violet.go:359. Consider using explicit list: VIOLET_REGISTRY_USERNAME, VIOLET_REGISTRY_PASSWORD, VIOLET_PLATFORM_USERNAME, VIOLET_PLATFORM_PASSWORD.
| if err != nil { | ||
| return "", nil, fmt.Errorf("mkdir temp: %w", err) | ||
| } | ||
| cleanup := func() { _ = os.RemoveAll(dir) } |
There was a problem hiding this comment.
Suggestion (observability/cleanup-warning): cleanup := func() { _ = os.RemoveAll(dir) } silently ignores cleanup failures. When CI runners run low on /tmp, orphaned dirs accumulate without trace. Add a warn log:
cleanup := func() {
if err := os.RemoveAll(dir); err != nil {
log.Warnw("failed to remove temp dir", "dir", dir, "err", err)
}
}There was a problem hiding this comment.
Update: This suggestion remains unaddressed. The cleanup function at violet.go:266 still silently ignores errors. Consider logging at debug level.
| // Returns (av, csv, err) so callers do not have to re-read status.version | ||
| // from the unstructured object — a second extraction site was prone to | ||
| // silently feeding an empty csv to InstallSubscription. | ||
| func (o *Operator) installViaViolet(ctx context.Context, version config.Version) (*unstructured.Unstructured, string, error) { |
There was a problem hiding this comment.
Suggestion (refactor/narrow-interface): installViaViolet receives config.Version (~10 fields) but uses only 3 (Channel, BundleVersion, ExpectedSha256). Extracting a narrow bundleParams struct would document the actual dependency and prevent future drift as new fields are added to Version.
There was a problem hiding this comment.
Update: This suggestion remains unaddressed. still receives the full struct (~10 fields) but only uses 3. Consider extracting a narrow struct for better interface clarity.
…via env
Real-cluster integration testing on 40-devops surfaced four missing pieces
that PR 2 had punted into Violet.PushArgs as a workaround. This commit
promotes each one to a first-class part of the schema (or the env-based
credential protocol) so production configs no longer have to thread
sensitive flags through an opaque string list.
(1) Violet.PlatformAddress — required ACP URL the violet binary
authenticates against. Lives in config (not a credential). Empty value
surfaces at violet exec time, not silently.
(2) Violet.Clusters — target subcluster(s) for the Artifact/AV write.
violet defaults to "global", but multi-cluster ACP deployments expose a
per-workload subcluster (kubectl is connected through
/kubernetes/<subcluster>) and the two must match. Without it, violet
reports "updated successfully" while the CRs never appear in the
cluster the upgrade CLI is watching — a silent assumed-success that
only surfaces minutes later as a wait timeout.
(3) Violet.Force *bool, defaulting to true — without --force, violet aborts
AV upserts with "already exist, skip it" and creates nothing. Verified
on tektoncd-operator v4.0.17: same input that succeeded with --force
silently no-op'd without it. The pointer keeps the door open for users
who genuinely want to preserve a stale AV, but the default matches
every real-world flow we have.
(4) VIOLET_PLATFORM_USERNAME / VIOLET_PLATFORM_PASSWORD environment
variables — auto-injected as --platform-username / --platform-password
so platform credentials follow the same env-only contract as registry
credentials. MaskCommand now redacts both --password and
--platform-password (via a sensitivePasswordFlags map). OS-level argv
inspection remains the documented residual risk.
Mechanical changes:
- BuildVioletPushArgs refactored to take a VioletPushParams struct rather
than a growing positional argument list. Keeps violet.go decoupled from
the config package (caller dereferences the *bool defaults before
populating the struct).
- execVioletPush populates VioletPushParams from o.violet, applying the
SkipPush=true / Force=true defaults at the boundary via a tiny
derefBoolDefault helper.
- defaultConfig fills Violet.Force=true when nil, mirroring the existing
SkipPush=true default.
- TestBuildVioletPushArgs rewritten as a 10-case table covering every new
knob, both creds env-var pairs, force on/off, push-args ordering, and
the full real-cluster shape. TestMaskCommand grows two cases covering
--platform-password redaction alone and the combined registry+platform
password mask.
README and CLAUDE.md document the new fields, the dual credential env
protocol, and the "Force defaults true" / "Clusters required on multi-
cluster ACP" hard constraints.
Integration test report — end-to-end pass ✅Real-cluster integration test of the full violet onboarding flow completed successfully against a multi-cluster ACP setup. All 8 stages of Final cluster state verified
Open Questions outcomes
Findings folded into this PRFour real-cluster needs that PR 2 had originally punted into
Earlier in-PR fixes (already on the branch)
Test coverage
Pre-merge checklist
|
The five P3 items captured nice-to-have polish (silent error cleanup, defense-in-depth security, organizational tweaks, field-format validation, env-allowlist threat-model doc). None of them are blocking and none have been scheduled, so retiring them keeps todos/ focused on the P2 list that will actually drive a follow-up cycle.
Six findings from the post-integration re-review, applied as one commit so
the result reads coherently — three of them tighten the credential-handling
contract and three of them strip complexity that the previous follow-up had
left behind.
Security
--------
(S1) Violet.PushArgs now rejects credential flags. --username, --password,
--platform-username, --platform-password (in both "--flag value" and
"--flag=value" form) cause BuildVioletPushArgs to return an error
before any process is started. The escape hatch stays usable for
--plain / --image-pull-secret / --dest-repo and similar, but it can
no longer be used to smuggle a real secret into config.yaml and
bypass the env-only credential contract.
(S2) Replace real credentials in test fixtures and README example with
obviously-fake placeholders ("admin@example.invalid",
"redacted-password"). Anything that ships in git should never look
like a working credential, even when it isn't one anymore.
(S3) Add Command.RedactSecrets in pkg/exec. When non-empty, every Write
to stdout/stderr (console + capture buffer + the stderr-tail wrapped
into Err) is scrubbed via bytes.ReplaceAll before forwarding. The
violet caller passes the real credential strings from
VIOLET_REGISTRY_PASSWORD / VIOLET_PLATFORM_PASSWORD, so even if
violet itself echoes its argv in verbose mode the CI log no longer
carries the secret verbatim. MaskCommand was already covering OUR
log line; this covers the child's output.
Simplicity
----------
(P1) Remove Violet.Force. The CLI always passes --force to violet — the
"preserve a stale AV" use case was hypothetical and conflicts with
deleteArtifactVersionIfExists already running first. One field, one
*bool dance, one default branch, and two test cases gone.
(P2) Inline the two single-use helpers: derefBoolDefault becomes three
lines at the SkipPush call site, sensitivePasswordFlags becomes an
isPasswordFlag function with a one-line || expression. Both helpers
were strictly local indirection.
(P3) EnvAllowlist drops glob support. matchAllowlist used to interpret a
trailing "*" as a prefix match; the only caller (violetEnvAllowlist)
used that for "VIOLET_*", but it also masks future unaudited
VIOLET_FOO env vars from sneaking through. Replaced with an exact
enumeration of the four VIOLET_REGISTRY/PLATFORM_* names. The
filterHostEnv loop now does a plain string equality check, which
was always the intent.
Mechanical follow-through
-------------------------
- BuildVioletPushArgs signature is now `(VioletPushParams) ([]string, error)`.
- VioletPushParams.Force removed.
- pkg/operator/operatorhub/violet.go drops the VIOLET_* glob comment and
enumerates the four env names; the test using a glob is removed and a
new TestRunCommand_RedactSecrets* trio covers stdout/stderr/Err scrubbing
and the no-op fast path.
- TestBuildVioletPushArgs rewritten as 8 happy-path cases (no Force knob)
plus TestBuildVioletPushArgs_RejectsCredentialFlagsInPushArgs (7 cases)
and TestBuildVioletPushArgs_AllowsNonCredentialFlags (1 sanity check).
- README documents that credentials in PushArgs are rejected; CLAUDE.md
adds the "PushArgs rejects credential flags" and "CLI always sends
--force" hard constraints, replacing the prior Force section.
Previously every UpgradePath step deleted the Subscription and prior CSV, then recreated the Subscription from scratch. That was effectively a fresh install on each iteration and did not exercise the OLM replace chain, so data-compatibility tests could miss upgrade regressions. Now InstallSubscription only creates the Subscription on first use; from the second version onward it keeps the existing Subscription and CSV in place, patches spec.channel when it differs, and bumps a refresh annotation to force the OLM controller to re-reconcile (mainly to defeat PackageManifest cache staleness right after a violet push). It then waits for an InstallPlan whose spec.clusterServiceVersionNames contains the target CSV (the old "any installplan name" check returned the previously approved plan on retry and hung waitCSVReady), approves it idempotently, and waits for the new CSV phase=Succeeded. Also removes the now-unused deleteResource helper and its schema / dynamic imports.
- VioletConfig.LocalPackageDir: opt-in on-disk cache for downloaded .tgz packages. Layout mirrors the MinIO URL convention so cache hits are trivially diagnosable. SHA-256 verification still runs against cached files; half-written files are removed on download error to prevent false hits on retry. - VioletConfig.PlatformUsername / PlatformPassword: opt-in config-side credentials for environments where VIOLET_PLATFORM_* env injection is awkward. Config wins, env falls back, both feed --platform-*. Resolved password also lands in RedactSecrets so violet's own stdout cannot leak. - Refactor downloadToTemp into downloadFile + per-call wrapper so the cache path and the legacy /tmp path share the same HTTP/error logic.
- Drop dead createArtifactVersion: violet binary owns the write path now; git history serves the "diff vs old impl" rationale better than keeping unreachable code in the file. - Wrap refreshSubscriptionForUpgrade and approveInstallPlan in retry.RetryOnConflict so a concurrent OLM status write does not abort the upgrade with a single 409. Get moves inside the retry block so each attempt sees the freshest resourceVersion. - Move Version.Channel required-check from UpgradeOperator runtime path to validateConfig at LoadConfig time. Catches the typo at load instead of after a half-finished upgrade. Gated on Type=="operatorhub" so the local flow (which never reads Channel) stays untouched. Tests: 3 new subscription cases (retry on 409, idempotent skip), 4 new config cases (missing-channel rejection, local exemption, defaulted type still validates, valid path loads cleanly).
alaudabot
left a comment
There was a problem hiding this comment.
This PR has solid improvements but 3 critical security issues remain unaddressed: (1) URL scheme not validated, (2) passwords as CLI args, (3) SHA256 optional for HTTP. Please review the inline comments.
| // ExpectedSha256, when non-empty, is the lowercase hex SHA-256 of the | ||
| // downloaded .tgz. The upgrade CLI verifies the digest after download and | ||
| // fails fast on mismatch. Optional; leave empty to skip verification. | ||
| ExpectedSha256 string `yaml:"expectedSha256,omitempty"` |
There was a problem hiding this comment.
Critical (security/integrity): ExpectedSha256 is optional. For HTTP URLs (non-HTTPS), this allows on-path attackers to inject malicious .tgz. Consider making SHA256 required when packagePrefix starts with http://.
| } | ||
| platformUser := p.PlatformUsername | ||
| if platformUser == "" { | ||
| platformUser = os.Getenv(EnvVioletPlatformUsername) |
There was a problem hiding this comment.
Critical (security/secrets): Passwords passed as CLI args. While RedactSecrets scrubs stdout/stderr, passwords are visible via ps auxe. Document this as known risk for shared CI runners.
alaudabot
left a comment
There was a problem hiding this comment.
Adding additional review comments:
| if err != nil { | ||
| return fmt.Errorf("build request: %w", err) | ||
| } | ||
| resp, err := http.DefaultClient.Do(req) |
There was a problem hiding this comment.
Critical (security/ssrf): No URL scheme validation here or in BuildPackageURL. An attacker controlling config could set packagePrefix to internal IPs or cloud metadata endpoints (e.g., http://169.254.169.254/). Add scheme validation to reject non-http/https URLs.
| if err != nil { | ||
| return "", nil, fmt.Errorf("mkdir temp: %w", err) | ||
| } | ||
| cleanup := func() { _ = os.RemoveAll(dir) } |
There was a problem hiding this comment.
Warning (quality/cleanup): cleanup function silently ignores errors with _ = os.RemoveAll(dir). Consider logging at debug level: if err := os.RemoveAll(dir); err != nil { log.Debugw("cleanup failed", "err", err) }.
| return nil, "", fmt.Errorf("get artifact %s: %w", o.artifact, err) | ||
| } | ||
|
|
||
| url, err := BuildPackageURL(o.violet.PackagePrefix, o.name, version.EffectivePackageChannel(), version.BundleVersion) |
There was a problem hiding this comment.
Suggestion (refactor/narrow-interface): installViaViolet receives full config.Version (~10 fields) but only uses Channel, BundleVersion, ExpectedSha256, EffectivePackageChannel. Consider a narrow input struct for better type safety and self-documentation.
测试结论:跨大版本升级 v4.0.17 → v4.6.3-rc.91.g8ff4ed2 通过 ✅测试环境
关键时间线(节点本地时区)
本 PR 改动的具体验证
测试中暴露的问题(非本 PR 范围,可作为 follow-up)
总结本 PR 引入的两个新配置入口( |
Round 2:逐步升级 v4.0.17 → v4.2.5 → v4.6.3-rc.91.g8ff4ed2 通过 ✅延续 Round 1(一步直跨)的同环境同节点,补充验证多跳 OLM replace chain。集群在 Round 1 跑完后做了一次精准清理(按名字删 Subscription / CSV / IP / 全部 AV),保证起点干净。 配置差异(vs Round 1)upgradePaths:
- name: v4.0.17-to-v4.2.5-to-v4.6.3-rc.91
versions:
- name: v4.0.17 # packageChannel: v4.0
- name: v4.2.5 # packageChannel: v4.2 ← 新增中间站
- name: v4.6.3-rc.91.g8ff4ed2 # packageChannel: rc三站 OLM channel 都是 关键时间线(节点本地时区)
Round 1 vs Round 2 对比
本 PR 改动的额外验证(Round 1 已覆盖的不重复)
新观察 / Insight
总结本 PR 引入的 |
BuildPackageURL hardcoded the second-to-last filename segment as `latest`, which worked for tektoncd only because that uploader double-wrote both `latest.ALL` and `<channel>.ALL` files. Operators like gitlab-ce that upload a single `<channel>.ALL` variant hit a 404 in acquirePackage. Split the API into pathChannel (directory segment, can be a release-train name like `v4.0`) and fileChannel (filename segment, the OLM channel like `stable`). acquirePackage's cache layout follows the same split.
…first onboard The Get only used artifact.GetName(), which equals o.artifact (already known from config at construction time). The Get added no information but turned every "first onboard" target — a fresh platform that has never installed this operator — into a fail-fast: violet push itself creates the Artifact CR when missing, so the pre-check was actively blocking a legitimate path. Use o.artifact directly when composing the AV name.
| dlCtx, cancel := context.WithTimeout(ctx, timeout) | ||
| defer cancel() | ||
|
|
||
| req, err := http.NewRequestWithContext(dlCtx, http.MethodGet, rawURL, nil) |
There was a problem hiding this comment.
Critical (security/ssrf): No URL scheme validation. An attacker controlling config.yaml could set packagePrefix to internal URLs (e.g., 169.254.169.254 for AWS metadata) or other dangerous schemes. Add: if parsed.Scheme != "http" && parsed.Scheme != "https" { return fmt.Errorf("unsupported scheme %q (only http/https allowed)", parsed.Scheme) }
| // ExpectedSha256, when non-empty, is the lowercase hex SHA-256 of the | ||
| // downloaded .tgz. The upgrade CLI verifies the digest after download and | ||
| // fails fast on mismatch. Optional; leave empty to skip verification. | ||
| ExpectedSha256 string `yaml:"expectedSha256,omitempty"` |
There was a problem hiding this comment.
Critical (security/integrity): ExpectedSha256 is optional but required for HTTP URLs to prevent on-path attacks. Consider adding validation in validateConfig to require SHA when PackagePrefix starts with http://.
| } | ||
| if p := os.Getenv(EnvVioletRegistryPassword); p != "" { | ||
| args = append(args, flagPassword, p) | ||
| if p2 := os.Getenv(EnvVioletRegistryPassword); p2 != "" { |
There was a problem hiding this comment.
Critical (security/secrets): Password passed as CLI argument (--password) visible via ps auxe / /proc/<pid>/cmdline. The README documents this risk - consider if this is an acceptable tradeoff or if there's a better approach.
| if err != nil { | ||
| return "", nil, fmt.Errorf("mkdir temp: %w", err) | ||
| } | ||
| cleanup := func() { _ = os.RemoveAll(dir) } |
There was a problem hiding this comment.
Warning (quality/cleanup): Cleanup function silently ignores errors with _ = os.RemoveAll(dir). Consider logging at debug level: if err := os.RemoveAll(dir); err != nil { log.Debugw("cleanup failed", "err", err) }
PR #14 的 commit a036487 + 12f63d2 已经把以下 6 个 P2 finding 实际修复, 对应 todo 文件失去价值,直接删除(按"todos/ 是看板不是档案"约定)。 | 删除文件 | 修复 commit | 改动要点 | |---------|-------------|---------| | 001 http-download-no-timeout | a036487 | downloadToTemp 用 context.WithTimeout(ctx, o.timeout) 包裹 | | 004 package-manifest-substring-csv-match| a036487 | strings.Contains → 精确匹配 + surface NestedString type error | | 005 wait-av-no-retry-on-transient-errors| a036487 | 引入 isTransientAPIError + switch 分支:NotFound/Transient 继续轮询 | | 006 download-close-error-swallowed | a036487 | explicit Close + 显式 error check,避免 truncated tgz 绕过 sha256 | | 007 empty-csv-fed-to-subscription | a036487 | InstallArtifactVersion 返回 (av, csv, err) 替代二次 NestedString | | 008 channel-default-inconsistency | a036487 + 12f63d2 | 删 "stable" 静默 fallback;channel required-check 移到 validateConfig | 剩余 pending 的 P2(002 / 003 / 009 / 010 / 011 / 012)继续保留—— 代码尚未落地,归档为 completed 会变成"修报告掩盖未做工作"。
) * chore(operatorhub): rename artifact_versiong.go to artifact_version.go (PR 3/3) 修复 PR 1 之前就遗留的文件名 typo(多了一个尾巴的 g)。 PR 3 原计划要清理的 dead code `createArtifactVersion` 已经在 PR 2 的 commit 12f63d2 (Drop dead createArtifactVersion: violet binary owns the write path now) 一并清掉了——此 PR 收尾把这个一直没人收的文件名 typo 也解决掉,确保仓库里再没有 `versiong` 这个错字。 同步更新引用:docs/brainstorms、docs/plans、todos/004/005/007/010。 其中 `:行号` 是 2026-05-18 plan / PR 2 review 时刻的历史快照,PR 2 重写后行号已不对应,但属于历史性引用,保留原值不动。 Plan: docs/plans/2026-05-18-refactor-externalize-violet-binary-plan.md * chore(todos): drop 6 P2 todos already fixed by PR #14 PR #14 的 commit a036487 + 12f63d2 已经把以下 6 个 P2 finding 实际修复, 对应 todo 文件失去价值,直接删除(按"todos/ 是看板不是档案"约定)。 | 删除文件 | 修复 commit | 改动要点 | |---------|-------------|---------| | 001 http-download-no-timeout | a036487 | downloadToTemp 用 context.WithTimeout(ctx, o.timeout) 包裹 | | 004 package-manifest-substring-csv-match| a036487 | strings.Contains → 精确匹配 + surface NestedString type error | | 005 wait-av-no-retry-on-transient-errors| a036487 | 引入 isTransientAPIError + switch 分支:NotFound/Transient 继续轮询 | | 006 download-close-error-swallowed | a036487 | explicit Close + 显式 error check,避免 truncated tgz 绕过 sha256 | | 007 empty-csv-fed-to-subscription | a036487 | InstallArtifactVersion 返回 (av, csv, err) 替代二次 NestedString | | 008 channel-default-inconsistency | a036487 + 12f63d2 | 删 "stable" 静默 fallback;channel required-check 移到 validateConfig | 剩余 pending 的 P2(002 / 003 / 009 / 010 / 011 / 012)继续保留—— 代码尚未落地,归档为 completed 会变成"修报告掩盖未做工作"。 --------- Co-authored-by: huanyang@alauda.io <huanyang@alauda.io>
Three silent-failure paths fixed (P1), four hardening / cleanup changes applied (P2). All P1 fixes close anti-patterns where preflight reported "clean" while actually bypassing the guard it was built to enforce. P1 — silent-failure closures: - #13/#14 cluster guard: in-cluster mode now requires --confirm-cluster matching violet.clusters; empty kubeconfig CurrentContext is now a hard error with an actionable 'kubectl config use-context' hint. Previously both paths log.Warn'd and returned nil, defeating the guard exactly when it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs). - #15 BundleVersion required for operatorhub: empty value used to slip past validateConfig, producing 'operatorhub-foo.' AV names with a trailing dot — preflight reported clean while upgrade failed late with cryptic 404s. Required at load time. P2 — internal hardening: - #16 Drop the seen map from runPreflight: dead code under fail-fast semantics (architecture-strategist + simplicity-reviewer both flagged it independently). -14 LOC. - #018 checkInstallPlanResidue no longer discards the NestedString error. status.phase type drift now surfaces a wrapped error instead of being swallowed and producing a 'every IP is non-terminal' noise storm. - #019 New preflight.NewResidual constructor centralises kubectl cleanup template + %q shell-quoting. Three call sites in operatorhub/preflight.go go from 5-line struct literals to one-line constructor calls. P3 follow-ups (020-023) recorded as separate todos for later triage. Tests for the cmd-layer changes (#17) land in the next commit.
Closes P2 #17 — the two cmd-layer functions added by PR #17 carried zero coverage (pr-test-analyzer criticality 9). Coverage: - assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops; in-cluster missing/mismatched/matching --confirm-cluster; file-mode unreadable kubeconfig, empty current-context (pins the #14 fix in place), missing/mismatched/matching --confirm-cluster. - runPreflight (5 sub-tests): clean cluster, empty Versions skipped, baseline-only invariant (regression guard: a refactor scanning all versions would surface mid-upgrade AVs as false positives), fail-fast across paths (second path's PreflightBaseline is verifiably not reached), operator errors propagate unwrapped (must NOT be wrapped as *PreflightError). The fake operator records the Version passed to each PreflightBaseline call so the baseline-only test asserts the actual contract rather than just call count.
* feat(operator): scaffold PreflightBaseline interface
- new pkg/operator/preflight package holds the Residual value type so
both pkg/operator and its concrete impls can import without cycles
- OperatorInterface gains PreflightBaseline(ctx, version)
([]preflight.Residual, error); contract: read-only, baseline-only
- local: no-op returning (nil, nil) — README will document the assumption
that `make deploy` must be idempotent
- operatorhub: scaffolding stub; real check logic lands in follow-up
- plan: docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md
* feat(operatorhub): implement PreflightBaseline residue checks
Three read-only checks against the baseline of an upgrade path:
- Subscription/<name> in <namespace>
- ArtifactVersion/<artifact>.<bundleVersion> in cpaas-system
- non-terminal InstallPlan in <namespace>, filtered by the OLM-managed
label operators.coreos.com/<package>.<ns> so we never scan the
namespace's full historical IP archive
Behavior:
- 30s context timeout caps the whole call so a hung apiserver cannot
block the upgrade for the default o.timeout (10m)
- transient API errors (ServerTimeout / TooManyRequests /
ServiceUnavailable) are wrapped with a 'retry the run' hint
- permanent errors propagate with the check name attached
- kubectl cleanup hints quote both name and namespace via %q so
copy-pasted commands stay shell-safe
CSV residue is intentionally NOT checked: an independent CSV without
a corresponding Subscription or ArtifactVersion is a degenerate state
already covered by the other three. Plan documents how to add
PackageManifest-driven CSV resolution if a future incident requires it.
Tests cover clean cluster, single/multi residue, non-terminal phases
(empty/Planning/RequiresApproval/Installing), terminal phases
(Complete/Failed) ignored, transient + permanent error paths, and a
read-only contract assertion (spy reactor fails the test on any
Create/Update/Patch/Delete).
* feat(config): validate namespace + bundleVersion for operatorhub
Two new load-time checks gated on Type == operatorhub:
1. operatorConfig.namespace must be non-empty. An empty namespace
would let preflight's Subscription / InstallPlan queries silently
degrade to cluster-scope, producing false-positive residue
reports from completely unrelated operators.
2. Version.bundleVersion must match ^[a-zA-Z0-9._-]+$. The field is
interpolated into kubectl cleanup hints emitted by preflight and
into violet argv, so allowing shell metacharacters here would
propagate $, backticks, semicolons or quotes straight into the
user's terminal. Centralising the check at load time closes the
injection vector for every downstream consumer with a single
chokepoint.
Both checks fire only for operatorhub type — local mode runs
`make deploy` and has no equivalent risk surface.
Tests cover: namespace required (operatorhub), namespace optional
(local), shell-active characters rejected, realistic versions
(v4.6.3 / 4.6.3-rc.91 / v0.74.0) accepted.
* feat(cmd): wire preflight + cluster guard into Execute
Three additions to the upgrade entry point, all ahead of the upgrade loop:
1. *PreflightError (cmd-internal) formats []preflight.Residual into a
multi-line copy-pasteable report: one cleanup block per residual,
then a finalizer-unstuck command template, then the bypass hint.
The type stays in cmd/ so pkg/operator never picks up a dependency
on how the CLI talks to the user. Decision C (all-English text) is
the locked default; Error() is the single place to change wording
without touching any contract.
2. assertClusterMatch enforces the operator≡kubectl cluster identity
contract whenever operatorConfig.violet.clusters is set. The
recommended rule (Decision B, locked at default B1) is exact
string equality against KUBECONFIG's CurrentContext via
--confirm-cluster. In-cluster runs (no kubeconfig file) degrade
to a loud warning so CI pods aren't silently blocked.
3. runPreflight iterates UpgradePaths, calls op.PreflightBaseline on
each baseline (Versions[0]), and fails fast at the first dirty
path with a *PreflightError. (Kind, NS, Name) dedup is in place
for future aggregation; --skip-preflight bypasses the whole step
with a Warn-level audit line.
Cobra's SilenceUsage is set in AddFlags so that PreflightError's
actionable kubectl lines are not buried under --help output on a
non-zero exit. Flag-parsing errors still print usage (they execute
before RunE), so users who typo a flag are not stranded.
* docs(preflight): README usage + CLAUDE.md hard constraints
- README: new 'Preflight 前置检查' section documents what gets
scanned (Subscription / ArtifactVersion / non-terminal InstallPlan),
the 30s timeout, a sample error message, the two new flags
(--skip-preflight / --confirm-cluster), and the minimum RBAC verbs
needed to run preflight in isolation.
- CLAUDE.md:
- 数据流: preflight + cluster guard inserted ahead of the upgrade
loop; explicit note that preflight does NOT honour cfg.Immediate
(quality gate, not part of the upgrade chain).
- Operator 抽象 paragraph mentions PreflightBaseline and the leaf
pkg/operator/preflight package created to avoid an import cycle.
- 硬约束: four new entries — PreflightBaseline must stay read-only
and baseline-only (with a note on why CSV is intentionally
excluded and where to add it back if needed); cmd.SilenceUsage
is mandatory; --confirm-cluster matching rule (B1 default);
BundleVersion regex chokepoint.
- Plan acceptance criteria all marked complete.
* docs(plan): mark preflight-check plan as completed
* fix: address P1+P2 review findings for preflight check
Three silent-failure paths fixed (P1), four hardening / cleanup changes
applied (P2). All P1 fixes close anti-patterns where preflight reported
"clean" while actually bypassing the guard it was built to enforce.
P1 — silent-failure closures:
- #13/#14 cluster guard: in-cluster mode now requires --confirm-cluster
matching violet.clusters; empty kubeconfig CurrentContext is now a hard
error with an actionable 'kubectl config use-context' hint. Previously
both paths log.Warn'd and returned nil, defeating the guard exactly when
it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs).
- #15 BundleVersion required for operatorhub: empty value used to slip
past validateConfig, producing 'operatorhub-foo.' AV names with a
trailing dot — preflight reported clean while upgrade failed late with
cryptic 404s. Required at load time.
P2 — internal hardening:
- #16 Drop the seen map from runPreflight: dead code under fail-fast
semantics (architecture-strategist + simplicity-reviewer both flagged
it independently). -14 LOC.
- #018 checkInstallPlanResidue no longer discards the NestedString error.
status.phase type drift now surfaces a wrapped error instead of being
swallowed and producing a 'every IP is non-terminal' noise storm.
- #019 New preflight.NewResidual constructor centralises kubectl cleanup
template + %q shell-quoting. Three call sites in operatorhub/preflight.go
go from 5-line struct literals to one-line constructor calls.
P3 follow-ups (020-023) recorded as separate todos for later triage.
Tests for the cmd-layer changes (#17) land in the next commit.
* test(cmd): unit tests for assertClusterMatch and runPreflight
Closes P2 #17 — the two cmd-layer functions added by PR #17 carried
zero coverage (pr-test-analyzer criticality 9).
Coverage:
- assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops;
in-cluster missing/mismatched/matching --confirm-cluster; file-mode
unreadable kubeconfig, empty current-context (pins the #14 fix in
place), missing/mismatched/matching --confirm-cluster.
- runPreflight (5 sub-tests): clean cluster, empty Versions skipped,
baseline-only invariant (regression guard: a refactor scanning all
versions would surface mid-upgrade AVs as false positives), fail-fast
across paths (second path's PreflightBaseline is verifiably not
reached), operator errors propagate unwrapped (must NOT be wrapped
as *PreflightError).
The fake operator records the Version passed to each PreflightBaseline
call so the baseline-only test asserts the actual contract rather than
just call count.
* chore(todos): remove completed P1/P2 review findings
The 7 todos (013-019) that drove the silent-failure fixes + cmd-layer
tests are done; their substance lives in the PR description and the
commits that closed them (1201665 + f721f8e). Removing the files
keeps todos/ as a list of OPEN work, which is its only useful state.
Remaining in todos/: 4 P3 follow-ups for this PR (020-023) and 6 P2
items from prior PRs that have not been actioned yet.
* chore(todos): drop P3 review follow-ups for PR #17
The four P3 nice-to-haves (test variants, internal simplifications,
agent-ergonomics knobs, misc cleanups) are not on the roadmap and
filing them as todos would just decorate the backlog with low-value
items nobody plans to pick up. The PR description still records the
reviewer recommendations for archeology; the file copies were
redundant.
---------
Co-authored-by: huanyang@alauda.io <huanyang@alauda.io>
Summary
PR 2 of 3 in the violet externalization refactor. This is the behavior change: `InstallArtifactVersion` now invokes the `violet push` binary instead of building `Artifact`/`ArtifactVersion` CRs in-process. OLM `Subscription` / `InstallPlan` / `CSV` flows are unchanged.
Commits in this PR:
Open Questions outcomes (from plan)
What changed
Testing
Integration validation required before merge
Post-Deploy Monitoring & Validation
Design references