Skip to content

refactor(operatorhub): install ArtifactVersion via violet binary (PR 2/3)#14

Merged
yhuan123 merged 15 commits into
mainfrom
refactor/install-via-violet
May 19, 2026
Merged

refactor(operatorhub): install ArtifactVersion via violet binary (PR 2/3)#14
yhuan123 merged 15 commits into
mainfrom
refactor/install-via-violet

Conversation

@yhuan123

Copy link
Copy Markdown
Collaborator

Summary

PR 2 of 3 in the violet externalization refactor. This is the behavior change: `InstallArtifactVersion` now invokes the `violet push` binary instead of building `Artifact`/`ArtifactVersion` CRs in-process. OLM `Subscription` / `InstallPlan` / `CSV` flows are unchanged.

Commits in this PR:

  1. `refactor: drop hardcoded violet package prefix` — PR 1 review fix. `DefaultVioletPackagePrefix` removed; `Violet.PackagePrefix` is now a required field with no default. Empty value surfaces loudly via `BuildPackageURL`.
  2. `feat(operatorhub): install ArtifactVersion via violet binary` — the actual switch. New `installViaViolet` orchestrates 8 stages (get Artifact → URL → download → sha256 verify → ensure-clean AV → `violet push` → wait AV Present (+ `spec.tag` cross-check) → wait PackageManifest). Old `createArtifactVersion` kept as dead code with `Deprecated:` comment for one release cycle; PR 3 removes it.
  3. `docs: update README and CLAUDE.md for violet onboarding flow` — config example with new `operatorConfig.violet` block + `expectedSha256`, new "violet 依赖与运行环境" section covering binary dependency / env allowlist / credential env vars / shared-runner argv warning. CLAUDE.md updated to reflect "violet owns AV writes" boundary and the new hard constraints.

Open Questions outcomes (from plan)

# Question Outcome
OQ1 Does `violet push` support `--password-stdin` or file-based credentials? No. Verified via `violet push --help` — only `--password string` and `--platform-password string`. Mitigations: `MaskCommand` redacts in logs; README documents the `ps auxe` risk and bans `skipPush: false` on shared CI runners.
OQ2 Does violet name AVs as `.`? Reviewer must verify in integration-test before merging. The code uses the convention; the cross-check on `spec.tag` (step 7) converts a naming mismatch into a loud failure rather than a silent stale-AV match. If the convention does not hold we will follow up by switching wait to a label-selector lookup.
OQ3 Does the Tekton task image include `violet`? Reviewer must verify before merging. Suggested check: `kubectl run --rm -it test --image= --command -- which violet`. If absent, either rebuild the base image or inject violet in the task script before running upgrade.

What changed

File Change
`pkg/config/config.go` Remove `DefaultVioletPackagePrefix` const; `packagePrefix` is now required. `Violet.SkipPush` default-to-true logic preserved.
`pkg/operator/operatorhub/operator.go` Add `Operator.violet` field populated from `OperatorConfig.Violet`. `UpgradeOperator` now passes the full `config.Version` into `InstallArtifactVersion`.
`pkg/operator/operatorhub/artifact_versiong.go` `InstallArtifactVersion` signature changes to `(ctx, version config.Version)`; body delegates to `installViaViolet` and returns a clear error when `Operator.violet` is nil. `createArtifactVersion` marked `Deprecated:` and kept as dead code.
`pkg/operator/operatorhub/violet.go` Add `installViaViolet`, `downloadToTemp`, `deleteArtifactVersionIfExists`, `execVioletPush`, `validateVioletBin`, and `violetEnvAllowlist`. Imports expand to cover `net/http`, `path/filepath`, k8s machinery, and `pkg/exec`.
`README.md` Extend config example with `operatorConfig.violet` and `expectedSha256`; add §violet 依赖与运行环境 covering binary lookup, env allowlist, credential env vars, shared-runner warning.
`CLAUDE.md` Architecture section updated to reflect "violet owns AV writes / Go retains Get + Delete + waits"; three new hard constraints added (violet owns AV writes, registry creds via env not config, packagePrefix is required).

Testing

  • `go vet ./...` — clean
  • `go build ./...` — clean
  • `go test ./pkg/operator/operatorhub/...` — 22 sub-tests pass (PR 1 table tests still green; `installViaViolet` itself is integration-level and intentionally untested in unit scope per project convention)
  • Manual `violet push --help` check confirms OQ1 outcome

Integration validation required before merge

  • Run a real upgrade in `integration-test` cluster (e.g. tektoncd-operator v4.5 → v4.6) and confirm:
    • violet creates an `ArtifactVersion` named `.` (OQ2)
    • `spec.tag` cross-check does not fire
    • `/proc//environ` does not expose unrelated CI secrets (env allowlist works)
  • Confirm violet is available in the Tekton task image (OQ3)
  • Reproduce an AV-residue scenario by pre-creating an AV with the expected name and confirm `deleteArtifactVersionIfExists` removes it before `violet push`

Post-Deploy Monitoring & Validation

  • What to monitor
    • Logs: upgrade CLI stderr; in particular look for stage-labelled errors (`violet push:`, `wait artifact version present:`, `ensure clean AV:`)
    • Metrics: AV creation latency in cpaas-system, OLM PackageManifest reconcile time
  • Validation checks
    • `kubectl -n cpaas-system get artifactversion -w` during an upgrade run
    • `kubectl -n cpaas-system describe artifactversion .` after upgrade — `spec.tag` and the bundleVersion must match
    • `tail -f /var/log/upgrade-*.log | grep -E '\[violet|wait artifact|ensure clean\]'`
  • Expected healthy signal
    • One INFO line `invoking violet` per version, followed within ~30s by `waiting for ArtifactVersion to reach Present`, then `waiting for package manifest`, ending in `artifact version installed via violet`
  • Failure signal / rollback trigger
    • Repeated `violet push:` errors → `git revert` PR 2 (returns to in-process path; `createArtifactVersion` is preserved exactly for this)
    • `spec.tag mismatch` errors → indicates OQ2 violation; rollback and pursue label-selector fix in a follow-up
  • Validation window & owner
    • Window: 1 week of upgrade runs in integration-test before merging PR 3
    • Owner: whoever merges PR 2

Design references

huanyang@alauda.io added 3 commits May 18, 2026 19:40
The MinIO root for violet packages varies across environments (private vs
shared, regional mirrors), so the CLI should not pretend any value is "the
default". Make Violet.PackagePrefix a required field; empty value surfaces
as a clear "packagePrefix is empty" error at BuildPackageURL time (already
covered in PR 1 tests).
Swaps the InstallArtifactVersion entry point from the in-process dynamic
client write path to a child-process invocation of the `violet push` binary.
Artifact / ArtifactVersion CR creation is now owned by violet; OLM Subscription,
InstallPlan and CSV flows remain in Go and are unchanged.

The new flow lives in installViaViolet (pkg/operator/operatorhub/violet.go)
and runs eight stages, each with a labelled error prefix so failures are easy
to locate:

  1. get Artifact CR (must already exist, managed out of band)
  2. build the MinIO download URL via BuildPackageURL
  3. download .tgz to a per-call temp directory (defer cleanup)
  4. verify Version.ExpectedSha256 when non-empty (no-op otherwise)
  5. ensure clean ArtifactVersion: delete same-name residue + poll NotFound,
     so waitArtifactVersionPresent cannot match a stale object from a prior
     failed upgrade and report false success
  6. exec `violet push <tgz>` with a minimal-permission child env
     (KUBECONFIG / PATH / HOME / USER / VIOLET_*); credentials come from
     VIOLET_REGISTRY_USERNAME / _PASSWORD via BuildVioletPushArgs; the rendered
     command is mask-logged before exec to keep --password values out of logs
  7. wait ArtifactVersion phase=Present (reuses existing wait helper) and
     cross-check that spec.tag equals the requested BundleVersion — if violet
     ever changes its naming convention this surfaces loudly rather than
     silently accepting an unrelated AV
  8. wait PackageManifest exposes the CSV (reuses existing wait helper)

Supporting changes:

- Operator gains a `violet *config.VioletConfig` field, populated by
  NewOperator from OperatorConfig.Violet. InstallArtifactVersion returns
  a clear config error when this field is nil.
- UpgradeOperator now passes the whole config.Version (not just BundleVersion)
  into InstallArtifactVersion — channel and ExpectedSha256 are needed for the
  violet flow.
- validateVioletBin requires Violet.Bin (when set) to be an absolute path to
  an executable regular file, guarding against accidental execution of a
  user-supplied directory or non-executable path. Empty Bin falls back to
  looking up `violet` in $PATH.
- createArtifactVersion (the in-process write path) is preserved as dead
  code with a Deprecated comment; PR 3 removes it.

OQ1 outcome: the local violet binary does not support --password-stdin or
file-based credential reads, so argv passing is the only available path.
Mitigations are README documentation (shared-runner warning) and MaskCommand
log redaction; OS-level inspection (`ps auxe`) of the credential remains an
accepted risk for the first release. Issue tracked in plan §Security.

OQ2 outcome: deferred to integration validation. The cross-check on
spec.tag (step 7) converts a silent failure into a loud one if violet uses
a different AV naming convention.

OQ3 outcome: deferred to integration validation. Tekton task image must
ship violet; verify before merging the PR.

Plan: docs/plans/2026-05-18-refactor-externalize-violet-binary-plan.md
- README §编写配置文件: extend the upgrade.yaml example with operatorConfig.violet
  (required block) and Version.expectedSha256 (optional). Adds a "配置说明"
  callout that names violet as required, declares packagePrefix has no default,
  documents the URL convention, and flags expectedSha256 as a defense against
  HTTP-MITM-induced supply-chain attacks.
- README new section §violet 依赖与运行环境: covers binary dependency
  (PATH or absolute path with executable-bit check), the env allowlist
  (KUBECONFIG / PATH / HOME / USER / VIOLET_*), registry credential injection
  via VIOLET_REGISTRY_USERNAME / _PASSWORD with shared-runner argv warning,
  and KUBECONFIG passthrough for k8s auth.
- CLAUDE.md §架构: clarify Artifact / ArtifactVersion writes are owned by
  violet; Go retains only Get + Delete + waits. References installViaViolet
  and targetCatalogSource constant.
- CLAUDE.md §硬约束: add three rules — violet owns AV writes, registry creds
  must go through env vars not config, packagePrefix is required (no default).
@alaudabot

alaudabot commented May 18, 2026

Copy link
Copy Markdown
Contributor

🤖 AI Code Review

Property Value
Model opencode/minimax-m2.5-free
Style strict
Issues Found 3
Config Source centralized
Profile ❌ Not Found
Personalized Prompt ❌ No
Prompt Path .github/review/profiles/alaudadevops/upgrade-test/pr-review.md
Alauda Skills ✅ base-acp-operator-list, base-acp-operator-release, base-authoring, base-m365, base-ocp-operator-list, base-skill-setup, builders-alauda-component-e2e-release, builders-alauda-component-upgrade, builders-alauda-pipeline, builders-claudetask-submit, builders-component-knowledge, builders-confluence, builders-dev-mesh-qa, builders-edge-ci-trace, builders-gitlab-ops, builders-helm-operator-generator, builders-install-cluster-plugin, builders-jira, builders-notify-wecom, builders-olm-operator-lifecycle, builders-prd-to-testcase, builders-publish-errata, builders-roadmap-studio, builders-story-split, builders-violet, builders-webapp-testing, cross-repo-add-mirror, cross-repo-publish, devops-add-bug-release-notes, devops-autodns, devops-bundle-csv-baseline-diff, devops-candidate-version-supervisor, devops-connectors-acceptance-test, devops-connectors-explore, devops-connectors-poc-case, devops-connectors-review, devops-connectors-unit-test, devops-connectors-upgrade-test, devops-connectors-write-user-docs, devops-creating-tekton-pipelines, devops-fix-go-vulns, devops-fork-alauda-binary-release, devops-gen-advanced-form-descriptors, devops-jira-rfd-acceptance, devops-knowledge-adoption, devops-pr-review, devops-refresh-containerfile-digests, devops-refresh-containerfile-tags, devops-replace-strings, devops-scan-docker-keywords, devops-sync-alauda-github-releases, devops-tekton-dynamic-form-optimizer, devops-tekton-operator-task-e2e, devops-tekton-pipeline-delivery, devops-tekton-refresh-results-tag, devops-tekton-task-delivery, devops-tekton-task-overview-template, devops-tekton-task-version-upgrade, devops-tekton-upgrade-notes, devops-tool-report-troubleshoot, devops-ui-e2e-code-audit, devops-ui-e2e-fix-base-on-report, devops-ui-e2e-regression-and-fix, devops-ui-generate-e2e-from-feature, devops-ui-pre-setup, devops-upgrade-go, devops-upstream-backport-cve, devops-upstream-upgrade
Reviewed at 2026-05-19 06:02:09 UTC

Summary

This PR (PR 2/3) refactors the operatorhub flow to install ArtifactVersion via the external violet push binary instead of building CRs in-process. The implementation is well-structured with good error handling for most cases (transient API errors, download close errors, CSV extraction). However, there are several critical security issues that remain unaddressed from previous reviews.

Review Statistics

Category Count
Critical Issues 3
Warnings 1
Suggestions 1
Files Reviewed 26

Critical Issues

  • [pkg/operator/operatorhub/violet.go:389] (security/ssrf): URL scheme not validated. downloadFile uses http.NewRequestWithContext without checking the URL scheme. An attacker controlling config.yaml could set packagePrefix: file:///etc/passwd or internal network URLs. Add scheme validation: if parsed.Scheme != "http" && parsed.Scheme != "https" { return fmt.Errorf("unsupported scheme %q", parsed.Scheme) }.

  • [pkg/config/config.go:166] (security/integrity): ExpectedSha256 not required for HTTP. Version.ExpectedSha256 is optional. For HTTP URLs (non-HTTPS), this allows on-path attackers to inject malicious .tgz packages. Add validation in validateConfig or LoadConfig to require SHA for http:// prefixes.

  • [pkg/operator/operatorhub/violet.go:169-171] (security/secrets): Passwords passed as CLI arguments. VIOLET_REGISTRY_PASSWORD is passed as --password CLI argument to violet subprocess. While RedactSecrets scrubs stdout/stderr, passwords are visible via ps auxe or /proc/<pid>/cmdline. The README documents this risk, but the code should ideally use --password-stdin if violet supports it (OQ1 concluded it doesn't). Consider documenting as a known limitation.

Warnings

  • [pkg/operator/operatorhub/violet.go:356] (quality/cleanup): Cleanup silently ignores errors. cleanup := func() { _ = os.RemoveAll(dir) } silently ignores cleanup failures. Consider logging at debug level: if err := os.RemoveAll(dir); err != nil { log.Debugw("cleanup failed", "err", err) }.

Suggestions

  • [pkg/operator/operatorhub/violet.go:243] (refactor/narrow-interface): installViaViolet receives full config.Version struct. The function receives config.Version (~10 fields) but only uses Channel, BundleVersion, ExpectedSha256, and EffectivePackageChannel(). Consider extracting a narrow interface or struct for better clarity.

Positive Feedback

  1. Good transient error handling: waitArtifactVersionPresent and waitPackageManifest correctly handle transient API errors (ServerTimeout, TooManyRequests, ServiceUnavailable) with retry logic.

  2. Download close error properly checked: The explicit f.Close() with error check at line 413 properly catches disk full/NFS reset scenarios.

  3. CSV extraction improved: The new signature returns CSV directly, avoiding the double-extraction issue that could silently pass empty csv to InstallSubscription.

  4. Environment allowlist explicit: The violetEnvAllowlist is an explicit enumeration, not a prefix glob, which is correct.

  5. Good test coverage: The new tests cover download, validateVioletBin, and deleteArtifactVersionIfExists with various edge cases.


Note: This review focused on security and reliability issues. The PR has good overall structure and addresses many of the concerns raised in PR 1. The critical URL scheme and SHA validation issues should be addressed before merge.


ℹ️ About this review

This review was automatically generated using the run-actions workflow.

  • Shared prompt: .github/prompts/code-review.md
  • Config source: centralized
  • Profile path: Not Found
  • Profile ref: e75e733e9aa1b417a8b3c6441e53495dbcb418ad
  • No repository-specific prompt configured
  • Alauda skills: base-acp-operator-list, base-acp-operator-release, base-authoring, base-m365, base-ocp-operator-list, base-skill-setup, builders-alauda-component-e2e-release, builders-alauda-component-upgrade, builders-alauda-pipeline, builders-claudetask-submit, builders-component-knowledge, builders-confluence, builders-dev-mesh-qa, builders-edge-ci-trace, builders-gitlab-ops, builders-helm-operator-generator, builders-install-cluster-plugin, builders-jira, builders-notify-wecom, builders-olm-operator-lifecycle, builders-prd-to-testcase, builders-publish-errata, builders-roadmap-studio, builders-story-split, builders-violet, builders-webapp-testing, cross-repo-add-mirror, cross-repo-publish, devops-add-bug-release-notes, devops-autodns, devops-bundle-csv-baseline-diff, devops-candidate-version-supervisor, devops-connectors-acceptance-test, devops-connectors-explore, devops-connectors-poc-case, devops-connectors-review, devops-connectors-unit-test, devops-connectors-upgrade-test, devops-connectors-write-user-docs, devops-creating-tekton-pipelines, devops-fix-go-vulns, devops-fork-alauda-binary-release, devops-gen-advanced-form-descriptors, devops-jira-rfd-acceptance, devops-knowledge-adoption, devops-pr-review, devops-refresh-containerfile-digests, devops-refresh-containerfile-tags, devops-replace-strings, devops-scan-docker-keywords, devops-sync-alauda-github-releases, devops-tekton-dynamic-form-optimizer, devops-tekton-operator-task-e2e, devops-tekton-pipeline-delivery, devops-tekton-refresh-results-tag, devops-tekton-task-delivery, devops-tekton-task-overview-template, devops-tekton-task-version-upgrade, devops-tekton-upgrade-notes, devops-tool-report-troubleshoot, devops-ui-e2e-code-audit, devops-ui-e2e-fix-base-on-report, devops-ui-e2e-regression-and-fix, devops-ui-generate-e2e-from-feature, devops-ui-pre-setup, devops-upgrade-go, devops-upstream-backport-cve, devops-upstream-upgrade

huanyang@alauda.io added 3 commits May 18, 2026 20:02
Add unit tests for the new code introduced in this PR. installViaViolet and
execVioletPush remain integration-level (require a real cluster and a stub
binary respectively), but everything below them is testable in process and
now covered:

pkg/exec (PR 1 additions that previously had no test):
- TestMatchAllowlist: exact / prefix-glob / mixed / empty allowlist
- TestFilterHostEnv_*: empty allowlist preserves full os.Environ() passthrough,
  non-empty allowlist filters, prefix glob matches by prefix not substring
- TestLastLines: trailing-newline trimming, fewer-than-N lines, only-newlines
  collapses to empty
- TestWrapWithStderrTail: empty stderr passes through unchanged; non-empty
  stderr is appended and errors.Is still identifies the base error
- TestRunCommand_*: end-to-end verification that EnvAllowlist actually filters
  what the child sees (uses /bin/sh -c 'env'), empty allowlist preserves
  legacy behavior, and a failing child wraps the stderr tail into the error

pkg/operator/operatorhub:
- TestValidateVioletBin: relative path / missing / directory / non-executable
  rejections plus the absolute-executable accept path
- TestDownloadToTemp_*: 200 OK + filename extracted from URL path; 404 and 500
  surface as labelled errors; URL with only "/" falls back to "package.tgz";
  context-cancel propagates from net/http (verifies the request honours ctx)
- TestDeleteArtifactVersionIfExists_*: backed by k8s client-go dynamic fake.
  Covers no-residue noop, residue removed + poll-until-NotFound, and the
  race where the AV vanishes between our Get and Delete (must not propagate
  as an error)

Production fix surfaced by the tests:
- downloadToTemp used filepath.Base(rawURL) to pick the local filename, which
  is wrong: filepath.Base does not understand URL structure and on inputs like
  "http://host:port/" returns "host:port" rather than "/". Switched to
  url.Parse(rawURL) + path.Base(parsed.Path), with the same fallback to
  "package.tgz" for empty/root paths. Tests now lock this behaviour in.

go.mod / go.sum: k8s.io/client-go/dynamic/fake transitively pulls a few
additional indirect deps (kr/pretty, go-difflib, etc.) — picked up by
`go mod tidy`. No first-party deps added.
- bound .tgz download with an explicit timeout derived from o.timeout so
  a stalled MinIO does not pin the upgrade to OS-level TCP keepalive
- explicit close-and-check on the downloaded file so a flush error (disk
  full, NFS reset) does not silently produce a truncated .tgz that
  bypasses VerifySha256 when ExpectedSha256 is unset
- retry transient apiserver errors (timeout / throttle / unavailable)
  while polling ArtifactVersion and PackageManifest; treat NotFound as
  keep-polling to absorb the post-violet-push eventual-consistency window
- exact-match CSV in PackageManifest entries (was strings.Contains, which
  matched v0.65.0 against v0.65.0-rc1 and friends) and surface NestedString
  type errors instead of silently dropping them
- return csv from InstallArtifactVersion so UpgradeOperator does not
  silently feed an empty status.version into InstallSubscription via a
  second unchecked NestedString extraction
- require version.channel up front in UpgradeOperator (was a silent
  "stable" fallback that contradicted BuildPackageURL's strict policy and
  could route Subscription through the wrong OLM channel on a typo)

Adds TestDownloadToTemp_TimeoutEnforcedWithoutParentDeadline as a
regression guard for the download timeout path.
17 findings from a 6-agent parallel review (security / silent-failure /
simplicity / architecture / performance / type-design) consolidated into
the todos/ directory:

- 001-012 are P2 — should fix before / with PR 3
- 013-017 are P3 — cleanup follow-ups for the PR 3 cleanup commit

001 + 004-008 are already addressed by the previous commit (a036487);
the corresponding files are kept in pending status so triage can
flip them to ready / complete when the implementation is reviewed.
@alaudabot

Copy link
Copy Markdown
Contributor

Review Summary

PR 2 of 3: The violet externalization refactor is well-executed. The code correctly delegates ArtifactVersion write operations to violet push, with the Go side retaining only Get, Delete, and OLM resource management. Key improvements include proper transient error handling, explicit file close-and-check, exact CSV name matching (vs substring), and consistent channel validation.

Critical Issues: 0
Warnings: 0
Suggestions: 2 minor items (dead validation branch, env allowlist awareness — both non-blocking)

The 17 todos/ files document important follow-up work for PR 3; none block this merge. Integration validation (OQ2/OQ3) should be confirmed before PR 3 ships.

See pr-overview.md for the full review.

… channel

Real-cluster integration testing on 40-devops surfaced a design defect: the
single Version.Channel field was used both as the OLM Subscription channel
and as the MinIO URL path segment, but the two are independent on platform
packages. On the live cluster:

- tektoncd-operator OLM channels are: latest, pipelines-4.0, pipelines-4.2,
  pipelines-4.6, stable
- MinIO URL segments are: v4.0, v4.2, v4.6, rc

A user filling `channel: stable` (which is correct for Subscription) made
BuildPackageURL produce ".../tektoncd-operator/stable/...tgz" which is 404,
and the violet flow could never download anything.

Fix:
- Add Version.PackageChannel for the MinIO segment.
- Channel keeps its OLM Subscription meaning (unchanged for callers).
- Version.EffectivePackageChannel() returns PackageChannel when set,
  falling back to Channel for repositories where they coincide.
- installViaViolet now calls BuildPackageURL with EffectivePackageChannel()
  instead of Channel.
- README documents the distinction explicitly and gives both fields in the
  example config.
- New table test TestVersionEffectivePackageChannel covers precedence,
  fallback, both-empty, and divergent-segments cases.

Verified by HEAD-checking the MinIO endpoint:
  http://.../tektoncd-operator/stable/...v4.0.17.tgz  -> 404
  http://.../tektoncd-operator/v4.0/...v4.0.17.tgz    -> 200
return "", nil, fmt.Errorf("build request: %w", err)
}

resp, err := http.DefaultClient.Do(req)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/ssrf): Neither BuildPackageURL nor downloadToTemp validates the URL scheme. An attacker who controls config.yaml (multi-tenant pipeline, supply-chain injection) can set packagePrefix: http://169.254.169.254/latest/meta-data/ to exfiltrate cloud IMDS tokens — Go follows up to 10 redirects by default. Add scheme allowlist (http/https only) before the HTTP request:

parsed, err := url.Parse(rawURL)
if err != nil { return "", nil, err }
if parsed.Scheme != "http" && parsed.Scheme != "https" {
    return "", nil, fmt.Errorf("unsupported scheme %q (only http/https allowed)", parsed.Scheme)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This issue remains unaddressed in the current PR. The URL scheme is still not validated - accepts any scheme. This should be addressed before merging.

Args: args,
EnvAllowlist: violetEnvAllowlist,
})
return result.Err

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/secrets): VIOLET_REGISTRY_PASSWORD is passed as --password argv to the violet subprocess. MaskCommand only redacts it from the parent log, but violet's stdout/stderr stream directly to the console via io.MultiWriter in pkg/exec/exec.go, and wrapWithStderrTail injects raw stderr into the error chain. On a shared CI runner, ps auxe or /proc/<violet-pid>/cmdline can read the password. Scrub the password from result.Err.Error() and the captured Stderr buffer after exec.RunCommand returns:

if pw := os.Getenv(EnvVioletRegistryPassword); pw != "" && result.Err != nil {
    result.Err = fmt.Errorf("%v", strings.ReplaceAll(result.Err.Error(), pw, "***"))
    result.Stderr = strings.ReplaceAll(result.Stderr, pw, "***")
}
return result.Err

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This issue remains unaddressed. The passwords are still passed as CLI arguments and can leak via violet's stdout/stderr (printed via exec.RunCommand's MultiWriter to os.Stderr). Should be addressed before merging.

Comment thread pkg/operator/operatorhub/violet.go Outdated

tgzPath, cleanup, err := downloadToTemp(ctx, url, o.timeout)
if err != nil {
return nil, "", fmt.Errorf("download %s: %w", url, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/integrity): ExpectedSha256 is optional. HTTP downloads with no SHA verification allow an on-path attacker to inject a malicious .tgz, which violet promotes as ArtifactVersion and OLM installs in cpaas-system (cluster-admin). The README documents this risk but code does not enforce it. Implement Config.Validate() (see todo #10) to reject configs where packagePrefix starts with http:// and any expectedSha256 is empty.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This issue remains unaddressed. is still optional in config.go. For http/https URLs, this should be required to prevent MITM attacks.

Comment thread pkg/operator/operatorhub/artifact_versiong.go
Comment thread pkg/operator/operatorhub/violet.go
// essentials; VIOLET_* covers VIOLET_REGISTRY_USERNAME / _PASSWORD and any
// future violet-specific overrides without leaking unrelated CI secrets such
// as GITHUB_TOKEN or AWS_*.
var violetEnvAllowlist = []string{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (security/env-allowlist): The VIOLET_* prefix glob is overly broad — any env var starting with VIOLET_ leaks into the violet subprocess, including hypothetical future vars like VIOLET_GITHUB_TOKEN. Narrow to the two known required vars:

var violetEnvAllowlist = []string{
    "KUBECONFIG", "PATH", "HOME", "USER",
    EnvVioletRegistryUsername,
    EnvVioletRegistryPassword,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This issue remains unaddressed. The allowlist still uses glob pattern at violet.go:359. Consider using explicit list: VIOLET_REGISTRY_USERNAME, VIOLET_REGISTRY_PASSWORD, VIOLET_PLATFORM_USERNAME, VIOLET_PLATFORM_PASSWORD.

if err != nil {
return "", nil, fmt.Errorf("mkdir temp: %w", err)
}
cleanup := func() { _ = os.RemoveAll(dir) }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (observability/cleanup-warning): cleanup := func() { _ = os.RemoveAll(dir) } silently ignores cleanup failures. When CI runners run low on /tmp, orphaned dirs accumulate without trace. Add a warn log:

cleanup := func() {
    if err := os.RemoveAll(dir); err != nil {
        log.Warnw("failed to remove temp dir", "dir", dir, "err", err)
    }
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This suggestion remains unaddressed. The cleanup function at violet.go:266 still silently ignores errors. Consider logging at debug level.

// Returns (av, csv, err) so callers do not have to re-read status.version
// from the unstructured object — a second extraction site was prone to
// silently feeding an empty csv to InstallSubscription.
func (o *Operator) installViaViolet(ctx context.Context, version config.Version) (*unstructured.Unstructured, string, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (refactor/narrow-interface): installViaViolet receives config.Version (~10 fields) but uses only 3 (Channel, BundleVersion, ExpectedSha256). Extracting a narrow bundleParams struct would document the actual dependency and prevent future drift as new fields are added to Version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This suggestion remains unaddressed. still receives the full struct (~10 fields) but only uses 3. Consider extracting a narrow struct for better interface clarity.

…via env

Real-cluster integration testing on 40-devops surfaced four missing pieces
that PR 2 had punted into Violet.PushArgs as a workaround. This commit
promotes each one to a first-class part of the schema (or the env-based
credential protocol) so production configs no longer have to thread
sensitive flags through an opaque string list.

(1) Violet.PlatformAddress — required ACP URL the violet binary
    authenticates against. Lives in config (not a credential). Empty value
    surfaces at violet exec time, not silently.

(2) Violet.Clusters — target subcluster(s) for the Artifact/AV write.
    violet defaults to "global", but multi-cluster ACP deployments expose a
    per-workload subcluster (kubectl is connected through
    /kubernetes/<subcluster>) and the two must match. Without it, violet
    reports "updated successfully" while the CRs never appear in the
    cluster the upgrade CLI is watching — a silent assumed-success that
    only surfaces minutes later as a wait timeout.

(3) Violet.Force *bool, defaulting to true — without --force, violet aborts
    AV upserts with "already exist, skip it" and creates nothing. Verified
    on tektoncd-operator v4.0.17: same input that succeeded with --force
    silently no-op'd without it. The pointer keeps the door open for users
    who genuinely want to preserve a stale AV, but the default matches
    every real-world flow we have.

(4) VIOLET_PLATFORM_USERNAME / VIOLET_PLATFORM_PASSWORD environment
    variables — auto-injected as --platform-username / --platform-password
    so platform credentials follow the same env-only contract as registry
    credentials. MaskCommand now redacts both --password and
    --platform-password (via a sensitivePasswordFlags map). OS-level argv
    inspection remains the documented residual risk.

Mechanical changes:
- BuildVioletPushArgs refactored to take a VioletPushParams struct rather
  than a growing positional argument list. Keeps violet.go decoupled from
  the config package (caller dereferences the *bool defaults before
  populating the struct).
- execVioletPush populates VioletPushParams from o.violet, applying the
  SkipPush=true / Force=true defaults at the boundary via a tiny
  derefBoolDefault helper.
- defaultConfig fills Violet.Force=true when nil, mirroring the existing
  SkipPush=true default.
- TestBuildVioletPushArgs rewritten as a 10-case table covering every new
  knob, both creds env-var pairs, force on/off, push-args ordering, and
  the full real-cluster shape. TestMaskCommand grows two cases covering
  --platform-password redaction alone and the combined registry+platform
  password mask.

README and CLAUDE.md document the new fields, the dual credential env
protocol, and the "Force defaults true" / "Clusters required on multi-
cluster ACP" hard constraints.
@yhuan123

Copy link
Copy Markdown
Collaborator Author

Integration test report — end-to-end pass ✅

Real-cluster integration test of the full violet onboarding flow completed successfully against a multi-cluster ACP setup. All 8 stages of installViaViolet plus the downstream InstallSubscription flow ran without intervention; the upgrade path finished in 56 seconds end to end (download → exec violet → wait AV Present → wait PackageManifest → install Subscription → approve InstallPlan → wait CSV Succeeded → run testCommand).

Final cluster state verified

Resource Field Value
ArtifactVersion metadata.labels[cpaas.io/managed-by] violet
ArtifactVersion spec.tag matches BundleVersion (cross-check passed)
ArtifactVersion status.phase Present
ClusterServiceVersion status.phase Succeeded
Subscription spec.channel matches Version.Channel

Open Questions outcomes

  • OQ2 (AV naming convention)verified ✓. violet creates the ArtifactVersion as <artifact>.<bundleVersion>, matching what waitArtifactVersionPresent and the spec.tag cross-check both expect. The defensive cross-check did not fire.
  • OQ3 (violet present in execution environment) → verified on the local developer setup; CI image presence still needs the documented kubectl run --image=<task-image> -- which violet probe before merge.
  • OQ1 (--password-stdin support in violet) → confirmed unavailable; the credentials-via-env protocol with argv injection and MaskCommand log redaction is the documented mitigation.

Findings folded into this PR

Four real-cluster needs that PR 2 had originally punted into Violet.PushArgs are now first-class:

  1. Violet.PlatformAddress (config, non-credential).
  2. Violet.Clusters (config; multi-cluster ACP needs this — leaving it blank silently writes to the wrong subcluster).
  3. Violet.Force *bool defaulting to true (without it, violet aborts AV upserts with already exist, skip it and creates nothing).
  4. VIOLET_PLATFORM_USERNAME / VIOLET_PLATFORM_PASSWORD env vars, auto-injected as --platform-username / --platform-password. MaskCommand redacts both --password and --platform-password via a sensitivePasswordFlags map.

BuildVioletPushArgs was refactored to accept a VioletPushParams struct so future violet flags do not keep growing the positional argument list. See commit 554ff6e.

Earlier in-PR fixes (already on the branch)

  • fix(config): separate MinIO URL channel segment from OLM Subscription channel — the brainstorm assumed they were the same; in practice the OLM channel and the MinIO path segment diverge for several first-party operators, so Version.PackageChannel was added with a fallback to Version.Channel when they coincide.
  • test: cover PR 2 helpers; fix URL base-name extraction bug found by testdownloadToTemp was using filepath.Base(rawURL) which can return the URL host on some inputs; now uses url.Parse + path.Base(parsed.Path).

Test coverage

  • go vet ./... clean.
  • go test ./pkg/... passing across all packages.
  • TestBuildVioletPushArgs rewritten as a 10-case table covering every new schema knob plus both env-credential pairs.
  • TestMaskCommand extended to cover --platform-password alone and the combined --password + --platform-password case.

Pre-merge checklist

  • Verify violet is present in the Tekton task image used by the upgrade-test pipeline (5-second probe with kubectl run).
  • Make sure pipeline-side VIOLET_PLATFORM_USERNAME / _PASSWORD are wired through to the upgrade CLI environment (not the config file).

@yhuan123 yhuan123 requested a review from kycheng May 18, 2026 14:59
huanyang@alauda.io added 3 commits May 18, 2026 23:02
The five P3 items captured nice-to-have polish (silent error cleanup,
defense-in-depth security, organizational tweaks, field-format validation,
env-allowlist threat-model doc). None of them are blocking and none have
been scheduled, so retiring them keeps todos/ focused on the P2 list that
will actually drive a follow-up cycle.
Six findings from the post-integration re-review, applied as one commit so
the result reads coherently — three of them tighten the credential-handling
contract and three of them strip complexity that the previous follow-up had
left behind.

Security
--------

(S1) Violet.PushArgs now rejects credential flags. --username, --password,
     --platform-username, --platform-password (in both "--flag value" and
     "--flag=value" form) cause BuildVioletPushArgs to return an error
     before any process is started. The escape hatch stays usable for
     --plain / --image-pull-secret / --dest-repo and similar, but it can
     no longer be used to smuggle a real secret into config.yaml and
     bypass the env-only credential contract.

(S2) Replace real credentials in test fixtures and README example with
     obviously-fake placeholders ("admin@example.invalid",
     "redacted-password"). Anything that ships in git should never look
     like a working credential, even when it isn't one anymore.

(S3) Add Command.RedactSecrets in pkg/exec. When non-empty, every Write
     to stdout/stderr (console + capture buffer + the stderr-tail wrapped
     into Err) is scrubbed via bytes.ReplaceAll before forwarding. The
     violet caller passes the real credential strings from
     VIOLET_REGISTRY_PASSWORD / VIOLET_PLATFORM_PASSWORD, so even if
     violet itself echoes its argv in verbose mode the CI log no longer
     carries the secret verbatim. MaskCommand was already covering OUR
     log line; this covers the child's output.

Simplicity
----------

(P1) Remove Violet.Force. The CLI always passes --force to violet — the
     "preserve a stale AV" use case was hypothetical and conflicts with
     deleteArtifactVersionIfExists already running first. One field, one
     *bool dance, one default branch, and two test cases gone.

(P2) Inline the two single-use helpers: derefBoolDefault becomes three
     lines at the SkipPush call site, sensitivePasswordFlags becomes an
     isPasswordFlag function with a one-line || expression. Both helpers
     were strictly local indirection.

(P3) EnvAllowlist drops glob support. matchAllowlist used to interpret a
     trailing "*" as a prefix match; the only caller (violetEnvAllowlist)
     used that for "VIOLET_*", but it also masks future unaudited
     VIOLET_FOO env vars from sneaking through. Replaced with an exact
     enumeration of the four VIOLET_REGISTRY/PLATFORM_* names. The
     filterHostEnv loop now does a plain string equality check, which
     was always the intent.

Mechanical follow-through
-------------------------

- BuildVioletPushArgs signature is now `(VioletPushParams) ([]string, error)`.
- VioletPushParams.Force removed.
- pkg/operator/operatorhub/violet.go drops the VIOLET_* glob comment and
  enumerates the four env names; the test using a glob is removed and a
  new TestRunCommand_RedactSecrets* trio covers stdout/stderr/Err scrubbing
  and the no-op fast path.
- TestBuildVioletPushArgs rewritten as 8 happy-path cases (no Force knob)
  plus TestBuildVioletPushArgs_RejectsCredentialFlagsInPushArgs (7 cases)
  and TestBuildVioletPushArgs_AllowsNonCredentialFlags (1 sanity check).
- README documents that credentials in PushArgs are rejected; CLAUDE.md
  adds the "PushArgs rejects credential flags" and "CLI always sends
  --force" hard constraints, replacing the prior Force section.
Previously every UpgradePath step deleted the Subscription and prior CSV,
then recreated the Subscription from scratch. That was effectively a fresh
install on each iteration and did not exercise the OLM replace chain, so
data-compatibility tests could miss upgrade regressions.

Now InstallSubscription only creates the Subscription on first use; from
the second version onward it keeps the existing Subscription and CSV in
place, patches spec.channel when it differs, and bumps a refresh
annotation to force the OLM controller to re-reconcile (mainly to defeat
PackageManifest cache staleness right after a violet push). It then waits
for an InstallPlan whose spec.clusterServiceVersionNames contains the
target CSV (the old "any installplan name" check returned the previously
approved plan on retry and hung waitCSVReady), approves it idempotently,
and waits for the new CSV phase=Succeeded.

Also removes the now-unused deleteResource helper and its schema /
dynamic imports.
huanyang@alauda.io added 2 commits May 19, 2026 11:20
- VioletConfig.LocalPackageDir: opt-in on-disk cache for downloaded .tgz
  packages. Layout mirrors the MinIO URL convention so cache hits are
  trivially diagnosable. SHA-256 verification still runs against cached
  files; half-written files are removed on download error to prevent
  false hits on retry.
- VioletConfig.PlatformUsername / PlatformPassword: opt-in config-side
  credentials for environments where VIOLET_PLATFORM_* env injection is
  awkward. Config wins, env falls back, both feed --platform-*. Resolved
  password also lands in RedactSecrets so violet's own stdout cannot leak.
- Refactor downloadToTemp into downloadFile + per-call wrapper so the
  cache path and the legacy /tmp path share the same HTTP/error logic.
- Drop dead createArtifactVersion: violet binary owns the write path now;
  git history serves the "diff vs old impl" rationale better than keeping
  unreachable code in the file.
- Wrap refreshSubscriptionForUpgrade and approveInstallPlan in
  retry.RetryOnConflict so a concurrent OLM status write does not abort
  the upgrade with a single 409. Get moves inside the retry block so each
  attempt sees the freshest resourceVersion.
- Move Version.Channel required-check from UpgradeOperator runtime path
  to validateConfig at LoadConfig time. Catches the typo at load instead
  of after a half-finished upgrade. Gated on Type=="operatorhub" so the
  local flow (which never reads Channel) stays untouched.

Tests: 3 new subscription cases (retry on 409, idempotent skip), 4 new
config cases (missing-channel rejection, local exemption, defaulted type
still validates, valid path loads cleanly).

@alaudabot alaudabot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has solid improvements but 3 critical security issues remain unaddressed: (1) URL scheme not validated, (2) passwords as CLI args, (3) SHA256 optional for HTTP. Please review the inline comments.

Comment thread pkg/config/config.go
// ExpectedSha256, when non-empty, is the lowercase hex SHA-256 of the
// downloaded .tgz. The upgrade CLI verifies the digest after download and
// fails fast on mismatch. Optional; leave empty to skip verification.
ExpectedSha256 string `yaml:"expectedSha256,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/integrity): ExpectedSha256 is optional. For HTTP URLs (non-HTTPS), this allows on-path attackers to inject malicious .tgz. Consider making SHA256 required when packagePrefix starts with http://.

}
platformUser := p.PlatformUsername
if platformUser == "" {
platformUser = os.Getenv(EnvVioletPlatformUsername)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/secrets): Passwords passed as CLI args. While RedactSecrets scrubs stdout/stderr, passwords are visible via ps auxe. Document this as known risk for shared CI runners.

@alaudabot alaudabot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding additional review comments:

if err != nil {
return fmt.Errorf("build request: %w", err)
}
resp, err := http.DefaultClient.Do(req)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/ssrf): No URL scheme validation here or in BuildPackageURL. An attacker controlling config could set packagePrefix to internal IPs or cloud metadata endpoints (e.g., http://169.254.169.254/). Add scheme validation to reject non-http/https URLs.

if err != nil {
return "", nil, fmt.Errorf("mkdir temp: %w", err)
}
cleanup := func() { _ = os.RemoveAll(dir) }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (quality/cleanup): cleanup function silently ignores errors with _ = os.RemoveAll(dir). Consider logging at debug level: if err := os.RemoveAll(dir); err != nil { log.Debugw("cleanup failed", "err", err) }.

Comment thread pkg/operator/operatorhub/violet.go Outdated
return nil, "", fmt.Errorf("get artifact %s: %w", o.artifact, err)
}

url, err := BuildPackageURL(o.violet.PackagePrefix, o.name, version.EffectivePackageChannel(), version.BundleVersion)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (refactor/narrow-interface): installViaViolet receives full config.Version (~10 fields) but only uses Channel, BundleVersion, ExpectedSha256, EffectivePackageChannel. Consider a narrow input struct for better type safety and self-documentation.

@yhuan123

Copy link
Copy Markdown
Collaborator Author

测试结论:跨大版本升级 v4.0.17 → v4.6.3-rc.91.g8ff4ed2 通过 ✅

测试环境

目标集群 env4-d(devops-env1-hcvt43--idp.alaudatech.net,子集群 /kubernetes/devops
执行节点 192.168.140.9(CentOS 7 / x86_64,与 MinIO 同内网直连)
upgrade 二进制 本分支 GOOS=linux GOARCH=amd64 交叉编译
violet 二进制 package-minio.alauda.cn:9199/packages/violet/latest/violet_linux_amd64 安装到节点 /usr/local/bin/violet
升级路径 tektoncd-operator v4.0.17 (packageChannel: v4.0) → v4.6.3-rc.91.g8ff4ed2 (packageChannel: rc),OLM channel 共 stable
testCommand 两步均 echo(仅验证 violet+OLM 升级链路,不跑 BDD)

关键时间线(节点本地时区)

时刻 事件 备注
11:47:47 启动 + v4.0.17 cache hit 2.6GB 复用,0 下载(验证 LocalPackageDir
11:47:47 violet push v4.0.17 --platform-password ***(日志屏蔽生效)
11:48:03 → 11:48:38 AV Present + PackageManifest 35s
11:48:38 Subscription absent → createSubscription 起点干净(清理脏数据后)
11:48:43 approve install-wrftv 5s 内出 IP
11:48:58 CSV v4.0.17 Succeeded 15s
11:48:58 [prepare] testCommand passed
11:48:58 → 11:49:31 v4.6.3-rc.91 cache miss → 下载 33s(节点内网直连 MinIO)
11:49:31 → 11:50:06 violet push v4.6.3-rc.91 + AV Present 35s
11:50:07 → 11:50:37 PackageManifest 30s
11:50:37 Subscription exists → in-place refresh bump upgrade-test.alauda.io/refresh-trigger
11:50:37 → 11:54:07 等 OLM 生成 IP for v4.6.3-rc.91 3.5min(OLM catalog reconcile 周期偏慢,非 CLI 问题)
11:54:07 approve install-5ls6h
11:54:27 CSV v4.6.3-rc.91.g8ff4ed2 Succeeded 20s
11:54:27 [upgrade] testCommand passed
总耗时 6 分 40 秒 含一次 ~1.6GB 网络下载 + 一次跨大版本 OLM 升级

本 PR 改动的具体验证

  • Violet.PlatformUsername / PlatformPassword (yaml 入口) ✅:从 config 注入到 --platform-username / --platform-password,env 兜底未触发;MaskCommand--platform-password 之后的值替换为 ***RedactSecrets 把同样的值也加入 violet 子进程 stdout/stderr 屏蔽列表。
  • Violet.LocalPackageDir (本地缓存) ✅:路径布局 <root>/<name>/<packageChannel>/<name>.latest.ALL.<bundleVersion>.tgz:v4.0.17 第二次跑 cache hit 复用 2.6GB;v4.6.3-rc.91 cache miss 自动下载并落到 cache 目录(父目录 auto-mkdir)。Cleanup 为 noop 保留为缓存。
  • acquirePackage 重构 ✅:LocalPackageDir 为空回退原 downloadToTemp 路径(流水线场景行为不变)。
  • InstallSubscription in-place refresh 路径 ✅:v4.6.3-rc.91 走 subscription exists, rolling forward in place 分支,bump refresh annotation,OLM 在 3.5min 内生成新 IP,approve → Succeeded,未删除 Subscription/CSV。

测试中暴露的问题(非本 PR 范围,可作为 follow-up)

  1. 配置直觉陷阱:LocalPackageDir 双层<root>/<name>/<channel>/<file> 布局下,看到仓库已有 ./tektoncd-operator/v4.0/...tgz 时本能想配 localPackageDir: ./tektoncd-operator,结果路径双层 tektoncd-operator/tektoncd-operator/v4.0/...tgz。建议:README 用更具象的说法("放在 operator 上一层;本 root 下会自动生成 <name>/<channel>/ 子目录")或在 CLI 启动时检测 <root> 名字等于 <name> 时打 WARN。
  2. HTTP 明文 + 无 SHA256 → 静默接受坏 cache。本次踩到:本机 aria2 残留的 .tgz 大小 2.6GB 与服务端 content-length 一致,os.Stat 通过,但 gzip 流在 22MB 处坏 → violet 报 flate: corrupt inputtodos/003-pending-p2-sha256-should-be-required-for-http.md 已在讨论这条,本次踩坑确认是已实测的现实失败,不只是理论风险。
  3. CSV 已 Succeeded 等于目标版本时,waitInstallPlanForCSV 死等。OLM 不会为 "我们已经在的版本" 主动生成 IP,CLI 当前会 poll 到 timeout。常见场景:升级路径第一步 == 集群当前 CSV、或重跑同一 upgrade。建议加 idempotent short-circuit:进 waitInstallPlanForCSV 前先 Get CSV,若 phase=Succeeded 直接 return。本 PR 不包含这条 fix,建议另开 PR。
  4. 本地代理截胡 MinIO:开发机带 ClashX 全局代理时,Go http.DefaultClient 默认读 HTTP_PROXY/HTTPS_PROXY → 内网 MinIO 下载从 ~100MB/s 降到 ~2MB/s。建议 README "运行环境" 段加一行:"开发机若有全局代理(ClashX / proxychains 等),请在跑 upgrade 前 unset http_proxy https_proxy all_proxy 或设置 NO_PROXY 包含 MinIO 域名。"
  5. violet push 失败时在 cwd 留下解压残留目录<file>.tgz 同名无后缀目录)。当 LocalPackageDir 是 cache root 时,会污染缓存目录,需要手工 rm -rf。建议:CLI 调 violet 前把 .tgz 路径转绝对路径,让 violet 在其自己的 tmpdir 解压;或者 push 失败后由 CLI 主动清理残留。

总结

本 PR 引入的两个新配置入口(platformUsername/Password 走 yaml、localPackageDir 本地缓存)在跨大版本升级路径上功能完备、行为符合 README/CLAUDE.md 描述。暴露的 5 个 follow-up 问题均不属于本 PR 范围,建议在后续 PR 中针对性处理。

@yhuan123

Copy link
Copy Markdown
Collaborator Author

Round 2:逐步升级 v4.0.17 → v4.2.5 → v4.6.3-rc.91.g8ff4ed2 通过 ✅

延续 Round 1(一步直跨)的同环境同节点,补充验证多跳 OLM replace chain。集群在 Round 1 跑完后做了一次精准清理(按名字删 Subscription / CSV / IP / 全部 AV),保证起点干净。

配置差异(vs Round 1)

upgradePaths:
  - name: v4.0.17-to-v4.2.5-to-v4.6.3-rc.91
    versions:
      - name: v4.0.17           # packageChannel: v4.0
      - name: v4.2.5            # packageChannel: v4.2  ← 新增中间站
      - name: v4.6.3-rc.91.g8ff4ed2  # packageChannel: rc

三站 OLM channel 都是 stable;packageChannel 分别是 v4.0 / v4.2 / rc,再次印证"OLM channel ≠ MinIO 目录段"。

关键时间线(节点本地时区)

时刻 事件 备注
11:59:39 启动 + v4.0.17 cache hit 2.6GB 复用(Round 1 落下的 cache)
11:59:39 → 12:00:11 violet push + AV Present + PackageManifest 32s
12:00:06 Subscription absent → createSubscription 起点干净
12:00:06 → 12:00:11 OLM 出 IP install-qlh7w 5s
12:00:11 → 12:00:26 CSV v4.0.17 Succeeded → [prepare] 15s
12:00:26 → 12:00:47 v4.2.5 cache miss → 下载 + push 21s 下载 + 25s push
12:01:13 → 12:01:38 AV Present + PackageManifest 25s
12:01:38 Subscription exists → in-place refresh(bump annotation)
12:01:38 → 12:01:43 OLM 出 IP install-rgqqg 5s(hot OLM controller)
12:01:43 → 12:01:58 CSV v4.2.5 Succeeded → [intermediate] 15s
12:01:58 v4.6.3-rc.91 cache hit 10.1GB 复用(Round 1 落下的 cache)
12:01:58 → 12:02:34 violet push 10GB 36s
12:02:39 → 12:03:09 AV Present + PackageManifest 30s
12:03:09 Subscription in-place refresh #2
12:03:09 → 12:04:04 OLM 出 IP install-ng6hp 55s(跨大版本 replace chain 解析)
12:04:04 → 12:04:24 CSV v4.6.3-rc.91 Succeeded → [upgrade] 20s
总耗时 4 分 46 秒 三跳 + 一次 ~1GB 下载 + 一次 10GB cache 复用

Round 1 vs Round 2 对比

指标 Round 1(1 跳直跨) Round 2(2 跳逐步)
总耗时 6m 40s 4m 46s
OLM 出 IP 最慢一次 3.5min(cold controller) 55s(hot controller)
网络下载 1 次(10GB v4.6.3-rc.91) 1 次(~1GB v4.2.5)
Cache 命中次数 1(v4.0.17) 2(v4.0.17 + v4.6.3-rc.91)
testCommand 触发次数 2 3

本 PR 改动的额外验证(Round 1 已覆盖的不重复)

  • LocalPackageDir 跨次复用 ✅:v4.6.3-rc.91 的 10.1GB .tgz 在 Round 1 落到 ./tektoncd-operator/rc/...tgz,Round 2 step 3 直接 reusing cached violet package,省掉一次大文件下载。这正是设计意图——cache 跨次保留而非随单次进程清理。
  • refreshSubscriptionForUpgrade 在连续多跳里行为一致 ✅:step 2 / step 3 各 bump 一次 upgrade-test.alauda.io/refresh-trigger,OLM 立刻 reconcile 出新 IP,未走 idle 大周期。
  • waitInstallPlanForCSV 在多 IP 历史下选对了目标 ✅:cluster 在 step 2 后有 install-qlh7w(v4.0.17)和 install-rgqqg(v4.2.5);step 3 触发后 status.installplan.name 短暂仍可能指向上一个 IP,但 CLI 通过 match spec.clusterServiceVersionNames 等到了 install-ng6hp(v4.6.3-rc.91),不会误 approve 旧 plan。

新观察 / Insight

  1. 多跳反而比单跳快——单纯凭"步数"想象会认为 3 跳 > 1 跳,但实际:v4.6.3-rc.91 包 cache hit 省 ~30s + OLM 在 hot 状态下出 IP 快 4-40 倍,两者叠加把多跳的开销压到比单跳还低。
  2. OLM 慢不是稳态属性,是状态依赖的——Round 1 step 2 等了 3.5min,Round 2 step 2 同样的 refresh+bump 流程只等 5s。Round 1 是集群 idle 24h+ 后冷启动;Round 2 是连续跑、controller 处于 hot 状态。这意味着 CLI 的 20min timeout 覆盖 cold case 是合理的,但开发者对"等多久算坏"的直觉需要校准。
  3. force-refresh annotation 确实起作用——step 2 OLM 5s 出 IP 证明 bump annotation 不是占位代码,能把 Sub controller 唤醒到立即重新评估 catalog,而不是等下一次 internal resync。
  4. OLM replace chain 元数据连贯:v4.2.5 / v4.6.3-rc.91 都能正确解析"从上一步升过来",IP 的 spec.clusterServiceVersionNames 都精准指向各步目标 CSV。Round 1 的"直跨 v4.0.17 → v4.6.3-rc.91"也成功 → catalog 的多跳 chain 是完整的。

总结

本 PR 引入的 LocalPackageDir 缓存机制在 跨次复用(Round 1 落地 → Round 2 命中)和 多版本布局v4.0 / v4.2 / rc 三个 channel 段共存于同一 root)两个维度都验证通过。OLM 多跳路径(fresh create + 两次 in-place refresh)功能完备。Round 1 评论里列出的 5 个 follow-up 在本轮也无新增。建议合并。

huanyang@alauda.io added 2 commits May 19, 2026 13:32
BuildPackageURL hardcoded the second-to-last filename segment as `latest`,
which worked for tektoncd only because that uploader double-wrote both
`latest.ALL` and `<channel>.ALL` files. Operators like gitlab-ce that
upload a single `<channel>.ALL` variant hit a 404 in acquirePackage.

Split the API into pathChannel (directory segment, can be a release-train
name like `v4.0`) and fileChannel (filename segment, the OLM channel like
`stable`). acquirePackage's cache layout follows the same split.
…first onboard

The Get only used artifact.GetName(), which equals o.artifact (already
known from config at construction time). The Get added no information
but turned every "first onboard" target — a fresh platform that has
never installed this operator — into a fail-fast: violet push itself
creates the Artifact CR when missing, so the pre-check was actively
blocking a legitimate path.

Use o.artifact directly when composing the AV name.
dlCtx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()

req, err := http.NewRequestWithContext(dlCtx, http.MethodGet, rawURL, nil)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/ssrf): No URL scheme validation. An attacker controlling config.yaml could set packagePrefix to internal URLs (e.g., 169.254.169.254 for AWS metadata) or other dangerous schemes. Add: if parsed.Scheme != "http" && parsed.Scheme != "https" { return fmt.Errorf("unsupported scheme %q (only http/https allowed)", parsed.Scheme) }

Comment thread pkg/config/config.go
// ExpectedSha256, when non-empty, is the lowercase hex SHA-256 of the
// downloaded .tgz. The upgrade CLI verifies the digest after download and
// fails fast on mismatch. Optional; leave empty to skip verification.
ExpectedSha256 string `yaml:"expectedSha256,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/integrity): ExpectedSha256 is optional but required for HTTP URLs to prevent on-path attacks. Consider adding validation in validateConfig to require SHA when PackagePrefix starts with http://.

}
if p := os.Getenv(EnvVioletRegistryPassword); p != "" {
args = append(args, flagPassword, p)
if p2 := os.Getenv(EnvVioletRegistryPassword); p2 != "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical (security/secrets): Password passed as CLI argument (--password) visible via ps auxe / /proc/<pid>/cmdline. The README documents this risk - consider if this is an acceptable tradeoff or if there's a better approach.

if err != nil {
return "", nil, fmt.Errorf("mkdir temp: %w", err)
}
cleanup := func() { _ = os.RemoveAll(dir) }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (quality/cleanup): Cleanup function silently ignores errors with _ = os.RemoveAll(dir). Consider logging at debug level: if err := os.RemoveAll(dir); err != nil { log.Debugw("cleanup failed", "err", err) }

@yhuan123 yhuan123 merged commit 45d6c0c into main May 19, 2026
3 of 4 checks passed
@yhuan123 yhuan123 deleted the refactor/install-via-violet branch May 19, 2026 07:08
yhuan123 pushed a commit that referenced this pull request May 19, 2026
PR #14 的 commit a036487 + 12f63d2 已经把以下 6 个 P2 finding 实际修复,
对应 todo 文件失去价值,直接删除(按"todos/ 是看板不是档案"约定)。

| 删除文件 | 修复 commit | 改动要点 |
|---------|-------------|---------|
| 001 http-download-no-timeout            | a036487 | downloadToTemp 用 context.WithTimeout(ctx, o.timeout) 包裹 |
| 004 package-manifest-substring-csv-match| a036487 | strings.Contains → 精确匹配 + surface NestedString type error |
| 005 wait-av-no-retry-on-transient-errors| a036487 | 引入 isTransientAPIError + switch 分支:NotFound/Transient 继续轮询 |
| 006 download-close-error-swallowed      | a036487 | explicit Close + 显式 error check,避免 truncated tgz 绕过 sha256 |
| 007 empty-csv-fed-to-subscription       | a036487 | InstallArtifactVersion 返回 (av, csv, err) 替代二次 NestedString |
| 008 channel-default-inconsistency       | a036487 + 12f63d2 | 删 "stable" 静默 fallback;channel required-check 移到 validateConfig |

剩余 pending 的 P2(002 / 003 / 009 / 010 / 011 / 012)继续保留——
代码尚未落地,归档为 completed 会变成"修报告掩盖未做工作"。
yhuan123 added a commit that referenced this pull request May 19, 2026
)

* chore(operatorhub): rename artifact_versiong.go to artifact_version.go (PR 3/3)

修复 PR 1 之前就遗留的文件名 typo(多了一个尾巴的 g)。

PR 3 原计划要清理的 dead code `createArtifactVersion` 已经在 PR 2 的
commit 12f63d2 (Drop dead createArtifactVersion: violet binary owns the
write path now) 一并清掉了——此 PR 收尾把这个一直没人收的文件名 typo
也解决掉,确保仓库里再没有 `versiong` 这个错字。

同步更新引用:docs/brainstorms、docs/plans、todos/004/005/007/010。
其中 `:行号` 是 2026-05-18 plan / PR 2 review 时刻的历史快照,PR 2
重写后行号已不对应,但属于历史性引用,保留原值不动。

Plan: docs/plans/2026-05-18-refactor-externalize-violet-binary-plan.md

* chore(todos): drop 6 P2 todos already fixed by PR #14

PR #14 的 commit a036487 + 12f63d2 已经把以下 6 个 P2 finding 实际修复,
对应 todo 文件失去价值,直接删除(按"todos/ 是看板不是档案"约定)。

| 删除文件 | 修复 commit | 改动要点 |
|---------|-------------|---------|
| 001 http-download-no-timeout            | a036487 | downloadToTemp 用 context.WithTimeout(ctx, o.timeout) 包裹 |
| 004 package-manifest-substring-csv-match| a036487 | strings.Contains → 精确匹配 + surface NestedString type error |
| 005 wait-av-no-retry-on-transient-errors| a036487 | 引入 isTransientAPIError + switch 分支:NotFound/Transient 继续轮询 |
| 006 download-close-error-swallowed      | a036487 | explicit Close + 显式 error check,避免 truncated tgz 绕过 sha256 |
| 007 empty-csv-fed-to-subscription       | a036487 | InstallArtifactVersion 返回 (av, csv, err) 替代二次 NestedString |
| 008 channel-default-inconsistency       | a036487 + 12f63d2 | 删 "stable" 静默 fallback;channel required-check 移到 validateConfig |

剩余 pending 的 P2(002 / 003 / 009 / 010 / 011 / 012)继续保留——
代码尚未落地,归档为 completed 会变成"修报告掩盖未做工作"。

---------

Co-authored-by: huanyang@alauda.io <huanyang@alauda.io>
yhuan123 pushed a commit that referenced this pull request May 19, 2026
Three silent-failure paths fixed (P1), four hardening / cleanup changes
applied (P2). All P1 fixes close anti-patterns where preflight reported
"clean" while actually bypassing the guard it was built to enforce.

P1 — silent-failure closures:
- #13/#14 cluster guard: in-cluster mode now requires --confirm-cluster
  matching violet.clusters; empty kubeconfig CurrentContext is now a hard
  error with an actionable 'kubectl config use-context' hint. Previously
  both paths log.Warn'd and returned nil, defeating the guard exactly when
  it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs).
- #15 BundleVersion required for operatorhub: empty value used to slip
  past validateConfig, producing 'operatorhub-foo.' AV names with a
  trailing dot — preflight reported clean while upgrade failed late with
  cryptic 404s. Required at load time.

P2 — internal hardening:
- #16 Drop the seen map from runPreflight: dead code under fail-fast
  semantics (architecture-strategist + simplicity-reviewer both flagged
  it independently). -14 LOC.
- #018 checkInstallPlanResidue no longer discards the NestedString error.
  status.phase type drift now surfaces a wrapped error instead of being
  swallowed and producing a 'every IP is non-terminal' noise storm.
- #019 New preflight.NewResidual constructor centralises kubectl cleanup
  template + %q shell-quoting. Three call sites in operatorhub/preflight.go
  go from 5-line struct literals to one-line constructor calls.

P3 follow-ups (020-023) recorded as separate todos for later triage.

Tests for the cmd-layer changes (#17) land in the next commit.
yhuan123 pushed a commit that referenced this pull request May 19, 2026
Closes P2 #17 — the two cmd-layer functions added by PR #17 carried
zero coverage (pr-test-analyzer criticality 9).

Coverage:
- assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops;
  in-cluster missing/mismatched/matching --confirm-cluster; file-mode
  unreadable kubeconfig, empty current-context (pins the #14 fix in
  place), missing/mismatched/matching --confirm-cluster.
- runPreflight (5 sub-tests): clean cluster, empty Versions skipped,
  baseline-only invariant (regression guard: a refactor scanning all
  versions would surface mid-upgrade AVs as false positives), fail-fast
  across paths (second path's PreflightBaseline is verifiably not
  reached), operator errors propagate unwrapped (must NOT be wrapped
  as *PreflightError).

The fake operator records the Version passed to each PreflightBaseline
call so the baseline-only test asserts the actual contract rather than
just call count.
yhuan123 added a commit that referenced this pull request May 20, 2026
* feat(operator): scaffold PreflightBaseline interface

- new pkg/operator/preflight package holds the Residual value type so
  both pkg/operator and its concrete impls can import without cycles
- OperatorInterface gains PreflightBaseline(ctx, version)
  ([]preflight.Residual, error); contract: read-only, baseline-only
- local: no-op returning (nil, nil) — README will document the assumption
  that `make deploy` must be idempotent
- operatorhub: scaffolding stub; real check logic lands in follow-up
- plan: docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md

* feat(operatorhub): implement PreflightBaseline residue checks

Three read-only checks against the baseline of an upgrade path:

- Subscription/<name> in <namespace>
- ArtifactVersion/<artifact>.<bundleVersion> in cpaas-system
- non-terminal InstallPlan in <namespace>, filtered by the OLM-managed
  label operators.coreos.com/<package>.<ns> so we never scan the
  namespace's full historical IP archive

Behavior:
- 30s context timeout caps the whole call so a hung apiserver cannot
  block the upgrade for the default o.timeout (10m)
- transient API errors (ServerTimeout / TooManyRequests /
  ServiceUnavailable) are wrapped with a 'retry the run' hint
- permanent errors propagate with the check name attached
- kubectl cleanup hints quote both name and namespace via %q so
  copy-pasted commands stay shell-safe

CSV residue is intentionally NOT checked: an independent CSV without
a corresponding Subscription or ArtifactVersion is a degenerate state
already covered by the other three. Plan documents how to add
PackageManifest-driven CSV resolution if a future incident requires it.

Tests cover clean cluster, single/multi residue, non-terminal phases
(empty/Planning/RequiresApproval/Installing), terminal phases
(Complete/Failed) ignored, transient + permanent error paths, and a
read-only contract assertion (spy reactor fails the test on any
Create/Update/Patch/Delete).

* feat(config): validate namespace + bundleVersion for operatorhub

Two new load-time checks gated on Type == operatorhub:

1. operatorConfig.namespace must be non-empty. An empty namespace
   would let preflight's Subscription / InstallPlan queries silently
   degrade to cluster-scope, producing false-positive residue
   reports from completely unrelated operators.

2. Version.bundleVersion must match ^[a-zA-Z0-9._-]+$. The field is
   interpolated into kubectl cleanup hints emitted by preflight and
   into violet argv, so allowing shell metacharacters here would
   propagate $, backticks, semicolons or quotes straight into the
   user's terminal. Centralising the check at load time closes the
   injection vector for every downstream consumer with a single
   chokepoint.

Both checks fire only for operatorhub type — local mode runs
`make deploy` and has no equivalent risk surface.

Tests cover: namespace required (operatorhub), namespace optional
(local), shell-active characters rejected, realistic versions
(v4.6.3 / 4.6.3-rc.91 / v0.74.0) accepted.

* feat(cmd): wire preflight + cluster guard into Execute

Three additions to the upgrade entry point, all ahead of the upgrade loop:

1. *PreflightError (cmd-internal) formats []preflight.Residual into a
   multi-line copy-pasteable report: one cleanup block per residual,
   then a finalizer-unstuck command template, then the bypass hint.
   The type stays in cmd/ so pkg/operator never picks up a dependency
   on how the CLI talks to the user. Decision C (all-English text) is
   the locked default; Error() is the single place to change wording
   without touching any contract.

2. assertClusterMatch enforces the operator≡kubectl cluster identity
   contract whenever operatorConfig.violet.clusters is set. The
   recommended rule (Decision B, locked at default B1) is exact
   string equality against KUBECONFIG's CurrentContext via
   --confirm-cluster. In-cluster runs (no kubeconfig file) degrade
   to a loud warning so CI pods aren't silently blocked.

3. runPreflight iterates UpgradePaths, calls op.PreflightBaseline on
   each baseline (Versions[0]), and fails fast at the first dirty
   path with a *PreflightError. (Kind, NS, Name) dedup is in place
   for future aggregation; --skip-preflight bypasses the whole step
   with a Warn-level audit line.

Cobra's SilenceUsage is set in AddFlags so that PreflightError's
actionable kubectl lines are not buried under --help output on a
non-zero exit. Flag-parsing errors still print usage (they execute
before RunE), so users who typo a flag are not stranded.

* docs(preflight): README usage + CLAUDE.md hard constraints

- README: new 'Preflight 前置检查' section documents what gets
  scanned (Subscription / ArtifactVersion / non-terminal InstallPlan),
  the 30s timeout, a sample error message, the two new flags
  (--skip-preflight / --confirm-cluster), and the minimum RBAC verbs
  needed to run preflight in isolation.

- CLAUDE.md:
  - 数据流: preflight + cluster guard inserted ahead of the upgrade
    loop; explicit note that preflight does NOT honour cfg.Immediate
    (quality gate, not part of the upgrade chain).
  - Operator 抽象 paragraph mentions PreflightBaseline and the leaf
    pkg/operator/preflight package created to avoid an import cycle.
  - 硬约束: four new entries — PreflightBaseline must stay read-only
    and baseline-only (with a note on why CSV is intentionally
    excluded and where to add it back if needed); cmd.SilenceUsage
    is mandatory; --confirm-cluster matching rule (B1 default);
    BundleVersion regex chokepoint.

- Plan acceptance criteria all marked complete.

* docs(plan): mark preflight-check plan as completed

* fix: address P1+P2 review findings for preflight check

Three silent-failure paths fixed (P1), four hardening / cleanup changes
applied (P2). All P1 fixes close anti-patterns where preflight reported
"clean" while actually bypassing the guard it was built to enforce.

P1 — silent-failure closures:
- #13/#14 cluster guard: in-cluster mode now requires --confirm-cluster
  matching violet.clusters; empty kubeconfig CurrentContext is now a hard
  error with an actionable 'kubectl config use-context' hint. Previously
  both paths log.Warn'd and returned nil, defeating the guard exactly when
  it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs).
- #15 BundleVersion required for operatorhub: empty value used to slip
  past validateConfig, producing 'operatorhub-foo.' AV names with a
  trailing dot — preflight reported clean while upgrade failed late with
  cryptic 404s. Required at load time.

P2 — internal hardening:
- #16 Drop the seen map from runPreflight: dead code under fail-fast
  semantics (architecture-strategist + simplicity-reviewer both flagged
  it independently). -14 LOC.
- #018 checkInstallPlanResidue no longer discards the NestedString error.
  status.phase type drift now surfaces a wrapped error instead of being
  swallowed and producing a 'every IP is non-terminal' noise storm.
- #019 New preflight.NewResidual constructor centralises kubectl cleanup
  template + %q shell-quoting. Three call sites in operatorhub/preflight.go
  go from 5-line struct literals to one-line constructor calls.

P3 follow-ups (020-023) recorded as separate todos for later triage.

Tests for the cmd-layer changes (#17) land in the next commit.

* test(cmd): unit tests for assertClusterMatch and runPreflight

Closes P2 #17 — the two cmd-layer functions added by PR #17 carried
zero coverage (pr-test-analyzer criticality 9).

Coverage:
- assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops;
  in-cluster missing/mismatched/matching --confirm-cluster; file-mode
  unreadable kubeconfig, empty current-context (pins the #14 fix in
  place), missing/mismatched/matching --confirm-cluster.
- runPreflight (5 sub-tests): clean cluster, empty Versions skipped,
  baseline-only invariant (regression guard: a refactor scanning all
  versions would surface mid-upgrade AVs as false positives), fail-fast
  across paths (second path's PreflightBaseline is verifiably not
  reached), operator errors propagate unwrapped (must NOT be wrapped
  as *PreflightError).

The fake operator records the Version passed to each PreflightBaseline
call so the baseline-only test asserts the actual contract rather than
just call count.

* chore(todos): remove completed P1/P2 review findings

The 7 todos (013-019) that drove the silent-failure fixes + cmd-layer
tests are done; their substance lives in the PR description and the
commits that closed them (1201665 + f721f8e). Removing the files
keeps todos/ as a list of OPEN work, which is its only useful state.

Remaining in todos/: 4 P3 follow-ups for this PR (020-023) and 6 P2
items from prior PRs that have not been actioned yet.

* chore(todos): drop P3 review follow-ups for PR #17

The four P3 nice-to-haves (test variants, internal simplifications,
agent-ergonomics knobs, misc cleanups) are not on the roadmap and
filing them as todos would just decorate the backlog with low-value
items nobody plans to pick up. The PR description still records the
reviewer recommendations for archeology; the file copies were
redundant.

---------

Co-authored-by: huanyang@alauda.io <huanyang@alauda.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants