Skip to content

feat: upgrade preflight check (OLM residue scan + cluster guard)#17

Merged
yhuan123 merged 10 commits into
mainfrom
feat/preflight-check
May 20, 2026
Merged

feat: upgrade preflight check (OLM residue scan + cluster guard)#17
yhuan123 merged 10 commits into
mainfrom
feat/preflight-check

Conversation

@yhuan123

@yhuan123 yhuan123 commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Preflight residue scan: read-only check before the upgrade loop. For each UpgradePath's baseline (Versions[0]), Get the operator's Subscription, Get the baseline ArtifactVersion (<artifact>.<bundleVersion> in cpaas-system), and List non-terminal InstallPlans filtered by OLM's operators.coreos.com/<pkg>.<ns> label. Any residue → fail with copy-pasteable kubectl delete commands (names/ns escaped via %q, generated by the centralised preflight.NewResidual constructor), a finalizer-unstuck template, and a "wait 30s for OLM" hint. 30s context timeout caps the entire call.

  • Cluster identity guard (strict in every mode after code review): when operatorConfig.violet.clusters is set, the user must pass --confirm-cluster=<NAME>. The matching rule is mode-aware:

    • File-mode kubeconfig: --confirm-cluster must equal CurrentContext. Empty CurrentContext is treated as a user error with an actionable kubectl config use-context <name> hint (no silent fallback).
    • In-cluster: there is no CurrentContext to read, so --confirm-cluster must equal violet.clusters — the user explicitly acknowledges the target subcluster. A CI pod with violet.clusters: devops but no --confirm-cluster is now a hard error instead of a log line. This closes the "silent 假成功" symmetry that motivated the guard in the first place.
  • CSV intentionally excluded from preflight: independent CSV residue (without Sub or AV) is a degenerate state covered by the other three checks. PackageManifest-driven CSV resolution is documented in the plan as a future-storage option if a real incident requires it.

  • Config hardening: pkg/config/config.go::validateConfig now requires namespace for operatorhub type, requires bundleVersion for operatorhub type (empty value used to produce operatorhub-foo. AV names with a trailing dot — preflight reported clean while upgrade failed late with cryptic 404s), and enforces bundleVersion against ^[a-zA-Z0-9._-]+$ (the field is interpolated into kubectl + violet argv, so shell metacharacters here are an injection vector — single chokepoint closes it for every downstream consumer).

  • CLI ergonomics: cmd.SilenceUsage = true in AddFlags so cobra does not dump --help over the actionable kubectl-delete lines on a non-zero exit. Flag-parsing errors still print usage (they execute before RunE), so typo'd flags are not stranded.

Code Review Fixes (P1 + P2 closed)

Multi-agent review (8 reviewers in parallel — architecture, performance, security, simplicity, silent-failure, type-design, pr-test, agent-native) surfaced 11 actionable findings; 3 P1 + 4 P2 fixed in commits 1201665 and f721f8e. P3 follow-ups recorded for later triage.

P1 — silent-failure closures (preflight was reporting "clean" while bypassing its own guard):

P2 — internal hardening:

  • chore(skill): add gen-upgrade-config for upgrade.yaml generation #16 Drop seen map from runPreflight: dead code under fail-fast semantics (architecture-strategist + simplicity-reviewer independently flagged it). -14 LOC.
  • #018 checkInstallPlanResidue no longer discards the NestedString error. status.phase type drift now surfaces a wrapped error instead of being swallowed and producing an "every IP is non-terminal" noise storm — protects against OLM CRD schema migrations.
  • #019 preflight.NewResidual constructor: centralises kubectl cleanup template + %q shell-quoting. Three call sites in operatorhub/preflight.go go from 5-line struct literals to one-line constructor calls. The shell-quoting invariant is now compiler-protected.
  • feat: upgrade preflight check (OLM residue scan + cluster guard) #17 cmd-layer test coverage: assertClusterMatch + runPreflight carried zero coverage (pr-test-analyzer criticality 9). New cmd/upgrade_command_test.go adds 15 sub-tests including baseline-only invariant (regression guard against "scan all versions") and fail-fast verification (second path's PreflightBaseline provably not reached).

P3 follow-ups recorded as todos for later:

  • todos/020-pending-p3-test-coverage-extensions.md — transient error type variants, IP label fixture drift, regex edge cases
  • todos/021-pending-p3-preflight-simplifications.md — checks slice → sequential calls, phases map → switch
  • todos/022-pending-p3-agent-ergonomics-followups.md--preflight-output=json, exit-code taxonomy, optional --preflight-only
  • todos/023-pending-p3-misc-cleanups.mdSilenceUsage placement, defensive zero-residual test, local Info log

Real-Cluster Validation

Verified against env3-devops cluster with live tektoncd-operator (Subscription in tekton-operator namespace, multiple historical ArtifactVersions in cpaas-system, terminal InstallPlan install-4pbdf):

Designed behavior Real-cluster result
Subscription name + namespace match ✓ Detected tekton-operator/tektoncd-operator
ArtifactVersion <artifact>.<bundleVersion> match ✓ Detected tektoncd-operator.v4.11.0-beta.78.gcec2b43
OLM label selector operators.coreos.com/<pkg>.<ns> ✓ Real label is exactly this format; selector hits the right plans
Terminal-phase IP (Complete) filtered out ✓ Not reported as residue (matches plan design)
%q quoting in kubectl hints kubectl delete subscription "tektoncd-operator" -n "tekton-operator" ready to paste
No cluster mutation when preflight fails ✓ No violet invocation, no Subscription patch — verified by reading IP status.phase is unchanged across runs
Exit code 1
stdout / stderr split ✓ INFO logs → stdout; Error: preflight failed: ... → stderr

Testing

  • 17 new tests in cmd/upgrade_command_test.go (10 assertClusterMatch × all branches + 5 runPreflight × baseline-only / fail-fast / error propagation / empty-versions / clean cluster + 2 PreflightError formatter).
  • 12 sub-tests in pkg/operator/operatorhub/preflight_test.go covering clean cluster, single residue per kind, multi-kind aggregation, non-terminal IP phases ("" / Planning / RequiresApproval / Installing), terminal IP phases (Complete / Failed) ignored, transient API error wrap with "retry the run" hint, permanent error wrap chain (errors.Is), and a read-only contract assertion (spy reactor on create/update/patch/delete/deletecollection fails the test on any mutation).
  • 9 sub-cases in pkg/config/config_test.go for new validators (shell metachar rejection × 5, realistic version acceptance × 4, namespace required for operatorhub, namespace optional for local, empty bundleVersion rejected for operatorhub).
  • go test ./... + go vet ./... all pass with no regressions in existing violet / subscription / config tests.

Plan & Decisions

  • Plan (with full Deepen report + 8 sub-agent findings folded in): docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md — included in the first commit so reviewers can map every design choice back to its source.
  • Decision B (--confirm-cluster matching rule): locked at B1 = exact string equality. Implementation note in cmd/upgrade_command.go::assertClusterMatch documents how to switch to substring/regex without changing the flag surface.
  • Decision C (PreflightError wording): locked at C1 = all-English. Implementation note in cmd/preflight_error.go::Error() documents the swap point if C2 / C3 is preferred later.

Post-Deploy Monitoring & Validation

  • What to monitor:
    • upgrade-test CI runs in sprint envs for the first week after merge — look for an uptick in preflight failed: lines.
    • User-reported issues mentioning "upgrade now fails on a clean env" → potential false positive (namespace defaulting or label-selector miss).
    • CI jobs setting violet.clusters but missing --confirm-cluster: hard fail at startup is the new behaviour — pipeline configs may need a one-time update.
  • Validation checks:
    • Sanity: on a known-clean sprint env, ./upgrade --config <real-upgrade.yaml> should reach the upgrade loop within ~2s of starting.
    • Negative: kubectl apply a stray Subscription, then re-run — expect a non-zero exit with the exact kubectl delete cleanup line printed.
    • Cluster-guard sanity: try --confirm-cluster=wrong-name against a real KUBECONFIG; expect a hard-fail with the actual current-context shown.
  • Expected healthy behavior: silent preflight pass; upgrade continues as before.
  • Failure signal / rollback trigger: user reports preflight false-positive on a clean env. Immediate mitigation: --skip-preflight. Code-level mitigation: revert this PR (no schema migration, no on-disk state).
  • Validation window & owner: 1 week after merge / huanyang@alauda.io.

RBAC (smallest verb set for preflight only)

- apiGroups: ["operators.coreos.com"]
  resources: ["subscriptions", "installplans"]
  verbs: ["get", "list"]
- apiGroups: ["app.alauda.io"]
  resources: ["artifactversions"]
  verbs: ["get"]

The upgrade itself still requires the existing write permissions.

huanyang@alauda.io added 6 commits May 19, 2026 18:20
- new pkg/operator/preflight package holds the Residual value type so
  both pkg/operator and its concrete impls can import without cycles
- OperatorInterface gains PreflightBaseline(ctx, version)
  ([]preflight.Residual, error); contract: read-only, baseline-only
- local: no-op returning (nil, nil) — README will document the assumption
  that `make deploy` must be idempotent
- operatorhub: scaffolding stub; real check logic lands in follow-up
- plan: docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md
Three read-only checks against the baseline of an upgrade path:

- Subscription/<name> in <namespace>
- ArtifactVersion/<artifact>.<bundleVersion> in cpaas-system
- non-terminal InstallPlan in <namespace>, filtered by the OLM-managed
  label operators.coreos.com/<package>.<ns> so we never scan the
  namespace's full historical IP archive

Behavior:
- 30s context timeout caps the whole call so a hung apiserver cannot
  block the upgrade for the default o.timeout (10m)
- transient API errors (ServerTimeout / TooManyRequests /
  ServiceUnavailable) are wrapped with a 'retry the run' hint
- permanent errors propagate with the check name attached
- kubectl cleanup hints quote both name and namespace via %q so
  copy-pasted commands stay shell-safe

CSV residue is intentionally NOT checked: an independent CSV without
a corresponding Subscription or ArtifactVersion is a degenerate state
already covered by the other three. Plan documents how to add
PackageManifest-driven CSV resolution if a future incident requires it.

Tests cover clean cluster, single/multi residue, non-terminal phases
(empty/Planning/RequiresApproval/Installing), terminal phases
(Complete/Failed) ignored, transient + permanent error paths, and a
read-only contract assertion (spy reactor fails the test on any
Create/Update/Patch/Delete).
Two new load-time checks gated on Type == operatorhub:

1. operatorConfig.namespace must be non-empty. An empty namespace
   would let preflight's Subscription / InstallPlan queries silently
   degrade to cluster-scope, producing false-positive residue
   reports from completely unrelated operators.

2. Version.bundleVersion must match ^[a-zA-Z0-9._-]+$. The field is
   interpolated into kubectl cleanup hints emitted by preflight and
   into violet argv, so allowing shell metacharacters here would
   propagate $, backticks, semicolons or quotes straight into the
   user's terminal. Centralising the check at load time closes the
   injection vector for every downstream consumer with a single
   chokepoint.

Both checks fire only for operatorhub type — local mode runs
`make deploy` and has no equivalent risk surface.

Tests cover: namespace required (operatorhub), namespace optional
(local), shell-active characters rejected, realistic versions
(v4.6.3 / 4.6.3-rc.91 / v0.74.0) accepted.
Three additions to the upgrade entry point, all ahead of the upgrade loop:

1. *PreflightError (cmd-internal) formats []preflight.Residual into a
   multi-line copy-pasteable report: one cleanup block per residual,
   then a finalizer-unstuck command template, then the bypass hint.
   The type stays in cmd/ so pkg/operator never picks up a dependency
   on how the CLI talks to the user. Decision C (all-English text) is
   the locked default; Error() is the single place to change wording
   without touching any contract.

2. assertClusterMatch enforces the operator≡kubectl cluster identity
   contract whenever operatorConfig.violet.clusters is set. The
   recommended rule (Decision B, locked at default B1) is exact
   string equality against KUBECONFIG's CurrentContext via
   --confirm-cluster. In-cluster runs (no kubeconfig file) degrade
   to a loud warning so CI pods aren't silently blocked.

3. runPreflight iterates UpgradePaths, calls op.PreflightBaseline on
   each baseline (Versions[0]), and fails fast at the first dirty
   path with a *PreflightError. (Kind, NS, Name) dedup is in place
   for future aggregation; --skip-preflight bypasses the whole step
   with a Warn-level audit line.

Cobra's SilenceUsage is set in AddFlags so that PreflightError's
actionable kubectl lines are not buried under --help output on a
non-zero exit. Flag-parsing errors still print usage (they execute
before RunE), so users who typo a flag are not stranded.
- README: new 'Preflight 前置检查' section documents what gets
  scanned (Subscription / ArtifactVersion / non-terminal InstallPlan),
  the 30s timeout, a sample error message, the two new flags
  (--skip-preflight / --confirm-cluster), and the minimum RBAC verbs
  needed to run preflight in isolation.

- CLAUDE.md:
  - 数据流: preflight + cluster guard inserted ahead of the upgrade
    loop; explicit note that preflight does NOT honour cfg.Immediate
    (quality gate, not part of the upgrade chain).
  - Operator 抽象 paragraph mentions PreflightBaseline and the leaf
    pkg/operator/preflight package created to avoid an import cycle.
  - 硬约束: four new entries — PreflightBaseline must stay read-only
    and baseline-only (with a note on why CSV is intentionally
    excluded and where to add it back if needed); cmd.SilenceUsage
    is mandatory; --confirm-cluster matching rule (B1 default);
    BundleVersion regex chokepoint.

- Plan acceptance criteria all marked complete.

@alaudabot alaudabot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: feat/preflight-check

Summary

This PR introduces a preflight check layer and cluster identity guard. The implementation is well-documented, thoroughly tested, and follows the repository's existing conventions. No blocking issues found.

Critical Issues: 0

Warnings (2)

  1. pkg/operator/operatorhub/preflight.go:34installPlanTerminalPhases is a package-level map[string]struct{} modified via assignment on first use. While not a runtime hazard, a const map or struct init would make immutability intent explicit.

  2. pkg/config/config.go:20bundleVersionRegex uses regexp.MustCompile which panics on invalid regex at import time. Document the regex format or switch to regexp.Compile with a hardcoded fallback.

Suggestions (4)

  1. cmd/preflight_error.go:36RecommendedCleanup uses %q for shell-escaping. Verify busybox-based kubectl handles quoted resource names identically. Consider adding a shell-executability test.

  2. pkg/operator/operatorhub/preflight.go:119-127checkInstallPlanResidue handles errors.IsNotFound on a List call. A List returning zero items is the idiomatic "nothing found" signal in K8s; consider removing the IsNotFound check.

  3. pkg/operator/operatorhub/preflight_test.go:44-63newIPWithPhase hardcodes label key "operators.coreos.com/tektoncd-operator.test-ns" instead of deriving from operator's name/namespace. Refactor to accept parameters.

  4. pkg/operator/operatorhub/preflight_test.go:226-230 — Calling t.Errorf inside a reactor function may not propagate correctly from the fake client's goroutine. Move mutation detection to the main goroutine.

Positive Feedback

  • Security-first design with single-chokepoint validation (bundleVersionRegex)
  • Excellent test coverage including the read-only contract enforcement
  • Clear architectural decisions documented in code comments
  • Excellent README documentation on preflight
  • Honest TOCTOU acknowledgment with "wait ~30s" hint

Status: PR is good to go. These are all non-blocking suggestions.

@alaudabot

alaudabot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🤖 AI Code Review

Property Value
Model opencode/minimax-m2.5-free
Style strict
Issues Found 0
Config Source centralized
Profile ❌ Not Found
Personalized Prompt ❌ No
Prompt Path .github/review/profiles/alaudadevops/upgrade-test/pr-review.md
Alauda Skills ✅ base-acp-operator-list, base-acp-operator-release, base-authoring, base-m365, base-ocp-operator-list, base-skill-setup, builders-alauda-component-e2e-release, builders-alauda-component-upgrade, builders-alauda-pipeline, builders-claudetask-submit, builders-component-knowledge, builders-confluence, builders-dev-mesh-qa, builders-edge-ci-trace, builders-gitlab-ops, builders-helm-operator-generator, builders-install-cluster-plugin, builders-jira, builders-notify-wecom, builders-olm-operator-lifecycle, builders-prd-to-testcase, builders-publish-errata, builders-roadmap-studio, builders-story-split, builders-violet, builders-webapp-testing, cross-repo-add-mirror, cross-repo-publish, devops-add-bug-release-notes, devops-autodns, devops-bundle-csv-baseline-diff, devops-candidate-version-supervisor, devops-connectors-acceptance-test, devops-connectors-explore, devops-connectors-poc-case, devops-connectors-review, devops-connectors-unit-test, devops-connectors-upgrade-test, devops-connectors-write-user-docs, devops-creating-tekton-pipelines, devops-fix-go-vulns, devops-fork-alauda-binary-release, devops-gen-advanced-form-descriptors, devops-jira-rfd-acceptance, devops-knowledge-adoption, devops-pr-review, devops-refresh-containerfile-digests, devops-refresh-containerfile-tags, devops-replace-strings, devops-scan-docker-keywords, devops-sync-alauda-github-releases, devops-tekton-dynamic-form-optimizer, devops-tekton-operator-task-e2e, devops-tekton-pipeline-delivery, devops-tekton-refresh-results-tag, devops-tekton-task-delivery, devops-tekton-task-overview-template, devops-tekton-task-version-upgrade, devops-tekton-upgrade-notes, devops-tool-report-troubleshoot, devops-ui-e2e-code-audit, devops-ui-e2e-fix-base-on-report, devops-ui-e2e-regression-and-fix, devops-ui-generate-e2e-from-feature, devops-ui-pre-setup, devops-upgrade-go, devops-upstream-backport-cve, devops-upstream-upgrade
Reviewed at 2026-05-19 12:55:26 UTC

Summary

This PR implements a comprehensive preflight check system for the upgrade CLI, including OLM residue scanning (Subscription, ArtifactVersion, InstallPlan) and a cluster identity guard. The implementation is well-structured with good test coverage and proper separation of concerns. The code addresses the silent failure scenarios described in the PR description while maintaining backward compatibility through the --skip-preflight flag.

Review Statistics

Category Count
Critical Issues 0
Warnings 2
Suggestions 3
Files Reviewed 14

Critical Issues

Issues that MUST be addressed before merging (security, bugs, breaking changes)

None identified.

Warnings

Issues that SHOULD be addressed but are not blocking

  • [pkg/operator/operatorhub/preflight_test.go:248] (style/dead-code): The compile-time import guard var _ = metav1.GetOptions{} is technically valid but unconventional. While it serves as documentation that the package is used, metav1 is already imported and used elsewhere in the file. Consider removing or replacing with a build tag comment if this pattern is important for the project.

  • [pkg/operator/operatorhub/preflight.go:19] (style/convention): installPlanTerminalPhases is a map that's never mutated. Consider using a constant map pattern instead:

    var installPlanTerminalPhases = map[string]struct{}{
        "Complete": {},
        "Failed":   {},
    }

    While functional, this could be made more explicit as a const if the language allows.

Suggestions

Recommendations for improvement (nice to have)

  • [cmd/upgrade_command.go:190] (style/clarity): The nil-check cfg.OperatorConfig.Violet == nil is correct, but relies on the reader knowing Violet is a pointer. Consider adding a brief comment:

    // Violet is a pointer - check both nil and empty string
  • [pkg/operator/operatorhub/preflight.go:94] (style/naming): The _ config.Version parameter in checkArtifactVersionResidue is unused, consistent with the pattern in other check functions but reduces readability. Consider explicitly naming it _ as is done, but this is a minor style preference.

  • [pkg/operator/operatorhub/preflight.go:65] (refactor/clarity): The error message "preflight: %s: %w (transient, retry the run)" includes the kind name but could be more consistent with other error messages in the codebase. This is minor.

Positive Feedback

  • Security: The bundleVersionRegex validation (^[a-zA-Z0-9._-]+$) properly prevents shell injection - this is a critical security fix.
  • Test Coverage: Excellent coverage including the read-only contract test (TestPreflightBaseline_ReadOnly) that asserts no mutating operations occur.
  • Error Handling: Good transient error detection with actionable "retry the run" messages.
  • Architecture: Clean separation with preflight.Residual in a leaf package to avoid import cycles.
  • Cluster Guard: Comprehensive handling of both in-cluster and file-mode kubeconfig scenarios.
  • Documentation: The plan document and CLAUDE.md updates are thorough.

Review Note: The existing critical comment about isTransientAPIError being inaccessible is incorrect - the function is defined in artifact_version.go within the same package and is properly accessible. The seen map mentioned in a suggestion was already removed in prior commits as noted in the PR description.


ℹ️ About this review

This review was automatically generated using the run-actions workflow.

  • Shared prompt: .github/prompts/code-review.md
  • Config source: centralized
  • Profile path: Not Found
  • Profile ref: e75e733e9aa1b417a8b3c6441e53495dbcb418ad
  • No repository-specific prompt configured
  • Alauda skills: base-acp-operator-list, base-acp-operator-release, base-authoring, base-m365, base-ocp-operator-list, base-skill-setup, builders-alauda-component-e2e-release, builders-alauda-component-upgrade, builders-alauda-pipeline, builders-claudetask-submit, builders-component-knowledge, builders-confluence, builders-dev-mesh-qa, builders-edge-ci-trace, builders-gitlab-ops, builders-helm-operator-generator, builders-install-cluster-plugin, builders-jira, builders-notify-wecom, builders-olm-operator-lifecycle, builders-prd-to-testcase, builders-publish-errata, builders-roadmap-studio, builders-story-split, builders-violet, builders-webapp-testing, cross-repo-add-mirror, cross-repo-publish, devops-add-bug-release-notes, devops-autodns, devops-bundle-csv-baseline-diff, devops-candidate-version-supervisor, devops-connectors-acceptance-test, devops-connectors-explore, devops-connectors-poc-case, devops-connectors-review, devops-connectors-unit-test, devops-connectors-upgrade-test, devops-connectors-write-user-docs, devops-creating-tekton-pipelines, devops-fix-go-vulns, devops-fork-alauda-binary-release, devops-gen-advanced-form-descriptors, devops-jira-rfd-acceptance, devops-knowledge-adoption, devops-pr-review, devops-refresh-containerfile-digests, devops-refresh-containerfile-tags, devops-replace-strings, devops-scan-docker-keywords, devops-sync-alauda-github-releases, devops-tekton-dynamic-form-optimizer, devops-tekton-operator-task-e2e, devops-tekton-pipeline-delivery, devops-tekton-refresh-results-tag, devops-tekton-task-delivery, devops-tekton-task-overview-template, devops-tekton-task-version-upgrade, devops-tekton-upgrade-notes, devops-tool-report-troubleshoot, devops-ui-e2e-code-audit, devops-ui-e2e-fix-base-on-report, devops-ui-e2e-regression-and-fix, devops-ui-generate-e2e-from-feature, devops-ui-pre-setup, devops-upgrade-go, devops-upstream-backport-cve, devops-upstream-upgrade

@alaudabot alaudabot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: feat: upgrade preflight check (OLM residue scan + cluster guard)

Critical Issues

  • pkg/operator/operatorhub/preflight.go:65isTransientAPIError is called here but is defined in artifact_version.go:107 (sibling file, same package). The call compiles because Go resolves package-internal references at link time, but the dependency is implicit and undocumented in code. Future refactoring of artifact_version.go's internal helpers will silently break preflight with no compile-time warning. Per CLAUDE.md conventions (GVRs declared at the top of operator.go), shared helpers should be in a named utility file, not hidden behind cross-file package resolution.

Suggested fix: Move isTransientAPIError to a shared non-test utility file (e.g., pkg/operator/operatorhub/errors.go) and add a doc comment referencing the original.

// errors.go — shared utility, not internal to artifact_version.go
func isTransientAPIError(err error) bool {
	return errors.IsServerTimeout(err) ||
		errors.IsTooManyRequests(err) ||
		errors.IsServiceUnavailable(err)
}

Warnings

  • cmd/upgrade_command.go:119assertClusterMatch short-circuits when cfg.OperatorConfig.Violet == nil. This is correct behaviour (nil means not configured), but the pattern relies on the reader noticing the pointer type. A brief comment would prevent future confusion:
// Violet is nil when not configured in YAML — cluster guard is a no-op.
if cfg.OperatorConfig.Violet == nil || cfg.OperatorConfig.Violet.Clusters == "" {
	return nil
}
  • pkg/operator/operatorhub/preflight_test.go:248 — A blank-line var _ = metav1.GetOptions{} appears at the end of the file as a compile-time import guard. It is dead code: the metav1 import is already used by the actual tests. Please remove it.

Suggestions

  • pkg/operator/operatorhub/preflight.go:19installPlanTerminalPhases is a var but is never mutated. Consider making it const.

  • pkg/operator/operatorhub/preflight.go:109checkArtifactVersionResidue's second parameter _ config.Version is unused (same pattern as checkSubscriptionResidue and checkInstallPlanResidue). For consistency and readability, either remove it or use _ prefix to signal intent.

  • cmd/upgrade_command.go:200–204runPreflight deduplicates residuals via a seen map, but map iteration order is non-deterministic. Since the primary output is copy-pasteable kubectl delete commands, deterministic ordering (e.g., sort by Kind then Name) would improve reproducibility. Not blocking.

Positive

The 10-sub-test preflight suite (including the spy-reactor read-only contract assertion), the bundleVersion regex chokepoint, and the comprehensive plan document (docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md) are all excellent. The design decisions are well-documented and traceable.

Comment thread cmd/upgrade_command.go

logger.Info("operator type", zap.String("type", cfg.OperatorConfig.Type))

// Cluster identity guard: when violet.clusters is configured, the user

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (style/clarity): The nil-check short-circuit is correct, but relies on the reader noticing Violet is a pointer. A one-line comment prevents future confusion when someone reads this guard in isolation.


// Confirm metav1.GetOptions is reachable here (import guard — if a future
// refactor drops the dependency by accident the test will catch it).
var _ = metav1.GetOptions{}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (style/dead-code): The blank-line var _ = metav1.GetOptions{} is a compile-time import guard but metav1 is already used by the actual test code. Dead code — please remove.

// preflightTimeout caps the entire PreflightBaseline call so one slow or
// hung apiserver does not block the upgrade flow for o.timeout (default 10m).
// 30s is generous enough for 3 Get/List against a healthy apiserver while
// keeping the "fail-fast within seconds" promise honest in the typical case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (style/convention): installPlanTerminalPhases is never mutated — consider const instead of var.

Comment thread pkg/operator/operatorhub/preflight.go Outdated
return []preflight.Residual{{
Kind: "ArtifactVersion",
Namespace: systemNamespace,
Name: av.GetName(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (style/naming): The _ config.Version parameter is unused, consistent with the other two check functions but reduces readability. Consider checkArtifactVersionResidue(ctx, _ config.Version) to signal intent explicitly, or removing the parameter if all callers are updated.

Comment thread cmd/upgrade_command.go
if err != nil {
return fmt.Errorf("read kubeconfig %s for cluster guard: %v", kubeconfig, err)
}
currentCtx := apiCfg.CurrentContext

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (style/determinism): The seen map iteration order is non-deterministic, meaning the order of cleanup commands in the error output may vary across runs. For a tool whose primary output is copy-pasteable commands, sorting residuals (e.g., by Kind then Name) would improve reproducibility. Not blocking — just a note for future polish.

huanyang@alauda.io added 4 commits May 19, 2026 20:12
Three silent-failure paths fixed (P1), four hardening / cleanup changes
applied (P2). All P1 fixes close anti-patterns where preflight reported
"clean" while actually bypassing the guard it was built to enforce.

P1 — silent-failure closures:
- #13/#14 cluster guard: in-cluster mode now requires --confirm-cluster
  matching violet.clusters; empty kubeconfig CurrentContext is now a hard
  error with an actionable 'kubectl config use-context' hint. Previously
  both paths log.Warn'd and returned nil, defeating the guard exactly when
  it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs).
- #15 BundleVersion required for operatorhub: empty value used to slip
  past validateConfig, producing 'operatorhub-foo.' AV names with a
  trailing dot — preflight reported clean while upgrade failed late with
  cryptic 404s. Required at load time.

P2 — internal hardening:
- #16 Drop the seen map from runPreflight: dead code under fail-fast
  semantics (architecture-strategist + simplicity-reviewer both flagged
  it independently). -14 LOC.
- #018 checkInstallPlanResidue no longer discards the NestedString error.
  status.phase type drift now surfaces a wrapped error instead of being
  swallowed and producing a 'every IP is non-terminal' noise storm.
- #019 New preflight.NewResidual constructor centralises kubectl cleanup
  template + %q shell-quoting. Three call sites in operatorhub/preflight.go
  go from 5-line struct literals to one-line constructor calls.

P3 follow-ups (020-023) recorded as separate todos for later triage.

Tests for the cmd-layer changes (#17) land in the next commit.
Closes P2 #17 — the two cmd-layer functions added by PR #17 carried
zero coverage (pr-test-analyzer criticality 9).

Coverage:
- assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops;
  in-cluster missing/mismatched/matching --confirm-cluster; file-mode
  unreadable kubeconfig, empty current-context (pins the #14 fix in
  place), missing/mismatched/matching --confirm-cluster.
- runPreflight (5 sub-tests): clean cluster, empty Versions skipped,
  baseline-only invariant (regression guard: a refactor scanning all
  versions would surface mid-upgrade AVs as false positives), fail-fast
  across paths (second path's PreflightBaseline is verifiably not
  reached), operator errors propagate unwrapped (must NOT be wrapped
  as *PreflightError).

The fake operator records the Version passed to each PreflightBaseline
call so the baseline-only test asserts the actual contract rather than
just call count.
The 7 todos (013-019) that drove the silent-failure fixes + cmd-layer
tests are done; their substance lives in the PR description and the
commits that closed them (1201665 + f721f8e). Removing the files
keeps todos/ as a list of OPEN work, which is its only useful state.

Remaining in todos/: 4 P3 follow-ups for this PR (020-023) and 6 P2
items from prior PRs that have not been actioned yet.
The four P3 nice-to-haves (test variants, internal simplifications,
agent-ergonomics knobs, misc cleanups) are not on the roadmap and
filing them as todos would just decorate the backlog with low-value
items nobody plans to pick up. The PR description still records the
reviewer recommendations for archeology; the file copies were
redundant.
@yhuan123 yhuan123 requested a review from kycheng May 20, 2026 02:52
@yhuan123 yhuan123 merged commit 27ad600 into main May 20, 2026
4 checks passed
@yhuan123 yhuan123 deleted the feat/preflight-check branch May 20, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants