feat: upgrade preflight check (OLM residue scan + cluster guard)#17
Conversation
- new pkg/operator/preflight package holds the Residual value type so both pkg/operator and its concrete impls can import without cycles - OperatorInterface gains PreflightBaseline(ctx, version) ([]preflight.Residual, error); contract: read-only, baseline-only - local: no-op returning (nil, nil) — README will document the assumption that `make deploy` must be idempotent - operatorhub: scaffolding stub; real check logic lands in follow-up - plan: docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md
Three read-only checks against the baseline of an upgrade path: - Subscription/<name> in <namespace> - ArtifactVersion/<artifact>.<bundleVersion> in cpaas-system - non-terminal InstallPlan in <namespace>, filtered by the OLM-managed label operators.coreos.com/<package>.<ns> so we never scan the namespace's full historical IP archive Behavior: - 30s context timeout caps the whole call so a hung apiserver cannot block the upgrade for the default o.timeout (10m) - transient API errors (ServerTimeout / TooManyRequests / ServiceUnavailable) are wrapped with a 'retry the run' hint - permanent errors propagate with the check name attached - kubectl cleanup hints quote both name and namespace via %q so copy-pasted commands stay shell-safe CSV residue is intentionally NOT checked: an independent CSV without a corresponding Subscription or ArtifactVersion is a degenerate state already covered by the other three. Plan documents how to add PackageManifest-driven CSV resolution if a future incident requires it. Tests cover clean cluster, single/multi residue, non-terminal phases (empty/Planning/RequiresApproval/Installing), terminal phases (Complete/Failed) ignored, transient + permanent error paths, and a read-only contract assertion (spy reactor fails the test on any Create/Update/Patch/Delete).
Two new load-time checks gated on Type == operatorhub: 1. operatorConfig.namespace must be non-empty. An empty namespace would let preflight's Subscription / InstallPlan queries silently degrade to cluster-scope, producing false-positive residue reports from completely unrelated operators. 2. Version.bundleVersion must match ^[a-zA-Z0-9._-]+$. The field is interpolated into kubectl cleanup hints emitted by preflight and into violet argv, so allowing shell metacharacters here would propagate $, backticks, semicolons or quotes straight into the user's terminal. Centralising the check at load time closes the injection vector for every downstream consumer with a single chokepoint. Both checks fire only for operatorhub type — local mode runs `make deploy` and has no equivalent risk surface. Tests cover: namespace required (operatorhub), namespace optional (local), shell-active characters rejected, realistic versions (v4.6.3 / 4.6.3-rc.91 / v0.74.0) accepted.
Three additions to the upgrade entry point, all ahead of the upgrade loop: 1. *PreflightError (cmd-internal) formats []preflight.Residual into a multi-line copy-pasteable report: one cleanup block per residual, then a finalizer-unstuck command template, then the bypass hint. The type stays in cmd/ so pkg/operator never picks up a dependency on how the CLI talks to the user. Decision C (all-English text) is the locked default; Error() is the single place to change wording without touching any contract. 2. assertClusterMatch enforces the operator≡kubectl cluster identity contract whenever operatorConfig.violet.clusters is set. The recommended rule (Decision B, locked at default B1) is exact string equality against KUBECONFIG's CurrentContext via --confirm-cluster. In-cluster runs (no kubeconfig file) degrade to a loud warning so CI pods aren't silently blocked. 3. runPreflight iterates UpgradePaths, calls op.PreflightBaseline on each baseline (Versions[0]), and fails fast at the first dirty path with a *PreflightError. (Kind, NS, Name) dedup is in place for future aggregation; --skip-preflight bypasses the whole step with a Warn-level audit line. Cobra's SilenceUsage is set in AddFlags so that PreflightError's actionable kubectl lines are not buried under --help output on a non-zero exit. Flag-parsing errors still print usage (they execute before RunE), so users who typo a flag are not stranded.
- README: new 'Preflight 前置检查' section documents what gets
scanned (Subscription / ArtifactVersion / non-terminal InstallPlan),
the 30s timeout, a sample error message, the two new flags
(--skip-preflight / --confirm-cluster), and the minimum RBAC verbs
needed to run preflight in isolation.
- CLAUDE.md:
- 数据流: preflight + cluster guard inserted ahead of the upgrade
loop; explicit note that preflight does NOT honour cfg.Immediate
(quality gate, not part of the upgrade chain).
- Operator 抽象 paragraph mentions PreflightBaseline and the leaf
pkg/operator/preflight package created to avoid an import cycle.
- 硬约束: four new entries — PreflightBaseline must stay read-only
and baseline-only (with a note on why CSV is intentionally
excluded and where to add it back if needed); cmd.SilenceUsage
is mandatory; --confirm-cluster matching rule (B1 default);
BundleVersion regex chokepoint.
- Plan acceptance criteria all marked complete.
alaudabot
left a comment
There was a problem hiding this comment.
PR Review: feat/preflight-check
Summary
This PR introduces a preflight check layer and cluster identity guard. The implementation is well-documented, thoroughly tested, and follows the repository's existing conventions. No blocking issues found.
Critical Issues: 0
Warnings (2)
-
pkg/operator/operatorhub/preflight.go:34—installPlanTerminalPhasesis a package-levelmap[string]struct{}modified via assignment on first use. While not a runtime hazard, a const map or struct init would make immutability intent explicit. -
pkg/config/config.go:20—bundleVersionRegexusesregexp.MustCompilewhich panics on invalid regex at import time. Document the regex format or switch toregexp.Compilewith a hardcoded fallback.
Suggestions (4)
-
cmd/preflight_error.go:36—RecommendedCleanupuses%qfor shell-escaping. Verify busybox-based kubectl handles quoted resource names identically. Consider adding a shell-executability test. -
pkg/operator/operatorhub/preflight.go:119-127—checkInstallPlanResiduehandleserrors.IsNotFoundon a List call. A List returning zero items is the idiomatic "nothing found" signal in K8s; consider removing theIsNotFoundcheck. -
pkg/operator/operatorhub/preflight_test.go:44-63—newIPWithPhasehardcodes label key"operators.coreos.com/tektoncd-operator.test-ns"instead of deriving from operator's name/namespace. Refactor to accept parameters. -
pkg/operator/operatorhub/preflight_test.go:226-230— Callingt.Errorfinside a reactor function may not propagate correctly from the fake client's goroutine. Move mutation detection to the main goroutine.
Positive Feedback
- Security-first design with single-chokepoint validation (
bundleVersionRegex) - Excellent test coverage including the read-only contract enforcement
- Clear architectural decisions documented in code comments
- Excellent README documentation on preflight
- Honest TOCTOU acknowledgment with "wait ~30s" hint
Status: PR is good to go. These are all non-blocking suggestions.
🤖 AI Code Review
SummaryThis PR implements a comprehensive preflight check system for the upgrade CLI, including OLM residue scanning (Subscription, ArtifactVersion, InstallPlan) and a cluster identity guard. The implementation is well-structured with good test coverage and proper separation of concerns. The code addresses the silent failure scenarios described in the PR description while maintaining backward compatibility through the Review Statistics
Critical Issues
None identified. Warnings
Suggestions
Positive Feedback
Review Note: The existing critical comment about ℹ️ About this reviewThis review was automatically generated using the
|
alaudabot
left a comment
There was a problem hiding this comment.
PR Review: feat: upgrade preflight check (OLM residue scan + cluster guard)
Critical Issues
- pkg/operator/operatorhub/preflight.go:65 —
isTransientAPIErroris called here but is defined inartifact_version.go:107(sibling file, same package). The call compiles because Go resolves package-internal references at link time, but the dependency is implicit and undocumented in code. Future refactoring ofartifact_version.go's internal helpers will silently break preflight with no compile-time warning. Per CLAUDE.md conventions (GVRs declared at the top ofoperator.go), shared helpers should be in a named utility file, not hidden behind cross-file package resolution.
Suggested fix: Move isTransientAPIError to a shared non-test utility file (e.g., pkg/operator/operatorhub/errors.go) and add a doc comment referencing the original.
// errors.go — shared utility, not internal to artifact_version.go
func isTransientAPIError(err error) bool {
return errors.IsServerTimeout(err) ||
errors.IsTooManyRequests(err) ||
errors.IsServiceUnavailable(err)
}
Warnings
- cmd/upgrade_command.go:119 —
assertClusterMatchshort-circuits whencfg.OperatorConfig.Violet == nil. This is correct behaviour (nil means not configured), but the pattern relies on the reader noticing the pointer type. A brief comment would prevent future confusion:
// Violet is nil when not configured in YAML — cluster guard is a no-op.
if cfg.OperatorConfig.Violet == nil || cfg.OperatorConfig.Violet.Clusters == "" {
return nil
}
- pkg/operator/operatorhub/preflight_test.go:248 — A blank-line
var _ = metav1.GetOptions{}appears at the end of the file as a compile-time import guard. It is dead code: themetav1import is already used by the actual tests. Please remove it.
Suggestions
-
pkg/operator/operatorhub/preflight.go:19 —
installPlanTerminalPhasesis avarbut is never mutated. Consider making itconst. -
pkg/operator/operatorhub/preflight.go:109 —
checkArtifactVersionResidue's second parameter_ config.Versionis unused (same pattern ascheckSubscriptionResidueandcheckInstallPlanResidue). For consistency and readability, either remove it or use_prefix to signal intent. -
cmd/upgrade_command.go:200–204 —
runPreflightdeduplicates residuals via aseenmap, but map iteration order is non-deterministic. Since the primary output is copy-pasteablekubectl deletecommands, deterministic ordering (e.g., sort by Kind then Name) would improve reproducibility. Not blocking.
Positive
The 10-sub-test preflight suite (including the spy-reactor read-only contract assertion), the bundleVersion regex chokepoint, and the comprehensive plan document (docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md) are all excellent. The design decisions are well-documented and traceable.
|
|
||
| logger.Info("operator type", zap.String("type", cfg.OperatorConfig.Type)) | ||
|
|
||
| // Cluster identity guard: when violet.clusters is configured, the user |
There was a problem hiding this comment.
Warning (style/clarity): The nil-check short-circuit is correct, but relies on the reader noticing Violet is a pointer. A one-line comment prevents future confusion when someone reads this guard in isolation.
|
|
||
| // Confirm metav1.GetOptions is reachable here (import guard — if a future | ||
| // refactor drops the dependency by accident the test will catch it). | ||
| var _ = metav1.GetOptions{} |
There was a problem hiding this comment.
Warning (style/dead-code): The blank-line var _ = metav1.GetOptions{} is a compile-time import guard but metav1 is already used by the actual test code. Dead code — please remove.
| // preflightTimeout caps the entire PreflightBaseline call so one slow or | ||
| // hung apiserver does not block the upgrade flow for o.timeout (default 10m). | ||
| // 30s is generous enough for 3 Get/List against a healthy apiserver while | ||
| // keeping the "fail-fast within seconds" promise honest in the typical case. |
There was a problem hiding this comment.
Suggestion (style/convention): installPlanTerminalPhases is never mutated — consider const instead of var.
| return []preflight.Residual{{ | ||
| Kind: "ArtifactVersion", | ||
| Namespace: systemNamespace, | ||
| Name: av.GetName(), |
There was a problem hiding this comment.
Suggestion (style/naming): The _ config.Version parameter is unused, consistent with the other two check functions but reduces readability. Consider checkArtifactVersionResidue(ctx, _ config.Version) to signal intent explicitly, or removing the parameter if all callers are updated.
| if err != nil { | ||
| return fmt.Errorf("read kubeconfig %s for cluster guard: %v", kubeconfig, err) | ||
| } | ||
| currentCtx := apiCfg.CurrentContext |
There was a problem hiding this comment.
Suggestion (style/determinism): The seen map iteration order is non-deterministic, meaning the order of cleanup commands in the error output may vary across runs. For a tool whose primary output is copy-pasteable commands, sorting residuals (e.g., by Kind then Name) would improve reproducibility. Not blocking — just a note for future polish.
Three silent-failure paths fixed (P1), four hardening / cleanup changes applied (P2). All P1 fixes close anti-patterns where preflight reported "clean" while actually bypassing the guard it was built to enforce. P1 — silent-failure closures: - #13/#14 cluster guard: in-cluster mode now requires --confirm-cluster matching violet.clusters; empty kubeconfig CurrentContext is now a hard error with an actionable 'kubectl config use-context' hint. Previously both paths log.Warn'd and returned nil, defeating the guard exactly when it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs). - #15 BundleVersion required for operatorhub: empty value used to slip past validateConfig, producing 'operatorhub-foo.' AV names with a trailing dot — preflight reported clean while upgrade failed late with cryptic 404s. Required at load time. P2 — internal hardening: - #16 Drop the seen map from runPreflight: dead code under fail-fast semantics (architecture-strategist + simplicity-reviewer both flagged it independently). -14 LOC. - #018 checkInstallPlanResidue no longer discards the NestedString error. status.phase type drift now surfaces a wrapped error instead of being swallowed and producing a 'every IP is non-terminal' noise storm. - #019 New preflight.NewResidual constructor centralises kubectl cleanup template + %q shell-quoting. Three call sites in operatorhub/preflight.go go from 5-line struct literals to one-line constructor calls. P3 follow-ups (020-023) recorded as separate todos for later triage. Tests for the cmd-layer changes (#17) land in the next commit.
Closes P2 #17 — the two cmd-layer functions added by PR #17 carried zero coverage (pr-test-analyzer criticality 9). Coverage: - assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops; in-cluster missing/mismatched/matching --confirm-cluster; file-mode unreadable kubeconfig, empty current-context (pins the #14 fix in place), missing/mismatched/matching --confirm-cluster. - runPreflight (5 sub-tests): clean cluster, empty Versions skipped, baseline-only invariant (regression guard: a refactor scanning all versions would surface mid-upgrade AVs as false positives), fail-fast across paths (second path's PreflightBaseline is verifiably not reached), operator errors propagate unwrapped (must NOT be wrapped as *PreflightError). The fake operator records the Version passed to each PreflightBaseline call so the baseline-only test asserts the actual contract rather than just call count.
The 7 todos (013-019) that drove the silent-failure fixes + cmd-layer tests are done; their substance lives in the PR description and the commits that closed them (1201665 + f721f8e). Removing the files keeps todos/ as a list of OPEN work, which is its only useful state. Remaining in todos/: 4 P3 follow-ups for this PR (020-023) and 6 P2 items from prior PRs that have not been actioned yet.
The four P3 nice-to-haves (test variants, internal simplifications, agent-ergonomics knobs, misc cleanups) are not on the roadmap and filing them as todos would just decorate the backlog with low-value items nobody plans to pick up. The PR description still records the reviewer recommendations for archeology; the file copies were redundant.
Summary
Preflight residue scan: read-only check before the upgrade loop. For each
UpgradePath's baseline (Versions[0]), Get the operator's Subscription, Get the baseline ArtifactVersion (<artifact>.<bundleVersion>incpaas-system), and List non-terminal InstallPlans filtered by OLM'soperators.coreos.com/<pkg>.<ns>label. Any residue → fail with copy-pasteablekubectl deletecommands (names/ns escaped via%q, generated by the centralisedpreflight.NewResidualconstructor), a finalizer-unstuck template, and a "wait 30s for OLM" hint. 30s context timeout caps the entire call.Cluster identity guard (strict in every mode after code review): when
operatorConfig.violet.clustersis set, the user must pass--confirm-cluster=<NAME>. The matching rule is mode-aware:--confirm-clustermust equalCurrentContext. EmptyCurrentContextis treated as a user error with an actionablekubectl config use-context <name>hint (no silent fallback).CurrentContextto read, so--confirm-clustermust equalviolet.clusters— the user explicitly acknowledges the target subcluster. A CI pod withviolet.clusters: devopsbut no--confirm-clusteris now a hard error instead of a log line. This closes the "silent 假成功" symmetry that motivated the guard in the first place.CSV intentionally excluded from preflight: independent CSV residue (without Sub or AV) is a degenerate state covered by the other three checks. PackageManifest-driven CSV resolution is documented in the plan as a future-storage option if a real incident requires it.
Config hardening:
pkg/config/config.go::validateConfignow requiresnamespacefor operatorhub type, requiresbundleVersionfor operatorhub type (empty value used to produceoperatorhub-foo.AV names with a trailing dot — preflight reported clean while upgrade failed late with cryptic 404s), and enforcesbundleVersionagainst^[a-zA-Z0-9._-]+$(the field is interpolated into kubectl + violet argv, so shell metacharacters here are an injection vector — single chokepoint closes it for every downstream consumer).CLI ergonomics:
cmd.SilenceUsage = trueinAddFlagsso cobra does not dump--helpover the actionable kubectl-delete lines on a non-zero exit. Flag-parsing errors still print usage (they execute before RunE), so typo'd flags are not stranded.Code Review Fixes (P1 + P2 closed)
Multi-agent review (8 reviewers in parallel — architecture, performance, security, simplicity, silent-failure, type-design, pr-test, agent-native) surfaced 11 actionable findings; 3 P1 + 4 P2 fixed in commits
1201665andf721f8e. P3 follow-ups recorded for later triage.P1 — silent-failure closures (preflight was reporting "clean" while bypassing its own guard):
CurrentContextused tolog.Warnthenreturn nil. Both now hard-fail with actionable error messages. The previous behaviour defeated the guard exactly where it mattered most (CI pods, multi-context kubeconfigs).validateConfig. The downstream AV-name interpolation produced malformedoperatorhub-foo.names →Getreturned NotFound → preflight falsely reported clean while upgrade crashed later with cryptic errors. Now required at load time.P2 — internal hardening:
seenmap fromrunPreflight: dead code under fail-fast semantics (architecture-strategist + simplicity-reviewer independently flagged it).-14 LOC.checkInstallPlanResidueno longer discards theNestedStringerror.status.phasetype drift now surfaces a wrapped error instead of being swallowed and producing an "every IP is non-terminal" noise storm — protects against OLM CRD schema migrations.preflight.NewResidualconstructor: centralises kubectl cleanup template +%qshell-quoting. Three call sites inoperatorhub/preflight.gogo from 5-line struct literals to one-line constructor calls. The shell-quoting invariant is now compiler-protected.assertClusterMatch+runPreflightcarried zero coverage (pr-test-analyzer criticality 9). Newcmd/upgrade_command_test.goadds 15 sub-tests including baseline-only invariant (regression guard against "scan all versions") and fail-fast verification (second path'sPreflightBaselineprovably not reached).P3 follow-ups recorded as todos for later:
todos/020-pending-p3-test-coverage-extensions.md— transient error type variants, IP label fixture drift, regex edge casestodos/021-pending-p3-preflight-simplifications.md— checks slice → sequential calls, phases map → switchtodos/022-pending-p3-agent-ergonomics-followups.md—--preflight-output=json, exit-code taxonomy, optional--preflight-onlytodos/023-pending-p3-misc-cleanups.md—SilenceUsageplacement, defensive zero-residual test, local Info logReal-Cluster Validation
Verified against
env3-devopscluster with livetektoncd-operator(Subscription intekton-operatornamespace, multiple historical ArtifactVersions incpaas-system, terminal InstallPlaninstall-4pbdf):tekton-operator/tektoncd-operator<artifact>.<bundleVersion>matchtektoncd-operator.v4.11.0-beta.78.gcec2b43operators.coreos.com/<pkg>.<ns>Complete) filtered out%qquoting in kubectl hintskubectl delete subscription "tektoncd-operator" -n "tekton-operator"ready to pastestatus.phaseis unchanged across runs1Error: preflight failed: ...→ stderrTesting
cmd/upgrade_command_test.go(10assertClusterMatch× all branches + 5runPreflight× baseline-only / fail-fast / error propagation / empty-versions / clean cluster + 2PreflightErrorformatter).pkg/operator/operatorhub/preflight_test.gocovering clean cluster, single residue per kind, multi-kind aggregation, non-terminal IP phases (""/ Planning / RequiresApproval / Installing), terminal IP phases (Complete / Failed) ignored, transient API error wrap with "retry the run" hint, permanent error wrap chain (errors.Is), and a read-only contract assertion (spy reactor on create/update/patch/delete/deletecollection fails the test on any mutation).pkg/config/config_test.gofor new validators (shell metachar rejection × 5, realistic version acceptance × 4, namespace required for operatorhub, namespace optional for local, empty bundleVersion rejected for operatorhub).go test ./...+go vet ./...all pass with no regressions in existing violet / subscription / config tests.Plan & Decisions
docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md— included in the first commit so reviewers can map every design choice back to its source.--confirm-clustermatching rule): locked at B1 = exact string equality. Implementation note incmd/upgrade_command.go::assertClusterMatchdocuments how to switch to substring/regex without changing the flag surface.cmd/preflight_error.go::Error()documents the swap point if C2 / C3 is preferred later.Post-Deploy Monitoring & Validation
preflight failed:lines.violet.clustersbut missing--confirm-cluster: hard fail at startup is the new behaviour — pipeline configs may need a one-time update../upgrade --config <real-upgrade.yaml>should reach the upgrade loop within ~2s of starting.kubectl applya stray Subscription, then re-run — expect a non-zero exit with the exactkubectl deletecleanup line printed.--confirm-cluster=wrong-nameagainst a real KUBECONFIG; expect a hard-fail with the actualcurrent-contextshown.--skip-preflight. Code-level mitigation: revert this PR (no schema migration, no on-disk state).RBAC (smallest verb set for preflight only)
The upgrade itself still requires the existing write permissions.