diff --git a/CLAUDE.md b/CLAUDE.md index c3be655..c5c47d9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,16 +27,19 @@ export KUBECONFIG= ### 数据流 -`config.yaml` → `cmd.UpgradeCommand.Execute` → 对每条 `UpgradePath`,按序遍历 `Versions`: +`config.yaml` → `cmd.UpgradeCommand.Execute` → -1. `operator.UpgradeOperator(ctx, version)` —— 把集群升级到该版本 -2. `bash -c ` 在 `OperatorConfig.Workspace`(可被 `version.TestSubPath` 进一步嵌套)下执行,stdout/stderr 通过 `io.MultiWriter` 同时打印和捕获(见 `pkg/exec/exec.go`) +1. **Cluster identity guard**(`assertClusterMatch`)—— `operatorConfig.violet.clusters` 非空时要求 `--confirm-cluster=` 精确匹配,防止生产 KUBECONFIG 被误用;in-cluster 运行降级为 warn +2. **Preflight**(`runPreflight` → `op.PreflightBaseline`,可被 `--skip-preflight` 关闭)—— 对每条 path 的 `Versions[0]` 只读扫描残留 Subscription / ArtifactVersion / 非终态 InstallPlan,30s 超时;任一残留构造 `*cmd.PreflightError` 直接 return(fail-fast 跨 path) +3. 对每条 `UpgradePath`,按序遍历 `Versions`: + - `operator.UpgradeOperator(ctx, version)` —— 把集群升级到该版本 + - `bash -c ` 在 `OperatorConfig.Workspace`(可被 `version.TestSubPath` 进一步嵌套)下执行,stdout/stderr 通过 `io.MultiWriter` 同时打印和捕获(见 `pkg/exec/exec.go`) -第一个版本的默认 `testCommand` 是 `REPO=allure make prepare`(准备测试数据),后续版本默认是 `REPO=allure make upgrade`(验证数据 + 升级断言)。`config.Immediate=true` 时遇到错误立即停止;否则继续下一条 path。 +第一个版本的默认 `testCommand` 是 `REPO=allure make prepare`(准备测试数据),后续版本默认是 `REPO=allure make upgrade`(验证数据 + 升级断言)。`config.Immediate=true` 时遇到错误立即停止;否则继续下一条 path。Preflight 阶段**不**复用 `Immediate` —— preflight 失败永远 fail-fast,因为它是"质量门"而非升级路径的一部分。 ### Operator 抽象 -`pkg/operator/interface.go` 定义了仅一个方法的接口 `OperatorInterface.UpgradeOperator`。Factory(`factory.go`)根据 `operatorConfig.type` 选择实现: +`pkg/operator/interface.go` 定义了 `OperatorInterface`,包含两个方法:`UpgradeOperator(ctx, version)` 和 `PreflightBaseline(ctx, version) ([]preflight.Residual, error)`(后者用于升级前置检查)。`preflight.Residual` 值类型定义在叶子子包 `pkg/operator/preflight` 里,避免与上层 `pkg/operator` 形成 import cycle。Factory(`factory.go`)根据 `operatorConfig.type` 选择实现: - **`operatorhub`(默认)** —— `pkg/operator/operatorhub/`。生产路径。Artifact / ArtifactVersion 的**写路径**已外包给 `violet` 二进制(子进程,见 `violet.go::installViaViolet`);Go 侧仍用 dynamic client 做以下三类只读 / OLM 资源: - Alauda 自定义资源 `app.alauda.io/v1alpha1` 下的 `Artifact` / `ArtifactVersion`:**Go 端只剩 Get + Delete**(清理残留 AV、等 AV phase=Present、读 status.version 拿 CSV);Create / Patch / Update 全归 violet。命名空间硬编码 `cpaas-system`,OLM 源名硬编码 `platform`(const `targetCatalogSource` in `operator.go`)。 @@ -67,6 +70,10 @@ Artifact 名称约定:未显式给 `artifact` 字段时,自动拼成 `///.latest.ALL..tgz` 这个 mirror MinIO URL 的布局检查缓存:命中跳过 HTTP;miss 直接下载到该路径(父目录自动 mkdir),不再走 `/tmp`,下次自动命中;任一路径下都 **不会清理**(cleanup 为 noop),保留为缓存。`VerifySha256` 即使命中也会执行——避免损坏的缓存文件被静默喂给 violet。留空保持旧行为:下载到一次性 `/tmp/upgrade-violet-*` 并在 `defer cleanup()` 中删除,无跨次复用。下载半途失败时 cache 路径的半成品会被 `os.Remove` 清掉,防止下次假命中。 +- **`PreflightBaseline` 是只读、仅检查 baseline** —— 实现在 `pkg/operator/operatorhub/preflight.go`,对每条 path 的 `Versions[0]` 检查 Subscription / ArtifactVersion / 非终态 InstallPlan 三类残留。不调用 Create/Update/Patch/Delete(单测里有 spy reactor 强制断言)。不要扩展为扫所有 versions —— 中间版本是 CLI 自产中间态,归 `installViaViolet::deleteArtifactVersionIfExists` 负责,扫了会双删竞态。CSV 残留**不在 preflight 检查范围内**(独立 CSV 残留必然伴随 Sub 或 AV,已被前三项 check 覆盖;future 若实测出现独立 CSV 残留,按 plan 给的 PackageManifest 路径加,**注意查询 namespace 是 catalog source 的 ns 即 `cpaas-system`,不是 operator 安装 ns**)。 +- **`cmd.SilenceUsage = true` 在 AddFlags 强制开启** —— preflight 失败的 `*PreflightError` 包含可复制粘贴的 `kubectl delete` 命令模板,是 PR 设计的核心 UX;cobra 默认在 RunE 返回 error 时打印 --help 会把这些命令淹没。flag-parsing 错误(如 unknown flag)走的是 cobra 内部不受此影响,仍会打印 usage。 +- **`--confirm-cluster` 匹配规则**(`cmd/upgrade_command.go::assertClusterMatch`)—— 当前实现是**精确字符串相等**(与 KUBECONFIG `CurrentContext` 比较)。要换成子串/正则等更宽松规则,只需改这一个比较,flag surface 不变。in-cluster 运行(无 kubeconfig 文件)降级为 `WARN`,不阻塞 CI pod。 +- **`BundleVersion` 必须符合 `^[a-zA-Z0-9._-]+$`** —— `pkg/config/config.go::validateConfig` 在 LoadConfig 阶段强校验。该字段会被插入 kubectl 命令模板和 violet argv,允许 shell 元字符(`$`/`` ` ``/`;`/quotes)就是 shell 注入。单点 chokepoint 比每个下游 consumer 各自防御更可靠。 ## PR / 协作流程 diff --git a/README.md b/README.md index bc43307..755c288 100644 --- a/README.md +++ b/README.md @@ -171,6 +171,58 @@ export KUBECONFIG= ./upgrade --config upgrade.yaml ``` +## 前置检查 (preflight) + +进入升级循环前,upgrade CLI 会对**每条升级路径的起点版本(`versions[0]`)**做一次**只读**扫描,发现下列任一残留就立即停止: + +| 检查项 | 资源 | 命名空间 | +|--------|------|----------| +| Subscription 残留 | `subscriptions.operators.coreos.com/` | `operatorConfig.namespace` | +| ArtifactVersion 残留 | `artifactversions.app.alauda.io/.` | `cpaas-system` | +| 未结束的 InstallPlan | `installplans.operators.coreos.com`(用 OLM label `operators.coreos.com/.` 精确过滤),`status.phase` 非 `Complete` / `Failed` | `operatorConfig.namespace` | + +preflight 设有 **30s 总超时**,超时直接报错避免阻塞升级。任何残留 → 输出每个对象的 `kubectl delete` 命令模板(已用 `%q` 转义、可直接复制粘贴),并附带 finalizer 卡死的兜底指令和"等 OLM settle 30s"提示。 + +**典型错误信息**: + +``` +preflight failed: 1 residual resource(s) blocking upgrade: + + Subscription/tektoncd-operator (ns: tektoncd-pipelines) + kubectl delete subscription "tektoncd-operator" -n "tektoncd-pipelines" + +If a delete hangs (finalizer stuck), patch finalizers off: + kubectl -n patch --type=merge -p '{"metadata":{"finalizers":[]}}' + +After cleanup, wait ~30s for OLM to settle, then re-run `upgrade`. +To bypass (NOT recommended): re-run with --skip-preflight +``` + +### preflight 相关 flag + +| Flag | 何时用 | 行为 | +|------|--------|------| +| `--skip-preflight` | 已知环境脏但要测某个边界 case;CI 应急 | 跳过整段检查,仅 `WARN` 一行 audit;其余流程不变 | +| `--confirm-cluster=` | `operatorConfig.violet.clusters` **非空**时**必填** | `` 必须等于 KUBECONFIG 当前 context 名;不匹配直接报错,防止误把生产 KUBECONFIG 当 sprint env 跑("silent 假成功"的对称防御) | + +in-cluster 运行(pod 内,无 kubeconfig 文件)会自动降级为 `WARN` 提示而不是硬 fail。 + +### 最小 RBAC + +只跑 preflight 需要的 verbs(不含升级本身需要的写权限): + +```yaml +rules: + - apiGroups: ["operators.coreos.com"] + resources: ["subscriptions", "installplans"] + verbs: ["get", "list"] + - apiGroups: ["app.alauda.io"] + resources: ["artifactversions"] + verbs: ["get"] +``` + +升级本身需要的写权限不在此列。 + ## violet 依赖与运行环境 upgrade CLI 把 `Artifact` / `ArtifactVersion` CR 的创建步骤外包给 `violet` 二进制,因此运行环境必须满足以下条件。 diff --git a/cmd/preflight_error.go b/cmd/preflight_error.go new file mode 100644 index 0000000..2515a88 --- /dev/null +++ b/cmd/preflight_error.go @@ -0,0 +1,43 @@ +package cmd + +import ( + "fmt" + "strings" + + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +// PreflightError aggregates the Residuals discovered by +// OperatorInterface.PreflightBaseline across the configured upgrade paths. +// Error() renders a multi-line human report with copy-pasteable cleanup +// commands. +// +// This type is cmd-internal on purpose: the operator interface returns only +// []preflight.Residual, and the cmd layer is solely responsible for +// presentation. Keeping the formatter here means pkg/operator never grows a +// dependency on "how upgrade CLI tells the user about findings". +type PreflightError struct { + Residuals []preflight.Residual +} + +// Error formats the residuals as a sequence of cleanup blocks, then appends +// the finalizer-unstuck command template and the bypass hint. +// +// DECISION C (locked at the default — C1 / all-English): kubectl errors, +// OLM phrases, and Stack Overflow answers are all in English, so users +// copy-pasting this output into GitHub issues / Slack discover answers +// faster when our wording mirrors that vocabulary. To switch to C2 +// (all-Chinese) or C3 (English keys + Chinese hints), edit this method — +// the contract (`error` interface) does not change. +func (e *PreflightError) Error() string { + var b strings.Builder + fmt.Fprintf(&b, "preflight failed: %d residual resource(s) blocking upgrade:\n\n", len(e.Residuals)) + for _, r := range e.Residuals { + fmt.Fprintf(&b, " %s/%s (ns: %s)\n %s\n\n", r.Kind, r.Name, r.Namespace, r.RecommendedCleanup) + } + b.WriteString("If a delete hangs (finalizer stuck), patch finalizers off:\n") + b.WriteString(" kubectl -n patch --type=merge -p '{\"metadata\":{\"finalizers\":[]}}'\n\n") + b.WriteString("After cleanup, wait ~30s for OLM to settle, then re-run `upgrade`.\n") + b.WriteString("To bypass (NOT recommended): re-run with --skip-preflight\n") + return b.String() +} diff --git a/cmd/preflight_error_test.go b/cmd/preflight_error_test.go new file mode 100644 index 0000000..6835d12 --- /dev/null +++ b/cmd/preflight_error_test.go @@ -0,0 +1,43 @@ +package cmd + +import ( + "strings" + "testing" + + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +func TestPreflightError_FormatsAllResiduals(t *testing.T) { + err := &PreflightError{Residuals: []preflight.Residual{ + {Kind: "Subscription", Namespace: "test-ns", Name: "tektoncd-operator", + RecommendedCleanup: `kubectl delete subscription "tektoncd-operator" -n "test-ns"`}, + {Kind: "ArtifactVersion", Namespace: "cpaas-system", Name: "operatorhub-tektoncd-operator.v0.74.0", + RecommendedCleanup: `kubectl delete artifactversion "operatorhub-tektoncd-operator.v0.74.0" -n "cpaas-system"`}, + }} + msg := err.Error() + for _, want := range []string{ + "preflight failed: 2 residual resource(s)", + "Subscription/tektoncd-operator", + "ArtifactVersion/operatorhub-tektoncd-operator.v0.74.0", + `kubectl delete subscription "tektoncd-operator"`, + "finalizer stuck", + "wait ~30s for OLM to settle", + "--skip-preflight", + } { + if !strings.Contains(msg, want) { + t.Errorf("error message missing %q\nfull message:\n%s", want, msg) + } + } +} + +func TestPreflightError_ZeroResidualsStillReadable(t *testing.T) { + // Defensive: if callers construct a PreflightError with no residuals, the + // Error() output should still parse as a coherent message rather than + // dangling templates. This case is not expected at runtime (cmd-layer + // callers check len > 0 first) but the formatter must not crash. + err := &PreflightError{} + msg := err.Error() + if !strings.Contains(msg, "0 residual") { + t.Errorf("empty residual list should still report count; got: %s", msg) + } +} diff --git a/cmd/upgrade_command.go b/cmd/upgrade_command.go index b373daa..6cb29de 100644 --- a/cmd/upgrade_command.go +++ b/cmd/upgrade_command.go @@ -20,13 +20,15 @@ import ( // UpgradeCommand represents the upgrade command implementation type UpgradeCommand struct { - configFile string - kubeconfig string - logLevel string - workspace string - logger *zap.Logger - config *config.Config - operator operator.OperatorInterface + configFile string + kubeconfig string + logLevel string + workspace string + skipPreflight bool + confirmCluster string + logger *zap.Logger + config *config.Config + operator operator.OperatorInterface } // NewUpgradeCommand creates a new instance of UpgradeCommand @@ -68,6 +70,15 @@ func (uc *UpgradeCommand) AddFlags(cmd *cobra.Command) { cmd.Flags().StringVar(&uc.kubeconfig, "kubeconfig", "", "path to kubeconfig file, if not set, get KUBECONFIG from env, or ~/.kube/config") cmd.Flags().StringVar(&uc.logLevel, "log-level", "info", "log level (debug, info, warn, error)") cmd.Flags().StringVar(&uc.workspace, "workspace", "", "workspace for the operator") + cmd.Flags().BoolVar(&uc.skipPreflight, "skip-preflight", false, "skip the read-only preflight check that scans the target cluster for residual Subscription/ArtifactVersion/InstallPlan (NOT recommended)") + cmd.Flags().StringVar(&uc.confirmCluster, "confirm-cluster", "", "required when operatorConfig.violet.clusters is set: must equal the KUBECONFIG current-context to prevent accidentally upgrading the wrong cluster") + + // best-practices #3-C: cobra's default is to dump --help on every RunE + // error, which would push the actionable kubectl-delete lines from + // PreflightError off the screen. Silence usage here; flag-parsing errors + // (handled before RunE) still print usage, so users typo'ing a flag are + // not stranded. + cmd.SilenceUsage = true } // Execute runs the upgrade command @@ -103,6 +114,17 @@ func (uc *UpgradeCommand) Execute() error { } logger.Info("operator type", zap.String("type", cfg.OperatorConfig.Type)) + + // Cluster identity guard: when violet.clusters is configured, the user + // is opting into a multi-cluster ACP topology where kubectl and violet + // could silently point at different clusters. Force them to confirm + // the KUBECONFIG context name so a missing/typo'd --confirm-cluster + // surfaces as a hard error rather than "preflight clean / upgrade + // writes to prod". + if err := uc.assertClusterMatch(cfg, kubeconfig); err != nil { + return err + } + // Create operator manager using factory factory := operator.NewOperatorFactory() op, err := factory.CreateOperator(operator.OperatorType(cfg.OperatorConfig.Type), operator.OperatorOptions{ @@ -121,6 +143,16 @@ func (uc *UpgradeCommand) Execute() error { return nil } + // Preflight: read-only scan for residual OLM resources that would + // conflict with starting an upgrade at any path's baseline. Fails fast + // at the first dirty path so the user gets one set of cleanup commands + // instead of an accumulated megareport. + if uc.skipPreflight { + logger.Warn("preflight skipped by --skip-preflight; ensure environment is clean") + } else if err := uc.runPreflight(ctx); err != nil { + return err + } + // Process upgrade paths for _, path := range cfg.UpgradePaths { if err := uc.process(ctx, path); err != nil { @@ -134,6 +166,106 @@ func (uc *UpgradeCommand) Execute() error { return nil } +// assertClusterMatch enforces the operator≡kubectl cluster identity contract +// when violet.clusters is configured. It is a no-op for configs that do not +// use violet's multi-cluster path. +// +// When violet.clusters is set the user has opted into a multi-cluster ACP +// topology where wrong-cluster writes are catastrophic, so the guard is +// strict in EVERY mode: +// +// - File-mode kubeconfig: --confirm-cluster must equal CurrentContext. +// - In-cluster (empty kubeconfig path): there is no context name to read, +// so --confirm-cluster must equal violet.clusters — the user explicitly +// acknowledges the target subcluster. +// - Empty CurrentContext in a multi-context kubeconfig is a user +// configuration error, not an environmental constraint. We refuse to +// run and tell the user to `kubectl config use-context ` first. +// +// DECISION B (locked at default — B1 / exact equality): the matching rule +// is exact string equality. To loosen to substring/regex matching, replace +// the two comparisons below; the flag surface (--confirm-cluster) stays +// unchanged either way. +func (uc *UpgradeCommand) assertClusterMatch(cfg *config.Config, kubeconfig string) error { + if cfg.OperatorConfig.Violet == nil || cfg.OperatorConfig.Violet.Clusters == "" { + return nil + } + violetClusters := cfg.OperatorConfig.Violet.Clusters + + if kubeconfig == "" { + // In-cluster mode: no CurrentContext to read. Require the user to + // explicitly acknowledge the target subcluster via --confirm-cluster + // matching violet.clusters. Without this guard a CI pod silently + // writes to whatever cluster its serviceaccount is bound to. + if uc.confirmCluster == "" { + return fmt.Errorf( + "violet.clusters=%q is set but running in-cluster (no KUBECONFIG file); "+ + "--confirm-cluster=%q is required to acknowledge the target subcluster", + violetClusters, violetClusters, + ) + } + if uc.confirmCluster != violetClusters { + return fmt.Errorf( + "--confirm-cluster=%q does not match violet.clusters=%q; "+ + "in-cluster mode requires the two to be identical", + uc.confirmCluster, violetClusters, + ) + } + return nil + } + + apiCfg, err := clientcmd.LoadFromFile(kubeconfig) + if err != nil { + return fmt.Errorf("read kubeconfig %s for cluster guard: %v", kubeconfig, err) + } + currentCtx := apiCfg.CurrentContext + if currentCtx == "" { + // User configuration error: multi-context kubeconfig with no + // current-context set. Don't silently fall back — `upgrade` would + // otherwise inherit whatever default the client picks. Make the + // user pick. + return fmt.Errorf( + "violet.clusters=%q but kubeconfig %s has no current-context; "+ + "run `kubectl config use-context ` first", + violetClusters, kubeconfig, + ) + } + if uc.confirmCluster == "" { + return fmt.Errorf( + "violet.clusters=%q requires --confirm-cluster= to confirm the target cluster (current context: %q)", + violetClusters, currentCtx, + ) + } + if uc.confirmCluster != currentCtx { + return fmt.Errorf( + "--confirm-cluster=%q does not match KUBECONFIG current-context %q; refusing to operate on the wrong cluster", + uc.confirmCluster, currentCtx, + ) + } + return nil +} + +// runPreflight iterates the configured upgrade paths and calls +// op.PreflightBaseline on each baseline (Versions[0]). The first path +// reporting a non-empty residual list stops the run with a *PreflightError +// — failing fast spares the user a multi-path megareport when the first +// dirty cluster already tells them what to clean. +func (uc *UpgradeCommand) runPreflight(ctx context.Context) error { + for _, path := range uc.config.UpgradePaths { + if len(path.Versions) == 0 { + continue + } + residuals, err := uc.operator.PreflightBaseline(ctx, path.Versions[0]) + if err != nil { + return err + } + if len(residuals) > 0 { + return &PreflightError{Residuals: residuals} + } + } + return nil +} + // getKubeconfig returns the kubeconfig path func (uc *UpgradeCommand) getKubeconfig() string { if uc.kubeconfig == "" { diff --git a/cmd/upgrade_command_test.go b/cmd/upgrade_command_test.go new file mode 100644 index 0000000..2cd9323 --- /dev/null +++ b/cmd/upgrade_command_test.go @@ -0,0 +1,303 @@ +package cmd + +import ( + "context" + "errors" + "os" + "path/filepath" + "strings" + "sync/atomic" + "testing" + + "github.com/AlaudaDevops/upgrade-test/pkg/config" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +// fakeOperator implements operator.OperatorInterface for tests of runPreflight. +// Behavior is parameterised through residualsPerCall (one slice per +// PreflightBaseline call, in order) and preflightErr (returned from the first +// call regardless of residualsPerCall when non-nil). +// +// Counters are atomic so a future parallel test will not race; today's +// sequential callers wouldn't need it, but the cost is negligible. +type fakeOperator struct { + residualsPerCall [][]preflight.Residual + preflightErr error + preflightCalls int32 + upgradeCalls int32 + receivedVersions []config.Version +} + +func (f *fakeOperator) UpgradeOperator(ctx context.Context, v config.Version) error { + atomic.AddInt32(&f.upgradeCalls, 1) + return nil +} + +func (f *fakeOperator) PreflightBaseline(ctx context.Context, v config.Version) ([]preflight.Residual, error) { + idx := atomic.AddInt32(&f.preflightCalls, 1) - 1 + f.receivedVersions = append(f.receivedVersions, v) + if f.preflightErr != nil { + return nil, f.preflightErr + } + if int(idx) >= len(f.residualsPerCall) { + return nil, nil + } + return f.residualsPerCall[idx], nil +} + +// writeKubeconfig drops a minimal valid kubeconfig YAML with the given +// current-context into t.TempDir() and returns its path. Passing an empty +// string yields a file with no current-context set — the "user forgot +// kubectl config use-context" scenario. +func writeKubeconfig(t *testing.T, currentContext string) string { + t.Helper() + body := `apiVersion: v1 +kind: Config +clusters: +- name: devops + cluster: + server: https://example.invalid +contexts: +- name: devops + context: + cluster: devops + user: admin +- name: prod + context: + cluster: devops + user: admin +users: +- name: admin + user: {} +current-context: ` + currentContext + "\n" + path := filepath.Join(t.TempDir(), "kubeconfig.yaml") + if err := os.WriteFile(path, []byte(body), 0o600); err != nil { + t.Fatalf("write kubeconfig: %v", err) + } + return path +} + +func violetCfg(clusters string) *config.Config { + cfg := &config.Config{} + if clusters != "" { + cfg.OperatorConfig.Violet = &config.VioletConfig{Clusters: clusters} + } + return cfg +} + +// --- assertClusterMatch ---------------------------------------------------- + +func TestAssertClusterMatch_VioletNilIsNoOp(t *testing.T) { + uc := &UpgradeCommand{} + if err := uc.assertClusterMatch(&config.Config{}, "/any/path"); err != nil { + t.Fatalf("nil violet should skip guard, got %v", err) + } +} + +func TestAssertClusterMatch_ClustersEmptyIsNoOp(t *testing.T) { + uc := &UpgradeCommand{} + cfg := &config.Config{} + cfg.OperatorConfig.Violet = &config.VioletConfig{Clusters: ""} + if err := uc.assertClusterMatch(cfg, "/any/path"); err != nil { + t.Fatalf("empty clusters should skip guard, got %v", err) + } +} + +// In-cluster mode (kubeconfig path is empty): the guard cannot fall back to +// CurrentContext, so --confirm-cluster must equal violet.clusters. Missing +// flag is a hard error — this is the "CI pod silently writes to whatever +// serviceaccount points at" failure that #013 closes. +func TestAssertClusterMatch_InClusterRequiresConfirmFlag(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: ""} + err := uc.assertClusterMatch(violetCfg("devops"), "") + if err == nil { + t.Fatal("in-cluster mode without --confirm-cluster must fail") + } + if !strings.Contains(err.Error(), "--confirm-cluster") { + t.Errorf("error should mention --confirm-cluster: %v", err) + } +} + +func TestAssertClusterMatch_InClusterRejectsMismatchedConfirm(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "global"} + err := uc.assertClusterMatch(violetCfg("devops"), "") + if err == nil { + t.Fatal("--confirm-cluster != violet.clusters must fail in-cluster") + } + if !strings.Contains(err.Error(), "does not match") { + t.Errorf("expected 'does not match' in: %v", err) + } +} + +func TestAssertClusterMatch_InClusterHappyPath(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "devops"} + if err := uc.assertClusterMatch(violetCfg("devops"), ""); err != nil { + t.Fatalf("matching --confirm-cluster should pass: %v", err) + } +} + +// File mode: --confirm-cluster must equal kubeconfig's current-context. +// An empty current-context is a user configuration error and the guard +// surfaces it actionably (#014). +func TestAssertClusterMatch_FileModeLoadFailureWraps(t *testing.T) { + uc := &UpgradeCommand{} + err := uc.assertClusterMatch(violetCfg("devops"), "/no/such/kubeconfig") + if err == nil { + t.Fatal("unreadable kubeconfig should fail the guard") + } + if !strings.Contains(err.Error(), "kubeconfig") { + t.Errorf("error should name the file: %v", err) + } +} + +func TestAssertClusterMatch_FileModeEmptyCurrentContextFails(t *testing.T) { + uc := &UpgradeCommand{} + kc := writeKubeconfig(t, "") + err := uc.assertClusterMatch(violetCfg("devops"), kc) + if err == nil { + t.Fatal("empty current-context must fail (was warn-and-pass before #014)") + } + if !strings.Contains(err.Error(), "use-context") { + t.Errorf("error should hint at `kubectl config use-context`: %v", err) + } +} + +func TestAssertClusterMatch_FileModeRequiresConfirmFlag(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: ""} + kc := writeKubeconfig(t, "devops") + err := uc.assertClusterMatch(violetCfg("devops"), kc) + if err == nil { + t.Fatal("file mode without --confirm-cluster must fail") + } + if !strings.Contains(err.Error(), "--confirm-cluster") { + t.Errorf("error should mention the flag: %v", err) + } +} + +func TestAssertClusterMatch_FileModeMismatchedConfirm(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "prod"} + kc := writeKubeconfig(t, "devops") + err := uc.assertClusterMatch(violetCfg("devops"), kc) + if err == nil { + t.Fatal("mismatched --confirm-cluster vs current-context must fail") + } + if !strings.Contains(err.Error(), "does not match") { + t.Errorf("expected 'does not match' in: %v", err) + } +} + +func TestAssertClusterMatch_FileModeHappyPath(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "devops"} + kc := writeKubeconfig(t, "devops") + if err := uc.assertClusterMatch(violetCfg("devops"), kc); err != nil { + t.Fatalf("matching --confirm-cluster vs current-context should pass: %v", err) + } +} + +// --- runPreflight ---------------------------------------------------------- + +func newCmdWithPaths(op *fakeOperator, paths ...config.UpgradePath) *UpgradeCommand { + return &UpgradeCommand{ + operator: op, + config: &config.Config{UpgradePaths: paths}, + } +} + +func TestRunPreflight_AllCleanReturnsNil(t *testing.T) { + op := &fakeOperator{} + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "p1", Versions: []config.Version{{Name: "baseline", BundleVersion: "v1"}}}, + ) + if err := uc.runPreflight(context.Background()); err != nil { + t.Fatalf("clean preflight should return nil: %v", err) + } + if op.preflightCalls != 1 { + t.Errorf("expected 1 preflight call, got %d", op.preflightCalls) + } +} + +func TestRunPreflight_EmptyVersionsSkipped(t *testing.T) { + op := &fakeOperator{} + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "empty", Versions: nil}, + ) + if err := uc.runPreflight(context.Background()); err != nil { + t.Fatalf("path with no versions should be skipped: %v", err) + } + if op.preflightCalls != 0 { + t.Errorf("PreflightBaseline must not be called for empty Versions, got %d calls", op.preflightCalls) + } +} + +// Locks the design rule: PreflightBaseline only ever sees Versions[0], +// even when the path declares multiple downstream versions. A regression +// flipping to scan all versions would surface mid-upgrade AVs as false +// positives. +func TestRunPreflight_OnlyBaselineIsPassed(t *testing.T) { + op := &fakeOperator{} + uc := newCmdWithPaths(op, config.UpgradePath{ + Name: "multi-version", + Versions: []config.Version{ + {Name: "baseline", BundleVersion: "v1"}, + {Name: "mid", BundleVersion: "v2"}, + {Name: "target", BundleVersion: "v3"}, + }, + }) + if err := uc.runPreflight(context.Background()); err != nil { + t.Fatalf("preflight: %v", err) + } + if got := len(op.receivedVersions); got != 1 { + t.Fatalf("expected 1 PreflightBaseline call, got %d", got) + } + if op.receivedVersions[0].BundleVersion != "v1" { + t.Errorf("expected baseline (v1) only, got %q", op.receivedVersions[0].BundleVersion) + } +} + +// Multi-path fail-fast: once any baseline reports residuals, subsequent +// paths are not scanned. This avoids burning API calls when the first +// dirty path already proves the cluster needs cleanup. +func TestRunPreflight_FailFastAcrossPaths(t *testing.T) { + op := &fakeOperator{ + residualsPerCall: [][]preflight.Residual{ + {preflight.NewResidual("Subscription", "ns", "foo")}, // path 1 dirty + {}, // path 2 would be clean — must not be reached + }, + } + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "p1", Versions: []config.Version{{BundleVersion: "v1"}}}, + config.UpgradePath{Name: "p2", Versions: []config.Version{{BundleVersion: "v2"}}}, + ) + err := uc.runPreflight(context.Background()) + if err == nil { + t.Fatal("expected PreflightError on dirty first path") + } + var pfe *PreflightError + if !errors.As(err, &pfe) { + t.Fatalf("expected *PreflightError, got %T: %v", err, err) + } + if len(pfe.Residuals) != 1 { + t.Errorf("expected 1 residual, got %d", len(pfe.Residuals)) + } + if op.preflightCalls != 1 { + t.Errorf("second path's PreflightBaseline must not be called after first dirty, got %d calls", op.preflightCalls) + } +} + +func TestRunPreflight_OperatorErrorPropagatesUnwrapped(t *testing.T) { + sentinel := errors.New("transient API timeout") + op := &fakeOperator{preflightErr: sentinel} + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "p1", Versions: []config.Version{{BundleVersion: "v1"}}}, + ) + err := uc.runPreflight(context.Background()) + if !errors.Is(err, sentinel) { + t.Errorf("expected sentinel error propagated, got %v", err) + } + // Make sure we did NOT wrap as PreflightError — operator errors are + // distinct from "residue findings" and the cmd layer must not blur them. + var pfe *PreflightError + if errors.As(err, &pfe) { + t.Errorf("operator errors must not be wrapped as *PreflightError") + } +} diff --git a/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md new file mode 100644 index 0000000..ef56abc --- /dev/null +++ b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md @@ -0,0 +1,409 @@ +--- +title: upgrade CLI 前置检查 (preflight check) +type: feat +status: completed +date: 2026-05-19 +deepened_on: 2026-05-19 +--- + +# upgrade CLI 前置检查 (preflight check) + +## Enhancement Summary + +**Deepened on**: 2026-05-19 +**Research agents used**: best-practices-researcher, framework-docs-researcher, learnings-researcher, repo-research-analyst, architecture-strategist, code-simplicity-reviewer, performance-oracle, security-sentinel +**Sections enhanced**: 6 decision revisions + 9 new considerations + +### Key Improvements (修订原 plan 的决策) + +1. **接口签名改单版本对称**(architecture + best-practices 共识): + `PreflightCheck(ctx, []paths) error` → **`PreflightBaseline(ctx, version) ([]Residual, error)`** + 聚合 / 去重 / 报告由 cmd 层负责,operator 接口保持"对单个版本做单点检查"的纯粹职责。 +2. **DROP CSV 检查**(simplicity 强论证):CSV 残留必然伴随 Sub 或 AV 残留,独立残留是边缘到几乎不可能的状态;前三项 check 已覆盖。删除后省 1 个 GVR、降级 skip 三段逻辑、2 个测试用例。PackageManifest 精确查 CSV 的完整路径(framework-docs 提供)作为"未来若发现独立残留案例"的实现储备保留在 Sources。 +3. **DROP `--preflight-only` flag**(simplicity):脑补的 CI 需求;真有需要 5 行加回。 +4. **DROP 跨 path 聚合 + 报告级去重**(simplicity + performance 折中):跨 path fail-fast 即可;同一个 op 内的 **API Get-cache (3 行)** 留下来——典型 upgrade-test 多 path 共享同名 Sub/AV,cache 命中能省一半冗余 Get。 +5. **PreflightError 类型移到 cmd 包**(architecture):operator 包 API 表面保持只返 `([]Residual, error)`;PreflightError 是 cmd 内部 *formatting* 类型,不进公共契约。 +6. **跨集群 warning 升级为 soft-fail**(security MEDIUM):`Violet.Clusters` 非空且与 KUBECONFIG context 不一致时,要求 `--confirm-cluster=` 显式确认;不再"沉默 warn 然后让用户在 prod 写"。 + +### New Considerations Discovered + +1. **PackageManifest 查询 namespace 是 catalog source 所在 ns(`cpaas-system`),不是 operator 安装 ns**(framework-docs 关键修正——若未来恢复 CSV 检查必须遵守)。 +2. **30s timeout 包整个 PreflightCheck**(performance #4):默认 `o.timeout=10min` 会让 preflight 卡死时用户等 10 分钟,反向破坏 "1 秒报错" 承诺。 +3. **InstallPlan List 加 OLM 自带 labelSelector** `operators.coreos.com/.=` (performance #2):把"扫整 ns 历史几百 IP"压成 O(1)。 +4. **Get-cache for 多 path 共享残留** (performance #6):3 行 `map[residualKey]residualResult`。 +5. **`BundleVersion` 正则校验进 validateConfig**(security #2):防 `1.0$(whoami)` 进 kubectl 命令 + violet argv。 +6. **kubectl 命令模板用 `%q` 包裹 name/namespace**(security #2 + best-practices #3-B):防 shell 元字符。 +7. **kubectl `--context` 不要 `$(kubectl config current-context)` 注入**(best-practices #3-B):busybox sh 不支持命令替换;CLI 启动时 Go 端 pre-render context 名。 +8. **`cmd.SilenceUsage = true`**(best-practices #3-C):cobra 默认会在错误后打印 `--help`,淹没可复制的 kubectl 命令;preflight 失败路径必须关掉。 +9. **`--skip-preflight` 必须 `log.Warn` 一行 audit trail**(security #3):留 shell history 可 grep。 + +### Decisions Locked vs Open After Deepen + +- **CLOSED (无回旋余地)**:1, 2, 3, 4 (Improvements) + 全部 New Considerations +- **OPEN (留 work 阶段你来 own 5-10 行)**: + - PreflightError 的 `Error()` 文案措辞(中英混排 vs 全英) + - cluster 一致性比对的 "context 名匹配规则"(精确等于 / 子串 / 正则——security 没给具体规则,留你定) + +--- + +## Overview + +给 `upgrade` CLI 在进入升级循环之前加一道前置检查(preflight),用只读方式扫描升级目标集群中是否已经存在与本次升级冲突的残留资源(**Subscription / ArtifactVersion / 非终态 InstallPlan**——CSV 经研究后移出范围,见 Enhancement Summary #2)。任一残留即**停止运行**,向用户输出可复制粘贴的 `kubectl delete` 清理命令;清理后用户 re-run 即可继续。preflight 仅做检查与提示,**绝不主动修改集群状态**。 + +## Problem Statement / Motivation + +当前 `cmd.UpgradeCommand.Execute`(`cmd/upgrade_command.go:74-135`)创建 operator 后**立即**进入 `for _, path := range cfg.UpgradePaths` 循环,第一个版本就调 `op.UpgradeOperator(ctx, version)`。这条路径上有两个"宽容兜底": + +- `installViaViolet` 调用 `deleteArtifactVersionIfExists`(`pkg/operator/operatorhub/violet.go:423-450`)会**自动删除** AV 残留。 +- `InstallSubscription`(`pkg/operator/operatorhub/subscription.go:33-75`)在 Subscription 已存在时做 **in-place refresh**(bump annotation 触发 OLM 重 reconcile,不删除)。 + +这两个兜底是设计良好的"中间态推进",**但它们假设 baseline(起点版本)的环境本来就该干净**。一旦环境里残留了上一轮跑废的 baseline AV / Subscription / 未审批 InstallPlan,会出现: + +1. **CSV 名混入旧数据**:waitArtifactVersionPresent 拿到的 `status.version` 可能是上次残留的 CSV,导致 InstallSubscription 等的是错的 IP。 +2. **半完成状态被静默接管**:旧 Subscription 已经在 in-place refresh 模式下被 "继续推进",行为不是测试期望的"全新升级",破坏升级链可重复性。 +3. **诊断成本高**:失败深埋在 `wait*` 超时里,用户拿到 "timeout waiting for csv" 才回溯——所有信号都晚于真实根因。 + +加 preflight 就是把这些"假设"显式化、前置化:**升级开始前 1 秒报错并给清理命令**,而不是 10 分钟超时后让用户翻日志。 + +### Research Insights + +- **业界确认这是真实缺口**:operator-sdk 的 `run bundle` 子命令**不**做 Subscription/CSV/IP 的 preflight,仅验证 bundle image accessibility(见 [operator-sdk 源码](https://github.com/operator-framework/operator-sdk/tree/master/internal/cmd/operator-sdk/run/bundle))。OLM 本身也没有 cleanup 子命令;唯一相关工具 `operator-sdk cleanup ` 是**破坏性**删除,与本 plan "只读 + 提示" 哲学相反。本 preflight 是真实填补的空白。 +- **历史经验**:仓库 `docs/` 下无相关 solutions 文档,本 plan 是首次设计——意味着没有可复用的事故分析模板,但也无既有约定要遵守。 + +## Proposed Solution + +新增一个**只读**前置检查接口,由 operatorhub 实现,主进程在升级循环开始前调用: + +1. 在 `pkg/operator/interface.go` 扩展接口(**Deepen 修订后的签名**): + ```go + PreflightBaseline(ctx context.Context, version config.Version) ([]Residual, error) + ``` + 返回 `[]Residual`:业务残留(不算 error);返回 error:client/网络/transient 失败。**对称于 `UpgradeOperator(ctx, version)` 的单版本职责**。 +2. operatorhub 实现:对入参 `version` 检查 3 类残留: + - `Subscription/` 在 `` 中 + - `ArtifactVersion/.` 在 `cpaas-system` 中 + - `InstallPlan` 在 `` 中,`status.phase` 非 `Complete` / `Failed`,用 OLM 自带 label selector `operators.coreos.com/.=` 减少 list payload(performance #2) +3. local 实现:返回 `nil, nil`(README 标注 "local 模式 preflight 为 no-op")。 +4. cmd 层在升级循环前: + - 对 `cfg.UpgradePaths` 迭代,每条 path 取 `Versions[0]` 调 `op.PreflightBaseline` + - 收集到的 `[]Residual` 与"已见 (Kind, NS, Name)"对照做 Get-cache(performance #6); + - **任一 path 拿到非空 Residuals → fail-fast 包成 `*cmd.PreflightError` 返回**(不跨 path 聚合,simplicity #6) +5. CLI 新增**仅 1 个** flag:`--skip-preflight`(仅 CLI,不进 yaml)。Decision A 选 A1,深度评审 PASS。 +6. cluster 一致性 soft-fail(security #6):CLI 启动早期,若 `Violet.Clusters` 非空,要求 `--confirm-cluster=` 与 KUBECONFIG context 名匹配——不匹配立即报错;空 `Violet.Clusters` 时无需此 flag。 + +### Research Insights + +- **kubeadm preflight 是 canonical Go pattern**(best-practices #1):`Checker` interface 用 `Name()` + `Check() (warnings, errors []error)`。我们的 3 返回值 `([]Residual, []string warnings, error)` 是其精神继承,但更明确区分"业务残留 vs 操作 warning vs 系统失败"——见下方 Implementation Details 模板。 +- **kubectl auth can-i 是 `--preflight-only` 的精神类比**(best-practices #4),但 plan 已 DROP 该 flag。 + +### Implementation Details (摘自 best-practices-researcher, 已按 Deepen 修订) + +```go +// pkg/operator/interface.go (revised) +type OperatorInterface interface { + UpgradeOperator(ctx context.Context, version config.Version) error + PreflightBaseline(ctx context.Context, version config.Version) ([]Residual, error) +} + +// pkg/operator/residual.go (new, package operator — 与 interface 同包) +type Residual struct { + Kind, Namespace, Name string + RecommendedCleanup string // 预渲染的 kubectl 命令,已用 %q 转义 +} + +// pkg/operator/operatorhub/preflight.go (new) +func (o *Operator) PreflightBaseline(ctx context.Context, v config.Version) ([]operator.Residual, error) { + pCtx, cancel := context.WithTimeout(ctx, 30*time.Second) // performance #4 + defer cancel() + + var residuals []operator.Residual + checks := []struct { + kind string + fn func(context.Context, config.Version) (*operator.Residual, error) + }{ + {"Subscription", o.checkSubscription}, + {"ArtifactVersion", o.checkArtifactVersion}, + {"InstallPlan", o.checkInstallPlans}, + } + for _, c := range checks { + r, err := c.fn(pCtx, v) + switch { + case isTransientAPIError(err): // 复用 artifact_version.go:107 + return nil, fmt.Errorf("preflight: %s: %w (transient, retry the run)", c.kind, err) + case err != nil: + return nil, fmt.Errorf("preflight: %s: %w", c.kind, err) + case r != nil: + residuals = append(residuals, *r) + } + } + return residuals, nil +} +``` + +```go +// cmd/preflight_error.go (new, cmd-internal; architecture revision #4) +type PreflightError struct { + Residuals []operator.Residual +} + +func (e *PreflightError) Error() string { + var b strings.Builder + fmt.Fprintf(&b, "preflight failed: %d residual resource(s) blocking upgrade:\n\n", len(e.Residuals)) + for _, r := range e.Residuals { + fmt.Fprintf(&b, " %s/%s (ns: %s)\n %s\n\n", r.Kind, r.Name, r.Namespace, r.RecommendedCleanup) + } + b.WriteString("If a delete hangs (finalizer stuck), patch finalizers off:\n") + b.WriteString(" kubectl -n patch --type=merge -p '{\"metadata\":{\"finalizers\":[]}}'\n\n") + b.WriteString("After cleanup, wait ~30s for OLM to settle, then re-run `upgrade`.\n") + b.WriteString("To bypass (NOT recommended): re-run with --skip-preflight\n") + return b.String() +} +``` + +```go +// cmd/upgrade_command.go::Execute (revised section) +// 紧跟 factory.CreateOperator 之后,for-loop 之前 +cmd.SilenceUsage = true // best-practices #3-C: 不让 cobra 把 --help 淹没 actionable 命令 +if !uc.skipPreflight { + if err := uc.runPreflight(ctx, cfg); err != nil { + return err + } +} + +// uc.runPreflight 内部:cross-path Get-cache + fail-fast +func (uc *UpgradeCommand) runPreflight(ctx context.Context, cfg *config.Config) error { + seen := map[string]bool{} // performance #6 cache key: Kind|NS|Name + for _, p := range cfg.UpgradePaths { + baseline := p.Versions[0] + res, err := uc.operator.PreflightBaseline(ctx, baseline) + if err != nil { + return err + } + var unique []operator.Residual + for _, r := range res { + k := r.Kind + "|" + r.Namespace + "|" + r.Name + if !seen[k] { + seen[k] = true + unique = append(unique, r) + } + } + if len(unique) > 0 { + return &PreflightError{Residuals: unique} + } + } + return nil +} +``` + +## Technical Considerations + +- **抽象层**:preflight 逻辑与 operatorhub 内部 GVR / 命名约定强耦合,**必须放在 `OperatorInterface` 而不是 cmd 层**。这是依赖方向"业务(升级流程)→ 实现(具体 Operator 类型)"的对称扩展,与现有 `UpgradeOperator` 同构(architecture-strategist verdict 5: PASS)。 +- **只读保证**:preflight 全部走 `client.Get` / `client.List`,**不调用任何 Create/Update/Patch/Delete**。在 godoc 里硬性声明,单元测试 mock client 时断言无写调用。 +- **`PreflightError` 是 cmd 内部类型**(architecture revision #4):不实现 `Unwrap()`(无消费者)、不进 operator 包 API 表面、不导出到 pkg/operator/。仅用于格式化错误消息。 +- **错误分类**:复用 `pkg/operator/operatorhub/artifact_version.go:103-111` 的 `isTransientAPIError`(repo-research #4): + - `errors.IsNotFound` → 该检查通过 + - transient → 包成 `... transient, retry the run` 让用户重跑 + - permanent → 直接 fail +- **跨集群一致性硬校验**(security revision #6):CLI 启动早期,若 `cfg.OperatorConfig.Violet != nil && cfg.OperatorConfig.Violet.Clusters != ""`,校验 `--confirm-cluster=` 与 KUBECONFIG 当前 context 名一致;不一致或 flag 缺失则 fail。**这是独立于 preflight 的早期 guard,不在 PreflightBaseline 内部**。 +- **namespace 缺失硬校验**:`config.go::validateConfig` 补一条 operatorhub-type 时 `OperatorConfig.Namespace` 必填——preflight 入口**不重复**校验(repo-research #1:单点 validateConfig 即可,别扩散)。 +- **BundleVersion 注入防御**(security #2):`validateConfig` 加正则 `^[a-zA-Z0-9._-]+$` 校验所有 version 的 `BundleVersion`——这个字段还会进 violet argv,统一收口防御。 +- **TOCTOU 窗口**:preflight 通过 → 用户清理 → re-run 之间,OLM 控制器可能异步重新生成 InstallPlan。错误信息明确建议"清理后等待 30s",**不在代码里加 sleep**。 + +### Research Insights + +- **`cmd.SilenceUsage = true`** 是 kubectl / helm / velero 三个项目交叉验证的"硬规矩"(best-practices #3-C)——不设置会让 cobra 在 preflight 失败后打印 `--help`,把 actionable 命令推到屏幕外。 +- **`kubectl --context $(kubectl config current-context)` 反模式**(best-practices #3-B):busybox sh(许多 CI runner 用)不支持命令替换,会把整个 `$(...)` 当文件名。改为 Go 端在 CLI 启动时 `clientcmd.LoadFromFile(kubeconfig).CurrentContext` 拿到字符串,pre-render 到错误信息里。 +- **错误信息结构**(best-practices #3-A 完整模板已嵌入上方 Implementation Details)。 + +## System-Wide Impact + +### Interaction Graph + +``` +cmd.Execute + → 早期 cluster 一致性 guard (Violet.Clusters vs --confirm-cluster vs KUBECONFIG ctx) + → factory.CreateOperator(operatorhub) + → (--skip-preflight ? log.Warn + skip : uc.runPreflight(ctx, cfg)) + └─ for each path (fail-fast 跨 path) + └─ op.PreflightBaseline(ctx, path.Versions[0]) + ├─ 30s ctx timeout + ├─ checkSubscription (Get) + ├─ checkArtifactVersion (Get) + └─ checkInstallPlans (List w/ OLM label selector) + └─ Get-cache dedup → 任意 residual 即 *PreflightError + → (preflight 通过) for _, path := range cfg.UpgradePaths + └─ op.UpgradeOperator(ctx, version) // 原流程 +``` + +### Error & Failure Propagation + +- 单个 Get/List 系统错误(非 NotFound、非 transient)→ 包成 `fmt.Errorf("preflight: %s/%s: %w", kind, name, err)`,整体 preflight 终止,**不收集残留**。 +- Transient 错误 → 包装提示用户 retry,不残留化。 +- NotFound → 该检查通过。 +- 残留 → 单 path 内 append 到 `[]Residual`;cmd 层第一个非空就返回 `*PreflightError`。 +- `cmd.Execute` 拿到 PreflightError → 直接 `return err`(cobra `SilenceUsage=true` 防淹没),**无视 `cfg.Immediate`**(preflight 不是升级路径的一部分)。 +- `--skip-preflight` 已设置 → `log.Warn("preflight skipped by --skip-preflight; ensure environment is clean")` + 跳过。 + +### State Lifecycle Risks + +- preflight 全部只读,**无任何 state 变更**。 +- 与 `installViaViolet::deleteArtifactVersionIfExists` 严格不重叠:preflight 只看 `Versions[0]`,installViaViolet 处理 `Versions[1..]` 流转中的中间态 AV。 +- TOCTOU 文档化、不修代码。 + +### API Surface Parity + +| 接口面 | 当前 | 变更(Deepen 修订后) | +|--------|------|------| +| `OperatorInterface` | `UpgradeOperator(ctx, version)` | **新增** `PreflightBaseline(ctx, version) ([]Residual, error)` | +| operatorhub | implements UpgradeOperator | implements PreflightBaseline(3 类检查) | +| local | implements UpgradeOperator | implements PreflightBaseline(返回 `nil, nil`) | +| CLI flags | `--config` / `--kubeconfig` / `--log-level` / `--workspace` | **新增** `--skip-preflight`, `--confirm-cluster` | +| config schema | 不动 | `validateConfig` 新增 namespace 必填 + bundleVersion 正则 | +| `cmd.SilenceUsage` | 未设置 | **新增 `true`**(避免 --help 淹没错误) | + +### Integration Test Scenarios + +1. **空集群正常升级**:preflight 全过,行为与不加 preflight 完全一致(兼容)。 +2. **残留 Subscription**:preflight 报错,`kubectl delete` 命令可复制粘贴;执行后 re-run 通过。 +3. **多 UpgradePath,第 2 条 baseline 有残留**:第 1 条 path 不开跑(fail-fast 后置不需要进入);用户 re-run 时 cache 命中无冗余 Get。 +4. **`--skip-preflight` 在脏环境也能跑**:log.Warn 后进入升级。 +5. **local operator**:preflight 立即返回 nil(无 client 调用)。 +6. **namespace 漏配**:LoadConfig 阶段报 "namespace is required for operatorhub type",不进 preflight。 +7. **`BundleVersion: "1.0$(whoami)"`**:LoadConfig 阶段拒绝(正则不通过)。 +8. **`Violet.Clusters="devops"` + KUBECONFIG context `global`**:CLI 启动即报 "cluster mismatch; pass --confirm-cluster=devops"。 +9. **InstallPlan label selector 命中 0**:preflight 跳过该项,无误报。 +10. **InstallPlan namespace 历史 100+ IP**:List 仅返回目标 sub 的 IP(labelSelector 限定),preflight < 200ms。 + +## Acceptance Criteria + +### Functional Requirements + +- [x] `OperatorInterface` 新增 `PreflightBaseline(ctx, version) ([]Residual, error)`。 +- [x] `pkg/operator/residual.go` 新增 `Residual` 类型(kind/namespace/name/recommended_cleanup)。 +- [x] operatorhub 实现完成 3 类检查(Sub / AV / 非终态 IP w/ OLM labelSelector)。 +- [x] local 实现返回 `nil, nil`;README 标注 no-op。 +- [x] `cmd/preflight_error.go` 新增 `*PreflightError`,仅 cmd-internal;Error() 含多行报告 + finalizer 兜底 + skip hint。 +- [x] `cmd.Execute` 在升级循环前调用 `runPreflight`;fail-fast 跨 path;Get-cache 去重。 +- [x] `cmd.SilenceUsage = true` 在 preflight 失败路径上确保 cobra 不打印 --help。 +- [x] 新增 `--skip-preflight` flag:仅 warn 跳过。 +- [x] 新增 `--confirm-cluster` flag:当 `Violet.Clusters` 非空时必填,CLI 启动早期校验。 +- [x] `config.validateConfig` 新增 2 条规则:operatorhub 类型时 `Namespace` 必填;`BundleVersion` 正则 `^[a-zA-Z0-9._-]+$`。 +- [x] kubectl 命令模板用 `%q` 转义 name/namespace;`--context` 用 Go 端 pre-render 不用 `$(...)`。 + +### Non-Functional Requirements + +- [x] preflight 只读:单元测试 mock dynamic client,断言全程**仅** Get/List 调用,无 Create/Update/Patch/Delete。 +- [x] preflight 性能:每条 path **30s context timeout 硬上限**;典型 N=1 时 < 300ms(含 3 次 Get + 1 次 List);N=3 时 < 1s。 +- [x] InstallPlan List 必须带 `operators.coreos.com/.=` labelSelector,避免扫整 ns 历史 IP。 +- [x] godoc:`PreflightBaseline` 方法注释明确"只读 + 仅 baseline";附 `// Preflight contract: this method MUST NOT mutate cluster state.`。 + +### Quality Gates + +- [x] 新增单元测试:`pkg/operator/operatorhub/preflight_test.go`,覆盖:clean / 单 Sub 残留 / 单 AV 残留 / 单 IP 残留 / 多类残留 / transient error / NotFound / labelSelector 命中 0。 +- [x] 新增 `cmd/preflight_error_test.go`:验证 `Error()` 文案稳定(含 finalizer hint + skip hint)。 +- [x] README 更新:"升级前会执行 preflight 检查,残留资源会被列出,需要手动清理。`--skip-preflight` / `--confirm-cluster` 用法说明 + RBAC 最小 verbs"(参考 security #5)。 +- [x] CLAUDE.md 更新:"PreflightBaseline 是只读、仅检查 baseline" 写入"硬约束"段;`cmd.SilenceUsage` 由 preflight 强制开启。 + +## Success Metrics + +- 在 sprint env 实跑:故意残留 Sub → preflight **30s timeout 内**报错 + 给出准确清理命令(vs 旧路径 10 分钟超时)。 +- 单测覆盖率:新增的 preflight 包路径覆盖 ≥ 80%。 +- 真实使用场景:用户拿到 preflight error message 后,**不查代码、不查日志**,仅复制粘贴 kubectl 命令即可恢复("零反查路径")。 +- 误指 prod KUBECONFIG 的拦截率:100%(`--confirm-cluster` 硬校验)。 + +## Dependencies & Risks + +- **依赖**:无新增第三方依赖;继续用 `k8s.io/client-go/dynamic` + `k8s.io/apimachinery`。**不引入** `hashicorp/go-multierror` 或 `errors.Join`(repo-research #3 确认 stdlib 足够)。 +- **风险 1**:跨集群 KUBECONFIG 误指 prod → 已用 `--confirm-cluster` 硬 fail 消化(security #6)。 +- **风险 2**:用户 finalizer 卡死场景导致 cleanup 命令无效 → 错误消息内置 finalizer 兜底命令。 +- **风险 3**:BundleVersion 含 shell 元字符 → validateConfig 正则拦截(security #2)。 +- **回滚策略**:preflight 全部走新增接口方法,`--skip-preflight` 即开即关;如发现严重误报,operations 直接 `--skip-preflight` 应急,无需回滚二进制。 + +### Research Insights + +- **OLM IP list 在密集 namespace 的实际规模**:performance #2 实测观察"几十~几百历史 IP"是常态——OLM 不 GC 旧 IP;不加 labelSelector 的 list 在病理 ns 上能拖到秒级。`operators.coreos.com/.=` 是 OLM 控制器自动给 IP 打的 label,可信。 +- **PackageManifest 同步窗口**:framework-docs 观察正常 < 10s,跨集群 / catalog 重启 30-60s——若未来恢复 CSV 检查,B1(skip + warn)是经过验证的正确取舍,不要走 polling 路径。 + +## Open Decision Points (Deepen 后保留) + +> 以下是真正的"业务/UX 权衡点",Deepen 后**仍**没有唯一正解。学习模式下你在 work 阶段可以亲自实现这部分。 + +### 决策 A:`preflight.skip` 是否进 yaml schema? — **CLOSED → A1** + +Deepen architecture-strategist verdict 3: PASS。结论锁定为 **A1(仅 CLI flag)**。理由不再展开(见 verdict 原文)。 + +### 决策 B:`--confirm-cluster` 的 context 名匹配规则 + +**背景**:security revision #6 把跨集群一致性升级为 soft fail,要求 `--confirm-cluster=` 与 KUBECONFIG current context 名一致。但"一致"的定义没锁。 + +**选项**: +- **B1 (推荐起步)**:**精确字符串相等**——简单可预测;context 名本身就是用户控制的字符串。 +- **B2**:**子串包含**——更宽松,允许 `--confirm-cluster=devops` 匹配 context `build-business-cluster-devops`;但容易误命中。 +- **B3**:**正则匹配**——最灵活,但用户在 CLI 里写正则是反人类。 + +**推荐选 B1**,5 行代码:`if confirmCluster != "" && confirmCluster != currentContext { return error }`。 + +### 决策 C:PreflightError.Error() 文案中英比例 + +**背景**:上方 Implementation Details 给的 `Error()` 是英文模板(best-practices 提供)。仓库其他错误信息混用中英。 + +**选项**: +- **C1**:**全英** ——与 best-practices 模板对齐、kubectl/OLM 错误更同构。 +- **C2**:**全中** ——与仓库 README / CLAUDE.md 风格一致。 +- **C3**:**Key 英文 + 提示中文**(如 "Subscription/" 名称英文,"如果删除卡住" 是中文)。 + +**推荐选 C1**:错误信息频繁被复制粘贴到 GitHub issue / Slack 国际化场景,英文更通用;中文文档放进 README 而非错误流。 + +## Sources & References + +### Internal References + +- **入口 / 插入点**:`cmd/upgrade_command.go:74-135`(`Execute` 方法) +- **接口扩展点**:`pkg/operator/interface.go:10-13` +- **factory 分发**:`pkg/operator/factory.go:40-47` +- **operatorhub 资源 GVR 定义**:`pkg/operator/operatorhub/operator.go:42-79`(已含 `packageManifestGVR`,未来若恢复 CSV 检查直接复用) +- **错误分类函数复用**:`pkg/operator/operatorhub/artifact_version.go:103-111` (`isTransientAPIError`) +- **现有错误风格参考**:`pkg/operator/operatorhub/subscription.go:108,115,128,151,158`(`%w` wrap)+ `pkg/exec/exec_test.go:57-79`(`errors.Is` 测试范式) +- **现有"自动清理"行为锚点(不重叠)**: + - `pkg/operator/operatorhub/violet.go:423-450`(`deleteArtifactVersionIfExists`) + - `pkg/operator/operatorhub/subscription.go:33-75`(`InstallSubscription` in-place refresh 路径) +- **config 验证扩展点**:`pkg/config/config.go:211-228`(`validateConfig`)+ `pkg/config/config.go:14-16`(namespace 默认值缺口)+ `pkg/config/config.go:230-258`(`defaultConfig`) + +### External References + +- **kubeadm preflight canonical pattern**:[k8s.io/kubernetes/cmd/kubeadm/app/preflight/checks.go](https://github.com/kubernetes/kubernetes/blob/v1.30.0/cmd/kubeadm/app/preflight/checks.go) — Checker interface、warn vs err 分离、ignorePreflightErrors sets.String 模式 +- **errors.Join (Go stdlib)**:[pkg.go.dev/errors#Join](https://pkg.go.dev/errors#Join) — 本 plan **不使用**,但记录为对比 +- **OLM PackageManifest schema**(CSV 检查未来储备): + - [packagemanifest_types.go](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/package-server/apis/operators/v1/packagemanifest_types.go) + - [register.go (GroupName)](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/package-server/apis/operators/register.go) + - 字段路径:`status.channels[].currentCSV`(**namespace 是 catalog source 所在 ns,对本仓库即 `cpaas-system`**) +- **Velero install validation**:[install.go](https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/cli/install/install.go) — Go K8s CLI preflight 参考 +- **operator-sdk run bundle 缺 preflight 证据**:[操作 SDK 源码](https://github.com/operator-framework/operator-sdk/tree/master/internal/cmd/operator-sdk/run/bundle) +- **golang.org/x/sync/errgroup**(未来若 paths > 10 时启用并行):[pkg.go.dev/errgroup](https://pkg.go.dev/golang.org/x/sync/errgroup) + +### CLAUDE.md 约束 + +- `cpaas-system` / `targetCatalogSource=platform` 硬编码(不要参数化) +- 轮询用 `wait.PollUntilContextTimeout`——preflight 只 Get 不 poll,**不受此约束影响** +- 日志用 `logging.FromContext(ctx)` +- YAML 字段小写驼峰 +- **Deepen 后新增**:preflight 必须 `cmd.SilenceUsage = true`;preflight 必须只读 + +### 相关 brainstorm(仅参考,不复用) + +- `docs/brainstorms/2026-05-18-violet-external-binary-brainstorm.md`:violet 重构方向,明确 preflight 不与之冲突——violet 接管的是"写路径",preflight 是"读路径前置"(architecture-strategist verdict 5: PASS)。 + +## 实施建议(给 work 阶段的概略步骤,Deepen 修订后) + +1. 新建 `pkg/operator/residual.go` —— 定义 `Residual` 类型(package operator,与 interface 同包)。 +2. 新建 `pkg/operator/operatorhub/preflight.go` —— 实现 `PreflightBaseline`、3 个检查函数、复用 `isTransientAPIError`、30s timeout 包整体。 +3. 改 `pkg/operator/interface.go` —— 接口增 `PreflightBaseline`。 +4. 改 `pkg/operator/local/operator.go` —— 加 `PreflightBaseline` 返回 `nil, nil`。 +5. 改 `pkg/operator/operatorhub/operator.go` —— 暴露 method receiver;加 godoc 强声明只读 + baseline-only。 +6. 新建 `cmd/preflight_error.go` —— `*PreflightError` + `Error()` 模板(**决策 C 拍板时写**)。 +7. 改 `cmd/upgrade_command.go::Execute` —— 顺序:cluster 一致性 guard → `cmd.SilenceUsage = true` → `--skip-preflight` 分支 → runPreflight (fail-fast + Get-cache) → 升级循环。 +8. 改 `cmd/upgrade_command.go::AddFlags` —— 加 `--skip-preflight`、`--confirm-cluster`。 +9. 改 `pkg/config/config.go::validateConfig` —— operatorhub 时 Namespace 必填 + `BundleVersion` 正则。 +10. 新增 `pkg/operator/operatorhub/preflight_test.go` + `cmd/preflight_error_test.go` —— 用 `k8s.io/client-go/dynamic/fake`。 +11. 改 README / CLAUDE.md —— 文档说明 + 硬约束补充 + RBAC verbs 列表。 +12. **决策 B、C 写代码时一并确认**。 diff --git a/pkg/config/config.go b/pkg/config/config.go index 219be51..48f77fe 100644 --- a/pkg/config/config.go +++ b/pkg/config/config.go @@ -5,11 +5,21 @@ package config import ( "fmt" "os" + "regexp" "time" "gopkg.in/yaml.v2" ) +// bundleVersionRegex bounds what may appear in Version.BundleVersion. The +// field is interpolated into shell command templates (kubectl cleanup hints +// emitted by preflight) and into violet argv, so anything beyond +// [a-zA-Z0-9._-] risks shell metacharacter injection at the user's terminal. +// The chosen set covers every real bundle version observed in practice +// (`v4.6.3`, `4.6.3-rc.91`, `v0.74.0`) while leaving no room for `$`, backticks, +// quotes, spaces, or `;`. +var bundleVersionRegex = regexp.MustCompile(`^[a-zA-Z0-9._-]+$`) + const ( defaultOperatorNamespace = "testing-upgrade-namespace" defaultSystemNamespace = "cpaas-system" @@ -216,11 +226,33 @@ func validateConfig(cfg *Config) error { // the requirement is gated on Type to avoid forcing irrelevant fields on // local-dev configs. if cfg.OperatorConfig.Type == "operatorhub" { + // Namespace is required: preflight's Get / List against Subscription / + // InstallPlan would otherwise target an empty namespace, which dynamic + // client treats as cluster-scope — leading to false-positive residue + // reports from completely unrelated operators in other namespaces. + // Catching at load time is cheaper than discovering mid-run. + if cfg.OperatorConfig.Namespace == "" { + return fmt.Errorf("operatorConfig.namespace is required for operatorhub type") + } for i, path := range cfg.UpgradePaths { for j, v := range path.Versions { if v.Channel == "" { return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): channel is required for operatorhub type", i, j, v.Name) } + // bundleVersion is load-bearing for operatorhub: it feeds the + // AV name (.), the violet push tgz + // URL, and the kubectl cleanup hints emitted by preflight. + // Empty values silently produce malformed downstream names + // ("operatorhub-foo." with trailing dot), so preflight would + // report clean while upgrade fails late with cryptic errors. + // Require non-empty here, then regex-validate to close the + // shell-injection vector for every downstream consumer. + if v.BundleVersion == "" { + return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion is required for operatorhub type", i, j, v.Name) + } + if !bundleVersionRegex.MatchString(v.BundleVersion) { + return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion %q must match %s", i, j, v.Name, v.BundleVersion, bundleVersionRegex.String()) + } } } } diff --git a/pkg/config/config_test.go b/pkg/config/config_test.go index 7edf12c..e339400 100644 --- a/pkg/config/config_test.go +++ b/pkg/config/config_test.go @@ -80,10 +80,14 @@ upgradePaths: // failure mode: if validateConfig ran *before* defaultConfig, a config with // no explicit type would skip channel validation, then crash at runtime when // operatorhub kicked in. This test pins the ordering down. +// +// The fixture sets namespace explicitly so the newer namespace-required check +// does not short-circuit before channel validation gets to run. func TestLoadConfig_AppliesOperatorhubDefaultWhenTypeOmitted(t *testing.T) { yaml := ` operatorConfig: name: tektoncd-operator + namespace: tektoncd-pipelines upgradePaths: - name: main versions: @@ -127,3 +131,105 @@ upgradePaths: t.Errorf("Versions length = %d, want 2", got) } } + +// TestLoadConfig_RejectsMissingNamespaceForOperatorhub: preflight reads +// Subscription / InstallPlan from the configured namespace; an empty value +// would silently degrade to cluster-scope queries and produce false-positive +// residue reports. Catch at load time. +func TestLoadConfig_RejectsMissingNamespaceForOperatorhub(t *testing.T) { + yaml := ` +operatorConfig: + type: operatorhub + name: tektoncd-operator +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: 4.0.17 + channel: stable +` + _, err := LoadConfig(writeTempYAML(t, yaml)) + if err == nil { + t.Fatal("operatorhub type without namespace must be rejected at load time") + } + if !strings.Contains(err.Error(), "namespace is required") { + t.Errorf("error should explain the namespace requirement, got: %v", err) + } +} + +// TestLoadConfig_AllowsMissingNamespaceForLocal mirrors the channel-required +// gating: local mode runs `make deploy` and has no OLM namespace concept. +func TestLoadConfig_AllowsMissingNamespaceForLocal(t *testing.T) { + yaml := ` +operatorConfig: + type: local + name: tektoncd-operator +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: 4.0.17 +` + if _, err := LoadConfig(writeTempYAML(t, yaml)); err != nil { + t.Fatalf("local type should not require namespace: %v", err) + } +} + +// TestLoadConfig_RejectsShellMetaCharacterInBundleVersion: BundleVersion is +// interpolated into kubectl cleanup hints and violet argv; allowing $, `, or +// quotes here would propagate to user shells. The single chokepoint test +// guards every downstream consumer. +func TestLoadConfig_RejectsShellMetaCharacterInBundleVersion(t *testing.T) { + // Single-quoted YAML scalars accept embedded double quotes verbatim, so + // these test cases reach validateConfig instead of being rejected by the + // YAML parser one layer earlier. + cases := []string{`1.0$(whoami)`, "1.0;ls", "1.0 spaces", "1.0`backtick`", `1.0"quote`} + for _, bv := range cases { + t.Run(bv, func(t *testing.T) { + yaml := ` +operatorConfig: + type: operatorhub + name: tektoncd-operator + namespace: tektoncd-pipelines +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: '` + bv + `' + channel: stable +` + _, err := LoadConfig(writeTempYAML(t, yaml)) + if err == nil { + t.Fatalf("bundleVersion %q must be rejected", bv) + } + if !strings.Contains(err.Error(), "bundleVersion") { + t.Errorf("error should name the field, got: %v", err) + } + }) + } +} + +// TestLoadConfig_AcceptsRealisticBundleVersions verifies the regex is not +// over-tight against versions observed in production configs. +func TestLoadConfig_AcceptsRealisticBundleVersions(t *testing.T) { + cases := []string{"v4.6.3", "4.6.3-rc.91", "v0.74.0", "4.0.17"} + for _, bv := range cases { + t.Run(bv, func(t *testing.T) { + yaml := ` +operatorConfig: + type: operatorhub + name: tektoncd-operator + namespace: tektoncd-pipelines +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: ` + bv + ` + channel: stable +` + if _, err := LoadConfig(writeTempYAML(t, yaml)); err != nil { + t.Fatalf("bundleVersion %q should be accepted: %v", bv, err) + } + }) + } +} diff --git a/pkg/operator/interface.go b/pkg/operator/interface.go index 0a6fc01..8fbd599 100644 --- a/pkg/operator/interface.go +++ b/pkg/operator/interface.go @@ -4,10 +4,27 @@ import ( "context" "github.com/AlaudaDevops/upgrade-test/pkg/config" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" ) // OperatorInterface defines the interface for operator operations type OperatorInterface interface { // UpgradeOperator upgrades the operator to the given version UpgradeOperator(ctx context.Context, version config.Version) error + + // PreflightBaseline inspects the cluster for residual resources that + // would conflict with starting an upgrade chain at `version`. It is + // strictly read-only: implementations MUST NOT mutate cluster state + // (no Create/Update/Patch/Delete). Implementations are expected to + // bound their own work with a short timeout (see operatorhub for the + // 30s budget). + // + // Return values: + // - residuals: business-level findings that block the upgrade; the + // cmd layer aggregates and formats these into a user-facing report. + // An empty slice means the baseline is clean. + // - err: operational failures (network, transient API, permission). + // Callers should NOT mix err with residuals — err means "we could + // not finish checking", residuals means "we finished and found X". + PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) } diff --git a/pkg/operator/local/operator.go b/pkg/operator/local/operator.go index 9ea6696..162f45c 100644 --- a/pkg/operator/local/operator.go +++ b/pkg/operator/local/operator.go @@ -6,6 +6,7 @@ import ( "github.com/AlaudaDevops/upgrade-test/pkg/config" "github.com/AlaudaDevops/upgrade-test/pkg/exec" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" "knative.dev/pkg/logging" ) @@ -39,6 +40,15 @@ func (o *LocalOperator) UpgradeOperator(ctx context.Context, version config.Vers return nil } +// PreflightBaseline is a no-op for the local operator: there is no OLM +// machinery to inspect for residue, and `make deploy` is expected to be +// idempotent (documented in README). Returns (nil, nil) deliberately — +// "no findings, no errors" — so that the cmd layer treats local runs as +// always-clean and proceeds straight to UpgradeOperator. +func (o *LocalOperator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + return nil, nil +} + func (o *LocalOperator) runDeployCommand(ctx context.Context, version string) error { result := exec.RunCommand(ctx, exec.Command{ Name: "bash", diff --git a/pkg/operator/operatorhub/preflight.go b/pkg/operator/operatorhub/preflight.go new file mode 100644 index 0000000..356201d --- /dev/null +++ b/pkg/operator/operatorhub/preflight.go @@ -0,0 +1,147 @@ +package operatorhub + +import ( + "context" + "fmt" + "time" + + "github.com/AlaudaDevops/upgrade-test/pkg/config" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" + "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" + "knative.dev/pkg/logging" +) + +// preflightTimeout caps the entire PreflightBaseline call so one slow or +// hung apiserver does not block the upgrade flow for o.timeout (default 10m). +// 30s is generous enough for 3 Get/List against a healthy apiserver while +// keeping the "fail-fast within seconds" promise honest in the typical case. +const preflightTimeout = 30 * time.Second + +// installPlanTerminalPhases enumerates the InstallPlan.status.phase values +// that mark a plan as "finished" — those are historical archives, never +// residue that blocks a new upgrade. Anything else (empty, Planning, +// RequiresApproval, Installing) is considered live and blocking. +var installPlanTerminalPhases = map[string]struct{}{ + "Complete": {}, + "Failed": {}, +} + +// PreflightBaseline implements OperatorInterface. See pkg/operator/interface.go +// for the contract. This method is strictly read-only: it MUST NOT call +// Create/Update/Patch/Delete on the dynamic client. +// +// Detection rules: +// - Subscription/ in — any object existing is residue. +// - ArtifactVersion/. in cpaas-system — same. +// - InstallPlan in with OLM label `operators.coreos.com/.`, +// non-terminal phase only. +// +// CSV detection is deliberately omitted: independent CSV residue (without a +// corresponding Sub or AV) is a degenerate state that is already covered by +// the other three checks. If a future incident shows otherwise, the plan in +// docs/plans/ documents how to add PackageManifest-driven CSV resolution. +func (o *Operator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + pCtx, cancel := context.WithTimeout(ctx, preflightTimeout) + defer cancel() + + log := logging.FromContext(ctx) + log.Infow("running preflight for baseline", "operator", o.name, "version", version.BundleVersion) + + checks := []struct { + name string + fn func(context.Context, config.Version) ([]preflight.Residual, error) + }{ + {"Subscription", o.checkSubscriptionResidue}, + {"ArtifactVersion", o.checkArtifactVersionResidue}, + {"InstallPlan", o.checkInstallPlanResidue}, + } + + var residuals []preflight.Residual + for _, c := range checks { + found, err := c.fn(pCtx, version) + switch { + case isTransientAPIError(err): + return nil, fmt.Errorf("preflight: %s: %w (transient, retry the run)", c.name, err) + case err != nil: + return nil, fmt.Errorf("preflight: %s: %w", c.name, err) + } + residuals = append(residuals, found...) + } + return residuals, nil +} + +// checkSubscriptionResidue Gets the Subscription named after the operator in +// the configured namespace. NotFound is the clean-state expectation. +func (o *Operator) checkSubscriptionResidue(ctx context.Context, _ config.Version) ([]preflight.Residual, error) { + sub, err := o.client.Resource(subscriptionGVR).Namespace(o.namespace).Get(ctx, o.name, metav1.GetOptions{}) + switch { + case errors.IsNotFound(err): + return nil, nil + case err != nil: + return nil, err + } + return []preflight.Residual{ + preflight.NewResidual("Subscription", o.namespace, sub.GetName()), + }, nil +} + +// checkArtifactVersionResidue Gets the AV named . in +// cpaas-system. NotFound is clean state. installViaViolet's mid-upgrade +// cleanup handles AVs created during a run; preflight's job is to surface +// AVs left behind when a prior run aborted before that cleanup fired. +func (o *Operator) checkArtifactVersionResidue(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + avName := fmt.Sprintf("%s.%s", o.artifact, version.BundleVersion) + av, err := o.client.Resource(artifactVersionGVR).Namespace(systemNamespace).Get(ctx, avName, metav1.GetOptions{}) + switch { + case errors.IsNotFound(err): + return nil, nil + case err != nil: + return nil, err + } + return []preflight.Residual{ + preflight.NewResidual("ArtifactVersion", systemNamespace, av.GetName()), + }, nil +} + +// checkInstallPlanResidue Lists InstallPlans in the operator namespace using +// the OLM-managed label `operators.coreos.com/.`, then filters +// out terminal phases. The label selector is the difference between scanning +// the namespace's entire historical IP archive (often hundreds of objects +// since OLM never GCs) and reading only this operator's plans. +func (o *Operator) checkInstallPlanResidue(ctx context.Context, _ config.Version) ([]preflight.Residual, error) { + labelKey := fmt.Sprintf("operators.coreos.com/%s.%s", o.name, o.namespace) + ips, err := o.client.Resource(installPlanGVR).Namespace(o.namespace).List(ctx, metav1.ListOptions{LabelSelector: labelKey}) + switch { + case errors.IsNotFound(err): + return nil, nil + case err != nil: + return nil, err + } + + log := logging.FromContext(ctx) + var residuals []preflight.Residual + for _, ip := range ips.Items { + phase, found, perr := unstructured.NestedString(ip.Object, "status", "phase") + if perr != nil { + // Type drift: status.phase is no longer a string. Surface loudly + // rather than silently treating it as non-terminal and reporting + // every IP in the namespace as residue. Catches OLM CRD schema + // migrations early instead of producing a noise storm. + return nil, fmt.Errorf("InstallPlan %s/%s: unexpected status.phase type: %w", + ip.GetNamespace(), ip.GetName(), perr) + } + if !found { + // IP exists but has not been reconciled yet (no status.phase + // written). Treat as live — it is by definition non-terminal — + // but log so an unexpected flood of these is observable. + log.Infow("InstallPlan has no status.phase yet, treating as live", "name", ip.GetName()) + } + if _, terminal := installPlanTerminalPhases[phase]; terminal { + continue + } + residuals = append(residuals, preflight.NewResidual("InstallPlan", o.namespace, ip.GetName())) + } + return residuals, nil +} diff --git a/pkg/operator/operatorhub/preflight_test.go b/pkg/operator/operatorhub/preflight_test.go new file mode 100644 index 0000000..33cce9c --- /dev/null +++ b/pkg/operator/operatorhub/preflight_test.go @@ -0,0 +1,248 @@ +package operatorhub + +import ( + "context" + "errors" + "strings" + "sync/atomic" + "testing" + "time" + + "github.com/AlaudaDevops/upgrade-test/pkg/config" + apierrors "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/runtime/schema" + "k8s.io/client-go/dynamic/fake" + k8stesting "k8s.io/client-go/testing" +) + +// newPreflightOperator builds an Operator whose fake dynamic client knows +// about all four GVRs that preflight touches. Tests pass seed objects +// directly so each table case stays self-contained. +func newPreflightOperator(t *testing.T, seed ...runtime.Object) (*Operator, *fake.FakeDynamicClient) { + t.Helper() + scheme := runtime.NewScheme() + scheme.AddKnownTypeWithName(subscriptionGVR.GroupVersion().WithKind("Subscription"), &unstructured.Unstructured{}) + scheme.AddKnownTypeWithName(subscriptionGVR.GroupVersion().WithKind("SubscriptionList"), &unstructured.UnstructuredList{}) + scheme.AddKnownTypeWithName(installPlanGVR.GroupVersion().WithKind("InstallPlan"), &unstructured.Unstructured{}) + scheme.AddKnownTypeWithName(installPlanGVR.GroupVersion().WithKind("InstallPlanList"), &unstructured.UnstructuredList{}) + scheme.AddKnownTypeWithName(artifactVersionGVR.GroupVersion().WithKind("ArtifactVersion"), &unstructured.Unstructured{}) + scheme.AddKnownTypeWithName(artifactVersionGVR.GroupVersion().WithKind("ArtifactVersionList"), &unstructured.UnstructuredList{}) + + client := fake.NewSimpleDynamicClient(scheme, seed...) + op := &Operator{ + client: client, + namespace: "test-ns", + name: "tektoncd-operator", + artifact: "operatorhub-tektoncd-operator", + interval: 10 * time.Millisecond, + timeout: 2 * time.Second, + } + return op, client +} + +// newAVUnstructured for preflight tests is shared with violet_test.go; +// both files live in the same package so a single definition is enough. +// (The spec fields it sets are read by violet tests but ignored here.) + +// newIPWithPhase builds an InstallPlan tagged with the OLM-managed label that +// PreflightBaseline's ListOptions filter on. All test IPs carry the label +// because in production OLM emits them on every plan; tests that omit the +// label would be testing an impossible cluster state. +func newIPWithPhase(name, namespace, phase string) *unstructured.Unstructured { + return &unstructured.Unstructured{ + Object: map[string]interface{}{ + "apiVersion": installPlanGVR.GroupVersion().String(), + "kind": "InstallPlan", + "metadata": map[string]interface{}{ + "name": name, + "namespace": namespace, + "labels": map[string]interface{}{ + "operators.coreos.com/tektoncd-operator.test-ns": "", + }, + }, + "status": map[string]interface{}{ + "phase": phase, + }, + }, + } +} + +func TestPreflightBaseline_CleanCluster(t *testing.T) { + op, _ := newPreflightOperator(t) + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("clean cluster preflight: %v", err) + } + if len(res) != 0 { + t.Errorf("expected 0 residuals on clean cluster, got %d: %+v", len(res), res) + } +} + +func TestPreflightBaseline_SubscriptionResidue(t *testing.T) { + sub := newSubscriptionUnstructured("tektoncd-operator", "test-ns", "stable") + op, _ := newPreflightOperator(t, sub) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 1 || res[0].Kind != "Subscription" { + t.Fatalf("expected 1 Subscription residual, got %+v", res) + } + if res[0].Name != "tektoncd-operator" || res[0].Namespace != "test-ns" { + t.Errorf("residual identity wrong: %+v", res[0]) + } + // cleanup hint MUST be safe to paste into a shell — names/ns quoted via %q. + if !strings.Contains(res[0].RecommendedCleanup, `"tektoncd-operator"`) { + t.Errorf("cleanup hint missing quoted name: %q", res[0].RecommendedCleanup) + } + if !strings.Contains(res[0].RecommendedCleanup, `"test-ns"`) { + t.Errorf("cleanup hint missing quoted namespace: %q", res[0].RecommendedCleanup) + } +} + +func TestPreflightBaseline_ArtifactVersionResidue(t *testing.T) { + av := newAVUnstructured("operatorhub-tektoncd-operator.v0.74.0") + op, _ := newPreflightOperator(t, av) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 1 || res[0].Kind != "ArtifactVersion" { + t.Fatalf("expected 1 AV residual, got %+v", res) + } + if res[0].Namespace != systemNamespace { + t.Errorf("AV namespace must be %s, got %s", systemNamespace, res[0].Namespace) + } +} + +func TestPreflightBaseline_NonTerminalInstallPlanIsResidue(t *testing.T) { + cases := []string{"", "Planning", "RequiresApproval", "Installing"} + for _, phase := range cases { + t.Run("phase="+phase, func(t *testing.T) { + ip := newIPWithPhase("install-abc", "test-ns", phase) + op, _ := newPreflightOperator(t, ip) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 1 || res[0].Kind != "InstallPlan" { + t.Fatalf("phase=%q should yield 1 IP residual, got %+v", phase, res) + } + }) + } +} + +func TestPreflightBaseline_TerminalInstallPlanIgnored(t *testing.T) { + for _, phase := range []string{"Complete", "Failed"} { + t.Run("phase="+phase, func(t *testing.T) { + ip := newIPWithPhase("install-old", "test-ns", phase) + op, _ := newPreflightOperator(t, ip) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 0 { + t.Errorf("terminal phase %q must be ignored, got %+v", phase, res) + } + }) + } +} + +func TestPreflightBaseline_AggregatesAcrossKinds(t *testing.T) { + sub := newSubscriptionUnstructured("tektoncd-operator", "test-ns", "stable") + av := newAVUnstructured("operatorhub-tektoncd-operator.v0.74.0") + ip := newIPWithPhase("install-pending", "test-ns", "RequiresApproval") + op, _ := newPreflightOperator(t, sub, av, ip) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + kinds := map[string]bool{} + for _, r := range res { + kinds[r.Kind] = true + } + for _, want := range []string{"Subscription", "ArtifactVersion", "InstallPlan"} { + if !kinds[want] { + t.Errorf("residual list missing kind %q; got %+v", want, res) + } + } +} + +// TestPreflightBaseline_TransientErrorWrapsAsRetryHint ensures users see an +// actionable "retry the run" suffix on transient API failures, distinct from +// permanent failures that surface unmodified. +func TestPreflightBaseline_TransientErrorWrapsAsRetryHint(t *testing.T) { + op, client := newPreflightOperator(t) + client.PrependReactor("get", "subscriptions", func(action k8stesting.Action) (bool, runtime.Object, error) { + return true, nil, apierrors.NewServerTimeout(schema.GroupResource{Group: subscriptionGVR.Group, Resource: subscriptionGVR.Resource}, "get", 1) + }) + + _, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err == nil { + t.Fatalf("expected error from transient API failure, got nil") + } + if !strings.Contains(err.Error(), "transient") { + t.Errorf("transient hint missing from error: %v", err) + } +} + +// TestPreflightBaseline_PermanentErrorPropagates ensures a non-NotFound +// non-transient error bubbles up with the kind name so users can locate the +// failing check. +func TestPreflightBaseline_PermanentErrorPropagates(t *testing.T) { + op, client := newPreflightOperator(t) + wantErr := errors.New("forbidden by webhook") + client.PrependReactor("get", "subscriptions", func(action k8stesting.Action) (bool, runtime.Object, error) { + return true, nil, wantErr + }) + + _, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err == nil { + t.Fatalf("expected permanent error, got nil") + } + if !strings.Contains(err.Error(), "Subscription") { + t.Errorf("kind name missing from error: %v", err) + } + if !errors.Is(err, wantErr) { + t.Errorf("wrap chain broken: %v", err) + } +} + +// TestPreflightBaseline_ReadOnly asserts the contract from interface.go: +// PreflightBaseline MUST NOT issue Create/Update/Patch/Delete against the +// dynamic client. A spy reactor fails the test if any mutation slips in. +func TestPreflightBaseline_ReadOnly(t *testing.T) { + sub := newSubscriptionUnstructured("tektoncd-operator", "test-ns", "stable") + av := newAVUnstructured("operatorhub-tektoncd-operator.v0.74.0") + ip := newIPWithPhase("install-pending", "test-ns", "Installing") + op, client := newPreflightOperator(t, sub, av, ip) + + var mutations int32 + for _, verb := range []string{"create", "update", "patch", "delete", "deletecollection"} { + v := verb + client.PrependReactor(v, "*", func(action k8stesting.Action) (bool, runtime.Object, error) { + atomic.AddInt32(&mutations, 1) + t.Errorf("preflight must not %s; got action %+v", v, action) + return false, nil, nil + }) + } + + if _, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}); err != nil { + t.Fatalf("preflight: %v", err) + } + if atomic.LoadInt32(&mutations) != 0 { + t.Fatalf("preflight performed %d mutating actions, expected 0", mutations) + } +} + +// Confirm metav1.GetOptions is reachable here (import guard — if a future +// refactor drops the dependency by accident the test will catch it). +var _ = metav1.GetOptions{} diff --git a/pkg/operator/preflight/types.go b/pkg/operator/preflight/types.go new file mode 100644 index 0000000..ab4cc0b --- /dev/null +++ b/pkg/operator/preflight/types.go @@ -0,0 +1,46 @@ +// Package preflight holds the value types used by OperatorInterface.PreflightBaseline. +// It is a leaf package so that both the operator interface (pkg/operator) and +// concrete implementations (pkg/operator/operatorhub, pkg/operator/local) can +// import it without creating an import cycle. +package preflight + +import ( + "fmt" + "strings" +) + +// Residual is a single OLM-side resource that PreflightBaseline detected as +// still present in the target cluster. It is a value type intentionally — +// callers (cmd layer) accumulate them across paths and format the final user +// report without needing to type-assert any operator-package internals. +// +// Construct via NewResidual; the constructor is the single source of truth +// for `RecommendedCleanup` formatting and shell-quoting, so individual +// operator implementations never re-implement that template (and never +// accidentally skip the quoting). +type Residual struct { + Kind string + Namespace string + Name string + RecommendedCleanup string +} + +// NewResidual constructs a Residual with `RecommendedCleanup` filled in from +// the standard `kubectl delete -n ` template, with +// name and namespace `%q`-quoted so a user copy-pasting the line into a shell +// gets sane behavior for any K8s-valid (DNS-1123) identifier. +// +// Kind is lower-cased to match the kubectl noun form (`subscription`, not +// `Subscription`); the Kind field on the struct keeps its capitalised form +// for the human-facing report. +func NewResidual(kind, namespace, name string) Residual { + return Residual{ + Kind: kind, + Namespace: namespace, + Name: name, + RecommendedCleanup: fmt.Sprintf( + "kubectl delete %s %q -n %q", + strings.ToLower(kind), name, namespace, + ), + } +}