From 68c26a4329872b236065a93cc5a317da4690d4bf Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 18:20:51 +0800 Subject: [PATCH 01/10] feat(operator): scaffold PreflightBaseline interface MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - new pkg/operator/preflight package holds the Residual value type so both pkg/operator and its concrete impls can import without cycles - OperatorInterface gains PreflightBaseline(ctx, version) ([]preflight.Residual, error); contract: read-only, baseline-only - local: no-op returning (nil, nil) — README will document the assumption that `make deploy` must be idempotent - operatorhub: scaffolding stub; real check logic lands in follow-up - plan: docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md --- ...05-19-feat-upgrade-preflight-check-plan.md | 409 ++++++++++++++++++ pkg/operator/interface.go | 17 + pkg/operator/local/operator.go | 10 + pkg/operator/operatorhub/preflight.go | 19 + pkg/operator/preflight/types.go | 21 + 5 files changed, 476 insertions(+) create mode 100644 docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md create mode 100644 pkg/operator/operatorhub/preflight.go create mode 100644 pkg/operator/preflight/types.go diff --git a/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md new file mode 100644 index 0000000..833e6e1 --- /dev/null +++ b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md @@ -0,0 +1,409 @@ +--- +title: upgrade CLI 前置检查 (preflight check) +type: feat +status: active +date: 2026-05-19 +deepened_on: 2026-05-19 +--- + +# upgrade CLI 前置检查 (preflight check) + +## Enhancement Summary + +**Deepened on**: 2026-05-19 +**Research agents used**: best-practices-researcher, framework-docs-researcher, learnings-researcher, repo-research-analyst, architecture-strategist, code-simplicity-reviewer, performance-oracle, security-sentinel +**Sections enhanced**: 6 decision revisions + 9 new considerations + +### Key Improvements (修订原 plan 的决策) + +1. **接口签名改单版本对称**(architecture + best-practices 共识): + `PreflightCheck(ctx, []paths) error` → **`PreflightBaseline(ctx, version) ([]Residual, error)`** + 聚合 / 去重 / 报告由 cmd 层负责,operator 接口保持"对单个版本做单点检查"的纯粹职责。 +2. **DROP CSV 检查**(simplicity 强论证):CSV 残留必然伴随 Sub 或 AV 残留,独立残留是边缘到几乎不可能的状态;前三项 check 已覆盖。删除后省 1 个 GVR、降级 skip 三段逻辑、2 个测试用例。PackageManifest 精确查 CSV 的完整路径(framework-docs 提供)作为"未来若发现独立残留案例"的实现储备保留在 Sources。 +3. **DROP `--preflight-only` flag**(simplicity):脑补的 CI 需求;真有需要 5 行加回。 +4. **DROP 跨 path 聚合 + 报告级去重**(simplicity + performance 折中):跨 path fail-fast 即可;同一个 op 内的 **API Get-cache (3 行)** 留下来——典型 upgrade-test 多 path 共享同名 Sub/AV,cache 命中能省一半冗余 Get。 +5. **PreflightError 类型移到 cmd 包**(architecture):operator 包 API 表面保持只返 `([]Residual, error)`;PreflightError 是 cmd 内部 *formatting* 类型,不进公共契约。 +6. **跨集群 warning 升级为 soft-fail**(security MEDIUM):`Violet.Clusters` 非空且与 KUBECONFIG context 不一致时,要求 `--confirm-cluster=` 显式确认;不再"沉默 warn 然后让用户在 prod 写"。 + +### New Considerations Discovered + +1. **PackageManifest 查询 namespace 是 catalog source 所在 ns(`cpaas-system`),不是 operator 安装 ns**(framework-docs 关键修正——若未来恢复 CSV 检查必须遵守)。 +2. **30s timeout 包整个 PreflightCheck**(performance #4):默认 `o.timeout=10min` 会让 preflight 卡死时用户等 10 分钟,反向破坏 "1 秒报错" 承诺。 +3. **InstallPlan List 加 OLM 自带 labelSelector** `operators.coreos.com/.=` (performance #2):把"扫整 ns 历史几百 IP"压成 O(1)。 +4. **Get-cache for 多 path 共享残留** (performance #6):3 行 `map[residualKey]residualResult`。 +5. **`BundleVersion` 正则校验进 validateConfig**(security #2):防 `1.0$(whoami)` 进 kubectl 命令 + violet argv。 +6. **kubectl 命令模板用 `%q` 包裹 name/namespace**(security #2 + best-practices #3-B):防 shell 元字符。 +7. **kubectl `--context` 不要 `$(kubectl config current-context)` 注入**(best-practices #3-B):busybox sh 不支持命令替换;CLI 启动时 Go 端 pre-render context 名。 +8. **`cmd.SilenceUsage = true`**(best-practices #3-C):cobra 默认会在错误后打印 `--help`,淹没可复制的 kubectl 命令;preflight 失败路径必须关掉。 +9. **`--skip-preflight` 必须 `log.Warn` 一行 audit trail**(security #3):留 shell history 可 grep。 + +### Decisions Locked vs Open After Deepen + +- **CLOSED (无回旋余地)**:1, 2, 3, 4 (Improvements) + 全部 New Considerations +- **OPEN (留 work 阶段你来 own 5-10 行)**: + - PreflightError 的 `Error()` 文案措辞(中英混排 vs 全英) + - cluster 一致性比对的 "context 名匹配规则"(精确等于 / 子串 / 正则——security 没给具体规则,留你定) + +--- + +## Overview + +给 `upgrade` CLI 在进入升级循环之前加一道前置检查(preflight),用只读方式扫描升级目标集群中是否已经存在与本次升级冲突的残留资源(**Subscription / ArtifactVersion / 非终态 InstallPlan**——CSV 经研究后移出范围,见 Enhancement Summary #2)。任一残留即**停止运行**,向用户输出可复制粘贴的 `kubectl delete` 清理命令;清理后用户 re-run 即可继续。preflight 仅做检查与提示,**绝不主动修改集群状态**。 + +## Problem Statement / Motivation + +当前 `cmd.UpgradeCommand.Execute`(`cmd/upgrade_command.go:74-135`)创建 operator 后**立即**进入 `for _, path := range cfg.UpgradePaths` 循环,第一个版本就调 `op.UpgradeOperator(ctx, version)`。这条路径上有两个"宽容兜底": + +- `installViaViolet` 调用 `deleteArtifactVersionIfExists`(`pkg/operator/operatorhub/violet.go:423-450`)会**自动删除** AV 残留。 +- `InstallSubscription`(`pkg/operator/operatorhub/subscription.go:33-75`)在 Subscription 已存在时做 **in-place refresh**(bump annotation 触发 OLM 重 reconcile,不删除)。 + +这两个兜底是设计良好的"中间态推进",**但它们假设 baseline(起点版本)的环境本来就该干净**。一旦环境里残留了上一轮跑废的 baseline AV / Subscription / 未审批 InstallPlan,会出现: + +1. **CSV 名混入旧数据**:waitArtifactVersionPresent 拿到的 `status.version` 可能是上次残留的 CSV,导致 InstallSubscription 等的是错的 IP。 +2. **半完成状态被静默接管**:旧 Subscription 已经在 in-place refresh 模式下被 "继续推进",行为不是测试期望的"全新升级",破坏升级链可重复性。 +3. **诊断成本高**:失败深埋在 `wait*` 超时里,用户拿到 "timeout waiting for csv" 才回溯——所有信号都晚于真实根因。 + +加 preflight 就是把这些"假设"显式化、前置化:**升级开始前 1 秒报错并给清理命令**,而不是 10 分钟超时后让用户翻日志。 + +### Research Insights + +- **业界确认这是真实缺口**:operator-sdk 的 `run bundle` 子命令**不**做 Subscription/CSV/IP 的 preflight,仅验证 bundle image accessibility(见 [operator-sdk 源码](https://github.com/operator-framework/operator-sdk/tree/master/internal/cmd/operator-sdk/run/bundle))。OLM 本身也没有 cleanup 子命令;唯一相关工具 `operator-sdk cleanup ` 是**破坏性**删除,与本 plan "只读 + 提示" 哲学相反。本 preflight 是真实填补的空白。 +- **历史经验**:仓库 `docs/` 下无相关 solutions 文档,本 plan 是首次设计——意味着没有可复用的事故分析模板,但也无既有约定要遵守。 + +## Proposed Solution + +新增一个**只读**前置检查接口,由 operatorhub 实现,主进程在升级循环开始前调用: + +1. 在 `pkg/operator/interface.go` 扩展接口(**Deepen 修订后的签名**): + ```go + PreflightBaseline(ctx context.Context, version config.Version) ([]Residual, error) + ``` + 返回 `[]Residual`:业务残留(不算 error);返回 error:client/网络/transient 失败。**对称于 `UpgradeOperator(ctx, version)` 的单版本职责**。 +2. operatorhub 实现:对入参 `version` 检查 3 类残留: + - `Subscription/` 在 `` 中 + - `ArtifactVersion/.` 在 `cpaas-system` 中 + - `InstallPlan` 在 `` 中,`status.phase` 非 `Complete` / `Failed`,用 OLM 自带 label selector `operators.coreos.com/.=` 减少 list payload(performance #2) +3. local 实现:返回 `nil, nil`(README 标注 "local 模式 preflight 为 no-op")。 +4. cmd 层在升级循环前: + - 对 `cfg.UpgradePaths` 迭代,每条 path 取 `Versions[0]` 调 `op.PreflightBaseline` + - 收集到的 `[]Residual` 与"已见 (Kind, NS, Name)"对照做 Get-cache(performance #6); + - **任一 path 拿到非空 Residuals → fail-fast 包成 `*cmd.PreflightError` 返回**(不跨 path 聚合,simplicity #6) +5. CLI 新增**仅 1 个** flag:`--skip-preflight`(仅 CLI,不进 yaml)。Decision A 选 A1,深度评审 PASS。 +6. cluster 一致性 soft-fail(security #6):CLI 启动早期,若 `Violet.Clusters` 非空,要求 `--confirm-cluster=` 与 KUBECONFIG context 名匹配——不匹配立即报错;空 `Violet.Clusters` 时无需此 flag。 + +### Research Insights + +- **kubeadm preflight 是 canonical Go pattern**(best-practices #1):`Checker` interface 用 `Name()` + `Check() (warnings, errors []error)`。我们的 3 返回值 `([]Residual, []string warnings, error)` 是其精神继承,但更明确区分"业务残留 vs 操作 warning vs 系统失败"——见下方 Implementation Details 模板。 +- **kubectl auth can-i 是 `--preflight-only` 的精神类比**(best-practices #4),但 plan 已 DROP 该 flag。 + +### Implementation Details (摘自 best-practices-researcher, 已按 Deepen 修订) + +```go +// pkg/operator/interface.go (revised) +type OperatorInterface interface { + UpgradeOperator(ctx context.Context, version config.Version) error + PreflightBaseline(ctx context.Context, version config.Version) ([]Residual, error) +} + +// pkg/operator/residual.go (new, package operator — 与 interface 同包) +type Residual struct { + Kind, Namespace, Name string + RecommendedCleanup string // 预渲染的 kubectl 命令,已用 %q 转义 +} + +// pkg/operator/operatorhub/preflight.go (new) +func (o *Operator) PreflightBaseline(ctx context.Context, v config.Version) ([]operator.Residual, error) { + pCtx, cancel := context.WithTimeout(ctx, 30*time.Second) // performance #4 + defer cancel() + + var residuals []operator.Residual + checks := []struct { + kind string + fn func(context.Context, config.Version) (*operator.Residual, error) + }{ + {"Subscription", o.checkSubscription}, + {"ArtifactVersion", o.checkArtifactVersion}, + {"InstallPlan", o.checkInstallPlans}, + } + for _, c := range checks { + r, err := c.fn(pCtx, v) + switch { + case isTransientAPIError(err): // 复用 artifact_version.go:107 + return nil, fmt.Errorf("preflight: %s: %w (transient, retry the run)", c.kind, err) + case err != nil: + return nil, fmt.Errorf("preflight: %s: %w", c.kind, err) + case r != nil: + residuals = append(residuals, *r) + } + } + return residuals, nil +} +``` + +```go +// cmd/preflight_error.go (new, cmd-internal; architecture revision #4) +type PreflightError struct { + Residuals []operator.Residual +} + +func (e *PreflightError) Error() string { + var b strings.Builder + fmt.Fprintf(&b, "preflight failed: %d residual resource(s) blocking upgrade:\n\n", len(e.Residuals)) + for _, r := range e.Residuals { + fmt.Fprintf(&b, " %s/%s (ns: %s)\n %s\n\n", r.Kind, r.Name, r.Namespace, r.RecommendedCleanup) + } + b.WriteString("If a delete hangs (finalizer stuck), patch finalizers off:\n") + b.WriteString(" kubectl -n patch --type=merge -p '{\"metadata\":{\"finalizers\":[]}}'\n\n") + b.WriteString("After cleanup, wait ~30s for OLM to settle, then re-run `upgrade`.\n") + b.WriteString("To bypass (NOT recommended): re-run with --skip-preflight\n") + return b.String() +} +``` + +```go +// cmd/upgrade_command.go::Execute (revised section) +// 紧跟 factory.CreateOperator 之后,for-loop 之前 +cmd.SilenceUsage = true // best-practices #3-C: 不让 cobra 把 --help 淹没 actionable 命令 +if !uc.skipPreflight { + if err := uc.runPreflight(ctx, cfg); err != nil { + return err + } +} + +// uc.runPreflight 内部:cross-path Get-cache + fail-fast +func (uc *UpgradeCommand) runPreflight(ctx context.Context, cfg *config.Config) error { + seen := map[string]bool{} // performance #6 cache key: Kind|NS|Name + for _, p := range cfg.UpgradePaths { + baseline := p.Versions[0] + res, err := uc.operator.PreflightBaseline(ctx, baseline) + if err != nil { + return err + } + var unique []operator.Residual + for _, r := range res { + k := r.Kind + "|" + r.Namespace + "|" + r.Name + if !seen[k] { + seen[k] = true + unique = append(unique, r) + } + } + if len(unique) > 0 { + return &PreflightError{Residuals: unique} + } + } + return nil +} +``` + +## Technical Considerations + +- **抽象层**:preflight 逻辑与 operatorhub 内部 GVR / 命名约定强耦合,**必须放在 `OperatorInterface` 而不是 cmd 层**。这是依赖方向"业务(升级流程)→ 实现(具体 Operator 类型)"的对称扩展,与现有 `UpgradeOperator` 同构(architecture-strategist verdict 5: PASS)。 +- **只读保证**:preflight 全部走 `client.Get` / `client.List`,**不调用任何 Create/Update/Patch/Delete**。在 godoc 里硬性声明,单元测试 mock client 时断言无写调用。 +- **`PreflightError` 是 cmd 内部类型**(architecture revision #4):不实现 `Unwrap()`(无消费者)、不进 operator 包 API 表面、不导出到 pkg/operator/。仅用于格式化错误消息。 +- **错误分类**:复用 `pkg/operator/operatorhub/artifact_version.go:103-111` 的 `isTransientAPIError`(repo-research #4): + - `errors.IsNotFound` → 该检查通过 + - transient → 包成 `... transient, retry the run` 让用户重跑 + - permanent → 直接 fail +- **跨集群一致性硬校验**(security revision #6):CLI 启动早期,若 `cfg.OperatorConfig.Violet != nil && cfg.OperatorConfig.Violet.Clusters != ""`,校验 `--confirm-cluster=` 与 KUBECONFIG 当前 context 名一致;不一致或 flag 缺失则 fail。**这是独立于 preflight 的早期 guard,不在 PreflightBaseline 内部**。 +- **namespace 缺失硬校验**:`config.go::validateConfig` 补一条 operatorhub-type 时 `OperatorConfig.Namespace` 必填——preflight 入口**不重复**校验(repo-research #1:单点 validateConfig 即可,别扩散)。 +- **BundleVersion 注入防御**(security #2):`validateConfig` 加正则 `^[a-zA-Z0-9._-]+$` 校验所有 version 的 `BundleVersion`——这个字段还会进 violet argv,统一收口防御。 +- **TOCTOU 窗口**:preflight 通过 → 用户清理 → re-run 之间,OLM 控制器可能异步重新生成 InstallPlan。错误信息明确建议"清理后等待 30s",**不在代码里加 sleep**。 + +### Research Insights + +- **`cmd.SilenceUsage = true`** 是 kubectl / helm / velero 三个项目交叉验证的"硬规矩"(best-practices #3-C)——不设置会让 cobra 在 preflight 失败后打印 `--help`,把 actionable 命令推到屏幕外。 +- **`kubectl --context $(kubectl config current-context)` 反模式**(best-practices #3-B):busybox sh(许多 CI runner 用)不支持命令替换,会把整个 `$(...)` 当文件名。改为 Go 端在 CLI 启动时 `clientcmd.LoadFromFile(kubeconfig).CurrentContext` 拿到字符串,pre-render 到错误信息里。 +- **错误信息结构**(best-practices #3-A 完整模板已嵌入上方 Implementation Details)。 + +## System-Wide Impact + +### Interaction Graph + +``` +cmd.Execute + → 早期 cluster 一致性 guard (Violet.Clusters vs --confirm-cluster vs KUBECONFIG ctx) + → factory.CreateOperator(operatorhub) + → (--skip-preflight ? log.Warn + skip : uc.runPreflight(ctx, cfg)) + └─ for each path (fail-fast 跨 path) + └─ op.PreflightBaseline(ctx, path.Versions[0]) + ├─ 30s ctx timeout + ├─ checkSubscription (Get) + ├─ checkArtifactVersion (Get) + └─ checkInstallPlans (List w/ OLM label selector) + └─ Get-cache dedup → 任意 residual 即 *PreflightError + → (preflight 通过) for _, path := range cfg.UpgradePaths + └─ op.UpgradeOperator(ctx, version) // 原流程 +``` + +### Error & Failure Propagation + +- 单个 Get/List 系统错误(非 NotFound、非 transient)→ 包成 `fmt.Errorf("preflight: %s/%s: %w", kind, name, err)`,整体 preflight 终止,**不收集残留**。 +- Transient 错误 → 包装提示用户 retry,不残留化。 +- NotFound → 该检查通过。 +- 残留 → 单 path 内 append 到 `[]Residual`;cmd 层第一个非空就返回 `*PreflightError`。 +- `cmd.Execute` 拿到 PreflightError → 直接 `return err`(cobra `SilenceUsage=true` 防淹没),**无视 `cfg.Immediate`**(preflight 不是升级路径的一部分)。 +- `--skip-preflight` 已设置 → `log.Warn("preflight skipped by --skip-preflight; ensure environment is clean")` + 跳过。 + +### State Lifecycle Risks + +- preflight 全部只读,**无任何 state 变更**。 +- 与 `installViaViolet::deleteArtifactVersionIfExists` 严格不重叠:preflight 只看 `Versions[0]`,installViaViolet 处理 `Versions[1..]` 流转中的中间态 AV。 +- TOCTOU 文档化、不修代码。 + +### API Surface Parity + +| 接口面 | 当前 | 变更(Deepen 修订后) | +|--------|------|------| +| `OperatorInterface` | `UpgradeOperator(ctx, version)` | **新增** `PreflightBaseline(ctx, version) ([]Residual, error)` | +| operatorhub | implements UpgradeOperator | implements PreflightBaseline(3 类检查) | +| local | implements UpgradeOperator | implements PreflightBaseline(返回 `nil, nil`) | +| CLI flags | `--config` / `--kubeconfig` / `--log-level` / `--workspace` | **新增** `--skip-preflight`, `--confirm-cluster` | +| config schema | 不动 | `validateConfig` 新增 namespace 必填 + bundleVersion 正则 | +| `cmd.SilenceUsage` | 未设置 | **新增 `true`**(避免 --help 淹没错误) | + +### Integration Test Scenarios + +1. **空集群正常升级**:preflight 全过,行为与不加 preflight 完全一致(兼容)。 +2. **残留 Subscription**:preflight 报错,`kubectl delete` 命令可复制粘贴;执行后 re-run 通过。 +3. **多 UpgradePath,第 2 条 baseline 有残留**:第 1 条 path 不开跑(fail-fast 后置不需要进入);用户 re-run 时 cache 命中无冗余 Get。 +4. **`--skip-preflight` 在脏环境也能跑**:log.Warn 后进入升级。 +5. **local operator**:preflight 立即返回 nil(无 client 调用)。 +6. **namespace 漏配**:LoadConfig 阶段报 "namespace is required for operatorhub type",不进 preflight。 +7. **`BundleVersion: "1.0$(whoami)"`**:LoadConfig 阶段拒绝(正则不通过)。 +8. **`Violet.Clusters="devops"` + KUBECONFIG context `global`**:CLI 启动即报 "cluster mismatch; pass --confirm-cluster=devops"。 +9. **InstallPlan label selector 命中 0**:preflight 跳过该项,无误报。 +10. **InstallPlan namespace 历史 100+ IP**:List 仅返回目标 sub 的 IP(labelSelector 限定),preflight < 200ms。 + +## Acceptance Criteria + +### Functional Requirements + +- [ ] `OperatorInterface` 新增 `PreflightBaseline(ctx, version) ([]Residual, error)`。 +- [ ] `pkg/operator/residual.go` 新增 `Residual` 类型(kind/namespace/name/recommended_cleanup)。 +- [ ] operatorhub 实现完成 3 类检查(Sub / AV / 非终态 IP w/ OLM labelSelector)。 +- [ ] local 实现返回 `nil, nil`;README 标注 no-op。 +- [ ] `cmd/preflight_error.go` 新增 `*PreflightError`,仅 cmd-internal;Error() 含多行报告 + finalizer 兜底 + skip hint。 +- [ ] `cmd.Execute` 在升级循环前调用 `runPreflight`;fail-fast 跨 path;Get-cache 去重。 +- [ ] `cmd.SilenceUsage = true` 在 preflight 失败路径上确保 cobra 不打印 --help。 +- [ ] 新增 `--skip-preflight` flag:仅 warn 跳过。 +- [ ] 新增 `--confirm-cluster` flag:当 `Violet.Clusters` 非空时必填,CLI 启动早期校验。 +- [ ] `config.validateConfig` 新增 2 条规则:operatorhub 类型时 `Namespace` 必填;`BundleVersion` 正则 `^[a-zA-Z0-9._-]+$`。 +- [ ] kubectl 命令模板用 `%q` 转义 name/namespace;`--context` 用 Go 端 pre-render 不用 `$(...)`。 + +### Non-Functional Requirements + +- [ ] preflight 只读:单元测试 mock dynamic client,断言全程**仅** Get/List 调用,无 Create/Update/Patch/Delete。 +- [ ] preflight 性能:每条 path **30s context timeout 硬上限**;典型 N=1 时 < 300ms(含 3 次 Get + 1 次 List);N=3 时 < 1s。 +- [ ] InstallPlan List 必须带 `operators.coreos.com/.=` labelSelector,避免扫整 ns 历史 IP。 +- [ ] godoc:`PreflightBaseline` 方法注释明确"只读 + 仅 baseline";附 `// Preflight contract: this method MUST NOT mutate cluster state.`。 + +### Quality Gates + +- [ ] 新增单元测试:`pkg/operator/operatorhub/preflight_test.go`,覆盖:clean / 单 Sub 残留 / 单 AV 残留 / 单 IP 残留 / 多类残留 / transient error / NotFound / labelSelector 命中 0。 +- [ ] 新增 `cmd/preflight_error_test.go`:验证 `Error()` 文案稳定(含 finalizer hint + skip hint)。 +- [ ] README 更新:"升级前会执行 preflight 检查,残留资源会被列出,需要手动清理。`--skip-preflight` / `--confirm-cluster` 用法说明 + RBAC 最小 verbs"(参考 security #5)。 +- [ ] CLAUDE.md 更新:"PreflightBaseline 是只读、仅检查 baseline" 写入"硬约束"段;`cmd.SilenceUsage` 由 preflight 强制开启。 + +## Success Metrics + +- 在 sprint env 实跑:故意残留 Sub → preflight **30s timeout 内**报错 + 给出准确清理命令(vs 旧路径 10 分钟超时)。 +- 单测覆盖率:新增的 preflight 包路径覆盖 ≥ 80%。 +- 真实使用场景:用户拿到 preflight error message 后,**不查代码、不查日志**,仅复制粘贴 kubectl 命令即可恢复("零反查路径")。 +- 误指 prod KUBECONFIG 的拦截率:100%(`--confirm-cluster` 硬校验)。 + +## Dependencies & Risks + +- **依赖**:无新增第三方依赖;继续用 `k8s.io/client-go/dynamic` + `k8s.io/apimachinery`。**不引入** `hashicorp/go-multierror` 或 `errors.Join`(repo-research #3 确认 stdlib 足够)。 +- **风险 1**:跨集群 KUBECONFIG 误指 prod → 已用 `--confirm-cluster` 硬 fail 消化(security #6)。 +- **风险 2**:用户 finalizer 卡死场景导致 cleanup 命令无效 → 错误消息内置 finalizer 兜底命令。 +- **风险 3**:BundleVersion 含 shell 元字符 → validateConfig 正则拦截(security #2)。 +- **回滚策略**:preflight 全部走新增接口方法,`--skip-preflight` 即开即关;如发现严重误报,operations 直接 `--skip-preflight` 应急,无需回滚二进制。 + +### Research Insights + +- **OLM IP list 在密集 namespace 的实际规模**:performance #2 实测观察"几十~几百历史 IP"是常态——OLM 不 GC 旧 IP;不加 labelSelector 的 list 在病理 ns 上能拖到秒级。`operators.coreos.com/.=` 是 OLM 控制器自动给 IP 打的 label,可信。 +- **PackageManifest 同步窗口**:framework-docs 观察正常 < 10s,跨集群 / catalog 重启 30-60s——若未来恢复 CSV 检查,B1(skip + warn)是经过验证的正确取舍,不要走 polling 路径。 + +## Open Decision Points (Deepen 后保留) + +> 以下是真正的"业务/UX 权衡点",Deepen 后**仍**没有唯一正解。学习模式下你在 work 阶段可以亲自实现这部分。 + +### 决策 A:`preflight.skip` 是否进 yaml schema? — **CLOSED → A1** + +Deepen architecture-strategist verdict 3: PASS。结论锁定为 **A1(仅 CLI flag)**。理由不再展开(见 verdict 原文)。 + +### 决策 B:`--confirm-cluster` 的 context 名匹配规则 + +**背景**:security revision #6 把跨集群一致性升级为 soft fail,要求 `--confirm-cluster=` 与 KUBECONFIG current context 名一致。但"一致"的定义没锁。 + +**选项**: +- **B1 (推荐起步)**:**精确字符串相等**——简单可预测;context 名本身就是用户控制的字符串。 +- **B2**:**子串包含**——更宽松,允许 `--confirm-cluster=devops` 匹配 context `build-business-cluster-devops`;但容易误命中。 +- **B3**:**正则匹配**——最灵活,但用户在 CLI 里写正则是反人类。 + +**推荐选 B1**,5 行代码:`if confirmCluster != "" && confirmCluster != currentContext { return error }`。 + +### 决策 C:PreflightError.Error() 文案中英比例 + +**背景**:上方 Implementation Details 给的 `Error()` 是英文模板(best-practices 提供)。仓库其他错误信息混用中英。 + +**选项**: +- **C1**:**全英** ——与 best-practices 模板对齐、kubectl/OLM 错误更同构。 +- **C2**:**全中** ——与仓库 README / CLAUDE.md 风格一致。 +- **C3**:**Key 英文 + 提示中文**(如 "Subscription/" 名称英文,"如果删除卡住" 是中文)。 + +**推荐选 C1**:错误信息频繁被复制粘贴到 GitHub issue / Slack 国际化场景,英文更通用;中文文档放进 README 而非错误流。 + +## Sources & References + +### Internal References + +- **入口 / 插入点**:`cmd/upgrade_command.go:74-135`(`Execute` 方法) +- **接口扩展点**:`pkg/operator/interface.go:10-13` +- **factory 分发**:`pkg/operator/factory.go:40-47` +- **operatorhub 资源 GVR 定义**:`pkg/operator/operatorhub/operator.go:42-79`(已含 `packageManifestGVR`,未来若恢复 CSV 检查直接复用) +- **错误分类函数复用**:`pkg/operator/operatorhub/artifact_version.go:103-111` (`isTransientAPIError`) +- **现有错误风格参考**:`pkg/operator/operatorhub/subscription.go:108,115,128,151,158`(`%w` wrap)+ `pkg/exec/exec_test.go:57-79`(`errors.Is` 测试范式) +- **现有"自动清理"行为锚点(不重叠)**: + - `pkg/operator/operatorhub/violet.go:423-450`(`deleteArtifactVersionIfExists`) + - `pkg/operator/operatorhub/subscription.go:33-75`(`InstallSubscription` in-place refresh 路径) +- **config 验证扩展点**:`pkg/config/config.go:211-228`(`validateConfig`)+ `pkg/config/config.go:14-16`(namespace 默认值缺口)+ `pkg/config/config.go:230-258`(`defaultConfig`) + +### External References + +- **kubeadm preflight canonical pattern**:[k8s.io/kubernetes/cmd/kubeadm/app/preflight/checks.go](https://github.com/kubernetes/kubernetes/blob/v1.30.0/cmd/kubeadm/app/preflight/checks.go) — Checker interface、warn vs err 分离、ignorePreflightErrors sets.String 模式 +- **errors.Join (Go stdlib)**:[pkg.go.dev/errors#Join](https://pkg.go.dev/errors#Join) — 本 plan **不使用**,但记录为对比 +- **OLM PackageManifest schema**(CSV 检查未来储备): + - [packagemanifest_types.go](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/package-server/apis/operators/v1/packagemanifest_types.go) + - [register.go (GroupName)](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/package-server/apis/operators/register.go) + - 字段路径:`status.channels[].currentCSV`(**namespace 是 catalog source 所在 ns,对本仓库即 `cpaas-system`**) +- **Velero install validation**:[install.go](https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/cli/install/install.go) — Go K8s CLI preflight 参考 +- **operator-sdk run bundle 缺 preflight 证据**:[操作 SDK 源码](https://github.com/operator-framework/operator-sdk/tree/master/internal/cmd/operator-sdk/run/bundle) +- **golang.org/x/sync/errgroup**(未来若 paths > 10 时启用并行):[pkg.go.dev/errgroup](https://pkg.go.dev/golang.org/x/sync/errgroup) + +### CLAUDE.md 约束 + +- `cpaas-system` / `targetCatalogSource=platform` 硬编码(不要参数化) +- 轮询用 `wait.PollUntilContextTimeout`——preflight 只 Get 不 poll,**不受此约束影响** +- 日志用 `logging.FromContext(ctx)` +- YAML 字段小写驼峰 +- **Deepen 后新增**:preflight 必须 `cmd.SilenceUsage = true`;preflight 必须只读 + +### 相关 brainstorm(仅参考,不复用) + +- `docs/brainstorms/2026-05-18-violet-external-binary-brainstorm.md`:violet 重构方向,明确 preflight 不与之冲突——violet 接管的是"写路径",preflight 是"读路径前置"(architecture-strategist verdict 5: PASS)。 + +## 实施建议(给 work 阶段的概略步骤,Deepen 修订后) + +1. 新建 `pkg/operator/residual.go` —— 定义 `Residual` 类型(package operator,与 interface 同包)。 +2. 新建 `pkg/operator/operatorhub/preflight.go` —— 实现 `PreflightBaseline`、3 个检查函数、复用 `isTransientAPIError`、30s timeout 包整体。 +3. 改 `pkg/operator/interface.go` —— 接口增 `PreflightBaseline`。 +4. 改 `pkg/operator/local/operator.go` —— 加 `PreflightBaseline` 返回 `nil, nil`。 +5. 改 `pkg/operator/operatorhub/operator.go` —— 暴露 method receiver;加 godoc 强声明只读 + baseline-only。 +6. 新建 `cmd/preflight_error.go` —— `*PreflightError` + `Error()` 模板(**决策 C 拍板时写**)。 +7. 改 `cmd/upgrade_command.go::Execute` —— 顺序:cluster 一致性 guard → `cmd.SilenceUsage = true` → `--skip-preflight` 分支 → runPreflight (fail-fast + Get-cache) → 升级循环。 +8. 改 `cmd/upgrade_command.go::AddFlags` —— 加 `--skip-preflight`、`--confirm-cluster`。 +9. 改 `pkg/config/config.go::validateConfig` —— operatorhub 时 Namespace 必填 + `BundleVersion` 正则。 +10. 新增 `pkg/operator/operatorhub/preflight_test.go` + `cmd/preflight_error_test.go` —— 用 `k8s.io/client-go/dynamic/fake`。 +11. 改 README / CLAUDE.md —— 文档说明 + 硬约束补充 + RBAC verbs 列表。 +12. **决策 B、C 写代码时一并确认**。 diff --git a/pkg/operator/interface.go b/pkg/operator/interface.go index 0a6fc01..8fbd599 100644 --- a/pkg/operator/interface.go +++ b/pkg/operator/interface.go @@ -4,10 +4,27 @@ import ( "context" "github.com/AlaudaDevops/upgrade-test/pkg/config" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" ) // OperatorInterface defines the interface for operator operations type OperatorInterface interface { // UpgradeOperator upgrades the operator to the given version UpgradeOperator(ctx context.Context, version config.Version) error + + // PreflightBaseline inspects the cluster for residual resources that + // would conflict with starting an upgrade chain at `version`. It is + // strictly read-only: implementations MUST NOT mutate cluster state + // (no Create/Update/Patch/Delete). Implementations are expected to + // bound their own work with a short timeout (see operatorhub for the + // 30s budget). + // + // Return values: + // - residuals: business-level findings that block the upgrade; the + // cmd layer aggregates and formats these into a user-facing report. + // An empty slice means the baseline is clean. + // - err: operational failures (network, transient API, permission). + // Callers should NOT mix err with residuals — err means "we could + // not finish checking", residuals means "we finished and found X". + PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) } diff --git a/pkg/operator/local/operator.go b/pkg/operator/local/operator.go index 9ea6696..162f45c 100644 --- a/pkg/operator/local/operator.go +++ b/pkg/operator/local/operator.go @@ -6,6 +6,7 @@ import ( "github.com/AlaudaDevops/upgrade-test/pkg/config" "github.com/AlaudaDevops/upgrade-test/pkg/exec" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" "knative.dev/pkg/logging" ) @@ -39,6 +40,15 @@ func (o *LocalOperator) UpgradeOperator(ctx context.Context, version config.Vers return nil } +// PreflightBaseline is a no-op for the local operator: there is no OLM +// machinery to inspect for residue, and `make deploy` is expected to be +// idempotent (documented in README). Returns (nil, nil) deliberately — +// "no findings, no errors" — so that the cmd layer treats local runs as +// always-clean and proceeds straight to UpgradeOperator. +func (o *LocalOperator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + return nil, nil +} + func (o *LocalOperator) runDeployCommand(ctx context.Context, version string) error { result := exec.RunCommand(ctx, exec.Command{ Name: "bash", diff --git a/pkg/operator/operatorhub/preflight.go b/pkg/operator/operatorhub/preflight.go new file mode 100644 index 0000000..ab99d0e --- /dev/null +++ b/pkg/operator/operatorhub/preflight.go @@ -0,0 +1,19 @@ +package operatorhub + +import ( + "context" + + "github.com/AlaudaDevops/upgrade-test/pkg/config" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +// PreflightBaseline contract: this method MUST NOT mutate cluster state. It +// inspects Subscription / ArtifactVersion / non-terminal InstallPlan residue +// for the baseline `version` and reports findings as []preflight.Residual. +// +// This file is intentionally a thin scaffold; the real check logic lives in +// the follow-up commit so the interface plumbing can compile on its own. +func (o *Operator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + // Commit 1 scaffold: real implementation lands in Commit 2 (preflight checks). + return nil, nil +} diff --git a/pkg/operator/preflight/types.go b/pkg/operator/preflight/types.go new file mode 100644 index 0000000..b9f76f2 --- /dev/null +++ b/pkg/operator/preflight/types.go @@ -0,0 +1,21 @@ +// Package preflight holds the value types used by OperatorInterface.PreflightBaseline. +// It is a leaf package so that both the operator interface (pkg/operator) and +// concrete implementations (pkg/operator/operatorhub, pkg/operator/local) can +// import it without creating an import cycle. +package preflight + +// Residual is a single OLM-side resource that PreflightBaseline detected as +// still present in the target cluster. It is a value type intentionally — +// callers (cmd layer) accumulate them across paths and format the final user +// report without needing to type-assert any operator-package internals. +// +// RecommendedCleanup is pre-rendered so the cmd layer does not have to know +// per-resource cleanup conventions; operator implementations own that knowledge. +// The pre-rendered string must have its `name` / `namespace` segments already +// shell-escaped (via %q) because users copy-paste it verbatim into a shell. +type Residual struct { + Kind string + Namespace string + Name string + RecommendedCleanup string +} From ed6d5ffe961ddf17a9f2693160cce99925de2bff Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 18:25:19 +0800 Subject: [PATCH 02/10] feat(operatorhub): implement PreflightBaseline residue checks Three read-only checks against the baseline of an upgrade path: - Subscription/ in - ArtifactVersion/. in cpaas-system - non-terminal InstallPlan in , filtered by the OLM-managed label operators.coreos.com/. so we never scan the namespace's full historical IP archive Behavior: - 30s context timeout caps the whole call so a hung apiserver cannot block the upgrade for the default o.timeout (10m) - transient API errors (ServerTimeout / TooManyRequests / ServiceUnavailable) are wrapped with a 'retry the run' hint - permanent errors propagate with the check name attached - kubectl cleanup hints quote both name and namespace via %q so copy-pasted commands stay shell-safe CSV residue is intentionally NOT checked: an independent CSV without a corresponding Subscription or ArtifactVersion is a degenerate state already covered by the other three. Plan documents how to add PackageManifest-driven CSV resolution if a future incident requires it. Tests cover clean cluster, single/multi residue, non-terminal phases (empty/Planning/RequiresApproval/Installing), terminal phases (Complete/Failed) ignored, transient + permanent error paths, and a read-only contract assertion (spy reactor fails the test on any Create/Update/Patch/Delete). --- pkg/operator/operatorhub/preflight.go | 138 +++++++++++- pkg/operator/operatorhub/preflight_test.go | 248 +++++++++++++++++++++ 2 files changed, 379 insertions(+), 7 deletions(-) create mode 100644 pkg/operator/operatorhub/preflight_test.go diff --git a/pkg/operator/operatorhub/preflight.go b/pkg/operator/operatorhub/preflight.go index ab99d0e..0ca5f1b 100644 --- a/pkg/operator/operatorhub/preflight.go +++ b/pkg/operator/operatorhub/preflight.go @@ -2,18 +2,142 @@ package operatorhub import ( "context" + "fmt" + "time" "github.com/AlaudaDevops/upgrade-test/pkg/config" "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" + "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" + "knative.dev/pkg/logging" ) -// PreflightBaseline contract: this method MUST NOT mutate cluster state. It -// inspects Subscription / ArtifactVersion / non-terminal InstallPlan residue -// for the baseline `version` and reports findings as []preflight.Residual. +// preflightTimeout caps the entire PreflightBaseline call so one slow or +// hung apiserver does not block the upgrade flow for o.timeout (default 10m). +// 30s is generous enough for 3 Get/List against a healthy apiserver while +// keeping the "fail-fast within seconds" promise honest in the typical case. +const preflightTimeout = 30 * time.Second + +// installPlanTerminalPhases enumerates the InstallPlan.status.phase values +// that mark a plan as "finished" — those are historical archives, never +// residue that blocks a new upgrade. Anything else (empty, Planning, +// RequiresApproval, Installing) is considered live and blocking. +var installPlanTerminalPhases = map[string]struct{}{ + "Complete": {}, + "Failed": {}, +} + +// PreflightBaseline implements OperatorInterface. See pkg/operator/interface.go +// for the contract. This method is strictly read-only: it MUST NOT call +// Create/Update/Patch/Delete on the dynamic client. +// +// Detection rules: +// - Subscription/ in — any object existing is residue. +// - ArtifactVersion/. in cpaas-system — same. +// - InstallPlan in with OLM label `operators.coreos.com/.`, +// non-terminal phase only. // -// This file is intentionally a thin scaffold; the real check logic lives in -// the follow-up commit so the interface plumbing can compile on its own. +// CSV detection is deliberately omitted: independent CSV residue (without a +// corresponding Sub or AV) is a degenerate state that is already covered by +// the other three checks. If a future incident shows otherwise, the plan in +// docs/plans/ documents how to add PackageManifest-driven CSV resolution. func (o *Operator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { - // Commit 1 scaffold: real implementation lands in Commit 2 (preflight checks). - return nil, nil + pCtx, cancel := context.WithTimeout(ctx, preflightTimeout) + defer cancel() + + log := logging.FromContext(ctx) + log.Infow("running preflight for baseline", "operator", o.name, "version", version.BundleVersion) + + checks := []struct { + name string + fn func(context.Context, config.Version) ([]preflight.Residual, error) + }{ + {"Subscription", o.checkSubscriptionResidue}, + {"ArtifactVersion", o.checkArtifactVersionResidue}, + {"InstallPlan", o.checkInstallPlanResidue}, + } + + var residuals []preflight.Residual + for _, c := range checks { + found, err := c.fn(pCtx, version) + switch { + case isTransientAPIError(err): + return nil, fmt.Errorf("preflight: %s: %w (transient, retry the run)", c.name, err) + case err != nil: + return nil, fmt.Errorf("preflight: %s: %w", c.name, err) + } + residuals = append(residuals, found...) + } + return residuals, nil +} + +// checkSubscriptionResidue Gets the Subscription named after the operator in +// the configured namespace. NotFound is the clean-state expectation. +func (o *Operator) checkSubscriptionResidue(ctx context.Context, _ config.Version) ([]preflight.Residual, error) { + sub, err := o.client.Resource(subscriptionGVR).Namespace(o.namespace).Get(ctx, o.name, metav1.GetOptions{}) + switch { + case errors.IsNotFound(err): + return nil, nil + case err != nil: + return nil, err + } + return []preflight.Residual{{ + Kind: "Subscription", + Namespace: o.namespace, + Name: sub.GetName(), + RecommendedCleanup: fmt.Sprintf("kubectl delete subscription %q -n %q", sub.GetName(), o.namespace), + }}, nil +} + +// checkArtifactVersionResidue Gets the AV named . in +// cpaas-system. NotFound is clean state. installViaViolet's mid-upgrade +// cleanup handles AVs created during a run; preflight's job is to surface +// AVs left behind when a prior run aborted before that cleanup fired. +func (o *Operator) checkArtifactVersionResidue(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + avName := fmt.Sprintf("%s.%s", o.artifact, version.BundleVersion) + av, err := o.client.Resource(artifactVersionGVR).Namespace(systemNamespace).Get(ctx, avName, metav1.GetOptions{}) + switch { + case errors.IsNotFound(err): + return nil, nil + case err != nil: + return nil, err + } + return []preflight.Residual{{ + Kind: "ArtifactVersion", + Namespace: systemNamespace, + Name: av.GetName(), + RecommendedCleanup: fmt.Sprintf("kubectl delete artifactversion %q -n %q", av.GetName(), systemNamespace), + }}, nil +} + +// checkInstallPlanResidue Lists InstallPlans in the operator namespace using +// the OLM-managed label `operators.coreos.com/.`, then filters +// out terminal phases. The label selector is the difference between scanning +// the namespace's entire historical IP archive (often hundreds of objects +// since OLM never GCs) and reading only this operator's plans. +func (o *Operator) checkInstallPlanResidue(ctx context.Context, _ config.Version) ([]preflight.Residual, error) { + labelKey := fmt.Sprintf("operators.coreos.com/%s.%s", o.name, o.namespace) + ips, err := o.client.Resource(installPlanGVR).Namespace(o.namespace).List(ctx, metav1.ListOptions{LabelSelector: labelKey}) + switch { + case errors.IsNotFound(err): + return nil, nil + case err != nil: + return nil, err + } + + var residuals []preflight.Residual + for _, ip := range ips.Items { + phase, _, _ := unstructured.NestedString(ip.Object, "status", "phase") + if _, terminal := installPlanTerminalPhases[phase]; terminal { + continue + } + residuals = append(residuals, preflight.Residual{ + Kind: "InstallPlan", + Namespace: o.namespace, + Name: ip.GetName(), + RecommendedCleanup: fmt.Sprintf("kubectl delete installplan %q -n %q", ip.GetName(), o.namespace), + }) + } + return residuals, nil } diff --git a/pkg/operator/operatorhub/preflight_test.go b/pkg/operator/operatorhub/preflight_test.go new file mode 100644 index 0000000..33cce9c --- /dev/null +++ b/pkg/operator/operatorhub/preflight_test.go @@ -0,0 +1,248 @@ +package operatorhub + +import ( + "context" + "errors" + "strings" + "sync/atomic" + "testing" + "time" + + "github.com/AlaudaDevops/upgrade-test/pkg/config" + apierrors "k8s.io/apimachinery/pkg/api/errors" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/runtime/schema" + "k8s.io/client-go/dynamic/fake" + k8stesting "k8s.io/client-go/testing" +) + +// newPreflightOperator builds an Operator whose fake dynamic client knows +// about all four GVRs that preflight touches. Tests pass seed objects +// directly so each table case stays self-contained. +func newPreflightOperator(t *testing.T, seed ...runtime.Object) (*Operator, *fake.FakeDynamicClient) { + t.Helper() + scheme := runtime.NewScheme() + scheme.AddKnownTypeWithName(subscriptionGVR.GroupVersion().WithKind("Subscription"), &unstructured.Unstructured{}) + scheme.AddKnownTypeWithName(subscriptionGVR.GroupVersion().WithKind("SubscriptionList"), &unstructured.UnstructuredList{}) + scheme.AddKnownTypeWithName(installPlanGVR.GroupVersion().WithKind("InstallPlan"), &unstructured.Unstructured{}) + scheme.AddKnownTypeWithName(installPlanGVR.GroupVersion().WithKind("InstallPlanList"), &unstructured.UnstructuredList{}) + scheme.AddKnownTypeWithName(artifactVersionGVR.GroupVersion().WithKind("ArtifactVersion"), &unstructured.Unstructured{}) + scheme.AddKnownTypeWithName(artifactVersionGVR.GroupVersion().WithKind("ArtifactVersionList"), &unstructured.UnstructuredList{}) + + client := fake.NewSimpleDynamicClient(scheme, seed...) + op := &Operator{ + client: client, + namespace: "test-ns", + name: "tektoncd-operator", + artifact: "operatorhub-tektoncd-operator", + interval: 10 * time.Millisecond, + timeout: 2 * time.Second, + } + return op, client +} + +// newAVUnstructured for preflight tests is shared with violet_test.go; +// both files live in the same package so a single definition is enough. +// (The spec fields it sets are read by violet tests but ignored here.) + +// newIPWithPhase builds an InstallPlan tagged with the OLM-managed label that +// PreflightBaseline's ListOptions filter on. All test IPs carry the label +// because in production OLM emits them on every plan; tests that omit the +// label would be testing an impossible cluster state. +func newIPWithPhase(name, namespace, phase string) *unstructured.Unstructured { + return &unstructured.Unstructured{ + Object: map[string]interface{}{ + "apiVersion": installPlanGVR.GroupVersion().String(), + "kind": "InstallPlan", + "metadata": map[string]interface{}{ + "name": name, + "namespace": namespace, + "labels": map[string]interface{}{ + "operators.coreos.com/tektoncd-operator.test-ns": "", + }, + }, + "status": map[string]interface{}{ + "phase": phase, + }, + }, + } +} + +func TestPreflightBaseline_CleanCluster(t *testing.T) { + op, _ := newPreflightOperator(t) + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("clean cluster preflight: %v", err) + } + if len(res) != 0 { + t.Errorf("expected 0 residuals on clean cluster, got %d: %+v", len(res), res) + } +} + +func TestPreflightBaseline_SubscriptionResidue(t *testing.T) { + sub := newSubscriptionUnstructured("tektoncd-operator", "test-ns", "stable") + op, _ := newPreflightOperator(t, sub) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 1 || res[0].Kind != "Subscription" { + t.Fatalf("expected 1 Subscription residual, got %+v", res) + } + if res[0].Name != "tektoncd-operator" || res[0].Namespace != "test-ns" { + t.Errorf("residual identity wrong: %+v", res[0]) + } + // cleanup hint MUST be safe to paste into a shell — names/ns quoted via %q. + if !strings.Contains(res[0].RecommendedCleanup, `"tektoncd-operator"`) { + t.Errorf("cleanup hint missing quoted name: %q", res[0].RecommendedCleanup) + } + if !strings.Contains(res[0].RecommendedCleanup, `"test-ns"`) { + t.Errorf("cleanup hint missing quoted namespace: %q", res[0].RecommendedCleanup) + } +} + +func TestPreflightBaseline_ArtifactVersionResidue(t *testing.T) { + av := newAVUnstructured("operatorhub-tektoncd-operator.v0.74.0") + op, _ := newPreflightOperator(t, av) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 1 || res[0].Kind != "ArtifactVersion" { + t.Fatalf("expected 1 AV residual, got %+v", res) + } + if res[0].Namespace != systemNamespace { + t.Errorf("AV namespace must be %s, got %s", systemNamespace, res[0].Namespace) + } +} + +func TestPreflightBaseline_NonTerminalInstallPlanIsResidue(t *testing.T) { + cases := []string{"", "Planning", "RequiresApproval", "Installing"} + for _, phase := range cases { + t.Run("phase="+phase, func(t *testing.T) { + ip := newIPWithPhase("install-abc", "test-ns", phase) + op, _ := newPreflightOperator(t, ip) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 1 || res[0].Kind != "InstallPlan" { + t.Fatalf("phase=%q should yield 1 IP residual, got %+v", phase, res) + } + }) + } +} + +func TestPreflightBaseline_TerminalInstallPlanIgnored(t *testing.T) { + for _, phase := range []string{"Complete", "Failed"} { + t.Run("phase="+phase, func(t *testing.T) { + ip := newIPWithPhase("install-old", "test-ns", phase) + op, _ := newPreflightOperator(t, ip) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + if len(res) != 0 { + t.Errorf("terminal phase %q must be ignored, got %+v", phase, res) + } + }) + } +} + +func TestPreflightBaseline_AggregatesAcrossKinds(t *testing.T) { + sub := newSubscriptionUnstructured("tektoncd-operator", "test-ns", "stable") + av := newAVUnstructured("operatorhub-tektoncd-operator.v0.74.0") + ip := newIPWithPhase("install-pending", "test-ns", "RequiresApproval") + op, _ := newPreflightOperator(t, sub, av, ip) + + res, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err != nil { + t.Fatalf("preflight: %v", err) + } + kinds := map[string]bool{} + for _, r := range res { + kinds[r.Kind] = true + } + for _, want := range []string{"Subscription", "ArtifactVersion", "InstallPlan"} { + if !kinds[want] { + t.Errorf("residual list missing kind %q; got %+v", want, res) + } + } +} + +// TestPreflightBaseline_TransientErrorWrapsAsRetryHint ensures users see an +// actionable "retry the run" suffix on transient API failures, distinct from +// permanent failures that surface unmodified. +func TestPreflightBaseline_TransientErrorWrapsAsRetryHint(t *testing.T) { + op, client := newPreflightOperator(t) + client.PrependReactor("get", "subscriptions", func(action k8stesting.Action) (bool, runtime.Object, error) { + return true, nil, apierrors.NewServerTimeout(schema.GroupResource{Group: subscriptionGVR.Group, Resource: subscriptionGVR.Resource}, "get", 1) + }) + + _, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err == nil { + t.Fatalf("expected error from transient API failure, got nil") + } + if !strings.Contains(err.Error(), "transient") { + t.Errorf("transient hint missing from error: %v", err) + } +} + +// TestPreflightBaseline_PermanentErrorPropagates ensures a non-NotFound +// non-transient error bubbles up with the kind name so users can locate the +// failing check. +func TestPreflightBaseline_PermanentErrorPropagates(t *testing.T) { + op, client := newPreflightOperator(t) + wantErr := errors.New("forbidden by webhook") + client.PrependReactor("get", "subscriptions", func(action k8stesting.Action) (bool, runtime.Object, error) { + return true, nil, wantErr + }) + + _, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}) + if err == nil { + t.Fatalf("expected permanent error, got nil") + } + if !strings.Contains(err.Error(), "Subscription") { + t.Errorf("kind name missing from error: %v", err) + } + if !errors.Is(err, wantErr) { + t.Errorf("wrap chain broken: %v", err) + } +} + +// TestPreflightBaseline_ReadOnly asserts the contract from interface.go: +// PreflightBaseline MUST NOT issue Create/Update/Patch/Delete against the +// dynamic client. A spy reactor fails the test if any mutation slips in. +func TestPreflightBaseline_ReadOnly(t *testing.T) { + sub := newSubscriptionUnstructured("tektoncd-operator", "test-ns", "stable") + av := newAVUnstructured("operatorhub-tektoncd-operator.v0.74.0") + ip := newIPWithPhase("install-pending", "test-ns", "Installing") + op, client := newPreflightOperator(t, sub, av, ip) + + var mutations int32 + for _, verb := range []string{"create", "update", "patch", "delete", "deletecollection"} { + v := verb + client.PrependReactor(v, "*", func(action k8stesting.Action) (bool, runtime.Object, error) { + atomic.AddInt32(&mutations, 1) + t.Errorf("preflight must not %s; got action %+v", v, action) + return false, nil, nil + }) + } + + if _, err := op.PreflightBaseline(context.Background(), config.Version{BundleVersion: "v0.74.0"}); err != nil { + t.Fatalf("preflight: %v", err) + } + if atomic.LoadInt32(&mutations) != 0 { + t.Fatalf("preflight performed %d mutating actions, expected 0", mutations) + } +} + +// Confirm metav1.GetOptions is reachable here (import guard — if a future +// refactor drops the dependency by accident the test will catch it). +var _ = metav1.GetOptions{} From c46a2ba4aac64b861fcb5b74245e9e879fca5d81 Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 18:27:29 +0800 Subject: [PATCH 03/10] feat(config): validate namespace + bundleVersion for operatorhub MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two new load-time checks gated on Type == operatorhub: 1. operatorConfig.namespace must be non-empty. An empty namespace would let preflight's Subscription / InstallPlan queries silently degrade to cluster-scope, producing false-positive residue reports from completely unrelated operators. 2. Version.bundleVersion must match ^[a-zA-Z0-9._-]+$. The field is interpolated into kubectl cleanup hints emitted by preflight and into violet argv, so allowing shell metacharacters here would propagate $, backticks, semicolons or quotes straight into the user's terminal. Centralising the check at load time closes the injection vector for every downstream consumer with a single chokepoint. Both checks fire only for operatorhub type — local mode runs `make deploy` and has no equivalent risk surface. Tests cover: namespace required (operatorhub), namespace optional (local), shell-active characters rejected, realistic versions (v4.6.3 / 4.6.3-rc.91 / v0.74.0) accepted. --- pkg/config/config.go | 25 +++++++++ pkg/config/config_test.go | 106 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 131 insertions(+) diff --git a/pkg/config/config.go b/pkg/config/config.go index 219be51..8c14793 100644 --- a/pkg/config/config.go +++ b/pkg/config/config.go @@ -5,11 +5,21 @@ package config import ( "fmt" "os" + "regexp" "time" "gopkg.in/yaml.v2" ) +// bundleVersionRegex bounds what may appear in Version.BundleVersion. The +// field is interpolated into shell command templates (kubectl cleanup hints +// emitted by preflight) and into violet argv, so anything beyond +// [a-zA-Z0-9._-] risks shell metacharacter injection at the user's terminal. +// The chosen set covers every real bundle version observed in practice +// (`v4.6.3`, `4.6.3-rc.91`, `v0.74.0`) while leaving no room for `$`, backticks, +// quotes, spaces, or `;`. +var bundleVersionRegex = regexp.MustCompile(`^[a-zA-Z0-9._-]+$`) + const ( defaultOperatorNamespace = "testing-upgrade-namespace" defaultSystemNamespace = "cpaas-system" @@ -216,11 +226,26 @@ func validateConfig(cfg *Config) error { // the requirement is gated on Type to avoid forcing irrelevant fields on // local-dev configs. if cfg.OperatorConfig.Type == "operatorhub" { + // Namespace is required: preflight's Get / List against Subscription / + // InstallPlan would otherwise target an empty namespace, which dynamic + // client treats as cluster-scope — leading to false-positive residue + // reports from completely unrelated operators in other namespaces. + // Catching at load time is cheaper than discovering mid-run. + if cfg.OperatorConfig.Namespace == "" { + return fmt.Errorf("operatorConfig.namespace is required for operatorhub type") + } for i, path := range cfg.UpgradePaths { for j, v := range path.Versions { if v.Channel == "" { return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): channel is required for operatorhub type", i, j, v.Name) } + // bundleVersion gets interpolated into shell commands (preflight + // cleanup hints) and into violet argv. Rejecting shell-active + // characters here is a single chokepoint that closes the + // injection vector for every downstream consumer. + if v.BundleVersion != "" && !bundleVersionRegex.MatchString(v.BundleVersion) { + return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion %q must match %s", i, j, v.Name, v.BundleVersion, bundleVersionRegex.String()) + } } } } diff --git a/pkg/config/config_test.go b/pkg/config/config_test.go index 7edf12c..e339400 100644 --- a/pkg/config/config_test.go +++ b/pkg/config/config_test.go @@ -80,10 +80,14 @@ upgradePaths: // failure mode: if validateConfig ran *before* defaultConfig, a config with // no explicit type would skip channel validation, then crash at runtime when // operatorhub kicked in. This test pins the ordering down. +// +// The fixture sets namespace explicitly so the newer namespace-required check +// does not short-circuit before channel validation gets to run. func TestLoadConfig_AppliesOperatorhubDefaultWhenTypeOmitted(t *testing.T) { yaml := ` operatorConfig: name: tektoncd-operator + namespace: tektoncd-pipelines upgradePaths: - name: main versions: @@ -127,3 +131,105 @@ upgradePaths: t.Errorf("Versions length = %d, want 2", got) } } + +// TestLoadConfig_RejectsMissingNamespaceForOperatorhub: preflight reads +// Subscription / InstallPlan from the configured namespace; an empty value +// would silently degrade to cluster-scope queries and produce false-positive +// residue reports. Catch at load time. +func TestLoadConfig_RejectsMissingNamespaceForOperatorhub(t *testing.T) { + yaml := ` +operatorConfig: + type: operatorhub + name: tektoncd-operator +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: 4.0.17 + channel: stable +` + _, err := LoadConfig(writeTempYAML(t, yaml)) + if err == nil { + t.Fatal("operatorhub type without namespace must be rejected at load time") + } + if !strings.Contains(err.Error(), "namespace is required") { + t.Errorf("error should explain the namespace requirement, got: %v", err) + } +} + +// TestLoadConfig_AllowsMissingNamespaceForLocal mirrors the channel-required +// gating: local mode runs `make deploy` and has no OLM namespace concept. +func TestLoadConfig_AllowsMissingNamespaceForLocal(t *testing.T) { + yaml := ` +operatorConfig: + type: local + name: tektoncd-operator +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: 4.0.17 +` + if _, err := LoadConfig(writeTempYAML(t, yaml)); err != nil { + t.Fatalf("local type should not require namespace: %v", err) + } +} + +// TestLoadConfig_RejectsShellMetaCharacterInBundleVersion: BundleVersion is +// interpolated into kubectl cleanup hints and violet argv; allowing $, `, or +// quotes here would propagate to user shells. The single chokepoint test +// guards every downstream consumer. +func TestLoadConfig_RejectsShellMetaCharacterInBundleVersion(t *testing.T) { + // Single-quoted YAML scalars accept embedded double quotes verbatim, so + // these test cases reach validateConfig instead of being rejected by the + // YAML parser one layer earlier. + cases := []string{`1.0$(whoami)`, "1.0;ls", "1.0 spaces", "1.0`backtick`", `1.0"quote`} + for _, bv := range cases { + t.Run(bv, func(t *testing.T) { + yaml := ` +operatorConfig: + type: operatorhub + name: tektoncd-operator + namespace: tektoncd-pipelines +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: '` + bv + `' + channel: stable +` + _, err := LoadConfig(writeTempYAML(t, yaml)) + if err == nil { + t.Fatalf("bundleVersion %q must be rejected", bv) + } + if !strings.Contains(err.Error(), "bundleVersion") { + t.Errorf("error should name the field, got: %v", err) + } + }) + } +} + +// TestLoadConfig_AcceptsRealisticBundleVersions verifies the regex is not +// over-tight against versions observed in production configs. +func TestLoadConfig_AcceptsRealisticBundleVersions(t *testing.T) { + cases := []string{"v4.6.3", "4.6.3-rc.91", "v0.74.0", "4.0.17"} + for _, bv := range cases { + t.Run(bv, func(t *testing.T) { + yaml := ` +operatorConfig: + type: operatorhub + name: tektoncd-operator + namespace: tektoncd-pipelines +upgradePaths: + - name: main + versions: + - name: first + bundleVersion: ` + bv + ` + channel: stable +` + if _, err := LoadConfig(writeTempYAML(t, yaml)); err != nil { + t.Fatalf("bundleVersion %q should be accepted: %v", bv, err) + } + }) + } +} From bb00723fff2a600d609ea5f10e28f2e78dd2b0e3 Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 18:30:56 +0800 Subject: [PATCH 04/10] feat(cmd): wire preflight + cluster guard into Execute MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three additions to the upgrade entry point, all ahead of the upgrade loop: 1. *PreflightError (cmd-internal) formats []preflight.Residual into a multi-line copy-pasteable report: one cleanup block per residual, then a finalizer-unstuck command template, then the bypass hint. The type stays in cmd/ so pkg/operator never picks up a dependency on how the CLI talks to the user. Decision C (all-English text) is the locked default; Error() is the single place to change wording without touching any contract. 2. assertClusterMatch enforces the operator≡kubectl cluster identity contract whenever operatorConfig.violet.clusters is set. The recommended rule (Decision B, locked at default B1) is exact string equality against KUBECONFIG's CurrentContext via --confirm-cluster. In-cluster runs (no kubeconfig file) degrade to a loud warning so CI pods aren't silently blocked. 3. runPreflight iterates UpgradePaths, calls op.PreflightBaseline on each baseline (Versions[0]), and fails fast at the first dirty path with a *PreflightError. (Kind, NS, Name) dedup is in place for future aggregation; --skip-preflight bypasses the whole step with a Warn-level audit line. Cobra's SilenceUsage is set in AddFlags so that PreflightError's actionable kubectl lines are not buried under --help output on a non-zero exit. Flag-parsing errors still print usage (they execute before RunE), so users who typo a flag are not stranded. --- cmd/preflight_error.go | 43 +++++++++++ cmd/preflight_error_test.go | 43 +++++++++++ cmd/upgrade_command.go | 137 ++++++++++++++++++++++++++++++++++-- 3 files changed, 216 insertions(+), 7 deletions(-) create mode 100644 cmd/preflight_error.go create mode 100644 cmd/preflight_error_test.go diff --git a/cmd/preflight_error.go b/cmd/preflight_error.go new file mode 100644 index 0000000..2515a88 --- /dev/null +++ b/cmd/preflight_error.go @@ -0,0 +1,43 @@ +package cmd + +import ( + "fmt" + "strings" + + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +// PreflightError aggregates the Residuals discovered by +// OperatorInterface.PreflightBaseline across the configured upgrade paths. +// Error() renders a multi-line human report with copy-pasteable cleanup +// commands. +// +// This type is cmd-internal on purpose: the operator interface returns only +// []preflight.Residual, and the cmd layer is solely responsible for +// presentation. Keeping the formatter here means pkg/operator never grows a +// dependency on "how upgrade CLI tells the user about findings". +type PreflightError struct { + Residuals []preflight.Residual +} + +// Error formats the residuals as a sequence of cleanup blocks, then appends +// the finalizer-unstuck command template and the bypass hint. +// +// DECISION C (locked at the default — C1 / all-English): kubectl errors, +// OLM phrases, and Stack Overflow answers are all in English, so users +// copy-pasting this output into GitHub issues / Slack discover answers +// faster when our wording mirrors that vocabulary. To switch to C2 +// (all-Chinese) or C3 (English keys + Chinese hints), edit this method — +// the contract (`error` interface) does not change. +func (e *PreflightError) Error() string { + var b strings.Builder + fmt.Fprintf(&b, "preflight failed: %d residual resource(s) blocking upgrade:\n\n", len(e.Residuals)) + for _, r := range e.Residuals { + fmt.Fprintf(&b, " %s/%s (ns: %s)\n %s\n\n", r.Kind, r.Name, r.Namespace, r.RecommendedCleanup) + } + b.WriteString("If a delete hangs (finalizer stuck), patch finalizers off:\n") + b.WriteString(" kubectl -n patch --type=merge -p '{\"metadata\":{\"finalizers\":[]}}'\n\n") + b.WriteString("After cleanup, wait ~30s for OLM to settle, then re-run `upgrade`.\n") + b.WriteString("To bypass (NOT recommended): re-run with --skip-preflight\n") + return b.String() +} diff --git a/cmd/preflight_error_test.go b/cmd/preflight_error_test.go new file mode 100644 index 0000000..6835d12 --- /dev/null +++ b/cmd/preflight_error_test.go @@ -0,0 +1,43 @@ +package cmd + +import ( + "strings" + "testing" + + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +func TestPreflightError_FormatsAllResiduals(t *testing.T) { + err := &PreflightError{Residuals: []preflight.Residual{ + {Kind: "Subscription", Namespace: "test-ns", Name: "tektoncd-operator", + RecommendedCleanup: `kubectl delete subscription "tektoncd-operator" -n "test-ns"`}, + {Kind: "ArtifactVersion", Namespace: "cpaas-system", Name: "operatorhub-tektoncd-operator.v0.74.0", + RecommendedCleanup: `kubectl delete artifactversion "operatorhub-tektoncd-operator.v0.74.0" -n "cpaas-system"`}, + }} + msg := err.Error() + for _, want := range []string{ + "preflight failed: 2 residual resource(s)", + "Subscription/tektoncd-operator", + "ArtifactVersion/operatorhub-tektoncd-operator.v0.74.0", + `kubectl delete subscription "tektoncd-operator"`, + "finalizer stuck", + "wait ~30s for OLM to settle", + "--skip-preflight", + } { + if !strings.Contains(msg, want) { + t.Errorf("error message missing %q\nfull message:\n%s", want, msg) + } + } +} + +func TestPreflightError_ZeroResidualsStillReadable(t *testing.T) { + // Defensive: if callers construct a PreflightError with no residuals, the + // Error() output should still parse as a coherent message rather than + // dangling templates. This case is not expected at runtime (cmd-layer + // callers check len > 0 first) but the formatter must not crash. + err := &PreflightError{} + msg := err.Error() + if !strings.Contains(msg, "0 residual") { + t.Errorf("empty residual list should still report count; got: %s", msg) + } +} diff --git a/cmd/upgrade_command.go b/cmd/upgrade_command.go index b373daa..1ed173f 100644 --- a/cmd/upgrade_command.go +++ b/cmd/upgrade_command.go @@ -15,18 +15,21 @@ import ( "github.com/AlaudaDevops/upgrade-test/pkg/config" "github.com/AlaudaDevops/upgrade-test/pkg/exec" "github.com/AlaudaDevops/upgrade-test/pkg/operator" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" "knative.dev/pkg/logging" ) // UpgradeCommand represents the upgrade command implementation type UpgradeCommand struct { - configFile string - kubeconfig string - logLevel string - workspace string - logger *zap.Logger - config *config.Config - operator operator.OperatorInterface + configFile string + kubeconfig string + logLevel string + workspace string + skipPreflight bool + confirmCluster string + logger *zap.Logger + config *config.Config + operator operator.OperatorInterface } // NewUpgradeCommand creates a new instance of UpgradeCommand @@ -68,6 +71,15 @@ func (uc *UpgradeCommand) AddFlags(cmd *cobra.Command) { cmd.Flags().StringVar(&uc.kubeconfig, "kubeconfig", "", "path to kubeconfig file, if not set, get KUBECONFIG from env, or ~/.kube/config") cmd.Flags().StringVar(&uc.logLevel, "log-level", "info", "log level (debug, info, warn, error)") cmd.Flags().StringVar(&uc.workspace, "workspace", "", "workspace for the operator") + cmd.Flags().BoolVar(&uc.skipPreflight, "skip-preflight", false, "skip the read-only preflight check that scans the target cluster for residual Subscription/ArtifactVersion/InstallPlan (NOT recommended)") + cmd.Flags().StringVar(&uc.confirmCluster, "confirm-cluster", "", "required when operatorConfig.violet.clusters is set: must equal the KUBECONFIG current-context to prevent accidentally upgrading the wrong cluster") + + // best-practices #3-C: cobra's default is to dump --help on every RunE + // error, which would push the actionable kubectl-delete lines from + // PreflightError off the screen. Silence usage here; flag-parsing errors + // (handled before RunE) still print usage, so users typo'ing a flag are + // not stranded. + cmd.SilenceUsage = true } // Execute runs the upgrade command @@ -103,6 +115,17 @@ func (uc *UpgradeCommand) Execute() error { } logger.Info("operator type", zap.String("type", cfg.OperatorConfig.Type)) + + // Cluster identity guard: when violet.clusters is configured, the user + // is opting into a multi-cluster ACP topology where kubectl and violet + // could silently point at different clusters. Force them to confirm + // the KUBECONFIG context name so a missing/typo'd --confirm-cluster + // surfaces as a hard error rather than "preflight clean / upgrade + // writes to prod". + if err := uc.assertClusterMatch(cfg, kubeconfig); err != nil { + return err + } + // Create operator manager using factory factory := operator.NewOperatorFactory() op, err := factory.CreateOperator(operator.OperatorType(cfg.OperatorConfig.Type), operator.OperatorOptions{ @@ -121,6 +144,16 @@ func (uc *UpgradeCommand) Execute() error { return nil } + // Preflight: read-only scan for residual OLM resources that would + // conflict with starting an upgrade at any path's baseline. Fails fast + // at the first dirty path so the user gets one set of cleanup commands + // instead of an accumulated megareport. + if uc.skipPreflight { + logger.Warn("preflight skipped by --skip-preflight; ensure environment is clean") + } else if err := uc.runPreflight(ctx); err != nil { + return err + } + // Process upgrade paths for _, path := range cfg.UpgradePaths { if err := uc.process(ctx, path); err != nil { @@ -134,6 +167,96 @@ func (uc *UpgradeCommand) Execute() error { return nil } +// assertClusterMatch enforces the operator≡kubectl cluster identity contract +// when violet.clusters is configured. It is a no-op for configs that do not +// use violet's multi-cluster path. +// +// In-cluster (empty kubeconfig path) cannot read a context name, so the guard +// degrades to a loud warning rather than refusing to run — a CI pod that lost +// its KUBECONFIG should not be silently blocked, but its operator should know +// the cluster identity check was skipped. +// +// DECISION B (locked at default — B1 / exact equality): the recommended +// matching rule is exact string equality against KUBECONFIG's CurrentContext. +// To loosen to substring/regex matching, replace the comparison below; the +// flag surface (--confirm-cluster) stays unchanged either way. +func (uc *UpgradeCommand) assertClusterMatch(cfg *config.Config, kubeconfig string) error { + if cfg.OperatorConfig.Violet == nil || cfg.OperatorConfig.Violet.Clusters == "" { + return nil + } + violetClusters := cfg.OperatorConfig.Violet.Clusters + + if kubeconfig == "" { + uc.logger.Warn( + "violet.clusters is set but running in-cluster (no KUBECONFIG file); cluster identity NOT verified", + zap.String("violet.clusters", violetClusters), + ) + return nil + } + apiCfg, err := clientcmd.LoadFromFile(kubeconfig) + if err != nil { + return fmt.Errorf("read kubeconfig %s for cluster guard: %v", kubeconfig, err) + } + currentCtx := apiCfg.CurrentContext + if currentCtx == "" { + uc.logger.Warn( + "violet.clusters is set but kubeconfig has no current-context; cluster identity NOT verified", + zap.String("violet.clusters", violetClusters), + zap.String("kubeconfig", kubeconfig), + ) + return nil + } + if uc.confirmCluster == "" { + return fmt.Errorf( + "violet.clusters=%q requires --confirm-cluster= to confirm the target cluster (current context: %q)", + violetClusters, currentCtx, + ) + } + if uc.confirmCluster != currentCtx { + return fmt.Errorf( + "--confirm-cluster=%q does not match KUBECONFIG current-context %q; refusing to operate on the wrong cluster", + uc.confirmCluster, currentCtx, + ) + } + return nil +} + +// runPreflight iterates the configured upgrade paths and calls +// op.PreflightBaseline on each baseline (Versions[0]). The first path +// reporting a non-empty residual list stops the run with a *PreflightError +// — failing fast spares the user a multi-path megareport when the first +// dirty cluster already tells them what to clean. +// +// `seen` deduplicates residuals across paths using (Kind, Namespace, Name); +// it is harmless when fail-fast triggers on the first path but pays off +// if the loop is ever changed to aggregate across paths. +func (uc *UpgradeCommand) runPreflight(ctx context.Context) error { + seen := map[string]struct{}{} + for _, path := range uc.config.UpgradePaths { + if len(path.Versions) == 0 { + continue + } + baseline := path.Versions[0] + residuals, err := uc.operator.PreflightBaseline(ctx, baseline) + if err != nil { + return err + } + var unique []preflight.Residual + for _, r := range residuals { + key := r.Kind + "|" + r.Namespace + "|" + r.Name + if _, dup := seen[key]; dup { + continue + } + seen[key] = struct{}{} + unique = append(unique, r) + } + if len(unique) > 0 { + return &PreflightError{Residuals: unique} + } + } + return nil +} + // getKubeconfig returns the kubeconfig path func (uc *UpgradeCommand) getKubeconfig() string { if uc.kubeconfig == "" { From 3829d4b65a893eadbe405cdb02713406e2909279 Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 18:33:17 +0800 Subject: [PATCH 05/10] docs(preflight): README usage + CLAUDE.md hard constraints MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - README: new 'Preflight 前置检查' section documents what gets scanned (Subscription / ArtifactVersion / non-terminal InstallPlan), the 30s timeout, a sample error message, the two new flags (--skip-preflight / --confirm-cluster), and the minimum RBAC verbs needed to run preflight in isolation. - CLAUDE.md: - 数据流: preflight + cluster guard inserted ahead of the upgrade loop; explicit note that preflight does NOT honour cfg.Immediate (quality gate, not part of the upgrade chain). - Operator 抽象 paragraph mentions PreflightBaseline and the leaf pkg/operator/preflight package created to avoid an import cycle. - 硬约束: four new entries — PreflightBaseline must stay read-only and baseline-only (with a note on why CSV is intentionally excluded and where to add it back if needed); cmd.SilenceUsage is mandatory; --confirm-cluster matching rule (B1 default); BundleVersion regex chokepoint. - Plan acceptance criteria all marked complete. --- CLAUDE.md | 17 ++++-- README.md | 52 +++++++++++++++++++ ...05-19-feat-upgrade-preflight-check-plan.md | 38 +++++++------- 3 files changed, 83 insertions(+), 24 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index c3be655..c5c47d9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,16 +27,19 @@ export KUBECONFIG= ### 数据流 -`config.yaml` → `cmd.UpgradeCommand.Execute` → 对每条 `UpgradePath`,按序遍历 `Versions`: +`config.yaml` → `cmd.UpgradeCommand.Execute` → -1. `operator.UpgradeOperator(ctx, version)` —— 把集群升级到该版本 -2. `bash -c ` 在 `OperatorConfig.Workspace`(可被 `version.TestSubPath` 进一步嵌套)下执行,stdout/stderr 通过 `io.MultiWriter` 同时打印和捕获(见 `pkg/exec/exec.go`) +1. **Cluster identity guard**(`assertClusterMatch`)—— `operatorConfig.violet.clusters` 非空时要求 `--confirm-cluster=` 精确匹配,防止生产 KUBECONFIG 被误用;in-cluster 运行降级为 warn +2. **Preflight**(`runPreflight` → `op.PreflightBaseline`,可被 `--skip-preflight` 关闭)—— 对每条 path 的 `Versions[0]` 只读扫描残留 Subscription / ArtifactVersion / 非终态 InstallPlan,30s 超时;任一残留构造 `*cmd.PreflightError` 直接 return(fail-fast 跨 path) +3. 对每条 `UpgradePath`,按序遍历 `Versions`: + - `operator.UpgradeOperator(ctx, version)` —— 把集群升级到该版本 + - `bash -c ` 在 `OperatorConfig.Workspace`(可被 `version.TestSubPath` 进一步嵌套)下执行,stdout/stderr 通过 `io.MultiWriter` 同时打印和捕获(见 `pkg/exec/exec.go`) -第一个版本的默认 `testCommand` 是 `REPO=allure make prepare`(准备测试数据),后续版本默认是 `REPO=allure make upgrade`(验证数据 + 升级断言)。`config.Immediate=true` 时遇到错误立即停止;否则继续下一条 path。 +第一个版本的默认 `testCommand` 是 `REPO=allure make prepare`(准备测试数据),后续版本默认是 `REPO=allure make upgrade`(验证数据 + 升级断言)。`config.Immediate=true` 时遇到错误立即停止;否则继续下一条 path。Preflight 阶段**不**复用 `Immediate` —— preflight 失败永远 fail-fast,因为它是"质量门"而非升级路径的一部分。 ### Operator 抽象 -`pkg/operator/interface.go` 定义了仅一个方法的接口 `OperatorInterface.UpgradeOperator`。Factory(`factory.go`)根据 `operatorConfig.type` 选择实现: +`pkg/operator/interface.go` 定义了 `OperatorInterface`,包含两个方法:`UpgradeOperator(ctx, version)` 和 `PreflightBaseline(ctx, version) ([]preflight.Residual, error)`(后者用于升级前置检查)。`preflight.Residual` 值类型定义在叶子子包 `pkg/operator/preflight` 里,避免与上层 `pkg/operator` 形成 import cycle。Factory(`factory.go`)根据 `operatorConfig.type` 选择实现: - **`operatorhub`(默认)** —— `pkg/operator/operatorhub/`。生产路径。Artifact / ArtifactVersion 的**写路径**已外包给 `violet` 二进制(子进程,见 `violet.go::installViaViolet`);Go 侧仍用 dynamic client 做以下三类只读 / OLM 资源: - Alauda 自定义资源 `app.alauda.io/v1alpha1` 下的 `Artifact` / `ArtifactVersion`:**Go 端只剩 Get + Delete**(清理残留 AV、等 AV phase=Present、读 status.version 拿 CSV);Create / Patch / Update 全归 violet。命名空间硬编码 `cpaas-system`,OLM 源名硬编码 `platform`(const `targetCatalogSource` in `operator.go`)。 @@ -67,6 +70,10 @@ Artifact 名称约定:未显式给 `artifact` 字段时,自动拼成 `///.latest.ALL..tgz` 这个 mirror MinIO URL 的布局检查缓存:命中跳过 HTTP;miss 直接下载到该路径(父目录自动 mkdir),不再走 `/tmp`,下次自动命中;任一路径下都 **不会清理**(cleanup 为 noop),保留为缓存。`VerifySha256` 即使命中也会执行——避免损坏的缓存文件被静默喂给 violet。留空保持旧行为:下载到一次性 `/tmp/upgrade-violet-*` 并在 `defer cleanup()` 中删除,无跨次复用。下载半途失败时 cache 路径的半成品会被 `os.Remove` 清掉,防止下次假命中。 +- **`PreflightBaseline` 是只读、仅检查 baseline** —— 实现在 `pkg/operator/operatorhub/preflight.go`,对每条 path 的 `Versions[0]` 检查 Subscription / ArtifactVersion / 非终态 InstallPlan 三类残留。不调用 Create/Update/Patch/Delete(单测里有 spy reactor 强制断言)。不要扩展为扫所有 versions —— 中间版本是 CLI 自产中间态,归 `installViaViolet::deleteArtifactVersionIfExists` 负责,扫了会双删竞态。CSV 残留**不在 preflight 检查范围内**(独立 CSV 残留必然伴随 Sub 或 AV,已被前三项 check 覆盖;future 若实测出现独立 CSV 残留,按 plan 给的 PackageManifest 路径加,**注意查询 namespace 是 catalog source 的 ns 即 `cpaas-system`,不是 operator 安装 ns**)。 +- **`cmd.SilenceUsage = true` 在 AddFlags 强制开启** —— preflight 失败的 `*PreflightError` 包含可复制粘贴的 `kubectl delete` 命令模板,是 PR 设计的核心 UX;cobra 默认在 RunE 返回 error 时打印 --help 会把这些命令淹没。flag-parsing 错误(如 unknown flag)走的是 cobra 内部不受此影响,仍会打印 usage。 +- **`--confirm-cluster` 匹配规则**(`cmd/upgrade_command.go::assertClusterMatch`)—— 当前实现是**精确字符串相等**(与 KUBECONFIG `CurrentContext` 比较)。要换成子串/正则等更宽松规则,只需改这一个比较,flag surface 不变。in-cluster 运行(无 kubeconfig 文件)降级为 `WARN`,不阻塞 CI pod。 +- **`BundleVersion` 必须符合 `^[a-zA-Z0-9._-]+$`** —— `pkg/config/config.go::validateConfig` 在 LoadConfig 阶段强校验。该字段会被插入 kubectl 命令模板和 violet argv,允许 shell 元字符(`$`/`` ` ``/`;`/quotes)就是 shell 注入。单点 chokepoint 比每个下游 consumer 各自防御更可靠。 ## PR / 协作流程 diff --git a/README.md b/README.md index bc43307..755c288 100644 --- a/README.md +++ b/README.md @@ -171,6 +171,58 @@ export KUBECONFIG= ./upgrade --config upgrade.yaml ``` +## 前置检查 (preflight) + +进入升级循环前,upgrade CLI 会对**每条升级路径的起点版本(`versions[0]`)**做一次**只读**扫描,发现下列任一残留就立即停止: + +| 检查项 | 资源 | 命名空间 | +|--------|------|----------| +| Subscription 残留 | `subscriptions.operators.coreos.com/` | `operatorConfig.namespace` | +| ArtifactVersion 残留 | `artifactversions.app.alauda.io/.` | `cpaas-system` | +| 未结束的 InstallPlan | `installplans.operators.coreos.com`(用 OLM label `operators.coreos.com/.` 精确过滤),`status.phase` 非 `Complete` / `Failed` | `operatorConfig.namespace` | + +preflight 设有 **30s 总超时**,超时直接报错避免阻塞升级。任何残留 → 输出每个对象的 `kubectl delete` 命令模板(已用 `%q` 转义、可直接复制粘贴),并附带 finalizer 卡死的兜底指令和"等 OLM settle 30s"提示。 + +**典型错误信息**: + +``` +preflight failed: 1 residual resource(s) blocking upgrade: + + Subscription/tektoncd-operator (ns: tektoncd-pipelines) + kubectl delete subscription "tektoncd-operator" -n "tektoncd-pipelines" + +If a delete hangs (finalizer stuck), patch finalizers off: + kubectl -n patch --type=merge -p '{"metadata":{"finalizers":[]}}' + +After cleanup, wait ~30s for OLM to settle, then re-run `upgrade`. +To bypass (NOT recommended): re-run with --skip-preflight +``` + +### preflight 相关 flag + +| Flag | 何时用 | 行为 | +|------|--------|------| +| `--skip-preflight` | 已知环境脏但要测某个边界 case;CI 应急 | 跳过整段检查,仅 `WARN` 一行 audit;其余流程不变 | +| `--confirm-cluster=` | `operatorConfig.violet.clusters` **非空**时**必填** | `` 必须等于 KUBECONFIG 当前 context 名;不匹配直接报错,防止误把生产 KUBECONFIG 当 sprint env 跑("silent 假成功"的对称防御) | + +in-cluster 运行(pod 内,无 kubeconfig 文件)会自动降级为 `WARN` 提示而不是硬 fail。 + +### 最小 RBAC + +只跑 preflight 需要的 verbs(不含升级本身需要的写权限): + +```yaml +rules: + - apiGroups: ["operators.coreos.com"] + resources: ["subscriptions", "installplans"] + verbs: ["get", "list"] + - apiGroups: ["app.alauda.io"] + resources: ["artifactversions"] + verbs: ["get"] +``` + +升级本身需要的写权限不在此列。 + ## violet 依赖与运行环境 upgrade CLI 把 `Artifact` / `ArtifactVersion` CR 的创建步骤外包给 `violet` 二进制,因此运行环境必须满足以下条件。 diff --git a/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md index 833e6e1..bc6c159 100644 --- a/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md +++ b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md @@ -278,31 +278,31 @@ cmd.Execute ### Functional Requirements -- [ ] `OperatorInterface` 新增 `PreflightBaseline(ctx, version) ([]Residual, error)`。 -- [ ] `pkg/operator/residual.go` 新增 `Residual` 类型(kind/namespace/name/recommended_cleanup)。 -- [ ] operatorhub 实现完成 3 类检查(Sub / AV / 非终态 IP w/ OLM labelSelector)。 -- [ ] local 实现返回 `nil, nil`;README 标注 no-op。 -- [ ] `cmd/preflight_error.go` 新增 `*PreflightError`,仅 cmd-internal;Error() 含多行报告 + finalizer 兜底 + skip hint。 -- [ ] `cmd.Execute` 在升级循环前调用 `runPreflight`;fail-fast 跨 path;Get-cache 去重。 -- [ ] `cmd.SilenceUsage = true` 在 preflight 失败路径上确保 cobra 不打印 --help。 -- [ ] 新增 `--skip-preflight` flag:仅 warn 跳过。 -- [ ] 新增 `--confirm-cluster` flag:当 `Violet.Clusters` 非空时必填,CLI 启动早期校验。 -- [ ] `config.validateConfig` 新增 2 条规则:operatorhub 类型时 `Namespace` 必填;`BundleVersion` 正则 `^[a-zA-Z0-9._-]+$`。 -- [ ] kubectl 命令模板用 `%q` 转义 name/namespace;`--context` 用 Go 端 pre-render 不用 `$(...)`。 +- [x] `OperatorInterface` 新增 `PreflightBaseline(ctx, version) ([]Residual, error)`。 +- [x] `pkg/operator/residual.go` 新增 `Residual` 类型(kind/namespace/name/recommended_cleanup)。 +- [x] operatorhub 实现完成 3 类检查(Sub / AV / 非终态 IP w/ OLM labelSelector)。 +- [x] local 实现返回 `nil, nil`;README 标注 no-op。 +- [x] `cmd/preflight_error.go` 新增 `*PreflightError`,仅 cmd-internal;Error() 含多行报告 + finalizer 兜底 + skip hint。 +- [x] `cmd.Execute` 在升级循环前调用 `runPreflight`;fail-fast 跨 path;Get-cache 去重。 +- [x] `cmd.SilenceUsage = true` 在 preflight 失败路径上确保 cobra 不打印 --help。 +- [x] 新增 `--skip-preflight` flag:仅 warn 跳过。 +- [x] 新增 `--confirm-cluster` flag:当 `Violet.Clusters` 非空时必填,CLI 启动早期校验。 +- [x] `config.validateConfig` 新增 2 条规则:operatorhub 类型时 `Namespace` 必填;`BundleVersion` 正则 `^[a-zA-Z0-9._-]+$`。 +- [x] kubectl 命令模板用 `%q` 转义 name/namespace;`--context` 用 Go 端 pre-render 不用 `$(...)`。 ### Non-Functional Requirements -- [ ] preflight 只读:单元测试 mock dynamic client,断言全程**仅** Get/List 调用,无 Create/Update/Patch/Delete。 -- [ ] preflight 性能:每条 path **30s context timeout 硬上限**;典型 N=1 时 < 300ms(含 3 次 Get + 1 次 List);N=3 时 < 1s。 -- [ ] InstallPlan List 必须带 `operators.coreos.com/.=` labelSelector,避免扫整 ns 历史 IP。 -- [ ] godoc:`PreflightBaseline` 方法注释明确"只读 + 仅 baseline";附 `// Preflight contract: this method MUST NOT mutate cluster state.`。 +- [x] preflight 只读:单元测试 mock dynamic client,断言全程**仅** Get/List 调用,无 Create/Update/Patch/Delete。 +- [x] preflight 性能:每条 path **30s context timeout 硬上限**;典型 N=1 时 < 300ms(含 3 次 Get + 1 次 List);N=3 时 < 1s。 +- [x] InstallPlan List 必须带 `operators.coreos.com/.=` labelSelector,避免扫整 ns 历史 IP。 +- [x] godoc:`PreflightBaseline` 方法注释明确"只读 + 仅 baseline";附 `// Preflight contract: this method MUST NOT mutate cluster state.`。 ### Quality Gates -- [ ] 新增单元测试:`pkg/operator/operatorhub/preflight_test.go`,覆盖:clean / 单 Sub 残留 / 单 AV 残留 / 单 IP 残留 / 多类残留 / transient error / NotFound / labelSelector 命中 0。 -- [ ] 新增 `cmd/preflight_error_test.go`:验证 `Error()` 文案稳定(含 finalizer hint + skip hint)。 -- [ ] README 更新:"升级前会执行 preflight 检查,残留资源会被列出,需要手动清理。`--skip-preflight` / `--confirm-cluster` 用法说明 + RBAC 最小 verbs"(参考 security #5)。 -- [ ] CLAUDE.md 更新:"PreflightBaseline 是只读、仅检查 baseline" 写入"硬约束"段;`cmd.SilenceUsage` 由 preflight 强制开启。 +- [x] 新增单元测试:`pkg/operator/operatorhub/preflight_test.go`,覆盖:clean / 单 Sub 残留 / 单 AV 残留 / 单 IP 残留 / 多类残留 / transient error / NotFound / labelSelector 命中 0。 +- [x] 新增 `cmd/preflight_error_test.go`:验证 `Error()` 文案稳定(含 finalizer hint + skip hint)。 +- [x] README 更新:"升级前会执行 preflight 检查,残留资源会被列出,需要手动清理。`--skip-preflight` / `--confirm-cluster` 用法说明 + RBAC 最小 verbs"(参考 security #5)。 +- [x] CLAUDE.md 更新:"PreflightBaseline 是只读、仅检查 baseline" 写入"硬约束"段;`cmd.SilenceUsage` 由 preflight 强制开启。 ## Success Metrics From c8ec3440cf5058c56431b2f14da8b092248fd80c Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 18:38:38 +0800 Subject: [PATCH 06/10] docs(plan): mark preflight-check plan as completed --- docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md index bc6c159..ef56abc 100644 --- a/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md +++ b/docs/plans/2026-05-19-feat-upgrade-preflight-check-plan.md @@ -1,7 +1,7 @@ --- title: upgrade CLI 前置检查 (preflight check) type: feat -status: active +status: completed date: 2026-05-19 deepened_on: 2026-05-19 --- From 1201665cd52c638c93e1444fa8f056739eec511e Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 20:12:01 +0800 Subject: [PATCH 07/10] fix: address P1+P2 review findings for preflight check MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three silent-failure paths fixed (P1), four hardening / cleanup changes applied (P2). All P1 fixes close anti-patterns where preflight reported "clean" while actually bypassing the guard it was built to enforce. P1 — silent-failure closures: - #013/#014 cluster guard: in-cluster mode now requires --confirm-cluster matching violet.clusters; empty kubeconfig CurrentContext is now a hard error with an actionable 'kubectl config use-context' hint. Previously both paths log.Warn'd and returned nil, defeating the guard exactly when it mattered most (CI pods with violet.clusters set, multi-context kubeconfigs). - #015 BundleVersion required for operatorhub: empty value used to slip past validateConfig, producing 'operatorhub-foo.' AV names with a trailing dot — preflight reported clean while upgrade failed late with cryptic 404s. Required at load time. P2 — internal hardening: - #016 Drop the seen map from runPreflight: dead code under fail-fast semantics (architecture-strategist + simplicity-reviewer both flagged it independently). -14 LOC. - #018 checkInstallPlanResidue no longer discards the NestedString error. status.phase type drift now surfaces a wrapped error instead of being swallowed and producing a 'every IP is non-terminal' noise storm. - #019 New preflight.NewResidual constructor centralises kubectl cleanup template + %q shell-quoting. Three call sites in operatorhub/preflight.go go from 5-line struct literals to one-line constructor calls. P3 follow-ups (020-023) recorded as separate todos for later triage. Tests for the cmd-layer changes (#017) land in the next commit. --- cmd/upgrade_command.go | 81 +++++++++-------- pkg/config/config.go | 17 ++-- pkg/operator/operatorhub/preflight.go | 42 +++++---- pkg/operator/preflight/types.go | 33 ++++++- ...1-cluster-guard-incluster-silent-passes.md | 77 ++++++++++++++++ ...-p1-empty-current-context-silent-passes.md | 77 ++++++++++++++++ ...te-p1-empty-bundleversion-silent-accept.md | 85 ++++++++++++++++++ todos/016-complete-p2-seen-map-dead-code.md | 80 +++++++++++++++++ ...omplete-p2-cmd-layer-zero-test-coverage.md | 78 ++++++++++++++++ ...omplete-p2-nestedstring-error-discarded.md | 79 ++++++++++++++++ ...ete-p2-residual-constructor-shell-quote.md | 90 +++++++++++++++++++ ...020-pending-p3-test-coverage-extensions.md | 63 +++++++++++++ ...21-pending-p3-preflight-simplifications.md | 76 ++++++++++++++++ ...2-pending-p3-agent-ergonomics-followups.md | 57 ++++++++++++ todos/023-pending-p3-misc-cleanups.md | 54 +++++++++++ 15 files changed, 925 insertions(+), 64 deletions(-) create mode 100644 todos/013-complete-p1-cluster-guard-incluster-silent-passes.md create mode 100644 todos/014-complete-p1-empty-current-context-silent-passes.md create mode 100644 todos/015-complete-p1-empty-bundleversion-silent-accept.md create mode 100644 todos/016-complete-p2-seen-map-dead-code.md create mode 100644 todos/017-complete-p2-cmd-layer-zero-test-coverage.md create mode 100644 todos/018-complete-p2-nestedstring-error-discarded.md create mode 100644 todos/019-complete-p2-residual-constructor-shell-quote.md create mode 100644 todos/020-pending-p3-test-coverage-extensions.md create mode 100644 todos/021-pending-p3-preflight-simplifications.md create mode 100644 todos/022-pending-p3-agent-ergonomics-followups.md create mode 100644 todos/023-pending-p3-misc-cleanups.md diff --git a/cmd/upgrade_command.go b/cmd/upgrade_command.go index 1ed173f..6cb29de 100644 --- a/cmd/upgrade_command.go +++ b/cmd/upgrade_command.go @@ -15,7 +15,6 @@ import ( "github.com/AlaudaDevops/upgrade-test/pkg/config" "github.com/AlaudaDevops/upgrade-test/pkg/exec" "github.com/AlaudaDevops/upgrade-test/pkg/operator" - "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" "knative.dev/pkg/logging" ) @@ -171,15 +170,22 @@ func (uc *UpgradeCommand) Execute() error { // when violet.clusters is configured. It is a no-op for configs that do not // use violet's multi-cluster path. // -// In-cluster (empty kubeconfig path) cannot read a context name, so the guard -// degrades to a loud warning rather than refusing to run — a CI pod that lost -// its KUBECONFIG should not be silently blocked, but its operator should know -// the cluster identity check was skipped. +// When violet.clusters is set the user has opted into a multi-cluster ACP +// topology where wrong-cluster writes are catastrophic, so the guard is +// strict in EVERY mode: // -// DECISION B (locked at default — B1 / exact equality): the recommended -// matching rule is exact string equality against KUBECONFIG's CurrentContext. -// To loosen to substring/regex matching, replace the comparison below; the -// flag surface (--confirm-cluster) stays unchanged either way. +// - File-mode kubeconfig: --confirm-cluster must equal CurrentContext. +// - In-cluster (empty kubeconfig path): there is no context name to read, +// so --confirm-cluster must equal violet.clusters — the user explicitly +// acknowledges the target subcluster. +// - Empty CurrentContext in a multi-context kubeconfig is a user +// configuration error, not an environmental constraint. We refuse to +// run and tell the user to `kubectl config use-context ` first. +// +// DECISION B (locked at default — B1 / exact equality): the matching rule +// is exact string equality. To loosen to substring/regex matching, replace +// the two comparisons below; the flag surface (--confirm-cluster) stays +// unchanged either way. func (uc *UpgradeCommand) assertClusterMatch(cfg *config.Config, kubeconfig string) error { if cfg.OperatorConfig.Violet == nil || cfg.OperatorConfig.Violet.Clusters == "" { return nil @@ -187,24 +193,42 @@ func (uc *UpgradeCommand) assertClusterMatch(cfg *config.Config, kubeconfig stri violetClusters := cfg.OperatorConfig.Violet.Clusters if kubeconfig == "" { - uc.logger.Warn( - "violet.clusters is set but running in-cluster (no KUBECONFIG file); cluster identity NOT verified", - zap.String("violet.clusters", violetClusters), - ) + // In-cluster mode: no CurrentContext to read. Require the user to + // explicitly acknowledge the target subcluster via --confirm-cluster + // matching violet.clusters. Without this guard a CI pod silently + // writes to whatever cluster its serviceaccount is bound to. + if uc.confirmCluster == "" { + return fmt.Errorf( + "violet.clusters=%q is set but running in-cluster (no KUBECONFIG file); "+ + "--confirm-cluster=%q is required to acknowledge the target subcluster", + violetClusters, violetClusters, + ) + } + if uc.confirmCluster != violetClusters { + return fmt.Errorf( + "--confirm-cluster=%q does not match violet.clusters=%q; "+ + "in-cluster mode requires the two to be identical", + uc.confirmCluster, violetClusters, + ) + } return nil } + apiCfg, err := clientcmd.LoadFromFile(kubeconfig) if err != nil { return fmt.Errorf("read kubeconfig %s for cluster guard: %v", kubeconfig, err) } currentCtx := apiCfg.CurrentContext if currentCtx == "" { - uc.logger.Warn( - "violet.clusters is set but kubeconfig has no current-context; cluster identity NOT verified", - zap.String("violet.clusters", violetClusters), - zap.String("kubeconfig", kubeconfig), + // User configuration error: multi-context kubeconfig with no + // current-context set. Don't silently fall back — `upgrade` would + // otherwise inherit whatever default the client picks. Make the + // user pick. + return fmt.Errorf( + "violet.clusters=%q but kubeconfig %s has no current-context; "+ + "run `kubectl config use-context ` first", + violetClusters, kubeconfig, ) - return nil } if uc.confirmCluster == "" { return fmt.Errorf( @@ -226,32 +250,17 @@ func (uc *UpgradeCommand) assertClusterMatch(cfg *config.Config, kubeconfig stri // reporting a non-empty residual list stops the run with a *PreflightError // — failing fast spares the user a multi-path megareport when the first // dirty cluster already tells them what to clean. -// -// `seen` deduplicates residuals across paths using (Kind, Namespace, Name); -// it is harmless when fail-fast triggers on the first path but pays off -// if the loop is ever changed to aggregate across paths. func (uc *UpgradeCommand) runPreflight(ctx context.Context) error { - seen := map[string]struct{}{} for _, path := range uc.config.UpgradePaths { if len(path.Versions) == 0 { continue } - baseline := path.Versions[0] - residuals, err := uc.operator.PreflightBaseline(ctx, baseline) + residuals, err := uc.operator.PreflightBaseline(ctx, path.Versions[0]) if err != nil { return err } - var unique []preflight.Residual - for _, r := range residuals { - key := r.Kind + "|" + r.Namespace + "|" + r.Name - if _, dup := seen[key]; dup { - continue - } - seen[key] = struct{}{} - unique = append(unique, r) - } - if len(unique) > 0 { - return &PreflightError{Residuals: unique} + if len(residuals) > 0 { + return &PreflightError{Residuals: residuals} } } return nil diff --git a/pkg/config/config.go b/pkg/config/config.go index 8c14793..48f77fe 100644 --- a/pkg/config/config.go +++ b/pkg/config/config.go @@ -239,11 +239,18 @@ func validateConfig(cfg *Config) error { if v.Channel == "" { return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): channel is required for operatorhub type", i, j, v.Name) } - // bundleVersion gets interpolated into shell commands (preflight - // cleanup hints) and into violet argv. Rejecting shell-active - // characters here is a single chokepoint that closes the - // injection vector for every downstream consumer. - if v.BundleVersion != "" && !bundleVersionRegex.MatchString(v.BundleVersion) { + // bundleVersion is load-bearing for operatorhub: it feeds the + // AV name (.), the violet push tgz + // URL, and the kubectl cleanup hints emitted by preflight. + // Empty values silently produce malformed downstream names + // ("operatorhub-foo." with trailing dot), so preflight would + // report clean while upgrade fails late with cryptic errors. + // Require non-empty here, then regex-validate to close the + // shell-injection vector for every downstream consumer. + if v.BundleVersion == "" { + return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion is required for operatorhub type", i, j, v.Name) + } + if !bundleVersionRegex.MatchString(v.BundleVersion) { return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion %q must match %s", i, j, v.Name, v.BundleVersion, bundleVersionRegex.String()) } } diff --git a/pkg/operator/operatorhub/preflight.go b/pkg/operator/operatorhub/preflight.go index 0ca5f1b..356201d 100644 --- a/pkg/operator/operatorhub/preflight.go +++ b/pkg/operator/operatorhub/preflight.go @@ -82,12 +82,9 @@ func (o *Operator) checkSubscriptionResidue(ctx context.Context, _ config.Versio case err != nil: return nil, err } - return []preflight.Residual{{ - Kind: "Subscription", - Namespace: o.namespace, - Name: sub.GetName(), - RecommendedCleanup: fmt.Sprintf("kubectl delete subscription %q -n %q", sub.GetName(), o.namespace), - }}, nil + return []preflight.Residual{ + preflight.NewResidual("Subscription", o.namespace, sub.GetName()), + }, nil } // checkArtifactVersionResidue Gets the AV named . in @@ -103,12 +100,9 @@ func (o *Operator) checkArtifactVersionResidue(ctx context.Context, version conf case err != nil: return nil, err } - return []preflight.Residual{{ - Kind: "ArtifactVersion", - Namespace: systemNamespace, - Name: av.GetName(), - RecommendedCleanup: fmt.Sprintf("kubectl delete artifactversion %q -n %q", av.GetName(), systemNamespace), - }}, nil + return []preflight.Residual{ + preflight.NewResidual("ArtifactVersion", systemNamespace, av.GetName()), + }, nil } // checkInstallPlanResidue Lists InstallPlans in the operator namespace using @@ -126,18 +120,28 @@ func (o *Operator) checkInstallPlanResidue(ctx context.Context, _ config.Version return nil, err } + log := logging.FromContext(ctx) var residuals []preflight.Residual for _, ip := range ips.Items { - phase, _, _ := unstructured.NestedString(ip.Object, "status", "phase") + phase, found, perr := unstructured.NestedString(ip.Object, "status", "phase") + if perr != nil { + // Type drift: status.phase is no longer a string. Surface loudly + // rather than silently treating it as non-terminal and reporting + // every IP in the namespace as residue. Catches OLM CRD schema + // migrations early instead of producing a noise storm. + return nil, fmt.Errorf("InstallPlan %s/%s: unexpected status.phase type: %w", + ip.GetNamespace(), ip.GetName(), perr) + } + if !found { + // IP exists but has not been reconciled yet (no status.phase + // written). Treat as live — it is by definition non-terminal — + // but log so an unexpected flood of these is observable. + log.Infow("InstallPlan has no status.phase yet, treating as live", "name", ip.GetName()) + } if _, terminal := installPlanTerminalPhases[phase]; terminal { continue } - residuals = append(residuals, preflight.Residual{ - Kind: "InstallPlan", - Namespace: o.namespace, - Name: ip.GetName(), - RecommendedCleanup: fmt.Sprintf("kubectl delete installplan %q -n %q", ip.GetName(), o.namespace), - }) + residuals = append(residuals, preflight.NewResidual("InstallPlan", o.namespace, ip.GetName())) } return residuals, nil } diff --git a/pkg/operator/preflight/types.go b/pkg/operator/preflight/types.go index b9f76f2..ab4cc0b 100644 --- a/pkg/operator/preflight/types.go +++ b/pkg/operator/preflight/types.go @@ -4,18 +4,43 @@ // import it without creating an import cycle. package preflight +import ( + "fmt" + "strings" +) + // Residual is a single OLM-side resource that PreflightBaseline detected as // still present in the target cluster. It is a value type intentionally — // callers (cmd layer) accumulate them across paths and format the final user // report without needing to type-assert any operator-package internals. // -// RecommendedCleanup is pre-rendered so the cmd layer does not have to know -// per-resource cleanup conventions; operator implementations own that knowledge. -// The pre-rendered string must have its `name` / `namespace` segments already -// shell-escaped (via %q) because users copy-paste it verbatim into a shell. +// Construct via NewResidual; the constructor is the single source of truth +// for `RecommendedCleanup` formatting and shell-quoting, so individual +// operator implementations never re-implement that template (and never +// accidentally skip the quoting). type Residual struct { Kind string Namespace string Name string RecommendedCleanup string } + +// NewResidual constructs a Residual with `RecommendedCleanup` filled in from +// the standard `kubectl delete -n ` template, with +// name and namespace `%q`-quoted so a user copy-pasting the line into a shell +// gets sane behavior for any K8s-valid (DNS-1123) identifier. +// +// Kind is lower-cased to match the kubectl noun form (`subscription`, not +// `Subscription`); the Kind field on the struct keeps its capitalised form +// for the human-facing report. +func NewResidual(kind, namespace, name string) Residual { + return Residual{ + Kind: kind, + Namespace: namespace, + Name: name, + RecommendedCleanup: fmt.Sprintf( + "kubectl delete %s %q -n %q", + strings.ToLower(kind), name, namespace, + ), + } +} diff --git a/todos/013-complete-p1-cluster-guard-incluster-silent-passes.md b/todos/013-complete-p1-cluster-guard-incluster-silent-passes.md new file mode 100644 index 0000000..e5c9c31 --- /dev/null +++ b/todos/013-complete-p1-cluster-guard-incluster-silent-passes.md @@ -0,0 +1,77 @@ +--- +status: complete +priority: p1 +issue_id: "013" +tags: [code-review, security, preflight, pr-17] +dependencies: [] +--- + +# Cluster guard 在 in-cluster + `violet.clusters` 非空时静默放行 + +## Problem Statement + +`assertClusterMatch` 当 `cfg.OperatorConfig.Violet.Clusters` 非空但 KUBECONFIG 是空(in-cluster 模式)时,只打了一行 `log.Warn` 就 `return nil` 放行。 + +但是 —— **`violet.clusters` 被显式设置正是 "我在多集群 ACP 环境,必须确认目标"**。in-cluster 模式恰恰是这条 guard 最该硬拦的场景:CI pod 里 `KUBECONFIG=` 但 `violet.clusters=devops` → 跑 `upgrade` 直接写到 prod,warn 没人看。 + +这是 PR 自己的 value proposition 反语义:preflight 设计目的是"silent 假成功的对称防御",结果该防御本身在最重要的场景下 silently 放行。 + +## Findings + +来源:silent-failure-hunter 评审 #2(FIX 级别)+ security-sentinel #4(MEDIUM) + +- `cmd/upgrade_command.go:189-195` 当前实现: + ```go + if kubeconfig == "" { + uc.logger.Warn( + "violet.clusters is set but running in-cluster (no KUBECONFIG file); cluster identity NOT verified", + zap.String("violet.clusters", violetClusters), + ) + return nil + } + ``` + +## Proposed Solutions + +### 选项 1(推荐):require `--confirm-cluster` 即使 in-cluster + +当 `violet.clusters != ""` 且无 kubeconfig 时,把 `--confirm-cluster` 升级为**必填**,且与 `cfg.OperatorConfig.Violet.Clusters` 做精确比对(即用户必须显式申明"我知道这是 devops 集群")。 + +- pros: 真实硬拦在场景下生效;与 file-mode 行为对称(都强 require --confirm-cluster) +- cons: 现有 in-cluster CI job 全部需要更新 spec 加 `--confirm-cluster=` +- effort: Small (~10 LOC) +- risk: 破坏性 — 已有 CI 流水线会一次性 fail,需要发 release note + +### 选项 2:env-only fallback for CI + +新增 env `UPGRADE_CONFIRM_CLUSTER=` 作为 in-cluster 场景下 `--confirm-cluster` 的替代——pod 注入 env 更符合 CI/Tekton 用法。 + +- pros: 不破坏 in-cluster 流水线(升级 env injection 即可) +- cons: 又多一个配置入口 +- effort: Small + +## Recommended Action + +待 triage。 + +## Technical Details + +- 文件:`cmd/upgrade_command.go::assertClusterMatch` +- 测试:`cmd/upgrade_command_test.go` 不存在,本 todo 同时是触发 #016 的依赖 + +## Acceptance Criteria + +- [ ] `violet.clusters != "" && kubeconfig == "" && --confirm-cluster == ""` 必须 return error +- [ ] `violet.clusters != "" && kubeconfig == "" && --confirm-cluster == violet.clusters` 通过(in-cluster 显式承认) +- [ ] 上述两条用单测覆盖 +- [ ] README 更新 in-cluster 场景说明 + +## Work Log + +_待开始_ + +## Resources + +- PR #17: https://github.com/AlaudaDevops/upgrade-test/pull/17 +- silent-failure-hunter review verdict #2 +- security-sentinel review verdict #4 diff --git a/todos/014-complete-p1-empty-current-context-silent-passes.md b/todos/014-complete-p1-empty-current-context-silent-passes.md new file mode 100644 index 0000000..121582e --- /dev/null +++ b/todos/014-complete-p1-empty-current-context-silent-passes.md @@ -0,0 +1,77 @@ +--- +status: complete +priority: p1 +issue_id: "014" +tags: [code-review, security, preflight, pr-17] +dependencies: [] +--- + +# Cluster guard 在 `current-context: ""` 时静默放行 + +## Problem Statement + +`assertClusterMatch` 当 KUBECONFIG 能读到但 `apiCfg.CurrentContext == ""` 时(多 context 文件且未 `kubectl config use-context`),只 log.Warn 后 `return nil`。 + +但**空 current-context 是用户配置错误**,不是环境约束。允许 silent pass = 跑过 preflight → 进入升级 → kubectl-client 自己选默认 context(或失败),又一次"silent 假成功"。 + +## Findings + +来源:silent-failure-hunter 评审 #4(FIX 级别) + +- `cmd/upgrade_command.go:200-208` 当前实现: + ```go + if currentCtx == "" { + uc.logger.Warn( + "violet.clusters is set but kubeconfig has no current-context; cluster identity NOT verified", + ... + ) + return nil + } + ``` + +## Proposed Solutions + +### 选项 1(推荐):直接报错 + +```go +return fmt.Errorf( + "violet.clusters=%q but kubeconfig %s has no current-context; run `kubectl config use-context ` first", + violetClusters, kubeconfig, +) +``` + +- pros: 用户拿到 actionable 命令;不再 silent +- cons: 已有"裸 multi-context kubeconfig + 多个 path"的工作流会断 +- effort: Trivial (~5 LOC) +- risk: 用户工作流断裂 — 但本就是用户应该 fix 的问题 + +### 选项 2:强制要求 `--confirm-cluster` 已设(与 #013 一致策略) + +current-context 空 → 退化到 in-cluster 行为(必须有 --confirm-cluster),用 --confirm-cluster 与 violet.clusters 比对。 + +- pros: 与 #013 行为对称(in-cluster vs empty-ctx 走同一处理) +- cons: 略难解释(为什么 empty ctx 等于 in-cluster) + +## Recommended Action + +待 triage。推荐选项 1(fail-loud 比 fail-symmetric 更易解释)。 + +## Technical Details + +- 文件:`cmd/upgrade_command.go::assertClusterMatch` +- 与 #013 共享文件,建议同一 PR 修复 + +## Acceptance Criteria + +- [ ] `violet.clusters != "" && CurrentContext == ""` 必须 return error +- [ ] 错误消息含 "run `kubectl config use-context` first" actionable hint +- [ ] 单测覆盖 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- silent-failure-hunter verdict #4 diff --git a/todos/015-complete-p1-empty-bundleversion-silent-accept.md b/todos/015-complete-p1-empty-bundleversion-silent-accept.md new file mode 100644 index 0000000..1bd860d --- /dev/null +++ b/todos/015-complete-p1-empty-bundleversion-silent-accept.md @@ -0,0 +1,85 @@ +--- +status: complete +priority: p1 +issue_id: "015" +tags: [code-review, preflight, config, pr-17] +dependencies: [] +--- + +# operatorhub 类型时 `bundleVersion: ""` silent-accept,污染下游 + +## Problem Statement + +`pkg/config/config.go::validateConfig` 中 BundleVersion 校验是 `if v.BundleVersion != "" && !bundleVersionRegex.MatchString(...)` —— 空 BundleVersion 跳过 regex 校验,但**对 operatorhub 类型**是 load-bearing 字段,不该 optional。 + +下游影响: +- `preflight.checkArtifactVersionResidue` 拼 `.` → trailing dot → AV name 畸形 → Get NotFound = false clean +- `installViaViolet::BuildPackageURL` 也会拼出畸形 URL,cryptic 404 错误 +- `violet push` argv 出现空 bundleVersion 段 + +preflight 报"无残留"实际上是配置错误的产物。同样是 PR value proposition 自己挫败。 + +## Findings + +来源:silent-failure-hunter 评审 #7(FIX 级别) + +- `pkg/config/config.go:246-249`: + ```go + if v.BundleVersion != "" && !bundleVersionRegex.MatchString(v.BundleVersion) { + return fmt.Errorf("...bundleVersion %q must match %s", ...) + } + ``` + +## Proposed Solutions + +### 选项 1(推荐):operatorhub 要求 BundleVersion 非空 + 非法字符 + +把校验从 "if 非空再校验" 改为 "operatorhub 时必填": + +```go +if cfg.OperatorConfig.Type == "operatorhub" { + if v.BundleVersion == "" { + return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion is required for operatorhub type", i, j, v.Name) + } + if !bundleVersionRegex.MatchString(v.BundleVersion) { + return fmt.Errorf(...) + } +} +``` + +- local 模式仍允许空(与 channel 一致的"local 不读这字段"原则) +- pros: 早期挡住,下游全部受益 +- cons: 已有空 BundleVersion 的 yaml 会一次性 fail +- effort: Trivial (~5 LOC) +- risk: 低 — 真实场景没人会故意填空 + +### 选项 2(保守):仅加 warning + +仅 `log.Warn` 不 fail。 + +- 否决 — 沿用了 silent 模式,违背本 PR 哲学 + +## Recommended Action + +待 triage。强烈推荐选项 1。 + +## Technical Details + +- 文件:`pkg/config/config.go::validateConfig` +- 测试:`pkg/config/config_test.go` 补一个负 case + 一个 local-allows-empty 正 case + +## Acceptance Criteria + +- [ ] operatorhub + 空 BundleVersion → LoadConfig 报错 +- [ ] local + 空 BundleVersion → LoadConfig 通过 +- [ ] 错误消息指出具体 `path[i].versions[j]` 索引 +- [ ] 新测试覆盖 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- silent-failure-hunter verdict #7 diff --git a/todos/016-complete-p2-seen-map-dead-code.md b/todos/016-complete-p2-seen-map-dead-code.md new file mode 100644 index 0000000..b11ead6 --- /dev/null +++ b/todos/016-complete-p2-seen-map-dead-code.md @@ -0,0 +1,80 @@ +--- +status: complete +priority: p2 +issue_id: "016" +tags: [code-review, simplicity, yagni, pr-17] +dependencies: [] +--- + +# `runPreflight` 的 `seen` map 是 fail-fast 下的 dead code + +## Problem Statement + +`cmd/upgrade_command.go:233-258` 的 `runPreflight` 用 `seen map[string]struct{}` 去重 Residuals。但实现是 **fail-fast 跨 path**——第一条 path 产生非空 unique 就 return。map 永远只有 0 或 1 个 entry,dedup 行为永远不触发。 + +注释 "harmless when fail-fast triggers, pays off if loop is ever changed to aggregate" 是教科书 YAGNI —— 为不存在且不在 roadmap 的行为埋复杂度。 + +两位独立 reviewer(architecture-strategist + code-simplicity-reviewer)给出相同结论。 + +## Findings + +- architecture-strategist Verdict #5: CHANGE — 唯一 must-fix +- code-simplicity-reviewer DROP #1: 减 14 LOC +- 数字:当前 ~25 LOC 含 map + key 拼接 + dup 检测;瘦版 ~12 LOC + +## Proposed Solutions + +### 选项 1(推荐,两位 reviewer 一致):删 seen map + +```go +func (uc *UpgradeCommand) runPreflight(ctx context.Context) error { + for _, path := range uc.config.UpgradePaths { + if len(path.Versions) == 0 { + continue + } + residuals, err := uc.operator.PreflightBaseline(ctx, path.Versions[0]) + if err != nil { + return err + } + if len(residuals) > 0 { + return &PreflightError{Residuals: residuals} + } + } + return nil +} +``` + +- pros: 减 14 LOC,意图清晰 +- cons: 改聚合模式时要重新加 dedup(可逆代价 5 行) +- effort: Trivial + +### 选项 2:保留 + 单测覆盖 dedup + +加单测证明 dedup 真的工作,把 dead code 转为活代码。但需要先决定"为什么 fail-fast"——如果保留 fail-fast 则 dedup 仍 dead。 + +- 否决,与 fail-fast 设计矛盾 + +## Recommended Action + +待 triage。推荐选项 1。 + +## Technical Details + +- 文件:`cmd/upgrade_command.go::runPreflight` +- 单测 dependency:#017(runPreflight 当前无单测) + +## Acceptance Criteria + +- [ ] 删除 `seen` map 及相关 dedup 逻辑 +- [ ] 行为保持:"fail-fast 在第一条有残留的 path 即停" +- [ ] 单测覆盖(与 #017 合并) + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- architecture-strategist verdict #5 +- code-simplicity-reviewer DROP #1 diff --git a/todos/017-complete-p2-cmd-layer-zero-test-coverage.md b/todos/017-complete-p2-cmd-layer-zero-test-coverage.md new file mode 100644 index 0000000..afe85ec --- /dev/null +++ b/todos/017-complete-p2-cmd-layer-zero-test-coverage.md @@ -0,0 +1,78 @@ +--- +status: complete +priority: p2 +issue_id: "017" +tags: [code-review, testing, pr-17] +dependencies: [] +--- + +# `cmd/upgrade_command.go` 新加的硬错误逻辑 0 单测 + +## Problem Statement + +PR 在 `cmd/upgrade_command.go` 新增 ~75 LOC 真实硬错误逻辑(`assertClusterMatch` + `runPreflight`),但**没有 cmd 包单测文件**。pr-test-analyzer 评 criticality 9(最高)。 + +具体未覆盖分支: + +### `assertClusterMatch` +- (a) `Violet == nil` → no-op +- (b) `Clusters == ""` → no-op +- (c) 空 kubeconfig → warn-and-pass(**#013 待 fix 后**:error) +- (d) `LoadFromFile` error → 包错返回 +- (e) 空 `CurrentContext` → warn-and-pass(**#014 待 fix 后**:error) +- (f) `--confirm-cluster == ""` → error("required" message) +- (g) mismatch → error("does not match" message) +- (h) exact match → pass + +### `runPreflight` +- (i) `len(Versions) == 0` 路径跳过 +- (j) 只传 baseline (`Versions[0]`),不传 v1/v2 +- (k) fail-fast — 第二条 path 的 PreflightBaseline **不被调用**(用 fake `OperatorInterface` 计数验证) +- (l) `PreflightBaseline` 返回 error 时原样 return +- (m) `--skip-preflight` 设置时整段 bypass + +## Findings + +来源:pr-test-analyzer Critical Gap #1 + #2 + +回归保护价值:guard 反向条件(如 #013/#014 修复后倒退回 warn)会被这些测试 catch 住。 + +## Proposed Solutions + +### 选项 1(推荐):新建 `cmd/upgrade_command_test.go` + +- 用 fake `OperatorInterface`(mock `UpgradeOperator` + `PreflightBaseline`) +- 用 `t.TempDir()` 写 minimal kubeconfig YAML fixture +- table-driven 覆盖上述 12 个分支 +- 估算 ~150-200 LOC + +### 选项 2:把 assertClusterMatch / runPreflight 重构成可独立测试的 pure-ish 函数 + +剥离对 `UpgradeCommand` struct 的依赖(接收明确 params 而非 receiver)。便于测试但破坏现有内聚。 + +- 否决:测试不该驱动 over-refactor + +## Recommended Action + +待 triage。选项 1。 + +## Technical Details + +- 新建文件:`cmd/upgrade_command_test.go` +- 与 #013/#014(修 silent pass)同 PR 完成,可同时验证修复 + 测试 + +## Acceptance Criteria + +- [ ] 12 个 listed 分支都有单测 +- [ ] 单测命名:`TestAssertClusterMatch_*` / `TestRunPreflight_*` +- [ ] fake OperatorInterface 至少能 count 调用次数 +- [ ] kubeconfig fixture 用 t.TempDir,不污染仓库 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- pr-test-analyzer Critical Gap #1 + #2 diff --git a/todos/018-complete-p2-nestedstring-error-discarded.md b/todos/018-complete-p2-nestedstring-error-discarded.md new file mode 100644 index 0000000..7e6f97c --- /dev/null +++ b/todos/018-complete-p2-nestedstring-error-discarded.md @@ -0,0 +1,79 @@ +--- +status: complete +priority: p2 +issue_id: "018" +tags: [code-review, robustness, preflight, pr-17] +dependencies: [] +--- + +# `checkInstallPlanResidue` 把 `NestedString` 的 err 用 `_, _, _` 丢弃 + +## Problem Statement + +`pkg/operator/operatorhub/preflight.go:131` 当前: + +```go +phase, _, _ := unstructured.NestedString(ip.Object, "status", "phase") +``` + +第三个返回值 `err` 被丢弃。两种风险: +1. **当前**:如果 `status.phase` 是错误类型(OLM CRD schema 变更把 phase 改成 `[]interface{}` 或对象),err 被吞,phase="" → 被认为非终态 → 误报为残留 → 用户白浪费时间 +2. **前瞻**:未来 OLM 重命名字段时每个 IP 都会 phase="",preflight 直接 jam 整个升级流(噪音风暴) + +CRD schema drift blast radius。 + +## Findings + +来源:silent-failure-hunter 评审 #5(FIX 级别) + +## Proposed Solutions + +### 选项 1(推荐):传播 err + 区分 found vs missing + +```go +phase, found, err := unstructured.NestedString(ip.Object, "status", "phase") +if err != nil { + return nil, fmt.Errorf("InstallPlan %s/%s: unexpected status.phase type: %w", + ip.GetNamespace(), ip.GetName(), err) +} +if !found { + logging.FromContext(ctx).Infow( + "InstallPlan has no status.phase yet, treating as live", + "name", ip.GetName(), + ) +} +``` + +- pros: 类型错误直接 fail loud;缺字段保留旧 fallthrough 行为但显式 log +- cons: 略增几行 + 引入 logging.FromContext 调用 +- effort: Small (~8 LOC) + +### 选项 2:把"无 phase 字段"也当作终态忽略 + +更保守但不准——刚创建的 IP 还没 reconcile 时 phase 缺,会被误认为终态忽略。 + +- 否决 + +## Recommended Action + +待 triage。选项 1。 + +## Technical Details + +- 文件:`pkg/operator/operatorhub/preflight.go::checkInstallPlanResidue` +- 测试:在 `preflight_test.go` 加:(a) status.phase 类型错误返回 error;(b) status 缺 phase 字段,记 log 后视为 live + +## Acceptance Criteria + +- [ ] `status.phase` 类型非 string 时 PreflightBaseline 返回 error +- [ ] `status.phase` 字段缺失时 logs Info + 当作非终态处理 +- [ ] 两个单测覆盖 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- silent-failure-hunter verdict #5 diff --git a/todos/019-complete-p2-residual-constructor-shell-quote.md b/todos/019-complete-p2-residual-constructor-shell-quote.md new file mode 100644 index 0000000..ac4bc53 --- /dev/null +++ b/todos/019-complete-p2-residual-constructor-shell-quote.md @@ -0,0 +1,90 @@ +--- +status: complete +priority: p2 +issue_id: "019" +tags: [code-review, type-design, security, pr-17] +dependencies: [] +--- + +# `preflight.Residual` 加 constructor 封装 shell-escape 不变量 + +## Problem Statement + +`preflight.Residual` 4 个 exported field 全裸暴露: + +```go +type Residual struct { + Kind, Namespace, Name string + RecommendedCleanup string +} +``` + +`RecommendedCleanup` 必须**已经 shell-escape**(文档注释里写了),但类型本身不保证。security-sentinel 也指出当前用 Go `%q` 不是真正的 shell-safe(`a"b` 这种 K8s 不允许的 name 即便有也会破)。 + +type-design-analyzer 给 Encapsulation/Invariant 各 2-3/5。 + +## Findings + +来源:type-design-analyzer "small refactor" + security-sentinel verdict #2 + +## Proposed Solutions + +### 选项 1(推荐):加 constructor + 转用 shellescape + +```go +// pkg/operator/preflight/types.go +func NewResidual(kind, namespace, name string) Residual { + return Residual{ + Kind: kind, + Namespace: namespace, + Name: name, + RecommendedCleanup: fmt.Sprintf("kubectl delete %s %s -n %s", + strings.ToLower(kind), + shellescape.Quote(name), + shellescape.Quote(namespace)), + } +} +``` + +引入 `github.com/alessio/shellescape`(small dep, MIT, 单文件实现)。 + +- pros: 不变量编译期保证;调用方更短;K8s 名字含 quote 等边界 case 也正确(即便理论场景) +- cons: 加一个外部 deps;现有 3 个调用点都要改 +- effort: Small (~30 LOC + 3 处调用点改写) + +### 选项 2:constructor 但仍用 `%q` + +不引入新 dep,仅集中转义到一处。 + +- pros: 零新 deps +- cons: 没解决 security-sentinel 的 `%q` ≠ shell-safe 论点 +- effort: Trivial + +### 选项 3:保留现状 + 加单测断言 RecommendedCleanup 形态 + +- 否决:测试不能弥补类型设计缺陷 + +## Recommended Action + +待 triage。优先 1,可接受 2 作 trade-off。 + +## Technical Details + +- 文件:`pkg/operator/preflight/types.go`、`pkg/operator/operatorhub/preflight.go`(3 处调用点) +- go.mod 引入 `github.com/alessio/shellescape v1.4.x`(如选 1) + +## Acceptance Criteria + +- [ ] 所有 Residual 从 constructor 构造,不再字面量构造(除测试) +- [ ] RecommendedCleanup 即使 name 含 quote / `$` / 空格也能正确 shell-quoted +- [ ] 单测:name 含特殊字符时 cleanup 命令仍然 shell-safe + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- type-design-analyzer recommendation +- security-sentinel verdict #2 diff --git a/todos/020-pending-p3-test-coverage-extensions.md b/todos/020-pending-p3-test-coverage-extensions.md new file mode 100644 index 0000000..3fe4edc --- /dev/null +++ b/todos/020-pending-p3-test-coverage-extensions.md @@ -0,0 +1,63 @@ +--- +status: pending +priority: p3 +issue_id: "020" +tags: [code-review, testing, pr-17] +dependencies: [] +--- + +# 测试覆盖完整性补充(transient types / label drift / regex edge) + +## Problem Statement + +PR 单测覆盖了主路径,但有几条 cheap insurance 没加: + +1. **`isTransientAPIError` 三种类型只测了 1 种**:当前 `TestPreflightBaseline_TransientErrorWrapsAsRetryHint` 只用 `apierrors.NewServerTimeout`。`TooManyRequests` 和 `ServiceUnavailable` 也是 transient 但无测试。 +2. **InstallPlan label fixture drift**:`newIPWithPhase` 把 label key 写死 `"operators.coreos.com/tektoncd-operator.test-ns"`,与 `preflight.go` 中 `fmt.Sprintf(...)` 拼接逻辑不一致时不会被发现。 +3. **`bundleVersionRegex` 边缘**:5 个 shell-meta sub-case 覆盖 ASCII,没测 `\x00` NUL byte 和 unicode lookalike(如 `1.0$(x)`,U+FF04)。这些都该被 regex 拒绝,pin behavior。 +4. **InstallPlan 无 label 的 negative case**:测试都给 IP 加了 OLM label,没有"label 缺失 → 应该被 labelSelector 过滤"的负 case。 + +## Findings + +来源:pr-test-analyzer #4, #5, #6, #9 + +## Proposed Solutions + +### 选项 1(推荐):批量补单测 + +按主题各加 2-3 个 sub-case: +- 在 `TestPreflightBaseline_TransientErrorWrapsAsRetryHint` 用 t.Run 加 TooManyRequests / ServiceUnavailable +- `newIPWithPhase` 改签名接受 `(operatorName, namespace string)`,内部用同样的 `fmt.Sprintf` 拼 label key +- 在 `TestLoadConfig_RejectsShellMetaCharacterInBundleVersion` 加 NUL byte + unicode-dollar 2 个 sub-case +- 新增 `TestPreflightBaseline_InstallPlanWithoutOLMLabelIgnored` + +- pros: cheap insurance,回归保护 +- cons: 测试代码 +30~50 LOC +- effort: Small + +### 选项 2:分散到各 todo(与对应实现 PR 一起) + +把每条测试拆分到对应实现修改的 todo(如 `#018` 同时补 NestedString 测试)。 + +- pros: 测试与实现同 commit +- cons: 部分纯增强测试(如 transient types)没有对应实现 todo + +## Recommended Action + +待 triage。 + +## Acceptance Criteria + +- [ ] `TestPreflightBaseline_TransientError*` 覆盖 3 种 transient 类型 +- [ ] `newIPWithPhase` 与 impl 共享 label key 拼接 +- [ ] `bundleVersionRegex` 测试覆盖 NUL byte + unicode +- [ ] 新增 IP-without-label 测试 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- pr-test-analyzer findings #4, #5, #6, #9 diff --git a/todos/021-pending-p3-preflight-simplifications.md b/todos/021-pending-p3-preflight-simplifications.md new file mode 100644 index 0000000..a225464 --- /dev/null +++ b/todos/021-pending-p3-preflight-simplifications.md @@ -0,0 +1,76 @@ +--- +status: pending +priority: p3 +issue_id: "021" +tags: [code-review, simplicity, pr-17] +dependencies: [] +--- + +# `preflight.go` 内部小简化(checks slice / phases map) + +## Problem Statement + +code-simplicity-reviewer 提出两处轻量过度抽象: + +1. **`checks []struct{name, fn}` 三件套循环**(`preflight.go:52-71`):3 个静态调用伪装成数据驱动。新增 check = 改 struct slice + 加方法 + 改测试,与直接加一行调用同代价。 +2. **`installPlanTerminalPhases` map**(`preflight.go:30-33`):2 entry 的 `map[string]struct{}` 用 `switch phase { case "Complete", "Failed": continue }` 更直白。 + +合计减 ~15 LOC,**无行为变化**。 + +## Findings + +来源:code-simplicity-reviewer SIMPLIFY #2 + #3 + +## Proposed Solutions + +### 选项 1(推荐):双合一改 + +```go +// preflight.go +func (o *Operator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { + pCtx, cancel := context.WithTimeout(ctx, preflightTimeout) + defer cancel() + + var residuals []preflight.Residual + for _, fn := range []checkFn{ + o.checkSubscriptionResidue, + o.checkArtifactVersionResidue, + o.checkInstallPlanResidue, + } { + // 错误分类逻辑保持原样 + } + // ... +} +``` + +或更直接:3 个独立调用 + helper for err wrapping。 + +phases map → switch case。 + +- pros: 减 LOC、意图清晰 +- cons: 重构本身需要谨慎不要破坏单测 +- effort: Small + +### 选项 2:保持现状 + +- "未来或许会加 check" — 但实际加 check 是单点 commit,今天的复杂度不必为想象的未来买单。 +- 否决(按 simplicity 论证) + +## Recommended Action + +待 triage。 + +## Acceptance Criteria + +- [ ] `checks` slice 删除,3 个 check 直接顺序调用 +- [ ] `installPlanTerminalPhases` map 删除,改用 switch +- [ ] 所有现有单测仍 pass + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- code-simplicity-reviewer SIMPLIFY #2 + #3 diff --git a/todos/022-pending-p3-agent-ergonomics-followups.md b/todos/022-pending-p3-agent-ergonomics-followups.md new file mode 100644 index 0000000..b2ed307 --- /dev/null +++ b/todos/022-pending-p3-agent-ergonomics-followups.md @@ -0,0 +1,57 @@ +--- +status: pending +priority: p3 +issue_id: "022" +tags: [code-review, agent-native, ergonomics, pr-17] +dependencies: [] +--- + +# Agent-friendliness + docs gap 综合改进(多项建议合并) + +## Problem Statement + +来自 agent-native-reviewer + security-sentinel 的小改进合集: + +1. **`--preflight-output=json` flag**(agent-native #1):subprocess 调用方(CI / agent)只能 regex-parse stderr 的 markdown 文本。加一个 JSON 输出模式让 `jq -r '.residuals[].name'` 这种调用方便很多。~30 LOC。 + +2. **exit code 区分**(agent-native #2):当前 preflight 失败 / violet 失败 / kubeconfig 错都 exit=1。在 `main.go` 加 `errors.As` 区分:PreflightError → exit 2,cluster mismatch → exit 3,其他保持 1。让 agent retry policy 能区分。 + +3. **`--preflight-only` flag**(agent-native #3 vs simplicity-reviewer 立场对立):agent 想"只验证不升级"目前要让 violet 阶段崩,浪费 + 部分副作用。simplicity 立场是"5 行可未来补,YAGNI"。**这是真正决策点**——录此 todo 让 triage 拍板。 + +4. **`--confirm-cluster` 信任 kubeconfig 内容假设文档化**(security MEDIUM #4):guard 依赖"操作者拥有 kubeconfig 文件"。文档化这条假设;可选 defense-in-depth 是同时比 `Contexts[ctx].Cluster.Server`。 + +5. **`--skip-preflight` audit log 落到外部系统**(security MEDIUM #6):当前只有 `log.Warn` 到 process stderr。共享 CI runner 上无法外部审计。建议要么需 env `UPGRADE_ALLOW_SKIP_PREFLIGHT=1` 显式开关;要么至少 log 时带 `os.Getenv("USER")`、kubeconfig context 等结构化字段方便 Loki/CloudWatch alert。 + +## Findings + +来源(合并):agent-native-reviewer warnings #1/#2/#3 + security-sentinel MEDIUM #4/#6 + +## Proposed Solutions + +各条独立 trade-off,建议 triage 时分别决策: + +| # | 选项 | 推荐 | +|---|------|------| +| 1 | `--preflight-output=json` flag | Yes, fix-now (~30 LOC) | +| 2 | exit code 区分 | Yes if main.go in scope (~10 LOC);否则 defer | +| 3 | `--preflight-only` flag | **决策点**(reviewers 立场对立) | +| 4 | kubeconfig trust doc | Yes — 一段 README 即可 | +| 5 | `--skip-preflight` 加 env 守门 | Defer or accept doc trade-off | + +## Recommended Action + +待 triage。 + +## Acceptance Criteria + +按 triage 决定后再补。 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- agent-native-reviewer findings #1/#2/#3 +- security-sentinel verdicts #4/#6 diff --git a/todos/023-pending-p3-misc-cleanups.md b/todos/023-pending-p3-misc-cleanups.md new file mode 100644 index 0000000..1fec5be --- /dev/null +++ b/todos/023-pending-p3-misc-cleanups.md @@ -0,0 +1,54 @@ +--- +status: pending +priority: p3 +issue_id: "023" +tags: [code-review, simplicity, cleanup, pr-17] +dependencies: [] +--- + +# 杂项清理(test / log / 注释) + +## Problem Statement + +多个 reviewer 抓到的低风险机械改动: + +1. **DROP `TestPreflightError_ZeroResidualsStillReadable`**(simplicity + pr-test #8):测试自己注释承认 "not expected at runtime"。Go 的 `len`/`for range`/`fmt.Fprintf` 对空 slice 安全是语言保证,不需要测试。 +2. **`cmd.SilenceUsage = true` 挪出 `AddFlags`**(simplicity #7):`AddFlags` 名字承诺加 flag,把 cobra 行为开关放这里 surprise。改放在 root command 构造处(`cmd/root.go` 的 `cobra.Command{}` 字面量字段)。 +3. **`local.PreflightBaseline` 加 Info log**(silent-failure #1):当前 `return nil, nil` 完全静默。一行 `log.Infow("preflight: local operator — OLM residue scan skipped")` 把"silent skip"转为"documented skip"。 +4. **`isTransientAPIError(nil) == false` 注释 lock invariant**(silent-failure #6):在 PreflightBaseline switch 上方加一行注释明确这个 invariant,未来 reader 不必反查 helper 实现。 +5. **`assertClusterMatch` 两个 warn 分支合并**(simplicity #6):in-cluster 和 empty-currentCtx 现在是两个独立的"无法读到 context"分支,可合并。**注意**:与 #013/#014 修复冲突(那两个 P1 把 warn 改 error)—— 等 #013/#014 落地后再考虑是否还需要合并。 + +## Findings + +来源(合并):simplicity DROP #4 + #6 + #7 + silent-failure #1 + #6 + +## Proposed Solutions + +按条目分别处理,无 trade-off 争议。可一次性 commit。 + +## Recommended Action + +待 triage。**注意**:第 5 条与 #013/#014 dependency。 + +## Technical Details + +- 文件:`cmd/preflight_error_test.go`(删测试)、`cmd/root.go` + `cmd/upgrade_command.go`(SilenceUsage 挪位)、`pkg/operator/local/operator.go`(Info log)、`pkg/operator/operatorhub/preflight.go`(注释) +- 估算总改动:净 -10 LOC + +## Acceptance Criteria + +- [ ] `TestPreflightError_ZeroResidualsStillReadable` 删除 +- [ ] `SilenceUsage` 不在 `AddFlags` 里 +- [ ] `local.PreflightBaseline` 有一行 Info log +- [ ] `PreflightBaseline` switch 上方有 "isTransientAPIError(nil)==false invariant" 注释 +- [ ] `assertClusterMatch` 分支合并视 #013/#014 状态决定 + +## Work Log + +_待开始_ + +## Resources + +- PR #17 +- code-simplicity-reviewer DROP #4, SIMPLIFY #6 + #7 +- silent-failure-hunter #1 + #6 From f721f8e249e38f9b7ce7564a72041f774d652771 Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 20:12:14 +0800 Subject: [PATCH 08/10] test(cmd): unit tests for assertClusterMatch and runPreflight MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes P2 #017 — the two cmd-layer functions added by PR #17 carried zero coverage (pr-test-analyzer criticality 9). Coverage: - assertClusterMatch (10 sub-tests): violet nil / clusters empty no-ops; in-cluster missing/mismatched/matching --confirm-cluster; file-mode unreadable kubeconfig, empty current-context (pins the #014 fix in place), missing/mismatched/matching --confirm-cluster. - runPreflight (5 sub-tests): clean cluster, empty Versions skipped, baseline-only invariant (regression guard: a refactor scanning all versions would surface mid-upgrade AVs as false positives), fail-fast across paths (second path's PreflightBaseline is verifiably not reached), operator errors propagate unwrapped (must NOT be wrapped as *PreflightError). The fake operator records the Version passed to each PreflightBaseline call so the baseline-only test asserts the actual contract rather than just call count. --- cmd/upgrade_command_test.go | 303 ++++++++++++++++++++++++++++++++++++ 1 file changed, 303 insertions(+) create mode 100644 cmd/upgrade_command_test.go diff --git a/cmd/upgrade_command_test.go b/cmd/upgrade_command_test.go new file mode 100644 index 0000000..2cd9323 --- /dev/null +++ b/cmd/upgrade_command_test.go @@ -0,0 +1,303 @@ +package cmd + +import ( + "context" + "errors" + "os" + "path/filepath" + "strings" + "sync/atomic" + "testing" + + "github.com/AlaudaDevops/upgrade-test/pkg/config" + "github.com/AlaudaDevops/upgrade-test/pkg/operator/preflight" +) + +// fakeOperator implements operator.OperatorInterface for tests of runPreflight. +// Behavior is parameterised through residualsPerCall (one slice per +// PreflightBaseline call, in order) and preflightErr (returned from the first +// call regardless of residualsPerCall when non-nil). +// +// Counters are atomic so a future parallel test will not race; today's +// sequential callers wouldn't need it, but the cost is negligible. +type fakeOperator struct { + residualsPerCall [][]preflight.Residual + preflightErr error + preflightCalls int32 + upgradeCalls int32 + receivedVersions []config.Version +} + +func (f *fakeOperator) UpgradeOperator(ctx context.Context, v config.Version) error { + atomic.AddInt32(&f.upgradeCalls, 1) + return nil +} + +func (f *fakeOperator) PreflightBaseline(ctx context.Context, v config.Version) ([]preflight.Residual, error) { + idx := atomic.AddInt32(&f.preflightCalls, 1) - 1 + f.receivedVersions = append(f.receivedVersions, v) + if f.preflightErr != nil { + return nil, f.preflightErr + } + if int(idx) >= len(f.residualsPerCall) { + return nil, nil + } + return f.residualsPerCall[idx], nil +} + +// writeKubeconfig drops a minimal valid kubeconfig YAML with the given +// current-context into t.TempDir() and returns its path. Passing an empty +// string yields a file with no current-context set — the "user forgot +// kubectl config use-context" scenario. +func writeKubeconfig(t *testing.T, currentContext string) string { + t.Helper() + body := `apiVersion: v1 +kind: Config +clusters: +- name: devops + cluster: + server: https://example.invalid +contexts: +- name: devops + context: + cluster: devops + user: admin +- name: prod + context: + cluster: devops + user: admin +users: +- name: admin + user: {} +current-context: ` + currentContext + "\n" + path := filepath.Join(t.TempDir(), "kubeconfig.yaml") + if err := os.WriteFile(path, []byte(body), 0o600); err != nil { + t.Fatalf("write kubeconfig: %v", err) + } + return path +} + +func violetCfg(clusters string) *config.Config { + cfg := &config.Config{} + if clusters != "" { + cfg.OperatorConfig.Violet = &config.VioletConfig{Clusters: clusters} + } + return cfg +} + +// --- assertClusterMatch ---------------------------------------------------- + +func TestAssertClusterMatch_VioletNilIsNoOp(t *testing.T) { + uc := &UpgradeCommand{} + if err := uc.assertClusterMatch(&config.Config{}, "/any/path"); err != nil { + t.Fatalf("nil violet should skip guard, got %v", err) + } +} + +func TestAssertClusterMatch_ClustersEmptyIsNoOp(t *testing.T) { + uc := &UpgradeCommand{} + cfg := &config.Config{} + cfg.OperatorConfig.Violet = &config.VioletConfig{Clusters: ""} + if err := uc.assertClusterMatch(cfg, "/any/path"); err != nil { + t.Fatalf("empty clusters should skip guard, got %v", err) + } +} + +// In-cluster mode (kubeconfig path is empty): the guard cannot fall back to +// CurrentContext, so --confirm-cluster must equal violet.clusters. Missing +// flag is a hard error — this is the "CI pod silently writes to whatever +// serviceaccount points at" failure that #013 closes. +func TestAssertClusterMatch_InClusterRequiresConfirmFlag(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: ""} + err := uc.assertClusterMatch(violetCfg("devops"), "") + if err == nil { + t.Fatal("in-cluster mode without --confirm-cluster must fail") + } + if !strings.Contains(err.Error(), "--confirm-cluster") { + t.Errorf("error should mention --confirm-cluster: %v", err) + } +} + +func TestAssertClusterMatch_InClusterRejectsMismatchedConfirm(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "global"} + err := uc.assertClusterMatch(violetCfg("devops"), "") + if err == nil { + t.Fatal("--confirm-cluster != violet.clusters must fail in-cluster") + } + if !strings.Contains(err.Error(), "does not match") { + t.Errorf("expected 'does not match' in: %v", err) + } +} + +func TestAssertClusterMatch_InClusterHappyPath(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "devops"} + if err := uc.assertClusterMatch(violetCfg("devops"), ""); err != nil { + t.Fatalf("matching --confirm-cluster should pass: %v", err) + } +} + +// File mode: --confirm-cluster must equal kubeconfig's current-context. +// An empty current-context is a user configuration error and the guard +// surfaces it actionably (#014). +func TestAssertClusterMatch_FileModeLoadFailureWraps(t *testing.T) { + uc := &UpgradeCommand{} + err := uc.assertClusterMatch(violetCfg("devops"), "/no/such/kubeconfig") + if err == nil { + t.Fatal("unreadable kubeconfig should fail the guard") + } + if !strings.Contains(err.Error(), "kubeconfig") { + t.Errorf("error should name the file: %v", err) + } +} + +func TestAssertClusterMatch_FileModeEmptyCurrentContextFails(t *testing.T) { + uc := &UpgradeCommand{} + kc := writeKubeconfig(t, "") + err := uc.assertClusterMatch(violetCfg("devops"), kc) + if err == nil { + t.Fatal("empty current-context must fail (was warn-and-pass before #014)") + } + if !strings.Contains(err.Error(), "use-context") { + t.Errorf("error should hint at `kubectl config use-context`: %v", err) + } +} + +func TestAssertClusterMatch_FileModeRequiresConfirmFlag(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: ""} + kc := writeKubeconfig(t, "devops") + err := uc.assertClusterMatch(violetCfg("devops"), kc) + if err == nil { + t.Fatal("file mode without --confirm-cluster must fail") + } + if !strings.Contains(err.Error(), "--confirm-cluster") { + t.Errorf("error should mention the flag: %v", err) + } +} + +func TestAssertClusterMatch_FileModeMismatchedConfirm(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "prod"} + kc := writeKubeconfig(t, "devops") + err := uc.assertClusterMatch(violetCfg("devops"), kc) + if err == nil { + t.Fatal("mismatched --confirm-cluster vs current-context must fail") + } + if !strings.Contains(err.Error(), "does not match") { + t.Errorf("expected 'does not match' in: %v", err) + } +} + +func TestAssertClusterMatch_FileModeHappyPath(t *testing.T) { + uc := &UpgradeCommand{confirmCluster: "devops"} + kc := writeKubeconfig(t, "devops") + if err := uc.assertClusterMatch(violetCfg("devops"), kc); err != nil { + t.Fatalf("matching --confirm-cluster vs current-context should pass: %v", err) + } +} + +// --- runPreflight ---------------------------------------------------------- + +func newCmdWithPaths(op *fakeOperator, paths ...config.UpgradePath) *UpgradeCommand { + return &UpgradeCommand{ + operator: op, + config: &config.Config{UpgradePaths: paths}, + } +} + +func TestRunPreflight_AllCleanReturnsNil(t *testing.T) { + op := &fakeOperator{} + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "p1", Versions: []config.Version{{Name: "baseline", BundleVersion: "v1"}}}, + ) + if err := uc.runPreflight(context.Background()); err != nil { + t.Fatalf("clean preflight should return nil: %v", err) + } + if op.preflightCalls != 1 { + t.Errorf("expected 1 preflight call, got %d", op.preflightCalls) + } +} + +func TestRunPreflight_EmptyVersionsSkipped(t *testing.T) { + op := &fakeOperator{} + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "empty", Versions: nil}, + ) + if err := uc.runPreflight(context.Background()); err != nil { + t.Fatalf("path with no versions should be skipped: %v", err) + } + if op.preflightCalls != 0 { + t.Errorf("PreflightBaseline must not be called for empty Versions, got %d calls", op.preflightCalls) + } +} + +// Locks the design rule: PreflightBaseline only ever sees Versions[0], +// even when the path declares multiple downstream versions. A regression +// flipping to scan all versions would surface mid-upgrade AVs as false +// positives. +func TestRunPreflight_OnlyBaselineIsPassed(t *testing.T) { + op := &fakeOperator{} + uc := newCmdWithPaths(op, config.UpgradePath{ + Name: "multi-version", + Versions: []config.Version{ + {Name: "baseline", BundleVersion: "v1"}, + {Name: "mid", BundleVersion: "v2"}, + {Name: "target", BundleVersion: "v3"}, + }, + }) + if err := uc.runPreflight(context.Background()); err != nil { + t.Fatalf("preflight: %v", err) + } + if got := len(op.receivedVersions); got != 1 { + t.Fatalf("expected 1 PreflightBaseline call, got %d", got) + } + if op.receivedVersions[0].BundleVersion != "v1" { + t.Errorf("expected baseline (v1) only, got %q", op.receivedVersions[0].BundleVersion) + } +} + +// Multi-path fail-fast: once any baseline reports residuals, subsequent +// paths are not scanned. This avoids burning API calls when the first +// dirty path already proves the cluster needs cleanup. +func TestRunPreflight_FailFastAcrossPaths(t *testing.T) { + op := &fakeOperator{ + residualsPerCall: [][]preflight.Residual{ + {preflight.NewResidual("Subscription", "ns", "foo")}, // path 1 dirty + {}, // path 2 would be clean — must not be reached + }, + } + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "p1", Versions: []config.Version{{BundleVersion: "v1"}}}, + config.UpgradePath{Name: "p2", Versions: []config.Version{{BundleVersion: "v2"}}}, + ) + err := uc.runPreflight(context.Background()) + if err == nil { + t.Fatal("expected PreflightError on dirty first path") + } + var pfe *PreflightError + if !errors.As(err, &pfe) { + t.Fatalf("expected *PreflightError, got %T: %v", err, err) + } + if len(pfe.Residuals) != 1 { + t.Errorf("expected 1 residual, got %d", len(pfe.Residuals)) + } + if op.preflightCalls != 1 { + t.Errorf("second path's PreflightBaseline must not be called after first dirty, got %d calls", op.preflightCalls) + } +} + +func TestRunPreflight_OperatorErrorPropagatesUnwrapped(t *testing.T) { + sentinel := errors.New("transient API timeout") + op := &fakeOperator{preflightErr: sentinel} + uc := newCmdWithPaths(op, + config.UpgradePath{Name: "p1", Versions: []config.Version{{BundleVersion: "v1"}}}, + ) + err := uc.runPreflight(context.Background()) + if !errors.Is(err, sentinel) { + t.Errorf("expected sentinel error propagated, got %v", err) + } + // Make sure we did NOT wrap as PreflightError — operator errors are + // distinct from "residue findings" and the cmd layer must not blur them. + var pfe *PreflightError + if errors.As(err, &pfe) { + t.Errorf("operator errors must not be wrapped as *PreflightError") + } +} From 72b752e6c6adfa8bd007e15f5dd92ad8e0c4be67 Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 20:50:39 +0800 Subject: [PATCH 09/10] chore(todos): remove completed P1/P2 review findings The 7 todos (013-019) that drove the silent-failure fixes + cmd-layer tests are done; their substance lives in the PR description and the commits that closed them (1201665 + f721f8e). Removing the files keeps todos/ as a list of OPEN work, which is its only useful state. Remaining in todos/: 4 P3 follow-ups for this PR (020-023) and 6 P2 items from prior PRs that have not been actioned yet. --- ...1-cluster-guard-incluster-silent-passes.md | 77 ---------------- ...-p1-empty-current-context-silent-passes.md | 77 ---------------- ...te-p1-empty-bundleversion-silent-accept.md | 85 ------------------ todos/016-complete-p2-seen-map-dead-code.md | 80 ----------------- ...omplete-p2-cmd-layer-zero-test-coverage.md | 78 ---------------- ...omplete-p2-nestedstring-error-discarded.md | 79 ---------------- ...ete-p2-residual-constructor-shell-quote.md | 90 ------------------- 7 files changed, 566 deletions(-) delete mode 100644 todos/013-complete-p1-cluster-guard-incluster-silent-passes.md delete mode 100644 todos/014-complete-p1-empty-current-context-silent-passes.md delete mode 100644 todos/015-complete-p1-empty-bundleversion-silent-accept.md delete mode 100644 todos/016-complete-p2-seen-map-dead-code.md delete mode 100644 todos/017-complete-p2-cmd-layer-zero-test-coverage.md delete mode 100644 todos/018-complete-p2-nestedstring-error-discarded.md delete mode 100644 todos/019-complete-p2-residual-constructor-shell-quote.md diff --git a/todos/013-complete-p1-cluster-guard-incluster-silent-passes.md b/todos/013-complete-p1-cluster-guard-incluster-silent-passes.md deleted file mode 100644 index e5c9c31..0000000 --- a/todos/013-complete-p1-cluster-guard-incluster-silent-passes.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -status: complete -priority: p1 -issue_id: "013" -tags: [code-review, security, preflight, pr-17] -dependencies: [] ---- - -# Cluster guard 在 in-cluster + `violet.clusters` 非空时静默放行 - -## Problem Statement - -`assertClusterMatch` 当 `cfg.OperatorConfig.Violet.Clusters` 非空但 KUBECONFIG 是空(in-cluster 模式)时,只打了一行 `log.Warn` 就 `return nil` 放行。 - -但是 —— **`violet.clusters` 被显式设置正是 "我在多集群 ACP 环境,必须确认目标"**。in-cluster 模式恰恰是这条 guard 最该硬拦的场景:CI pod 里 `KUBECONFIG=` 但 `violet.clusters=devops` → 跑 `upgrade` 直接写到 prod,warn 没人看。 - -这是 PR 自己的 value proposition 反语义:preflight 设计目的是"silent 假成功的对称防御",结果该防御本身在最重要的场景下 silently 放行。 - -## Findings - -来源:silent-failure-hunter 评审 #2(FIX 级别)+ security-sentinel #4(MEDIUM) - -- `cmd/upgrade_command.go:189-195` 当前实现: - ```go - if kubeconfig == "" { - uc.logger.Warn( - "violet.clusters is set but running in-cluster (no KUBECONFIG file); cluster identity NOT verified", - zap.String("violet.clusters", violetClusters), - ) - return nil - } - ``` - -## Proposed Solutions - -### 选项 1(推荐):require `--confirm-cluster` 即使 in-cluster - -当 `violet.clusters != ""` 且无 kubeconfig 时,把 `--confirm-cluster` 升级为**必填**,且与 `cfg.OperatorConfig.Violet.Clusters` 做精确比对(即用户必须显式申明"我知道这是 devops 集群")。 - -- pros: 真实硬拦在场景下生效;与 file-mode 行为对称(都强 require --confirm-cluster) -- cons: 现有 in-cluster CI job 全部需要更新 spec 加 `--confirm-cluster=` -- effort: Small (~10 LOC) -- risk: 破坏性 — 已有 CI 流水线会一次性 fail,需要发 release note - -### 选项 2:env-only fallback for CI - -新增 env `UPGRADE_CONFIRM_CLUSTER=` 作为 in-cluster 场景下 `--confirm-cluster` 的替代——pod 注入 env 更符合 CI/Tekton 用法。 - -- pros: 不破坏 in-cluster 流水线(升级 env injection 即可) -- cons: 又多一个配置入口 -- effort: Small - -## Recommended Action - -待 triage。 - -## Technical Details - -- 文件:`cmd/upgrade_command.go::assertClusterMatch` -- 测试:`cmd/upgrade_command_test.go` 不存在,本 todo 同时是触发 #016 的依赖 - -## Acceptance Criteria - -- [ ] `violet.clusters != "" && kubeconfig == "" && --confirm-cluster == ""` 必须 return error -- [ ] `violet.clusters != "" && kubeconfig == "" && --confirm-cluster == violet.clusters` 通过(in-cluster 显式承认) -- [ ] 上述两条用单测覆盖 -- [ ] README 更新 in-cluster 场景说明 - -## Work Log - -_待开始_ - -## Resources - -- PR #17: https://github.com/AlaudaDevops/upgrade-test/pull/17 -- silent-failure-hunter review verdict #2 -- security-sentinel review verdict #4 diff --git a/todos/014-complete-p1-empty-current-context-silent-passes.md b/todos/014-complete-p1-empty-current-context-silent-passes.md deleted file mode 100644 index 121582e..0000000 --- a/todos/014-complete-p1-empty-current-context-silent-passes.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -status: complete -priority: p1 -issue_id: "014" -tags: [code-review, security, preflight, pr-17] -dependencies: [] ---- - -# Cluster guard 在 `current-context: ""` 时静默放行 - -## Problem Statement - -`assertClusterMatch` 当 KUBECONFIG 能读到但 `apiCfg.CurrentContext == ""` 时(多 context 文件且未 `kubectl config use-context`),只 log.Warn 后 `return nil`。 - -但**空 current-context 是用户配置错误**,不是环境约束。允许 silent pass = 跑过 preflight → 进入升级 → kubectl-client 自己选默认 context(或失败),又一次"silent 假成功"。 - -## Findings - -来源:silent-failure-hunter 评审 #4(FIX 级别) - -- `cmd/upgrade_command.go:200-208` 当前实现: - ```go - if currentCtx == "" { - uc.logger.Warn( - "violet.clusters is set but kubeconfig has no current-context; cluster identity NOT verified", - ... - ) - return nil - } - ``` - -## Proposed Solutions - -### 选项 1(推荐):直接报错 - -```go -return fmt.Errorf( - "violet.clusters=%q but kubeconfig %s has no current-context; run `kubectl config use-context ` first", - violetClusters, kubeconfig, -) -``` - -- pros: 用户拿到 actionable 命令;不再 silent -- cons: 已有"裸 multi-context kubeconfig + 多个 path"的工作流会断 -- effort: Trivial (~5 LOC) -- risk: 用户工作流断裂 — 但本就是用户应该 fix 的问题 - -### 选项 2:强制要求 `--confirm-cluster` 已设(与 #013 一致策略) - -current-context 空 → 退化到 in-cluster 行为(必须有 --confirm-cluster),用 --confirm-cluster 与 violet.clusters 比对。 - -- pros: 与 #013 行为对称(in-cluster vs empty-ctx 走同一处理) -- cons: 略难解释(为什么 empty ctx 等于 in-cluster) - -## Recommended Action - -待 triage。推荐选项 1(fail-loud 比 fail-symmetric 更易解释)。 - -## Technical Details - -- 文件:`cmd/upgrade_command.go::assertClusterMatch` -- 与 #013 共享文件,建议同一 PR 修复 - -## Acceptance Criteria - -- [ ] `violet.clusters != "" && CurrentContext == ""` 必须 return error -- [ ] 错误消息含 "run `kubectl config use-context` first" actionable hint -- [ ] 单测覆盖 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- silent-failure-hunter verdict #4 diff --git a/todos/015-complete-p1-empty-bundleversion-silent-accept.md b/todos/015-complete-p1-empty-bundleversion-silent-accept.md deleted file mode 100644 index 1bd860d..0000000 --- a/todos/015-complete-p1-empty-bundleversion-silent-accept.md +++ /dev/null @@ -1,85 +0,0 @@ ---- -status: complete -priority: p1 -issue_id: "015" -tags: [code-review, preflight, config, pr-17] -dependencies: [] ---- - -# operatorhub 类型时 `bundleVersion: ""` silent-accept,污染下游 - -## Problem Statement - -`pkg/config/config.go::validateConfig` 中 BundleVersion 校验是 `if v.BundleVersion != "" && !bundleVersionRegex.MatchString(...)` —— 空 BundleVersion 跳过 regex 校验,但**对 operatorhub 类型**是 load-bearing 字段,不该 optional。 - -下游影响: -- `preflight.checkArtifactVersionResidue` 拼 `.` → trailing dot → AV name 畸形 → Get NotFound = false clean -- `installViaViolet::BuildPackageURL` 也会拼出畸形 URL,cryptic 404 错误 -- `violet push` argv 出现空 bundleVersion 段 - -preflight 报"无残留"实际上是配置错误的产物。同样是 PR value proposition 自己挫败。 - -## Findings - -来源:silent-failure-hunter 评审 #7(FIX 级别) - -- `pkg/config/config.go:246-249`: - ```go - if v.BundleVersion != "" && !bundleVersionRegex.MatchString(v.BundleVersion) { - return fmt.Errorf("...bundleVersion %q must match %s", ...) - } - ``` - -## Proposed Solutions - -### 选项 1(推荐):operatorhub 要求 BundleVersion 非空 + 非法字符 - -把校验从 "if 非空再校验" 改为 "operatorhub 时必填": - -```go -if cfg.OperatorConfig.Type == "operatorhub" { - if v.BundleVersion == "" { - return fmt.Errorf("upgradePaths[%d].versions[%d] (%q): bundleVersion is required for operatorhub type", i, j, v.Name) - } - if !bundleVersionRegex.MatchString(v.BundleVersion) { - return fmt.Errorf(...) - } -} -``` - -- local 模式仍允许空(与 channel 一致的"local 不读这字段"原则) -- pros: 早期挡住,下游全部受益 -- cons: 已有空 BundleVersion 的 yaml 会一次性 fail -- effort: Trivial (~5 LOC) -- risk: 低 — 真实场景没人会故意填空 - -### 选项 2(保守):仅加 warning - -仅 `log.Warn` 不 fail。 - -- 否决 — 沿用了 silent 模式,违背本 PR 哲学 - -## Recommended Action - -待 triage。强烈推荐选项 1。 - -## Technical Details - -- 文件:`pkg/config/config.go::validateConfig` -- 测试:`pkg/config/config_test.go` 补一个负 case + 一个 local-allows-empty 正 case - -## Acceptance Criteria - -- [ ] operatorhub + 空 BundleVersion → LoadConfig 报错 -- [ ] local + 空 BundleVersion → LoadConfig 通过 -- [ ] 错误消息指出具体 `path[i].versions[j]` 索引 -- [ ] 新测试覆盖 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- silent-failure-hunter verdict #7 diff --git a/todos/016-complete-p2-seen-map-dead-code.md b/todos/016-complete-p2-seen-map-dead-code.md deleted file mode 100644 index b11ead6..0000000 --- a/todos/016-complete-p2-seen-map-dead-code.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -status: complete -priority: p2 -issue_id: "016" -tags: [code-review, simplicity, yagni, pr-17] -dependencies: [] ---- - -# `runPreflight` 的 `seen` map 是 fail-fast 下的 dead code - -## Problem Statement - -`cmd/upgrade_command.go:233-258` 的 `runPreflight` 用 `seen map[string]struct{}` 去重 Residuals。但实现是 **fail-fast 跨 path**——第一条 path 产生非空 unique 就 return。map 永远只有 0 或 1 个 entry,dedup 行为永远不触发。 - -注释 "harmless when fail-fast triggers, pays off if loop is ever changed to aggregate" 是教科书 YAGNI —— 为不存在且不在 roadmap 的行为埋复杂度。 - -两位独立 reviewer(architecture-strategist + code-simplicity-reviewer)给出相同结论。 - -## Findings - -- architecture-strategist Verdict #5: CHANGE — 唯一 must-fix -- code-simplicity-reviewer DROP #1: 减 14 LOC -- 数字:当前 ~25 LOC 含 map + key 拼接 + dup 检测;瘦版 ~12 LOC - -## Proposed Solutions - -### 选项 1(推荐,两位 reviewer 一致):删 seen map - -```go -func (uc *UpgradeCommand) runPreflight(ctx context.Context) error { - for _, path := range uc.config.UpgradePaths { - if len(path.Versions) == 0 { - continue - } - residuals, err := uc.operator.PreflightBaseline(ctx, path.Versions[0]) - if err != nil { - return err - } - if len(residuals) > 0 { - return &PreflightError{Residuals: residuals} - } - } - return nil -} -``` - -- pros: 减 14 LOC,意图清晰 -- cons: 改聚合模式时要重新加 dedup(可逆代价 5 行) -- effort: Trivial - -### 选项 2:保留 + 单测覆盖 dedup - -加单测证明 dedup 真的工作,把 dead code 转为活代码。但需要先决定"为什么 fail-fast"——如果保留 fail-fast 则 dedup 仍 dead。 - -- 否决,与 fail-fast 设计矛盾 - -## Recommended Action - -待 triage。推荐选项 1。 - -## Technical Details - -- 文件:`cmd/upgrade_command.go::runPreflight` -- 单测 dependency:#017(runPreflight 当前无单测) - -## Acceptance Criteria - -- [ ] 删除 `seen` map 及相关 dedup 逻辑 -- [ ] 行为保持:"fail-fast 在第一条有残留的 path 即停" -- [ ] 单测覆盖(与 #017 合并) - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- architecture-strategist verdict #5 -- code-simplicity-reviewer DROP #1 diff --git a/todos/017-complete-p2-cmd-layer-zero-test-coverage.md b/todos/017-complete-p2-cmd-layer-zero-test-coverage.md deleted file mode 100644 index afe85ec..0000000 --- a/todos/017-complete-p2-cmd-layer-zero-test-coverage.md +++ /dev/null @@ -1,78 +0,0 @@ ---- -status: complete -priority: p2 -issue_id: "017" -tags: [code-review, testing, pr-17] -dependencies: [] ---- - -# `cmd/upgrade_command.go` 新加的硬错误逻辑 0 单测 - -## Problem Statement - -PR 在 `cmd/upgrade_command.go` 新增 ~75 LOC 真实硬错误逻辑(`assertClusterMatch` + `runPreflight`),但**没有 cmd 包单测文件**。pr-test-analyzer 评 criticality 9(最高)。 - -具体未覆盖分支: - -### `assertClusterMatch` -- (a) `Violet == nil` → no-op -- (b) `Clusters == ""` → no-op -- (c) 空 kubeconfig → warn-and-pass(**#013 待 fix 后**:error) -- (d) `LoadFromFile` error → 包错返回 -- (e) 空 `CurrentContext` → warn-and-pass(**#014 待 fix 后**:error) -- (f) `--confirm-cluster == ""` → error("required" message) -- (g) mismatch → error("does not match" message) -- (h) exact match → pass - -### `runPreflight` -- (i) `len(Versions) == 0` 路径跳过 -- (j) 只传 baseline (`Versions[0]`),不传 v1/v2 -- (k) fail-fast — 第二条 path 的 PreflightBaseline **不被调用**(用 fake `OperatorInterface` 计数验证) -- (l) `PreflightBaseline` 返回 error 时原样 return -- (m) `--skip-preflight` 设置时整段 bypass - -## Findings - -来源:pr-test-analyzer Critical Gap #1 + #2 - -回归保护价值:guard 反向条件(如 #013/#014 修复后倒退回 warn)会被这些测试 catch 住。 - -## Proposed Solutions - -### 选项 1(推荐):新建 `cmd/upgrade_command_test.go` - -- 用 fake `OperatorInterface`(mock `UpgradeOperator` + `PreflightBaseline`) -- 用 `t.TempDir()` 写 minimal kubeconfig YAML fixture -- table-driven 覆盖上述 12 个分支 -- 估算 ~150-200 LOC - -### 选项 2:把 assertClusterMatch / runPreflight 重构成可独立测试的 pure-ish 函数 - -剥离对 `UpgradeCommand` struct 的依赖(接收明确 params 而非 receiver)。便于测试但破坏现有内聚。 - -- 否决:测试不该驱动 over-refactor - -## Recommended Action - -待 triage。选项 1。 - -## Technical Details - -- 新建文件:`cmd/upgrade_command_test.go` -- 与 #013/#014(修 silent pass)同 PR 完成,可同时验证修复 + 测试 - -## Acceptance Criteria - -- [ ] 12 个 listed 分支都有单测 -- [ ] 单测命名:`TestAssertClusterMatch_*` / `TestRunPreflight_*` -- [ ] fake OperatorInterface 至少能 count 调用次数 -- [ ] kubeconfig fixture 用 t.TempDir,不污染仓库 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- pr-test-analyzer Critical Gap #1 + #2 diff --git a/todos/018-complete-p2-nestedstring-error-discarded.md b/todos/018-complete-p2-nestedstring-error-discarded.md deleted file mode 100644 index 7e6f97c..0000000 --- a/todos/018-complete-p2-nestedstring-error-discarded.md +++ /dev/null @@ -1,79 +0,0 @@ ---- -status: complete -priority: p2 -issue_id: "018" -tags: [code-review, robustness, preflight, pr-17] -dependencies: [] ---- - -# `checkInstallPlanResidue` 把 `NestedString` 的 err 用 `_, _, _` 丢弃 - -## Problem Statement - -`pkg/operator/operatorhub/preflight.go:131` 当前: - -```go -phase, _, _ := unstructured.NestedString(ip.Object, "status", "phase") -``` - -第三个返回值 `err` 被丢弃。两种风险: -1. **当前**:如果 `status.phase` 是错误类型(OLM CRD schema 变更把 phase 改成 `[]interface{}` 或对象),err 被吞,phase="" → 被认为非终态 → 误报为残留 → 用户白浪费时间 -2. **前瞻**:未来 OLM 重命名字段时每个 IP 都会 phase="",preflight 直接 jam 整个升级流(噪音风暴) - -CRD schema drift blast radius。 - -## Findings - -来源:silent-failure-hunter 评审 #5(FIX 级别) - -## Proposed Solutions - -### 选项 1(推荐):传播 err + 区分 found vs missing - -```go -phase, found, err := unstructured.NestedString(ip.Object, "status", "phase") -if err != nil { - return nil, fmt.Errorf("InstallPlan %s/%s: unexpected status.phase type: %w", - ip.GetNamespace(), ip.GetName(), err) -} -if !found { - logging.FromContext(ctx).Infow( - "InstallPlan has no status.phase yet, treating as live", - "name", ip.GetName(), - ) -} -``` - -- pros: 类型错误直接 fail loud;缺字段保留旧 fallthrough 行为但显式 log -- cons: 略增几行 + 引入 logging.FromContext 调用 -- effort: Small (~8 LOC) - -### 选项 2:把"无 phase 字段"也当作终态忽略 - -更保守但不准——刚创建的 IP 还没 reconcile 时 phase 缺,会被误认为终态忽略。 - -- 否决 - -## Recommended Action - -待 triage。选项 1。 - -## Technical Details - -- 文件:`pkg/operator/operatorhub/preflight.go::checkInstallPlanResidue` -- 测试:在 `preflight_test.go` 加:(a) status.phase 类型错误返回 error;(b) status 缺 phase 字段,记 log 后视为 live - -## Acceptance Criteria - -- [ ] `status.phase` 类型非 string 时 PreflightBaseline 返回 error -- [ ] `status.phase` 字段缺失时 logs Info + 当作非终态处理 -- [ ] 两个单测覆盖 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- silent-failure-hunter verdict #5 diff --git a/todos/019-complete-p2-residual-constructor-shell-quote.md b/todos/019-complete-p2-residual-constructor-shell-quote.md deleted file mode 100644 index ac4bc53..0000000 --- a/todos/019-complete-p2-residual-constructor-shell-quote.md +++ /dev/null @@ -1,90 +0,0 @@ ---- -status: complete -priority: p2 -issue_id: "019" -tags: [code-review, type-design, security, pr-17] -dependencies: [] ---- - -# `preflight.Residual` 加 constructor 封装 shell-escape 不变量 - -## Problem Statement - -`preflight.Residual` 4 个 exported field 全裸暴露: - -```go -type Residual struct { - Kind, Namespace, Name string - RecommendedCleanup string -} -``` - -`RecommendedCleanup` 必须**已经 shell-escape**(文档注释里写了),但类型本身不保证。security-sentinel 也指出当前用 Go `%q` 不是真正的 shell-safe(`a"b` 这种 K8s 不允许的 name 即便有也会破)。 - -type-design-analyzer 给 Encapsulation/Invariant 各 2-3/5。 - -## Findings - -来源:type-design-analyzer "small refactor" + security-sentinel verdict #2 - -## Proposed Solutions - -### 选项 1(推荐):加 constructor + 转用 shellescape - -```go -// pkg/operator/preflight/types.go -func NewResidual(kind, namespace, name string) Residual { - return Residual{ - Kind: kind, - Namespace: namespace, - Name: name, - RecommendedCleanup: fmt.Sprintf("kubectl delete %s %s -n %s", - strings.ToLower(kind), - shellescape.Quote(name), - shellescape.Quote(namespace)), - } -} -``` - -引入 `github.com/alessio/shellescape`(small dep, MIT, 单文件实现)。 - -- pros: 不变量编译期保证;调用方更短;K8s 名字含 quote 等边界 case 也正确(即便理论场景) -- cons: 加一个外部 deps;现有 3 个调用点都要改 -- effort: Small (~30 LOC + 3 处调用点改写) - -### 选项 2:constructor 但仍用 `%q` - -不引入新 dep,仅集中转义到一处。 - -- pros: 零新 deps -- cons: 没解决 security-sentinel 的 `%q` ≠ shell-safe 论点 -- effort: Trivial - -### 选项 3:保留现状 + 加单测断言 RecommendedCleanup 形态 - -- 否决:测试不能弥补类型设计缺陷 - -## Recommended Action - -待 triage。优先 1,可接受 2 作 trade-off。 - -## Technical Details - -- 文件:`pkg/operator/preflight/types.go`、`pkg/operator/operatorhub/preflight.go`(3 处调用点) -- go.mod 引入 `github.com/alessio/shellescape v1.4.x`(如选 1) - -## Acceptance Criteria - -- [ ] 所有 Residual 从 constructor 构造,不再字面量构造(除测试) -- [ ] RecommendedCleanup 即使 name 含 quote / `$` / 空格也能正确 shell-quoted -- [ ] 单测:name 含特殊字符时 cleanup 命令仍然 shell-safe - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- type-design-analyzer recommendation -- security-sentinel verdict #2 From 147fc0f09e9d158200069e8ded948bff5e25add0 Mon Sep 17 00:00:00 2001 From: "huanyang@alauda.io" Date: Tue, 19 May 2026 20:52:01 +0800 Subject: [PATCH 10/10] chore(todos): drop P3 review follow-ups for PR #17 The four P3 nice-to-haves (test variants, internal simplifications, agent-ergonomics knobs, misc cleanups) are not on the roadmap and filing them as todos would just decorate the backlog with low-value items nobody plans to pick up. The PR description still records the reviewer recommendations for archeology; the file copies were redundant. --- ...020-pending-p3-test-coverage-extensions.md | 63 --------------- ...21-pending-p3-preflight-simplifications.md | 76 ------------------- ...2-pending-p3-agent-ergonomics-followups.md | 57 -------------- todos/023-pending-p3-misc-cleanups.md | 54 ------------- 4 files changed, 250 deletions(-) delete mode 100644 todos/020-pending-p3-test-coverage-extensions.md delete mode 100644 todos/021-pending-p3-preflight-simplifications.md delete mode 100644 todos/022-pending-p3-agent-ergonomics-followups.md delete mode 100644 todos/023-pending-p3-misc-cleanups.md diff --git a/todos/020-pending-p3-test-coverage-extensions.md b/todos/020-pending-p3-test-coverage-extensions.md deleted file mode 100644 index 3fe4edc..0000000 --- a/todos/020-pending-p3-test-coverage-extensions.md +++ /dev/null @@ -1,63 +0,0 @@ ---- -status: pending -priority: p3 -issue_id: "020" -tags: [code-review, testing, pr-17] -dependencies: [] ---- - -# 测试覆盖完整性补充(transient types / label drift / regex edge) - -## Problem Statement - -PR 单测覆盖了主路径,但有几条 cheap insurance 没加: - -1. **`isTransientAPIError` 三种类型只测了 1 种**:当前 `TestPreflightBaseline_TransientErrorWrapsAsRetryHint` 只用 `apierrors.NewServerTimeout`。`TooManyRequests` 和 `ServiceUnavailable` 也是 transient 但无测试。 -2. **InstallPlan label fixture drift**:`newIPWithPhase` 把 label key 写死 `"operators.coreos.com/tektoncd-operator.test-ns"`,与 `preflight.go` 中 `fmt.Sprintf(...)` 拼接逻辑不一致时不会被发现。 -3. **`bundleVersionRegex` 边缘**:5 个 shell-meta sub-case 覆盖 ASCII,没测 `\x00` NUL byte 和 unicode lookalike(如 `1.0$(x)`,U+FF04)。这些都该被 regex 拒绝,pin behavior。 -4. **InstallPlan 无 label 的 negative case**:测试都给 IP 加了 OLM label,没有"label 缺失 → 应该被 labelSelector 过滤"的负 case。 - -## Findings - -来源:pr-test-analyzer #4, #5, #6, #9 - -## Proposed Solutions - -### 选项 1(推荐):批量补单测 - -按主题各加 2-3 个 sub-case: -- 在 `TestPreflightBaseline_TransientErrorWrapsAsRetryHint` 用 t.Run 加 TooManyRequests / ServiceUnavailable -- `newIPWithPhase` 改签名接受 `(operatorName, namespace string)`,内部用同样的 `fmt.Sprintf` 拼 label key -- 在 `TestLoadConfig_RejectsShellMetaCharacterInBundleVersion` 加 NUL byte + unicode-dollar 2 个 sub-case -- 新增 `TestPreflightBaseline_InstallPlanWithoutOLMLabelIgnored` - -- pros: cheap insurance,回归保护 -- cons: 测试代码 +30~50 LOC -- effort: Small - -### 选项 2:分散到各 todo(与对应实现 PR 一起) - -把每条测试拆分到对应实现修改的 todo(如 `#018` 同时补 NestedString 测试)。 - -- pros: 测试与实现同 commit -- cons: 部分纯增强测试(如 transient types)没有对应实现 todo - -## Recommended Action - -待 triage。 - -## Acceptance Criteria - -- [ ] `TestPreflightBaseline_TransientError*` 覆盖 3 种 transient 类型 -- [ ] `newIPWithPhase` 与 impl 共享 label key 拼接 -- [ ] `bundleVersionRegex` 测试覆盖 NUL byte + unicode -- [ ] 新增 IP-without-label 测试 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- pr-test-analyzer findings #4, #5, #6, #9 diff --git a/todos/021-pending-p3-preflight-simplifications.md b/todos/021-pending-p3-preflight-simplifications.md deleted file mode 100644 index a225464..0000000 --- a/todos/021-pending-p3-preflight-simplifications.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -status: pending -priority: p3 -issue_id: "021" -tags: [code-review, simplicity, pr-17] -dependencies: [] ---- - -# `preflight.go` 内部小简化(checks slice / phases map) - -## Problem Statement - -code-simplicity-reviewer 提出两处轻量过度抽象: - -1. **`checks []struct{name, fn}` 三件套循环**(`preflight.go:52-71`):3 个静态调用伪装成数据驱动。新增 check = 改 struct slice + 加方法 + 改测试,与直接加一行调用同代价。 -2. **`installPlanTerminalPhases` map**(`preflight.go:30-33`):2 entry 的 `map[string]struct{}` 用 `switch phase { case "Complete", "Failed": continue }` 更直白。 - -合计减 ~15 LOC,**无行为变化**。 - -## Findings - -来源:code-simplicity-reviewer SIMPLIFY #2 + #3 - -## Proposed Solutions - -### 选项 1(推荐):双合一改 - -```go -// preflight.go -func (o *Operator) PreflightBaseline(ctx context.Context, version config.Version) ([]preflight.Residual, error) { - pCtx, cancel := context.WithTimeout(ctx, preflightTimeout) - defer cancel() - - var residuals []preflight.Residual - for _, fn := range []checkFn{ - o.checkSubscriptionResidue, - o.checkArtifactVersionResidue, - o.checkInstallPlanResidue, - } { - // 错误分类逻辑保持原样 - } - // ... -} -``` - -或更直接:3 个独立调用 + helper for err wrapping。 - -phases map → switch case。 - -- pros: 减 LOC、意图清晰 -- cons: 重构本身需要谨慎不要破坏单测 -- effort: Small - -### 选项 2:保持现状 - -- "未来或许会加 check" — 但实际加 check 是单点 commit,今天的复杂度不必为想象的未来买单。 -- 否决(按 simplicity 论证) - -## Recommended Action - -待 triage。 - -## Acceptance Criteria - -- [ ] `checks` slice 删除,3 个 check 直接顺序调用 -- [ ] `installPlanTerminalPhases` map 删除,改用 switch -- [ ] 所有现有单测仍 pass - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- code-simplicity-reviewer SIMPLIFY #2 + #3 diff --git a/todos/022-pending-p3-agent-ergonomics-followups.md b/todos/022-pending-p3-agent-ergonomics-followups.md deleted file mode 100644 index b2ed307..0000000 --- a/todos/022-pending-p3-agent-ergonomics-followups.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -status: pending -priority: p3 -issue_id: "022" -tags: [code-review, agent-native, ergonomics, pr-17] -dependencies: [] ---- - -# Agent-friendliness + docs gap 综合改进(多项建议合并) - -## Problem Statement - -来自 agent-native-reviewer + security-sentinel 的小改进合集: - -1. **`--preflight-output=json` flag**(agent-native #1):subprocess 调用方(CI / agent)只能 regex-parse stderr 的 markdown 文本。加一个 JSON 输出模式让 `jq -r '.residuals[].name'` 这种调用方便很多。~30 LOC。 - -2. **exit code 区分**(agent-native #2):当前 preflight 失败 / violet 失败 / kubeconfig 错都 exit=1。在 `main.go` 加 `errors.As` 区分:PreflightError → exit 2,cluster mismatch → exit 3,其他保持 1。让 agent retry policy 能区分。 - -3. **`--preflight-only` flag**(agent-native #3 vs simplicity-reviewer 立场对立):agent 想"只验证不升级"目前要让 violet 阶段崩,浪费 + 部分副作用。simplicity 立场是"5 行可未来补,YAGNI"。**这是真正决策点**——录此 todo 让 triage 拍板。 - -4. **`--confirm-cluster` 信任 kubeconfig 内容假设文档化**(security MEDIUM #4):guard 依赖"操作者拥有 kubeconfig 文件"。文档化这条假设;可选 defense-in-depth 是同时比 `Contexts[ctx].Cluster.Server`。 - -5. **`--skip-preflight` audit log 落到外部系统**(security MEDIUM #6):当前只有 `log.Warn` 到 process stderr。共享 CI runner 上无法外部审计。建议要么需 env `UPGRADE_ALLOW_SKIP_PREFLIGHT=1` 显式开关;要么至少 log 时带 `os.Getenv("USER")`、kubeconfig context 等结构化字段方便 Loki/CloudWatch alert。 - -## Findings - -来源(合并):agent-native-reviewer warnings #1/#2/#3 + security-sentinel MEDIUM #4/#6 - -## Proposed Solutions - -各条独立 trade-off,建议 triage 时分别决策: - -| # | 选项 | 推荐 | -|---|------|------| -| 1 | `--preflight-output=json` flag | Yes, fix-now (~30 LOC) | -| 2 | exit code 区分 | Yes if main.go in scope (~10 LOC);否则 defer | -| 3 | `--preflight-only` flag | **决策点**(reviewers 立场对立) | -| 4 | kubeconfig trust doc | Yes — 一段 README 即可 | -| 5 | `--skip-preflight` 加 env 守门 | Defer or accept doc trade-off | - -## Recommended Action - -待 triage。 - -## Acceptance Criteria - -按 triage 决定后再补。 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- agent-native-reviewer findings #1/#2/#3 -- security-sentinel verdicts #4/#6 diff --git a/todos/023-pending-p3-misc-cleanups.md b/todos/023-pending-p3-misc-cleanups.md deleted file mode 100644 index 1fec5be..0000000 --- a/todos/023-pending-p3-misc-cleanups.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -status: pending -priority: p3 -issue_id: "023" -tags: [code-review, simplicity, cleanup, pr-17] -dependencies: [] ---- - -# 杂项清理(test / log / 注释) - -## Problem Statement - -多个 reviewer 抓到的低风险机械改动: - -1. **DROP `TestPreflightError_ZeroResidualsStillReadable`**(simplicity + pr-test #8):测试自己注释承认 "not expected at runtime"。Go 的 `len`/`for range`/`fmt.Fprintf` 对空 slice 安全是语言保证,不需要测试。 -2. **`cmd.SilenceUsage = true` 挪出 `AddFlags`**(simplicity #7):`AddFlags` 名字承诺加 flag,把 cobra 行为开关放这里 surprise。改放在 root command 构造处(`cmd/root.go` 的 `cobra.Command{}` 字面量字段)。 -3. **`local.PreflightBaseline` 加 Info log**(silent-failure #1):当前 `return nil, nil` 完全静默。一行 `log.Infow("preflight: local operator — OLM residue scan skipped")` 把"silent skip"转为"documented skip"。 -4. **`isTransientAPIError(nil) == false` 注释 lock invariant**(silent-failure #6):在 PreflightBaseline switch 上方加一行注释明确这个 invariant,未来 reader 不必反查 helper 实现。 -5. **`assertClusterMatch` 两个 warn 分支合并**(simplicity #6):in-cluster 和 empty-currentCtx 现在是两个独立的"无法读到 context"分支,可合并。**注意**:与 #013/#014 修复冲突(那两个 P1 把 warn 改 error)—— 等 #013/#014 落地后再考虑是否还需要合并。 - -## Findings - -来源(合并):simplicity DROP #4 + #6 + #7 + silent-failure #1 + #6 - -## Proposed Solutions - -按条目分别处理,无 trade-off 争议。可一次性 commit。 - -## Recommended Action - -待 triage。**注意**:第 5 条与 #013/#014 dependency。 - -## Technical Details - -- 文件:`cmd/preflight_error_test.go`(删测试)、`cmd/root.go` + `cmd/upgrade_command.go`(SilenceUsage 挪位)、`pkg/operator/local/operator.go`(Info log)、`pkg/operator/operatorhub/preflight.go`(注释) -- 估算总改动:净 -10 LOC - -## Acceptance Criteria - -- [ ] `TestPreflightError_ZeroResidualsStillReadable` 删除 -- [ ] `SilenceUsage` 不在 `AddFlags` 里 -- [ ] `local.PreflightBaseline` 有一行 Info log -- [ ] `PreflightBaseline` switch 上方有 "isTransientAPIError(nil)==false invariant" 注释 -- [ ] `assertClusterMatch` 分支合并视 #013/#014 状态决定 - -## Work Log - -_待开始_ - -## Resources - -- PR #17 -- code-simplicity-reviewer DROP #4, SIMPLIFY #6 + #7 -- silent-failure-hunter #1 + #6