feat(cli): add okactl command-line tool for sandbox operations#497
feat(cli): add okactl command-line tool for sandbox operations#497Liquorice-Ma wants to merge 10 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #497 +/- ##
==========================================
+ Coverage 78.34% 80.13% +1.78%
==========================================
Files 162 208 +46
Lines 11739 15416 +3677
==========================================
+ Hits 9197 12353 +3156
- Misses 2187 2606 +419
- Partials 355 457 +102
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
|
||
| OpenKruise Agents 项目原有四个组件(controller、manager、gateway、runtime),都是长运行服务。日常运维中,对 SandboxSet 的扩缩容、镜像更新、容器重启等操作只能通过 `kubectl edit` 或手写 YAML 完成,操作繁琐且容易出错。 | ||
|
|
||
| 为此,我们新增了第五个组件 **agent-cli** —— 一个 kubectl 风格的命令行工具,提供三个核心命令: |
|
|
||
| 前两个命令(scale、set image)是**纯客户端操作**,直接通过 K8s API 修改 SandboxSet CR,由已有的 SandboxSet controller 自动处理变更。 | ||
|
|
||
| 第三个命令(restart)采用了**CRD 驱动模式**(参照 OpenKruise 的 ContainerRecreateRequest):CLI 创建一个 `SandboxContainerRestart` CR,新增的 controller 监听并执行实际的容器重启操作。 |
There was a problem hiding this comment.
cli create CRR is enough, no need to create SandboxContainerRestart CR
| @@ -0,0 +1,227 @@ | |||
| # agent-cli 命令行工具设计与实现文档 | |||
There was a problem hiding this comment.
plz put the proposal in docs/proposals, and rewrite the design in English for wider audience
feat(cli): add okactl command-line tool for sandbox operations fix(cli): add in-cluster config support for okactl running inside Pods Update okactl binary okactl scale/setimage/restart -h
920a645 to
57d6ff2
Compare
06fb203 to
b48e3df
Compare
Add create suo subcommand that creates SandboxUpdateOps to batch update container images of claimed sandboxes by label selector. Includes auto-cleanup of existing SUOs, container name validation, and correct PodTemplateSpec-level patch structure. Also adds proposal doc for the create suo feature.
- Add E2E tests for scale, set image, create suo, and restart commands - Add dedicated e2e-okactl.yaml workflow running on kind cluster - Add unit tests to meet 80% coverage requirement - Fix set image optimistic lock conflict with retry.RetryOnConflict
59cca84 to
cbfe589
Compare
… display - Add top-level 'okactl status' command group with 'sbs' and 'suo' subcommands - status sbs: show SandboxSet rolling update progress with auto-diagnosis - status suo: show SandboxUpdateOps batch update progress - Both support --wait flag for polling until completion - Add cobra Aliases for resource short names (sandboxset/sbs, sandboxupdateops/suo) - Customize cobra usage template to display subcommand aliases in help output - Remove 'set image status' subcommand (replaced by top-level 'status sbs') - Update set image examples to reference 'okactl status sbs' - Add developer manual for okactl CLI - Add proposal document for status command design - Add unit tests (coverage: 85.2%) and E2E tests for new commands Signed-off-by: 马赫 <Yurong.mh@alibaba-inc.com>
furykerry
left a comment
There was a problem hiding this comment.
plz never submit binary to the repo
| } | ||
|
|
||
| var deleted []string | ||
| for i := range list.Items { |
There was a problem hiding this comment.
deleteActiveSandboxUpdateOps deletes ALL SUOs, not just active ones, plz Only delete SUOs with Phase == Pending || Phase == Updating
| } | ||
|
|
||
| if sbx.Spec.Template == nil { | ||
| return nil |
There was a problem hiding this comment.
plz fetch the referenced SandboxTemplate to validate if Template nil
|
|
||
| // diagnoseSandboxSetUpdate checks sandboxes belonging to a SandboxSet and reports any issues. | ||
| // It builds a kubernetes client to inspect pod status when sandbox messages are empty. | ||
| func diagnoseSandboxSetUpdate(globalOpts *GlobalOptions, sbs *agentsv1alpha1.SandboxSet, reported *map[string]bool) { |
There was a problem hiding this comment.
ach diagnosis call creates two new REST clients via globalOpts.AgentsClient() and globalOpts.KubeClient(), which involves re-reading kubeconfig and establishing new TLS connections. In --wait mode this happens every 3 poll cycles (~9s), consider pass the already-created clients as parameters to diagnoseSandboxSetUpdate.
| } | ||
|
|
||
| // formatSuoImagePairs formats a map of container=image pairs as a slice of "container=image" strings. | ||
| func formatSuoImagePairs(images map[string]string) []string { |
There was a problem hiding this comment.
ormatSuoImagePairs() and buildSuoImagePatch() generate Non-deterministic output from map iteration, consider sort keys before iteration in both functions.
| } | ||
|
|
||
| // parseContainerImages parses "container=image" pairs and returns a map. | ||
| func parseContainerImages(args []string) (map[string]string, error) { |
There was a problem hiding this comment.
setimage.go and create.go duplicate parseContainerImages / parseSuoContainerImages, Consider extracting a shared parseImageArgs() to reduce duplication.
| okactl status sbs my-pool --wait | ||
|
|
||
| # Batch update images for claimed sandboxes | ||
| okactl create suo -l app=my-app app=nginx:1.25 |
There was a problem hiding this comment.
is it possible to considate create suo command with set image
okactl set image -l app=my-app app=nginx:1.25
There was a problem hiding this comment.
Originally set image supported both sbs and sbx, but create suo was split out in an earlier review so that set image only handles SandboxSet while claimed sandboxes are updated via SUO.
| okactl set image sbs my-pool app=nginx:1.25 | ||
|
|
||
| # Check update progress (or wait for completion) | ||
| okactl status sbs my-pool --wait |
There was a problem hiding this comment.
--wait should be the option of set image not status
Binary is now distributed via GitHub Releases instead of committing to the repository. Signed-off-by: 马赫 <Yurong.mh@alibaba-inc.com>
- Use apierrors.IsNotFound instead of string matching in waitForSUODeletion - Replace parseSuoSelectorToMap with metav1.ParseToLabelSelector for full label selector syntax support (key in (v1,v2), key!=value, etc.) - Validate container names against all matching sandboxes instead of only the first; warn on partial mismatch, error only when missing from all - Add Running phase check to restart command before creating CRR - Remove --wait flag from status command (belongs to set image only) - Sync developer manual and proposal documents Signed-off-by: 马赫 <Yurong.mh@alibaba-inc.com>
zmberg
left a comment
There was a problem hiding this comment.
Reviewed a couple of items — see inline comments.
| } | ||
|
|
||
| containers := o.containers | ||
| if len(containers) == 0 { |
There was a problem hiding this comment.
Restarting ALL containers by default is dangerous.
When no -c flag is provided, the command silently restarts every container in the sandbox (including sidecars, init containers, etc.). A user might accidentally run okactl restart sbx my-sbx without -c and cause a production incident.
Suggested fix: make -c required and add an explicit --all flag:
okactl restart sbx my-sbx -c app # restart specific container
okactl restart sbx my-sbx --all # explicitly restart all
okactl restart sbx my-sbx # error: must specify -c or --allAlternatively, when no -c is given, list available container names and ask the user to specify explicitly.
| limitations under the License. | ||
| */ | ||
|
|
||
| package main |
There was a problem hiding this comment.
PR description is severely outdated and does not match the actual implementation.
The PR description mentions:
- Binary name:
agent-cli→ Actual:okactl - Path:
cmd/agent-cli/→ Actual:cmd/okactl/ - New CRD:
SandboxContainerRestart(scr) → Actual: uses OpenKruiseContainerRecreateRequest(CRR), no new CRD - New controller:
pkg/controller/sandboxcontainerrestart/→ Actual: no new controller - 3 commands → Actual: 5 commands (scale, set image, restart, create suo, status)
- Included 47MB binary to be removed → Actual: binary already removed,
.gitignoreadded
Please update the PR description to accurately reflect the current implementation.
zmberg
left a comment
There was a problem hiding this comment.
Code-level review — see 8 inline comments below covering error handling, missing timeouts, hardcoded values, and logic gaps.
|
|
||
| printSandboxSetStatus(sbs) | ||
| var reported map[string]bool | ||
| kubeClient, _ := globalOpts.KubeClient() |
There was a problem hiding this comment.
Error silently ignored. Project guidelines state "Never ignore errors. Always check and handle err."
If the kubeconfig is invalid, the diagnosis feature will silently skip with no feedback to the user. Consider returning the error or at least printing a warning.
kubeClient, err := globalOpts.KubeClient()
if err != nil {
fmt.Fprintf(os.Stderr, "Warning: failed to create kube client for diagnosis: %v\n", err)
}| // Create kubeClient once for diagnosis instead of on every poll cycle | ||
| kubeClient, _ := globalOpts.KubeClient() | ||
|
|
||
| for { |
There was a problem hiding this comment.
No timeout for --wait. The polling loop runs forever with no exit condition beyond completion. If the update never completes (e.g., misconfigured image, persistent resource shortage), the CLI hangs indefinitely.
Also, ctx.Done() is never checked, which means Ctrl+C / SIGTERM won't be handled gracefully.
Suggestion: add a --timeout flag (default e.g. 5m) and use context.WithTimeout. Also add select { case <-ctx.Done(): return ctx.Err(); case <-time.After(pollInterval): } instead of time.Sleep.
| } | ||
|
|
||
| sbxList, err := agentsClient.Sandboxes(ns).List(context.TODO(), metav1.ListOptions{ | ||
| LabelSelector: fmt.Sprintf("agents.kruise.io/sandbox-template=%s", sbs.Name), |
There was a problem hiding this comment.
Hardcoded label selector key. The string "agents.kruise.io/sandbox-template" should use the constant from the API package (e.g. agentsv1alpha1.LabelSandboxTemplate). If the label key ever changes, this will silently break.
| return fmt.Errorf("invalid label selector %q: %w", o.selector, err) | ||
| } | ||
|
|
||
| // List all sandboxes and filter by label selector |
There was a problem hiding this comment.
Client-side filtering instead of server-side. All sandboxes are listed and then filtered locally. For namespaces with many sandboxes this is inefficient and puts unnecessary load on the API server.
Suggestion: pass the selector to the API server directly:
sbxList, err := client.Sandboxes(ns).List(ctx, metav1.ListOptions{
LabelSelector: o.selector,
})This also makes the labels.Parse call at line 112 unnecessary.
| for i := range sandboxes { | ||
| sbx := &sandboxes[i] | ||
| known := make(map[string]bool) | ||
| if sbx.Spec.Template != nil { |
There was a problem hiding this comment.
validateSuoImageContainers doesn't handle TemplateRef sandboxes. When sbx.Spec.Template == nil (TemplateRef is used), the known map is empty, so every requested container will be counted as "missing" for that sandbox. This could trigger false positive warnings or even block creation if ALL matched sandboxes use TemplateRef.
Suggestion: resolve container names from the referenced SandboxTemplate when Template is nil, similar to how fetchContainerNames in restart.go handles it.
| // removeSUOFinalizer removes the finalizer from a SandboxUpdateOps via JSON patch. | ||
| // This allows the SUO to be deleted immediately without waiting for the controller to process it. | ||
| func removeSUOFinalizer(client apiv1alpha1.ApiV1alpha1Interface, ns, name string) error { | ||
| patch := []byte(`[{"op":"replace","path":"/metadata/finalizers","value":[]}]`) |
There was a problem hiding this comment.
Hardcoded JSON Patch with replace op. The replace operation on /metadata/finalizers will fail if the field doesn't exist (no finalizers). Consider using MergePatch with {"metadata":{"finalizers":[]}} which is idempotent regardless of whether finalizers exist.
Also, this bypasses the SUO controller's finalizer logic. If the controller relies on the finalizer to clean up internal state (e.g., expectations, rate limiters), removing it externally could leave stale state behind.
| "podName": sandboxName, | ||
| "containers": containerTargets, | ||
| "strategy": map[string]interface{}{ | ||
| "failurePolicy": "Fail", |
There was a problem hiding this comment.
failurePolicy hardcoded to "Fail". Users cannot choose the Ignore policy, which would continue restarting remaining containers even if one fails. Consider adding a --failure-policy flag to let users choose between Fail and Ignore.
zmberg
left a comment
There was a problem hiding this comment.
Additional code-level review — 8 more inline comments covering CLI responsibility boundaries, error handling, and type safety.
General note: create.go currently only contains the create suo subcommand. Consider renaming to createsuo.go for consistency with setimage.go naming convention.
| return fmt.Errorf("invalid label selector %q: %w", o.selector, err) | ||
| } | ||
|
|
||
| // List all sandboxes and filter by label selector |
There was a problem hiding this comment.
CLI should not perform sandbox listing, filtering, or container validation. These three blocks (list sandboxes, client-side filter by label selector, validate container names) are the SUO controller's responsibility, not the CLI's. The CLI's job is simply to create the SUO resource with the selector and patch — the controller will match sandboxes by selector and record any container name issues in the SUO status.
Problems with CLI doing this validation:
- Client-side filtering is inefficient (already noted in a separate comment)
- Validation logic is duplicated with the controller, requiring maintenance in two places
- TOCTOU issue: the check inspects current state, but sandboxes may change before the SUO is processed
Suggestion: Remove the list/filter/validate block entirely. Just build the SUO spec and create it.
| } | ||
|
|
||
| // Delete any active (non-terminal) SUO before creating a new one | ||
| if err := deleteActiveSandboxUpdateOps(client, ns); err != nil { |
There was a problem hiding this comment.
CLI should not manage existing resource lifecycles. A CLI tool's job is to create resources, not to delete active SUOs or force-remove finalizers. If there's a conflict with an existing active SUO, the API server or an admission webhook should reject the new creation with a clear error.
Additionally, deleteActiveSandboxUpdateOps deletes ALL active SUOs in the namespace regardless of selector, which could affect unrelated operations.
Suggestion: Remove this block entirely. If conflict detection is needed, add it as an admission webhook on the API server side.
| return err | ||
| } | ||
|
|
||
| labelSelector, err := metav1.ParseToLabelSelector(o.selector) |
There was a problem hiding this comment.
Label selector is parsed twice. labels.Parse at line 112 for client-side filtering, and metav1.ParseToLabelSelector here for the SUO spec. If the client-side filtering is removed (as suggested above), only this parse is needed.
| const maxWait = 10 * time.Second | ||
| const pollInterval = 500 * time.Millisecond | ||
|
|
||
| for elapsed := time.Duration(0); elapsed < maxWait; elapsed += pollInterval { |
There was a problem hiding this comment.
elapsed counter is inaccurate. It only accumulates pollInterval but ignores the actual time spent in the Get call. Under slow API server conditions, the real wait duration could significantly exceed maxWait.
Suggestion: Use context.WithTimeout and check ctx.Done() instead.
| // When running locally, it falls back to the kubeconfig file. | ||
| func (o *GlobalOptions) RESTConfig() (*rest.Config, error) { | ||
| if o.KubeConfig == "" && o.Context == "" { | ||
| if cfg, err := inClusterConfigFn(); err == nil { |
There was a problem hiding this comment.
In-cluster config failure is silently swallowed. If the user expects in-cluster config to work but it fails (e.g., missing ServiceAccount token), the code silently falls back to kubeconfig with no indication. Consider logging at debug level when in-cluster config is attempted but fails.
| }) | ||
| } | ||
|
|
||
| // CRR targets the Pod which has the same name as the Sandbox |
There was a problem hiding this comment.
Should use the official OpenKruise typed client instead of unstructured.Unstructured. OpenKruise provides a type-sa ent API at github.com/openkruise/kruise-api that includes ContainerRecreateRequest typed definitions. Using map[string]interface{} to manually construct the CRR is error-prone and hard to maintain.
Reference: https://github.com/openkruise/kruise-api
Benefits of using the typed client:
- Compile-time type checking, preventing field name typos
- No need to manually maintain apiVersion/kind strings
- DeepCopy and serialization are guaranteed by generated code
- Direct use of
apps.kruise.io/v1alpha1.ContainerRecreateRequesttype
| "apiVersion": "apps.kruise.io/v1alpha1", | ||
| "kind": "ContainerRecreateRequest", | ||
| "metadata": map[string]interface{}{ | ||
| "generateName": sandboxName + "-restart-", |
There was a problem hiding this comment.
generateName causes CRR accumulation. Each restart creates a new CRR with no cleanup of previous ones. Multiple restarts of the sa sandbox will accumulate CRR objects in the namespace.
Suggestion: Consider checking for an existing active CRR and reusing it, or using a deterministic name to avoid duplication.
| } | ||
| for container := range images { | ||
| if !found[container] { | ||
| return fmt.Errorf("container %q not found in sandboxset %q", container, name) |
There was a problem hiding this comment.
Misleading error wrapping inside RetryOnConflict. This validation error is correctly not retried, but it gets wrapped by the outer error as "failed to update sandboxset: container not found" — misleading because the update didn't fail, the pre-update validation did.
Suggestion: Move validation outside RetryOnConflict, or return a sentinel error that skips the outer wrapping.
b56e69e to
8519b29
Compare
fix 16 comments 12 12
Ⅰ. Describe what this PR does
Add okactl — a kubectl-style CLI tool for OpenKruise Agents sandbox operations, eliminating the need to hand-write YAML or use kubectl edit.
Commands
okactl scale sandboxset NAME --replicas=N (alias: sbs)
Scale a SandboxSet's replica count via JSON Merge Patch (atomic, no optimistic-lock conflicts).
okactl set image sandboxset NAME CONTAINER=IMAGE [...] (alias: sbs)
Update container images in a SandboxSet's inline template. Detects TemplateRef usage and guides users to modify the SandboxTemplate directly. Supports --wait with --timeout to poll until the rolling update completes.
okactl restart sandbox NAME [-c CONTAINER ...] [--all] [--failure-policy=Fail|Ignore] (alias: sbx)
Restart containers in a running Sandbox by creating an OpenKruise ContainerRecreateRequest (CRR). Requires explicit -c or --all to prevent accidental full restarts. Supports --failure-policy (Fail stops on error, Ignore continues).
okactl create suo -l SELECTOR CONTAINER=IMAGE [...] (alias: sandboxupdateops)
Create a SandboxUpdateOps to batch-update images for claimed sandboxes. Uses server-side label selector filtering. Automatically cleans up existing active SUOs, with finalizer removal as a fallback only if the controller is unavailable.
okactl status sbs NAME / okactl status suo NAME
Show update progress for SandboxSet or SandboxUpdateOps, with automatic diagnosis of stalled updates (e.g., ImagePullBackOff, insufficient resources).
Files
cmd/okactl/main.go — CLI entrypoint
pkg/cli/ — All command implementations and unit tests
docs/developer-manuals/okactl.md — Developer manual
docs/proposals/ — Three design proposals
test/e2e/okactl_test.go — E2E tests
Ⅱ. Does this pull request fix one issue?
NONE
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
restart sandbox uses dynamic.Interface to create OpenKruise ContainerRecreateRequest (CRR) because CRR belongs to apps.kruise.io, not agents.kruise.io.
create suo deletes existing active SUOs before creating a new one. It first issues a Delete and waits for the controller to process finalizer cleanup (which also removes sandbox labels). Finalizer is force-removed via MergePatch only as a timeout fallback.
restart sandbox requires explicit -c or --all; it does not restart all containers silently by default.
Developer manual is also synced to the openkruise.io repo.