feat: add NetworkPolicy support for sandbox isolation#291
feat: add NetworkPolicy support for sandbox isolation#291Sanchit2662 wants to merge 7 commits intovolcano-sh:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request implements network isolation for sandboxes using Kubernetes NetworkPolicies, supporting 'None', 'Restricted', and 'Custom' modes via the AgentRuntime and CodeInterpreter CRDs. The workload-manager is updated to manage the lifecycle of these policies, including automated creation and cleanup. A critical issue was identified in pkg/workloadmanager/workload_builder.go where the per-session NetworkPolicy override is ignored during the creation of CodeInterpreter sandboxes when warm pools are not utilized.
- Add SandboxNetworkPolicy API type (None/Restricted/Custom modes) - Build per-sandbox NetworkPolicy with router ingress always allowed - Support cross-namespace router via NamespaceSelector - Per-session NP override in CreateSandboxRequest (replace semantics) - Configurable router selector/namespace via chart values and CLI flags - Clean up NP on sandbox delete, rollback, and GC - Add networkpolicies RBAC to workloadmanager ClusterRole Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
c8b70b4 to
9e2f065
Compare
There was a problem hiding this comment.
Pull request overview
Adds optional Kubernetes NetworkPolicy generation/cleanup to improve sandbox network isolation in AgentCube’s Workload Manager, closing the “no network restrictions” gap for untrusted sandbox code execution.
Changes:
- Introduces
SandboxNetworkPolicyin runtime API templates (AgentRuntime + CodeInterpreter) and a per-session override viaCreateSandboxRequest. - Implements NetworkPolicy building (None/Restricted/Custom) plus lifecycle management (create on sandbox create; best-effort delete on rollback, delete API, and GC).
- Adds router selector/namespace configuration (flags + Helm values) to support cross-namespace router→sandbox ingress allow rules.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/workload_builder.go | Wires template/session policy into sandbox build flow and returns a NetworkPolicy object for creation. |
| pkg/workloadmanager/server.go | Adds config defaults for router selector/namespace used by Restricted mode ingress. |
| pkg/workloadmanager/networkpolicy_builder.go | Builds NetworkPolicy objects for None/Restricted/Custom modes (router ingress + optional DNS egress). |
| pkg/workloadmanager/networkpolicy_builder_test.go | Unit tests for NetworkPolicy building and “replace, not merge” selection behavior. |
| pkg/workloadmanager/k8s_client.go | Adds helpers to create/delete NetworkPolicies and generalizes clientset type. |
| pkg/workloadmanager/handlers.go | Creates NetworkPolicy during sandbox creation; deletes on rollback and explicit delete handler. |
| pkg/workloadmanager/handlers_test.go | Updates handler tests for new builder signatures and adds session override forwarding test. |
| pkg/workloadmanager/garbage_collection.go | Ensures NetworkPolicy cleanup runs even when Sandbox/SandboxClaim is already NotFound. |
| pkg/workloadmanager/garbage_collection_test.go | Adds fake clientset to cover NP cleanup path without nil panics. |
| pkg/common/types/sandbox.go | Extends CreateSandboxRequest to accept per-session NetworkPolicy override. |
| pkg/apis/runtime/v1alpha1/networkpolicy_types.go | Adds new API types for SandboxNetworkPolicy modes and rules. |
| pkg/apis/runtime/v1alpha1/agent_type.go | Adds NetworkPolicy field to AgentRuntime SandboxTemplate. |
| pkg/apis/runtime/v1alpha1/codeinterpreter_types.go | Adds NetworkPolicy field to CodeInterpreterSandboxTemplate. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Updates generated deep-copies for new NetworkPolicy fields/types. |
| manifests/charts/base/values.yaml | Adds Helm values for router selector/namespace used in Restricted mode. |
| manifests/charts/base/templates/workloadmanager.yaml | Passes router selector/namespace flags into workload-manager deployment. |
| manifests/charts/base/templates/rbac/workloadmanager.yaml | Grants workload-manager RBAC to manage networkpolicies. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml | Updates AgentRuntime CRD schema with networkPolicy fields. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml | Updates CodeInterpreter CRD schema with networkPolicy fields. |
| cmd/workload-manager/main.go | Adds flags + selector parsing for router selector/namespace configuration. |
| cmd/workload-manager/main_test.go | Unit tests for selector parsing logic. |
| // createNetworkPolicy creates a NetworkPolicy using the workloadmanager's own client. | ||
| func createNetworkPolicy(ctx context.Context, clientset kubernetes.Interface, np *networkingv1.NetworkPolicy) error { | ||
| _, err := clientset.NetworkingV1().NetworkPolicies(np.Namespace).Create(ctx, np, metav1.CreateOptions{}) | ||
| if err != nil && !apierrors.IsAlreadyExists(err) { | ||
| return fmt.Errorf("failed to create network policy %s/%s: %w", np.Namespace, np.Name, err) | ||
| } | ||
| return nil |
| @@ -21,6 +21,7 @@ import ( | |||
| "time" | |||
|
|
|||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |||
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #291 +/- ##
==========================================
+ Coverage 46.73% 47.39% +0.65%
==========================================
Files 30 31 +1
Lines 2822 3028 +206
==========================================
+ Hits 1319 1435 +116
- Misses 1363 1442 +79
- Partials 140 151 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…eter path - Fix session NP override being ignored in direct sandbox path - Add CIDR format validation markers to SandboxEgressRule and SandboxIngressRule - Add example manifest for NetworkPolicy modes (Restricted, Custom) Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Good design with three modes that maintain backward compatibility. The GC leak fix is a nice bonus. Please confirm behavior of NetworkPolicy for the warm-pool path before merging.
- Add OwnerReference on NetworkPolicy pointing to Sandbox/SandboxClaim so k8s GC cascades deletion when owner is removed out-of-band - Add CEL validation: custom field must be set when mode is Custom - Add TODO comment on warm-pool NP selector gap pending agent-sandbox fix - Add GC lifecycle tests verifying NP is deleted alongside sandbox - Extend createSandboxClaim to return UID for OwnerReference wiring Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds Kubernetes NetworkPolicy generation/management to improve sandbox network isolation in the Workload Manager, with configurable router selectors and support for template-level defaults plus per-session overrides.
Changes:
- Introduces
SandboxNetworkPolicyAPI types and addsnetworkPolicyfields to AgentRuntime/CodeInterpreter templates and CreateSandboxRequest. - Implements NetworkPolicy building (Restricted/Custom) and lifecycle management (create, rollback, delete, GC cleanup).
- Adds Helm values/flags for router selector + namespace and updates RBAC/CRDs/examples/tests accordingly.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/workload_builder.go | Threads session/template NetworkPolicy into sandbox build flow. |
| pkg/workloadmanager/server.go | Adds configurable router selector/namespace defaults to server config. |
| pkg/workloadmanager/networkpolicy_builder.go | Implements NetworkPolicy object construction for sandbox pods. |
| pkg/workloadmanager/networkpolicy_builder_test.go | Unit tests for NetworkPolicy builder + override semantics. |
| pkg/workloadmanager/k8s_client.go | Adds NetworkPolicy create/delete helpers; returns SandboxClaim UID for owner refs. |
| pkg/workloadmanager/handlers.go | Creates NetworkPolicy (with owner refs), adds rollback + delete handler cleanup. |
| pkg/workloadmanager/handlers_test.go | Updates handler/server tests for new create signature + session NP override. |
| pkg/workloadmanager/garbage_collection.go | Ensures NP cleanup runs even when Sandbox/SandboxClaim is already NotFound. |
| pkg/workloadmanager/garbage_collection_test.go | Tests GC deletes associated NetworkPolicies. |
| pkg/common/types/sandbox.go | Extends CreateSandboxRequest with per-session NetworkPolicy override. |
| pkg/apis/runtime/v1alpha1/networkpolicy_types.go | New CRD/API types for sandbox network policy configuration. |
| pkg/apis/runtime/v1alpha1/agent_type.go | Adds template-level networkPolicy to AgentRuntime SandboxTemplate. |
| pkg/apis/runtime/v1alpha1/codeinterpreter_types.go | Adds template-level networkPolicy to CodeInterpreterSandboxTemplate. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Regenerates DeepCopy for new API fields/types. |
| manifests/charts/base/values.yaml | Adds chart values for router selector/namespace. |
| manifests/charts/base/templates/workloadmanager.yaml | Wires chart values into workload-manager flags. |
| manifests/charts/base/templates/rbac/workloadmanager.yaml | Grants workloadmanager RBAC to manage NetworkPolicies. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml | Updates CRD schema for new template field. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml | Updates CRD schema for new template field. |
| example/code-interpreter/code-interpreter-network-policy.yaml | Adds usage examples for Restricted/Custom modes. |
| cmd/workload-manager/main.go | Adds flags for router selector/namespace + selector parsing helper. |
| cmd/workload-manager/main_test.go | Tests selector parsing behavior. |
Extract sandbox/claim creation and NetworkPolicy OwnerReference wiring into createWorkload helper to bring createSandbox complexity from 16 to under the gocyclo limit of 15 Also address Copilot review comments: - Treat AlreadyExists as error in createNetworkPolicy instead of silently reusing a potentially stale or mismatched policy - Validate per-session NetworkPolicy override in CreateSandboxRequest.Validate: reject unknown modes and require custom field when mode is Custom - Fix import ordering to be goimports-compliant across handlers.go, k8s_client.go, sandbox.go and main.go - Warn at startup when --router-selector parses to an empty map Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
|
Hey @hzxuzhonghu Practically: the NP gets created, sits there with no matching pods, and does nothing. The sandbox runs unprotected in the warm-pool case even if Mode=Restricted. I've got a TODO in the code at the build site marking this. Two possible fixes I'm aware of:
My suggestion: merge this PR as-is since the direct-sandbox path works correctly, and track the warm-pool fix as a follow-up once we confirm what labels agent-sandbox applies. I'd rather not block the whole feature on the warm-pool edge case, but happy to dig into it more if you'd prefer it handled in this PR. |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
The overall design is solid — per-session override, Restricted-mode DNS egress defaults, cross-namespace router peer with NamespaceSelector, and GC/rollback NP cleanup are all well thought out. A few issues below.
Also note dead code in handleDeleteSandbox: deleteSandboxClaim and deleteSandbox in k8s_client.go already return nil for IsNotFound errors, so the if apierrors.IsNotFound(err) branches at lines ~322 and ~334 of handlers.go are unreachable — err can never be a NotFound error at that point. The log lines there will never fire. The dead branches should be removed to avoid confusing future readers who might assume NotFound could surface there.
| // Create NetworkPolicy if one was built from the template spec. | ||
| // Uses the workloadmanager's own client (not the user client) since NP is infra. | ||
| if networkPolicy != nil { | ||
| if err := createNetworkPolicy(ctx, s.k8sClient.clientset, networkPolicy); err != nil { |
There was a problem hiding this comment.
Security: Mode=Custom in CreateSandboxRequest is a privilege-escalation vector when EnableAuth=true.
The NetworkPolicy is created here using s.k8sClient.clientset — the workloadmanager's own elevated service account (which now has networkpolicies create/delete RBAC per this PR's manifest changes). The NP rules, however, come entirely from the user-supplied sandboxReq.NetworkPolicy field.
This means a user who can call the WM sandbox-create API but does not have networkpolicies/create permission in k8s can bypass that restriction by setting networkPolicy.mode=Custom with arbitrary egress rules. The practical impact: they can weaken the isolation the cluster operator intended (e.g., override a template-level Restricted policy with open egress rules that reach the k8s API server, cloud metadata endpoints, or internal services).
Suggested fixes:
- Disallow
Custommode inCreateSandboxRequestwhenEnableAuth=true; reserveCustomfor cluster-admin-controlled CRD templates only and allow onlyNone/Restrictedper-session overrides. - SubjectAccessReview gate: before creating the NP, perform a SAR check on behalf of the requesting user to verify they hold
networking.k8s.io/networkpoliciescreatein the sandbox namespace.
| routerPeer := networkingv1.NetworkPolicyPeer{ | ||
| PodSelector: &metav1.LabelSelector{MatchLabels: routerSelector}, | ||
| } | ||
| if routerNamespace != "" { |
There was a problem hiding this comment.
When routerNamespace is an empty string, NamespaceSelector is not added to the peer. Kubernetes NetworkPolicy semantics for a peer with only PodSelector (no NamespaceSelector): match pods in the same namespace as the NetworkPolicy.
If the router is in a different namespace and routerNamespace is empty, the router ingress rule silently becomes a same-namespace-only rule — cross-namespace router traffic is blocked, defeating the whole point of this cross-namespace logic.
In production this is guarded by NewServer defaulting RouterNamespace to IdentitySecretNamespace, so the empty case never fires at runtime. But the function itself is silently wrong for empty input, and any test or future caller that passes "" will get incorrect policy without any signal. Consider either panic/returning an error for empty routerNamespace, or at minimum adding a comment: // routerNamespace must be non-empty; an empty value restricts the rule to the sandbox's own namespace.
| np.Spec.Egress = buildRestrictedEgressRules(spec) | ||
| case runtimev1alpha1.NetworkPolicyModeCustom: | ||
| if spec.Custom != nil { | ||
| np.Spec.Ingress = spec.Custom.Ingress |
There was a problem hiding this comment.
PolicyTypes: [Ingress, Egress] is set unconditionally (lines 53-56) for both Restricted and Custom modes. In Custom mode, np.Spec.Ingress is set directly from spec.Custom.Ingress. If the caller provides only egress rules and leaves Custom.Ingress nil, the resulting NP has PolicyTypes: [Ingress] with no ingress rules — deny-all ingress, including from the router.
Unlike Restricted mode, Custom mode does not automatically inject the router ingress rule. This is not documented anywhere in the type or the mode description, so an operator writing:
networkPolicy:
mode: Custom
custom:
egress:
- ports: [{port: 443}]...would unknowingly block the router from reaching the sandbox and see requests time out silently.
Please add a note to SandboxNetworkPolicyCustomRules or NetworkPolicyModeCustom warning that callers must explicitly include any required ingress rules (including router ingress) when using Custom mode.
There was a problem hiding this comment.
The overall design is solid — per-session override semantics, rollback logic, GC cleanup, and the cross-namespace router NamespaceSelector are all well-thought-out.
Dead code in handleDeleteSandbox (lines ~322/334 — outside this diff's hunks): The if apierrors.IsNotFound(err) branches after deleteSandboxClaim/deleteSandbox calls are unreachable. Both helpers already swallow IsNotFound and return nil for it. Any error that reaches those if err != nil guards is always a real, non-NotFound error. The branches are harmless today but would cause a silent behavior change if the helpers ever stop suppressing NotFound. Fix: remove the dead IsNotFound branches, or remove NotFound-suppression from the helpers and handle it at each call site.
| // Uses the workloadmanager's own client (not the user client) since NP is infra. | ||
| if networkPolicy != nil { | ||
| if err := createNetworkPolicy(ctx, s.k8sClient.clientset, networkPolicy); err != nil { | ||
| return nil, api.NewInternalError(err) |
There was a problem hiding this comment.
Security: per-session Custom mode allows RBAC bypass when EnableAuth=true.
This call uses s.k8sClient.clientset (the workloadmanager's own elevated SA, which has networking.k8s.io/networkpolicies create RBAC per the manifest added in this PR). A user who calls this API endpoint with networkPolicy.mode=Custom can therefore create an arbitrary NetworkPolicy in the sandbox namespace, even if their own principal lacks that RBAC permission.
In a multi-tenant cluster where an operator sets Mode=Restricted to lock down sandbox egress (e.g., block the metadata endpoint or internal services), any API caller can override that with Mode=Custom and open egress at will — effectively a privilege escalation.
Fix options:
- Disallow
Customin per-session overrides. Only allow it in the operator-controlled template spec; session overrides can only setMode=Restricted/Noneand addallowedEgressentries. - SubjectAccessReview gate. Before using the elevated client, perform an
authorizationv1.SubjectAccessReviewto confirm the requesting user hasnetworking.k8s.io/networkpoliciescreate permission in the target namespace.
| case runtimev1alpha1.NetworkPolicyModeCustom: | ||
| if spec.Custom != nil { | ||
| np.Spec.Ingress = spec.Custom.Ingress | ||
| np.Spec.Egress = spec.Custom.Egress |
There was a problem hiding this comment.
PolicyTypes: [Ingress, Egress] is always set; nil Custom.Ingress = deny-all ingress, including from the router.
PolicyTypes is set unconditionally to [Ingress, Egress] for all non-None modes. In Custom mode, if spec.Custom.Ingress is nil (the user only wants to configure egress), np.Spec.Ingress is nil here. Per k8s NetworkPolicy semantics, PolicyType=Ingress with no ingress rules means deny all ingress — including from the router. The sandbox becomes completely unreachable.
This affects a user who passes {mode: Custom, custom: {egress: [...]}} expecting to add egress allow rules while keeping normal inbound traffic.
The kubebuilder validation only requires custom != null, so {mode: Custom, custom: {}} is accepted and silently creates a deny-all policy.
Fix: add a warning to the SandboxNetworkPolicyCustomRules godoc:
// When mode is Custom, ALL ingress is denied unless rules are explicitly
// specified here. The router ingress rule injected by Restricted mode is NOT
// automatic — include it in Ingress if sandbox reachability is required.
| PodSelector: &metav1.LabelSelector{MatchLabels: routerSelector}, | ||
| } | ||
| if routerNamespace != "" { | ||
| routerPeer.NamespaceSelector = &metav1.LabelSelector{ |
There was a problem hiding this comment.
Silent wrong semantics when routerNamespace is empty.
When routerNamespace == "", NamespaceSelector is skipped. Per k8s NetworkPolicy semantics, a NetworkPolicyPeer with only PodSelector and no NamespaceSelector selects pods in the same namespace as the NetworkPolicy — not all namespaces.
So if routerNamespace is empty, the rule only allows router pods co-located in the sandbox namespace. If the router is in a different namespace (the common multi-tenant layout the comment above describes), it silently cannot reach the sandbox — exactly the case the NamespaceSelector was added to prevent.
server.go/NewServer ensures RouterNamespace always defaults to at least IdentitySecretNamespace, so production is safe today. But the function itself has incorrect semantics for the empty case, making it a hazard for tests and any future direct callers.
Fix: guard at the top of buildRestrictedIngressRules:
if routerNamespace == "" {
// A missing namespace selector restricts the rule to the sandbox namespace only,
// defeating cross-namespace router ingress. Callers must supply a non-empty value.
klog.Fatalf("routerNamespace must not be empty")
}or return an error up the call chain.
| // "Custom": use the raw Ingress/Egress rules in the Custom field. | ||
| // +kubebuilder:default=None | ||
| // +optional | ||
| Mode NetworkPolicyMode `json:"mode,omitempty"` |
There was a problem hiding this comment.
Why do you need this field, i thin none can be represented with unset policy
Custom can be represented by AllowedIngress AllowedEgress
Restricted should be like None
|
|
||
| // AllowedEgress lists additional egress allow rules applied when Mode=Restricted. | ||
| // +optional | ||
| AllowedEgress []SandboxEgressRule `json:"allowedEgress,omitempty"` |
There was a problem hiding this comment.
| AllowedEgress []SandboxEgressRule `json:"allowedEgress,omitempty"` | |
| Egress []SandboxEgressRule `json:"allowedEgress,omitempty"` |
| // AllowDNS controls whether a DNS egress rule (port 53 UDP/TCP) is added when | ||
| // Mode=Restricted. Defaults to true. | ||
| // +optional | ||
| AllowDNS *bool `json:"allowDNS,omitempty"` |
| // AllowedIngress lists additional ingress allow rules when Mode=Restricted. | ||
| // Router ingress is always permitted in Restricted mode regardless of this field. | ||
| // +optional | ||
| AllowedIngress []SandboxIngressRule `json:"allowedIngress,omitempty"` |
There was a problem hiding this comment.
| AllowedIngress []SandboxIngressRule `json:"allowedIngress,omitempty"` | |
| Ingress []SandboxIngressRule `json:"ingress,omitempty"` |
| CIDRs []string `json:"cidrs,omitempty"` | ||
| // Ports restricts the rule to these destination ports. An empty list matches all ports. | ||
| // +optional | ||
| Ports []SandboxNetworkPolicyPort `json:"ports,omitempty"` |
There was a problem hiding this comment.
?? I think the comment is wrong , copied from SandboxEgressRule
| } | ||
|
|
||
| // SandboxIngressRule describes an ingress allow rule for a sandbox pod. | ||
| type SandboxIngressRule struct { |
There was a problem hiding this comment.
This struct is same as SandboxEgressRule, please check k8s NetworkPolicySpec api
| return nil, err | ||
| } | ||
| } else { | ||
| if _, err := createSandbox(ctx, dynamicClient, sandbox); err != nil { |
There was a problem hiding this comment.
This was extracted from createSandbox to get its cyclomatic complexity under the gocyclo lint limit of 15 , it was hitting 16 before. The logic is exactly the same, just pulled out into its own helper. The early return is just how I naturally wrote it in Go, happy to match whatever style you prefer here.
- Remove Mode enum (None/Restricted/Custom); nil policy means no enforcement, non-nil enforces deny-all with router ingress and DNS
- Remove AllowDNS toggle; DNS egress is always included when a policy
is set - Rename AllowedEgress/AllowedIngress to Egress/Ingress; update JSON
tags to egress/ingress - Merge SandboxEgressRule and SandboxIngressRule into single
SandboxNetworkPolicyRule type - Block Custom mode in per-session overrides to prevent RBAC privilege
escalation via the workloadmanager elevated SA
- Add klog.Fatalf guard when routerNamespace is empty to prevent silent same-namespace-only router ingress semantics
- Remove dead IsNotFound branches in handleDeleteSandbox; helpers
already swallow NotFound and return nil - Fix incorrect Ports comment copied from SandboxEgressRule into
SandboxIngressRule
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds Kubernetes NetworkPolicy provisioning to improve sandbox network isolation in AgentCube’s workload-manager, including router-ingress allowlisting, DNS egress allowlisting, and cleanup on rollback/delete/GC, plus configuration knobs for selecting the router pods.
Changes:
- Build and (optionally) create a per-sandbox NetworkPolicy alongside Sandbox/SandboxClaim creation, with rollback/delete/GC cleanup paths.
- Add router selector/namespace configuration via workload-manager flags and Helm values, and wire these into the NetworkPolicy builder.
- Introduce API/CRD surface changes for template-level and per-session NetworkPolicy configuration, plus unit tests around policy construction and cleanup.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/workload_builder.go | Returns a built NetworkPolicy alongside sandbox objects and applies session override semantics. |
| pkg/workloadmanager/server.go | Adds router selector/namespace config defaults for policy building. |
| pkg/workloadmanager/networkpolicy_builder.go | Implements NetworkPolicy object construction (router ingress + DNS egress + extra rules). |
| pkg/workloadmanager/networkpolicy_builder_test.go | Adds unit tests for NetworkPolicy construction and override semantics. |
| pkg/workloadmanager/k8s_client.go | Adds NetworkPolicy create/delete helpers; returns created UIDs for OwnerReferences; makes clientset interface-typed for testability. |
| pkg/workloadmanager/handlers.go | Wires per-session policy override into builders; creates NetworkPolicy; adds OwnerReferences; improves rollback to include NP deletion; delete handler best-effort NP cleanup. |
| pkg/workloadmanager/handlers_test.go | Updates mocks/signatures and adds test for session-level policy override forwarding. |
| pkg/workloadmanager/garbage_collection.go | Ensures NP cleanup runs even when sandbox/claim is already NotFound. |
| pkg/workloadmanager/garbage_collection_test.go | Adds fake clientset and test verifying GC deletes associated NetworkPolicy. |
| pkg/common/types/sandbox.go | Adds NetworkPolicy field to CreateSandboxRequest and a placeholder validator. |
| pkg/apis/runtime/v1alpha1/networkpolicy_types.go | Adds SandboxNetworkPolicy API type (ingress/egress rules). |
| pkg/apis/runtime/v1alpha1/agent_type.go | Adds NetworkPolicy to SandboxTemplate. |
| pkg/apis/runtime/v1alpha1/codeinterpreter_types.go | Adds NetworkPolicy to CodeInterpreterSandboxTemplate. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Updates generated deep-copies for new network policy types/fields. |
| manifests/charts/base/values.yaml | Adds Helm values for router selector/namespace. |
| manifests/charts/base/templates/workloadmanager.yaml | Passes router selector/namespace flags to workload-manager. |
| manifests/charts/base/templates/rbac/workloadmanager.yaml | Adds RBAC for NetworkPolicy resources. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml | Updates CRD schema to include networkPolicy fields. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml | Updates CRD schema to include networkPolicy fields. |
| example/code-interpreter/code-interpreter-network-policy.yaml | Adds an example CodeInterpreter manifest using template networkPolicy. |
| cmd/workload-manager/main.go | Adds CLI flags and selector parsing for router selector/namespace. |
| cmd/workload-manager/main_test.go | Adds tests for selector parsing. |
| // SandboxNetworkPolicy defines the network isolation policy applied to each sandbox pod. | ||
| // When set (non-nil), a deny-all NetworkPolicy is created with explicit allow rules for | ||
| // router ingress and DNS egress. Additional rules can be added via Ingress and | ||
| // Egress. Leave this field unset (nil) to disable network policy enforcement. | ||
| type SandboxNetworkPolicy struct { | ||
| // Egress lists additional egress allow rules beyond the default DNS rule. | ||
| // +optional | ||
| Egress []SandboxNetworkPolicyRule `json:"egress,omitempty"` | ||
|
|
||
| // Ingress lists additional ingress allow rules. | ||
| // Router ingress is always permitted regardless of this field. | ||
| // +optional | ||
| Ingress []SandboxNetworkPolicyRule `json:"ingress,omitempty"` | ||
| } |
| Spec corev1.PodSpec `json:"spec" protobuf:"bytes,3,opt,name=spec"` | ||
|
|
||
| // NetworkPolicy defines the network isolation policy for sandbox pods created from | ||
| // this template. Defaults to Mode=None (no policy created). |
| networkPolicy: | ||
| # routerSelector is the pod label selector used to identify the router pod. | ||
| # Applied as the mandatory ingress allow rule when SandboxNetworkPolicy.Mode=Restricted. | ||
| # Format: comma-separated key=value pairs, e.g. "app=agentcube-router,tier=proxy" | ||
| routerSelector: "app=agentcube-router" | ||
| # routerNamespace is the namespace the router runs in. | ||
| # Used to build the cross-namespace NamespaceSelector for the router→sandbox ingress rule. | ||
| # Defaults to the workloadmanager's own namespace when left empty. | ||
| routerNamespace: "" |
| networkPolicy: | ||
| description: |- | ||
| NetworkPolicy defines the network isolation policy for sandbox pods created from | ||
| this template. Defaults to Mode=None (no policy created). | ||
| properties: | ||
| allowDNS: | ||
| description: |- | ||
| AllowDNS controls whether a DNS egress rule (port 53 UDP/TCP) is added when | ||
| Mode=Restricted. Defaults to true. | ||
| type: boolean | ||
| allowedEgress: | ||
| description: AllowedEgress lists additional egress allow rules | ||
| applied when Mode=Restricted. | ||
| items: | ||
| description: SandboxEgressRule describes an egress allow | ||
| rule for a sandbox pod. | ||
| properties: | ||
| cidrs: | ||
| description: |- | ||
| CIDRs is the list of destination CIDR blocks. |
| networkPolicy: | ||
| description: |- | ||
| NetworkPolicy defines the network isolation policy for sandbox pods created from | ||
| this template. Defaults to Mode=None (no policy created). | ||
| properties: | ||
| allowDNS: | ||
| description: |- | ||
| AllowDNS controls whether a DNS egress rule (port 53 UDP/TCP) is added when | ||
| Mode=Restricted. Defaults to true. | ||
| type: boolean | ||
| allowedEgress: | ||
| description: AllowedEgress lists additional egress allow rules | ||
| applied when Mode=Restricted. |
| Resources corev1.ResourceRequirements `json:"resources,omitempty"` | ||
|
|
||
| // NetworkPolicy defines the network isolation policy for sandbox pods created from | ||
| // this template. Defaults to Mode=None (no policy created). |
| // validateNetworkPolicyOverride validates a per-session NetworkPolicy override from | ||
| // CreateSandboxRequest. Unlike the CRD field (which has kubebuilder markers), this | ||
| // comes in as raw JSON and bypasses API-server admission, so we validate it here. | ||
| func validateNetworkPolicyOverride(_ *runtimev1alpha1.SandboxNetworkPolicy) error { | ||
| return nil |
| if routerNamespace == "" { | ||
| // Without a NamespaceSelector the peer only matches pods in the same namespace | ||
| // as the NetworkPolicy, defeating cross-namespace router ingress. This is a | ||
| // configuration error that must be caught loudly — silent wrong semantics here | ||
| // would make the sandbox unreachable in any multi-namespace layout. | ||
| klog.Fatalf("routerNamespace must not be empty: a missing NamespaceSelector restricts router ingress to the sandbox namespace only") | ||
| } |
| egress: | ||
| - cidrs: | ||
| - "10.0.0.0/8" | ||
| ports: | ||
| - protocol: TCP | ||
| port: 443 |
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
|
Hi @hzxuzhonghu , thanks for the thorough review! I've addressed all the feedback |
|
|
||
| // Create NetworkPolicy if one was built from the template spec. | ||
| // Uses the workloadmanager's own client (not the user client) since NP is infra. | ||
| if networkPolicy != nil { |
There was a problem hiding this comment.
Can we create the NetworkPolicy before creating the Sandbox/SandboxClaim, or otherwise block pod startup until the policy exists? Creating the workload first leaves a window where untrusted sandbox code can run before the deny-all policy is applied.
There was a problem hiding this comment.
Done , NP is now created before the Sandbox/SandboxClaim so the deny-all is in place before the CNI wires up the pod's network interface. The OwnerRef is patched onto the NP after the sandbox UID is known.
| } | ||
| sandboxEntry.Kind = types.SandboxClaimsKind | ||
| return simpleSandbox, sandboxClaim, sandboxEntry, nil | ||
| np := buildNetworkPolicy(sandboxName, namespace, effectiveNetworkPolicy(sessionNetworkPolicy, codeInterpreterObj.Spec.Template.NetworkPolicy), routerSelector, routerNamespace) |
There was a problem hiding this comment.
Can we avoid creating a NetworkPolicy for the warm-pool path until it actually selects the bound pod? As written, this creates a policy object but the TODO says the pod does not carry SandboxNameLabelKey, so users can enable isolation and still get no enforcement.
There was a problem hiding this comment.
I've addressed this ; Warm-pool pods don't carry SandboxNameLabelKey, so the NP's PodSelector would never match them , creating the object only gave a false sense of isolation. I've removed NP creation from the warm-pool path entirely and added a warning log when a policy is configured but can't be enforced. Proper enforcement will need agent-sandbox to propagate claim labels onto the bound pod; left a note in the code for that.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds Kubernetes NetworkPolicy generation and lifecycle management to improve sandbox network isolation, with configurable router selection and per-session overrides.
Changes:
- Build and (optionally) create a deny-all NetworkPolicy per sandbox with router-ingress + DNS-egress allow rules and user-specified extra rules.
- Add cleanup paths for NetworkPolicies on delete, rollback, and garbage collection, plus ownerReference patching.
- Expose router selector/namespace configuration via Helm values and CLI flags; add unit tests.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/workload_builder.go | Threads template/session NetworkPolicy config into sandbox build and returns a built NP object. |
| pkg/workloadmanager/server.go | Adds router selector/namespace config defaults used for NP rule construction. |
| pkg/workloadmanager/networkpolicy_builder.go | Implements NetworkPolicy object construction (router ingress + DNS egress + extra rules). |
| pkg/workloadmanager/networkpolicy_builder_test.go | Adds unit tests for NP building and override semantics. |
| pkg/workloadmanager/k8s_client.go | Adds NP create/delete/patch helpers and returns UIDs for ownerRefs. |
| pkg/workloadmanager/handlers.go | Creates NP before workload, patches ownerRef after UID known, and cleans up on delete/rollback. |
| pkg/workloadmanager/handlers_test.go | Updates tests for new builder/createSandbox signatures; adds session override forwarding test. |
| pkg/workloadmanager/garbage_collection.go | Ensures GC deletes NP even when sandbox/claim is already gone. |
| pkg/workloadmanager/garbage_collection_test.go | Adds GC test verifying NP cleanup behavior. |
| pkg/common/types/sandbox.go | Adds per-session NetworkPolicy override field and validation hook. |
| pkg/apis/runtime/v1alpha1/networkpolicy_types.go | Introduces CRD types for SandboxNetworkPolicy rules. |
| pkg/apis/runtime/v1alpha1/agent_type.go | Adds NetworkPolicy field to AgentRuntime sandbox template CRD type. |
| pkg/apis/runtime/v1alpha1/codeinterpreter_types.go | Adds NetworkPolicy field to CodeInterpreter template CRD type. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Adds deep-copy support for new NetworkPolicy fields/types. |
| manifests/charts/base/values.yaml | Adds Helm values for router selector/namespace used for NP router ingress rule. |
| manifests/charts/base/templates/workloadmanager.yaml | Wires router selector/namespace Helm values into workload-manager CLI args. |
| manifests/charts/base/templates/rbac/workloadmanager.yaml | Grants workload-manager permissions for NetworkPolicies (create/get/list/watch/delete). |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml | Extends CodeInterpreter CRD schema with networkPolicy fields. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml | Extends AgentRuntime CRD schema with networkPolicy fields. |
| example/code-interpreter/code-interpreter-network-policy.yaml | Adds example manifests demonstrating template networkPolicy usage. |
| cmd/workload-manager/main.go | Adds CLI flags for router selector/namespace and parsing helper. |
| cmd/workload-manager/main_test.go | Adds unit test coverage for selector parsing. |
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] | ||
| - apiGroups: ["networking.k8s.io"] | ||
| resources: ["networkpolicies"] | ||
| verbs: ["get", "list", "watch", "create", "delete"] |
| if routerNamespace == "" { | ||
| // Without a NamespaceSelector the peer only matches pods in the same namespace | ||
| // as the NetworkPolicy, defeating cross-namespace router ingress. This is a | ||
| // configuration error that must be caught loudly — silent wrong semantics here | ||
| // would make the sandbox unreachable in any multi-namespace layout. | ||
| klog.Fatalf("routerNamespace must not be empty: a missing NamespaceSelector restricts router ingress to the sandbox namespace only") | ||
| } | ||
| // Router ingress is always allowed. NamespaceSelector is set so the rule works | ||
| // when the router runs in a namespace distinct from the sandbox (the common | ||
| // production layout: router in agentcube-system, sandboxes in user namespaces). | ||
| routerPeer := networkingv1.NetworkPolicyPeer{ | ||
| PodSelector: &metav1.LabelSelector{MatchLabels: routerSelector}, | ||
| NamespaceSelector: &metav1.LabelSelector{ | ||
| MatchLabels: map[string]string{k8sNamespaceLabelKey: routerNamespace}, | ||
| }, | ||
| } | ||
| rules := []networkingv1.NetworkPolicyIngressRule{ | ||
| {From: []networkingv1.NetworkPolicyPeer{routerPeer}}, | ||
| } |
| // NetworkPolicy defines the network isolation policy for sandbox pods created from | ||
| // this template. Defaults to Mode=None (no policy created). | ||
| // +optional | ||
| NetworkPolicy *SandboxNetworkPolicy `json:"networkPolicy,omitempty"` |
| // validateNetworkPolicyOverride validates a per-session NetworkPolicy override from | ||
| // CreateSandboxRequest. Unlike the CRD field (which has kubebuilder markers), this | ||
| // comes in as raw JSON and bypasses API-server admission, so we validate it here. | ||
| func validateNetworkPolicyOverride(_ *runtimev1alpha1.SandboxNetworkPolicy) error { |
| return nil | ||
| } |
| // DefaultRouterSelector is the pod label selector used to identify the router pod | ||
| // when building the mandatory router→sandbox ingress rule in Restricted network policy mode. | ||
| var DefaultRouterSelector = map[string]string{"app": "agentcube-router"} |
| if len(config.RouterSelector) == 0 { | ||
| config.RouterSelector = DefaultRouterSelector | ||
| } |
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
3f46273 to
4edac2e
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Sandboxes had zero network restrictions. A pod inside a sandbox could reach anything, internal or external. Since sandboxes run untrusted code this is a real isolation gap, so I picked up issue #216 to add NetworkPolicy support.
I added a
SandboxNetworkPolicyfield to bothSandboxTemplate(AgentRuntime) andCodeInterpreterSandboxTemplate. It has three modes you pick from:None- default, no policy created, nothing breaks for existing users -Restricted- deny-all baseline. DNS egress on port 53 (UDP+TCP) is always allowed so pods don't lose name resolution. Router ingress is always allowed toosince the router has to be able to reach the sandbox -
Custom- you pass raw k8s NetworkPolicy ingress/egress rules directlyA few specific things I was careful about:
Cross-namespace router ingress - using just
PodSelectorwould only match pods in the same namespace as the NP. In a real deployment the router usually lives in a different namespace than the sandbox pods, so that would silently break the router rule. I fixed this by adding aNamespaceSelectorusingkubernetes.io/metadata.nameso it works across namespaces.Router selector is configurable - you can set it in
values.yamlunderworkloadmanager.networkPolicy.routerSelectorandrouterNamespace, or pass--router-selectorand--router-namespaceflags. Defaults toapp=agentcube-routerand the workloadmanager's own namespace.Per-session override - added
NetworkPolicyfield toCreateSandboxRequest. When set it fully replaces the template-level policy for that session only. No merging, just a clean replace.Cleanup - NPs are deleted when you delete the sandbox through the API, on rollback if creation fails partway through, and in the GC loop. I also fixed a bug where the GC was returning early on
IsNotFoundbefore deleting the NP. Since the store record gets purged right after, that would've leaked the NP permanently. Now NP cleanup runs regardless of whether the sandbox was already gone.Rollback - simplified it to always attempt NP delete when
networkPolicy != nilrather than tracking anpCreatedflag. This also handles the edge case where the k8s write succeeded but returned an error to us.RBAC - added
networking.k8s.io/networkpoliciescreate/get/list/watch/delete to the workloadmanager ClusterRole. Without this the API server returns 403 on NP create.Which issue(s) this PR fixes:
Fixes #216
Special notes for your reviewer:
Known gap: for the warm-pool path (SandboxClaim), the NP gets created but its
PodSelectorusesSandboxNameLabelKeywhich agent-sandbox doesn't propagate onto warm pool pods. So the NP exists but won't enforce anything for warm-pool sessions right now. I want to confirm the label propagation behavior on the agent-sandbox side before fixing this, happy to do it as a follow-up.No OwnerReferences on the NP for now. Cleanup is handled explicitly through the delete handler and GC. Can add owner refs in a follow-up if you'd prefer that approach.
Does this PR introduce a user-facing change?: