Skip to content

feat: add NetworkPolicy support for sandbox isolation#291

Open
Sanchit2662 wants to merge 7 commits intovolcano-sh:mainfrom
Sanchit2662:feat/sandbox-network-policy
Open

feat: add NetworkPolicy support for sandbox isolation#291
Sanchit2662 wants to merge 7 commits intovolcano-sh:mainfrom
Sanchit2662:feat/sandbox-network-policy

Conversation

@Sanchit2662
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Sandboxes had zero network restrictions. A pod inside a sandbox could reach anything, internal or external. Since sandboxes run untrusted code this is a real isolation gap, so I picked up issue #216 to add NetworkPolicy support.

I added a SandboxNetworkPolicy field to both SandboxTemplate (AgentRuntime) and CodeInterpreterSandboxTemplate. It has three modes you pick from:

  • None - default, no policy created, nothing breaks for existing users - Restricted - deny-all baseline. DNS egress on port 53 (UDP+TCP) is always allowed so pods don't lose name resolution. Router ingress is always allowed too
    since the router has to be able to reach the sandbox - Custom - you pass raw k8s NetworkPolicy ingress/egress rules directly

A few specific things I was careful about:

Cross-namespace router ingress - using just PodSelector would only match pods in the same namespace as the NP. In a real deployment the router usually lives in a different namespace than the sandbox pods, so that would silently break the router rule. I fixed this by adding a NamespaceSelector using kubernetes.io/metadata.name so it works across namespaces.

Router selector is configurable - you can set it in values.yaml under workloadmanager.networkPolicy.routerSelector and routerNamespace, or pass --router-selector and --router-namespace flags. Defaults to app=agentcube-router and the workloadmanager's own namespace.

Per-session override - added NetworkPolicy field to CreateSandboxRequest. When set it fully replaces the template-level policy for that session only. No merging, just a clean replace.

Cleanup - NPs are deleted when you delete the sandbox through the API, on rollback if creation fails partway through, and in the GC loop. I also fixed a bug where the GC was returning early on IsNotFound before deleting the NP. Since the store record gets purged right after, that would've leaked the NP permanently. Now NP cleanup runs regardless of whether the sandbox was already gone.

Rollback - simplified it to always attempt NP delete when networkPolicy != nil rather than tracking a npCreated flag. This also handles the edge case where the k8s write succeeded but returned an error to us.

RBAC - added networking.k8s.io/networkpolicies create/get/list/watch/delete to the workloadmanager ClusterRole. Without this the API server returns 403 on NP create.

Which issue(s) this PR fixes:

Fixes #216

Special notes for your reviewer:

Known gap: for the warm-pool path (SandboxClaim), the NP gets created but its PodSelector uses SandboxNameLabelKey which agent-sandbox doesn't propagate onto warm pool pods. So the NP exists but won't enforce anything for warm-pool sessions right now. I want to confirm the label propagation behavior on the agent-sandbox side before fixing this, happy to do it as a follow-up.

No OwnerReferences on the NP for now. Cleanup is handled explicitly through the delete handler and GC. Can add owner refs in a follow-up if you'd prefer that approach.

Does this PR introduce a user-facing change?:

Added NetworkPolicy support for sandbox isolation. Set `networkPolicy.mode` on AgentRuntime or CodeInterpreter templates to `Restricted` (deny-all with DNS and router ingress allowed) or `Custom` (raw k8s rules). Defaults to `None` so existing setups are unaffected. Router selector and namespace are configurable via chart   values.
                                     

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements network isolation for sandboxes using Kubernetes NetworkPolicies, supporting 'None', 'Restricted', and 'Custom' modes via the AgentRuntime and CodeInterpreter CRDs. The workload-manager is updated to manage the lifecycle of these policies, including automated creation and cleanup. A critical issue was identified in pkg/workloadmanager/workload_builder.go where the per-session NetworkPolicy override is ignored during the creation of CodeInterpreter sandboxes when warm pools are not utilized.

Comment thread pkg/workloadmanager/workload_builder.go Outdated
- Add SandboxNetworkPolicy API type (None/Restricted/Custom modes)
- Build per-sandbox NetworkPolicy with router ingress always allowed
- Support cross-namespace router via NamespaceSelector
- Per-session NP override in CreateSandboxRequest (replace semantics)
- Configurable router selector/namespace via chart values and CLI flags
- Clean up NP on sandbox delete, rollback, and GC
- Add networkpolicies RBAC to workloadmanager ClusterRole

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional Kubernetes NetworkPolicy generation/cleanup to improve sandbox network isolation in AgentCube’s Workload Manager, closing the “no network restrictions” gap for untrusted sandbox code execution.

Changes:

  • Introduces SandboxNetworkPolicy in runtime API templates (AgentRuntime + CodeInterpreter) and a per-session override via CreateSandboxRequest.
  • Implements NetworkPolicy building (None/Restricted/Custom) plus lifecycle management (create on sandbox create; best-effort delete on rollback, delete API, and GC).
  • Adds router selector/namespace configuration (flags + Helm values) to support cross-namespace router→sandbox ingress allow rules.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/workloadmanager/workload_builder.go Wires template/session policy into sandbox build flow and returns a NetworkPolicy object for creation.
pkg/workloadmanager/server.go Adds config defaults for router selector/namespace used by Restricted mode ingress.
pkg/workloadmanager/networkpolicy_builder.go Builds NetworkPolicy objects for None/Restricted/Custom modes (router ingress + optional DNS egress).
pkg/workloadmanager/networkpolicy_builder_test.go Unit tests for NetworkPolicy building and “replace, not merge” selection behavior.
pkg/workloadmanager/k8s_client.go Adds helpers to create/delete NetworkPolicies and generalizes clientset type.
pkg/workloadmanager/handlers.go Creates NetworkPolicy during sandbox creation; deletes on rollback and explicit delete handler.
pkg/workloadmanager/handlers_test.go Updates handler tests for new builder signatures and adds session override forwarding test.
pkg/workloadmanager/garbage_collection.go Ensures NetworkPolicy cleanup runs even when Sandbox/SandboxClaim is already NotFound.
pkg/workloadmanager/garbage_collection_test.go Adds fake clientset to cover NP cleanup path without nil panics.
pkg/common/types/sandbox.go Extends CreateSandboxRequest to accept per-session NetworkPolicy override.
pkg/apis/runtime/v1alpha1/networkpolicy_types.go Adds new API types for SandboxNetworkPolicy modes and rules.
pkg/apis/runtime/v1alpha1/agent_type.go Adds NetworkPolicy field to AgentRuntime SandboxTemplate.
pkg/apis/runtime/v1alpha1/codeinterpreter_types.go Adds NetworkPolicy field to CodeInterpreterSandboxTemplate.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go Updates generated deep-copies for new NetworkPolicy fields/types.
manifests/charts/base/values.yaml Adds Helm values for router selector/namespace used in Restricted mode.
manifests/charts/base/templates/workloadmanager.yaml Passes router selector/namespace flags into workload-manager deployment.
manifests/charts/base/templates/rbac/workloadmanager.yaml Grants workload-manager RBAC to manage networkpolicies.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml Updates AgentRuntime CRD schema with networkPolicy fields.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml Updates CodeInterpreter CRD schema with networkPolicy fields.
cmd/workload-manager/main.go Adds flags + selector parsing for router selector/namespace configuration.
cmd/workload-manager/main_test.go Unit tests for selector parsing logic.

Comment thread pkg/workloadmanager/workload_builder.go Outdated
Comment on lines +388 to +394
// createNetworkPolicy creates a NetworkPolicy using the workloadmanager's own client.
func createNetworkPolicy(ctx context.Context, clientset kubernetes.Interface, np *networkingv1.NetworkPolicy) error {
_, err := clientset.NetworkingV1().NetworkPolicies(np.Namespace).Create(ctx, np, metav1.CreateOptions{})
if err != nil && !apierrors.IsAlreadyExists(err) {
return fmt.Errorf("failed to create network policy %s/%s: %w", np.Namespace, np.Name, err)
}
return nil
Comment thread pkg/common/types/sandbox.go Outdated
@@ -21,6 +21,7 @@ import (
"time"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 39.64912% with 172 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.39%. Comparing base (3fddb1b) to head (4edac2e).
⚠️ Report is 50 commits behind head on main.

Files with missing lines Patch % Lines
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go 0.00% 56 Missing ⚠️
pkg/workloadmanager/workload_builder.go 9.30% 39 Missing ⚠️
pkg/workloadmanager/handlers.go 46.77% 29 Missing and 4 partials ⚠️
pkg/workloadmanager/k8s_client.go 9.09% 29 Missing and 1 partial ⚠️
pkg/workloadmanager/garbage_collection.go 0.00% 2 Missing and 4 partials ⚠️
pkg/workloadmanager/server.go 0.00% 4 Missing ⚠️
pkg/common/types/sandbox.go 50.00% 1 Missing and 1 partial ⚠️
pkg/workloadmanager/networkpolicy_builder.go 97.40% 1 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
+ Coverage   46.73%   47.39%   +0.65%     
==========================================
  Files          30       31       +1     
  Lines        2822     3028     +206     
==========================================
+ Hits         1319     1435     +116     
- Misses       1363     1442      +79     
- Partials      140      151      +11     
Flag Coverage Δ
unittests 47.39% <39.64%> (+0.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…eter path

- Fix session NP override being ignored in direct sandbox path
- Add CIDR format validation markers to SandboxEgressRule and SandboxIngressRule
- Add example manifest for NetworkPolicy modes (Restricted, Custom)

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
hzxuzhonghu

This comment was marked as spam.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good design with three modes that maintain backward compatibility. The GC leak fix is a nice bonus. Please confirm behavior of NetworkPolicy for the warm-pool path before merging.

- Add OwnerReference on NetworkPolicy pointing to Sandbox/SandboxClaim
  so k8s GC cascades deletion when owner is removed out-of-band
- Add CEL validation: custom field must be set when mode is Custom
- Add TODO comment on warm-pool NP selector gap pending agent-sandbox fix
- Add GC lifecycle tests verifying NP is deleted alongside sandbox
- Extend createSandboxClaim to return UID for OwnerReference wiring

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Copilot AI review requested due to automatic review settings April 21, 2026 05:32
@Sanchit2662 Sanchit2662 requested a review from hzxuzhonghu April 21, 2026 05:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Kubernetes NetworkPolicy generation/management to improve sandbox network isolation in the Workload Manager, with configurable router selectors and support for template-level defaults plus per-session overrides.

Changes:

  • Introduces SandboxNetworkPolicy API types and adds networkPolicy fields to AgentRuntime/CodeInterpreter templates and CreateSandboxRequest.
  • Implements NetworkPolicy building (Restricted/Custom) and lifecycle management (create, rollback, delete, GC cleanup).
  • Adds Helm values/flags for router selector + namespace and updates RBAC/CRDs/examples/tests accordingly.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/workloadmanager/workload_builder.go Threads session/template NetworkPolicy into sandbox build flow.
pkg/workloadmanager/server.go Adds configurable router selector/namespace defaults to server config.
pkg/workloadmanager/networkpolicy_builder.go Implements NetworkPolicy object construction for sandbox pods.
pkg/workloadmanager/networkpolicy_builder_test.go Unit tests for NetworkPolicy builder + override semantics.
pkg/workloadmanager/k8s_client.go Adds NetworkPolicy create/delete helpers; returns SandboxClaim UID for owner refs.
pkg/workloadmanager/handlers.go Creates NetworkPolicy (with owner refs), adds rollback + delete handler cleanup.
pkg/workloadmanager/handlers_test.go Updates handler/server tests for new create signature + session NP override.
pkg/workloadmanager/garbage_collection.go Ensures NP cleanup runs even when Sandbox/SandboxClaim is already NotFound.
pkg/workloadmanager/garbage_collection_test.go Tests GC deletes associated NetworkPolicies.
pkg/common/types/sandbox.go Extends CreateSandboxRequest with per-session NetworkPolicy override.
pkg/apis/runtime/v1alpha1/networkpolicy_types.go New CRD/API types for sandbox network policy configuration.
pkg/apis/runtime/v1alpha1/agent_type.go Adds template-level networkPolicy to AgentRuntime SandboxTemplate.
pkg/apis/runtime/v1alpha1/codeinterpreter_types.go Adds template-level networkPolicy to CodeInterpreterSandboxTemplate.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go Regenerates DeepCopy for new API fields/types.
manifests/charts/base/values.yaml Adds chart values for router selector/namespace.
manifests/charts/base/templates/workloadmanager.yaml Wires chart values into workload-manager flags.
manifests/charts/base/templates/rbac/workloadmanager.yaml Grants workloadmanager RBAC to manage NetworkPolicies.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml Updates CRD schema for new template field.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml Updates CRD schema for new template field.
example/code-interpreter/code-interpreter-network-policy.yaml Adds usage examples for Restricted/Custom modes.
cmd/workload-manager/main.go Adds flags for router selector/namespace + selector parsing helper.
cmd/workload-manager/main_test.go Tests selector parsing behavior.

Comment thread pkg/workloadmanager/networkpolicy_builder.go Outdated
Comment thread pkg/common/types/sandbox.go
Comment thread cmd/workload-manager/main.go
Comment thread pkg/common/types/sandbox.go Outdated
Extract sandbox/claim creation and NetworkPolicy OwnerReference wiring
into createWorkload helper to bring createSandbox complexity from 16 to
under the gocyclo limit of 15

Also address Copilot review comments:
- Treat AlreadyExists as error in createNetworkPolicy instead of silently
  reusing a potentially stale or mismatched policy
- Validate per-session NetworkPolicy override in CreateSandboxRequest.Validate:
  reject unknown modes and require custom field when mode is Custom
- Fix import ordering to be goimports-compliant across handlers.go,
  k8s_client.go, sandbox.go and main.go
- Warn at startup when --router-selector parses to an empty map

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662
Copy link
Copy Markdown
Contributor Author

Sanchit2662 commented Apr 21, 2026

Hey @hzxuzhonghu
For the warm-pool path, the NP is created but the PodSelector (SandboxNameLabelKey: ) won't match anything. Here's why: warm-pool pods are pre-created from the SandboxTemplate before any session exists, so they carry whatever labels the template defines , not our per-session SandboxNameLabelKey. The label only gets set explicitly in buildSandboxObject (the direct-sandbox path), where we control the pod template at creation time. In the warm-pool path we create a SandboxClaim and agent-sandbox binds a pre-warmed pod to it, but that binding doesn't patch our label onto the pod.

Practically: the NP gets created, sits there with no matching pods, and does nothing. The sandbox runs unprotected in the warm-pool case even if Mode=Restricted. I've got a TODO in the code at the build site marking this.

Two possible fixes I'm aware of:

  1. Post-ready pod patch , after the sandbox reports ready, we know the pod name (from the SandboxPodNameAnnotation). We could patch the pod to add SandboxNameLabelKey and then create the NP. Needs pod patch RBAC, and it's a race-y two-step between pod running and NP creation.

  2. Use whatever labels agent-sandbox applies at bind time , if agent-sandbox patches session-identifying labels onto the pod when binding a claim, we can use those as the selector instead. I'd need to look at the agent-sandbox controller code to confirm what labels end up on a bound warm-pool pod.

My suggestion: merge this PR as-is since the direct-sandbox path works correctly, and track the warm-pool fix as a follow-up once we confirm what labels agent-sandbox applies. I'd rather not block the whole feature on the warm-pool edge case, but happy to dig into it more if you'd prefer it handled in this PR.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design is solid — per-session override, Restricted-mode DNS egress defaults, cross-namespace router peer with NamespaceSelector, and GC/rollback NP cleanup are all well thought out. A few issues below.

Also note dead code in handleDeleteSandbox: deleteSandboxClaim and deleteSandbox in k8s_client.go already return nil for IsNotFound errors, so the if apierrors.IsNotFound(err) branches at lines ~322 and ~334 of handlers.go are unreachableerr can never be a NotFound error at that point. The log lines there will never fire. The dead branches should be removed to avoid confusing future readers who might assume NotFound could surface there.

Comment thread pkg/workloadmanager/handlers.go Outdated
// Create NetworkPolicy if one was built from the template spec.
// Uses the workloadmanager's own client (not the user client) since NP is infra.
if networkPolicy != nil {
if err := createNetworkPolicy(ctx, s.k8sClient.clientset, networkPolicy); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Mode=Custom in CreateSandboxRequest is a privilege-escalation vector when EnableAuth=true.

The NetworkPolicy is created here using s.k8sClient.clientset — the workloadmanager's own elevated service account (which now has networkpolicies create/delete RBAC per this PR's manifest changes). The NP rules, however, come entirely from the user-supplied sandboxReq.NetworkPolicy field.

This means a user who can call the WM sandbox-create API but does not have networkpolicies/create permission in k8s can bypass that restriction by setting networkPolicy.mode=Custom with arbitrary egress rules. The practical impact: they can weaken the isolation the cluster operator intended (e.g., override a template-level Restricted policy with open egress rules that reach the k8s API server, cloud metadata endpoints, or internal services).

Suggested fixes:

  1. Disallow Custom mode in CreateSandboxRequest when EnableAuth=true; reserve Custom for cluster-admin-controlled CRD templates only and allow only None/Restricted per-session overrides.
  2. SubjectAccessReview gate: before creating the NP, perform a SAR check on behalf of the requesting user to verify they hold networking.k8s.io/networkpolicies create in the sandbox namespace.

routerPeer := networkingv1.NetworkPolicyPeer{
PodSelector: &metav1.LabelSelector{MatchLabels: routerSelector},
}
if routerNamespace != "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When routerNamespace is an empty string, NamespaceSelector is not added to the peer. Kubernetes NetworkPolicy semantics for a peer with only PodSelector (no NamespaceSelector): match pods in the same namespace as the NetworkPolicy.

If the router is in a different namespace and routerNamespace is empty, the router ingress rule silently becomes a same-namespace-only rule — cross-namespace router traffic is blocked, defeating the whole point of this cross-namespace logic.

In production this is guarded by NewServer defaulting RouterNamespace to IdentitySecretNamespace, so the empty case never fires at runtime. But the function itself is silently wrong for empty input, and any test or future caller that passes "" will get incorrect policy without any signal. Consider either panic/returning an error for empty routerNamespace, or at minimum adding a comment: // routerNamespace must be non-empty; an empty value restricts the rule to the sandbox's own namespace.

np.Spec.Egress = buildRestrictedEgressRules(spec)
case runtimev1alpha1.NetworkPolicyModeCustom:
if spec.Custom != nil {
np.Spec.Ingress = spec.Custom.Ingress
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PolicyTypes: [Ingress, Egress] is set unconditionally (lines 53-56) for both Restricted and Custom modes. In Custom mode, np.Spec.Ingress is set directly from spec.Custom.Ingress. If the caller provides only egress rules and leaves Custom.Ingress nil, the resulting NP has PolicyTypes: [Ingress] with no ingress rules — deny-all ingress, including from the router.

Unlike Restricted mode, Custom mode does not automatically inject the router ingress rule. This is not documented anywhere in the type or the mode description, so an operator writing:

networkPolicy:
  mode: Custom
  custom:
    egress:
      - ports: [{port: 443}]

...would unknowingly block the router from reaching the sandbox and see requests time out silently.

Please add a note to SandboxNetworkPolicyCustomRules or NetworkPolicyModeCustom warning that callers must explicitly include any required ingress rules (including router ingress) when using Custom mode.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design is solid — per-session override semantics, rollback logic, GC cleanup, and the cross-namespace router NamespaceSelector are all well-thought-out.

Dead code in handleDeleteSandbox (lines ~322/334 — outside this diff's hunks): The if apierrors.IsNotFound(err) branches after deleteSandboxClaim/deleteSandbox calls are unreachable. Both helpers already swallow IsNotFound and return nil for it. Any error that reaches those if err != nil guards is always a real, non-NotFound error. The branches are harmless today but would cause a silent behavior change if the helpers ever stop suppressing NotFound. Fix: remove the dead IsNotFound branches, or remove NotFound-suppression from the helpers and handle it at each call site.

Comment thread pkg/workloadmanager/handlers.go Outdated
// Uses the workloadmanager's own client (not the user client) since NP is infra.
if networkPolicy != nil {
if err := createNetworkPolicy(ctx, s.k8sClient.clientset, networkPolicy); err != nil {
return nil, api.NewInternalError(err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: per-session Custom mode allows RBAC bypass when EnableAuth=true.

This call uses s.k8sClient.clientset (the workloadmanager's own elevated SA, which has networking.k8s.io/networkpolicies create RBAC per the manifest added in this PR). A user who calls this API endpoint with networkPolicy.mode=Custom can therefore create an arbitrary NetworkPolicy in the sandbox namespace, even if their own principal lacks that RBAC permission.

In a multi-tenant cluster where an operator sets Mode=Restricted to lock down sandbox egress (e.g., block the metadata endpoint or internal services), any API caller can override that with Mode=Custom and open egress at will — effectively a privilege escalation.

Fix options:

  1. Disallow Custom in per-session overrides. Only allow it in the operator-controlled template spec; session overrides can only set Mode=Restricted/None and add allowedEgress entries.
  2. SubjectAccessReview gate. Before using the elevated client, perform an authorizationv1.SubjectAccessReview to confirm the requesting user has networking.k8s.io/networkpolicies create permission in the target namespace.

case runtimev1alpha1.NetworkPolicyModeCustom:
if spec.Custom != nil {
np.Spec.Ingress = spec.Custom.Ingress
np.Spec.Egress = spec.Custom.Egress
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PolicyTypes: [Ingress, Egress] is always set; nil Custom.Ingress = deny-all ingress, including from the router.

PolicyTypes is set unconditionally to [Ingress, Egress] for all non-None modes. In Custom mode, if spec.Custom.Ingress is nil (the user only wants to configure egress), np.Spec.Ingress is nil here. Per k8s NetworkPolicy semantics, PolicyType=Ingress with no ingress rules means deny all ingress — including from the router. The sandbox becomes completely unreachable.

This affects a user who passes {mode: Custom, custom: {egress: [...]}} expecting to add egress allow rules while keeping normal inbound traffic.

The kubebuilder validation only requires custom != null, so {mode: Custom, custom: {}} is accepted and silently creates a deny-all policy.

Fix: add a warning to the SandboxNetworkPolicyCustomRules godoc:

// When mode is Custom, ALL ingress is denied unless rules are explicitly
// specified here. The router ingress rule injected by Restricted mode is NOT
// automatic — include it in Ingress if sandbox reachability is required.

PodSelector: &metav1.LabelSelector{MatchLabels: routerSelector},
}
if routerNamespace != "" {
routerPeer.NamespaceSelector = &metav1.LabelSelector{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent wrong semantics when routerNamespace is empty.

When routerNamespace == "", NamespaceSelector is skipped. Per k8s NetworkPolicy semantics, a NetworkPolicyPeer with only PodSelector and no NamespaceSelector selects pods in the same namespace as the NetworkPolicy — not all namespaces.

So if routerNamespace is empty, the rule only allows router pods co-located in the sandbox namespace. If the router is in a different namespace (the common multi-tenant layout the comment above describes), it silently cannot reach the sandbox — exactly the case the NamespaceSelector was added to prevent.

server.go/NewServer ensures RouterNamespace always defaults to at least IdentitySecretNamespace, so production is safe today. But the function itself has incorrect semantics for the empty case, making it a hazard for tests and any future direct callers.

Fix: guard at the top of buildRestrictedIngressRules:

if routerNamespace == "" {
    // A missing namespace selector restricts the rule to the sandbox namespace only,
    // defeating cross-namespace router ingress. Callers must supply a non-empty value.
    klog.Fatalf("routerNamespace must not be empty")
}

or return an error up the call chain.

// "Custom": use the raw Ingress/Egress rules in the Custom field.
// +kubebuilder:default=None
// +optional
Mode NetworkPolicyMode `json:"mode,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this field, i thin none can be represented with unset policy

Custom can be represented by AllowedIngress AllowedEgress

Restricted should be like None


// AllowedEgress lists additional egress allow rules applied when Mode=Restricted.
// +optional
AllowedEgress []SandboxEgressRule `json:"allowedEgress,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AllowedEgress []SandboxEgressRule `json:"allowedEgress,omitempty"`
Egress []SandboxEgressRule `json:"allowedEgress,omitempty"`

// AllowDNS controls whether a DNS egress rule (port 53 UDP/TCP) is added when
// Mode=Restricted. Defaults to true.
// +optional
AllowDNS *bool `json:"allowDNS,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

// AllowedIngress lists additional ingress allow rules when Mode=Restricted.
// Router ingress is always permitted in Restricted mode regardless of this field.
// +optional
AllowedIngress []SandboxIngressRule `json:"allowedIngress,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AllowedIngress []SandboxIngressRule `json:"allowedIngress,omitempty"`
Ingress []SandboxIngressRule `json:"ingress,omitempty"`

CIDRs []string `json:"cidrs,omitempty"`
// Ports restricts the rule to these destination ports. An empty list matches all ports.
// +optional
Ports []SandboxNetworkPolicyPort `json:"ports,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?? I think the comment is wrong , copied from SandboxEgressRule

}

// SandboxIngressRule describes an ingress allow rule for a sandbox pod.
type SandboxIngressRule struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is same as SandboxEgressRule, please check k8s NetworkPolicySpec api

return nil, err
}
} else {
if _, err := createSandbox(ctx, dynamicClient, sandbox); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was extracted from createSandbox to get its cyclomatic complexity under the gocyclo lint limit of 15 , it was hitting 16 before. The logic is exactly the same, just pulled out into its own helper. The early return is just how I naturally wrote it in Go, happy to match whatever style you prefer here.

  - Remove Mode enum (None/Restricted/Custom); nil policy means no                                                                                                       enforcement, non-nil enforces deny-all with router ingress and DNS
  - Remove AllowDNS toggle; DNS egress is always included when a policy
    is set                                                                                                                                                             - Rename AllowedEgress/AllowedIngress to Egress/Ingress; update JSON
    tags to egress/ingress                                                                                                                                             - Merge SandboxEgressRule and SandboxIngressRule into single
    SandboxNetworkPolicyRule type                                                                                                                                      - Block Custom mode in per-session overrides to prevent RBAC privilege
    escalation via the workloadmanager elevated SA
  - Add klog.Fatalf guard when routerNamespace is empty to prevent silent                                                                                                same-namespace-only router ingress semantics
  - Remove dead IsNotFound branches in handleDeleteSandbox; helpers
    already swallow NotFound and return nil                                                                                                                            - Fix incorrect Ports comment copied from SandboxEgressRule into
    SandboxIngressRule

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Copilot AI review requested due to automatic review settings April 24, 2026 05:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Kubernetes NetworkPolicy provisioning to improve sandbox network isolation in AgentCube’s workload-manager, including router-ingress allowlisting, DNS egress allowlisting, and cleanup on rollback/delete/GC, plus configuration knobs for selecting the router pods.

Changes:

  • Build and (optionally) create a per-sandbox NetworkPolicy alongside Sandbox/SandboxClaim creation, with rollback/delete/GC cleanup paths.
  • Add router selector/namespace configuration via workload-manager flags and Helm values, and wire these into the NetworkPolicy builder.
  • Introduce API/CRD surface changes for template-level and per-session NetworkPolicy configuration, plus unit tests around policy construction and cleanup.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/workloadmanager/workload_builder.go Returns a built NetworkPolicy alongside sandbox objects and applies session override semantics.
pkg/workloadmanager/server.go Adds router selector/namespace config defaults for policy building.
pkg/workloadmanager/networkpolicy_builder.go Implements NetworkPolicy object construction (router ingress + DNS egress + extra rules).
pkg/workloadmanager/networkpolicy_builder_test.go Adds unit tests for NetworkPolicy construction and override semantics.
pkg/workloadmanager/k8s_client.go Adds NetworkPolicy create/delete helpers; returns created UIDs for OwnerReferences; makes clientset interface-typed for testability.
pkg/workloadmanager/handlers.go Wires per-session policy override into builders; creates NetworkPolicy; adds OwnerReferences; improves rollback to include NP deletion; delete handler best-effort NP cleanup.
pkg/workloadmanager/handlers_test.go Updates mocks/signatures and adds test for session-level policy override forwarding.
pkg/workloadmanager/garbage_collection.go Ensures NP cleanup runs even when sandbox/claim is already NotFound.
pkg/workloadmanager/garbage_collection_test.go Adds fake clientset and test verifying GC deletes associated NetworkPolicy.
pkg/common/types/sandbox.go Adds NetworkPolicy field to CreateSandboxRequest and a placeholder validator.
pkg/apis/runtime/v1alpha1/networkpolicy_types.go Adds SandboxNetworkPolicy API type (ingress/egress rules).
pkg/apis/runtime/v1alpha1/agent_type.go Adds NetworkPolicy to SandboxTemplate.
pkg/apis/runtime/v1alpha1/codeinterpreter_types.go Adds NetworkPolicy to CodeInterpreterSandboxTemplate.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go Updates generated deep-copies for new network policy types/fields.
manifests/charts/base/values.yaml Adds Helm values for router selector/namespace.
manifests/charts/base/templates/workloadmanager.yaml Passes router selector/namespace flags to workload-manager.
manifests/charts/base/templates/rbac/workloadmanager.yaml Adds RBAC for NetworkPolicy resources.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml Updates CRD schema to include networkPolicy fields.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml Updates CRD schema to include networkPolicy fields.
example/code-interpreter/code-interpreter-network-policy.yaml Adds an example CodeInterpreter manifest using template networkPolicy.
cmd/workload-manager/main.go Adds CLI flags and selector parsing for router selector/namespace.
cmd/workload-manager/main_test.go Adds tests for selector parsing.

Comment on lines +24 to +37
// SandboxNetworkPolicy defines the network isolation policy applied to each sandbox pod.
// When set (non-nil), a deny-all NetworkPolicy is created with explicit allow rules for
// router ingress and DNS egress. Additional rules can be added via Ingress and
// Egress. Leave this field unset (nil) to disable network policy enforcement.
type SandboxNetworkPolicy struct {
// Egress lists additional egress allow rules beyond the default DNS rule.
// +optional
Egress []SandboxNetworkPolicyRule `json:"egress,omitempty"`

// Ingress lists additional ingress allow rules.
// Router ingress is always permitted regardless of this field.
// +optional
Ingress []SandboxNetworkPolicyRule `json:"ingress,omitempty"`
}
Spec corev1.PodSpec `json:"spec" protobuf:"bytes,3,opt,name=spec"`

// NetworkPolicy defines the network isolation policy for sandbox pods created from
// this template. Defaults to Mode=None (no policy created).
Comment on lines +56 to +64
networkPolicy:
# routerSelector is the pod label selector used to identify the router pod.
# Applied as the mandatory ingress allow rule when SandboxNetworkPolicy.Mode=Restricted.
# Format: comma-separated key=value pairs, e.g. "app=agentcube-router,tier=proxy"
routerSelector: "app=agentcube-router"
# routerNamespace is the namespace the router runs in.
# Used to build the cross-namespace NamespaceSelector for the router→sandbox ingress rule.
# Defaults to the workloadmanager's own namespace when left empty.
routerNamespace: ""
Comment on lines +66 to +85
networkPolicy:
description: |-
NetworkPolicy defines the network isolation policy for sandbox pods created from
this template. Defaults to Mode=None (no policy created).
properties:
allowDNS:
description: |-
AllowDNS controls whether a DNS egress rule (port 53 UDP/TCP) is added when
Mode=Restricted. Defaults to true.
type: boolean
allowedEgress:
description: AllowedEgress lists additional egress allow rules
applied when Mode=Restricted.
items:
description: SandboxEgressRule describes an egress allow
rule for a sandbox pod.
properties:
cidrs:
description: |-
CIDRs is the list of destination CIDR blocks.
Comment on lines +335 to +347
networkPolicy:
description: |-
NetworkPolicy defines the network isolation policy for sandbox pods created from
this template. Defaults to Mode=None (no policy created).
properties:
allowDNS:
description: |-
AllowDNS controls whether a DNS egress rule (port 53 UDP/TCP) is added when
Mode=Restricted. Defaults to true.
type: boolean
allowedEgress:
description: AllowedEgress lists additional egress allow rules
applied when Mode=Restricted.
Resources corev1.ResourceRequirements `json:"resources,omitempty"`

// NetworkPolicy defines the network isolation policy for sandbox pods created from
// this template. Defaults to Mode=None (no policy created).
Comment on lines +90 to 94
// validateNetworkPolicyOverride validates a per-session NetworkPolicy override from
// CreateSandboxRequest. Unlike the CRD field (which has kubebuilder markers), this
// comes in as raw JSON and bypasses API-server admission, so we validate it here.
func validateNetworkPolicyOverride(_ *runtimev1alpha1.SandboxNetworkPolicy) error {
return nil
Comment on lines +70 to +76
if routerNamespace == "" {
// Without a NamespaceSelector the peer only matches pods in the same namespace
// as the NetworkPolicy, defeating cross-namespace router ingress. This is a
// configuration error that must be caught loudly — silent wrong semantics here
// would make the sandbox unreachable in any multi-namespace layout.
klog.Fatalf("routerNamespace must not be empty: a missing NamespaceSelector restricts router ingress to the sandbox namespace only")
}
Comment on lines +60 to +65
egress:
- cidrs:
- "10.0.0.0/8"
ports:
- protocol: TCP
port: 443
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662
Copy link
Copy Markdown
Contributor Author

Hi @hzxuzhonghu , thanks for the thorough review! I've addressed all the feedback


// Create NetworkPolicy if one was built from the template spec.
// Uses the workloadmanager's own client (not the user client) since NP is infra.
if networkPolicy != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create the NetworkPolicy before creating the Sandbox/SandboxClaim, or otherwise block pod startup until the policy exists? Creating the workload first leaves a window where untrusted sandbox code can run before the deny-all policy is applied.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done , NP is now created before the Sandbox/SandboxClaim so the deny-all is in place before the CNI wires up the pod's network interface. The OwnerRef is patched onto the NP after the sandbox UID is known.

Comment thread pkg/workloadmanager/workload_builder.go Outdated
}
sandboxEntry.Kind = types.SandboxClaimsKind
return simpleSandbox, sandboxClaim, sandboxEntry, nil
np := buildNetworkPolicy(sandboxName, namespace, effectiveNetworkPolicy(sessionNetworkPolicy, codeInterpreterObj.Spec.Template.NetworkPolicy), routerSelector, routerNamespace)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid creating a NetworkPolicy for the warm-pool path until it actually selects the bound pod? As written, this creates a policy object but the TODO says the pod does not carry SandboxNameLabelKey, so users can enable isolation and still get no enforcement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this ; Warm-pool pods don't carry SandboxNameLabelKey, so the NP's PodSelector would never match them , creating the object only gave a false sense of isolation. I've removed NP creation from the warm-pool path entirely and added a warning log when a policy is configured but can't be enforced. Proper enforcement will need agent-sandbox to propagate claim labels onto the bound pod; left a note in the code for that.

Copilot AI review requested due to automatic review settings April 24, 2026 16:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Kubernetes NetworkPolicy generation and lifecycle management to improve sandbox network isolation, with configurable router selection and per-session overrides.

Changes:

  • Build and (optionally) create a deny-all NetworkPolicy per sandbox with router-ingress + DNS-egress allow rules and user-specified extra rules.
  • Add cleanup paths for NetworkPolicies on delete, rollback, and garbage collection, plus ownerReference patching.
  • Expose router selector/namespace configuration via Helm values and CLI flags; add unit tests.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pkg/workloadmanager/workload_builder.go Threads template/session NetworkPolicy config into sandbox build and returns a built NP object.
pkg/workloadmanager/server.go Adds router selector/namespace config defaults used for NP rule construction.
pkg/workloadmanager/networkpolicy_builder.go Implements NetworkPolicy object construction (router ingress + DNS egress + extra rules).
pkg/workloadmanager/networkpolicy_builder_test.go Adds unit tests for NP building and override semantics.
pkg/workloadmanager/k8s_client.go Adds NP create/delete/patch helpers and returns UIDs for ownerRefs.
pkg/workloadmanager/handlers.go Creates NP before workload, patches ownerRef after UID known, and cleans up on delete/rollback.
pkg/workloadmanager/handlers_test.go Updates tests for new builder/createSandbox signatures; adds session override forwarding test.
pkg/workloadmanager/garbage_collection.go Ensures GC deletes NP even when sandbox/claim is already gone.
pkg/workloadmanager/garbage_collection_test.go Adds GC test verifying NP cleanup behavior.
pkg/common/types/sandbox.go Adds per-session NetworkPolicy override field and validation hook.
pkg/apis/runtime/v1alpha1/networkpolicy_types.go Introduces CRD types for SandboxNetworkPolicy rules.
pkg/apis/runtime/v1alpha1/agent_type.go Adds NetworkPolicy field to AgentRuntime sandbox template CRD type.
pkg/apis/runtime/v1alpha1/codeinterpreter_types.go Adds NetworkPolicy field to CodeInterpreter template CRD type.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go Adds deep-copy support for new NetworkPolicy fields/types.
manifests/charts/base/values.yaml Adds Helm values for router selector/namespace used for NP router ingress rule.
manifests/charts/base/templates/workloadmanager.yaml Wires router selector/namespace Helm values into workload-manager CLI args.
manifests/charts/base/templates/rbac/workloadmanager.yaml Grants workload-manager permissions for NetworkPolicies (create/get/list/watch/delete).
manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml Extends CodeInterpreter CRD schema with networkPolicy fields.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml Extends AgentRuntime CRD schema with networkPolicy fields.
example/code-interpreter/code-interpreter-network-policy.yaml Adds example manifests demonstrating template networkPolicy usage.
cmd/workload-manager/main.go Adds CLI flags for router selector/namespace and parsing helper.
cmd/workload-manager/main_test.go Adds unit test coverage for selector parsing.

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies"]
verbs: ["get", "list", "watch", "create", "delete"]
Comment on lines +70 to +88
if routerNamespace == "" {
// Without a NamespaceSelector the peer only matches pods in the same namespace
// as the NetworkPolicy, defeating cross-namespace router ingress. This is a
// configuration error that must be caught loudly — silent wrong semantics here
// would make the sandbox unreachable in any multi-namespace layout.
klog.Fatalf("routerNamespace must not be empty: a missing NamespaceSelector restricts router ingress to the sandbox namespace only")
}
// Router ingress is always allowed. NamespaceSelector is set so the rule works
// when the router runs in a namespace distinct from the sandbox (the common
// production layout: router in agentcube-system, sandboxes in user namespaces).
routerPeer := networkingv1.NetworkPolicyPeer{
PodSelector: &metav1.LabelSelector{MatchLabels: routerSelector},
NamespaceSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{k8sNamespaceLabelKey: routerNamespace},
},
}
rules := []networkingv1.NetworkPolicyIngressRule{
{From: []networkingv1.NetworkPolicyPeer{routerPeer}},
}
Comment on lines +82 to +85
// NetworkPolicy defines the network isolation policy for sandbox pods created from
// this template. Defaults to Mode=None (no policy created).
// +optional
NetworkPolicy *SandboxNetworkPolicy `json:"networkPolicy,omitempty"`
// validateNetworkPolicyOverride validates a per-session NetworkPolicy override from
// CreateSandboxRequest. Unlike the CRD field (which has kubebuilder markers), this
// comes in as raw JSON and bypasses API-server admission, so we validate it here.
func validateNetworkPolicyOverride(_ *runtimev1alpha1.SandboxNetworkPolicy) error {
Comment on lines 94 to 95
return nil
}
Comment on lines +47 to +49
// DefaultRouterSelector is the pod label selector used to identify the router pod
// when building the mandatory router→sandbox ingress rule in Restricted network policy mode.
var DefaultRouterSelector = map[string]string{"app": "agentcube-router"}
Comment on lines +91 to +93
if len(config.RouterSelector) == 0 {
config.RouterSelector = DefaultRouterSelector
}
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662 Sanchit2662 force-pushed the feat/sandbox-network-policy branch from 3f46273 to 4edac2e Compare April 24, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: Support network policy for sandbox

6 participants