From c9deff179b8292363db13725d176f3397b89cdbb Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Mon, 15 Jun 2026 23:42:08 -0700 Subject: [PATCH 01/10] docs(rfc): add gateway interceptors RFC Signed-off-by: Drew Newberry --- rfc/0006-gateway-interceptors/README.md | 494 ++++++++++++++++++ .../policy-governance-example.md | 70 +++ 2 files changed, 564 insertions(+) create mode 100644 rfc/0006-gateway-interceptors/README.md create mode 100644 rfc/0006-gateway-interceptors/policy-governance-example.md diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0006-gateway-interceptors/README.md new file mode 100644 index 000000000..f63c1ae7f --- /dev/null +++ b/rfc/0006-gateway-interceptors/README.md @@ -0,0 +1,494 @@ +--- +authors: + - "@anewberry" +state: draft +links: + - https://github.com/NVIDIA/OpenShell/issues/1919 +--- + +# RFC 0006 - Gateway Interceptors + +## Summary + +This RFC proposes a first-class Gateway Interceptors system for OpenShell. +Interceptors let operators and external integrators customize gateway API +behavior without forking the gateway or adding special cases to compute +drivers. + +Interceptors and drivers serve different extension needs. Interceptors add business logic +around gateway operations. Drivers replace or provide implementation for +platform functionality, such as how sandboxes are provisioned on Docker, +Kubernetes, or VMs. + +Gateway Interceptors is the umbrella name for gateway extension points. This +RFC defines one interceptor role: + +- **Operation interceptors** observe, modify, validate, reject, or audit gateway + operations at well-defined phases. + +Future RFCs may define event-driven or workflow interceptors under the same +umbrella, but they are out of scope for this first implementation. + +Compute drivers continue to own compute-platform provisioning. Interceptors own +gateway-level policy for resource writes: tenancy, quotas, naming, policy +authority, and driver configuration restrictions. + +The gateway database remains the system of record. External systems integrate +by writing through existing OpenShell APIs that persist into the gateway DB. +Gateway runtime paths read gateway-owned state; they do not call external +systems during lookup. + +## Motivation + +OpenShell already has several centralized control-plane choke points: + +- Sandbox creation validates requests, defaults images, validates policy + safety, persists a sandbox object, and provisions through the selected driver. +- Policy and runtime settings are resolved through gateway APIs before they are + delivered to sandbox supervisors. +- Provider profiles and provider records are stored and resolved by the + gateway. +- Driver-specific `SandboxTemplate.driver_config` is selected by the gateway + before the translated `DriverSandbox` reaches the compute driver. + +These are the right places for operator-specific control, but today those +controls must be implemented directly in OpenShell code. That does not scale to +organizational requirements such as: + +- Vend policies and providers from an external source by writing them through + existing provider, provider profile, and config APIs. +- Enforce one system-wide sandbox policy and reject custom sandbox policies. +- Verify policy writes against an external authority before accepting them. +- Restrict driver configuration payloads to an approved schema or fixed value. +- Limit each user to a maximum number of running sandboxes. +- Require sandbox names to follow an organization prefix, such as `nvidia-`. + +These examples are gateway policy, not compute-driver behavior. A compute +driver can validate whether a pod, container, or VM can be provisioned. It +should not own tenant quotas, global policy authority, provider resource +management, or naming conventions. + +## Non-goals + +- Replacing compute drivers or adding a second compute provisioning interface. +- Letting interceptors bypass gateway authentication, authorization, policy + safety validation, or driver schema validation. +- Moving sandbox runtime enforcement out of the sandbox supervisor and proxy. +- Replacing the gateway database as the system of record. + +## Proposal + +Add a gateway interceptor framework with explicit phases, resource selectors, +deterministic ordering, bounded execution, audit logging, and conservative +failure behavior. + +Interceptors do not replace gateway functionality. They add governance and +business logic around resource operations: defaulting, validation, rejection, +and audit. Replacing how core functionality is implemented remains the role of +drivers and other provider-style interfaces. + +### Operation interceptors + +Operation interceptors run in request handling paths. They may modify a request or +object only in modification phases. They may reject in validation phases. They may +attach warnings and audit annotations in all phases. + +Operation interceptors should work for all gateway operations, not a +hand-maintained subset. Each operation exposes stable interceptor metadata: + +- `resource`: the logical resource being operated on, such as a sandbox, + provider, provider profile, policy/config object, or internal driver-facing + sandbox request. +- `operation`: the action being performed, such as create, update, delete, + attach, detach, import, merge, validate, or another domain operation. + +The gateway should derive this metadata from the operation being handled rather +than checking it against a fixed allowlist. New gateway operations should enter +the interceptor pipeline by default when they are added. + +This lets OpenShell add deployment-specific business logic around the resource +operations it already supports while keeping runtime reads local and +deterministic. + +### Gateway state + +External systems should not participate in live gateway lookup paths. Instead, +they run controllers or sync jobs that write desired state through existing +OpenShell APIs. + +Examples of existing DB-backed state include: + +| State | Existing API surface | +|---|---| +| Sandboxes | `CreateSandbox`, `DeleteSandbox`, sandbox provider attach/detach | +| Providers | `CreateProvider`, `UpdateProvider`, `DeleteProvider` | +| Provider profiles | `ImportProviderProfiles`, `DeleteProviderProfile` | +| Sandbox policy and settings | `UpdateConfig`, policy history/status APIs | +| Gateway-global config | `UpdateConfig --global`, gateway settings APIs | + +Gateway runtime paths read this state from the gateway store. If an external +catalog or controller is unavailable, the gateway continues using the last +accepted state already persisted in the DB. + +This RFC does not introduce new gateway resource kinds for quotas, name +policies, policy bundles, or driver config policy. Those concerns can be +enforced by interceptor services and normal gateway configuration. If +OpenShell later needs first-class resources for them, that should be a separate +RFC. + +### Operation phases + +Operation phases are ordered. Later phases see the result of earlier phases. + +| Phase | Modification allowed | Purpose | +|---|---:|---| +| `pre_request` | yes | Normalize or reject the raw API request after auth and basic size limits. | +| `modify_object` | yes | Apply defaults to the gateway object after standard request parsing. | +| `validate_object` | no | Enforce object-level policy before persistence. | +| `validate_driver` | no | Enforce driver-facing policy after translation to `DriverSandbox`. | +| `post_commit` | no | Emit audit or notify external systems after successful persistence or provisioning. | + +Gateway invariants run after modification so interceptors cannot leave invalid +objects in the system. Driver validation still runs after interceptors so +drivers remain the authority for driver-owned schemas. + +### Interceptor request contract + +The interceptor request should be stable and resource-oriented, not tied to Rust +handler internals. + +```proto +message InterceptorReview { + string api_version = 1; + string interceptor_name = 2; + string binding_id = 3; + string phase = 4; + string resource = 5; + string operation = 6; + + InterceptorPrincipal principal = 7; + InterceptorRequestContext context = 8; + + google.protobuf.Struct object = 9; + google.protobuf.Struct old_object = 10; + google.protobuf.Struct request = 11; +} + +message InterceptorPrincipal { + string kind = 1; // user, service, sandbox + string subject = 2; + repeated string groups = 3; +} + +message InterceptorRequestContext { + string request_id = 1; + string gateway_replica_id = 2; + string compute_driver = 3; + bool dry_run = 4; + map labels = 5; +} +``` + +The interceptor response returns an allow/deny decision, optional patches, and +operator-visible metadata for operation interceptors. + +```proto +message InterceptorDecision { + bool allowed = 1; + string reason = 2; + string status_code = 3; + repeated JsonPatch patches = 4; + repeated string warnings = 5; + map audit_annotations = 6; +} +``` + +Only modification phases accept patches. A validation interceptor that returns +patches is a configuration error. + +The `binding_id` is owned by the interceptor service. It identifies the +service-declared binding that selected the review. + +### Interceptor endpoints + +The framework supports one service protocol with two transports. The gateway +detects the transport from the interceptor endpoint URI: + +- `grpc://host:port` connects to a plaintext gRPC interceptor service over TCP. +- `grpcs://host:port` connects to a TLS-protected gRPC interceptor service over TCP. +- `unix:///path/to/socket` connects to a gRPC interceptor service over a Unix domain + socket. + +Both transports use the same protobuf service contract. Unix domain sockets are +the preferred local deployment shape because they avoid exposing a network +listener and can rely on filesystem permissions. TCP is for interceptors that run as +separate services or outside the gateway host. + +### Selection and ordering + +Selection should be oriented around interceptor services, not individual +phase/resource routes. Operators should normally configure a small number of +interceptor services and service-specific settings. The service tells the +gateway which operation bindings it supports. + +A configured `[[interceptors]]` entry represents one interceptor service +instance. During gateway startup or config reload, the gateway calls a +`Describe` RPC on the service. The response describes the service's default +bindings: + +```proto +message InterceptorManifest { + string api_version = 1; + repeated InterceptorBinding bindings = 2; +} + +message InterceptorBinding { + string id = 1; + repeated string phases = 2; + repeated string resources = 3; + repeated string operations = 4; + int32 order = 5; + bool modifies = 6; + string default_failure_policy = 7; +} +``` + +By default, the gateway enables the bindings returned by the service manifest. +Operators can configure the service once, then optionally override specific +bindings when they need to disable, narrow, or reorder behavior. Overrides +should only narrow service-declared selectors unless a future RFC explicitly +allows expansion. + +Example: + +```toml +[[interceptors]] +name = "org-controls" +order = 100 +failure_policy = "fail_closed" +endpoint = "unix:///run/openshell/interceptors/org-controls.sock" +timeout = "500ms" + +[interceptors.config] +sandbox_name_prefix = "nvidia-" +generated_sandbox_names_only = true +max_running_sandboxes_per_user = 10 +system_policy_authority = true +policy_authority_endpoint = "grpcs://policy-control.example.com:8443" + +[interceptors.config.driver_config.kubernetes.required_payload] +runtimeClassName = "nvidia" + +[[interceptors.overrides]] +binding = "provider-profile-governance" +enabled = false + +[[interceptors.overrides]] +binding = "driver-config-validation" +failure_policy = "fail_closed" +match = { compute_driver = "kubernetes" } + +[[interceptors.overrides]] +binding = "policy-authority" +order = 90 +match = { operations = ["update", "merge", "delete"] } +``` + +The service manifest keeps common configuration terse. Operators do not need to +know that sandbox prefix behavior runs at `modify_object` while driver config +behavior runs at `validate_driver`; the service exposes those bindings. + +The gateway builds an execution plan from enabled bindings. Selection evaluates +the service-declared resource, operation, phase, principal, label, and driver +selectors, then applies gateway-configured narrowing overrides. + +Interceptors run in fixed phase order. Within a phase, matching bindings run by +this deterministic ordering: + +1. configured interceptor service `order`. +2. service-declared binding `order`, after gateway overrides. +3. interceptor service name. +4. binding ID. + +The gateway rejects interceptor configuration that creates ambiguous +modification order for the same field if that can be detected statically. + +### Failure policy + +Each binding has an effective failure policy. The gateway starts with the +service default, applies the interceptor service-level gateway config, then +applies any binding override. + +| Failure policy | Behavior | +|---|---| +| `fail_closed` | Interceptor timeout or service error rejects the API operation. | +| `fail_open` | Interceptor timeout or service error permits the operation. The gateway emits warnings and audit logs. | +| `ignore` | Interceptor errors are logged only. Valid only for `post_commit`. | + +Defaults: + +- Modifying and validating bindings default to `fail_closed`. +- `post_commit` bindings default to `ignore`. + +Every interceptor service has a timeout and response size limit. Operation +interceptor bindings also have a maximum patch count. + +### Worked examples + +See [policy-governance-example.md](policy-governance-example.md) for a +non-normative example of an organization policy interceptor service with +multiple service-declared bindings and gateway-side overrides. + +### External reconciliation through existing APIs + +External systems integrate by reconciling desired state through existing +OpenShell APIs. The gateway validates and persists those writes, then runtime +paths read the persisted state. + +```mermaid +flowchart LR + External[External catalog/controller] --> API[Existing OpenShell API] + API --> Interceptors[Operation interceptors] + Interceptors --> Validate[OpenShell validation] + Validate --> Store[Gateway DB] + Store --> Runtime[Gateway runtime reads] +``` + +Provider profile sync should use the existing provider profile import API. +Provider sync should use the existing provider create/update APIs. + +Policy sync should use the existing global and sandbox-scoped config APIs. +Managed deployments that want an authoritative global policy can set the global +policy through `UpdateConfig --global` and use operation interceptors to reject +sandbox-scoped policy changes. + +Ownership and provenance should use existing metadata surfaces where available, +such as labels on objects and config fields on provider records. The gateway DB +record is still authoritative; provenance explains how the current desired +state arrived. + +### Gateway info surface + +The first version should not add a dedicated interceptor management API or CLI. +Interceptor configuration remains gateway-local configuration. + +The existing gateway info command may expose a read-only summary of configured +interceptor services, enabled bindings, effective failure policies, and last +observed health. That is sufficient for operator visibility in this RFC. + +### Observability and audit + +Every interceptor decision should emit structured gateway logs with: + +- interceptor name. +- binding ID. +- phase. +- resource and operation. +- principal subject. +- decision. +- reason. +- latency. +- failure policy. +- patch count. +- audit annotations. + +Security-relevant denials should be emitted as OCSF detection findings or +configuration/security events, depending on the event class. Non-security +operational failures can use plain tracing. + +### Security model + +Interceptor services run outside the gateway trust boundary. The gateway must +continue to enforce first-party invariants after interceptor modification. + +Rules: + +- Interceptors receive only the fields needed for their phase. +- `grpcs://` endpoints use TLS and should be required for remote interceptor services. +- `grpc://` endpoints are plaintext and should be limited to loopback or + explicitly trusted local networks. +- UDS interceptor services rely on filesystem permissions and should be owned by the + gateway operator. +- Interceptor service responses are bounded by timeout and body size. + Operation interceptor patches are also bounded by patch count. +- Interceptor services cannot replace built-in validation. Imported profiles and + policies are validated before use. + +## Implementation plan + +1. Add a `crates/openshell-interceptors` crate with shared interceptor + manifest, request/response, selector matching, ordering, failure policy + handling, patch application, and test helpers. +2. Add interceptor configuration parsing to gateway config and validate it at startup. +3. Implement gRPC interceptor clients that derive TCP or Unix domain socket + transport from the configured endpoint URI and call `Describe` during + startup or config reload. +4. Build an execution plan from service manifests plus gateway-configured + overrides. +5. Wire interceptor execution into the gateway operation pipeline so all + gateway operations can pass through `pre_request`, `modify_object`, + `validate_object`, `validate_driver`, and `post_commit` where applicable. +6. Add example service bindings for the policy governance workflows described + in [policy-governance-example.md](policy-governance-example.md). +7. Audit existing gateway operations and route each resource-affecting path + through the shared interceptor pipeline. +8. Add interceptor decision audit logging and metrics. +9. Document how external controllers should reconcile providers, provider + profiles, global policy, and sandbox policy through existing APIs. +10. Add read-only interceptor visibility to the existing gateway info command. +11. Document gateway interceptor configuration, endpoint requirements, failure + modes, and security guidance. + +## Risks + +- Interceptors can make request behavior harder to reason about if ordering + and audit are weak. +- Synchronous gRPC interceptor services can become availability dependencies for the + gateway. +- Modifying interceptors can hide user intent if they silently rewrite user-supplied + values. +- Ownership can become confusing when external controllers and humans both edit + the same provider profile, provider, or policy through existing APIs. +- Quota interceptors need a stronger consistency design before they are safe in HA + deployments. + +Mitigations: + +- Keep interceptors disabled by default. +- Make ordering deterministic and visible. +- Default modifying and validating interceptors to `fail_closed`. +- Run first-party invariant validation after modification. +- Make HA-unsafe interceptors declare their scope explicitly. + +## Alternatives + +### Add more gateway config fields + +OpenShell could add first-party config fields for each requirement, such as +`sandbox_name_prefix`, `max_sandboxes_per_user`, and +`allowed_driver_config_keys`. + +This is simple for known cases but does not scale to organization-specific +policy or external sources. It also keeps growing the gateway config schema for +controls that are not core OpenShell semantics. + +### Put this in compute drivers + +Drivers already validate driver-owned config. They could also reject names, +quotas, and policy choices. + +This mixes responsibilities. Drivers should own compute-platform feasibility. +The gateway should own API behavior, tenancy, policy authority, and provider +state. Interceptors are appropriate for additional business logic around gateway +operations; drivers are appropriate when OpenShell needs a different +implementation of compute functionality. + +### Use HTTP webhooks + +OpenShell could model interceptors as HTTP webhooks with JSON request and response +payloads. + +This is familiar to Kubernetes users, but OpenShell already uses protobuf and +gRPC heavily. A protobuf gRPC contract avoids a second wire format for gateway +extension points, works over Unix domain sockets for local integrations, and +matches the gateway's existing service boundaries. diff --git a/rfc/0006-gateway-interceptors/policy-governance-example.md b/rfc/0006-gateway-interceptors/policy-governance-example.md new file mode 100644 index 000000000..4fdaf3fa8 --- /dev/null +++ b/rfc/0006-gateway-interceptors/policy-governance-example.md @@ -0,0 +1,70 @@ +# Policy Governance Example + +This companion note is non-normative. It shows how one organization policy +interceptor service could expose several bindings through its manifest, so a +gateway operator can configure one service and selectively override behavior. + +The service is not special gateway integration. It uses the same +gRPC-over-TCP or gRPC-over-UDS contract available to external users. + +## Example Service Bindings + +### System Policy Authority + +Reject sandbox-scoped policy creation, update, merge, or delete when an +operator-configured gateway policy is authoritative. Optionally inject the +default policy into sandbox creation when no policy is supplied. + +This complements the existing global policy behavior. Global policy override +controls effective sandbox config; this interceptor service makes custom policy +submission fail at the API boundary instead of being silently overridden later. + +### External Policy Authority Verifier + +Validate global or sandbox-scoped policy writes against an external authority +before the gateway persists them. The external authority might verify a policy +bundle signature, check that a submitted policy was approved by an internal +control plane, or compare policy metadata against an organization-owned source. + +This is a write-time validation path. Accepted policy state is still persisted +in the gateway DB and runtime paths continue to read gateway-owned state. If +the external authority is unavailable, the configured failure policy determines +whether the write is rejected or allowed with audit warnings. + +### Driver Config Validator + +Validate `SandboxTemplate.driver_config` before it reaches a driver. This can +enforce allowed keys, exact payloads, forbidden annotations, resource ceilings, +or driver-specific profiles. + +Example: + +```yaml +driver: kubernetes +allowed_keys: + - nodeSelector + - tolerations +required_payload: + runtimeClassName: nvidia +``` + +### User Sandbox Quota + +Reject `CreateSandbox` when a principal already has too many active sandboxes. +The initial version may use the existing store list path for single-replica +deployments. A later HA-safe version should use a quota lease or counter with +database compare-and-swap or a transaction. + +The rejection code should be `resource_exhausted`. + +### Sandbox Name Prefix + +Require sandbox names to start with a configured prefix. Generated names may be +modified. User-supplied names should be rejected rather than silently changed. + +Example: + +```text +generated: bright-lake -> nvidia-bright-lake +supplied: demo -> reject, expected prefix nvidia- +``` From 3c0335e088d7547b9318d54e8dacdc9bbf2a42a9 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 16 Jun 2026 00:05:44 -0700 Subject: [PATCH 02/10] docs(rfc): clarify gateway interceptors proposal Signed-off-by: Drew Newberry --- rfc/0006-gateway-interceptors/README.md | 130 ++++++++++++------ .../policy-governance-example.md | 2 +- 2 files changed, 88 insertions(+), 44 deletions(-) diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0006-gateway-interceptors/README.md index f63c1ae7f..ea0fc7800 100644 --- a/rfc/0006-gateway-interceptors/README.md +++ b/rfc/0006-gateway-interceptors/README.md @@ -40,7 +40,8 @@ systems during lookup. ## Motivation -OpenShell already has several centralized control-plane choke points: +OpenShell already has several centralized control-plane paths where the gateway +has enough context to enforce deployment-specific policy: - Sandbox creation validates requests, defaults images, validates policy safety, persists a sandbox object, and provisions through the selected driver. @@ -52,10 +53,10 @@ OpenShell already has several centralized control-plane choke points: before the translated `DriverSandbox` reaches the compute driver. These are the right places for operator-specific control, but today those -controls must be implemented directly in OpenShell code. That does not scale to -organizational requirements such as: +controls must be implemented directly in OpenShell code. That does not scale +for organizational requirements such as: -- Vend policies and providers from an external source by writing them through +- Sync policies and providers from an external source by writing them through existing provider, provider profile, and config APIs. - Enforce one system-wide sandbox policy and reject custom sandbox policies. - Verify policy writes against an external authority before accepting them. @@ -87,12 +88,26 @@ business logic around resource operations: defaulting, validation, rejection, and audit. Replacing how core functionality is implemented remains the role of drivers and other provider-style interfaces. +The design keeps three boundaries intact: + +- The gateway database remains the system of record for gateway-owned state. +- Existing gateway and driver validation still run after interceptor + modification. +- External systems integrate through writes to OpenShell APIs, not live lookup + calls on runtime paths. + ### Operation interceptors -Operation interceptors run in request handling paths. They may modify a request or -object only in modification phases. They may reject in validation phases. They may +An operation interceptor runs during a gateway operation, such as creating a +sandbox, importing provider profiles, updating policy, or translating a +sandbox request into driver-facing configuration. It may modify a request or +object only in modification phases. It may reject in validation phases. It may attach warnings and audit annotations in all phases. +Interceptor services expose one or more bindings. A binding is a +service-declared rule that maps the service to phases, resources, operations, +and selectors. The gateway uses bindings to decide when to call the service. + Operation interceptors should work for all gateway operations, not a hand-maintained subset. Each operation exposes stable interceptor metadata: @@ -130,6 +145,32 @@ Gateway runtime paths read this state from the gateway store. If an external catalog or controller is unavailable, the gateway continues using the last accepted state already persisted in the DB. +External systems integrate by reconciling desired state through existing +OpenShell APIs. The gateway validates and persists those writes, then runtime +paths read the persisted state. + +```mermaid +flowchart LR + External[External catalog/controller] --> API[Existing OpenShell API] + API --> Interceptors[Operation interceptors] + Interceptors --> Validate[OpenShell validation] + Validate --> Store[Gateway DB] + Store --> Runtime[Gateway runtime reads] +``` + +Provider profile sync should use the existing provider profile import API. +Provider sync should use the existing provider create/update APIs. + +Policy sync should use the existing global and sandbox-scoped config APIs. +Managed deployments that want an authoritative global policy can set the global +policy through `UpdateConfig --global` and use operation interceptors to reject +sandbox-scoped policy changes. + +Ownership and provenance should use existing metadata surfaces where available, +such as labels on objects and config fields on provider records. The gateway DB +record is still authoritative; provenance explains how the current desired +state arrived. + This RFC does not introduce new gateway resource kinds for quotas, name policies, policy bundles, or driver config policy. Those concerns can be enforced by interceptor services and normal gateway configuration. If @@ -148,6 +189,25 @@ Operation phases are ordered. Later phases see the result of earlier phases. | `validate_driver` | no | Enforce driver-facing policy after translation to `DriverSandbox`. | | `post_commit` | no | Emit audit or notify external systems after successful persistence or provisioning. | +For `CreateSandbox`, the phases fit into the existing gateway flow like this: + +```text +authenticate request +validate raw field sizes and labels +pre_request interceptors +load gateway-owned providers, policy, and settings +gateway defaulting from stored state +modify_object interceptors +gateway invariant validation +validate_object interceptors +translate to DriverSandbox +validate_driver interceptors +compute driver validation +persist sandbox +driver create +post_commit interceptors +``` + Gateway invariants run after modification so interceptors cannot leave invalid objects in the system. Driver validation still runs after interceptors so drivers remain the authority for driver-owned schemas. @@ -190,7 +250,7 @@ message InterceptorRequestContext { ``` The interceptor response returns an allow/deny decision, optional patches, and -operator-visible metadata for operation interceptors. +diagnostic metadata for operation interceptors. ```proto message InterceptorDecision { @@ -250,6 +310,14 @@ message InterceptorBinding { int32 order = 5; bool modifies = 6; string default_failure_policy = 7; + InterceptorSelector selector = 8; +} + +message InterceptorSelector { + repeated string principal_kinds = 1; + repeated string principal_groups = 2; + map labels = 3; + repeated string compute_drivers = 4; } ``` @@ -259,6 +327,10 @@ bindings when they need to disable, narrow, or reorder behavior. Overrides should only narrow service-declared selectors unless a future RFC explicitly allows expansion. +Empty selector fields match all values. For example, a binding with no +`compute_drivers` selector can run for all drivers, while a gateway override can +narrow it to only `kubernetes`. + Example: ```toml @@ -286,7 +358,7 @@ enabled = false [[interceptors.overrides]] binding = "driver-config-validation" failure_policy = "fail_closed" -match = { compute_driver = "kubernetes" } +match = { compute_drivers = ["kubernetes"] } [[interceptors.overrides]] binding = "policy-authority" @@ -333,40 +405,6 @@ Defaults: Every interceptor service has a timeout and response size limit. Operation interceptor bindings also have a maximum patch count. -### Worked examples - -See [policy-governance-example.md](policy-governance-example.md) for a -non-normative example of an organization policy interceptor service with -multiple service-declared bindings and gateway-side overrides. - -### External reconciliation through existing APIs - -External systems integrate by reconciling desired state through existing -OpenShell APIs. The gateway validates and persists those writes, then runtime -paths read the persisted state. - -```mermaid -flowchart LR - External[External catalog/controller] --> API[Existing OpenShell API] - API --> Interceptors[Operation interceptors] - Interceptors --> Validate[OpenShell validation] - Validate --> Store[Gateway DB] - Store --> Runtime[Gateway runtime reads] -``` - -Provider profile sync should use the existing provider profile import API. -Provider sync should use the existing provider create/update APIs. - -Policy sync should use the existing global and sandbox-scoped config APIs. -Managed deployments that want an authoritative global policy can set the global -policy through `UpdateConfig --global` and use operation interceptors to reject -sandbox-scoped policy changes. - -Ownership and provenance should use existing metadata surfaces where available, -such as labels on objects and config fields on provider records. The gateway DB -record is still authoritative; provenance explains how the current desired -state arrived. - ### Gateway info surface The first version should not add a dedicated interceptor management API or CLI. @@ -374,7 +412,7 @@ Interceptor configuration remains gateway-local configuration. The existing gateway info command may expose a read-only summary of configured interceptor services, enabled bindings, effective failure policies, and last -observed health. That is sufficient for operator visibility in this RFC. +observed health. That is sufficient for operational visibility in this RFC. ### Observability and audit @@ -414,6 +452,12 @@ Rules: - Interceptor services cannot replace built-in validation. Imported profiles and policies are validated before use. +### Worked examples + +See [policy-governance-example.md](policy-governance-example.md) for a +non-normative example of an organization policy interceptor service with +multiple service-declared bindings and gateway-side overrides. + ## Implementation plan 1. Add a `crates/openshell-interceptors` crate with shared interceptor diff --git a/rfc/0006-gateway-interceptors/policy-governance-example.md b/rfc/0006-gateway-interceptors/policy-governance-example.md index 4fdaf3fa8..4f10a7834 100644 --- a/rfc/0006-gateway-interceptors/policy-governance-example.md +++ b/rfc/0006-gateway-interceptors/policy-governance-example.md @@ -4,7 +4,7 @@ This companion note is non-normative. It shows how one organization policy interceptor service could expose several bindings through its manifest, so a gateway operator can configure one service and selectively override behavior. -The service is not special gateway integration. It uses the same +The service is not a special gateway integration. It uses the same gRPC-over-TCP or gRPC-over-UDS contract available to external users. ## Example Service Bindings From 3020acb84737ecf83b2d16864aeb0e76a5414469 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 16 Jun 2026 00:19:28 -0700 Subject: [PATCH 03/10] docs(rfc): clarify interceptor source of truth Signed-off-by: Drew Newberry --- rfc/0006-gateway-interceptors/README.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0006-gateway-interceptors/README.md index ea0fc7800..e0a0352fa 100644 --- a/rfc/0006-gateway-interceptors/README.md +++ b/rfc/0006-gateway-interceptors/README.md @@ -76,6 +76,8 @@ management, or naming conventions. safety validation, or driver schema validation. - Moving sandbox runtime enforcement out of the sandbox supervisor and proxy. - Replacing the gateway database as the system of record. +- Adding new first-class gateway resource kinds for quotas, name policies, + policy bundles, or driver config policy. ## Proposal @@ -125,7 +127,7 @@ This lets OpenShell add deployment-specific business logic around the resource operations it already supports while keeping runtime reads local and deterministic. -### Gateway state +### Source of truth and reconciliation External systems should not participate in live gateway lookup paths. Instead, they run controllers or sync jobs that write desired state through existing @@ -171,12 +173,6 @@ such as labels on objects and config fields on provider records. The gateway DB record is still authoritative; provenance explains how the current desired state arrived. -This RFC does not introduce new gateway resource kinds for quotas, name -policies, policy bundles, or driver config policy. Those concerns can be -enforced by interceptor services and normal gateway configuration. If -OpenShell later needs first-class resources for them, that should be a separate -RFC. - ### Operation phases Operation phases are ordered. Later phases see the result of earlier phases. From 3bdf531624715b1dfb258d44870c0005f341558e Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 00:06:09 -0700 Subject: [PATCH 04/10] docs(rfc): refine gateway interceptor proposal Signed-off-by: Drew Newberry --- rfc/0006-gateway-interceptors/README.md | 525 +++++++++--------- .../policy-governance-example.md | 70 --- 2 files changed, 257 insertions(+), 338 deletions(-) delete mode 100644 rfc/0006-gateway-interceptors/policy-governance-example.md diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0006-gateway-interceptors/README.md index e0a0352fa..78177ab6a 100644 --- a/rfc/0006-gateway-interceptors/README.md +++ b/rfc/0006-gateway-interceptors/README.md @@ -10,64 +10,65 @@ links: ## Summary -This RFC proposes a first-class Gateway Interceptors system for OpenShell. -Interceptors let operators and external integrators customize gateway API -behavior without forking the gateway or adding special cases to compute -drivers. +Operators and external integrators need a flexible way to customize gateway API +behavior to fit their own requirements — for example, enforcing tenancy, +quotas, naming conventions, or policy authority. Today any such customization +has to be hardcoded into gateway handlers or pushed into drivers, which mixes +responsibilities and does not scale to deployment-specific requirements. + +This RFC proposes a first-class extension system that lets external services +observe, modify, validate, reject, or audit gateway operations at well-defined +phases. We call these **Gateway Interceptors**. Interceptors and drivers serve different extension needs. Interceptors add business logic around gateway operations. Drivers replace or provide implementation for platform functionality, such as how sandboxes are provisioned on Docker, Kubernetes, or VMs. -Gateway Interceptors is the umbrella name for gateway extension points. This -RFC defines one interceptor role: - -- **Operation interceptors** observe, modify, validate, reject, or audit gateway - operations at well-defined phases. +This RFC scopes interceptors to gateway API operations. An interceptor can +observe, modify, validate, reject, or audit a gateway operation at well-defined +phases. Future RFCs may extend interceptors to other gateway functionality, +such as event-driven or workflow behavior, but that is out of scope for this +first implementation. -Future RFCs may define event-driven or workflow interceptors under the same -umbrella, but they are out of scope for this first implementation. +Drivers continue to own platform implementation — how gateway functionality is +actually provided. Interceptors own gateway-level governance for resource +writes: tenancy, quotas, naming, policy authority, and driver configuration +restrictions. -Compute drivers continue to own compute-platform provisioning. Interceptors own -gateway-level policy for resource writes: tenancy, quotas, naming, policy -authority, and driver configuration restrictions. - -The gateway database remains the system of record. External systems integrate -by writing through existing OpenShell APIs that persist into the gateway DB. -Gateway runtime paths read gateway-owned state; they do not call external -systems during lookup. +The gateway database remains the system of record. Interceptors add governance +around gateway operations; they do not replace gateway-owned state. ## Motivation -OpenShell already has several centralized control-plane paths where the gateway -has enough context to enforce deployment-specific policy: +Operators running OpenShell in their own environments need to apply +deployment-specific rules to gateway operations that core OpenShell does not +encode. Examples include: -- Sandbox creation validates requests, defaults images, validates policy - safety, persists a sandbox object, and provisions through the selected driver. -- Policy and runtime settings are resolved through gateway APIs before they are - delivered to sandbox supervisors. -- Provider profiles and provider records are stored and resolved by the - gateway. -- Driver-specific `SandboxTemplate.driver_config` is selected by the gateway - before the translated `DriverSandbox` reaches the compute driver. - -These are the right places for operator-specific control, but today those -controls must be implemented directly in OpenShell code. That does not scale -for organizational requirements such as: - -- Sync policies and providers from an external source by writing them through - existing provider, provider profile, and config APIs. +- Sync policies and providers from an external source of truth. - Enforce one system-wide sandbox policy and reject custom sandbox policies. - Verify policy writes against an external authority before accepting them. - Restrict driver configuration payloads to an approved schema or fixed value. - Limit each user to a maximum number of running sandboxes. -- Require sandbox names to follow an organization prefix, such as `nvidia-`. -These examples are gateway policy, not compute-driver behavior. A compute -driver can validate whether a pod, container, or VM can be provisioned. It -should not own tenant quotas, global policy authority, provider resource -management, or naming conventions. +These are not core OpenShell semantics. They vary per deployment, and the set +changes over time, so they are not a good fit for a fixed set of built-in +options. + +OpenShell already extends two of its subsystems. Drivers (RFC 0001) provide +implementations for the platform and infrastructure layer. Sandbox egress +middleware (RFC 0009) runs in the supervisor proxy and governs what an agent's +outbound requests may carry. Interceptors complete this pattern for the gateway +control plane: an extension point for the API operations themselves, where +deployment-specific rules like tenant quotas, policy authority, and naming +belong. + +Some of these may ship as built-in gateway defaults over time. Interceptors do +not replace that — they let a deployment extend or override built-in defaults +when its rules differ, without waiting on an upstream change. + +Without a dedicated mechanism, operators carry these rules as gateway forks or +local patches. ## Non-goals @@ -81,172 +82,110 @@ management, or naming conventions. ## Proposal -Add a gateway interceptor framework with explicit phases, resource selectors, +Add a gateway interceptor framework with explicit phases, RPC method selectors, deterministic ordering, bounded execution, audit logging, and conservative failure behavior. Interceptors do not replace gateway functionality. They add governance and -business logic around resource operations: defaulting, validation, rejection, +business logic around gateway operations: defaulting, validation, rejection, and audit. Replacing how core functionality is implemented remains the role of drivers and other provider-style interfaces. -The design keeps three boundaries intact: +The design keeps two boundaries intact: - The gateway database remains the system of record for gateway-owned state. - Existing gateway and driver validation still run after interceptor modification. -- External systems integrate through writes to OpenShell APIs, not live lookup - calls on runtime paths. -### Operation interceptors +### Gateway API interceptors -An operation interceptor runs during a gateway operation, such as creating a -sandbox, importing provider profiles, updating policy, or translating a -sandbox request into driver-facing configuration. It may modify a request or -object only in modification phases. It may reject in validation phases. It may -attach warnings and audit annotations in all phases. +A gateway API interceptor runs during a gateway API operation, such as creating a +sandbox, importing provider profiles, updating policy, or applying sandbox +configuration. It may modify an RPC request or operation input only in +modification phases. It may reject in validation phases. It may attach warnings +and audit annotations in all phases. Interceptor services expose one or more bindings. A binding is a -service-declared rule that maps the service to phases, resources, operations, -and selectors. The gateway uses bindings to decide when to call the service. - -Operation interceptors should work for all gateway operations, not a -hand-maintained subset. Each operation exposes stable interceptor metadata: - -- `resource`: the logical resource being operated on, such as a sandbox, - provider, provider profile, policy/config object, or internal driver-facing - sandbox request. -- `operation`: the action being performed, such as create, update, delete, - attach, detach, import, merge, validate, or another domain operation. - -The gateway should derive this metadata from the operation being handled rather -than checking it against a fixed allowlist. New gateway operations should enter -the interceptor pipeline by default when they are added. - -This lets OpenShell add deployment-specific business logic around the resource +service-declared rule that maps the service to phases, gateway RPC methods, and +selectors. The gateway uses bindings to decide when to call the service. + +The public gRPC service and method identify the API operation. The v1 selector +vocabulary uses fully qualified RPC names, for example +`openshell.v1.OpenShell/CreateSandbox`. This keeps binding configuration tied +to the public API operators already know and avoids another compatibility +surface. + +All interceptable gateway API RPCs run through the same standard phase pipeline. +The gateway rejects interceptor bindings that reference unknown RPCs for the +running gateway version, unless the RPC selector is empty to match all +interceptable RPCs. + +Gateway API interceptors should work for all relevant gateway RPCs, not a +hand-maintained subset. New gateway RPCs should enter the interceptor pipeline by +using the shared gateway API execution path, not by adding per-RPC interceptor +hooks or updating a separate allowlist. RPCs may opt out only when they are not +gateway API operations in scope for this RFC, such as low-level streaming or +supervisor-internal calls, and the opt-out should be explicit in code review. + +This lets OpenShell add deployment-specific business logic around the gateway operations it already supports while keeping runtime reads local and deterministic. -### Source of truth and reconciliation - -External systems should not participate in live gateway lookup paths. Instead, -they run controllers or sync jobs that write desired state through existing -OpenShell APIs. - -Examples of existing DB-backed state include: - -| State | Existing API surface | -|---|---| -| Sandboxes | `CreateSandbox`, `DeleteSandbox`, sandbox provider attach/detach | -| Providers | `CreateProvider`, `UpdateProvider`, `DeleteProvider` | -| Provider profiles | `ImportProviderProfiles`, `DeleteProviderProfile` | -| Sandbox policy and settings | `UpdateConfig`, policy history/status APIs | -| Gateway-global config | `UpdateConfig --global`, gateway settings APIs | - -Gateway runtime paths read this state from the gateway store. If an external -catalog or controller is unavailable, the gateway continues using the last -accepted state already persisted in the DB. - -External systems integrate by reconciling desired state through existing -OpenShell APIs. The gateway validates and persists those writes, then runtime -paths read the persisted state. - -```mermaid -flowchart LR - External[External catalog/controller] --> API[Existing OpenShell API] - API --> Interceptors[Operation interceptors] - Interceptors --> Validate[OpenShell validation] - Validate --> Store[Gateway DB] - Store --> Runtime[Gateway runtime reads] -``` - -Provider profile sync should use the existing provider profile import API. -Provider sync should use the existing provider create/update APIs. - -Policy sync should use the existing global and sandbox-scoped config APIs. -Managed deployments that want an authoritative global policy can set the global -policy through `UpdateConfig --global` and use operation interceptors to reject -sandbox-scoped policy changes. - -Ownership and provenance should use existing metadata surfaces where available, -such as labels on objects and config fields on provider records. The gateway DB -record is still authoritative; provenance explains how the current desired -state arrived. - ### Operation phases -Operation phases are ordered. Later phases see the result of earlier phases. - -| Phase | Modification allowed | Purpose | -|---|---:|---| -| `pre_request` | yes | Normalize or reject the raw API request after auth and basic size limits. | -| `modify_object` | yes | Apply defaults to the gateway object after standard request parsing. | -| `validate_object` | no | Enforce object-level policy before persistence. | -| `validate_driver` | no | Enforce driver-facing policy after translation to `DriverSandbox`. | -| `post_commit` | no | Emit audit or notify external systems after successful persistence or provisioning. | - -For `CreateSandbox`, the phases fit into the existing gateway flow like this: - -```text -authenticate request -validate raw field sizes and labels -pre_request interceptors -load gateway-owned providers, policy, and settings -gateway defaulting from stored state -modify_object interceptors -gateway invariant validation -validate_object interceptors -translate to DriverSandbox -validate_driver interceptors -compute driver validation -persist sandbox -driver create -post_commit interceptors -``` +Operation phases are ordered. Later phases see the result of earlier phases. All +interceptable gateway API RPCs use the same phases in the same order so +interceptor authors and operators do not need per-RPC phase rules. + +| Phase | Modification allowed | Purpose | Examples | +|---|---:|---|---| +| `pre_request` | yes | Normalize or reject the RPC request after auth and basic size limits. | Normalize labels, require a sandbox name prefix, or reject requests with unsupported request fields. | +| `modify_operation` | yes | Apply defaults or controlled changes after the gateway prepares the operation input. | Stamp a default sandbox policy, select a provider profile, or clamp resource limits to deployment defaults. | +| `validate` | no | Enforce deployment-specific rules before persistence, provisioning, or other side effects. | Enforce tenant quotas, reject policy updates that allow internet egress, or verify driver config against an approved schema. | +| `post_commit` | no | Emit audit or notify external systems after successful persistence or provisioning. | Send audit records, notify an inventory system, or trigger a reconciliation job after a successful write. | Gateway invariants run after modification so interceptors cannot leave invalid -objects in the system. Driver validation still runs after interceptors so -drivers remain the authority for driver-owned schemas. +objects in the system. Operation-specific built-in validation, including driver +validation where applicable, remains part of the gateway-owned execution path so +drivers stay the authority for driver-owned schemas. ### Interceptor request contract -The interceptor request should be stable and resource-oriented, not tied to Rust -handler internals. +The interceptor request should be stable and tied to the public gateway API, not +to Rust handler internals. ```proto -message InterceptorReview { +message InterceptorEvaluation { string api_version = 1; string interceptor_name = 2; string binding_id = 3; string phase = 4; - string resource = 5; - string operation = 6; + string rpc_service = 5; + string rpc_method = 6; - InterceptorPrincipal principal = 7; - InterceptorRequestContext context = 8; + string principal = 7; + map context = 8; - google.protobuf.Struct object = 9; - google.protobuf.Struct old_object = 10; - google.protobuf.Struct request = 11; + google.protobuf.Struct operation_input = 9; + google.protobuf.Struct existing_state = 10; + google.protobuf.Struct rpc_request = 11; } +``` -message InterceptorPrincipal { - string kind = 1; // user, service, sandbox - string subject = 2; - repeated string groups = 3; -} +The `rpc_service` and `rpc_method` fields are the split form of the fully +qualified RPC selector used by bindings. For example, +`openshell.v1.OpenShell/CreateSandbox` becomes +`rpc_service = "openshell.v1.OpenShell"` and +`rpc_method = "CreateSandbox"`. -message InterceptorRequestContext { - string request_id = 1; - string gateway_replica_id = 2; - string compute_driver = 3; - bool dry_run = 4; - map labels = 5; -} -``` +The payload fields are phase-scoped. `rpc_request` is the raw gateway RPC +payload available to `pre_request`. `operation_input` is the gateway-prepared +input available after state loading and defaulting; it is the main payload for +`modify_operation`, `validate`, and `post_commit`. `existing_state` is populated +only when the operation has prior gateway-owned state. The interceptor response returns an allow/deny decision, optional patches, and -diagnostic metadata for operation interceptors. +diagnostic metadata for gateway API interceptors. ```proto message InterceptorDecision { @@ -259,36 +198,37 @@ message InterceptorDecision { } ``` -Only modification phases accept patches. A validation interceptor that returns -patches is a configuration error. +Only modification phases accept patches. `pre_request` patches apply to +`rpc_request`; `modify_operation` patches apply to `operation_input`. +`validate` and `post_commit` interceptors that return patches are configuration +errors. The `binding_id` is owned by the interceptor service. It identifies the -service-declared binding that selected the review. +service-declared binding that selected the evaluation. ### Interceptor endpoints -The framework supports one service protocol with two transports. The gateway -detects the transport from the interceptor endpoint URI: +The framework uses one protobuf/gRPC service contract. The gateway derives the +endpoint type and TLS mode from the interceptor endpoint URI: - `grpc://host:port` connects to a plaintext gRPC interceptor service over TCP. - `grpcs://host:port` connects to a TLS-protected gRPC interceptor service over TCP. - `unix:///path/to/socket` connects to a gRPC interceptor service over a Unix domain socket. -Both transports use the same protobuf service contract. Unix domain sockets are -the preferred local deployment shape because they avoid exposing a network -listener and can rely on filesystem permissions. TCP is for interceptors that run as -separate services or outside the gateway host. +Remote gRPC interceptors require authentication. The exact configuration shape +is out of scope for this RFC, but the implementation should support mTLS and +bearer-token authentication. ### Selection and ordering Selection should be oriented around interceptor services, not individual -phase/resource routes. Operators should normally configure a small number of +phase/RPC routes. Operators should normally configure a small number of interceptor services and service-specific settings. The service tells the -gateway which operation bindings it supports. +gateway which RPC bindings it supports. -A configured `[[interceptors]]` entry represents one interceptor service -instance. During gateway startup or config reload, the gateway calls a +A `[[interceptors]]` table in the gateway config TOML represents one interceptor +service instance. During gateway startup or config reload, the gateway calls a `Describe` RPC on the service. The response describes the service's default bindings: @@ -301,19 +241,16 @@ message InterceptorManifest { message InterceptorBinding { string id = 1; repeated string phases = 2; - repeated string resources = 3; - repeated string operations = 4; - int32 order = 5; - bool modifies = 6; - string default_failure_policy = 7; - InterceptorSelector selector = 8; + repeated string rpcs = 3; + int32 order = 4; + bool modifies = 5; + string default_failure_policy = 6; + InterceptorSelector selector = 7; } message InterceptorSelector { - repeated string principal_kinds = 1; - repeated string principal_groups = 2; - map labels = 3; - repeated string compute_drivers = 4; + repeated string principals = 1; + map labels = 2; } ``` @@ -323,52 +260,25 @@ bindings when they need to disable, narrow, or reorder behavior. Overrides should only narrow service-declared selectors unless a future RFC explicitly allows expansion. -Empty selector fields match all values. For example, a binding with no -`compute_drivers` selector can run for all drivers, while a gateway override can -narrow it to only `kubernetes`. +Empty selector fields match all values. A gateway override can narrow a +service-declared selector, such as limiting a binding to a specific RPC. -Example: +Gateway config example for a remote policy provider: ```toml [[interceptors]] -name = "org-controls" -order = 100 +name = "policy-provider" +endpoint = "grpcs://policy-provider.example.com:8443" failure_policy = "fail_closed" -endpoint = "unix:///run/openshell/interceptors/org-controls.sock" timeout = "500ms" - -[interceptors.config] -sandbox_name_prefix = "nvidia-" -generated_sandbox_names_only = true -max_running_sandboxes_per_user = 10 -system_policy_authority = true -policy_authority_endpoint = "grpcs://policy-control.example.com:8443" - -[interceptors.config.driver_config.kubernetes.required_payload] -runtimeClassName = "nvidia" - -[[interceptors.overrides]] -binding = "provider-profile-governance" -enabled = false - -[[interceptors.overrides]] -binding = "driver-config-validation" -failure_policy = "fail_closed" -match = { compute_drivers = ["kubernetes"] } - -[[interceptors.overrides]] -binding = "policy-authority" -order = 90 -match = { operations = ["update", "merge", "delete"] } ``` The service manifest keeps common configuration terse. Operators do not need to -know that sandbox prefix behavior runs at `modify_object` while driver config -behavior runs at `validate_driver`; the service exposes those bindings. +know which phase each behavior runs in; the service exposes those bindings. The gateway builds an execution plan from enabled bindings. Selection evaluates -the service-declared resource, operation, phase, principal, label, and driver -selectors, then applies gateway-configured narrowing overrides. +the service-declared RPC, phase, principal, and label selectors, then applies +gateway-configured narrowing overrides. Interceptors run in fixed phase order. Within a phase, matching bindings run by this deterministic ordering: @@ -398,18 +308,9 @@ Defaults: - Modifying and validating bindings default to `fail_closed`. - `post_commit` bindings default to `ignore`. -Every interceptor service has a timeout and response size limit. Operation +Every interceptor service has a timeout and response size limit. Gateway API interceptor bindings also have a maximum patch count. -### Gateway info surface - -The first version should not add a dedicated interceptor management API or CLI. -Interceptor configuration remains gateway-local configuration. - -The existing gateway info command may expose a read-only summary of configured -interceptor services, enabled bindings, effective failure policies, and last -observed health. That is sufficient for operational visibility in this RFC. - ### Observability and audit Every interceptor decision should emit structured gateway logs with: @@ -417,8 +318,8 @@ Every interceptor decision should emit structured gateway logs with: - interceptor name. - binding ID. - phase. -- resource and operation. -- principal subject. +- RPC service and method. +- principal. - decision. - reason. - latency. @@ -430,29 +331,102 @@ Security-relevant denials should be emitted as OCSF detection findings or configuration/security events, depending on the event class. Non-security operational failures can use plain tracing. -### Security model +### Example: remote policy provider -Interceptor services run outside the gateway trust boundary. The gateway must -continue to enforce first-party invariants after interceptor modification. +An interceptor should start from the invariant it wants to preserve, then find +every gateway API RPC that can establish or weaken that invariant. For example, +an operator may want a remote policy provider to be the authority for sandbox +policy decisions. -Rules: +Two RPCs matter for this invariant: -- Interceptors receive only the fields needed for their phase. -- `grpcs://` endpoints use TLS and should be required for remote interceptor services. -- `grpc://` endpoints are plaintext and should be limited to loopback or - explicitly trusted local networks. -- UDS interceptor services rely on filesystem permissions and should be owned by the - gateway operator. -- Interceptor service responses are bounded by timeout and body size. - Operation interceptor patches are also bounded by patch count. -- Interceptor services cannot replace built-in validation. Imported profiles and - policies are validated before use. +- `openshell.v1.OpenShell/CreateSandbox` establishes the initial sandbox policy. +- `openshell.v1.OpenShell/UpdateConfig` changes sandbox or global policy. -### Worked examples +The interceptor service declares one binding to apply an approved initial policy +and another to guard later policy changes: -See [policy-governance-example.md](policy-governance-example.md) for a -non-normative example of an organization policy interceptor service with -multiple service-declared bindings and gateway-side overrides. +```proto +InterceptorManifest { + api_version: "v1" + bindings: [ + { + id: "sandbox-policy-default" + phases: ["modify_operation"] + rpcs: ["openshell.v1.OpenShell/CreateSandbox"] + modifies: true + default_failure_policy: "fail_closed" + }, + { + id: "policy-authority" + phases: ["validate"] + rpcs: ["openshell.v1.OpenShell/UpdateConfig"] + modifies: false + default_failure_policy: "fail_closed" + } + ] +} +``` + +The handler can then focus on the phase and RPC method that selected the +binding: + +```rust +// Toy implementation of the InterceptorService evaluate RPC. +async fn evaluate(&self, req: InterceptorEvaluation) -> InterceptorDecision { + match (req.rpc_method.as_str(), req.phase.as_str()) { + // CreateSandbox: ask the remote policy provider for the approved + // initial policy and stamp it into the prepared operation input. + ("CreateSandbox", "modify_operation") => { + let approved_policy = self.policy_provider.initial_policy(&req).await; + + InterceptorDecision::allow().with_patch(JsonPatch::replace( + "/policy", + approved_policy, + )) + } + + // UpdateConfig: reject policy writes the remote provider does not approve. + ("UpdateConfig", "validate") => { + let decision = self.policy_provider.validate_update(&req).await; + if !decision.allowed { + return InterceptorDecision::reject( + "PERMISSION_DENIED", + decision.reason, + ); + } + + InterceptorDecision::allow() + } + + // The service should only receive bound RPCs, but defaulting to allow + // keeps the handler safe if the manifest grows later. + _ => InterceptorDecision::allow(), + } +} +``` + +The gateway config can stay small because the service manifest declares the +bindings: + +```toml +[[interceptors]] +name = "policy-provider" +endpoint = "grpcs://policy-provider.example.com:8443" +failure_policy = "fail_closed" +timeout = "500ms" +``` + +This example illustrates the general interceptor design loop: + +- Start with the invariant, then identify every RPC that can establish or weaken + it. +- Pick the phase by intent: `modify_operation` to apply an approved initial + policy and `validate` to reject unauthorized later changes. +- Use `fail_closed` because policy authority is a control-plane security + boundary. +- Keep gateway validation after the interceptor so built-in policy safety checks + still run. ## Implementation plan @@ -465,18 +439,13 @@ multiple service-declared bindings and gateway-side overrides. startup or config reload. 4. Build an execution plan from service manifests plus gateway-configured overrides. -5. Wire interceptor execution into the gateway operation pipeline so all - gateway operations can pass through `pre_request`, `modify_object`, - `validate_object`, `validate_driver`, and `post_commit` where applicable. -6. Add example service bindings for the policy governance workflows described - in [policy-governance-example.md](policy-governance-example.md). -7. Audit existing gateway operations and route each resource-affecting path +5. Wire interceptor execution into the gateway API operation pipeline so all + gateway operations can pass through `pre_request`, `modify_operation`, + `validate`, and `post_commit` where applicable. +6. Audit existing gateway operations and route each resource-affecting path through the shared interceptor pipeline. -8. Add interceptor decision audit logging and metrics. -9. Document how external controllers should reconcile providers, provider - profiles, global policy, and sandbox policy through existing APIs. -10. Add read-only interceptor visibility to the existing gateway info command. -11. Document gateway interceptor configuration, endpoint requirements, failure +7. Add interceptor decision audit logging and metrics. +8. Document gateway interceptor configuration, endpoint requirements, failure modes, and security guidance. ## Risks @@ -512,6 +481,26 @@ This is simple for known cases but does not scale to organization-specific policy or external sources. It also keeps growing the gateway config schema for controls that are not core OpenShell semantics. +Built-in fields and interceptors are not mutually exclusive. OpenShell may still +ship common defaults as first-party config; interceptors let a deployment extend +or override those defaults when its rules differ. + +### Build a specific policy driver + +OpenShell could add a dedicated policy driver interface for deployments that want +policy decisions to come from an external authority. + +This solves one important use case, but it creates a narrow extension point for +one resource type instead of a general gateway operation framework. The same +deployments may need adjacent controls for sandbox creation, provider +profiles, quotas, naming, and audit. It would also be difficult to evolve: +OpenShell would need to expose policy-specific hooks that are likely to track +individual deployment use cases rather than a stable gateway operation contract. +This is different from compute drivers, which implement backend behavior after +the gateway has accepted an operation. A policy authority participates in the +gateway's decision to accept, reject, or modify the operation before +persistence, so it fits better as an interceptor than as a driver. + ### Put this in compute drivers Drivers already validate driver-owned config. They could also reject names, diff --git a/rfc/0006-gateway-interceptors/policy-governance-example.md b/rfc/0006-gateway-interceptors/policy-governance-example.md deleted file mode 100644 index 4f10a7834..000000000 --- a/rfc/0006-gateway-interceptors/policy-governance-example.md +++ /dev/null @@ -1,70 +0,0 @@ -# Policy Governance Example - -This companion note is non-normative. It shows how one organization policy -interceptor service could expose several bindings through its manifest, so a -gateway operator can configure one service and selectively override behavior. - -The service is not a special gateway integration. It uses the same -gRPC-over-TCP or gRPC-over-UDS contract available to external users. - -## Example Service Bindings - -### System Policy Authority - -Reject sandbox-scoped policy creation, update, merge, or delete when an -operator-configured gateway policy is authoritative. Optionally inject the -default policy into sandbox creation when no policy is supplied. - -This complements the existing global policy behavior. Global policy override -controls effective sandbox config; this interceptor service makes custom policy -submission fail at the API boundary instead of being silently overridden later. - -### External Policy Authority Verifier - -Validate global or sandbox-scoped policy writes against an external authority -before the gateway persists them. The external authority might verify a policy -bundle signature, check that a submitted policy was approved by an internal -control plane, or compare policy metadata against an organization-owned source. - -This is a write-time validation path. Accepted policy state is still persisted -in the gateway DB and runtime paths continue to read gateway-owned state. If -the external authority is unavailable, the configured failure policy determines -whether the write is rejected or allowed with audit warnings. - -### Driver Config Validator - -Validate `SandboxTemplate.driver_config` before it reaches a driver. This can -enforce allowed keys, exact payloads, forbidden annotations, resource ceilings, -or driver-specific profiles. - -Example: - -```yaml -driver: kubernetes -allowed_keys: - - nodeSelector - - tolerations -required_payload: - runtimeClassName: nvidia -``` - -### User Sandbox Quota - -Reject `CreateSandbox` when a principal already has too many active sandboxes. -The initial version may use the existing store list path for single-replica -deployments. A later HA-safe version should use a quota lease or counter with -database compare-and-swap or a transaction. - -The rejection code should be `resource_exhausted`. - -### Sandbox Name Prefix - -Require sandbox names to start with a configured prefix. Generated names may be -modified. User-supplied names should be rejected rather than silently changed. - -Example: - -```text -generated: bright-lake -> nvidia-bright-lake -supplied: demo -> reject, expected prefix nvidia- -``` From 0a26c9e4c00dc6348df44c0e5d52885d4328ab92 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 00:16:59 -0700 Subject: [PATCH 05/10] wip --- rfc/0006-gateway-interceptors/README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0006-gateway-interceptors/README.md index 78177ab6a..41d2481b5 100644 --- a/rfc/0006-gateway-interceptors/README.md +++ b/rfc/0006-gateway-interceptors/README.md @@ -216,8 +216,7 @@ endpoint type and TLS mode from the interceptor endpoint URI: - `unix:///path/to/socket` connects to a gRPC interceptor service over a Unix domain socket. -Remote gRPC interceptors require authentication. The exact configuration shape -is out of scope for this RFC, but the implementation should support mTLS and +Remote gRPC interceptors require authentication. The exact approach is out of scope for this RFC, but the implementation should support mTLS and bearer-token authentication. ### Selection and ordering From b87edf2593d72c1f279ae85192da6aab90401807 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 00:34:49 -0700 Subject: [PATCH 06/10] docs(rfc): document interceptor order example Signed-off-by: Drew Newberry --- rfc/0006-gateway-interceptors/README.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0006-gateway-interceptors/README.md index 41d2481b5..7beb3eb47 100644 --- a/rfc/0006-gateway-interceptors/README.md +++ b/rfc/0006-gateway-interceptors/README.md @@ -268,6 +268,7 @@ Gateway config example for a remote policy provider: [[interceptors]] name = "policy-provider" endpoint = "grpcs://policy-provider.example.com:8443" +order = 100 failure_policy = "fail_closed" timeout = "500ms" ``` @@ -314,17 +315,17 @@ interceptor bindings also have a maximum patch count. Every interceptor decision should emit structured gateway logs with: -- interceptor name. -- binding ID. -- phase. -- RPC service and method. -- principal. -- decision. -- reason. -- latency. -- failure policy. -- patch count. -- audit annotations. +- interceptor name +- binding ID +- phase +- RPC service and method +- principal +- decision +- reason +- latency +- failure policy +- patch count +- audit annotations Security-relevant denials should be emitted as OCSF detection findings or configuration/security events, depending on the event class. Non-security @@ -412,6 +413,7 @@ bindings: [[interceptors]] name = "policy-provider" endpoint = "grpcs://policy-provider.example.com:8443" +order = 100 failure_policy = "fail_closed" timeout = "500ms" ``` From f8081d9fef03fe15e5f547c3e64247d6a8c46017 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 07:56:04 -0700 Subject: [PATCH 07/10] docs(rfc): renumber gateway interceptors RFC Signed-off-by: Drew Newberry --- .../README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename rfc/{0006-gateway-interceptors => 0010-gateway-interceptors}/README.md (99%) diff --git a/rfc/0006-gateway-interceptors/README.md b/rfc/0010-gateway-interceptors/README.md similarity index 99% rename from rfc/0006-gateway-interceptors/README.md rename to rfc/0010-gateway-interceptors/README.md index 7beb3eb47..b14610eb2 100644 --- a/rfc/0006-gateway-interceptors/README.md +++ b/rfc/0010-gateway-interceptors/README.md @@ -6,7 +6,7 @@ links: - https://github.com/NVIDIA/OpenShell/issues/1919 --- -# RFC 0006 - Gateway Interceptors +# RFC 0010 - Gateway Interceptors ## Summary From 4cba33c225170754ba7a5abebe298b6e2dda83c3 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 11:21:13 -0700 Subject: [PATCH 08/10] docs(rfc): clarify gateway interceptor service Signed-off-by: Drew Newberry --- rfc/0010-gateway-interceptors/README.md | 257 +++++++++++++----------- 1 file changed, 136 insertions(+), 121 deletions(-) diff --git a/rfc/0010-gateway-interceptors/README.md b/rfc/0010-gateway-interceptors/README.md index b14610eb2..7e24892f9 100644 --- a/rfc/0010-gateway-interceptors/README.md +++ b/rfc/0010-gateway-interceptors/README.md @@ -20,24 +20,24 @@ This RFC proposes a first-class extension system that lets external services observe, modify, validate, reject, or audit gateway operations at well-defined phases. We call these **Gateway Interceptors**. -Interceptors and drivers serve different extension needs. Interceptors add business logic -around gateway operations. Drivers replace or provide implementation for -platform functionality, such as how sandboxes are provisioned on Docker, -Kubernetes, or VMs. +Gateway interceptors and drivers serve different extension needs. The gateway +interceptor role adds business logic around gateway operations. Drivers replace +or provide implementation for platform functionality, such as how sandboxes are +provisioned on Docker, Kubernetes, or VMs. -This RFC scopes interceptors to gateway API operations. An interceptor can -observe, modify, validate, reject, or audit a gateway operation at well-defined -phases. Future RFCs may extend interceptors to other gateway functionality, -such as event-driven or workflow behavior, but that is out of scope for this -first implementation. +This RFC scopes gateway interceptors to gateway API operations. A gateway +interceptor can observe, modify, validate, reject, or audit a gateway operation +at well-defined phases. Future RFCs may extend gateway interceptors to other +gateway functionality, such as event-driven or workflow behavior, but that is +out of scope for this first implementation. Drivers continue to own platform implementation — how gateway functionality is -actually provided. Interceptors own gateway-level governance for resource -writes: tenancy, quotas, naming, policy authority, and driver configuration -restrictions. +actually provided. Gateway interceptors own gateway-level governance for +resource writes: tenancy, quotas, naming, policy authority, and driver +configuration restrictions. -The gateway database remains the system of record. Interceptors add governance -around gateway operations; they do not replace gateway-owned state. +The gateway database remains the system of record. Gateway interceptors add +governance around gateway operations; they do not replace gateway-owned state. ## Motivation @@ -58,14 +58,14 @@ options. OpenShell already extends two of its subsystems. Drivers (RFC 0001) provide implementations for the platform and infrastructure layer. Sandbox egress middleware (RFC 0009) runs in the supervisor proxy and governs what an agent's -outbound requests may carry. Interceptors complete this pattern for the gateway -control plane: an extension point for the API operations themselves, where -deployment-specific rules like tenant quotas, policy authority, and naming -belong. +outbound requests may carry. Gateway interceptors complete this pattern for the +gateway control plane: an extension point for the API operations themselves, +where deployment-specific rules like tenant quotas, policy authority, and +naming belong. -Some of these may ship as built-in gateway defaults over time. Interceptors do -not replace that — they let a deployment extend or override built-in defaults -when its rules differ, without waiting on an upstream change. +Some of these may ship as built-in gateway defaults over time. Gateway interceptor +services do not replace that — they let a deployment extend or override built-in +defaults when its rules differ, without waiting on an upstream change. Without a dedicated mechanism, operators carry these rules as gateway forks or local patches. @@ -73,8 +73,8 @@ local patches. ## Non-goals - Replacing compute drivers or adding a second compute provisioning interface. -- Letting interceptors bypass gateway authentication, authorization, policy - safety validation, or driver schema validation. +- Letting gateway interceptors bypass gateway authentication, authorization, + policy safety validation, or driver schema validation. - Moving sandbox runtime enforcement out of the sandbox supervisor and proxy. - Replacing the gateway database as the system of record. - Adding new first-class gateway resource kinds for quotas, name policies, @@ -86,26 +86,26 @@ Add a gateway interceptor framework with explicit phases, RPC method selectors, deterministic ordering, bounded execution, audit logging, and conservative failure behavior. -Interceptors do not replace gateway functionality. They add governance and -business logic around gateway operations: defaulting, validation, rejection, +Gateway interceptors do not replace gateway functionality. They add governance +and business logic around gateway operations: defaulting, validation, rejection, and audit. Replacing how core functionality is implemented remains the role of drivers and other provider-style interfaces. The design keeps two boundaries intact: - The gateway database remains the system of record for gateway-owned state. -- Existing gateway and driver validation still run after interceptor +- Existing gateway and driver validation still run after gateway interceptor modification. -### Gateway API interceptors +### Gateway interceptors -A gateway API interceptor runs during a gateway API operation, such as creating a +A gateway interceptor runs during a gateway API operation, such as creating a sandbox, importing provider profiles, updating policy, or applying sandbox configuration. It may modify an RPC request or operation input only in modification phases. It may reject in validation phases. It may attach warnings and audit annotations in all phases. -Interceptor services expose one or more bindings. A binding is a +Gateway interceptor services expose one or more bindings. A binding is a service-declared rule that maps the service to phases, gateway RPC methods, and selectors. The gateway uses bindings to decide when to call the service. @@ -116,16 +116,17 @@ to the public API operators already know and avoids another compatibility surface. All interceptable gateway API RPCs run through the same standard phase pipeline. -The gateway rejects interceptor bindings that reference unknown RPCs for the -running gateway version, unless the RPC selector is empty to match all +The gateway rejects gateway interceptor bindings that reference unknown RPCs for +the running gateway version, unless the RPC selector is empty to match all interceptable RPCs. -Gateway API interceptors should work for all relevant gateway RPCs, not a -hand-maintained subset. New gateway RPCs should enter the interceptor pipeline by -using the shared gateway API execution path, not by adding per-RPC interceptor -hooks or updating a separate allowlist. RPCs may opt out only when they are not -gateway API operations in scope for this RFC, such as low-level streaming or -supervisor-internal calls, and the opt-out should be explicit in code review. +Gateway interceptors should work for all relevant gateway RPCs, not a +hand-maintained subset. New gateway RPCs should enter the gateway interceptor +pipeline by using the shared gateway API execution path, not by adding per-RPC +gateway interceptor hooks or updating a separate allowlist. RPCs may opt out +only when they are not gateway API operations in scope for this RFC, such as +low-level streaming or supervisor-internal calls, and the opt-out should be +explicit in code review. This lets OpenShell add deployment-specific business logic around the gateway operations it already supports while keeping runtime reads local and @@ -135,7 +136,7 @@ deterministic. Operation phases are ordered. Later phases see the result of earlier phases. All interceptable gateway API RPCs use the same phases in the same order so -interceptor authors and operators do not need per-RPC phase rules. +gateway interceptor authors and operators do not need per-RPC phase rules. | Phase | Modification allowed | Purpose | Examples | |---|---:|---|---| @@ -144,15 +145,25 @@ interceptor authors and operators do not need per-RPC phase rules. | `validate` | no | Enforce deployment-specific rules before persistence, provisioning, or other side effects. | Enforce tenant quotas, reject policy updates that allow internet egress, or verify driver config against an approved schema. | | `post_commit` | no | Emit audit or notify external systems after successful persistence or provisioning. | Send audit records, notify an inventory system, or trigger a reconciliation job after a successful write. | -Gateway invariants run after modification so interceptors cannot leave invalid -objects in the system. Operation-specific built-in validation, including driver -validation where applicable, remains part of the gateway-owned execution path so -drivers stay the authority for driver-owned schemas. +Gateway invariants run after modification so gateway interceptors cannot leave +invalid objects in the system. Operation-specific built-in validation, including +driver validation where applicable, remains part of the gateway-owned execution +path so drivers stay the authority for driver-owned schemas. -### Interceptor request contract +### Gateway interceptor request contract -The interceptor request should be stable and tied to the public gateway API, not -to Rust handler internals. +Each gateway interceptor is a registered service instance that implements the +`GatewayInterceptor` gRPC service: + +```proto +service GatewayInterceptor { + rpc Describe(google.protobuf.Empty) returns (InterceptorManifest); + rpc Evaluate(InterceptorEvaluation) returns (InterceptorDecision); +} +``` + +The gateway interceptor request should be stable and tied to the public gateway +API, not to Rust handler internals. ```proto message InterceptorEvaluation { @@ -184,8 +195,8 @@ input available after state loading and defaulting; it is the main payload for `modify_operation`, `validate`, and `post_commit`. `existing_state` is populated only when the operation has prior gateway-owned state. -The interceptor response returns an allow/deny decision, optional patches, and -diagnostic metadata for gateway API interceptors. +The gateway interceptor response returns an allow/deny decision, optional +patches, and diagnostic metadata for selected gateway operations. ```proto message InterceptorDecision { @@ -200,36 +211,39 @@ message InterceptorDecision { Only modification phases accept patches. `pre_request` patches apply to `rpc_request`; `modify_operation` patches apply to `operation_input`. -`validate` and `post_commit` interceptors that return patches are configuration -errors. +`validate` and `post_commit` gateway interceptors that return patches are +configuration errors. -The `binding_id` is owned by the interceptor service. It identifies the +The `binding_id` is owned by the gateway interceptor service. It identifies the service-declared binding that selected the evaluation. -### Interceptor endpoints +### Gateway interceptor endpoints The framework uses one protobuf/gRPC service contract. The gateway derives the -endpoint type and TLS mode from the interceptor endpoint URI: +endpoint type and TLS mode from the gateway interceptor endpoint URI: -- `grpc://host:port` connects to a plaintext gRPC interceptor service over TCP. -- `grpcs://host:port` connects to a TLS-protected gRPC interceptor service over TCP. -- `unix:///path/to/socket` connects to a gRPC interceptor service over a Unix domain - socket. +- `grpc://host:port` connects to a plaintext gRPC gateway interceptor service + over TCP. +- `grpcs://host:port` connects to a TLS-protected gRPC gateway interceptor + service over TCP. +- `unix:///path/to/socket` connects to a gRPC gateway interceptor service over a + Unix domain socket. -Remote gRPC interceptors require authentication. The exact approach is out of scope for this RFC, but the implementation should support mTLS and -bearer-token authentication. +All gateway interceptor connections require authentication, regardless of +endpoint type. The exact approach is out of scope for this RFC, but the +implementation should support mTLS and bearer-token authentication. ### Selection and ordering -Selection should be oriented around interceptor services, not individual -phase/RPC routes. Operators should normally configure a small number of -interceptor services and service-specific settings. The service tells the -gateway which RPC bindings it supports. +Selection should be oriented around gateway interceptor services, not individual +phase/RPC routes. Operators should normally configure a small number of these +services and service-specific settings. The service tells the gateway which RPC +bindings it supports. -A `[[interceptors]]` table in the gateway config TOML represents one interceptor -service instance. During gateway startup or config reload, the gateway calls a -`Describe` RPC on the service. The response describes the service's default -bindings: +A `[[gateway_interceptors]]` table in the gateway config TOML represents one +gateway interceptor service instance. During gateway startup or config reload, +the gateway calls a `Describe` RPC on the service. The response describes the +service's default bindings: ```proto message InterceptorManifest { @@ -265,7 +279,7 @@ service-declared selector, such as limiting a binding to a specific RPC. Gateway config example for a remote policy provider: ```toml -[[interceptors]] +[[gateway_interceptors]] name = "policy-provider" endpoint = "grpcs://policy-provider.example.com:8443" order = 100 @@ -280,42 +294,42 @@ The gateway builds an execution plan from enabled bindings. Selection evaluates the service-declared RPC, phase, principal, and label selectors, then applies gateway-configured narrowing overrides. -Interceptors run in fixed phase order. Within a phase, matching bindings run by -this deterministic ordering: +Gateway interceptors run in fixed phase order. Within a phase, matching +bindings run by this deterministic ordering: -1. configured interceptor service `order`. +1. configured gateway interceptor service `order`. 2. service-declared binding `order`, after gateway overrides. -3. interceptor service name. +3. gateway interceptor service name. 4. binding ID. -The gateway rejects interceptor configuration that creates ambiguous +The gateway rejects gateway interceptor configuration that creates ambiguous modification order for the same field if that can be detected statically. ### Failure policy Each binding has an effective failure policy. The gateway starts with the -service default, applies the interceptor service-level gateway config, then -applies any binding override. +service default, applies the gateway interceptor service-level gateway config, +then applies any binding override. | Failure policy | Behavior | |---|---| -| `fail_closed` | Interceptor timeout or service error rejects the API operation. | -| `fail_open` | Interceptor timeout or service error permits the operation. The gateway emits warnings and audit logs. | -| `ignore` | Interceptor errors are logged only. Valid only for `post_commit`. | +| `fail_closed` | Gateway interceptor timeout or service error rejects the API operation. | +| `fail_open` | Gateway interceptor timeout or service error permits the operation. The gateway emits warnings and audit logs. | +| `ignore` | Gateway interceptor errors are logged only. Valid only for `post_commit`. | Defaults: - Modifying and validating bindings default to `fail_closed`. - `post_commit` bindings default to `ignore`. -Every interceptor service has a timeout and response size limit. Gateway API -interceptor bindings also have a maximum patch count. +Every gateway interceptor service has a timeout and response size limit. Each +binding also has a maximum patch count. ### Observability and audit -Every interceptor decision should emit structured gateway logs with: +Every gateway interceptor decision should emit structured gateway logs with: -- interceptor name +- gateway interceptor name - binding ID - phase - RPC service and method @@ -333,18 +347,18 @@ operational failures can use plain tracing. ### Example: remote policy provider -An interceptor should start from the invariant it wants to preserve, then find -every gateway API RPC that can establish or weaken that invariant. For example, -an operator may want a remote policy provider to be the authority for sandbox -policy decisions. +A gateway interceptor should start from the invariant it wants to preserve, then +find every gateway API RPC that can establish or weaken that invariant. For +example, an operator may want a remote policy provider to be the authority for +sandbox policy decisions. Two RPCs matter for this invariant: - `openshell.v1.OpenShell/CreateSandbox` establishes the initial sandbox policy. - `openshell.v1.OpenShell/UpdateConfig` changes sandbox or global policy. -The interceptor service declares one binding to apply an approved initial policy -and another to guard later policy changes: +The gateway interceptor service declares one binding to apply an approved +initial policy and another to guard later policy changes: ```proto InterceptorManifest { @@ -372,7 +386,7 @@ The handler can then focus on the phase and RPC method that selected the binding: ```rust -// Toy implementation of the InterceptorService evaluate RPC. +// Toy implementation of the GatewayInterceptor Evaluate RPC. async fn evaluate(&self, req: InterceptorEvaluation) -> InterceptorDecision { match (req.rpc_method.as_str(), req.phase.as_str()) { // CreateSandbox: ask the remote policy provider for the approved @@ -410,7 +424,7 @@ The gateway config can stay small because the service manifest declares the bindings: ```toml -[[interceptors]] +[[gateway_interceptors]] name = "policy-provider" endpoint = "grpcs://policy-provider.example.com:8443" order = 100 @@ -418,7 +432,7 @@ failure_policy = "fail_closed" timeout = "500ms" ``` -This example illustrates the general interceptor design loop: +This example illustrates the general gateway interceptor design loop: - Start with the invariant, then identify every RPC that can establish or weaken it. @@ -426,49 +440,50 @@ This example illustrates the general interceptor design loop: policy and `validate` to reject unauthorized later changes. - Use `fail_closed` because policy authority is a control-plane security boundary. -- Keep gateway validation after the interceptor so built-in policy safety checks - still run. +- Keep gateway validation after the gateway interceptor so built-in policy + safety checks still run. ## Implementation plan -1. Add a `crates/openshell-interceptors` crate with shared interceptor - manifest, request/response, selector matching, ordering, failure policy - handling, patch application, and test helpers. -2. Add interceptor configuration parsing to gateway config and validate it at startup. -3. Implement gRPC interceptor clients that derive TCP or Unix domain socket - transport from the configured endpoint URI and call `Describe` during +1. Add a `crates/openshell-gateway-interceptors` crate with the shared manifest, + request/response, selector matching, ordering, failure policy handling, patch + application, and test helpers. +2. Add gateway interceptor configuration parsing to gateway config and validate + it at startup. +3. Implement gRPC gateway interceptor clients that derive TCP or Unix domain + socket transport from the configured endpoint URI and call `Describe` during startup or config reload. 4. Build an execution plan from service manifests plus gateway-configured overrides. -5. Wire interceptor execution into the gateway API operation pipeline so all - gateway operations can pass through `pre_request`, `modify_operation`, +5. Wire gateway interceptor execution into the gateway API operation pipeline so + all gateway operations can pass through `pre_request`, `modify_operation`, `validate`, and `post_commit` where applicable. 6. Audit existing gateway operations and route each resource-affecting path - through the shared interceptor pipeline. -7. Add interceptor decision audit logging and metrics. + through the shared gateway interceptor pipeline. +7. Add gateway interceptor decision audit logging and metrics. 8. Document gateway interceptor configuration, endpoint requirements, failure modes, and security guidance. ## Risks -- Interceptors can make request behavior harder to reason about if ordering - and audit are weak. -- Synchronous gRPC interceptor services can become availability dependencies for the - gateway. -- Modifying interceptors can hide user intent if they silently rewrite user-supplied - values. +- Gateway interceptors can make request behavior harder to reason about if + ordering and audit are weak. +- Synchronous gRPC gateway interceptor services can become availability + dependencies for the gateway. +- Modifying gateway interceptors can hide user intent if they silently rewrite + user-supplied values. - Ownership can become confusing when external controllers and humans both edit the same provider profile, provider, or policy through existing APIs. -- Quota interceptors need a stronger consistency design before they are safe in HA - deployments. +- Quota gateway interceptors need a stronger consistency design before they are + safe in HA deployments. Mitigations: -- Keep interceptors disabled by default. +- Keep gateway interceptors disabled by default. - Make ordering deterministic and visible. -- Default modifying and validating interceptors to `fail_closed`. +- Default modifying and validating gateway interceptors to `fail_closed`. - Run first-party invariant validation after modification. -- Make HA-unsafe interceptors declare their scope explicitly. +- Make HA-unsafe gateway interceptors declare their scope explicitly. ## Alternatives @@ -482,9 +497,9 @@ This is simple for known cases but does not scale to organization-specific policy or external sources. It also keeps growing the gateway config schema for controls that are not core OpenShell semantics. -Built-in fields and interceptors are not mutually exclusive. OpenShell may still -ship common defaults as first-party config; interceptors let a deployment extend -or override those defaults when its rules differ. +Built-in fields and gateway interceptors are not mutually exclusive. OpenShell +may still ship common defaults as first-party config; gateway interceptors let a +deployment extend or override those defaults when its rules differ. ### Build a specific policy driver @@ -500,7 +515,7 @@ individual deployment use cases rather than a stable gateway operation contract. This is different from compute drivers, which implement backend behavior after the gateway has accepted an operation. A policy authority participates in the gateway's decision to accept, reject, or modify the operation before -persistence, so it fits better as an interceptor than as a driver. +persistence, so it fits better as a gateway interceptor than as a driver. ### Put this in compute drivers @@ -509,14 +524,14 @@ quotas, and policy choices. This mixes responsibilities. Drivers should own compute-platform feasibility. The gateway should own API behavior, tenancy, policy authority, and provider -state. Interceptors are appropriate for additional business logic around gateway -operations; drivers are appropriate when OpenShell needs a different +state. Gateway interceptors are appropriate for additional business logic around +gateway operations; drivers are appropriate when OpenShell needs a different implementation of compute functionality. ### Use HTTP webhooks -OpenShell could model interceptors as HTTP webhooks with JSON request and response -payloads. +OpenShell could model gateway interceptors as HTTP webhooks with JSON request +and response payloads. This is familiar to Kubernetes users, but OpenShell already uses protobuf and gRPC heavily. A protobuf gRPC contract avoids a second wire format for gateway From bb1b3fa4db926214d1e2001d6135cd495cea9318 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 11:26:40 -0700 Subject: [PATCH 09/10] docs(rfc): clarify gateway interceptor limits Signed-off-by: Drew Newberry --- rfc/0010-gateway-interceptors/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfc/0010-gateway-interceptors/README.md b/rfc/0010-gateway-interceptors/README.md index 7e24892f9..5d71dd614 100644 --- a/rfc/0010-gateway-interceptors/README.md +++ b/rfc/0010-gateway-interceptors/README.md @@ -322,8 +322,8 @@ Defaults: - Modifying and validating bindings default to `fail_closed`. - `post_commit` bindings default to `ignore`. -Every gateway interceptor service has a timeout and response size limit. Each -binding also has a maximum patch count. +The gateway enforces a timeout and response size limit for every gateway +interceptor service call. Each binding also has a maximum patch count. ### Observability and audit From 4c38e06df79dc94116a7d775e8dba9434a5019b3 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 23 Jun 2026 13:33:34 -0700 Subject: [PATCH 10/10] docs(rfc): align gateway interceptor config Signed-off-by: Drew Newberry --- rfc/0010-gateway-interceptors/README.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/rfc/0010-gateway-interceptors/README.md b/rfc/0010-gateway-interceptors/README.md index 5d71dd614..a09a3e7a0 100644 --- a/rfc/0010-gateway-interceptors/README.md +++ b/rfc/0010-gateway-interceptors/README.md @@ -220,7 +220,7 @@ service-declared binding that selected the evaluation. ### Gateway interceptor endpoints The framework uses one protobuf/gRPC service contract. The gateway derives the -endpoint type and TLS mode from the gateway interceptor endpoint URI: +endpoint type and TLS mode from the gateway interceptor `grpc_endpoint` URI: - `grpc://host:port` connects to a plaintext gRPC gateway interceptor service over TCP. @@ -240,10 +240,10 @@ phase/RPC routes. Operators should normally configure a small number of these services and service-specific settings. The service tells the gateway which RPC bindings it supports. -A `[[gateway_interceptors]]` table in the gateway config TOML represents one -gateway interceptor service instance. During gateway startup or config reload, -the gateway calls a `Describe` RPC on the service. The response describes the -service's default bindings: +A `[[openshell.gateway.interceptors]]` table in the gateway config TOML +represents one gateway interceptor service instance. During gateway startup or +config reload, the gateway calls a `Describe` RPC on the service. The response +describes the service's default bindings: ```proto message InterceptorManifest { @@ -279,9 +279,9 @@ service-declared selector, such as limiting a binding to a specific RPC. Gateway config example for a remote policy provider: ```toml -[[gateway_interceptors]] +[[openshell.gateway.interceptors]] name = "policy-provider" -endpoint = "grpcs://policy-provider.example.com:8443" +grpc_endpoint = "grpcs://policy-provider.example.com:8443" order = 100 failure_policy = "fail_closed" timeout = "500ms" @@ -424,9 +424,9 @@ The gateway config can stay small because the service manifest declares the bindings: ```toml -[[gateway_interceptors]] +[[openshell.gateway.interceptors]] name = "policy-provider" -endpoint = "grpcs://policy-provider.example.com:8443" +grpc_endpoint = "grpcs://policy-provider.example.com:8443" order = 100 failure_policy = "fail_closed" timeout = "500ms" @@ -451,8 +451,8 @@ This example illustrates the general gateway interceptor design loop: 2. Add gateway interceptor configuration parsing to gateway config and validate it at startup. 3. Implement gRPC gateway interceptor clients that derive TCP or Unix domain - socket transport from the configured endpoint URI and call `Describe` during - startup or config reload. + socket transport from the configured `grpc_endpoint` URI and call `Describe` + during startup or config reload. 4. Build an execution plan from service manifests plus gateway-configured overrides. 5. Wire gateway interceptor execution into the gateway API operation pipeline so