diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md
new file mode 100644
index 0000000..ba6b775
--- /dev/null
+++ b/enhancements/environment-agent/environment-agent.md
@@ -0,0 +1,1214 @@
+---
+title: Environment Agent
+authors:
+ - "@gabriel-farache"
+reviewers:
+ - "@gciavarrini"
+ - "@ygalblum"
+ - "@machacekondra"
+ - "@jenniferubah"
+approvers:
+ - ""
+creation-date: 2026-06-03
+see-also:
+ - "/enhancements/service-provider-health-check/service-provider-health-check.md"
+ - "/enhancements/state-management/service-provider-status-reporting.md"
+ - "/enhancements/sp-registration-flow/sp-registration-flow.md"
+ - "/enhancements/placement-manager/placement-manager.md"
+ - "/enhancements/sp-resource-manager/sp-resource-manager.md"
+---
+
+# Environment Agent
+
+## Open Questions
+
+1. Can multiple agent replicas consume from the same topic for high
+ availability? (deferred to HA iteration)
+2. How does an administrator update the agent's cost tier without restarting it?
+ **Proposed resolution:** The administrator updates the agent's configuration
+ (config file, environment variable, or ConfigMap on Kubernetes). The agent
+ detects the change and sends a `POST /api/v1/agents` to DCM with the updated
+ cost tier — the same mechanism used when the supported service types list
+ changes. **This solution is deferred to later version: in the current
+ version, a restart will be needed for the change in the cost tier to be
+ propagated (via [Agent Registration Flow](#agent-registration-flow) )**
+3. How does DCM handle the "queued" CloudEvent response
+ (`dcm.agent.request-queued`)? Does it expose the status to the user, set a
+ timeout, or re-evaluate policies? (deferred to DCM-side design)
+
+## Terminology
+
+- **Agent:** A lightweight process that runs in a target environment, acting as
+ the intermediary between DCM and the Service Providers deployed in that
+ environment. It registers the environment to DCM, consumes resource operation
+ requests from a messaging system, and routes them to the appropriate Service
+ Provider.
+- **Embedded SP:** SP code shipped within the agent binary (K8s Container, ACM
+ Cluster, KubeVirt), enabled via configuration. Embedded SPs register
+ internally at agent startup without a REST call.
+- **External SP:** A standalone SP process that registers to the agent via the
+ REST API (`POST /api/v1/providers`). Also referred to as "bring your own" SP.
+- **Environment:** A set of infrastructures that is ready to receive workload
+ from DCM (e.g., `dev`, `staging`, `prod-eu-west-1`).
+
+## Summary
+
+This enhancement aims at adding the notion of environment by adding a layer
+between the SP and DCM: an agent would run on each environment usable by DCM and
+the agent would register the environment to DCM.
+
+The agent supports a hybrid SP model: it ships with embedded SP code for known
+service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration,
+and also accepts external ("bring your own") SPs that register via REST API.
+Only one SP — embedded or external — may serve a given service type per agent;
+duplicate registrations are rejected.
+
+This enhancement also proposes to change the way the creation request is
+submitted to the agent (or currently, to the SP): instead of sending a direct
+request to the agent, DCM will send the request to a bus that will in turn be
+consumed by the relevant agent to create the requested resource.
+
+## Motivation
+
+When deploying resources in general, one of the main criterion taken into
+account is the type of environment in which the resource will be deployed: DEV,
+INT, VAL, PROD, etc
+
+Currently, in DCM, a resource's creation request is routed to a given Service
+Provider (SP) by a policy on the base of several criteria. Once the SP is
+selected, DCM will send a request to the selected SP to request the creation of
+the resource.
+
+There is currently no way for a policy to determine in which environment an SP
+is running and hence a user cannot explicitly set the targeted environment
+constraint when requesting the creation of a resource.
+
+Furthermore, with the current way of submitting creation requests, the
+administrator has to make sure the ports are open for DCM to reach the SP.
+Changing how creation requests are consumed by giving the initiative to the
+agent would solve this problem: the agent pulls work from a messaging system,
+removing the need for DCM-to-environment inbound connectivity for creation
+requests. The agent still requires outbound connectivity to DCM for registration
+and heartbeats. This approach also aligns with the way K8s/OCP consume creation
+requests, where manifests are pulled by the application creating the resource.
+
+### Goals
+
+- Define how the agent registers to DCM
+- Define what information the agent gives to DCM while registering
+- Define how agents and DCM are communicating
+- Define how agents and Service Providers interact with each other
+- Define how embedded SPs integrate with the agent alongside external SPs
+ (hybrid model)
+- Define the service type uniqueness constraint (one SP per service type)
+- Define how Service Providers register to the agent, allowing the agent to
+ dynamically build and maintain its list of supported service types
+- Define how the agent monitors Service Provider health using the three-state
+ health model (Ready, Unhealthy, Unavailable) and updates DCM when the
+ supported service types change as a result
+- Define how the agent reports its own health to DCM via periodic heartbeats
+
+### Non-Goals
+
+- Defining how to use the information registered by the agent to DCM
+- Define how agent will provision application (vs simple service type)
+- Update other enhancement files to reflect the changes introduced by the
+ present document; this will be done in subsequent PRs.
+
+## Proposal
+
+### Overview
+
+For each environment that can be used by DCM, an agent must be spawned. The
+agent will self register to DCM. When doing so, it will provide, amongst other
+information, the environment on which it's running and the service types it can
+serve.
+
+When starting, the agent will also create a specific topic in the messaging
+system in order for DCM to communicate with the agent. The topic name is
+deterministic — either derived from the agent's name or provided via
+configuration — ensuring that after a restart the agent reuses the same topic.
+If the topic already exists, the agent reuses it. The topic name is unique per
+environment and is shared with DCM upon registration. In the current
+single-agent model, one agent consumes from the topic. In a future HA model,
+multiple agent replicas for the same environment could consume from the same
+topic as competing consumers.
+
+The agent supports a hybrid SP model combining embedded and external SPs:
+
+- **Embedded SPs:** The agent ships with SP code for K8s Container, ACM Cluster,
+ and KubeVirt. These are enabled via configuration and register internally at
+ agent startup — no REST call is needed. The embedded SP code lives in
+ dedicated packages within the agent codebase.
+- **External SPs ("bring your own"):** Standalone SP processes register to the
+ agent via the REST API (`POST /api/v1/providers`), following the contract
+ defined in the
+ [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md).
+
+Only one SP — embedded or external — may serve a given service type per agent.
+If an SP attempts to register for a service type that is already served, the
+registration is rejected (see
+[SP Registration to Agent](#sp-registration-to-agent)). Future iterations may
+support multiple SPs per service type with selection strategies (e.g.,
+affinity-based, capacity-based).
+
+The agent dynamically builds its list of supported service types based on the
+SPs registered to it (both embedded and external). When the list changes (SP
+registration or health-driven removal), the agent updates DCM accordingly.
+
+An agent must have at least one SP (embedded or external) registered and healthy
+before self registering to DCM. Each service type advertised to DCM must be
+backed by a healthy SP.
+
+DCM will send the creation request to the specific topic that was created by the
+agent.
+
+The agent will then consume the message, validate it and then pass it to the
+relevant SP.
+
+The agent monitors the health of its registered SPs using the three-state model
+(Ready, Unhealthy, Unavailable). The health monitoring mechanism differs by SP
+type:
+
+- **Embedded SPs:** Health is determined in-process — the agent directly checks
+ the embedded SP's internal state without a network call.
+- **External SPs:** Health is determined by polling the SP's `GET /health`
+ endpoint, as defined in the
+ [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md).
+
+The agent differentiates its behavior based on the SP health state:
+
+- **Unhealthy:** The agent keeps the service type in its advertised list to DCM
+ but stops routing requests to the SP. Incoming requests for that service type
+ are held in a dedicated retry topic until the SP recovers or becomes
+ unavailable.
+- **Unavailable:** The agent removes the service type from its advertised list,
+ updates DCM, and rejects any held requests for that service type.
+
+The agent exposes the health status of each registered SP via a `/api/v1/status`
+endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces
+this information as custom pod conditions on its own pod, allowing
+administrators to quickly identify which SPs are causing issues via
+`oc describe pod`.
+
+The agent reports its own liveness to DCM via periodic REST heartbeats. DCM
+tracks the last heartbeat timestamp and marks the agent as unavailable if no
+heartbeat is received within a configurable threshold.
+
+The status monitoring will not be impacted: the SP will be the one managing the
+resource and the current flow will remain the same; the agent is only an
+intermediary.
+
+### Architecture
+
+```mermaid
+%%{init: {'flowchart': {'rankSpacing': 80, 'nodeSpacing': 10, 'curve': 'linear'}}}%%
+flowchart TD
+ classDef dcm fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px
+ classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px
+ classDef agent fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px
+ classDef embedded fill:#2d2d2d,color:#ffffff,stroke:#ce93d8,stroke-width:2px
+ classDef external fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px
+ classDef clusterEnvironment fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px
+
+ DCM["**DCM**
Control Plane"]:::dcm
+ MS["**Messaging System**
Subject-based routing"]:::messaging
+
+ subgraph Target_Environment["Target Environment"]
+ direction LR
+ EXT_SP["**External SP**
Service Type Z
(bring your own)"]:::external
+
+ subgraph Agent_Process["Agent Process"]
+ direction TB
+ AG["**Agent**
Routes creation requests to SP"]:::agent
+ EMB_SP["**Embedded SPs**
K8s Container · ACM Cluster · KubeVirt
(enabled via config)"]:::embedded
+ EMB_SP ---|In-process| AG
+ end
+
+ EXT_SP -. "Registration (REST)" .-> AG
+ AG -->|Creation Request| EXT_SP
+ AG -.->|"Health Check (polling)"| EXT_SP
+ end
+
+ DCM -->|Creation Request| MS
+ MS -->|Creation Request| AG
+ AG -. Registration .-> DCM
+ AG -. Heartbeat .-> DCM
+ AG -->|Health Warning| MS
+ MS -->|Health Warning| DCM
+ EXT_SP -->|Status| MS
+ EMB_SP -->|Status| MS
+ MS -->|Status| DCM
+
+ class Target_Environment clusterEnvironment
+```
+
+#### Flow Description
+
+- The agent is spawned in an environment
+- At startup, the agent registers its configured embedded SPs internally (K8s
+ Container, ACM Cluster, KubeVirt — each enabled via configuration)
+- External SPs register to the agent via REST API; the agent rejects
+ registration if the service type is already served (by an embedded or another
+ external SP)
+- Only one SP (embedded or external) may serve a given service type
+- The agent creates a specific topic in the bus system
+- Once at least one SP is registered and healthy, the agent self-registers to
+ DCM and begins sending periodic heartbeats
+- DCM sends creation request to the specific topic
+- The agent consumes the messages sent to the topic
+- The agent routes the creation request to the SP serving the requested service
+ type
+- The agent monitors each registered SP's health: in-process for embedded SPs,
+ via `/health` endpoint polling for external SPs. When the SP for a service
+ type becomes unhealthy, the agent publishes a health warning through the
+ messaging system
+- The status monitoring remains unchanged: each SP manages its resource
+ lifecycle and reports status through the messaging system
+
+### API
+
+#### Agent Endpoints
+
+| Method | Endpoint | Description |
+| ------ | ----------------- | ------------------------------------------------------------------------------------------------------------- |
+| POST | /api/v1/providers | SP registration — reuses the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) contract |
+| GET | /api/v1/status | Agent status — health of all registered SPs |
+
+##### `POST /api/v1/providers` — SP Registration (External SPs only)
+
+Reuses the contract defined in the
+[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md)
+enhancement. The agent applies the same idempotency semantics (name as natural
+key, create-or-update behavior).
+
+Only one SP may serve a given service type. If the requested service type is
+already served by another SP (embedded or external), the agent rejects the
+registration with `409 Conflict`:
+
+```json
+{
+ "error": "service type 'vm' is already served by provider 'vm-provider'"
+}
+```
+
+Embedded SPs register internally at startup and do not use this endpoint.
+
+##### `GET /api/v1/status` — Agent Status
+
+Returns the health state of all registered SPs (both embedded and external).
+This endpoint is always available, regardless of the deployment mode
+(Kubernetes, Docker, standalone), and is the primary way to inspect the agent's
+view of its Service Providers.
+
+Example response:
+
+```json
+{
+ "providers": [
+ {
+ "providerId": "sp-container-001",
+ "name": "k8s-container",
+ "serviceType": "container",
+ "type": "embedded",
+ "status": "Ready",
+ "lastCheck": "2026-06-05T10:30:00Z"
+ },
+ {
+ "providerId": "sp-db-001",
+ "name": "db-provider",
+ "serviceType": "database",
+ "type": "external",
+ "status": "Unhealthy",
+ "lastCheck": "2026-06-05T10:30:00Z"
+ }
+ ]
+}
+```
+
+#### DCM Endpoints
+
+| Method | Endpoint | Description |
+| ------ | ---------------------------------- | ------------------ |
+| POST | /api/v1/agents | Agent registration |
+| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat |
+
+##### `POST /api/v1/agents` — Agent Registration
+
+Register a new agent to DCM.
+
+| Field | Type | Required | Description |
+| ------------------ | -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| name | string | yes | Unique agent name |
+| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) |
+| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP, embedded or external). May be empty on subsequent re-registrations when SPs become unavailable (an Unhealthy SP does not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). |
+| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) |
+| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` |
+| topicName | string | yes | Deterministic topic name for the agent's messaging channel |
+
+Response: `201 Created` with `{agentId}`
+
+###### `resourcesAvailable` Structure
+
+The `resourcesAvailable` field is optional. When provided, it follows a similar
+structure to the SP registration metadata defined in the
+[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md), but
+represents the aggregate available resources across the environment rather than
+a single SP's capacity.
+
+Example:
+
+```json
+{
+ "totalCpu": 200,
+ "totalMemory": "1TB",
+ "totalStorage": "2TB",
+ "totalNode": 100
+}
+```
+
+##### `PUT /api/v1/agents/{agentId}/heartbeat` — Agent Heartbeat
+
+| Field | Type | Required | Description |
+| --------- | ----------------- | -------- | ------------------------- |
+| timestamp | string (ISO 8601) | yes | Agent's current timestamp |
+
+Response: `200 OK`
+
+### SP Registration to Agent
+
+Service Providers register to the agent rather than to DCM directly. The agent
+supports two registration mechanisms and dynamically maintains its list of
+supported service types based on registered SPs.
+
+**Service type uniqueness constraint:** Only one SP — embedded or external — may
+serve a given service type per agent. The first SP to register for a service
+type claims the slot. Subsequent registration attempts for the same service type
+are rejected.
+
+#### Embedded SP Registration
+
+At startup, the agent registers its configured embedded SPs internally. Each
+embedded SP's code lives in a dedicated package within the agent codebase and is
+enabled explicitly via a configuration field. The embedded SP code reaches the
+agent's registration logic directly — no REST call is involved.
+
+If the agent's state is not clean (e.g., an external SP already holds a service
+type slot from a prior session), the embedded SP registration for that service
+type is rejected. The agent logs a warning and continues running — this is not a
+fatal error.
+
+Because embedded SPs register at startup before external SPs can connect, they
+effectively take priority on a clean agent state.
+
+#### External SP Registration
+
+External SPs register via the REST API (`POST /api/v1/providers`), following the
+contract defined in the
+[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md)
+enhancement. The agent applies the same idempotency semantics (name as natural
+key, create-or-update behavior).
+
+If the requested service type is already served by another SP (embedded or
+external), the agent rejects the registration with `409 Conflict` and a message
+identifying the conflicting provider, so the administrator can take action if
+necessary.
+
+External SPs periodically re-register with the agent to maintain their
+registration. This periodic re-registration serves as a lease renewal and
+ensures that after an agent restart (where the agent loses its in-memory state),
+SPs naturally re-register without requiring any additional coordination
+mechanism.
+
+#### DCM Notification
+
+When the list of supported service types changes as a result of an SP
+registration (embedded or external) and the agent is already registered to DCM,
+the agent updates DCM via a `POST /api/v1/agents` request with the full updated
+registration payload. If the agent has not yet registered to DCM (i.e., this is
+the first SP registering), the agent does not notify DCM yet; instead, the SP
+registration satisfies the prerequisite for the agent to proceed with its
+initial registration to DCM (see
+[Agent Registration Flow](#agent-registration-flow)).
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant SP as External SP
+ participant AG as Agent
+ participant DCM as DCM
(Control Plane)
+ participant DB as Database
+
+ Note over AG: Agent starts:
register embedded SPs
from configuration
+
+ AG->>AG: Register embedded SPs internally
(K8s Container, ACM Cluster, KubeVirt
— each if enabled in config)
+
+ Note over SP: External SP starts and
registers to the agent
+
+ SP->>AG: POST /api/v1/providers
{name, serviceType, endpoint}
+ activate AG
+
+ alt Service type already served by another SP
+ AG-->>SP: 409 Conflict
{error: "service type X already
served by provider Y"}
+ else Service type available
+ AG->>AG: Store SP registration
Add service type to supported list
+
+ alt Service type list changed AND agent already registered to DCM
+ AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName}
+ activate DCM
+ DCM->>DB: Update agent registration
+ activate DB
+ DB-->>DCM: Registration updated
+ deactivate DB
+ DCM-->>AG: 200 OK
+ deactivate DCM
+ else Service type list changed AND agent not yet registered to DCM
+ Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow)
+ end
+
+ AG-->>SP: 201 Created
{providerId}
+ end
+ deactivate AG
+
+ Note over SP,AG: External SP periodically
re-registers to maintain its lease
+```
+
+#### Flow Description
+
+1. At startup, the agent registers its configured embedded SPs internally. Each
+ embedded SP claims a service type slot. If a slot is already occupied, the
+ agent logs a warning and continues
+2. An external SP starts and registers to the agent via a REST API call,
+ providing:
+ - Name
+ - Service type it serves
+ - Endpoint (URL where the agent can reach the SP)
+3. The agent checks whether the requested service type is already served:
+ - If **already served**: the agent rejects the registration with
+ `409 Conflict` and a message identifying the conflicting provider
+ - If **available**: the agent stores the SP registration and adds the service
+ type to its supported list
+4. If the service type list changed (new service type added):
+ - If the agent is already registered to DCM: the agent sends a
+ `POST /api/v1/agents` request to DCM with the full updated agent
+ registration; DCM updates the agent record in the database
+ - If the agent is not yet registered to DCM: the agent does not notify DCM
+ yet; instead, this SP registration satisfies the prerequisite for the
+ agent's initial registration (see
+ [Agent Registration Flow](#agent-registration-flow))
+5. The agent acknowledges the SP registration
+6. External SPs periodically re-register with the agent; the agent handles this
+ idempotently (create or update). This ensures that after an agent restart,
+ external SPs naturally rebuild the agent's state without additional
+ coordination
+
+### Agent Registration Flow
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant AG as Agent
+ participant MS as Messaging System
+ participant DCM as DCM
(Control Plane)
+ participant DB as Database
+
+ Note over AG: Agent starts in
target environment
+
+ AG->>MS: Create main topic (deterministic name)
+ MS-->>AG: Topic created
{topicName}
+
+ AG->>MS: Create retry topic (internal)
+ MS-->>AG: Topic created
{topicName}.retry
+
+ Note over AG: Prerequisite:
At least 1 SP (embedded or
external) must be registered
and healthy
(see SP Registration to Agent)
+
+ AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName}
+ activate DCM
+
+ DCM->>DB: Store agent registration
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName}
+ activate DB
+ DB-->>DCM: Registration stored
+ deactivate DB
+
+ DCM-->>AG: 201 Created
{agentId}
+ deactivate DCM
+```
+
+#### Flow Description
+
+1. The agent starts and serves a specific environment
+2. The agent creates two topics in the messaging system:
+ - A **main topic** (using a deterministic name) to establish a dedicated
+ communication channel with DCM. This topic name is advertised to DCM during
+ registration.
+ - A **retry topic** (`{topicName}.retry`) used internally by the agent to
+ hold requests when the SP for a service type is Unhealthy (see
+ [Retry Topic](#retry-topic)). This topic is not advertised to DCM.
+3. The agent checks whether at least one SP (embedded or external) is registered
+ and healthy:
+ - If at least one SP is registered and healthy: the agent proceeds to
+ register to DCM
+ - Else: the agent waits until at least one SP is registered and healthy
+4. The agent registers itself with DCM via a REST API call, providing:
+ - Name
+ - Environment
+ - Supported service types
+ - Available resources
+ - Cost tier
+ - Topic name
+5. DCM persists the registration in the database
+6. DCM acknowledges the registration
+
+#### Re-Registration on Restart
+
+When the agent restarts, it uses the same `POST /api/v1/agents` endpoint with
+the same payload. The agent does not persist its `agentId`; it relies on DCM's
+idempotent registration, which uses the agent `name` as the natural key (same
+pattern as SP registration defined in the
+[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md)): if the
+name already exists and no `agentId` is provided (or the same `agentId` is
+provided), DCM updates the existing entry, returns the same `agentId`, and
+resets the heartbeat tracker. The agent then uses the returned `agentId` for
+subsequent heartbeats and updates.
+
+Ensuring that each agent uses a unique name is an operational responsibility.
+
+Note that the `(name, topicName)` pair is not unique: in a future HA model,
+multiple agent replicas for the same environment may share the same topic name.
+
+### Resource Creation Flow
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant DCM as DCM
(Control Plane)
+ participant MS as Messaging System
+ participant AG as Agent
+ participant EMB as Embedded SP
+ participant EXT as External SP
+
+ DCM->>MS: PUBLISH CloudEvent (creation request)
topic: {agentTopicName}
{resourceId, serviceType, spec}
+
+ MS->>AG: PUSH message
+ activate AG
+
+ AG->>AG: Validate requested service type
is supported by a registered SP
+
+ alt Service type not supported
+ AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"}
+ MS->>DCM: PUSH error message
+ else Service type supported but SP is Unhealthy
+ AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType, spec}
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SP unhealthy — held for retry"}
+ MS->>DCM: PUSH queued response
+ else Service type supported and SP is Ready
+ alt SP is embedded
+ AG->>EMB: In-process call
{serviceType, spec}
+ activate EMB
+ alt Creation fails
+ EMB-->>AG: Error
+ deactivate EMB
+ AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details}
+ MS->>DCM: PUSH error message
+ else Creation succeeds
+ EMB-->>AG: Success
{instanceId, status: PROVISIONING}
+ AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING}
+ MS->>DCM: PUSH creation acknowledged
+ Note over EMB: SP manages resource lifecycle
and reports status through
the existing status reporting flow
+ end
+ else SP is external
+ AG->>EXT: POST {spEndpoint}/api/v1/{serviceType}
{spec}
+ activate EXT
+ alt Creation fails
+ EXT-->>AG: Error response
+ deactivate EXT
+ AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details}
+ MS->>DCM: PUSH error message
+ else Creation succeeds
+ EXT-->>AG: Success response
{instanceId, status: PROVISIONING}
+ AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING}
+ MS->>DCM: PUSH creation acknowledged
+ Note over EXT: SP manages resource lifecycle
and reports status through
the existing status reporting flow
+ end
+ end
+ end
+ deactivate AG
+```
+
+#### Flow Description
+
+1. DCM publishes a creation request CloudEvent to the agent's dedicated topic in
+ the messaging system, including the resource ID, service type, and spec
+2. The agent consumes the message
+3. The agent validates that the requested service type is supported by a
+ registered SP (embedded or external)
+4. If the service type is **not supported**:
+ - The agent publishes an error CloudEvent back to the messaging system
+ - DCM consumes the error message
+5. If the service type is **supported but the SP is Unhealthy**:
+ - The agent publishes the original request CloudEvent to the retry topic
+ (`{agentTopicName}.retry`) for durable holding
+ - The agent publishes a "queued" CloudEvent to `dcm.agents.responses` with
+ `{resourceId, serviceType, status: "QUEUED"}`, informing DCM that the
+ request is held for retry
+ - The request will be processed when the SP recovers, or rejected if the SP
+ becomes Unavailable (see [Retry Topic](#retry-topic))
+6. If the service type is **supported and the SP is Ready**:
+ - The agent forwards the creation request to the SP via REST API (for
+ external SPs) or in-process call (for embedded SPs)
+ - If the SP returns an **immediate error**: the agent publishes an error
+ CloudEvent back to the messaging system for DCM to consume
+ - If the SP **accepts** the request: the agent publishes a CloudEvent
+ acknowledging the creation is in progress. The SP takes over resource
+ lifecycle management and reports status changes through the existing status
+ reporting flow (SP → Messaging System → DCM)
+
+#### Service Type Uniqueness
+
+Each service type is served by exactly one SP (embedded or external). There is
+no SP selection strategy in the current version. Future iterations may support
+multiple SPs per service type with selection strategies (e.g., affinity-based,
+capacity-based).
+
+#### Retry Policy
+
+When the agent forwards a creation request to an SP and the SP returns an error,
+the agent applies a configurable retry policy. When retries are exhausted, the
+agent publishes an error CloudEvent to the messaging system with the resource ID
+(provided by DCM in the original creation request), allowing DCM to track the
+failure.
+
+#### Retry Topic
+
+When the SP for a given service type is Unhealthy, the agent cannot route
+requests but the service type remains advertised to DCM (to avoid registration
+flapping). Instead of rejecting the request, the agent publishes it to a
+dedicated **retry topic** (`{agentTopicName}.retry`) for durable holding, and
+responds to DCM with a "queued" CloudEvent.
+
+The retry topic is created by the agent at startup alongside the main topic (see
+[Agent Registration Flow](#agent-registration-flow)). It is internal to the
+agent and is not advertised to DCM.
+
+**Message format:** The original CloudEvent is published to the retry topic
+as-is (passthrough, no wrapping).
+
+**Consumption is event-driven.** The agent reads the retry topic only when an SP
+health state changes — not periodically:
+
+- **SP transitions to Ready:** The agent consumes the retry topic. For each
+ message whose service type now has a Ready SP, the agent processes the request
+ (forwards to the SP, responds to DCM with success or error). Messages for
+ service types whose SP is still Unhealthy are re-published to the retry topic.
+- **SP transitions to Unavailable:** The agent consumes the retry topic. For
+ each message whose service type's SP is Unavailable, the agent rejects the
+ request with an error CloudEvent to DCM. Messages for other service types are
+ re-published to the retry topic.
+- **No health state change:** The retry topic is not consumed.
+
+**Creation/Deletion dedup:** If both a creation request and a deletion request
+for the same resource ID are present in the retry topic, both messages are
+removed — they cancel out since the resource was never created. The agent logs
+the cancellation and acknowledges the deletion to DCM. The creation request is
+silently dropped since it was never started.
+
+**Ordering:** Requests are processed in arrival order per service type. Requests
+for different service types are independent.
+
+**Durability:** Messages in the retry topic survive agent crashes, guaranteed by
+the messaging system's persistence layer. On restart, the agent re-reads both
+the main topic and the retry topic.
+
+#### In-Flight Request Handling
+
+When the agent restarts, unconsumed messages on both the main topic and the
+retry topic are consumed once the agent is back up (guaranteed by the messaging
+system's persistence layer).
+
+- **SP is Unhealthy:** The agent publishes the request to the retry topic and
+ responds to DCM with a "queued" CloudEvent. The request is processed when the
+ SP recovers, or rejected when the SP for that service type becomes Unavailable
+ (see [Retry Topic](#retry-topic)).
+- **SP is Unavailable:** The agent responds with an error CloudEvent for each
+ incoming request targeting that service type. Additionally, the agent drains
+ the retry topic, rejecting any held requests for that service type with error
+ CloudEvents.
+
+### Resource Deletion Flow
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant DCM as DCM
(Control Plane)
+ participant MS as Messaging System
+ participant AG as Agent
+ participant EMB as Embedded SP
+ participant EXT as External SP
+
+ DCM->>MS: PUBLISH CloudEvent (deletion request)
topic: {agentTopicName}
{resourceId, serviceType}
+
+ MS->>AG: PUSH message
+ activate AG
+
+ AG->>AG: Validate requested service type
is supported by a registered SP
+
+ alt Service type not supported
+ AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"}
+ MS->>DCM: PUSH error message
+ else Service type supported but SP is Unhealthy
+ AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType}
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SP unhealthy — held for retry"}
+ MS->>DCM: PUSH queued response
+ else Service type supported and SP is Ready
+ alt SP is embedded
+ AG->>EMB: In-process call
{serviceType, resourceId}
+ activate EMB
+ alt Deletion fails
+ EMB-->>AG: Error
+ deactivate EMB
+ AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details}
+ MS->>DCM: PUSH error message
+ else Deletion succeeds
+ EMB-->>AG: Success
{resourceId, status: DELETING}
+ AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING}
+ MS->>DCM: PUSH deletion acknowledged
+ Note over EMB: SP manages resource deletion
and reports final status through
the existing status reporting flow
+ end
+ else SP is external
+ AG->>EXT: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId}
+ activate EXT
+ alt Deletion fails
+ EXT-->>AG: Error response
+ deactivate EXT
+ AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details}
+ MS->>DCM: PUSH error message
+ else Deletion succeeds
+ EXT-->>AG: Success response
{resourceId, status: DELETING}
+ AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING}
+ MS->>DCM: PUSH deletion acknowledged
+ Note over EXT: SP manages resource deletion
and reports final status through
the existing status reporting flow
+ end
+ end
+ end
+ deactivate AG
+```
+
+#### Flow Description
+
+1. DCM publishes a deletion request CloudEvent to the agent's dedicated topic in
+ the messaging system, including the resource ID and service type
+2. The agent consumes the message
+3. The agent validates that the requested service type is supported by a
+ registered SP (embedded or external)
+4. If the service type is **not supported**:
+ - The agent publishes an error CloudEvent back to the messaging system
+ - DCM consumes the error message
+5. If the service type is **supported but the SP is Unhealthy**:
+ - The agent publishes the original request to the retry topic for durable
+ holding
+ - The agent publishes a "queued" CloudEvent to `dcm.agents.responses`,
+ informing DCM that the request is held for retry
+ - The request will be processed when the SP recovers, or rejected if the SP
+ becomes Unavailable (see [Retry Topic](#retry-topic))
+6. If the service type is **supported and the SP is Ready**:
+ - The agent forwards the deletion request to the SP via a REST `DELETE` call
+ (for external SPs) or in-process call (for embedded SPs)
+ - If the SP returns an **immediate error**: the agent publishes an error
+ CloudEvent back to the messaging system for DCM to consume
+ - If the SP **accepts** the request: the agent publishes a CloudEvent
+ acknowledging the deletion is in progress. The SP manages the actual
+ resource deletion and reports the final status through the existing status
+ reporting flow (SP → Messaging System → DCM)
+
+The retry policy and in-flight request handling described in the
+[Resource Creation Flow](#resource-creation-flow) apply equally to deletion
+requests.
+
+### Health
+
+#### Agent Health
+
+The agent reports its own liveness to DCM via periodic REST heartbeats. Since
+the messaging system is used for resource operations (creation requests, status
+updates), the heartbeat uses the existing REST channel that the agent already
+uses for registration.
+
+DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is
+received within a configurable threshold, DCM marks the agent as unavailable.
+
+On startup, the agent registers to DCM (as described in
+[Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it
+re-registers to DCM; DCM handles this idempotently, resetting the heartbeat
+tracker.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant AG as Agent
+ participant DCM as DCM
(Control Plane)
+ participant DB as Database
+
+ loop Every {heartbeatInterval} seconds
+ AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp}
+ activate DCM
+ DCM->>DB: Update last heartbeat timestamp
+ DB-->>DCM: Updated
+ DCM-->>AG: 200 OK
+ deactivate DCM
+ end
+
+ Note over DCM: No heartbeat received
within {threshold} seconds
+
+ DCM->>DB: Mark agent as Unavailable
+ activate DB
+ DB-->>DCM: Updated
+ deactivate DB
+```
+
+##### Flow Description
+
+1. The agent periodically sends a heartbeat to DCM via a REST `PUT` call
+2. DCM updates the agent's last heartbeat timestamp in the database
+3. If DCM does not receive a heartbeat within the configured threshold, it marks
+ the agent as **Unavailable**
+4. When the agent restarts, its initial registration to DCM resets the heartbeat
+ tracker and the agent status
+
+#### SP Health Monitoring
+
+The agent monitors the health of its registered Service Providers using the
+three-state health model defined in the
+[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md).
+The monitoring mechanism differs by SP type:
+
+- **Embedded SPs:** Health is determined in-process — the agent directly checks
+ the embedded SP's internal state without a network call.
+- **External SPs:** Health is determined by polling the SP's `GET /health`
+ endpoint.
+
+| State | Condition |
+| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| **Ready** | SP responds with `200 OK` and `status: "healthy"` (external), or internal check passes (embedded) |
+| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (external), or internal check reports unhealthy (embedded) |
+| **Unavailable** | SP does not respond or returns an error after exceeding the failure threshold (external), or internal check reports unavailable (embedded) |
+
+With the agent layer, the responsibility for monitoring SP health shifts from
+DCM to the agent. The agent is the natural point to perform health checks on its
+registered SPs, as it already maintains the list of SP registrations.
+
+The agent only routes requests to SPs in the **Ready** state. An SP in the
+**Unhealthy** or **Unavailable** state is not eligible for routing, even though
+an Unhealthy SP may be technically reachable. When the SP for a service type is
+Unhealthy, incoming requests are held in the retry topic rather than rejected
+(see [Retry Topic](#retry-topic)).
+
+Since each service type is served by exactly one SP, the agent's behavior is
+determined by that SP's health state:
+
+**When the SP becomes Unhealthy:**
+
+1. The agent **keeps** the service type in its advertised list (no update sent
+ to DCM to remove it)
+2. The agent stops routing new requests for that service type — incoming
+ requests are held in the retry topic and a "queued" CloudEvent is sent to DCM
+3. The agent publishes a health warning CloudEvent to `dcm.agents.health` with
+ type `service-type-degraded`
+
+**When the SP becomes Unavailable:**
+
+1. The agent removes the service type from its advertised list
+2. The agent sends a `POST /api/v1/agents` request to DCM with the updated
+ registration (service types list without the affected type)
+3. The agent drains the retry topic: all held requests for that service type are
+ rejected with error CloudEvents to DCM
+4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with
+ type `service-type-unavailable`
+
+**When a previously unhealthy or unavailable SP recovers** (returns to Ready
+state):
+
+1. If the service type was removed (Unavailable case): the agent re-adds it to
+ its list and sends a `POST /api/v1/agents` to DCM with the updated
+ registration
+2. The agent processes held requests from the retry topic for that service type
+ (see [Retry Topic](#retry-topic))
+
+##### Agent Status
+
+The agent exposes the health status of all registered SPs via the
+`GET /api/v1/status` endpoint (see
+[Agent Endpoints — `GET /api/v1/status`](#get-apiv1status--agent-status) for the
+response format).
+
+##### Pod Conditions (Kubernetes / OpenShift)
+
+On Kubernetes or OpenShift deployments, the agent additionally exposes the
+health status of each registered SP as custom pod conditions on its own pod.
+This complements the `/api/v1/status` endpoint and allows administrators to
+inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which
+SPs are healthy, unhealthy, or unavailable without having to query the agent's
+REST API.
+
+Each registered SP is represented as a separate pod condition, using the SP's
+provider ID as the condition type. The condition's `status` field reflects
+whether the SP is healthy (`True`) or not (`False`), and the `reason` and
+`message` fields provide additional context.
+
+Example output from `oc describe pod `:
+
+```
+Conditions:
+ Type Status Reason Message
+ sp-vm-001/vm True Ready SP vm-provider serving service type vm is healthy
+ sp-db-001/database False Unhealthy SP db-provider serving service type database is unhealthy
+```
+
+###### Implementation Detail
+
+The agent uses
+[Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate)
+to surface per-SP health as custom pod conditions. The agent's pod spec declares
+a readiness gate for each expected condition type, and the agent application
+patches its own pod's `status.conditions` via the Kubernetes API using
+in-cluster authentication (`rest.InClusterConfig()` or equivalent). This
+requires RBAC permissions on the `pods/status` subresource for the agent's
+service account.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant AG as Agent
+ participant SP as External SP
+ participant MS as Messaging System
+ participant DCM as DCM
(Control Plane)
+ participant DB as Database
+
+ Note over AG: Embedded SPs: health
checked in-process
+
+ loop Every {healthCheckInterval} seconds (external SPs)
+ AG->>SP: GET /health
+ alt Healthy
+ SP-->>AG: 200 OK
{status: "healthy"}
+ AG->>AG: Reset failure counter
Mark SP as Ready
+ else Unhealthy
+ SP-->>AG: 200 OK
{status: "unhealthy"}
+ AG->>AG: Mark SP as Unhealthy
+ else No response / error
+ SP-->>AG: Timeout / Error
+ AG->>AG: Increment failure counter
+ Note over AG: If counter >= threshold:
Mark SP as Unavailable
+ end
+ end
+
+ alt SP for service type X becomes Unhealthy
+ Note over AG: Keep service type X
in advertised list.
Hold incoming requests
in retry topic.
+
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-degraded",
agentId, serviceType, reason,
affectedProvider}
+ MS->>DCM: PUSH health warning
+
+ else SP for service type X becomes Unavailable
+ AG->>DCM: POST /api/v1/agents
{updated serviceTypes without X}
+ activate DCM
+ DCM->>DB: Update agent registration
+ DB-->>DCM: Updated
+ DCM-->>AG: 200 OK
+ deactivate DCM
+
+ Note over AG: Drain retry topic:
reject held requests for
service type X
+
+ AG->>MS: PUBLISH CloudEvent(s)
topic: dcm.agents.responses
{error: "SP unavailable"}
for each held request
+
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider}
+ MS->>DCM: PUSH health warning
+
+ else Previously unhealthy/unavailable SP recovers to Ready
+ Note over AG: Re-add service type if removed.
Process held requests
from retry topic.
+
+ opt Service type was removed (Unavailable case)
+ AG->>DCM: POST /api/v1/agents
{updated serviceTypes with X}
+ activate DCM
+ DCM->>DB: Update agent registration
+ DB-->>DCM: Updated
+ DCM-->>AG: 200 OK
+ deactivate DCM
+ end
+
+ AG->>SP: Forward held requests from retry topic
+ SP-->>AG: Responses
+ AG->>MS: PUBLISH CloudEvent(s)
topic: dcm.agents.responses
{success/error for each}
+ end
+```
+
+##### Flow Description
+
+1. The agent monitors each registered SP's health:
+ - **Embedded SPs:** health checked in-process (no network call)
+ - **External SPs:** health checked by periodically polling `GET /health`
+2. Based on the result, the agent updates the SP's health state:
+ - Healthy → **Ready** (failure counter reset)
+ - Unhealthy → **Unhealthy**
+ - Timeout or error (external) / internal failure (embedded) → increment
+ failure counter; if counter exceeds threshold → **Unavailable**
+3. When the SP for a service type becomes **Unhealthy**:
+ - The agent **keeps** the service type in its advertised list (no update sent
+ to DCM)
+ - Incoming requests for that service type are held in the retry topic (see
+ [Retry Topic](#retry-topic))
+ - The agent publishes a `service-type-degraded` health warning CloudEvent to
+ the `dcm.agents.health` topic
+4. When the SP for a service type becomes **Unavailable**:
+ - The agent removes the service type from its advertised list
+ - The agent sends a `POST /api/v1/agents` to DCM with the updated
+ registration
+ - The agent drains the retry topic: all held requests for that service type
+ are rejected with error CloudEvents to DCM
+ - The agent publishes a `service-type-unavailable` health warning CloudEvent
+ to the `dcm.agents.health` topic
+5. When a previously unhealthy or unavailable SP recovers:
+ - If the service type was removed (Unavailable case): the agent re-adds it to
+ its list and sends a `POST /api/v1/agents` to DCM with the updated
+ registration
+ - The agent processes held requests from the retry topic for that service
+ type
+6. The agent exposes the health status of all registered SPs (both embedded and
+ external) via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift
+ deployments, the agent additionally surfaces this information as custom pod
+ conditions on its own pod (see
+ [Pod Conditions](#pod-conditions-kubernetes--openshift))
+
+### CloudEvent Message Definitions
+
+All messages exchanged through the messaging system use the
+[CloudEvents v1.0](https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/spec.md)
+specification, following the conventions established in the
+[Service Provider Status Reporting](../state-management/service-provider-status-reporting.md)
+enhancement.
+
+All agent-originated CloudEvents include `agentName` and `topicName` in the data
+payload for correlation, in addition to the `source` envelope attribute. This
+allows DCM to identify both the resource and the originating agent when
+consuming from the shared `dcm.agents.responses` subject.
+
+The `spec` field in creation request data follows the schema defined by the
+target service type (see
+[SP Resource Manager](../sp-resource-manager/sp-resource-manager.md),
+[Placement Manager](../placement-manager/placement-manager.md)).
+
+| Message | `type` | `source` | `subject` | `data` |
+| --------------------- | ------------------------------------------- | ---------------------- | ---------------------- | ------------------------------------------------------------------------ |
+| Creation Request | `dcm.request.create` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType, spec}` |
+| Deletion Request | `dcm.request.delete` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType}` |
+| Creation Acknowledged | `dcm.agent.creation-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "PROVISIONING"}` |
+| Deletion Acknowledged | `dcm.agent.deletion-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "DELETING"}` |
+| Request Queued | `dcm.agent.request-queued` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, serviceType, status: "QUEUED"}` |
+| Error | `dcm.agent.error` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, error, details}` |
+| Health Degraded | `dcm.agent.health.service-type-degraded` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` |
+| Health Unavailable | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` |
+
+### Assumptions
+
+- A messaging system (e.g., NATS) is deployed and accessible to both DCM and the
+ agent
+- The agent has outbound network connectivity to DCM's REST API (for
+ registration and heartbeats)
+- External SPs have network connectivity to the agent's REST API (for
+ registration and health checks)
+- For Kubernetes/OpenShift deployments: the agent's service account has RBAC
+ permissions for the `pods/status` subresource
+
+### Risks and Mitigations
+
+| Risk | Mitigation |
+| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. |
+| Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. |
+| Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. |
+| Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). |
+| Unauthenticated external SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. |
+| Embedded SP crash takes down the agent | Embedded SPs run in-process; a panic/crash affects the entire agent. Mitigation: embedded SP code is well-tested and isolated in dedicated packages. Process-level restart recovers state via re-registration. |
+
+## Drawbacks
+
+- Adds operational complexity: a new binary (the agent) must be deployed,
+ configured, and monitored per environment
+- Adds latency to the creation path: DCM → messaging system → agent → SP, versus
+ the current DCM → SP direct call
+- Fragments health monitoring responsibility: DCM monitors agent health via
+ heartbeats, while the agent monitors SP health directly (in-process for
+ embedded SPs, via polling for external SPs)
+- Requires messaging system infrastructure accessible to both DCM and all target
+ environments
+- Embedding SP code (K8s Container, ACM Cluster, KubeVirt) increases agent
+ binary size and couples the agent release cycle to the embedded SPs for
+ updates
+
+## Alternatives
+
+### Alternative 1: Watch / Reconcile Pattern
+
+#### Description
+
+Instead of using a messaging system for creation requests, DCM would expose
+resource requests through its own API. The agent would poll DCM's API or be
+notified by DCM of new events, discover pending resource requests targeting its
+environment, and reconcile them by forwarding the creation request to the
+relevant SP and reporting the result back to DCM. This mimics the Kubernetes
+controller pattern (watch → reconcile) but with DCM acting as the API server
+rather than a Kubernetes cluster.
+
+#### Pros
+
+- Familiar pattern for teams experienced with Kubernetes controllers
+- Could eliminate the messaging system dependency for creation requests
+- DCM retains full visibility of pending requests (they live in DCM's own
+ storage, not in a bus topic)
+- No additional infrastructure beyond DCM itself — the agent only needs outbound
+ connectivity to DCM's API, which it already has for registration and
+ heartbeats
+
+#### Cons
+
+- Requires DCM to implement watch/notification semantics natively, which adds
+ complexity to the control plane
+- The messaging system is still required for status reporting (SP → bus → DCM),
+ so this does not fully eliminate the messaging infrastructure dependency
+- Maturity of a DCM-native watch system is unproven compared to established
+ messaging systems (e.g., NATS JetStream)
+
+#### Status
+
+Deferred
+
+#### Rationale
+
+The watch/reconcile pattern's main advantage is eliminating the messaging system
+for creation requests and keeping all request state within DCM. However, the
+messaging system is already required for status reporting (SP → bus → DCM), so
+removing it for creation requests alone does not eliminate the infrastructure
+dependency.
+
+Additionally, DCM does not currently expose watch/notification semantics.
+Building a reliable, scalable watch system into DCM requires further
+investigation — particularly around delivery guarantees, fan-out to multiple
+agents, and behaviour under network partitions. This is deferred to a future
+iteration when the trade-offs are better understood and the maturity level of a
+DCM-native watch system can be assessed.
+
+## Cross-Cutting Impact
+
+The following enhancement documents will need to be updated to reflect the
+changes introduced by this enhancement. These updates will be done in subsequent
+PRs.
+
+| Document | Impact |
+| -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | External SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. Embedded SPs register internally and do not use this flow. |
+| [Service Provider Health Check](../service-provider-health-check/service-provider-health-check.md) | Health polling responsibility shifts from DCM to the agent. DCM monitors agent health via heartbeats instead of polling individual SPs. |
+| [SP Resource Manager](../sp-resource-manager/sp-resource-manager.md) | SPRM publishes creation requests to the agent's bus topic instead of calling SP REST endpoints directly. SPRM interacts with the agent (not individual SPs) for health status. From SPRM's perspective, the agent serves the same role as a SP: provisioning service types. |
+| [Placement Manager](../placement-manager/placement-manager.md) | Policy evaluation may now include environment as a selection criterion. Placement Manager delegates to SPRM, which routes through the messaging system. |
+| [User Flows](../user-flows/user-flows.md) | End-to-end flows must include the agent layer between DCM and SPs. |
+
+Additionally, DCM should monitor consumer lag on agent topics in a future
+iteration. If lag exceeds a configurable threshold, DCM could stop routing new
+requests to that agent to avoid further congestion. A new agent state (e.g.,
+"Congested") could be introduced for this purpose.
diff --git a/enhancements/placement-manager/placement-manager.md b/enhancements/placement-manager/placement-manager.md
index 2b5c0a0..afd1f40 100644
--- a/enhancements/placement-manager/placement-manager.md
+++ b/enhancements/placement-manager/placement-manager.md
@@ -12,6 +12,8 @@ reviewers:
- "@gabriel-farache"
- "@ebichman"
creation-date: 2026-01-09
+see-also:
+ - /enhancements/environment-agent/environment-agent.md
---
# Placement Manager
@@ -20,19 +22,23 @@ creation-date: 2026-01-09
The Placement Manager orchestrates resource requests within DCM core. It
receives user requests through the Catalog Manager, validates and enriches them
-through the Policy Manager, and delegates instance creation to the SP Resource
-Manager. The Placement Manager focuses on request orchestration and
-coordination.
+through the Policy Manager (which now selects an Agent), and delegates instance
+creation and deletion to the SP Resource Manager, which routes through the
+Messaging System to an Agent. The Placement Manager also handles queued-request
+timeout logic when an Agent reports that the Service Provider for the requested
+service type is unhealthy.
## Motivation
### Goals
-- Define end-to-end flow of for creating resources
+- Define end-to-end flow for creating resources
+- Define end-to-end flow for deleting resources (deletion flow)
- Define _Create_, _Read_, _Delete_ endpoints for Placement Manager
-- Define Placement Manager interacts with other services within DCM core
+- Define how Placement Manager interacts with other services within DCM core
(Catalog Manager, Policy Manager, SP Resource Manager)
- Define orchestration responsibilities for Placement Manager
+- Define queued-request timeout logic for agent-based routing
### Non-Goals
@@ -44,8 +50,12 @@ coordination.
The Placement Manager acts as the central orchestration service within DCM core,
coordinating between user requests (from Catalog), policy validation, and
-catalog instance creation. The following diagram illustrates the system
-architecture and component interactions.
+instance lifecycle management. The Policy Manager selects an Agent, and the SP
+Resource Manager publishes requests to the Agent's messaging topic. The Agent
+internally routes to its Service Providers.
+
+The following diagram illustrates the system architecture and component
+interactions.
```mermaid
%%{init: {'flowchart': {'rankSpacing': 100, 'nodeSpacing': 10, 'curve': 'linear'},}}%%
@@ -55,26 +65,28 @@ flowchart TD
classDef policyEngine fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px
classDef spResourceManager fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px
classDef database fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px
+ classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ff8a65,stroke-width:2px
+ classDef agent fill:#2d2d2d,color:#ffffff,stroke:#a5d6a7,stroke-width:2px
classDef dcmCore fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px
CM["**Catalog Manager**
Send Request"]:::catalogManager
subgraph DCM_Core [ ]
- PM["**Placement Manager**
"]:::placementManager
-
- PE["**Policy Manager**
Request Validation
Payload Mutation
SP Selection"]:::policyEngine
-
- SPRM["**SP Resource Manager**
Create Instance
Read Instances
Delete Instances"]:::spResourceManager
-
+ PM["**Placement Manager**
Orchestrate & Timeout"]:::placementManager
+ PE["**Policy Manager**
Request Validation
Payload Mutation
Agent Selection"]:::policyEngine
+ SPRM["**SP Resource Manager**
Publish to Agent Topic
Consume Responses"]:::spResourceManager
PM_DB[("**Placement DB**
Store Intent
Store validated request")]:::database
-
end
+ MS["**Messaging System**
(NATS)"]:::messaging
+ AG["**Agent**
Routes to SPs"]:::agent
+
CM --> PM
PM --> PE
PM --> PM_DB
PM --> SPRM
-
+ SPRM --> MS
+ MS --> AG
class DCM_Core dcmCore
```
@@ -83,15 +95,19 @@ flowchart TD
#### Catalog Service
-- Receives resource creation requests from users
+- Receives resource creation and deletion requests from users
- Provides REST API endpoints for _create_, _read_, _delete_ operations on
catalog instances
- Returns responses and error messages to users
#### Policy Manager
-- Sends requests for validation via `POST /api/v1/engine/evaluate`
-- Receives validated/mutated payload and selected Service Provider
+- Sends requests for validation via
+ `POST /api/v1alpha1/policies:evaluateRequest`
+- Provides `available_agents` metadata in the evaluation request
+- Optionally includes `exclude_agents` to exclude agents from consideration
+ (e.g., after a queued-request timeout)
+- Receives validated/mutated payload and selected Agent (`agentName`)
- Receives policy rejections and constraint violations responses and forwards to
the users
@@ -99,13 +115,18 @@ flowchart TD
- Delegates instance creation, read, and delete operations to SP Resource
Manager
-- Forwards validated requests with selected SP name
+- Forwards `agentName`, `serviceType`, and `spec` in requests
+- SPRM publishes to the agent's messaging topic
- Receives responses and forwards to the users
+- Reports back: success (202), error, or queued status
+- When SPRM reports "queued" status, PM handles timeout logic (see
+ [Queued-Request Handling](#queued-request-handling))
#### Database
- Stores the intent (original request) of the user request
-- Store validated request and enables rehydration process
+- Stores validated request (including `agentName`) and enables rehydration
+ process
- Maintains record of all resources created through Placement Manager
### API Endpoints
@@ -123,7 +144,7 @@ resources.
| DELETE | /api/v1/resources/{resourceId} | Delete a resource |
| GET | /api/v1/health | Placement Manager health check |
-**POST /api/v1/resources - Create an resource.**
+**POST /api/v1/resources - Create a resource.**
The POST endpoint creates a resource that is supported by DCM. The resource
request is an instance of a catalog item and originates from the user (UI)
@@ -151,6 +172,8 @@ requestBody:
description: |
Service specification following one of the supported service type
schemas (VMSpec, ContainerSpec, DatabaseSpec, or ClusterSpec).
+ The `serviceType` field within the spec determines which Agent
+ and Service Provider can fulfill the request.
additionalProperties: true
```
@@ -180,7 +203,7 @@ Response payload: Returns 201 Created if successful.
"id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"path": "resources/08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"catalogItemInstanceId": "4baa35eb-e70d-4d37-867d-0f4efa21d05c",
- "providerName": "kubevirt-sp",
+ "agentName": "prod-eu-agent",
"spec": {
"serviceType": "vm",
"vcpu": { "count": 2 },
@@ -197,8 +220,7 @@ Response payload: Returns 201 Created if successful.
**Note**: This is **only** an example of the payload.
-**GET /api/v1/resources**
-List all resources according to AEP standards.
+**GET /api/v1/resources** List all resources according to AEP standards.
Example of Response Payload
@@ -209,7 +231,7 @@ Example of Response Payload
"id": "696511df-1fcb-4f66-8ad5-aeb828f383a0",
"path": "resources/696511df-1fcb-4f66-8ad5-aeb828f383a0",
"catalogItemInstanceId": "52540146-6212-4514-b534-0c3127b2836f",
- "providerName": "container-sp",
+ "agentName": "prod-us-agent",
"spec": {
"serviceType": "container",
"image": { "reference": "docker.io/nginx:latest" },
@@ -227,7 +249,7 @@ Example of Response Payload
"id": "c66be104-eea3-4246-975c-e6cc9b32d74d",
"path": "resources/c66be104-eea3-4246-975c-e6cc9b32d74d",
"catalogItemInstanceId": "4baa35eb-e70d-4d37-867d-0f4efa21d05c",
- "providerName": "postgres-sp",
+ "agentName": "prod-eu-agent",
"spec": {
"serviceType": "database",
"engine": "postgresql",
@@ -243,7 +265,7 @@ Example of Response Payload
"id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"path": "resources/08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"catalogItemInstanceId": "f3645f8f-82c1-4efb-888f-318c0ac81a08",
- "providerName": "kubevirt-sp",
+ "agentName": "prod-eu-agent",
"spec": {
"serviceType": "vm",
"vcpu": { "count": 2 },
@@ -261,8 +283,7 @@ Example of Response Payload
}
```
-**GET /api/v1/resources/{resourceId}**
-Get a resource based on id.
+**GET /api/v1/resources/{resourceId}** Get a resource based on id.
Example of Response Payload
@@ -271,7 +292,7 @@ Example of Response Payload
"id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"path": "resources/08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"catalogItemInstanceId": "d6ebf344-bfd1-44c9-bc25-97f9fb856f22",
- "providerName": "kubevirt-sp",
+ "agentName": "prod-eu-agent",
"spec": {
"serviceType": "vm",
"vcpu": { "count": 4 },
@@ -286,11 +307,9 @@ Example of Response Payload
}
```
-**Delete /api/v1/resources/{resourceId}**
-Delete a resource based on id.
+**DELETE /api/v1/resources/{resourceId}** Delete a resource based on id.
-**GET /api/v1/health**
-Retrieve the health status of Placement Manager.
+**GET /api/v1/health** Retrieve the health status of Placement Manager.
Example of Response Payload
@@ -306,7 +325,7 @@ Example of Response Payload
### Service Creation Flow
The following sequence diagram illustrates the complete flow for creating a
-resources via the `POST /api/v1/resources` endpoint.
+resource via the `POST /api/v1/resources` endpoint.
```mermaid
sequenceDiagram
@@ -321,43 +340,66 @@ sequenceDiagram
activate PM
PM->>DB: Store intent
{originalRequest}
- activate DB
DB-->>PM: Intent stored
- deactivate DB
- PM->>PE: POST /api/v1/engine/evaluate
{requestPayload, userId, tenantId}
+ PM->>DB: Fetch available agents
(healthy, non-Congested)
+ DB-->>PM: available_agents list
+
+ PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents}
activate PE
- PE-->>PM: Validated/mutated payload
& selected providerName
+ PE-->>PM: Validated/mutated payload
& selectedAgent
deactivate PE
alt Policy validation fails
- PM-->>CM: Error response
(Policy rejection)
- deactivate PM
+ PM->>DB: Delete intent record
+ PM-->>CM: Error response (policy rejection)
else Policy validation succeeds
- PM->>DB: Store validated request
{validatedPayload, providerName}
- activate DB
- DB-->>PM: Validated request stored
- deactivate DB
+ PM->>DB: Store validated request
{validatedPayload, agentName}
- PM->>SPRM: POST /api/v1/service-types/instances
{providerName, serviceType, spec}
+ PM->>SPRM: POST /api/v1/service-type-instances
{agentName, serviceType, spec}
activate SPRM
- alt SP Resource Manager fails
+ alt SPRM returns error (404/503)
SPRM-->>PM: Error response
- PM-->>CM: Error response
(Instance creation failed)
+ PM->>DB: Delete records
+ PM-->>CM: Error response
deactivate SPRM
- else Instance creation succeeds
- SPRM-->>PM: Success response
{instanceId, status, metadata}
- activate DB
- deactivate DB
-
- PM-->>CM: 201 Created
{Resource}
+ else SPRM returns 202 Accepted
+ SPRM-->>PM: 202 Accepted
{instanceId, agentName, status: PENDING}
+ deactivate SPRM
+ PM-->>CM: 201 Created {Resource}
+ end
+ end
+ Note over SPRM: Async: SPRM consumes response
from dcm.agents.responses
+
+ opt SPRM notifies PM of QUEUED status
+ SPRM->>PM: Notify: instance QUEUED
{instanceId, agentName}
+ Note over PM: Start queuedRequestTimeout timer
+
+ alt Timeout expires (or timeout = 0)
+ PM->>SPRM: DELETE /api/v1/service-type-instances/{instanceId}
+ Note over PM: Re-evaluate excluding current agent
+
+ PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents, exclude_agents: [agentName]}
+ activate PE
+ PE-->>PM: New selectedAgent or no match
+ deactivate PE
+
+ alt Alternative agent found
+ PM->>SPRM: POST /api/v1/service-type-instances
{newAgentName, serviceType, spec}
+ SPRM-->>PM: 202 Accepted
+ PM-->>CM: 201 Created {Resource}
+ else No agent available
+ PM->>DB: Delete records
+ PM-->>CM: Error: no agent available
+ end
end
end
+ deactivate PM
```
#### Flow Description
@@ -374,47 +416,164 @@ sequenceDiagram
- This enables rehydration and tracking of the user's original request
- Intent is stored before any processing to ensure request persistence
-3. **Policy Validation**
+3. **Fetch Available Agents**
-- Placement Manager forwards the request to Policy Manager for validation
+- Placement Manager queries the Agent Registry for healthy, non-Congested agents
+ that support the requested service type
+- The resulting `available_agents` list is passed to the Policy Manager for
+ evaluation
+
+4. **Policy Validation**
+
+- Placement Manager forwards the request to Policy Manager with
+ `available_agents` and optional `exclude_agents`
- Policy Manager evaluates requests against policies
- Policy Manager returns:
- Approved or rejected
- Validated and potentially mutated payload
- - Selected Service Provider name (`providerName`)
+ - Selected Agent name (`selectedAgent`)
- Policy constraints and patches applied
- If policy validation fails (request rejected or constraint violation):
- - Delete record from Placement DB
+ - Delete intent record from Placement DB
- Placement Manager returns error response to Catalog Manager
- Request processing stops
- If policy validation succeeds:
- Placement Manager stores the validated request in Placement DB which
- includes the validated/mutated payload and selected `providerName`
+ includes the validated/mutated payload and selected `agentName`
+
+5. **Store Validated Request**
-4. **Instance Creation**
+- Placement Manager persists the validated/mutated payload along with the
+ `agentName` returned by the Policy Manager
+- This enables rehydration and audit
+
+6. **Instance Creation**
- Placement Manager delegates instance creation to SP Resource Manager
-- Forwards the validated request with `providerName`, `serviceType`, and `spec`
-- SP Resource Manager handles SP lookup, health checks, and instance
- provisioning
-- If SP Resource Manager fails to create the instance:
- - Error response is returned to Placement Manager
- - Delete record from Placement DB
- - Placement Manager forwards the error to Catalog Manager
- - Request processing stops
-- If instance creation succeeds:
- - SP Resource Manager returns success response with `instanceId`, `status`
- - Placement Manager returns 201 Created to Catalog Manager with a full
- `Resource` object
- - The resource is now in a `PROVISIONING` state
+- Forwards `agentName`, `serviceType`, and `spec`
+- SP Resource Manager publishes the request to the agent's messaging topic
+- SPRM always responds synchronously with one of:
+ - **SPRM returns error (404/503)**: Error response returned to Placement
+ Manager. Records deleted from Placement DB. Placement Manager forwards the
+ error to Catalog Manager. Request processing stops.
+ - **SPRM returns 202 Accepted**: Instance creation is in progress. Placement
+ Manager returns 201 Created to Catalog Manager with a full `Resource`
+ object. The resource is now in a `PENDING` state.
+
+7. **Queued-Request Handling (Asynchronous)**
+
+- After SPRM returns 202, it continues to consume responses from
+ `dcm.agents.responses`. If the Agent reports a `dcm.agent.request-queued`
+ CloudEvent (the SP for the requested service type is unhealthy), SPRM
+ asynchronously notifies Placement Manager of the `QUEUED` status
+- Upon receiving the QUEUED notification, Placement Manager starts a
+ `queuedRequestTimeout` timer
+- On timeout expiry (or immediately if `queuedRequestTimeout = 0`):
+ - PM tells SPRM to DELETE the queued request
+ - PM re-evaluates policies by calling the Policy Manager again, this time
+ including `exclude_agents: [agentName]` to exclude the timed-out agent
+ - If an alternative agent is found: PM sends a new creation request to SPRM
+ with the new agent
+ - If no alternative agent is available: PM deletes records from Placement DB
+ and returns an error to Catalog Manager
+
+### Service Deletion Flow
+
+The following sequence diagram illustrates the complete flow for deleting a
+resource via the `DELETE /api/v1/resources/{resourceId}` endpoint.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant CM as Catalog Manager
+ participant PM as Placement Manager
+ participant DB as Placement DB
+ participant SPRM as SP Resource Manager
+
+ CM->>PM: DELETE /api/v1/resources/{resourceId}
+ activate PM
+
+ PM->>DB: Lookup resource
Get agentName, serviceType, instanceId
+
+ PM->>SPRM: DELETE /api/v1/service-type-instances/{instanceId}
+ activate SPRM
+
+ alt SPRM returns error
+ SPRM-->>PM: Error response
+ PM-->>CM: Error response
+ else SPRM returns 202 Accepted
+ SPRM-->>PM: 202 Accepted
{instanceId, agentName, status: DELETING}
+ PM->>DB: Update resource status to DELETING
+ PM-->>CM: 200 OK
+ end
+ deactivate SPRM
+
+ Note over SPRM: Async: SPRM consumes response
from dcm.agents.responses
+
+ opt SPRM notifies PM of QUEUED status
+ SPRM->>PM: Notify: instance QUEUED
{instanceId, agentName}
+ Note over PM: Same timeout logic as creation
+ end
+ deactivate PM
+```
+
+#### Flow Description
+
+1. **Request Reception**
+
+- Catalog Manager sends a DELETE request to Placement Manager with the
+ `resourceId`
+
+2. **Resource Lookup**
+
+- Placement Manager queries Placement DB to retrieve the resource record,
+ including the `agentName`, `serviceType`, and `instanceId` needed for deletion
-#### Key Characteristics/Notes
+3. **Delegation to SP Resource Manager**
+
+- Placement Manager sends a DELETE request to SPRM with the `instanceId`
+- SPRM publishes a deletion CloudEvent to the agent's messaging topic
+- SPRM always responds synchronously with one of:
+ - **SPRM returns error**: Error response returned to Placement Manager, which
+ forwards it to Catalog Manager
+ - **SPRM returns 202 Accepted**: Deletion is in progress. PM updates the
+ resource status to `DELETING` in Placement DB and returns 200 OK to Catalog
+ Manager
+- **SPRM notifies QUEUED (asynchronous)**: After returning 202, SPRM may
+ asynchronously notify PM of a `QUEUED` status if the Agent reports the SP for
+ the service type is unhealthy. The same `queuedRequestTimeout` logic applies
+ as in the creation flow (see
+ [Queued-Request Handling](#queued-request-handling))
+
+### Configuration
+
+| Parameter | Type | Default | Description |
+| ---------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `queuedRequestTimeout` | Duration | `300s` | Maximum time PM waits when SPRM reports a "queued" status before cancelling the request and re-evaluating policies excluding the current agent. When set to `0`, PM immediately re-evaluates without waiting. Applies to both creation and deletion requests. |
+
+### Key Characteristics/Notes
- **Intent Preservation**: Original user request is stored before processing for
audit and rehydration purposes
-- **Policy-Driven**: Service Provider selection and request validation are
- handled by Policy Manager
-- **Error Handling**: Clear error paths for policy rejections and instance
- creation failures
+- **Policy-Driven**: Agent selection and request validation are handled by
+ Policy Manager
+- **Agent-Based Selection**: Service Provider selection is no longer a direct
+ concern of the Placement Manager. The Policy Engine selects an Agent based on
+ environment, service types, and cost. The Agent internally selects the SP.
+- **Queued-Request Timeout**: When SPRM reports a "queued" status (the SP for
+ the requested service type on the agent is unhealthy), PM applies a
+ configurable timeout. On expiry, PM cancels the request and re-evaluates
+ policies excluding the timed-out agent.
+- **Error Handling**: Clear error paths for policy rejections, instance creation
+ failures, and queued-request timeouts
- **State Management**: Both original intent and validated request are stored
for complete request lifecycle tracking and rehydration purposes
+
+### Next Steps
+
+- Per-agent timeout overrides (allow different `queuedRequestTimeout` values per
+ agent)
+- Retry limits on re-evaluation (cap the number of times PM re-evaluates after
+ excluding agents)
+- PM-level request priority/ordering (prioritize certain requests over others
+ when re-evaluating)
diff --git a/enhancements/policy-engine/policy-engine.md b/enhancements/policy-engine/policy-engine.md
index b37f4b9..c75751e 100644
--- a/enhancements/policy-engine/policy-engine.md
+++ b/enhancements/policy-engine/policy-engine.md
@@ -13,6 +13,8 @@ reviewers:
approvers:
- TBD
creation-date: 2025-12-15
+see-also:
+ - "/enhancements/environment-agent/environment-agent.md"
---
# Policy API & Execution Engine
@@ -20,7 +22,11 @@ creation-date: 2025-12-15
## Summary
This ADR defines the Management and Execution API and Workflow of the DCM Policy
-Engine
+Engine.
+
+With the introduction of the Environment Agent layer, the Policy Engine selects
+an Agent (rather than a Service Provider) to handle the request, and can
+constrain selection by environment.
## Motivation
@@ -28,10 +34,9 @@ The Policy Engine operates as a specialized microservice within the Data Center
Management (DCM) application responsible for governing service creation and
modification (e.g., VirtualMachines, Containers). It enables Admins,
Tenant-Admins, and Users to inject logic that validates (Approve/Reject),
-mutates (Defaulting/Altering) and assigns Service Providers to request payloads
-using an embedded
-[Open Policy Agent (OPA)](https://www.openpolicyagent.org/docs) engine and
-[Rego](https://www.openpolicyagent.org/docs/policy-language).
+mutates (Defaulting/Altering) and assigns Agents to request payloads using an
+embedded [Open Policy Agent (OPA)](https://www.openpolicyagent.org/docs) engine
+and [Rego](https://www.openpolicyagent.org/docs/policy-language).
OPA is embedded as a Go library within the Policy Engine process rather than
deployed as a separate sidecar service. Rego source code is persisted in the
@@ -75,7 +80,8 @@ Every policy may return one or more of the following outputs
by providing a patch map.
3. **Field Constraints:** Defining the mutability of fields for _subsequent_
policies in the chain.
-4. **Service Provider Selection:** Policies may set a value and/or constraints
+4. **Agent Selection:** Policies may set a target agent and/or agent constraints
+ (including environment constraints)
### Policy Scope & Hierarchy (Execution Order)
@@ -100,10 +106,14 @@ The input payload includes:
they will need to know the expected content
- `constraints` - The current constraints context (accumulated from prior
policies)
-- `provider` - The currently selected service provider (empty string initially,
- populated as policies are evaluated)
-- `service_provider_constraints` - The current service provider constraints
- (accumulated from prior policies)
+- `agent` - The currently selected agent (empty string initially, populated as
+ policies are evaluated)
+- `agent_constraints` - The current agent constraints (accumulated from prior
+ policies)
+- `available_agents` - List of eligible agents with metadata
+ `[{name, environment, serviceTypes, cost}]`, provided by Placement Manager
+- `exclude_agents` - List of agent names to exclude from selection (used during
+ re-evaluation after queued timeout)
#### Output
@@ -113,11 +123,14 @@ following elements
- **rejected** (bool) - since requests are approved by default, policies may
reject them.
- **rejection_reason** (string, optional) - reason for rejection
-- **selected_provider** (string, optional) - the name of the service provider
- chosen to fulfill the request
-- **service_provider_constraints** (object, optional) -
- - `allow_list` - list of allowed service provider names
- - `patterns` - list of regex patterns for matching allowed providers
+- **selected_agent** (string, optional) - the name of the agent chosen to handle
+ the request
+- **agent_constraints** (object, optional) -
+ - `allow_list` - list of allowed agent names
+ - `patterns` - list of regex patterns for matching allowed agents
+ - `environment_constraints` - environment-level constraints
+ - `allow_list` - list of allowed environment identifiers
+ - `patterns` - list of regex patterns for matching allowed environments
- **patch** (map, optional) - a dictionary of the corresponding service type for
setting values. Each internal key is optional
- **constraints** (map, optional) - follows
@@ -266,7 +279,7 @@ sequenceDiagram
Database-->>PolicyEngine: List of policies
loop For each policy
- PolicyEngine->>PolicyEngine: Evaluate policy (embedded OPA)
+ PolicyEngine->>PolicyEngine: Evaluate policy (embedded OPA)
{spec, agent, constraints, agent_constraints}
PolicyEngine->>PolicyEngine: Enforce constraints
PolicyEngine->>PolicyEngine: Mutate payload
alt Policy rejected or constraint violation
@@ -275,7 +288,7 @@ sequenceDiagram
end
end
- PolicyEngine-->>PlacementManager: Success with updated payload
+ PolicyEngine-->>PlacementManager: Success with {evaluatedServiceInstance, selectedAgent, status}
PlacementManager-->>User: Service created
```
@@ -287,6 +300,10 @@ sequenceDiagram
- Service Instance
- spec - the service specification (flexible schema)
+- available_agents - list of agents with metadata (provided by PM)
+ `[{name, environment, serviceTypes, cost}]`
+- exclude_agents - list of agent names to exclude (optional, used for
+ re-evaluation)
###### Execution Logic & Flow
@@ -299,9 +316,11 @@ parallel with policy management operations.
- The Policy API maintains a `ConstraintContext` map in memory for the duration
of the request.
+- Pre-filter: Remove any agents in `exclude_agents` from the `available_agents`
+ list before evaluation begins.
- Fetch & Sort:
- Query DB for enabled policies matching the request payload based on the
- policy’s matching criteria.
+ policy's matching criteria.
- Sort by Level (Global -> Tenant -> User) then Priority (Desc).
- If no policies matching the request payload were found, the request will
return successfully
@@ -310,9 +329,11 @@ parallel with policy management operations.
- Invoke the policy's package main rule
- Pass
- `spec` - the current patched request payload
- - `provider` - the currently selected service provider
+ - `agent` - the currently selected agent
- `constraints` - the accumulated constraint context (if any)
- - `service_provider_constraints` - the accumulated SP constraints (if any)
+ - `agent_constraints` - the accumulated agent constraints (if any)
+ - `available_agents` - the pre-filtered list of eligible agents
+ - `exclude_agents` - the list of excluded agent names
- Check `Reject`
- If `Reject` is `true`, ABORT IMMEDIATELY (Fail Fast). Return 406.
- Validate `Constraints`:
@@ -327,13 +348,16 @@ parallel with policy management operations.
patch the `region`, ABORT with "Policy Conflict Error"
- Apply `Patch`
- Update service_payload with valid patches.
- - Validate `ServiceProvider`
- - If Policy P returned a `selected_provider` and
- `service_provider_constraints` exist, validate the selected provider
- against the constraints.
-
-- Finalize: Return the final payload, selected provider, and status to Placement
+ - Validate `Agent`
+ - If Policy P returned a `selected_agent` and `agent_constraints` exist,
+ validate the selected agent against the constraints.
+ - If `environment_constraints` exist, validate the selected agent's
+ environment against those constraints (Policy Engine uses
+ `available_agents` metadata for this).
+
+- Finalize: Return the final payload, selected agent, and status to Placement
Manager.
+ - Response: `{evaluatedServiceInstance, selectedAgent, status}`
- Status is `APPROVED` if the payload was not modified, `MODIFIED` if any
patches were applied.
@@ -347,3 +371,24 @@ parallel with policy management operations.
- Patch: {"billing_tag": "marketing"}
- Action: Engine checks Context. billing_tag is immutable.
- Result: Error. The User policy violates the Global constraint.
+
+###### _Agent/Environment Constraint Validation Example_
+
+- Step 1 (Global Policy):
+ - agent_constraints: {environment_constraints: {allow_list: ["prod-eu-west-1",
+ "prod-us-east-1"]}}
+ - Result: Only agents in prod-eu-west-1 or prod-us-east-1 are eligible
+- Step 2 (Tenant Policy):
+ - selected_agent: "prod-eu-agent"
+ - Validation: Agent's environment is "prod-eu-west-1" (looked up from
+ available_agents metadata) — matches allow_list. Valid.
+- Step 3 (User Policy):
+ - selected_agent: "dev-agent"
+ - Validation: Agent's environment is "dev" (looked up from available_agents
+ metadata) — NOT in allow_list. Error: violates Global constraint.
+
+## Next Steps
+
+- Cost-based agent selection within agent_constraints
+- Resource capacity constraints (totalCpu, totalMemory)
+- SP-level constraints passed through to agents
diff --git a/enhancements/service-provider-health-check/service-provider-health-check.md b/enhancements/service-provider-health-check/service-provider-health-check.md
index 191d6c4..9e55eba 100644
--- a/enhancements/service-provider-health-check/service-provider-health-check.md
+++ b/enhancements/service-provider-health-check/service-provider-health-check.md
@@ -13,77 +13,194 @@ reviewers:
approvers:
- ""
creation-date: 2025-12-15
+see-also:
+ - "/enhancements/environment-agent/environment-agent.md"
---
# Service Provider Health Check
## Summary
-This enhancement proposes a mechanism for the DCM control plane to actively
-monitor the health of service providers. Instead of providers pushing
-heartbeats, the DCM control plane will poll a `/health` endpoint on the service
-provider to verify liveness and backing provider health.
+The Environment Agent monitors SP health using two mechanisms: in-process checks
+for embedded SPs (K8s Container, ACM Cluster, KubeVirt) and polling the
+`/health` endpoint for external SPs. DCM monitors Agent health via heartbeats
+and consumer lag reporting.
## Motivation
-Define the DCM control plane way to determine if a service provider is
-accessible. Without an active check, the control plane might attempt to schedule
-services on providers that are down.
+Define how SP health is monitored by the Agent, and how Agent health and
+congestion are monitored by DCM.
### Goals
-- Implement a polling mechanism where DCM checks provider health.
+- Define the polling mechanism where the Agent checks SP health.
- Define a standard `/health` endpoint for all Service Providers.
+- Define the heartbeat mechanism by which DCM monitors Agent health.
+- Define consumer lag monitoring and the Congested agent state.
### Non-Goals
- Status reporting of individual services running _on_ the provider.
- Deep provider diagnostics (out of scope for liveness check).
-- Ensure DCM excludes "Unhealthy" or "Unreachable" providers from scheduling.
+- Agent high availability (deferred to HA iteration).
## Proposal
### Overview
-The DCM Control Plane will act as the "prober." It will maintain a list of
-registered service providers URLs. At a configurable interval, DCM will perform
-an HTTP GET request to the provider's `/health` endpoint.
+The Agent acts as the prober for SP health. The monitoring mechanism differs by
+SP type: embedded SPs are checked in-process (the agent directly checks the
+embedded SP's internal state without a network call), while external SPs are
+checked by polling their `/health` endpoint at a configurable interval. DCM
+monitors Agent health via periodic REST heartbeats and tracks consumer lag.
### Architecture
-1. **Health Polling (High Frequency):**
- - **Initiator:** DCM Control Plane.
- - **Target:** Service Provider `/health` endpoint.
- - **Frequency:** Every 10 seconds (default).
- - **Success Criteria:** HTTP 200 OK.
-
-2. **Resource Synchronization (Low Frequency/On-Demand):**
- - **Note:** Detailed resource data (CPU/Memory) continues to be handled via
- the Provider Info API, but the "Ready" state is governed by the Health
- Check results.
+1. **SP Health Monitoring (Agent → SP):**
+ - **Embedded SPs:** Health is determined in-process — the agent directly
+ checks the embedded SP's internal state without a network call.
+ - **External SPs:** Health is determined by polling the SP's `/health`
+ endpoint.
+ - **Initiator:** Agent.
+ - **Target:** Service Provider `/health` endpoint.
+ - **Frequency:** Every 10 seconds (default).
+ - **Success Criteria:** HTTP 200 OK.
+
+2. **Agent Health Monitoring (Agent → DCM):**
+ - **Mechanism:** Agent sends `PUT /api/v1/agents/{agentId}/heartbeat` to DCM.
+ - **Frequency:** Every `heartbeatInterval` seconds (configurable).
+ - **Failure:** If no heartbeat within configurable threshold, DCM marks agent
+ as Unavailable.
+
+3. **Consumer Lag Monitoring:**
+ - Agent self-reports consumer lag in heartbeat payload
+ `{timestamp, consumerLag}`.
+ - DCM marks agent as **Congested** when lag exceeds `consumerLagThreshold`.
+ - DCM stops routing new requests to a Congested agent.
### Health Check Flow
-1. **DCM Controller:** Iterates through the list of active providers in the
- database.
-2. **Probing:** For each provider, DCM executes:
- `GET http://:/health`.
-3. **State Machine:**
- - **Ready:** If response is `200 OK` and body `status` is `healthy`, reset
- failure counter and mark as `Ready`.
- - **Unhealthy:** If response is `200 OK` and body `status` is `unhealthy`,
- mark as `Unhealthy`. The service provider is reachable but the backing
- provider is unavailable.
- - **Failure:** If timeout or non-200 response, increment failure counter.
- - **Threshold:** If failures exceed the `FailureThreshold` (default: 3),
- transition provider to `Unavailable`.
-4. **Recovery:** A single `200 OK` with `status` `healthy` transitions an
- `Unhealthy` or `Unavailable` provider back to `Ready`.
+1. **Agent:** Iterates through the list of registered SPs (both embedded and
+ external).
+2. **Probing:** For each external SP, the Agent executes:
+ `GET http://:/health`. Embedded SPs are checked in-process.
+3. **State Machine:**
+ - **Ready:** If response is `200 OK` and body `status` is `healthy`, reset
+ failure counter and mark as `Ready`.
+ - **Unhealthy:** If response is `200 OK` and body `status` is `unhealthy`,
+ mark as `Unhealthy`. The service provider is reachable but the backing
+ provider is unavailable.
+ - **Failure:** If timeout or non-200 response, increment failure counter.
+ - **Threshold:** If failures exceed the `FailureThreshold` (default: 3),
+ transition provider to `Unavailable`.
+4. **Recovery:** A single `200 OK` with `status` `healthy` transitions an
+ `Unhealthy` or `Unavailable` provider back to `Ready`.
+
+### Differentiated Behavior
+
+Since only one SP (embedded or external) may serve a given service type per
+agent, when that SP transitions out of the Ready state, the Agent's behavior
+differs based on the health state. See the
+[Environment Agent enhancement](../environment-agent/environment-agent.md) for
+full details on retry topic behavior.
+
+**Unhealthy:**
+
+1. Agent **keeps** the service type in its advertised list (no update to DCM).
+2. Stops routing to the SP. Incoming requests are held in the retry topic.
+3. Publishes a `service-type-degraded` health warning CloudEvent.
+
+**Unavailable** (after exceeding failure threshold):
+
+1. Agent **removes** the service type from its advertised list.
+2. Sends `POST /api/v1/agents` to DCM with the updated registration.
+3. Drains retry topic — rejects held requests with error CloudEvents.
+4. Publishes a `service-type-unavailable` health warning CloudEvent.
+
+**Recovery:**
+
+1. Re-adds service type to advertised list if it was removed (Unavailable case)
+ and sends `POST /api/v1/agents` to DCM with the updated registration.
+2. Processes held requests from the retry topic.
+
+## Agent Health Monitoring
+
+The Agent reports its own liveness to DCM via periodic REST heartbeats. DCM
+tracks the last heartbeat timestamp for each agent.
+
+- **Endpoint:** `PUT /api/v1/agents/{agentId}/heartbeat`
+- **Payload:** `{timestamp, consumerLag}`
+- **Frequency:** Every `heartbeatInterval` seconds (configurable).
+- If no heartbeat is received within a configurable threshold, DCM marks the
+ agent as **Unavailable**.
+- On restart, the Agent re-registers to DCM, which resets the heartbeat tracker.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant AG as Agent
+ participant DCM as DCM Control Plane
+ participant DB as Database
+
+ loop Every {heartbeatInterval} seconds
+ AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp, consumerLag}
+ DCM->>DB: Update heartbeat timestamp and lag
+ DCM->>DCM: Check consumerLag against threshold
+ alt consumerLag >= consumerLagThreshold
+ DCM->>DB: Mark agent as Congested
+ else consumerLag < consumerLagThreshold
+ DCM->>DB: Clear Congested state (if set)
+ end
+ DCM-->>AG: 200 OK
+ end
+
+ Note over DCM: No heartbeat within threshold
+ DCM->>DB: Mark agent as Unavailable
+```
+
+## Consumer Lag Monitoring
+
+The Agent self-reports the number of pending messages on its topic as
+`consumerLag` in each heartbeat. DCM compares this value against a global
+`consumerLagThreshold`.
+
+- When `consumerLag >= consumerLagThreshold`, DCM marks the agent as
+ **Congested** and stops routing new requests to it.
+- When `consumerLag` drops below the threshold on a subsequent heartbeat, DCM
+ clears the Congested state.
+
+> **Note:** The environment-agent enhancement currently defines the heartbeat
+> payload as `{timestamp}` only. The extended payload `{timestamp, consumerLag}`
+> is defined here as the intended contract; the agent doc will be updated in a
+> follow-up.
+
+## Agent Health State Summary
+
+| Condition | Agent State |
+| --------------------------------------- | --------------- |
+| Heartbeat received, lag below threshold | **Ready** |
+| Heartbeat received, lag above threshold | **Congested** |
+| No heartbeat within threshold | **Unavailable** |
+
+```mermaid
+stateDiagram-v2
+ [*] --> Ready: Agent registers
+ Ready --> Congested: consumerLag >= threshold
+ Congested --> Ready: consumerLag < threshold
+ Ready --> Unavailable: Heartbeat timeout
+ Congested --> Unavailable: Heartbeat timeout
+ Unavailable --> Ready: Agent re-registers
+```
## Design Details
### Service Provider Implementation
+The SP health endpoint specification applies to external SPs only. Embedded SPs
+are health-checked in-process and do not expose a `/health` endpoint. The only
+difference from the original design is that the Agent, not DCM, is the caller
+for external SPs.
+
The Service Provider must expose a lightweight unauthenticated (or internally
secured) endpoint.
@@ -106,10 +223,10 @@ secured) endpoint.
The `status` field indicates the health of the backing provider:
-- `healthy` — The service provider and its backing provider are operational. DCM
- marks the provider as **Ready**.
+- `healthy` — The service provider and its backing provider are operational. The
+ Agent marks the provider as **Ready**.
- `unhealthy` — The service provider is reachable but the backing provider is
- unavailable. DCM marks the provider as **Unhealthy**.
+ unavailable. The Agent marks the provider as **Unhealthy**.
**Unhealthy Response Example:**
@@ -123,8 +240,14 @@ The `status` field indicates the health of the backing provider:
#### Provider State Summary
-| HTTP Response | `status` field | DCM State |
+| HTTP Response | `status` field | SP State |
| ----------------- | -------------- | ---------------------------------------------------- |
| `200 OK` | `healthy` | **Ready** |
| `200 OK` | `unhealthy` | **Unhealthy** |
| Non-200 / Timeout | N/A | **Unavailable** (after exceeding `FailureThreshold`) |
+
+## Next Steps
+
+- Agent HA: multiple agents sharing health-check duties.
+- Authenticated health checks.
+- Per-SP health check intervals.
diff --git a/enhancements/sp-registration-flow/sp-registration-flow.md b/enhancements/sp-registration-flow/sp-registration-flow.md
index 2055a78..d318687 100644
--- a/enhancements/sp-registration-flow/sp-registration-flow.md
+++ b/enhancements/sp-registration-flow/sp-registration-flow.md
@@ -17,6 +17,8 @@ approvers:
- "@flocati"
- "@gabriel-farache"
creation-date: 2025-12-05
+see-also:
+ - "/enhancements/environment-agent/environment-agent.md"
---
# Service Provider Registration Flow
@@ -26,19 +28,35 @@ creation-date: 2025-12-05
The DCM (Data Center Management) is designed to provide a unified control plane
for managing distributed infrastructure across multiple enclaves, including
air-gapped environments, regional datacenters, and isolated security zones (e.g.
-ships, edge locations). A fundamental architectural decision must be made about
-how Service Providers (SP) — the components that execute infrastructure
-provisioning work — become known to and integrate with the DCM Control Plane.
-This decision directly impacts scalability, security, network topology,
-operational model (whether centralized DCM teams or distributed SME teams manage
-Service Provider lifecycle).
+ships, edge locations). In each target environment, an
+[Agent](../environment-agent/environment-agent.md) runs as the intermediary
+between DCM and the Service Providers (SPs) deployed in that environment.
+
+The Agent supports a hybrid SP model: it ships with embedded SP code for known
+service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration,
+and also accepts external ("bring your own") SPs that register via the Agent's
+SP Registration API (`POST /api/v1/providers`). Only one SP — embedded or
+external — may serve a given service type per agent; duplicate registrations are
+rejected with `409 Conflict`.
+
+This document defines the registration contract for external SPs — API shape,
+idempotency semantics, and natural key behavior. Embedded SPs register
+internally at agent startup without a REST call and do not use this flow.
+
+The Agent, in turn, registers itself to DCM via a separate API
+(`POST /api/v1/agents`), advertising the environment and the aggregated list of
+service types it can serve. DCM's Registration Handler no longer receives SP
+registrations directly; it receives Agent registrations. The Agent Registration
+Flow is defined in the
+[Environment Agent enhancement](../environment-agent/environment-agent.md#agent-registration-flow).
## Motivation
### Goals
-- Define the registration mechanism by which Service Providers become known to
- and communicate with the DCM Control Plane.
+- Define the registration mechanism by which external Service Providers become
+ known to the Agent, and how the Agent becomes known to DCM.
+- Define the service type uniqueness constraint (one SP per service type).
### Non-Goals
@@ -48,6 +66,10 @@ Service Provider lifecycle).
- DCM Control Plane definition
- Meta-service-provider design
- Service Provider's policies
+- Embedded SP registration (these register internally at agent startup; see the
+ [Environment Agent enhancement](../environment-agent/environment-agent.md#embedded-sp-registration))
+- Agent registration to DCM (defined in the
+ [Environment Agent enhancement](../environment-agent/environment-agent.md#agent-registration-flow))
## Proposal
@@ -55,36 +77,41 @@ Service Provider lifecycle).
#### Terminology
-Service Providers must register using the DCM Service Provider API to operate
-within the DCM system. The Registration Handler component implements the
-provider registration endpoints of the Service Provider API. The registration
-phase provides to the DCM Control Plane the SP endpoint, metadata and
-capabilities so it can route requests to the appropriate SP. The registration
-call can be initiated either by the SP itself during start up phase or by a
-third party (e.g. platform admins) on behalf of the SP. Both approaches use the
-same registration API.
+External Service Providers must register using the Agent's SP Registration API
+to operate within the DCM system. The Agent implements the provider registration
+endpoint (`POST /api/v1/providers`), applying the same contract defined in this
+document. Embedded SPs (K8s Container, ACM Cluster, KubeVirt) register
+internally at agent startup and do not use this endpoint.
+
+The registration phase provides the Agent with the SP endpoint, metadata and
+capabilities so it can route creation requests to the appropriate SP. The
+registration call can be initiated either by the SP itself during start up phase
+or by a third party (e.g. platform admins) on behalf of the SP. Both approaches
+use the same registration API.
+
+Only one SP — embedded or external — may serve a given service type per agent.
+If the requested service type is already served by another SP (embedded or
+external), the Agent rejects the registration with `409 Conflict` (see the
+[Environment Agent enhancement](../environment-agent/environment-agent.md#sp-registration-to-agent)
+for the full service type uniqueness constraint).
The _initial implementation_ will focus only on the **self registration flow**.
-The _Service Provider API_ is located in the Egress layer and defines the
-contract between the DCM Control Plane and Service Providers. It includes
-endpoints for provider registration, service management, and provider queries.
-The
+The _SP Registration API_ is hosted by the Agent and defines the contract
+between the Agent and Service Providers. It includes the endpoint for provider
+registration. The
[Service Provider API specification](https://github.com/Fale/dcm/blob/od/api/interoperabilityAPI.yaml)
is under development.
-Within this architecture, the _Registration Handler_ is a component within the
-Service Provider API that implements the provider registration endpoints
-(`POST /providers` and related endpoints). When an SP registers, the
-Registration Handler communicates with the Control Plane to update the Service
-Registry.
+DCM implements `POST /api/v1/agents` for Agent registration (defined in the
+[Environment Agent enhancement](../environment-agent/environment-agent.md#post-apiv1agents--agent-registration)).
#### Architectural Assumptions
-Bidirectional network connectivity between Service Providers and the DCM Control
-Plane is required. SPs must reach DCM to register, and DCM must reach SPs to
-route provisioning requests. If either direction is blocked, the system cannot
-function regardless of the registration method used.
+SPs require network connectivity to the Agent. The Agent requires outbound
+connectivity to DCM (for registration and heartbeats) and to the Messaging
+System. DCM requires connectivity to the Messaging System. Direct SP-to-DCM
+connectivity is not required.
#### Registration Flow
@@ -98,17 +125,18 @@ capability matrices.
```mermaid
%%{init: {'flowchart': {'rankSpacing': 100, 'nodeSpacing': 10}}}%%
flowchart BT
- subgraph Data_Sources [**Data Sources**]
- DB[("**Service Registry**
SP endpoints")]
+ subgraph DCM_Control_Plane [**DCM Control Plane**]
+ DB[("**Agent Registry**
Agent endpoints &
service types")]
end
- subgraph API_Block [**Service Provider API**]
- Handler["_Service Registration Handler_
+ subgraph Agent_Block [**Agent**]
+ Handler["_SP Registration Handler_
2. Receive Request
- 3. Validate & Process"]
+ 3. Validate & Process
+ 4. Update internal SP registry"]
end
- subgraph Service_API [**Service API**
]
+ subgraph Service_API [**Service Providers**
]
subgraph SP1 [**ServiceProvider 1**
]
VM_Prov["**VM Provider Impl.**
@@ -130,14 +158,14 @@ flowchart BT
end
VM_Prov & Storage_Prov & Container_Prov & Pod_Prov -- 1. Register --> Handler
- Handler -- 4. Update Service Registry --> DB
+ Handler -- 5. Update DCM
POST /api/v1/agents --> DB
```
- Admins predefine supported
[Service Types](https://github.com/dcm-project/enhancements/blob/main/enhancements/service-type-definitions/service-type-definitions.md)
(e.g., "vm", "database")
-- A registration call must be made to the Registration Handler endpoint for each
- service type the SP supports. The payload includes:
+- A registration call must be made to the Agent's SP Registration endpoint for
+ each service type the SP supports. The payload includes:
1. Unique provider name
2. Unique providerID (optional, server-generated if not provided)
3. Endpoint URL (e.g.,
@@ -146,13 +174,17 @@ flowchart BT
5. Metadata (optional: zone, region, resource constraints)
6. Operations supported for this service type (optional, e.g., _"create"_,
_"delete"_)
-- The Registration Handler processes and validates the metadata
-- The Registration Handler internally updates the Service Registry with:
- 1. SP endpoint
- 1. metadata
-- When user requests a catalog offering, Control Plane matches it to registered
- SPs that can fulfill it based on configured policies and calls the selected SP
- endpoint (endpoint must be reachable)
+- The Agent processes and validates the metadata
+- The Agent stores the SP registration in its internal registry and recomputes
+ its list of supported service types
+- When the Agent's service type list changes (new type added or removed), the
+ Agent updates DCM via `POST /api/v1/agents` (see the
+ [Environment Agent enhancement](../environment-agent/environment-agent.md#sp-registration-to-agent)
+ for the full flow)
+- When user requests a catalog offering, DCM's Control Plane matches it to a
+ registered Agent that can fulfill it based on configured policies and routes
+ the request through the messaging system to the Agent, which forwards it to
+ the selected SP
The Service Provider's _name_ is the natural key used to match existing
registrations.
@@ -163,18 +195,25 @@ request body. This allows the `id` field in the schema to be `readOnly`,
preventing conflicts between query param and body values. The server sets `id`
from the query parameter or auto-generates it if not provided.
-The registration endpoint is idempotent. During the registration phase:
+The registration endpoint is idempotent. These idempotency semantics apply at
+the Agent level for SP registration. During the registration phase:
-- If the _name_ does not exist in DCM, a new SP entry is created. If no
- _providerID_ is specified, DCM will automatically generate one.
+- If the _name_ does not exist in the Agent's registry, a new SP entry is
+ created. If no _providerID_ is specified, the Agent will automatically
+ generate one.
- If the _name_ already exists and no _providerID_ is provided (or the same
_providerID_ is provided), the existing entry is updated and the same
_providerID_ is returned.
- If the _name_ already exists but a **different** _providerID_ is provided,
registration fails (conflict: another SP is attempting to register with a
taken name).
-- If a new _name_ is provided but the _providerID_ already exists in DCM,
- registration fails (conflict: _providerID_ is already assigned to another SP).
+- If a new _name_ is provided but the _providerID_ already exists in the Agent's
+ registry, registration fails (conflict: _providerID_ is already assigned to
+ another SP).
+
+Identical idempotency semantics (same `name` natural key pattern) apply at DCM
+level for Agent registration, as defined in the
+[Environment Agent enhancement](../environment-agent/environment-agent.md#re-registration-on-restart).
The response to a registration request will always include the _providerID_,
regardless of whether it was generated or provided. Consistent with AEP, the
@@ -184,35 +223,37 @@ response payload mirrors the request payload with possibly updated values.
The registration endpoint is idempotent. If an SP's capabilities change
(typically due to a new version following a restart), the SP (or admin) can call
-the same registration endpoint again. The Registration Handler will update the
-existing SP entry rather than creating a duplicate.
+the same registration endpoint again. The Agent will update the existing SP
+entry rather than creating a duplicate.
+
+When an SP re-registers with updated capabilities, the Agent recomputes its
+service type list and, if changed, updates DCM via `POST /api/v1/agents`.
- SP serviceType changes
-- SP restarts and re-registers using the same Service Provider API registration
- endpoint
-- The Registration Handler updates the existing Service Provider Registry and
- Service Catalog entry with the new serviceType
-- The Registration Handler detects that the SP already exists by matching the
- Service Provider _name_
-- The Registration Handler updates the existing Service Registry entry with the
- new serviceType and returns the same providerID.
-- There are 3 potential scenarios for updating a Service Provider within DCM:
+- SP restarts and re-registers using the same Agent SP Registration API endpoint
+- The Agent updates the existing SP entry in its internal registry with the new
+ serviceType
+- The Agent detects that the SP already exists by matching the Service Provider
+ _name_
+- The Agent updates the existing SP entry with the new serviceType and returns
+ the same providerID.
+- There are 3 potential scenarios for updating a Service Provider:
1. SP's _name_ update: If only the SP's name changes (but the providerID remains
- the same), DCM updates the SP's name. An attempt to update with a
+ the same), the Agent updates the SP's name. An attempt to update with a
pre-existing SP's name will result in failure.
2. _providerID_ update: If only the _providerID_ changes (but the SP's _name_
- remains the same), DCM updates the providerID. An attempt to update with a
- pre-existing _providerID_ will result in failure.
-3. Both the SP's name and providerID change: DCM cannot reliably determine if
- this is an update to the existing SP or a new registration of a distinct SP.
- In this scenario the required action is to delete and re-create the SP.
+ remains the same), the Agent updates the providerID. An attempt to update
+ with a pre-existing _providerID_ will result in failure.
+3. Both the SP's name and providerID change: The Agent cannot reliably determine
+ if this is an update to the existing SP or a new registration of a distinct
+ SP. In this scenario the required action is to delete and re-create the SP.
###### Example
- First registration (with client-specified id):
-`POST /api/v1/providers?id=uuid-1234`
+`POST /api/v1/providers?id=uuid-1234` (on the Agent)
```yaml
{
@@ -246,7 +287,7 @@ Response:
- First registration (with server generated id):
-`POST /api/v1/providers`
+`POST /api/v1/providers` (on the Agent)
```yaml
{
@@ -268,7 +309,7 @@ Response:
- Re-registration (SP restarts, same endpoint):
-`POST /api/v1/providers`
+`POST /api/v1/providers` (on the Agent)
```yaml
{
@@ -293,8 +334,23 @@ Response:
### Risks and Mitigations
+The risks related to the Agent-based architecture (agent as single point of
+failure, unauthenticated SP registration, messaging system dependencies) are
+documented in the
+[Environment Agent enhancement](../environment-agent/environment-agent.md#risks-and-mitigations).
+
+### Next Steps
+
+- HA agent replicas for high availability per environment
+- Authenticated SP registration (AuthN/AuthZ for the Agent's SP Registration
+ API)
+- Dynamic cost tier updates without agent restart
+
## Alternatives
+The following alternatives were evaluated before the current Agent-based
+architecture was adopted. They are retained for historical context.
+
### Dynamic Registration Approach
#### Description
@@ -463,4 +519,7 @@ flowchart BT
#### Why rejected
Too complex for initial delivery. Requirements for network scanning, discovery
-protocols, and security policies are not yet defined.
+protocols, and security policies are not yet defined. The Agent-based
+architecture further reinforces this rejection: the Agent eliminates the need
+for direct DCM-to-SP connectivity, making a DCM-driven network scanning approach
+even less aligned with the current architecture.
diff --git a/enhancements/sp-resource-manager/sp-resource-manager.md b/enhancements/sp-resource-manager/sp-resource-manager.md
index 59ff754..c97ed3f 100644
--- a/enhancements/sp-resource-manager/sp-resource-manager.md
+++ b/enhancements/sp-resource-manager/sp-resource-manager.md
@@ -11,21 +11,24 @@ reviewers:
- "@pkliczewski"
- "@gabriel-farache"
creation-date: 2026-01-02
+see-also:
+ - "/enhancements/environment-agent/environment-agent.md"
---
# Service Provider Resource Manager
## Summary
-The DCM Service Provider Resource Manager provides a centralized intermediary
-service between Placement Manager and Service Providers (SPs) for creating and
-managing service type instances. Rather than having Placement Manager directly
-call individual SPs, the Resource Manager abstracts SP interactions by handling
-SP lookup (retrieving SP endpoints and metadata from the Service Registry),
-health validation, instance tracking, and database persistence. This design
-simplifies Placement Manager logic, ensures consistent instance management
-across all SPs, and provides a single point of control for instance lifecycle
-operations within DCM core.
+The DCM Service Provider Resource Manager (SPRM) provides a centralized
+intermediary service between Placement Manager and Environment Agents for
+creating and managing service type instances. Rather than having Placement
+Manager interact with Service Providers directly, the Resource Manager abstracts
+agent interactions by looking up agent details from the Agent Registry, checking
+agent health and congestion state, publishing creation and deletion CloudEvents
+to the agent's messaging topic, and consuming responses from
+`dcm.agents.responses`. This design simplifies Placement Manager logic, ensures
+consistent instance management across all agents, and provides a single point of
+control for instance lifecycle operations within DCM core.
## Motivation
@@ -48,25 +51,35 @@ operations within DCM core.
### Assumptions
-- The SP Resource Manager has connectivity to the registered SPs.
-- The SP Resource Manager has access/permission to the database.
+- The SP Resource Manager has access to the Messaging System for publishing
+ CloudEvents and consuming responses.
+- A Messaging System (e.g., NATS) is deployed and accessible.
+- The SP Resource Manager has access to the Agent Registry and instance record
+ database.
- The SP Resource Manager is reachable from the Placement Manager.
- The SP Resource Manager lives within the SP API.
-- The database persists both SP registry information and created resource
### Integrations Points
#### Database Integration
-- **Service Registry**:
- - Stores Service Provider's registration information
- - Used for retrieving SP details during instance creation
- - SP info includes `endpoints`, `metadata`, `status` and `resource capacity`
+- **Agent Registry**:
+ - Stores Agent registration information (name, environment, serviceTypes,
+ topicName, cost, healthStatus, consumerLag)
+ - Used for retrieving agent details during instance creation and deletion
- **Service Type Instance Records**:
- Stores created service type instance information
- - Instance data includes `instanceId`, `providerName`, `status`.
+ - Instance data includes `instanceId`, `agentName`, `serviceType`, `status`.
+ The `providerName` field is populated asynchronously from the agent's
+ creation-acknowledged CloudEvent.
- Maintains record of all created instances within DCM core
+#### Messaging System
+
+- **Publishing**: SPRM publishes creation and deletion request CloudEvents to
+ the agent's topic (`{agentTopicName}`)
+- **Consuming**: SPRM consumes response CloudEvents from `dcm.agents.responses`
+
### API Endpoints
The CRUD endpoints are consumed by the DCM Placement Manager to create and
@@ -103,13 +116,20 @@ requestBody:
schema:
type: object
required:
- - providerName
+ - agentName
+ - serviceType
- spec
properties:
- providerName:
+ agentName:
type: string
- description: The unique identifier of the target Service Provider
- example: "kubevirt-sp"
+ description: The name of the target Environment Agent
+ example: "prod-eu-agent"
+ serviceType:
+ type: string
+ description:
+ The type of service to create (e.g., vm, container, database,
+ cluster)
+ example: "vm"
spec:
type: object
description: |
@@ -122,7 +142,8 @@ Example of payload for incoming VM request
```json
{
- "providerName": "kubevirt-sp",
+ "agentName": "prod-eu-agent",
+ "serviceType": "vm",
"spec": {
"memory": { "size": "2GB" },
"vcpu": { "count": 2 },
@@ -144,19 +165,19 @@ Example of Response Payload
[
{
"name": "nginx-container",
- "providerName": "container-sp",
+ "agentName": "container-agent",
"instanceId": "696511df-1fcb-4f66-8ad5-aeb828f383a0",
"status": "PROVISIONING"
},
{
"name": "postgres-001",
- "providerName": "postgres-sp",
+ "agentName": "postgres-agent",
"instanceId": "c66be104-eea3-4246-975c-e6cc9b32d74d",
"status": "FAILED"
},
{
"name": "ubuntu-vm",
- "providerName": "kubevirt-sp",
+ "agentName": "prod-eu-agent",
"instanceId": "08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"status": "PROVISIONING"
}
@@ -171,7 +192,7 @@ Example of Response Payload
```json
{
"name": "ubuntu-vm",
- "providerName": "kubevirt-sp",
+ "agentName": "prod-eu-agent",
"instanceId": "08aa81d1-a0d2-4d5f-a4df-b80addf07781",
"status": "PROVISIONING"
}
@@ -190,7 +211,7 @@ Retrieve the health status of SP Resource Manager.
This flow demonstrates the creation of a service type instance (VMs, containers,
databases, or clusters) through the SP Resource Manager. It involves
communication between the Placement Manager, SP Resource Manager, database, and
-the targeted Service Provider.
+the Messaging System.
```mermaid
sequenceDiagram
@@ -198,39 +219,23 @@ sequenceDiagram
participant PS as Placement Manager
participant SPRM as SP Resource Manager
participant DB as Database
- participant SP as Service Provider
+ participant MS as Messaging System
- PS->>SPRM: POST /api/v1/service-type-instances
{providerName, spec}
+ PS->>SPRM: POST /api/v1/service-type-instances
{agentName, serviceType, spec}
activate SPRM
-
- alt SP not found
+ SPRM->>DB: Lookup agent by agentName
+ alt Agent not found
SPRM-->>PS: 404 Not Found
- else SP Health Check fails
+ else Agent Unavailable or Congested
SPRM-->>PS: 503 Service Unavailable
- SPRM->>SP: POST {SP_endpoint}/api/v1/services
{payload}
- activate SP
-
- alt SP creation fails
- SP-->>SPRM: Error response
- deactivate SP
- SPRM-->>PS: Return SP error
(SP creation failed)
- else SP creation succeeds
- SP-->>SPRM: Success response
{instanceId, status, metadata}
- SPRM->>DB: Create instance record
{instanceId, providerName, metadata}
- activate DB
-
- alt DB record creation fails
- DB-->>SPRM: Error response
- deactivate DB
- SPRM-->>PS: 500 Internal Server Error
{instanceId, error}
-
- else DB record creation succeeds
- DB-->>SPRM: Record created
- SPRM-->>PS: 202 Accepted
{instanceId, status}
- end
- end
+ else Agent healthy
+ SPRM->>DB: Generate resourceId
Create instance record
{resourceId, agentName, serviceType, status: PENDING}
+
+ SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
type: dcm.request.create
{resourceId, serviceType, spec}
+
+ SPRM-->>PS: 202 Accepted
{instanceId, agentName, status: PENDING}
end
deactivate SPRM
```
@@ -240,43 +245,143 @@ sequenceDiagram
- **Request Reception**
- SP Resource Manager receives a POST request
(`/api/v1/service-type-instances`) from Placement Manager with:
- - `providerName`: The unique identifier of the target Service Provider
- - `spec`: The detailed spec following any of service type schema (VMSpec,
- ContainerSpec, DatabaseSpec, or ClusterSpec)
-- **Service Provider Lookup**
- - Queries the Service Registry database using the `providerName`
+ - `agentName`: The name of the target Environment Agent
+ - `serviceType`: The type of service to create (e.g., vm, container)
+ - `spec`: The detailed spec following any of the service type schemas
+ (VMSpec, ContainerSpec, DatabaseSpec, or ClusterSpec)
+- **Agent Lookup**
+ - Queries the Agent Registry by `agentName`
- Retrieves:
- - Service Provider endpoint URL
- - SP metadata (region, providerName etc)
- - Current SP status (healthy, degraded, unavailable)
- - If SP is not found, returns 404 error to Placement Manager
- - If SP status is degraded or unavailable, returns 503 error to Placement
- Manager
-- **Service Provider Invocation**
- - Calls the Service Provider's API endpoint:
- `POST {SP_endpoint}/api/v1/services`
- - Forwards the service specification (payload) to the SP
- - If SP instance creation fails, forward the SP's error response to Placement
- Manager
-- **Persist Response**
- - Receives response from Service Provider containing:
- - `instanceId`: Unique identifier for the created instance
- - `status`: Creation status (`PROVISIONING`)
- - Stores instance metadata in the database
- - If database record creation fails, returns 500 Internal Server Error with
- `instanceId` included in error response (instance was created by SP but
- tracking failed)
+ - `topicName`: The agent's messaging topic
+ - `healthStatus`: Current agent health (Ready, Unavailable)
+ - `consumerLag`: Current consumer lag for congestion detection
+ - If agent is not found, returns 404 error to Placement Manager
+ - If agent is Unavailable (missed heartbeats) or Congested (consumer lag
+ threshold exceeded), returns 503 error to Placement Manager
+- **Instance Record Creation**
+ - Generates a `resourceId` for the new instance
+ - Creates an instance record in the database with status `PENDING`
+ - The record includes `resourceId`, `agentName`, `serviceType`, and `status`
+- **CloudEvent Publishing**
+ - Publishes a creation request CloudEvent to the agent's topic (`{topicName}`)
+ via the Messaging System
+ - CloudEvent type: `dcm.request.create`
+ - CloudEvent data: `{resourceId, serviceType, spec}`
+ - See
+ [Environment Agent - CloudEvent Message Definitions](../environment-agent/environment-agent.md#cloudevent-message-definitions)
+ for the full CloudEvent schema
- **Response to Placement Manager**
- - Returns success response (202 Accepted) with:
+ - Returns 202 Accepted with:
- `instanceId`: The created instance identifier
- - `status`: Current instance status
- - Returns error response with appropriate HTTP status code and error details
- if any step fails
+ - `agentName`: The target agent
+ - `status`: `PENDING`
+ - At this point only `agentName` is known; `providerName` is populated
+ asynchronously when the agent's creation-acknowledged response arrives
+
+### Service Type Instance Deletion Flow
+
+This flow demonstrates the deletion of a service type instance through the SP
+Resource Manager. It mirrors the creation flow, publishing a deletion CloudEvent
+instead of a creation one.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant PS as Placement Manager
+ participant SPRM as SP Resource Manager
+ participant DB as Database
+ participant MS as Messaging System
+
+ PS->>SPRM: DELETE /api/v1/service-type-instances/{instanceId}
+ activate SPRM
+
+ SPRM->>DB: Lookup instance by instanceId
Get agentName, serviceType, resourceId
+
+ SPRM->>DB: Lookup agent by agentName
+ alt Agent not found
+ SPRM-->>PS: 404 Not Found
+ else Agent Unavailable or Congested
+ SPRM-->>PS: 503 Service Unavailable
+ else Agent healthy
+ SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
type: dcm.request.delete
{resourceId, serviceType}
+
+ SPRM->>DB: Update instance status to DELETING
+ SPRM-->>PS: 202 Accepted
{instanceId, status: DELETING}
+ end
+ deactivate SPRM
+```
+
+#### Steps
+
+- **Request Reception**
+ - SP Resource Manager receives a DELETE request
+ (`/api/v1/service-type-instances/{instanceId}`) from Placement Manager
+- **Instance Lookup**
+ - Queries the database by `instanceId`
+ - Retrieves `agentName`, `serviceType`, and `resourceId` from the instance
+ record
+- **Agent Lookup**
+ - Queries the Agent Registry by `agentName`
+ - Retrieves `topicName`, `healthStatus`, and `consumerLag`
+ - If agent is not found, returns 404 error to Placement Manager
+ - If agent is Unavailable or Congested, returns 503 error to Placement Manager
+- **CloudEvent Publishing**
+ - Publishes a deletion request CloudEvent to the agent's topic (`{topicName}`)
+ via the Messaging System
+ - CloudEvent type: `dcm.request.delete`
+ - CloudEvent data: `{resourceId, serviceType}`
+ - See
+ [Environment Agent - CloudEvent Message Definitions](../environment-agent/environment-agent.md#cloudevent-message-definitions)
+ for the full CloudEvent schema
+- **Instance Record Update**
+ - Updates the instance record status to `DELETING`
+- **Response to Placement Manager**
+ - Returns 202 Accepted with:
+ - `instanceId`: The instance identifier
+ - `status`: `DELETING`
+
+> **Note:** The Placement Manager also uses this DELETE endpoint to cancel
+> requests that were queued by an agent (when the SP for the service type was
+> unhealthy). When the Placement Manager's `queuedRequestTimeout` expires, it
+> sends a DELETE for the queued instance, then re-evaluates policies to select
+> an alternative agent. The agent handles creation/deletion dedup in its retry
+> topic — if both the original creation request and the cancellation DELETE are
+> present, they cancel out (see
+> [Environment Agent — Retry Topic](../environment-agent/environment-agent.md#retry-topic)).
+
+### Asynchronous Response Processing
+
+The SP Resource Manager consumes response CloudEvents from the
+`dcm.agents.responses` topic. These responses are published by Environment
+Agents after processing creation or deletion requests. The following table
+describes the actions taken for each response type:
+
+| CloudEvent Type | Action |
+| --------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
+| `dcm.agent.creation-acknowledged` | Update instance record: status to `PROVISIONING`, store `providerName` from response |
+| `dcm.agent.deletion-acknowledged` | Update instance record: status to `DELETING` |
+| `dcm.agent.error` | Update instance record: status to `FAILED`, store error details. Notify Placement Manager. |
+| `dcm.agent.request-queued` | Update instance record: status to `QUEUED`. Report queued status to Placement Manager (PM handles timeout logic). |
+
+Note: `providerName` in instance records is populated asynchronously. At 202
+response time, only `agentName` is known. The `providerName` is set when the
+agent's `dcm.agent.creation-acknowledged` CloudEvent arrives, which includes the
+SP that ultimately handled the request.
+
+See
+[Environment Agent - CloudEvent Message Definitions](../environment-agent/environment-agent.md#cloudevent-message-definitions)
+for the full CloudEvent type definitions and data schemas.
#### Error Handling
-- **404 Not Found**: Service Provider with the given `providerName` is not
- registered
+- **404 Not Found**: Agent with the given `agentName` is not registered
- **400 Bad Request**: Invalid request schema
-- **503 Service Unavailable**: Service Provider is not healthy
+- **503 Service Unavailable**: Agent is Unavailable (missed heartbeats) or
+ Congested (consumer lag threshold exceeded)
- **500 Internal Server Error**: Unexpected error in SP Resource Manager
+
+### Next Steps
+
+- Dead-letter handling for unprocessable responses
+- Batch publishing of CloudEvents
+- Per-agent response timeout configuration
diff --git a/enhancements/user-flows/user-flows.md b/enhancements/user-flows/user-flows.md
index f5a8c5a..7ae1c78 100644
--- a/enhancements/user-flows/user-flows.md
+++ b/enhancements/user-flows/user-flows.md
@@ -15,13 +15,14 @@ see-also:
- "/enhancements/kubevirt-sp/kubevirt-sp.md"
- "/enhancements/k8s-container-sp/k8s-container-sp.md"
- "/enhancements/acm-cluster-sp/acm-cluster-sp.md"
+ - "/enhancements/environment-agent/environment-agent.md"
---
# DCM User Flows
This document summarizes the primary user flows in the DCM system, covering
policy management, service type and catalog item management, service provider
-lifecycle, and end-to-end CatalogItemInstance creation.
+and agent lifecycle, and end-to-end CatalogItemInstance creation and deletion.
## Table of Contents
@@ -35,16 +36,22 @@ lifecycle, and end-to-end CatalogItemInstance creation.
- [4. Managing CatalogItems](#4-managing-catalogitems)
- [4.1 Create CatalogItem](#41-create-catalogitem)
- [4.2 CatalogItem to ServiceType Translation](#42-catalogitem-to-servicetype-translation)
-- [5. Service Provider Lifecycle](#5-service-provider-lifecycle)
- - [5.1 Service Provider Registration](#51-service-provider-registration)
- - [5.2 Service Provider Health Checks](#52-service-provider-health-checks)
- - [5.3 Service Provider Status Reporting](#53-service-provider-status-reporting)
+- [5. Service Provider & Agent Lifecycle](#5-service-provider--agent-lifecycle)
+ - [5.1 Service Provider Registration (SP → Agent)](#51-service-provider-registration-sp--agent)
+ - [5.2 Agent Registration (Agent → DCM)](#52-agent-registration-agent--dcm)
+ - [5.3 Health Monitoring](#53-health-monitoring)
+ - [5.3.1 SP Health (Agent → SP)](#531-sp-health-agent--sp)
+ - [5.3.2 Agent Health (Agent → DCM heartbeats)](#532-agent-health-agent--dcm-heartbeats)
+ - [5.3.3 Consumer Lag Monitoring](#533-consumer-lag-monitoring)
+ - [5.4 Service Provider Status Reporting](#54-service-provider-status-reporting)
+ - [5.5 Agent Lifecycle](#55-agent-lifecycle)
- [6. CatalogItemInstance Creation (End-to-End)](#6-catalogiteminstance-creation-end-to-end)
- [6.1 Full Creation Flow](#61-full-creation-flow)
- [6.2 Placement Manager Flow](#62-placement-manager-flow)
- [6.3 SP Resource Manager Flow](#63-sp-resource-manager-flow)
- [6.4 Service Provider Instance Creation](#64-service-provider-instance-creation)
- [6.5 Continuous Status Reporting](#65-continuous-status-reporting)
+ - [6.6 Deletion Flow](#66-deletion-flow)
---
@@ -52,16 +59,17 @@ lifecycle, and end-to-end CatalogItemInstance creation.
The DCM system is composed of the following core components:
-| Component | Responsibility |
-| ---------------------------------- | ----------------------------------------------------------------------------------------------------- |
-| **Catalog Manager** | Entry point for user requests; manages CatalogItems and CatalogItemInstances |
-| **Catalog DB** | Stores CatalogItems, CatalogItemInstances, and ServiceType definitions |
-| **Placement Manager** | Orchestrates instance creation; coordinates policy evaluation and SP selection |
-| **Policy Manager (Policy Engine)** | Validates, mutates, and selects Service Providers via REGO policies and OPA |
-| **SP Resource Manager** | Intermediary between Placement Manager and Service Providers; handles SP lookup and health validation |
-| **Service Registry** | Stores Service Provider registration, endpoints, and metadata |
-| **Service Providers** | Execute infrastructure provisioning (KubeVirt SP, K8s Container SP, ACM Cluster SP) |
-| **Messaging System** | Handles CloudEvents for asynchronous status reporting (NATS) |
+| Component | Responsibility |
+| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
+| **Catalog Manager** | Entry point for user requests; manages CatalogItems and CatalogItemInstances |
+| **Catalog DB** | Stores CatalogItems, CatalogItemInstances, and ServiceType definitions |
+| **Placement Manager** | Orchestrates instance creation; coordinates policy evaluation and agent selection |
+| **Policy Manager (Policy Engine)** | Validates, mutates, and selects Agents via REGO policies and OPA |
+| **SP Resource Manager** | Intermediary between Placement Manager and Agents; publishes CloudEvents to agent topics; consumes responses |
+| **Agent Registry** | Stores Agent registration data (name, environment, serviceTypes, topicName, cost, healthStatus) |
+| **Environment Agent** | Runs in target environment; routes creation/deletion requests to SPs; monitors SP health; reports to DCM via heartbeats |
+| **Service Providers** | Execute infrastructure provisioning (KubeVirt SP, K8s Container SP, ACM Cluster SP) |
+| **Messaging System** | Handles CloudEvents for asynchronous request delivery and status reporting (NATS) |
```mermaid
graph TB
@@ -72,10 +80,12 @@ graph TB
PM[Placement Manager]
POL[Policy Manager / OPA]
SPRM[SP Resource Manager]
- SR[(Service Registry)]
+ AR[(Agent Registry)]
DB[(Placement DB)]
- MSG[Messaging System / NATS]
+ MS[Messaging System / NATS]
+ AG1[Agent - Environment 1]
+ AG2[Agent - Environment 2]
SP1[KubeVirt SP]
SP2[K8s Container SP]
SP3[ACM Cluster SP]
@@ -88,15 +98,22 @@ graph TB
PM --> POL
PM --> SPRM
PM --> DB
- SPRM --> SR
- SPRM --> SP1
- SPRM --> SP2
- SPRM --> SP3
-
- SP1 -->|status events| MSG
- SP2 -->|status events| MSG
- SP3 -->|status events| MSG
- MSG -->|status updates| SPRM
+ SPRM --> AR
+ SPRM -->|publish requests| MS
+ MS -->|deliver requests| AG1
+ MS -->|deliver requests| AG2
+ AG1 --> SP1
+ AG1 --> SP2
+ AG2 --> SP3
+ AG1 -.->|registration & heartbeat| DCM_API
+ AG2 -.->|registration & heartbeat| DCM_API
+ DCM_API[DCM API]
+ DCM_API --> AR
+
+ SP1 -->|status events| MS
+ SP2 -->|status events| MS
+ SP3 -->|status events| MS
+ MS -->|status updates| SPRM
SPRM -->|status updates| CM
```
@@ -104,9 +121,9 @@ graph TB
## 2. Managing Policies
-Policies control validation, mutation, and Service Provider selection for all
-resource requests. They are organized in a three-level hierarchy: **Global**
-(Super Admin), **Tenant** (Tenant Admin), and **User** (End User).
+Policies control validation, mutation, and Agent selection for all resource
+requests. They are organized in a three-level hierarchy: **Global** (Super
+Admin), **Tenant** (Tenant Admin), and **User** (End User).
### 2.1 Create Policy
@@ -154,10 +171,10 @@ sequenceDiagram
### 2.2 Policy Evaluation
When a resource request arrives, the Policy Manager fetches all matching enabled
-policies, sorts them by level (Global → Tenant → User) then priority
+policies, sorts them by level (Global > Tenant > User) then priority
(ascending), and evaluates them in a chain-of-responsibility pipeline. Each
policy can reject the request, apply patches (mutations), set constraints, and
-influence Service Provider selection.
+influence Agent selection.
```mermaid
sequenceDiagram
@@ -166,14 +183,14 @@ sequenceDiagram
participant DB as Policy DB
participant OPA as OPA Engine
- PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}}
+ PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents}
PE->>DB: Fetch enabled policies matching request via label selector
PE->>PE: Sort by Level (Global→Tenant→User), then Priority (asc)
loop For each policy in sorted order
- PE->>OPA: Evaluate policy with:
{spec, provider, constraints, service_provider_constraints}
- OPA-->>PE: {rejected, patch, constraints,
selected_provider, service_provider_constraints}
+ PE->>OPA: Evaluate policy with:
{spec, agent, constraints, agent_constraints}
+ OPA-->>PE: {rejected, patch, constraints,
selected_agent, agent_constraints}
alt rejected == true
PE-->>PM: 406 Not Acceptable (rejection_reason)
@@ -185,12 +202,12 @@ sequenceDiagram
end
PE->>PE: Merge constraints into ConstraintContext
- PE->>PE: Merge service_provider_constraints
+ PE->>PE: Merge agent_constraints
PE->>PE: Validate & apply patches against constraints
- PE->>PE: Validate selected_provider against SP constraints
+ PE->>PE: Validate selected_agent against agent constraints
end
- PE-->>PM: 200 OK {evaluatedServiceInstance, selectedProvider, status}
+ PE-->>PM: 200 OK {evaluatedServiceInstance, selectedAgent, status}
```
**Evaluation request (Placement Manager → Policy Manager):**
@@ -205,7 +222,21 @@ sequenceDiagram
"guestOS": { "type": "fedora-39" },
"metadata": { "name": "fedora-vm" }
}
- }
+ },
+ "available_agents": [
+ {
+ "name": "agent-prod-eu-west-1",
+ "environment": "prod-eu-west-1",
+ "serviceTypes": ["vm", "container"],
+ "cost": "medium"
+ },
+ {
+ "name": "agent-dev-us-east-1",
+ "environment": "dev-us-east-1",
+ "serviceTypes": ["vm"],
+ "cost": "low"
+ }
+ ]
}
```
@@ -220,9 +251,18 @@ sequenceDiagram
"guestOS": { "type": "fedora-39" },
"metadata": { "name": "fedora-vm" }
},
- "provider": "",
+ "agent": "",
"constraints": {},
- "service_provider_constraints": {}
+ "agent_constraints": {},
+ "available_agents": [
+ {
+ "name": "prod-eu-agent",
+ "environment": "prod-eu-west-1",
+ "serviceTypes": ["vm", "database"],
+ "cost": "medium"
+ }
+ ],
+ "exclude_agents": []
}
```
@@ -240,10 +280,13 @@ sequenceDiagram
"region": { "const": "us-east-1" },
"vcpu": { "minimum": 2, "maximum": 8 }
},
- "selected_provider": "kubevirt-sp",
- "service_provider_constraints": {
- "allow_list": ["kubevirt-sp", "vmware-sp"],
- "patterns": []
+ "selected_agent": "agent-prod-eu-west-1",
+ "agent_constraints": {
+ "allow_list": ["agent-prod-eu-west-1", "agent-staging-eu-west-1"],
+ "patterns": [],
+ "environment_constraints": {
+ "allow_list": ["prod-eu-west-1", "staging-eu-west-1"]
+ }
}
}
```
@@ -253,7 +296,7 @@ sequenceDiagram
```json
{
"evaluatedServiceInstance": { "...": "final mutated spec" },
- "selectedProvider": "kubevirt-sp",
+ "selectedAgent": "agent-prod-eu-west-1",
"status": "APPROVED | MODIFIED"
}
```
@@ -431,32 +474,59 @@ sequenceDiagram
---
-## 5. Service Provider Lifecycle
+## 5. Service Provider & Agent Lifecycle
-### 5.1 Service Provider Registration
+Service Providers register with the Environment Agent in their target
+environment. The Agent registers with DCM and acts as the intermediary for
+resource operation requests. For full details on agent behavior, see the
+[Environment Agent enhancement](/enhancements/environment-agent/environment-agent.md).
-Service Providers register with DCM per service type. Registration is idempotent
-— re-registering with the same name updates the existing entry.
+### 5.1 Service Provider Registration (SP → Agent)
+
+The Agent supports a hybrid SP model: it ships with embedded SP code for known
+service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration,
+and also accepts external ("bring your own") SPs that register via the REST API.
+Only one SP — embedded or external — may serve a given service type per agent;
+duplicate registrations are rejected with `409 Conflict`. Embedded SPs register
+internally at agent startup; external SPs register via `POST /api/v1/providers`.
+Registration is idempotent — re-registering with the same name updates the
+existing entry. External SPs periodically re-register to maintain their lease,
+which also ensures that after an agent restart, SPs naturally rebuild the
+agent's state.
```mermaid
sequenceDiagram
participant SP as Service Provider
- participant SR as Service Registry
+ participant AG as Agent
+ participant DCM as DCM Control Plane
+ participant DB as Database
+
+ Note over AG: Embedded SPs registered
internally at startup
- SP->>SR: POST /api/v1/providers
{name, displayName, endpoint, serviceType, metadata}
+ SP->>AG: POST /api/v1/providers
{name, displayName, endpoint, serviceType, metadata}
- alt Name does not exist
- SR->>SR: Create new SP entry, generate providerID
- SR-->>SP: 201 Created {id, name, status: "registered"}
+ alt Service type already served by another SP
+ AG-->>SP: 409 Conflict
{error: "service type X already served by provider Y"}
+ else Name does not exist
+ AG->>AG: Create new SP entry, generate providerID
+ AG-->>SP: 201 Created {id, name, status: "registered"}
else Name exists, same providerID
- SR->>SR: Update existing entry
- SR-->>SP: 200 OK {id, name, status: "registered"}
+ AG->>AG: Update existing entry
+ AG-->>SP: 200 OK {id, name, status: "registered"}
else Name exists, different providerID
- SR-->>SP: 409 Conflict
+ AG-->>SP: 409 Conflict
end
+
+ alt Service type list changed AND agent registered to DCM
+ AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes, cost, topicName}
+ DCM->>DB: Update agent registration
+ DCM-->>AG: 200 OK
+ end
+
+ Note over SP,AG: SP periodically re-registers
to maintain lease
```
-**Registration payload example:**
+**Registration payload example (SP → Agent):**
```json
{
@@ -475,47 +545,154 @@ sequenceDiagram
}
```
-### 5.2 Service Provider Health Checks
+### 5.2 Agent Registration (Agent → DCM)
-DCM polls each registered Service Provider's `/health` endpoint at a
-configurable interval (default: every 10 seconds). Health status determines
-whether a provider can receive new requests.
+The Agent registers with DCM after creating its messaging topics and after at
+least one SP (embedded or external) is registered and healthy. Registration is
+idempotent — the agent `name` is the natural key. On restart, the agent
+re-registers; DCM resets the heartbeat tracker. For full registration details,
+see the
+[Environment Agent enhancement](/enhancements/environment-agent/environment-agent.md).
-#### Health State Diagram
+```mermaid
+sequenceDiagram
+ autonumber
+ participant AG as Agent
+ participant MS as Messaging System
+ participant DCM as DCM Control Plane
+ participant DB as Database
+
+ AG->>MS: Create topics (main + retry)
+ Note over AG: Wait for at least 1 SP
(embedded or external) to register
and be healthy
+
+ AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName}
+ DCM->>DB: Store agent registration
+ DCM-->>AG: 201 Created {agentId}
+```
+
+**Registration payload (Agent → DCM):**
+
+```json
+{
+ "name": "agent-prod-eu-west-1",
+ "environment": "prod-eu-west-1",
+ "serviceTypes": ["vm", "container"],
+ "resourcesAvailable": {
+ "totalCpu": 200,
+ "totalMemory": "1TB",
+ "totalStorage": "2TB"
+ },
+ "cost": "medium",
+ "topicName": "dcm.agents.agent-prod-eu-west-1"
+}
+```
+
+### 5.3 Health Monitoring
+
+#### 5.3.1 SP Health (Agent → SP)
+
+The Agent monitors each registered SP's health using a three-state model. The
+monitoring mechanism differs by SP type: embedded SPs are checked in-process (no
+network call), while external SPs are checked by polling their `/health`
+endpoint at a configurable interval.
+
+##### Health State Diagram
```mermaid
stateDiagram-v2
- [*] --> Ready: Registered
+ [*] --> Ready: Registered with Agent
Ready --> FailureCount: Failed
FailureCount --> Ready: OK (reset)
- FailureCount --> NotReady: Threshold reached
- NotReady --> Ready: OK (recover)
+ FailureCount --> Unavailable: Threshold reached
+
+ Ready --> Unhealthy: status: unhealthy
+ Unhealthy --> Ready: status: healthy
+ Unhealthy --> Unavailable: Timeout/error threshold
+
+ Unavailable --> Ready: OK (recover)
```
-#### Health Check Sequence Diagram
+Three health states:
+
+- **Ready**: SP is healthy and eligible for routing.
+- **Unhealthy**: SP is reachable but reports its backing provider is down. The
+ Agent keeps the service type in its advertised list but stops routing requests
+ to this SP; incoming requests are held in the retry topic until the SP
+ recovers or becomes Unavailable.
+- **Unavailable**: SP is unreachable after exceeding the failure threshold. The
+ Agent removes the service type from its advertised list and updates DCM.
+
+##### Health Check Sequence
```mermaid
sequenceDiagram
- participant DCM as DCM Health Checker
+ participant AG as Agent
participant SP as Service Provider
- loop Every 10 seconds
- DCM->>SP: GET /health
- alt HTTP 200 OK
- SP-->>DCM: 200 {status: "pass"}
- DCM->>DCM: Reset failure counter, mark Ready
- else Timeout or non-200
- SP-->>DCM: Error / Timeout
- DCM->>DCM: Increment failure counter
+ loop Every {healthCheckInterval} seconds
+ AG->>SP: GET /health
+ alt 200 OK, status: healthy
+ SP-->>AG: {status: "healthy"}
+ AG->>AG: Reset failure counter, mark Ready
+ else 200 OK, status: unhealthy
+ SP-->>AG: {status: "unhealthy"}
+ AG->>AG: Mark Unhealthy
Stop routing, hold requests
+ else Timeout or error
+ SP-->>AG: Error / Timeout
+ AG->>AG: Increment failure counter
alt Failures >= threshold
- DCM->>DCM: Mark NotReady
+ AG->>AG: Mark Unavailable
Remove service type, update DCM
end
end
end
```
-### 5.3 Service Provider Status Reporting
+#### 5.3.2 Agent Health (Agent → DCM heartbeats)
+
+The Agent reports its own liveness to DCM via periodic REST heartbeats. DCM
+tracks the last heartbeat timestamp for each agent and marks the agent as
+Unavailable if no heartbeat is received within a configurable threshold.
+
+```mermaid
+sequenceDiagram
+ participant AG as Agent
+ participant DCM as DCM Control Plane
+
+ loop Every {heartbeatInterval} seconds
+ AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp, consumerLag}
+ DCM->>DCM: Update heartbeat, check lag
+ DCM-->>AG: 200 OK
+ end
+
+ Note over DCM: No heartbeat within threshold
+ DCM->>DCM: Mark agent Unavailable
+```
+
+#### 5.3.3 Consumer Lag Monitoring
+
+The Agent self-reports its consumer lag in each heartbeat. If the lag exceeds
+`consumerLagThreshold`, DCM marks the agent as **Congested** and stops routing
+new requests to it. When the lag drops below the threshold, the Congested state
+is cleared.
+
+##### Agent Health State Diagram
+
+```mermaid
+stateDiagram-v2
+ [*] --> Ready: Agent registers
+ Ready --> Congested: lag >= threshold
+ Congested --> Ready: lag < threshold
+ Ready --> Unavailable: Heartbeat timeout
+ Congested --> Unavailable: Heartbeat timeout
+ Unavailable --> Ready: Agent re-registers
+```
+
+### 5.4 Service Provider Status Reporting
+
+> **Note:** Status reporting is not impacted by the Agent layer. SPs publish
+> status CloudEvents directly to the Messaging System. The Agent is not in the
+> status-reporting path.
Service Providers report instance status changes to DCM via CloudEvents
published to a messaging system (NATS). This decoupled approach supports
@@ -525,16 +702,16 @@ multiple consumers (billing, auditing, etc.) and scales independently.
sequenceDiagram
participant Platform as Underlying Platform
(K8s, KubeVirt, ACM)
participant SP as Service Provider
- participant MSG as Messaging System (NATS)
+ participant MS as Messaging System (NATS)
participant DCM as DCM Core Service
participant DB as Status DB
Platform->>SP: State change event
(via informer watch or polling)
SP->>SP: Map platform status → DCM status
SP->>SP: Build CloudEvent
- SP->>MSG: Publish to:
dcm.providers.{provider}.{serviceType}
.instances.{instanceId}.status
+ SP->>MS: Publish to:
dcm.providers.{provider}.{serviceType}
.instances.{instanceId}.status
- MSG->>DCM: Deliver event
+ MS->>DCM: Deliver event
DCM->>DCM: Validate CloudEvent schema
alt Valid
DCM->>DB: UPSERT instance status
@@ -555,14 +732,39 @@ sequenceDiagram
| DELETING | | |
| DELETED | | |
+### 5.5 Agent Lifecycle
+
+This section provides a brief overview of the agent lifecycle. For full details,
+see the
+[Environment Agent enhancement](/enhancements/environment-agent/environment-agent.md).
+
+**Startup:**
+
+1. Agent registers its configured embedded SPs internally (K8s Container, ACM
+ Cluster, KubeVirt — each if enabled in config)
+2. Agent creates messaging topics (main topic + retry topic)
+3. Agent waits for at least one SP (embedded or external) to be registered and
+ healthy
+4. Agent registers with DCM via `POST /api/v1/agents`
+5. Agent begins periodic heartbeats and SP health checking
+
+**Restart:**
+
+1. Agent re-registers with DCM (idempotent; DCM resets heartbeat tracker)
+2. Embedded SPs register internally at startup; external SPs naturally
+ re-register via periodic lease renewal, rebuilding agent state
+3. Unconsumed messages on both main and retry topics survive (messaging system
+ persistence)
+4. Agent resumes consuming from both topics once fully initialized
+
---
## 6. CatalogItemInstance Creation (End-to-End)
This is the primary user flow: creating an infrastructure resource from a
CatalogItem. The request flows through the Catalog Manager, Placement Manager
-(with policy evaluation), SP Resource Manager, and finally to the selected
-Service Provider.
+(with policy evaluation), SP Resource Manager (which publishes to the messaging
+system), the Environment Agent, and finally to the selected Service Provider.
### 6.1 Full Creation Flow
@@ -574,21 +776,22 @@ sequenceDiagram
participant DB as Placement DB
participant PE as Policy Manager
participant SPRM as SP Resource Manager
- participant SR as Service Registry
+ participant AR as Agent Registry
+ participant MS as Messaging System
+ participant AG as Agent
participant SP as Service Provider
- participant MSG as Messaging System
User->>CM: Request CatalogItemInstance
(select CatalogItem + customize fields)
CM->>CM: Validate input, merge with defaults
CM->>PM: POST /api/v1/resources
{CatalogItemInstance: UUID, spec}
- %% Intent preservation
PM->>DB: Store original request (intent)
- %% Policy evaluation
- PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}}
- PE->>PE: Fetch & sort matching policies
(Global→Tenant→User, by priority)
- PE->>PE: Evaluate policy chain
(validate, mutate, select SP)
+ PM->>AR: Fetch available agents
(healthy, not Congested, matching serviceType)
+ AR-->>PM: available_agents list
+
+ PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents}
+ PE->>PE: Evaluate policy chain
(validate, mutate, select Agent)
alt Policy rejects
PE-->>PM: 406 Not Acceptable
@@ -597,64 +800,87 @@ sequenceDiagram
CM-->>User: Request denied
end
- PE-->>PM: 200 OK
{evaluatedServiceInstance, selectedProvider, status}
- PM->>DB: Store validated request
+ PE-->>PM: 200 OK
{evaluatedServiceInstance, selectedAgent, status}
+ PM->>DB: Store validated request with agentName
- %% SP Resource Manager
- PM->>SPRM: POST /api/v1/service-type-instances
{providerName, spec}
+ PM->>SPRM: POST /api/v1/service-type-instances
{agentName, serviceType, spec}
- SPRM->>SR: Lookup provider by name
- alt Provider not found
- SR-->>SPRM: 404
- SPRM-->>PM: 404 Not Found
+ SPRM->>AR: Lookup agent, get topicName
+ alt Agent not found or unhealthy/Congested
+ SPRM-->>PM: Error (404/503)
PM->>DB: Delete records
PM-->>CM: Error
- CM-->>User: Provider not found
+ CM-->>User: Agent unavailable
end
- SR-->>SPRM: {endpoint, metadata, healthStatus}
- alt Provider unhealthy
- SPRM-->>PM: 503 Service Unavailable
- PM->>DB: Delete records
- PM-->>CM: Error
- CM-->>User: Provider unavailable
+ SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
{resourceId, serviceType, spec}
+ SPRM->>DB: Create instance record
+ SPRM-->>PM: 202 Accepted {instanceId, agentName, status: PENDING}
+ PM-->>CM: 201 Created
+ CM-->>User: Instance created (PENDING)
+
+ Note over MS,AG: Async processing
+ MS->>AG: Deliver creation request
+ AG->>AG: Validate service type, select SP
+ AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec}
+ SP-->>AG: {instanceId, status: PROVISIONING}
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, agentName, topicName,
status: PROVISIONING}
+ MS->>SPRM: Deliver response
+ SPRM->>DB: Update instance: PROVISIONING
+
+ opt Agent queues request (SP Unhealthy)
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED}
+ MS->>SPRM: Deliver QUEUED response
+ SPRM->>SPRM: Update instance: QUEUED
+ SPRM->>PM: Notify: instance QUEUED
+
+ Note over PM: Start queuedRequestTimeout
+ alt Timeout or timeout = 0
+ PM->>SPRM: DELETE instance
+ PM->>PE: Re-evaluate excluding agent
+ Note over PM: Route to alternative agent
or return error if none available
+ end
end
- %% Instance creation
- SPRM->>SP: POST {endpoint}/api/v1/{serviceType}
{spec}
- SP->>SP: Create resource on platform
- SP-->>SPRM: {instanceId, status: PROVISIONING}
- SPRM->>DB: Persist instance metadata
- SPRM-->>PM: 202 Accepted {instanceId, status}
- PM-->>CM: 202 Accepted
- CM-->>User: Instance created
{instanceId, status: PROVISIONING}
-
- %% Continuous status reporting
- Note over SP,MSG: Async status reporting begins
- SP->>MSG: Publish status CloudEvents
as instance state changes
- MSG->>PM: Deliver status updates
- PM->>DB: UPSERT status
+ Note over SP,MS: Status reporting (unchanged)
+ SP->>MS: Publish status CloudEvents
+ MS->>SPRM: Deliver status updates
```
+When the SP for the requested service type on the agent is Unhealthy, the Agent
+holds the request in its retry topic and responds with a QUEUED CloudEvent. DCM
+records the QUEUED status. If the SP recovers, the Agent processes the held
+request. If the SP becomes Unavailable, the Agent rejects the held request with
+an error CloudEvent. The Placement Manager handles the QUEUED status via a
+`queuedRequestTimeout` timer (see
+[6.2 Placement Manager Flow](#62-placement-manager-flow)).
+
### 6.2 Placement Manager Flow
The Placement Manager is the central orchestrator. It preserves the user's
-original intent, delegates policy evaluation, and coordinates with the SP
-Resource Manager.
+original intent, fetches available agents, delegates policy evaluation, and
+coordinates with the SP Resource Manager.
```mermaid
flowchart TD
A[Receive request from Catalog Manager] --> B[Store original request in Placement DB]
- B --> C[Send to Policy Manager for evaluation]
- C --> D{Policy approved?}
- D -->|No| E[Delete intent record]
- E --> F[Return error to Catalog Manager]
- D -->|Yes| G[Store validated request in Placement DB]
- G --> H[Forward to SP Resource Manager
with providerName and validated spec]
- H --> I{SP Resource Manager
succeeded?}
- I -->|No| J[Delete records from Placement DB]
- J --> F
- I -->|Yes| K[Return 202 Accepted
to Catalog Manager]
+ B --> C[Fetch available agents from Agent Registry]
+ C --> D[Send to Policy Manager for evaluation
with available_agents]
+ D --> E{Policy approved?}
+ E -->|No| F[Delete intent record]
+ F --> G[Return error to Catalog Manager]
+ E -->|Yes| H[Store validated request with agentName]
+ H --> I[Forward to SP Resource Manager
with agentName, serviceType, spec]
+ I --> J{SPRM response?}
+ J -->|Error| K[Delete records from Placement DB]
+ K --> G
+ J -->|202 Accepted| L[Return 201 Created
to Catalog Manager]
+ J -->|QUEUED| M[Start queuedRequestTimeout timer]
+ M --> N{Timeout?}
+ N -->|Yes| O[Send DELETE to SPRM
Re-evaluate excluding agent]
+ O --> P{Alternative agent?}
+ P -->|Yes| I
+ P -->|No| K
```
**Request payload (Catalog Manager → Placement Manager):**
@@ -679,30 +905,27 @@ flowchart TD
{
"CatalogItemInstanceId": "f3645f8f-82c1-4efb-888f-318c0ac81a08",
"resource_name": "fedora-vm",
- "providerName": "kubevirt-sp",
+ "agentName": "agent-prod-eu-west-1",
"id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781"
}
```
### 6.3 SP Resource Manager Flow
-The SP Resource Manager handles Service Provider lookup, health validation, and
-instance creation delegation.
+The SP Resource Manager handles Agent lookup and publishes creation requests as
+CloudEvents to the agent's messaging topic. It no longer calls SP REST endpoints
+directly.
```mermaid
flowchart TD
- A[Receive request from Placement Manager
providerName + spec] --> B[Query Service Registry
by providerName]
- B --> C{Provider found?}
+ A[Receive request from Placement Manager
agentName + serviceType + spec] --> B[Query Agent Registry
by agentName]
+ B --> C{Agent found?}
C -->|No| D[Return 404 Not Found]
- C -->|Yes| E{Provider healthy?}
+ C -->|Yes| E{Agent healthy
and not Congested?}
E -->|No| F[Return 503 Service Unavailable]
- E -->|Yes| G[Forward spec to Service Provider
POST endpoint/api/v1/serviceType]
- G --> H{SP creation succeeded?}
- H -->|No| I[Forward error to Placement Manager]
- H -->|Yes| J[Persist instance in database
instanceId, providerName, metadata]
- J --> K{DB persist succeeded?}
- K -->|No| L[Return 500 Internal Server Error]
- K -->|Yes| M[Return 202 Accepted
instanceId, status]
+ E -->|Yes| G[Publish CloudEvent to agent topic
via Messaging System]
+ G --> H[Create instance record in DB]
+ H --> I[Return 202 Accepted
instanceId, agentName, status: PENDING]
```
### 6.4 Service Provider Instance Creation
@@ -710,6 +933,11 @@ flowchart TD
Each Service Provider translates the provider-agnostic ServiceType spec into
platform-native resources.
+> **Note:** Each service type is served by exactly one SP (embedded or external)
+> per agent — there is no SP selection strategy. The Agent forwards the request
+> to the SP via an in-process call (for embedded SPs) or via REST (for external
+> SPs). The SP's internal behavior is unchanged.
+
```mermaid
flowchart LR
subgraph KubeVirtSP[KubeVirt SP]
@@ -730,6 +958,10 @@ flowchart LR
### 6.5 Continuous Status Reporting
+> **Note:** Status reporting is not impacted by the Agent layer. SPs publish
+> status CloudEvents directly to the Messaging System, bypassing the Agent. This
+> path is unchanged from the pre-agent architecture.
+
After instance creation, Service Providers continuously monitor the underlying
platform and report status changes via CloudEvents.
@@ -831,3 +1063,54 @@ graph LR
HC4 --> DC4
HC5 --> DC5
```
+
+### 6.6 Deletion Flow
+
+The deletion flow follows the same architecture as creation: the request is
+published as a CloudEvent to the agent's messaging topic, and the Agent routes
+it to the appropriate SP.
+
+```mermaid
+sequenceDiagram
+ actor User
+ participant CM as Catalog Manager
+ participant PM as Placement Manager
+ participant DB as Placement DB
+ participant PE as Policy Manager
+ participant SPRM as SP Resource Manager
+ participant MS as Messaging System
+ participant AG as Agent
+ participant SP as Service Provider
+
+ User->>CM: Delete CatalogItemInstance
+ CM->>PM: DELETE /api/v1/resources/{resourceId}
+ PM->>DB: Lookup resource (agentName, serviceType, instanceId)
+
+ PM->>SPRM: DELETE /api/v1/service-type-instances/{instanceId}
+ SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
type: dcm.request.delete
{resourceId, serviceType}
+ SPRM-->>PM: 202 Accepted
+
+ MS->>AG: Deliver deletion request
+ AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId}
+ SP-->>AG: {status: DELETING}
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, agentName, topicName,
status: DELETING}
+
+ opt Agent queues request (SP Unhealthy)
+ AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED}
+ MS->>SPRM: Deliver QUEUED response
+ SPRM->>SPRM: Update instance: QUEUED
+ SPRM->>PM: Notify: instance QUEUED
+
+ Note over PM: Start queuedRequestTimeout
+ alt Timeout or timeout = 0
+ PM->>SPRM: DELETE instance
+ PM->>PE: Re-evaluate excluding agent
+ Note over PM: Route to alternative agent
or return error if none available
+ end
+ end
+
+ Note over SP: SP manages deletion
and reports final status
+ SP->>MS: CloudEvent {status: DELETED}
+ MS->>SPRM: Status update
+ SPRM->>DB: Update status: DELETED
+```