From 0232e2de51bcbb65583b1c7563503d5e57109b0d Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 3 Jun 2026 21:15:45 +0200 Subject: [PATCH 01/20] docs(environment-agent): add environment agent enhancement Define the environment agent layer that sits between DCM and Service Providers. The agent runs per-cluster, registers to DCM with environment metadata, and routes creation requests via a messaging system. SPs register to the agent (not DCM directly), each serving a single resource type. Includes agent registration, resource creation, SP registration, agent heartbeat, and SP health monitoring flows. Assisted by: Claude Code - claude-opus-4-6 Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 406 ++++++++++++++++++ 1 file changed, 406 insertions(+) create mode 100644 enhancements/environment-agent/environment-agent.md diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md new file mode 100644 index 0000000..d8b7cd3 --- /dev/null +++ b/enhancements/environment-agent/environment-agent.md @@ -0,0 +1,406 @@ +--- +title: Environment Agent +authors: + - "@gabriel-farache" +reviewers: + - "@gciavarrini" + - "@ygalblum" + - "@machacekondra" + - "@jenniferubah" +approvers: + - "" +creation-date: 2026-06-03 +--- + +# Environment Agent + +## Summary +This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environments usable by DCM and the agent would regiester the environment to DCM. +The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that SPs now serve only 1 specific resource type. +This enhancement also propose to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM wil send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. + +Additionally, this enhancement defines: +- How Service Providers register to the agent (rather than to DCM directly), allowing the agent to dynamically build and maintain its list of supported resource types. +- How the agent reports its own health to DCM via periodic heartbeats. +- How the agent monitors the health of its registered Service Providers using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result. + +## Motivation +When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, ... + +Currently, in DCM, a resource's creation request is routed to a given Service Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. + +There is currently no way for a policy to determine in which environment a SP is running and hence an user cannot explictly set the targeted environment constraint when requesting the creation of a resource. + +Furthermore, with the current way on submitting creation's request, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how the creation's requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. + + +### Goals + +- Define how the agent registers to DCM +- Define what information the agent gives to DCM while registering +- Define how agents and DCM are communicating +- Define how agents and Service Providers interact with each other +- Define how Service Providers register to the agent +- Define how the agent monitors Service Provider health +- Define how the agent reports its own health to DCM + +### Non-Goals + +- Defining how to use the information registered by the agent to DCM +- Define how agent will provision application (vs simple resource type) + +## Proposal + +### Overview + +For each clusters that can be used by DCM, an agent must be spawn. +The agent will then register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. + +When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic be unique and shared with DCM upon registration. + +Service Providers register directly to the agent (not to DCM). Each SP serves a single resource type and registers itself with the agent via a REST API call. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. + +An agent must have at least 1 Service Provider (SP) registered to it. For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. + +DCM will send the creation request to the specific topic that was created by the agent. + +The agent will then consume the message, validate it and then pass it to the relevant SP. + +The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP in its own status, allowing administrators to quickly identify which SPs are causing issues. + +The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. + +The status monitoring will not be impacted: the SP will be the one managing the resource and the current flow will remain the same; the agent is only an intermediary. + +### Architecture +```mermaid +%%{init: {'flowchart': {'rankSpacing': 80, 'nodeSpacing': 10, 'curve': 'linear'}}}%% +flowchart TD + classDef dcm fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px + classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px + classDef agent fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px + classDef provider fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px + classDef clusterEnvironment fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px + + DCM["**DCM**
Control Plane"]:::dcm + MS["**Messaging System**
Subject-based routing"]:::messaging + + subgraph Cluster_Environment["Cluster / Environment"] + direction LR + SPX["**SP**
Resource Type X"]:::provider + AG["**Agent**
Routes creation requests to SP"]:::agent + SPY["**SP**
Resource Type Y"]:::provider + SPX -. Registration .-> AG + SPY -. Registration .-> AG + AG -->|Creation Request| SPX + AG -->|Creation Request| SPY + AG -.->|Health Check| SPX + AG -.->|Health Check| SPY + end + + DCM -->|Creation Request| MS + MS -->|Creation Request| AG + AG -. Registration .-> DCM + AG -. Heartbeat .-> DCM + AG -->|Health Warning| MS + MS -->|Health Warning| DCM + SPX -->|Status| MS + SPY -->|Status| MS + MS -->|Status| DCM + + class Cluster_Environment clusterEnvironment +``` + +#### Flow Description +* The agent is spawn in a cluster serving as a specific environment +* Within the same cluster several Service Providers (SP) are running and serving each a specific resource type +* Each SP registers itself to the agent; the agent dynamically builds its supported resource types list +* The agent creates a specific topic in the bus system +* The agent registers to DCM on startup and sends periodic heartbeats +* DCM sends creation request to the specific topic +* The agent consumes the messages sent to the topic +* The agent routes the creation request to the relevant SP +* The agent periodically health-checks each registered SP; when the last SP for a resource type becomes unhealthy, the agent updates DCM and publishes a health warning through the messaging system +* The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system + +### Agent Registration Flow + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant MS as Messaging System + participant DCM as DCM
(Control Plane) + participant DB as Database + + Note over AG: Agent starts in
cluster / environment + + AG->>MS: Create unique topic + MS-->>AG: Topic created
{topicName} + + AG->>DCM: POST /api/v1/agents
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DCM + + DCM->>DB: Store agent registration
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DB + DB-->>DCM: Registration stored + deactivate DB + + DCM-->>AG: 201 Created
{agentId} + deactivate DCM +``` + +#### Flow Description +1. The agent starts within a cluster serving a specific environment +2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel +3. The agent registers itself with DCM via a REST API call, providing: + * Name + * Environment + * Supported resource types + * Available resources + * Cost tier + * Topic name +4. DCM persists the registration in the database +5. DCM acknowledges the registration + +### Resource Creation Flow + +```mermaid +sequenceDiagram + autonumber + participant DCM as DCM
(Control Plane) + participant MS as Messaging System + participant AG as Agent + participant SP as Service Provider + + DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{resourceType, spec} + + MS->>AG: PUSH message + activate AG + + AG->>AG: Validate requested resource type
is supported by an attached SP + + alt Resource type not supported + AG->>MS: PUBLISH CloudEvent
{error: "unsupported resource type"} + MS->>DCM: PUSH error message + else Resource type supported + AG->>SP: POST {spEndpoint}/api/v1/{resourceType}
{spec} + activate SP + + alt SP creation fails + SP-->>AG: Error response + deactivate SP + AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} + MS->>DCM: PUSH error message + else SP creation succeeds + SP-->>AG: Success response
{instanceId, status: PROVISIONING} + Note over SP: SP manages resource lifecycle
and reports status through
the existing status reporting flow + end + end + deactivate AG +``` + +#### Flow Description +1. DCM publishes the creation request to the agent's dedicated topic in the messaging system +2. The agent consumes the message +3. The agent validates that the requested resource type is supported by one of its attached Service Providers +4. If the resource type is **not supported**: + * The agent publishes an error CloudEvent back to the messaging system + * DCM consumes the error message +5. If the resource type is **supported**: + * The agent forwards the creation request to the relevant SP via REST API + * If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume + * If the SP **accepts** the request: the SP takes over resource lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) + +### SP Registration to Agent + +Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. + +SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. + +When the list of supported resource types changes as a result of an SP registration, the agent updates DCM via a `PUT` request with the full updated registration payload. + +```mermaid +sequenceDiagram + autonumber + participant SP as Service Provider + participant AG as Agent + participant DCM as DCM
(Control Plane) + participant DB as Database + + Note over SP: SP starts in
cluster / environment + + SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} + activate AG + + AG->>AG: Store SP registration
Recompute supported resource types + + alt Resource type list changed + AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DCM + DCM->>DB: Update agent registration + activate DB + DB-->>DCM: Registration updated + deactivate DB + DCM-->>AG: 200 OK + deactivate DCM + end + + AG-->>SP: 201 Created
{providerId} + deactivate AG + + Note over SP,AG: SP periodically re-registers
to maintain its lease +``` + +#### Flow Description +1. The SP starts within the same cluster / environment as the agent +2. The SP registers itself with the agent via a REST API call, providing: + * Name + * Resource type it serves + * Endpoint (URL where the agent can reach the SP) +3. The agent stores the SP registration and recomputes the list of supported resource types +4. If the resource type list changed (new resource type added): + * The agent sends a `PUT` request to DCM with the full updated agent registration + * DCM updates the agent record in the database +5. The agent acknowledges the SP registration +6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination + +### Health + +#### Agent Health + +The agent reports its own liveness to DCM via periodic REST heartbeats. Since the messaging system is used for resource operations (creation requests, status updates), the heartbeat uses the existing REST channel that the agent already uses for registration. + +DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is received within a configurable threshold, DCM marks the agent as unavailable. + +On startup, the agent registers to DCM (as described in [Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it re-registers to DCM; DCM handles this idempotently, resetting the heartbeat tracker. + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant DCM as DCM
(Control Plane) + participant DB as Database + + loop Every {heartbeatInterval} seconds + AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp} + activate DCM + DCM->>DB: Update last heartbeat timestamp + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM + end + + Note over DCM: No heartbeat received
within {threshold} seconds + + DCM->>DB: Mark agent as Unavailable + activate DB + DB-->>DCM: Updated + deactivate DB +``` + +##### Flow Description +1. The agent periodically sends a heartbeat to DCM via a REST `PUT` call +2. DCM updates the agent's last heartbeat timestamp in the database +3. If DCM does not receive a heartbeat within the configured threshold, it marks the agent as **Unavailable** +4. When the agent restarts, its initial registration to DCM resets the heartbeat tracker and the agent status + +#### SP Health Monitoring + +The agent monitors the health of its registered Service Providers by polling their `/health` endpoint, using the three-state health model defined in the [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): + +| State | Condition | +|-------|-----------| +| **Ready** | SP responds with `200 OK` and `status: "healthy"` | +| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | +| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | + +With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent and its SPs are co-located in the same cluster, making the agent the natural point to perform health checks. + +When the last SP serving a given resource type transitions to **Unhealthy** or **Unavailable**, the agent: +1. Removes that resource type from its advertised list +2. Sends a `PUT` request to DCM with the updated agent registration (resource types list without the affected type) +3. Publishes a health warning CloudEvent to a dedicated health topic in the messaging system, providing DCM with context about the degradation (which SP, which resource type, the reason) + +When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `status: "healthy"`), the agent re-adds the resource type to its list and updates DCM accordingly. + +##### Agent Status + +The agent exposes the health status of each registered SP in its own status. This allows cluster administrators to inspect the agent's status and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. + +```json +{ + "agentId": "agent-123", + "name": "cluster-prod-eu-west", + "environment": "PROD", + "status": "Ready", + "providers": [ + { + "providerId": "sp-vm-001", + "name": "vm-provider", + "resourceType": "vm", + "health": "Ready", + "endpoint": "http://vm-provider:8080" + }, + { + "providerId": "sp-db-001", + "name": "db-provider", + "resourceType": "database", + "health": "Unhealthy", + "endpoint": "http://db-provider:8080" + } + ] +} +``` + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant SP as Service Provider + participant MS as Messaging System + participant DCM as DCM
(Control Plane) + participant DB as Database + + loop Every {healthCheckInterval} seconds + AG->>SP: GET /health + alt Healthy + SP-->>AG: 200 OK
{status: "healthy"} + AG->>AG: Reset failure counter
Mark SP as Ready + else Unhealthy + SP-->>AG: 200 OK
{status: "unhealthy"} + AG->>AG: Mark SP as Unhealthy + else No response / error + SP-->>AG: Timeout / Error + AG->>AG: Increment failure counter + Note over AG: If counter >= threshold:
Mark SP as Unavailable + end + end + + Note over AG: Last SP for resource type X
becomes Unhealthy or Unavailable + + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated resourceTypes without X} + activate DCM + DCM->>DB: Update agent registration + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM + + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "resource-type-unavailable",
agentId, resourceType, reason,
affectedProvider} + MS->>DCM: PUSH health warning +``` + +##### Flow Description +1. The agent periodically polls each registered SP's `GET /health` endpoint +2. Based on the response, the agent updates the SP's health state: + * `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) + * `200 OK` with `status: "unhealthy"` → **Unhealthy** + * Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** +3. When the last SP serving a given resource type becomes **Unhealthy** or **Unavailable**: + * The agent removes the resource type from its advertised list + * The agent sends a `PUT` to DCM with the updated registration + * The agent publishes a health warning CloudEvent to the `dcm.agents.health` topic with details about the affected SP and resource type +4. When a previously unhealthy/unavailable SP recovers: + * The agent re-adds the resource type to its list (if it was removed) + * The agent sends a `PUT` to DCM with the updated registration +5. The agent exposes the health status of all registered SPs in its own status, allowing administrators to inspect the agent and see per-SP health at a glance \ No newline at end of file From 435b36611db3d71ee8f6d5315346ae7e4e5f6d8f Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 4 Jun 2026 14:04:16 +0200 Subject: [PATCH 02/20] Rework a little bit the doc Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 173 +++++++++--------- 1 file changed, 86 insertions(+), 87 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index d8b7cd3..b658170 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -48,25 +48,27 @@ Furthermore, with the current way on submitting creation's request, deploying an - Defining how to use the information registered by the agent to DCM - Define how agent will provision application (vs simple resource type) +- Update other enhancement files to reflect the changes introduced by the present document; this will be done in subsequent PRs. ## Proposal ### Overview For each clusters that can be used by DCM, an agent must be spawn. -The agent will then register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. +The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. -When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic be unique and shared with DCM upon registration. +When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. Service Providers register directly to the agent (not to DCM). Each SP serves a single resource type and registers itself with the agent via a REST API call. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. -An agent must have at least 1 Service Provider (SP) registered to it. For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. +An agent must have at least 1 Service Provider (SP) registered to it before self registering to DCM. +For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. DCM will send the creation request to the specific topic that was created by the agent. The agent will then consume the message, validate it and then pass it to the relevant SP. -The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP in its own status, allowing administrators to quickly identify which SPs are causing issues. +The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. @@ -116,13 +118,68 @@ flowchart TD * Within the same cluster several Service Providers (SP) are running and serving each a specific resource type * Each SP registers itself to the agent; the agent dynamically builds its supported resource types list * The agent creates a specific topic in the bus system -* The agent registers to DCM on startup and sends periodic heartbeats +* Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats * DCM sends creation request to the specific topic * The agent consumes the messages sent to the topic * The agent routes the creation request to the relevant SP * The agent periodically health-checks each registered SP; when the last SP for a resource type becomes unhealthy, the agent updates DCM and publishes a health warning through the messaging system * The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system +### SP Registration to Agent + +Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. + +SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. + +When the list of supported resource types changes as a result of an SP registration and the agent is already registered to DCM, the agent updates DCM via a `PUT` request with the full updated registration payload. If the agent has not yet registered to DCM (i.e., this is the first SP registering), the agent does not send a `PUT`; instead, the SP registration satisfies the prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). + +```mermaid +sequenceDiagram + autonumber + participant SP as Service Provider + participant AG as Agent + participant DCM as DCM
(Control Plane) + participant DB as Database + + Note over SP: SP starts in
cluster / environment + + SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} + activate AG + + AG->>AG: Store SP registration
Recompute supported resource types + + alt Resource type list changed AND agent already registered to DCM + AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DCM + DCM->>DB: Update agent registration + activate DB + DB-->>DCM: Registration updated + deactivate DB + DCM-->>AG: 200 OK + deactivate DCM + else Resource type list changed AND agent not yet registered to DCM + Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) + end + + AG-->>SP: 201 Created
{providerId} + deactivate AG + + Note over SP,AG: SP periodically re-registers
to maintain its lease +``` + +#### Flow Description +1. The SP starts within the same cluster / environment as the agent +2. The SP registers itself with the agent via a REST API call, providing: + * Name + * Resource type it serves + * Endpoint (URL where the agent can reach the SP) +3. The agent stores the SP registration and recomputes the list of supported resource types +4. If the resource type list changed (new resource type added): + * If the agent is already registered to DCM: the agent sends a `PUT` request to DCM with the full updated agent registration; DCM updates the agent record in the database + * If the agent is not yet registered to DCM: the agent does not send a `PUT`; instead, this SP registration satisfies the prerequisite for the agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) +5. The agent acknowledges the SP registration +6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination + ### Agent Registration Flow ```mermaid @@ -138,6 +195,8 @@ sequenceDiagram AG->>MS: Create unique topic MS-->>AG: Topic created
{topicName} + Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) + AG->>DCM: POST /api/v1/agents
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} activate DCM @@ -153,15 +212,18 @@ sequenceDiagram #### Flow Description 1. The agent starts within a cluster serving a specific environment 2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel -3. The agent registers itself with DCM via a REST API call, providing: +3. The agent checks whether at least one SP is registered and healthy: + * If at least 1 SP is registered and healthy: the agent proceeds to register to DCM + * Else: the agent waits until at least 1 SP is registered and healthy +4. The agent registers itself with DCM via a REST API call, providing: * Name * Environment * Supported resource types * Available resources * Cost tier * Topic name -4. DCM persists the registration in the database -5. DCM acknowledges the registration +5. DCM persists the registration in the database +6. DCM acknowledges the registration ### Resource Creation Flow @@ -212,59 +274,6 @@ sequenceDiagram * If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume * If the SP **accepts** the request: the SP takes over resource lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) -### SP Registration to Agent - -Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. - -SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. - -When the list of supported resource types changes as a result of an SP registration, the agent updates DCM via a `PUT` request with the full updated registration payload. - -```mermaid -sequenceDiagram - autonumber - participant SP as Service Provider - participant AG as Agent - participant DCM as DCM
(Control Plane) - participant DB as Database - - Note over SP: SP starts in
cluster / environment - - SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} - activate AG - - AG->>AG: Store SP registration
Recompute supported resource types - - alt Resource type list changed - AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} - activate DCM - DCM->>DB: Update agent registration - activate DB - DB-->>DCM: Registration updated - deactivate DB - DCM-->>AG: 200 OK - deactivate DCM - end - - AG-->>SP: 201 Created
{providerId} - deactivate AG - - Note over SP,AG: SP periodically re-registers
to maintain its lease -``` - -#### Flow Description -1. The SP starts within the same cluster / environment as the agent -2. The SP registers itself with the agent via a REST API call, providing: - * Name - * Resource type it serves - * Endpoint (URL where the agent can reach the SP) -3. The agent stores the SP registration and recomputes the list of supported resource types -4. If the resource type list changed (new resource type added): - * The agent sends a `PUT` request to DCM with the full updated agent registration - * DCM updates the agent record in the database -5. The agent acknowledges the SP registration -6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination - ### Health #### Agent Health @@ -326,32 +335,22 @@ When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `s ##### Agent Status -The agent exposes the health status of each registered SP in its own status. This allows cluster administrators to inspect the agent's status and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. - -```json -{ - "agentId": "agent-123", - "name": "cluster-prod-eu-west", - "environment": "PROD", - "status": "Ready", - "providers": [ - { - "providerId": "sp-vm-001", - "name": "vm-provider", - "resourceType": "vm", - "health": "Ready", - "endpoint": "http://vm-provider:8080" - }, - { - "providerId": "sp-db-001", - "name": "db-provider", - "resourceType": "database", - "health": "Unhealthy", - "endpoint": "http://db-provider:8080" - } - ] -} +The agent exposes the health status of each registered SP as custom pod conditions on its own pod. This allows cluster administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. + +Each registered SP is represented as a separate pod condition, using the SP's provider ID as the condition type. The condition's `status` field reflects whether the SP is healthy (`True`) or not (`False`), and the `reason` and `message` fields provide additional context. + +Example output from `oc describe pod `: + ``` +Conditions: + Type Status Reason Message + sp-vm-001/vm True Ready SP vm-provider serving resource type vm is healthy + sp-db-001/database False Unhealthy SP db-provider serving resource type database is unhealthy +``` + +###### Implementation Detail + +The agent uses [Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) to surface per-SP health as custom pod conditions. The agent's pod spec declares a readiness gate for each expected condition type, and the agent application patches its own pod's `status.conditions` via the Kubernetes API using in-cluster authentication (`rest.InClusterConfig()` or equivalent). This requires RBAC permissions on the `pods/status` subresource for the agent's service account. ```mermaid sequenceDiagram @@ -403,4 +402,4 @@ sequenceDiagram 4. When a previously unhealthy/unavailable SP recovers: * The agent re-adds the resource type to its list (if it was removed) * The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs in its own status, allowing administrators to inspect the agent and see per-SP health at a glance \ No newline at end of file +5. The agent exposes the health status of all registered SPs as custom pod conditions on its own pod, allowing administrators to inspect the agent via `oc describe pod` and see per-SP health at a glance \ No newline at end of file From 9429d31af645d4b4790da64a0e464c068237c357 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Fri, 5 Jun 2026 09:44:11 +0200 Subject: [PATCH 03/20] Fix typos Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index b658170..c42c5e6 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -15,9 +15,9 @@ creation-date: 2026-06-03 # Environment Agent ## Summary -This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environments usable by DCM and the agent would regiester the environment to DCM. +This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that SPs now serve only 1 specific resource type. -This enhancement also propose to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM wil send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. +This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. Additionally, this enhancement defines: - How Service Providers register to the agent (rather than to DCM directly), allowing the agent to dynamically build and maintain its list of supported resource types. @@ -25,13 +25,13 @@ Additionally, this enhancement defines: - How the agent monitors the health of its registered Service Providers using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result. ## Motivation -When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, ... +When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, etc Currently, in DCM, a resource's creation request is routed to a given Service Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. -There is currently no way for a policy to determine in which environment a SP is running and hence an user cannot explictly set the targeted environment constraint when requesting the creation of a resource. +There is currently no way for a policy to determine in which environment a SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. -Furthermore, with the current way on submitting creation's request, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how the creation's requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. +Furthermore, with the current way of submitting creation requests, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. ### Goals @@ -54,7 +54,7 @@ Furthermore, with the current way on submitting creation's request, deploying an ### Overview -For each clusters that can be used by DCM, an agent must be spawn. +For each cluster that can be used by DCM, an agent must be spawn. The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. @@ -114,7 +114,7 @@ flowchart TD ``` #### Flow Description -* The agent is spawn in a cluster serving as a specific environment +* The agent is spawned in a cluster serving as a specific environment * Within the same cluster several Service Providers (SP) are running and serving each a specific resource type * Each SP registers itself to the agent; the agent dynamically builds its supported resource types list * The agent creates a specific topic in the bus system From eeed61a544b6a29704246c31a3cbda821a791b4a Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Fri, 5 Jun 2026 11:26:40 +0200 Subject: [PATCH 04/20] Agent can run outside of cluster Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 55 ++++++++++++++----- 1 file changed, 41 insertions(+), 14 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index c42c5e6..4be57f1 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -31,7 +31,7 @@ Currently, in DCM, a resource's creation request is routed to a given Service Pr There is currently no way for a policy to determine in which environment a SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. -Furthermore, with the current way of submitting creation requests, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. +Furthermore, with the current way of submitting creation requests, the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem: the agent pulls work from a messaging system, removing the need for inbound connectivity. This approach also aligns with the way K8s/OCP consume creation requests, where manifests are pulled by the application creating the resource. ### Goals @@ -54,7 +54,7 @@ Furthermore, with the current way of submitting creation requests, deploying an ### Overview -For each cluster that can be used by DCM, an agent must be spawn. +For each environment that can be used by DCM, an agent must be spawned. The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. @@ -68,7 +68,7 @@ DCM will send the creation request to the specific topic that was created by the The agent will then consume the message, validate it and then pass it to the relevant SP. -The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. +The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent exposes the health status of each registered SP via a `/api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. @@ -87,7 +87,7 @@ flowchart TD DCM["**DCM**
Control Plane"]:::dcm MS["**Messaging System**
Subject-based routing"]:::messaging - subgraph Cluster_Environment["Cluster / Environment"] + subgraph Target_Environment["Target Environment"] direction LR SPX["**SP**
Resource Type X"]:::provider AG["**Agent**
Routes creation requests to SP"]:::agent @@ -110,12 +110,12 @@ flowchart TD SPY -->|Status| MS MS -->|Status| DCM - class Cluster_Environment clusterEnvironment + class Target_Environment clusterEnvironment ``` #### Flow Description -* The agent is spawned in a cluster serving as a specific environment -* Within the same cluster several Service Providers (SP) are running and serving each a specific resource type +* The agent is spawned in an environment +* Several Service Providers (SP) are running and serving each a specific resource type * Each SP registers itself to the agent; the agent dynamically builds its supported resource types list * The agent creates a specific topic in the bus system * Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats @@ -141,7 +141,7 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant DB as Database - Note over SP: SP starts in
cluster / environment + Note over SP: SP starts and
registers to the agent SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} activate AG @@ -168,7 +168,7 @@ sequenceDiagram ``` #### Flow Description -1. The SP starts within the same cluster / environment as the agent +1. The SP starts and registers to the agent 2. The SP registers itself with the agent via a REST API call, providing: * Name * Resource type it serves @@ -190,7 +190,7 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant DB as Database - Note over AG: Agent starts in
cluster / environment + Note over AG: Agent starts in
target environment AG->>MS: Create unique topic MS-->>AG: Topic created
{topicName} @@ -210,7 +210,7 @@ sequenceDiagram ``` #### Flow Description -1. The agent starts within a cluster serving a specific environment +1. The agent starts and serves a specific environment 2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel 3. The agent checks whether at least one SP is registered and healthy: * If at least 1 SP is registered and healthy: the agent proceeds to register to DCM @@ -324,7 +324,7 @@ The agent monitors the health of its registered Service Providers by polling the | **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | | **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | -With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent and its SPs are co-located in the same cluster, making the agent the natural point to perform health checks. +With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent is the natural point to perform health checks on its registered SPs, as it already maintains the list of SP endpoints. When the last SP serving a given resource type transitions to **Unhealthy** or **Unavailable**, the agent: 1. Removes that resource type from its advertised list @@ -335,7 +335,34 @@ When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `s ##### Agent Status -The agent exposes the health status of each registered SP as custom pod conditions on its own pod. This allows cluster administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. +The agent exposes a `GET /api/v1/status` endpoint that returns the health state of all registered SPs. This endpoint is always available, regardless of the deployment mode (Kubernetes, Docker, standalone), and is the primary way to inspect the agent's view of its Service Providers. + +Example response: + +```json +{ + "providers": [ + { + "providerId": "sp-vm-001", + "name": "vm-provider", + "resourceType": "vm", + "status": "Ready", + "lastCheck": "2026-06-05T10:30:00Z" + }, + { + "providerId": "sp-db-001", + "name": "db-provider", + "resourceType": "database", + "status": "Unhealthy", + "lastCheck": "2026-06-05T10:30:00Z" + } + ] +} +``` + +##### Pod Conditions (Kubernetes / OpenShift) + +On Kubernetes or OpenShift deployments, the agent additionally exposes the health status of each registered SP as custom pod conditions on its own pod. This complements the `/api/v1/status` endpoint and allows administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query the agent's REST API. Each registered SP is represented as a separate pod condition, using the SP's provider ID as the condition type. The condition's `status` field reflects whether the SP is healthy (`True`) or not (`False`), and the `reason` and `message` fields provide additional context. @@ -402,4 +429,4 @@ sequenceDiagram 4. When a previously unhealthy/unavailable SP recovers: * The agent re-adds the resource type to its list (if it was removed) * The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs as custom pod conditions on its own pod, allowing administrators to inspect the agent via `oc describe pod` and see per-SP health at a glance \ No newline at end of file +5. The agent exposes the health status of all registered SPs via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) \ No newline at end of file From 42dd0573634eec0632b6aef44e8aac7974bd586e Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 11:07:44 +0200 Subject: [PATCH 05/20] Reword Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 4be57f1..dc87bb2 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -16,7 +16,7 @@ creation-date: 2026-06-03 ## Summary This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. -The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that SPs now serve only 1 specific resource type. +The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that each SP registration with the agent serves exactly one resource type (though a single SP application may register multiple times for different resource types). This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. Additionally, this enhancement defines: From 97f9157bff0268183cfa9755729d7388bf9e46e6 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 14:39:35 +0200 Subject: [PATCH 06/20] Reword again Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index dc87bb2..ce6a2a3 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -59,7 +59,7 @@ The agent will self register to DCM. When doing so, it will provide, amongst oth When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. -Service Providers register directly to the agent (not to DCM). Each SP serves a single resource type and registers itself with the agent via a REST API call. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. +Service Providers register directly to the agent (not to DCM). Each SP registration with the agent serves exactly one resource type, though a single SP application may register multiple times for different resource types. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. An agent must have at least 1 Service Provider (SP) registered to it before self registering to DCM. For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. From 144685a1acd4dd1e16f99f1cac582c845c21593c Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 14:58:01 +0200 Subject: [PATCH 07/20] Reduce summary Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index ce6a2a3..255bed8 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -19,11 +19,6 @@ This enhancement aims at adding the notion of environment by adding a layer betw The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that each SP registration with the agent serves exactly one resource type (though a single SP application may register multiple times for different resource types). This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. -Additionally, this enhancement defines: -- How Service Providers register to the agent (rather than to DCM directly), allowing the agent to dynamically build and maintain its list of supported resource types. -- How the agent reports its own health to DCM via periodic heartbeats. -- How the agent monitors the health of its registered Service Providers using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result. - ## Motivation When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, etc @@ -40,9 +35,9 @@ Furthermore, with the current way of submitting creation requests, the administr - Define what information the agent gives to DCM while registering - Define how agents and DCM are communicating - Define how agents and Service Providers interact with each other -- Define how Service Providers register to the agent -- Define how the agent monitors Service Provider health -- Define how the agent reports its own health to DCM +- Define how Service Providers register to the agent, allowing the agent to dynamically build and maintain its list of supported resource types +- Define how the agent monitors Service Provider health using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result +- Define how the agent reports its own health to DCM via periodic heartbeats ### Non-Goals From dd2c0cfa8479f007751785e06d9b6bd25613b7fb Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 17:29:09 +0200 Subject: [PATCH 08/20] Reorganised and reworded based on feedback Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 613 ++++++++++++++---- 1 file changed, 493 insertions(+), 120 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 255bed8..4895ad8 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -10,24 +10,61 @@ reviewers: approvers: - "" creation-date: 2026-06-03 +see-also: + - "/enhancements/service-provider-health-check/service-provider-health-check.md" + - "/enhancements/state-management/service-provider-status-reporting.md" + - "/enhancements/sp-registration-flow/sp-registration-flow.md" + - "/enhancements/placement-manager/placement-manager.md" + - "/enhancements/sp-resource-manager/sp-resource-manager.md" --- # Environment Agent +## Open Questions + +1. Can multiple agent replicas consume from the same topic for high + availability? (deferred to HA iteration) +2. How does an administrator update the agent's cost tier without restarting it? + ## Summary -This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. -The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that each SP registration with the agent serves exactly one resource type (though a single SP application may register multiple times for different resource types). -This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. + +This enhancement aims at adding the notion of environment by adding a layer +between the SP and DCM: an agent would run on each environment usable by DCM and +the agent would register the environment to DCM. + +The agent would then use the SPs as plugins for the supported service types and +pass the creation request to the relevant one. This would mean that each SP +registration with the agent serves exactly one service type (though a single SP +application may register multiple times for different service types). + +This enhancement also proposes to change the way the creation request is +submitted to the agent (or currently, to the SP): instead of sending a direct +request to the agent, DCM will send the request to a bus that will in turn be +consumed by the relevant agent to create the requested resource. ## Motivation -When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, etc -Currently, in DCM, a resource's creation request is routed to a given Service Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. +When deploying resources in general, one of the main criterion taken into +account is the type of environment in which the resource will be deployed: DEV, +INT, VAL, PROD, etc -There is currently no way for a policy to determine in which environment a SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. +Currently, in DCM, a resource's creation request is routed to a given Service +Provider (SP) by a policy on the base of several criteria. Once the SP is +selected, DCM will send a request to the selected SP to request the creation of +the resource. -Furthermore, with the current way of submitting creation requests, the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem: the agent pulls work from a messaging system, removing the need for inbound connectivity. This approach also aligns with the way K8s/OCP consume creation requests, where manifests are pulled by the application creating the resource. +There is currently no way for a policy to determine in which environment a SP is +running and hence a user cannot explicitly set the targeted environment +constraint when requesting the creation of a resource. +Furthermore, with the current way of submitting creation requests, the +administrator has to make sure the ports are open for DCM to reach the SP. +Changing how creation requests are consumed by giving the initiative to the +agent would solve this problem: the agent pulls work from a messaging system, +removing the need for DCM-to-environment inbound connectivity for creation +requests. The agent still requires outbound connectivity to DCM for registration +and heartbeats. This approach also aligns with the way K8s/OCP consume creation +requests, where manifests are pulled by the application creating the resource. ### Goals @@ -35,41 +72,76 @@ Furthermore, with the current way of submitting creation requests, the administr - Define what information the agent gives to DCM while registering - Define how agents and DCM are communicating - Define how agents and Service Providers interact with each other -- Define how Service Providers register to the agent, allowing the agent to dynamically build and maintain its list of supported resource types -- Define how the agent monitors Service Provider health using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result +- Define how Service Providers register to the agent, allowing the agent to + dynamically build and maintain its list of supported service types +- Define how the agent monitors Service Provider health using the three-state + health model (Ready, Unhealthy, Unavailable) and updates DCM when the + supported service types change as a result - Define how the agent reports its own health to DCM via periodic heartbeats ### Non-Goals -- Defining how to use the information registered by the agent to DCM -- Define how agent will provision application (vs simple resource type) -- Update other enhancement files to reflect the changes introduced by the present document; this will be done in subsequent PRs. +- Defining how to use the information registered by the agent to DCM +- Define how agent will provision application (vs simple service type) +- Update other enhancement files to reflect the changes introduced by the + present document; this will be done in subsequent PRs. ## Proposal ### Overview -For each environment that can be used by DCM, an agent must be spawned. -The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. - -When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. - -Service Providers register directly to the agent (not to DCM). Each SP registration with the agent serves exactly one resource type, though a single SP application may register multiple times for different resource types. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. - -An agent must have at least 1 Service Provider (SP) registered to it before self registering to DCM. -For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. - -DCM will send the creation request to the specific topic that was created by the agent. - -The agent will then consume the message, validate it and then pass it to the relevant SP. - -The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent exposes the health status of each registered SP via a `/api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. - -The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. - -The status monitoring will not be impacted: the SP will be the one managing the resource and the current flow will remain the same; the agent is only an intermediary. +For each environment that can be used by DCM, an agent must be spawned. The +agent will self register to DCM. When doing so, it will provide, amongst other +information, the environment on which it's running and the service types it can +serve. + +When starting, the agent will also create a specific topic in the messaging +system in order for DCM to communicate with the agent. The topic name is +deterministic — either derived from the agent's name or provided via +configuration — ensuring that after a restart the agent reuses the same topic. +If the topic already exists, the agent reuses it. The topic name is unique per +environment and is shared with DCM upon registration. In the current +single-agent model, one agent consumes from the topic. In a future HA model, +multiple agent replicas for the same environment could consume from the same +topic as competing consumers. + +Service Providers register directly to the agent (not to DCM). Each SP +registration with the agent serves exactly one service type, though a single SP +application may register multiple times for different service types. The agent +dynamically builds its list of supported service types based on the SPs that are +registered to it. When the list changes (SP registration or health-driven +removal), the agent updates DCM accordingly. + +An agent must have at least 1 Service Provider (SP) registered to it before self +registering to DCM. For each service type advertised as supported to DCM by the +agent, there must be at least 1 healthy SP registered supporting the given +service type. + +DCM will send the creation request to the specific topic that was created by the +agent. + +The agent will then consume the message, validate it and then pass it to the +relevant SP. + +The agent monitors the health of its registered SPs by polling their `/health` +endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the +last SP serving a given service type becomes unhealthy or unavailable, the agent +removes that service type from its advertised list and updates DCM. The agent +exposes the health status of each registered SP via a `/api/v1/status` endpoint. +On Kubernetes/OpenShift deployments, the agent additionally surfaces this +information as custom pod conditions on its own pod, allowing administrators to +quickly identify which SPs are causing issues via `oc describe pod`. + +The agent reports its own liveness to DCM via periodic REST heartbeats. DCM +tracks the last heartbeat timestamp and marks the agent as unavailable if no +heartbeat is received within a configurable threshold. + +The status monitoring will not be impacted: the SP will be the one managing the +resource and the current flow will remain the same; the agent is only an +intermediary. ### Architecture + ```mermaid %%{init: {'flowchart': {'rankSpacing': 80, 'nodeSpacing': 10, 'curve': 'linear'}}}%% flowchart TD @@ -84,9 +156,9 @@ flowchart TD subgraph Target_Environment["Target Environment"] direction LR - SPX["**SP**
Resource Type X"]:::provider + SPX["**SP**
Service Type X"]:::provider AG["**Agent**
Routes creation requests to SP"]:::agent - SPY["**SP**
Resource Type Y"]:::provider + SPY["**SP**
Service Type Y"]:::provider SPX -. Registration .-> AG SPY -. Registration .-> AG AG -->|Creation Request| SPX @@ -109,24 +181,42 @@ flowchart TD ``` #### Flow Description -* The agent is spawned in an environment -* Several Service Providers (SP) are running and serving each a specific resource type -* Each SP registers itself to the agent; the agent dynamically builds its supported resource types list -* The agent creates a specific topic in the bus system -* Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats -* DCM sends creation request to the specific topic -* The agent consumes the messages sent to the topic -* The agent routes the creation request to the relevant SP -* The agent periodically health-checks each registered SP; when the last SP for a resource type becomes unhealthy, the agent updates DCM and publishes a health warning through the messaging system -* The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system + +- The agent is spawned in an environment +- Several Service Providers (SP) are running and serving each a specific service + type +- Each SP registers itself to the agent; the agent dynamically builds its + supported service types list +- The agent creates a specific topic in the bus system +- Once at least one SP is registered and healthy, the agent self-registers to + DCM and begins sending periodic heartbeats +- DCM sends creation request to the specific topic +- The agent consumes the messages sent to the topic +- The agent routes the creation request to the relevant SP +- The agent periodically health-checks each registered SP; when the last SP for + a service type becomes unhealthy, the agent updates DCM and publishes a health + warning through the messaging system +- The status monitoring remains unchanged: each SP manages its resource + lifecycle and reports status through the messaging system ### SP Registration to Agent -Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. +Service Providers register to the agent rather than to DCM directly. The agent +exposes a REST API for SP registration and dynamically maintains its list of +supported service types based on registered SPs. -SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. +SPs periodically re-register with the agent to maintain their registration. This +periodic re-registration serves as a lease renewal and ensures that after an +agent restart (where the agent loses its in-memory state), SPs naturally +re-register without requiring any additional coordination mechanism. -When the list of supported resource types changes as a result of an SP registration and the agent is already registered to DCM, the agent updates DCM via a `PUT` request with the full updated registration payload. If the agent has not yet registered to DCM (i.e., this is the first SP registering), the agent does not send a `PUT`; instead, the SP registration satisfies the prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). +When the list of supported service types changes as a result of an SP +registration and the agent is already registered to DCM, the agent updates DCM +via a `PUT` request with the full updated registration payload. If the agent has +not yet registered to DCM (i.e., this is the first SP registering), the agent +does not send a `PUT`; instead, the SP registration satisfies the prerequisite +for the agent to proceed with its initial registration to DCM (see +[Agent Registration Flow](#agent-registration-flow)). ```mermaid sequenceDiagram @@ -138,13 +228,13 @@ sequenceDiagram Note over SP: SP starts and
registers to the agent - SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} + SP->>AG: POST /api/v1/providers
{name, serviceType, endpoint} activate AG - AG->>AG: Store SP registration
Recompute supported resource types + AG->>AG: Store SP registration
Recompute supported service types alt Resource type list changed AND agent already registered to DCM - AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM DCM->>DB: Update agent registration activate DB @@ -163,17 +253,26 @@ sequenceDiagram ``` #### Flow Description + 1. The SP starts and registers to the agent 2. The SP registers itself with the agent via a REST API call, providing: - * Name - * Resource type it serves - * Endpoint (URL where the agent can reach the SP) -3. The agent stores the SP registration and recomputes the list of supported resource types -4. If the resource type list changed (new resource type added): - * If the agent is already registered to DCM: the agent sends a `PUT` request to DCM with the full updated agent registration; DCM updates the agent record in the database - * If the agent is not yet registered to DCM: the agent does not send a `PUT`; instead, this SP registration satisfies the prerequisite for the agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) + - Name + - Resource type it serves + - Endpoint (URL where the agent can reach the SP) +3. The agent stores the SP registration and recomputes the list of supported + service types +4. If the service type list changed (new service type added): + - If the agent is already registered to DCM: the agent sends a `PUT` request + to DCM with the full updated agent registration; DCM updates the agent + record in the database + - If the agent is not yet registered to DCM: the agent does not send a `PUT`; + instead, this SP registration satisfies the prerequisite for the agent's + initial registration (see + [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration -6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination +6. The SP periodically re-registers with the agent; the agent handles this + idempotently (create or update). This ensures that after an agent restart, + SPs naturally rebuild the agent's state without additional coordination ### Agent Registration Flow @@ -187,15 +286,15 @@ sequenceDiagram Note over AG: Agent starts in
target environment - AG->>MS: Create unique topic + AG->>MS: Create topic (deterministic name) MS-->>AG: Topic created
{topicName} Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) - AG->>DCM: POST /api/v1/agents
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM - DCM->>DB: Store agent registration
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + DCM->>DB: Store agent registration
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DB DB-->>DCM: Registration stored deactivate DB @@ -205,18 +304,21 @@ sequenceDiagram ``` #### Flow Description + 1. The agent starts and serves a specific environment -2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel +2. The agent creates a topic in the messaging system (using a deterministic + name) to establish a dedicated communication channel 3. The agent checks whether at least one SP is registered and healthy: - * If at least 1 SP is registered and healthy: the agent proceeds to register to DCM - * Else: the agent waits until at least 1 SP is registered and healthy + - If at least 1 SP is registered and healthy: the agent proceeds to register + to DCM + - Else: the agent waits until at least 1 SP is registered and healthy 4. The agent registers itself with DCM via a REST API call, providing: - * Name - * Environment - * Supported resource types - * Available resources - * Cost tier - * Topic name + - Name + - Environment + - Supported service types + - Available resources + - Cost tier + - Topic name 5. DCM persists the registration in the database 6. DCM acknowledges the registration @@ -230,18 +332,18 @@ sequenceDiagram participant AG as Agent participant SP as Service Provider - DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{resourceType, spec} + DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{serviceType, spec} MS->>AG: PUSH message activate AG - AG->>AG: Validate requested resource type
is supported by an attached SP + AG->>AG: Validate requested service type
is supported by an attached SP alt Resource type not supported - AG->>MS: PUBLISH CloudEvent
{error: "unsupported resource type"} + AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message else Resource type supported - AG->>SP: POST {spEndpoint}/api/v1/{resourceType}
{spec} + AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} activate SP alt SP creation fails @@ -258,26 +360,125 @@ sequenceDiagram ``` #### Flow Description -1. DCM publishes the creation request to the agent's dedicated topic in the messaging system + +1. DCM publishes the creation request to the agent's dedicated topic in the + messaging system 2. The agent consumes the message -3. The agent validates that the requested resource type is supported by one of its attached Service Providers -4. If the resource type is **not supported**: - * The agent publishes an error CloudEvent back to the messaging system - * DCM consumes the error message -5. If the resource type is **supported**: - * The agent forwards the creation request to the relevant SP via REST API - * If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - * If the SP **accepts** the request: the SP takes over resource lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) +3. The agent validates that the requested service type is supported by one of + its attached Service Providers +4. If the service type is **not supported**: + - The agent publishes an error CloudEvent back to the messaging system + - DCM consumes the error message +5. If the service type is **supported**: + - The agent forwards the creation request to the relevant SP via REST API + - If the SP returns an **immediate error**: the agent publishes an error + CloudEvent back to the messaging system for DCM to consume + - If the SP **accepts** the request: the SP takes over resource lifecycle + management and reports status changes through the existing status reporting + flow (SP → Messaging System → DCM) + +#### SP Selection Strategy + +When multiple SPs are registered for the same service type, the agent selects +one randomly. Future iterations may introduce affinity-based or capacity-based +selection strategies (e.g., selecting the SP with the most available resources, +similar to pod affinity in Kubernetes). + +#### Retry Policy + +When the agent forwards a creation request to an SP and the SP returns an error, +the agent applies a configurable retry policy. When retries are exhausted, the +agent publishes an error CloudEvent to the messaging system with the resource ID +(provided by DCM in the original creation request), allowing DCM to track the +failure. + +#### In-Flight Request Handling + +When the agent restarts, unconsumed messages remain on the topic and are +consumed once the agent is back up (guaranteed by the messaging system's +persistence layer). When all SPs for a given service type are unhealthy or +unavailable, the agent responds with an error CloudEvent for each incoming +creation request targeting that service type. + +### Resource Deletion Flow + +```mermaid +sequenceDiagram + autonumber + participant DCM as DCM
(Control Plane) + participant MS as Messaging System + participant AG as Agent + participant SP as Service Provider + + DCM->>MS: PUBLISH deletion request
topic: {agentTopicName}
{serviceType, resourceId} + + MS->>AG: PUSH message + activate AG + + AG->>AG: Validate requested service type
is supported by an attached SP + + alt Service type not supported + AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} + MS->>DCM: PUSH error message + else Service type supported + AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} + activate SP + + alt SP deletion fails + SP-->>AG: Error response + deactivate SP + AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} + MS->>DCM: PUSH error message + else SP deletion succeeds + SP-->>AG: Success response
{resourceId, status: DELETING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} + MS->>DCM: PUSH deletion acknowledged + Note over SP: SP manages resource deletion
and reports final status through
the existing status reporting flow + end + end + deactivate AG +``` + +#### Flow Description + +1. DCM publishes the deletion request to the agent's dedicated topic in the + messaging system, including the service type and resource ID +2. The agent consumes the message +3. The agent validates that the requested service type is supported by one of + its attached Service Providers +4. If the service type is **not supported**: + - The agent publishes an error CloudEvent back to the messaging system + - DCM consumes the error message +5. If the service type is **supported**: + - The agent forwards the deletion request to the relevant SP via a REST + `DELETE` call + - If the SP returns an **immediate error**: the agent publishes an error + CloudEvent back to the messaging system for DCM to consume + - If the SP **accepts** the request: the agent publishes a CloudEvent + acknowledging the deletion is in progress. The SP manages the actual + resource deletion and reports the final status through the existing status + reporting flow (SP → Messaging System → DCM) + +The retry policy and in-flight request handling described in the +[Resource Creation Flow](#resource-creation-flow) apply equally to deletion +requests. ### Health #### Agent Health -The agent reports its own liveness to DCM via periodic REST heartbeats. Since the messaging system is used for resource operations (creation requests, status updates), the heartbeat uses the existing REST channel that the agent already uses for registration. +The agent reports its own liveness to DCM via periodic REST heartbeats. Since +the messaging system is used for resource operations (creation requests, status +updates), the heartbeat uses the existing REST channel that the agent already +uses for registration. -DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is received within a configurable threshold, DCM marks the agent as unavailable. +DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is +received within a configurable threshold, DCM marks the agent as unavailable. -On startup, the agent registers to DCM (as described in [Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it re-registers to DCM; DCM handles this idempotently, resetting the heartbeat tracker. +On startup, the agent registers to DCM (as described in +[Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it +re-registers to DCM; DCM handles this idempotently, resetting the heartbeat +tracker. ```mermaid sequenceDiagram @@ -304,33 +505,55 @@ sequenceDiagram ``` ##### Flow Description + 1. The agent periodically sends a heartbeat to DCM via a REST `PUT` call 2. DCM updates the agent's last heartbeat timestamp in the database -3. If DCM does not receive a heartbeat within the configured threshold, it marks the agent as **Unavailable** -4. When the agent restarts, its initial registration to DCM resets the heartbeat tracker and the agent status +3. If DCM does not receive a heartbeat within the configured threshold, it marks + the agent as **Unavailable** +4. When the agent restarts, its initial registration to DCM resets the heartbeat + tracker and the agent status #### SP Health Monitoring -The agent monitors the health of its registered Service Providers by polling their `/health` endpoint, using the three-state health model defined in the [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): +The agent monitors the health of its registered Service Providers by polling +their `/health` endpoint, using the three-state health model defined in the +[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): + +| State | Condition | +| --------------- | --------------------------------------------------------------------------------------------------- | +| **Ready** | SP responds with `200 OK` and `status: "healthy"` | +| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | +| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | -| State | Condition | -|-------|-----------| -| **Ready** | SP responds with `200 OK` and `status: "healthy"` | -| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | -| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | +With the agent layer, the responsibility for polling SP health shifts from DCM +to the agent. The agent is the natural point to perform health checks on its +registered SPs, as it already maintains the list of SP endpoints. -With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent is the natural point to perform health checks on its registered SPs, as it already maintains the list of SP endpoints. +The agent only routes creation requests to SPs in the **Ready** state. SPs in +the **Unhealthy** or **Unavailable** state are not eligible for routing, even +though an Unhealthy SP is technically reachable. This simplifies routing logic +and avoids sending requests to SPs whose backing provider is known to be down. -When the last SP serving a given resource type transitions to **Unhealthy** or **Unavailable**, the agent: -1. Removes that resource type from its advertised list -2. Sends a `PUT` request to DCM with the updated agent registration (resource types list without the affected type) -3. Publishes a health warning CloudEvent to a dedicated health topic in the messaging system, providing DCM with context about the degradation (which SP, which resource type, the reason) +When the last SP serving a given service type transitions to **Unhealthy** or +**Unavailable**, the agent: -When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `status: "healthy"`), the agent re-adds the resource type to its list and updates DCM accordingly. +1. Removes that service type from its advertised list +2. Sends a `PUT` request to DCM with the updated agent registration (service + types list without the affected type) +3. Publishes a health warning CloudEvent to a dedicated health topic in the + messaging system, providing DCM with context about the degradation (which SP, + which service type, the reason) + +When a previously unhealthy or unavailable SP recovers (returns `200 OK` with +`status: "healthy"`), the agent re-adds the service type to its list and updates +DCM accordingly. ##### Agent Status -The agent exposes a `GET /api/v1/status` endpoint that returns the health state of all registered SPs. This endpoint is always available, regardless of the deployment mode (Kubernetes, Docker, standalone), and is the primary way to inspect the agent's view of its Service Providers. +The agent exposes a `GET /api/v1/status` endpoint that returns the health state +of all registered SPs. This endpoint is always available, regardless of the +deployment mode (Kubernetes, Docker, standalone), and is the primary way to +inspect the agent's view of its Service Providers. Example response: @@ -340,14 +563,14 @@ Example response: { "providerId": "sp-vm-001", "name": "vm-provider", - "resourceType": "vm", + "serviceType": "vm", "status": "Ready", "lastCheck": "2026-06-05T10:30:00Z" }, { "providerId": "sp-db-001", "name": "db-provider", - "resourceType": "database", + "serviceType": "database", "status": "Unhealthy", "lastCheck": "2026-06-05T10:30:00Z" } @@ -357,22 +580,37 @@ Example response: ##### Pod Conditions (Kubernetes / OpenShift) -On Kubernetes or OpenShift deployments, the agent additionally exposes the health status of each registered SP as custom pod conditions on its own pod. This complements the `/api/v1/status` endpoint and allows administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query the agent's REST API. +On Kubernetes or OpenShift deployments, the agent additionally exposes the +health status of each registered SP as custom pod conditions on its own pod. +This complements the `/api/v1/status` endpoint and allows administrators to +inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which +SPs are healthy, unhealthy, or unavailable without having to query the agent's +REST API. -Each registered SP is represented as a separate pod condition, using the SP's provider ID as the condition type. The condition's `status` field reflects whether the SP is healthy (`True`) or not (`False`), and the `reason` and `message` fields provide additional context. +Each registered SP is represented as a separate pod condition, using the SP's +provider ID as the condition type. The condition's `status` field reflects +whether the SP is healthy (`True`) or not (`False`), and the `reason` and +`message` fields provide additional context. Example output from `oc describe pod `: ``` Conditions: Type Status Reason Message - sp-vm-001/vm True Ready SP vm-provider serving resource type vm is healthy - sp-db-001/database False Unhealthy SP db-provider serving resource type database is unhealthy + sp-vm-001/vm True Ready SP vm-provider serving service type vm is healthy + sp-db-001/database False Unhealthy SP db-provider serving service type database is unhealthy ``` ###### Implementation Detail -The agent uses [Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) to surface per-SP health as custom pod conditions. The agent's pod spec declares a readiness gate for each expected condition type, and the agent application patches its own pod's `status.conditions` via the Kubernetes API using in-cluster authentication (`rest.InClusterConfig()` or equivalent). This requires RBAC permissions on the `pods/status` subresource for the agent's service account. +The agent uses +[Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) +to surface per-SP health as custom pod conditions. The agent's pod spec declares +a readiness gate for each expected condition type, and the agent application +patches its own pod's `status.conditions` via the Kubernetes API using +in-cluster authentication (`rest.InClusterConfig()` or equivalent). This +requires RBAC permissions on the `pods/status` subresource for the agent's +service account. ```mermaid sequenceDiagram @@ -398,30 +636,165 @@ sequenceDiagram end end - Note over AG: Last SP for resource type X
becomes Unhealthy or Unavailable + Note over AG: Last SP for service type X
becomes Unhealthy or Unavailable - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated resourceTypes without X} + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} activate DCM DCM->>DB: Update agent registration DB-->>DCM: Updated DCM-->>AG: 200 OK deactivate DCM - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "resource-type-unavailable",
agentId, resourceType, reason,
affectedProvider} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider} MS->>DCM: PUSH health warning ``` ##### Flow Description + 1. The agent periodically polls each registered SP's `GET /health` endpoint 2. Based on the response, the agent updates the SP's health state: - * `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) - * `200 OK` with `status: "unhealthy"` → **Unhealthy** - * Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** -3. When the last SP serving a given resource type becomes **Unhealthy** or **Unavailable**: - * The agent removes the resource type from its advertised list - * The agent sends a `PUT` to DCM with the updated registration - * The agent publishes a health warning CloudEvent to the `dcm.agents.health` topic with details about the affected SP and resource type + - `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) + - `200 OK` with `status: "unhealthy"` → **Unhealthy** + - Timeout or error → increment failure counter; if counter exceeds threshold + → **Unavailable** +3. When the last SP serving a given service type becomes **Unhealthy** or + **Unavailable**: + - The agent removes the service type from its advertised list + - The agent sends a `PUT` to DCM with the updated registration + - The agent publishes a health warning CloudEvent to the `dcm.agents.health` + topic with details about the affected SP and service type 4. When a previously unhealthy/unavailable SP recovers: - * The agent re-adds the resource type to its list (if it was removed) - * The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) \ No newline at end of file + - The agent re-adds the service type to its list (if it was removed) + - The agent sends a `PUT` to DCM with the updated registration +5. The agent exposes the health status of all registered SPs via the + `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent + additionally surfaces this information as custom pod conditions on its own + pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) + +### Assumptions + +- A messaging system (e.g., NATS) is deployed and accessible to both DCM and the + agent +- The agent has outbound network connectivity to DCM's REST API (for + registration and heartbeats) +- SPs have network connectivity to the agent's REST API (for registration and + health checks) +- For Kubernetes/OpenShift deployments: the agent's service account has RBAC + permissions for the `pods/status` subresource + +### Risks and Mitigations + +| Risk | Mitigation | +| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state via SP re-registration (SPs periodically re-register, naturally rebuilding agent state). | +| Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. | +| Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. | +| Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). | +| Unauthenticated SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. | + +## Drawbacks + +- Adds operational complexity: a new binary (the agent) must be deployed, + configured, and monitored per environment +- Adds latency to the creation path: DCM → messaging system → agent → SP, versus + the current DCM → SP direct call +- Fragments health monitoring responsibility: DCM monitors agent health via + heartbeats, while the agent monitors SP health via polling +- Requires messaging system infrastructure accessible to both DCM and all target + environments + +## Alternatives + +### Alternative 1: Monolithic Agent with Embedded SPs + +#### Description + +Instead of separating the agent and Service Providers into distinct processes, +the agent binary would ship with SP code for a known set of SPs (e.g., ACM, +KubeVirt, K8s). At startup, the agent would detect available CRDs or backing +infrastructure on the environment and activate only the relevant SP code. + +#### Pros + +- Single binary to deploy, no REST registration ceremony between agent and SPs +- No health monitoring overhead between agent and SPs (they share a process) +- Simpler deployment and operational model + +#### Cons + +- Tightly couples the agent to a fixed, predefined set of SPs +- Cannot support custom or third-party SPs without rebuilding the agent binary +- Agent binary grows with each new SP type +- Requires agent rebuild and redeployment to add support for a new service type + +#### Status + +Rejected + +#### Rationale + +The agent must support arbitrary SPs, including custom ones developed by third +parties. Tight coupling between the agent and SP code prevents this +extensibility. The plugin-style model (separate processes, REST registration) +allows any SP that implements the registration API to participate, regardless of +who develops or deploys it. + +### Alternative 2: etcd / CRD Watch Pattern + +#### Description + +Instead of using a messaging system for creation requests, DCM would create +Custom Resource (CR) manifests (e.g., `ResourceRequest`) directly in the target +cluster's etcd via the Kubernetes API. The agent would run as a Kubernetes +controller, watching for these CRs and reconciling them by forwarding the +creation request to the relevant SP. This follows the native Kubernetes +controller pattern. + +#### Pros + +- Native Kubernetes pattern, well-understood and battle-tested +- Leverages existing etcd for persistence and watch semantics, no separate + messaging infrastructure needed +- Built-in HA via Kubernetes controller framework (leader election, informer + caching) + +#### Cons + +- Requires DCM to have kubeconfig/API access to each target cluster, + reintroducing DCM-to-environment connectivity that this enhancement aims to + eliminate +- Does not work for non-Kubernetes environments (Docker, standalone, etc.) +- Pushes the connectivity requirement from the agent (outbound) to DCM (outbound + to every cluster) + +#### Status + +Rejected + +#### Rationale + +A core motivation of this enhancement is removing the need for +DCM-to-environment inbound connectivity for creation requests. The CRD watch +pattern requires DCM to push CRs to the target cluster's API server, +reintroducing that dependency. Additionally, this approach limits the agent to +Kubernetes-based environments, conflicting with the goal of supporting +non-cluster environments. + +## Cross-Cutting Impact + +The following enhancement documents will need to be updated to reflect the +changes introduced by this enhancement. These updates will be done in subsequent +PRs. + +| Document | Impact | +| -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. | +| [Service Provider Health Check](../service-provider-health-check/service-provider-health-check.md) | Health polling responsibility shifts from DCM to the agent. DCM monitors agent health via heartbeats instead of polling individual SPs. | +| [SP Resource Manager](../sp-resource-manager/sp-resource-manager.md) | SPRM publishes creation requests to the agent's bus topic instead of calling SP REST endpoints directly. SPRM interacts with the agent (not individual SPs) for health status. From SPRM's perspective, the agent serves the same role as a SP: provisioning service types. | +| [Placement Manager](../placement-manager/placement-manager.md) | Policy evaluation may now include environment as a selection criterion. Placement Manager delegates to SPRM, which routes through the messaging system. | +| [User Flows](../user-flows/user-flows.md) | End-to-end flows must include the agent layer between DCM and SPs. | + +Additionally, DCM should monitor consumer lag on agent topics in a future +iteration. If lag exceeds a configurable threshold, DCM could stop routing new +requests to that agent to avoid further congestion. A new agent state (e.g., +"Congested") could be introduced for this purpose. From efbf77b06284d7ad8aee3752d3ffc8ce6ffb6ef6 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 10 Jun 2026 09:41:28 +0200 Subject: [PATCH 09/20] typo Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 4895ad8..91c9d41 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -183,7 +183,7 @@ flowchart TD #### Flow Description - The agent is spawned in an environment -- Several Service Providers (SP) are running and serving each a specific service +- Several Service Providers (SP) are running and each serving a specific service type - Each SP registers itself to the agent; the agent dynamically builds its supported service types list From 81e16f410d8c0e2e09da30d8cea9fb58f4796bb6 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 10 Jun 2026 11:23:49 +0200 Subject: [PATCH 10/20] Add pseudo api definition and payload example Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 203 ++++++++++++++---- 1 file changed, 167 insertions(+), 36 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 91c9d41..b452d0a 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -25,6 +25,11 @@ see-also: 1. Can multiple agent replicas consume from the same topic for high availability? (deferred to HA iteration) 2. How does an administrator update the agent's cost tier without restarting it? + **Proposed resolution:** The administrator updates the agent's configuration + (config file, environment variable, or ConfigMap on Kubernetes). The agent + detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the + updated cost tier — the same mechanism used when the supported service types + list changes. ## Summary @@ -199,6 +204,108 @@ flowchart TD - The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system +### API + +#### Agent Endpoints + +| Method | Endpoint | Description | +| ------ | ----------------- | ------------------------------------------------------------------------------------------------------------- | +| POST | /api/v1/providers | SP registration — reuses the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) contract | +| GET | /api/v1/status | Agent status — health of all registered SPs | + +##### `POST /api/v1/providers` — SP Registration + +Reuses the contract defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) +enhancement. The agent applies the same idempotency semantics (name as natural +key, create-or-update behavior). + +##### `GET /api/v1/status` — Agent Status + +Returns the health state of all registered SPs. This endpoint is always +available, regardless of the deployment mode (Kubernetes, Docker, standalone), +and is the primary way to inspect the agent's view of its Service Providers. + +Example response: + +```json +{ + "providers": [ + { + "providerId": "sp-vm-001", + "name": "vm-provider", + "serviceType": "vm", + "status": "Ready", + "lastCheck": "2026-06-05T10:30:00Z" + }, + { + "providerId": "sp-db-001", + "name": "db-provider", + "serviceType": "database", + "status": "Unhealthy", + "lastCheck": "2026-06-05T10:30:00Z" + } + ] +} +``` + +#### DCM Endpoints + +| Method | Endpoint | Description | +| ------ | ---------------------------------- | ------------------------- | +| POST | /api/v1/agents | Agent registration | +| PUT | /api/v1/agents/{agentId} | Update agent registration | +| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | + +##### `POST /api/v1/agents` — Agent Registration + +Register a new agent to DCM. + +| Field | Type | Required | Description | +| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| name | string | yes | Unique agent name | +| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unhealthy/unavailable. | +| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | +| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | +| topicName | string | yes | Deterministic topic name for the agent's messaging channel | + +Response: `201 Created` with `{agentId}` + +###### `resourcesAvailable` Structure + +The `resourcesAvailable` field is optional. When provided, it follows a similar +structure to the SP registration metadata defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md), but +represents the aggregate available resources across the environment rather than +a single SP's capacity. + +Example: + +```json +{ + "totalCpu": 200, + "totalMemory": "1TB", + "totalStorage": "2TB", + "totalNode": 100 +} +``` + +##### `PUT /api/v1/agents/{agentId}` — Update Agent Registration + +Update an existing agent registration. The payload is identical to the initial +`POST` registration (full replace). All fields are sent on every `PUT`. + +Response: `200 OK` + +##### `PUT /api/v1/agents/{agentId}/heartbeat` — Agent Heartbeat + +| Field | Type | Required | Description | +| --------- | ----------------- | -------- | ------------------------- | +| timestamp | string (ISO 8601) | yes | Agent's current timestamp | + +Response: `200 OK` + ### SP Registration to Agent Service Providers register to the agent rather than to DCM directly. The agent @@ -322,6 +429,23 @@ sequenceDiagram 5. DCM persists the registration in the database 6. DCM acknowledges the registration +#### Re-Registration on Restart + +When the agent restarts, it uses the same `POST /api/v1/agents` endpoint with +the same payload. The agent does not persist its `agentId`; it relies on DCM's +idempotent registration, which uses the agent `name` as the natural key (same +pattern as SP registration defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md)): if the +name already exists and no `agentId` is provided (or the same `agentId` is +provided), DCM updates the existing entry, returns the same `agentId`, and +resets the heartbeat tracker. The agent then uses the returned `agentId` for +subsequent heartbeats and updates. + +Ensuring that each agent uses a unique name is an operational responsibility. + +Note that the `(name, topicName)` pair is not unique: in a future HA model, +multiple agent replicas for the same environment may share the same topic name. + ### Resource Creation Flow ```mermaid @@ -332,7 +456,7 @@ sequenceDiagram participant AG as Agent participant SP as Service Provider - DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{serviceType, spec} + DCM->>MS: PUBLISH CloudEvent (creation request)
topic: {agentTopicName}
{resourceId, serviceType, spec} MS->>AG: PUSH message activate AG @@ -353,6 +477,8 @@ sequenceDiagram MS->>DCM: PUSH error message else SP creation succeeds SP-->>AG: Success response
{instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} + MS->>DCM: PUSH creation acknowledged Note over SP: SP manages resource lifecycle
and reports status through
the existing status reporting flow end end @@ -361,8 +487,8 @@ sequenceDiagram #### Flow Description -1. DCM publishes the creation request to the agent's dedicated topic in the - messaging system +1. DCM publishes a creation request CloudEvent to the agent's dedicated topic in + the messaging system, including the resource ID, service type, and spec 2. The agent consumes the message 3. The agent validates that the requested service type is supported by one of its attached Service Providers @@ -373,9 +499,10 @@ sequenceDiagram - The agent forwards the creation request to the relevant SP via REST API - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - - If the SP **accepts** the request: the SP takes over resource lifecycle - management and reports status changes through the existing status reporting - flow (SP → Messaging System → DCM) + - If the SP **accepts** the request: the agent publishes a CloudEvent + acknowledging the creation is in progress. The SP takes over resource + lifecycle management and reports status changes through the existing status + reporting flow (SP → Messaging System → DCM) #### SP Selection Strategy @@ -410,7 +537,7 @@ sequenceDiagram participant AG as Agent participant SP as Service Provider - DCM->>MS: PUBLISH deletion request
topic: {agentTopicName}
{serviceType, resourceId} + DCM->>MS: PUBLISH CloudEvent (deletion request)
topic: {agentTopicName}
{resourceId, serviceType} MS->>AG: PUSH message activate AG @@ -441,8 +568,8 @@ sequenceDiagram #### Flow Description -1. DCM publishes the deletion request to the agent's dedicated topic in the - messaging system, including the service type and resource ID +1. DCM publishes a deletion request CloudEvent to the agent's dedicated topic in + the messaging system, including the resource ID and service type 2. The agent consumes the message 3. The agent validates that the requested service type is supported by one of its attached Service Providers @@ -550,33 +677,10 @@ DCM accordingly. ##### Agent Status -The agent exposes a `GET /api/v1/status` endpoint that returns the health state -of all registered SPs. This endpoint is always available, regardless of the -deployment mode (Kubernetes, Docker, standalone), and is the primary way to -inspect the agent's view of its Service Providers. - -Example response: - -```json -{ - "providers": [ - { - "providerId": "sp-vm-001", - "name": "vm-provider", - "serviceType": "vm", - "status": "Ready", - "lastCheck": "2026-06-05T10:30:00Z" - }, - { - "providerId": "sp-db-001", - "name": "db-provider", - "serviceType": "database", - "status": "Unhealthy", - "lastCheck": "2026-06-05T10:30:00Z" - } - ] -} -``` +The agent exposes the health status of all registered SPs via the +`GET /api/v1/status` endpoint (see +[Agent Endpoints — `GET /api/v1/status`](#get-apiv1status--agent-status) for the +response format). ##### Pod Conditions (Kubernetes / OpenShift) @@ -671,6 +775,33 @@ sequenceDiagram additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) +### CloudEvent Message Definitions + +All messages exchanged through the messaging system use the +[CloudEvents v1.0](https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/spec.md) +specification, following the conventions established in the +[Service Provider Status Reporting](../state-management/service-provider-status-reporting.md) +enhancement. + +All agent-originated CloudEvents include `agentName` and `topicName` in the data +payload for correlation, in addition to the `source` envelope attribute. This +allows DCM to identify both the resource and the originating agent when +consuming from the shared `dcm.agents.responses` subject. + +The `spec` field in creation request data follows the schema defined by the +target service type (see +[SP Resource Manager](../sp-resource-manager/sp-resource-manager.md), +[Placement Manager](../placement-manager/placement-manager.md)). + +| Message | `type` | `source` | `subject` | `data` | +| --------------------- | ------------------------------------------- | ---------------------- | ---------------------- | ------------------------------------------------------------------------ | +| Creation Request | `dcm.request.create` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType, spec}` | +| Deletion Request | `dcm.request.delete` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType}` | +| Creation Acknowledged | `dcm.agent.creation-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "PROVISIONING"}` | +| Deletion Acknowledged | `dcm.agent.deletion-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "DELETING"}` | +| Error | `dcm.agent.error` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, error, details}` | +| Health Warning | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | + ### Assumptions - A messaging system (e.g., NATS) is deployed and accessible to both DCM and the From 09cc808a2a6b498a1f21cf6024284c96ee179790 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 09:34:24 +0200 Subject: [PATCH 11/20] Add Terminology section Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index b452d0a..99e7d02 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -31,6 +31,16 @@ see-also: updated cost tier — the same mechanism used when the supported service types list changes. +## Terminology + +- **Agent:** A lightweight process that runs in a target environment, acting as + the intermediary between DCM and the Service Providers deployed in that + environment. It registers the environment to DCM, consumes resource operation + requests from a messaging system, and routes them to the appropriate Service + Provider. +- **Environment:** A set of infrastructures that is ready to receive workload + from DCM (e.g., `dev`, `staging`, `prod-eu-west-1`). + ## Summary This enhancement aims at adding the notion of environment by adding a layer From e029207afc50d4f20188b34ac4992dac9c586080 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:39:10 +0200 Subject: [PATCH 12/20] fix inconsistency with resource vs service type Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 99e7d02..b442c90 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -350,7 +350,7 @@ sequenceDiagram AG->>AG: Store SP registration
Recompute supported service types - alt Resource type list changed AND agent already registered to DCM + alt Service type list changed AND agent already registered to DCM AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM DCM->>DB: Update agent registration @@ -359,7 +359,7 @@ sequenceDiagram deactivate DB DCM-->>AG: 200 OK deactivate DCM - else Resource type list changed AND agent not yet registered to DCM + else Service type list changed AND agent not yet registered to DCM Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) end @@ -374,7 +374,7 @@ sequenceDiagram 1. The SP starts and registers to the agent 2. The SP registers itself with the agent via a REST API call, providing: - Name - - Resource type it serves + - Service type it serves - Endpoint (URL where the agent can reach the SP) 3. The agent stores the SP registration and recomputes the list of supported service types @@ -473,10 +473,10 @@ sequenceDiagram AG->>AG: Validate requested service type
is supported by an attached SP - alt Resource type not supported + alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Resource type supported + else Service type supported AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} activate SP From bb8eac2e8a84907e5405dcd3b6eb0880e5c0de2b Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:41:09 +0200 Subject: [PATCH 13/20] select SP by alphabetical order Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index b442c90..06538d4 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -517,7 +517,7 @@ sequenceDiagram #### SP Selection Strategy When multiple SPs are registered for the same service type, the agent selects -one randomly. Future iterations may introduce affinity-based or capacity-based +the SP in alphabetical order. Future iterations may introduce affinity-based or capacity-based selection strategies (e.g., selecting the SP with the most available resources, similar to pod affinity in Kubernetes). From 0529b4bdeb83cd2cae964cff8f9724dd40436ff1 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:57:24 +0200 Subject: [PATCH 14/20] explicitly defer open question resolution 2 Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 06538d4..cc51aa5 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -29,7 +29,8 @@ see-also: (config file, environment variable, or ConfigMap on Kubernetes). The agent detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the updated cost tier — the same mechanism used when the supported service types - list changes. + list changes. + **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** ## Terminology From b5a2143ff8d764a3a942af0081b62fa740d41c9b Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:57:30 +0200 Subject: [PATCH 15/20] format Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index cc51aa5..badc04a 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -29,8 +29,9 @@ see-also: (config file, environment variable, or ConfigMap on Kubernetes). The agent detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the updated cost tier — the same mechanism used when the supported service types - list changes. - **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** + list changes. **This solution is deferred to later version: in the current + version, a restart will be needed for the change in the cost tier to be + propagated (via [Agent Registration Flow](#agent-registration-flow) )** ## Terminology @@ -518,9 +519,9 @@ sequenceDiagram #### SP Selection Strategy When multiple SPs are registered for the same service type, the agent selects -the SP in alphabetical order. Future iterations may introduce affinity-based or capacity-based -selection strategies (e.g., selecting the SP with the most available resources, -similar to pod affinity in Kubernetes). +the SP in alphabetical order. Future iterations may introduce affinity-based or +capacity-based selection strategies (e.g., selecting the SP with the most +available resources, similar to pod affinity in Kubernetes). #### Retry Policy From 372334f79257af59f4e4d5cf6a452032231b9155 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 15 Jun 2026 11:19:33 +0200 Subject: [PATCH 16/20] Change behviour for unhleathy Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 243 ++++++++++++++---- 1 file changed, 195 insertions(+), 48 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index badc04a..75d2c8d 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -32,6 +32,9 @@ see-also: list changes. **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** +3. How does DCM handle the "queued" CloudEvent response + (`dcm.agent.request-queued`)? Does it expose the status to the user, set a + timeout, or re-evaluate policies? (deferred to DCM-side design) ## Terminology @@ -141,10 +144,18 @@ The agent will then consume the message, validate it and then pass it to the relevant SP. The agent monitors the health of its registered SPs by polling their `/health` -endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the -last SP serving a given service type becomes unhealthy or unavailable, the agent -removes that service type from its advertised list and updates DCM. The agent -exposes the health status of each registered SP via a `/api/v1/status` endpoint. +endpoint, using the three-state model (Ready, Unhealthy, Unavailable). The agent +differentiates its behavior based on the SP health state: + +- **Unhealthy:** The agent keeps the service type in its advertised list to DCM + but stops routing requests to the SP. Incoming requests for that service type + are held in a dedicated retry topic until the SP recovers or becomes + unavailable. +- **Unavailable:** The agent removes the service type from its advertised list, + updates DCM, and rejects any held requests for that service type. + +The agent exposes the health status of each registered SP via a `/api/v1/status` +endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. @@ -277,7 +288,7 @@ Register a new agent to DCM. | ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | name | string | yes | Unique agent name | | environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | -| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unhealthy/unavailable. | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | | resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | | cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | | topicName | string | yes | Deterministic topic name for the agent's messaging channel | @@ -405,9 +416,12 @@ sequenceDiagram Note over AG: Agent starts in
target environment - AG->>MS: Create topic (deterministic name) + AG->>MS: Create main topic (deterministic name) MS-->>AG: Topic created
{topicName} + AG->>MS: Create retry topic (internal) + MS-->>AG: Topic created
{topicName}.retry + Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} @@ -425,8 +439,13 @@ sequenceDiagram #### Flow Description 1. The agent starts and serves a specific environment -2. The agent creates a topic in the messaging system (using a deterministic - name) to establish a dedicated communication channel +2. The agent creates two topics in the messaging system: + - A **main topic** (using a deterministic name) to establish a dedicated + communication channel with DCM. This topic name is advertised to DCM during + registration. + - A **retry topic** (`{topicName}.retry`) used internally by the agent to hold + requests when all SPs for a service type are Unhealthy (see + [Retry Topic](#retry-topic)). This topic is not advertised to DCM. 3. The agent checks whether at least one SP is registered and healthy: - If at least 1 SP is registered and healthy: the agent proceeds to register to DCM @@ -478,7 +497,11 @@ sequenceDiagram alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported + else Service type supported but all SPs Unhealthy + AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType, spec} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + MS->>DCM: PUSH queued response + else Service type supported and at least one SP Ready AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} activate SP @@ -507,7 +530,15 @@ sequenceDiagram 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported**: +5. If the service type is **supported but all SPs are Unhealthy**: + - The agent publishes the original request CloudEvent to the retry topic + (`{agentTopicName}.retry`) for durable holding + - The agent publishes a "queued" CloudEvent to `dcm.agents.responses` with + `{resourceId, serviceType, status: "QUEUED"}`, informing DCM that the + request is held for retry + - The request will be processed when an SP recovers, or rejected if all SPs + become Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and at least one SP is Ready**: - The agent forwards the creation request to the relevant SP via REST API - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume @@ -531,13 +562,62 @@ agent publishes an error CloudEvent to the messaging system with the resource ID (provided by DCM in the original creation request), allowing DCM to track the failure. +#### Retry Topic + +When all SPs for a given service type are Unhealthy, the agent cannot route +requests but the service type remains advertised to DCM (to avoid +registration flapping). Instead of rejecting the request, the agent publishes +it to a dedicated **retry topic** (`{agentTopicName}.retry`) for durable +holding, and responds to DCM with a "queued" CloudEvent. + +The retry topic is created by the agent at startup alongside the main topic +(see [Agent Registration Flow](#agent-registration-flow)). It is internal to +the agent and is not advertised to DCM. + +**Message format:** The original CloudEvent is published to the retry topic +as-is (passthrough, no wrapping). + +**Consumption is event-driven.** The agent reads the retry topic only when an +SP health state changes — not periodically: + +- **SP transitions to Ready:** The agent consumes the retry topic. For each + message whose service type now has a Ready SP, the agent processes the + request (forwards to the SP, responds to DCM with success or error). + Messages for service types still Unhealthy are re-published to the retry + topic. +- **SP transitions to Unavailable:** The agent consumes the retry topic. For + each message whose service type has all SPs Unavailable, the agent rejects + the request with an error CloudEvent to DCM. Messages for other service + types are re-published to the retry topic. +- **No health state change:** The retry topic is not consumed. + +**Creation/Deletion dedup:** If both a creation request and a deletion request +for the same resource ID are present in the retry topic, both messages are +removed — they cancel out since the resource was never created. The agent logs +the cancellation and acknowledges the deletion to DCM. The creation request is +silently dropped since it was never started. + +**Ordering:** Requests are processed in arrival order per service type. +Requests for different service types are independent. + +**Durability:** Messages in the retry topic survive agent crashes, guaranteed +by the messaging system's persistence layer. On restart, the agent re-reads +both the main topic and the retry topic. + #### In-Flight Request Handling -When the agent restarts, unconsumed messages remain on the topic and are -consumed once the agent is back up (guaranteed by the messaging system's -persistence layer). When all SPs for a given service type are unhealthy or -unavailable, the agent responds with an error CloudEvent for each incoming -creation request targeting that service type. +When the agent restarts, unconsumed messages on both the main topic and the +retry topic are consumed once the agent is back up (guaranteed by the messaging +system's persistence layer). + +- **All SPs Unhealthy:** The agent publishes the request to the retry topic and + responds to DCM with a "queued" CloudEvent. The request is processed when an + SP recovers, or rejected when all SPs for that service type become + Unavailable (see [Retry Topic](#retry-topic)). +- **All SPs Unavailable:** The agent responds with an error CloudEvent for each + incoming request targeting that service type. Additionally, the agent drains + the retry topic, rejecting any held requests for that service type with error + CloudEvents. ### Resource Deletion Flow @@ -559,7 +639,11 @@ sequenceDiagram alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported + else Service type supported but all SPs Unhealthy + AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + MS->>DCM: PUSH queued response + else Service type supported and at least one SP Ready AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} activate SP @@ -588,7 +672,14 @@ sequenceDiagram 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported**: +5. If the service type is **supported but all SPs are Unhealthy**: + - The agent publishes the original request to the retry topic for durable + holding + - The agent publishes a "queued" CloudEvent to `dcm.agents.responses`, + informing DCM that the request is held for retry + - The request will be processed when an SP recovers, or rejected if all SPs + become Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and at least one SP is Ready**: - The agent forwards the deletion request to the relevant SP via a REST `DELETE` call - If the SP returns an **immediate error**: the agent publishes an error @@ -670,22 +761,40 @@ registered SPs, as it already maintains the list of SP endpoints. The agent only routes creation requests to SPs in the **Ready** state. SPs in the **Unhealthy** or **Unavailable** state are not eligible for routing, even -though an Unhealthy SP is technically reachable. This simplifies routing logic -and avoids sending requests to SPs whose backing provider is known to be down. +though an Unhealthy SP is technically reachable. When all SPs for a service +type are Unhealthy, incoming requests are held in the retry topic rather than +rejected (see [Retry Topic](#retry-topic)). + +The agent differentiates its behavior based on the health state of the last SP +serving a given service type: -When the last SP serving a given service type transitions to **Unhealthy** or -**Unavailable**, the agent: +**When the last SP becomes Unhealthy:** -1. Removes that service type from its advertised list -2. Sends a `PUT` request to DCM with the updated agent registration (service - types list without the affected type) -3. Publishes a health warning CloudEvent to a dedicated health topic in the - messaging system, providing DCM with context about the degradation (which SP, - which service type, the reason) +1. The agent **keeps** the service type in its advertised list (no `PUT` to DCM + to remove it) +2. The agent stops routing new requests to SPs for that service type — incoming + requests are held in the retry topic and a "queued" CloudEvent is sent to + DCM +3. The agent publishes a health warning CloudEvent to `dcm.agents.health` with + type `service-type-degraded` -When a previously unhealthy or unavailable SP recovers (returns `200 OK` with -`status: "healthy"`), the agent re-adds the service type to its list and updates -DCM accordingly. +**When the last SP becomes Unavailable:** + +1. The agent removes the service type from its advertised list +2. The agent sends a `PUT` request to DCM with the updated agent registration + (service types list without the affected type) +3. The agent drains the retry topic: all held requests for that service type are + rejected with error CloudEvents to DCM +4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with + type `service-type-unavailable` + +**When a previously unhealthy or unavailable SP recovers** (returns `200 OK` +with `status: "healthy"`): + +1. If the service type was removed (Unavailable case): the agent re-adds it to + its list and sends a `PUT` to DCM with the updated registration +2. The agent processes held requests from the retry topic for that service type + (see [Retry Topic](#retry-topic)) ##### Agent Status @@ -752,17 +861,43 @@ sequenceDiagram end end - Note over AG: Last SP for service type X
becomes Unhealthy or Unavailable + alt Last SP for service type X becomes Unhealthy + Note over AG: Keep service type X
in advertised list.
Hold incoming requests
in retry topic. - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} - activate DCM - DCM->>DB: Update agent registration - DB-->>DCM: Updated - DCM-->>AG: 200 OK - deactivate DCM + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-degraded",
agentId, serviceType, reason,
affectedProvider} + MS->>DCM: PUSH health warning + + else Last SP for service type X becomes Unavailable + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} + activate DCM + DCM->>DB: Update agent registration + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider} - MS->>DCM: PUSH health warning + Note over AG: Drain retry topic:
reject held requests for
service type X + + AG->>MS: PUBLISH CloudEvent(s)
topic: dcm.agents.responses
{error: "SP unavailable"}
for each held request + + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider} + MS->>DCM: PUSH health warning + + else Previously unhealthy/unavailable SP recovers to Ready + Note over AG: Re-add service type if removed.
Process held requests
from retry topic. + + opt Service type was removed (Unavailable case) + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes with X} + activate DCM + DCM->>DB: Update agent registration + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM + end + + AG->>SP: Forward held requests from retry topic + SP-->>AG: Responses + AG->>MS: PUBLISH CloudEvent(s)
topic: dcm.agents.responses
{success/error for each} + end ``` ##### Flow Description @@ -773,16 +908,26 @@ sequenceDiagram - `200 OK` with `status: "unhealthy"` → **Unhealthy** - Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** -3. When the last SP serving a given service type becomes **Unhealthy** or - **Unavailable**: +3. When the last SP serving a given service type becomes **Unhealthy**: + - The agent **keeps** the service type in its advertised list (no `PUT` to + DCM) + - Incoming requests for that service type are held in the retry topic (see + [Retry Topic](#retry-topic)) + - The agent publishes a `service-type-degraded` health warning CloudEvent to + the `dcm.agents.health` topic +4. When the last SP serving a given service type becomes **Unavailable**: - The agent removes the service type from its advertised list - The agent sends a `PUT` to DCM with the updated registration - - The agent publishes a health warning CloudEvent to the `dcm.agents.health` - topic with details about the affected SP and service type -4. When a previously unhealthy/unavailable SP recovers: - - The agent re-adds the service type to its list (if it was removed) - - The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs via the + - The agent drains the retry topic: all held requests for that service type + are rejected with error CloudEvents to DCM + - The agent publishes a `service-type-unavailable` health warning CloudEvent + to the `dcm.agents.health` topic +5. When a previously unhealthy or unavailable SP recovers: + - If the service type was removed (Unavailable case): the agent re-adds it + to its list and sends a `PUT` to DCM with the updated registration + - The agent processes held requests from the retry topic for that service + type +6. The agent exposes the health status of all registered SPs via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) @@ -811,8 +956,10 @@ target service type (see | Deletion Request | `dcm.request.delete` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType}` | | Creation Acknowledged | `dcm.agent.creation-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "PROVISIONING"}` | | Deletion Acknowledged | `dcm.agent.deletion-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "DELETING"}` | +| Request Queued | `dcm.agent.request-queued` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, serviceType, status: "QUEUED"}` | | Error | `dcm.agent.error` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, error, details}` | -| Health Warning | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | +| Health Degraded | `dcm.agent.health.service-type-degraded` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | +| Health Unavailable | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | ### Assumptions From fe2250f2a807de273a0db7688c712fed51e85f08 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 15 Jun 2026 16:41:42 +0200 Subject: [PATCH 17/20] Remove PUT Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 57 +++++++++---------- 1 file changed, 26 insertions(+), 31 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 75d2c8d..23dda2e 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -27,7 +27,7 @@ see-also: 2. How does an administrator update the agent's cost tier without restarting it? **Proposed resolution:** The administrator updates the agent's configuration (config file, environment variable, or ConfigMap on Kubernetes). The agent - detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the + detects the change and sends a `POST /api/v1/agents` to DCM with the updated cost tier — the same mechanism used when the supported service types list changes. **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be @@ -277,7 +277,6 @@ Example response: | Method | Endpoint | Description | | ------ | ---------------------------------- | ------------------------- | | POST | /api/v1/agents | Agent registration | -| PUT | /api/v1/agents/{agentId} | Update agent registration | | PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | ##### `POST /api/v1/agents` — Agent Registration @@ -288,7 +287,7 @@ Register a new agent to DCM. | ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | name | string | yes | Unique agent name | | environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | -| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP). May be empty on subsequent re-registrations when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | | resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | | cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | | topicName | string | yes | Deterministic topic name for the agent's messaging channel | @@ -314,13 +313,6 @@ Example: } ``` -##### `PUT /api/v1/agents/{agentId}` — Update Agent Registration - -Update an existing agent registration. The payload is identical to the initial -`POST` registration (full replace). All fields are sent on every `PUT`. - -Response: `200 OK` - ##### `PUT /api/v1/agents/{agentId}/heartbeat` — Agent Heartbeat | Field | Type | Required | Description | @@ -342,10 +334,10 @@ re-register without requiring any additional coordination mechanism. When the list of supported service types changes as a result of an SP registration and the agent is already registered to DCM, the agent updates DCM -via a `PUT` request with the full updated registration payload. If the agent has +via a `POST /api/v1/agents` request with the full updated registration payload. If the agent has not yet registered to DCM (i.e., this is the first SP registering), the agent -does not send a `PUT`; instead, the SP registration satisfies the prerequisite -for the agent to proceed with its initial registration to DCM (see +does not notify DCM yet; instead, the SP registration satisfies the +prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). ```mermaid @@ -364,7 +356,7 @@ sequenceDiagram AG->>AG: Store SP registration
Recompute supported service types alt Service type list changed AND agent already registered to DCM - AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM DCM->>DB: Update agent registration activate DB @@ -392,12 +384,12 @@ sequenceDiagram 3. The agent stores the SP registration and recomputes the list of supported service types 4. If the service type list changed (new service type added): - - If the agent is already registered to DCM: the agent sends a `PUT` request - to DCM with the full updated agent registration; DCM updates the agent - record in the database - - If the agent is not yet registered to DCM: the agent does not send a `PUT`; - instead, this SP registration satisfies the prerequisite for the agent's - initial registration (see + - If the agent is already registered to DCM: the agent sends a + `POST /api/v1/agents` request to DCM with the full updated agent + registration; DCM updates the agent record in the database + - If the agent is not yet registered to DCM: the agent does not notify DCM + yet; instead, this SP registration satisfies the prerequisite for the + agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration 6. The SP periodically re-registers with the agent; the agent handles this @@ -770,8 +762,8 @@ serving a given service type: **When the last SP becomes Unhealthy:** -1. The agent **keeps** the service type in its advertised list (no `PUT` to DCM - to remove it) +1. The agent **keeps** the service type in its advertised list (no update sent + to DCM to remove it) 2. The agent stops routing new requests to SPs for that service type — incoming requests are held in the retry topic and a "queued" CloudEvent is sent to DCM @@ -781,8 +773,8 @@ serving a given service type: **When the last SP becomes Unavailable:** 1. The agent removes the service type from its advertised list -2. The agent sends a `PUT` request to DCM with the updated agent registration - (service types list without the affected type) +2. The agent sends a `POST /api/v1/agents` request to DCM with the updated + registration (service types list without the affected type) 3. The agent drains the retry topic: all held requests for that service type are rejected with error CloudEvents to DCM 4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with @@ -792,7 +784,8 @@ serving a given service type: with `status: "healthy"`): 1. If the service type was removed (Unavailable case): the agent re-adds it to - its list and sends a `PUT` to DCM with the updated registration + its list and sends a `POST /api/v1/agents` to DCM with the updated + registration 2. The agent processes held requests from the retry topic for that service type (see [Retry Topic](#retry-topic)) @@ -868,7 +861,7 @@ sequenceDiagram MS->>DCM: PUSH health warning else Last SP for service type X becomes Unavailable - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} + AG->>DCM: POST /api/v1/agents
{updated serviceTypes without X} activate DCM DCM->>DB: Update agent registration DB-->>DCM: Updated @@ -886,7 +879,7 @@ sequenceDiagram Note over AG: Re-add service type if removed.
Process held requests
from retry topic. opt Service type was removed (Unavailable case) - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes with X} + AG->>DCM: POST /api/v1/agents
{updated serviceTypes with X} activate DCM DCM->>DB: Update agent registration DB-->>DCM: Updated @@ -909,22 +902,24 @@ sequenceDiagram - Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** 3. When the last SP serving a given service type becomes **Unhealthy**: - - The agent **keeps** the service type in its advertised list (no `PUT` to - DCM) + - The agent **keeps** the service type in its advertised list (no update + sent to DCM) - Incoming requests for that service type are held in the retry topic (see [Retry Topic](#retry-topic)) - The agent publishes a `service-type-degraded` health warning CloudEvent to the `dcm.agents.health` topic 4. When the last SP serving a given service type becomes **Unavailable**: - The agent removes the service type from its advertised list - - The agent sends a `PUT` to DCM with the updated registration + - The agent sends a `POST /api/v1/agents` to DCM with the updated + registration - The agent drains the retry topic: all held requests for that service type are rejected with error CloudEvents to DCM - The agent publishes a `service-type-unavailable` health warning CloudEvent to the `dcm.agents.health` topic 5. When a previously unhealthy or unavailable SP recovers: - If the service type was removed (Unavailable case): the agent re-adds it - to its list and sends a `PUT` to DCM with the updated registration + to its list and sends a `POST /api/v1/agents` to DCM with the updated + registration - The agent processes held requests from the retry topic for that service type 6. The agent exposes the health status of all registered SPs via the From 8deff815d51ba19fd6006acec6b318e29c563ac5 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Fri, 19 Jun 2026 16:10:34 +0200 Subject: [PATCH 18/20] feat(environment-agent): hybrid SP model, 1 SP per service type, defer etcd watch Integrate embedded SPs (K8s Container, ACM Cluster, KubeVirt) into the main proposal alongside external "bring your own" SPs. Enforce a global constraint of one SP per service type with 409 Conflict rejection for duplicates. Change etcd/CRD Watch alternative from Rejected to Deferred pending investigation of DCM-native watch semantics. Co-Authored-By: Claude Opus 4.6 Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 721 +++++++++++------- 1 file changed, 425 insertions(+), 296 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 23dda2e..ffcfbcd 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -27,9 +27,9 @@ see-also: 2. How does an administrator update the agent's cost tier without restarting it? **Proposed resolution:** The administrator updates the agent's configuration (config file, environment variable, or ConfigMap on Kubernetes). The agent - detects the change and sends a `POST /api/v1/agents` to DCM with the - updated cost tier — the same mechanism used when the supported service types - list changes. **This solution is deferred to later version: in the current + detects the change and sends a `POST /api/v1/agents` to DCM with the updated + cost tier — the same mechanism used when the supported service types list + changes. **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** 3. How does DCM handle the "queued" CloudEvent response @@ -43,6 +43,11 @@ see-also: environment. It registers the environment to DCM, consumes resource operation requests from a messaging system, and routes them to the appropriate Service Provider. +- **Embedded SP:** SP code shipped within the agent binary (K8s Container, ACM + Cluster, KubeVirt), enabled via configuration. Embedded SPs register + internally at agent startup without a REST call. +- **External SP:** A standalone SP process that registers to the agent via the + REST API (`POST /api/v1/providers`). Also referred to as "bring your own" SP. - **Environment:** A set of infrastructures that is ready to receive workload from DCM (e.g., `dev`, `staging`, `prod-eu-west-1`). @@ -52,10 +57,11 @@ This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. -The agent would then use the SPs as plugins for the supported service types and -pass the creation request to the relevant one. This would mean that each SP -registration with the agent serves exactly one service type (though a single SP -application may register multiple times for different service types). +The agent supports a hybrid SP model: it ships with embedded SP code for known +service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration, +and also accepts external ("bring your own") SPs that register via REST API. +Only one SP — embedded or external — may serve a given service type per agent; +duplicate registrations are rejected. This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct @@ -73,7 +79,7 @@ Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. -There is currently no way for a policy to determine in which environment a SP is +There is currently no way for a policy to determine in which environment an SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. @@ -92,6 +98,9 @@ requests, where manifests are pulled by the application creating the resource. - Define what information the agent gives to DCM while registering - Define how agents and DCM are communicating - Define how agents and Service Providers interact with each other +- Define how embedded SPs integrate with the agent alongside external SPs + (hybrid model) +- Define the service type uniqueness constraint (one SP per service type) - Define how Service Providers register to the agent, allowing the agent to dynamically build and maintain its list of supported service types - Define how the agent monitors Service Provider health using the three-state @@ -125,17 +134,31 @@ single-agent model, one agent consumes from the topic. In a future HA model, multiple agent replicas for the same environment could consume from the same topic as competing consumers. -Service Providers register directly to the agent (not to DCM). Each SP -registration with the agent serves exactly one service type, though a single SP -application may register multiple times for different service types. The agent -dynamically builds its list of supported service types based on the SPs that are -registered to it. When the list changes (SP registration or health-driven -removal), the agent updates DCM accordingly. +The agent supports a hybrid SP model combining embedded and external SPs: -An agent must have at least 1 Service Provider (SP) registered to it before self -registering to DCM. For each service type advertised as supported to DCM by the -agent, there must be at least 1 healthy SP registered supporting the given -service type. +- **Embedded SPs:** The agent ships with SP code for K8s Container, ACM Cluster, + and KubeVirt. These are enabled via configuration and register internally at + agent startup — no REST call is needed. The embedded SP code lives in + dedicated packages within the agent codebase. +- **External SPs ("bring your own"):** Standalone SP processes register to the + agent via the REST API (`POST /api/v1/providers`), following the contract + defined in the + [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md). + +Only one SP — embedded or external — may serve a given service type per agent. +If an SP attempts to register for a service type that is already served, the +registration is rejected (see +[SP Registration to Agent](#sp-registration-to-agent)). Future iterations may +support multiple SPs per service type with selection strategies (e.g., +affinity-based, capacity-based). + +The agent dynamically builds its list of supported service types based on the +SPs registered to it (both embedded and external). When the list changes (SP +registration or health-driven removal), the agent updates DCM accordingly. + +An agent must have at least one SP (embedded or external) registered and healthy +before self registering to DCM. Each service type advertised to DCM must be +backed by a healthy SP. DCM will send the creation request to the specific topic that was created by the agent. @@ -143,9 +166,17 @@ agent. The agent will then consume the message, validate it and then pass it to the relevant SP. -The agent monitors the health of its registered SPs by polling their `/health` -endpoint, using the three-state model (Ready, Unhealthy, Unavailable). The agent -differentiates its behavior based on the SP health state: +The agent monitors the health of its registered SPs using the three-state model +(Ready, Unhealthy, Unavailable). The health monitoring mechanism differs by SP +type: + +- **Embedded SPs:** Health is determined in-process — the agent directly checks + the embedded SP's internal state without a network call. +- **External SPs:** Health is determined by polling the SP's `GET /health` + endpoint, as defined in the + [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md). + +The agent differentiates its behavior based on the SP health state: - **Unhealthy:** The agent keeps the service type in its advertised list to DCM but stops routing requests to the SP. Incoming requests for that service type @@ -155,10 +186,10 @@ differentiates its behavior based on the SP health state: updates DCM, and rejects any held requests for that service type. The agent exposes the health status of each registered SP via a `/api/v1/status` -endpoint. -On Kubernetes/OpenShift deployments, the agent additionally surfaces this -information as custom pod conditions on its own pod, allowing administrators to -quickly identify which SPs are causing issues via `oc describe pod`. +endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces +this information as custom pod conditions on its own pod, allowing +administrators to quickly identify which SPs are causing issues via +`oc describe pod`. The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no @@ -176,7 +207,8 @@ flowchart TD classDef dcm fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px classDef agent fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px - classDef provider fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px + classDef embedded fill:#2d2d2d,color:#ffffff,stroke:#ce93d8,stroke-width:2px + classDef external fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px classDef clusterEnvironment fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px DCM["**DCM**
Control Plane"]:::dcm @@ -184,15 +216,18 @@ flowchart TD subgraph Target_Environment["Target Environment"] direction LR - SPX["**SP**
Service Type X"]:::provider - AG["**Agent**
Routes creation requests to SP"]:::agent - SPY["**SP**
Service Type Y"]:::provider - SPX -. Registration .-> AG - SPY -. Registration .-> AG - AG -->|Creation Request| SPX - AG -->|Creation Request| SPY - AG -.->|Health Check| SPX - AG -.->|Health Check| SPY + EXT_SP["**External SP**
Service Type Z
(bring your own)"]:::external + + subgraph Agent_Process["Agent Process"] + direction TB + AG["**Agent**
Routes creation requests to SP"]:::agent + EMB_SP["**Embedded SPs**
K8s Container · ACM Cluster · KubeVirt
(enabled via config)"]:::embedded + EMB_SP ---|In-process| AG + end + + EXT_SP -. "Registration (REST)" .-> AG + AG -->|Creation Request| EXT_SP + AG -.->|"Health Check (polling)"| EXT_SP end DCM -->|Creation Request| MS @@ -201,8 +236,8 @@ flowchart TD AG -. Heartbeat .-> DCM AG -->|Health Warning| MS MS -->|Health Warning| DCM - SPX -->|Status| MS - SPY -->|Status| MS + EXT_SP -->|Status| MS + EMB_SP -->|Status| MS MS -->|Status| DCM class Target_Environment clusterEnvironment @@ -211,19 +246,23 @@ flowchart TD #### Flow Description - The agent is spawned in an environment -- Several Service Providers (SP) are running and each serving a specific service - type -- Each SP registers itself to the agent; the agent dynamically builds its - supported service types list +- At startup, the agent registers its configured embedded SPs internally (K8s + Container, ACM Cluster, KubeVirt — each enabled via configuration) +- External SPs register to the agent via REST API; the agent rejects + registration if the service type is already served (by an embedded or another + external SP) +- Only one SP (embedded or external) may serve a given service type - The agent creates a specific topic in the bus system - Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats - DCM sends creation request to the specific topic - The agent consumes the messages sent to the topic -- The agent routes the creation request to the relevant SP -- The agent periodically health-checks each registered SP; when the last SP for - a service type becomes unhealthy, the agent updates DCM and publishes a health - warning through the messaging system +- The agent routes the creation request to the SP serving the requested service + type +- The agent monitors each registered SP's health: in-process for embedded SPs, + via `/health` endpoint polling for external SPs. When the SP for a service + type becomes unhealthy, the agent publishes a health warning through the + messaging system - The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system @@ -236,18 +275,31 @@ flowchart TD | POST | /api/v1/providers | SP registration — reuses the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) contract | | GET | /api/v1/status | Agent status — health of all registered SPs | -##### `POST /api/v1/providers` — SP Registration +##### `POST /api/v1/providers` — SP Registration (External SPs only) Reuses the contract defined in the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) enhancement. The agent applies the same idempotency semantics (name as natural key, create-or-update behavior). +Only one SP may serve a given service type. If the requested service type is +already served by another SP (embedded or external), the agent rejects the +registration with `409 Conflict`: + +```json +{ + "error": "service type 'vm' is already served by provider 'vm-provider'" +} +``` + +Embedded SPs register internally at startup and do not use this endpoint. + ##### `GET /api/v1/status` — Agent Status -Returns the health state of all registered SPs. This endpoint is always -available, regardless of the deployment mode (Kubernetes, Docker, standalone), -and is the primary way to inspect the agent's view of its Service Providers. +Returns the health state of all registered SPs (both embedded and external). +This endpoint is always available, regardless of the deployment mode +(Kubernetes, Docker, standalone), and is the primary way to inspect the agent's +view of its Service Providers. Example response: @@ -255,9 +307,10 @@ Example response: { "providers": [ { - "providerId": "sp-vm-001", - "name": "vm-provider", - "serviceType": "vm", + "providerId": "sp-container-001", + "name": "k8s-container", + "serviceType": "container", + "type": "embedded", "status": "Ready", "lastCheck": "2026-06-05T10:30:00Z" }, @@ -265,6 +318,7 @@ Example response: "providerId": "sp-db-001", "name": "db-provider", "serviceType": "database", + "type": "external", "status": "Unhealthy", "lastCheck": "2026-06-05T10:30:00Z" } @@ -274,23 +328,23 @@ Example response: #### DCM Endpoints -| Method | Endpoint | Description | -| ------ | ---------------------------------- | ------------------------- | -| POST | /api/v1/agents | Agent registration | -| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | +| Method | Endpoint | Description | +| ------ | ---------------------------------- | ------------------ | +| POST | /api/v1/agents | Agent registration | +| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | ##### `POST /api/v1/agents` — Agent Registration Register a new agent to DCM. -| Field | Type | Required | Description | -| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| name | string | yes | Unique agent name | -| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | -| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP). May be empty on subsequent re-registrations when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | -| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | -| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | -| topicName | string | yes | Deterministic topic name for the agent's messaging channel | +| Field | Type | Required | Description | +| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| name | string | yes | Unique agent name | +| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP, embedded or external). May be empty on subsequent re-registrations when SPs become unavailable (an Unhealthy SP does not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | +| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | +| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | +| topicName | string | yes | Deterministic topic name for the agent's messaging channel | Response: `201 Created` with `{agentId}` @@ -324,65 +378,116 @@ Response: `200 OK` ### SP Registration to Agent Service Providers register to the agent rather than to DCM directly. The agent -exposes a REST API for SP registration and dynamically maintains its list of +supports two registration mechanisms and dynamically maintains its list of supported service types based on registered SPs. -SPs periodically re-register with the agent to maintain their registration. This -periodic re-registration serves as a lease renewal and ensures that after an -agent restart (where the agent loses its in-memory state), SPs naturally -re-register without requiring any additional coordination mechanism. +**Service type uniqueness constraint:** Only one SP — embedded or external — may +serve a given service type per agent. The first SP to register for a service +type claims the slot. Subsequent registration attempts for the same service type +are rejected. + +#### Embedded SP Registration + +At startup, the agent registers its configured embedded SPs internally. Each +embedded SP's code lives in a dedicated package within the agent codebase and is +enabled explicitly via a configuration field. The embedded SP code reaches the +agent's registration logic directly — no REST call is involved. + +If the agent's state is not clean (e.g., an external SP already holds a service +type slot from a prior session), the embedded SP registration for that service +type is rejected. The agent logs a warning and continues running — this is not a +fatal error. + +Because embedded SPs register at startup before external SPs can connect, they +effectively take priority on a clean agent state. + +#### External SP Registration + +External SPs register via the REST API (`POST /api/v1/providers`), following the +contract defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) +enhancement. The agent applies the same idempotency semantics (name as natural +key, create-or-update behavior). + +If the requested service type is already served by another SP (embedded or +external), the agent rejects the registration with `409 Conflict` and a message +identifying the conflicting provider, so the administrator can take action if +necessary. + +External SPs periodically re-register with the agent to maintain their +registration. This periodic re-registration serves as a lease renewal and +ensures that after an agent restart (where the agent loses its in-memory state), +SPs naturally re-register without requiring any additional coordination +mechanism. + +#### DCM Notification When the list of supported service types changes as a result of an SP -registration and the agent is already registered to DCM, the agent updates DCM -via a `POST /api/v1/agents` request with the full updated registration payload. If the agent has -not yet registered to DCM (i.e., this is the first SP registering), the agent -does not notify DCM yet; instead, the SP registration satisfies the -prerequisite for the agent to proceed with its initial registration to DCM (see +registration (embedded or external) and the agent is already registered to DCM, +the agent updates DCM via a `POST /api/v1/agents` request with the full updated +registration payload. If the agent has not yet registered to DCM (i.e., this is +the first SP registering), the agent does not notify DCM yet; instead, the SP +registration satisfies the prerequisite for the agent to proceed with its +initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). ```mermaid sequenceDiagram autonumber - participant SP as Service Provider + participant SP as External SP participant AG as Agent participant DCM as DCM
(Control Plane) participant DB as Database - Note over SP: SP starts and
registers to the agent + Note over AG: Agent starts:
register embedded SPs
from configuration + + AG->>AG: Register embedded SPs internally
(K8s Container, ACM Cluster, KubeVirt
— each if enabled in config) + + Note over SP: External SP starts and
registers to the agent SP->>AG: POST /api/v1/providers
{name, serviceType, endpoint} activate AG - AG->>AG: Store SP registration
Recompute supported service types + alt Service type already served by another SP + AG-->>SP: 409 Conflict
{error: "service type X already
served by provider Y"} + else Service type available + AG->>AG: Store SP registration
Add service type to supported list - alt Service type list changed AND agent already registered to DCM - AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} - activate DCM - DCM->>DB: Update agent registration - activate DB - DB-->>DCM: Registration updated - deactivate DB - DCM-->>AG: 200 OK - deactivate DCM - else Service type list changed AND agent not yet registered to DCM - Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) - end + alt Service type list changed AND agent already registered to DCM + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} + activate DCM + DCM->>DB: Update agent registration + activate DB + DB-->>DCM: Registration updated + deactivate DB + DCM-->>AG: 200 OK + deactivate DCM + else Service type list changed AND agent not yet registered to DCM + Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) + end - AG-->>SP: 201 Created
{providerId} + AG-->>SP: 201 Created
{providerId} + end deactivate AG - Note over SP,AG: SP periodically re-registers
to maintain its lease + Note over SP,AG: External SP periodically
re-registers to maintain its lease ``` #### Flow Description -1. The SP starts and registers to the agent -2. The SP registers itself with the agent via a REST API call, providing: +1. At startup, the agent registers its configured embedded SPs internally. Each + embedded SP claims a service type slot. If a slot is already occupied, the + agent logs a warning and continues +2. An external SP starts and registers to the agent via a REST API call, + providing: - Name - Service type it serves - Endpoint (URL where the agent can reach the SP) -3. The agent stores the SP registration and recomputes the list of supported - service types +3. The agent checks whether the requested service type is already served: + - If **already served**: the agent rejects the registration with + `409 Conflict` and a message identifying the conflicting provider + - If **available**: the agent stores the SP registration and adds the service + type to its supported list 4. If the service type list changed (new service type added): - If the agent is already registered to DCM: the agent sends a `POST /api/v1/agents` request to DCM with the full updated agent @@ -392,9 +497,10 @@ sequenceDiagram agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration -6. The SP periodically re-registers with the agent; the agent handles this +6. External SPs periodically re-register with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, - SPs naturally rebuild the agent's state without additional coordination + external SPs naturally rebuild the agent's state without additional + coordination ### Agent Registration Flow @@ -414,7 +520,7 @@ sequenceDiagram AG->>MS: Create retry topic (internal) MS-->>AG: Topic created
{topicName}.retry - Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) + Note over AG: Prerequisite:
At least 1 SP (embedded or
external) must be registered
and healthy
(see SP Registration to Agent) AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM @@ -435,13 +541,14 @@ sequenceDiagram - A **main topic** (using a deterministic name) to establish a dedicated communication channel with DCM. This topic name is advertised to DCM during registration. - - A **retry topic** (`{topicName}.retry`) used internally by the agent to hold - requests when all SPs for a service type are Unhealthy (see + - A **retry topic** (`{topicName}.retry`) used internally by the agent to + hold requests when the SP for a service type is Unhealthy (see [Retry Topic](#retry-topic)). This topic is not advertised to DCM. -3. The agent checks whether at least one SP is registered and healthy: - - If at least 1 SP is registered and healthy: the agent proceeds to register - to DCM - - Else: the agent waits until at least 1 SP is registered and healthy +3. The agent checks whether at least one SP (embedded or external) is registered + and healthy: + - If at least one SP is registered and healthy: the agent proceeds to + register to DCM + - Else: the agent waits until at least one SP is registered and healthy 4. The agent registers itself with DCM via a REST API call, providing: - Name - Environment @@ -477,36 +584,52 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant MS as Messaging System participant AG as Agent - participant SP as Service Provider + participant EMB as Embedded SP + participant EXT as External SP DCM->>MS: PUBLISH CloudEvent (creation request)
topic: {agentTopicName}
{resourceId, serviceType, spec} MS->>AG: PUSH message activate AG - AG->>AG: Validate requested service type
is supported by an attached SP + AG->>AG: Validate requested service type
is supported by a registered SP alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported but all SPs Unhealthy + else Service type supported but SP is Unhealthy AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType, spec} - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SP unhealthy — held for retry"} MS->>DCM: PUSH queued response - else Service type supported and at least one SP Ready - AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} - activate SP - - alt SP creation fails - SP-->>AG: Error response - deactivate SP - AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} - MS->>DCM: PUSH error message - else SP creation succeeds - SP-->>AG: Success response
{instanceId, status: PROVISIONING} - AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} - MS->>DCM: PUSH creation acknowledged - Note over SP: SP manages resource lifecycle
and reports status through
the existing status reporting flow + else Service type supported and SP is Ready + alt SP is embedded + AG->>EMB: In-process call
{serviceType, spec} + activate EMB + alt Creation fails + EMB-->>AG: Error + deactivate EMB + AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} + MS->>DCM: PUSH error message + else Creation succeeds + EMB-->>AG: Success
{instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} + MS->>DCM: PUSH creation acknowledged + Note over EMB: SP manages resource lifecycle
and reports status through
the existing status reporting flow + end + else SP is external + AG->>EXT: POST {spEndpoint}/api/v1/{serviceType}
{spec} + activate EXT + alt Creation fails + EXT-->>AG: Error response + deactivate EXT + AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} + MS->>DCM: PUSH error message + else Creation succeeds + EXT-->>AG: Success response
{instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} + MS->>DCM: PUSH creation acknowledged + Note over EXT: SP manages resource lifecycle
and reports status through
the existing status reporting flow + end end end deactivate AG @@ -517,21 +640,22 @@ sequenceDiagram 1. DCM publishes a creation request CloudEvent to the agent's dedicated topic in the messaging system, including the resource ID, service type, and spec 2. The agent consumes the message -3. The agent validates that the requested service type is supported by one of - its attached Service Providers +3. The agent validates that the requested service type is supported by a + registered SP (embedded or external) 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported but all SPs are Unhealthy**: +5. If the service type is **supported but the SP is Unhealthy**: - The agent publishes the original request CloudEvent to the retry topic (`{agentTopicName}.retry`) for durable holding - The agent publishes a "queued" CloudEvent to `dcm.agents.responses` with `{resourceId, serviceType, status: "QUEUED"}`, informing DCM that the request is held for retry - - The request will be processed when an SP recovers, or rejected if all SPs - become Unavailable (see [Retry Topic](#retry-topic)) -6. If the service type is **supported and at least one SP is Ready**: - - The agent forwards the creation request to the relevant SP via REST API + - The request will be processed when the SP recovers, or rejected if the SP + becomes Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and the SP is Ready**: + - The agent forwards the creation request to the SP via REST API (for + external SPs) or in-process call (for embedded SPs) - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - If the SP **accepts** the request: the agent publishes a CloudEvent @@ -539,12 +663,12 @@ sequenceDiagram lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) -#### SP Selection Strategy +#### Service Type Uniqueness -When multiple SPs are registered for the same service type, the agent selects -the SP in alphabetical order. Future iterations may introduce affinity-based or -capacity-based selection strategies (e.g., selecting the SP with the most -available resources, similar to pod affinity in Kubernetes). +Each service type is served by exactly one SP (embedded or external). There is +no SP selection strategy in the current version. Future iterations may support +multiple SPs per service type with selection strategies (e.g., affinity-based, +capacity-based). #### Retry Policy @@ -556,31 +680,30 @@ failure. #### Retry Topic -When all SPs for a given service type are Unhealthy, the agent cannot route -requests but the service type remains advertised to DCM (to avoid -registration flapping). Instead of rejecting the request, the agent publishes -it to a dedicated **retry topic** (`{agentTopicName}.retry`) for durable -holding, and responds to DCM with a "queued" CloudEvent. +When the SP for a given service type is Unhealthy, the agent cannot route +requests but the service type remains advertised to DCM (to avoid registration +flapping). Instead of rejecting the request, the agent publishes it to a +dedicated **retry topic** (`{agentTopicName}.retry`) for durable holding, and +responds to DCM with a "queued" CloudEvent. -The retry topic is created by the agent at startup alongside the main topic -(see [Agent Registration Flow](#agent-registration-flow)). It is internal to -the agent and is not advertised to DCM. +The retry topic is created by the agent at startup alongside the main topic (see +[Agent Registration Flow](#agent-registration-flow)). It is internal to the +agent and is not advertised to DCM. **Message format:** The original CloudEvent is published to the retry topic as-is (passthrough, no wrapping). -**Consumption is event-driven.** The agent reads the retry topic only when an -SP health state changes — not periodically: +**Consumption is event-driven.** The agent reads the retry topic only when an SP +health state changes — not periodically: - **SP transitions to Ready:** The agent consumes the retry topic. For each - message whose service type now has a Ready SP, the agent processes the - request (forwards to the SP, responds to DCM with success or error). - Messages for service types still Unhealthy are re-published to the retry - topic. + message whose service type now has a Ready SP, the agent processes the request + (forwards to the SP, responds to DCM with success or error). Messages for + service types whose SP is still Unhealthy are re-published to the retry topic. - **SP transitions to Unavailable:** The agent consumes the retry topic. For - each message whose service type has all SPs Unavailable, the agent rejects - the request with an error CloudEvent to DCM. Messages for other service - types are re-published to the retry topic. + each message whose service type's SP is Unavailable, the agent rejects the + request with an error CloudEvent to DCM. Messages for other service types are + re-published to the retry topic. - **No health state change:** The retry topic is not consumed. **Creation/Deletion dedup:** If both a creation request and a deletion request @@ -589,12 +712,12 @@ removed — they cancel out since the resource was never created. The agent logs the cancellation and acknowledges the deletion to DCM. The creation request is silently dropped since it was never started. -**Ordering:** Requests are processed in arrival order per service type. -Requests for different service types are independent. +**Ordering:** Requests are processed in arrival order per service type. Requests +for different service types are independent. -**Durability:** Messages in the retry topic survive agent crashes, guaranteed -by the messaging system's persistence layer. On restart, the agent re-reads -both the main topic and the retry topic. +**Durability:** Messages in the retry topic survive agent crashes, guaranteed by +the messaging system's persistence layer. On restart, the agent re-reads both +the main topic and the retry topic. #### In-Flight Request Handling @@ -602,11 +725,11 @@ When the agent restarts, unconsumed messages on both the main topic and the retry topic are consumed once the agent is back up (guaranteed by the messaging system's persistence layer). -- **All SPs Unhealthy:** The agent publishes the request to the retry topic and - responds to DCM with a "queued" CloudEvent. The request is processed when an - SP recovers, or rejected when all SPs for that service type become - Unavailable (see [Retry Topic](#retry-topic)). -- **All SPs Unavailable:** The agent responds with an error CloudEvent for each +- **SP is Unhealthy:** The agent publishes the request to the retry topic and + responds to DCM with a "queued" CloudEvent. The request is processed when the + SP recovers, or rejected when the SP for that service type becomes Unavailable + (see [Retry Topic](#retry-topic)). +- **SP is Unavailable:** The agent responds with an error CloudEvent for each incoming request targeting that service type. Additionally, the agent drains the retry topic, rejecting any held requests for that service type with error CloudEvents. @@ -619,36 +742,52 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant MS as Messaging System participant AG as Agent - participant SP as Service Provider + participant EMB as Embedded SP + participant EXT as External SP DCM->>MS: PUBLISH CloudEvent (deletion request)
topic: {agentTopicName}
{resourceId, serviceType} MS->>AG: PUSH message activate AG - AG->>AG: Validate requested service type
is supported by an attached SP + AG->>AG: Validate requested service type
is supported by a registered SP alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported but all SPs Unhealthy + else Service type supported but SP is Unhealthy AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType} - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SP unhealthy — held for retry"} MS->>DCM: PUSH queued response - else Service type supported and at least one SP Ready - AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} - activate SP - - alt SP deletion fails - SP-->>AG: Error response - deactivate SP - AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} - MS->>DCM: PUSH error message - else SP deletion succeeds - SP-->>AG: Success response
{resourceId, status: DELETING} - AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} - MS->>DCM: PUSH deletion acknowledged - Note over SP: SP manages resource deletion
and reports final status through
the existing status reporting flow + else Service type supported and SP is Ready + alt SP is embedded + AG->>EMB: In-process call
{serviceType, resourceId} + activate EMB + alt Deletion fails + EMB-->>AG: Error + deactivate EMB + AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} + MS->>DCM: PUSH error message + else Deletion succeeds + EMB-->>AG: Success
{resourceId, status: DELETING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} + MS->>DCM: PUSH deletion acknowledged + Note over EMB: SP manages resource deletion
and reports final status through
the existing status reporting flow + end + else SP is external + AG->>EXT: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} + activate EXT + alt Deletion fails + EXT-->>AG: Error response + deactivate EXT + AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} + MS->>DCM: PUSH error message + else Deletion succeeds + EXT-->>AG: Success response
{resourceId, status: DELETING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} + MS->>DCM: PUSH deletion acknowledged + Note over EXT: SP manages resource deletion
and reports final status through
the existing status reporting flow + end end end deactivate AG @@ -659,21 +798,21 @@ sequenceDiagram 1. DCM publishes a deletion request CloudEvent to the agent's dedicated topic in the messaging system, including the resource ID and service type 2. The agent consumes the message -3. The agent validates that the requested service type is supported by one of - its attached Service Providers +3. The agent validates that the requested service type is supported by a + registered SP (embedded or external) 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported but all SPs are Unhealthy**: +5. If the service type is **supported but the SP is Unhealthy**: - The agent publishes the original request to the retry topic for durable holding - The agent publishes a "queued" CloudEvent to `dcm.agents.responses`, informing DCM that the request is held for retry - - The request will be processed when an SP recovers, or rejected if all SPs - become Unavailable (see [Retry Topic](#retry-topic)) -6. If the service type is **supported and at least one SP is Ready**: - - The agent forwards the deletion request to the relevant SP via a REST - `DELETE` call + - The request will be processed when the SP recovers, or rejected if the SP + becomes Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and the SP is Ready**: + - The agent forwards the deletion request to the SP via a REST `DELETE` call + (for external SPs) or in-process call (for embedded SPs) - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - If the SP **accepts** the request: the agent publishes a CloudEvent @@ -737,40 +876,45 @@ sequenceDiagram #### SP Health Monitoring -The agent monitors the health of its registered Service Providers by polling -their `/health` endpoint, using the three-state health model defined in the -[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): +The agent monitors the health of its registered Service Providers using the +three-state health model defined in the +[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md). +The monitoring mechanism differs by SP type: -| State | Condition | -| --------------- | --------------------------------------------------------------------------------------------------- | -| **Ready** | SP responds with `200 OK` and `status: "healthy"` | -| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | -| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | +- **Embedded SPs:** Health is determined in-process — the agent directly checks + the embedded SP's internal state without a network call. +- **External SPs:** Health is determined by polling the SP's `GET /health` + endpoint. -With the agent layer, the responsibility for polling SP health shifts from DCM -to the agent. The agent is the natural point to perform health checks on its -registered SPs, as it already maintains the list of SP endpoints. +| State | Condition | +| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| **Ready** | SP responds with `200 OK` and `status: "healthy"` (external), or internal check passes (embedded) | +| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (external), or internal check reports unhealthy (embedded) | +| **Unavailable** | SP does not respond or returns an error after exceeding the failure threshold (external), or internal check reports unavailable (embedded) | -The agent only routes creation requests to SPs in the **Ready** state. SPs in -the **Unhealthy** or **Unavailable** state are not eligible for routing, even -though an Unhealthy SP is technically reachable. When all SPs for a service -type are Unhealthy, incoming requests are held in the retry topic rather than -rejected (see [Retry Topic](#retry-topic)). +With the agent layer, the responsibility for monitoring SP health shifts from +DCM to the agent. The agent is the natural point to perform health checks on its +registered SPs, as it already maintains the list of SP registrations. -The agent differentiates its behavior based on the health state of the last SP -serving a given service type: +The agent only routes requests to SPs in the **Ready** state. An SP in the +**Unhealthy** or **Unavailable** state is not eligible for routing, even though +an Unhealthy SP may be technically reachable. When the SP for a service type is +Unhealthy, incoming requests are held in the retry topic rather than rejected +(see [Retry Topic](#retry-topic)). -**When the last SP becomes Unhealthy:** +Since each service type is served by exactly one SP, the agent's behavior is +determined by that SP's health state: + +**When the SP becomes Unhealthy:** 1. The agent **keeps** the service type in its advertised list (no update sent to DCM to remove it) -2. The agent stops routing new requests to SPs for that service type — incoming - requests are held in the retry topic and a "queued" CloudEvent is sent to - DCM +2. The agent stops routing new requests for that service type — incoming + requests are held in the retry topic and a "queued" CloudEvent is sent to DCM 3. The agent publishes a health warning CloudEvent to `dcm.agents.health` with type `service-type-degraded` -**When the last SP becomes Unavailable:** +**When the SP becomes Unavailable:** 1. The agent removes the service type from its advertised list 2. The agent sends a `POST /api/v1/agents` request to DCM with the updated @@ -780,8 +924,8 @@ serving a given service type: 4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with type `service-type-unavailable` -**When a previously unhealthy or unavailable SP recovers** (returns `200 OK` -with `status: "healthy"`): +**When a previously unhealthy or unavailable SP recovers** (returns to Ready +state): 1. If the service type was removed (Unavailable case): the agent re-adds it to its list and sends a `POST /api/v1/agents` to DCM with the updated @@ -834,12 +978,14 @@ service account. sequenceDiagram autonumber participant AG as Agent - participant SP as Service Provider + participant SP as External SP participant MS as Messaging System participant DCM as DCM
(Control Plane) participant DB as Database - loop Every {healthCheckInterval} seconds + Note over AG: Embedded SPs: health
checked in-process + + loop Every {healthCheckInterval} seconds (external SPs) AG->>SP: GET /health alt Healthy SP-->>AG: 200 OK
{status: "healthy"} @@ -854,13 +1000,13 @@ sequenceDiagram end end - alt Last SP for service type X becomes Unhealthy + alt SP for service type X becomes Unhealthy Note over AG: Keep service type X
in advertised list.
Hold incoming requests
in retry topic. AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-degraded",
agentId, serviceType, reason,
affectedProvider} MS->>DCM: PUSH health warning - else Last SP for service type X becomes Unavailable + else SP for service type X becomes Unavailable AG->>DCM: POST /api/v1/agents
{updated serviceTypes without X} activate DCM DCM->>DB: Update agent registration @@ -895,20 +1041,22 @@ sequenceDiagram ##### Flow Description -1. The agent periodically polls each registered SP's `GET /health` endpoint -2. Based on the response, the agent updates the SP's health state: - - `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) - - `200 OK` with `status: "unhealthy"` → **Unhealthy** - - Timeout or error → increment failure counter; if counter exceeds threshold - → **Unavailable** -3. When the last SP serving a given service type becomes **Unhealthy**: - - The agent **keeps** the service type in its advertised list (no update - sent to DCM) +1. The agent monitors each registered SP's health: + - **Embedded SPs:** health checked in-process (no network call) + - **External SPs:** health checked by periodically polling `GET /health` +2. Based on the result, the agent updates the SP's health state: + - Healthy → **Ready** (failure counter reset) + - Unhealthy → **Unhealthy** + - Timeout or error (external) / internal failure (embedded) → increment + failure counter; if counter exceeds threshold → **Unavailable** +3. When the SP for a service type becomes **Unhealthy**: + - The agent **keeps** the service type in its advertised list (no update sent + to DCM) - Incoming requests for that service type are held in the retry topic (see [Retry Topic](#retry-topic)) - The agent publishes a `service-type-degraded` health warning CloudEvent to the `dcm.agents.health` topic -4. When the last SP serving a given service type becomes **Unavailable**: +4. When the SP for a service type becomes **Unavailable**: - The agent removes the service type from its advertised list - The agent sends a `POST /api/v1/agents` to DCM with the updated registration @@ -917,15 +1065,16 @@ sequenceDiagram - The agent publishes a `service-type-unavailable` health warning CloudEvent to the `dcm.agents.health` topic 5. When a previously unhealthy or unavailable SP recovers: - - If the service type was removed (Unavailable case): the agent re-adds it - to its list and sends a `POST /api/v1/agents` to DCM with the updated + - If the service type was removed (Unavailable case): the agent re-adds it to + its list and sends a `POST /api/v1/agents` to DCM with the updated registration - The agent processes held requests from the retry topic for that service type -6. The agent exposes the health status of all registered SPs via the - `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent - additionally surfaces this information as custom pod conditions on its own - pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) +6. The agent exposes the health status of all registered SPs (both embedded and + external) via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift + deployments, the agent additionally surfaces this information as custom pod + conditions on its own pod (see + [Pod Conditions](#pod-conditions-kubernetes--openshift)) ### CloudEvent Message Definitions @@ -962,8 +1111,8 @@ target service type (see agent - The agent has outbound network connectivity to DCM's REST API (for registration and heartbeats) -- SPs have network connectivity to the agent's REST API (for registration and - health checks) +- External SPs have network connectivity to the agent's REST API (for + registration and health checks) - For Kubernetes/OpenShift deployments: the agent's service account has RBAC permissions for the `pods/status` subresource @@ -971,11 +1120,12 @@ target service type (see | Risk | Mitigation | | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state via SP re-registration (SPs periodically re-register, naturally rebuilding agent state). | +| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. | | Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. | | Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. | | Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). | -| Unauthenticated SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. | +| Unauthenticated external SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. | +| Embedded SP crash takes down the agent | Embedded SPs run in-process; a panic/crash affects the entire agent. Mitigation: embedded SP code is well-tested and isolated in dedicated packages. Process-level restart recovers state via re-registration. | ## Drawbacks @@ -984,86 +1134,65 @@ target service type (see - Adds latency to the creation path: DCM → messaging system → agent → SP, versus the current DCM → SP direct call - Fragments health monitoring responsibility: DCM monitors agent health via - heartbeats, while the agent monitors SP health via polling + heartbeats, while the agent monitors SP health directly (in-process for + embedded SPs, via polling for external SPs) - Requires messaging system infrastructure accessible to both DCM and all target environments +- Embedding SP code (K8s Container, ACM Cluster, KubeVirt) increases agent + binary size and couples the agent release cycle to the embedded SPs for + updates ## Alternatives -### Alternative 1: Monolithic Agent with Embedded SPs +### Alternative 1: Watch / Reconcile Pattern #### Description -Instead of separating the agent and Service Providers into distinct processes, -the agent binary would ship with SP code for a known set of SPs (e.g., ACM, -KubeVirt, K8s). At startup, the agent would detect available CRDs or backing -infrastructure on the environment and activate only the relevant SP code. +Instead of using a messaging system for creation requests, DCM would expose +resource requests through its own API. The agent would poll DCM's API or be +notified by DCM of new events, discover pending resource requests targeting its +environment, and reconcile them by forwarding the creation request to the +relevant SP and reporting the result back to DCM. This mimics the Kubernetes +controller pattern (watch → reconcile) but with DCM acting as the API server +rather than a Kubernetes cluster. #### Pros -- Single binary to deploy, no REST registration ceremony between agent and SPs -- No health monitoring overhead between agent and SPs (they share a process) -- Simpler deployment and operational model +- Familiar pattern for teams experienced with Kubernetes controllers +- Could eliminate the messaging system dependency for creation requests +- DCM retains full visibility of pending requests (they live in DCM's own + storage, not in a bus topic) +- No additional infrastructure beyond DCM itself — the agent only needs + outbound connectivity to DCM's API, which it already has for registration and + heartbeats #### Cons -- Tightly couples the agent to a fixed, predefined set of SPs -- Cannot support custom or third-party SPs without rebuilding the agent binary -- Agent binary grows with each new SP type -- Requires agent rebuild and redeployment to add support for a new service type +- Requires DCM to implement watch/notification semantics natively, which adds + complexity to the control plane +- The messaging system is still required for status reporting (SP → bus → DCM), + so this does not fully eliminate the messaging infrastructure dependency +- Maturity of a DCM-native watch system is unproven compared to established + messaging systems (e.g., NATS JetStream) #### Status -Rejected +Deferred #### Rationale -The agent must support arbitrary SPs, including custom ones developed by third -parties. Tight coupling between the agent and SP code prevents this -extensibility. The plugin-style model (separate processes, REST registration) -allows any SP that implements the registration API to participate, regardless of -who develops or deploys it. - -### Alternative 2: etcd / CRD Watch Pattern - -#### Description - -Instead of using a messaging system for creation requests, DCM would create -Custom Resource (CR) manifests (e.g., `ResourceRequest`) directly in the target -cluster's etcd via the Kubernetes API. The agent would run as a Kubernetes -controller, watching for these CRs and reconciling them by forwarding the -creation request to the relevant SP. This follows the native Kubernetes -controller pattern. - -#### Pros - -- Native Kubernetes pattern, well-understood and battle-tested -- Leverages existing etcd for persistence and watch semantics, no separate - messaging infrastructure needed -- Built-in HA via Kubernetes controller framework (leader election, informer - caching) - -#### Cons - -- Requires DCM to have kubeconfig/API access to each target cluster, - reintroducing DCM-to-environment connectivity that this enhancement aims to - eliminate -- Does not work for non-Kubernetes environments (Docker, standalone, etc.) -- Pushes the connectivity requirement from the agent (outbound) to DCM (outbound - to every cluster) - -#### Status - -Rejected - -#### Rationale +The watch/reconcile pattern's main advantage is eliminating the messaging system +for creation requests and keeping all request state within DCM. However, the +messaging system is already required for status reporting (SP → bus → DCM), so +removing it for creation requests alone does not eliminate the infrastructure +dependency. -A core motivation of this enhancement is removing the need for -DCM-to-environment inbound connectivity for creation requests. The CRD watch -pattern requires DCM to push CRs to the target cluster's API server, -reintroducing that dependency. Additionally, this approach limits the agent to -Kubernetes-based environments, conflicting with the goal of supporting -non-cluster environments. +Additionally, DCM does not currently expose watch/notification semantics. +Building a reliable, scalable watch system into DCM requires further +investigation — particularly around delivery guarantees, fan-out to multiple +agents, and behaviour under network partitions. This is deferred to a future +iteration when the trade-offs are better understood and the maturity level of a +DCM-native watch system can be assessed. ## Cross-Cutting Impact @@ -1073,7 +1202,7 @@ PRs. | Document | Impact | | -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. | +| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | External SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. Embedded SPs register internally and do not use this flow. | | [Service Provider Health Check](../service-provider-health-check/service-provider-health-check.md) | Health polling responsibility shifts from DCM to the agent. DCM monitors agent health via heartbeats instead of polling individual SPs. | | [SP Resource Manager](../sp-resource-manager/sp-resource-manager.md) | SPRM publishes creation requests to the agent's bus topic instead of calling SP REST endpoints directly. SPRM interacts with the agent (not individual SPs) for health status. From SPRM's perspective, the agent serves the same role as a SP: provisioning service types. | | [Placement Manager](../placement-manager/placement-manager.md) | Policy evaluation may now include environment as a selection criterion. Placement Manager delegates to SPRM, which routes through the messaging system. | From 970f9cd04aee9674c62cbcaff30b4f2a2d5a3cd4 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 22 Jun 2026 11:57:33 +0200 Subject: [PATCH 19/20] format Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index ffcfbcd..ba6b775 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -79,8 +79,8 @@ Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. -There is currently no way for a policy to determine in which environment an SP is -running and hence a user cannot explicitly set the targeted environment +There is currently no way for a policy to determine in which environment an SP +is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. Furthermore, with the current way of submitting creation requests, the @@ -337,14 +337,14 @@ Example response: Register a new agent to DCM. -| Field | Type | Required | Description | -| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| name | string | yes | Unique agent name | -| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | +| Field | Type | Required | Description | +| ------------------ | -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| name | string | yes | Unique agent name | +| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | | serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP, embedded or external). May be empty on subsequent re-registrations when SPs become unavailable (an Unhealthy SP does not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | -| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | -| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | -| topicName | string | yes | Deterministic topic name for the agent's messaging channel | +| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | +| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | +| topicName | string | yes | Deterministic topic name for the agent's messaging channel | Response: `201 Created` with `{agentId}` @@ -1120,7 +1120,7 @@ target service type (see | Risk | Mitigation | | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. | +| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. | | Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. | | Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. | | Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). | @@ -1162,8 +1162,8 @@ rather than a Kubernetes cluster. - Could eliminate the messaging system dependency for creation requests - DCM retains full visibility of pending requests (they live in DCM's own storage, not in a bus topic) -- No additional infrastructure beyond DCM itself — the agent only needs - outbound connectivity to DCM's API, which it already has for registration and +- No additional infrastructure beyond DCM itself — the agent only needs outbound + connectivity to DCM's API, which it already has for registration and heartbeats #### Cons From 1c40ce75d81df25a3a82e9166492567dd621813c Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 22 Jun 2026 11:57:44 +0200 Subject: [PATCH 20/20] Reflect Agent Environment introduction to other enhancements Assisted by: Claude Code - claude-opus-4-6 Signed-off-by: gabriel-farache --- .../placement-manager/placement-manager.md | 315 +++++++--- enhancements/policy-engine/policy-engine.md | 97 ++- .../service-provider-health-check.md | 205 +++++-- .../sp-registration-flow.md | 201 +++--- .../sp-resource-manager.md | 277 ++++++--- enhancements/user-flows/user-flows.md | 571 +++++++++++++----- 6 files changed, 1220 insertions(+), 446 deletions(-) diff --git a/enhancements/placement-manager/placement-manager.md b/enhancements/placement-manager/placement-manager.md index 2b5c0a0..afd1f40 100644 --- a/enhancements/placement-manager/placement-manager.md +++ b/enhancements/placement-manager/placement-manager.md @@ -12,6 +12,8 @@ reviewers: - "@gabriel-farache" - "@ebichman" creation-date: 2026-01-09 +see-also: + - /enhancements/environment-agent/environment-agent.md --- # Placement Manager @@ -20,19 +22,23 @@ creation-date: 2026-01-09 The Placement Manager orchestrates resource requests within DCM core. It receives user requests through the Catalog Manager, validates and enriches them -through the Policy Manager, and delegates instance creation to the SP Resource -Manager. The Placement Manager focuses on request orchestration and -coordination. +through the Policy Manager (which now selects an Agent), and delegates instance +creation and deletion to the SP Resource Manager, which routes through the +Messaging System to an Agent. The Placement Manager also handles queued-request +timeout logic when an Agent reports that the Service Provider for the requested +service type is unhealthy. ## Motivation ### Goals -- Define end-to-end flow of for creating resources +- Define end-to-end flow for creating resources +- Define end-to-end flow for deleting resources (deletion flow) - Define _Create_, _Read_, _Delete_ endpoints for Placement Manager -- Define Placement Manager interacts with other services within DCM core +- Define how Placement Manager interacts with other services within DCM core (Catalog Manager, Policy Manager, SP Resource Manager) - Define orchestration responsibilities for Placement Manager +- Define queued-request timeout logic for agent-based routing ### Non-Goals @@ -44,8 +50,12 @@ coordination. The Placement Manager acts as the central orchestration service within DCM core, coordinating between user requests (from Catalog), policy validation, and -catalog instance creation. The following diagram illustrates the system -architecture and component interactions. +instance lifecycle management. The Policy Manager selects an Agent, and the SP +Resource Manager publishes requests to the Agent's messaging topic. The Agent +internally routes to its Service Providers. + +The following diagram illustrates the system architecture and component +interactions. ```mermaid %%{init: {'flowchart': {'rankSpacing': 100, 'nodeSpacing': 10, 'curve': 'linear'},}}%% @@ -55,26 +65,28 @@ flowchart TD classDef policyEngine fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px classDef spResourceManager fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px classDef database fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px + classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ff8a65,stroke-width:2px + classDef agent fill:#2d2d2d,color:#ffffff,stroke:#a5d6a7,stroke-width:2px classDef dcmCore fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px CM["**Catalog Manager**
Send Request"]:::catalogManager subgraph DCM_Core [ ] - PM["**Placement Manager**
"]:::placementManager - - PE["**Policy Manager**
Request Validation
Payload Mutation
SP Selection"]:::policyEngine - - SPRM["**SP Resource Manager**
Create Instance
Read Instances
Delete Instances"]:::spResourceManager - + PM["**Placement Manager**
Orchestrate & Timeout"]:::placementManager + PE["**Policy Manager**
Request Validation
Payload Mutation
Agent Selection"]:::policyEngine + SPRM["**SP Resource Manager**
Publish to Agent Topic
Consume Responses"]:::spResourceManager PM_DB[("**Placement DB**
Store Intent
Store validated request")]:::database - end + MS["**Messaging System**
(NATS)"]:::messaging + AG["**Agent**
Routes to SPs"]:::agent + CM --> PM PM --> PE PM --> PM_DB PM --> SPRM - + SPRM --> MS + MS --> AG class DCM_Core dcmCore ``` @@ -83,15 +95,19 @@ flowchart TD #### Catalog Service -- Receives resource creation requests from users +- Receives resource creation and deletion requests from users - Provides REST API endpoints for _create_, _read_, _delete_ operations on catalog instances - Returns responses and error messages to users #### Policy Manager -- Sends requests for validation via `POST /api/v1/engine/evaluate` -- Receives validated/mutated payload and selected Service Provider +- Sends requests for validation via + `POST /api/v1alpha1/policies:evaluateRequest` +- Provides `available_agents` metadata in the evaluation request +- Optionally includes `exclude_agents` to exclude agents from consideration + (e.g., after a queued-request timeout) +- Receives validated/mutated payload and selected Agent (`agentName`) - Receives policy rejections and constraint violations responses and forwards to the users @@ -99,13 +115,18 @@ flowchart TD - Delegates instance creation, read, and delete operations to SP Resource Manager -- Forwards validated requests with selected SP name +- Forwards `agentName`, `serviceType`, and `spec` in requests +- SPRM publishes to the agent's messaging topic - Receives responses and forwards to the users +- Reports back: success (202), error, or queued status +- When SPRM reports "queued" status, PM handles timeout logic (see + [Queued-Request Handling](#queued-request-handling)) #### Database - Stores the intent (original request) of the user request -- Store validated request and enables rehydration process +- Stores validated request (including `agentName`) and enables rehydration + process - Maintains record of all resources created through Placement Manager ### API Endpoints @@ -123,7 +144,7 @@ resources. | DELETE | /api/v1/resources/{resourceId} | Delete a resource | | GET | /api/v1/health | Placement Manager health check | -**POST /api/v1/resources - Create an resource.** +**POST /api/v1/resources - Create a resource.** The POST endpoint creates a resource that is supported by DCM. The resource request is an instance of a catalog item and originates from the user (UI) @@ -151,6 +172,8 @@ requestBody: description: | Service specification following one of the supported service type schemas (VMSpec, ContainerSpec, DatabaseSpec, or ClusterSpec). + The `serviceType` field within the spec determines which Agent + and Service Provider can fulfill the request. additionalProperties: true ``` @@ -180,7 +203,7 @@ Response payload: Returns 201 Created if successful. "id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781", "path": "resources/08aa81d1-a0d2-4d5f-a4df-b80addf07781", "catalogItemInstanceId": "4baa35eb-e70d-4d37-867d-0f4efa21d05c", - "providerName": "kubevirt-sp", + "agentName": "prod-eu-agent", "spec": { "serviceType": "vm", "vcpu": { "count": 2 }, @@ -197,8 +220,7 @@ Response payload: Returns 201 Created if successful. **Note**: This is **only** an example of the payload. -**GET /api/v1/resources** -List all resources according to AEP standards. +**GET /api/v1/resources** List all resources according to AEP standards. Example of Response Payload @@ -209,7 +231,7 @@ Example of Response Payload "id": "696511df-1fcb-4f66-8ad5-aeb828f383a0", "path": "resources/696511df-1fcb-4f66-8ad5-aeb828f383a0", "catalogItemInstanceId": "52540146-6212-4514-b534-0c3127b2836f", - "providerName": "container-sp", + "agentName": "prod-us-agent", "spec": { "serviceType": "container", "image": { "reference": "docker.io/nginx:latest" }, @@ -227,7 +249,7 @@ Example of Response Payload "id": "c66be104-eea3-4246-975c-e6cc9b32d74d", "path": "resources/c66be104-eea3-4246-975c-e6cc9b32d74d", "catalogItemInstanceId": "4baa35eb-e70d-4d37-867d-0f4efa21d05c", - "providerName": "postgres-sp", + "agentName": "prod-eu-agent", "spec": { "serviceType": "database", "engine": "postgresql", @@ -243,7 +265,7 @@ Example of Response Payload "id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781", "path": "resources/08aa81d1-a0d2-4d5f-a4df-b80addf07781", "catalogItemInstanceId": "f3645f8f-82c1-4efb-888f-318c0ac81a08", - "providerName": "kubevirt-sp", + "agentName": "prod-eu-agent", "spec": { "serviceType": "vm", "vcpu": { "count": 2 }, @@ -261,8 +283,7 @@ Example of Response Payload } ``` -**GET /api/v1/resources/{resourceId}** -Get a resource based on id. +**GET /api/v1/resources/{resourceId}** Get a resource based on id. Example of Response Payload @@ -271,7 +292,7 @@ Example of Response Payload "id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781", "path": "resources/08aa81d1-a0d2-4d5f-a4df-b80addf07781", "catalogItemInstanceId": "d6ebf344-bfd1-44c9-bc25-97f9fb856f22", - "providerName": "kubevirt-sp", + "agentName": "prod-eu-agent", "spec": { "serviceType": "vm", "vcpu": { "count": 4 }, @@ -286,11 +307,9 @@ Example of Response Payload } ``` -**Delete /api/v1/resources/{resourceId}** -Delete a resource based on id. +**DELETE /api/v1/resources/{resourceId}** Delete a resource based on id. -**GET /api/v1/health** -Retrieve the health status of Placement Manager. +**GET /api/v1/health** Retrieve the health status of Placement Manager. Example of Response Payload @@ -306,7 +325,7 @@ Example of Response Payload ### Service Creation Flow The following sequence diagram illustrates the complete flow for creating a -resources via the `POST /api/v1/resources` endpoint. +resource via the `POST /api/v1/resources` endpoint. ```mermaid sequenceDiagram @@ -321,43 +340,66 @@ sequenceDiagram activate PM PM->>DB: Store intent
{originalRequest} - activate DB DB-->>PM: Intent stored - deactivate DB - PM->>PE: POST /api/v1/engine/evaluate
{requestPayload, userId, tenantId} + PM->>DB: Fetch available agents
(healthy, non-Congested) + DB-->>PM: available_agents list + + PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents} activate PE - PE-->>PM: Validated/mutated payload
& selected providerName + PE-->>PM: Validated/mutated payload
& selectedAgent deactivate PE alt Policy validation fails - PM-->>CM: Error response
(Policy rejection) - deactivate PM + PM->>DB: Delete intent record + PM-->>CM: Error response (policy rejection) else Policy validation succeeds - PM->>DB: Store validated request
{validatedPayload, providerName} - activate DB - DB-->>PM: Validated request stored - deactivate DB + PM->>DB: Store validated request
{validatedPayload, agentName} - PM->>SPRM: POST /api/v1/service-types/instances
{providerName, serviceType, spec} + PM->>SPRM: POST /api/v1/service-type-instances
{agentName, serviceType, spec} activate SPRM - alt SP Resource Manager fails + alt SPRM returns error (404/503) SPRM-->>PM: Error response - PM-->>CM: Error response
(Instance creation failed) + PM->>DB: Delete records + PM-->>CM: Error response deactivate SPRM - else Instance creation succeeds - SPRM-->>PM: Success response
{instanceId, status, metadata} - activate DB - deactivate DB - - PM-->>CM: 201 Created
{Resource} + else SPRM returns 202 Accepted + SPRM-->>PM: 202 Accepted
{instanceId, agentName, status: PENDING} + deactivate SPRM + PM-->>CM: 201 Created {Resource} + end + end + Note over SPRM: Async: SPRM consumes response
from dcm.agents.responses + + opt SPRM notifies PM of QUEUED status + SPRM->>PM: Notify: instance QUEUED
{instanceId, agentName} + Note over PM: Start queuedRequestTimeout timer + + alt Timeout expires (or timeout = 0) + PM->>SPRM: DELETE /api/v1/service-type-instances/{instanceId} + Note over PM: Re-evaluate excluding current agent + + PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents, exclude_agents: [agentName]} + activate PE + PE-->>PM: New selectedAgent or no match + deactivate PE + + alt Alternative agent found + PM->>SPRM: POST /api/v1/service-type-instances
{newAgentName, serviceType, spec} + SPRM-->>PM: 202 Accepted + PM-->>CM: 201 Created {Resource} + else No agent available + PM->>DB: Delete records + PM-->>CM: Error: no agent available + end end end + deactivate PM ``` #### Flow Description @@ -374,47 +416,164 @@ sequenceDiagram - This enables rehydration and tracking of the user's original request - Intent is stored before any processing to ensure request persistence -3. **Policy Validation** +3. **Fetch Available Agents** -- Placement Manager forwards the request to Policy Manager for validation +- Placement Manager queries the Agent Registry for healthy, non-Congested agents + that support the requested service type +- The resulting `available_agents` list is passed to the Policy Manager for + evaluation + +4. **Policy Validation** + +- Placement Manager forwards the request to Policy Manager with + `available_agents` and optional `exclude_agents` - Policy Manager evaluates requests against policies - Policy Manager returns: - Approved or rejected - Validated and potentially mutated payload - - Selected Service Provider name (`providerName`) + - Selected Agent name (`selectedAgent`) - Policy constraints and patches applied - If policy validation fails (request rejected or constraint violation): - - Delete record from Placement DB + - Delete intent record from Placement DB - Placement Manager returns error response to Catalog Manager - Request processing stops - If policy validation succeeds: - Placement Manager stores the validated request in Placement DB which - includes the validated/mutated payload and selected `providerName` + includes the validated/mutated payload and selected `agentName` + +5. **Store Validated Request** -4. **Instance Creation** +- Placement Manager persists the validated/mutated payload along with the + `agentName` returned by the Policy Manager +- This enables rehydration and audit + +6. **Instance Creation** - Placement Manager delegates instance creation to SP Resource Manager -- Forwards the validated request with `providerName`, `serviceType`, and `spec` -- SP Resource Manager handles SP lookup, health checks, and instance - provisioning -- If SP Resource Manager fails to create the instance: - - Error response is returned to Placement Manager - - Delete record from Placement DB - - Placement Manager forwards the error to Catalog Manager - - Request processing stops -- If instance creation succeeds: - - SP Resource Manager returns success response with `instanceId`, `status` - - Placement Manager returns 201 Created to Catalog Manager with a full - `Resource` object - - The resource is now in a `PROVISIONING` state +- Forwards `agentName`, `serviceType`, and `spec` +- SP Resource Manager publishes the request to the agent's messaging topic +- SPRM always responds synchronously with one of: + - **SPRM returns error (404/503)**: Error response returned to Placement + Manager. Records deleted from Placement DB. Placement Manager forwards the + error to Catalog Manager. Request processing stops. + - **SPRM returns 202 Accepted**: Instance creation is in progress. Placement + Manager returns 201 Created to Catalog Manager with a full `Resource` + object. The resource is now in a `PENDING` state. + +7. **Queued-Request Handling (Asynchronous)** + +- After SPRM returns 202, it continues to consume responses from + `dcm.agents.responses`. If the Agent reports a `dcm.agent.request-queued` + CloudEvent (the SP for the requested service type is unhealthy), SPRM + asynchronously notifies Placement Manager of the `QUEUED` status +- Upon receiving the QUEUED notification, Placement Manager starts a + `queuedRequestTimeout` timer +- On timeout expiry (or immediately if `queuedRequestTimeout = 0`): + - PM tells SPRM to DELETE the queued request + - PM re-evaluates policies by calling the Policy Manager again, this time + including `exclude_agents: [agentName]` to exclude the timed-out agent + - If an alternative agent is found: PM sends a new creation request to SPRM + with the new agent + - If no alternative agent is available: PM deletes records from Placement DB + and returns an error to Catalog Manager + +### Service Deletion Flow + +The following sequence diagram illustrates the complete flow for deleting a +resource via the `DELETE /api/v1/resources/{resourceId}` endpoint. + +```mermaid +sequenceDiagram + autonumber + participant CM as Catalog Manager + participant PM as Placement Manager + participant DB as Placement DB + participant SPRM as SP Resource Manager + + CM->>PM: DELETE /api/v1/resources/{resourceId} + activate PM + + PM->>DB: Lookup resource
Get agentName, serviceType, instanceId + + PM->>SPRM: DELETE /api/v1/service-type-instances/{instanceId} + activate SPRM + + alt SPRM returns error + SPRM-->>PM: Error response + PM-->>CM: Error response + else SPRM returns 202 Accepted + SPRM-->>PM: 202 Accepted
{instanceId, agentName, status: DELETING} + PM->>DB: Update resource status to DELETING + PM-->>CM: 200 OK + end + deactivate SPRM + + Note over SPRM: Async: SPRM consumes response
from dcm.agents.responses + + opt SPRM notifies PM of QUEUED status + SPRM->>PM: Notify: instance QUEUED
{instanceId, agentName} + Note over PM: Same timeout logic as creation + end + deactivate PM +``` + +#### Flow Description + +1. **Request Reception** + +- Catalog Manager sends a DELETE request to Placement Manager with the + `resourceId` + +2. **Resource Lookup** + +- Placement Manager queries Placement DB to retrieve the resource record, + including the `agentName`, `serviceType`, and `instanceId` needed for deletion -#### Key Characteristics/Notes +3. **Delegation to SP Resource Manager** + +- Placement Manager sends a DELETE request to SPRM with the `instanceId` +- SPRM publishes a deletion CloudEvent to the agent's messaging topic +- SPRM always responds synchronously with one of: + - **SPRM returns error**: Error response returned to Placement Manager, which + forwards it to Catalog Manager + - **SPRM returns 202 Accepted**: Deletion is in progress. PM updates the + resource status to `DELETING` in Placement DB and returns 200 OK to Catalog + Manager +- **SPRM notifies QUEUED (asynchronous)**: After returning 202, SPRM may + asynchronously notify PM of a `QUEUED` status if the Agent reports the SP for + the service type is unhealthy. The same `queuedRequestTimeout` logic applies + as in the creation flow (see + [Queued-Request Handling](#queued-request-handling)) + +### Configuration + +| Parameter | Type | Default | Description | +| ---------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `queuedRequestTimeout` | Duration | `300s` | Maximum time PM waits when SPRM reports a "queued" status before cancelling the request and re-evaluating policies excluding the current agent. When set to `0`, PM immediately re-evaluates without waiting. Applies to both creation and deletion requests. | + +### Key Characteristics/Notes - **Intent Preservation**: Original user request is stored before processing for audit and rehydration purposes -- **Policy-Driven**: Service Provider selection and request validation are - handled by Policy Manager -- **Error Handling**: Clear error paths for policy rejections and instance - creation failures +- **Policy-Driven**: Agent selection and request validation are handled by + Policy Manager +- **Agent-Based Selection**: Service Provider selection is no longer a direct + concern of the Placement Manager. The Policy Engine selects an Agent based on + environment, service types, and cost. The Agent internally selects the SP. +- **Queued-Request Timeout**: When SPRM reports a "queued" status (the SP for + the requested service type on the agent is unhealthy), PM applies a + configurable timeout. On expiry, PM cancels the request and re-evaluates + policies excluding the timed-out agent. +- **Error Handling**: Clear error paths for policy rejections, instance creation + failures, and queued-request timeouts - **State Management**: Both original intent and validated request are stored for complete request lifecycle tracking and rehydration purposes + +### Next Steps + +- Per-agent timeout overrides (allow different `queuedRequestTimeout` values per + agent) +- Retry limits on re-evaluation (cap the number of times PM re-evaluates after + excluding agents) +- PM-level request priority/ordering (prioritize certain requests over others + when re-evaluating) diff --git a/enhancements/policy-engine/policy-engine.md b/enhancements/policy-engine/policy-engine.md index b37f4b9..c75751e 100644 --- a/enhancements/policy-engine/policy-engine.md +++ b/enhancements/policy-engine/policy-engine.md @@ -13,6 +13,8 @@ reviewers: approvers: - TBD creation-date: 2025-12-15 +see-also: + - "/enhancements/environment-agent/environment-agent.md" --- # Policy API & Execution Engine @@ -20,7 +22,11 @@ creation-date: 2025-12-15 ## Summary This ADR defines the Management and Execution API and Workflow of the DCM Policy -Engine +Engine. + +With the introduction of the Environment Agent layer, the Policy Engine selects +an Agent (rather than a Service Provider) to handle the request, and can +constrain selection by environment. ## Motivation @@ -28,10 +34,9 @@ The Policy Engine operates as a specialized microservice within the Data Center Management (DCM) application responsible for governing service creation and modification (e.g., VirtualMachines, Containers). It enables Admins, Tenant-Admins, and Users to inject logic that validates (Approve/Reject), -mutates (Defaulting/Altering) and assigns Service Providers to request payloads -using an embedded -[Open Policy Agent (OPA)](https://www.openpolicyagent.org/docs) engine and -[Rego](https://www.openpolicyagent.org/docs/policy-language). +mutates (Defaulting/Altering) and assigns Agents to request payloads using an +embedded [Open Policy Agent (OPA)](https://www.openpolicyagent.org/docs) engine +and [Rego](https://www.openpolicyagent.org/docs/policy-language). OPA is embedded as a Go library within the Policy Engine process rather than deployed as a separate sidecar service. Rego source code is persisted in the @@ -75,7 +80,8 @@ Every policy may return one or more of the following outputs by providing a patch map. 3. **Field Constraints:** Defining the mutability of fields for _subsequent_ policies in the chain. -4. **Service Provider Selection:** Policies may set a value and/or constraints +4. **Agent Selection:** Policies may set a target agent and/or agent constraints + (including environment constraints) ### Policy Scope & Hierarchy (Execution Order) @@ -100,10 +106,14 @@ The input payload includes: they will need to know the expected content - `constraints` - The current constraints context (accumulated from prior policies) -- `provider` - The currently selected service provider (empty string initially, - populated as policies are evaluated) -- `service_provider_constraints` - The current service provider constraints - (accumulated from prior policies) +- `agent` - The currently selected agent (empty string initially, populated as + policies are evaluated) +- `agent_constraints` - The current agent constraints (accumulated from prior + policies) +- `available_agents` - List of eligible agents with metadata + `[{name, environment, serviceTypes, cost}]`, provided by Placement Manager +- `exclude_agents` - List of agent names to exclude from selection (used during + re-evaluation after queued timeout) #### Output @@ -113,11 +123,14 @@ following elements - **rejected** (bool) - since requests are approved by default, policies may reject them. - **rejection_reason** (string, optional) - reason for rejection -- **selected_provider** (string, optional) - the name of the service provider - chosen to fulfill the request -- **service_provider_constraints** (object, optional) - - - `allow_list` - list of allowed service provider names - - `patterns` - list of regex patterns for matching allowed providers +- **selected_agent** (string, optional) - the name of the agent chosen to handle + the request +- **agent_constraints** (object, optional) - + - `allow_list` - list of allowed agent names + - `patterns` - list of regex patterns for matching allowed agents + - `environment_constraints` - environment-level constraints + - `allow_list` - list of allowed environment identifiers + - `patterns` - list of regex patterns for matching allowed environments - **patch** (map, optional) - a dictionary of the corresponding service type for setting values. Each internal key is optional - **constraints** (map, optional) - follows @@ -266,7 +279,7 @@ sequenceDiagram Database-->>PolicyEngine: List of policies loop For each policy - PolicyEngine->>PolicyEngine: Evaluate policy (embedded OPA) + PolicyEngine->>PolicyEngine: Evaluate policy (embedded OPA)
{spec, agent, constraints, agent_constraints} PolicyEngine->>PolicyEngine: Enforce constraints PolicyEngine->>PolicyEngine: Mutate payload alt Policy rejected or constraint violation @@ -275,7 +288,7 @@ sequenceDiagram end end - PolicyEngine-->>PlacementManager: Success with updated payload + PolicyEngine-->>PlacementManager: Success with {evaluatedServiceInstance, selectedAgent, status} PlacementManager-->>User: Service created ``` @@ -287,6 +300,10 @@ sequenceDiagram - Service Instance - spec - the service specification (flexible schema) +- available_agents - list of agents with metadata (provided by PM) + `[{name, environment, serviceTypes, cost}]` +- exclude_agents - list of agent names to exclude (optional, used for + re-evaluation) ###### Execution Logic & Flow @@ -299,9 +316,11 @@ parallel with policy management operations. - The Policy API maintains a `ConstraintContext` map in memory for the duration of the request. +- Pre-filter: Remove any agents in `exclude_agents` from the `available_agents` + list before evaluation begins. - Fetch & Sort: - Query DB for enabled policies matching the request payload based on the - policy’s matching criteria. + policy's matching criteria. - Sort by Level (Global -> Tenant -> User) then Priority (Desc). - If no policies matching the request payload were found, the request will return successfully @@ -310,9 +329,11 @@ parallel with policy management operations. - Invoke the policy's package main rule - Pass - `spec` - the current patched request payload - - `provider` - the currently selected service provider + - `agent` - the currently selected agent - `constraints` - the accumulated constraint context (if any) - - `service_provider_constraints` - the accumulated SP constraints (if any) + - `agent_constraints` - the accumulated agent constraints (if any) + - `available_agents` - the pre-filtered list of eligible agents + - `exclude_agents` - the list of excluded agent names - Check `Reject` - If `Reject` is `true`, ABORT IMMEDIATELY (Fail Fast). Return 406. - Validate `Constraints`: @@ -327,13 +348,16 @@ parallel with policy management operations. patch the `region`, ABORT with "Policy Conflict Error" - Apply `Patch` - Update service_payload with valid patches. - - Validate `ServiceProvider` - - If Policy P returned a `selected_provider` and - `service_provider_constraints` exist, validate the selected provider - against the constraints. - -- Finalize: Return the final payload, selected provider, and status to Placement + - Validate `Agent` + - If Policy P returned a `selected_agent` and `agent_constraints` exist, + validate the selected agent against the constraints. + - If `environment_constraints` exist, validate the selected agent's + environment against those constraints (Policy Engine uses + `available_agents` metadata for this). + +- Finalize: Return the final payload, selected agent, and status to Placement Manager. + - Response: `{evaluatedServiceInstance, selectedAgent, status}` - Status is `APPROVED` if the payload was not modified, `MODIFIED` if any patches were applied. @@ -347,3 +371,24 @@ parallel with policy management operations. - Patch: {"billing_tag": "marketing"} - Action: Engine checks Context. billing_tag is immutable. - Result: Error. The User policy violates the Global constraint. + +###### _Agent/Environment Constraint Validation Example_ + +- Step 1 (Global Policy): + - agent_constraints: {environment_constraints: {allow_list: ["prod-eu-west-1", + "prod-us-east-1"]}} + - Result: Only agents in prod-eu-west-1 or prod-us-east-1 are eligible +- Step 2 (Tenant Policy): + - selected_agent: "prod-eu-agent" + - Validation: Agent's environment is "prod-eu-west-1" (looked up from + available_agents metadata) — matches allow_list. Valid. +- Step 3 (User Policy): + - selected_agent: "dev-agent" + - Validation: Agent's environment is "dev" (looked up from available_agents + metadata) — NOT in allow_list. Error: violates Global constraint. + +## Next Steps + +- Cost-based agent selection within agent_constraints +- Resource capacity constraints (totalCpu, totalMemory) +- SP-level constraints passed through to agents diff --git a/enhancements/service-provider-health-check/service-provider-health-check.md b/enhancements/service-provider-health-check/service-provider-health-check.md index 191d6c4..9e55eba 100644 --- a/enhancements/service-provider-health-check/service-provider-health-check.md +++ b/enhancements/service-provider-health-check/service-provider-health-check.md @@ -13,77 +13,194 @@ reviewers: approvers: - "" creation-date: 2025-12-15 +see-also: + - "/enhancements/environment-agent/environment-agent.md" --- # Service Provider Health Check ## Summary -This enhancement proposes a mechanism for the DCM control plane to actively -monitor the health of service providers. Instead of providers pushing -heartbeats, the DCM control plane will poll a `/health` endpoint on the service -provider to verify liveness and backing provider health. +The Environment Agent monitors SP health using two mechanisms: in-process checks +for embedded SPs (K8s Container, ACM Cluster, KubeVirt) and polling the +`/health` endpoint for external SPs. DCM monitors Agent health via heartbeats +and consumer lag reporting. ## Motivation -Define the DCM control plane way to determine if a service provider is -accessible. Without an active check, the control plane might attempt to schedule -services on providers that are down. +Define how SP health is monitored by the Agent, and how Agent health and +congestion are monitored by DCM. ### Goals -- Implement a polling mechanism where DCM checks provider health. +- Define the polling mechanism where the Agent checks SP health. - Define a standard `/health` endpoint for all Service Providers. +- Define the heartbeat mechanism by which DCM monitors Agent health. +- Define consumer lag monitoring and the Congested agent state. ### Non-Goals - Status reporting of individual services running _on_ the provider. - Deep provider diagnostics (out of scope for liveness check). -- Ensure DCM excludes "Unhealthy" or "Unreachable" providers from scheduling. +- Agent high availability (deferred to HA iteration). ## Proposal ### Overview -The DCM Control Plane will act as the "prober." It will maintain a list of -registered service providers URLs. At a configurable interval, DCM will perform -an HTTP GET request to the provider's `/health` endpoint. +The Agent acts as the prober for SP health. The monitoring mechanism differs by +SP type: embedded SPs are checked in-process (the agent directly checks the +embedded SP's internal state without a network call), while external SPs are +checked by polling their `/health` endpoint at a configurable interval. DCM +monitors Agent health via periodic REST heartbeats and tracks consumer lag. ### Architecture -1. **Health Polling (High Frequency):** - - **Initiator:** DCM Control Plane. - - **Target:** Service Provider `/health` endpoint. - - **Frequency:** Every 10 seconds (default). - - **Success Criteria:** HTTP 200 OK. - -2. **Resource Synchronization (Low Frequency/On-Demand):** - - **Note:** Detailed resource data (CPU/Memory) continues to be handled via - the Provider Info API, but the "Ready" state is governed by the Health - Check results. +1. **SP Health Monitoring (Agent → SP):** + - **Embedded SPs:** Health is determined in-process — the agent directly + checks the embedded SP's internal state without a network call. + - **External SPs:** Health is determined by polling the SP's `/health` + endpoint. + - **Initiator:** Agent. + - **Target:** Service Provider `/health` endpoint. + - **Frequency:** Every 10 seconds (default). + - **Success Criteria:** HTTP 200 OK. + +2. **Agent Health Monitoring (Agent → DCM):** + - **Mechanism:** Agent sends `PUT /api/v1/agents/{agentId}/heartbeat` to DCM. + - **Frequency:** Every `heartbeatInterval` seconds (configurable). + - **Failure:** If no heartbeat within configurable threshold, DCM marks agent + as Unavailable. + +3. **Consumer Lag Monitoring:** + - Agent self-reports consumer lag in heartbeat payload + `{timestamp, consumerLag}`. + - DCM marks agent as **Congested** when lag exceeds `consumerLagThreshold`. + - DCM stops routing new requests to a Congested agent. ### Health Check Flow -1. **DCM Controller:** Iterates through the list of active providers in the - database. -2. **Probing:** For each provider, DCM executes: - `GET http://:/health`. -3. **State Machine:** - - **Ready:** If response is `200 OK` and body `status` is `healthy`, reset - failure counter and mark as `Ready`. - - **Unhealthy:** If response is `200 OK` and body `status` is `unhealthy`, - mark as `Unhealthy`. The service provider is reachable but the backing - provider is unavailable. - - **Failure:** If timeout or non-200 response, increment failure counter. - - **Threshold:** If failures exceed the `FailureThreshold` (default: 3), - transition provider to `Unavailable`. -4. **Recovery:** A single `200 OK` with `status` `healthy` transitions an - `Unhealthy` or `Unavailable` provider back to `Ready`. +1. **Agent:** Iterates through the list of registered SPs (both embedded and + external). +2. **Probing:** For each external SP, the Agent executes: + `GET http://:/health`. Embedded SPs are checked in-process. +3. **State Machine:** + - **Ready:** If response is `200 OK` and body `status` is `healthy`, reset + failure counter and mark as `Ready`. + - **Unhealthy:** If response is `200 OK` and body `status` is `unhealthy`, + mark as `Unhealthy`. The service provider is reachable but the backing + provider is unavailable. + - **Failure:** If timeout or non-200 response, increment failure counter. + - **Threshold:** If failures exceed the `FailureThreshold` (default: 3), + transition provider to `Unavailable`. +4. **Recovery:** A single `200 OK` with `status` `healthy` transitions an + `Unhealthy` or `Unavailable` provider back to `Ready`. + +### Differentiated Behavior + +Since only one SP (embedded or external) may serve a given service type per +agent, when that SP transitions out of the Ready state, the Agent's behavior +differs based on the health state. See the +[Environment Agent enhancement](../environment-agent/environment-agent.md) for +full details on retry topic behavior. + +**Unhealthy:** + +1. Agent **keeps** the service type in its advertised list (no update to DCM). +2. Stops routing to the SP. Incoming requests are held in the retry topic. +3. Publishes a `service-type-degraded` health warning CloudEvent. + +**Unavailable** (after exceeding failure threshold): + +1. Agent **removes** the service type from its advertised list. +2. Sends `POST /api/v1/agents` to DCM with the updated registration. +3. Drains retry topic — rejects held requests with error CloudEvents. +4. Publishes a `service-type-unavailable` health warning CloudEvent. + +**Recovery:** + +1. Re-adds service type to advertised list if it was removed (Unavailable case) + and sends `POST /api/v1/agents` to DCM with the updated registration. +2. Processes held requests from the retry topic. + +## Agent Health Monitoring + +The Agent reports its own liveness to DCM via periodic REST heartbeats. DCM +tracks the last heartbeat timestamp for each agent. + +- **Endpoint:** `PUT /api/v1/agents/{agentId}/heartbeat` +- **Payload:** `{timestamp, consumerLag}` +- **Frequency:** Every `heartbeatInterval` seconds (configurable). +- If no heartbeat is received within a configurable threshold, DCM marks the + agent as **Unavailable**. +- On restart, the Agent re-registers to DCM, which resets the heartbeat tracker. + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant DCM as DCM Control Plane + participant DB as Database + + loop Every {heartbeatInterval} seconds + AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp, consumerLag} + DCM->>DB: Update heartbeat timestamp and lag + DCM->>DCM: Check consumerLag against threshold + alt consumerLag >= consumerLagThreshold + DCM->>DB: Mark agent as Congested + else consumerLag < consumerLagThreshold + DCM->>DB: Clear Congested state (if set) + end + DCM-->>AG: 200 OK + end + + Note over DCM: No heartbeat within threshold + DCM->>DB: Mark agent as Unavailable +``` + +## Consumer Lag Monitoring + +The Agent self-reports the number of pending messages on its topic as +`consumerLag` in each heartbeat. DCM compares this value against a global +`consumerLagThreshold`. + +- When `consumerLag >= consumerLagThreshold`, DCM marks the agent as + **Congested** and stops routing new requests to it. +- When `consumerLag` drops below the threshold on a subsequent heartbeat, DCM + clears the Congested state. + +> **Note:** The environment-agent enhancement currently defines the heartbeat +> payload as `{timestamp}` only. The extended payload `{timestamp, consumerLag}` +> is defined here as the intended contract; the agent doc will be updated in a +> follow-up. + +## Agent Health State Summary + +| Condition | Agent State | +| --------------------------------------- | --------------- | +| Heartbeat received, lag below threshold | **Ready** | +| Heartbeat received, lag above threshold | **Congested** | +| No heartbeat within threshold | **Unavailable** | + +```mermaid +stateDiagram-v2 + [*] --> Ready: Agent registers + Ready --> Congested: consumerLag >= threshold + Congested --> Ready: consumerLag < threshold + Ready --> Unavailable: Heartbeat timeout + Congested --> Unavailable: Heartbeat timeout + Unavailable --> Ready: Agent re-registers +``` ## Design Details ### Service Provider Implementation +The SP health endpoint specification applies to external SPs only. Embedded SPs +are health-checked in-process and do not expose a `/health` endpoint. The only +difference from the original design is that the Agent, not DCM, is the caller +for external SPs. + The Service Provider must expose a lightweight unauthenticated (or internally secured) endpoint. @@ -106,10 +223,10 @@ secured) endpoint. The `status` field indicates the health of the backing provider: -- `healthy` — The service provider and its backing provider are operational. DCM - marks the provider as **Ready**. +- `healthy` — The service provider and its backing provider are operational. The + Agent marks the provider as **Ready**. - `unhealthy` — The service provider is reachable but the backing provider is - unavailable. DCM marks the provider as **Unhealthy**. + unavailable. The Agent marks the provider as **Unhealthy**. **Unhealthy Response Example:** @@ -123,8 +240,14 @@ The `status` field indicates the health of the backing provider: #### Provider State Summary -| HTTP Response | `status` field | DCM State | +| HTTP Response | `status` field | SP State | | ----------------- | -------------- | ---------------------------------------------------- | | `200 OK` | `healthy` | **Ready** | | `200 OK` | `unhealthy` | **Unhealthy** | | Non-200 / Timeout | N/A | **Unavailable** (after exceeding `FailureThreshold`) | + +## Next Steps + +- Agent HA: multiple agents sharing health-check duties. +- Authenticated health checks. +- Per-SP health check intervals. diff --git a/enhancements/sp-registration-flow/sp-registration-flow.md b/enhancements/sp-registration-flow/sp-registration-flow.md index 2055a78..d318687 100644 --- a/enhancements/sp-registration-flow/sp-registration-flow.md +++ b/enhancements/sp-registration-flow/sp-registration-flow.md @@ -17,6 +17,8 @@ approvers: - "@flocati" - "@gabriel-farache" creation-date: 2025-12-05 +see-also: + - "/enhancements/environment-agent/environment-agent.md" --- # Service Provider Registration Flow @@ -26,19 +28,35 @@ creation-date: 2025-12-05 The DCM (Data Center Management) is designed to provide a unified control plane for managing distributed infrastructure across multiple enclaves, including air-gapped environments, regional datacenters, and isolated security zones (e.g. -ships, edge locations). A fundamental architectural decision must be made about -how Service Providers (SP) — the components that execute infrastructure -provisioning work — become known to and integrate with the DCM Control Plane. -This decision directly impacts scalability, security, network topology, -operational model (whether centralized DCM teams or distributed SME teams manage -Service Provider lifecycle). +ships, edge locations). In each target environment, an +[Agent](../environment-agent/environment-agent.md) runs as the intermediary +between DCM and the Service Providers (SPs) deployed in that environment. + +The Agent supports a hybrid SP model: it ships with embedded SP code for known +service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration, +and also accepts external ("bring your own") SPs that register via the Agent's +SP Registration API (`POST /api/v1/providers`). Only one SP — embedded or +external — may serve a given service type per agent; duplicate registrations are +rejected with `409 Conflict`. + +This document defines the registration contract for external SPs — API shape, +idempotency semantics, and natural key behavior. Embedded SPs register +internally at agent startup without a REST call and do not use this flow. + +The Agent, in turn, registers itself to DCM via a separate API +(`POST /api/v1/agents`), advertising the environment and the aggregated list of +service types it can serve. DCM's Registration Handler no longer receives SP +registrations directly; it receives Agent registrations. The Agent Registration +Flow is defined in the +[Environment Agent enhancement](../environment-agent/environment-agent.md#agent-registration-flow). ## Motivation ### Goals -- Define the registration mechanism by which Service Providers become known to - and communicate with the DCM Control Plane. +- Define the registration mechanism by which external Service Providers become + known to the Agent, and how the Agent becomes known to DCM. +- Define the service type uniqueness constraint (one SP per service type). ### Non-Goals @@ -48,6 +66,10 @@ Service Provider lifecycle). - DCM Control Plane definition - Meta-service-provider design - Service Provider's policies +- Embedded SP registration (these register internally at agent startup; see the + [Environment Agent enhancement](../environment-agent/environment-agent.md#embedded-sp-registration)) +- Agent registration to DCM (defined in the + [Environment Agent enhancement](../environment-agent/environment-agent.md#agent-registration-flow)) ## Proposal @@ -55,36 +77,41 @@ Service Provider lifecycle). #### Terminology -Service Providers must register using the DCM Service Provider API to operate -within the DCM system. The Registration Handler component implements the -provider registration endpoints of the Service Provider API. The registration -phase provides to the DCM Control Plane the SP endpoint, metadata and -capabilities so it can route requests to the appropriate SP. The registration -call can be initiated either by the SP itself during start up phase or by a -third party (e.g. platform admins) on behalf of the SP. Both approaches use the -same registration API. +External Service Providers must register using the Agent's SP Registration API +to operate within the DCM system. The Agent implements the provider registration +endpoint (`POST /api/v1/providers`), applying the same contract defined in this +document. Embedded SPs (K8s Container, ACM Cluster, KubeVirt) register +internally at agent startup and do not use this endpoint. + +The registration phase provides the Agent with the SP endpoint, metadata and +capabilities so it can route creation requests to the appropriate SP. The +registration call can be initiated either by the SP itself during start up phase +or by a third party (e.g. platform admins) on behalf of the SP. Both approaches +use the same registration API. + +Only one SP — embedded or external — may serve a given service type per agent. +If the requested service type is already served by another SP (embedded or +external), the Agent rejects the registration with `409 Conflict` (see the +[Environment Agent enhancement](../environment-agent/environment-agent.md#sp-registration-to-agent) +for the full service type uniqueness constraint). The _initial implementation_ will focus only on the **self registration flow**. -The _Service Provider API_ is located in the Egress layer and defines the -contract between the DCM Control Plane and Service Providers. It includes -endpoints for provider registration, service management, and provider queries. -The +The _SP Registration API_ is hosted by the Agent and defines the contract +between the Agent and Service Providers. It includes the endpoint for provider +registration. The [Service Provider API specification](https://github.com/Fale/dcm/blob/od/api/interoperabilityAPI.yaml) is under development. -Within this architecture, the _Registration Handler_ is a component within the -Service Provider API that implements the provider registration endpoints -(`POST /providers` and related endpoints). When an SP registers, the -Registration Handler communicates with the Control Plane to update the Service -Registry. +DCM implements `POST /api/v1/agents` for Agent registration (defined in the +[Environment Agent enhancement](../environment-agent/environment-agent.md#post-apiv1agents--agent-registration)). #### Architectural Assumptions -Bidirectional network connectivity between Service Providers and the DCM Control -Plane is required. SPs must reach DCM to register, and DCM must reach SPs to -route provisioning requests. If either direction is blocked, the system cannot -function regardless of the registration method used. +SPs require network connectivity to the Agent. The Agent requires outbound +connectivity to DCM (for registration and heartbeats) and to the Messaging +System. DCM requires connectivity to the Messaging System. Direct SP-to-DCM +connectivity is not required. #### Registration Flow @@ -98,17 +125,18 @@ capability matrices. ```mermaid %%{init: {'flowchart': {'rankSpacing': 100, 'nodeSpacing': 10}}}%% flowchart BT - subgraph Data_Sources [**Data Sources**] - DB[("**Service Registry**
SP endpoints")] + subgraph DCM_Control_Plane [**DCM Control Plane**] + DB[("**Agent Registry**
Agent endpoints &
service types")] end - subgraph API_Block [**Service Provider API**] - Handler["_Service Registration Handler_ + subgraph Agent_Block [**Agent**] + Handler["_SP Registration Handler_ 2. Receive Request - 3. Validate & Process"] + 3. Validate & Process + 4. Update internal SP registry"] end - subgraph Service_API [**Service API**

] + subgraph Service_API [**Service Providers**

] subgraph SP1 [**ServiceProvider 1**
] VM_Prov["**VM Provider Impl.** @@ -130,14 +158,14 @@ flowchart BT end VM_Prov & Storage_Prov & Container_Prov & Pod_Prov -- 1. Register --> Handler - Handler -- 4. Update Service Registry --> DB + Handler -- 5. Update DCM
POST /api/v1/agents --> DB ``` - Admins predefine supported [Service Types](https://github.com/dcm-project/enhancements/blob/main/enhancements/service-type-definitions/service-type-definitions.md) (e.g., "vm", "database") -- A registration call must be made to the Registration Handler endpoint for each - service type the SP supports. The payload includes: +- A registration call must be made to the Agent's SP Registration endpoint for + each service type the SP supports. The payload includes: 1. Unique provider name 2. Unique providerID (optional, server-generated if not provided) 3. Endpoint URL (e.g., @@ -146,13 +174,17 @@ flowchart BT 5. Metadata (optional: zone, region, resource constraints) 6. Operations supported for this service type (optional, e.g., _"create"_, _"delete"_) -- The Registration Handler processes and validates the metadata -- The Registration Handler internally updates the Service Registry with: - 1. SP endpoint - 1. metadata -- When user requests a catalog offering, Control Plane matches it to registered - SPs that can fulfill it based on configured policies and calls the selected SP - endpoint (endpoint must be reachable) +- The Agent processes and validates the metadata +- The Agent stores the SP registration in its internal registry and recomputes + its list of supported service types +- When the Agent's service type list changes (new type added or removed), the + Agent updates DCM via `POST /api/v1/agents` (see the + [Environment Agent enhancement](../environment-agent/environment-agent.md#sp-registration-to-agent) + for the full flow) +- When user requests a catalog offering, DCM's Control Plane matches it to a + registered Agent that can fulfill it based on configured policies and routes + the request through the messaging system to the Agent, which forwards it to + the selected SP The Service Provider's _name_ is the natural key used to match existing registrations. @@ -163,18 +195,25 @@ request body. This allows the `id` field in the schema to be `readOnly`, preventing conflicts between query param and body values. The server sets `id` from the query parameter or auto-generates it if not provided. -The registration endpoint is idempotent. During the registration phase: +The registration endpoint is idempotent. These idempotency semantics apply at +the Agent level for SP registration. During the registration phase: -- If the _name_ does not exist in DCM, a new SP entry is created. If no - _providerID_ is specified, DCM will automatically generate one. +- If the _name_ does not exist in the Agent's registry, a new SP entry is + created. If no _providerID_ is specified, the Agent will automatically + generate one. - If the _name_ already exists and no _providerID_ is provided (or the same _providerID_ is provided), the existing entry is updated and the same _providerID_ is returned. - If the _name_ already exists but a **different** _providerID_ is provided, registration fails (conflict: another SP is attempting to register with a taken name). -- If a new _name_ is provided but the _providerID_ already exists in DCM, - registration fails (conflict: _providerID_ is already assigned to another SP). +- If a new _name_ is provided but the _providerID_ already exists in the Agent's + registry, registration fails (conflict: _providerID_ is already assigned to + another SP). + +Identical idempotency semantics (same `name` natural key pattern) apply at DCM +level for Agent registration, as defined in the +[Environment Agent enhancement](../environment-agent/environment-agent.md#re-registration-on-restart). The response to a registration request will always include the _providerID_, regardless of whether it was generated or provided. Consistent with AEP, the @@ -184,35 +223,37 @@ response payload mirrors the request payload with possibly updated values. The registration endpoint is idempotent. If an SP's capabilities change (typically due to a new version following a restart), the SP (or admin) can call -the same registration endpoint again. The Registration Handler will update the -existing SP entry rather than creating a duplicate. +the same registration endpoint again. The Agent will update the existing SP +entry rather than creating a duplicate. + +When an SP re-registers with updated capabilities, the Agent recomputes its +service type list and, if changed, updates DCM via `POST /api/v1/agents`. - SP serviceType changes -- SP restarts and re-registers using the same Service Provider API registration - endpoint -- The Registration Handler updates the existing Service Provider Registry and - Service Catalog entry with the new serviceType -- The Registration Handler detects that the SP already exists by matching the - Service Provider _name_ -- The Registration Handler updates the existing Service Registry entry with the - new serviceType and returns the same providerID. -- There are 3 potential scenarios for updating a Service Provider within DCM: +- SP restarts and re-registers using the same Agent SP Registration API endpoint +- The Agent updates the existing SP entry in its internal registry with the new + serviceType +- The Agent detects that the SP already exists by matching the Service Provider + _name_ +- The Agent updates the existing SP entry with the new serviceType and returns + the same providerID. +- There are 3 potential scenarios for updating a Service Provider: 1. SP's _name_ update: If only the SP's name changes (but the providerID remains - the same), DCM updates the SP's name. An attempt to update with a + the same), the Agent updates the SP's name. An attempt to update with a pre-existing SP's name will result in failure. 2. _providerID_ update: If only the _providerID_ changes (but the SP's _name_ - remains the same), DCM updates the providerID. An attempt to update with a - pre-existing _providerID_ will result in failure. -3. Both the SP's name and providerID change: DCM cannot reliably determine if - this is an update to the existing SP or a new registration of a distinct SP. - In this scenario the required action is to delete and re-create the SP. + remains the same), the Agent updates the providerID. An attempt to update + with a pre-existing _providerID_ will result in failure. +3. Both the SP's name and providerID change: The Agent cannot reliably determine + if this is an update to the existing SP or a new registration of a distinct + SP. In this scenario the required action is to delete and re-create the SP. ###### Example - First registration (with client-specified id): -`POST /api/v1/providers?id=uuid-1234` +`POST /api/v1/providers?id=uuid-1234` (on the Agent) ```yaml { @@ -246,7 +287,7 @@ Response: - First registration (with server generated id): -`POST /api/v1/providers` +`POST /api/v1/providers` (on the Agent) ```yaml { @@ -268,7 +309,7 @@ Response: - Re-registration (SP restarts, same endpoint): -`POST /api/v1/providers` +`POST /api/v1/providers` (on the Agent) ```yaml { @@ -293,8 +334,23 @@ Response: ### Risks and Mitigations +The risks related to the Agent-based architecture (agent as single point of +failure, unauthenticated SP registration, messaging system dependencies) are +documented in the +[Environment Agent enhancement](../environment-agent/environment-agent.md#risks-and-mitigations). + +### Next Steps + +- HA agent replicas for high availability per environment +- Authenticated SP registration (AuthN/AuthZ for the Agent's SP Registration + API) +- Dynamic cost tier updates without agent restart + ## Alternatives +The following alternatives were evaluated before the current Agent-based +architecture was adopted. They are retained for historical context. + ### Dynamic Registration Approach #### Description @@ -463,4 +519,7 @@ flowchart BT #### Why rejected Too complex for initial delivery. Requirements for network scanning, discovery -protocols, and security policies are not yet defined. +protocols, and security policies are not yet defined. The Agent-based +architecture further reinforces this rejection: the Agent eliminates the need +for direct DCM-to-SP connectivity, making a DCM-driven network scanning approach +even less aligned with the current architecture. diff --git a/enhancements/sp-resource-manager/sp-resource-manager.md b/enhancements/sp-resource-manager/sp-resource-manager.md index 59ff754..c97ed3f 100644 --- a/enhancements/sp-resource-manager/sp-resource-manager.md +++ b/enhancements/sp-resource-manager/sp-resource-manager.md @@ -11,21 +11,24 @@ reviewers: - "@pkliczewski" - "@gabriel-farache" creation-date: 2026-01-02 +see-also: + - "/enhancements/environment-agent/environment-agent.md" --- # Service Provider Resource Manager ## Summary -The DCM Service Provider Resource Manager provides a centralized intermediary -service between Placement Manager and Service Providers (SPs) for creating and -managing service type instances. Rather than having Placement Manager directly -call individual SPs, the Resource Manager abstracts SP interactions by handling -SP lookup (retrieving SP endpoints and metadata from the Service Registry), -health validation, instance tracking, and database persistence. This design -simplifies Placement Manager logic, ensures consistent instance management -across all SPs, and provides a single point of control for instance lifecycle -operations within DCM core. +The DCM Service Provider Resource Manager (SPRM) provides a centralized +intermediary service between Placement Manager and Environment Agents for +creating and managing service type instances. Rather than having Placement +Manager interact with Service Providers directly, the Resource Manager abstracts +agent interactions by looking up agent details from the Agent Registry, checking +agent health and congestion state, publishing creation and deletion CloudEvents +to the agent's messaging topic, and consuming responses from +`dcm.agents.responses`. This design simplifies Placement Manager logic, ensures +consistent instance management across all agents, and provides a single point of +control for instance lifecycle operations within DCM core. ## Motivation @@ -48,25 +51,35 @@ operations within DCM core. ### Assumptions -- The SP Resource Manager has connectivity to the registered SPs. -- The SP Resource Manager has access/permission to the database. +- The SP Resource Manager has access to the Messaging System for publishing + CloudEvents and consuming responses. +- A Messaging System (e.g., NATS) is deployed and accessible. +- The SP Resource Manager has access to the Agent Registry and instance record + database. - The SP Resource Manager is reachable from the Placement Manager. - The SP Resource Manager lives within the SP API. -- The database persists both SP registry information and created resource ### Integrations Points #### Database Integration -- **Service Registry**: - - Stores Service Provider's registration information - - Used for retrieving SP details during instance creation - - SP info includes `endpoints`, `metadata`, `status` and `resource capacity` +- **Agent Registry**: + - Stores Agent registration information (name, environment, serviceTypes, + topicName, cost, healthStatus, consumerLag) + - Used for retrieving agent details during instance creation and deletion - **Service Type Instance Records**: - Stores created service type instance information - - Instance data includes `instanceId`, `providerName`, `status`. + - Instance data includes `instanceId`, `agentName`, `serviceType`, `status`. + The `providerName` field is populated asynchronously from the agent's + creation-acknowledged CloudEvent. - Maintains record of all created instances within DCM core +#### Messaging System + +- **Publishing**: SPRM publishes creation and deletion request CloudEvents to + the agent's topic (`{agentTopicName}`) +- **Consuming**: SPRM consumes response CloudEvents from `dcm.agents.responses` + ### API Endpoints The CRUD endpoints are consumed by the DCM Placement Manager to create and @@ -103,13 +116,20 @@ requestBody: schema: type: object required: - - providerName + - agentName + - serviceType - spec properties: - providerName: + agentName: type: string - description: The unique identifier of the target Service Provider - example: "kubevirt-sp" + description: The name of the target Environment Agent + example: "prod-eu-agent" + serviceType: + type: string + description: + The type of service to create (e.g., vm, container, database, + cluster) + example: "vm" spec: type: object description: | @@ -122,7 +142,8 @@ Example of payload for incoming VM request ```json { - "providerName": "kubevirt-sp", + "agentName": "prod-eu-agent", + "serviceType": "vm", "spec": { "memory": { "size": "2GB" }, "vcpu": { "count": 2 }, @@ -144,19 +165,19 @@ Example of Response Payload [ { "name": "nginx-container", - "providerName": "container-sp", + "agentName": "container-agent", "instanceId": "696511df-1fcb-4f66-8ad5-aeb828f383a0", "status": "PROVISIONING" }, { "name": "postgres-001", - "providerName": "postgres-sp", + "agentName": "postgres-agent", "instanceId": "c66be104-eea3-4246-975c-e6cc9b32d74d", "status": "FAILED" }, { "name": "ubuntu-vm", - "providerName": "kubevirt-sp", + "agentName": "prod-eu-agent", "instanceId": "08aa81d1-a0d2-4d5f-a4df-b80addf07781", "status": "PROVISIONING" } @@ -171,7 +192,7 @@ Example of Response Payload ```json { "name": "ubuntu-vm", - "providerName": "kubevirt-sp", + "agentName": "prod-eu-agent", "instanceId": "08aa81d1-a0d2-4d5f-a4df-b80addf07781", "status": "PROVISIONING" } @@ -190,7 +211,7 @@ Retrieve the health status of SP Resource Manager. This flow demonstrates the creation of a service type instance (VMs, containers, databases, or clusters) through the SP Resource Manager. It involves communication between the Placement Manager, SP Resource Manager, database, and -the targeted Service Provider. +the Messaging System. ```mermaid sequenceDiagram @@ -198,39 +219,23 @@ sequenceDiagram participant PS as Placement Manager participant SPRM as SP Resource Manager participant DB as Database - participant SP as Service Provider + participant MS as Messaging System - PS->>SPRM: POST /api/v1/service-type-instances
{providerName, spec} + PS->>SPRM: POST /api/v1/service-type-instances
{agentName, serviceType, spec} activate SPRM - - alt SP not found + SPRM->>DB: Lookup agent by agentName + alt Agent not found SPRM-->>PS: 404 Not Found - else SP Health Check fails + else Agent Unavailable or Congested SPRM-->>PS: 503 Service Unavailable - SPRM->>SP: POST {SP_endpoint}/api/v1/services
{payload} - activate SP - - alt SP creation fails - SP-->>SPRM: Error response - deactivate SP - SPRM-->>PS: Return SP error
(SP creation failed) - else SP creation succeeds - SP-->>SPRM: Success response
{instanceId, status, metadata} - SPRM->>DB: Create instance record
{instanceId, providerName, metadata} - activate DB - - alt DB record creation fails - DB-->>SPRM: Error response - deactivate DB - SPRM-->>PS: 500 Internal Server Error
{instanceId, error} - - else DB record creation succeeds - DB-->>SPRM: Record created - SPRM-->>PS: 202 Accepted
{instanceId, status} - end - end + else Agent healthy + SPRM->>DB: Generate resourceId
Create instance record
{resourceId, agentName, serviceType, status: PENDING} + + SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
type: dcm.request.create
{resourceId, serviceType, spec} + + SPRM-->>PS: 202 Accepted
{instanceId, agentName, status: PENDING} end deactivate SPRM ``` @@ -240,43 +245,143 @@ sequenceDiagram - **Request Reception** - SP Resource Manager receives a POST request (`/api/v1/service-type-instances`) from Placement Manager with: - - `providerName`: The unique identifier of the target Service Provider - - `spec`: The detailed spec following any of service type schema (VMSpec, - ContainerSpec, DatabaseSpec, or ClusterSpec) -- **Service Provider Lookup** - - Queries the Service Registry database using the `providerName` + - `agentName`: The name of the target Environment Agent + - `serviceType`: The type of service to create (e.g., vm, container) + - `spec`: The detailed spec following any of the service type schemas + (VMSpec, ContainerSpec, DatabaseSpec, or ClusterSpec) +- **Agent Lookup** + - Queries the Agent Registry by `agentName` - Retrieves: - - Service Provider endpoint URL - - SP metadata (region, providerName etc) - - Current SP status (healthy, degraded, unavailable) - - If SP is not found, returns 404 error to Placement Manager - - If SP status is degraded or unavailable, returns 503 error to Placement - Manager -- **Service Provider Invocation** - - Calls the Service Provider's API endpoint: - `POST {SP_endpoint}/api/v1/services` - - Forwards the service specification (payload) to the SP - - If SP instance creation fails, forward the SP's error response to Placement - Manager -- **Persist Response** - - Receives response from Service Provider containing: - - `instanceId`: Unique identifier for the created instance - - `status`: Creation status (`PROVISIONING`) - - Stores instance metadata in the database - - If database record creation fails, returns 500 Internal Server Error with - `instanceId` included in error response (instance was created by SP but - tracking failed) + - `topicName`: The agent's messaging topic + - `healthStatus`: Current agent health (Ready, Unavailable) + - `consumerLag`: Current consumer lag for congestion detection + - If agent is not found, returns 404 error to Placement Manager + - If agent is Unavailable (missed heartbeats) or Congested (consumer lag + threshold exceeded), returns 503 error to Placement Manager +- **Instance Record Creation** + - Generates a `resourceId` for the new instance + - Creates an instance record in the database with status `PENDING` + - The record includes `resourceId`, `agentName`, `serviceType`, and `status` +- **CloudEvent Publishing** + - Publishes a creation request CloudEvent to the agent's topic (`{topicName}`) + via the Messaging System + - CloudEvent type: `dcm.request.create` + - CloudEvent data: `{resourceId, serviceType, spec}` + - See + [Environment Agent - CloudEvent Message Definitions](../environment-agent/environment-agent.md#cloudevent-message-definitions) + for the full CloudEvent schema - **Response to Placement Manager** - - Returns success response (202 Accepted) with: + - Returns 202 Accepted with: - `instanceId`: The created instance identifier - - `status`: Current instance status - - Returns error response with appropriate HTTP status code and error details - if any step fails + - `agentName`: The target agent + - `status`: `PENDING` + - At this point only `agentName` is known; `providerName` is populated + asynchronously when the agent's creation-acknowledged response arrives + +### Service Type Instance Deletion Flow + +This flow demonstrates the deletion of a service type instance through the SP +Resource Manager. It mirrors the creation flow, publishing a deletion CloudEvent +instead of a creation one. + +```mermaid +sequenceDiagram + autonumber + participant PS as Placement Manager + participant SPRM as SP Resource Manager + participant DB as Database + participant MS as Messaging System + + PS->>SPRM: DELETE /api/v1/service-type-instances/{instanceId} + activate SPRM + + SPRM->>DB: Lookup instance by instanceId
Get agentName, serviceType, resourceId + + SPRM->>DB: Lookup agent by agentName + alt Agent not found + SPRM-->>PS: 404 Not Found + else Agent Unavailable or Congested + SPRM-->>PS: 503 Service Unavailable + else Agent healthy + SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
type: dcm.request.delete
{resourceId, serviceType} + + SPRM->>DB: Update instance status to DELETING + SPRM-->>PS: 202 Accepted
{instanceId, status: DELETING} + end + deactivate SPRM +``` + +#### Steps + +- **Request Reception** + - SP Resource Manager receives a DELETE request + (`/api/v1/service-type-instances/{instanceId}`) from Placement Manager +- **Instance Lookup** + - Queries the database by `instanceId` + - Retrieves `agentName`, `serviceType`, and `resourceId` from the instance + record +- **Agent Lookup** + - Queries the Agent Registry by `agentName` + - Retrieves `topicName`, `healthStatus`, and `consumerLag` + - If agent is not found, returns 404 error to Placement Manager + - If agent is Unavailable or Congested, returns 503 error to Placement Manager +- **CloudEvent Publishing** + - Publishes a deletion request CloudEvent to the agent's topic (`{topicName}`) + via the Messaging System + - CloudEvent type: `dcm.request.delete` + - CloudEvent data: `{resourceId, serviceType}` + - See + [Environment Agent - CloudEvent Message Definitions](../environment-agent/environment-agent.md#cloudevent-message-definitions) + for the full CloudEvent schema +- **Instance Record Update** + - Updates the instance record status to `DELETING` +- **Response to Placement Manager** + - Returns 202 Accepted with: + - `instanceId`: The instance identifier + - `status`: `DELETING` + +> **Note:** The Placement Manager also uses this DELETE endpoint to cancel +> requests that were queued by an agent (when the SP for the service type was +> unhealthy). When the Placement Manager's `queuedRequestTimeout` expires, it +> sends a DELETE for the queued instance, then re-evaluates policies to select +> an alternative agent. The agent handles creation/deletion dedup in its retry +> topic — if both the original creation request and the cancellation DELETE are +> present, they cancel out (see +> [Environment Agent — Retry Topic](../environment-agent/environment-agent.md#retry-topic)). + +### Asynchronous Response Processing + +The SP Resource Manager consumes response CloudEvents from the +`dcm.agents.responses` topic. These responses are published by Environment +Agents after processing creation or deletion requests. The following table +describes the actions taken for each response type: + +| CloudEvent Type | Action | +| --------------------------------- | ----------------------------------------------------------------------------------------------------------------- | +| `dcm.agent.creation-acknowledged` | Update instance record: status to `PROVISIONING`, store `providerName` from response | +| `dcm.agent.deletion-acknowledged` | Update instance record: status to `DELETING` | +| `dcm.agent.error` | Update instance record: status to `FAILED`, store error details. Notify Placement Manager. | +| `dcm.agent.request-queued` | Update instance record: status to `QUEUED`. Report queued status to Placement Manager (PM handles timeout logic). | + +Note: `providerName` in instance records is populated asynchronously. At 202 +response time, only `agentName` is known. The `providerName` is set when the +agent's `dcm.agent.creation-acknowledged` CloudEvent arrives, which includes the +SP that ultimately handled the request. + +See +[Environment Agent - CloudEvent Message Definitions](../environment-agent/environment-agent.md#cloudevent-message-definitions) +for the full CloudEvent type definitions and data schemas. #### Error Handling -- **404 Not Found**: Service Provider with the given `providerName` is not - registered +- **404 Not Found**: Agent with the given `agentName` is not registered - **400 Bad Request**: Invalid request schema -- **503 Service Unavailable**: Service Provider is not healthy +- **503 Service Unavailable**: Agent is Unavailable (missed heartbeats) or + Congested (consumer lag threshold exceeded) - **500 Internal Server Error**: Unexpected error in SP Resource Manager + +### Next Steps + +- Dead-letter handling for unprocessable responses +- Batch publishing of CloudEvents +- Per-agent response timeout configuration diff --git a/enhancements/user-flows/user-flows.md b/enhancements/user-flows/user-flows.md index f5a8c5a..7ae1c78 100644 --- a/enhancements/user-flows/user-flows.md +++ b/enhancements/user-flows/user-flows.md @@ -15,13 +15,14 @@ see-also: - "/enhancements/kubevirt-sp/kubevirt-sp.md" - "/enhancements/k8s-container-sp/k8s-container-sp.md" - "/enhancements/acm-cluster-sp/acm-cluster-sp.md" + - "/enhancements/environment-agent/environment-agent.md" --- # DCM User Flows This document summarizes the primary user flows in the DCM system, covering policy management, service type and catalog item management, service provider -lifecycle, and end-to-end CatalogItemInstance creation. +and agent lifecycle, and end-to-end CatalogItemInstance creation and deletion. ## Table of Contents @@ -35,16 +36,22 @@ lifecycle, and end-to-end CatalogItemInstance creation. - [4. Managing CatalogItems](#4-managing-catalogitems) - [4.1 Create CatalogItem](#41-create-catalogitem) - [4.2 CatalogItem to ServiceType Translation](#42-catalogitem-to-servicetype-translation) -- [5. Service Provider Lifecycle](#5-service-provider-lifecycle) - - [5.1 Service Provider Registration](#51-service-provider-registration) - - [5.2 Service Provider Health Checks](#52-service-provider-health-checks) - - [5.3 Service Provider Status Reporting](#53-service-provider-status-reporting) +- [5. Service Provider & Agent Lifecycle](#5-service-provider--agent-lifecycle) + - [5.1 Service Provider Registration (SP → Agent)](#51-service-provider-registration-sp--agent) + - [5.2 Agent Registration (Agent → DCM)](#52-agent-registration-agent--dcm) + - [5.3 Health Monitoring](#53-health-monitoring) + - [5.3.1 SP Health (Agent → SP)](#531-sp-health-agent--sp) + - [5.3.2 Agent Health (Agent → DCM heartbeats)](#532-agent-health-agent--dcm-heartbeats) + - [5.3.3 Consumer Lag Monitoring](#533-consumer-lag-monitoring) + - [5.4 Service Provider Status Reporting](#54-service-provider-status-reporting) + - [5.5 Agent Lifecycle](#55-agent-lifecycle) - [6. CatalogItemInstance Creation (End-to-End)](#6-catalogiteminstance-creation-end-to-end) - [6.1 Full Creation Flow](#61-full-creation-flow) - [6.2 Placement Manager Flow](#62-placement-manager-flow) - [6.3 SP Resource Manager Flow](#63-sp-resource-manager-flow) - [6.4 Service Provider Instance Creation](#64-service-provider-instance-creation) - [6.5 Continuous Status Reporting](#65-continuous-status-reporting) + - [6.6 Deletion Flow](#66-deletion-flow) --- @@ -52,16 +59,17 @@ lifecycle, and end-to-end CatalogItemInstance creation. The DCM system is composed of the following core components: -| Component | Responsibility | -| ---------------------------------- | ----------------------------------------------------------------------------------------------------- | -| **Catalog Manager** | Entry point for user requests; manages CatalogItems and CatalogItemInstances | -| **Catalog DB** | Stores CatalogItems, CatalogItemInstances, and ServiceType definitions | -| **Placement Manager** | Orchestrates instance creation; coordinates policy evaluation and SP selection | -| **Policy Manager (Policy Engine)** | Validates, mutates, and selects Service Providers via REGO policies and OPA | -| **SP Resource Manager** | Intermediary between Placement Manager and Service Providers; handles SP lookup and health validation | -| **Service Registry** | Stores Service Provider registration, endpoints, and metadata | -| **Service Providers** | Execute infrastructure provisioning (KubeVirt SP, K8s Container SP, ACM Cluster SP) | -| **Messaging System** | Handles CloudEvents for asynchronous status reporting (NATS) | +| Component | Responsibility | +| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | +| **Catalog Manager** | Entry point for user requests; manages CatalogItems and CatalogItemInstances | +| **Catalog DB** | Stores CatalogItems, CatalogItemInstances, and ServiceType definitions | +| **Placement Manager** | Orchestrates instance creation; coordinates policy evaluation and agent selection | +| **Policy Manager (Policy Engine)** | Validates, mutates, and selects Agents via REGO policies and OPA | +| **SP Resource Manager** | Intermediary between Placement Manager and Agents; publishes CloudEvents to agent topics; consumes responses | +| **Agent Registry** | Stores Agent registration data (name, environment, serviceTypes, topicName, cost, healthStatus) | +| **Environment Agent** | Runs in target environment; routes creation/deletion requests to SPs; monitors SP health; reports to DCM via heartbeats | +| **Service Providers** | Execute infrastructure provisioning (KubeVirt SP, K8s Container SP, ACM Cluster SP) | +| **Messaging System** | Handles CloudEvents for asynchronous request delivery and status reporting (NATS) | ```mermaid graph TB @@ -72,10 +80,12 @@ graph TB PM[Placement Manager] POL[Policy Manager / OPA] SPRM[SP Resource Manager] - SR[(Service Registry)] + AR[(Agent Registry)] DB[(Placement DB)] - MSG[Messaging System / NATS] + MS[Messaging System / NATS] + AG1[Agent - Environment 1] + AG2[Agent - Environment 2] SP1[KubeVirt SP] SP2[K8s Container SP] SP3[ACM Cluster SP] @@ -88,15 +98,22 @@ graph TB PM --> POL PM --> SPRM PM --> DB - SPRM --> SR - SPRM --> SP1 - SPRM --> SP2 - SPRM --> SP3 - - SP1 -->|status events| MSG - SP2 -->|status events| MSG - SP3 -->|status events| MSG - MSG -->|status updates| SPRM + SPRM --> AR + SPRM -->|publish requests| MS + MS -->|deliver requests| AG1 + MS -->|deliver requests| AG2 + AG1 --> SP1 + AG1 --> SP2 + AG2 --> SP3 + AG1 -.->|registration & heartbeat| DCM_API + AG2 -.->|registration & heartbeat| DCM_API + DCM_API[DCM API] + DCM_API --> AR + + SP1 -->|status events| MS + SP2 -->|status events| MS + SP3 -->|status events| MS + MS -->|status updates| SPRM SPRM -->|status updates| CM ``` @@ -104,9 +121,9 @@ graph TB ## 2. Managing Policies -Policies control validation, mutation, and Service Provider selection for all -resource requests. They are organized in a three-level hierarchy: **Global** -(Super Admin), **Tenant** (Tenant Admin), and **User** (End User). +Policies control validation, mutation, and Agent selection for all resource +requests. They are organized in a three-level hierarchy: **Global** (Super +Admin), **Tenant** (Tenant Admin), and **User** (End User). ### 2.1 Create Policy @@ -154,10 +171,10 @@ sequenceDiagram ### 2.2 Policy Evaluation When a resource request arrives, the Policy Manager fetches all matching enabled -policies, sorts them by level (Global → Tenant → User) then priority +policies, sorts them by level (Global > Tenant > User) then priority (ascending), and evaluates them in a chain-of-responsibility pipeline. Each policy can reject the request, apply patches (mutations), set constraints, and -influence Service Provider selection. +influence Agent selection. ```mermaid sequenceDiagram @@ -166,14 +183,14 @@ sequenceDiagram participant DB as Policy DB participant OPA as OPA Engine - PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}} + PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents} PE->>DB: Fetch enabled policies matching request via label selector PE->>PE: Sort by Level (Global→Tenant→User), then Priority (asc) loop For each policy in sorted order - PE->>OPA: Evaluate policy with:
{spec, provider, constraints, service_provider_constraints} - OPA-->>PE: {rejected, patch, constraints,
selected_provider, service_provider_constraints} + PE->>OPA: Evaluate policy with:
{spec, agent, constraints, agent_constraints} + OPA-->>PE: {rejected, patch, constraints,
selected_agent, agent_constraints} alt rejected == true PE-->>PM: 406 Not Acceptable (rejection_reason) @@ -185,12 +202,12 @@ sequenceDiagram end PE->>PE: Merge constraints into ConstraintContext - PE->>PE: Merge service_provider_constraints + PE->>PE: Merge agent_constraints PE->>PE: Validate & apply patches against constraints - PE->>PE: Validate selected_provider against SP constraints + PE->>PE: Validate selected_agent against agent constraints end - PE-->>PM: 200 OK {evaluatedServiceInstance, selectedProvider, status} + PE-->>PM: 200 OK {evaluatedServiceInstance, selectedAgent, status} ``` **Evaluation request (Placement Manager → Policy Manager):** @@ -205,7 +222,21 @@ sequenceDiagram "guestOS": { "type": "fedora-39" }, "metadata": { "name": "fedora-vm" } } - } + }, + "available_agents": [ + { + "name": "agent-prod-eu-west-1", + "environment": "prod-eu-west-1", + "serviceTypes": ["vm", "container"], + "cost": "medium" + }, + { + "name": "agent-dev-us-east-1", + "environment": "dev-us-east-1", + "serviceTypes": ["vm"], + "cost": "low" + } + ] } ``` @@ -220,9 +251,18 @@ sequenceDiagram "guestOS": { "type": "fedora-39" }, "metadata": { "name": "fedora-vm" } }, - "provider": "", + "agent": "", "constraints": {}, - "service_provider_constraints": {} + "agent_constraints": {}, + "available_agents": [ + { + "name": "prod-eu-agent", + "environment": "prod-eu-west-1", + "serviceTypes": ["vm", "database"], + "cost": "medium" + } + ], + "exclude_agents": [] } ``` @@ -240,10 +280,13 @@ sequenceDiagram "region": { "const": "us-east-1" }, "vcpu": { "minimum": 2, "maximum": 8 } }, - "selected_provider": "kubevirt-sp", - "service_provider_constraints": { - "allow_list": ["kubevirt-sp", "vmware-sp"], - "patterns": [] + "selected_agent": "agent-prod-eu-west-1", + "agent_constraints": { + "allow_list": ["agent-prod-eu-west-1", "agent-staging-eu-west-1"], + "patterns": [], + "environment_constraints": { + "allow_list": ["prod-eu-west-1", "staging-eu-west-1"] + } } } ``` @@ -253,7 +296,7 @@ sequenceDiagram ```json { "evaluatedServiceInstance": { "...": "final mutated spec" }, - "selectedProvider": "kubevirt-sp", + "selectedAgent": "agent-prod-eu-west-1", "status": "APPROVED | MODIFIED" } ``` @@ -431,32 +474,59 @@ sequenceDiagram --- -## 5. Service Provider Lifecycle +## 5. Service Provider & Agent Lifecycle -### 5.1 Service Provider Registration +Service Providers register with the Environment Agent in their target +environment. The Agent registers with DCM and acts as the intermediary for +resource operation requests. For full details on agent behavior, see the +[Environment Agent enhancement](/enhancements/environment-agent/environment-agent.md). -Service Providers register with DCM per service type. Registration is idempotent -— re-registering with the same name updates the existing entry. +### 5.1 Service Provider Registration (SP → Agent) + +The Agent supports a hybrid SP model: it ships with embedded SP code for known +service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration, +and also accepts external ("bring your own") SPs that register via the REST API. +Only one SP — embedded or external — may serve a given service type per agent; +duplicate registrations are rejected with `409 Conflict`. Embedded SPs register +internally at agent startup; external SPs register via `POST /api/v1/providers`. +Registration is idempotent — re-registering with the same name updates the +existing entry. External SPs periodically re-register to maintain their lease, +which also ensures that after an agent restart, SPs naturally rebuild the +agent's state. ```mermaid sequenceDiagram participant SP as Service Provider - participant SR as Service Registry + participant AG as Agent + participant DCM as DCM Control Plane + participant DB as Database + + Note over AG: Embedded SPs registered
internally at startup - SP->>SR: POST /api/v1/providers
{name, displayName, endpoint, serviceType, metadata} + SP->>AG: POST /api/v1/providers
{name, displayName, endpoint, serviceType, metadata} - alt Name does not exist - SR->>SR: Create new SP entry, generate providerID - SR-->>SP: 201 Created {id, name, status: "registered"} + alt Service type already served by another SP + AG-->>SP: 409 Conflict
{error: "service type X already served by provider Y"} + else Name does not exist + AG->>AG: Create new SP entry, generate providerID + AG-->>SP: 201 Created {id, name, status: "registered"} else Name exists, same providerID - SR->>SR: Update existing entry - SR-->>SP: 200 OK {id, name, status: "registered"} + AG->>AG: Update existing entry + AG-->>SP: 200 OK {id, name, status: "registered"} else Name exists, different providerID - SR-->>SP: 409 Conflict + AG-->>SP: 409 Conflict end + + alt Service type list changed AND agent registered to DCM + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes, cost, topicName} + DCM->>DB: Update agent registration + DCM-->>AG: 200 OK + end + + Note over SP,AG: SP periodically re-registers
to maintain lease ``` -**Registration payload example:** +**Registration payload example (SP → Agent):** ```json { @@ -475,47 +545,154 @@ sequenceDiagram } ``` -### 5.2 Service Provider Health Checks +### 5.2 Agent Registration (Agent → DCM) -DCM polls each registered Service Provider's `/health` endpoint at a -configurable interval (default: every 10 seconds). Health status determines -whether a provider can receive new requests. +The Agent registers with DCM after creating its messaging topics and after at +least one SP (embedded or external) is registered and healthy. Registration is +idempotent — the agent `name` is the natural key. On restart, the agent +re-registers; DCM resets the heartbeat tracker. For full registration details, +see the +[Environment Agent enhancement](/enhancements/environment-agent/environment-agent.md). -#### Health State Diagram +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant MS as Messaging System + participant DCM as DCM Control Plane + participant DB as Database + + AG->>MS: Create topics (main + retry) + Note over AG: Wait for at least 1 SP
(embedded or external) to register
and be healthy + + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} + DCM->>DB: Store agent registration + DCM-->>AG: 201 Created {agentId} +``` + +**Registration payload (Agent → DCM):** + +```json +{ + "name": "agent-prod-eu-west-1", + "environment": "prod-eu-west-1", + "serviceTypes": ["vm", "container"], + "resourcesAvailable": { + "totalCpu": 200, + "totalMemory": "1TB", + "totalStorage": "2TB" + }, + "cost": "medium", + "topicName": "dcm.agents.agent-prod-eu-west-1" +} +``` + +### 5.3 Health Monitoring + +#### 5.3.1 SP Health (Agent → SP) + +The Agent monitors each registered SP's health using a three-state model. The +monitoring mechanism differs by SP type: embedded SPs are checked in-process (no +network call), while external SPs are checked by polling their `/health` +endpoint at a configurable interval. + +##### Health State Diagram ```mermaid stateDiagram-v2 - [*] --> Ready: Registered + [*] --> Ready: Registered with Agent Ready --> FailureCount: Failed FailureCount --> Ready: OK (reset) - FailureCount --> NotReady: Threshold reached - NotReady --> Ready: OK (recover) + FailureCount --> Unavailable: Threshold reached + + Ready --> Unhealthy: status: unhealthy + Unhealthy --> Ready: status: healthy + Unhealthy --> Unavailable: Timeout/error threshold + + Unavailable --> Ready: OK (recover) ``` -#### Health Check Sequence Diagram +Three health states: + +- **Ready**: SP is healthy and eligible for routing. +- **Unhealthy**: SP is reachable but reports its backing provider is down. The + Agent keeps the service type in its advertised list but stops routing requests + to this SP; incoming requests are held in the retry topic until the SP + recovers or becomes Unavailable. +- **Unavailable**: SP is unreachable after exceeding the failure threshold. The + Agent removes the service type from its advertised list and updates DCM. + +##### Health Check Sequence ```mermaid sequenceDiagram - participant DCM as DCM Health Checker + participant AG as Agent participant SP as Service Provider - loop Every 10 seconds - DCM->>SP: GET /health - alt HTTP 200 OK - SP-->>DCM: 200 {status: "pass"} - DCM->>DCM: Reset failure counter, mark Ready - else Timeout or non-200 - SP-->>DCM: Error / Timeout - DCM->>DCM: Increment failure counter + loop Every {healthCheckInterval} seconds + AG->>SP: GET /health + alt 200 OK, status: healthy + SP-->>AG: {status: "healthy"} + AG->>AG: Reset failure counter, mark Ready + else 200 OK, status: unhealthy + SP-->>AG: {status: "unhealthy"} + AG->>AG: Mark Unhealthy
Stop routing, hold requests + else Timeout or error + SP-->>AG: Error / Timeout + AG->>AG: Increment failure counter alt Failures >= threshold - DCM->>DCM: Mark NotReady + AG->>AG: Mark Unavailable
Remove service type, update DCM end end end ``` -### 5.3 Service Provider Status Reporting +#### 5.3.2 Agent Health (Agent → DCM heartbeats) + +The Agent reports its own liveness to DCM via periodic REST heartbeats. DCM +tracks the last heartbeat timestamp for each agent and marks the agent as +Unavailable if no heartbeat is received within a configurable threshold. + +```mermaid +sequenceDiagram + participant AG as Agent + participant DCM as DCM Control Plane + + loop Every {heartbeatInterval} seconds + AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp, consumerLag} + DCM->>DCM: Update heartbeat, check lag + DCM-->>AG: 200 OK + end + + Note over DCM: No heartbeat within threshold + DCM->>DCM: Mark agent Unavailable +``` + +#### 5.3.3 Consumer Lag Monitoring + +The Agent self-reports its consumer lag in each heartbeat. If the lag exceeds +`consumerLagThreshold`, DCM marks the agent as **Congested** and stops routing +new requests to it. When the lag drops below the threshold, the Congested state +is cleared. + +##### Agent Health State Diagram + +```mermaid +stateDiagram-v2 + [*] --> Ready: Agent registers + Ready --> Congested: lag >= threshold + Congested --> Ready: lag < threshold + Ready --> Unavailable: Heartbeat timeout + Congested --> Unavailable: Heartbeat timeout + Unavailable --> Ready: Agent re-registers +``` + +### 5.4 Service Provider Status Reporting + +> **Note:** Status reporting is not impacted by the Agent layer. SPs publish +> status CloudEvents directly to the Messaging System. The Agent is not in the +> status-reporting path. Service Providers report instance status changes to DCM via CloudEvents published to a messaging system (NATS). This decoupled approach supports @@ -525,16 +702,16 @@ multiple consumers (billing, auditing, etc.) and scales independently. sequenceDiagram participant Platform as Underlying Platform
(K8s, KubeVirt, ACM) participant SP as Service Provider - participant MSG as Messaging System (NATS) + participant MS as Messaging System (NATS) participant DCM as DCM Core Service participant DB as Status DB Platform->>SP: State change event
(via informer watch or polling) SP->>SP: Map platform status → DCM status SP->>SP: Build CloudEvent - SP->>MSG: Publish to:
dcm.providers.{provider}.{serviceType}
.instances.{instanceId}.status + SP->>MS: Publish to:
dcm.providers.{provider}.{serviceType}
.instances.{instanceId}.status - MSG->>DCM: Deliver event + MS->>DCM: Deliver event DCM->>DCM: Validate CloudEvent schema alt Valid DCM->>DB: UPSERT instance status @@ -555,14 +732,39 @@ sequenceDiagram | DELETING | | | | DELETED | | | +### 5.5 Agent Lifecycle + +This section provides a brief overview of the agent lifecycle. For full details, +see the +[Environment Agent enhancement](/enhancements/environment-agent/environment-agent.md). + +**Startup:** + +1. Agent registers its configured embedded SPs internally (K8s Container, ACM + Cluster, KubeVirt — each if enabled in config) +2. Agent creates messaging topics (main topic + retry topic) +3. Agent waits for at least one SP (embedded or external) to be registered and + healthy +4. Agent registers with DCM via `POST /api/v1/agents` +5. Agent begins periodic heartbeats and SP health checking + +**Restart:** + +1. Agent re-registers with DCM (idempotent; DCM resets heartbeat tracker) +2. Embedded SPs register internally at startup; external SPs naturally + re-register via periodic lease renewal, rebuilding agent state +3. Unconsumed messages on both main and retry topics survive (messaging system + persistence) +4. Agent resumes consuming from both topics once fully initialized + --- ## 6. CatalogItemInstance Creation (End-to-End) This is the primary user flow: creating an infrastructure resource from a CatalogItem. The request flows through the Catalog Manager, Placement Manager -(with policy evaluation), SP Resource Manager, and finally to the selected -Service Provider. +(with policy evaluation), SP Resource Manager (which publishes to the messaging +system), the Environment Agent, and finally to the selected Service Provider. ### 6.1 Full Creation Flow @@ -574,21 +776,22 @@ sequenceDiagram participant DB as Placement DB participant PE as Policy Manager participant SPRM as SP Resource Manager - participant SR as Service Registry + participant AR as Agent Registry + participant MS as Messaging System + participant AG as Agent participant SP as Service Provider - participant MSG as Messaging System User->>CM: Request CatalogItemInstance
(select CatalogItem + customize fields) CM->>CM: Validate input, merge with defaults CM->>PM: POST /api/v1/resources
{CatalogItemInstance: UUID, spec} - %% Intent preservation PM->>DB: Store original request (intent) - %% Policy evaluation - PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}} - PE->>PE: Fetch & sort matching policies
(Global→Tenant→User, by priority) - PE->>PE: Evaluate policy chain
(validate, mutate, select SP) + PM->>AR: Fetch available agents
(healthy, not Congested, matching serviceType) + AR-->>PM: available_agents list + + PM->>PE: POST /api/v1alpha1/policies:evaluateRequest
{service_instance: {spec}, available_agents} + PE->>PE: Evaluate policy chain
(validate, mutate, select Agent) alt Policy rejects PE-->>PM: 406 Not Acceptable @@ -597,64 +800,87 @@ sequenceDiagram CM-->>User: Request denied end - PE-->>PM: 200 OK
{evaluatedServiceInstance, selectedProvider, status} - PM->>DB: Store validated request + PE-->>PM: 200 OK
{evaluatedServiceInstance, selectedAgent, status} + PM->>DB: Store validated request with agentName - %% SP Resource Manager - PM->>SPRM: POST /api/v1/service-type-instances
{providerName, spec} + PM->>SPRM: POST /api/v1/service-type-instances
{agentName, serviceType, spec} - SPRM->>SR: Lookup provider by name - alt Provider not found - SR-->>SPRM: 404 - SPRM-->>PM: 404 Not Found + SPRM->>AR: Lookup agent, get topicName + alt Agent not found or unhealthy/Congested + SPRM-->>PM: Error (404/503) PM->>DB: Delete records PM-->>CM: Error - CM-->>User: Provider not found + CM-->>User: Agent unavailable end - SR-->>SPRM: {endpoint, metadata, healthStatus} - alt Provider unhealthy - SPRM-->>PM: 503 Service Unavailable - PM->>DB: Delete records - PM-->>CM: Error - CM-->>User: Provider unavailable + SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
{resourceId, serviceType, spec} + SPRM->>DB: Create instance record + SPRM-->>PM: 202 Accepted {instanceId, agentName, status: PENDING} + PM-->>CM: 201 Created + CM-->>User: Instance created (PENDING) + + Note over MS,AG: Async processing + MS->>AG: Deliver creation request + AG->>AG: Validate service type, select SP + AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} + SP-->>AG: {instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, agentName, topicName,
status: PROVISIONING} + MS->>SPRM: Deliver response + SPRM->>DB: Update instance: PROVISIONING + + opt Agent queues request (SP Unhealthy) + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED} + MS->>SPRM: Deliver QUEUED response + SPRM->>SPRM: Update instance: QUEUED + SPRM->>PM: Notify: instance QUEUED + + Note over PM: Start queuedRequestTimeout + alt Timeout or timeout = 0 + PM->>SPRM: DELETE instance + PM->>PE: Re-evaluate excluding agent + Note over PM: Route to alternative agent
or return error if none available + end end - %% Instance creation - SPRM->>SP: POST {endpoint}/api/v1/{serviceType}
{spec} - SP->>SP: Create resource on platform - SP-->>SPRM: {instanceId, status: PROVISIONING} - SPRM->>DB: Persist instance metadata - SPRM-->>PM: 202 Accepted {instanceId, status} - PM-->>CM: 202 Accepted - CM-->>User: Instance created
{instanceId, status: PROVISIONING} - - %% Continuous status reporting - Note over SP,MSG: Async status reporting begins - SP->>MSG: Publish status CloudEvents
as instance state changes - MSG->>PM: Deliver status updates - PM->>DB: UPSERT status + Note over SP,MS: Status reporting (unchanged) + SP->>MS: Publish status CloudEvents + MS->>SPRM: Deliver status updates ``` +When the SP for the requested service type on the agent is Unhealthy, the Agent +holds the request in its retry topic and responds with a QUEUED CloudEvent. DCM +records the QUEUED status. If the SP recovers, the Agent processes the held +request. If the SP becomes Unavailable, the Agent rejects the held request with +an error CloudEvent. The Placement Manager handles the QUEUED status via a +`queuedRequestTimeout` timer (see +[6.2 Placement Manager Flow](#62-placement-manager-flow)). + ### 6.2 Placement Manager Flow The Placement Manager is the central orchestrator. It preserves the user's -original intent, delegates policy evaluation, and coordinates with the SP -Resource Manager. +original intent, fetches available agents, delegates policy evaluation, and +coordinates with the SP Resource Manager. ```mermaid flowchart TD A[Receive request from Catalog Manager] --> B[Store original request in Placement DB] - B --> C[Send to Policy Manager for evaluation] - C --> D{Policy approved?} - D -->|No| E[Delete intent record] - E --> F[Return error to Catalog Manager] - D -->|Yes| G[Store validated request in Placement DB] - G --> H[Forward to SP Resource Manager
with providerName and validated spec] - H --> I{SP Resource Manager
succeeded?} - I -->|No| J[Delete records from Placement DB] - J --> F - I -->|Yes| K[Return 202 Accepted
to Catalog Manager] + B --> C[Fetch available agents from Agent Registry] + C --> D[Send to Policy Manager for evaluation
with available_agents] + D --> E{Policy approved?} + E -->|No| F[Delete intent record] + F --> G[Return error to Catalog Manager] + E -->|Yes| H[Store validated request with agentName] + H --> I[Forward to SP Resource Manager
with agentName, serviceType, spec] + I --> J{SPRM response?} + J -->|Error| K[Delete records from Placement DB] + K --> G + J -->|202 Accepted| L[Return 201 Created
to Catalog Manager] + J -->|QUEUED| M[Start queuedRequestTimeout timer] + M --> N{Timeout?} + N -->|Yes| O[Send DELETE to SPRM
Re-evaluate excluding agent] + O --> P{Alternative agent?} + P -->|Yes| I + P -->|No| K ``` **Request payload (Catalog Manager → Placement Manager):** @@ -679,30 +905,27 @@ flowchart TD { "CatalogItemInstanceId": "f3645f8f-82c1-4efb-888f-318c0ac81a08", "resource_name": "fedora-vm", - "providerName": "kubevirt-sp", + "agentName": "agent-prod-eu-west-1", "id": "08aa81d1-a0d2-4d5f-a4df-b80addf07781" } ``` ### 6.3 SP Resource Manager Flow -The SP Resource Manager handles Service Provider lookup, health validation, and -instance creation delegation. +The SP Resource Manager handles Agent lookup and publishes creation requests as +CloudEvents to the agent's messaging topic. It no longer calls SP REST endpoints +directly. ```mermaid flowchart TD - A[Receive request from Placement Manager
providerName + spec] --> B[Query Service Registry
by providerName] - B --> C{Provider found?} + A[Receive request from Placement Manager
agentName + serviceType + spec] --> B[Query Agent Registry
by agentName] + B --> C{Agent found?} C -->|No| D[Return 404 Not Found] - C -->|Yes| E{Provider healthy?} + C -->|Yes| E{Agent healthy
and not Congested?} E -->|No| F[Return 503 Service Unavailable] - E -->|Yes| G[Forward spec to Service Provider
POST endpoint/api/v1/serviceType] - G --> H{SP creation succeeded?} - H -->|No| I[Forward error to Placement Manager] - H -->|Yes| J[Persist instance in database
instanceId, providerName, metadata] - J --> K{DB persist succeeded?} - K -->|No| L[Return 500 Internal Server Error] - K -->|Yes| M[Return 202 Accepted
instanceId, status] + E -->|Yes| G[Publish CloudEvent to agent topic
via Messaging System] + G --> H[Create instance record in DB] + H --> I[Return 202 Accepted
instanceId, agentName, status: PENDING] ``` ### 6.4 Service Provider Instance Creation @@ -710,6 +933,11 @@ flowchart TD Each Service Provider translates the provider-agnostic ServiceType spec into platform-native resources. +> **Note:** Each service type is served by exactly one SP (embedded or external) +> per agent — there is no SP selection strategy. The Agent forwards the request +> to the SP via an in-process call (for embedded SPs) or via REST (for external +> SPs). The SP's internal behavior is unchanged. + ```mermaid flowchart LR subgraph KubeVirtSP[KubeVirt SP] @@ -730,6 +958,10 @@ flowchart LR ### 6.5 Continuous Status Reporting +> **Note:** Status reporting is not impacted by the Agent layer. SPs publish +> status CloudEvents directly to the Messaging System, bypassing the Agent. This +> path is unchanged from the pre-agent architecture. + After instance creation, Service Providers continuously monitor the underlying platform and report status changes via CloudEvents. @@ -831,3 +1063,54 @@ graph LR HC4 --> DC4 HC5 --> DC5 ``` + +### 6.6 Deletion Flow + +The deletion flow follows the same architecture as creation: the request is +published as a CloudEvent to the agent's messaging topic, and the Agent routes +it to the appropriate SP. + +```mermaid +sequenceDiagram + actor User + participant CM as Catalog Manager + participant PM as Placement Manager + participant DB as Placement DB + participant PE as Policy Manager + participant SPRM as SP Resource Manager + participant MS as Messaging System + participant AG as Agent + participant SP as Service Provider + + User->>CM: Delete CatalogItemInstance + CM->>PM: DELETE /api/v1/resources/{resourceId} + PM->>DB: Lookup resource (agentName, serviceType, instanceId) + + PM->>SPRM: DELETE /api/v1/service-type-instances/{instanceId} + SPRM->>MS: PUBLISH CloudEvent
topic: {topicName}
type: dcm.request.delete
{resourceId, serviceType} + SPRM-->>PM: 202 Accepted + + MS->>AG: Deliver deletion request + AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} + SP-->>AG: {status: DELETING} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, agentName, topicName,
status: DELETING} + + opt Agent queues request (SP Unhealthy) + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED} + MS->>SPRM: Deliver QUEUED response + SPRM->>SPRM: Update instance: QUEUED + SPRM->>PM: Notify: instance QUEUED + + Note over PM: Start queuedRequestTimeout + alt Timeout or timeout = 0 + PM->>SPRM: DELETE instance + PM->>PE: Re-evaluate excluding agent + Note over PM: Route to alternative agent
or return error if none available + end + end + + Note over SP: SP manages deletion
and reports final status + SP->>MS: CloudEvent {status: DELETED} + MS->>SPRM: Status update + SPRM->>DB: Update status: DELETED +```