From 0232e2de51bcbb65583b1c7563503d5e57109b0d Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 3 Jun 2026 21:15:45 +0200 Subject: [PATCH 01/24] docs(environment-agent): add environment agent enhancement Define the environment agent layer that sits between DCM and Service Providers. The agent runs per-cluster, registers to DCM with environment metadata, and routes creation requests via a messaging system. SPs register to the agent (not DCM directly), each serving a single resource type. Includes agent registration, resource creation, SP registration, agent heartbeat, and SP health monitoring flows. Assisted by: Claude Code - claude-opus-4-6 Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 406 ++++++++++++++++++ 1 file changed, 406 insertions(+) create mode 100644 enhancements/environment-agent/environment-agent.md diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md new file mode 100644 index 0000000..d8b7cd3 --- /dev/null +++ b/enhancements/environment-agent/environment-agent.md @@ -0,0 +1,406 @@ +--- +title: Environment Agent +authors: + - "@gabriel-farache" +reviewers: + - "@gciavarrini" + - "@ygalblum" + - "@machacekondra" + - "@jenniferubah" +approvers: + - "" +creation-date: 2026-06-03 +--- + +# Environment Agent + +## Summary +This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environments usable by DCM and the agent would regiester the environment to DCM. +The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that SPs now serve only 1 specific resource type. +This enhancement also propose to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM wil send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. + +Additionally, this enhancement defines: +- How Service Providers register to the agent (rather than to DCM directly), allowing the agent to dynamically build and maintain its list of supported resource types. +- How the agent reports its own health to DCM via periodic heartbeats. +- How the agent monitors the health of its registered Service Providers using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result. + +## Motivation +When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, ... + +Currently, in DCM, a resource's creation request is routed to a given Service Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. + +There is currently no way for a policy to determine in which environment a SP is running and hence an user cannot explictly set the targeted environment constraint when requesting the creation of a resource. + +Furthermore, with the current way on submitting creation's request, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how the creation's requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. + + +### Goals + +- Define how the agent registers to DCM +- Define what information the agent gives to DCM while registering +- Define how agents and DCM are communicating +- Define how agents and Service Providers interact with each other +- Define how Service Providers register to the agent +- Define how the agent monitors Service Provider health +- Define how the agent reports its own health to DCM + +### Non-Goals + +- Defining how to use the information registered by the agent to DCM +- Define how agent will provision application (vs simple resource type) + +## Proposal + +### Overview + +For each clusters that can be used by DCM, an agent must be spawn. +The agent will then register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. + +When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic be unique and shared with DCM upon registration. + +Service Providers register directly to the agent (not to DCM). Each SP serves a single resource type and registers itself with the agent via a REST API call. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. + +An agent must have at least 1 Service Provider (SP) registered to it. For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. + +DCM will send the creation request to the specific topic that was created by the agent. + +The agent will then consume the message, validate it and then pass it to the relevant SP. + +The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP in its own status, allowing administrators to quickly identify which SPs are causing issues. + +The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. + +The status monitoring will not be impacted: the SP will be the one managing the resource and the current flow will remain the same; the agent is only an intermediary. + +### Architecture +```mermaid +%%{init: {'flowchart': {'rankSpacing': 80, 'nodeSpacing': 10, 'curve': 'linear'}}}%% +flowchart TD + classDef dcm fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px + classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px + classDef agent fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px + classDef provider fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px + classDef clusterEnvironment fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px + + DCM["**DCM**
Control Plane"]:::dcm + MS["**Messaging System**
Subject-based routing"]:::messaging + + subgraph Cluster_Environment["Cluster / Environment"] + direction LR + SPX["**SP**
Resource Type X"]:::provider + AG["**Agent**
Routes creation requests to SP"]:::agent + SPY["**SP**
Resource Type Y"]:::provider + SPX -. Registration .-> AG + SPY -. Registration .-> AG + AG -->|Creation Request| SPX + AG -->|Creation Request| SPY + AG -.->|Health Check| SPX + AG -.->|Health Check| SPY + end + + DCM -->|Creation Request| MS + MS -->|Creation Request| AG + AG -. Registration .-> DCM + AG -. Heartbeat .-> DCM + AG -->|Health Warning| MS + MS -->|Health Warning| DCM + SPX -->|Status| MS + SPY -->|Status| MS + MS -->|Status| DCM + + class Cluster_Environment clusterEnvironment +``` + +#### Flow Description +* The agent is spawn in a cluster serving as a specific environment +* Within the same cluster several Service Providers (SP) are running and serving each a specific resource type +* Each SP registers itself to the agent; the agent dynamically builds its supported resource types list +* The agent creates a specific topic in the bus system +* The agent registers to DCM on startup and sends periodic heartbeats +* DCM sends creation request to the specific topic +* The agent consumes the messages sent to the topic +* The agent routes the creation request to the relevant SP +* The agent periodically health-checks each registered SP; when the last SP for a resource type becomes unhealthy, the agent updates DCM and publishes a health warning through the messaging system +* The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system + +### Agent Registration Flow + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant MS as Messaging System + participant DCM as DCM
(Control Plane) + participant DB as Database + + Note over AG: Agent starts in
cluster / environment + + AG->>MS: Create unique topic + MS-->>AG: Topic created
{topicName} + + AG->>DCM: POST /api/v1/agents
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DCM + + DCM->>DB: Store agent registration
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DB + DB-->>DCM: Registration stored + deactivate DB + + DCM-->>AG: 201 Created
{agentId} + deactivate DCM +``` + +#### Flow Description +1. The agent starts within a cluster serving a specific environment +2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel +3. The agent registers itself with DCM via a REST API call, providing: + * Name + * Environment + * Supported resource types + * Available resources + * Cost tier + * Topic name +4. DCM persists the registration in the database +5. DCM acknowledges the registration + +### Resource Creation Flow + +```mermaid +sequenceDiagram + autonumber + participant DCM as DCM
(Control Plane) + participant MS as Messaging System + participant AG as Agent + participant SP as Service Provider + + DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{resourceType, spec} + + MS->>AG: PUSH message + activate AG + + AG->>AG: Validate requested resource type
is supported by an attached SP + + alt Resource type not supported + AG->>MS: PUBLISH CloudEvent
{error: "unsupported resource type"} + MS->>DCM: PUSH error message + else Resource type supported + AG->>SP: POST {spEndpoint}/api/v1/{resourceType}
{spec} + activate SP + + alt SP creation fails + SP-->>AG: Error response + deactivate SP + AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} + MS->>DCM: PUSH error message + else SP creation succeeds + SP-->>AG: Success response
{instanceId, status: PROVISIONING} + Note over SP: SP manages resource lifecycle
and reports status through
the existing status reporting flow + end + end + deactivate AG +``` + +#### Flow Description +1. DCM publishes the creation request to the agent's dedicated topic in the messaging system +2. The agent consumes the message +3. The agent validates that the requested resource type is supported by one of its attached Service Providers +4. If the resource type is **not supported**: + * The agent publishes an error CloudEvent back to the messaging system + * DCM consumes the error message +5. If the resource type is **supported**: + * The agent forwards the creation request to the relevant SP via REST API + * If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume + * If the SP **accepts** the request: the SP takes over resource lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) + +### SP Registration to Agent + +Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. + +SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. + +When the list of supported resource types changes as a result of an SP registration, the agent updates DCM via a `PUT` request with the full updated registration payload. + +```mermaid +sequenceDiagram + autonumber + participant SP as Service Provider + participant AG as Agent + participant DCM as DCM
(Control Plane) + participant DB as Database + + Note over SP: SP starts in
cluster / environment + + SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} + activate AG + + AG->>AG: Store SP registration
Recompute supported resource types + + alt Resource type list changed + AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DCM + DCM->>DB: Update agent registration + activate DB + DB-->>DCM: Registration updated + deactivate DB + DCM-->>AG: 200 OK + deactivate DCM + end + + AG-->>SP: 201 Created
{providerId} + deactivate AG + + Note over SP,AG: SP periodically re-registers
to maintain its lease +``` + +#### Flow Description +1. The SP starts within the same cluster / environment as the agent +2. The SP registers itself with the agent via a REST API call, providing: + * Name + * Resource type it serves + * Endpoint (URL where the agent can reach the SP) +3. The agent stores the SP registration and recomputes the list of supported resource types +4. If the resource type list changed (new resource type added): + * The agent sends a `PUT` request to DCM with the full updated agent registration + * DCM updates the agent record in the database +5. The agent acknowledges the SP registration +6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination + +### Health + +#### Agent Health + +The agent reports its own liveness to DCM via periodic REST heartbeats. Since the messaging system is used for resource operations (creation requests, status updates), the heartbeat uses the existing REST channel that the agent already uses for registration. + +DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is received within a configurable threshold, DCM marks the agent as unavailable. + +On startup, the agent registers to DCM (as described in [Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it re-registers to DCM; DCM handles this idempotently, resetting the heartbeat tracker. + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant DCM as DCM
(Control Plane) + participant DB as Database + + loop Every {heartbeatInterval} seconds + AG->>DCM: PUT /api/v1/agents/{agentId}/heartbeat
{timestamp} + activate DCM + DCM->>DB: Update last heartbeat timestamp + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM + end + + Note over DCM: No heartbeat received
within {threshold} seconds + + DCM->>DB: Mark agent as Unavailable + activate DB + DB-->>DCM: Updated + deactivate DB +``` + +##### Flow Description +1. The agent periodically sends a heartbeat to DCM via a REST `PUT` call +2. DCM updates the agent's last heartbeat timestamp in the database +3. If DCM does not receive a heartbeat within the configured threshold, it marks the agent as **Unavailable** +4. When the agent restarts, its initial registration to DCM resets the heartbeat tracker and the agent status + +#### SP Health Monitoring + +The agent monitors the health of its registered Service Providers by polling their `/health` endpoint, using the three-state health model defined in the [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): + +| State | Condition | +|-------|-----------| +| **Ready** | SP responds with `200 OK` and `status: "healthy"` | +| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | +| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | + +With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent and its SPs are co-located in the same cluster, making the agent the natural point to perform health checks. + +When the last SP serving a given resource type transitions to **Unhealthy** or **Unavailable**, the agent: +1. Removes that resource type from its advertised list +2. Sends a `PUT` request to DCM with the updated agent registration (resource types list without the affected type) +3. Publishes a health warning CloudEvent to a dedicated health topic in the messaging system, providing DCM with context about the degradation (which SP, which resource type, the reason) + +When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `status: "healthy"`), the agent re-adds the resource type to its list and updates DCM accordingly. + +##### Agent Status + +The agent exposes the health status of each registered SP in its own status. This allows cluster administrators to inspect the agent's status and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. + +```json +{ + "agentId": "agent-123", + "name": "cluster-prod-eu-west", + "environment": "PROD", + "status": "Ready", + "providers": [ + { + "providerId": "sp-vm-001", + "name": "vm-provider", + "resourceType": "vm", + "health": "Ready", + "endpoint": "http://vm-provider:8080" + }, + { + "providerId": "sp-db-001", + "name": "db-provider", + "resourceType": "database", + "health": "Unhealthy", + "endpoint": "http://db-provider:8080" + } + ] +} +``` + +```mermaid +sequenceDiagram + autonumber + participant AG as Agent + participant SP as Service Provider + participant MS as Messaging System + participant DCM as DCM
(Control Plane) + participant DB as Database + + loop Every {healthCheckInterval} seconds + AG->>SP: GET /health + alt Healthy + SP-->>AG: 200 OK
{status: "healthy"} + AG->>AG: Reset failure counter
Mark SP as Ready + else Unhealthy + SP-->>AG: 200 OK
{status: "unhealthy"} + AG->>AG: Mark SP as Unhealthy + else No response / error + SP-->>AG: Timeout / Error + AG->>AG: Increment failure counter + Note over AG: If counter >= threshold:
Mark SP as Unavailable + end + end + + Note over AG: Last SP for resource type X
becomes Unhealthy or Unavailable + + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated resourceTypes without X} + activate DCM + DCM->>DB: Update agent registration + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM + + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "resource-type-unavailable",
agentId, resourceType, reason,
affectedProvider} + MS->>DCM: PUSH health warning +``` + +##### Flow Description +1. The agent periodically polls each registered SP's `GET /health` endpoint +2. Based on the response, the agent updates the SP's health state: + * `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) + * `200 OK` with `status: "unhealthy"` → **Unhealthy** + * Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** +3. When the last SP serving a given resource type becomes **Unhealthy** or **Unavailable**: + * The agent removes the resource type from its advertised list + * The agent sends a `PUT` to DCM with the updated registration + * The agent publishes a health warning CloudEvent to the `dcm.agents.health` topic with details about the affected SP and resource type +4. When a previously unhealthy/unavailable SP recovers: + * The agent re-adds the resource type to its list (if it was removed) + * The agent sends a `PUT` to DCM with the updated registration +5. The agent exposes the health status of all registered SPs in its own status, allowing administrators to inspect the agent and see per-SP health at a glance \ No newline at end of file From 435b36611db3d71ee8f6d5315346ae7e4e5f6d8f Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 4 Jun 2026 14:04:16 +0200 Subject: [PATCH 02/24] Rework a little bit the doc Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 173 +++++++++--------- 1 file changed, 86 insertions(+), 87 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index d8b7cd3..b658170 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -48,25 +48,27 @@ Furthermore, with the current way on submitting creation's request, deploying an - Defining how to use the information registered by the agent to DCM - Define how agent will provision application (vs simple resource type) +- Update other enhancement files to reflect the changes introduced by the present document; this will be done in subsequent PRs. ## Proposal ### Overview For each clusters that can be used by DCM, an agent must be spawn. -The agent will then register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. +The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. -When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic be unique and shared with DCM upon registration. +When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. Service Providers register directly to the agent (not to DCM). Each SP serves a single resource type and registers itself with the agent via a REST API call. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. -An agent must have at least 1 Service Provider (SP) registered to it. For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. +An agent must have at least 1 Service Provider (SP) registered to it before self registering to DCM. +For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. DCM will send the creation request to the specific topic that was created by the agent. The agent will then consume the message, validate it and then pass it to the relevant SP. -The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP in its own status, allowing administrators to quickly identify which SPs are causing issues. +The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. @@ -116,13 +118,68 @@ flowchart TD * Within the same cluster several Service Providers (SP) are running and serving each a specific resource type * Each SP registers itself to the agent; the agent dynamically builds its supported resource types list * The agent creates a specific topic in the bus system -* The agent registers to DCM on startup and sends periodic heartbeats +* Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats * DCM sends creation request to the specific topic * The agent consumes the messages sent to the topic * The agent routes the creation request to the relevant SP * The agent periodically health-checks each registered SP; when the last SP for a resource type becomes unhealthy, the agent updates DCM and publishes a health warning through the messaging system * The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system +### SP Registration to Agent + +Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. + +SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. + +When the list of supported resource types changes as a result of an SP registration and the agent is already registered to DCM, the agent updates DCM via a `PUT` request with the full updated registration payload. If the agent has not yet registered to DCM (i.e., this is the first SP registering), the agent does not send a `PUT`; instead, the SP registration satisfies the prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). + +```mermaid +sequenceDiagram + autonumber + participant SP as Service Provider + participant AG as Agent + participant DCM as DCM
(Control Plane) + participant DB as Database + + Note over SP: SP starts in
cluster / environment + + SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} + activate AG + + AG->>AG: Store SP registration
Recompute supported resource types + + alt Resource type list changed AND agent already registered to DCM + AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + activate DCM + DCM->>DB: Update agent registration + activate DB + DB-->>DCM: Registration updated + deactivate DB + DCM-->>AG: 200 OK + deactivate DCM + else Resource type list changed AND agent not yet registered to DCM + Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) + end + + AG-->>SP: 201 Created
{providerId} + deactivate AG + + Note over SP,AG: SP periodically re-registers
to maintain its lease +``` + +#### Flow Description +1. The SP starts within the same cluster / environment as the agent +2. The SP registers itself with the agent via a REST API call, providing: + * Name + * Resource type it serves + * Endpoint (URL where the agent can reach the SP) +3. The agent stores the SP registration and recomputes the list of supported resource types +4. If the resource type list changed (new resource type added): + * If the agent is already registered to DCM: the agent sends a `PUT` request to DCM with the full updated agent registration; DCM updates the agent record in the database + * If the agent is not yet registered to DCM: the agent does not send a `PUT`; instead, this SP registration satisfies the prerequisite for the agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) +5. The agent acknowledges the SP registration +6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination + ### Agent Registration Flow ```mermaid @@ -138,6 +195,8 @@ sequenceDiagram AG->>MS: Create unique topic MS-->>AG: Topic created
{topicName} + Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) + AG->>DCM: POST /api/v1/agents
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} activate DCM @@ -153,15 +212,18 @@ sequenceDiagram #### Flow Description 1. The agent starts within a cluster serving a specific environment 2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel -3. The agent registers itself with DCM via a REST API call, providing: +3. The agent checks whether at least one SP is registered and healthy: + * If at least 1 SP is registered and healthy: the agent proceeds to register to DCM + * Else: the agent waits until at least 1 SP is registered and healthy +4. The agent registers itself with DCM via a REST API call, providing: * Name * Environment * Supported resource types * Available resources * Cost tier * Topic name -4. DCM persists the registration in the database -5. DCM acknowledges the registration +5. DCM persists the registration in the database +6. DCM acknowledges the registration ### Resource Creation Flow @@ -212,59 +274,6 @@ sequenceDiagram * If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume * If the SP **accepts** the request: the SP takes over resource lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) -### SP Registration to Agent - -Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. - -SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. - -When the list of supported resource types changes as a result of an SP registration, the agent updates DCM via a `PUT` request with the full updated registration payload. - -```mermaid -sequenceDiagram - autonumber - participant SP as Service Provider - participant AG as Agent - participant DCM as DCM
(Control Plane) - participant DB as Database - - Note over SP: SP starts in
cluster / environment - - SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} - activate AG - - AG->>AG: Store SP registration
Recompute supported resource types - - alt Resource type list changed - AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} - activate DCM - DCM->>DB: Update agent registration - activate DB - DB-->>DCM: Registration updated - deactivate DB - DCM-->>AG: 200 OK - deactivate DCM - end - - AG-->>SP: 201 Created
{providerId} - deactivate AG - - Note over SP,AG: SP periodically re-registers
to maintain its lease -``` - -#### Flow Description -1. The SP starts within the same cluster / environment as the agent -2. The SP registers itself with the agent via a REST API call, providing: - * Name - * Resource type it serves - * Endpoint (URL where the agent can reach the SP) -3. The agent stores the SP registration and recomputes the list of supported resource types -4. If the resource type list changed (new resource type added): - * The agent sends a `PUT` request to DCM with the full updated agent registration - * DCM updates the agent record in the database -5. The agent acknowledges the SP registration -6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination - ### Health #### Agent Health @@ -326,32 +335,22 @@ When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `s ##### Agent Status -The agent exposes the health status of each registered SP in its own status. This allows cluster administrators to inspect the agent's status and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. - -```json -{ - "agentId": "agent-123", - "name": "cluster-prod-eu-west", - "environment": "PROD", - "status": "Ready", - "providers": [ - { - "providerId": "sp-vm-001", - "name": "vm-provider", - "resourceType": "vm", - "health": "Ready", - "endpoint": "http://vm-provider:8080" - }, - { - "providerId": "sp-db-001", - "name": "db-provider", - "resourceType": "database", - "health": "Unhealthy", - "endpoint": "http://db-provider:8080" - } - ] -} +The agent exposes the health status of each registered SP as custom pod conditions on its own pod. This allows cluster administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. + +Each registered SP is represented as a separate pod condition, using the SP's provider ID as the condition type. The condition's `status` field reflects whether the SP is healthy (`True`) or not (`False`), and the `reason` and `message` fields provide additional context. + +Example output from `oc describe pod `: + ``` +Conditions: + Type Status Reason Message + sp-vm-001/vm True Ready SP vm-provider serving resource type vm is healthy + sp-db-001/database False Unhealthy SP db-provider serving resource type database is unhealthy +``` + +###### Implementation Detail + +The agent uses [Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) to surface per-SP health as custom pod conditions. The agent's pod spec declares a readiness gate for each expected condition type, and the agent application patches its own pod's `status.conditions` via the Kubernetes API using in-cluster authentication (`rest.InClusterConfig()` or equivalent). This requires RBAC permissions on the `pods/status` subresource for the agent's service account. ```mermaid sequenceDiagram @@ -403,4 +402,4 @@ sequenceDiagram 4. When a previously unhealthy/unavailable SP recovers: * The agent re-adds the resource type to its list (if it was removed) * The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs in its own status, allowing administrators to inspect the agent and see per-SP health at a glance \ No newline at end of file +5. The agent exposes the health status of all registered SPs as custom pod conditions on its own pod, allowing administrators to inspect the agent via `oc describe pod` and see per-SP health at a glance \ No newline at end of file From 9429d31af645d4b4790da64a0e464c068237c357 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Fri, 5 Jun 2026 09:44:11 +0200 Subject: [PATCH 03/24] Fix typos Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index b658170..c42c5e6 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -15,9 +15,9 @@ creation-date: 2026-06-03 # Environment Agent ## Summary -This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environments usable by DCM and the agent would regiester the environment to DCM. +This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that SPs now serve only 1 specific resource type. -This enhancement also propose to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM wil send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. +This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. Additionally, this enhancement defines: - How Service Providers register to the agent (rather than to DCM directly), allowing the agent to dynamically build and maintain its list of supported resource types. @@ -25,13 +25,13 @@ Additionally, this enhancement defines: - How the agent monitors the health of its registered Service Providers using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result. ## Motivation -When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, ... +When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, etc Currently, in DCM, a resource's creation request is routed to a given Service Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. -There is currently no way for a policy to determine in which environment a SP is running and hence an user cannot explictly set the targeted environment constraint when requesting the creation of a resource. +There is currently no way for a policy to determine in which environment a SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. -Furthermore, with the current way on submitting creation's request, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how the creation's requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. +Furthermore, with the current way of submitting creation requests, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. ### Goals @@ -54,7 +54,7 @@ Furthermore, with the current way on submitting creation's request, deploying an ### Overview -For each clusters that can be used by DCM, an agent must be spawn. +For each cluster that can be used by DCM, an agent must be spawn. The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. @@ -114,7 +114,7 @@ flowchart TD ``` #### Flow Description -* The agent is spawn in a cluster serving as a specific environment +* The agent is spawned in a cluster serving as a specific environment * Within the same cluster several Service Providers (SP) are running and serving each a specific resource type * Each SP registers itself to the agent; the agent dynamically builds its supported resource types list * The agent creates a specific topic in the bus system From eeed61a544b6a29704246c31a3cbda821a791b4a Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Fri, 5 Jun 2026 11:26:40 +0200 Subject: [PATCH 04/24] Agent can run outside of cluster Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 55 ++++++++++++++----- 1 file changed, 41 insertions(+), 14 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index c42c5e6..4be57f1 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -31,7 +31,7 @@ Currently, in DCM, a resource's creation request is routed to a given Service Pr There is currently no way for a policy to determine in which environment a SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. -Furthermore, with the current way of submitting creation requests, deploying an agent on a cluster would also mean the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem and would fit the way K8s/OCP are consuming creation requests: when a manifest is submitted, the manifest is pulled by the application actually creating the resource on the cluster. +Furthermore, with the current way of submitting creation requests, the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem: the agent pulls work from a messaging system, removing the need for inbound connectivity. This approach also aligns with the way K8s/OCP consume creation requests, where manifests are pulled by the application creating the resource. ### Goals @@ -54,7 +54,7 @@ Furthermore, with the current way of submitting creation requests, deploying an ### Overview -For each cluster that can be used by DCM, an agent must be spawn. +For each environment that can be used by DCM, an agent must be spawned. The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. @@ -68,7 +68,7 @@ DCM will send the creation request to the specific topic that was created by the The agent will then consume the message, validate it and then pass it to the relevant SP. -The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent also exposes the health status of each registered SP as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. +The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent exposes the health status of each registered SP via a `/api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. @@ -87,7 +87,7 @@ flowchart TD DCM["**DCM**
Control Plane"]:::dcm MS["**Messaging System**
Subject-based routing"]:::messaging - subgraph Cluster_Environment["Cluster / Environment"] + subgraph Target_Environment["Target Environment"] direction LR SPX["**SP**
Resource Type X"]:::provider AG["**Agent**
Routes creation requests to SP"]:::agent @@ -110,12 +110,12 @@ flowchart TD SPY -->|Status| MS MS -->|Status| DCM - class Cluster_Environment clusterEnvironment + class Target_Environment clusterEnvironment ``` #### Flow Description -* The agent is spawned in a cluster serving as a specific environment -* Within the same cluster several Service Providers (SP) are running and serving each a specific resource type +* The agent is spawned in an environment +* Several Service Providers (SP) are running and serving each a specific resource type * Each SP registers itself to the agent; the agent dynamically builds its supported resource types list * The agent creates a specific topic in the bus system * Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats @@ -141,7 +141,7 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant DB as Database - Note over SP: SP starts in
cluster / environment + Note over SP: SP starts and
registers to the agent SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} activate AG @@ -168,7 +168,7 @@ sequenceDiagram ``` #### Flow Description -1. The SP starts within the same cluster / environment as the agent +1. The SP starts and registers to the agent 2. The SP registers itself with the agent via a REST API call, providing: * Name * Resource type it serves @@ -190,7 +190,7 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant DB as Database - Note over AG: Agent starts in
cluster / environment + Note over AG: Agent starts in
target environment AG->>MS: Create unique topic MS-->>AG: Topic created
{topicName} @@ -210,7 +210,7 @@ sequenceDiagram ``` #### Flow Description -1. The agent starts within a cluster serving a specific environment +1. The agent starts and serves a specific environment 2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel 3. The agent checks whether at least one SP is registered and healthy: * If at least 1 SP is registered and healthy: the agent proceeds to register to DCM @@ -324,7 +324,7 @@ The agent monitors the health of its registered Service Providers by polling the | **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | | **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | -With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent and its SPs are co-located in the same cluster, making the agent the natural point to perform health checks. +With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent is the natural point to perform health checks on its registered SPs, as it already maintains the list of SP endpoints. When the last SP serving a given resource type transitions to **Unhealthy** or **Unavailable**, the agent: 1. Removes that resource type from its advertised list @@ -335,7 +335,34 @@ When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `s ##### Agent Status -The agent exposes the health status of each registered SP as custom pod conditions on its own pod. This allows cluster administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query each SP individually. +The agent exposes a `GET /api/v1/status` endpoint that returns the health state of all registered SPs. This endpoint is always available, regardless of the deployment mode (Kubernetes, Docker, standalone), and is the primary way to inspect the agent's view of its Service Providers. + +Example response: + +```json +{ + "providers": [ + { + "providerId": "sp-vm-001", + "name": "vm-provider", + "resourceType": "vm", + "status": "Ready", + "lastCheck": "2026-06-05T10:30:00Z" + }, + { + "providerId": "sp-db-001", + "name": "db-provider", + "resourceType": "database", + "status": "Unhealthy", + "lastCheck": "2026-06-05T10:30:00Z" + } + ] +} +``` + +##### Pod Conditions (Kubernetes / OpenShift) + +On Kubernetes or OpenShift deployments, the agent additionally exposes the health status of each registered SP as custom pod conditions on its own pod. This complements the `/api/v1/status` endpoint and allows administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query the agent's REST API. Each registered SP is represented as a separate pod condition, using the SP's provider ID as the condition type. The condition's `status` field reflects whether the SP is healthy (`True`) or not (`False`), and the `reason` and `message` fields provide additional context. @@ -402,4 +429,4 @@ sequenceDiagram 4. When a previously unhealthy/unavailable SP recovers: * The agent re-adds the resource type to its list (if it was removed) * The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs as custom pod conditions on its own pod, allowing administrators to inspect the agent via `oc describe pod` and see per-SP health at a glance \ No newline at end of file +5. The agent exposes the health status of all registered SPs via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) \ No newline at end of file From 42dd0573634eec0632b6aef44e8aac7974bd586e Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 11:07:44 +0200 Subject: [PATCH 05/24] Reword Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 4be57f1..dc87bb2 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -16,7 +16,7 @@ creation-date: 2026-06-03 ## Summary This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. -The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that SPs now serve only 1 specific resource type. +The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that each SP registration with the agent serves exactly one resource type (though a single SP application may register multiple times for different resource types). This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. Additionally, this enhancement defines: From 97f9157bff0268183cfa9755729d7388bf9e46e6 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 14:39:35 +0200 Subject: [PATCH 06/24] Reword again Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index dc87bb2..ce6a2a3 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -59,7 +59,7 @@ The agent will self register to DCM. When doing so, it will provide, amongst oth When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. -Service Providers register directly to the agent (not to DCM). Each SP serves a single resource type and registers itself with the agent via a REST API call. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. +Service Providers register directly to the agent (not to DCM). Each SP registration with the agent serves exactly one resource type, though a single SP application may register multiple times for different resource types. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. An agent must have at least 1 Service Provider (SP) registered to it before self registering to DCM. For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. From 144685a1acd4dd1e16f99f1cac582c845c21593c Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 14:58:01 +0200 Subject: [PATCH 07/24] Reduce summary Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index ce6a2a3..255bed8 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -19,11 +19,6 @@ This enhancement aims at adding the notion of environment by adding a layer betw The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that each SP registration with the agent serves exactly one resource type (though a single SP application may register multiple times for different resource types). This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. -Additionally, this enhancement defines: -- How Service Providers register to the agent (rather than to DCM directly), allowing the agent to dynamically build and maintain its list of supported resource types. -- How the agent reports its own health to DCM via periodic heartbeats. -- How the agent monitors the health of its registered Service Providers using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result. - ## Motivation When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, etc @@ -40,9 +35,9 @@ Furthermore, with the current way of submitting creation requests, the administr - Define what information the agent gives to DCM while registering - Define how agents and DCM are communicating - Define how agents and Service Providers interact with each other -- Define how Service Providers register to the agent -- Define how the agent monitors Service Provider health -- Define how the agent reports its own health to DCM +- Define how Service Providers register to the agent, allowing the agent to dynamically build and maintain its list of supported resource types +- Define how the agent monitors Service Provider health using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result +- Define how the agent reports its own health to DCM via periodic heartbeats ### Non-Goals From dd2c0cfa8479f007751785e06d9b6bd25613b7fb Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Tue, 9 Jun 2026 17:29:09 +0200 Subject: [PATCH 08/24] Reorganised and reworded based on feedback Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 613 ++++++++++++++---- 1 file changed, 493 insertions(+), 120 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 255bed8..4895ad8 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -10,24 +10,61 @@ reviewers: approvers: - "" creation-date: 2026-06-03 +see-also: + - "/enhancements/service-provider-health-check/service-provider-health-check.md" + - "/enhancements/state-management/service-provider-status-reporting.md" + - "/enhancements/sp-registration-flow/sp-registration-flow.md" + - "/enhancements/placement-manager/placement-manager.md" + - "/enhancements/sp-resource-manager/sp-resource-manager.md" --- # Environment Agent +## Open Questions + +1. Can multiple agent replicas consume from the same topic for high + availability? (deferred to HA iteration) +2. How does an administrator update the agent's cost tier without restarting it? + ## Summary -This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. -The agent would then use the SPs as plugins for the supported resource types and pass the creation request to the relevant one. This would mean that each SP registration with the agent serves exactly one resource type (though a single SP application may register multiple times for different resource types). -This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct request to the agent, DCM will send the request to a bus that will in turn be consumed by the relevant agent to create the requested resource. + +This enhancement aims at adding the notion of environment by adding a layer +between the SP and DCM: an agent would run on each environment usable by DCM and +the agent would register the environment to DCM. + +The agent would then use the SPs as plugins for the supported service types and +pass the creation request to the relevant one. This would mean that each SP +registration with the agent serves exactly one service type (though a single SP +application may register multiple times for different service types). + +This enhancement also proposes to change the way the creation request is +submitted to the agent (or currently, to the SP): instead of sending a direct +request to the agent, DCM will send the request to a bus that will in turn be +consumed by the relevant agent to create the requested resource. ## Motivation -When deploying resources in general, one of the main criterion taken into account is the type of environment in which the resource will be deployed: DEV, INT, VAL, PROD, etc -Currently, in DCM, a resource's creation request is routed to a given Service Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. +When deploying resources in general, one of the main criterion taken into +account is the type of environment in which the resource will be deployed: DEV, +INT, VAL, PROD, etc -There is currently no way for a policy to determine in which environment a SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. +Currently, in DCM, a resource's creation request is routed to a given Service +Provider (SP) by a policy on the base of several criteria. Once the SP is +selected, DCM will send a request to the selected SP to request the creation of +the resource. -Furthermore, with the current way of submitting creation requests, the administrator has to make sure the ports are open for DCM to reach the agent. Changing how creation requests are consumed by giving the initiative to the agent would solve this problem: the agent pulls work from a messaging system, removing the need for inbound connectivity. This approach also aligns with the way K8s/OCP consume creation requests, where manifests are pulled by the application creating the resource. +There is currently no way for a policy to determine in which environment a SP is +running and hence a user cannot explicitly set the targeted environment +constraint when requesting the creation of a resource. +Furthermore, with the current way of submitting creation requests, the +administrator has to make sure the ports are open for DCM to reach the SP. +Changing how creation requests are consumed by giving the initiative to the +agent would solve this problem: the agent pulls work from a messaging system, +removing the need for DCM-to-environment inbound connectivity for creation +requests. The agent still requires outbound connectivity to DCM for registration +and heartbeats. This approach also aligns with the way K8s/OCP consume creation +requests, where manifests are pulled by the application creating the resource. ### Goals @@ -35,41 +72,76 @@ Furthermore, with the current way of submitting creation requests, the administr - Define what information the agent gives to DCM while registering - Define how agents and DCM are communicating - Define how agents and Service Providers interact with each other -- Define how Service Providers register to the agent, allowing the agent to dynamically build and maintain its list of supported resource types -- Define how the agent monitors Service Provider health using the three-state health model (Ready, Unhealthy, Unavailable) and updates DCM when the supported resource types change as a result +- Define how Service Providers register to the agent, allowing the agent to + dynamically build and maintain its list of supported service types +- Define how the agent monitors Service Provider health using the three-state + health model (Ready, Unhealthy, Unavailable) and updates DCM when the + supported service types change as a result - Define how the agent reports its own health to DCM via periodic heartbeats ### Non-Goals -- Defining how to use the information registered by the agent to DCM -- Define how agent will provision application (vs simple resource type) -- Update other enhancement files to reflect the changes introduced by the present document; this will be done in subsequent PRs. +- Defining how to use the information registered by the agent to DCM +- Define how agent will provision application (vs simple service type) +- Update other enhancement files to reflect the changes introduced by the + present document; this will be done in subsequent PRs. ## Proposal ### Overview -For each environment that can be used by DCM, an agent must be spawned. -The agent will self register to DCM. When doing so, it will provide, amongst other information, the environment on which it's running and the resource types it can serve. - -When starting, the agent will also create a specific topic in a bus (Kafka, NATS, ...) in order for DCM to communicate with the agent. The name of the topic must be unique and shared with DCM upon registration. - -Service Providers register directly to the agent (not to DCM). Each SP registration with the agent serves exactly one resource type, though a single SP application may register multiple times for different resource types. The agent dynamically builds its list of supported resource types based on the SPs that are registered to it. When the list changes (SP registration or health-driven removal), the agent updates DCM accordingly. - -An agent must have at least 1 Service Provider (SP) registered to it before self registering to DCM. -For each resource type advertised as supported to DCM by the agent, there must be at least 1 healthy SP registered supporting the given resource type. - -DCM will send the creation request to the specific topic that was created by the agent. - -The agent will then consume the message, validate it and then pass it to the relevant SP. - -The agent monitors the health of its registered SPs by polling their `/health` endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the last SP serving a given resource type becomes unhealthy or unavailable, the agent removes that resource type from its advertised list and updates DCM. The agent exposes the health status of each registered SP via a `/api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. - -The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no heartbeat is received within a configurable threshold. - -The status monitoring will not be impacted: the SP will be the one managing the resource and the current flow will remain the same; the agent is only an intermediary. +For each environment that can be used by DCM, an agent must be spawned. The +agent will self register to DCM. When doing so, it will provide, amongst other +information, the environment on which it's running and the service types it can +serve. + +When starting, the agent will also create a specific topic in the messaging +system in order for DCM to communicate with the agent. The topic name is +deterministic — either derived from the agent's name or provided via +configuration — ensuring that after a restart the agent reuses the same topic. +If the topic already exists, the agent reuses it. The topic name is unique per +environment and is shared with DCM upon registration. In the current +single-agent model, one agent consumes from the topic. In a future HA model, +multiple agent replicas for the same environment could consume from the same +topic as competing consumers. + +Service Providers register directly to the agent (not to DCM). Each SP +registration with the agent serves exactly one service type, though a single SP +application may register multiple times for different service types. The agent +dynamically builds its list of supported service types based on the SPs that are +registered to it. When the list changes (SP registration or health-driven +removal), the agent updates DCM accordingly. + +An agent must have at least 1 Service Provider (SP) registered to it before self +registering to DCM. For each service type advertised as supported to DCM by the +agent, there must be at least 1 healthy SP registered supporting the given +service type. + +DCM will send the creation request to the specific topic that was created by the +agent. + +The agent will then consume the message, validate it and then pass it to the +relevant SP. + +The agent monitors the health of its registered SPs by polling their `/health` +endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the +last SP serving a given service type becomes unhealthy or unavailable, the agent +removes that service type from its advertised list and updates DCM. The agent +exposes the health status of each registered SP via a `/api/v1/status` endpoint. +On Kubernetes/OpenShift deployments, the agent additionally surfaces this +information as custom pod conditions on its own pod, allowing administrators to +quickly identify which SPs are causing issues via `oc describe pod`. + +The agent reports its own liveness to DCM via periodic REST heartbeats. DCM +tracks the last heartbeat timestamp and marks the agent as unavailable if no +heartbeat is received within a configurable threshold. + +The status monitoring will not be impacted: the SP will be the one managing the +resource and the current flow will remain the same; the agent is only an +intermediary. ### Architecture + ```mermaid %%{init: {'flowchart': {'rankSpacing': 80, 'nodeSpacing': 10, 'curve': 'linear'}}}%% flowchart TD @@ -84,9 +156,9 @@ flowchart TD subgraph Target_Environment["Target Environment"] direction LR - SPX["**SP**
Resource Type X"]:::provider + SPX["**SP**
Service Type X"]:::provider AG["**Agent**
Routes creation requests to SP"]:::agent - SPY["**SP**
Resource Type Y"]:::provider + SPY["**SP**
Service Type Y"]:::provider SPX -. Registration .-> AG SPY -. Registration .-> AG AG -->|Creation Request| SPX @@ -109,24 +181,42 @@ flowchart TD ``` #### Flow Description -* The agent is spawned in an environment -* Several Service Providers (SP) are running and serving each a specific resource type -* Each SP registers itself to the agent; the agent dynamically builds its supported resource types list -* The agent creates a specific topic in the bus system -* Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats -* DCM sends creation request to the specific topic -* The agent consumes the messages sent to the topic -* The agent routes the creation request to the relevant SP -* The agent periodically health-checks each registered SP; when the last SP for a resource type becomes unhealthy, the agent updates DCM and publishes a health warning through the messaging system -* The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system + +- The agent is spawned in an environment +- Several Service Providers (SP) are running and serving each a specific service + type +- Each SP registers itself to the agent; the agent dynamically builds its + supported service types list +- The agent creates a specific topic in the bus system +- Once at least one SP is registered and healthy, the agent self-registers to + DCM and begins sending periodic heartbeats +- DCM sends creation request to the specific topic +- The agent consumes the messages sent to the topic +- The agent routes the creation request to the relevant SP +- The agent periodically health-checks each registered SP; when the last SP for + a service type becomes unhealthy, the agent updates DCM and publishes a health + warning through the messaging system +- The status monitoring remains unchanged: each SP manages its resource + lifecycle and reports status through the messaging system ### SP Registration to Agent -Service Providers register to the agent rather than to DCM directly. The agent exposes a REST API for SP registration and dynamically maintains its list of supported resource types based on registered SPs. +Service Providers register to the agent rather than to DCM directly. The agent +exposes a REST API for SP registration and dynamically maintains its list of +supported service types based on registered SPs. -SPs periodically re-register with the agent to maintain their registration. This periodic re-registration serves as a lease renewal and ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. +SPs periodically re-register with the agent to maintain their registration. This +periodic re-registration serves as a lease renewal and ensures that after an +agent restart (where the agent loses its in-memory state), SPs naturally +re-register without requiring any additional coordination mechanism. -When the list of supported resource types changes as a result of an SP registration and the agent is already registered to DCM, the agent updates DCM via a `PUT` request with the full updated registration payload. If the agent has not yet registered to DCM (i.e., this is the first SP registering), the agent does not send a `PUT`; instead, the SP registration satisfies the prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). +When the list of supported service types changes as a result of an SP +registration and the agent is already registered to DCM, the agent updates DCM +via a `PUT` request with the full updated registration payload. If the agent has +not yet registered to DCM (i.e., this is the first SP registering), the agent +does not send a `PUT`; instead, the SP registration satisfies the prerequisite +for the agent to proceed with its initial registration to DCM (see +[Agent Registration Flow](#agent-registration-flow)). ```mermaid sequenceDiagram @@ -138,13 +228,13 @@ sequenceDiagram Note over SP: SP starts and
registers to the agent - SP->>AG: POST /api/v1/providers
{name, resourceType, endpoint} + SP->>AG: POST /api/v1/providers
{name, serviceType, endpoint} activate AG - AG->>AG: Store SP registration
Recompute supported resource types + AG->>AG: Store SP registration
Recompute supported service types alt Resource type list changed AND agent already registered to DCM - AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM DCM->>DB: Update agent registration activate DB @@ -163,17 +253,26 @@ sequenceDiagram ``` #### Flow Description + 1. The SP starts and registers to the agent 2. The SP registers itself with the agent via a REST API call, providing: - * Name - * Resource type it serves - * Endpoint (URL where the agent can reach the SP) -3. The agent stores the SP registration and recomputes the list of supported resource types -4. If the resource type list changed (new resource type added): - * If the agent is already registered to DCM: the agent sends a `PUT` request to DCM with the full updated agent registration; DCM updates the agent record in the database - * If the agent is not yet registered to DCM: the agent does not send a `PUT`; instead, this SP registration satisfies the prerequisite for the agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) + - Name + - Resource type it serves + - Endpoint (URL where the agent can reach the SP) +3. The agent stores the SP registration and recomputes the list of supported + service types +4. If the service type list changed (new service type added): + - If the agent is already registered to DCM: the agent sends a `PUT` request + to DCM with the full updated agent registration; DCM updates the agent + record in the database + - If the agent is not yet registered to DCM: the agent does not send a `PUT`; + instead, this SP registration satisfies the prerequisite for the agent's + initial registration (see + [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration -6. The SP periodically re-registers with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, SPs naturally rebuild the agent's state without additional coordination +6. The SP periodically re-registers with the agent; the agent handles this + idempotently (create or update). This ensures that after an agent restart, + SPs naturally rebuild the agent's state without additional coordination ### Agent Registration Flow @@ -187,15 +286,15 @@ sequenceDiagram Note over AG: Agent starts in
target environment - AG->>MS: Create unique topic + AG->>MS: Create topic (deterministic name) MS-->>AG: Topic created
{topicName} Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) - AG->>DCM: POST /api/v1/agents
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM - DCM->>DB: Store agent registration
{name, environment, resourceTypes,
resourcesAvailable, cost, topicName} + DCM->>DB: Store agent registration
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DB DB-->>DCM: Registration stored deactivate DB @@ -205,18 +304,21 @@ sequenceDiagram ``` #### Flow Description + 1. The agent starts and serves a specific environment -2. The agent creates a unique topic in the messaging system to establish a dedicated communication channel +2. The agent creates a topic in the messaging system (using a deterministic + name) to establish a dedicated communication channel 3. The agent checks whether at least one SP is registered and healthy: - * If at least 1 SP is registered and healthy: the agent proceeds to register to DCM - * Else: the agent waits until at least 1 SP is registered and healthy + - If at least 1 SP is registered and healthy: the agent proceeds to register + to DCM + - Else: the agent waits until at least 1 SP is registered and healthy 4. The agent registers itself with DCM via a REST API call, providing: - * Name - * Environment - * Supported resource types - * Available resources - * Cost tier - * Topic name + - Name + - Environment + - Supported service types + - Available resources + - Cost tier + - Topic name 5. DCM persists the registration in the database 6. DCM acknowledges the registration @@ -230,18 +332,18 @@ sequenceDiagram participant AG as Agent participant SP as Service Provider - DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{resourceType, spec} + DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{serviceType, spec} MS->>AG: PUSH message activate AG - AG->>AG: Validate requested resource type
is supported by an attached SP + AG->>AG: Validate requested service type
is supported by an attached SP alt Resource type not supported - AG->>MS: PUBLISH CloudEvent
{error: "unsupported resource type"} + AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message else Resource type supported - AG->>SP: POST {spEndpoint}/api/v1/{resourceType}
{spec} + AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} activate SP alt SP creation fails @@ -258,26 +360,125 @@ sequenceDiagram ``` #### Flow Description -1. DCM publishes the creation request to the agent's dedicated topic in the messaging system + +1. DCM publishes the creation request to the agent's dedicated topic in the + messaging system 2. The agent consumes the message -3. The agent validates that the requested resource type is supported by one of its attached Service Providers -4. If the resource type is **not supported**: - * The agent publishes an error CloudEvent back to the messaging system - * DCM consumes the error message -5. If the resource type is **supported**: - * The agent forwards the creation request to the relevant SP via REST API - * If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - * If the SP **accepts** the request: the SP takes over resource lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) +3. The agent validates that the requested service type is supported by one of + its attached Service Providers +4. If the service type is **not supported**: + - The agent publishes an error CloudEvent back to the messaging system + - DCM consumes the error message +5. If the service type is **supported**: + - The agent forwards the creation request to the relevant SP via REST API + - If the SP returns an **immediate error**: the agent publishes an error + CloudEvent back to the messaging system for DCM to consume + - If the SP **accepts** the request: the SP takes over resource lifecycle + management and reports status changes through the existing status reporting + flow (SP → Messaging System → DCM) + +#### SP Selection Strategy + +When multiple SPs are registered for the same service type, the agent selects +one randomly. Future iterations may introduce affinity-based or capacity-based +selection strategies (e.g., selecting the SP with the most available resources, +similar to pod affinity in Kubernetes). + +#### Retry Policy + +When the agent forwards a creation request to an SP and the SP returns an error, +the agent applies a configurable retry policy. When retries are exhausted, the +agent publishes an error CloudEvent to the messaging system with the resource ID +(provided by DCM in the original creation request), allowing DCM to track the +failure. + +#### In-Flight Request Handling + +When the agent restarts, unconsumed messages remain on the topic and are +consumed once the agent is back up (guaranteed by the messaging system's +persistence layer). When all SPs for a given service type are unhealthy or +unavailable, the agent responds with an error CloudEvent for each incoming +creation request targeting that service type. + +### Resource Deletion Flow + +```mermaid +sequenceDiagram + autonumber + participant DCM as DCM
(Control Plane) + participant MS as Messaging System + participant AG as Agent + participant SP as Service Provider + + DCM->>MS: PUBLISH deletion request
topic: {agentTopicName}
{serviceType, resourceId} + + MS->>AG: PUSH message + activate AG + + AG->>AG: Validate requested service type
is supported by an attached SP + + alt Service type not supported + AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} + MS->>DCM: PUSH error message + else Service type supported + AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} + activate SP + + alt SP deletion fails + SP-->>AG: Error response + deactivate SP + AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} + MS->>DCM: PUSH error message + else SP deletion succeeds + SP-->>AG: Success response
{resourceId, status: DELETING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} + MS->>DCM: PUSH deletion acknowledged + Note over SP: SP manages resource deletion
and reports final status through
the existing status reporting flow + end + end + deactivate AG +``` + +#### Flow Description + +1. DCM publishes the deletion request to the agent's dedicated topic in the + messaging system, including the service type and resource ID +2. The agent consumes the message +3. The agent validates that the requested service type is supported by one of + its attached Service Providers +4. If the service type is **not supported**: + - The agent publishes an error CloudEvent back to the messaging system + - DCM consumes the error message +5. If the service type is **supported**: + - The agent forwards the deletion request to the relevant SP via a REST + `DELETE` call + - If the SP returns an **immediate error**: the agent publishes an error + CloudEvent back to the messaging system for DCM to consume + - If the SP **accepts** the request: the agent publishes a CloudEvent + acknowledging the deletion is in progress. The SP manages the actual + resource deletion and reports the final status through the existing status + reporting flow (SP → Messaging System → DCM) + +The retry policy and in-flight request handling described in the +[Resource Creation Flow](#resource-creation-flow) apply equally to deletion +requests. ### Health #### Agent Health -The agent reports its own liveness to DCM via periodic REST heartbeats. Since the messaging system is used for resource operations (creation requests, status updates), the heartbeat uses the existing REST channel that the agent already uses for registration. +The agent reports its own liveness to DCM via periodic REST heartbeats. Since +the messaging system is used for resource operations (creation requests, status +updates), the heartbeat uses the existing REST channel that the agent already +uses for registration. -DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is received within a configurable threshold, DCM marks the agent as unavailable. +DCM tracks the last heartbeat timestamp for each agent. If no heartbeat is +received within a configurable threshold, DCM marks the agent as unavailable. -On startup, the agent registers to DCM (as described in [Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it re-registers to DCM; DCM handles this idempotently, resetting the heartbeat tracker. +On startup, the agent registers to DCM (as described in +[Agent Registration Flow](#agent-registration-flow)). If the agent restarts, it +re-registers to DCM; DCM handles this idempotently, resetting the heartbeat +tracker. ```mermaid sequenceDiagram @@ -304,33 +505,55 @@ sequenceDiagram ``` ##### Flow Description + 1. The agent periodically sends a heartbeat to DCM via a REST `PUT` call 2. DCM updates the agent's last heartbeat timestamp in the database -3. If DCM does not receive a heartbeat within the configured threshold, it marks the agent as **Unavailable** -4. When the agent restarts, its initial registration to DCM resets the heartbeat tracker and the agent status +3. If DCM does not receive a heartbeat within the configured threshold, it marks + the agent as **Unavailable** +4. When the agent restarts, its initial registration to DCM resets the heartbeat + tracker and the agent status #### SP Health Monitoring -The agent monitors the health of its registered Service Providers by polling their `/health` endpoint, using the three-state health model defined in the [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): +The agent monitors the health of its registered Service Providers by polling +their `/health` endpoint, using the three-state health model defined in the +[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): + +| State | Condition | +| --------------- | --------------------------------------------------------------------------------------------------- | +| **Ready** | SP responds with `200 OK` and `status: "healthy"` | +| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | +| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | -| State | Condition | -|-------|-----------| -| **Ready** | SP responds with `200 OK` and `status: "healthy"` | -| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | -| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | +With the agent layer, the responsibility for polling SP health shifts from DCM +to the agent. The agent is the natural point to perform health checks on its +registered SPs, as it already maintains the list of SP endpoints. -With the agent layer, the responsibility for polling SP health shifts from DCM to the agent. The agent is the natural point to perform health checks on its registered SPs, as it already maintains the list of SP endpoints. +The agent only routes creation requests to SPs in the **Ready** state. SPs in +the **Unhealthy** or **Unavailable** state are not eligible for routing, even +though an Unhealthy SP is technically reachable. This simplifies routing logic +and avoids sending requests to SPs whose backing provider is known to be down. -When the last SP serving a given resource type transitions to **Unhealthy** or **Unavailable**, the agent: -1. Removes that resource type from its advertised list -2. Sends a `PUT` request to DCM with the updated agent registration (resource types list without the affected type) -3. Publishes a health warning CloudEvent to a dedicated health topic in the messaging system, providing DCM with context about the degradation (which SP, which resource type, the reason) +When the last SP serving a given service type transitions to **Unhealthy** or +**Unavailable**, the agent: -When a previously unhealthy or unavailable SP recovers (returns `200 OK` with `status: "healthy"`), the agent re-adds the resource type to its list and updates DCM accordingly. +1. Removes that service type from its advertised list +2. Sends a `PUT` request to DCM with the updated agent registration (service + types list without the affected type) +3. Publishes a health warning CloudEvent to a dedicated health topic in the + messaging system, providing DCM with context about the degradation (which SP, + which service type, the reason) + +When a previously unhealthy or unavailable SP recovers (returns `200 OK` with +`status: "healthy"`), the agent re-adds the service type to its list and updates +DCM accordingly. ##### Agent Status -The agent exposes a `GET /api/v1/status` endpoint that returns the health state of all registered SPs. This endpoint is always available, regardless of the deployment mode (Kubernetes, Docker, standalone), and is the primary way to inspect the agent's view of its Service Providers. +The agent exposes a `GET /api/v1/status` endpoint that returns the health state +of all registered SPs. This endpoint is always available, regardless of the +deployment mode (Kubernetes, Docker, standalone), and is the primary way to +inspect the agent's view of its Service Providers. Example response: @@ -340,14 +563,14 @@ Example response: { "providerId": "sp-vm-001", "name": "vm-provider", - "resourceType": "vm", + "serviceType": "vm", "status": "Ready", "lastCheck": "2026-06-05T10:30:00Z" }, { "providerId": "sp-db-001", "name": "db-provider", - "resourceType": "database", + "serviceType": "database", "status": "Unhealthy", "lastCheck": "2026-06-05T10:30:00Z" } @@ -357,22 +580,37 @@ Example response: ##### Pod Conditions (Kubernetes / OpenShift) -On Kubernetes or OpenShift deployments, the agent additionally exposes the health status of each registered SP as custom pod conditions on its own pod. This complements the `/api/v1/status` endpoint and allows administrators to inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which SPs are healthy, unhealthy, or unavailable without having to query the agent's REST API. +On Kubernetes or OpenShift deployments, the agent additionally exposes the +health status of each registered SP as custom pod conditions on its own pod. +This complements the `/api/v1/status` endpoint and allows administrators to +inspect the agent's pod (e.g., via `oc describe pod`) and immediately see which +SPs are healthy, unhealthy, or unavailable without having to query the agent's +REST API. -Each registered SP is represented as a separate pod condition, using the SP's provider ID as the condition type. The condition's `status` field reflects whether the SP is healthy (`True`) or not (`False`), and the `reason` and `message` fields provide additional context. +Each registered SP is represented as a separate pod condition, using the SP's +provider ID as the condition type. The condition's `status` field reflects +whether the SP is healthy (`True`) or not (`False`), and the `reason` and +`message` fields provide additional context. Example output from `oc describe pod `: ``` Conditions: Type Status Reason Message - sp-vm-001/vm True Ready SP vm-provider serving resource type vm is healthy - sp-db-001/database False Unhealthy SP db-provider serving resource type database is unhealthy + sp-vm-001/vm True Ready SP vm-provider serving service type vm is healthy + sp-db-001/database False Unhealthy SP db-provider serving service type database is unhealthy ``` ###### Implementation Detail -The agent uses [Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) to surface per-SP health as custom pod conditions. The agent's pod spec declares a readiness gate for each expected condition type, and the agent application patches its own pod's `status.conditions` via the Kubernetes API using in-cluster authentication (`rest.InClusterConfig()` or equivalent). This requires RBAC permissions on the `pods/status` subresource for the agent's service account. +The agent uses +[Pod Readiness Gates](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) +to surface per-SP health as custom pod conditions. The agent's pod spec declares +a readiness gate for each expected condition type, and the agent application +patches its own pod's `status.conditions` via the Kubernetes API using +in-cluster authentication (`rest.InClusterConfig()` or equivalent). This +requires RBAC permissions on the `pods/status` subresource for the agent's +service account. ```mermaid sequenceDiagram @@ -398,30 +636,165 @@ sequenceDiagram end end - Note over AG: Last SP for resource type X
becomes Unhealthy or Unavailable + Note over AG: Last SP for service type X
becomes Unhealthy or Unavailable - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated resourceTypes without X} + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} activate DCM DCM->>DB: Update agent registration DB-->>DCM: Updated DCM-->>AG: 200 OK deactivate DCM - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "resource-type-unavailable",
agentId, resourceType, reason,
affectedProvider} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider} MS->>DCM: PUSH health warning ``` ##### Flow Description + 1. The agent periodically polls each registered SP's `GET /health` endpoint 2. Based on the response, the agent updates the SP's health state: - * `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) - * `200 OK` with `status: "unhealthy"` → **Unhealthy** - * Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** -3. When the last SP serving a given resource type becomes **Unhealthy** or **Unavailable**: - * The agent removes the resource type from its advertised list - * The agent sends a `PUT` to DCM with the updated registration - * The agent publishes a health warning CloudEvent to the `dcm.agents.health` topic with details about the affected SP and resource type + - `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) + - `200 OK` with `status: "unhealthy"` → **Unhealthy** + - Timeout or error → increment failure counter; if counter exceeds threshold + → **Unavailable** +3. When the last SP serving a given service type becomes **Unhealthy** or + **Unavailable**: + - The agent removes the service type from its advertised list + - The agent sends a `PUT` to DCM with the updated registration + - The agent publishes a health warning CloudEvent to the `dcm.agents.health` + topic with details about the affected SP and service type 4. When a previously unhealthy/unavailable SP recovers: - * The agent re-adds the resource type to its list (if it was removed) - * The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) \ No newline at end of file + - The agent re-adds the service type to its list (if it was removed) + - The agent sends a `PUT` to DCM with the updated registration +5. The agent exposes the health status of all registered SPs via the + `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent + additionally surfaces this information as custom pod conditions on its own + pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) + +### Assumptions + +- A messaging system (e.g., NATS) is deployed and accessible to both DCM and the + agent +- The agent has outbound network connectivity to DCM's REST API (for + registration and heartbeats) +- SPs have network connectivity to the agent's REST API (for registration and + health checks) +- For Kubernetes/OpenShift deployments: the agent's service account has RBAC + permissions for the `pods/status` subresource + +### Risks and Mitigations + +| Risk | Mitigation | +| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state via SP re-registration (SPs periodically re-register, naturally rebuilding agent state). | +| Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. | +| Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. | +| Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). | +| Unauthenticated SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. | + +## Drawbacks + +- Adds operational complexity: a new binary (the agent) must be deployed, + configured, and monitored per environment +- Adds latency to the creation path: DCM → messaging system → agent → SP, versus + the current DCM → SP direct call +- Fragments health monitoring responsibility: DCM monitors agent health via + heartbeats, while the agent monitors SP health via polling +- Requires messaging system infrastructure accessible to both DCM and all target + environments + +## Alternatives + +### Alternative 1: Monolithic Agent with Embedded SPs + +#### Description + +Instead of separating the agent and Service Providers into distinct processes, +the agent binary would ship with SP code for a known set of SPs (e.g., ACM, +KubeVirt, K8s). At startup, the agent would detect available CRDs or backing +infrastructure on the environment and activate only the relevant SP code. + +#### Pros + +- Single binary to deploy, no REST registration ceremony between agent and SPs +- No health monitoring overhead between agent and SPs (they share a process) +- Simpler deployment and operational model + +#### Cons + +- Tightly couples the agent to a fixed, predefined set of SPs +- Cannot support custom or third-party SPs without rebuilding the agent binary +- Agent binary grows with each new SP type +- Requires agent rebuild and redeployment to add support for a new service type + +#### Status + +Rejected + +#### Rationale + +The agent must support arbitrary SPs, including custom ones developed by third +parties. Tight coupling between the agent and SP code prevents this +extensibility. The plugin-style model (separate processes, REST registration) +allows any SP that implements the registration API to participate, regardless of +who develops or deploys it. + +### Alternative 2: etcd / CRD Watch Pattern + +#### Description + +Instead of using a messaging system for creation requests, DCM would create +Custom Resource (CR) manifests (e.g., `ResourceRequest`) directly in the target +cluster's etcd via the Kubernetes API. The agent would run as a Kubernetes +controller, watching for these CRs and reconciling them by forwarding the +creation request to the relevant SP. This follows the native Kubernetes +controller pattern. + +#### Pros + +- Native Kubernetes pattern, well-understood and battle-tested +- Leverages existing etcd for persistence and watch semantics, no separate + messaging infrastructure needed +- Built-in HA via Kubernetes controller framework (leader election, informer + caching) + +#### Cons + +- Requires DCM to have kubeconfig/API access to each target cluster, + reintroducing DCM-to-environment connectivity that this enhancement aims to + eliminate +- Does not work for non-Kubernetes environments (Docker, standalone, etc.) +- Pushes the connectivity requirement from the agent (outbound) to DCM (outbound + to every cluster) + +#### Status + +Rejected + +#### Rationale + +A core motivation of this enhancement is removing the need for +DCM-to-environment inbound connectivity for creation requests. The CRD watch +pattern requires DCM to push CRs to the target cluster's API server, +reintroducing that dependency. Additionally, this approach limits the agent to +Kubernetes-based environments, conflicting with the goal of supporting +non-cluster environments. + +## Cross-Cutting Impact + +The following enhancement documents will need to be updated to reflect the +changes introduced by this enhancement. These updates will be done in subsequent +PRs. + +| Document | Impact | +| -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. | +| [Service Provider Health Check](../service-provider-health-check/service-provider-health-check.md) | Health polling responsibility shifts from DCM to the agent. DCM monitors agent health via heartbeats instead of polling individual SPs. | +| [SP Resource Manager](../sp-resource-manager/sp-resource-manager.md) | SPRM publishes creation requests to the agent's bus topic instead of calling SP REST endpoints directly. SPRM interacts with the agent (not individual SPs) for health status. From SPRM's perspective, the agent serves the same role as a SP: provisioning service types. | +| [Placement Manager](../placement-manager/placement-manager.md) | Policy evaluation may now include environment as a selection criterion. Placement Manager delegates to SPRM, which routes through the messaging system. | +| [User Flows](../user-flows/user-flows.md) | End-to-end flows must include the agent layer between DCM and SPs. | + +Additionally, DCM should monitor consumer lag on agent topics in a future +iteration. If lag exceeds a configurable threshold, DCM could stop routing new +requests to that agent to avoid further congestion. A new agent state (e.g., +"Congested") could be introduced for this purpose. From efbf77b06284d7ad8aee3752d3ffc8ce6ffb6ef6 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 10 Jun 2026 09:41:28 +0200 Subject: [PATCH 09/24] typo Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 4895ad8..91c9d41 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -183,7 +183,7 @@ flowchart TD #### Flow Description - The agent is spawned in an environment -- Several Service Providers (SP) are running and serving each a specific service +- Several Service Providers (SP) are running and each serving a specific service type - Each SP registers itself to the agent; the agent dynamically builds its supported service types list From 81e16f410d8c0e2e09da30d8cea9fb58f4796bb6 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 10 Jun 2026 11:23:49 +0200 Subject: [PATCH 10/24] Add pseudo api definition and payload example Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 203 ++++++++++++++---- 1 file changed, 167 insertions(+), 36 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 91c9d41..b452d0a 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -25,6 +25,11 @@ see-also: 1. Can multiple agent replicas consume from the same topic for high availability? (deferred to HA iteration) 2. How does an administrator update the agent's cost tier without restarting it? + **Proposed resolution:** The administrator updates the agent's configuration + (config file, environment variable, or ConfigMap on Kubernetes). The agent + detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the + updated cost tier — the same mechanism used when the supported service types + list changes. ## Summary @@ -199,6 +204,108 @@ flowchart TD - The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system +### API + +#### Agent Endpoints + +| Method | Endpoint | Description | +| ------ | ----------------- | ------------------------------------------------------------------------------------------------------------- | +| POST | /api/v1/providers | SP registration — reuses the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) contract | +| GET | /api/v1/status | Agent status — health of all registered SPs | + +##### `POST /api/v1/providers` — SP Registration + +Reuses the contract defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) +enhancement. The agent applies the same idempotency semantics (name as natural +key, create-or-update behavior). + +##### `GET /api/v1/status` — Agent Status + +Returns the health state of all registered SPs. This endpoint is always +available, regardless of the deployment mode (Kubernetes, Docker, standalone), +and is the primary way to inspect the agent's view of its Service Providers. + +Example response: + +```json +{ + "providers": [ + { + "providerId": "sp-vm-001", + "name": "vm-provider", + "serviceType": "vm", + "status": "Ready", + "lastCheck": "2026-06-05T10:30:00Z" + }, + { + "providerId": "sp-db-001", + "name": "db-provider", + "serviceType": "database", + "status": "Unhealthy", + "lastCheck": "2026-06-05T10:30:00Z" + } + ] +} +``` + +#### DCM Endpoints + +| Method | Endpoint | Description | +| ------ | ---------------------------------- | ------------------------- | +| POST | /api/v1/agents | Agent registration | +| PUT | /api/v1/agents/{agentId} | Update agent registration | +| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | + +##### `POST /api/v1/agents` — Agent Registration + +Register a new agent to DCM. + +| Field | Type | Required | Description | +| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| name | string | yes | Unique agent name | +| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unhealthy/unavailable. | +| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | +| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | +| topicName | string | yes | Deterministic topic name for the agent's messaging channel | + +Response: `201 Created` with `{agentId}` + +###### `resourcesAvailable` Structure + +The `resourcesAvailable` field is optional. When provided, it follows a similar +structure to the SP registration metadata defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md), but +represents the aggregate available resources across the environment rather than +a single SP's capacity. + +Example: + +```json +{ + "totalCpu": 200, + "totalMemory": "1TB", + "totalStorage": "2TB", + "totalNode": 100 +} +``` + +##### `PUT /api/v1/agents/{agentId}` — Update Agent Registration + +Update an existing agent registration. The payload is identical to the initial +`POST` registration (full replace). All fields are sent on every `PUT`. + +Response: `200 OK` + +##### `PUT /api/v1/agents/{agentId}/heartbeat` — Agent Heartbeat + +| Field | Type | Required | Description | +| --------- | ----------------- | -------- | ------------------------- | +| timestamp | string (ISO 8601) | yes | Agent's current timestamp | + +Response: `200 OK` + ### SP Registration to Agent Service Providers register to the agent rather than to DCM directly. The agent @@ -322,6 +429,23 @@ sequenceDiagram 5. DCM persists the registration in the database 6. DCM acknowledges the registration +#### Re-Registration on Restart + +When the agent restarts, it uses the same `POST /api/v1/agents` endpoint with +the same payload. The agent does not persist its `agentId`; it relies on DCM's +idempotent registration, which uses the agent `name` as the natural key (same +pattern as SP registration defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md)): if the +name already exists and no `agentId` is provided (or the same `agentId` is +provided), DCM updates the existing entry, returns the same `agentId`, and +resets the heartbeat tracker. The agent then uses the returned `agentId` for +subsequent heartbeats and updates. + +Ensuring that each agent uses a unique name is an operational responsibility. + +Note that the `(name, topicName)` pair is not unique: in a future HA model, +multiple agent replicas for the same environment may share the same topic name. + ### Resource Creation Flow ```mermaid @@ -332,7 +456,7 @@ sequenceDiagram participant AG as Agent participant SP as Service Provider - DCM->>MS: PUBLISH creation request
topic: {agentTopicName}
{serviceType, spec} + DCM->>MS: PUBLISH CloudEvent (creation request)
topic: {agentTopicName}
{resourceId, serviceType, spec} MS->>AG: PUSH message activate AG @@ -353,6 +477,8 @@ sequenceDiagram MS->>DCM: PUSH error message else SP creation succeeds SP-->>AG: Success response
{instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} + MS->>DCM: PUSH creation acknowledged Note over SP: SP manages resource lifecycle
and reports status through
the existing status reporting flow end end @@ -361,8 +487,8 @@ sequenceDiagram #### Flow Description -1. DCM publishes the creation request to the agent's dedicated topic in the - messaging system +1. DCM publishes a creation request CloudEvent to the agent's dedicated topic in + the messaging system, including the resource ID, service type, and spec 2. The agent consumes the message 3. The agent validates that the requested service type is supported by one of its attached Service Providers @@ -373,9 +499,10 @@ sequenceDiagram - The agent forwards the creation request to the relevant SP via REST API - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - - If the SP **accepts** the request: the SP takes over resource lifecycle - management and reports status changes through the existing status reporting - flow (SP → Messaging System → DCM) + - If the SP **accepts** the request: the agent publishes a CloudEvent + acknowledging the creation is in progress. The SP takes over resource + lifecycle management and reports status changes through the existing status + reporting flow (SP → Messaging System → DCM) #### SP Selection Strategy @@ -410,7 +537,7 @@ sequenceDiagram participant AG as Agent participant SP as Service Provider - DCM->>MS: PUBLISH deletion request
topic: {agentTopicName}
{serviceType, resourceId} + DCM->>MS: PUBLISH CloudEvent (deletion request)
topic: {agentTopicName}
{resourceId, serviceType} MS->>AG: PUSH message activate AG @@ -441,8 +568,8 @@ sequenceDiagram #### Flow Description -1. DCM publishes the deletion request to the agent's dedicated topic in the - messaging system, including the service type and resource ID +1. DCM publishes a deletion request CloudEvent to the agent's dedicated topic in + the messaging system, including the resource ID and service type 2. The agent consumes the message 3. The agent validates that the requested service type is supported by one of its attached Service Providers @@ -550,33 +677,10 @@ DCM accordingly. ##### Agent Status -The agent exposes a `GET /api/v1/status` endpoint that returns the health state -of all registered SPs. This endpoint is always available, regardless of the -deployment mode (Kubernetes, Docker, standalone), and is the primary way to -inspect the agent's view of its Service Providers. - -Example response: - -```json -{ - "providers": [ - { - "providerId": "sp-vm-001", - "name": "vm-provider", - "serviceType": "vm", - "status": "Ready", - "lastCheck": "2026-06-05T10:30:00Z" - }, - { - "providerId": "sp-db-001", - "name": "db-provider", - "serviceType": "database", - "status": "Unhealthy", - "lastCheck": "2026-06-05T10:30:00Z" - } - ] -} -``` +The agent exposes the health status of all registered SPs via the +`GET /api/v1/status` endpoint (see +[Agent Endpoints — `GET /api/v1/status`](#get-apiv1status--agent-status) for the +response format). ##### Pod Conditions (Kubernetes / OpenShift) @@ -671,6 +775,33 @@ sequenceDiagram additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) +### CloudEvent Message Definitions + +All messages exchanged through the messaging system use the +[CloudEvents v1.0](https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/spec.md) +specification, following the conventions established in the +[Service Provider Status Reporting](../state-management/service-provider-status-reporting.md) +enhancement. + +All agent-originated CloudEvents include `agentName` and `topicName` in the data +payload for correlation, in addition to the `source` envelope attribute. This +allows DCM to identify both the resource and the originating agent when +consuming from the shared `dcm.agents.responses` subject. + +The `spec` field in creation request data follows the schema defined by the +target service type (see +[SP Resource Manager](../sp-resource-manager/sp-resource-manager.md), +[Placement Manager](../placement-manager/placement-manager.md)). + +| Message | `type` | `source` | `subject` | `data` | +| --------------------- | ------------------------------------------- | ---------------------- | ---------------------- | ------------------------------------------------------------------------ | +| Creation Request | `dcm.request.create` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType, spec}` | +| Deletion Request | `dcm.request.delete` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType}` | +| Creation Acknowledged | `dcm.agent.creation-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "PROVISIONING"}` | +| Deletion Acknowledged | `dcm.agent.deletion-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "DELETING"}` | +| Error | `dcm.agent.error` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, error, details}` | +| Health Warning | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | + ### Assumptions - A messaging system (e.g., NATS) is deployed and accessible to both DCM and the From 09cc808a2a6b498a1f21cf6024284c96ee179790 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 09:34:24 +0200 Subject: [PATCH 11/24] Add Terminology section Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index b452d0a..99e7d02 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -31,6 +31,16 @@ see-also: updated cost tier — the same mechanism used when the supported service types list changes. +## Terminology + +- **Agent:** A lightweight process that runs in a target environment, acting as + the intermediary between DCM and the Service Providers deployed in that + environment. It registers the environment to DCM, consumes resource operation + requests from a messaging system, and routes them to the appropriate Service + Provider. +- **Environment:** A set of infrastructures that is ready to receive workload + from DCM (e.g., `dev`, `staging`, `prod-eu-west-1`). + ## Summary This enhancement aims at adding the notion of environment by adding a layer From e029207afc50d4f20188b34ac4992dac9c586080 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:39:10 +0200 Subject: [PATCH 12/24] fix inconsistency with resource vs service type Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 99e7d02..b442c90 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -350,7 +350,7 @@ sequenceDiagram AG->>AG: Store SP registration
Recompute supported service types - alt Resource type list changed AND agent already registered to DCM + alt Service type list changed AND agent already registered to DCM AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM DCM->>DB: Update agent registration @@ -359,7 +359,7 @@ sequenceDiagram deactivate DB DCM-->>AG: 200 OK deactivate DCM - else Resource type list changed AND agent not yet registered to DCM + else Service type list changed AND agent not yet registered to DCM Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) end @@ -374,7 +374,7 @@ sequenceDiagram 1. The SP starts and registers to the agent 2. The SP registers itself with the agent via a REST API call, providing: - Name - - Resource type it serves + - Service type it serves - Endpoint (URL where the agent can reach the SP) 3. The agent stores the SP registration and recomputes the list of supported service types @@ -473,10 +473,10 @@ sequenceDiagram AG->>AG: Validate requested service type
is supported by an attached SP - alt Resource type not supported + alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Resource type supported + else Service type supported AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} activate SP From bb8eac2e8a84907e5405dcd3b6eb0880e5c0de2b Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:41:09 +0200 Subject: [PATCH 13/24] select SP by alphabetical order Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index b442c90..06538d4 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -517,7 +517,7 @@ sequenceDiagram #### SP Selection Strategy When multiple SPs are registered for the same service type, the agent selects -one randomly. Future iterations may introduce affinity-based or capacity-based +the SP in alphabetical order. Future iterations may introduce affinity-based or capacity-based selection strategies (e.g., selecting the SP with the most available resources, similar to pod affinity in Kubernetes). From 0529b4bdeb83cd2cae964cff8f9724dd40436ff1 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:57:24 +0200 Subject: [PATCH 14/24] explicitly defer open question resolution 2 Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 06538d4..cc51aa5 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -29,7 +29,8 @@ see-also: (config file, environment variable, or ConfigMap on Kubernetes). The agent detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the updated cost tier — the same mechanism used when the supported service types - list changes. + list changes. + **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** ## Terminology From b5a2143ff8d764a3a942af0081b62fa740d41c9b Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Thu, 11 Jun 2026 11:57:30 +0200 Subject: [PATCH 15/24] format Signed-off-by: gabriel-farache --- enhancements/environment-agent/environment-agent.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index cc51aa5..badc04a 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -29,8 +29,9 @@ see-also: (config file, environment variable, or ConfigMap on Kubernetes). The agent detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the updated cost tier — the same mechanism used when the supported service types - list changes. - **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** + list changes. **This solution is deferred to later version: in the current + version, a restart will be needed for the change in the cost tier to be + propagated (via [Agent Registration Flow](#agent-registration-flow) )** ## Terminology @@ -518,9 +519,9 @@ sequenceDiagram #### SP Selection Strategy When multiple SPs are registered for the same service type, the agent selects -the SP in alphabetical order. Future iterations may introduce affinity-based or capacity-based -selection strategies (e.g., selecting the SP with the most available resources, -similar to pod affinity in Kubernetes). +the SP in alphabetical order. Future iterations may introduce affinity-based or +capacity-based selection strategies (e.g., selecting the SP with the most +available resources, similar to pod affinity in Kubernetes). #### Retry Policy From 372334f79257af59f4e4d5cf6a452032231b9155 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 15 Jun 2026 11:19:33 +0200 Subject: [PATCH 16/24] Change behviour for unhleathy Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 243 ++++++++++++++---- 1 file changed, 195 insertions(+), 48 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index badc04a..75d2c8d 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -32,6 +32,9 @@ see-also: list changes. **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** +3. How does DCM handle the "queued" CloudEvent response + (`dcm.agent.request-queued`)? Does it expose the status to the user, set a + timeout, or re-evaluate policies? (deferred to DCM-side design) ## Terminology @@ -141,10 +144,18 @@ The agent will then consume the message, validate it and then pass it to the relevant SP. The agent monitors the health of its registered SPs by polling their `/health` -endpoint, using the three-state model (Ready, Unhealthy, Unavailable). When the -last SP serving a given service type becomes unhealthy or unavailable, the agent -removes that service type from its advertised list and updates DCM. The agent -exposes the health status of each registered SP via a `/api/v1/status` endpoint. +endpoint, using the three-state model (Ready, Unhealthy, Unavailable). The agent +differentiates its behavior based on the SP health state: + +- **Unhealthy:** The agent keeps the service type in its advertised list to DCM + but stops routing requests to the SP. Incoming requests for that service type + are held in a dedicated retry topic until the SP recovers or becomes + unavailable. +- **Unavailable:** The agent removes the service type from its advertised list, + updates DCM, and rejects any held requests for that service type. + +The agent exposes the health status of each registered SP via a `/api/v1/status` +endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod, allowing administrators to quickly identify which SPs are causing issues via `oc describe pod`. @@ -277,7 +288,7 @@ Register a new agent to DCM. | ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | name | string | yes | Unique agent name | | environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | -| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unhealthy/unavailable. | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | | resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | | cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | | topicName | string | yes | Deterministic topic name for the agent's messaging channel | @@ -405,9 +416,12 @@ sequenceDiagram Note over AG: Agent starts in
target environment - AG->>MS: Create topic (deterministic name) + AG->>MS: Create main topic (deterministic name) MS-->>AG: Topic created
{topicName} + AG->>MS: Create retry topic (internal) + MS-->>AG: Topic created
{topicName}.retry + Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} @@ -425,8 +439,13 @@ sequenceDiagram #### Flow Description 1. The agent starts and serves a specific environment -2. The agent creates a topic in the messaging system (using a deterministic - name) to establish a dedicated communication channel +2. The agent creates two topics in the messaging system: + - A **main topic** (using a deterministic name) to establish a dedicated + communication channel with DCM. This topic name is advertised to DCM during + registration. + - A **retry topic** (`{topicName}.retry`) used internally by the agent to hold + requests when all SPs for a service type are Unhealthy (see + [Retry Topic](#retry-topic)). This topic is not advertised to DCM. 3. The agent checks whether at least one SP is registered and healthy: - If at least 1 SP is registered and healthy: the agent proceeds to register to DCM @@ -478,7 +497,11 @@ sequenceDiagram alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported + else Service type supported but all SPs Unhealthy + AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType, spec} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + MS->>DCM: PUSH queued response + else Service type supported and at least one SP Ready AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} activate SP @@ -507,7 +530,15 @@ sequenceDiagram 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported**: +5. If the service type is **supported but all SPs are Unhealthy**: + - The agent publishes the original request CloudEvent to the retry topic + (`{agentTopicName}.retry`) for durable holding + - The agent publishes a "queued" CloudEvent to `dcm.agents.responses` with + `{resourceId, serviceType, status: "QUEUED"}`, informing DCM that the + request is held for retry + - The request will be processed when an SP recovers, or rejected if all SPs + become Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and at least one SP is Ready**: - The agent forwards the creation request to the relevant SP via REST API - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume @@ -531,13 +562,62 @@ agent publishes an error CloudEvent to the messaging system with the resource ID (provided by DCM in the original creation request), allowing DCM to track the failure. +#### Retry Topic + +When all SPs for a given service type are Unhealthy, the agent cannot route +requests but the service type remains advertised to DCM (to avoid +registration flapping). Instead of rejecting the request, the agent publishes +it to a dedicated **retry topic** (`{agentTopicName}.retry`) for durable +holding, and responds to DCM with a "queued" CloudEvent. + +The retry topic is created by the agent at startup alongside the main topic +(see [Agent Registration Flow](#agent-registration-flow)). It is internal to +the agent and is not advertised to DCM. + +**Message format:** The original CloudEvent is published to the retry topic +as-is (passthrough, no wrapping). + +**Consumption is event-driven.** The agent reads the retry topic only when an +SP health state changes — not periodically: + +- **SP transitions to Ready:** The agent consumes the retry topic. For each + message whose service type now has a Ready SP, the agent processes the + request (forwards to the SP, responds to DCM with success or error). + Messages for service types still Unhealthy are re-published to the retry + topic. +- **SP transitions to Unavailable:** The agent consumes the retry topic. For + each message whose service type has all SPs Unavailable, the agent rejects + the request with an error CloudEvent to DCM. Messages for other service + types are re-published to the retry topic. +- **No health state change:** The retry topic is not consumed. + +**Creation/Deletion dedup:** If both a creation request and a deletion request +for the same resource ID are present in the retry topic, both messages are +removed — they cancel out since the resource was never created. The agent logs +the cancellation and acknowledges the deletion to DCM. The creation request is +silently dropped since it was never started. + +**Ordering:** Requests are processed in arrival order per service type. +Requests for different service types are independent. + +**Durability:** Messages in the retry topic survive agent crashes, guaranteed +by the messaging system's persistence layer. On restart, the agent re-reads +both the main topic and the retry topic. + #### In-Flight Request Handling -When the agent restarts, unconsumed messages remain on the topic and are -consumed once the agent is back up (guaranteed by the messaging system's -persistence layer). When all SPs for a given service type are unhealthy or -unavailable, the agent responds with an error CloudEvent for each incoming -creation request targeting that service type. +When the agent restarts, unconsumed messages on both the main topic and the +retry topic are consumed once the agent is back up (guaranteed by the messaging +system's persistence layer). + +- **All SPs Unhealthy:** The agent publishes the request to the retry topic and + responds to DCM with a "queued" CloudEvent. The request is processed when an + SP recovers, or rejected when all SPs for that service type become + Unavailable (see [Retry Topic](#retry-topic)). +- **All SPs Unavailable:** The agent responds with an error CloudEvent for each + incoming request targeting that service type. Additionally, the agent drains + the retry topic, rejecting any held requests for that service type with error + CloudEvents. ### Resource Deletion Flow @@ -559,7 +639,11 @@ sequenceDiagram alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported + else Service type supported but all SPs Unhealthy + AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + MS->>DCM: PUSH queued response + else Service type supported and at least one SP Ready AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} activate SP @@ -588,7 +672,14 @@ sequenceDiagram 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported**: +5. If the service type is **supported but all SPs are Unhealthy**: + - The agent publishes the original request to the retry topic for durable + holding + - The agent publishes a "queued" CloudEvent to `dcm.agents.responses`, + informing DCM that the request is held for retry + - The request will be processed when an SP recovers, or rejected if all SPs + become Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and at least one SP is Ready**: - The agent forwards the deletion request to the relevant SP via a REST `DELETE` call - If the SP returns an **immediate error**: the agent publishes an error @@ -670,22 +761,40 @@ registered SPs, as it already maintains the list of SP endpoints. The agent only routes creation requests to SPs in the **Ready** state. SPs in the **Unhealthy** or **Unavailable** state are not eligible for routing, even -though an Unhealthy SP is technically reachable. This simplifies routing logic -and avoids sending requests to SPs whose backing provider is known to be down. +though an Unhealthy SP is technically reachable. When all SPs for a service +type are Unhealthy, incoming requests are held in the retry topic rather than +rejected (see [Retry Topic](#retry-topic)). + +The agent differentiates its behavior based on the health state of the last SP +serving a given service type: -When the last SP serving a given service type transitions to **Unhealthy** or -**Unavailable**, the agent: +**When the last SP becomes Unhealthy:** -1. Removes that service type from its advertised list -2. Sends a `PUT` request to DCM with the updated agent registration (service - types list without the affected type) -3. Publishes a health warning CloudEvent to a dedicated health topic in the - messaging system, providing DCM with context about the degradation (which SP, - which service type, the reason) +1. The agent **keeps** the service type in its advertised list (no `PUT` to DCM + to remove it) +2. The agent stops routing new requests to SPs for that service type — incoming + requests are held in the retry topic and a "queued" CloudEvent is sent to + DCM +3. The agent publishes a health warning CloudEvent to `dcm.agents.health` with + type `service-type-degraded` -When a previously unhealthy or unavailable SP recovers (returns `200 OK` with -`status: "healthy"`), the agent re-adds the service type to its list and updates -DCM accordingly. +**When the last SP becomes Unavailable:** + +1. The agent removes the service type from its advertised list +2. The agent sends a `PUT` request to DCM with the updated agent registration + (service types list without the affected type) +3. The agent drains the retry topic: all held requests for that service type are + rejected with error CloudEvents to DCM +4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with + type `service-type-unavailable` + +**When a previously unhealthy or unavailable SP recovers** (returns `200 OK` +with `status: "healthy"`): + +1. If the service type was removed (Unavailable case): the agent re-adds it to + its list and sends a `PUT` to DCM with the updated registration +2. The agent processes held requests from the retry topic for that service type + (see [Retry Topic](#retry-topic)) ##### Agent Status @@ -752,17 +861,43 @@ sequenceDiagram end end - Note over AG: Last SP for service type X
becomes Unhealthy or Unavailable + alt Last SP for service type X becomes Unhealthy + Note over AG: Keep service type X
in advertised list.
Hold incoming requests
in retry topic. - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} - activate DCM - DCM->>DB: Update agent registration - DB-->>DCM: Updated - DCM-->>AG: 200 OK - deactivate DCM + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-degraded",
agentId, serviceType, reason,
affectedProvider} + MS->>DCM: PUSH health warning + + else Last SP for service type X becomes Unavailable + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} + activate DCM + DCM->>DB: Update agent registration + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider} - MS->>DCM: PUSH health warning + Note over AG: Drain retry topic:
reject held requests for
service type X + + AG->>MS: PUBLISH CloudEvent(s)
topic: dcm.agents.responses
{error: "SP unavailable"}
for each held request + + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-unavailable",
agentId, serviceType, reason,
affectedProvider} + MS->>DCM: PUSH health warning + + else Previously unhealthy/unavailable SP recovers to Ready + Note over AG: Re-add service type if removed.
Process held requests
from retry topic. + + opt Service type was removed (Unavailable case) + AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes with X} + activate DCM + DCM->>DB: Update agent registration + DB-->>DCM: Updated + DCM-->>AG: 200 OK + deactivate DCM + end + + AG->>SP: Forward held requests from retry topic + SP-->>AG: Responses + AG->>MS: PUBLISH CloudEvent(s)
topic: dcm.agents.responses
{success/error for each} + end ``` ##### Flow Description @@ -773,16 +908,26 @@ sequenceDiagram - `200 OK` with `status: "unhealthy"` → **Unhealthy** - Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** -3. When the last SP serving a given service type becomes **Unhealthy** or - **Unavailable**: +3. When the last SP serving a given service type becomes **Unhealthy**: + - The agent **keeps** the service type in its advertised list (no `PUT` to + DCM) + - Incoming requests for that service type are held in the retry topic (see + [Retry Topic](#retry-topic)) + - The agent publishes a `service-type-degraded` health warning CloudEvent to + the `dcm.agents.health` topic +4. When the last SP serving a given service type becomes **Unavailable**: - The agent removes the service type from its advertised list - The agent sends a `PUT` to DCM with the updated registration - - The agent publishes a health warning CloudEvent to the `dcm.agents.health` - topic with details about the affected SP and service type -4. When a previously unhealthy/unavailable SP recovers: - - The agent re-adds the service type to its list (if it was removed) - - The agent sends a `PUT` to DCM with the updated registration -5. The agent exposes the health status of all registered SPs via the + - The agent drains the retry topic: all held requests for that service type + are rejected with error CloudEvents to DCM + - The agent publishes a `service-type-unavailable` health warning CloudEvent + to the `dcm.agents.health` topic +5. When a previously unhealthy or unavailable SP recovers: + - If the service type was removed (Unavailable case): the agent re-adds it + to its list and sends a `PUT` to DCM with the updated registration + - The agent processes held requests from the retry topic for that service + type +6. The agent exposes the health status of all registered SPs via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces this information as custom pod conditions on its own pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) @@ -811,8 +956,10 @@ target service type (see | Deletion Request | `dcm.request.delete` | `dcm/control-plane` | `{agentTopicName}` | `{resourceId, serviceType}` | | Creation Acknowledged | `dcm.agent.creation-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "PROVISIONING"}` | | Deletion Acknowledged | `dcm.agent.deletion-acknowledged` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, status: "DELETING"}` | +| Request Queued | `dcm.agent.request-queued` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, serviceType, status: "QUEUED"}` | | Error | `dcm.agent.error` | `dcm/agents/{agentId}` | `dcm.agents.responses` | `{resourceId, agentName, topicName, error, details}` | -| Health Warning | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | +| Health Degraded | `dcm.agent.health.service-type-degraded` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | +| Health Unavailable | `dcm.agent.health.service-type-unavailable` | `dcm/agents/{agentId}` | `dcm.agents.health` | `{agentId, agentName, topicName, serviceType, reason, affectedProvider}` | ### Assumptions From fe2250f2a807de273a0db7688c712fed51e85f08 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 15 Jun 2026 16:41:42 +0200 Subject: [PATCH 17/24] Remove PUT Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 57 +++++++++---------- 1 file changed, 26 insertions(+), 31 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 75d2c8d..23dda2e 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -27,7 +27,7 @@ see-also: 2. How does an administrator update the agent's cost tier without restarting it? **Proposed resolution:** The administrator updates the agent's configuration (config file, environment variable, or ConfigMap on Kubernetes). The agent - detects the change and sends a `PUT /api/v1/agents/{agentId}` to DCM with the + detects the change and sends a `POST /api/v1/agents` to DCM with the updated cost tier — the same mechanism used when the supported service types list changes. **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be @@ -277,7 +277,6 @@ Example response: | Method | Endpoint | Description | | ------ | ---------------------------------- | ------------------------- | | POST | /api/v1/agents | Agent registration | -| PUT | /api/v1/agents/{agentId} | Update agent registration | | PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | ##### `POST /api/v1/agents` — Agent Registration @@ -288,7 +287,7 @@ Register a new agent to DCM. | ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | name | string | yes | Unique agent name | | environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | -| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial `POST` (prerequisite: at least one healthy SP). May be empty on `PUT` when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP). May be empty on subsequent re-registrations when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | | resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | | cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | | topicName | string | yes | Deterministic topic name for the agent's messaging channel | @@ -314,13 +313,6 @@ Example: } ``` -##### `PUT /api/v1/agents/{agentId}` — Update Agent Registration - -Update an existing agent registration. The payload is identical to the initial -`POST` registration (full replace). All fields are sent on every `PUT`. - -Response: `200 OK` - ##### `PUT /api/v1/agents/{agentId}/heartbeat` — Agent Heartbeat | Field | Type | Required | Description | @@ -342,10 +334,10 @@ re-register without requiring any additional coordination mechanism. When the list of supported service types changes as a result of an SP registration and the agent is already registered to DCM, the agent updates DCM -via a `PUT` request with the full updated registration payload. If the agent has +via a `POST /api/v1/agents` request with the full updated registration payload. If the agent has not yet registered to DCM (i.e., this is the first SP registering), the agent -does not send a `PUT`; instead, the SP registration satisfies the prerequisite -for the agent to proceed with its initial registration to DCM (see +does not notify DCM yet; instead, the SP registration satisfies the +prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). ```mermaid @@ -364,7 +356,7 @@ sequenceDiagram AG->>AG: Store SP registration
Recompute supported service types alt Service type list changed AND agent already registered to DCM - AG->>DCM: PUT /api/v1/agents/{agentId}
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM DCM->>DB: Update agent registration activate DB @@ -392,12 +384,12 @@ sequenceDiagram 3. The agent stores the SP registration and recomputes the list of supported service types 4. If the service type list changed (new service type added): - - If the agent is already registered to DCM: the agent sends a `PUT` request - to DCM with the full updated agent registration; DCM updates the agent - record in the database - - If the agent is not yet registered to DCM: the agent does not send a `PUT`; - instead, this SP registration satisfies the prerequisite for the agent's - initial registration (see + - If the agent is already registered to DCM: the agent sends a + `POST /api/v1/agents` request to DCM with the full updated agent + registration; DCM updates the agent record in the database + - If the agent is not yet registered to DCM: the agent does not notify DCM + yet; instead, this SP registration satisfies the prerequisite for the + agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration 6. The SP periodically re-registers with the agent; the agent handles this @@ -770,8 +762,8 @@ serving a given service type: **When the last SP becomes Unhealthy:** -1. The agent **keeps** the service type in its advertised list (no `PUT` to DCM - to remove it) +1. The agent **keeps** the service type in its advertised list (no update sent + to DCM to remove it) 2. The agent stops routing new requests to SPs for that service type — incoming requests are held in the retry topic and a "queued" CloudEvent is sent to DCM @@ -781,8 +773,8 @@ serving a given service type: **When the last SP becomes Unavailable:** 1. The agent removes the service type from its advertised list -2. The agent sends a `PUT` request to DCM with the updated agent registration - (service types list without the affected type) +2. The agent sends a `POST /api/v1/agents` request to DCM with the updated + registration (service types list without the affected type) 3. The agent drains the retry topic: all held requests for that service type are rejected with error CloudEvents to DCM 4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with @@ -792,7 +784,8 @@ serving a given service type: with `status: "healthy"`): 1. If the service type was removed (Unavailable case): the agent re-adds it to - its list and sends a `PUT` to DCM with the updated registration + its list and sends a `POST /api/v1/agents` to DCM with the updated + registration 2. The agent processes held requests from the retry topic for that service type (see [Retry Topic](#retry-topic)) @@ -868,7 +861,7 @@ sequenceDiagram MS->>DCM: PUSH health warning else Last SP for service type X becomes Unavailable - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes without X} + AG->>DCM: POST /api/v1/agents
{updated serviceTypes without X} activate DCM DCM->>DB: Update agent registration DB-->>DCM: Updated @@ -886,7 +879,7 @@ sequenceDiagram Note over AG: Re-add service type if removed.
Process held requests
from retry topic. opt Service type was removed (Unavailable case) - AG->>DCM: PUT /api/v1/agents/{agentId}
{updated serviceTypes with X} + AG->>DCM: POST /api/v1/agents
{updated serviceTypes with X} activate DCM DCM->>DB: Update agent registration DB-->>DCM: Updated @@ -909,22 +902,24 @@ sequenceDiagram - Timeout or error → increment failure counter; if counter exceeds threshold → **Unavailable** 3. When the last SP serving a given service type becomes **Unhealthy**: - - The agent **keeps** the service type in its advertised list (no `PUT` to - DCM) + - The agent **keeps** the service type in its advertised list (no update + sent to DCM) - Incoming requests for that service type are held in the retry topic (see [Retry Topic](#retry-topic)) - The agent publishes a `service-type-degraded` health warning CloudEvent to the `dcm.agents.health` topic 4. When the last SP serving a given service type becomes **Unavailable**: - The agent removes the service type from its advertised list - - The agent sends a `PUT` to DCM with the updated registration + - The agent sends a `POST /api/v1/agents` to DCM with the updated + registration - The agent drains the retry topic: all held requests for that service type are rejected with error CloudEvents to DCM - The agent publishes a `service-type-unavailable` health warning CloudEvent to the `dcm.agents.health` topic 5. When a previously unhealthy or unavailable SP recovers: - If the service type was removed (Unavailable case): the agent re-adds it - to its list and sends a `PUT` to DCM with the updated registration + to its list and sends a `POST /api/v1/agents` to DCM with the updated + registration - The agent processes held requests from the retry topic for that service type 6. The agent exposes the health status of all registered SPs via the From 8deff815d51ba19fd6006acec6b318e29c563ac5 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Fri, 19 Jun 2026 16:10:34 +0200 Subject: [PATCH 18/24] feat(environment-agent): hybrid SP model, 1 SP per service type, defer etcd watch Integrate embedded SPs (K8s Container, ACM Cluster, KubeVirt) into the main proposal alongside external "bring your own" SPs. Enforce a global constraint of one SP per service type with 409 Conflict rejection for duplicates. Change etcd/CRD Watch alternative from Rejected to Deferred pending investigation of DCM-native watch semantics. Co-Authored-By: Claude Opus 4.6 Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 721 +++++++++++------- 1 file changed, 425 insertions(+), 296 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 23dda2e..ffcfbcd 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -27,9 +27,9 @@ see-also: 2. How does an administrator update the agent's cost tier without restarting it? **Proposed resolution:** The administrator updates the agent's configuration (config file, environment variable, or ConfigMap on Kubernetes). The agent - detects the change and sends a `POST /api/v1/agents` to DCM with the - updated cost tier — the same mechanism used when the supported service types - list changes. **This solution is deferred to later version: in the current + detects the change and sends a `POST /api/v1/agents` to DCM with the updated + cost tier — the same mechanism used when the supported service types list + changes. **This solution is deferred to later version: in the current version, a restart will be needed for the change in the cost tier to be propagated (via [Agent Registration Flow](#agent-registration-flow) )** 3. How does DCM handle the "queued" CloudEvent response @@ -43,6 +43,11 @@ see-also: environment. It registers the environment to DCM, consumes resource operation requests from a messaging system, and routes them to the appropriate Service Provider. +- **Embedded SP:** SP code shipped within the agent binary (K8s Container, ACM + Cluster, KubeVirt), enabled via configuration. Embedded SPs register + internally at agent startup without a REST call. +- **External SP:** A standalone SP process that registers to the agent via the + REST API (`POST /api/v1/providers`). Also referred to as "bring your own" SP. - **Environment:** A set of infrastructures that is ready to receive workload from DCM (e.g., `dev`, `staging`, `prod-eu-west-1`). @@ -52,10 +57,11 @@ This enhancement aims at adding the notion of environment by adding a layer between the SP and DCM: an agent would run on each environment usable by DCM and the agent would register the environment to DCM. -The agent would then use the SPs as plugins for the supported service types and -pass the creation request to the relevant one. This would mean that each SP -registration with the agent serves exactly one service type (though a single SP -application may register multiple times for different service types). +The agent supports a hybrid SP model: it ships with embedded SP code for known +service types (K8s Container, ACM Cluster, KubeVirt), enabled via configuration, +and also accepts external ("bring your own") SPs that register via REST API. +Only one SP — embedded or external — may serve a given service type per agent; +duplicate registrations are rejected. This enhancement also proposes to change the way the creation request is submitted to the agent (or currently, to the SP): instead of sending a direct @@ -73,7 +79,7 @@ Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. -There is currently no way for a policy to determine in which environment a SP is +There is currently no way for a policy to determine in which environment an SP is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. @@ -92,6 +98,9 @@ requests, where manifests are pulled by the application creating the resource. - Define what information the agent gives to DCM while registering - Define how agents and DCM are communicating - Define how agents and Service Providers interact with each other +- Define how embedded SPs integrate with the agent alongside external SPs + (hybrid model) +- Define the service type uniqueness constraint (one SP per service type) - Define how Service Providers register to the agent, allowing the agent to dynamically build and maintain its list of supported service types - Define how the agent monitors Service Provider health using the three-state @@ -125,17 +134,31 @@ single-agent model, one agent consumes from the topic. In a future HA model, multiple agent replicas for the same environment could consume from the same topic as competing consumers. -Service Providers register directly to the agent (not to DCM). Each SP -registration with the agent serves exactly one service type, though a single SP -application may register multiple times for different service types. The agent -dynamically builds its list of supported service types based on the SPs that are -registered to it. When the list changes (SP registration or health-driven -removal), the agent updates DCM accordingly. +The agent supports a hybrid SP model combining embedded and external SPs: -An agent must have at least 1 Service Provider (SP) registered to it before self -registering to DCM. For each service type advertised as supported to DCM by the -agent, there must be at least 1 healthy SP registered supporting the given -service type. +- **Embedded SPs:** The agent ships with SP code for K8s Container, ACM Cluster, + and KubeVirt. These are enabled via configuration and register internally at + agent startup — no REST call is needed. The embedded SP code lives in + dedicated packages within the agent codebase. +- **External SPs ("bring your own"):** Standalone SP processes register to the + agent via the REST API (`POST /api/v1/providers`), following the contract + defined in the + [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md). + +Only one SP — embedded or external — may serve a given service type per agent. +If an SP attempts to register for a service type that is already served, the +registration is rejected (see +[SP Registration to Agent](#sp-registration-to-agent)). Future iterations may +support multiple SPs per service type with selection strategies (e.g., +affinity-based, capacity-based). + +The agent dynamically builds its list of supported service types based on the +SPs registered to it (both embedded and external). When the list changes (SP +registration or health-driven removal), the agent updates DCM accordingly. + +An agent must have at least one SP (embedded or external) registered and healthy +before self registering to DCM. Each service type advertised to DCM must be +backed by a healthy SP. DCM will send the creation request to the specific topic that was created by the agent. @@ -143,9 +166,17 @@ agent. The agent will then consume the message, validate it and then pass it to the relevant SP. -The agent monitors the health of its registered SPs by polling their `/health` -endpoint, using the three-state model (Ready, Unhealthy, Unavailable). The agent -differentiates its behavior based on the SP health state: +The agent monitors the health of its registered SPs using the three-state model +(Ready, Unhealthy, Unavailable). The health monitoring mechanism differs by SP +type: + +- **Embedded SPs:** Health is determined in-process — the agent directly checks + the embedded SP's internal state without a network call. +- **External SPs:** Health is determined by polling the SP's `GET /health` + endpoint, as defined in the + [Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md). + +The agent differentiates its behavior based on the SP health state: - **Unhealthy:** The agent keeps the service type in its advertised list to DCM but stops routing requests to the SP. Incoming requests for that service type @@ -155,10 +186,10 @@ differentiates its behavior based on the SP health state: updates DCM, and rejects any held requests for that service type. The agent exposes the health status of each registered SP via a `/api/v1/status` -endpoint. -On Kubernetes/OpenShift deployments, the agent additionally surfaces this -information as custom pod conditions on its own pod, allowing administrators to -quickly identify which SPs are causing issues via `oc describe pod`. +endpoint. On Kubernetes/OpenShift deployments, the agent additionally surfaces +this information as custom pod conditions on its own pod, allowing +administrators to quickly identify which SPs are causing issues via +`oc describe pod`. The agent reports its own liveness to DCM via periodic REST heartbeats. DCM tracks the last heartbeat timestamp and marks the agent as unavailable if no @@ -176,7 +207,8 @@ flowchart TD classDef dcm fill:#2d2d2d,color:#ffffff,stroke:#81c784,stroke-width:2px classDef messaging fill:#2d2d2d,color:#ffffff,stroke:#ffb74d,stroke-width:2px classDef agent fill:#2d2d2d,color:#ffffff,stroke:#f48fb1,stroke-width:2px - classDef provider fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px + classDef embedded fill:#2d2d2d,color:#ffffff,stroke:#ce93d8,stroke-width:2px + classDef external fill:#2d2d2d,color:#ffffff,stroke:#90caf9,stroke-width:2px classDef clusterEnvironment fill:#FFFFFF,stroke:#bdbdbd,stroke-width:2px DCM["**DCM**
Control Plane"]:::dcm @@ -184,15 +216,18 @@ flowchart TD subgraph Target_Environment["Target Environment"] direction LR - SPX["**SP**
Service Type X"]:::provider - AG["**Agent**
Routes creation requests to SP"]:::agent - SPY["**SP**
Service Type Y"]:::provider - SPX -. Registration .-> AG - SPY -. Registration .-> AG - AG -->|Creation Request| SPX - AG -->|Creation Request| SPY - AG -.->|Health Check| SPX - AG -.->|Health Check| SPY + EXT_SP["**External SP**
Service Type Z
(bring your own)"]:::external + + subgraph Agent_Process["Agent Process"] + direction TB + AG["**Agent**
Routes creation requests to SP"]:::agent + EMB_SP["**Embedded SPs**
K8s Container · ACM Cluster · KubeVirt
(enabled via config)"]:::embedded + EMB_SP ---|In-process| AG + end + + EXT_SP -. "Registration (REST)" .-> AG + AG -->|Creation Request| EXT_SP + AG -.->|"Health Check (polling)"| EXT_SP end DCM -->|Creation Request| MS @@ -201,8 +236,8 @@ flowchart TD AG -. Heartbeat .-> DCM AG -->|Health Warning| MS MS -->|Health Warning| DCM - SPX -->|Status| MS - SPY -->|Status| MS + EXT_SP -->|Status| MS + EMB_SP -->|Status| MS MS -->|Status| DCM class Target_Environment clusterEnvironment @@ -211,19 +246,23 @@ flowchart TD #### Flow Description - The agent is spawned in an environment -- Several Service Providers (SP) are running and each serving a specific service - type -- Each SP registers itself to the agent; the agent dynamically builds its - supported service types list +- At startup, the agent registers its configured embedded SPs internally (K8s + Container, ACM Cluster, KubeVirt — each enabled via configuration) +- External SPs register to the agent via REST API; the agent rejects + registration if the service type is already served (by an embedded or another + external SP) +- Only one SP (embedded or external) may serve a given service type - The agent creates a specific topic in the bus system - Once at least one SP is registered and healthy, the agent self-registers to DCM and begins sending periodic heartbeats - DCM sends creation request to the specific topic - The agent consumes the messages sent to the topic -- The agent routes the creation request to the relevant SP -- The agent periodically health-checks each registered SP; when the last SP for - a service type becomes unhealthy, the agent updates DCM and publishes a health - warning through the messaging system +- The agent routes the creation request to the SP serving the requested service + type +- The agent monitors each registered SP's health: in-process for embedded SPs, + via `/health` endpoint polling for external SPs. When the SP for a service + type becomes unhealthy, the agent publishes a health warning through the + messaging system - The status monitoring remains unchanged: each SP manages its resource lifecycle and reports status through the messaging system @@ -236,18 +275,31 @@ flowchart TD | POST | /api/v1/providers | SP registration — reuses the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) contract | | GET | /api/v1/status | Agent status — health of all registered SPs | -##### `POST /api/v1/providers` — SP Registration +##### `POST /api/v1/providers` — SP Registration (External SPs only) Reuses the contract defined in the [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) enhancement. The agent applies the same idempotency semantics (name as natural key, create-or-update behavior). +Only one SP may serve a given service type. If the requested service type is +already served by another SP (embedded or external), the agent rejects the +registration with `409 Conflict`: + +```json +{ + "error": "service type 'vm' is already served by provider 'vm-provider'" +} +``` + +Embedded SPs register internally at startup and do not use this endpoint. + ##### `GET /api/v1/status` — Agent Status -Returns the health state of all registered SPs. This endpoint is always -available, regardless of the deployment mode (Kubernetes, Docker, standalone), -and is the primary way to inspect the agent's view of its Service Providers. +Returns the health state of all registered SPs (both embedded and external). +This endpoint is always available, regardless of the deployment mode +(Kubernetes, Docker, standalone), and is the primary way to inspect the agent's +view of its Service Providers. Example response: @@ -255,9 +307,10 @@ Example response: { "providers": [ { - "providerId": "sp-vm-001", - "name": "vm-provider", - "serviceType": "vm", + "providerId": "sp-container-001", + "name": "k8s-container", + "serviceType": "container", + "type": "embedded", "status": "Ready", "lastCheck": "2026-06-05T10:30:00Z" }, @@ -265,6 +318,7 @@ Example response: "providerId": "sp-db-001", "name": "db-provider", "serviceType": "database", + "type": "external", "status": "Unhealthy", "lastCheck": "2026-06-05T10:30:00Z" } @@ -274,23 +328,23 @@ Example response: #### DCM Endpoints -| Method | Endpoint | Description | -| ------ | ---------------------------------- | ------------------------- | -| POST | /api/v1/agents | Agent registration | -| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | +| Method | Endpoint | Description | +| ------ | ---------------------------------- | ------------------ | +| POST | /api/v1/agents | Agent registration | +| PUT | /api/v1/agents/{agentId}/heartbeat | Agent heartbeat | ##### `POST /api/v1/agents` — Agent Registration Register a new agent to DCM. -| Field | Type | Required | Description | -| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| name | string | yes | Unique agent name | -| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | -| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP). May be empty on subsequent re-registrations when all SPs are unavailable (Unhealthy SPs do not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | -| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | -| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | -| topicName | string | yes | Deterministic topic name for the agent's messaging channel | +| Field | Type | Required | Description | +| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| name | string | yes | Unique agent name | +| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | +| serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP, embedded or external). May be empty on subsequent re-registrations when SPs become unavailable (an Unhealthy SP does not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | +| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | +| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | +| topicName | string | yes | Deterministic topic name for the agent's messaging channel | Response: `201 Created` with `{agentId}` @@ -324,65 +378,116 @@ Response: `200 OK` ### SP Registration to Agent Service Providers register to the agent rather than to DCM directly. The agent -exposes a REST API for SP registration and dynamically maintains its list of +supports two registration mechanisms and dynamically maintains its list of supported service types based on registered SPs. -SPs periodically re-register with the agent to maintain their registration. This -periodic re-registration serves as a lease renewal and ensures that after an -agent restart (where the agent loses its in-memory state), SPs naturally -re-register without requiring any additional coordination mechanism. +**Service type uniqueness constraint:** Only one SP — embedded or external — may +serve a given service type per agent. The first SP to register for a service +type claims the slot. Subsequent registration attempts for the same service type +are rejected. + +#### Embedded SP Registration + +At startup, the agent registers its configured embedded SPs internally. Each +embedded SP's code lives in a dedicated package within the agent codebase and is +enabled explicitly via a configuration field. The embedded SP code reaches the +agent's registration logic directly — no REST call is involved. + +If the agent's state is not clean (e.g., an external SP already holds a service +type slot from a prior session), the embedded SP registration for that service +type is rejected. The agent logs a warning and continues running — this is not a +fatal error. + +Because embedded SPs register at startup before external SPs can connect, they +effectively take priority on a clean agent state. + +#### External SP Registration + +External SPs register via the REST API (`POST /api/v1/providers`), following the +contract defined in the +[SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) +enhancement. The agent applies the same idempotency semantics (name as natural +key, create-or-update behavior). + +If the requested service type is already served by another SP (embedded or +external), the agent rejects the registration with `409 Conflict` and a message +identifying the conflicting provider, so the administrator can take action if +necessary. + +External SPs periodically re-register with the agent to maintain their +registration. This periodic re-registration serves as a lease renewal and +ensures that after an agent restart (where the agent loses its in-memory state), +SPs naturally re-register without requiring any additional coordination +mechanism. + +#### DCM Notification When the list of supported service types changes as a result of an SP -registration and the agent is already registered to DCM, the agent updates DCM -via a `POST /api/v1/agents` request with the full updated registration payload. If the agent has -not yet registered to DCM (i.e., this is the first SP registering), the agent -does not notify DCM yet; instead, the SP registration satisfies the -prerequisite for the agent to proceed with its initial registration to DCM (see +registration (embedded or external) and the agent is already registered to DCM, +the agent updates DCM via a `POST /api/v1/agents` request with the full updated +registration payload. If the agent has not yet registered to DCM (i.e., this is +the first SP registering), the agent does not notify DCM yet; instead, the SP +registration satisfies the prerequisite for the agent to proceed with its +initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). ```mermaid sequenceDiagram autonumber - participant SP as Service Provider + participant SP as External SP participant AG as Agent participant DCM as DCM
(Control Plane) participant DB as Database - Note over SP: SP starts and
registers to the agent + Note over AG: Agent starts:
register embedded SPs
from configuration + + AG->>AG: Register embedded SPs internally
(K8s Container, ACM Cluster, KubeVirt
— each if enabled in config) + + Note over SP: External SP starts and
registers to the agent SP->>AG: POST /api/v1/providers
{name, serviceType, endpoint} activate AG - AG->>AG: Store SP registration
Recompute supported service types + alt Service type already served by another SP + AG-->>SP: 409 Conflict
{error: "service type X already
served by provider Y"} + else Service type available + AG->>AG: Store SP registration
Add service type to supported list - alt Service type list changed AND agent already registered to DCM - AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} - activate DCM - DCM->>DB: Update agent registration - activate DB - DB-->>DCM: Registration updated - deactivate DB - DCM-->>AG: 200 OK - deactivate DCM - else Service type list changed AND agent not yet registered to DCM - Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) - end + alt Service type list changed AND agent already registered to DCM + AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} + activate DCM + DCM->>DB: Update agent registration + activate DB + DB-->>DCM: Registration updated + deactivate DB + DCM-->>AG: 200 OK + deactivate DCM + else Service type list changed AND agent not yet registered to DCM + Note over AG: Prerequisite for initial
agent registration is now met
(see Agent Registration Flow) + end - AG-->>SP: 201 Created
{providerId} + AG-->>SP: 201 Created
{providerId} + end deactivate AG - Note over SP,AG: SP periodically re-registers
to maintain its lease + Note over SP,AG: External SP periodically
re-registers to maintain its lease ``` #### Flow Description -1. The SP starts and registers to the agent -2. The SP registers itself with the agent via a REST API call, providing: +1. At startup, the agent registers its configured embedded SPs internally. Each + embedded SP claims a service type slot. If a slot is already occupied, the + agent logs a warning and continues +2. An external SP starts and registers to the agent via a REST API call, + providing: - Name - Service type it serves - Endpoint (URL where the agent can reach the SP) -3. The agent stores the SP registration and recomputes the list of supported - service types +3. The agent checks whether the requested service type is already served: + - If **already served**: the agent rejects the registration with + `409 Conflict` and a message identifying the conflicting provider + - If **available**: the agent stores the SP registration and adds the service + type to its supported list 4. If the service type list changed (new service type added): - If the agent is already registered to DCM: the agent sends a `POST /api/v1/agents` request to DCM with the full updated agent @@ -392,9 +497,10 @@ sequenceDiagram agent's initial registration (see [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration -6. The SP periodically re-registers with the agent; the agent handles this +6. External SPs periodically re-register with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, - SPs naturally rebuild the agent's state without additional coordination + external SPs naturally rebuild the agent's state without additional + coordination ### Agent Registration Flow @@ -414,7 +520,7 @@ sequenceDiagram AG->>MS: Create retry topic (internal) MS-->>AG: Topic created
{topicName}.retry - Note over AG: Prerequisite:
At least 1 SP must be
registered and healthy
(see SP Registration to Agent) + Note over AG: Prerequisite:
At least 1 SP (embedded or
external) must be registered
and healthy
(see SP Registration to Agent) AG->>DCM: POST /api/v1/agents
{name, environment, serviceTypes,
resourcesAvailable, cost, topicName} activate DCM @@ -435,13 +541,14 @@ sequenceDiagram - A **main topic** (using a deterministic name) to establish a dedicated communication channel with DCM. This topic name is advertised to DCM during registration. - - A **retry topic** (`{topicName}.retry`) used internally by the agent to hold - requests when all SPs for a service type are Unhealthy (see + - A **retry topic** (`{topicName}.retry`) used internally by the agent to + hold requests when the SP for a service type is Unhealthy (see [Retry Topic](#retry-topic)). This topic is not advertised to DCM. -3. The agent checks whether at least one SP is registered and healthy: - - If at least 1 SP is registered and healthy: the agent proceeds to register - to DCM - - Else: the agent waits until at least 1 SP is registered and healthy +3. The agent checks whether at least one SP (embedded or external) is registered + and healthy: + - If at least one SP is registered and healthy: the agent proceeds to + register to DCM + - Else: the agent waits until at least one SP is registered and healthy 4. The agent registers itself with DCM via a REST API call, providing: - Name - Environment @@ -477,36 +584,52 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant MS as Messaging System participant AG as Agent - participant SP as Service Provider + participant EMB as Embedded SP + participant EXT as External SP DCM->>MS: PUBLISH CloudEvent (creation request)
topic: {agentTopicName}
{resourceId, serviceType, spec} MS->>AG: PUSH message activate AG - AG->>AG: Validate requested service type
is supported by an attached SP + AG->>AG: Validate requested service type
is supported by a registered SP alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported but all SPs Unhealthy + else Service type supported but SP is Unhealthy AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType, spec} - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SP unhealthy — held for retry"} MS->>DCM: PUSH queued response - else Service type supported and at least one SP Ready - AG->>SP: POST {spEndpoint}/api/v1/{serviceType}
{spec} - activate SP - - alt SP creation fails - SP-->>AG: Error response - deactivate SP - AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} - MS->>DCM: PUSH error message - else SP creation succeeds - SP-->>AG: Success response
{instanceId, status: PROVISIONING} - AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} - MS->>DCM: PUSH creation acknowledged - Note over SP: SP manages resource lifecycle
and reports status through
the existing status reporting flow + else Service type supported and SP is Ready + alt SP is embedded + AG->>EMB: In-process call
{serviceType, spec} + activate EMB + alt Creation fails + EMB-->>AG: Error + deactivate EMB + AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} + MS->>DCM: PUSH error message + else Creation succeeds + EMB-->>AG: Success
{instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} + MS->>DCM: PUSH creation acknowledged + Note over EMB: SP manages resource lifecycle
and reports status through
the existing status reporting flow + end + else SP is external + AG->>EXT: POST {spEndpoint}/api/v1/{serviceType}
{spec} + activate EXT + alt Creation fails + EXT-->>AG: Error response + deactivate EXT + AG->>MS: PUBLISH CloudEvent
{error: "creation failed", details} + MS->>DCM: PUSH error message + else Creation succeeds + EXT-->>AG: Success response
{instanceId, status: PROVISIONING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: PROVISIONING} + MS->>DCM: PUSH creation acknowledged + Note over EXT: SP manages resource lifecycle
and reports status through
the existing status reporting flow + end end end deactivate AG @@ -517,21 +640,22 @@ sequenceDiagram 1. DCM publishes a creation request CloudEvent to the agent's dedicated topic in the messaging system, including the resource ID, service type, and spec 2. The agent consumes the message -3. The agent validates that the requested service type is supported by one of - its attached Service Providers +3. The agent validates that the requested service type is supported by a + registered SP (embedded or external) 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported but all SPs are Unhealthy**: +5. If the service type is **supported but the SP is Unhealthy**: - The agent publishes the original request CloudEvent to the retry topic (`{agentTopicName}.retry`) for durable holding - The agent publishes a "queued" CloudEvent to `dcm.agents.responses` with `{resourceId, serviceType, status: "QUEUED"}`, informing DCM that the request is held for retry - - The request will be processed when an SP recovers, or rejected if all SPs - become Unavailable (see [Retry Topic](#retry-topic)) -6. If the service type is **supported and at least one SP is Ready**: - - The agent forwards the creation request to the relevant SP via REST API + - The request will be processed when the SP recovers, or rejected if the SP + becomes Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and the SP is Ready**: + - The agent forwards the creation request to the SP via REST API (for + external SPs) or in-process call (for embedded SPs) - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - If the SP **accepts** the request: the agent publishes a CloudEvent @@ -539,12 +663,12 @@ sequenceDiagram lifecycle management and reports status changes through the existing status reporting flow (SP → Messaging System → DCM) -#### SP Selection Strategy +#### Service Type Uniqueness -When multiple SPs are registered for the same service type, the agent selects -the SP in alphabetical order. Future iterations may introduce affinity-based or -capacity-based selection strategies (e.g., selecting the SP with the most -available resources, similar to pod affinity in Kubernetes). +Each service type is served by exactly one SP (embedded or external). There is +no SP selection strategy in the current version. Future iterations may support +multiple SPs per service type with selection strategies (e.g., affinity-based, +capacity-based). #### Retry Policy @@ -556,31 +680,30 @@ failure. #### Retry Topic -When all SPs for a given service type are Unhealthy, the agent cannot route -requests but the service type remains advertised to DCM (to avoid -registration flapping). Instead of rejecting the request, the agent publishes -it to a dedicated **retry topic** (`{agentTopicName}.retry`) for durable -holding, and responds to DCM with a "queued" CloudEvent. +When the SP for a given service type is Unhealthy, the agent cannot route +requests but the service type remains advertised to DCM (to avoid registration +flapping). Instead of rejecting the request, the agent publishes it to a +dedicated **retry topic** (`{agentTopicName}.retry`) for durable holding, and +responds to DCM with a "queued" CloudEvent. -The retry topic is created by the agent at startup alongside the main topic -(see [Agent Registration Flow](#agent-registration-flow)). It is internal to -the agent and is not advertised to DCM. +The retry topic is created by the agent at startup alongside the main topic (see +[Agent Registration Flow](#agent-registration-flow)). It is internal to the +agent and is not advertised to DCM. **Message format:** The original CloudEvent is published to the retry topic as-is (passthrough, no wrapping). -**Consumption is event-driven.** The agent reads the retry topic only when an -SP health state changes — not periodically: +**Consumption is event-driven.** The agent reads the retry topic only when an SP +health state changes — not periodically: - **SP transitions to Ready:** The agent consumes the retry topic. For each - message whose service type now has a Ready SP, the agent processes the - request (forwards to the SP, responds to DCM with success or error). - Messages for service types still Unhealthy are re-published to the retry - topic. + message whose service type now has a Ready SP, the agent processes the request + (forwards to the SP, responds to DCM with success or error). Messages for + service types whose SP is still Unhealthy are re-published to the retry topic. - **SP transitions to Unavailable:** The agent consumes the retry topic. For - each message whose service type has all SPs Unavailable, the agent rejects - the request with an error CloudEvent to DCM. Messages for other service - types are re-published to the retry topic. + each message whose service type's SP is Unavailable, the agent rejects the + request with an error CloudEvent to DCM. Messages for other service types are + re-published to the retry topic. - **No health state change:** The retry topic is not consumed. **Creation/Deletion dedup:** If both a creation request and a deletion request @@ -589,12 +712,12 @@ removed — they cancel out since the resource was never created. The agent logs the cancellation and acknowledges the deletion to DCM. The creation request is silently dropped since it was never started. -**Ordering:** Requests are processed in arrival order per service type. -Requests for different service types are independent. +**Ordering:** Requests are processed in arrival order per service type. Requests +for different service types are independent. -**Durability:** Messages in the retry topic survive agent crashes, guaranteed -by the messaging system's persistence layer. On restart, the agent re-reads -both the main topic and the retry topic. +**Durability:** Messages in the retry topic survive agent crashes, guaranteed by +the messaging system's persistence layer. On restart, the agent re-reads both +the main topic and the retry topic. #### In-Flight Request Handling @@ -602,11 +725,11 @@ When the agent restarts, unconsumed messages on both the main topic and the retry topic are consumed once the agent is back up (guaranteed by the messaging system's persistence layer). -- **All SPs Unhealthy:** The agent publishes the request to the retry topic and - responds to DCM with a "queued" CloudEvent. The request is processed when an - SP recovers, or rejected when all SPs for that service type become - Unavailable (see [Retry Topic](#retry-topic)). -- **All SPs Unavailable:** The agent responds with an error CloudEvent for each +- **SP is Unhealthy:** The agent publishes the request to the retry topic and + responds to DCM with a "queued" CloudEvent. The request is processed when the + SP recovers, or rejected when the SP for that service type becomes Unavailable + (see [Retry Topic](#retry-topic)). +- **SP is Unavailable:** The agent responds with an error CloudEvent for each incoming request targeting that service type. Additionally, the agent drains the retry topic, rejecting any held requests for that service type with error CloudEvents. @@ -619,36 +742,52 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant MS as Messaging System participant AG as Agent - participant SP as Service Provider + participant EMB as Embedded SP + participant EXT as External SP DCM->>MS: PUBLISH CloudEvent (deletion request)
topic: {agentTopicName}
{resourceId, serviceType} MS->>AG: PUSH message activate AG - AG->>AG: Validate requested service type
is supported by an attached SP + AG->>AG: Validate requested service type
is supported by a registered SP alt Service type not supported AG->>MS: PUBLISH CloudEvent
{error: "unsupported service type"} MS->>DCM: PUSH error message - else Service type supported but all SPs Unhealthy + else Service type supported but SP is Unhealthy AG->>MS: PUBLISH CloudEvent (hold request)
topic: {agentTopicName}.retry
{resourceId, serviceType} - AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SPs unhealthy — held for retry"} + AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.responses
{resourceId, status: QUEUED,
reason: "SP unhealthy — held for retry"} MS->>DCM: PUSH queued response - else Service type supported and at least one SP Ready - AG->>SP: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} - activate SP - - alt SP deletion fails - SP-->>AG: Error response - deactivate SP - AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} - MS->>DCM: PUSH error message - else SP deletion succeeds - SP-->>AG: Success response
{resourceId, status: DELETING} - AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} - MS->>DCM: PUSH deletion acknowledged - Note over SP: SP manages resource deletion
and reports final status through
the existing status reporting flow + else Service type supported and SP is Ready + alt SP is embedded + AG->>EMB: In-process call
{serviceType, resourceId} + activate EMB + alt Deletion fails + EMB-->>AG: Error + deactivate EMB + AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} + MS->>DCM: PUSH error message + else Deletion succeeds + EMB-->>AG: Success
{resourceId, status: DELETING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} + MS->>DCM: PUSH deletion acknowledged + Note over EMB: SP manages resource deletion
and reports final status through
the existing status reporting flow + end + else SP is external + AG->>EXT: DELETE {spEndpoint}/api/v1/{serviceType}/{resourceId} + activate EXT + alt Deletion fails + EXT-->>AG: Error response + deactivate EXT + AG->>MS: PUBLISH CloudEvent
{error: "deletion failed",
resourceId, details} + MS->>DCM: PUSH error message + else Deletion succeeds + EXT-->>AG: Success response
{resourceId, status: DELETING} + AG->>MS: PUBLISH CloudEvent
{resourceId, status: DELETING} + MS->>DCM: PUSH deletion acknowledged + Note over EXT: SP manages resource deletion
and reports final status through
the existing status reporting flow + end end end deactivate AG @@ -659,21 +798,21 @@ sequenceDiagram 1. DCM publishes a deletion request CloudEvent to the agent's dedicated topic in the messaging system, including the resource ID and service type 2. The agent consumes the message -3. The agent validates that the requested service type is supported by one of - its attached Service Providers +3. The agent validates that the requested service type is supported by a + registered SP (embedded or external) 4. If the service type is **not supported**: - The agent publishes an error CloudEvent back to the messaging system - DCM consumes the error message -5. If the service type is **supported but all SPs are Unhealthy**: +5. If the service type is **supported but the SP is Unhealthy**: - The agent publishes the original request to the retry topic for durable holding - The agent publishes a "queued" CloudEvent to `dcm.agents.responses`, informing DCM that the request is held for retry - - The request will be processed when an SP recovers, or rejected if all SPs - become Unavailable (see [Retry Topic](#retry-topic)) -6. If the service type is **supported and at least one SP is Ready**: - - The agent forwards the deletion request to the relevant SP via a REST - `DELETE` call + - The request will be processed when the SP recovers, or rejected if the SP + becomes Unavailable (see [Retry Topic](#retry-topic)) +6. If the service type is **supported and the SP is Ready**: + - The agent forwards the deletion request to the SP via a REST `DELETE` call + (for external SPs) or in-process call (for embedded SPs) - If the SP returns an **immediate error**: the agent publishes an error CloudEvent back to the messaging system for DCM to consume - If the SP **accepts** the request: the agent publishes a CloudEvent @@ -737,40 +876,45 @@ sequenceDiagram #### SP Health Monitoring -The agent monitors the health of its registered Service Providers by polling -their `/health` endpoint, using the three-state health model defined in the -[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md): +The agent monitors the health of its registered Service Providers using the +three-state health model defined in the +[Service Provider Health Check enhancement](../service-provider-health-check/service-provider-health-check.md). +The monitoring mechanism differs by SP type: -| State | Condition | -| --------------- | --------------------------------------------------------------------------------------------------- | -| **Ready** | SP responds with `200 OK` and `status: "healthy"` | -| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (SP reachable but backing provider unavailable) | -| **Unavailable** | SP does not respond or returns an error, after exceeding the failure threshold | +- **Embedded SPs:** Health is determined in-process — the agent directly checks + the embedded SP's internal state without a network call. +- **External SPs:** Health is determined by polling the SP's `GET /health` + endpoint. -With the agent layer, the responsibility for polling SP health shifts from DCM -to the agent. The agent is the natural point to perform health checks on its -registered SPs, as it already maintains the list of SP endpoints. +| State | Condition | +| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| **Ready** | SP responds with `200 OK` and `status: "healthy"` (external), or internal check passes (embedded) | +| **Unhealthy** | SP responds with `200 OK` and `status: "unhealthy"` (external), or internal check reports unhealthy (embedded) | +| **Unavailable** | SP does not respond or returns an error after exceeding the failure threshold (external), or internal check reports unavailable (embedded) | -The agent only routes creation requests to SPs in the **Ready** state. SPs in -the **Unhealthy** or **Unavailable** state are not eligible for routing, even -though an Unhealthy SP is technically reachable. When all SPs for a service -type are Unhealthy, incoming requests are held in the retry topic rather than -rejected (see [Retry Topic](#retry-topic)). +With the agent layer, the responsibility for monitoring SP health shifts from +DCM to the agent. The agent is the natural point to perform health checks on its +registered SPs, as it already maintains the list of SP registrations. -The agent differentiates its behavior based on the health state of the last SP -serving a given service type: +The agent only routes requests to SPs in the **Ready** state. An SP in the +**Unhealthy** or **Unavailable** state is not eligible for routing, even though +an Unhealthy SP may be technically reachable. When the SP for a service type is +Unhealthy, incoming requests are held in the retry topic rather than rejected +(see [Retry Topic](#retry-topic)). -**When the last SP becomes Unhealthy:** +Since each service type is served by exactly one SP, the agent's behavior is +determined by that SP's health state: + +**When the SP becomes Unhealthy:** 1. The agent **keeps** the service type in its advertised list (no update sent to DCM to remove it) -2. The agent stops routing new requests to SPs for that service type — incoming - requests are held in the retry topic and a "queued" CloudEvent is sent to - DCM +2. The agent stops routing new requests for that service type — incoming + requests are held in the retry topic and a "queued" CloudEvent is sent to DCM 3. The agent publishes a health warning CloudEvent to `dcm.agents.health` with type `service-type-degraded` -**When the last SP becomes Unavailable:** +**When the SP becomes Unavailable:** 1. The agent removes the service type from its advertised list 2. The agent sends a `POST /api/v1/agents` request to DCM with the updated @@ -780,8 +924,8 @@ serving a given service type: 4. The agent publishes a health warning CloudEvent to `dcm.agents.health` with type `service-type-unavailable` -**When a previously unhealthy or unavailable SP recovers** (returns `200 OK` -with `status: "healthy"`): +**When a previously unhealthy or unavailable SP recovers** (returns to Ready +state): 1. If the service type was removed (Unavailable case): the agent re-adds it to its list and sends a `POST /api/v1/agents` to DCM with the updated @@ -834,12 +978,14 @@ service account. sequenceDiagram autonumber participant AG as Agent - participant SP as Service Provider + participant SP as External SP participant MS as Messaging System participant DCM as DCM
(Control Plane) participant DB as Database - loop Every {healthCheckInterval} seconds + Note over AG: Embedded SPs: health
checked in-process + + loop Every {healthCheckInterval} seconds (external SPs) AG->>SP: GET /health alt Healthy SP-->>AG: 200 OK
{status: "healthy"} @@ -854,13 +1000,13 @@ sequenceDiagram end end - alt Last SP for service type X becomes Unhealthy + alt SP for service type X becomes Unhealthy Note over AG: Keep service type X
in advertised list.
Hold incoming requests
in retry topic. AG->>MS: PUBLISH CloudEvent
topic: dcm.agents.health
{type: "service-type-degraded",
agentId, serviceType, reason,
affectedProvider} MS->>DCM: PUSH health warning - else Last SP for service type X becomes Unavailable + else SP for service type X becomes Unavailable AG->>DCM: POST /api/v1/agents
{updated serviceTypes without X} activate DCM DCM->>DB: Update agent registration @@ -895,20 +1041,22 @@ sequenceDiagram ##### Flow Description -1. The agent periodically polls each registered SP's `GET /health` endpoint -2. Based on the response, the agent updates the SP's health state: - - `200 OK` with `status: "healthy"` → **Ready** (failure counter reset) - - `200 OK` with `status: "unhealthy"` → **Unhealthy** - - Timeout or error → increment failure counter; if counter exceeds threshold - → **Unavailable** -3. When the last SP serving a given service type becomes **Unhealthy**: - - The agent **keeps** the service type in its advertised list (no update - sent to DCM) +1. The agent monitors each registered SP's health: + - **Embedded SPs:** health checked in-process (no network call) + - **External SPs:** health checked by periodically polling `GET /health` +2. Based on the result, the agent updates the SP's health state: + - Healthy → **Ready** (failure counter reset) + - Unhealthy → **Unhealthy** + - Timeout or error (external) / internal failure (embedded) → increment + failure counter; if counter exceeds threshold → **Unavailable** +3. When the SP for a service type becomes **Unhealthy**: + - The agent **keeps** the service type in its advertised list (no update sent + to DCM) - Incoming requests for that service type are held in the retry topic (see [Retry Topic](#retry-topic)) - The agent publishes a `service-type-degraded` health warning CloudEvent to the `dcm.agents.health` topic -4. When the last SP serving a given service type becomes **Unavailable**: +4. When the SP for a service type becomes **Unavailable**: - The agent removes the service type from its advertised list - The agent sends a `POST /api/v1/agents` to DCM with the updated registration @@ -917,15 +1065,16 @@ sequenceDiagram - The agent publishes a `service-type-unavailable` health warning CloudEvent to the `dcm.agents.health` topic 5. When a previously unhealthy or unavailable SP recovers: - - If the service type was removed (Unavailable case): the agent re-adds it - to its list and sends a `POST /api/v1/agents` to DCM with the updated + - If the service type was removed (Unavailable case): the agent re-adds it to + its list and sends a `POST /api/v1/agents` to DCM with the updated registration - The agent processes held requests from the retry topic for that service type -6. The agent exposes the health status of all registered SPs via the - `GET /api/v1/status` endpoint. On Kubernetes/OpenShift deployments, the agent - additionally surfaces this information as custom pod conditions on its own - pod (see [Pod Conditions](#pod-conditions-kubernetes--openshift)) +6. The agent exposes the health status of all registered SPs (both embedded and + external) via the `GET /api/v1/status` endpoint. On Kubernetes/OpenShift + deployments, the agent additionally surfaces this information as custom pod + conditions on its own pod (see + [Pod Conditions](#pod-conditions-kubernetes--openshift)) ### CloudEvent Message Definitions @@ -962,8 +1111,8 @@ target service type (see agent - The agent has outbound network connectivity to DCM's REST API (for registration and heartbeats) -- SPs have network connectivity to the agent's REST API (for registration and - health checks) +- External SPs have network connectivity to the agent's REST API (for + registration and health checks) - For Kubernetes/OpenShift deployments: the agent's service account has RBAC permissions for the `pods/status` subresource @@ -971,11 +1120,12 @@ target service type (see | Risk | Mitigation | | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state via SP re-registration (SPs periodically re-register, naturally rebuilding agent state). | +| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. | | Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. | | Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. | | Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). | -| Unauthenticated SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. | +| Unauthenticated external SP registration | Deferred to AuthN/Z iteration. Network isolation is the interim mitigation. | +| Embedded SP crash takes down the agent | Embedded SPs run in-process; a panic/crash affects the entire agent. Mitigation: embedded SP code is well-tested and isolated in dedicated packages. Process-level restart recovers state via re-registration. | ## Drawbacks @@ -984,86 +1134,65 @@ target service type (see - Adds latency to the creation path: DCM → messaging system → agent → SP, versus the current DCM → SP direct call - Fragments health monitoring responsibility: DCM monitors agent health via - heartbeats, while the agent monitors SP health via polling + heartbeats, while the agent monitors SP health directly (in-process for + embedded SPs, via polling for external SPs) - Requires messaging system infrastructure accessible to both DCM and all target environments +- Embedding SP code (K8s Container, ACM Cluster, KubeVirt) increases agent + binary size and couples the agent release cycle to the embedded SPs for + updates ## Alternatives -### Alternative 1: Monolithic Agent with Embedded SPs +### Alternative 1: Watch / Reconcile Pattern #### Description -Instead of separating the agent and Service Providers into distinct processes, -the agent binary would ship with SP code for a known set of SPs (e.g., ACM, -KubeVirt, K8s). At startup, the agent would detect available CRDs or backing -infrastructure on the environment and activate only the relevant SP code. +Instead of using a messaging system for creation requests, DCM would expose +resource requests through its own API. The agent would poll DCM's API or be +notified by DCM of new events, discover pending resource requests targeting its +environment, and reconcile them by forwarding the creation request to the +relevant SP and reporting the result back to DCM. This mimics the Kubernetes +controller pattern (watch → reconcile) but with DCM acting as the API server +rather than a Kubernetes cluster. #### Pros -- Single binary to deploy, no REST registration ceremony between agent and SPs -- No health monitoring overhead between agent and SPs (they share a process) -- Simpler deployment and operational model +- Familiar pattern for teams experienced with Kubernetes controllers +- Could eliminate the messaging system dependency for creation requests +- DCM retains full visibility of pending requests (they live in DCM's own + storage, not in a bus topic) +- No additional infrastructure beyond DCM itself — the agent only needs + outbound connectivity to DCM's API, which it already has for registration and + heartbeats #### Cons -- Tightly couples the agent to a fixed, predefined set of SPs -- Cannot support custom or third-party SPs without rebuilding the agent binary -- Agent binary grows with each new SP type -- Requires agent rebuild and redeployment to add support for a new service type +- Requires DCM to implement watch/notification semantics natively, which adds + complexity to the control plane +- The messaging system is still required for status reporting (SP → bus → DCM), + so this does not fully eliminate the messaging infrastructure dependency +- Maturity of a DCM-native watch system is unproven compared to established + messaging systems (e.g., NATS JetStream) #### Status -Rejected +Deferred #### Rationale -The agent must support arbitrary SPs, including custom ones developed by third -parties. Tight coupling between the agent and SP code prevents this -extensibility. The plugin-style model (separate processes, REST registration) -allows any SP that implements the registration API to participate, regardless of -who develops or deploys it. - -### Alternative 2: etcd / CRD Watch Pattern - -#### Description - -Instead of using a messaging system for creation requests, DCM would create -Custom Resource (CR) manifests (e.g., `ResourceRequest`) directly in the target -cluster's etcd via the Kubernetes API. The agent would run as a Kubernetes -controller, watching for these CRs and reconciling them by forwarding the -creation request to the relevant SP. This follows the native Kubernetes -controller pattern. - -#### Pros - -- Native Kubernetes pattern, well-understood and battle-tested -- Leverages existing etcd for persistence and watch semantics, no separate - messaging infrastructure needed -- Built-in HA via Kubernetes controller framework (leader election, informer - caching) - -#### Cons - -- Requires DCM to have kubeconfig/API access to each target cluster, - reintroducing DCM-to-environment connectivity that this enhancement aims to - eliminate -- Does not work for non-Kubernetes environments (Docker, standalone, etc.) -- Pushes the connectivity requirement from the agent (outbound) to DCM (outbound - to every cluster) - -#### Status - -Rejected - -#### Rationale +The watch/reconcile pattern's main advantage is eliminating the messaging system +for creation requests and keeping all request state within DCM. However, the +messaging system is already required for status reporting (SP → bus → DCM), so +removing it for creation requests alone does not eliminate the infrastructure +dependency. -A core motivation of this enhancement is removing the need for -DCM-to-environment inbound connectivity for creation requests. The CRD watch -pattern requires DCM to push CRs to the target cluster's API server, -reintroducing that dependency. Additionally, this approach limits the agent to -Kubernetes-based environments, conflicting with the goal of supporting -non-cluster environments. +Additionally, DCM does not currently expose watch/notification semantics. +Building a reliable, scalable watch system into DCM requires further +investigation — particularly around delivery guarantees, fan-out to multiple +agents, and behaviour under network partitions. This is deferred to a future +iteration when the trade-offs are better understood and the maturity level of a +DCM-native watch system can be assessed. ## Cross-Cutting Impact @@ -1073,7 +1202,7 @@ PRs. | Document | Impact | | -------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. | +| [SP Registration Flow](../sp-registration-flow/sp-registration-flow.md) | External SPs register to the agent instead of DCM. The existing registration API contract remains valid for the agent's REST API, but DCM's registration handler no longer receives SP registrations directly. Embedded SPs register internally and do not use this flow. | | [Service Provider Health Check](../service-provider-health-check/service-provider-health-check.md) | Health polling responsibility shifts from DCM to the agent. DCM monitors agent health via heartbeats instead of polling individual SPs. | | [SP Resource Manager](../sp-resource-manager/sp-resource-manager.md) | SPRM publishes creation requests to the agent's bus topic instead of calling SP REST endpoints directly. SPRM interacts with the agent (not individual SPs) for health status. From SPRM's perspective, the agent serves the same role as a SP: provisioning service types. | | [Placement Manager](../placement-manager/placement-manager.md) | Policy evaluation may now include environment as a selection criterion. Placement Manager delegates to SPRM, which routes through the messaging system. | From 970f9cd04aee9674c62cbcaff30b4f2a2d5a3cd4 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 22 Jun 2026 11:57:33 +0200 Subject: [PATCH 19/24] format Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index ffcfbcd..ba6b775 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -79,8 +79,8 @@ Provider (SP) by a policy on the base of several criteria. Once the SP is selected, DCM will send a request to the selected SP to request the creation of the resource. -There is currently no way for a policy to determine in which environment an SP is -running and hence a user cannot explicitly set the targeted environment +There is currently no way for a policy to determine in which environment an SP +is running and hence a user cannot explicitly set the targeted environment constraint when requesting the creation of a resource. Furthermore, with the current way of submitting creation requests, the @@ -337,14 +337,14 @@ Example response: Register a new agent to DCM. -| Field | Type | Required | Description | -| ------------------ | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| name | string | yes | Unique agent name | -| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | +| Field | Type | Required | Description | +| ------------------ | -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| name | string | yes | Unique agent name | +| environment | string | yes | Freeform environment identifier (e.g., `"dev"`, `"staging"`, `"prod-eu-west-1"`) | | serviceTypes | string[] | yes | List of service types the agent can serve. Must be non-empty on initial registration (prerequisite: at least one healthy SP, embedded or external). May be empty on subsequent re-registrations when SPs become unavailable (an Unhealthy SP does not trigger service type removal — see [SP Health Monitoring](#sp-health-monitoring)). | -| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | -| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | -| topicName | string | yes | Deterministic topic name for the agent's messaging channel | +| resourcesAvailable | object | no | Available resources in the environment — sourced from K8s node info or manual configuration (see below) | +| cost | enum | yes | Cost tier: `low` \| `medium-low` \| `medium` \| `medium-high` \| `high` | +| topicName | string | yes | Deterministic topic name for the agent's messaging channel | Response: `201 Created` with `{agentId}` @@ -1120,7 +1120,7 @@ target service type (see | Risk | Mitigation | | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. | +| Agent is a single point of failure per environment | Deferred to HA iteration. Agent restart recovers state: embedded SPs register internally at startup; external SPs periodically re-register, naturally rebuilding the agent's state. | | Messaging system failure blocks creation requests | Dependent on chosen bus technology's delivery guarantees. Stated as an assumption. | | Message loss with at-most-once semantics | Rely on bus capabilities (e.g., JetStream for NATS). Specific delivery guarantee is a deployment decision. | | Split-brain: agent loses DCM connectivity but keeps processing | On reconnection, the agent re-registers to DCM. During the split, DCM marks the agent as unavailable and stops routing new requests to its topic. In-flight messages are processed normally. Duplicate creation risk if DCM re-routes to another agent is mitigated by idempotent resource creation (resource ID provided by DCM in the creation request). | @@ -1162,8 +1162,8 @@ rather than a Kubernetes cluster. - Could eliminate the messaging system dependency for creation requests - DCM retains full visibility of pending requests (they live in DCM's own storage, not in a bus topic) -- No additional infrastructure beyond DCM itself — the agent only needs - outbound connectivity to DCM's API, which it already has for registration and +- No additional infrastructure beyond DCM itself — the agent only needs outbound + connectivity to DCM's API, which it already has for registration and heartbeats #### Cons From 42db2094c6b8984f27d045e4d98301226db42aa5 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 22 Jun 2026 16:16:11 +0200 Subject: [PATCH 20/24] Reword for clarity Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 35 +++++++++++-------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index ba6b775..46067d3 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -388,15 +388,19 @@ are rejected. #### Embedded SP Registration -At startup, the agent registers its configured embedded SPs internally. Each -embedded SP's code lives in a dedicated package within the agent codebase and is -enabled explicitly via a configuration field. The embedded SP code reaches the -agent's registration logic directly — no REST call is involved. - -If the agent's state is not clean (e.g., an external SP already holds a service -type slot from a prior session), the embedded SP registration for that service -type is rejected. The agent logs a warning and continues running — this is not a -fatal error. +Embedded SPs are not active by default. An administrator must explicitly enable +each embedded SP in the agent's configuration file. At startup, the agent +registers only the embedded SPs that are explicitly enabled in its +configuration. Each embedded SP's code lives in a dedicated package within the +agent codebase. The embedded SP code reaches the agent's registration logic +directly — no REST call is involved. + +If the agent restarts with a configuration change that newly enables an embedded +SP for a service type already occupied by an external SP (registered during a +prior session and still holding its slot), the embedded SP registration for that +service type is skipped. The agent logs a warning and continues starting +normally — this is not a fatal error. The external SP retains its slot until it +is explicitly deregistered or its lease expires. Because embedded SPs register at startup before external SPs can connect, they effectively take priority on a clean agent state. @@ -439,9 +443,9 @@ sequenceDiagram participant DCM as DCM
(Control Plane) participant DB as Database - Note over AG: Agent starts:
register embedded SPs
from configuration + Note over AG: Agent starts:
register only explicitly
enabled embedded SPs - AG->>AG: Register embedded SPs internally
(K8s Container, ACM Cluster, KubeVirt
— each if enabled in config) + AG->>AG: Register explicitly enabled
embedded SPs internally
(K8s Container, ACM Cluster, KubeVirt
— only if enabled in config) Note over SP: External SP starts and
registers to the agent @@ -475,9 +479,12 @@ sequenceDiagram #### Flow Description -1. At startup, the agent registers its configured embedded SPs internally. Each - embedded SP claims a service type slot. If a slot is already occupied, the - agent logs a warning and continues +1. At startup, the agent registers only the embedded SPs that are explicitly + enabled in its configuration. Embedded SPs are not active by default — an + administrator must opt in via the agent's configuration file. Each enabled + embedded SP claims a service type slot. If a slot is already occupied (e.g., + by an external SP that persisted from a prior session), the agent logs a + warning and continues without registering that embedded SP 2. An external SP starts and registers to the agent via a REST API call, providing: - Name From 38baba72ca5386193bc44330dd590b0642ea0653 Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Mon, 22 Jun 2026 16:32:46 +0200 Subject: [PATCH 21/24] add section for SP registration Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 28 +++++++++++-------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 46067d3..77b4a28 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -424,7 +424,7 @@ ensures that after an agent restart (where the agent loses its in-memory state), SPs naturally re-register without requiring any additional coordination mechanism. -#### DCM Notification +#### DCM Notification on Service Type Change When the list of supported service types changes as a result of an SP registration (embedded or external) and the agent is already registered to DCM, @@ -435,6 +435,12 @@ registration satisfies the prerequisite for the agent to proceed with its initial registration to DCM (see [Agent Registration Flow](#agent-registration-flow)). +#### SP Registration Flow + +The following diagram illustrates the complete SP registration flow, including +embedded SP startup, external SP registration, conflict handling, and DCM +notification: + ```mermaid sequenceDiagram autonumber @@ -485,8 +491,8 @@ sequenceDiagram embedded SP claims a service type slot. If a slot is already occupied (e.g., by an external SP that persisted from a prior session), the agent logs a warning and continues without registering that embedded SP -2. An external SP starts and registers to the agent via a REST API call, - providing: +2. An external SP starts and registers to the agent via + `POST /api/v1/providers`, providing: - Name - Service type it serves - Endpoint (URL where the agent can reach the SP) @@ -495,14 +501,14 @@ sequenceDiagram `409 Conflict` and a message identifying the conflicting provider - If **available**: the agent stores the SP registration and adds the service type to its supported list -4. If the service type list changed (new service type added): - - If the agent is already registered to DCM: the agent sends a - `POST /api/v1/agents` request to DCM with the full updated agent - registration; DCM updates the agent record in the database - - If the agent is not yet registered to DCM: the agent does not notify DCM - yet; instead, this SP registration satisfies the prerequisite for the - agent's initial registration (see - [Agent Registration Flow](#agent-registration-flow)) +4. If the service type list changed (new service type added), the agent notifies + the DCM Control Plane by sending `POST /api/v1/agents` with the full updated + agent registration (name, environment, supported service types, available + resources, cost, topic name); the DCM Control Plane updates the agent record + in the database and responds with `200 OK`. If the agent is not yet + registered to the DCM Control Plane, this step is deferred — the SP + registration instead satisfies the prerequisite for the agent's initial + registration (see [Agent Registration Flow](#agent-registration-flow)) 5. The agent acknowledges the SP registration 6. External SPs periodically re-register with the agent; the agent handles this idempotently (create or update). This ensures that after an agent restart, From a4d6dfd02af47f71c1efe9539f62bfd5b1d2befc Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 24 Jun 2026 15:01:41 +0200 Subject: [PATCH 22/24] Add and consolifdate future enhancement section Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 124 ++++++++++++++++++ 1 file changed, 124 insertions(+) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 77b4a28..6a7a14e 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -1225,3 +1225,127 @@ Additionally, DCM should monitor consumer lag on agent topics in a future iteration. If lag exceeds a configurable threshold, DCM could stop routing new requests to that agent to avoid further congestion. A new agent state (e.g., "Congested") could be introduced for this purpose. + +## Future Enhancements + +This section lists potential enhancements to the environment agent that are out +of scope for the initial implementation but are expected to be addressed in +future iterations. Items marked with **(consolidation)** are already referenced +elsewhere in this document; they are gathered here for visibility. + +### Agent High Availability (consolidation) + +The current design assumes a single agent instance per environment, making it a +single point of failure. A future iteration would support multiple agent +replicas for the same environment consuming from the same messaging topic as +competing consumers. This requires defining leader election or partitioned +consumption semantics, agent identity (shared name vs. unique replica IDs), +heartbeat coordination (each replica vs. a single heartbeat per environment), +and DCM-side handling of multiple registrations for the same environment. +Referenced in [Open Questions](#open-questions) (item 1), [Overview](#overview), +[Re-Registration on Restart](#re-registration-on-restart), and +[Risks and Mitigations](#risks-and-mitigations). + +### Hot-Reload of Agent Configuration (consolidation) + +Currently, changes to the agent's configuration (e.g., cost tier, enabled +embedded SPs) require a restart to take effect. A future iteration would allow +the agent to detect configuration changes at runtime — via file watchers, +environment variable polling, or Kubernetes ConfigMap updates — and propagate +them to DCM without downtime. Referenced in [Open Questions](#open-questions) +(item 2). + +### DCM-Side Handling of Queued Requests (consolidation) + +When the agent holds a request because the SP for a given service type is +unhealthy, it responds to DCM with a "queued" CloudEvent +(`dcm.agent.request-queued`). The DCM Control Plane does not yet define how it +surfaces this status to the end user, whether it applies a timeout, or whether +it re-evaluates policies to re-route the request to a different +agent/environment. A future iteration would formalize DCM's behavior for queued +requests, including user visibility, timeout semantics, and potential re-routing +strategies. Referenced in [Open Questions](#open-questions) (item 3). + +### Multiple SPs per Service Type (consolidation) + +The current design enforces a one-SP-per-service-type constraint. A future +iteration would allow multiple SPs to register for the same service type within +an agent, with selection strategies such as affinity-based routing (e.g., prefer +the SP closest to the data), capacity-based routing (e.g., least-loaded SP), or +round-robin. This would require defining a selection API, conflict resolution +semantics, and health-aware load balancing across SPs serving the same type. +Referenced in [Overview](#overview) and +[Service Type Uniqueness](#service-type-uniqueness). + +### Authentication and Authorization for SP Registration (consolidation) + +External SP registration is currently unauthenticated; the interim mitigation is +network isolation. A future iteration would introduce an authentication and +authorization layer for the agent's REST API (e.g., mTLS, API tokens, or OIDC), +ensuring that only authorized SPs can register. This also applies to the agent's +status endpoint and any future administrative APIs. Referenced in +[Risks and Mitigations](#risks-and-mitigations). + +### Watch/Reconcile Pattern for Creation Requests (consolidation) + +An alternative to the current messaging-based creation flow, where the agent +would poll DCM's API or receive notifications of pending resource requests +targeting its environment, then reconcile them Kubernetes-controller style. This +approach could eliminate the messaging system dependency for creation requests +(though it remains needed for status reporting) at the cost of implementing +watch/notification semantics in DCM. See +[Alternative 1: Watch / Reconcile Pattern](#alternative-1-watch--reconcile-pattern) +for the full analysis and rationale for deferral. + +### SP Hub/Store and Plugin SDK + +Instead of — or in addition to — the current manual registration process for +external SPs, a plugin system would make it simpler to discover, provision, and +register SPs. This enhancement has two complementary facets: + +- **Plugin SDK:** A formal interface and development kit for building SPs as + self-contained plugins. The SDK would define the SP contract (health endpoint, + creation/deletion handlers, status reporting), packaging format, and + versioning conventions. Third-party developers would build SPs against this + contract and distribute them as installable plugins. +- **SP Hub/Store:** A centralized, curated marketplace — similar to Helm Hub or + OperatorHub — where pre-built SP packages are published, versioned, and + discoverable. An administrator would browse the hub, select an SP, and the + agent would download, install, and register it automatically, reducing the + operational burden of deploying and configuring external SPs. + +Together, the plugin SDK and hub/store would lower the barrier for extending DCM +with new service types, accelerate SP adoption, and standardize the SP +development experience. + +### Resource Update Operations + +The current design defines creation and deletion flows through the agent but +does not address resource update operations (e.g., resizing a VM's CPU/memory, +scaling a container's replica count, upgrading a database version). A future +iteration would introduce an update/patch flow — a `dcm.request.update` +CloudEvent carrying the resource ID and a partial spec — routed through the +agent to the SP in the same manner as creation requests. This requires defining +update semantics (full replace vs. partial patch), validation rules, and how SPs +report intermediate states during the update. + +### Observability and Metrics + +The agent currently exposes SP health via the `/api/v1/status` endpoint and +Kubernetes pod conditions, but does not emit structured telemetry. A future +iteration would add agent-level observability: Prometheus metrics for request +throughput (by service type and outcome), SP health state transitions, retry +topic depth, and end-to-end message latency. Distributed tracing (e.g., +OpenTelemetry) across the DCM → messaging system → agent → SP path would help +diagnose latency and failures in the creation flow. + +### SP Lifecycle Management + +Updating an embedded or external SP currently requires manual intervention — +restarting the agent (for embedded SPs) or redeploying the external SP process. +A future iteration would introduce SP lifecycle management capabilities: +versioned SP upgrades, rolling updates that drain in-flight requests before +swapping to the new version, and the ability to run two versions of the same SP +side by side during a canary deployment. For embedded SPs, this could leverage +the plugin SDK (see [SP Hub/Store and Plugin SDK](#sp-hubstore-and-plugin-sdk)) +to load new versions without rebuilding the agent binary. From 8610595c1db71bdab8524b85dd0873d8ab39971b Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 24 Jun 2026 15:24:36 +0200 Subject: [PATCH 23/24] Add declaratuve apply to future enhancement Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 25 +++++++++++-------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 6a7a14e..6172ffe 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -1318,16 +1318,21 @@ Together, the plugin SDK and hub/store would lower the barrier for extending DCM with new service types, accelerate SP adoption, and standardize the SP development experience. -### Resource Update Operations - -The current design defines creation and deletion flows through the agent but -does not address resource update operations (e.g., resizing a VM's CPU/memory, -scaling a container's replica count, upgrading a database version). A future -iteration would introduce an update/patch flow — a `dcm.request.update` -CloudEvent carrying the resource ID and a partial spec — routed through the -agent to the SP in the same manner as creation requests. This requires defining -update semantics (full replace vs. partial patch), validation rules, and how SPs -report intermediate states during the update. +### Declarative Apply Semantics and Resource Updates + +The current design uses separate CloudEvent types for creation +(`dcm.request.create`) and deletion (`dcm.request.delete`). A future iteration +would replace `dcm.request.create` with a declarative `dcm.request.apply` type +that carries the desired spec for a resource. The SP would determine whether the +resource already exists and perform the appropriate action (create or update), +reporting the outcome back to the agent (`resource-created` or +`resource-updated`). This approach unifies creation and update operations under +a single message type, eliminating the need for a separate `dcm.request.update`. +Deletion would remain a distinct `dcm.request.delete` type since it carries no +spec. Adopting apply semantics requires defining: SP API support for upsert +(create-or-update), update semantics (full replace vs. partial patch), field +mutability rules, and how SPs report intermediate states during updates (e.g., +an `UPDATING` status). ### Observability and Metrics From c219d9e8463be4bd760bf01968ae65aa6a5aa0fe Mon Sep 17 00:00:00 2001 From: gabriel-farache Date: Wed, 24 Jun 2026 15:28:01 +0200 Subject: [PATCH 24/24] =?UTF-8?q?Remove=20op=C3=AAn=20questions?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: gabriel-farache --- .../environment-agent/environment-agent.md | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/enhancements/environment-agent/environment-agent.md b/enhancements/environment-agent/environment-agent.md index 6172ffe..3af9265 100644 --- a/enhancements/environment-agent/environment-agent.md +++ b/enhancements/environment-agent/environment-agent.md @@ -20,22 +20,6 @@ see-also: # Environment Agent -## Open Questions - -1. Can multiple agent replicas consume from the same topic for high - availability? (deferred to HA iteration) -2. How does an administrator update the agent's cost tier without restarting it? - **Proposed resolution:** The administrator updates the agent's configuration - (config file, environment variable, or ConfigMap on Kubernetes). The agent - detects the change and sends a `POST /api/v1/agents` to DCM with the updated - cost tier — the same mechanism used when the supported service types list - changes. **This solution is deferred to later version: in the current - version, a restart will be needed for the change in the cost tier to be - propagated (via [Agent Registration Flow](#agent-registration-flow) )** -3. How does DCM handle the "queued" CloudEvent response - (`dcm.agent.request-queued`)? Does it expose the status to the user, set a - timeout, or re-evaluate policies? (deferred to DCM-side design) - ## Terminology - **Agent:** A lightweight process that runs in a target environment, acting as