Environment Agent Enhancement by gabriel-farache · Pull Request #53 · dcm-project/enhancements

gabriel-farache · 2026-06-03T19:16:25Z

Summary

Adds the Environment Agent enhancement defining a new agent layer between DCM and Service Providers
The agent runs per-cluster/environment, registers to DCM with environment metadata, and routes creation requests via a messaging system (Kafka, NATS)
Service Providers register directly to the agent (not DCM), each serving a single resource type
Covers: agent registration, resource creation flow, SP-to-agent registration, agent heartbeat, and SP health monitoring using the three-state model

Flows Defined

Agent Registration: Agent creates a messaging topic, registers to DCM with name, environment, resource types, cost, and topic name
Resource Creation: DCM publishes to agent topic → agent validates and routes to relevant SP
SP Registration: SPs register to agent via REST; agent dynamically maintains supported resource types and updates DCM
Agent Health: Periodic REST heartbeats from agent to DCM; DCM marks agent unavailable if threshold exceeded
SP Health Monitoring: Agent polls SP /health endpoints; removes resource types when last SP becomes unhealthy/unavailable

🤖 Generated with Claude Code

machacekondra · 2026-06-05T11:33:03Z

+    class Target_Environment clusterEnvironment
+```
+
+#### Flow Description


I am just wondering if we really need to have Agent and SP separated.
I was also thinking about a solution where the agent provide a code of the SPs, let's say openshift agent which provides SP plugins for ACM,k8s,kubevirt,... Agent when deployed will check the openshift what it supports (based on CRs) and then agent register as itself as supporting those specific service-types.

But the SPs code will be part of agent code, and "activated" only if SP is detected on environment.

WDYT?

a solution where the agent provide a code of the SPs, let's say openshift agent which provides SP plugins for ACM,k8s,kubevirt,

Not sure I understand what you mean here.

But the SPs code will be part of agent code, and "activated" only if SP is detected on environment.

again not sure I get what you suggest: here you suggest that the current SP codebase to be merged into the agent codebase and then it will be automatically enabled when the infrastructure needed by the SP's code is present on the cluster? Like if the agent detects ACM is installed then it will automatically support the cluster resource type and it will route the request to the ACM SP code that it holds internally?

The idea was to keep the agent small and then reuse the bricks we already have. It could make sense to have an agent shipped with some supported resource types but it must be kept open for custom SP to register to the agent.
Not sure about the maintenance cost of such bundling

Based on my understanding, I added an alternate option that I rejected on bd4de25
I marked it as rejected for now as I am not sure of what you meant

That's exactly what I've meant. I don't think codebase would be anyhow complicated. There are project that do the same and have even server+agent in same codebase, and are really great, see:
https://github.com/hashicorp/nomad/tree/main/drivers

What would be the advantage of such solution is simplicity. No need to run additional things, it's just one bundled agent, which you can deploy anywhere.

Honestly I was even thinking that those "service providers" are just ansible playbooks, or anything similar. And when control plane ask you to create "VM" you spawn an ansible playbook to deploy VM, so you don't need to maintain idempotent code to handle everything around the VM provisiong.

In that case Agent would just "monitor" the state. And execute workers for specific jobs.

With this approach, aren't we killing the flexibility when a user can bring and plug their own SP?
I am all good to embed some SP in the agent's code but I believe we should keep the "bring your own SP" feature. Unless we want to remove this feature.

With the embedded SPs, they will have to be tagged as internal to skip the heartbeat part, the resource's status reporting will remain

machacekondra · 2026-06-05T11:38:00Z

+5. DCM persists the registration in the database
+6. DCM acknowledges the registration
+
+### Resource Creation Flow


I like this resource creation flow, but do you think it should be a separate enhacement? Where you will also challenge a flow using etcd+watch mechanism (which would hive us high-availability store for free)

Where you will also challenge a flow using etcd+watch mechanism (which would hive us high-availability store for free)

You mean that we would introduce new CRDs that would be created by DCM? The main idea here was to remove the link from DCM to the environment/cluster and have the agent poll the request from somewhere, either directly from DCM or via the bus.

Or did you mean that DCM would create the manifest on its own cluster and then the agent would watches DCM's cluster?

Based on my understanding, I added an alternate option that I rejected on bd4de25
I marked it as rejected for now as I am not sure of what you meant

Or did you mean that DCM would create the manifest on its own cluster and then the agent would watches DCM's cluster?

Yes, this is what I've meant. Basically the same mechanism kubernetes is using. We would have a control-plane store, and agent would poll for changes. There would be no need for the message-bus.

That would mean the agents are running with credentials to DCM's cluster.
That means:

the agent's admin must have a way to create/use a Service Account (and its token) of DCM's cluster => if the token is rotated or expired, the agent will stop working so this is something that must be highlighted in the maintenance doc

DCM has to run on a cluster, it cannot be ran as a simple application our outside a K8s based cluster.

For point 1, I guess that's OK as the agents will have to authN/Z to DCM at some point so sharing SA token and setting the agent's conf with it is OK for me

For point 2, I am not sure if DCM has to always be run on a K8s based cluster as it's the application managing the datacenter. Do we know if that's a pre-requisites? This is not something that we can easily change later so we have to be sure that we will always expect users to run DCM in a K8s based cluster

gabriel-farache · 2026-06-09T12:40:28Z

yes, I do not see currently a use case where a single SP supports 2 differents resource types.

it's one of our requirements to let the SP supports multiple resource types. Just as an example see #43.
Also the correct naming is service type

@gciavarrini in 52642f5 and
8d2862d I reworded the sentences

machacekondra · 2026-06-11T09:18:18Z

+Changing how creation requests are consumed by giving the initiative to the
+agent would solve this problem: the agent pulls work from a messaging system,
+removing the need for DCM-to-environment inbound connectivity for creation
+requests. The agent still requires outbound connectivity to DCM for registration


Just wondering, if all of the communication goes via message bus, can the registration and heartbeat go via bus as well?

I think in that case there will be only communication via bus, no direct communication via agent and control-plane.

We can go full bus, yes
I kept those direct link to DCM as they ensure DCM is still running as the REST request provides a immediate feedback

If we care, from the agent PoV, about the DCM response the registration request, we would need to define a flow where DCM sends an ack message of some sort to the agent's topic to give it feedback (ie:registration failure)
Not sure we would need it for heartbeat as the "connection" is already established and validated. And if DCM is down for some time, the bus buffering mechanism will make sure that no message is lost

Should I make the change? @dcm-project/team-dcm what do we think about completely removing direct connection between DCM and agent/SP and using the bus system? Meaning we go full async

Yeah, I guess the question is about if there is support of "ack", I know there is in NATS, not sure about kafka.

I leaning towards sync for the registration/re-registration/updates because these would mostly be called at low frequency and DCM can get the information in real time (e.g updated list) instead of having delays while waiting to consume this info from the bus. For example, forwarding a resource creation to an SP that no longer supports a service type before DCM gets the updated list info. Similar with the heartbeat, we'll need to know immediately when an agent/SP is down to avoid delegating a workload to it.
For the simplicity sake, we could go sync for now and then re-evaluate later.

For example, forwarding a resource creation to an SP that no longer supports a service type before DCM gets the updated list info.

If an agent does not support a service type when a creation request is received/read from the bus, it will reject it and DCM will receive the rejection message and will have to re-evaluate

Similar with the heartbeat, we'll need to know immediately when an agent/SP is down to avoid delegating a workload to it.

If the agent is done, it will never send a response message in the bus so in DCM, there should be a routine waiting for the messages to be acknowledged for some time and upon expiration it should re-evaluate
If the SP is unhealthy the agent will be send a message with queued status (the agent waits for the SP to become healthy again before processing the request) if DCM does not want to wait, it can send a deletion message (this will remove the waiting creation request) and then re-evaluate

Yeah, I guess the question is about if there is support of "ack", I know there is in NATS, not sure about kafka.

The ack is done by the agent by sending a message to the responses topic with the resourceID and the status of the resource

The main advantage of async is that if DCM is unavailable for a short period of time, agents will still considered themselves registered to DCM as DCM will not lose any message and will be able to process the buffered messages when back to ready. The agent is not doing any work until DCM give it some so even if DCM is down for good, the agent will not do anything harmful. If we need the agent to have a feedback from DCM, DCM can send a message on the agent's topic acking the registration/heartbeat

but we can delay the full async (registration and heartbeat) to a subsequent PR once the changes introduced by this concept are reflected in the others enhancements files (and maybe even implemented to see if there is any challenge that we did not anticipated)

gabriel-farache · 2026-06-19T14:29:07Z

@machacekondra @jenniferubah b442c56 as discussed I updated the relevant part to reflect the decisions

Define the environment agent layer that sits between DCM and Service Providers. The agent runs per-cluster, registers to DCM with environment metadata, and routes creation requests via a messaging system. SPs register to the agent (not DCM directly), each serving a single resource type. Includes agent registration, resource creation, SP registration, agent heartbeat, and SP health monitoring flows. Assisted by: Claude Code - claude-opus-4-6 Signed-off-by: gabriel-farache <gfarache@redhat.com>

Signed-off-by: gabriel-farache <gfarache@redhat.com>

…r etcd watch Integrate embedded SPs (K8s Container, ACM Cluster, KubeVirt) into the main proposal alongside external "bring your own" SPs. Enforce a global constraint of one SP per service type with 409 Conflict rejection for duplicates. Change etcd/CRD Watch alternative from Rejected to Deferred pending investigation of DCM-native watch semantics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: gabriel-farache <gfarache@redhat.com>

Signed-off-by: gabriel-farache <gfarache@redhat.com>

jenniferubah

Looks good

gabriel-farache marked this pull request as ready for review June 4, 2026 12:04

gabriel-farache requested review from Fale, croadfeldt, gciavarrini, jenniferubah, machacekondra, pkliczewski and ygalblum as code owners June 4, 2026 12:04

jenniferubah reviewed Jun 4, 2026

View reviewed changes

Comment thread enhancements/environment-agent/environment-agent.md Outdated