Skip to content

Environment Agent Enhancement#53

Open
gabriel-farache wants to merge 21 commits into
dcm-project:mainfrom
gabriel-farache:enhancement/environment-agent
Open

Environment Agent Enhancement#53
gabriel-farache wants to merge 21 commits into
dcm-project:mainfrom
gabriel-farache:enhancement/environment-agent

Conversation

@gabriel-farache

Copy link
Copy Markdown
Collaborator

Summary

  • Adds the Environment Agent enhancement defining a new agent layer between DCM and Service Providers
  • The agent runs per-cluster/environment, registers to DCM with environment metadata, and routes creation requests via a messaging system (Kafka, NATS)
  • Service Providers register directly to the agent (not DCM), each serving a single resource type
  • Covers: agent registration, resource creation flow, SP-to-agent registration, agent heartbeat, and SP health monitoring using the three-state model

Flows Defined

  • Agent Registration: Agent creates a messaging topic, registers to DCM with name, environment, resource types, cost, and topic name
  • Resource Creation: DCM publishes to agent topic → agent validates and routes to relevant SP
  • SP Registration: SPs register to agent via REST; agent dynamically maintains supported resource types and updates DCM
  • Agent Health: Periodic REST heartbeats from agent to DCM; DCM marks agent unavailable if threshold exceeded
  • SP Health Monitoring: Agent polls SP /health endpoints; removes resource types when last SP becomes unhealthy/unavailable

🤖 Generated with Claude Code

Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
class Target_Environment clusterEnvironment
```

#### Flow Description

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering if we really need to have Agent and SP separated.
I was also thinking about a solution where the agent provide a code of the SPs, let's say openshift agent which provides SP plugins for ACM,k8s,kubevirt,... Agent when deployed will check the openshift what it supports (based on CRs) and then agent register as itself as supporting those specific service-types.

But the SPs code will be part of agent code, and "activated" only if SP is detected on environment.

WDYT?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a solution where the agent provide a code of the SPs, let's say openshift agent which provides SP plugins for ACM,k8s,kubevirt,

Not sure I understand what you mean here.

But the SPs code will be part of agent code, and "activated" only if SP is detected on environment.

again not sure I get what you suggest: here you suggest that the current SP codebase to be merged into the agent codebase and then it will be automatically enabled when the infrastructure needed by the SP's code is present on the cluster? Like if the agent detects ACM is installed then it will automatically support the cluster resource type and it will route the request to the ACM SP code that it holds internally?

The idea was to keep the agent small and then reuse the bricks we already have. It could make sense to have an agent shipped with some supported resource types but it must be kept open for custom SP to register to the agent.
Not sure about the maintenance cost of such bundling

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, I added an alternate option that I rejected on bd4de25
I marked it as rejected for now as I am not sure of what you meant

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I've meant. I don't think codebase would be anyhow complicated. There are project that do the same and have even server+agent in same codebase, and are really great, see:
https://github.com/hashicorp/nomad/tree/main/drivers

What would be the advantage of such solution is simplicity. No need to run additional things, it's just one bundled agent, which you can deploy anywhere.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I was even thinking that those "service providers" are just ansible playbooks, or anything similar. And when control plane ask you to create "VM" you spawn an ansible playbook to deploy VM, so you don't need to maintain idempotent code to handle everything around the VM provisiong.

In that case Agent would just "monitor" the state. And execute workers for specific jobs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach, aren't we killing the flexibility when a user can bring and plug their own SP?
I am all good to embed some SP in the agent's code but I believe we should keep the "bring your own SP" feature. Unless we want to remove this feature.

With the embedded SPs, they will have to be tagged as internal to skip the heartbeat part, the resource's status reporting will remain

5. DCM persists the registration in the database
6. DCM acknowledges the registration

### Resource Creation Flow

@machacekondra machacekondra Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this resource creation flow, but do you think it should be a separate enhacement? Where you will also challenge a flow using etcd+watch mechanism (which would hive us high-availability store for free)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where you will also challenge a flow using etcd+watch mechanism (which would hive us high-availability store for free)

You mean that we would introduce new CRDs that would be created by DCM? The main idea here was to remove the link from DCM to the environment/cluster and have the agent poll the request from somewhere, either directly from DCM or via the bus.

Or did you mean that DCM would create the manifest on its own cluster and then the agent would watches DCM's cluster?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, I added an alternate option that I rejected on bd4de25
I marked it as rejected for now as I am not sure of what you meant

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or did you mean that DCM would create the manifest on its own cluster and then the agent would watches DCM's cluster?

Yes, this is what I've meant. Basically the same mechanism kubernetes is using. We would have a control-plane store, and agent would poll for changes. There would be no need for the message-bus.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean the agents are running with credentials to DCM's cluster.
That means:

  1. the agent's admin must have a way to create/use a Service Account (and its token) of DCM's cluster => if the token is rotated or expired, the agent will stop working so this is something that must be highlighted in the maintenance doc

  2. DCM has to run on a cluster, it cannot be ran as a simple application our outside a K8s based cluster.

For point 1, I guess that's OK as the agents will have to authN/Z to DCM at some point so sharing SA token and setting the agent's conf with it is OK for me

For point 2, I am not sure if DCM has to always be run on a K8s based cluster as it's the application managing the datacenter. Do we know if that's a pre-requisites? This is not something that we can easily change later so we have to be sure that we will always expect users to run DCM in a K8s based cluster

Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
@gabriel-farache

Copy link
Copy Markdown
Collaborator Author

yes, I do not see currently a use case where a single SP supports 2 differents resource types.

it's one of our requirements to let the SP supports multiple resource types. Just as an example see #43.
Also the correct naming is service type

@gciavarrini in 52642f5 and
8d2862d I reworded the sentences

Comment thread enhancements/environment-agent/environment-agent.md
Comment thread enhancements/environment-agent/environment-agent.md
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md
Comment thread enhancements/environment-agent/environment-agent.md
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Changing how creation requests are consumed by giving the initiative to the
agent would solve this problem: the agent pulls work from a messaging system,
removing the need for DCM-to-environment inbound connectivity for creation
requests. The agent still requires outbound connectivity to DCM for registration

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, if all of the communication goes via message bus, can the registration and heartbeat go via bus as well?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case there will be only communication via bus, no direct communication via agent and control-plane.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go full bus, yes
I kept those direct link to DCM as they ensure DCM is still running as the REST request provides a immediate feedback

If we care, from the agent PoV, about the DCM response the registration request, we would need to define a flow where DCM sends an ack message of some sort to the agent's topic to give it feedback (ie:registration failure)
Not sure we would need it for heartbeat as the "connection" is already established and validated. And if DCM is down for some time, the bus buffering mechanism will make sure that no message is lost

Should I make the change? @dcm-project/team-dcm what do we think about completely removing direct connection between DCM and agent/SP and using the bus system? Meaning we go full async

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the question is about if there is support of "ack", I know there is in NATS, not sure about kafka.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leaning towards sync for the registration/re-registration/updates because these would mostly be called at low frequency and DCM can get the information in real time (e.g updated list) instead of having delays while waiting to consume this info from the bus. For example, forwarding a resource creation to an SP that no longer supports a service type before DCM gets the updated list info. Similar with the heartbeat, we'll need to know immediately when an agent/SP is down to avoid delegating a workload to it.
For the simplicity sake, we could go sync for now and then re-evaluate later.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, forwarding a resource creation to an SP that no longer supports a service type before DCM gets the updated list info.

If an agent does not support a service type when a creation request is received/read from the bus, it will reject it and DCM will receive the rejection message and will have to re-evaluate

Similar with the heartbeat, we'll need to know immediately when an agent/SP is down to avoid delegating a workload to it.

If the agent is done, it will never send a response message in the bus so in DCM, there should be a routine waiting for the messages to be acknowledged for some time and upon expiration it should re-evaluate
If the SP is unhealthy the agent will be send a message with queued status (the agent waits for the SP to become healthy again before processing the request) if DCM does not want to wait, it can send a deletion message (this will remove the waiting creation request) and then re-evaluate

Yeah, I guess the question is about if there is support of "ack", I know there is in NATS, not sure about kafka.

The ack is done by the agent by sending a message to the responses topic with the resourceID and the status of the resource

The main advantage of async is that if DCM is unavailable for a short period of time, agents will still considered themselves registered to DCM as DCM will not lose any message and will be able to process the buffered messages when back to ready. The agent is not doing any work until DCM give it some so even if DCM is down for good, the agent will not do anything harmful. If we need the agent to have a feedback from DCM, DCM can send a message on the agent's topic acking the registration/heartbeat

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can delay the full async (registration and heartbeat) to a subsequent PR once the changes introduced by this concept are reflected in the others enhancements files (and maybe even implemented to see if there is any challenge that we did not anticipated)

Comment thread enhancements/environment-agent/environment-agent.md Outdated
@gabriel-farache gabriel-farache force-pushed the enhancement/environment-agent branch from 5ce6564 to 2c4c9ba Compare June 16, 2026 07:59
@gabriel-farache

Copy link
Copy Markdown
Collaborator Author

@machacekondra @jenniferubah b442c56 as discussed I updated the relevant part to reflect the decisions

gabriel-farache and others added 18 commits June 22, 2026 11:52
Define the environment agent layer that sits between DCM and Service
Providers. The agent runs per-cluster, registers to DCM with environment
metadata, and routes creation requests via a messaging system. SPs
register to the agent (not DCM directly), each serving a single resource
type. Includes agent registration, resource creation, SP registration,
agent heartbeat, and SP health monitoring flows.

Assisted by: Claude Code - claude-opus-4-6

Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
…r etcd watch

Integrate embedded SPs (K8s Container, ACM Cluster, KubeVirt) into the
main proposal alongside external "bring your own" SPs. Enforce a global
constraint of one SP per service type with 409 Conflict rejection for
duplicates. Change etcd/CRD Watch alternative from Rejected to Deferred
pending investigation of DCM-native watch semantics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: gabriel-farache <gfarache@redhat.com>
@gabriel-farache gabriel-farache force-pushed the enhancement/environment-agent branch from b442c56 to 8deff81 Compare June 22, 2026 09:52
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Comment thread enhancements/environment-agent/environment-agent.md Outdated
Signed-off-by: gabriel-farache <gfarache@redhat.com>
Comment thread enhancements/environment-agent/environment-agent.md
Signed-off-by: gabriel-farache <gfarache@redhat.com>
@gabriel-farache gabriel-farache force-pushed the enhancement/environment-agent branch from 6c2e91b to 38baba7 Compare June 22, 2026 14:34

@jenniferubah jenniferubah left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants