Enhancement proposal: Cost Management Service Provider#57
Conversation
Proposes a DCM service provider backed by Red Hat Lightspeed Cost Management (Project Koku). Introduces a `cost` service type, six catalog items across three tiers (basic metering, distribution, full cost), and a Go microservice that translates DCM lifecycle events into Koku API operations. Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
| 2. **In-place tier upgrades.** Can an instance be upgraded from Tier 1 to | ||
| Tier 3 without delete+recreate? Koku cost models are mutable (PUT), so the | ||
| SP could support it, but the simpler path for v1 is delete+recreate. |
There was a problem hiding this comment.
Currently, we don't support updates to instances yet so maybe we want to defer this as well.
There was a problem hiding this comment.
Agreed — I've marked this as deferred in the Open Questions section. The proposal was influenced by UDLM's vision of mutable resource specifications (where cost models can be updated in-place), but since DCM doesn't yet support instance updates, we'll defer this to align with DCM's current capabilities. V1 uses delete+recreate.
| 1. **Tag-based rates in the spec?** Koku supports rates keyed by label | ||
| key:value pairs (40+ cost dimensions). Should the `CostSpec` schema include | ||
| tag rates in v1, or defer to v2? *Recommendation: defer — tiered rates cover | ||
| 90% of use cases; tag rates add significant schema complexity.* |
There was a problem hiding this comment.
Since tag rates seem complex, we can defer it for future version.
There was a problem hiding this comment.
Thanks for confirming — tag rates are deferred to v2 as recommended in the document.
| - Replacing or reimplementing Koku's metering pipeline, rate engine, | ||
| distribution logic, or reporting. | ||
| - Cloud cost management (AWS CUR, Azure, GCP). The cost SP focuses on | ||
| on-premise OpenShift cluster metering. Koku handles cloud costs separately. |
There was a problem hiding this comment.
Since this specific to openshift platform, we should indicate that in ehnacement name and title (for example ocp-cost-sp/OCP Cost Management Service Provider). It is simlar to other SPs currently implemented sucha as kubvirt-sp, acm-cluster-sp etc
There was a problem hiding this comment.
Good question — I considered this but intentionally kept the generic name cost-sp rather than ocp-cost-sp. While v1 focuses on OpenShift cluster metering, the architecture is designed to extend beyond OpenShift:
- VM metering: The koku-metrics-operator already captures OpenShift Virtualization VM metrics (CPU, memory, disk, uptime) — so the cost SP can meter VMs today, not just clusters. See queries.go.
- RHOSO metrics: OpenStack on OpenShift metering is being added to the koku-metrics-operator before end of year (COST-5067).
- Cloud costs: Koku already handles AWS CUR, Azure, and GCP cost data. Future versions of this SP could expose those cost dimensions through DCM as well.
- Windows BMaaS: DCM is working on Windows bare-metal provisioning — the cost SP should be able to meter those resources too.
I've updated the Non-Goals section to clarify that v1 focuses on OpenShift cluster and VM metering but the SP design is intentionally generic. Naming it ocp-cost-sp would artificially limit it.
That said, if the team feels cost-sp is too generic, we'd be open to something like rhcostmgmt-sp that reflects the Cost Management product without being platform-specific.
There was a problem hiding this comment.
So if v1 is supporting only cluster and this enhancement will be expanded to support other service types for future version, so rhcostmgmt-sp makes sense to me as it references the product.
There was a problem hiding this comment.
@dcm-project/maintainers, what do you think?
| | Endpoint | Koku API Call | | ||
| |----------|--------------| | ||
| | `GET /usage/{id}/compute` | `GET /reports/openshift/compute/?filter[cluster]={cluster_id}` | | ||
| | `GET /usage/{id}/memory` | `GET /reports/openshift/memory/?filter[cluster]={cluster_id}` | | ||
| | `GET /usage/{id}/storage` | `GET /reports/openshift/volumes/?filter[cluster]={cluster_id}` | | ||
|
|
||
| **Cost** (available when a cost model is assigned): | ||
|
|
||
| | Endpoint | Koku API Call | | ||
| |----------|--------------| | ||
| | `GET /cost-reports/{id}` | `GET /reports/openshift/costs/?filter[cluster]={cluster_id}` | | ||
| | `GET /cost-reports/{id}/breakdown` | `GET /reports/openshift/costs/?filter[cluster]={cluster_id}&group_by[project]=*` | | ||
| | `GET /cost-reports/{id}/forecast` | `GET /forecasts/openshift/costs/?filter[cluster]={cluster_id}` | |
There was a problem hiding this comment.
Who should call these read endpoints? Is it intended for DCM or its for direct usage by the admins? or something else
There was a problem hiding this comment.
The read-only endpoints are designed for multiple consumers:
- DCM UI — platform engineers can see cost data alongside provisioned resources in a single dashboard.
- Tenant self-service — operators or project owners can query their cluster's metering and cost data through DCM's API gateway without needing direct Koku access.
- External tooling — CI/CD pipelines for cost gates, external dashboards, or cost-aware placement policies.
These endpoints proxy to Koku's report API and return the data filtered by cluster_id. I've clarified this in the document.
There was a problem hiding this comment.
DCM UI — platform engineers can see cost data alongside provisioned resources in a single dashboard.
Since DCM UI communicates with the SPs via the control plane and not directly, I'm wondering if we should have standard interface endpoints for cost reports within the control plane. Probably can be deferred to future versions.
|
|
||
| ## Design Details | ||
|
|
||
| ### Architecture |
There was a problem hiding this comment.
I don't fully get the flow in my head.
- Does the DCM user request for a cluster via a catalog item and then something automatically selects a catalog item for the cost? DCM will be supporting multi-resource creation in a single request so in this use case, this can be a single catalog-item. For reference, see enhancement
- What is the flow for attach cost metering for a VM? or is this out of scope?
- Why does the bridge initiate instance ceation? Is it for creating cost instances for incoming requests? I also don't fully understand what the bridge does. I might be missing something
There was a problem hiding this comment.
Great questions — let me address each:
1. Cluster + cost as one request:
Today the flow is two-step: the user (or an automation) requests a cluster, and then the bridge (a NATS consumer) automatically creates the cost instance when the cluster reaches READY status. Your declarative API enhancement is exactly the mechanism that could combine both into a single multi-resource catalog request in the future.
For v1, the bridge approach was chosen because it works with DCM as it exists today — no changes to the catalog, placement, or policy managers are needed. Once the declarative API is implemented, combining both into a multi-resource catalog item would be the natural evolution. I've added a note about this in the document.
2. Cost metering for VMs:
VM metering is actually already partially covered. The koku-metrics-operator already captures OpenShift Virtualization VM metrics — CPU limits/requests/usage, memory limits/requests/usage, disk size, uptime, and labels (see queries.go). These metrics flow through the same Koku pipeline that handles pods and nodes, so the cost SP can already create sources that capture VM data alongside cluster data.
Additionally, RHOSO (Red Hat OpenStack Services on OpenShift) metering support is being added to the koku-metrics-operator before end of year (COST-5067).
I've updated the Non-Goals to clarify this.
3. What the bridge does:
The bridge is a lightweight NATS consumer that subscribes to DCM's CloudEvent stream. When it sees a cluster reach READY status (from any cluster SP — ACM, kcli, k8s, etc.), it automatically calls POST /api/v1alpha1/catalog-item-instances through DCM's standard catalog pipeline to create a cost instance for that cluster. This way, every provisioned cluster automatically gets cost tracking without the operator having to manually create a cost instance. Think of it as "event-driven automation on top of DCM's existing API."
On cluster deletion, the bridge detects the DELETED event and triggers cost instance cleanup through the same pipeline. I've expanded the description in the document to make this clearer.
There was a problem hiding this comment.
Thanks for the explanation. It is very clear to me now.
Related to point 1 and 3, I agree with @gabriel-farache, we can ditch the bridge and fly with the multi-resource catalog request. What do you think?
- Fix three-state health model to include unavailable - Mark in-place tier upgrades as deferred (DCM lacks instance updates) - Clarify cost SP is not OCP-specific: Koku already meters VMs, RHOSO coming - Remove inline service type enum, reference service-type-definitions - Expand ID mapping explanation with table - Add Koku→DCM status mapping table - Add CRUD + health endpoint summary table - Clarify intended consumers of read-only query endpoints - Add note about future multi-resource catalog items (declarative API) - Update Non-Goals to reflect broader metering scope Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
Why: the Koku backing is a great starting implementation — but for a second cost provider to be droppable behind the same This is already a clean multi-capability provider (
Net: cost recovery becomes provider-agnostic, and the cost data is ingestible by any FOCUS-aware FinOps tool. Pairs with the matching note on the |
gabriel-farache
left a comment
There was a problem hiding this comment.
I think we should reject the bridge and let the declarative api (composite catalog item) request the cost resource to be created alongside a cluster if the user wants it
| 1. **Tag-based rates in the spec?** Koku supports rates keyed by label | ||
| key:value pairs (40+ cost dimensions). Should the `CostSpec` schema include | ||
| tag rates in v1, or defer to v2? *Recommendation: defer — tiered rates cover | ||
| 90% of use cases; tag rates add significant schema complexity.* |
| 2. **In-place tier upgrades.** *(Deferred.)* Can an instance be upgraded from | ||
| Tier 1 to Tier 3 without delete+recreate? Koku cost models are mutable | ||
| (PUT), so the SP could support it. However, DCM does not yet support | ||
| instance updates, so this is deferred until that capability is available. | ||
| The proposal was influenced by UDLM's vision of mutable resource | ||
| specifications; v1 uses delete+recreate. | ||
|
|
||
| 3. **Multi-tenancy mapping.** DCM does not have tenancy yet (v1). Koku uses | ||
| schema-per-tenant. Which Koku tenant does the bridge create sources in? | ||
| *For v1, a single pre-configured Koku tenant is assumed.* |
There was a problem hiding this comment.
+1 let's defer those
| The Cost Management Service Provider integrates | ||
| [Red Hat Lightspeed Cost Management](https://github.com/project-koku/koku) | ||
| (Project Koku) with DCM's provisioning lifecycle. It introduces a new `cost` | ||
| service type and a Go microservice (`koku-cost-provider`) that translates DCM |
There was a problem hiding this comment.
nitpicking here: no need to mention the language that will be used for implementation
There was a problem hiding this comment.
I agree, there is no need
|
|
||
| ## Motivation | ||
|
|
||
| When DCM provisions an OpenShift cluster, someone must separately configure |
There was a problem hiding this comment.
When DCM provisions an OpenShift cluster
So the cost management is only for cluster, not for any other service type?
There was a problem hiding this comment.
See related comment - it does support other service types but v1, only cluster is supported.
There was a problem hiding this comment.
ah, ok, that's why the resource type od the cost resource is never mentioned and is set to cluster by default in the service type definition of #60
That's OK with me but there should be no default type in the service definition and the resource type of the cost resource must be explicitly set to cluster to remove any ambiguity
I am just wondering if the enhancement is not too focused on cluster but I guess when/if we will support other resource types, it will be updated to be a bit more agnostic from the cluster only
|
|
||
| - **No automatic cost visibility.** Clusters can run for days before cost | ||
| tracking is configured. | ||
| - **No lifecycle synchronization.** Deleting a cluster in DCM does not clean up |
There was a problem hiding this comment.
this is not a real currently as the Koku sources were manually installed to begin with
| | `/api/v1alpha1/instances/{id}` | GET | Get cost instance status and details | | ||
| | `/api/v1alpha1/instances/{id}` | DELETE | Delete cost instance (pause Koku source, remove cost model) | | ||
| | `/api/v1alpha1/instances` | GET | List cost instances | | ||
| | `/usage/{id}/compute` | GET | Query CPU utilization (proxied from Koku) | |
There was a problem hiding this comment.
by id you mean the resourceID AKA the cluster for which we want the data? Or is it something else?
| #### Status Lifecycle | ||
|
|
||
| ``` | ||
| PROVISIONING → READY → ERROR → DELETED |
There was a problem hiding this comment.
can an ERROR be back in READY?
| | READY | Metering data actively flowing | `GET /sources/{uuid}/stats/` returns recent data points. Source is active and not paused. | | ||
| | ERROR | Operator not uploading | Source exists but metering data has gone stale (no new data beyond the configured staleness threshold). | | ||
| | ERROR | Koku API unreachable | SP cannot reach the Koku API to verify source status. | | ||
| | ERROR | Source creation failed | Koku rejected the `POST /sources/` request (e.g., duplicate `cluster_id`, invalid authentication). | |
There was a problem hiding this comment.
is that correct? it feels strange to have the whole SP status to be in ERROR just because 1 creation failed.
I get it that when creation is failing, the response to DCM for the requested resource is ERROR but here aren't we defining the SP status?
|
|
||
| 1. Receive `dcm.status.cluster` CloudEvent with `status: READY`. | ||
| 2. Read the cluster's labels to select the appropriate catalog item. | ||
| 3. Submit a `CatalogItemInstance` creation request through DCM's API. |
There was a problem hiding this comment.
aaah! ok I got it now, the SP will send POST to DCM to request the creation of a cost resource
What happens is there are several Cost SP listening at the same time? How can we know that the SP requesting the creation will be the one assigned to by the policies? Or does it matter?
| will support multi-resource catalog items, enabling "provision cluster + attach | ||
| cost metering" in a single request. When that capability is available, the |
There was a problem hiding this comment.
Exactly, so I believe the bridge feature should be rejected
This will simplify things and avoid back and forth and the declarative api is a core feature of DCM
Summary
Enhancement proposal for the Cost Management Service Provider — a DCM
service provider backed by Red Hat Lightspeed Cost Management
(Project Koku).
When DCM provisions infrastructure, this SP automatically creates the
corresponding Koku sources and cost models — so metering and cost tracking
begin without manual configuration.
Implementation: pgarciaq/cost-dcm-provider
What makes this SP unique
New
costservice type. First DCM SP that provisions a cross-cuttingcapability (cost visibility) rather than a compute resource.
Three-tier model. Basic metering (no cost model), metering + distribution
(overhead attribution without dollar amounts), and full cost (metering ×
rates). One instance, one SP — the tier is determined by what's in the spec.
Bridge-driven automation. A companion NATS consumer watches for cluster
READY events and automatically creates cost instances through DCM's standard
catalog pipeline — every cluster gets cost tracking without operator
intervention per cluster.
Read-only query API. The SP exposes metering and cost data from Koku
through its own endpoints, enabling tenants to query their cluster costs
via DCM.
Policy-governed rates. Rego policies enforce rate ranges, markup minimums,
and budget limits — applied to both automatic and manual cost instances.
Key details
healthy/unhealthy) compatible withcontrol-planeRelationship to existing proposals
The cost SP follows DCM's standard SP contract and works with any cluster SP
(ACM, kcli, etc.). It uses the same registration, CRUD, and CloudEvent patterns
documented in the SP registration flow and status reporting enhancements.
Made with Cursor