Skip to content

Enhancement proposal: Cost Management Service Provider#57

Open
pgarciaq wants to merge 2 commits into
dcm-project:mainfrom
pgarciaq:cost-sp
Open

Enhancement proposal: Cost Management Service Provider#57
pgarciaq wants to merge 2 commits into
dcm-project:mainfrom
pgarciaq:cost-sp

Conversation

@pgarciaq

Copy link
Copy Markdown

Summary

Enhancement proposal for the Cost Management Service Provider — a DCM
service provider backed by Red Hat Lightspeed Cost Management
(Project Koku).

When DCM provisions infrastructure, this SP automatically creates the
corresponding Koku sources and cost models — so metering and cost tracking
begin without manual configuration.

Implementation: pgarciaq/cost-dcm-provider

What makes this SP unique

  1. New cost service type. First DCM SP that provisions a cross-cutting
    capability (cost visibility) rather than a compute resource.

  2. Three-tier model. Basic metering (no cost model), metering + distribution
    (overhead attribution without dollar amounts), and full cost (metering ×
    rates). One instance, one SP — the tier is determined by what's in the spec.

  3. Bridge-driven automation. A companion NATS consumer watches for cluster
    READY events and automatically creates cost instances through DCM's standard
    catalog pipeline — every cluster gets cost tracking without operator
    intervention per cluster.

  4. Read-only query API. The SP exposes metering and cost data from Koku
    through its own endpoints, enabling tenants to query their cluster costs
    via DCM.

  5. Policy-governed rates. Rego policies enforce rate ranges, markup minimums,
    and budget limits — applied to both automatic and manual cost instances.

Key details

  • Six catalog items across three tiers (basic metering → distribution → full cost)
  • Koku's 40+ cost dimensions available through the rate system
  • Three-state health model (healthy/unhealthy) compatible with control-plane
  • Ginkgo test suite, CI with lint + build + generate-check
  • Detailed design docs: integration architecture, SP design, white paper

Relationship to existing proposals

The cost SP follows DCM's standard SP contract and works with any cluster SP
(ACM, kcli, etc.). It uses the same registration, CRUD, and CloudEvent patterns
documented in the SP registration flow and status reporting enhancements.

Made with Cursor

Proposes a DCM service provider backed by Red Hat Lightspeed Cost
Management (Project Koku). Introduces a `cost` service type, six
catalog items across three tiers (basic metering, distribution, full
cost), and a Go microservice that translates DCM lifecycle events into
Koku API operations.

Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread enhancements/cost-sp/cost-sp.md Outdated
Comment thread enhancements/cost-sp/cost-sp.md Outdated
Comment on lines +36 to +38
2. **In-place tier upgrades.** Can an instance be upgraded from Tier 1 to
Tier 3 without delete+recreate? Koku cost models are mutable (PUT), so the
SP could support it, but the simpler path for v1 is delete+recreate.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we don't support updates to instances yet so maybe we want to defer this as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — I've marked this as deferred in the Open Questions section. The proposal was influenced by UDLM's vision of mutable resource specifications (where cost models can be updated in-place), but since DCM doesn't yet support instance updates, we'll defer this to align with DCM's current capabilities. V1 uses delete+recreate.

Comment on lines +31 to +34
1. **Tag-based rates in the spec?** Koku supports rates keyed by label
key:value pairs (40+ cost dimensions). Should the `CostSpec` schema include
tag rates in v1, or defer to v2? *Recommendation: defer — tiered rates cover
90% of use cases; tag rates add significant schema complexity.*

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tag rates seem complex, we can defer it for future version.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming — tag rates are deferred to v2 as recommended in the document.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment thread enhancements/cost-sp/cost-sp.md Outdated
- Replacing or reimplementing Koku's metering pipeline, rate engine,
distribution logic, or reporting.
- Cloud cost management (AWS CUR, Azure, GCP). The cost SP focuses on
on-premise OpenShift cluster metering. Koku handles cloud costs separately.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this specific to openshift platform, we should indicate that in ehnacement name and title (for example ocp-cost-sp/OCP Cost Management Service Provider). It is simlar to other SPs currently implemented sucha as kubvirt-sp, acm-cluster-sp etc

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — I considered this but intentionally kept the generic name cost-sp rather than ocp-cost-sp. While v1 focuses on OpenShift cluster metering, the architecture is designed to extend beyond OpenShift:

  1. VM metering: The koku-metrics-operator already captures OpenShift Virtualization VM metrics (CPU, memory, disk, uptime) — so the cost SP can meter VMs today, not just clusters. See queries.go.
  2. RHOSO metrics: OpenStack on OpenShift metering is being added to the koku-metrics-operator before end of year (COST-5067).
  3. Cloud costs: Koku already handles AWS CUR, Azure, and GCP cost data. Future versions of this SP could expose those cost dimensions through DCM as well.
  4. Windows BMaaS: DCM is working on Windows bare-metal provisioning — the cost SP should be able to meter those resources too.

I've updated the Non-Goals section to clarify that v1 focuses on OpenShift cluster and VM metering but the SP design is intentionally generic. Naming it ocp-cost-sp would artificially limit it.

That said, if the team feels cost-sp is too generic, we'd be open to something like rhcostmgmt-sp that reflects the Cost Management product without being platform-specific.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if v1 is supporting only cluster and this enhancement will be expanded to support other service types for future version, so rhcostmgmt-sp makes sense to me as it references the product.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcm-project/maintainers, what do you think?

Comment thread enhancements/cost-sp/cost-sp.md Outdated
Comment thread enhancements/cost-sp/cost-sp.md Outdated
Comment thread enhancements/cost-sp/cost-sp.md
Comment on lines +328 to +340
| Endpoint | Koku API Call |
|----------|--------------|
| `GET /usage/{id}/compute` | `GET /reports/openshift/compute/?filter[cluster]={cluster_id}` |
| `GET /usage/{id}/memory` | `GET /reports/openshift/memory/?filter[cluster]={cluster_id}` |
| `GET /usage/{id}/storage` | `GET /reports/openshift/volumes/?filter[cluster]={cluster_id}` |

**Cost** (available when a cost model is assigned):

| Endpoint | Koku API Call |
|----------|--------------|
| `GET /cost-reports/{id}` | `GET /reports/openshift/costs/?filter[cluster]={cluster_id}` |
| `GET /cost-reports/{id}/breakdown` | `GET /reports/openshift/costs/?filter[cluster]={cluster_id}&group_by[project]=*` |
| `GET /cost-reports/{id}/forecast` | `GET /forecasts/openshift/costs/?filter[cluster]={cluster_id}` |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who should call these read endpoints? Is it intended for DCM or its for direct usage by the admins? or something else

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read-only endpoints are designed for multiple consumers:

  1. DCM UI — platform engineers can see cost data alongside provisioned resources in a single dashboard.
  2. Tenant self-service — operators or project owners can query their cluster's metering and cost data through DCM's API gateway without needing direct Koku access.
  3. External tooling — CI/CD pipelines for cost gates, external dashboards, or cost-aware placement policies.

These endpoints proxy to Koku's report API and return the data filtered by cluster_id. I've clarified this in the document.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCM UI — platform engineers can see cost data alongside provisioned resources in a single dashboard.

Since DCM UI communicates with the SPs via the control plane and not directly, I'm wondering if we should have standard interface endpoints for cost reports within the control plane. Probably can be deferred to future versions.

Comment thread enhancements/cost-sp/cost-sp.md

## Design Details

### Architecture

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully get the flow in my head.

  1. Does the DCM user request for a cluster via a catalog item and then something automatically selects a catalog item for the cost? DCM will be supporting multi-resource creation in a single request so in this use case, this can be a single catalog-item. For reference, see enhancement
  2. What is the flow for attach cost metering for a VM? or is this out of scope?
  3. Why does the bridge initiate instance ceation? Is it for creating cost instances for incoming requests? I also don't fully understand what the bridge does. I might be missing something

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions — let me address each:

1. Cluster + cost as one request:

Today the flow is two-step: the user (or an automation) requests a cluster, and then the bridge (a NATS consumer) automatically creates the cost instance when the cluster reaches READY status. Your declarative API enhancement is exactly the mechanism that could combine both into a single multi-resource catalog request in the future.

For v1, the bridge approach was chosen because it works with DCM as it exists today — no changes to the catalog, placement, or policy managers are needed. Once the declarative API is implemented, combining both into a multi-resource catalog item would be the natural evolution. I've added a note about this in the document.

2. Cost metering for VMs:

VM metering is actually already partially covered. The koku-metrics-operator already captures OpenShift Virtualization VM metrics — CPU limits/requests/usage, memory limits/requests/usage, disk size, uptime, and labels (see queries.go). These metrics flow through the same Koku pipeline that handles pods and nodes, so the cost SP can already create sources that capture VM data alongside cluster data.

Additionally, RHOSO (Red Hat OpenStack Services on OpenShift) metering support is being added to the koku-metrics-operator before end of year (COST-5067).

I've updated the Non-Goals to clarify this.

3. What the bridge does:

The bridge is a lightweight NATS consumer that subscribes to DCM's CloudEvent stream. When it sees a cluster reach READY status (from any cluster SP — ACM, kcli, k8s, etc.), it automatically calls POST /api/v1alpha1/catalog-item-instances through DCM's standard catalog pipeline to create a cost instance for that cluster. This way, every provisioned cluster automatically gets cost tracking without the operator having to manually create a cost instance. Think of it as "event-driven automation on top of DCM's existing API."

On cluster deletion, the bridge detects the DELETED event and triggers cost instance cleanup through the same pipeline. I've expanded the description in the document to make this clearer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. It is very clear to me now.
Related to point 1 and 3, I agree with @gabriel-farache, we can ditch the bridge and fly with the multi-resource catalog request. What do you think?

- Fix three-state health model to include unavailable
- Mark in-place tier upgrades as deferred (DCM lacks instance updates)
- Clarify cost SP is not OCP-specific: Koku already meters VMs, RHOSO coming
- Remove inline service type enum, reference service-type-definitions
- Expand ID mapping explanation with table
- Add Koku→DCM status mapping table
- Add CRUD + health endpoint summary table
- Clarify intended consumers of read-only query endpoints
- Add note about future multi-resource catalog items (declarative API)
- Update Non-Goals to reflect broader metering scope

Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@croadfeldt

Copy link
Copy Markdown
Collaborator

Why: the Koku backing is a great starting implementation — but for a second cost provider to be droppable behind the same cost service type, the contract and the served data need to be vendor-neutral. The implementation can stay Koku-specific.

This is already a clean multi-capability provider (realize_resources to provision metering + serve_data for the read-only query API) with Rego-governed rates (policy-as-code) — nicely aligned with the provider model. To make it provider-agnostic:

  1. Conform the served cost data to FOCUS (+ OpenCost for k8s allocation) so consumers query cost identically regardless of which provider backs it.
  2. Declare which standard versions the SP supports (e.g. FOCUS >=1.2 <2.0, preferred 1.4, emit; OpenCost 1.x) so the platform can negotiate the version a consumer needs.
  3. Register the query API as a serve_data / information capability, and bind cost to the target resource by identity (uuid ↔ FOCUS ResourceId); keep Koku authoritative (lookup-only, not copied or owned).
  4. Keep Koku as the backing — it's the implementation detail, not the contract.

Net: cost recovery becomes provider-agnostic, and the cost data is ingestible by any FOCUS-aware FinOps tool. Pairs with the matching note on the cost service-type definition (#60).

@gabriel-farache gabriel-farache left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reject the bridge and let the declarative api (composite catalog item) request the cost resource to be created alongside a cluster if the user wants it

Comment on lines +31 to +34
1. **Tag-based rates in the spec?** Koku supports rates keyed by label
key:value pairs (40+ cost dimensions). Should the `CostSpec` schema include
tag rates in v1, or defer to v2? *Recommendation: defer — tiered rates cover
90% of use cases; tag rates add significant schema complexity.*

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +36 to +45
2. **In-place tier upgrades.** *(Deferred.)* Can an instance be upgraded from
Tier 1 to Tier 3 without delete+recreate? Koku cost models are mutable
(PUT), so the SP could support it. However, DCM does not yet support
instance updates, so this is deferred until that capability is available.
The proposal was influenced by UDLM's vision of mutable resource
specifications; v1 uses delete+recreate.

3. **Multi-tenancy mapping.** DCM does not have tenancy yet (v1). Koku uses
schema-per-tenant. Which Koku tenant does the bridge create sources in?
*For v1, a single pre-configured Koku tenant is assumed.*

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 let's defer those

The Cost Management Service Provider integrates
[Red Hat Lightspeed Cost Management](https://github.com/project-koku/koku)
(Project Koku) with DCM's provisioning lifecycle. It introduces a new `cost`
service type and a Go microservice (`koku-cost-provider`) that translates DCM

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking here: no need to mention the language that will be used for implementation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there is no need


## Motivation

When DCM provisions an OpenShift cluster, someone must separately configure

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When DCM provisions an OpenShift cluster

So the cost management is only for cluster, not for any other service type?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related comment - it does support other service types but v1, only cluster is supported.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, that's why the resource type od the cost resource is never mentioned and is set to cluster by default in the service type definition of #60

That's OK with me but there should be no default type in the service definition and the resource type of the cost resource must be explicitly set to cluster to remove any ambiguity

I am just wondering if the enhancement is not too focused on cluster but I guess when/if we will support other resource types, it will be updated to be a bit more agnostic from the cluster only


- **No automatic cost visibility.** Clusters can run for days before cost
tracking is configured.
- **No lifecycle synchronization.** Deleting a cluster in DCM does not clean up

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a real currently as the Koku sources were manually installed to begin with

| `/api/v1alpha1/instances/{id}` | GET | Get cost instance status and details |
| `/api/v1alpha1/instances/{id}` | DELETE | Delete cost instance (pause Koku source, remove cost model) |
| `/api/v1alpha1/instances` | GET | List cost instances |
| `/usage/{id}/compute` | GET | Query CPU utilization (proxied from Koku) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by id you mean the resourceID AKA the cluster for which we want the data? Or is it something else?

#### Status Lifecycle

```
PROVISIONING → READY → ERROR → DELETED

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can an ERROR be back in READY?

| READY | Metering data actively flowing | `GET /sources/{uuid}/stats/` returns recent data points. Source is active and not paused. |
| ERROR | Operator not uploading | Source exists but metering data has gone stale (no new data beyond the configured staleness threshold). |
| ERROR | Koku API unreachable | SP cannot reach the Koku API to verify source status. |
| ERROR | Source creation failed | Koku rejected the `POST /sources/` request (e.g., duplicate `cluster_id`, invalid authentication). |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that correct? it feels strange to have the whole SP status to be in ERROR just because 1 creation failed.
I get it that when creation is failing, the response to DCM for the requested resource is ERROR but here aren't we defining the SP status?


1. Receive `dcm.status.cluster` CloudEvent with `status: READY`.
2. Read the cluster's labels to select the appropriate catalog item.
3. Submit a `CatalogItemInstance` creation request through DCM's API.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaah! ok I got it now, the SP will send POST to DCM to request the creation of a cost resource
What happens is there are several Cost SP listening at the same time? How can we know that the SP requesting the creation will be the one assigned to by the policies? Or does it matter?

Comment on lines +421 to +422
will support multi-resource catalog items, enabling "provision cluster + attach
cost metering" in a single request. When that capability is available, the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, so I believe the bridge feature should be rejected
This will simplify things and avoid back and forth and the declarative api is a core feature of DCM

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants