diff --git a/docs/future-features/README.md b/docs/future-features/README.md new file mode 100644 index 0000000..7ca80af --- /dev/null +++ b/docs/future-features/README.md @@ -0,0 +1,9 @@ +# Future Features + +Documents in this directory describe design intent that is not part of the current DCM architecture. They are retained for reference and future planning. + +| Document | Description | Status | +|----------|------------|--------| +| kubernetes-compatibility.md | Kubernetes operator integration layer — CRD mappings, operator SDK, compatibility | Future — design intent, not validated against implementation | +| dcm-operator-interface-spec.md | Formal contract for K8s operators integrating with DCM as Service Providers | Future — depends on kubernetes-compatibility | +| dcm-operator-sdk-api.md | SDK API for building DCM-compatible K8s operators | Future — depends on dcm-operator-interface-spec | diff --git a/docs/future-features/dcm-operator-interface-spec.md b/docs/future-features/dcm-operator-interface-spec.md new file mode 100644 index 0000000..690e950 --- /dev/null +++ b/docs/future-features/dcm-operator-interface-spec.md @@ -0,0 +1,1142 @@ +# DCM Operator Interface Specification + +**Document Status:** 📋 Draft — Ready for Implementation Feedback +**Document Type:** Operator Interface Specification + + + +> ## 📋 Draft — Promoted from Work in Progress +> +> All questions resolved. Level 0–4 conformance levels defined. Cluster-scoped resource ownership clarified. CAPI integration specified. +> +> **This section is explicitly a work in progress and is less mature than the core DCM data model and architecture documentation.** +> +> The Kubernetes operator integration layer — including the Operator Interface Specification, Operator SDK API, and Kubernetes compatibility mappings — represents design intent that has not yet been validated against implementation. Specific interface contracts, API signatures, SDK method names, and CRD structures **will change** as implementation work begins. +> +> **Do not build against these specifications yet.** They are published to share design direction and invite feedback, not as stable contracts. +> +> Known gaps and open items for this section: +> - Operator Interface Specification: reconciliation hook signatures are provisional +> - Operator SDK API: Go module structure and dependency model not yet finalized +> - Kubernetes Compatibility Mappings: some concept mappings remain under discussion +> - SDK code examples are illustrative only — not yet tested against a real implementation +> +> Feedback and contributions welcome via [GitHub Issues](https://github.com/dcm-project/issues). + + + +**Version:** 0.1.0-draft +**Status:** Draft — Ready for implementation feedback +**Document Type:** Technical Specification +**Maintainers:** Red Hat FlightPath Team +**GitHub:** https://github.com/dcm-project +**Last Updated:** 2026-03 + +--- + +## Abstract + +This specification defines the interface by which Kubernetes operators integrate with the DCM (Data Center Management) control plane as first-class Service Providers. An operator that conforms to this specification becomes a DCM Service Provider, enabling its managed resources to participate in DCM's unified lifecycle management, multi-tenancy, policy governance, cost analysis, drift detection, and service catalog. + +DCM is designed as a superset of Kubernetes — extending Kubernetes' declarative, controller-based model upward to provide unified management across multiple clusters, infrastructure types, and organizational boundaries. This specification is the technical contract that enables that extension without requiring operators to abandon their existing Kubernetes-native design. + +Operators conforming to this specification function as Service Providers within a single DCM instance. In federated deployments (Hub-Spoke or Peer topology), the operator registers with the appropriate Regional or local DCM instance — federation routing is handled by DCM, not by the operator. + +--- + +## 1. Introduction + +> **OIS Versioning:** Providers declare the OIS version they implement in capability registration (`ois_version`). DCM maintains dispatch compatibility with all supported OIS versions. See [API Versioning Strategy](../../architecture/control-plane/api-versioning.md) Section 7. + + +### 1.1 Motivation + +Kubernetes operators are the most mature pattern for managing complex, stateful resources declaratively on Kubernetes. However, operators operate within a single cluster and lack the cross-cluster lifecycle management, multi-tenancy, cost attribution, sovereignty governance, and policy enforcement that enterprise organizations require at scale. + +DCM provides these capabilities at the management plane level — above individual clusters. By conforming to this specification, an operator's managed resources become: + +- **Multi-tenant** — DCM Tenant ownership and isolation applied automatically +- **Cost-attributed** — resource costs tracked and attributed across the full lifecycle +- **Policy-governed** — organizational policies applied at request time via DCM's Policy Engine +- **Cross-cluster** — the same resource type managed across multiple clusters through DCM +- **Self-service** — automatically available in the DCM Service Catalog for consumer request +- **Sovereignty-compliant** — placement and operational constraints enforced by DCM's GateKeeper policies +- **Audit-complete** — full provenance chain from intent through realization + +### 1.2 Scope + +This specification defines: +- The HTTP API an operator must expose to participate in DCM +- The data format for all API payloads (DCM Unified Data Model) +- The registration, health, capacity, status, and lifecycle event contracts +- The field mapping specification for translating between DCM format and CRD format +- Conformance levels and what each level unlocks in DCM + +This specification does not define: +- How operators implement their internal reconciliation logic +- Which specific Kubernetes distributions operators must support +- The internal architecture of the DCM control plane +- Provider-specific business logic or domain knowledge + +### 1.3 Relationship to the DCM Service Provider Contract + +This specification is a Kubernetes-specific instantiation of the DCM Service Provider Contract. All general Service Provider Contract requirements apply. This specification adds Kubernetes-specific requirements and guidance. Where this specification and the general Service Provider Contract conflict, this specification takes precedence for Kubernetes operator implementations. + +### 1.4 Terminology + +- **Operator** — a Kubernetes controller that manages custom resources via a Custom Resource Definition (CRD) +- **DCM Control Plane** — the DCM management system that routes requests and manages lifecycle +- **Adapter** — a component that sits between DCM and an operator, implementing this specification on the operator's behalf (used when the operator cannot be modified directly) +- **Native implementation** — an operator that implements this specification directly, without an adapter +- **CR** — Custom Resource — an instance of a CRD managed by the operator +- **CRD** — Custom Resource Definition — the Kubernetes schema definition for a CR +- **Reconciliation loop** — the operator's control loop that drives actual state toward desired state + +--- + +## 2. Conformance Levels + +This specification defines three conformance levels. Higher levels unlock additional DCM capabilities. An operator may implement any level — DCM accepts operators at all levels, with capabilities gated by the declared conformance level. + +**Design principle:** Level 1 must be achievable in a single day of work for an existing operator. Level 3 is the target for operators that want full DCM integration. The SDK (see Section 9) handles all protocol concerns — operator developers only implement business logic. + +### 2.1 Level 1 — Basic + +**What it requires:** +- Operator registration with DCM on startup +- Health check endpoint (`GET /health`) +- Basic status reporting to DCM when resource state changes + +**What it unlocks:** +- Operator resources appear in the DCM Service Catalog +- Basic lifecycle state tracking (PROVISIONING, OPERATIONAL, FAILED, DECOMMISSIONED) +- Health monitoring via DCM Observability +- Basic cost tracking (resource exists/does not exist) + +**Estimated implementation effort:** 1 day using the DCM Operator SDK + +### 2.2 Level 2 — Standard + +Level 2 conformance is required for providers that support auto-scaling, auto-healing, or provider-side maintenance operations. Level 2 includes all Level 1 requirements plus the Provider Update Notification API (Section 7a). + + +**What it requires:** All Level 1 requirements, plus: +- Capacity reporting to DCM (scheduled registration) +- Full lifecycle event reporting (DEGRADED, MAINTENANCE, UNSANCTIONED_CHANGE, etc.) +- Complete realized state payloads in DCM Unified Data Model format +- Field mapping declaration (CRD fields mapped to DCM Resource Type fields) + +**What it unlocks:** All Level 1 capabilities, plus: +- Intelligent placement — DCM can route requests based on real capacity data +- Drift detection — DCM compares discovered state against realized state +- Full cost attribution — granular resource cost tracking throughout lifecycle +- Cross-cluster management — DCM can route the same resource type to multiple clusters +- Dependency graph participation — operator resources participate in DCM entity relationships + +**Estimated implementation effort:** 2-3 days using the DCM Operator SDK + +### 2.3 Level 3 — Full + +**What it requires:** All Level 2 requirements, plus: +- Sovereignty capability declaration +- Field-level provenance in realized state payloads +- Override control metadata support +- Discovery endpoint (`POST /discover`) — operator can discover existing resources for brownfield ingestion +- Decommission confirmation callback + +**What it unlocks:** All Level 2 capabilities, plus: +- Sovereignty enforcement — DCM can enforce placement and operational constraints per regulatory requirements +- Full audit chain — complete provenance from intent through realization +- Brownfield ingestion — existing resources can be imported into DCM lifecycle management +- Override control enforcement — policy-set field locks honored in operator requests + +**Estimated implementation effort:** 3-5 days using the DCM Operator SDK + +--- + +## 3. Registration API + +### 3.1 Overview + +Operators register with DCM on startup. Registration informs DCM of the operator's endpoint, the resource types it manages, its capabilities, and its conformance level. Registration is idempotent — re-registering with the same name updates the existing registration rather than creating a duplicate. + +### 3.2 Registration Endpoint + +**DCM endpoint:** `POST /api/v1/providers` + +**Timing:** Called by the operator (or adapter) during startup, after the HTTP server is ready. Retried with exponential backoff on failure. Registration failure does not block operator startup — the operator functions normally for Kubernetes consumers even if DCM registration fails. + +### 3.3 Registration Payload + +```yaml +# Registration request payload +provider_registration: + name: + display_name: + conformance_level: <1|2|3> + endpoint: + version: + + service_types: + - service_type: + service_type_uuid: + crd_reference: + group: + version: + kind: + operations_supported: [CREATE, READ, UPDATE, DELETE, DISCOVER] + # DISCOVER only required for Level 3 + field_mapping_ref: + + kubernetes: + cluster_id: + cluster_endpoint: + namespace_strategy: + # per_tenant: one namespace per DCM Tenant + # shared: all DCM resources in one namespace, isolated by labels + # per_resource: one namespace per resource instance + + metadata: + region: + zone: + cluster_type: + cluster_version: + + # Level 2+ required + capacity: + update_mode: + update_frequency_seconds: + + # Level 3 required + sovereignty_capabilities: + data_residency_regions: [] + operational_sovereignty: + hard_tenancy_supported: + air_gapped_capable: + compliance_frameworks: [] +``` + +### 3.4 Registration Response + +```yaml +# Success response +provider_registration_response: + provider_uuid: + name: + status: + conformance_level_accepted: <1|2|3> + capabilities_enabled: + - service_catalog + - health_monitoring + - cost_tracking + # Level 2+ + - placement + - drift_detection + - cross_cluster_management + # Level 3 + - sovereignty_enforcement + - brownfield_ingestion + - full_audit_chain +``` + +--- + +## 4. Health Check API + +### 4.1 Overview + +DCM polls the operator's health endpoint every 10 seconds (configurable). A healthy operator is eligible to receive new resource requests. An unhealthy operator is excluded from placement decisions. + +### 4.2 Health Endpoint + +**Endpoint:** `GET /health` +**Authentication:** Unauthenticated (or internally secured — operator choice) +**Expected response:** HTTP 200 OK for healthy or warn status; any non-200 for unhealthy (fail) + +The health response body is **normative**. DCM uses the `status` field to determine provider health and trigger alerts. Providers that return a non-conforming or absent body are treated as `warn` until three consecutive failures, after which they are treated as `fail`. + +```http +GET /health HTTP/1.1 + +HTTP/1.1 200 OK +Content-Type: application/health+json + +{ + "status": "pass", // REQUIRED: "pass" | "warn" | "fail" + "version": "", // REQUIRED: provider software version + "dcm_registration_status": "registered", // REQUIRED: "registered" | "unregistered" | "error" + "uptime_seconds": 86423, // RECOMMENDED: seconds since last restart + "checks": { // RECOMMENDED: per-subsystem health + "provider_backend": { + "status": "pass", + "observed_at": "" + }, + "service_provider_connectivity": { + "status": "pass", + "observed_at": "" + } + }, + "details": {} // OPTIONAL: operator-specific additional detail +} +``` + +**Status semantics:** + +| Status | HTTP code | Meaning | DCM behavior | +|--------|-----------|---------|--------------| +| `pass` | 200 | Fully operational | No action | +| `warn` | 200 | Operational but degraded | Fires `provider.degraded` event; alert platform admin | +| `fail` | any non-200 | Not operational | Fires `provider.unhealthy` event; triggers recovery policy | + +The health endpoint format follows [RFC 8615 / IANA health+json](https://www.iana.org/assignments/media-types/application/health+json). + +**DCM polling behavior:** +- Polling interval: declared in provider capability registration (`health_check_interval`, default PT30S) +- Consecutive `fail` threshold before `provider.unhealthy` event: 3 (profile-governed) +- Recovery: first `pass` after `fail` fires `provider.healthy` event + +### 4.3 State Machine + +- **Ready** — HTTP 200 received. Operator eligible for new requests. +- **NotReady** — Non-200 or timeout received 3 consecutive times (configurable threshold). Operator excluded from placement. Existing resources not affected. +- **Recovery** — Single HTTP 200 transitions NotReady back to Ready immediately. + +--- + +## 5. Capacity Reporting API + +*Required for Level 2 conformance.* + +### 5.1 Overview + +DCM maintains an internal capacity rating per operator, per service type, per location. Operators report capacity on a configurable schedule. DCM uses capacity data for intelligent placement decisions. + +### 5.2 Capacity Registration + +**DCM endpoint:** `POST /api/v1/providers/{provider_uuid}/capacity` + +```yaml +capacity_report: + provider_id: + report_timestamp: + next_report_at: + capacity_by_service_type: + - service_type_uuid: + available_units: + reserved_units: + committed_units: + unit_definition: + kubernetes_resources: + available_cpu: + available_memory: + available_storage: + node_count: +``` + +### 5.3 Capacity Denial + +When DCM dispatches a request the operator cannot fulfill, the operator **must** reject it with `INSUFFICIENT_RESOURCES`. DCM receives the denial and retries with an alternative provider. + +```yaml +# Denial response to a resource creation request +denial_response: + request_id: + denial_reason: INSUFFICIENT_RESOURCES + denial_timestamp: + service_type_uuid: + estimated_available_at: + details: +``` + +DCM updates its internal capacity rating for this operator immediately upon receiving a denial. + +--- + +## 6. Resource Lifecycle API + +### 6.1 Overview + +DCM dispatches resource lifecycle operations to the operator via standard REST endpoints. The operator translates these into Kubernetes CR operations (Naturalization) and reports results back to DCM in DCM Unified Data Model format (Denaturalization). + +### 6.2 Standard Endpoints + +| Method | Endpoint | Description | Required Level | +|--------|----------|-------------|---------------| +| `POST` | `/api/v1/{service_type}` | Create a new resource | Level 1 | +| `GET` | `/api/v1/{service_type}` | List all resources | Level 1 | +| `GET` | `/api/v1/{service_type}/{resource_id}` | Get a specific resource | Level 1 | +| `PUT` | `/api/v1/{service_type}/{resource_id}` | Update a resource | Level 2 | +| `DELETE` | `/api/v1/{service_type}/{resource_id}` | Delete a resource | Level 1 | +| `POST` | `/api/v1/{service_type}/discover` | Discover existing resources | Level 3 | + +### 6.3 Create Request + +DCM sends the Requested State payload to the operator. The operator naturalizes it to a Kubernetes CR and submits it. The operator responds immediately with a PROVISIONING status — not waiting for reconciliation to complete. + +```yaml +# Create request from DCM — Requested State payload in DCM format +create_request: + request_id: + tenant_uuid: + # Both resource_type_uuid and resource_type_name are always present — DCM resolves from consumer input + resource_type_uuid: + resource_type_name: Storage.Database + spec: + + relationships: + + metadata: + override_control: + +``` + +```yaml +# Create response — immediate acknowledgment +create_response: + resource_id: + dcm_request_id: + lifecycle_state: PROVISIONING + kubernetes_reference: + namespace: + name: + uid: +``` + +### 6.4 Realized State Payload + +When the operator's reconciliation loop completes provisioning, it pushes the realized state to DCM. This is the critical Denaturalization step — translating Kubernetes-native status into DCM Unified Data Model format. + +**DCM endpoint:** `PUT /api/v1/instances/{resource_id}/status` + +```yaml +# Realized state payload — DCM Unified Data Model format +realized_state: + resource_id: + dcm_entity_uuid: + lifecycle_state: + realized_timestamp: + + spec: + + + + # Level 3 — provenance for each field + field_provenance: + : + source_type: provider + source_uuid: + timestamp: + + kubernetes_reference: + namespace: + name: + uid: + resource_version: + + relationships: + +``` + +### 6.5 Delete and Decommission + +When DCM requests deletion, the operator deletes the CR and confirms decommission via the realized state endpoint with `lifecycle_state: DECOMMISSIONED`. + +For **Level 3**, the operator must wait for DCM confirmation before deleting — this allows DCM to apply lifecycle policies (retain, detach) before the operator acts. + +```yaml +# Decommission confirmation callback (Level 3) +# DCM calls this before the operator deletes +decommission_confirmation: + resource_id: + lifecycle_policies_applied: + - entity_uuid: + policy_applied: retain + # storage was retained, not deleted with the parent + - entity_uuid: + policy_applied: destroy + proceed_with_deletion: +``` + +--- + + +--- + +## 7a. Provider Update Notification API + +This section defines the Provider Update Notification endpoint — the formal mechanism by which Service Providers report authorized state changes to DCM. This is a **Level 2** conformance requirement for providers that support auto-scaling, auto-healing, or provider-side maintenance operations. + +### 7a.1 Overview + +The Provider Update Notification API enables providers to report authorized state changes so DCM can update its Realized State with a traceable Requested State record. This is distinct from drift — a provider submitting an update notification is asserting that the change was authorized (by a pre-existing policy or operational agreement). DCM evaluates the assertion and decides whether to accept or reject it. + +**Key principle:** Providers never write directly to DCM's Realized State. They submit a notification; DCM processes it through its governance pipeline; DCM writes the Realized State if approved. + +### 7a.2 Conformance Requirements + +| Conformance Level | Requirement | +|------------------|-------------| +| Level 1 — Basic | Not required. Providers at Level 1 report all state changes as lifecycle events; DCM handles them as drift. | +| Level 2 — Standard | Required for providers that implement auto-scaling, auto-healing, or provider-side maintenance. | +| Level 3 — Full | Required. All authorized provider-side state changes must use this API. | + +### 7a.3 Endpoint + +``` +POST /api/v1/provider/entities/{entity_uuid}/update-notification +Host: {dcm-instance} +Authorization: mTLS (provider certificate) +Content-Type: application/json +``` + +**Note:** This endpoint is on the DCM API Gateway, not on the provider. Providers call DCM; DCM does not poll providers for updates. + +### 7a.4 Request Payload + +```json +{ + "provider_uuid": "", + "notification_uuid": "", + "notification_type": "authorized_change | maintenance_change | auto_scale | auto_heal", + "changed_fields": { + "": { + "previous_value": "", + "new_value": "", + "change_reason": "", + "authorizing_policy_ref": "" + } + }, + "effective_at": "", + "provider_evidence_ref": "" +} +``` + +**`notification_uuid`** is an idempotency key. If DCM receives the same `notification_uuid` twice, it acknowledges the second request without reprocessing. + +**`authorizing_policy_ref`** is the UUID of the DCM policy that pre-authorized this type of change. If null, DCM will evaluate whether a policy covers this change. If no policy covers it, the notification is rejected. + +### 7a.5 Response Codes + +| Response | Meaning | +|----------|---------| +| `202 Accepted` | Notification accepted. DCM is processing. Use `notification_status_url` to poll. | +| `200 OK` (with `status: approved`) | Notification accepted and Realized State updated. | +| `200 OK` (with `status: pending_approval`) | Notification queued pending consumer approval. Entity in PENDING_REVIEW. | +| `200 OK` (with `status: rejected`) | Notification rejected. Realized State not updated. Discrepancy is now drift. | +| `409 Conflict` | A notification for this entity is already being processed. Retry after the `retry_after` interval. | +| `422 Unprocessable` | Notification payload malformed or entity UUID not found in this provider's scope. | + +```json +{ + "notification_uuid": "", + "status": "approved | pending_approval | rejected", + "realized_state_uuid": "", + "rejection_reason": "", + "retry_after": "", + "notification_status_url": "/api/v1/provider/notifications/{notification_uuid}" +} +``` + +### 7a.6 Notification Status Polling + +``` +GET /api/v1/provider/notifications/{notification_uuid} + +Response: +{ + "notification_uuid": "", + "status": "processing | approved | pending_approval | rejected", + "entity_uuid": "", + "realized_state_uuid": "", + "consumer_approval_required": true | false, + "consumer_notified_at": "", + "resolved_at": "" +} +``` + +### 7a.7 Idempotency + +Provider Update Notifications are idempotent by `notification_uuid`. If DCM crashes between receiving a notification and writing the Realized State, the provider can safely resend the same notification. DCM will not create duplicate Realized State records. + +### 7a.8 Pre-Authorization Declarations + +Providers may declare categories of updates they routinely make — enabling organizations to pre-authorize them in policy rather than reviewing each one: + +```json +{ + "provider_uuid": "", + "update_capabilities": [ + { + "notification_type": "auto_scale", + "affected_fields": ["cpu_count", "memory_gb"], + "max_change_magnitude": "2x", + "typical_trigger": "Resource utilization threshold" + }, + { + "notification_type": "auto_heal", + "affected_fields": ["storage_device_id", "network_interface_id"], + "max_change_magnitude": "replacement", + "typical_trigger": "Hardware failure" + } + ] +} +``` + +This declaration is part of provider registration (Section 3.3) and is surfaced in the Service Catalog to help consumers understand what provider-side changes they can expect. + + + +--- + +## 7b. Cancellation API + +This section defines the cancellation endpoint that Service Providers implement for Level 2+ conformance. Providers that declare `supports_cancellation: true` in their registration must implement this endpoint. + +### 7b.1 Cancellation Endpoint + +``` +POST /cancel (on the provider, called by DCM) +Authorization: DCM mTLS certificate + +Body: +{ + "cancellation_uuid": "", + "entity_uuid": "", + "requested_state_uuid": "", + "reason": "consumer_requested | timeout | policy_triggered", + "requested_at": "", + "best_effort": true +} +``` + +### 7b.2 Response + +| Code | Meaning | +|------|---------| +| `200 OK` (status: cancelled) | Cancellation clean; no resources provisioned | +| `200 OK` (status: partial_rollback) | Cancellation attempted; some resources may remain | +| `200 OK` (status: too_late) | Provider completed before cancellation arrived; late response forthcoming | +| `409 Conflict` | Already cancelled or already completed | + +```json +{ + "cancellation_uuid": "", + "status": "cancelled | partial_rollback | too_late", + "resources_remaining": [], + "late_response_expected": false, + "notes": "" +} +``` + +### 7b.3 Late Response After Cancellation + +If the provider returns `status: too_late`, it must still send the completed realization response via the standard realized-state callback. DCM's Late Response Pipeline handles this — the provider does not need to do anything different. The `LATE_RESPONSE_RECEIVED` Recovery Policy fires on the DCM side. + +### 7b.4 Capability Declaration + +```json +{ + "cancellation_capabilities": { + "supports_cancellation": true, + "cancellation_supported_during": ["DISPATCHED", "PROVISIONING"], + "partial_rollback_possible": true, + "cancellation_response_time_seconds": 30 + } +} +``` + + +### 6.4 Interim Status Reporting + +For long-running operations (provisioning complex resources, composite service constituents), providers may send interim progress updates to DCM without waiting for terminal status. This gives DCM — and therefore consumers — live visibility into multi-step operations. + +**DCM endpoint for interim status:** + +``` +POST /api/v1/provider/entities/{entity_uuid}/status + +Authorization: Bearer +Content-Type: application/json + +{ + "request_id": "", + "lifecycle_state": "PROVISIONING", // current state — not yet terminal + "progress": { + "step_current": 3, + "step_total": 7, + "step_label": "Configuring network interfaces", + "step_started_at": "", + "estimated_completion": "" + }, + "constituent_status": [ // for compound/composite service definition operations + { "ref": "vm", "status": "REALIZED", "completed_at": "" }, + { "ref": "ip", "status": "REALIZED", "completed_at": "" }, + { "ref": "dns", "status": "PROVISIONING", "started_at": "" }, + { "ref": "storage", "status": "PENDING", "started_at": null } + ], + "notes": "" +} + +Response 202 Accepted +``` + +DCM uses interim status to: +1. Update `current_step` and progress fields in the request status response +2. Publish `request.progress_updated` event (info urgency) to the Message Bus +3. Deliver live status updates to consumers via SSE stream (see Consumer API Section 4.3) + +**Frequency:** Providers should not send interim status more frequently than once per 10 seconds. DCM rate-limits interim status calls per entity_uuid. + +**Terminal status** is still reported via the existing create/update response callback — interim status supplements, not replaces it. + +## 7. Field Mapping Specification + +*Required for Level 2 conformance.* + +### 7.1 Overview + +The field mapping declaration tells DCM how to translate between DCM Unified Data Model fields and the operator's CRD fields. This mapping enables DCM to: +- Generate CRs from DCM Requested State payloads (Naturalization) +- Extract DCM Realized State from CR status (Denaturalization) +- Understand which DCM fields correspond to which CRD fields for drift detection + +### 7.2 Field Mapping Declaration Format + +```yaml +field_mapping: + service_type: Storage.Database + service_type_uuid: + crd_reference: + group: postgresql.cnpg.io + version: v1 + kind: Cluster + + # DCM Requested State → Kubernetes CR (Naturalization) + dcm_to_cr: + - dcm_path: resources.cpu + cr_path: spec.instances[0].resources.requests.cpu + transform: + required: true + + - dcm_path: resources.memory + cr_path: spec.instances[0].resources.requests.memory + transform: gigabytes_to_kubernetes_memory + required: true + + - dcm_path: engine + cr_path: spec.imageName + transform: engine_version_to_image + # engine: postgresql, version: 15 → imageName: ghcr.io/cloudnative-pg/postgresql:15 + required: true + + - dcm_path: metadata.name + cr_path: metadata.name + required: true + + - dcm_path: tenant_uuid + cr_path: metadata.labels.dcm-tenant-id + required: true + + - dcm_path: dcm_entity_uuid + cr_path: metadata.labels.dcm-entity-id + required: true + # All DCM-managed CRs must be labeled with their DCM entity UUID + # This enables discovery and drift detection + + # Kubernetes CR status → DCM Realized State (Denaturalization) + cr_status_to_dcm: + - cr_path: status.phase + dcm_path: lifecycle_state + transform: cr_phase_to_dcm_state + # Mapping defined in condition_mappings below + + - cr_path: status.readyInstances + dcm_path: realized_data.ready_instances + transform: none + + - cr_path: status.instancesStatus[0].ip + dcm_path: realized_data.connection.host + transform: none + + - cr_path: status.certificates.serverCASecret + dcm_path: realized_data.tls.ca_secret_ref + transform: none + + # Kubernetes conditions → DCM lifecycle states + condition_mappings: + - kubernetes_condition: "Ready=True" + dcm_lifecycle_state: OPERATIONAL + + - kubernetes_condition: "Ready=False,Progressing=True" + dcm_lifecycle_state: PROVISIONING + + - kubernetes_condition: "Ready=False,Progressing=False" + dcm_lifecycle_state: FAILED + + - kubernetes_condition: "Degraded=True" + dcm_lifecycle_state: DEGRADED + + # Kubernetes events → DCM lifecycle events + lifecycle_event_mappings: + - kubernetes_event: condition_change + condition: "Ready=False" + dcm_event: ENTITY_HEALTH_CHANGE + severity: WARNING + + - kubernetes_event: condition_change + condition: "Degraded=True" + dcm_event: DEGRADATION + severity: CRITICAL + + - kubernetes_event: spec_change_without_dcm_request + dcm_event: UNSANCTIONED_CHANGE + severity: WARNING + # Detected when CR spec changes without a corresponding DCM request ID + # Indicates drift — someone modified the CR directly in Kubernetes + + # Namespace strategy implementation + namespace_strategy: + type: per_tenant + namespace_name_pattern: "dcm-{tenant_uuid_short}" + # {tenant_uuid_short} = first 8 chars of tenant UUID + labels_required: + dcm-managed: "true" + dcm-tenant-id: "{tenant_uuid}" + dcm-entity-id: "{entity_uuid}" +``` + +### 7.3 Mandatory CR Labels + +All CRs created by a DCM-conformant operator must carry these labels. These labels enable DCM's discovery and drift detection capabilities: + +| Label | Value | Purpose | +|-------|-------|---------| +| `dcm-managed` | `"true"` | Identifies this CR as DCM-managed | +| `dcm-tenant-id` | DCM Tenant UUID | Tenant ownership | +| `dcm-entity-id` | DCM Entity UUID | Links CR to DCM entity record | +| `dcm-provider-id` | DCM Provider UUID | Which provider created this | +| `dcm-request-id` | DCM Request UUID | Which request created this | + +Any CR change that does not have a corresponding DCM request ID in its update metadata is flagged as an UNSANCTIONED_CHANGE and reported to DCM. + +--- + +## 8. Lifecycle Event API + +*Required for Level 2 conformance.* + +### 8.1 Overview + +Operators must notify DCM of any event that affects the operational status of a managed resource. DCM acts as the Tenant advocate — it receives events, evaluates them through the Policy Engine, and determines the appropriate response. + +### 8.2 Event Endpoint + +**DCM endpoint:** `POST /api/v1/instances/{resource_id}/events` + +### 8.3 Standard Event Types + +| Event Type | Trigger | Severity | Required Level | +|------------|---------|----------|---------------| +| `ENTITY_HEALTH_CHANGE` | CR condition changes | INFO/WARNING | Level 2 | +| `DEGRADATION` | Resource is degraded but operational | WARNING | Level 2 | +| `MAINTENANCE_SCHEDULED` | Planned maintenance window | INFO | Level 2 | +| `MAINTENANCE_STARTED` | Maintenance has begun | INFO | Level 2 | +| `MAINTENANCE_COMPLETED` | Maintenance completed | INFO | Level 2 | +| `UNSANCTIONED_CHANGE` | CR modified without DCM request | WARNING | Level 2 | +| `CAPACITY_CHANGE` | Available capacity changed significantly | INFO | Level 2 | +| `DECOMMISSION_NOTICE` | Operator is shutting down | CRITICAL | Level 2 | +| `PROVIDER_DEGRADATION` | Operator itself is degraded | CRITICAL | Level 2 | + +```yaml +# Event payload +lifecycle_event: + event_uuid: + event_type: UNSANCTIONED_CHANGE + provider_id: + resource_id: + dcm_entity_uuid: + event_timestamp: + severity: WARNING + requires_immediate_action: true + + details: + changed_fields: + - field_path: spec.instances[0].resources.requests.cpu + previous_value: "2000m" + current_value: "4000m" + changed_by: + changed_at: + + kubernetes_reference: + namespace: + name: + resource_version: +``` + +--- + +## 9. DCM Operator SDK + +### 9.1 Overview + +The DCM Operator SDK is an open source Go library that handles all DCM protocol concerns for operator developers. Using the SDK, an operator developer only needs to: + +1. Import the SDK +2. Configure field mappings (declarative YAML) +3. Add SDK hooks at key points in the reconciliation loop + +The SDK handles registration, health check endpoint exposure, capacity reporting, status translation, lifecycle event emission, provenance generation, and label management. + +### 9.2 SDK Initialization + +```go +import dcmsdk "github.com/dcm-project/operator-sdk" + +func main() { + // Load field mapping configuration + mappings, err := dcmsdk.LoadFieldMappings("dcm-mappings.yaml") + + // Initialize DCM SDK + dcm, err := dcmsdk.New(dcmsdk.Config{ + ProviderName: "cloudnativepg-provider", + DisplayName: "CloudNativePG Service Provider", + ConformanceLevel: dcmsdk.Level2, + DCMEndpoint: os.Getenv("DCM_ENDPOINT"), + OperatorEndpoint: os.Getenv("OPERATOR_ENDPOINT"), + FieldMappings: mappings, + CapacityReporter: &PostgresCapacityReporter{}, + }) + + // Start HTTP server with DCM endpoints automatically registered + dcm.StartServer(":8080") + + // Register with DCM on startup + dcm.Register(context.Background()) + + // Start operator manager + mgr.Start(ctrl.SetupSignalHandler()) +} +``` + +### 9.3 Reconciliation Loop Integration + +```go +func (r *ClusterReconciler) Reconcile( + ctx context.Context, + req ctrl.Request, +) (ctrl.Result, error) { + + cluster := &cnpgv1.Cluster{} + if err := r.Get(ctx, req.NamespacedName, cluster); err != nil { + return ctrl.Result{}, client.IgnoreNotFound(err) + } + + // Check if this CR is DCM-managed + if !r.DCM.IsManagedResource(cluster) { + return ctrl.Result{}, nil + // Not a DCM resource — normal operator behavior + } + + // Detect unsanctioned changes + if r.DCM.IsUnsanctionedChange(cluster) { + r.DCM.ReportEvent(ctx, cluster, dcmsdk.UnsanctionedChange{ + ChangedFields: r.DCM.DetectChangedFields(cluster), + }) + } + + // ... existing reconciliation logic ... + + // Report current state to DCM + realizedState, err := r.DCM.TranslateStatus(cluster) + if err != nil { + return ctrl.Result{}, err + } + r.DCM.ReportStatus(ctx, cluster, realizedState) + + return ctrl.Result{}, nil +} +``` + +### 9.4 SDK Responsibilities + +The SDK automatically handles: +- Self-registration on startup with retry and exponential backoff +- Health check HTTP endpoint (`GET /health`) +- Capacity reporting on configurable schedule +- CR label injection on creation (`dcm-managed`, `dcm-tenant-id`, etc.) +- Unsanctioned change detection (spec change without DCM request ID) +- Status translation using field mapping configuration +- Lifecycle event formatting and delivery to DCM +- Provenance metadata generation for realized state payloads (Level 3) + +--- + +## 10. Kubernetes-to-DCM Concept Mappings + +Understanding how Kubernetes concepts map to DCM concepts is essential for implementing this specification correctly. + +| Kubernetes Concept | DCM Concept | Notes | +|-------------------|-------------|-------| +| Custom Resource Definition (CRD) | Resource Type Specification | CRD schema maps to DCM Resource Type fields | +| Custom Resource (CR) | Requested State → Realized State | CR is the naturalized form of the DCM payload | +| Operator reconciliation loop | Realization + Drift Detection | Reconciliation IS the realization process | +| CR status subresource | Realized State payload | Status must be denaturalized to DCM format | +| Kubernetes Namespace | DCM Tenant boundary | One namespace per Tenant (per_tenant strategy) | +| ownerReference | Entity Relationship | ownerReferences map to `contains`/`contained_by` relationships | +| Labels/Annotations | DCM Entity metadata | DCM-specific labels declared as mandatory | +| Finalizers | Lifecycle policy enforcement | Finalizers implement `retain` lifecycle policies | +| Kubernetes conditions | DCM lifecycle states | Mapped via condition_mappings declaration | +| Watch events | DCM lifecycle events | Kubernetes watch → DCM event translation | +| Kubernetes RBAC | DCM IDM/IAM + Policy Engine | Kubernetes RBAC is the runtime enforcement; DCM Policy Engine governs the request | +| Kubernetes cluster | DCM Resource Type: Platform.KubernetesCluster | The cluster itself is a DCM-managed resource | + +--- + +## 11. Conformance Testing + +### 11.1 Overview + +The DCM project provides a conformance test suite that validates an operator's implementation against this specification. Operators that pass the conformance test suite at their declared level can claim DCM conformance. + +### 11.2 Test Suite Structure + +``` +dcm-operator-conformance/ +├── level1/ +│ ├── registration_test.go +│ ├── health_check_test.go +│ └── basic_status_test.go +├── level2/ +│ ├── capacity_test.go +│ ├── lifecycle_events_test.go +│ ├── realized_state_test.go +│ └── field_mapping_test.go +└── level3/ + ├── sovereignty_test.go + ├── provenance_test.go + ├── discovery_test.go + └── decommission_confirmation_test.go +``` + +### 11.3 Running the Conformance Tests + +```bash +# Run Level 1 conformance tests against a running operator +dcm-conformance test \ + --level 1 \ + --operator-endpoint https://my-operator:8080 \ + --dcm-endpoint https://dcm-control-plane:8080 \ + --service-type Storage.Database + +# Run all levels +dcm-conformance test --level 3 --operator-endpoint ... +``` + +### 11.4 Conformance Certification + +Operators that pass the conformance test suite may: +- Use the "DCM Compatible — Level N" badge in their documentation +- Be listed in the DCM Operator Registry +- Receive inclusion in the DCM default Service Catalog for participating organizations + +--- + +## 12. Security Considerations + +### 12.1 Authentication + +DCM authenticates outbound requests to operators using the trust model established during registration. Operators must validate that incoming requests originate from the DCM control plane. The specific authentication mechanism is declared in the provider registration: + +```yaml +trust_declaration: + auth_method: + auth_config: +``` + +### 12.2 Namespace Isolation + +When using the `per_tenant` namespace strategy, operators must enforce that resources in one namespace cannot access resources in another namespace. This is the physical enforcement of DCM's hard tenancy model at the Kubernetes level. + +### 12.3 Unsanctioned Change Detection + +Operators must monitor for changes to DCM-managed CRs that did not originate from a DCM request. Any such change is an UNSANCTIONED_CHANGE event and must be reported to DCM immediately. DCM's Policy Engine determines the appropriate response (REVERT, UPDATE_DEFINITION, ALERT, etc.). + +--- + +## 13. Open Questions + +| # | Question | Impact | Status | +|---|----------|--------|--------| +| 1 | Should the specification be submitted to CNCF as a sandbox project or proposed as a Kubernetes SIG? | Community adoption strategy | ✅ Resolved | +| 2 | Should conformance certification be self-certified (test suite passes) or require DCM project review? | Community trust | ✅ Resolved | +| 3 | How should the specification handle operators that manage cluster-scoped (non-namespaced) resources? | Namespace strategy | ✅ Resolved — Two models: (A) Cluster-as-a-Service: Tenant owns the entire cluster entity including all cluster-scoped resources within it; (B) Shared cluster: cluster-scoped governance resources belong to __platform__ Tenant. Cluster-as-a-Service is the primary model. | +| 4 | Should the SDK support non-Go operator frameworks (Java Operator SDK, Python kopf)? | Ecosystem breadth | ✅ Resolved | +| 5 | How does the specification interact with Kubernetes Cluster API — can CAPI clusters be DCM-managed resources? | Scope | ✅ Resolved | +| 6 | Should there be a Level 0 — a pure label-based passive mode requiring no operator changes? | Adoption friction | ✅ Resolved | + +--- + +## Appendix A — Example Implementation Checklist + +### Level 1 Checklist +- [ ] Operator registers with DCM on startup via `POST /api/v1/providers` +- [ ] Registration retried with exponential backoff on failure +- [ ] `GET /health` endpoint returns HTTP 200 when healthy +- [ ] `GET /health` returns non-200 when operator cannot fulfill requests +- [ ] Status reported to DCM when resource transitions to OPERATIONAL, FAILED, or DECOMMISSIONED +- [ ] All DCM-managed CRs labeled with mandatory DCM labels +- [ ] Create response returns PROVISIONING state immediately + +### Level 2 Checklist +- [ ] All Level 1 items complete +- [ ] Capacity reported to DCM on configurable schedule +- [ ] Capacity denial returns `INSUFFICIENT_RESOURCES` with proper payload +- [ ] Full realized state payload in DCM Unified Data Model format +- [ ] Field mapping declaration complete and validated +- [ ] All standard lifecycle event types implemented +- [ ] Unsanctioned change detection active +- [ ] CR condition changes translated to DCM lifecycle events + +### Level 3 Checklist +- [ ] All Level 2 items complete +- [ ] Sovereignty capabilities declared in registration +- [ ] Field-level provenance included in realized state payloads +- [ ] `POST /discover` endpoint implemented +- [ ] Decommission confirmation callback handled +- [ ] Override control metadata honored in CR creation + +--- + +## Appendix B — Relationship to Other Specifications + +- **DCM Data Model** — defines the Unified Data Model format used in all API payloads +- **DCM Service Provider Contract** — the general provider contract this specification extends +- **DCM Resource Type Registry** — where DCM Resource Types are registered; operators must reference registry UUIDs +- **AEP (API Enhancement Proposals)** — the DCM API follows AEP standards for REST API design +- **OpenAPI 3.1.0** — all API schemas are defined in OpenAPI 3.1.0 + +--- + +*This specification is maintained by the DCM Project. For questions, contributions, or conformance certification see [GitHub](https://github.com/dcm-project).* + + +## Resolution Notes + +**Q1:** Submit the Operator Interface Specification as a CNCF specification project (not a Sandbox project requiring a working implementation). SIG App Delivery and SIG Cluster Lifecycle engagement happens before submission. See cncf-strategy.md for the full submission strategy. + +**Q2:** Self-certified via automated test suite is the conformance gate — this is the low-friction path that enables broad adoption. An optional 'DCM Verified' badge is available via DCM project review for organizations wanting a higher-trust production claim. This mirrors Kubernetes conformance: automated test suite gates access; CNCF certification provides the badge. + +**Q3:** Two distinct models apply, and it is important to not conflate them: + +**Model A — Cluster as a catalog item (example Service Provider implementation):** A Kubernetes cluster can be offered as a catalog item that any authorized Tenant requests and owns — this is a natural use of DCM's Service Provider model, not a special architectural feature. From DCM's perspective, `Platform.KubernetesCluster` is simply a resource type whose Service Provider happens to provision Kubernetes clusters (e.g., via CAPI). The Tenant owns the resulting cluster entity, including all cluster-scoped resources within it, because the cluster is the resource boundary. This is an example of how DCM's architecture enables complex resources as services — DCM has no special knowledge of Kubernetes; it treats the cluster as any other resource entity. + +**Model B — Shared cluster infrastructure (the exception):** When multiple Tenants share a single cluster (the multi-tenant cluster model), cluster-scoped resources that govern the shared infrastructure itself (admission webhook configurations, cluster-level network policies, CRD registrations) cannot be owned by any single Tenant — they belong to the `__platform__` system Tenant. These are resources that, if modified by a Tenant, would affect all other Tenants on the cluster. The distinction: resources *inside* a Tenant-owned cluster are always Tenant-owned; resources that *govern shared cluster infrastructure* belong to `__platform__`. + +**The rule:** Cluster-scoped resources are owned by the Tenant that owns the cluster. If no single Tenant owns the cluster (shared infrastructure), cluster-scoped governance resources belong to `__platform__`. Operators managing cluster-scoped resources implement the standard base contract. The catalog item scope (`scope: cluster` vs `scope: namespaced`) determines which ownership model applies and what role is required to request it. + +**Q4:** The Operator Interface Specification is a REST/HTTP API specification and is language-agnostic by definition. The Go SDK is the reference implementation. Operators in any language implement the specification directly via HTTP — no language-specific adapter is required. Community SDKs for Java and Python are encouraged as community projects under the DCM umbrella; the DCM project does not maintain them in v1. + +**Q5:** CAPI clusters are `Platform.KubernetesCluster` resources in DCM. The CAPI operator registers as a Service Provider for this resource type. Once provisioned, a CAPI cluster can optionally register with DCM as a nested DCM deployment or as a Service Provider for workload resources (the composite service definition pattern). Sovereignty constraints are enforced at the CAPI provider selection level. + +**Q6:** Level 0 exists as a label-based passive discovery mode. Organizations apply DCM labels to existing operator-managed resources. DCM discovers and tracks these resources (they appear in inventory, drift detection runs against them) but DCM does not dispatch to or control them. No operator code changes are required for Level 0. This is the brownfield ingestion model applied to operators — the lowest possible adoption friction. + diff --git a/docs/future-features/dcm-operator-sdk-api.md b/docs/future-features/dcm-operator-sdk-api.md new file mode 100644 index 0000000..eeac6a7 --- /dev/null +++ b/docs/future-features/dcm-operator-sdk-api.md @@ -0,0 +1,728 @@ +# DCM Operator SDK — API Design + +**Document Status:** ✅ Complete +**Document Type:** SDK Reference (Go) +**Related Documents:** [Operator Interface Specification](dcm-operator-interface-spec.md) | [Kubernetes Compatibility](kubernetes-compatibility.md) | [Provider Callback Auth](https://github.com/croadfeldt/udlm/blob/main/contracts/provider-callback-auth.md) | [Registration Specification](../specifications/dcm-registration-spec.md) + +> **AEP Alignment:** DCM interaction uses colon-syntax custom methods (`:approve`, `:suspend`). +> `operation_uuid == request_uuid` — Operations polling uses `GET /api/v1/operations/{uuid}`. +> `resource_type` accepts both FQN string (`Compute.VirtualMachine`) and Registry UUID; +> DCM resolves either form internally. See `schemas/openapi/dcm-operator-api.yaml` +> for the normative operator-facing OpenAPI specification. + + + +> ## 📋 Draft — Promoted from Work in Progress +> +> All questions resolved. Local durable queue, mock test harness, Prometheus metrics, and dynamic field resolution all specified. +> +> **This section is explicitly a work in progress and is less mature than the core DCM data model and architecture documentation.** +> +> The Kubernetes operator integration layer — including the Operator Interface Specification, Operator SDK API, and Kubernetes compatibility mappings — represents design intent that has not yet been validated against implementation. Specific interface contracts, API signatures, SDK method names, and CRD structures **will change** as implementation work begins. +> +> **Do not build against these specifications yet.** They are published to share design direction and invite feedback, not as stable contracts. +> +> Known gaps and open items for this section: +> - Operator Interface Specification: reconciliation hook signatures are provisional +> - Operator SDK API: Go module structure and dependency model not yet finalized +> - Kubernetes Compatibility Mappings: some concept mappings remain under discussion +> - SDK code examples are illustrative only — not yet tested against a real implementation +> +> Feedback and contributions welcome via [GitHub Issues](https://github.com/dcm-project/issues). + + + +**Version:** 0.1.0-draft +**Status:** Draft — Ready for implementation feedback +**Document Type:** Technical Design +**Language:** Go +**Repository:** https://github.com/dcm-project/operator-sdk +**Related Documents:** [Foundational Abstractions](https://github.com/croadfeldt/udlm/blob/main/foundations/foundations.md) | [DCM Operator Interface Specification](dcm-operator-interface-spec.md) | [Kubernetes Compatibility](kubernetes-compatibility.md) + +--- + +## 1. Purpose + +This document defines the public API of the DCM Operator SDK — the Go library that enables Kubernetes operators to implement the DCM Operator Interface Specification with minimal code changes. The SDK handles all DCM protocol concerns so that operator developers only need to implement business logic — field mappings and reconciliation hooks. + +**Design principle:** The SDK must be adoptable in a single day. If implementing Level 1 takes more than a day, the API is too complex. + +--- + +## 2. Package Structure + +``` +github.com/dcm-project/operator-sdk/ +├── pkg/ +│ ├── client/ # DCM control plane client +│ ├── config/ # SDK configuration +│ ├── mapping/ # Field mapping engine +│ ├── reconciler/ # Reconciliation loop helpers +│ ├── registration/ # Provider registration +│ ├── server/ # HTTP server with DCM endpoints +│ ├── status/ # Status translation and reporting +│ ├── events/ # Lifecycle event types and emission +│ ├── discovery/ # Brownfield discovery helpers (Level 3) +│ └── provenance/ # Provenance metadata generation (Level 3) +├── api/ +│ └── v1/ # DCM API type definitions +└── examples/ + ├── level1/ # Minimal Level 1 implementation example + ├── level2/ # Full Level 2 implementation example + └── level3/ # Complete Level 3 implementation example +``` + +--- + +## 3. Core Types + +### 3.1 Config + +```go +// Config is the primary SDK configuration structure. +// All fields have sensible defaults — only DCMEndpoint, +// OperatorEndpoint, and ProviderName are required. +type Config struct { + // Required + ProviderName string + DCMEndpoint string + OperatorEndpoint string + + // Required — at least one ServiceType must be declared + ServiceTypes []ServiceTypeConfig + + // Optional — defaults to Level1 if not specified + ConformanceLevel ConformanceLevel + + // Optional — defaults to "unknown" if not specified + DisplayName string + Version string + + // Optional — field mappings loaded from file if not inline + FieldMappings []FieldMapping + FieldMappingFiles []string + + // Level 2+ — capacity reporter + // If nil and ConformanceLevel >= Level2, SDK returns error on init + CapacityReporter CapacityReporter + + // Level 3 — sovereignty and provenance + SovereigntyCapabilities *SovereigntyCapabilities + + // Optional — HTTP server configuration + ServerConfig ServerConfig + + // Optional — registration retry configuration + RegistrationConfig RegistrationConfig + + // Optional — health check configuration + HealthConfig HealthConfig + + // Optional — logger (defaults to zap logger) + Logger logr.Logger +} + +// ConformanceLevel declares the operator's DCM conformance level +type ConformanceLevel int + +const ( + Level1 ConformanceLevel = 1 + Level2 ConformanceLevel = 2 + Level3 ConformanceLevel = 3 +) + +// ServiceTypeConfig declares a DCM Resource Type this operator implements +type ServiceTypeConfig struct { + // DCM Resource Type name — e.g., "Storage.Database" + ServiceTypeName string + // DCM Resource Type UUID from the registry + ServiceTypeUUID string + // Kubernetes CRD this service type maps to + CRDReference CRDReference + // Operations this operator supports for this type + OperationsSupported []Operation +} + +type CRDReference struct { + Group string + Version string + Kind string +} + +type Operation string + +const ( + OperationCreate Operation = "CREATE" + OperationRead Operation = "READ" + OperationUpdate Operation = "UPDATE" + OperationDelete Operation = "DELETE" + OperationDiscover Operation = "DISCOVER" // Level 3 only +) +``` + +### 3.2 Client — DCM Control Plane Interface + +```go +// Client is the interface for communicating with the DCM control plane. +// The SDK creates and manages this internally — operator developers +// use it only through the higher-level SDK methods. +type Client interface { + // Register sends the provider registration to DCM. + // Returns the DCM-assigned provider UUID on success. + Register(ctx context.Context, reg ProviderRegistration) (string, error) + + // ReportStatus sends a realized state payload to DCM. + ReportStatus(ctx context.Context, resourceID string, status RealizedState) error + + // ReportEvent sends a lifecycle event to DCM. + ReportEvent(ctx context.Context, resourceID string, event LifecycleEvent) error + + // ReportCapacity sends a capacity update to DCM. + // Required for Level 2+. + ReportCapacity(ctx context.Context, capacity CapacityReport) error + + // ConfirmDecommission acknowledges a decommission request from DCM. + // Required for Level 3. + ConfirmDecommission(ctx context.Context, resourceID string, confirmation DecommissionConfirmation) error +} +``` + +### 3.3 SDK — Primary Interface + +```go +// SDK is the primary interface for the DCM Operator SDK. +// Operator developers interact with DCM through this interface. +type SDK interface { + // --- Lifecycle --- + + // Register sends the provider registration to DCM. + // Called during operator startup. Retries with exponential backoff. + // Does not block — runs in background goroutine. + Register(ctx context.Context) + + // Shutdown gracefully deregisters the operator from DCM and + // stops background goroutines. + Shutdown(ctx context.Context) error + + // --- HTTP Server --- + + // StartServer starts the HTTP server with all DCM-required endpoints. + // Blocks until context is cancelled. + StartServer(ctx context.Context, addr string) error + + // Handler returns an http.Handler for use with an existing HTTP server. + // Alternative to StartServer when the operator already has an HTTP server. + Handler() http.Handler + + // --- Reconciliation Helpers --- + + // IsManagedResource returns true if the Kubernetes object + // carries DCM management labels. + IsManagedResource(obj client.Object) bool + + // IsUnsanctionedChange returns true if the object's spec has changed + // without a corresponding DCM request annotation. + // Used in reconciliation loops to detect drift. + IsUnsanctionedChange(obj client.Object) bool + + // DetectChangedFields returns the list of fields that changed + // relative to the last known DCM request state. + DetectChangedFields(obj client.Object) []FieldChange + + // InjectLabels adds DCM-required labels to a Kubernetes object + // before creation. Called before submitting a CR to Kubernetes. + InjectLabels(obj client.Object, req CreateRequest) client.Object + + // AnnotateRequest adds the DCM request ID annotation to a + // Kubernetes object. Used to mark changes as DCM-sanctioned. + AnnotateRequest(obj client.Object, requestID string) client.Object + + // --- Status Translation --- + + // TranslateStatus translates a Kubernetes object's status + // to a DCM RealizedState using the configured field mappings. + TranslateStatus(obj client.Object) (RealizedState, error) + + // ReportStatus translates and reports status to DCM in one call. + // Convenience wrapper for TranslateStatus + Client.ReportStatus. + ReportStatus(ctx context.Context, obj client.Object) error + + // --- Event Emission --- + + // ReportEvent sends a lifecycle event to DCM. + ReportEvent(ctx context.Context, obj client.Object, event LifecycleEventType, details EventDetails) error + + // ReportUnsanctionedChange is a convenience method for reporting + // an unsanctioned change event with the detected changed fields. + ReportUnsanctionedChange(ctx context.Context, obj client.Object, changes []FieldChange) error + + // ReportDegradation reports a DEGRADATION event to DCM. + ReportDegradation(ctx context.Context, obj client.Object, reason string) error + + // ReportHealthChange reports an ENTITY_HEALTH_CHANGE event. + ReportHealthChange(ctx context.Context, obj client.Object, healthy bool, reason string) error + + // --- Capacity --- + + // StartCapacityReporting starts the background capacity reporting + // goroutine. Required for Level 2+. Called automatically by StartServer. + StartCapacityReporting(ctx context.Context) + + // --- Discovery (Level 3) --- + + // BuildDiscoveryResponse queries Kubernetes for existing resources + // and returns them in DCM Realized State format. + // Used to implement the POST /discover endpoint. + BuildDiscoveryResponse(ctx context.Context, k8sClient client.Client, opts DiscoveryOptions) ([]RealizedState, error) +} +``` + +--- + +## 4. Field Mapping API + +```go +// FieldMapping declares how a DCM Resource Type maps to a Kubernetes CRD. +// Can be loaded from a YAML file or declared inline in Go. +type FieldMapping struct { + ServiceTypeName string + ServiceTypeUUID string + CRDReference CRDReference + + // DCM Requested State → Kubernetes CR spec (Naturalization) + DCMToCR []FieldMap + + // Kubernetes CR status → DCM Realized State (Denaturalization) + CRStatusToDCM []FieldMap + + // Kubernetes conditions → DCM lifecycle states + ConditionMappings []ConditionMapping + + // Kubernetes events → DCM lifecycle event types + LifecycleEventMappings []LifecycleEventMapping + + // Namespace strategy for this resource type + NamespaceStrategy NamespaceStrategy +} + +// FieldMap declares a single field translation +type FieldMap struct { + // Source field path — dot-notation, supports array indexing + // e.g., "resources.cpu" or "nodes.controlPlane[0].cpu" + SourcePath string + + // Destination field path + DestPath string + + // Transform function name — registered in the transform registry + // "none" for direct copy, or a named transform + Transform string + + // Required — if true and source field is absent, returns error + Required bool + + // Default — used when source field is absent and Required is false + Default interface{} +} + +// ConditionMapping maps a Kubernetes condition to a DCM lifecycle state +type ConditionMapping struct { + // Kubernetes condition expression — e.g., "Ready=True" + // Supports AND: "Ready=False,Progressing=True" + KubernetesCondition string + + // DCM lifecycle state + DCMLifecycleState LifecycleState +} + +// LifecycleEventMapping maps a Kubernetes event to a DCM event type +type LifecycleEventMapping struct { + // "condition_change" | "spec_change_without_dcm_request" | "deletion" + KubernetesEvent string + + // Condition that triggers this mapping (for condition_change events) + Condition string + + // DCM event type + DCMEventType LifecycleEventType + + // Severity + Severity EventSeverity +} + +// Transform registry — operator developers register custom transforms +type TransformRegistry interface { + // Register adds a named transform function + Register(name string, fn TransformFunc) error + + // Get retrieves a transform function by name + Get(name string) (TransformFunc, error) +} + +// TransformFunc transforms a value from source to destination format +type TransformFunc func(value interface{}) (interface{}, error) +``` + +--- + +## 5. Status and State Types + +```go +// LifecycleState represents the DCM lifecycle state of a resource +type LifecycleState string + +const ( + LifecycleStateProvisioning LifecycleState = "PROVISIONING" + LifecycleStateOperational LifecycleState = "OPERATIONAL" + LifecycleStateDegraded LifecycleState = "DEGRADED" + LifecycleStateSuspended LifecycleState = "SUSPENDED" + LifecycleStateFailed LifecycleState = "FAILED" + LifecycleStateDecommissioned LifecycleState = "DECOMMISSIONED" +) + +// RealizedState is the DCM Unified Data Model representation of +// a resource's realized state. This is what the operator sends +// to DCM after successful provisioning or status change. +type RealizedState struct { + // DCM resource ID (returned by DCM in the create request) + ResourceID string + + // DCM entity UUID + DCMEntityUUID string + + // Current lifecycle state + LifecycleState LifecycleState + + // Timestamp of this realization + RealizedTimestamp time.Time + + // All realized fields in DCM Unified Data Model format + Spec map[string]interface{} + + // Level 3 — field-level provenance + FieldProvenance map[string]FieldProvenance + + // Kubernetes reference for correlation + KubernetesReference KubernetesReference + + // Relationships created during realization + Relationships []RelationshipRecord +} + +// KubernetesReference carries Kubernetes-specific identity for correlation +type KubernetesReference struct { + Namespace string + Name string + UID types.UID + ResourceVersion string + Generation int64 +} + +// FieldProvenance carries lineage for a single field (Level 3) +type FieldProvenance struct { + SourceType string // "provider" + SourceUUID string // operator provider UUID + Timestamp time.Time + Reason string +} +``` + +--- + +## 6. Event Types + +```go +// LifecycleEventType represents a DCM lifecycle event type +type LifecycleEventType string + +const ( + EventEntityHealthChange LifecycleEventType = "ENTITY_HEALTH_CHANGE" + EventDegradation LifecycleEventType = "DEGRADATION" + EventMaintenanceScheduled LifecycleEventType = "MAINTENANCE_SCHEDULED" + EventMaintenanceStarted LifecycleEventType = "MAINTENANCE_STARTED" + EventMaintenanceCompleted LifecycleEventType = "MAINTENANCE_COMPLETED" + EventUnsanctionedChange LifecycleEventType = "UNSANCTIONED_CHANGE" + EventCapacityChange LifecycleEventType = "CAPACITY_CHANGE" + EventDecommissionNotice LifecycleEventType = "DECOMMISSION_NOTICE" + EventProviderDegradation LifecycleEventType = "PROVIDER_DEGRADATION" +) + +// EventSeverity represents the severity of a lifecycle event +type EventSeverity string + +const ( + SeverityInfo EventSeverity = "INFO" + SeverityWarning EventSeverity = "WARNING" + SeverityCritical EventSeverity = "CRITICAL" +) + +// LifecycleEvent is the payload sent to DCM for a lifecycle event +type LifecycleEvent struct { + EventUUID string + EventType LifecycleEventType + ProviderID string + ResourceID string + DCMEntityUUID string + EventTimestamp time.Time + Severity EventSeverity + RequiresImmediateAction bool + Details EventDetails + KubernetesReference KubernetesReference +} + +// EventDetails carries event-specific detail data +type EventDetails struct { + // For UNSANCTIONED_CHANGE events + ChangedFields []FieldChange + + // For DEGRADATION events + DegradationReason string + AffectedComponents []string + + // For MAINTENANCE events + MaintenanceWindow *MaintenanceWindow + MaintenanceReason string + + // For CAPACITY_CHANGE events + PreviousCapacity *CapacityReport + CurrentCapacity *CapacityReport + + // Human-readable message for any event type + Message string +} + +// FieldChange describes a single field change in an unsanctioned change event +type FieldChange struct { + FieldPath string + PreviousValue interface{} + CurrentValue interface{} + ChangedBy string // Kubernetes user or service account + ChangedAt time.Time +} +``` + +--- + +## 7. Capacity Types + +```go +// CapacityReporter is the interface operator developers implement +// to report capacity data to DCM. The SDK calls this on schedule. +type CapacityReporter interface { + // GetCapacity returns the current capacity for all service types. + // Called by the SDK on the configured reporting schedule. + GetCapacity(ctx context.Context) (CapacityReport, error) +} + +// CapacityReport contains capacity data for all service types +type CapacityReport struct { + ProviderID string + ReportTimestamp time.Time + NextReportAt time.Time + CapacityByServiceType []ServiceTypeCapacity +} + +// ServiceTypeCapacity contains capacity for a single service type +type ServiceTypeCapacity struct { + ServiceTypeUUID string + AvailableUnits int + ReservedUnits int + CommittedUnits int + UnitDefinition string + KubernetesResources KubernetesResourceCapacity +} + +// KubernetesResourceCapacity contains raw Kubernetes resource availability +type KubernetesResourceCapacity struct { + AvailableCPUMillicores int64 + AvailableMemoryBytes int64 + AvailableStorageBytes int64 + NodeCount int +} +``` + +--- + +## 8. Constructor and Initialization + +```go +// New creates and initializes a new DCM SDK instance. +// Returns an error if the configuration is invalid or +// if required components for the declared conformance level +// are missing. +func New(config Config) (SDK, error) + +// NewWithClient creates a new SDK instance with a pre-configured +// DCM client. Used primarily for testing. +func NewWithClient(config Config, client Client) (SDK, error) + +// LoadFieldMappings loads field mapping declarations from YAML files. +// Accepts one or more file paths or glob patterns. +func LoadFieldMappings(paths ...string) ([]FieldMapping, error) + +// MustNew creates a new SDK instance and panics if initialization fails. +// Convenience function for use in main() where error handling via +// panic is acceptable. +func MustNew(config Config) SDK +``` + +--- + +## 9. Minimal Level 1 Example + +```go +package main + +import ( + "context" + "os" + + dcmsdk "github.com/dcm-project/operator-sdk" + ctrl "sigs.k8s.io/controller-runtime" +) + +func main() { + // Minimal Level 1 configuration + dcm, err := dcmsdk.New(dcmsdk.Config{ + ProviderName: "my-operator", + DisplayName: "My Operator DCM Provider", + DCMEndpoint: os.Getenv("DCM_ENDPOINT"), + OperatorEndpoint: os.Getenv("OPERATOR_ENDPOINT"), + ConformanceLevel: dcmsdk.Level1, + ServiceTypes: []dcmsdk.ServiceTypeConfig{ + { + ServiceTypeName: "Storage.Database", + ServiceTypeUUID: "dcm-registry-uuid-for-storage-database", + CRDReference: dcmsdk.CRDReference{ + Group: "postgresql.cnpg.io", + Version: "v1", + Kind: "Cluster", + }, + OperationsSupported: []dcmsdk.Operation{ + dcmsdk.OperationCreate, + dcmsdk.OperationRead, + dcmsdk.OperationDelete, + }, + }, + }, + FieldMappingFiles: []string{"dcm-mappings.yaml"}, + }) + if err != nil { + panic(err) + } + + ctx := ctrl.SetupSignalHandler() + + // Register with DCM in background — does not block startup + dcm.Register(ctx) + + // Start HTTP server with health + DCM endpoints + go dcm.StartServer(ctx, ":8080") + + // Start operator manager (existing code unchanged) + mgr, _ := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{}) + mgr.Start(ctx) +} + +// In reconciliation loop — minimal Level 1 additions +func (r *ClusterReconciler) Reconcile( + ctx context.Context, + req ctrl.Request, +) (ctrl.Result, error) { + + cluster := &cnpgv1.Cluster{} + if err := r.Get(ctx, req.NamespacedName, cluster); err != nil { + return ctrl.Result{}, client.IgnoreNotFound(err) + } + + // Only process DCM-managed resources + if !r.DCM.IsManagedResource(cluster) { + return ctrl.Result{}, nil + } + + // Existing reconciliation logic here... + + // Report status to DCM (SDK handles translation via field mappings) + r.DCM.ReportStatus(ctx, cluster) + + return ctrl.Result{}, nil +} +``` + +--- + +## 11. Callback Credential Management + +The SDK manages the provider callback credential lifecycle automatically. + +### 11.1 Credential Storage + +```go +// CallbackCredential is managed internally by the SDK. +// Operators do not need to handle credential rotation manually. +type CallbackCredential struct { + Value string // Bearer token — never logged + ValidUntil time.Time // Pre-rotation begins at 50% of lifetime + UUID string // For audit correlation +} +``` + +### 11.2 Automatic Rotation + +The SDK initiates credential rotation before expiry (at 50% of the credential lifetime). During the transition window, the SDK accepts both the old and new credentials simultaneously. Operators do not need to handle rotation — the SDK does it transparently. + +```go +// The SDK emits a credential rotation event when rotation completes. +// Operators can subscribe to be notified (e.g., to update secret stores). +sdk.OnCredentialRotated(func(old, new CallbackCredential) { + // Optional: persist new credential to external secret store + log.Info("credential rotated", "new_uuid", new.UUID) +}) +``` + +### 11.3 entity_uuid and resource_id + +The SDK ensures `dcm_entity_uuid` is echoed in every response and callback. `resource_id` is the operator's own stable identifier. DCM uses `dcm_entity_uuid` for all internal routing — `resource_id` is stored by DCM as a correlation handle but never used for routing or identity. + +```go +// The SDK automatically populates dcm_entity_uuid from the CreateRequest. +// Operators set resource_id to their own stable identifier. +resp := &CreateResponse{ + ResourceID: myInternalID, // operator-assigned + DCMEntityUUID: req.DCMEntityUUID, // echoed from CreateRequest — SDK validates this + LifecycleState: StatePROVISIONING, +} +``` + +--- + +## 10. Open Questions + +> All questions resolved. See Resolution Notes below. + + +| # | Question | Impact | Status | +|---|----------|--------|--------| +| 1 | Should the SDK support non-Go operator frameworks via a language-agnostic REST adapter? | Ecosystem breadth | ✅ Resolved | +| 2 | How should the SDK handle DCM endpoint unavailability — queue events locally or drop? | Reliability | ✅ Resolved | +| 3 | Should field mappings support dynamic resolution — a transform that queries external data? | Flexibility | ✅ Resolved | +| 4 | Should the SDK provide a testing framework for unit testing operator-DCM integration? | Developer experience | ✅ Resolved | +| 5 | Should the SDK expose metrics (Prometheus) for DCM registration status, event delivery success, etc.? | Observability | ✅ Resolved | + +--- + + + +## Resolution Notes + +**Q1:** No language-agnostic REST adapter is needed in the Go SDK — the Operator Interface Specification is itself language-agnostic. Operators in any language implement the specification directly via HTTP. Community SDKs for Java/Python are encouraged as community projects. The Go SDK is the reference implementation only. + +**Q2:** Queue locally, always. The SDK maintains a local durable queue (SQLite — simple, no external dependencies) with configurable capacity and TTL. On DCM reconnection, queued events are replayed in order. If the local queue reaches capacity (DCM unavailable for an extended period), the SDK enters DEGRADED mode: new events are still accepted up to the hard capacity limit, then dropped with a QUEUE_OVERFLOW audit record and an alert via the operator's configured alerting channel. Dropping events silently is never acceptable — the system is designed to be the authoritative source of truth. + +**Q3:** Dynamic field resolution is implemented as an Information Provider reference in the field mapping declaration. The SDK declares 'this field resolves from Information Provider X with lookup key Y'. DCM resolves the value during layer assembly via the standard Information Provider query. This keeps transformation logic in DCM's Policy Engine where it belongs and is auditable via standard field provenance. + +**Q4:** A mock DCM test harness ships as a first-class component of the SDK. The harness implements the registration, dispatch, cancel, and discover endpoints with configurable behaviors: inject failures, inject delays, return specific payloads, simulate timeout scenarios. Operators use the test harness for unit and integration testing without a live DCM deployment. This is essential for adoption — operators must be able to test DCM integration in CI without a full environment. + +**Q5:** Prometheus metrics are mandatory, not optional. The SDK exposes: registration_status (gauge), event_delivery_total (counter, labels: status=success|failure), event_delivery_duration_seconds (histogram), local_queue_depth (gauge, only when local queuing active), dispatch_duration_seconds (histogram), discovery_cycle_duration_seconds (histogram). Metrics endpoint follows the standard DCM observability model and is required for Level 2 conformance. + +*Document maintained by the DCM Project. For questions or contributions see [GitHub](https://github.com/dcm-project).* diff --git a/docs/future-features/kubernetes-compatibility.md b/docs/future-features/kubernetes-compatibility.md new file mode 100644 index 0000000..3036744 --- /dev/null +++ b/docs/future-features/kubernetes-compatibility.md @@ -0,0 +1,443 @@ +# DCM — Kubernetes Compatibility and Concept Mappings + + +> ## 📋 Draft — Promoted from Work in Progress +> +> All questions resolved. Cluster-as-a-Service model defined. Namespace-to-Tenant mapping, admission webhook model, and managed K8s integration all specified. +> +> **This section is explicitly a work in progress and is less mature than the core DCM data model and architecture documentation.** +> +> The Kubernetes operator integration layer — including the Operator Interface Specification, Operator SDK API, and Kubernetes compatibility mappings — represents design intent that has not yet been validated against implementation. Specific interface contracts, API signatures, SDK method names, and CRD structures **will change** as implementation work begins. +> +> **Do not build against these specifications yet.** They are published to share design direction and invite feedback, not as stable contracts. +> +> Known gaps and open items for this section: +> - Operator Interface Specification: reconciliation hook signatures are provisional +> - Operator SDK API: Go module structure and dependency model not yet finalized +> - Kubernetes Compatibility Mappings: some concept mappings remain under discussion +> - SDK code examples are illustrative only — not yet tested against a real implementation +> +> Feedback and contributions welcome via [GitHub Issues](https://github.com/dcm-project/issues). + + + +**Document Status:** ✅ Complete +**Document Type:** Architecture Reference +**Related Documents:** [Foundational Abstractions](https://github.com/croadfeldt/udlm/blob/main/foundations/foundations.md) | [Entity Relationships](https://github.com/croadfeldt/udlm/blob/main/entities/entity-relationships.md) | [Resource Type Hierarchy](https://github.com/croadfeldt/udlm/blob/main/entities/resource-type-hierarchy.md) | [Resource/Service Entities](https://github.com/croadfeldt/udlm/blob/main/entities/resource-service-entities.md) | [DCM Operator Interface Specification](dcm-operator-interface-spec.md) + +--- + +## 1. Purpose + +> **AEP Alignment:** API endpoint references in this document follow [AEP](https://aep.dev) conventions +> (custom methods use colon syntax). See `schemas/openapi/dcm-consumer-api.yaml` for the +> normative OpenAPI specification. + + +DCM is designed as a **superset of Kubernetes** — extending Kubernetes' declarative, controller-based model upward to provide unified management across multiple clusters, infrastructure types, and organizational boundaries that Kubernetes alone cannot address. + +This document serves three purposes: + +1. **Defines the formal mapping** between Kubernetes concepts and DCM concepts — enabling implementors to understand how the two models relate and where DCM extends beyond Kubernetes +2. **Establishes DCM Resource Types** for standard Kubernetes resources — so that Kubernetes-managed resources participate in the DCM registry alongside non-Kubernetes resources +3. **Documents the boundary** between what Kubernetes governs and what DCM governs — making clear that DCM extends Kubernetes rather than replacing it + +--- + +## 2. The Superset Relationship + +DCM is a superset of Kubernetes in the sense that it provides all the capabilities Kubernetes provides — and more. An organization running Kubernetes exclusively is using a subset of what DCM can manage. DCM does not replace Kubernetes; it manages the lifecycle of Kubernetes clusters and the resources running on them. + +The superset relationship means DCM can manage Kubernetes-native resources (Deployments, Services, PersistentVolumes) through conformant operators, and it can manage the clusters themselves as catalog items. It also means DCM manages resources that have no Kubernetes equivalent — bare metal, VMs, VLANs, IP allocations, and organizational data entities. + +### 2.1 What Kubernetes Provides + +Kubernetes is a container orchestration platform that provides: +- Declarative desired-state management within a single cluster +- A controller/operator pattern for extending resource management +- Namespace-based isolation within a cluster +- RBAC for access control within a cluster +- A rich ecosystem of operators for managing complex stateful resources + +### 2.2 What DCM Adds + +DCM extends Kubernetes upward by providing: + +| Capability | Kubernetes | DCM | +|------------|-----------|-----| +| Scope | Single cluster | Multi-cluster, multi-infrastructure | +| Tenancy | Namespace isolation | First-class Tenant model with ownership | +| Policy | RBAC + admission webhooks | Full Policy Engine with Validation/Transformation/GateKeeper | +| Data lineage | Not provided | Field-level provenance on all data | +| Cost attribution | Not provided | Full lifecycle cost analysis | +| Drift detection | Basic — controller reconciles | Full four-state model with Intent/Requested/Realized/Discovered | +| Service catalog | Not provided | Full self-service catalog with RBAC-governed presentation | +| Sovereignty | Not provided | Sovereignty declarations, placement constraints, compliance evidence | +| Information context | Labels/annotations | First-class Information Provider relationships | +| Non-Kubernetes resources | Not provided | VMware, bare metal, OpenStack, etc. all managed through same model | + +### 2.3 What DCM Does Not Replace + +DCM does not replace Kubernetes at the runtime level. Kubernetes continues to: +- Schedule and run containers +- Manage Pod lifecycle within a cluster +- Enforce network policies within a cluster +- Provide the Kubernetes API for cluster-native tooling +- Run operators that manage complex stateful resources + +DCM manages the management plane — the lifecycle of what gets requested, provisioned, owned, governed, and decommissioned. Kubernetes manages the execution plane — the runtime behavior of what is running. + +--- + +## 3. Core Concept Mappings + +### 3.1 Resource Model + +| Kubernetes Concept | DCM Concept | Relationship | Notes | +|-------------------|-------------|--------------|-------| +| Custom Resource Definition (CRD) | Resource Type Specification | CRD schema → DCM Resource Type fields | DCM Resource Type is the portable, provider-agnostic equivalent. CRD is the Kubernetes-specific implementation schema. | +| Custom Resource (CR) | Requested State payload → Realized State entity | CR is the naturalized form of the DCM payload | The operator translates DCM Requested State into a CR (Naturalization) and translates CR status back to DCM Realized State (Denaturalization). | +| Built-in resource (Pod, Service, PV) | DCM Resource Type in Compute.*, Network.*, Storage.* | Kubernetes built-ins are valid DCM Resource Types | See Section 5 for standard Kubernetes resource type mappings. | +| Kubernetes object | Resource/Service Entity | Every Kubernetes object managed by DCM has a corresponding DCM entity with UUID and provenance | | + +### 3.2 Control Loop + +| Kubernetes Concept | DCM Concept | Relationship | Notes | +|-------------------|-------------|--------------|-------| +| Operator reconciliation loop | Realization + Drift Detection combined | Reconciliation IS the realization process — the operator drives actual state toward desired state | DCM's Drift Detection compares Discovered State against Realized State. The operator's reconciliation loop is the mechanism that corrects drift. | +| Desired state (CR spec) | Requested State | CR spec is the naturalized form of the DCM Requested State | DCM stores the Requested State in DCM format. The operator translates it to CR spec format. | +| Actual state (CR status) | Realized State | CR status is the Kubernetes-native form of the DCM Realized State | The operator must denaturalize CR status back to DCM Realized State format and report it to DCM. | +| Watch/Inform pattern | DCM Discovered State polling | Kubernetes watch events are the mechanism for keeping DCM Discovered State current | | + +### 3.3 Isolation and Multi-tenancy + +| Kubernetes Concept | DCM Concept | Relationship | Notes | +|-------------------|-------------|--------------|-------| +| Namespace | DCM Tenant boundary | One namespace per DCM Tenant (per_tenant strategy) | Kubernetes namespace provides the physical isolation enforcement. DCM Tenant provides the ownership and governance model. A single DCM Tenant maps to exactly one namespace per cluster. | +| Namespace | DCM Resource Group | In shared namespace strategies, Resource Group labels replace namespace isolation | When multiple Tenants share a namespace, DCM Resource Group labels provide logical separation. | +| Kubernetes RBAC | DCM IDM/IAM + Policy Engine | Kubernetes RBAC is the runtime enforcement mechanism. DCM Policy Engine governs who can request what via the service catalog. | DCM policies determine what a user can request. Kubernetes RBAC determines what a running workload can do. These are complementary, not duplicative. | +| ServiceAccount | DCM Identity.ServiceAccount Information Type | Kubernetes ServiceAccounts that DCM provisions or references are modeled as DCM Information Type entities | | + +### 3.4 Relationships and Dependencies + +| Kubernetes Concept | DCM Concept | Relationship | Notes | +|-------------------|-------------|--------------|-------| +| ownerReference | Entity Relationship (`contains`/`contained_by`) | Kubernetes ownerReferences are a subset of DCM entity relationships — ownership only | DCM relationships are richer — supporting `requires`, `depends_on`, `references`, `peer`, `manages` in addition to ownership. During Denaturalization, ownerReferences are translated to DCM `contains` relationships. | +| Finalizers | Lifecycle policy (`retain`, `detach`) | Kubernetes finalizers implement DCM lifecycle policies at the Kubernetes level | When DCM declares `on_parent_destroy: retain` for a storage entity, the operator implements this using Kubernetes finalizers to prevent deletion until DCM confirms the lifecycle policy has been applied. | +| Label selectors | Resource Group membership | Kubernetes label selectors used for DCM Resource Group filtering | DCM mandatory labels (`dcm-tenant-id`, `dcm-entity-id`) are used as label selectors for Resource Group queries. | + +### 3.5 Data Model + +| Kubernetes Concept | DCM Concept | Relationship | Notes | +|-------------------|-------------|--------------|-------| +| Labels | DCM entity metadata + relationships | DCM-mandatory labels (`dcm-managed`, `dcm-tenant-id`, `dcm-entity-id`, etc.) carry core DCM identity data. Custom labels may map to DCM Information Type relationships. | | +| Annotations | DCM field-level provenance + metadata | Annotations used by DCM to carry request correlation data during the request lifecycle | `dcm-request-id` annotation on a CR identifies the DCM request that created or last modified it — enabling unsanctioned change detection. | +| Resource version | Entity version (Revision component) | Kubernetes resource versions map to DCM entity Revision increments | Major and Minor versions are managed by DCM based on breaking/non-breaking changes. Kubernetes resource version increments map to DCM Revision increments. | +| Generation | Requested State version | CR generation increments correspond to new DCM Requested State records | Each new generation of a CR corresponds to a new intent/request cycle in DCM. | + +### 3.6 Lifecycle + +| Kubernetes Concept | DCM Concept | Relationship | Notes | +|-------------------|-------------|--------------|-------| +| Pod phases (Pending, Running, Succeeded, Failed, Unknown) | DCM lifecycle states | Pod phases map to DCM lifecycle states via condition_mappings declaration | | +| CRD conditions | DCM lifecycle states and events | Standard conditions (Ready, Degraded, Progressing) map to DCM states and events via the field mapping specification | | +| Kubernetes events | DCM lifecycle events | Kubernetes watch events trigger DCM lifecycle event reports | The operator translates Kubernetes events into DCM lifecycle event types (ENTITY_HEALTH_CHANGE, DEGRADATION, UNSANCTIONED_CHANGE, etc.) | +| Cluster deletion | DCM decommission workflow | Cluster deletion triggers DCM's full decommission lifecycle — lifecycle policies applied to all related entities | | + +--- + + +## 3a. Cluster as a Service — The Primary Model + +A Kubernetes cluster is a first-class catalog item in DCM. Any authorized Tenant can request and own a cluster through the service catalog, the same way they request a VM or a network. This is not a special case — it is the expected primary consumption model for Kubernetes infrastructure in DCM. + +**How it works:** + +```yaml +catalog_item: Platform.KubernetesCluster +provider: CAPI-based Service Provider (or managed K8s Service Provider) +tenant_uuid: + +entity: + resource_type: Platform.KubernetesCluster + tenant_uuid: # Tenant owns the cluster + lifecycle_state: OPERATIONAL + fields: + kubernetes_version: "1.29" + node_count: 3 + api_endpoint: "https://cluster-01.eu-west.example.com" + kubeconfig_ref: # via credential management service +``` + +**Ownership scope:** When a Tenant owns a `Platform.KubernetesCluster` entity, that Tenant owns everything within the cluster boundary — including cluster-scoped resources (ClusterRoles, StorageClasses, PersistentVolumes, CRDs registered for that cluster). The cluster entity is the ownership boundary. DCM treats the cluster as an opaque resource from a Tenant ownership perspective — the Tenant gets the cluster; what's inside it belongs to them. + +**The composite service definition pattern:** A Cluster-as-a-Service catalog item typically composes multiple constituent resources: +```yaml +Platform.KubernetesCluster → constituent providers: + - Compute resources (control plane + worker nodes) + - Network resources (load balancer, ingress) + - Storage resources (CSI driver + storage class) + - DNS records (cluster API endpoint) + - Credential issuance (kubeconfig via credential management service) +``` + +This is a composite service definition — the cluster catalog item orchestrates all constituents and presents a single entity to the Tenant. + +**Sovereignty and accreditation:** Cluster placement follows the standard Placement Engine model. Sovereignty constraints declared by the Tenant apply to cluster placement — a GDPR-scoped Tenant requesting a cluster gets a cluster placed in an EU sovereignty zone. The CAPI provider (or managed K8s Service Provider) must hold appropriate accreditations. + +**Post-provision:** Once the cluster is OPERATIONAL, it can optionally register with DCM as a nested Service Provider for workload resources. The Tenant can then request workload resources (Deployments, Services, PersistentVolumes) against their cluster through the same DCM service catalog. This creates the superset model: DCM provisions the cluster → cluster becomes a workload Service Provider → Tenant uses DCM to manage workloads on their cluster. + + +## 4. Where DCM Extends Beyond Kubernetes + +These are capabilities that exist in DCM but have no Kubernetes equivalent. None of these require Kubernetes to be present — they operate across all provider types. For organizations running pure Kubernetes estates, these are the capabilities DCM brings that Kubernetes tooling alone cannot provide. + +**Summary of extensions:** + +| DCM Capability | Kubernetes Gap | +|---------------|---------------| +| Intent State | No concept of original consumer intent separate from desired state | +| Field-Level Provenance | No field lineage — a field is a field | +| Data Layers and Assembly | No layering model — manifests are flat declarations | +| Policy Engine | Admission webhooks are cluster-scoped, admission-time only | +| Cost Analysis | No native cost attribution in the request lifecycle | +| Information Providers | No structured external organizational data relationships | +| Cross-Cluster Lifecycle | Single-cluster scope — multi-cluster requires external tooling | + +These are concepts that exist in DCM but have no Kubernetes equivalent. They are the capabilities DCM adds that justify the superset positioning. + +### 4.1 Intent State + +Kubernetes has no concept of a consumer's original intent separate from the desired state. Once you apply a manifest, Kubernetes only knows the current desired state — not what the consumer originally asked for or why. + +DCM's Intent State is the immutable record of what the consumer asked for, stored before any policy processing or layer enrichment. This enables: +- Rehydration — replaying the original intent through current policies to produce a new request +- Intent portability — the same intent applied to a different provider +- Audit — answering "what did the consumer originally ask for?" independently of what was realized + +### 4.2 Field-Level Provenance + +Kubernetes has no concept of where a field value came from or why it was set. A field in a CR spec is a field — there is no lineage. + +DCM's field-level provenance carries the full lineage of every field value through the entire lifecycle — which layer set it, which policy modified it, which provider realized it, and why each change was made. This enables complete audit trails and sovereignty evidence. + +### 4.3 Data Layers and Assembly + +Kubernetes has no equivalent to DCM's layering model. A Kubernetes manifest is a flat declaration — there is no concept of organizational standards, site-specific configuration, and service-specific configuration being separate layers that compose into a final manifest. + +DCM's layering model enables 36 layer definitions to govern 40,000 VMs without duplication — impossible in the Kubernetes model. + +### 4.4 Policy Engine + +Kubernetes admission webhooks provide some policy capability (validation, mutation) but are cluster-scoped, apply at admission time only, and have no concept of hierarchy (Global → Tenant → User policy levels) or field-level override control. + +DCM's Policy Engine operates at the management plane level, applies across all clusters and providers, enforces a three-level hierarchy with field-level override control (allow/constrained/immutable), and carries policy decisions as provenance metadata in the payload. + +### 4.5 Cost Analysis + +Kubernetes has no native cost attribution model. Tools like Kubecost exist but are add-ons with no integration into the request lifecycle. + +DCM's cost analysis is built into the lifecycle model — cost attribution is tracked from request time through realization, operation, and decommission for every entity. + +### 4.6 Information Providers + +Kubernetes has no concept of structured relationships to external organizational data (Business Units, Cost Centers, Product Owners). Labels and annotations are unstructured key-value pairs with no type safety, no external system integration, and no verification model. + +DCM's Information Provider model gives every entity structured, verified, versioned relationships to external organizational data with a stable external key model. + +### 4.7 Cross-Cluster Lifecycle + +Kubernetes manages resources within a single cluster. Multi-cluster management requires additional tools (ACM, Argo CD, Fleet) that are not part of the core Kubernetes model. + +DCM manages the lifecycle of resources across multiple clusters as a first-class capability — the same Resource Type can be instantiated on any cluster that has a conformant Service Provider registered. + +--- + +## 5. Standard Kubernetes Resource Type Mappings + +These are the DCM Resource Type registry entries for standard Kubernetes resource types. Operators implementing these types should use these registry UUIDs and field definitions. + +### 5.1 Compute + +| DCM Resource Type | Kubernetes Equivalent | Notes | +|------------------|----------------------|-------| +| `Compute.Pod` | Pod | Lowest-level compute unit | +| `Compute.Container` | Container (within a Pod) | Sub-entity of Pod — expanded via bundled declaration | +| `Compute.Deployment` | Deployment | Managed set of Pods | +| `Compute.StatefulSet` | StatefulSet | Stateful managed set of Pods | +| `Compute.Job` | Job | One-time execution workload | +| `Compute.CronJob` | CronJob | Scheduled execution workload | + +### 5.2 Network + +| DCM Resource Type | Kubernetes Equivalent | Notes | +|------------------|----------------------|-------| +| `Network.Service` | Service | In-cluster service discovery and load balancing | +| `Network.Ingress` | Ingress | External HTTP/HTTPS routing | +| `Network.NetworkPolicy` | NetworkPolicy | In-cluster network isolation | + +### 5.3 Storage + +| DCM Resource Type | Kubernetes Equivalent | Notes | +|------------------|----------------------|-------| +| `Storage.PersistentVolume` | PersistentVolume | Cluster-level storage resource | +| `Storage.PersistentVolumeClaim` | PersistentVolumeClaim | Consumer's storage declaration — expanded into Storage.PersistentVolume relationship | +| `Storage.StorageClass` | StorageClass | Storage type definition — maps to DCM Provider Catalog Item | +| `Storage.ConfigMap` | ConfigMap | Configuration data storage | +| `Storage.Secret` | Secret | Sensitive data storage | + +### 5.4 Platform + +| DCM Resource Type | Kubernetes Equivalent | Notes | +|------------------|----------------------|-------| +| `Platform.KubernetesCluster` | Kubernetes Cluster (via CAPI or managed service) | The cluster itself is a DCM-managed resource | +| `Platform.Namespace` | Namespace | Maps to DCM Tenant boundary in per_tenant strategy | +| `Platform.CustomResourceDefinition` | CRD | CRD registration maps to DCM Resource Type registration | + +### 5.5 Identity + +| DCM Resource Type | Kubernetes Equivalent | Notes | +|------------------|----------------------|-------| +| `Security.ServiceAccount` | ServiceAccount | Kubernetes identity for workloads | +| `Security.Role` | Role / ClusterRole | Kubernetes RBAC role | +| `Security.RoleBinding` | RoleBinding / ClusterRoleBinding | Kubernetes RBAC binding | + +--- + +## 6. The Kubernetes Information Provider + +Kubernetes clusters function as both Service Providers (for provisioning resources) and Information Providers (for querying existing state). As an Information Provider, a Kubernetes cluster exposes its current resource state to DCM for: + +- **Brownfield ingestion** — discovering existing resources and bringing them under DCM lifecycle management +- **Discovered State** — DCM's Discovered State for Kubernetes resources comes from querying the Kubernetes API +- **Drift detection** — comparing DCM Realized State against what Kubernetes actually has + +### 6.1 Kubernetes as Information Provider Registration + +```yaml +information_provider_registration: + name: kubernetes-cluster-01 + implements: + - information_type: Platform.KubernetesCluster + - information_type: Compute.Pod + - information_type: Storage.PersistentVolume + # ... all resource types the cluster contains + endpoint: + kubernetes_credentials: + auth_method: + discovery_capabilities: + label_selector: "dcm-managed=true" + # Only returns DCM-managed resources by default + full_discovery: true + # Can also return all resources for brownfield ingestion +``` + +### 6.2 Discovered State from Kubernetes + +DCM queries the Kubernetes API using the Kubernetes Information Provider to populate Discovered State: + +``` +DCM Drift Detection + │ + ▼ +Kubernetes Information Provider + │ GET /apis/{group}/{version}/namespaces/{ns}/{kind} + │ Filter: label dcm-entity-id = {entity_uuid} + ▼ +Discovered State payload (DCM format) + │ Kubernetes object denaturalized to DCM format + ▼ +Compare against Realized State + │ Field-by-field comparison + ▼ +UNSANCTIONED_CHANGE if differences found + │ Reported to Policy Engine for response determination +``` + +--- + +## 7. Kubernetes-Native Patterns and DCM Equivalents + +### 7.1 GitOps + +Kubernetes GitOps (Argo CD, Flux) manages Kubernetes manifests in Git and synchronizes them to clusters. DCM's data model is also Git-based — all layers, Resource Type definitions, and policy definitions are stored in Git. + +The relationship: DCM manages the **request lifecycle** (what gets asked for, approved, and provisioned). GitOps manages the **deployment lifecycle** (what gets deployed to a cluster from a Git repository). These are complementary: + +- DCM governs the provisioning request — "is this consumer allowed to provision this resource?" +- GitOps deploys application code to the provisioned resource +- DCM and GitOps together form a complete lifecycle: DCM provisions the cluster, GitOps deploys applications to it + +### 7.2 Helm + +Helm charts are packages of Kubernetes manifests that can be parameterized. In DCM terms, a Helm chart is a form of Catalog Item — a curated, parameterized offering of a set of Kubernetes resources. + +DCM does not replace Helm — it can use Helm as a delivery mechanism inside a Service Provider. The Service Provider receives the DCM Requested State, translates it to Helm values, and uses Helm to deploy the resources. The operator pattern is preferred for Day 2 management (Helm has limited reconciliation), but Helm remains valid for initial provisioning. + +### 7.3 Cluster API (CAPI) + +CAPI is the Kubernetes sub-project for managing Kubernetes clusters themselves using the Kubernetes API and operator pattern. CAPI clusters are a natural fit for DCM's `Platform.KubernetesCluster` Resource Type — a CAPI-based operator would be the Service Provider for provisioning new Kubernetes clusters as DCM-managed resources. + +This is particularly significant: DCM managing the lifecycle of Kubernetes clusters through CAPI means DCM can provision the very infrastructure that operators run on. The superset relationship becomes concrete — DCM provisions the cluster, the cluster runs the operators, the operators provision the resources that DCM manages. + +--- + +## 8. Incremental Adoption — Kubernetes-Native to DCM-Managed + +Organizations running Kubernetes can adopt DCM incrementally across these phases: + +### Phase 1 — Observation (no operator changes) +Deploy DCM with the Kubernetes Information Provider. DCM observes existing resources via the Kubernetes API and builds a Discovered State inventory. No changes to existing operators or workloads. + +### Phase 2 — Brownfield Ingestion (no operator changes) +DCM promotes Discovered State records to Realized State — assuming lifecycle management of existing resources. Resources get DCM UUIDs, Tenant assignments, and provenance records. Existing resources are now DCM-managed without any operator changes. + +### Phase 3 — Level 1 Conformance (minimal operator changes) +Operators implement Level 1 of this specification via the DCM Operator SDK. New resources are provisioned through DCM's service catalog. Existing resources managed via brownfield ingestion continue as-is. + +### Phase 4 — Level 2 Conformance (moderate operator changes) +Operators implement Level 2 — full field mappings, capacity reporting, lifecycle events. DCM gains placement intelligence, drift detection, and cross-cluster management capabilities. + +### Phase 5 — Level 3 Conformance (complete integration) +Operators implement Level 3 — sovereignty declarations, provenance, discovery endpoint. Full DCM capabilities available. + +--- + +## 9. Open Questions + +| # | Question | Impact | Status | +|---|----------|--------|--------| +| 1 | How does the Namespace-to-Tenant mapping work when a cluster has existing namespaces that predate DCM adoption? | Brownfield migration | ✅ Resolved | +| 2 | Should `Platform.KubernetesCluster` be the boundary for a DCM deployment, or can DCM manage resources across clusters without treating the cluster as a DCM entity? | Architecture scope | ✅ Resolved | +| 3 | How does DCM interact with Kubernetes admission webhooks — do they duplicate Policy Engine functions or complement them? | Policy model | ✅ Resolved | +| 4 | Should the Kubernetes Information Provider be a built-in DCM component or a separately deployed provider? | Deployment architecture | ✅ Resolved | +| 5 | How does the DCM superset model interact with managed Kubernetes services (EKS, GKE, AKS) where cluster management is outside the user's control? | Cloud provider integration | ✅ Resolved | + +--- + +## 10. Related Concepts + +- **DCM Operator Interface Specification** — the technical contract for operators integrating with DCM +- **DCM Operator SDK** — Go library implementing this specification for operator developers +- **Entity Relationships** — DCM's universal relationship model, of which Kubernetes ownerReferences are a subset +- **Resource Type Hierarchy** — the DCM registry where Kubernetes Resource Types are registered +- **Information Providers** — the DCM model for the Kubernetes API as a discoverable information source +- **Four States** — DCM's Intent/Requested/Realized/Discovered model, which extends Kubernetes' desired/actual model + +--- + + + +## Resolution Notes + +**Q1:** Pre-existing namespaces are handled by the brownfield ingestion model. Each namespace maps to one DCM Tenant. Resources without clear ownership land in the `__transitional__` Tenant and are promoted by a platform admin. Same flow as brownfield VM ingestion — no special handling required. + +**Q2:** DCM manages resources across multiple clusters simultaneously. `Platform.KubernetesCluster` is a DCM-managed resource type — both something DCM provisions as a catalog item (Cluster as a Service) and something DCM tracks when externally provisioned. A Tenant can own a full cluster as a catalog item; the cluster is not the boundary of a DCM deployment. DCM's organizational boundary is the Tenant. A single DCM deployment routes requests to Service Providers across many clusters, and can provision new clusters as service catalog items. + +**Q3:** Admission webhooks and the DCM Policy Engine are complementary layers, not duplicates. Admission webhooks enforce cluster-native policy (security contexts, image policies, resource quotas). The DCM Policy Engine enforces DCM request policy (business rules, data governance, sovereignty). A DCM-managed workload resource is validated by both — DCM Policy Engine before dispatch, admission webhook at the cluster. This is defense in depth. + +**Q4:** The Kubernetes Information Provider is a separately deployed provider that registers with DCM as a standard Information Provider. It serves cluster state, namespace inventory, and workload status. There are no built-in Information Providers in DCM's architecture — all Information Providers follow the unified base contract and are independently deployable. + +**Q5:** Managed Kubernetes services (EKS, GKE, AKS) register as Service Providers of resource type `Platform.ManagedKubernetesCluster`. DCM manages workload resources within the cluster (Deployments, Services, PersistentVolumes) but explicitly does not manage the cluster control plane. Sovereignty enforcement applies at cluster selection — DCM places workloads on clusters satisfying sovereignty constraints. The cloud provider manages cluster infrastructure. + +*Document maintained by the DCM Project. For questions or contributions see [GitHub](https://github.com/dcm-project).*