Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions architecture/adr/001-why-dcm-exists.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# ADR-001: Why DCM Exists

**Status:** Accepted
**Date:** March 2026

## Context

Enterprise data centers run hundreds of thousands of resources — VMs, containers, network segments, storage volumes — across multiple infrastructure platforms. Today, each platform has its own provisioning workflow, API, data format, and lifecycle model. The result:

- **No unified view of what's deployed.** Intended state, deployed state, and actual state diverge silently. Nobody can answer "what's running, who owns it, and does it match what was approved?"
- **No consistent governance.** Policy enforcement is tribal knowledge. Security reviews are manual gates. Compliance is verified after the fact rather than enforced at request time.
- **No common abstraction.** A team requesting a VM goes through one process; requesting a database goes through another; requesting a three-tier application requires manually coordinating both plus networking.

Public cloud solves this with unified control planes (AWS CloudFormation, Azure Resource Manager, GCP Deployment Manager). On-premises infrastructure has no equivalent.

## Decision

Build DCM — a management plane for enterprise data center infrastructure that provides:
- A unified data model and API across all infrastructure platforms
- Policy-as-code enforcement on every request before provisioning
- Full lifecycle management from request through decommission with tamper-evident audit
- A provider abstraction that makes any infrastructure platform consumable through the same interface

DCM is **not** a provisioning tool. It is the governance and orchestration layer that sits above provisioning tools (Ansible, Terraform, operators) and governs what gets requested, approved, built, owned, and decommissioned.

## Consequences

- DCM must be infrastructure-agnostic — it cannot favor any single platform
- The data model must be extensible to any resource type without code changes
- Policy evaluation must be mandatory, not optional — governance is the value proposition
- Audit must be tamper-evident to satisfy regulated environments (the primary adopters)
27 changes: 27 additions & 0 deletions architecture/adr/002-three-abstractions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# ADR-002: Three Foundational Abstractions — Data, Provider, Policy

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc 00 (Foundations)

## Context

A management plane for infrastructure must handle many concerns: data storage, external integrations, governance rules, audit trails, placement decisions, dependency resolution, lifecycle events, and more. Without a unifying model, the architecture becomes a collection of ad-hoc services with unclear boundaries.

## Decision

Every component of DCM maps to exactly one of three foundational abstractions:

**DATA** — Everything stored and versioned. The unified data model, entity lifecycle states, field-level provenance, data layers, and audit records. Data flows through a deterministic pipeline: Intent → Requested → Realized → Discovered.

**PROVIDER** — Everything external. Any system DCM interacts with through a defined contract. Providers receive data from DCM, act on it, and return data to DCM. Six provider types cover all external interactions: service, information, meta, auth, peer_dcm, and process.

**POLICY** — Everything that decides. Rules that fire when data matches conditions and produce typed outputs: allow/deny, validation, field mutations, recovery actions, orchestration directives. Policies govern every transition and transformation in DCM.

The interaction model: Data changes trigger Policy evaluation. Policy decisions may mutate Data or select Providers. Providers produce new Data. The cycle repeats.

## Consequences

- Any new capability must map to one of these three abstractions — if it doesn't fit, the abstraction model needs revision, not a fourth pillar
- Documentation, APIs, and code are organized around these three concepts
- Team members only need deep knowledge of 1-2 abstractions for their area of work
29 changes: 29 additions & 0 deletions architecture/adr/003-four-lifecycle-states.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# ADR-003: Four Lifecycle States

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc 02 (Four States)

## Context

A resource entity goes through multiple stages: the consumer declares intent, the system processes and approves, the provider provisions, and discovery observes what actually exists. If we track this as a single mutable record, we lose the ability to answer: "What did they ask for? What did we approve? What got built? What exists now?"

These four questions are the foundation of governance, audit, compliance, and drift detection.

## Decision

Every resource entity flows through four immutable states:

1. **Intent** — What the consumer asked for (raw declaration, no processing)
2. **Requested** — What was approved after layer assembly and policy evaluation (write-once)
3. **Realized** — What the provider actually created (snapshot from provider callback)
4. **Discovered** — What exists right now (independent observation via polling)

The `entity_uuid` links all four states for the same resource. States are immutable — updates create new records. Drift is the delta between Realized and Discovered. Compliance is provable because Requested State records the policy-approved payload.

## Consequences

- Every resource has exactly 4 records linked by entity_uuid
- Drift detection is a comparison: Realized ≠ Discovered
- Rehydration (disaster recovery) re-enters at Intent with current policies
- Audit can trace any resource from consumer's original ask through to what's running
38 changes: 38 additions & 0 deletions architecture/adr/004-service-catalog-consumer-experience.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# ADR-004: Service Catalog and Consumer Experience

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc 05 (Resource Type Hierarchy), Doc 06 (Resource/Service Entities)

## Context

Consumers need a way to discover what services are available and request them. The service catalog must abstract away infrastructure complexity — a developer requesting a VM should not need to know which hypervisor, which datacenter, or which network configuration is required.

## Decision

A four-level hierarchy separates what consumers see from what providers implement:

1. **Resource Type Category** — Broad groupings (Compute, Network, Storage, Database)
2. **Resource Type** — Specific resource kinds (Compute.VirtualMachine, Network.VLAN)
3. **Resource Type Specification** — Vendor-neutral field schemas, constraints, lifecycle rules
4. **Provider Catalog Item** — A specific provider's offering (pricing, SLAs, availability)

Consumers browse the catalog, select a catalog item, and submit a request with only the fields they care about (e.g., CPU count, memory, OS). DCM handles everything else: layer assembly, policy evaluation, provider selection, dependency resolution.

**Consumer request surface** is a JSON payload via the Consumer API:

```json
POST /api/v1/requests
{ "catalog_item_uuid": "...", "fields": { "cpu_count": 4, "memory_gb": 8, "os_family": "rhel" } }
```

## Open Question — Application Definition Language

The current consumer interface is an API call with a JSON payload. This works for single resources. For multi-resource applications (three-tier web app, data pipeline, ML training environment), the consumer needs a way to define the application as a whole. This is an open design question — see [ADR-016: Application Definition Language](016-application-definition-language.md).

## Consequences

- Resource types are vendor-neutral; provider catalog items are provider-specific
- Multiple providers can offer catalog items for the same resource type
- Consumers never choose a provider directly — placement does that
- The catalog is queryable via API; RHDH provides the frontend
51 changes: 51 additions & 0 deletions architecture/adr/005-provider-abstraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# ADR-005: Why Providers Exist and What They Do

**Status:** Accepted
**Date:** April 2026
**Docs:** Doc A (Provider Contract), Doc 53 (Capability Discovery)

## Context

DCM must interact with many external systems: hypervisors, container platforms, network controllers, IPAM systems, identity services, other DCM instances, ITSM tools, FinOps platforms, and more. Each has its own API, data format, and operational model. Without a common abstraction, DCM becomes tightly coupled to specific infrastructure platforms.

## Decision

A **Provider** is any external system DCM interacts with through a defined contract. All providers share the same base contract: registration, health check, sovereignty declaration, accreditation, zero trust authentication, and provenance emission.

What varies is the **capabilities** the provider declares. Capabilities define what the provider can do — not a rigid type assignment, but a profile of operations:

| Capability | What it means | Example |
|-----------|--------------|---------|
| `realize_resources` | Provisions, updates, and decommissions infrastructure resources | OpenStack Nova, KubeVirt, ACM |
| `serve_data` | Responds to queries with authoritative external data | CMDB, DNS, IPAM (InfoBlox) |
| `authenticate` | Authenticates identities and returns tokens/roles/groups | Keycloak, LDAP, FreeIPA |
| `federate` | Another DCM instance — mTLS mandatory, dual audit | Cross-region DCM |
| `execute_workflows` | Runs ephemeral workflows without producing persistent resources | Approval chains, ITSM, runbooks |

**A provider can declare multiple capabilities.** An IPAM system that both serves IP availability data AND allocates IP addresses registers once with `capabilities: [serve_data, realize_resources]` — not twice as two separate providers.

The key mechanism is **Naturalization/Denaturalization**: DCM sends a unified payload to the provider. The provider translates (naturalizes) it into its native API format, acts on it, then translates (denaturalizes) the result back into DCM's unified format.

## Capability Discovery

DCM and providers discover each other's capabilities bidirectionally:

- **DCM advertises** its capabilities via `GET /api/v1/capabilities` — external systems query what DCM offers (cost data, audit trail, entity lifecycle events, placement decisions) and subscribe to data streams automatically
- **Providers declare** what they offer to DCM (capabilities) AND what they need from DCM (data streams, events) at registration time. DCM matches needs to available capabilities and offers subscription endpoints.

This replaces the old one-directional model where providers register with DCM but DCM doesn't advertise anything back.

## Alternatives Considered

1. **12 provider types** (original design) — rejected because credential, notification, message bus, registry, storage, meta, policy, and ITSM providers were implementation details or data concepts, not architectural abstractions
2. **5 rigid types** (interim design) — rejected because it still forced providers into exactly one type, preventing multi-capability providers and providing no discovery mechanism
3. **Unified model with capability declarations** (current) — one provider type with capability profiles, bidirectional discovery, and automatic pipeline establishment

## Consequences

- Adding a new infrastructure platform means writing one provider — not changing DCM core
- Consumers don't know or care which provider fulfills their request
- Provider selection is policy-driven (placement), not consumer-chosen
- All provider interactions are audited and sovereignty-checked
- Multi-capability providers register once, not once per capability
- External systems discover DCM's data streams without reading docs
33 changes: 33 additions & 0 deletions architecture/adr/006-policy-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# ADR-006: Why Policy-as-Code and What It Governs

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc B (Policy Contract)

## Context

Enterprise infrastructure requires governance: sizing limits, security constraints, compliance rules, sovereignty requirements, cost controls, naming conventions. Today this governance is tribal knowledge enforced by manual review gates. Manual gates are slow, inconsistent, and unauditable.

## Decision

Every request is policy-evaluated before provisioning. Policies are code artifacts (Rego), not configuration. They fire automatically when data matches conditions and produce typed outputs.

**What policies govern:**
- **Who can request what** (GateKeeper: allow/deny based on role, tenant, resource type)
- **Whether the request is valid** (Validation: field constraints, range checks, format)
- **How the request is enriched** (Transformation: inject monitoring agents, set backup policies, apply naming conventions)
- **What happens when things fail** (Recovery: retry, requeue, compensate)
- **How pipeline stages are ordered** (Orchestration Flow: dependency sequencing)
- **What crosses boundaries** (Governance Matrix: sovereignty, data classification)

**Key design choices:**
- Multi-pass evaluation with convergence — transformation policies can inject fields that other policies depend on
- Lifecycle-scoped — a CPU-sizing policy fires on provisioning and scaling, not on hostname changes
- Override model with 5 mechanisms — governance is not rigid; legitimate exceptions are handled through audited overrides

## Consequences

- No request bypasses policy evaluation — this is mandatory, not opt-in
- Policies are versioned, have lifecycle (developing → active → retired), and support shadow mode for safe testing
- Every policy evaluation produces an audit record regardless of outcome
- Policy complexity is managed through templates (Gatekeeper ConstraintTemplate pattern) and a Constraint Type Registry
32 changes: 32 additions & 0 deletions architecture/adr/007-placement-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# ADR-007: How DCM Decides Where Things Run

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc 50 (Placement), Doc 14 (Profiles)

## Context

When a consumer requests a VM, they don't specify which provider or datacenter. Multiple providers may be capable of fulfilling the request. DCM must select the best provider based on sovereignty requirements, capacity, compliance, cost, and organizational policy.

## Decision

The Placement Engine selects providers through a multi-stage scoring process:

1. **Sovereignty pre-filter** — Eliminate providers that don't satisfy data residency requirements (e.g., EU-WEST resources can only go to EU-WEST providers). This is a hard gate, not a score.

2. **Capability filter** — Eliminate providers that don't support the requested resource type or lack required capabilities.

3. **Reserve query** — Query remaining providers for capacity availability and get confidence scores.

4. **Policy-driven scoring** — Apply placement policies that score providers on criteria like cost, performance tier, organizational preference, and existing affinity (e.g., co-locate with related resources).

5. **Selection** — Highest-scoring provider wins. Ties broken by configurable rules.

For composite services (composite resource type specifications), placement runs per-constituent — the database may land on a different provider than the app server, each scored independently but subject to the same sovereignty constraints.

## Consequences

- Consumers never choose providers — placement is always policy-driven
- Adding new providers to a zone automatically makes them candidates for placement
- Placement decisions are audited with full scoring rationale
- Provider health affects placement — unhealthy providers are excluded
32 changes: 32 additions & 0 deletions architecture/adr/008-dependency-resolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# ADR-008: How Resources Know What They Need

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc 07 (Service Dependencies), Doc 30 (Composite Service Composition Model)

## Context

Infrastructure resources have dependencies. A VM needs an IP address. A database needs a network port. A three-tier application needs all of its components provisioned in the right order with runtime values (IP addresses, connection strings) flowing from one resource to the next.

## Decision

Dependencies are declared at two levels:

**Type-level** (in the Resource Type Specification): "Every VM requires exactly one IP address." These are portable, provider-agnostic, and apply to all implementations of the resource type. DCM automatically creates sub-requests for type-level dependencies.

**Binding fields** (in composite service definitions): "The backend's db_host field gets its value from the database's realized ip_address." These connect resources via runtime values — the output of one resource becomes the input of another.

**How it works:**
1. Request Processor reads the resource type spec and identifies dependencies
2. Dependencies without parents are dispatched first (topological sort)
3. When a dependency is realized, its output values are injected into dependent resources via dependency payload passing (with full provenance tracking)
4. Dependent resources are dispatched after their dependencies are satisfied

For composite services, the composite resource type spec declares the full dependency graph with binding fields.

## Consequences

- Consumers don't manage dependencies — they request a catalog item and DCM resolves the graph
- Each dependency is a first-class DCM entity with its own audit trail and lifecycle
- Decommission reverses the dependency order — dependents are torn down before their dependencies
- Circular dependencies are detected at resource type registration time, not at request time
39 changes: 39 additions & 0 deletions architecture/adr/009-api-gateway-control-plane.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# ADR-009: Why an API Gateway and What the Control Plane Services Do

**Status:** Accepted
**Date:** March 2026
**Docs:** Doc 25 (Control Plane Services), OpenAPI Specs

## Context

DCM has multiple consumers (developers, platform engineers, admins, providers, external systems) that interact via different APIs with different authorization scopes. Internally, DCM has multiple services that process requests through a pipeline. These services need a single entry point that handles authentication, routing, rate limiting, and API versioning.

## Decision

The **API Gateway** is the single entry point for all external traffic. It handles:
- Authentication (JWT validation, API key verification)
- Route multiplexing (consumer API, admin API, provider callback API)
- Rate limiting and throttling per tenant
- TLS termination
- API versioning (v1, v1alpha1)

Behind the gateway, **9 control plane services** process requests through the pipeline:

| Service | What it does |
|---------|-------------|
| API Gateway | Routes external traffic to internal services |
| Catalog Manager | Serves the service catalog and resource type registry |
| Request Processor | Assembles layers, resolves dependencies, builds requested state |
| Policy Engine | Evaluates all matching policies against the request payload |
| Placement Engine | Scores and selects providers for fulfillment |
| Request Orchestrator | Dispatches to providers, manages async callbacks, handles retries |
| Audit Service | Records tamper-evident audit trail with Merkle tree |
| Discovery Service | Polls providers for current state, detects drift |
| Provider Manager | Manages provider registration, health monitoring, sovereignty declarations |

## Consequences

- All external traffic goes through one endpoint — simplifies network policy and TLS
- Services communicate internally via direct calls or PostgreSQL LISTEN/NOTIFY
- Each service has its own health endpoint and can be scaled independently
- The pipeline is deterministic: assembly → policy → placement → dispatch → callback
Loading