diff --git a/architecture/adr/001-why-dcm-exists.md b/architecture/adr/001-why-dcm-exists.md new file mode 100644 index 0000000..7c1bf56 --- /dev/null +++ b/architecture/adr/001-why-dcm-exists.md @@ -0,0 +1,31 @@ +# ADR-001: Why DCM Exists + +**Status:** Accepted +**Date:** March 2026 + +## Context + +Enterprise data centers run hundreds of thousands of resources — VMs, containers, network segments, storage volumes — across multiple infrastructure platforms. Today, each platform has its own provisioning workflow, API, data format, and lifecycle model. The result: + +- **No unified view of what's deployed.** Intended state, deployed state, and actual state diverge silently. Nobody can answer "what's running, who owns it, and does it match what was approved?" +- **No consistent governance.** Policy enforcement is tribal knowledge. Security reviews are manual gates. Compliance is verified after the fact rather than enforced at request time. +- **No common abstraction.** A team requesting a VM goes through one process; requesting a database goes through another; requesting a three-tier application requires manually coordinating both plus networking. + +Public cloud solves this with unified control planes (AWS CloudFormation, Azure Resource Manager, GCP Deployment Manager). On-premises infrastructure has no equivalent. + +## Decision + +Build DCM — a management plane for enterprise data center infrastructure that provides: +- A unified data model and API across all infrastructure platforms +- Policy-as-code enforcement on every request before provisioning +- Full lifecycle management from request through decommission with tamper-evident audit +- A provider abstraction that makes any infrastructure platform consumable through the same interface + +DCM is **not** a provisioning tool. It is the governance and orchestration layer that sits above provisioning tools (Ansible, Terraform, operators) and governs what gets requested, approved, built, owned, and decommissioned. + +## Consequences + +- DCM must be infrastructure-agnostic — it cannot favor any single platform +- The data model must be extensible to any resource type without code changes +- Policy evaluation must be mandatory, not optional — governance is the value proposition +- Audit must be tamper-evident to satisfy regulated environments (the primary adopters) diff --git a/architecture/adr/002-three-abstractions.md b/architecture/adr/002-three-abstractions.md new file mode 100644 index 0000000..a211aa8 --- /dev/null +++ b/architecture/adr/002-three-abstractions.md @@ -0,0 +1,27 @@ +# ADR-002: Three Foundational Abstractions — Data, Provider, Policy + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 00 (Foundations) + +## Context + +A management plane for infrastructure must handle many concerns: data storage, external integrations, governance rules, audit trails, placement decisions, dependency resolution, lifecycle events, and more. Without a unifying model, the architecture becomes a collection of ad-hoc services with unclear boundaries. + +## Decision + +Every component of DCM maps to exactly one of three foundational abstractions: + +**DATA** — Everything stored and versioned. The unified data model, entity lifecycle states, field-level provenance, data layers, and audit records. Data flows through a deterministic pipeline: Intent → Requested → Realized → Discovered. + +**PROVIDER** — Everything external. Any system DCM interacts with through a defined contract. Providers receive data from DCM, act on it, and return data to DCM. Six provider types cover all external interactions: service, information, meta, auth, peer_dcm, and process. + +**POLICY** — Everything that decides. Rules that fire when data matches conditions and produce typed outputs: allow/deny, validation, field mutations, recovery actions, orchestration directives. Policies govern every transition and transformation in DCM. + +The interaction model: Data changes trigger Policy evaluation. Policy decisions may mutate Data or select Providers. Providers produce new Data. The cycle repeats. + +## Consequences + +- Any new capability must map to one of these three abstractions — if it doesn't fit, the abstraction model needs revision, not a fourth pillar +- Documentation, APIs, and code are organized around these three concepts +- Team members only need deep knowledge of 1-2 abstractions for their area of work diff --git a/architecture/adr/003-four-lifecycle-states.md b/architecture/adr/003-four-lifecycle-states.md new file mode 100644 index 0000000..2fa45a6 --- /dev/null +++ b/architecture/adr/003-four-lifecycle-states.md @@ -0,0 +1,29 @@ +# ADR-003: Four Lifecycle States + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 02 (Four States) + +## Context + +A resource entity goes through multiple stages: the consumer declares intent, the system processes and approves, the provider provisions, and discovery observes what actually exists. If we track this as a single mutable record, we lose the ability to answer: "What did they ask for? What did we approve? What got built? What exists now?" + +These four questions are the foundation of governance, audit, compliance, and drift detection. + +## Decision + +Every resource entity flows through four immutable states: + +1. **Intent** — What the consumer asked for (raw declaration, no processing) +2. **Requested** — What was approved after layer assembly and policy evaluation (write-once) +3. **Realized** — What the provider actually created (snapshot from provider callback) +4. **Discovered** — What exists right now (independent observation via polling) + +The `entity_uuid` links all four states for the same resource. States are immutable — updates create new records. Drift is the delta between Realized and Discovered. Compliance is provable because Requested State records the policy-approved payload. + +## Consequences + +- Every resource has exactly 4 records linked by entity_uuid +- Drift detection is a comparison: Realized ≠ Discovered +- Rehydration (disaster recovery) re-enters at Intent with current policies +- Audit can trace any resource from consumer's original ask through to what's running diff --git a/architecture/adr/004-service-catalog-consumer-experience.md b/architecture/adr/004-service-catalog-consumer-experience.md new file mode 100644 index 0000000..6a6d528 --- /dev/null +++ b/architecture/adr/004-service-catalog-consumer-experience.md @@ -0,0 +1,38 @@ +# ADR-004: Service Catalog and Consumer Experience + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 05 (Resource Type Hierarchy), Doc 06 (Resource/Service Entities) + +## Context + +Consumers need a way to discover what services are available and request them. The service catalog must abstract away infrastructure complexity — a developer requesting a VM should not need to know which hypervisor, which datacenter, or which network configuration is required. + +## Decision + +A four-level hierarchy separates what consumers see from what providers implement: + +1. **Resource Type Category** — Broad groupings (Compute, Network, Storage, Database) +2. **Resource Type** — Specific resource kinds (Compute.VirtualMachine, Network.VLAN) +3. **Resource Type Specification** — Vendor-neutral field schemas, constraints, lifecycle rules +4. **Provider Catalog Item** — A specific provider's offering (pricing, SLAs, availability) + +Consumers browse the catalog, select a catalog item, and submit a request with only the fields they care about (e.g., CPU count, memory, OS). DCM handles everything else: layer assembly, policy evaluation, provider selection, dependency resolution. + +**Consumer request surface** is a JSON payload via the Consumer API: + +```json +POST /api/v1/requests +{ "catalog_item_uuid": "...", "fields": { "cpu_count": 4, "memory_gb": 8, "os_family": "rhel" } } +``` + +## Open Question — Application Definition Language + +The current consumer interface is an API call with a JSON payload. This works for single resources. For multi-resource applications (three-tier web app, data pipeline, ML training environment), the consumer needs a way to define the application as a whole. This is an open design question — see [ADR-016: Application Definition Language](016-application-definition-language.md). + +## Consequences + +- Resource types are vendor-neutral; provider catalog items are provider-specific +- Multiple providers can offer catalog items for the same resource type +- Consumers never choose a provider directly — placement does that +- The catalog is queryable via API; RHDH provides the frontend diff --git a/architecture/adr/005-provider-abstraction.md b/architecture/adr/005-provider-abstraction.md new file mode 100644 index 0000000..657f961 --- /dev/null +++ b/architecture/adr/005-provider-abstraction.md @@ -0,0 +1,51 @@ +# ADR-005: Why Providers Exist and What They Do + +**Status:** Accepted +**Date:** April 2026 +**Docs:** Doc A (Provider Contract), Doc 53 (Capability Discovery) + +## Context + +DCM must interact with many external systems: hypervisors, container platforms, network controllers, IPAM systems, identity services, other DCM instances, ITSM tools, FinOps platforms, and more. Each has its own API, data format, and operational model. Without a common abstraction, DCM becomes tightly coupled to specific infrastructure platforms. + +## Decision + +A **Provider** is any external system DCM interacts with through a defined contract. All providers share the same base contract: registration, health check, sovereignty declaration, accreditation, zero trust authentication, and provenance emission. + +What varies is the **capabilities** the provider declares. Capabilities define what the provider can do — not a rigid type assignment, but a profile of operations: + +| Capability | What it means | Example | +|-----------|--------------|---------| +| `realize_resources` | Provisions, updates, and decommissions infrastructure resources | OpenStack Nova, KubeVirt, ACM | +| `serve_data` | Responds to queries with authoritative external data | CMDB, DNS, IPAM (InfoBlox) | +| `authenticate` | Authenticates identities and returns tokens/roles/groups | Keycloak, LDAP, FreeIPA | +| `federate` | Another DCM instance — mTLS mandatory, dual audit | Cross-region DCM | +| `execute_workflows` | Runs ephemeral workflows without producing persistent resources | Approval chains, ITSM, runbooks | + +**A provider can declare multiple capabilities.** An IPAM system that both serves IP availability data AND allocates IP addresses registers once with `capabilities: [serve_data, realize_resources]` — not twice as two separate providers. + +The key mechanism is **Naturalization/Denaturalization**: DCM sends a unified payload to the provider. The provider translates (naturalizes) it into its native API format, acts on it, then translates (denaturalizes) the result back into DCM's unified format. + +## Capability Discovery + +DCM and providers discover each other's capabilities bidirectionally: + +- **DCM advertises** its capabilities via `GET /api/v1/capabilities` — external systems query what DCM offers (cost data, audit trail, entity lifecycle events, placement decisions) and subscribe to data streams automatically +- **Providers declare** what they offer to DCM (capabilities) AND what they need from DCM (data streams, events) at registration time. DCM matches needs to available capabilities and offers subscription endpoints. + +This replaces the old one-directional model where providers register with DCM but DCM doesn't advertise anything back. + +## Alternatives Considered + +1. **12 provider types** (original design) — rejected because credential, notification, message bus, registry, storage, meta, policy, and ITSM providers were implementation details or data concepts, not architectural abstractions +2. **5 rigid types** (interim design) — rejected because it still forced providers into exactly one type, preventing multi-capability providers and providing no discovery mechanism +3. **Unified model with capability declarations** (current) — one provider type with capability profiles, bidirectional discovery, and automatic pipeline establishment + +## Consequences + +- Adding a new infrastructure platform means writing one provider — not changing DCM core +- Consumers don't know or care which provider fulfills their request +- Provider selection is policy-driven (placement), not consumer-chosen +- All provider interactions are audited and sovereignty-checked +- Multi-capability providers register once, not once per capability +- External systems discover DCM's data streams without reading docs diff --git a/architecture/adr/006-policy-engine.md b/architecture/adr/006-policy-engine.md new file mode 100644 index 0000000..5e17fd0 --- /dev/null +++ b/architecture/adr/006-policy-engine.md @@ -0,0 +1,33 @@ +# ADR-006: Why Policy-as-Code and What It Governs + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc B (Policy Contract) + +## Context + +Enterprise infrastructure requires governance: sizing limits, security constraints, compliance rules, sovereignty requirements, cost controls, naming conventions. Today this governance is tribal knowledge enforced by manual review gates. Manual gates are slow, inconsistent, and unauditable. + +## Decision + +Every request is policy-evaluated before provisioning. Policies are code artifacts (Rego), not configuration. They fire automatically when data matches conditions and produce typed outputs. + +**What policies govern:** +- **Who can request what** (GateKeeper: allow/deny based on role, tenant, resource type) +- **Whether the request is valid** (Validation: field constraints, range checks, format) +- **How the request is enriched** (Transformation: inject monitoring agents, set backup policies, apply naming conventions) +- **What happens when things fail** (Recovery: retry, requeue, compensate) +- **How pipeline stages are ordered** (Orchestration Flow: dependency sequencing) +- **What crosses boundaries** (Governance Matrix: sovereignty, data classification) + +**Key design choices:** +- Multi-pass evaluation with convergence — transformation policies can inject fields that other policies depend on +- Lifecycle-scoped — a CPU-sizing policy fires on provisioning and scaling, not on hostname changes +- Override model with 5 mechanisms — governance is not rigid; legitimate exceptions are handled through audited overrides + +## Consequences + +- No request bypasses policy evaluation — this is mandatory, not opt-in +- Policies are versioned, have lifecycle (developing → active → retired), and support shadow mode for safe testing +- Every policy evaluation produces an audit record regardless of outcome +- Policy complexity is managed through templates (Gatekeeper ConstraintTemplate pattern) and a Constraint Type Registry diff --git a/architecture/adr/007-placement-engine.md b/architecture/adr/007-placement-engine.md new file mode 100644 index 0000000..7856fb3 --- /dev/null +++ b/architecture/adr/007-placement-engine.md @@ -0,0 +1,32 @@ +# ADR-007: How DCM Decides Where Things Run + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 50 (Placement), Doc 14 (Profiles) + +## Context + +When a consumer requests a VM, they don't specify which provider or datacenter. Multiple providers may be capable of fulfilling the request. DCM must select the best provider based on sovereignty requirements, capacity, compliance, cost, and organizational policy. + +## Decision + +The Placement Engine selects providers through a multi-stage scoring process: + +1. **Sovereignty pre-filter** — Eliminate providers that don't satisfy data residency requirements (e.g., EU-WEST resources can only go to EU-WEST providers). This is a hard gate, not a score. + +2. **Capability filter** — Eliminate providers that don't support the requested resource type or lack required capabilities. + +3. **Reserve query** — Query remaining providers for capacity availability and get confidence scores. + +4. **Policy-driven scoring** — Apply placement policies that score providers on criteria like cost, performance tier, organizational preference, and existing affinity (e.g., co-locate with related resources). + +5. **Selection** — Highest-scoring provider wins. Ties broken by configurable rules. + +For composite services (composite resource type specifications), placement runs per-constituent — the database may land on a different provider than the app server, each scored independently but subject to the same sovereignty constraints. + +## Consequences + +- Consumers never choose providers — placement is always policy-driven +- Adding new providers to a zone automatically makes them candidates for placement +- Placement decisions are audited with full scoring rationale +- Provider health affects placement — unhealthy providers are excluded diff --git a/architecture/adr/008-dependency-resolution.md b/architecture/adr/008-dependency-resolution.md new file mode 100644 index 0000000..3383678 --- /dev/null +++ b/architecture/adr/008-dependency-resolution.md @@ -0,0 +1,32 @@ +# ADR-008: How Resources Know What They Need + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 07 (Service Dependencies), Doc 30 (Composite Service Composition Model) + +## Context + +Infrastructure resources have dependencies. A VM needs an IP address. A database needs a network port. A three-tier application needs all of its components provisioned in the right order with runtime values (IP addresses, connection strings) flowing from one resource to the next. + +## Decision + +Dependencies are declared at two levels: + +**Type-level** (in the Resource Type Specification): "Every VM requires exactly one IP address." These are portable, provider-agnostic, and apply to all implementations of the resource type. DCM automatically creates sub-requests for type-level dependencies. + +**Binding fields** (in composite service definitions): "The backend's db_host field gets its value from the database's realized ip_address." These connect resources via runtime values — the output of one resource becomes the input of another. + +**How it works:** +1. Request Processor reads the resource type spec and identifies dependencies +2. Dependencies without parents are dispatched first (topological sort) +3. When a dependency is realized, its output values are injected into dependent resources via dependency payload passing (with full provenance tracking) +4. Dependent resources are dispatched after their dependencies are satisfied + +For composite services, the composite resource type spec declares the full dependency graph with binding fields. + +## Consequences + +- Consumers don't manage dependencies — they request a catalog item and DCM resolves the graph +- Each dependency is a first-class DCM entity with its own audit trail and lifecycle +- Decommission reverses the dependency order — dependents are torn down before their dependencies +- Circular dependencies are detected at resource type registration time, not at request time diff --git a/architecture/adr/009-api-gateway-control-plane.md b/architecture/adr/009-api-gateway-control-plane.md new file mode 100644 index 0000000..479618d --- /dev/null +++ b/architecture/adr/009-api-gateway-control-plane.md @@ -0,0 +1,39 @@ +# ADR-009: Why an API Gateway and What the Control Plane Services Do + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 25 (Control Plane Services), OpenAPI Specs + +## Context + +DCM has multiple consumers (developers, platform engineers, admins, providers, external systems) that interact via different APIs with different authorization scopes. Internally, DCM has multiple services that process requests through a pipeline. These services need a single entry point that handles authentication, routing, rate limiting, and API versioning. + +## Decision + +The **API Gateway** is the single entry point for all external traffic. It handles: +- Authentication (JWT validation, API key verification) +- Route multiplexing (consumer API, admin API, provider callback API) +- Rate limiting and throttling per tenant +- TLS termination +- API versioning (v1, v1alpha1) + +Behind the gateway, **9 control plane services** process requests through the pipeline: + +| Service | What it does | +|---------|-------------| +| API Gateway | Routes external traffic to internal services | +| Catalog Manager | Serves the service catalog and resource type registry | +| Request Processor | Assembles layers, resolves dependencies, builds requested state | +| Policy Engine | Evaluates all matching policies against the request payload | +| Placement Engine | Scores and selects providers for fulfillment | +| Request Orchestrator | Dispatches to providers, manages async callbacks, handles retries | +| Audit Service | Records tamper-evident audit trail with Merkle tree | +| Discovery Service | Polls providers for current state, detects drift | +| Provider Manager | Manages provider registration, health monitoring, sovereignty declarations | + +## Consequences + +- All external traffic goes through one endpoint — simplifies network policy and TLS +- Services communicate internally via direct calls or PostgreSQL LISTEN/NOTIFY +- Each service has its own health endpoint and can be scaled independently +- The pipeline is deterministic: assembly → policy → placement → dispatch → callback diff --git a/architecture/adr/010-audit-tamper-evidence.md b/architecture/adr/010-audit-tamper-evidence.md new file mode 100644 index 0000000..83ffe2a --- /dev/null +++ b/architecture/adr/010-audit-tamper-evidence.md @@ -0,0 +1,31 @@ +# ADR-010: Why Tamper-Evident Audit and How It Works + +**Status:** Accepted +**Date:** April 2026 +**Docs:** Doc 16 (Universal Audit) + +## Context + +Regulated industries (financial services, government, healthcare) require provable audit trails. "We logged it" is insufficient — auditors need mathematical proof that records haven't been modified or deleted after the fact. This is a hard requirement for sovereign cloud deployments. + +## Decision + +DCM uses a **Merkle tree** audit model (RFC 9162 — the same pattern used in Certificate Transparency): + +- Every pipeline stage produces a signed audit record (Ed25519 signature) +- Records are leaves in a Merkle tree — a binary hash tree where modifying any leaf changes the root hash +- **Inclusion proofs** prove a specific record exists in the tree +- **Consistency proofs** prove the tree has only grown (no deletions) +- **Signed tree heads** provide non-repudiation by the DCM instance + +**Configurable granularity** because not every deployment needs the same detail: +- **Stage** (~6 leaves/request): one leaf per pipeline stage — sufficient for dev/homelab +- **Mutation** (~15-30 leaves/request): one leaf per field change — standard for production +- **Field** (mutation + per-field hashes): required for FedRAMP/sovereign deployments + +## Consequences + +- Any modification to audit records is mathematically detectable +- Auditors can independently verify the audit trail without trusting DCM +- Granularity is profile-governed — organizations choose their audit depth +- Three SQL tables support the model: audit_records, signed_tree_heads, merkle_tree_nodes diff --git a/architecture/adr/011-sovereignty-data-residency.md b/architecture/adr/011-sovereignty-data-residency.md new file mode 100644 index 0000000..b48d052 --- /dev/null +++ b/architecture/adr/011-sovereignty-data-residency.md @@ -0,0 +1,28 @@ +# ADR-011: Why Sovereignty Is a First-Class Concept + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 14 (Profiles), Doc B §18 (Overrides), Doc 26 (Governance Matrix) + +## Context + +Organizations operating in regulated industries or across jurisdictions face data residency requirements: EU data must stay in EU, classified data must stay on approved infrastructure, healthcare data must meet HIPAA locality requirements. Public clouds handle this with regions. On-premises infrastructure has no equivalent enforcement mechanism. + +## Decision + +Sovereignty is enforced at three levels: + +1. **Provider declaration** — Every provider declares its sovereignty zones and data residency scope at registration. This is not self-reported trust — it's validated against the accreditation model. + +2. **Policy enforcement** — Sovereignty policies are GateKeeper policies with hard enforcement. They fire on every lifecycle operation (not just initial provisioning). A resource in EU-WEST stays in EU-WEST for its entire lifecycle, including updates, scaling, and rehydration. + +3. **Placement pre-filter** — The placement engine eliminates non-compliant providers before scoring begins. Sovereignty is a hard gate, not a soft preference. + +**Override governance:** Sovereignty policies can be overridden, but only through dual-approval (two approvers from different roles). Every override is audited at field granularity. + +## Consequences + +- Sovereignty violations are caught at request time, not after deployment +- Cross-zone data movement is impossible without explicit, audited override +- Rehydration (disaster recovery) respects current sovereignty policies — rebuilding in a non-compliant zone is blocked +- Profiles (minimal, standard, fsi, sovereign) set sovereignty enforcement minimums diff --git a/architecture/adr/012-data-assembly-layering.md b/architecture/adr/012-data-assembly-layering.md new file mode 100644 index 0000000..467d517 --- /dev/null +++ b/architecture/adr/012-data-assembly-layering.md @@ -0,0 +1,29 @@ +# ADR-012: How Organizational Data Merges with Consumer Requests + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 03 (Layering and Versioning) + +## Context + +When a consumer requests a VM with 4 CPUs, the provisioning system needs much more information: which datacenter, which network, what monitoring agent, what backup policy, what compliance requirements apply. This organizational data shouldn't be the consumer's responsibility — they just want a VM. + +## Decision + +**Data Layers** carry organizational context that gets merged into every request: + +- **System layers** — Datacenter configurations, environment defaults, compliance requirements +- **Tenant layers** — Organization-specific overrides (monitoring agents, naming conventions) +- **Provider layers** — Provider-specific defaults (image mappings, flavor resolution) +- **Consumer intent** — What the consumer actually asked for + +Layers merge in precedence order (system → tenant → provider → consumer). Consumer values override layer defaults. Every field in the merged payload carries **provenance** — where the value came from and what modified it. + +**Layers are Data, not Logic.** Layers provide values. Policies provide decisions. A layer says "the datacenter is EU-WEST-DC1." A policy says "EU-WEST resources must use the EU-WEST monitoring endpoint." This separation means layers can be managed by infrastructure teams while policies are managed by security/governance teams. + +## Consequences + +- Consumers declare only what they need — organizational data is injected automatically +- Adding a new datacenter or changing a monitoring agent is a layer change, not a code change +- Provenance on every field answers "why does this VM have this backup policy?" +- Layer conflicts are resolved deterministically by precedence order diff --git a/architecture/adr/013-override-exception-governance.md b/architecture/adr/013-override-exception-governance.md new file mode 100644 index 0000000..9922d43 --- /dev/null +++ b/architecture/adr/013-override-exception-governance.md @@ -0,0 +1,28 @@ +# ADR-013: How to Handle Legitimate Exceptions Without Undermining Governance + +**Status:** Accepted +**Date:** April 2026 +**Docs:** Doc B §18 (Override Model) + +## Context + +Policies will block legitimate requests. A data residency policy may block a valid exception for a disaster recovery scenario. A sizing policy may block a temporary capacity burst for a product launch. If the only options are "change the policy" or "work around the system," governance degrades. + +## Decision + +Five override mechanisms, layered from least to most disruptive: + +1. **Override Policy** — A planned exception registered in advance (e.g., "DR events may use US-EAST zone") +2. **Exception Grant** — A pre-authorized waiver with compensating controls and expiry +3. **Manual Override** — Immediate single-request authorization with written justification +4. **Compensating Control** — Replace a blocked requirement with an equivalent risk-reduction measure +5. **Dual-Approval** — Required modifier for hard-enforcement policies (two approvers, different roles) + +**The consumer experience:** When a policy blocks a request, the consumer sees the blocking reason, compliant value suggestions, and four options: modify the request, request an override, cancel, or escalate. Override is one path among four — not the default. + +## Consequences + +- Every override is audited with full Merkle tree leaf +- Frequently-overridden policies are surfaced in metrics for policy review +- Block timeout auto-cancels requests where the consumer takes no action +- The governance model is flexible without being permissive diff --git a/architecture/adr/014-multi-tenancy-isolation.md b/architecture/adr/014-multi-tenancy-isolation.md new file mode 100644 index 0000000..32c0af0 --- /dev/null +++ b/architecture/adr/014-multi-tenancy-isolation.md @@ -0,0 +1,24 @@ +# ADR-014: How Tenants Are Separated + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 11 (Data Store Contracts), Doc 15 (Universal Groups) + +## Context + +DCM serves multiple teams (tenants) within an organization. Each tenant's data, resources, policies, and audit trails must be isolated. A developer on Team A must not see Team B's resources, and Team A's policies must not affect Team B's requests (unless they're system-level policies that apply to everyone). + +## Decision + +**Row-Level Security (RLS)** in PostgreSQL enforces tenant isolation at the database layer. Every query is automatically scoped to the actor's tenant — application code cannot accidentally leak cross-tenant data. + +Tenants are **DCMGroups** with type `tenant_boundary`. Groups can be nested (organization → department → team) and support the universal group model for flexible organizational mapping. + +**Policy domain precedence** respects tenancy: system > platform > tenant > resource_type > entity. A system-level sizing policy applies to all tenants. A tenant-level naming convention applies only to that tenant. + +## Consequences + +- Tenant isolation is enforced by the database, not application logic — defense in depth +- Cross-tenant operations (ownership transfer, shared resources) require explicit policy authorization +- RLS adds a small query overhead (~2-5%) — acceptable for the security guarantee +- 18 SQL tables all include tenant_uuid columns with RLS policies diff --git a/architecture/adr/015-minimal-infrastructure.md b/architecture/adr/015-minimal-infrastructure.md new file mode 100644 index 0000000..a6f3d45 --- /dev/null +++ b/architecture/adr/015-minimal-infrastructure.md @@ -0,0 +1,30 @@ +# ADR-015: Why PostgreSQL Is the Only Required Dependency + +**Status:** Accepted +**Date:** March 2026 +**Docs:** Doc 11 (Data Store Contracts), Doc 17 (Deployment) + +## Context + +Infrastructure management platforms often require heavy middleware stacks: message brokers, secret managers, identity providers, search engines. This creates a bootstrap problem — you need significant infrastructure just to manage infrastructure. It also blocks adoption in resource-constrained environments (homelab, edge, evaluation). + +## Decision + +PostgreSQL is the only required dependency. DCM implements internal equivalents for every capability that optional services provide: + +| Capability | Internal (default) | External (optional) | +|-----------|-------------------|-------------------| +| Events | PostgreSQL LISTEN/NOTIFY | Kafka | +| Secrets | Envelope-encrypted table | Vault | +| Auth | Built-in bcrypt + JWT | Keycloak/OIDC | +| Search | PostgreSQL full-text + GIN | OpenSearch | +| Notifications | PostgreSQL LISTEN/NOTIFY + webhooks | External notification service | + +Every optional dependency follows the same pattern: internal by default, externally delegable by configuration. The same API surface is exposed regardless of which implementation is active. + +## Consequences + +- Bootstrap is `docker-compose up` with one PostgreSQL container +- Production deployments can delegate to Kafka, Vault, Keycloak when scale or policy requires it +- Internal implementations have performance ceilings (LISTEN/NOTIFY: ~1K events/sec vs Kafka: millions) +- Every new cross-cutting service must implement the internal path first diff --git a/architecture/adr/016-application-definition-language.md b/architecture/adr/016-application-definition-language.md new file mode 100644 index 0000000..61eb7de --- /dev/null +++ b/architecture/adr/016-application-definition-language.md @@ -0,0 +1,114 @@ +# ADR-016: Application Definition Language + +**Status:** OPEN — Design Decision Required +**Date:** April 2026 +**Raised by:** Ondra (machacekondra), repeatedly + +## Context + +DCM currently has two consumer interfaces: + +1. **Single resource** — A JSON payload to the Consumer API: `POST /api/v1/requests { "catalog_item_uuid": "...", "fields": {...} }` +2. **Compound service** — A composite resource type spec that defines constituent resources, dependencies, and binding fields in YAML + +The single-resource API works well for atomic resources. The composite service definition (composite service model) works for platform engineers who define reusable application templates. But there is a gap: + +**How does a consumer define a custom application?** Not a pre-defined catalog item, but an ad-hoc composition: "I need a database, two app servers, and a load balancer, and here's how they connect." Today, this requires a platform engineer to create a composite resource type spec first. + +Comparable projects have made different choices: +- **Radius** uses Bicep (a DSL) for application definitions, with Recipes (Terraform/Bicep templates) for infrastructure implementation +- **KRO** uses ResourceGraphDefinitions with CEL expressions, generating CRDs from the definition +- **Crossplane** uses Compositions with embedded resource templates and patch sets + +## The Question + +What is DCM's application definition language? Options to evaluate: + +### Option A: API-Only (Current State) +Consumers submit JSON payloads. Compound services require pre-defined composite resource type specs. Platform engineers author specs; consumers consume them. + +**Pros:** Simple, API-first, no custom language to learn +**Cons:** No self-service composition. Every new application pattern requires a platform engineer. + +### Option B: YAML Application Manifests +A YAML document defining resources, dependencies, and binding fields — similar to the composite resource type spec but authored by consumers, not platform engineers. + +```yaml +apiVersion: dcm.io/v1 +kind: Application +metadata: + name: pet-clinic +spec: + resources: + - name: database + type: Database.PostgreSQL + fields: { engine: postgresql, storage_gb: 50 } + - name: backend + type: Compute.VirtualMachine + depends_on: [database] + bindings: + - from: database.ip_address + to: config.db_host + fields: { cpu_count: 4, memory_gb: 8 } + - name: frontend + type: Compute.VirtualMachine + depends_on: [backend] + bindings: + - from: backend.ip_address + to: config.api_host + fields: { cpu_count: 2, memory_gb: 4, replicas: 2 } +``` + +**Pros:** Declarative, GitOps-friendly, reviewable, versionable +**Cons:** New format to learn. Validation complexity. How does this interact with the service catalog? + +### Option C: Reference Existing DSL (Bicep, CEL, HCL) +Adopt an existing language like Radius does with Bicep or KRO does with CEL. Leverage existing tooling and developer familiarity. + +**Pros:** Existing tooling, IDE support, community +**Cons:** Tight coupling to an external project. Bicep is Azure-originated. CEL is K8s-specific. HCL is HashiCorp-specific. + +### Option D: Catalog Composition via API +Consumers compose applications by linking multiple catalog requests through the API, declaring dependencies between them. No new language — just structured API calls. + +```json +POST /api/v1/applications +{ + "name": "pet-clinic", + "components": [ + { "name": "database", "catalog_item_uuid": "pg-standard", "fields": {...} }, + { "name": "backend", "catalog_item_uuid": "vm-standard", "fields": {...}, + "depends_on": ["database"], + "bindings": [{ "from": "database.ip_address", "to": "config.db_host" }] } + ] +} +``` + +**Pros:** API-first, no DSL, consistent with existing patterns +**Cons:** JSON is verbose for complex compositions. Not as readable/reviewable as YAML. No GitOps-friendly file format. + +## Evaluation Criteria + +1. **Consumer UX** — How easy is it for a developer to define a three-tier app? +2. **Platform engineer UX** — How easy is it to create reusable templates? +3. **GitOps compatibility** — Can definitions be stored in Git and applied via PR? +4. **Validation** — Can DCM validate the definition before execution? +5. **Existing tooling** — Does it work with existing editors, linters, CI pipelines? +6. **Consistency with DCM patterns** — Does it align with the API-first, JSON, snake_case conventions? + +## Recommendation + +This decision needs team input. The author's preliminary assessment: + +**Option B (YAML manifests) or Option D (API composition) are most aligned** with DCM's existing patterns. Option B is better for GitOps. Option D is better for API-first consistency. They could coexist — the YAML manifest could be a file format that the API endpoint accepts. + +**Option C (external DSL) is least aligned** — it introduces a dependency on an external project's language and tooling, which conflicts with DCM's technology-agnostic principle. + +**Regardless of choice, the composite resource type spec remains the implementation mechanism.** The application definition language is a consumer-facing UX that ultimately produces a composite resource type spec (or equivalent) for execution. + +## Actions Required + +- [ ] Team discussion to evaluate options +- [ ] Prototype consumer UX for three-tier app with top 2 options +- [ ] Evaluate interaction with RHDH (Backstage) scaffolding templates +- [ ] Decision by [date TBD] diff --git a/architecture/adr/README.md b/architecture/adr/README.md new file mode 100644 index 0000000..56b2312 --- /dev/null +++ b/architecture/adr/README.md @@ -0,0 +1,24 @@ +# Architecture Decision Records + +Short, reviewable summaries of the major architectural decisions in DCM. Each ADR answers **"Why does this exist and what does it do?"** — not implementation details. + +**Reading order:** ADRs 001-003 establish the foundations. Read those first, then jump to whichever ADRs are relevant to your area. + +| ADR | Decision | One-Line Summary | +|-----|----------|-----------------| +| [001](001-why-dcm-exists.md) | Why DCM Exists | Unified management plane for on-prem infrastructure — the governance layer above provisioning tools | +| [002](002-three-abstractions.md) | Three Foundational Abstractions | Everything in DCM is Data, Provider, or Policy — no exceptions | +| [003](003-four-lifecycle-states.md) | Four Lifecycle States | Intent → Requested → Realized → Discovered — immutable states linked by entity_uuid | +| [004](004-service-catalog-consumer-experience.md) | Service Catalog & Consumer UX | Four-level hierarchy from resource types to catalog items; consumers declare what, not how | +| [005](005-provider-abstraction.md) | Provider Abstraction | Unified provider model with capability declarations; bidirectional discovery; any platform, same interface | +| [006](006-policy-engine.md) | Policy Engine | Policy-as-code on every request; 8 policy types from gatekeeping to orchestration flow | +| [007](007-placement-engine.md) | Placement Engine | Multi-stage scoring: sovereignty pre-filter → capability → capacity → policy scoring | +| [008](008-dependency-resolution.md) | Dependency Resolution | Type-level dependencies trigger automatic sub-requests; binding fields inject runtime values | +| [009](009-api-gateway-control-plane.md) | API Gateway & Control Plane | Single entry point routing to 9 internal services; deterministic pipeline | +| [010](010-audit-tamper-evidence.md) | Audit & Tamper Evidence | Merkle tree (RFC 9162) with configurable granularity; mathematically provable integrity | +| [011](011-sovereignty-data-residency.md) | Sovereignty & Data Residency | First-class enforcement on every lifecycle operation; dual-approval for overrides | +| [012](012-data-assembly-layering.md) | Data Assembly & Layering | Organizational data merges with consumer requests; field-level provenance on everything | +| [013](013-override-exception-governance.md) | Override & Exception Governance | 5 mechanisms from planned exceptions to dual-approval; governance with flexibility | +| [014](014-multi-tenancy-isolation.md) | Multi-Tenancy & Isolation | PostgreSQL RLS enforces tenant isolation at the database layer | +| [015](015-minimal-infrastructure.md) | Minimal Infrastructure | PostgreSQL is the only required dependency; everything else is optional | +| [016](016-application-definition-language.md) | Application Definition Language | **OPEN** — How should consumers define multi-resource applications? Options under evaluation |