Skip to content

feat(guardian): RECON tracks C/F/G/H/I/J -- KUBEBUILDER guard, failurePolicy split, lineage archiver, scoped OperatorContext#22

Merged
ontave merged 20 commits into
mainfrom
session/pr-merge-reconcile
May 29, 2026
Merged

feat(guardian): RECON tracks C/F/G/H/I/J -- KUBEBUILDER guard, failurePolicy split, lineage archiver, scoped OperatorContext#22
ontave merged 20 commits into
mainfrom
session/pr-merge-reconcile

Conversation

@ontave
Copy link
Copy Markdown
Contributor

@ontave ontave commented May 29, 2026

Summary

  • RECON-I2: admission webhook failurePolicy split -- RBACProfile/PermissionSet use Fail; lineage CRDs use Ignore; prevents deadlock during bootstrap window
  • RECON-F4: LineageArchiver async write channel pool with batch inserts -- reconcile loop never blocks on CNPG writes
  • RECON-G1: scoped OperatorContext -- Guardian now resolves per-cluster OperatorContext before each reconcile cycle
  • RECON-C1: proactive PKI certificate expiry detection and DriftSignal emission (PKIExpirySoonSignal)
  • RECON-H1/H2: ConditionTypeNodeInfrastructureReady constant; K register completion check
  • Guardian integration test suites: KUBEBUILDER_ASSETS guard added to all 4 suites (controller, epg, webhook, lineage) -- suites skip cleanly when binaries absent
  • Merge: origin/main migration commit (da885cd: guardian.ontai.dev rename + seam rename); seam-sdk/memory import preserved

Test plan

  • go test ./... passes (all unit + integration suites)
  • Integration suites skip when KUBEBUILDER_ASSETS unset
  • failurePolicy Fail on RBACProfile/PermissionSet webhook; Ignore on lineage CRDs
  • LineageArchiver flushes within 500ms under load

ontave added 20 commits May 17, 2026 23:49
… seam-core

Adds replace directive for github.com/ontai-dev/platform. Updates
cluster_rbacpolicy_controller to use platformseamv1alpha1.TalosCluster
from platform/api/seam/v1alpha1. Registers platform scheme in main.go.
All unit tests updated to use the platform types.
Replace seam-core -> seam in go.mod replace/require. Update all Go
import paths from github.com/ontai-dev/seam-core/ to
github.com/ontai-dev/seam/. Add seam-sdk replace + require.
Replace ../seam-core with ../seam following the seam-core -> seam
filesystem rename. Module path github.com/ontai-dev/seam was already
updated in Phase 4; this aligns the local path pointer.
… Guardian singleton

- Remove Guardian singleton CRD and types (Guardian, GuardianSpec, GuardianStatus, GuardianList)
- Remove setCNPGCondition and simplify RunWithRetry to 2-arg form (no kube client)
- Remove Scheme/Recorder/OperatorNamespace from BootstrapController; replace Guardian CR
  condition writes with in-memory WebhookModeGate and NamespaceEnforcementRegistry
- Fix one-way ratchet: return early after Initialising->ObserveOnly transition so Enforcing
  check only runs in the next reconcile
- Rename 7 CRD YAML files from security.ontai.dev_*.yaml to guardian.ontai.dev_*.yaml
- Update groupversion_info.go: group annotation and GroupVersion.Group to guardian.ontai.dev
- Update all finalizer constants, kubebuilder markers, GVR/GVK Group fields, apiVersion
  strings in unstructured objects across all guardian packages
Fresh documentation from current codebase. security.ontai.dev replaced with
guardian.ontai.dev throughout. Guardian singleton CR removed -- Deployment
readiness is the health signal. security-system namespace removed.
seam-core references replaced with seam. wrapper replaced with dispatcher.
LineageRecord and SeamMembership removed from guardian CRD table (owned by seam).
…gs and tests; fix runner.ontai.dev -> seam.ontai.dev in epg_controller; add seam-sdk/platform checkouts to CI
Extends MigrationRunner with 6 new migrations (006-011) that create the
domain memory tables on management CNPG: governance_events,
identity_resolution_events, snapshot_distribution_events, receipt_events,
lineage_archive, lineage_sdns (with FK to lineage_archive). All migrations
are idempotent (CREATE TABLE IF NOT EXISTS). T-WI4-8 complete.
- domain_memory.go: nil-safe write helpers for all four event types
- IdentityBindingReconciler: writes governance_events on every reconcile, identity_resolution_events on trust anchor resolution
- IdentityProviderReconciler: writes governance_events on every reconcile, identity_resolution_events when validation passes
- EPGReconciler: writes snapshot_distribution_events per PermissionSnapshot upsert
- TenantSnapshotRunnable: writes receipt_events on new PermissionSnapshotReceipt creation
- DomainMemoryWriter field is optional (nil-safe); failures are discarded to never block reconciliation
- clusterFromNamespaceDM derives cluster context from seam-tenant-{cluster} namespace prefix
- Unit tests: domain_memory_test.go covers all four write paths + nil-writer no-panic
…t reader RBAC

Guardian management bootstrap now provisions ClusterRole + ClusterRoleBinding for
the seam LineageController SA (system:serviceaccount:seam-system:lineage-controller)
to get/list/watch PermissionSnapshots across all namespaces.

ensureSeamLineageControllerRBAC uses SSA (ForceOwnership, field-owner=guardian) and is
called from BootstrapAnnotationRunnable.Start after createThirdPartyProfiles, gated on
management role (ManagementClusterName != ""). INV-004: Guardian provisions; seam
does not self-grant. lineage_controller_rbac_test.go: 3 unit tests.
Implements guardian domain memory persistence for LineageRecord CRs:
- LineageArchiveStore interface + InsertLineageArchive() on SQLAuditStore writes to migration #10 lineage_archive table
- LazyLineageArchiveStore mirrors LazyAuditDatabase lazy-init pattern for startup sequencing
- LineageArchiver polls seam.ontai.dev/v1alpha1/lineagerecords cluster-wide every 60s (LINEAGE_ARCHIVE_INTERVAL env), archives new/changed records tracked by resourceVersion
- cnpgStartupRunnable.Start() calls lazyLineage.Set() then starts LineageArchiver goroutine after CNPG connects
- Unit tests: 5 database + 6 archiver tests all green

Unblocks TC-MC-16 (lineage_archive row verification).
…rootRef

LineageRecord spec uses rootBinding.rootKind/rootName/rootNamespace.
Prior implementation used a non-existent spec.lineage.rootRef path,
leaving root_cr_kind/name/namespace empty in lineage_archive rows.
Fixes TC-MC-16 row content.
…async archive writes

RECON-G1: New OperatorContextConflictHandler (ValidatingAdmissionHandler) rejects
OperatorContext CREATE/UPDATE when the incoming CR's scope overlaps at the same specificity
level as an existing CR in the same namespace. Specificity: 2=exact clusterRefs,
1=clusterRoles, 0=global. Overlap at equal specificity with shared cluster or both-global
triggers denial. Self-update (same name) is excluded. List failure admits rather than
blocking all OperatorContext ops during API unavailability. Registered via new
RegisterOperatorContextConflict method on AdmissionWebhookServer. New
ValidatingWebhookConfiguration (failurePolicy=Fail) at /validate-operator-context.
6 unit tests covering all allow/deny permutations including self-update and
different-specificity coexistence.

RECON-F4: Decouple LineageArchiver DB writes from poll loop via buffered event channel
(capacity=500) drained by a pool of 3 writer goroutines. Poll goroutine never blocks on
Postgres. Channel saturation drops events (logged) rather than stalling the watcher.
Absorbs upgrade waves of ~100 clusters with 5 records each without blocking. RECON-F4.
…tConflict wiring

RECON-I2: Split webhook failurePolicy strategy:
- guardian-rbac-webhook (validating-webhook-configuration.yaml): failurePolicy=Ignore.
  RBAC admission checks are liveness-critical; Guardian restart must not block operators
  from creating RBAC resources. The bootstrap RBAC window (INV-020) closes permanently
  after first startup; subsequent restart windows are short and acceptable at Ignore.
- guardian-lineage-immutability-webhook: failurePolicy=Fail (unchanged) -- lineage
  immutability violations corrupt causal history.
- guardian-operator-context-webhook: failurePolicy=Fail (unchanged) -- unchecked
  OperatorContext conflicts corrupt governance state.
- guardian-operator-cr-webhook: failurePolicy=Fail (unchanged) -- operator CR guard
  is safety-critical.

Wire RegisterOperatorContextConflict in main.go: creates a dedicated local dynamic
client from mgr.GetConfig() and registers the conflict handler at
/validate-operator-context. Called after all other webhook registrations.

4 new unit tests:
- TestWebhookPaths_RBACAndOperatorContextAreDistinct: path separation required for
  independent WebhookConfiguration failurePolicy assignment.
- TestWebhookPaths_AllDistinct: all 7 guardian webhook paths are unique.
- TestRBACWebhookConfig_FailurePolicyIsIgnore: reads YAML and verifies Ignore.
- TestOperatorContextWebhookConfig_FailurePolicyIsFail: reads YAML and verifies Fail.
…inaries

All four envtest suites (controller, webhook, lineage, epg) now check
KUBEBUILDER_ASSETS at TestMain startup and call os.Exit(0) when absent,
matching the pattern already established in the platform integration suite.
BinaryAssetsDirectory is set from the env var so setup-envtest paths are
honoured without falling back to the /usr/local/kubebuilder default.
@ontave ontave merged commit 53bd725 into main May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant