feat: Introduce v1alpha2 version of LlamaStackDistribution CRD#253
feat: Introduce v1alpha2 version of LlamaStackDistribution CRD#253VaishnaviHire wants to merge 11 commits intollamastack:mainfrom
Conversation
28c7f4a to
afec277
Compare
b760e03 to
1843bd3
Compare
|
@Mergifyio rebase |
✅ Branch has been successfully rebased |
1843bd3 to
31e0e3c
Compare
3fde13b to
673d4e7
Compare
| logger := log.FromContext(ctx) | ||
|
|
||
| // Handle v1alpha2 native config generation before standard reconciliation. | ||
| v1a2Result, v1a2Err := r.handleV1Alpha2NativeConfig(ctx, key, instance) |
There was a problem hiding this comment.
Missing test coverage for FR-097 (preserve running Deployment on config generation failure).
There was a problem hiding this comment.
Where is this covered? I don't see a test that creates a Deployment first, then fails config generation, then checks the Deployment is unchanged.
673d4e7 to
ac0683d
Compare
fc77468 to
958b6f3
Compare
eoinfennessy
left a comment
There was a problem hiding this comment.
Review of config gen pipeline. There are some critical issues that need to be addressed.
| for k, v := range settingsMap { | ||
| cfg[k] = v | ||
| } |
There was a problem hiding this comment.
It's possible that values from the settings map can override the endpoint, secret refs, and the API key.
Should we skip adding items that are already in cfg? And log a warning?
There was a problem hiding this comment.
Or maybe add settings to the cfg map first and then add fields like base_url secret_refs and api_key?
We need to be careful that secret_refs can't override api_key too (which it can currently).
| for key := range pc.SecretRefs { | ||
| ident := providerID + ":" + key | ||
| if sub, ok := substitutions[ident]; ok { | ||
| cfg[key] = sub | ||
| } else { | ||
| envName := GenerateEnvVarName(providerID, key) | ||
| cfg[key] = "${env." + envName + "}" | ||
| } | ||
| } |
There was a problem hiding this comment.
The order of iteration is non-deterministic. This could potentially cause unnecessary Deployment updates.
We should sort the keys before iterating.
| for key, ref := range pc.SecretRefs { | ||
| addSecretToResolution(resolution, secretRefEntry{ | ||
| ProviderID: providerID, | ||
| Field: key, | ||
| SecretName: ref.Name, | ||
| SecretKey: ref.Key, | ||
| }) |
There was a problem hiding this comment.
The order of iteration is non-deterministic. This could potentially cause unnecessary Deployment updates.
We should sort the keys before iterating.
| // apiNameToConfigKey maps CRD-style camelCase API names to config.yaml snake_case keys. | ||
| var apiNameToConfigKey = map[string]string{ | ||
| "vectorIo": "vector_io", | ||
| "toolRuntime": "tool_runtime", | ||
| "postTraining": "post_training", | ||
| "datasetIo": "datasetio", | ||
| } | ||
|
|
||
| // normalizeAPIName converts a CRD-style camelCase API name to the config.yaml | ||
| // snake_case key. Names already in snake_case pass through unchanged. | ||
| func normalizeAPIName(api string) string { | ||
| if mapped, ok := apiNameToConfigKey[api]; ok { | ||
| return mapped | ||
| } | ||
| return api | ||
| } |
There was a problem hiding this comment.
This is all effectively dead code because the disabled enum already specifies snake-case values.
| // RenderConfigYAML serializes the config to deterministic YAML. | ||
| func RenderConfigYAML(config *BaseConfig) (string, error) { | ||
| // Build an ordered map for deterministic output | ||
| out := buildOrderedConfig(config) | ||
|
|
||
| data, err := yaml.Marshal(out) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to marshal config YAML: %w", err) | ||
| } | ||
|
|
||
| return string(data), nil | ||
| } |
There was a problem hiding this comment.
This function mutates the provided config, which is unconventional and unexpected for a render function.
Consider having buildOrderedConfig write to out["registered_resources"] instead of config.Extra["registered_resources"].
| EvalStore map[string]interface{} `json:"eval_store,omitempty" yaml:"eval_store,omitempty"` | ||
| DatasetIOStore map[string]interface{} `json:"datasetio_store,omitempty" yaml:"datasetio_store,omitempty"` | ||
| Server map[string]interface{} `json:"server,omitempty" yaml:"server,omitempty"` | ||
| ExternalProviders map[string]interface{} `json:"external_providers,omitempty" yaml:"external_providers,omitempty"` |
There was a problem hiding this comment.
This field is effectively unused because we never use it in buildOrderedConfig.
We should delete it to avoid confusion.
| } | ||
|
|
||
| if mc.ContextLength != nil && *mc.ContextLength > 0 { | ||
| if entry["provider_model_id"] == nil { |
There was a problem hiding this comment.
This check is not needed. It is always true
| provider := mc.Provider | ||
| if provider == "" { | ||
| provider = defaultProvider | ||
| } |
There was a problem hiding this comment.
We have no validation that the model's provider ID actually exists.
The model can be registered with a non-existent provider and no error is returned. The llama-stack server would fail at startup with a confusing error about an unknown provider.
We should consider adding validation for this at the admission layer if not too complex. Otherwise, we can validate here and return an error so the CR's status can reflect the issue to users.
eoinfennessy
left a comment
There was a problem hiding this comment.
Review of webhooks:
We need to comprehensively test all validation logic (webhook and CEL). Currently we don't do any testing of validation logic.
There are some problems with data loss and stale data in the conversion logic. I added a suggestion to fix this.
958b6f3 to
4021d2f
Compare
|
@eoinfennessy I have addressed the comments and updated the commits. Please take a look. |
|
@VaishnaviHire, thanks for addressing the comments. In future review cycles on this PR, please avoid squashing and force-pushing. Instead, please add new commits for each change. This makes it easier for me to review the changes that have been made between PR reviews, which is especially tricky in such a large PR. |
|
This pull request has merge conflicts that must be resolved before it can be merged. @VaishnaviHire please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
eoinfennessy
left a comment
There was a problem hiding this comment.
I re-reviewed after the previous suggestions. Thanks for addressing these. Couple of things left from these reviews:
- CEL validation tests: Let's add envtest tests to ensure all of our complex CEL validation is actually working.
- Split TLSSpec for server and client: see my latest comment in the thread above discussing this
- Small bug remaining in resource.go (see below)
| if provider == "" { | ||
| return nil, fmt.Errorf("failed to expand model %q: no provider specified and no default inference provider found", mc.Name) | ||
| } | ||
| if mc.Provider != "" && !providerExists(provider, userProviders, base) { |
There was a problem hiding this comment.
The check mc.Provider != "" means: "only validate if the user explicitly set a provider." But this creates a blind spot — if the user omits the provider and the default is used, no existence check happens at all. The default provider could be stale or wrong, and the error would only surface at LlamaStack server startup with a confusing message about an unknown provider.
The fix is simply:
if !providerExists(provider, userProviders, base) {
This validates the provider regardless of whether it came from the user or from the default, which is what you'd want — if we resolved a provider name, we should verify it exists.
eoinfennessy
left a comment
There was a problem hiding this comment.
Review focussing on /controller. Mostly minor suggestions, but a couple of major things related to surfacing errors and status.
| // Handle v1alpha2 native config generation before standard reconciliation. | ||
| v1a2Result, v1a2Err := r.handleV1Alpha2NativeConfig(ctx, key, instance) | ||
| if v1a2Err != nil { | ||
| logger.Error(v1a2Err, "failed to handle v1alpha2 native config") | ||
| } |
There was a problem hiding this comment.
FR-097 states:
If config generation or validation fails during a CR update, the operator MUST preserve the current running Deployment (image, ConfigMap, env vars) unchanged and set status condition ConfigGenerated=False with the failure reason. The running instance MUST NOT be disrupted.
Two gaps:
-
ConfigGenerated=Falseis never set. WhenhandleV1Alpha2NativeConfigfails,v1a2Resultisnil, sofinalizeReconciliationtakes the else branch and callsupdateStatus— which only writes v1alpha1 status fields.SetV1Alpha2Conditionis only called insidepersistV1Alpha2Status, which is only reached on success. The constantReasonConfigGenFailedis declared but unused. The failure is logged to operator stdout but never surfaces in.status.conditions. -
No structural guarantee the Deployment is preserved.
handleV1Alpha2NativeConfigmutatesv1Instancein-place (settingUserConfigand appending env vars) after all fallible operations, thenreconcileResourcesuses the same pointer to reconcile the Deployment. Today the mutation ordering is safe — mutations happen last, after all fallible steps. But this is an implicit invariant: if a fallible step is later added after theUserConfigassignment,reconcileResourceswould reconcile against a half-modified spec, potentially pointing the Deployment at a ConfigMap that doesn't exist.
Suggested approach: On handleV1Alpha2NativeConfig error, skip reconcileResources, persist ConfigGenerated=False with the failure reason, and return the error to requeue. This satisfies both halves of FR-097: the Deployment is untouched and the failure is visible in status.
There was a problem hiding this comment.
There's also no test asserting that .status.conditions contains ConfigGenerated=False after a failed config generation.
| // v1alpha2 Condition reasons. | ||
| const ( | ||
| ReasonConfigGenSucceeded = "ConfigGenerationSucceeded" | ||
| ReasonConfigGenFailed = "ConfigGenerationFailed" |
| if err := r.persistV1Alpha2Status(ctx, key, instance, v1a2Result); err != nil { | ||
| logger.Error(err, "failed to update v1alpha2 status") | ||
| } |
There was a problem hiding this comment.
We should return the status update error to match v1alpha1 behaviour in the else block, and ensure the status is eventually consistent.
|
|
||
| // updateStatus refreshes the LlamaStack status. | ||
| func (r *LlamaStackDistributionReconciler) updateStatus(ctx context.Context, instance *llamav1alpha1.LlamaStackDistribution, reconcileErr error) error { | ||
| // computeStatus computes all status fields on the in-memory v1alpha1 instance |
There was a problem hiding this comment.
We should remove the stale updateStatus comment above this line
| for _, envVar := range resolution.EnvVars { | ||
| if envVar.ValueFrom == nil || envVar.ValueFrom.SecretKeyRef == nil { | ||
| continue | ||
| } | ||
|
|
||
| secretName := envVar.ValueFrom.SecretKeyRef.Name | ||
| secretKey := envVar.ValueFrom.SecretKeyRef.Key | ||
|
|
||
| secret := &corev1.Secret{} | ||
| if err := r.Get(ctx, types.NamespacedName{ | ||
| Name: secretName, | ||
| Namespace: namespace, | ||
| }, secret); err != nil { | ||
| if k8serrors.IsNotFound(err) { | ||
| return fmt.Errorf("failed to find Secret %q in namespace %q (referenced by env var %s)", secretName, namespace, envVar.Name) | ||
| } | ||
| return fmt.Errorf("failed to get Secret %q: %w", secretName, err) | ||
| } | ||
|
|
||
| if _, ok := secret.Data[secretKey]; !ok { | ||
| return fmt.Errorf("failed to find key %q in Secret %q in namespace %q", secretKey, secretName, namespace) | ||
| } | ||
| } |
There was a problem hiding this comment.
We could maybe consider aggregating errors here to provide a better UX.
| for i, mc := range spec.Resources.Models { | ||
| if mc.Provider != "" { | ||
| if _, ok := providerIDs[mc.Provider]; !ok { | ||
| return fmt.Errorf( | ||
| "resources.models[%d].provider: provider ID %q not found; available providers: %s", | ||
| i, mc.Provider, strings.Join(sortedKeys(providerIDs), ", "), | ||
| ) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
We could aggregate errors here too.
| } else { | ||
| status.Conditions[i].Reason = reason | ||
| status.Conditions[i].Message = message | ||
| } |
There was a problem hiding this comment.
There's no ObservedGeneration set on the condition. Without it, a client can't distinguish whether a ConfigGenerated=True condition was set for the current spec generation or a previous one. Consider setting condition.ObservedGeneration = instance.Generation to match the convention used by most Kubernetes controllers.
| namespace := createTestNamespace(t, "test-v1alpha2-secret-ref") | ||
| operatorNamespace := createTestNamespace(t, "test-v1alpha2-secret-op") | ||
| t.Setenv("OPERATOR_NAMESPACE", operatorNamespace.Name) | ||
|
|
||
| // Create operator config ConfigMap (required by NewLlamaStackDistributionReconciler) | ||
| opConfig := &corev1.ConfigMap{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "llama-stack-operator-config", | ||
| Namespace: operatorNamespace.Name, | ||
| }, | ||
| Data: map[string]string{}, | ||
| } | ||
| require.NoError(t, k8sClient.Create(t.Context(), opConfig)) |
There was a problem hiding this comment.
This is repeated 6 times. Consider writing a helper:
func setupV1Alpha2Env(t *testing.T, prefix string) (ns *corev1.Namespace, opNs *corev1.Namespace)| clusterInfo := &cluster.ClusterInfo{ | ||
| OperatorNamespace: operatorNamespace.Name, | ||
| DistributionImages: map[string]string{"starter": testImage}, | ||
| } | ||
| reconciler, err := controllers.NewLlamaStackDistributionReconciler( | ||
| t.Context(), k8sClient, scheme.Scheme, clusterInfo, | ||
| ) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
This is repeated 5 times. Consider a helper:
func newV1Alpha2Reconciler(t *testing.T, opNamespace string) *controllers.LlamaStackDistributionReconciler| agents: | ||
| - provider_id: meta-reference | ||
| provider_type: inline::meta-reference |
There was a problem hiding this comment.
The agents API has been renamed to responses: llamastack/llama-stack#5195
We probably need to update this in all embedded configs
4021d2f to
43e2ce6
Compare
|
@eoinfennessy I have addressed the comments. Additionally I added deploy time feature flag for v1alpha2 - an overlay |
Add typed v1alpha2 API (ProvidersSpec, ModelConfig, ExposeConfig, StorageSpec) with kubebuilder validation markers and CEL rules. Implement lossless v1alpha1<->v1alpha2 conversion via JSON-blob annotations for fields that have no v1alpha1 equivalent. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Implement admission webhook that validates distribution names against the embedded registry, enforces unique provider IDs per category, and checks model provider references. Wire up cert-manager and webhook kustomize overlays. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Build the config generation pipeline that renders a complete config.yaml from v1alpha2 spec fields (providers, resources, storage). Includes distribution registry, provider expansion, model/tool/shield resource resolution, storage configuration, secret-ref placeholder injection, and disabled-API pruning. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Wire the config generation pipeline into the reconciliation loop. Adds v1alpha2 config source detection, ConfigMap creation with generated config.yaml, secret env-var injection into pod spec, RBAC permissions for secrets and configmap deletion, and controller-level integration tests. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Add kustomize overlay for OpenShift deployments that patches the webhook configuration to use the service-serving-cert-signer CA instead of cert-manager, along with SCC-compatible manager patches. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Add end-to-end tests covering v1alpha2 CR creation, conversion round-trips, webhook validation rejection, secret env-var injection, and TLS configuration. Refactor existing e2e tests into focused test files with shared utilities. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Reorganize sample CRs into v1alpha1/ and v1alpha2/ subdirectories. Add v1alpha2 sample CRs (vLLM+Postgres, HA, networking), API overview, and v1alpha1-to-v1alpha2 migration guide. Update README with v1alpha2 quick-start examples. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Update Makefile with webhook cert-manager targets, add go module dependencies for the config pipeline and webhook infrastructure, and regenerate the release operator manifest. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Add v1alpha1-only overlay for deployment. This allows the v1alpha2 api to be incrementally enabled for GA releases. Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
Remove support for eval, safety and related apis Signed-off-by: Vaishnavi Hire <vhire@redhat.com> Assisted-by : claude-4.6-opus
43e2ce6 to
2de6d1e
Compare
|
This pull request has merge conflicts that must be resolved before it can be merged. @VaishnaviHire please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
This PR introduces the
v1alpha2API version for theLlamaStackDistributionCRD, enabling declarative, Kubernetes-native configuration of LlamaStack servers. Instead of requiring users to manually craft and supply aconfig.yamlvia ConfigMap (as in v1alpha1), the operator now generates the server configuration automatically from structured CR fields (providers,resources,storage,networking). Both API versions are served concurrently with full conversion webhook support.v1alpha2 Example
The v1alpha2 API replaces environment-variable-driven configuration with structured, declarative fields. All provider fields use typed
[]ProviderConfigslices with CEL validation.Basic : single Ollama provider:
Advanced : vLLM with secret refs, PostgreSQL storage, pgvector:
Review Guide
This is a large PR (86 files, ~18k lines). The sections below group changes by area matching the commit structure. Each section is self-contained and can be reviewed independently.
1. v1alpha2 CRD Schema & Conversion (commit:
e7cc5c2)api/v1alpha2/llamastackdistribution_types.go[]ProviderConfigslices with CEL validation for provider ID uniqueness.OverrideConfigis mutually exclusive with providers/resources/storage/disabled.api/v1alpha2/zz_generated.deepcopy.goapi/v1alpha1/llamastackdistribution_conversion.goannV1Alpha1Extras,annV1Alpha2Extras) for lossless round-trips in both directions.api/v1alpha1/llamastackdistribution_conversion_test.goconfig/crd/bases/llamastack.io_llamastackdistributions.yaml2. Validating Webhook (commit:
a45c9f6)api/v1alpha2/llamastackdistribution_webhook.goapi/v1alpha2/llamastackdistribution_webhook_test.goderiveProviderIDbehavior, unknown distribution rejection, edge casesconfig/webhook/*config/certmanager/*config/crd/kustomization.yamlconfig/default/kustomization.yamlconfig/default/manager_webhook_patch.yamlmain.go3. Config Generation Pipeline (commit:
3feb572)pkg/config/config.gopkg/config/provider.goremote::prefix, endpoint →base_url, sorted secret ref iteration, settings merge with override protectionpkg/config/resource.gopkg/config/storage.gopkg/config/secret_resolver.gosecretRefsmaps to env vars (LLSD_<PROVIDER_ID>_<KEY>) and${env.VAR_NAME}substitutionspkg/config/resolver.gopkg/config/version.gopkg/config/types.goBaseConfig,ProviderEntry,GeneratedConfigpkg/config/config_test.gopkg/config/configs/*/config.yamlstarter,starter-gpu,postgres-demodistributionsdistributions.json4. Controller Integration (commit:
bf90527)controllers/v1alpha2_config.gocontrollers/llamastackdistribution_controller.gohandleV1Alpha2NativeConfigbefore standard reconcile. Dual status update path for v1alpha2 CRscontrollers/kubebuilder_rbac.gosecrets(get/list/watch) andconfigmaps(delete)controllers/resource_helper.gostartupScript. SetsRUN_CONFIG_PATHenv var; image's built-inentrypoint.shhandles startupcontrollers/resource_helper_test.goRUN_CONFIG_PATHinstead of command/args overridescontrollers/suite_test.gocontrollers/testing_support_test.gocontrollers/llamastackdistribution_controller_test.goconfig/rbac/role.yamlClusterRolewith secrets and configmaps-delete permissions5. OpenShift Webhook Overlay (commit:
e3f405b)config/openshift/kustomization.yamlconfig/openshift/crd_ca_patch.yamlconfig/openshift/manager_webhook_patch.yamlconfig/openshift/webhook_ca_patch.yaml6. E2E Tests (commit:
c6d24e8)tests/e2e/creation_v1alpha2_test.gotests/e2e/conversion_test.gotests/e2e/webhook_validation_test.gotests/e2e/validation_test.gotests/e2e/creation_test.gotests/e2e/e2e_test.gotests/e2e/test_utils.go.github/workflows/run-e2e-test.yml7. Documentation & Samples (commit:
3b6972b)docs/migration-v1alpha1-to-v1alpha2.mddocs/api-overview.mdREADME.mdsecretRefsand list syntaxconfig/samples/v1alpha1/*config/samples/v1alpha2/*specs/002-operator-generated-config/*8. Build Tooling & Release (commit:
b5a2df7)Makefile.gitignorego.mod/go.sumrelease/operator.yaml