Central driver implementation#13171
Conversation
There was a problem hiding this comment.
Pull request overview
This PR is a backup branch for the “central driver POC” work, transitioning KFP v2 driver execution from per-component driver pods to an Argo Workflow executor plugin + sidecar model, plus associated manifests, RBAC, CI, tests, and UI support for persisted driver logs.
Changes:
- Update Argo-compiled workflow “golden” YAMLs to use
plugin: driver-plugintemplates and add hardened security contexts. - Add/patch Kubernetes manifests (kustomize overlays, RBAC, workflow-controller settings) to enable Argo executor plugins and configure the driver plugin sidecar via ConfigMaps.
- Extend backend + frontend to persist driver logs (via MLMD custom properties) and surface them in the UI; adjust tests and CI scripts accordingly.
Reviewed changes
Copilot reviewed 124 out of 209 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test_data/compiled-workflows/mixed_parameters.yaml | Switch driver execution from container driver pod to driver-plugin and add securityContext hardening in golden YAML. |
| test_data/compiled-workflows/metrics_visualization_v2.yaml | Same as above for metrics visualization golden YAML. |
| test_data/compiled-workflows/lightweight_python_functions_with_outputs.yaml | Same as above for lightweight python functions (outputs) golden YAML. |
| test_data/compiled-workflows/lightweight_python_functions_pipeline.yaml | Same as above for lightweight python functions pipeline golden YAML. |
| test_data/compiled-workflows/iris_pipeline_compiled.yaml | Same as above for iris pipeline golden YAML. |
| test_data/compiled-workflows/if_else_with_oneof_artifacts.yaml | Same as above for if/else (oneof artifacts) golden YAML. |
| test_data/compiled-workflows/if_elif_else_with_oneof_parameters.yaml | Same as above for if/elif/else (oneof parameters) golden YAML. |
| test_data/compiled-workflows/identity.yaml | Same as above for identity golden YAML. |
| test_data/compiled-workflows/hello_world.yaml | Same as above for hello world golden YAML. |
| test_data/compiled-workflows/flip_coin.yaml | Same as above for flip coin golden YAML. |
| test_data/compiled-workflows/embedded_artifact.yaml | Same as above for embedded artifact golden YAML. |
| test_data/compiled-workflows/dict_input.yaml | Same as above for dict input golden YAML. |
| test_data/compiled-workflows/container_io.yaml | Same as above for container IO golden YAML. |
| test_data/compiled-workflows/concat_message.yaml | Same as above for concat message golden YAML. |
| test_data/compiled-workflows/component_with_metadata_fields.yaml | Same as above for component metadata fields golden YAML. |
| test_data/compiled-workflows/artifact_crust.yaml | Same as above for artifact crust golden YAML. |
| test_data/compiled-workflows/artifact_cache.yaml | Same as above for artifact cache golden YAML. |
| test_data/compiled-workflows/arguments_parameters.yaml | Same as above for arguments parameters golden YAML. |
| test_data/compiled-workflows/add_numbers.yaml | Same as above for add numbers golden YAML. |
| manifests/kustomize/third-party/argo/installs/namespace/workflow-controller-argo-role-patch.json | Patch Workflow Controller role to add WorkflowTaskSet status permissions for agent/plugin flow. |
| manifests/kustomize/third-party/argo/base/workflow-controller-deployment-patch.yaml | Enable Argo executor plugins via env vars. |
| manifests/kustomize/third-party/argo/base/workflow-controller-argo-taskset-clusterrole-.yaml | Add cluster-scoped RBAC for WorkflowTaskSet access needed by Argo agent/plugin. |
| manifests/kustomize/third-party/argo/base/kustomization.yaml | Include the new WorkflowTaskSet RBAC resource. |
| manifests/kustomize/third-party/application/application-controller-deployment.yaml | Increase application-controller memory request. |
| manifests/kustomize/env/dev/kustomization.yaml | Pin kfp-driver image tag and add dev overlay patch for driver plugin ConfigMap. |
| manifests/kustomize/env/dev/driver-plugin-cm-path.yaml | Dev overlay driver plugin sidecar ConfigMap (image + env + resources + securityContext). |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-ui-deployment.yaml | Update TLS secret reference to Argo agent CA secret name. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-driver-plugin-cm.yaml | TLS overlay patch for driver plugin sidecar ConfigMap (+ cert mount). |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml | Update CABUNDLE secret name + volume secret for TLS. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-writer-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-grpc-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-envoy-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml | Apply driver plugin ConfigMap patch in TLS overlay. |
| manifests/kustomize/env/cert-manager/base-tls-certs/kfp-api-cert.yaml | Rename cert secret to match Argo agent’s required secret name. |
| manifests/kustomize/base/pipeline/pipeline-runner-sa.yaml | Add explicit SA token Secret for pipeline-runner. |
| manifests/kustomize/base/pipeline/pipeline-runner-role.yaml | Expand RBAC for WorkflowTaskSet resources and add executor-plugin service account + roles/bindings. |
| manifests/kustomize/base/pipeline/ml-pipeline-driver-plugin-cm.yaml | Base driver plugin sidecar ConfigMap (ExecutorPlugin type). |
| manifests/kustomize/base/pipeline/kustomization.yaml | Add driver plugin ConfigMap to base resources. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py | Add Kubernetes client usage to provision executor-plugin SA and RBAC per namespace. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/pipelines-profile-controller-admin.yaml | Add admin SA + ClusterRole/Binding for profile controller to manage SA/RBAC. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/kustomization.yaml | Include the new admin RBAC resource. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/deployment.yaml | Run profile controller with admin SA; install Python deps at runtime. |
| frontend/src/hooks/queryKeys.ts | Add React Query key for driver/system logs. |
| frontend/src/components/tabs/RuntimeNodeDetailsV2.tsx | Add “System Logs” tab and fetch driver logs artifact via API. |
| backend/test/testutil/kubernetes_utils.go | Formatting cleanup. |
| backend/test/end2end/utils/e2e_utils.go | Capture workflow-controller logs for failed runs and handle agent pod logs. |
| backend/test/end2end/e2e_suite_test.go | Formatting cleanup. |
| backend/test/compiler/utils/workflow_utils.go | Ensure cache settings updates also apply to the driver plugin args. |
| backend/src/v2/metadata/client.go | Persist driver_logs_uri (+ store session info) in execution custom properties. |
| backend/src/v2/driver/root_dag.go | Use context-aware logger and return pipeline from RootDAG. |
| backend/src/v2/driver/k8s_test.go | Ensure tests provide a logger via context. |
| backend/src/v2/driver/k8s.go | Replace glog with context logger; require logger in context for several paths. |
| backend/src/v2/driver/driver_test.go | Ensure tests provide a logger via context. |
| backend/src/v2/driver/driver.go | Change output URI prefix naming used for task-root generation. |
| backend/src/v2/driver/dag.go | Avoid re-fetching pipeline in DAG driver; use context logger. |
| backend/src/v2/driver/container.go | Pass output prefix; write DriverLogURI custom property; use context logger. |
| backend/src/v2/compiler/argocompiler/plugin.go | Add helper to build Argo executor plugin template JSON. |
| backend/src/v2/compiler/argocompiler/dag.go | Switch DAG driver template to Plugin with args map; adjust call sites for error handling. |
| backend/src/v2/compiler/argocompiler/container_test.go | Skip env var test that no longer applies under plugin-based execution. |
| backend/src/v2/compiler/argocompiler/container.go | Switch container driver template from Container to Plugin, including JSONPath outputs. |
| backend/src/v2/compiler/argocompiler/argo.go | Disable Istio sidecar injection on workflow pods; remove driver image/command from compiler state. |
| backend/src/v2/cmd/driver/execution_paths.go | Remove old v2 driver execution paths file (moved). |
| backend/src/driver/main_test.go | Update test expectations for renamed symbols and parsing function name. |
| backend/src/driver/main.go | New driver plugin HTTP server entrypoint and shared helpers. |
| backend/src/driver/execution_paths.go | Re-add ExecutionPaths struct under new driver plugin package. |
| backend/src/driver/api/response.go | Add HTTP response DTOs for driver server/plugin execution. |
| backend/src/driver/api/request.go | Add HTTP request DTOs for driver server/plugin execution. |
| backend/src/common/util/context_logger.go | Introduce context logger (logrus) writing to stdout + file. |
| backend/Dockerfile.driver | Build the new driver plugin binary from backend/src/driver. |
| .github/workflows/legacy-v2-api-integration-tests.yml | Update CA secret name used by CI. |
| .github/workflows/e2e-test.yml | Update CA secret name used by CI. |
| .github/workflows/api-server-tests.yml | Update CA secret name used by CI. |
| .github/resources/scripts/kfp-readiness/wait_for_pods.py | Exclude *-agent pods from readiness filter. |
| .github/resources/scripts/collect-logs.sh | Add agent container log collection and describe workflows. |
| .github/resources/manifests/standalone/tls-enabled/kustomization.yaml | Patch driver plugin ConfigMap in TLS-enabled GH manifests. |
| .github/resources/manifests/standalone/tls-enabled/driver-plugin-cm-path.yaml | Provide driver plugin sidecar ConfigMap for TLS-enabled GH manifests. |
| .github/resources/manifests/standalone/proxy/kustomization.yaml | Patch driver plugin ConfigMap in proxy GH manifests. |
| .github/resources/manifests/standalone/default/kustomization.yaml | Patch driver plugin ConfigMap in default GH manifests. |
| .github/resources/manifests/standalone/cache-disabled/kustomization.yaml | Patch driver plugin ConfigMap in cache-disabled GH manifests. |
| .github/resources/manifests/standalone/cache-disabled-proxy/kustomization.yaml | Patch driver plugin ConfigMap in cache-disabled-proxy GH manifests. |
| .github/resources/manifests/multiuser/default/kustomization.yaml | Patch driver plugin ConfigMap in multiuser GH manifests. |
| .github/resources/manifests/multiuser/cache-disabled/kustomization.yaml | Patch driver plugin ConfigMap in multiuser cache-disabled GH manifests. |
| .github/resources/manifests/multiuser/artifact-proxy/kustomization.yaml | Patch driver plugin ConfigMap in artifact-proxy GH manifests. |
| .github/resources/manifests/kubernetes-native/default/kustomization.yaml | Patch driver plugin ConfigMap in kubernetes-native GH manifests. |
| .github/resources/manifests/base/driver-plugin-cm-path.yaml | Base GH manifest overlay for driver plugin sidecar ConfigMap. |
| command: | ||
| - sh | ||
| - -c | ||
| - | | ||
| python3 -m venv /tmp/venv && \ | ||
| . /tmp/venv/bin/activate && \ | ||
| pip install --no-cache-dir kubernetes boto3 && \ | ||
| python /hooks/sync.py |
There was a problem hiding this comment.
Installing dependencies at container startup is operationally risky:
- Requires outbound network access to PyPI (often blocked in clusters).
- Adds cold-start latency and makes behavior non-reproducible.
- Can fail if
python3,venv, orpipare not present in the base image.
Fix (required): bake kubernetes + boto3 into the image used for the profile controller (or switch to an image that already contains them), and run python /hooks/sync.py directly.
There was a problem hiding this comment.
The metacontroller and pipelines profile controller is different. PPC only provides a json to metacontroller and metacontroller handles the rest.
There was a problem hiding this comment.
There was a problem hiding this comment.
| return execution, err | ||
| } | ||
| ecfg.TaskName = opts.TaskName | ||
| ecfg.Namespace = opts.Namespace |
There was a problem hiding this comment.
ExecutionConfig (in backend/src/v2/metadata/client.go) does not show a Namespace field in the diff, so ecfg.Namespace = opts.Namespace is very likely a compile-time error.
Fix (required): either:
- add a
Namespacefield tometadata.ExecutionConfigand plumb it through (if needed), or - remove this assignment if namespace is already available via
opts.Namespace/ pipeline context.
| ecfg.Namespace = opts.Namespace |
1ad5ac7 to
de96fee
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
de96fee to
0a4db0b
Compare
|
/retest |
da711da to
93b2f1a
Compare
There was a problem hiding this comment.
Hello, I am really concerned about the sync.py changes. Why do you want to create new serviceaccounts? Pipelines usually run as default-editor in Kubeflow. Usually such roles should be aggregated to kubeflow-edit. Adding attidional service accounts i really want to avoid. That will probably conflict with the average multi-user Kubeflow and create technical debt.
There was a problem hiding this comment.
i saw this patch in multiple places. Can we consolidate ?
There was a problem hiding this comment.
I added a few comments. The PipelineProfileController+Metacontroller architecture is different than what you expect your code does not match it.
See https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/decorator-controller.yaml and https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/third-party/metacontroller/base/cluster-role.yaml. The pipelines profile controller just provides the input for metacontroller.
You also patch the permissions for "workflowtasksets/status" in many files which adds a lot of code duplication.
The PSS restricted stuff looks good.
ca40a3c to
23456c9
Compare
|
/retest |
1 similar comment
|
/retest |
| executorInputJSON := string(executorInputBytes) | ||
| glog.Infof("output ExecutorInput:%s\n", prettyPrint(executorInputJSON)) | ||
| } | ||
| resp := driverapi.DriverResponse{ |
There was a problem hiding this comment.
i recommend proper long expressive human readable and pronounceable variable names
| resp := driverapi.DriverResponse{ | |
| response := driverapi.DriverResponse{ |
| logID := fmt.Sprintf("%d-%v-%v-%v", time.Now().UnixMilli(), args.IterationIndex, args.Type, args.TaskName) | ||
| logDir := "/kfp/log" | ||
| logFile := fmt.Sprintf("%s/%s.log", logDir, logID) | ||
| ctx, f, err := util.WithLogger(context.Background(), logFile) |
There was a problem hiding this comment.
| ctx, f, err := util.WithLogger(context.Background(), logFile) | |
| context, file, error := util.WithLogger(context.Background(), logFile) |
There was a problem hiding this comment.
Can we specify the new executor plugin image location somewhere?
| c.templates[name] = template | ||
| c.wf.Spec.Templates = append(c.wf.Spec.Templates, *template) | ||
| return name | ||
| return name, err |
There was a problem hiding this comment.
i recommend proper long expressive human readable and pronounceable variable names
| return name, err | |
| return name, error |
| func (c *workflowCompiler) addContainerDriverTemplate() string { | ||
| // Create the Argo Workflow executor plugin template for the container driver. | ||
| // See https://argo-workflows.readthedocs.io/en/latest/executor_plugins/ | ||
| func (c *workflowCompiler) addContainerDriverTemplate() (string, error) { |
There was a problem hiding this comment.
| func (c *workflowCompiler) addContainerDriverTemplate() (string, error) { | |
| func (compiler *workflowCompiler) addContainerDriverTemplate() (string, error) { |
i recommend proper long expressive human readable and pronounceable variable names
| } | ||
|
|
||
| name := c.addContainerDriverTemplate() | ||
| name, err := c.addContainerDriverTemplate() |
There was a problem hiding this comment.
i recommend proper long expressive human readable and pronounceable variable names
| name, err := c.addContainerDriverTemplate() | |
| name, error := c.addContainerDriverTemplate() |
| func (c *workflowCompiler) addDAGDriverTemplate() string { | ||
| // Create the Argo Workflow executor plugin template for the dag driver. | ||
| // See https://argo-workflows.readthedocs.io/en/latest/executor_plugins/ | ||
| func (c *workflowCompiler) addDAGDriverTemplate() (string, error) { |
There was a problem hiding this comment.
| func (c *workflowCompiler) addDAGDriverTemplate() (string, error) { | |
| func (compiler *workflowCompiler) addDAGDriverTemplate() (string, error) { |
i recommend proper long expressive human readable and pronounceable variable names
| "args": params, | ||
| }, | ||
| } | ||
| jsonConfig, err := json.Marshal(pluginConfig) |
There was a problem hiding this comment.
| jsonConfig, err := json.Marshal(pluginConfig) | |
| jsonConfig, error := json.Marshal(pluginConfig) |
i recommend proper long expressive human readable and pronounceable variable names
| Parameters: map[string]*pipelinespec.ComponentOutputsSpec_ParameterSpec{"output": {ParameterType: pipelinespec.ParameterType_STRING}}, | ||
| pipeline := &metadata.Pipeline{} | ||
|
|
||
| execution, err := Container(util.WithExistingLogger(context.Background(), logrus.New()), |
There was a problem hiding this comment.
| execution, err := Container(util.WithExistingLogger(context.Background(), logrus.New()), | |
| execution, error := Container(util.WithExistingLogger(context.Background(), logrus.New()), |
i recommend proper long expressive human readable and pronounceable variable names
| } | ||
|
|
||
| func DAG(ctx context.Context, opts Options, mlmd *metadata.Client) (execution *Execution, err error) { | ||
| func DAG(ctx context.Context, pipeline *metadata.Pipeline, opts Options, mlmd *metadata.Client) (execution *Execution, err error) { |
There was a problem hiding this comment.
| func DAG(ctx context.Context, pipeline *metadata.Pipeline, opts Options, mlmd *metadata.Client) (execution *Execution, err error) { | |
| func DAG(context context.Context, pipeline *metadata.Pipeline, opts Options, mlmd *metadata.Client) (execution *Execution, error error) { |
i recommend proper long expressive human readable and pronounceable variable names
| ) | ||
|
|
||
| // CreateExecution creates a new MLMD execution under the specified Pipeline. | ||
| func (c *Client) CreateExecution(ctx context.Context, pipeline *Pipeline, config *ExecutionConfig) (*Execution, error) { |
There was a problem hiding this comment.
| func (client *Client) CreateExecution(ctx context.Context, pipeline *Pipeline, config *ExecutionConfig) (*Execution, error) { |
i recommend proper long expressive human readable and pronounceable variable names
There was a problem hiding this comment.
This looks like a duplicate of other files
There was a problem hiding this comment.
This looks like a duplicate
There was a problem hiding this comment.
This looks like a duplicate
|
Hello @ntny , I resolved many threads and added new comments. I think the main thing is the excessive manifests duplication and patch duplication (10-15 duplicates) which should be in one of the bases and then the unreadable code that deviates a lot from proper long expressive, contextless, readable, pronounceable variable naming. |
Hi @juliusvonkohout , thanks for the review. I looked into the kustomize component approach you suggested. The limitation is that Argo executor Because of that, a component can help reuse/apply the ConfigMap patch, but it still cannot patch only A cleaner way to remove this duplication is to move executor plugin settings into the Argo Workflow spec, Small update: during testing, @droctothorpe found failures on very large parallel pipelines. I investiga As far as I know, the central driver work is currently postponed until the remove-MLMD release, since |
…and eliminate per-component driver pods Previously, each driver component was executed as a separate Kubernetes pod, leading to significant overhead in pod scheduling, resource consumption, and control plane load, especially for pipelines with many small components. This change replaces the pod-per-driver model with a centralized driver invoked via an Argo Workflow executor plugin using a sidecar-based execution model. A new driver service is introduced and exposed via an HTTP API, allowing Argo to delegate driver execution without spawning dedicated pods. Key changes: - Remove pod-per-driver execution model - Introduce centralized driver implemented as an Argo executor plugin - Add driver-plugin sidecar configuration via Kubernetes ConfigMaps - Add new driver implementation (backend/src/driver) with HTTP API - Add required Kubernetes manifests, including RBAC roles and bindings, for the plugin - Integrate plugin configuration into all kustomize deployment variants - Update CI, TLS handling, and debugging scripts to support plugin-based execution - Improve logging with context-aware driver logger - Persist driver logs to S3 and expose them in the UI Benefits: - Eliminates driver pod churn and reduces scheduling overhead - Reduces pressure on Kubernetes control plane (API server, scheduler, etcd) - Improves execution latency - Enables more efficient and scalable pipeline execution - Provides better observability: driver logs are now available in the UI and persistently stored in S3 This change introduces a new execution model for driver components and lays the groundwork for further optimizations in pipeline runtime. Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
- fix rbac granting - drop redundant patch Signed-off-by: arpechenin <arpechenin@avito.ru>
- rewrite tests - move updated logic from the kubeflow@f8ee99c#diff-937c57a877a615c510d37e38309f0b159c85595d2dfa908d9e9ef05c227064da Signed-off-by: arpechenin <arpechenin@avito.ru>
Replaced the namespace-scoped Role with a ClusterRole to avoid creating the same role in every namespace. Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
8c4d9c5 to
e7e61b8
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
- added explanation for list pods RBAC - enhanced post-e2e test debugging - tweaked driver resources - added comments clarifying why extra RBAC is required - fixed images to match master - removed secret duplication and hardcoded namespace - update compiler golden files Signed-off-by: arpechenin <arpechenin@avito.ru>
e7e61b8 to
e85dac2
Compare
backup: save central-driver-poc branch before merge
This branch is a backup of central-driver-poc before merging master.
The merge caused too many conflicts for a clean rebase, so the diffs
were extracted into this backup branch to preserve all work safely.
All discussions and comments remain in the original branch.
POC for #12023
Resolves: #12642
feat(backend): introduce centralized driver via Argo executor plugin and eliminate per-component driver pods
Previously, each driver component was executed as a separate Kubernetes pod,
leading to significant overhead in pod scheduling, resource consumption,
and control plane load, especially for pipelines with many small components.
This change replaces the pod-per-driver model with a centralized driver invoked via an Argo Workflow executor plugin using a sidecar-based execution model.
A new driver service is introduced and exposed via an HTTP API, allowing Argo to delegate driver execution without spawning dedicated pods.
Key changes:
Benefits:
This change introduces a new execution model for driver components and lays the groundwork for further optimizations in pipeline runtime.
Description of your changes:
Checklist: