Skip to content

Central driver implementation#13171

Open
ntny wants to merge 11 commits into
kubeflow:masterfrom
ntny:central-driver-poc-backup
Open

Central driver implementation#13171
ntny wants to merge 11 commits into
kubeflow:masterfrom
ntny:central-driver-poc-backup

Conversation

@ntny
Copy link
Copy Markdown
Contributor

@ntny ntny commented Mar 29, 2026

backup: save central-driver-poc branch before merge
This branch is a backup of central-driver-poc before merging master.
The merge caused too many conflicts for a clean rebase, so the diffs
were extracted into this backup branch to preserve all work safely.
All discussions and comments remain in the original branch.

POC for #12023

Resolves: #12642

feat(backend): introduce centralized driver via Argo executor plugin and eliminate per-component driver pods

Previously, each driver component was executed as a separate Kubernetes pod,
leading to significant overhead in pod scheduling, resource consumption,
and control plane load, especially for pipelines with many small components.

This change replaces the pod-per-driver model with a centralized driver invoked via an Argo Workflow executor plugin using a sidecar-based execution model.

A new driver service is introduced and exposed via an HTTP API, allowing Argo to delegate driver execution without spawning dedicated pods.

Key changes:

  • Remove pod-per-driver execution model
  • Introduce centralized driver implemented as an Argo executor plugin
  • Add driver-plugin sidecar configuration via Kubernetes ConfigMaps
  • Add new driver implementation (backend/src/driver) with HTTP API
  • Add required Kubernetes manifests, including RBAC roles and bindings, for the plugin
  • Integrate plugin configuration into all kustomize deployment variants
  • Update CI, TLS handling, and debugging scripts to support plugin-based execution
  • Improve logging with context-aware driver logger
  • Persist driver logs to S3 and expose them in the UI

Benefits:

  • Eliminates driver pod churn and reduces scheduling overhead
  • Reduces pressure on Kubernetes control plane (API server, scheduler, etcd)
  • Improves execution latency
  • Enables more efficient and scalable pipeline execution
  • Provides better observability: driver logs are now available in the UI and persistently stored in S3

This change introduces a new execution model for driver components and lays the groundwork for further optimizations in pipeline runtime.

Description of your changes:

Checklist:

Copilot AI review requested due to automatic review settings March 29, 2026 20:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a backup branch for the “central driver POC” work, transitioning KFP v2 driver execution from per-component driver pods to an Argo Workflow executor plugin + sidecar model, plus associated manifests, RBAC, CI, tests, and UI support for persisted driver logs.

Changes:

  • Update Argo-compiled workflow “golden” YAMLs to use plugin: driver-plugin templates and add hardened security contexts.
  • Add/patch Kubernetes manifests (kustomize overlays, RBAC, workflow-controller settings) to enable Argo executor plugins and configure the driver plugin sidecar via ConfigMaps.
  • Extend backend + frontend to persist driver logs (via MLMD custom properties) and surface them in the UI; adjust tests and CI scripts accordingly.

Reviewed changes

Copilot reviewed 124 out of 209 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test_data/compiled-workflows/mixed_parameters.yaml Switch driver execution from container driver pod to driver-plugin and add securityContext hardening in golden YAML.
test_data/compiled-workflows/metrics_visualization_v2.yaml Same as above for metrics visualization golden YAML.
test_data/compiled-workflows/lightweight_python_functions_with_outputs.yaml Same as above for lightweight python functions (outputs) golden YAML.
test_data/compiled-workflows/lightweight_python_functions_pipeline.yaml Same as above for lightweight python functions pipeline golden YAML.
test_data/compiled-workflows/iris_pipeline_compiled.yaml Same as above for iris pipeline golden YAML.
test_data/compiled-workflows/if_else_with_oneof_artifacts.yaml Same as above for if/else (oneof artifacts) golden YAML.
test_data/compiled-workflows/if_elif_else_with_oneof_parameters.yaml Same as above for if/elif/else (oneof parameters) golden YAML.
test_data/compiled-workflows/identity.yaml Same as above for identity golden YAML.
test_data/compiled-workflows/hello_world.yaml Same as above for hello world golden YAML.
test_data/compiled-workflows/flip_coin.yaml Same as above for flip coin golden YAML.
test_data/compiled-workflows/embedded_artifact.yaml Same as above for embedded artifact golden YAML.
test_data/compiled-workflows/dict_input.yaml Same as above for dict input golden YAML.
test_data/compiled-workflows/container_io.yaml Same as above for container IO golden YAML.
test_data/compiled-workflows/concat_message.yaml Same as above for concat message golden YAML.
test_data/compiled-workflows/component_with_metadata_fields.yaml Same as above for component metadata fields golden YAML.
test_data/compiled-workflows/artifact_crust.yaml Same as above for artifact crust golden YAML.
test_data/compiled-workflows/artifact_cache.yaml Same as above for artifact cache golden YAML.
test_data/compiled-workflows/arguments_parameters.yaml Same as above for arguments parameters golden YAML.
test_data/compiled-workflows/add_numbers.yaml Same as above for add numbers golden YAML.
manifests/kustomize/third-party/argo/installs/namespace/workflow-controller-argo-role-patch.json Patch Workflow Controller role to add WorkflowTaskSet status permissions for agent/plugin flow.
manifests/kustomize/third-party/argo/base/workflow-controller-deployment-patch.yaml Enable Argo executor plugins via env vars.
manifests/kustomize/third-party/argo/base/workflow-controller-argo-taskset-clusterrole-.yaml Add cluster-scoped RBAC for WorkflowTaskSet access needed by Argo agent/plugin.
manifests/kustomize/third-party/argo/base/kustomization.yaml Include the new WorkflowTaskSet RBAC resource.
manifests/kustomize/third-party/application/application-controller-deployment.yaml Increase application-controller memory request.
manifests/kustomize/env/dev/kustomization.yaml Pin kfp-driver image tag and add dev overlay patch for driver plugin ConfigMap.
manifests/kustomize/env/dev/driver-plugin-cm-path.yaml Dev overlay driver plugin sidecar ConfigMap (image + env + resources + securityContext).
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-ui-deployment.yaml Update TLS secret reference to Argo agent CA secret name.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml Same TLS secret rename.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml Same TLS secret rename.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-driver-plugin-cm.yaml TLS overlay patch for driver plugin sidecar ConfigMap (+ cert mount).
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml Update CABUNDLE secret name + volume secret for TLS.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-writer-deployment.yaml Same TLS secret rename.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-grpc-deployment.yaml Same TLS secret rename.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-envoy-deployment.yaml Same TLS secret rename.
manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml Apply driver plugin ConfigMap patch in TLS overlay.
manifests/kustomize/env/cert-manager/base-tls-certs/kfp-api-cert.yaml Rename cert secret to match Argo agent’s required secret name.
manifests/kustomize/base/pipeline/pipeline-runner-sa.yaml Add explicit SA token Secret for pipeline-runner.
manifests/kustomize/base/pipeline/pipeline-runner-role.yaml Expand RBAC for WorkflowTaskSet resources and add executor-plugin service account + roles/bindings.
manifests/kustomize/base/pipeline/ml-pipeline-driver-plugin-cm.yaml Base driver plugin sidecar ConfigMap (ExecutorPlugin type).
manifests/kustomize/base/pipeline/kustomization.yaml Add driver plugin ConfigMap to base resources.
manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py Add Kubernetes client usage to provision executor-plugin SA and RBAC per namespace.
manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/pipelines-profile-controller-admin.yaml Add admin SA + ClusterRole/Binding for profile controller to manage SA/RBAC.
manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/kustomization.yaml Include the new admin RBAC resource.
manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/deployment.yaml Run profile controller with admin SA; install Python deps at runtime.
frontend/src/hooks/queryKeys.ts Add React Query key for driver/system logs.
frontend/src/components/tabs/RuntimeNodeDetailsV2.tsx Add “System Logs” tab and fetch driver logs artifact via API.
backend/test/testutil/kubernetes_utils.go Formatting cleanup.
backend/test/end2end/utils/e2e_utils.go Capture workflow-controller logs for failed runs and handle agent pod logs.
backend/test/end2end/e2e_suite_test.go Formatting cleanup.
backend/test/compiler/utils/workflow_utils.go Ensure cache settings updates also apply to the driver plugin args.
backend/src/v2/metadata/client.go Persist driver_logs_uri (+ store session info) in execution custom properties.
backend/src/v2/driver/root_dag.go Use context-aware logger and return pipeline from RootDAG.
backend/src/v2/driver/k8s_test.go Ensure tests provide a logger via context.
backend/src/v2/driver/k8s.go Replace glog with context logger; require logger in context for several paths.
backend/src/v2/driver/driver_test.go Ensure tests provide a logger via context.
backend/src/v2/driver/driver.go Change output URI prefix naming used for task-root generation.
backend/src/v2/driver/dag.go Avoid re-fetching pipeline in DAG driver; use context logger.
backend/src/v2/driver/container.go Pass output prefix; write DriverLogURI custom property; use context logger.
backend/src/v2/compiler/argocompiler/plugin.go Add helper to build Argo executor plugin template JSON.
backend/src/v2/compiler/argocompiler/dag.go Switch DAG driver template to Plugin with args map; adjust call sites for error handling.
backend/src/v2/compiler/argocompiler/container_test.go Skip env var test that no longer applies under plugin-based execution.
backend/src/v2/compiler/argocompiler/container.go Switch container driver template from Container to Plugin, including JSONPath outputs.
backend/src/v2/compiler/argocompiler/argo.go Disable Istio sidecar injection on workflow pods; remove driver image/command from compiler state.
backend/src/v2/cmd/driver/execution_paths.go Remove old v2 driver execution paths file (moved).
backend/src/driver/main_test.go Update test expectations for renamed symbols and parsing function name.
backend/src/driver/main.go New driver plugin HTTP server entrypoint and shared helpers.
backend/src/driver/execution_paths.go Re-add ExecutionPaths struct under new driver plugin package.
backend/src/driver/api/response.go Add HTTP response DTOs for driver server/plugin execution.
backend/src/driver/api/request.go Add HTTP request DTOs for driver server/plugin execution.
backend/src/common/util/context_logger.go Introduce context logger (logrus) writing to stdout + file.
backend/Dockerfile.driver Build the new driver plugin binary from backend/src/driver.
.github/workflows/legacy-v2-api-integration-tests.yml Update CA secret name used by CI.
.github/workflows/e2e-test.yml Update CA secret name used by CI.
.github/workflows/api-server-tests.yml Update CA secret name used by CI.
.github/resources/scripts/kfp-readiness/wait_for_pods.py Exclude *-agent pods from readiness filter.
.github/resources/scripts/collect-logs.sh Add agent container log collection and describe workflows.
.github/resources/manifests/standalone/tls-enabled/kustomization.yaml Patch driver plugin ConfigMap in TLS-enabled GH manifests.
.github/resources/manifests/standalone/tls-enabled/driver-plugin-cm-path.yaml Provide driver plugin sidecar ConfigMap for TLS-enabled GH manifests.
.github/resources/manifests/standalone/proxy/kustomization.yaml Patch driver plugin ConfigMap in proxy GH manifests.
.github/resources/manifests/standalone/default/kustomization.yaml Patch driver plugin ConfigMap in default GH manifests.
.github/resources/manifests/standalone/cache-disabled/kustomization.yaml Patch driver plugin ConfigMap in cache-disabled GH manifests.
.github/resources/manifests/standalone/cache-disabled-proxy/kustomization.yaml Patch driver plugin ConfigMap in cache-disabled-proxy GH manifests.
.github/resources/manifests/multiuser/default/kustomization.yaml Patch driver plugin ConfigMap in multiuser GH manifests.
.github/resources/manifests/multiuser/cache-disabled/kustomization.yaml Patch driver plugin ConfigMap in multiuser cache-disabled GH manifests.
.github/resources/manifests/multiuser/artifact-proxy/kustomization.yaml Patch driver plugin ConfigMap in artifact-proxy GH manifests.
.github/resources/manifests/kubernetes-native/default/kustomization.yaml Patch driver plugin ConfigMap in kubernetes-native GH manifests.
.github/resources/manifests/base/driver-plugin-cm-path.yaml Base GH manifest overlay for driver plugin sidecar ConfigMap.

Comment thread manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py Outdated
Comment on lines +31 to +38
command:
- sh
- -c
- |
python3 -m venv /tmp/venv && \
. /tmp/venv/bin/activate && \
pip install --no-cache-dir kubernetes boto3 && \
python /hooks/sync.py
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing dependencies at container startup is operationally risky:

  • Requires outbound network access to PyPI (often blocked in clusters).
  • Adds cold-start latency and makes behavior non-reproducible.
  • Can fail if python3, venv, or pip are not present in the base image.

Fix (required): bake kubernetes + boto3 into the image used for the profile controller (or switch to an image that already contains them), and run python /hooks/sync.py directly.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully addressed by ntny#2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metacontroller and pipelines profile controller is different. PPC only provides a json to metacontroller and metacontroller handles the rest.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return execution, err
}
ecfg.TaskName = opts.TaskName
ecfg.Namespace = opts.Namespace
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionConfig (in backend/src/v2/metadata/client.go) does not show a Namespace field in the diff, so ecfg.Namespace = opts.Namespace is very likely a compile-time error.

Fix (required): either:

  • add a Namespace field to metadata.ExecutionConfig and plumb it through (if needed), or
  • remove this assignment if namespace is already available via opts.Namespace / pipeline context.
Suggested change
ecfg.Namespace = opts.Namespace

Copilot uses AI. Check for mistakes.
Comment thread frontend/src/components/tabs/RuntimeNodeDetailsV2.tsx
Comment thread frontend/src/hooks/queryKeys.ts Outdated
@ntny ntny mentioned this pull request Mar 29, 2026
@ntny ntny force-pushed the central-driver-poc-backup branch from 1ad5ac7 to de96fee Compare March 29, 2026 21:13
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 29, 2026

/retest

2 similar comments
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 29, 2026

/retest

@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 29, 2026

/retest

@droctothorpe droctothorpe force-pushed the central-driver-poc-backup branch from de96fee to 0a4db0b Compare March 30, 2026 11:04
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 30, 2026

/retest

@ntny ntny changed the title Central driver POC backup Central driver implementation Mar 30, 2026
@ntny ntny force-pushed the central-driver-poc-backup branch from da711da to 93b2f1a Compare March 31, 2026 08:03
Copy link
Copy Markdown
Member

@juliusvonkohout juliusvonkohout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I am really concerned about the sync.py changes. Why do you want to create new serviceaccounts? Pipelines usually run as default-editor in Kubeflow. Usually such roles should be aggregated to kubeflow-edit. Adding attidional service accounts i really want to avoid. That will probably conflict with the average multi-user Kubeflow and create technical debt.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i saw this patch in multiple places. Can we consolidate ?

Copy link
Copy Markdown
Member

@juliusvonkohout juliusvonkohout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments. The PipelineProfileController+Metacontroller architecture is different than what you expect your code does not match it.

See https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/decorator-controller.yaml and https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/third-party/metacontroller/base/cluster-role.yaml. The pipelines profile controller just provides the input for metacontroller.

You also patch the permissions for "workflowtasksets/status" in many files which adds a lot of code duplication.

The PSS restricted stuff looks good.

@ntny ntny force-pushed the central-driver-poc-backup branch 3 times, most recently from ca40a3c to 23456c9 Compare April 2, 2026 06:33
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Apr 2, 2026

/retest

1 similar comment
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Apr 2, 2026

/retest

@zazulam zazulam requested a review from juliusvonkohout April 3, 2026 21:14
executorInputJSON := string(executorInputBytes)
glog.Infof("output ExecutorInput:%s\n", prettyPrint(executorInputJSON))
}
resp := driverapi.DriverResponse{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend proper long expressive human readable and pronounceable variable names

Suggested change
resp := driverapi.DriverResponse{
response := driverapi.DriverResponse{

logID := fmt.Sprintf("%d-%v-%v-%v", time.Now().UnixMilli(), args.IterationIndex, args.Type, args.TaskName)
logDir := "/kfp/log"
logFile := fmt.Sprintf("%s/%s.log", logDir, logID)
ctx, f, err := util.WithLogger(context.Background(), logFile)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ctx, f, err := util.WithLogger(context.Background(), logFile)
context, file, error := util.WithLogger(context.Background(), logFile)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify the new executor plugin image location somewhere?

c.templates[name] = template
c.wf.Spec.Templates = append(c.wf.Spec.Templates, *template)
return name
return name, err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend proper long expressive human readable and pronounceable variable names

Suggested change
return name, err
return name, error

func (c *workflowCompiler) addContainerDriverTemplate() string {
// Create the Argo Workflow executor plugin template for the container driver.
// See https://argo-workflows.readthedocs.io/en/latest/executor_plugins/
func (c *workflowCompiler) addContainerDriverTemplate() (string, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *workflowCompiler) addContainerDriverTemplate() (string, error) {
func (compiler *workflowCompiler) addContainerDriverTemplate() (string, error) {

i recommend proper long expressive human readable and pronounceable variable names

}

name := c.addContainerDriverTemplate()
name, err := c.addContainerDriverTemplate()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend proper long expressive human readable and pronounceable variable names

Suggested change
name, err := c.addContainerDriverTemplate()
name, error := c.addContainerDriverTemplate()

func (c *workflowCompiler) addDAGDriverTemplate() string {
// Create the Argo Workflow executor plugin template for the dag driver.
// See https://argo-workflows.readthedocs.io/en/latest/executor_plugins/
func (c *workflowCompiler) addDAGDriverTemplate() (string, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *workflowCompiler) addDAGDriverTemplate() (string, error) {
func (compiler *workflowCompiler) addDAGDriverTemplate() (string, error) {

i recommend proper long expressive human readable and pronounceable variable names

"args": params,
},
}
jsonConfig, err := json.Marshal(pluginConfig)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jsonConfig, err := json.Marshal(pluginConfig)
jsonConfig, error := json.Marshal(pluginConfig)

i recommend proper long expressive human readable and pronounceable variable names

Parameters: map[string]*pipelinespec.ComponentOutputsSpec_ParameterSpec{"output": {ParameterType: pipelinespec.ParameterType_STRING}},
pipeline := &metadata.Pipeline{}

execution, err := Container(util.WithExistingLogger(context.Background(), logrus.New()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
execution, err := Container(util.WithExistingLogger(context.Background(), logrus.New()),
execution, error := Container(util.WithExistingLogger(context.Background(), logrus.New()),

i recommend proper long expressive human readable and pronounceable variable names

}

func DAG(ctx context.Context, opts Options, mlmd *metadata.Client) (execution *Execution, err error) {
func DAG(ctx context.Context, pipeline *metadata.Pipeline, opts Options, mlmd *metadata.Client) (execution *Execution, err error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func DAG(ctx context.Context, pipeline *metadata.Pipeline, opts Options, mlmd *metadata.Client) (execution *Execution, err error) {
func DAG(context context.Context, pipeline *metadata.Pipeline, opts Options, mlmd *metadata.Client) (execution *Execution, error error) {

i recommend proper long expressive human readable and pronounceable variable names

)

// CreateExecution creates a new MLMD execution under the specified Pipeline.
func (c *Client) CreateExecution(ctx context.Context, pipeline *Pipeline, config *ExecutionConfig) (*Execution, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (client *Client) CreateExecution(ctx context.Context, pipeline *Pipeline, config *ExecutionConfig) (*Execution, error) {

i recommend proper long expressive human readable and pronounceable variable names

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate of other files

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate

@juliusvonkohout
Copy link
Copy Markdown
Member

Hello @ntny , I resolved many threads and added new comments. I think the main thing is the excessive manifests duplication and patch duplication (10-15 duplicates) which should be in one of the bases and then the unreadable code that deviates a lot from proper long expressive, contextless, readable, pronounceable variable naming.

@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented May 16, 2026

Hello @ntny , I resolved many threads and added new comments. I think the main thing is the excessive manifests duplication and patch duplication (10-15 duplicates) which should be in one of the bases and then the unreadable code that deviates a lot from proper long expressive, contextless, readable, pronounceable variable naming.

Hi @juliusvonkohout , thanks for the review.

I looked into the kustomize component approach you suggested. The limitation is that Argo executor
plugins are currently configured through a ConfigMap, not a structured CRD. Kubernetes parses
ConfigMap.data as map[string]string, and Argo then unmarshals data["sidecar.container"] from that
string into the container spec. So the final manifest must keep sidecar.container as a YAML string.

Because of that, a component can help reuse/apply the ConfigMap patch, but it still cannot patch only
image or volumeMounts inside sidecar.container as structured fields. I also tried
configMapGenerator with a separate driver-plugin-sidecar.yaml it reduces some boilerplate, but
image-only variants still need to replace the whole string, so the main duplication remains while
kustomize gets more complex.

A cleaner way to remove this duplication is to move executor plugin settings into the Argo Workflow spec,
where the sidecar is structured YAML. I have an upstream Argo PR for that here:
argoproj/argo-workflows#15440

Small update: during testing, @droctothorpe found failures on very large parallel pipelines. I investiga
ted and found this is an issue in Argo’s executor plugin implementation. I also opened fixes upstream, c
urrently waiting for review: argoproj/argo-workflows#16075

As far as I know, the central driver work is currently postponed until the remove-MLMD release, since
that work should significantly reduce workflow size. @droctothorpe and @HumairAK can share more details on that..
If Imanage to finish and merge the workflow-level executor plugin configuration in Argo before that, we may be able to remove this duplication before merging this work.

arpechenin added 10 commits May 30, 2026 17:22
…and eliminate per-component driver pods

Previously, each driver component was executed as a separate Kubernetes pod,
leading to significant overhead in pod scheduling, resource consumption,
and control plane load, especially for pipelines with many small components.

This change replaces the pod-per-driver model with a centralized driver
invoked via an Argo Workflow executor plugin using a sidecar-based execution model.

A new driver service is introduced and exposed via an HTTP API, allowing
Argo to delegate driver execution without spawning dedicated pods.

Key changes:
- Remove pod-per-driver execution model
- Introduce centralized driver implemented as an Argo executor plugin
- Add driver-plugin sidecar configuration via Kubernetes ConfigMaps
- Add new driver implementation (backend/src/driver) with HTTP API
- Add required Kubernetes manifests, including RBAC roles and bindings, for the plugin
- Integrate plugin configuration into all kustomize deployment variants
- Update CI, TLS handling, and debugging scripts to support plugin-based execution
- Improve logging with context-aware driver logger
- Persist driver logs to S3 and expose them in the UI

Benefits:
- Eliminates driver pod churn and reduces scheduling overhead
- Reduces pressure on Kubernetes control plane (API server, scheduler, etcd)
- Improves execution latency
- Enables more efficient and scalable pipeline execution
- Provides better observability: driver logs are now available in the UI and persistently stored in S3

This change introduces a new execution model for driver components
and lays the groundwork for further optimizations in pipeline runtime.

Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
- fix rbac granting
- drop redundant patch

Signed-off-by: arpechenin <arpechenin@avito.ru>
- rewrite tests
- move updated logic from the kubeflow@f8ee99c#diff-937c57a877a615c510d37e38309f0b159c85595d2dfa908d9e9ef05c227064da

Signed-off-by: arpechenin <arpechenin@avito.ru>
Replaced the namespace-scoped Role with a ClusterRole to avoid creating the same role in every namespace.

Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
@ntny ntny force-pushed the central-driver-poc-backup branch from 8c4d9c5 to e7e61b8 Compare May 30, 2026 14:23
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hbelmiro, humairak for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- added explanation for list pods RBAC
- enhanced post-e2e test debugging
- tweaked driver resources
- added comments clarifying why extra RBAC is required
- fixed images to match master
- removed secret duplication and hardcoded namespace
- update compiler golden files

Signed-off-by: arpechenin <arpechenin@avito.ru>
@ntny ntny force-pushed the central-driver-poc-backup branch from e7e61b8 to e85dac2 Compare May 30, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(backend): Reduce KFP Driver pod footprint

5 participants