Skip to content

[WIP/PoC] Centralize KFP driver as standalone ml-pipeline-driver service#2

Open
droctothorpe wants to merge 1 commit into
ntny:central-driver-poc-backupfrom
droctothorpe:central-driver-poc-backup
Open

[WIP/PoC] Centralize KFP driver as standalone ml-pipeline-driver service#2
droctothorpe wants to merge 1 commit into
ntny:central-driver-poc-backupfrom
droctothorpe:central-driver-poc-backup

Conversation

@droctothorpe
Copy link
Copy Markdown

@droctothorpe droctothorpe commented Mar 30, 2026

⚠️ PROOF OF CONCEPT — NOT READY FOR MERGE ⚠️

This PR is a work-in-progress exploration. It is intentionally incomplete,
contains known gaps, and has not passed full CI. It is being shared for
early design feedback only.


Overview

This PoC explores moving the KFP driver from a per-pod Argo executor plugin
sidecar
(injected into every workflow pod) to a dedicated, centrally-running
Kubernetes Deployment
(ml-pipeline-driver). Argo Workflows communicates with
the driver via HTTP address mode
rather than injecting a sidecar into each task pod.

Motivation

Problem with sidecar model Centralized model benefit
Driver binary is injected into every workflow pod Single driver Deployment — scale, update, secure independently
Namespace resolved from in-pod service-account file Namespace extracted from Argo request body — correct in multi-user deployments
Driver logs end up in Argo shared volume (/kfp/log) Driver logs go to stderr → visible via kubectl logs
RBAC bound to per-pod SA (ml-pipeline-driver-agent-executor-plugin) RBAC bound to the single ml-pipeline SA
Per-namespace SA + token secret creation in profile controller Profile controller no longer manages driver credentials

What changed

backend/src/driver/

  • Promoted from package main to importable package driver.
  • Merged rpc_handler.go + helpers into handler.go; removed standalone server files (main.go, execution_paths.go).
  • Namespace now extracted from the Argo request body (workflow.metadata.namespace / objectMeta.namespace).
  • Driver log files written to os.TempDir() instead of the Argo shared /kfp/log volume.
  • Added handler_test.go.

backend/src/driver/cmd/main.go (new)

  • Standalone HTTP server entry point for ml-pipeline-driver.
  • Serves POST /api/v1/template.execute and GET /healthz on :8080.
  • Calls flag.Set("logtostderr","true") before flag.Parse() so glog emits to stderr.

backend/Dockerfile.driver (new)

  • Multi-stage Docker build for the standalone driver binary.

backend/Makefile

  • Added image_driver target; included in image_all.

backend/src/apiserver/main.go

  • Removed driver handler registration (driver now lives in its own pod).

backend/src/v2/compiler/argocompiler/plugin.go

  • Updated driverEndpointURL
    http://ml-pipeline-driver.kubeflow.svc.cluster.local:8080/api/v1/template.execute

backend/src/v2/compiler/argocompiler/{argo,container,dag}.go

  • Switched Argo HTTP template output extraction from the non-functional
    ValueFrom.Expression to taskResultExtract() (jsonpath over outputs.result),
    which is the only output field Argo HTTP templates actually populate.

manifests/kustomize/base/pipeline/

  • Added ml-pipeline-driver-{sa,role,rolebinding,deployment,service}.yaml.
  • Removed ml-pipeline-driver-plugin-cm.yaml (sidecar config no longer needed).
  • Updated kustomization.yaml.
  • RBAC: driver roles (pod viewer, PVC editor, configmap reader, artifact secret
    reader) are now bound to the ml-pipeline SA.

manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/

  • Removed sidecar-mode patch; updated kustomization.yaml.

manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py

  • Removed per-namespace driver SA and token secret creation.
  • Grants driver roles to the ml-pipeline SA instead.

.github/resources/manifests/

  • Removed driver-plugin-cm-path.yaml patches (no longer applicable).
  • Updated kustomizations for all CI variants (standalone, multiuser,
    kubernetes-native, tls-enabled).

.github/workflows/image-builds*.yml

  • Added ml-pipeline-driver image build targets.

test_data/compiled-workflows/

  • Regenerated all golden files with the new driver endpoint URL.

Manual testing notes

  • Both HTTP driver nodes (root-driver, add-driver) succeed and driver logs
    are now visible via kubectl logs.
  • The executor container currently exits with code 1 (missing kfp module) in
    the manual test environment — this is a test-script issue
    (install_kfp_package=False) and is unrelated to the driver changes.
  • The ScheduledWorkflow CRD (scheduled-workflow-crd.yaml) must be present in
    the cluster; without it the persistence agent fails to watch ScheduledWorkflow
    resources and runs remain stuck in "pending execution".

Known gaps / TODO

  • End-to-end CI has not been run; only compiler golden regeneration has been validated.
  • TLS and multi-user paths need additional testing.
  • Image promotion / release pipeline not yet wired up for the new ml-pipeline-driver image.
  • The actionlint CI check flags a type mismatch in image-builds-master.yml introduced by the new image target — needs a follow-up fix.
  • Consider whether ml-pipeline-driver should run as a sidecar to ml-pipeline (same pod, different container) vs. a fully independent Deployment.
  • Auth/mTLS between Argo and ml-pipeline-driver not yet addressed.
  • Resource requests/limits for the driver Deployment not yet tuned.

@ntny
Copy link
Copy Markdown
Owner

ntny commented Mar 31, 2026

Hey @droctothorpe,

Could you please open an MR from this branch to master?
Not for merging - just to launch the E2E multi-user tests.

You can call me paranoid, but I’m not fully sure the new resources will be applied properly, so let’s verify it :)

@ntny ntny force-pushed the central-driver-poc-backup branch 13 times, most recently from b59dd60 to 8e17485 Compare April 6, 2026 20:02
@droctothorpe droctothorpe changed the title manifests: remove sync.py kubernetes runtime dependency [WIP/PoC] Centralize KFP driver as standalone ml-pipeline-driver service Apr 10, 2026
@droctothorpe droctothorpe force-pushed the central-driver-poc-backup branch 11 times, most recently from b238bc5 to a7cf9e3 Compare April 10, 2026 20:19
…ne-driver service

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: test <test@test.com>
Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
@droctothorpe droctothorpe force-pushed the central-driver-poc-backup branch from a7cf9e3 to 7a10d17 Compare April 10, 2026 20:28
@ntny ntny force-pushed the central-driver-poc-backup branch from 068cff6 to 0f6119c Compare April 13, 2026 18:32
@ntny ntny force-pushed the central-driver-poc-backup branch 13 times, most recently from 000ae2e to 8c4d9c5 Compare April 28, 2026 22:25
@ntny ntny force-pushed the central-driver-poc-backup branch 2 times, most recently from e7e61b8 to e85dac2 Compare May 30, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants