Skip to content

Central driver POC #12269

Closed
ntny wants to merge 57 commits into
kubeflow:masterfrom
ntny:central-driver-poc
Closed

Central driver POC #12269
ntny wants to merge 57 commits into
kubeflow:masterfrom
ntny:central-driver-poc

Conversation

@ntny
Copy link
Copy Markdown
Contributor

@ntny ntny commented Sep 22, 2025

Description of your changes:

POC for #12023

Resolves: #12269

Changes:

  • I modified the Argo compiler in the API server — it now generates a workflow spec with the driver plugin instead of a container. The driver is now hosted as a server inside the agent.
  • I built modified images for the API server (for compiling a new Argo workflow spec) and added the KFP driver server image (hosted by the executor plugin).
  • Added a necessary sa/tokens and additional rules according to documentation
  • built images from the brunch and pushed to docker.io

How to launch:

I built multi-layer container images on both Apple M-series (ARM64) and Linux/AMD64 platforms. If you’re using the same architecture, you can safely reuse the images from Docker Hub (ntny/kfp-driver:beta-poc & ntny/kfp-api-server:beta-poc). These images are already referenced in the manifests in this branch.
If your architecture is different, you will need to build the Dockerfile and Dockerfile.driver yourself from this brunch and replace images to yours here and here before proceeding with the further instructions

I use & have prepeared a platform-agnostic env inside minikube (mono user)

  • move to the root of the project and run:
kubectl apply -k ./manifests/kustomize/cluster-scoped-resources
  • wait about 30 seconds and run
kubectl apply -k ./manifests/kustomize/env/platform-agnostic 


Forward the UI port as usual: 
```bash  
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

I have tested this POC on the preinstalled [Tutorial] Data passing in Python components pipeline. Drivers are not created, and the agent is used instead (and removed after the pipeline has finished).
Снимок экрана 2025-09-25 в 12 25 03

Please note: this is just a POC and not a production-ready solution.

@google-oss-prow
Copy link
Copy Markdown

Hi @ntny. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link
Copy Markdown

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@ntny ntny force-pushed the central-driver-poc branch 3 times, most recently from 3388fc7 to 87883fa Compare September 22, 2025 19:12
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Sep 22, 2025

/hold

@ntny ntny force-pushed the central-driver-poc branch 5 times, most recently from de3b9d2 to 5c0ae07 Compare September 24, 2025 23:22
@droctothorpe
Copy link
Copy Markdown
Collaborator

This is EPIC, @ntny! Can't wait to try it out.

@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Sep 27, 2025

/unhold

@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Sep 30, 2025

Hi @HumairAK @droctothorpe would you mind giving this a try?
It should be pretty straightforward to run the cluster with the agent only without the driver by following the instructions above.

@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Sep 30, 2025

Hi! @nsingla I made intentional changes to the compiler, and manually updating all specs in test/compiled-workflow would be very time-consuming.
I’ve already used the following code on my side to regenerate specs directly from the test using a special flag (similar to snapshot tests) and then review the diff manually.
Do you have any concerns about this approach, given your experience with test code and test practices?

@nsingla
Copy link
Copy Markdown
Contributor

nsingla commented Sep 30, 2025

Hi! @nsingla I made intentional changes to the compiler, and manually updating all specs in test/compiled-workflow would be very time-consuming. I’ve already used the following code on my side to regenerate specs directly from the test using a special flag (similar to snapshot tests) and then review the diff manually. Do you have any concerns about this approach, given your experience with test code and test practices?

You don;t need to update it manually, you can run the compiler tests locally with flag:
ginkgo -v -- -updateCompiledFiles=true
this should update the workflows

@ntny ntny force-pushed the central-driver-poc branch 2 times, most recently from 4bee799 to 1ee2602 Compare September 30, 2025 18:28
@zazulam
Copy link
Copy Markdown
Contributor

zazulam commented Oct 1, 2025

/ok-to-test

@droctothorpe
Copy link
Copy Markdown
Collaborator

Hey, @ntny . Unfortunately, I won't have bandwidth to validate it in the next two weeks but just wanted to let you know that it's on my radar and I will get to it as soon as I can. Maybe someone else will get to it before me. VERY excited about this. Kudos!

arpechenin added 9 commits March 9, 2026 21:53
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
…lisecond timestamp prefix

Signed-off-by: arpechenin <arpechenin@avito.ru>
@zazulam
Copy link
Copy Markdown
Contributor

zazulam commented Mar 27, 2026

/lgtm

@droctothorpe
Copy link
Copy Markdown
Collaborator

image

Mike and I both manually validated and everything looks great! Amazing work, Anton. A few very small tweaks that should def happen in discrete PRs so that we don't hold this one up any longer:

  • Enable wrapping of logs in the UI.
  • Enabling searching / filtering of logs in the UI.
  • Add some kind of info overlay or clickable i icon for system logs since users will have no idea what it is.
  • Scale test the driver with extremely parallelized workloads.

Tag us in slack as soon as you rebase and we will merge. Happy to help if there are issues with the rebase.

@google-oss-prow
Copy link
Copy Markdown

New changes are detected. LGTM label has been removed.

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from zazulam. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 27, 2026

/retest

arpechenin added 11 commits March 27, 2026 21:44
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
…rkflow-compiler tests

Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
- add rbac to driver for pvc and pods
- remove redundant enters

Signed-off-by: arpechenin <arpechenin@avito.ru>
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 29, 2026

/retest

1 similar comment
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 29, 2026

/retest

@ntny ntny mentioned this pull request Mar 29, 2026
2 tasks
@ntny
Copy link
Copy Markdown
Contributor Author

ntny commented Mar 29, 2026

I created a backup branch backup-central-driver to preserve all changes from this branch before merging with master.
The merge with master caused too many conflicts for a clean rebase, so to preserve the correct history, I extracted the diffs into a separate branch.

I am closing this PR to avoid confusion with the merge. All changes are saved in the backup branch, and it is this branch that should be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-passed All CI tests on a pull request have passed ok-to-test size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants