KubeflowExecutor Implementation #313

jskswamy · 2025-08-04T12:52:12Z

⚠️ DRAFT PR: This is a draft version for initial review and feedback. Comprehensive testing is yet to be completed. Raising this PR to get feedback on the implementation approach and architecture.

🎯 Overview

This PR adds Kubernetes support to NeMo Run by implementing a complete KubeflowExecutor that enables distributed training on Kubernetes clusters using Kubeflow Trainer.

User Value: NeMo Run users can now run their experiments directly on Kubernetes clusters, leveraging the power of Kubeflow Trainer for distributed training without leaving the familiar NeMo Run API.

Key Capabilities:

Run NeMo experiments on Kubernetes clusters
Scale training across multiple nodes and GPUs
Automatic resource management and cleanup
Seamless integration with existing NeMo Run workflows

🏗️ Architecture

Key Components

KubeflowExecutor: Core executor that manages training job lifecycle
ConfigMapPackager Integration: Automatic file staging for distributed training
CLI Integration: Factory functions and entrypoints for easy usage
Resource Management: ClusterTrainingRuntime and TrainJob lifecycle

Why This Implementation?

Current State: NeMo Run currently supports various executors for different environments (local, SkyPilot, etc.) but lacks a Kubernetes executor for running experiments directly on Kubernetes clusters.

This Implementation Adds: Complete Kubernetes support for NeMo Run using Kubeflow Trainer, enabling users to:

Run NeMo Experiments on Kubernetes: Direct integration with Kubernetes clusters
Distributed Training: Scale training across multiple nodes and GPUs
Kubeflow Integration: Leverage Kubeflow Trainer's robust distributed training capabilities
Resource Management: Automatic GPU allocation, memory management, and scheduling
Production Deployment: Run ML experiments in production Kubernetes environments

Technical Approach: The Kubeflow Trainer SDK only provides TrainJob methods, so we implement ClusterTrainingRuntime management directly via Kubernetes API for:

Environment Consistency: Same runtime configuration across experiments
Resource Optimization: Reuse runtime definitions
Experiment Isolation: Per-experiment runtime names
Cleanup Management: Proper resource lifecycle

📁 Files Added/Modified

Core Implementation

nemo_run/core/execution/kubeflow.py - Main KubeflowExecutor implementation
nemo_run/core/packaging/configmap.py - ConfigMapPackager for file staging

Tests

test/core/execution/test_kubeflow.py - comprehensive unit tests for KubeflowExecutor and ConfigMapPackager integration
test/core/packaging/test_configmap.py - comrehensive unit tests for ConfigMapPackager

Examples

examples/kubeflow/hello_kubeflow.py - Complete example with CLI integration
examples/kubeflow/README.md - Comprehensive documentation
examples/kubeflow/architecture.md - Detailed architecture diagrams

🚀 Key Features

1. Seamless NeMo Run Integration

# Simple API usage
executor = KubeflowExecutor(nodes=2, gpus=8)
exp = run.Experiment()
exp.add(train_function)
exp.run(executor=executor)

2. Automatic Resource Management

Creates ClusterTrainingRuntime per experiment
Manages TrainJob lifecycle
Automatic cleanup of all resources

3. File Staging via ConfigMaps

Integrates with ConfigMapPackager
Automatic file distribution to all nodes
Optimized for distributed training

🔧 Implementation Details

Resource Creation Flow

ClusterTrainingRuntime: Defines training environment (image, resources, scaling)
ConfigMap: Packages and distributes training files
TrainJob: Executes training with runtime reference

Error Handling

Robust Kubernetes connectivity validation
Automatic resource cleanup on failures
Comprehensive error logging and recovery

Add ConfigMapPackager to enable staging files into Kubernetes ConfigMaps for distributed training jobs. This packager supports file size validation, debug logging, and graceful error handling when Kubernetes client is unavailable. The implementation includes comprehensive test coverage with parameterized tests for different job_dir scenarios and proper handling of large files that exceed the 1MB ConfigMap limit. Dependencies added: - kubernetes>=28.0.0: Official Kubernetes Python client for API interactions - kubeflow: Kubeflow SDK for trainer integration and custom resource management Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

This commit introduces a new KubeflowExecutor that enables distributed training jobs on Kubernetes using the Kubeflow Trainer SDK. The executor supports both file-based and function-based execution modes, with files staged into Kubernetes ConfigMaps for execution. Key features: - Integration with Kubeflow Trainer SDK for TrainJob management - Support for both file-based and function-based execution - ConfigMapPackager integration for file staging - Comprehensive test coverage with parameterized tests - Consistent API with other executors (LocalExecutor, SlurmExecutor) - Resource management (CPU, memory, GPU requests/limits) - Custom runtime support via runtime_name parameter The executor follows the same patterns as existing executors and is integrated into the experiment system for parallel and detached execution support. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add comprehensive documentation for KubeflowExecutor in the execution guide, following the same pattern as other executors. The documentation includes: - Basic and advanced configuration examples - File-based and function-based execution modes - ConfigMapPackager integration for file staging - Prerequisites and architecture explanation - Monitoring, debugging, and troubleshooting sections Also update README to include KubeflowExecutor in supported executors list and add new section highlighting all available executors. The KubeflowExecutor enables distributed training jobs on Kubernetes using Kubeflow Trainer SDK while following proper separation of concerns between ClusterOps and MLE teams. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Fix Kubernetes resource naming issues by adding sanitize_kubernetes_name utility function that replaces underscores with hyphens to comply with RFC 1123 rules. Extract ConfigMap key sanitization logic into reusable _sanitize_configmap_key method that replaces forward slashes with hyphens for valid ConfigMap keys. Improve resource formatting in KubeflowExecutor to use flat structure expected by Kubeflow SDK instead of nested limits/requests format. Add comprehensive parametrized tests for sanitization functions and error handling scenarios. Remove redundant tests and fix brittle assertions to call actual methods rather than duplicating logic. These changes resolve ConfigMap creation failures due to invalid names and keys, ensuring successful NeMo Run training job submissions to Kubeflow. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

This commit addresses several critical issues in the Kubeflow executor: - Consolidates Kubernetes name sanitization logic into base.py to avoid duplication and ensure consistent naming across the codebase - Removes hardcoded resource defaults from KubeflowExecutor to allow ClusterTrainingRuntime to provide appropriate defaults - Fixes double prefix issue in ConfigMap names by ensuring only ConfigMapPackager adds the workspace prefix - Adds configurable volume_mount_path and default_task_dir parameters for better flexibility - Implements dynamic file path resolution that correctly infers staged file locations - Updates all tests to reflect the new behavior and ensure proper coverage Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add comprehensive ConfigMapPackager integration to KubeflowExecutor with full test coverage. This enables staging user files as Kubernetes ConfigMaps for distributed training jobs. Key changes: - Fix CustomTrainer API integration (script -> python_file parameter) - Add ConfigMapPackager staging and cleanup functionality - Implement proper resource management with experiment lifecycle - Add comprehensive test suite with 101 passing tests - Support both Script and Partial task types from Experiment API - Add proper error handling and logging for ConfigMap operations The implementation follows NeMo Run architecture with clear separation between tasks (what to run) and executors (where/how to run). Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Implement focused integration tests for ConfigMapPackager with KubeflowExecutor covering the complete ConfigMap lifecycle and resource management. Key improvements: - Comprehensive integration test covering ConfigMap creation, sanitization, large file handling, mount path validation, and error recovery - Lifecycle management test verifying complete resource cleanup The tests verify: - ConfigMap creation during job submission - Resource cleanup completeness (TrainJob + file cleanup) - Namespace isolation and mount path validation - Error handling for large files and API failures Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add comprehensive tests for ClusterTrainingRuntime creation, TrainJob management, and resource lifecycle with ConfigMapPackager integration. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add Kubernetes client integration to create and delete ClusterTrainingRuntime CRDs with experiment-specific configurations. Move Kubernetes configuration validation to __post_init__ for early failure detection. Remove unused cpu_request and memory_request parameters, add configurable image parameter. Add comprehensive tests that verify actual Kubernetes API calls and CRD body structure. Remove implementation comments and TODO statements to focus on user-facing documentation. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add CLI factory functions and entrypoint to hello_kubeflow.py example. Follow established patterns from other executors with simple factory functions and entrypoint integration. Include kubeflow_gpu and kubeflow_cpu factories with sensible defaults for common use cases. Update README with CLI usage examples and simplified documentation. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

hemildesai · 2025-08-04T16:32:16Z

Hey @jskswamy Thanks a lot for your contribution, this is great 🎉 . Appreciate you taking the time since this is a pretty involved PR. I took a quick glance and the design and organization looks good to me. Happy to proceed with merging once its ready on your side and works end to end.

nemo_run/run/torchx_backend/schedulers/kubeflow.py

hemildesai · 2025-08-18T15:03:49Z

Hi, can you fix the DCO check by signing off your commits using --signoff?

Add support for inline script execution in the KubeflowExecutor using the SDK's function argument injection style. This change allows users to pass scripts directly as inline parameters, enhancing flexibility in task execution. Key changes include: - Introduced `_nemo_inline_entry_params` function for handling inline script execution. - Updated `create_trainjob` and `submit` methods to support inline scripts. - Enhanced logging for better tracking of execution modes. - Improved Kubernetes runtime management, enabling reuse of ClusterTrainingRuntime across experiments with similar configurations. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Updated the KubeflowExecutor class to support a unique executor name and improved the handling of Kubernetes configurations. Key changes include: - Introduced a `name` parameter for the executor to enable better identification in logs and configuration. - Adjusted volume mount path defaults and removed unnecessary runtime parameters. - Enhanced error handling for Kubernetes configuration failures, ensuring clearer logging of issues. - Streamlined file staging and cleanup processes using the new name parameter. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

This commit refactors the KubeflowExecutor class to enhance task management and streamline the creation of ClusterTrainingRuntime. Key changes include: - Removed the _nemo_inline_entry_params function, simplifying inline script handling. - Introduced a new method to get additional files based on task types, allowing for better staging of files in ConfigMap. - Updated the create_trainjob method to accept runtime_name directly, improving clarity in job submissions. - Adjusted the _runtime_name method to generate names based on a unique identifier and hash, ensuring no collisions. - Improved logging for better traceability during execution. These modifications aim to simplify the executor's interface and enhance its usability for developers working with Kubeflow. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Refactor the KubeflowExecutor class to replace the CustomTrainer with CommandTrainer for improved task handling. Introduce a new enable_tcpxo feature that configures a sidecar for TCP enhancements in the runtime template. The implementation now validates entrypoints and manages task configurations more robustly, ensuring compatibility with the CommandTrainer. - Added enable_tcpxo flag to runtime template - Updated TrainerClient initialization with KubernetesBackendConfig - Enhanced error handling for unsupported tasks - Improved logging for trainer configurations and commands Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Reorganize annotations under the spec.template.metadata section based on the enable_tcpxo condition. This improves clarity and maintains consistency in configuration. Add podAntiAffinity rules to ensure that replicated jobs do not schedule on the same node, enhancing fault tolerance and resource management. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Refactor the KubeflowExecutor to improve the way command and arguments are built based on the task entrypoint. Introduce a new method, _build_command_and_args, to centralize this logic. Additionally, implement a mutation function, _mutate_bash_torchrun_flags, that appends necessary torchrun flags to bash scripts, ensuring they include required parameters for distributed training. This change aims to streamline the execution of script tasks and enhance compatibility with the PET framework for distributed training setups. - Added _build_command_and_args method for command construction - Implemented _mutate_bash_torchrun_flags for bash script handling - Updated tests to verify new functionality and flag injection Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add a new StorageMount class to encapsulate the configuration for persistent volume claims (PVCs) in KubeflowExecutor. This includes attributes for mount path, read-only status, and PVC specifics. Enhance KubeflowExecutor to support storage mounts by ensuring PVCs are created if they do not exist. Update the template rendering to include storage PVC mounts in the runtime configuration. Add unit tests for storage mount normalization, PVC creation, and template rendering, ensuring correct behavior in various scenarios. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add functionality to handle launcher commands for both Script and Partial tasks in the KubeflowExecutor. This update introduces helper functions to build command arguments and integrates them into the existing execution flow. Key changes include: - Creation of `_build_pet_torchrun_flags` to standardize the flags for torchrun. - Implementation of `_build_launcher_command_and_args` for generating the appropriate command and arguments based on task type and entrypoint. - Refinement of `_materialize_task_content_for_staging` to accommodate both inline and function-based tasks. - Adjustments in the test suite to validate the new functionality and ensure correct behavior when executing tasks with torchrun. These enhancements aim to improve task execution consistency and facilitate better integration of PyTorch distributed training capabilities within the Kubeflow environment. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Updated the variable name from `volume_mount_path` to `workspace_mount_path` in the `KubeflowExecutor` class to enhance clarity and consistency. This change affects both the implementation in the code and the corresponding template files. Additionally, updated unit tests to reflect the new variable name, ensuring that all references are correctly aligned with this modification. This improves readability and reduces potential confusion regarding the purpose of the mount path. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add support for specifying additional package installations in the KubeflowExecutor. This enhancement allows users to configure packages to be installed within the training container, improving flexibility for various training requirements. - Introduced an AdditionalPackages dataclass to encapsulate installation parameters. - Updated the CommandTrainer instantiation to accept additional packages if configured. - Modified get_volume_name and get_pvc_claim_name to ensure lowercase names for Kubernetes compatibility. This change enables more robust customization of the training environment, facilitating the inclusion of necessary dependencies directly through executor configuration. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Add functionality to create and manage Kubernetes Secrets for environment variables in the KubeflowExecutor class. The new _method_ `_ensure_env_secret` verifies the existence of the secret and creates or updates it as necessary. This allows for better handling of sensitive environment configurations. Update the template rendering to include secrets as environment variables, enhancing the flexibility of environment variable management. - Implemented `_ensure_env_secret` to manage Secrets - Updated YAML template to include `env_from_secrets` - Added tests for secret creation and conflict handling Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Revise the key sanitization logic to preserve underscores in file names, aligning with Kubernetes naming conventions. The ConfigMap NAME must still comply with DNS-1123 standards, but keys can now contain underscores. Additionally, update related tests to reflect this change, ensuring that the sanitization process maintains underscores as valid characters in keys. This enhances compatibility with Kubernetes while improving the handling of file names. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Refactor the KubeflowExecutor class to improve clarity in GPU configuration. The following changes were made: - Renamed `gpus` to `gpus_per_node` to specify GPU allocation per node. - Updated the related code references to use the new `gpus_per_node` attribute. - Changed `image` to `container_image` for consistency in naming conventions. - Adjusted tests to align with the updated attribute names and ensure proper functionality. These changes enhance the readability and maintainability of the code, making it clearer how GPU resources are managed. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Refactor the Kubeflow ClusterTrainingRuntime template to use variable placeholders for the number of nodes and the number of processes per node. This change enhances flexibility by allowing dynamic configuration based on input parameters. Add unit tests to validate the rendering of nodes and process counts in the template. The tests ensure that the correct values are populated in the rendered output for both single and multi-GPU scenarios. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

nemo_run/core/execution/kubeflow.py

+                type="Opaque",
+            )
+            core_client.create_namespaced_secret(namespace=self.namespace, body=body)
+            logger.info(f"Created Secret {generated_secret_name} in {self.namespace}")


nemo_run/core/execution/kubeflow.py

+                        name=generated_secret_name, namespace=self.namespace, body=patch_body
+                    )
+                    logger.info(
+                        f"Patched Secret {generated_secret_name} with updated stringData in {self.namespace}"


nemo_run/core/execution/kubeflow.py

+                        f"Patched Secret {generated_secret_name} with updated stringData in {self.namespace}"
+                    )
+                except Exception as patch_err:
+                    logger.warning(f"Failed to patch Secret {generated_secret_name}: {patch_err}")


nemo_run/core/execution/kubeflow.py

+                except Exception as patch_err:
+                    logger.warning(f"Failed to patch Secret {generated_secret_name}: {patch_err}")
+            else:
+                logger.warning(f"Failed to create Secret {generated_secret_name}: {e}")


test/core/execution/test_kubeflow.py

+    # GPU count should be present under both requests and limits
+    assert '"nvidia.com/gpu": 8' in rendered
+
+    def test_pvc_creation_when_missing(self, mocker):


test/core/execution/test_kubeflow.py

+        assert body["spec"]["resources"]["requests"]["storage"] == "200Gi"
+        assert body.get("spec", {}).get("storageClassName") == "standard"
+
+    def test_pvc_creation_skipped_when_exists(self, mocker):


jskswamy added 11 commits August 4, 2025 12:00

Add resource management tests for KubeflowExecutor

1c5f514

Add comprehensive tests for ClusterTrainingRuntime creation, TrainJob management, and resource lifecycle with ConfigMapPackager integration. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

Fix lint issues

22a8e11

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>

jskswamy had a problem deploying to public August 4, 2025 16:32 — with GitHub Actions Failure

github-actions bot added the community-request label Aug 4, 2025

jskswamy had a problem deploying to public August 4, 2025 16:32 — with GitHub Actions Failure

github-advanced-security bot found potential problems Aug 4, 2025

View reviewed changes

nemo_run/run/torchx_backend/schedulers/kubeflow.py Fixed Show fixed Hide fixed

jskswamy had a problem deploying to public August 18, 2025 15:04 — with GitHub Actions Failure

jskswamy had a problem deploying to public August 18, 2025 15:05 — with GitHub Actions Failure

jskswamy force-pushed the feature/add-kubeflow-executor branch from 2c3a530 to 5cf880d Compare August 18, 2025 17:10

jskswamy had a problem deploying to public August 18, 2025 17:19 — with GitHub Actions Failure

jskswamy had a problem deploying to public September 10, 2025 19:53 — with GitHub Actions Failure

jskswamy force-pushed the feature/add-kubeflow-executor branch from 57674c0 to 03bb57e Compare September 12, 2025 09:40

jskswamy added 5 commits September 17, 2025 10:28

jskswamy added 4 commits September 17, 2025 10:28

jskswamy force-pushed the feature/add-kubeflow-executor branch from 8d60b32 to 57c9ea2 Compare September 17, 2025 04:59

jskswamy force-pushed the feature/add-kubeflow-executor branch from 517fdc3 to 4532d7b Compare September 17, 2025 07:18

jskswamy added 3 commits September 18, 2025 06:23

github-advanced-security bot found potential problems Oct 3, 2025

View reviewed changes

jskswamy had a problem deploying to public October 3, 2025 01:33 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KubeflowExecutor Implementation #313

KubeflowExecutor Implementation #313

Uh oh!

jskswamy commented Aug 4, 2025

Uh oh!

hemildesai commented Aug 4, 2025

Uh oh!

Uh oh!

hemildesai commented Aug 18, 2025

Uh oh!

Check failure

Check failure

Check failure

Check failure

Check notice

Check notice

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KubeflowExecutor Implementation #313

Are you sure you want to change the base?

KubeflowExecutor Implementation #313

Uh oh!

Conversation

jskswamy commented Aug 4, 2025

🎯 Overview

🏗️ Architecture

Key Components

Why This Implementation?

📁 Files Added/Modified

Core Implementation

Tests

Examples

🚀 Key Features

1. Seamless NeMo Run Integration

2. Automatic Resource Management

3. File Staging via ConfigMaps

🔧 Implementation Details

Resource Creation Flow

Error Handling

Uh oh!

hemildesai commented Aug 4, 2025

Uh oh!

Uh oh!

hemildesai commented Aug 18, 2025

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check notice

Check notice

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants