Skip to content

Conversation

@jskswamy
Copy link

@jskswamy jskswamy commented Aug 4, 2025

⚠️ DRAFT PR: This is a draft version for initial review and feedback. Comprehensive testing is yet to be completed. Raising this PR to get feedback on the implementation approach and architecture.

🎯 Overview

This PR adds Kubernetes support to NeMo Run by implementing a complete KubeflowExecutor that enables distributed training on Kubernetes clusters using Kubeflow Trainer.

User Value: NeMo Run users can now run their experiments directly on Kubernetes clusters, leveraging the power of Kubeflow Trainer for distributed training without leaving the familiar NeMo Run API.

Key Capabilities:

  • Run NeMo experiments on Kubernetes clusters
  • Scale training across multiple nodes and GPUs
  • Automatic resource management and cleanup
  • Seamless integration with existing NeMo Run workflows

🏗️ Architecture

Key Components

  1. KubeflowExecutor: Core executor that manages training job lifecycle
  2. ConfigMapPackager Integration: Automatic file staging for distributed training
  3. CLI Integration: Factory functions and entrypoints for easy usage
  4. Resource Management: ClusterTrainingRuntime and TrainJob lifecycle

Why This Implementation?

Current State: NeMo Run currently supports various executors for different environments (local, SkyPilot, etc.) but lacks a Kubernetes executor for running experiments directly on Kubernetes clusters.

This Implementation Adds: Complete Kubernetes support for NeMo Run using Kubeflow Trainer, enabling users to:

  • Run NeMo Experiments on Kubernetes: Direct integration with Kubernetes clusters
  • Distributed Training: Scale training across multiple nodes and GPUs
  • Kubeflow Integration: Leverage Kubeflow Trainer's robust distributed training capabilities
  • Resource Management: Automatic GPU allocation, memory management, and scheduling
  • Production Deployment: Run ML experiments in production Kubernetes environments

Technical Approach: The Kubeflow Trainer SDK only provides TrainJob methods, so we implement ClusterTrainingRuntime management directly via Kubernetes API for:

  • Environment Consistency: Same runtime configuration across experiments
  • Resource Optimization: Reuse runtime definitions
  • Experiment Isolation: Per-experiment runtime names
  • Cleanup Management: Proper resource lifecycle

📁 Files Added/Modified

Core Implementation

  • nemo_run/core/execution/kubeflow.py - Main KubeflowExecutor implementation
  • nemo_run/core/packaging/configmap.py - ConfigMapPackager for file staging

Tests

  • test/core/execution/test_kubeflow.py - comprehensive unit tests for KubeflowExecutor and ConfigMapPackager integration
  • test/core/packaging/test_configmap.py - comrehensive unit tests for ConfigMapPackager

Examples

  • examples/kubeflow/hello_kubeflow.py - Complete example with CLI integration
  • examples/kubeflow/README.md - Comprehensive documentation
  • examples/kubeflow/architecture.md - Detailed architecture diagrams

🚀 Key Features

1. Seamless NeMo Run Integration

# Simple API usage
executor = KubeflowExecutor(nodes=2, gpus=8)
exp = run.Experiment()
exp.add(train_function)
exp.run(executor=executor)

2. Automatic Resource Management

  • Creates ClusterTrainingRuntime per experiment
  • Manages TrainJob lifecycle
  • Automatic cleanup of all resources

3. File Staging via ConfigMaps

  • Integrates with ConfigMapPackager
  • Automatic file distribution to all nodes
  • Optimized for distributed training

🔧 Implementation Details

Resource Creation Flow

  1. ClusterTrainingRuntime: Defines training environment (image, resources, scaling)
  2. ConfigMap: Packages and distributes training files
  3. TrainJob: Executes training with runtime reference

Error Handling

  • Robust Kubernetes connectivity validation
  • Automatic resource cleanup on failures
  • Comprehensive error logging and recovery

jskswamy added 11 commits August 4, 2025 12:00
Add ConfigMapPackager to enable staging files into Kubernetes ConfigMaps
for distributed training jobs. This packager supports file size validation,
debug logging, and graceful error handling when Kubernetes client is
unavailable.

The implementation includes comprehensive test coverage with parameterized
tests for different job_dir scenarios and proper handling of large files
that exceed the 1MB ConfigMap limit.

Dependencies added:
- kubernetes>=28.0.0: Official Kubernetes Python client for API interactions
- kubeflow: Kubeflow SDK for trainer integration and custom resource management

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
This commit introduces a new KubeflowExecutor that enables distributed
training jobs on Kubernetes using the Kubeflow Trainer SDK. The executor
supports both file-based and function-based execution modes, with files
staged into Kubernetes ConfigMaps for execution.

Key features:
- Integration with Kubeflow Trainer SDK for TrainJob management
- Support for both file-based and function-based execution
- ConfigMapPackager integration for file staging
- Comprehensive test coverage with parameterized tests
- Consistent API with other executors (LocalExecutor, SlurmExecutor)
- Resource management (CPU, memory, GPU requests/limits)
- Custom runtime support via runtime_name parameter

The executor follows the same patterns as existing executors and is
integrated into the experiment system for parallel and detached
execution support.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add comprehensive documentation for KubeflowExecutor in the execution guide,
following the same pattern as other executors. The documentation includes:

- Basic and advanced configuration examples
- File-based and function-based execution modes
- ConfigMapPackager integration for file staging
- Prerequisites and architecture explanation
- Monitoring, debugging, and troubleshooting sections

Also update README to include KubeflowExecutor in supported executors
list and add new section highlighting all available executors.

The KubeflowExecutor enables distributed training jobs on Kubernetes using
Kubeflow Trainer SDK while following proper separation of concerns between
ClusterOps and MLE teams.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Fix Kubernetes resource naming issues by adding sanitize_kubernetes_name utility
function that replaces underscores with hyphens to comply with RFC 1123 rules.
Extract ConfigMap key sanitization logic into reusable _sanitize_configmap_key
method that replaces forward slashes with hyphens for valid ConfigMap keys.

Improve resource formatting in KubeflowExecutor to use flat structure expected
by Kubeflow SDK instead of nested limits/requests format.

Add comprehensive parametrized tests for sanitization functions and error
handling scenarios. Remove redundant tests and fix brittle assertions to
call actual methods rather than duplicating logic.

These changes resolve ConfigMap creation failures due to invalid names and
keys, ensuring successful NeMo Run training job submissions to Kubeflow.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
This commit addresses several critical issues in the Kubeflow executor:

- Consolidates Kubernetes name sanitization logic into base.py to avoid
  duplication and ensure consistent naming across the codebase
- Removes hardcoded resource defaults from KubeflowExecutor to allow
  ClusterTrainingRuntime to provide appropriate defaults
- Fixes double prefix issue in ConfigMap names by ensuring only
  ConfigMapPackager adds the workspace prefix
- Adds configurable volume_mount_path and default_task_dir parameters
  for better flexibility
- Implements dynamic file path resolution that correctly infers staged
  file locations
- Updates all tests to reflect the new behavior and ensure proper
  coverage

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add comprehensive ConfigMapPackager integration to KubeflowExecutor
with full test coverage. This enables staging user files as Kubernetes
ConfigMaps for distributed training jobs.

Key changes:
- Fix CustomTrainer API integration (script -> python_file parameter)
- Add ConfigMapPackager staging and cleanup functionality
- Implement proper resource management with experiment lifecycle
- Add comprehensive test suite with 101 passing tests
- Support both Script and Partial task types from Experiment API
- Add proper error handling and logging for ConfigMap operations

The implementation follows NeMo Run architecture with clear separation
between tasks (what to run) and executors (where/how to run).

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Implement focused integration tests for ConfigMapPackager with KubeflowExecutor
covering the complete ConfigMap lifecycle and resource management.

Key improvements:
- Comprehensive integration test covering ConfigMap creation, sanitization,
  large file handling, mount path validation, and error recovery
- Lifecycle management test verifying complete resource cleanup

The tests verify:
- ConfigMap creation during job submission
- Resource cleanup completeness (TrainJob + file cleanup)
- Namespace isolation and mount path validation
- Error handling for large files and API failures

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add comprehensive tests for ClusterTrainingRuntime creation, TrainJob
management, and resource lifecycle with ConfigMapPackager integration.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add Kubernetes client integration to create and delete ClusterTrainingRuntime
CRDs with experiment-specific configurations. Move Kubernetes configuration
validation to __post_init__ for early failure detection. Remove unused
cpu_request and memory_request parameters, add configurable image parameter.

Add comprehensive tests that verify actual Kubernetes API calls and CRD
body structure. Remove implementation comments and TODO statements to
focus on user-facing documentation.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add CLI factory functions and entrypoint to hello_kubeflow.py example.
Follow established patterns from other executors with simple factory
functions and entrypoint integration. Include kubeflow_gpu and
kubeflow_cpu factories with sensible defaults for common use cases.

Update README with CLI usage examples and simplified documentation.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
@hemildesai
Copy link
Contributor

Hey @jskswamy Thanks a lot for your contribution, this is great 🎉 . Appreciate you taking the time since this is a pretty involved PR. I took a quick glance and the design and organization looks good to me. Happy to proceed with merging once its ready on your side and works end to end.

@hemildesai
Copy link
Contributor

Hi, can you fix the DCO check by signing off your commits using --signoff?

Add support for inline script execution in the KubeflowExecutor
using the SDK's function argument injection style. This change allows
users to pass scripts directly as inline parameters, enhancing
flexibility in task execution.

Key changes include:
- Introduced `_nemo_inline_entry_params` function for handling
  inline script execution.
- Updated `create_trainjob` and `submit` methods to support
  inline scripts.
- Enhanced logging for better tracking of execution modes.
- Improved Kubernetes runtime management, enabling reuse of
  ClusterTrainingRuntime across experiments with similar
  configurations.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Updated the KubeflowExecutor class to support a unique executor name and
improved the handling of Kubernetes configurations. Key changes include:

- Introduced a `name` parameter for the executor to enable better
  identification in logs and configuration.
- Adjusted volume mount path defaults and removed unnecessary runtime
  parameters.
- Enhanced error handling for Kubernetes configuration failures, ensuring
  clearer logging of issues.
- Streamlined file staging and cleanup processes using the new name
  parameter.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
This commit refactors the KubeflowExecutor class to enhance task
management and streamline the creation of ClusterTrainingRuntime. Key
changes include:

- Removed the _nemo_inline_entry_params function, simplifying
  inline script handling.
- Introduced a new method to get additional files based on task
  types, allowing for better staging of files in ConfigMap.
- Updated the create_trainjob method to accept runtime_name
  directly, improving clarity in job submissions.
- Adjusted the _runtime_name method to generate names based on
  a unique identifier and hash, ensuring no collisions.
- Improved logging for better traceability during execution.

These modifications aim to simplify the executor's interface and
enhance its usability for developers working with Kubeflow.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the KubeflowExecutor class to replace the CustomTrainer
with CommandTrainer for improved task handling. Introduce a
new enable_tcpxo feature that configures a sidecar for TCP
enhancements in the runtime template. The implementation now
validates entrypoints and manages task configurations more
robustly, ensuring compatibility with the CommandTrainer.

- Added enable_tcpxo flag to runtime template
- Updated TrainerClient initialization with KubernetesBackendConfig
- Enhanced error handling for unsupported tasks
- Improved logging for trainer configurations and commands

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Reorganize annotations under the spec.template.metadata section
based on the enable_tcpxo condition. This improves clarity and
maintains consistency in configuration.

Add podAntiAffinity rules to ensure that replicated jobs do not
schedule on the same node, enhancing fault tolerance and resource
management.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the KubeflowExecutor to improve the way command and
arguments are built based on the task entrypoint. Introduce a new
method, _build_command_and_args, to centralize this logic.

Additionally, implement a mutation function, _mutate_bash_torchrun_flags,
that appends necessary torchrun flags to bash scripts, ensuring they
include required parameters for distributed training.

This change aims to streamline the execution of script tasks and
enhance compatibility with the PET framework for distributed
training setups.

- Added _build_command_and_args method for command construction
- Implemented _mutate_bash_torchrun_flags for bash script handling
- Updated tests to verify new functionality and flag injection

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add a new StorageMount class to encapsulate the configuration
for persistent volume claims (PVCs) in KubeflowExecutor. This
includes attributes for mount path, read-only status, and
PVC specifics.

Enhance KubeflowExecutor to support storage mounts by ensuring
PVCs are created if they do not exist. Update the template
rendering to include storage PVC mounts in the runtime
configuration.

Add unit tests for storage mount normalization, PVC creation,
and template rendering, ensuring correct behavior in various
scenarios.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add functionality to handle launcher commands for both Script and Partial
tasks in the KubeflowExecutor. This update introduces helper functions to build
command arguments and integrates them into the existing execution flow.

Key changes include:
- Creation of `_build_pet_torchrun_flags` to standardize the flags for torchrun.
- Implementation of `_build_launcher_command_and_args` for generating the appropriate command and arguments based on task type and entrypoint.
- Refinement of `_materialize_task_content_for_staging` to accommodate both inline and function-based tasks.
- Adjustments in the test suite to validate the new functionality and ensure correct behavior when executing tasks with torchrun.

These enhancements aim to improve task execution consistency and facilitate better integration of PyTorch distributed training capabilities within the Kubeflow environment.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Updated the variable name from `volume_mount_path` to `workspace_mount_path`
in the `KubeflowExecutor` class to enhance clarity and consistency. This change
affects both the implementation in the code and the corresponding template files.

Additionally, updated unit tests to reflect the new variable name, ensuring
that all references are correctly aligned with this modification. This
improves readability and reduces potential confusion regarding the purpose of
the mount path.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add support for specifying additional package installations in the
KubeflowExecutor. This enhancement allows users to configure packages
to be installed within the training container, improving flexibility
for various training requirements.

- Introduced an AdditionalPackages dataclass to encapsulate
  installation parameters.
- Updated the CommandTrainer instantiation to accept additional
  packages if configured.
- Modified get_volume_name and get_pvc_claim_name to ensure
  lowercase names for Kubernetes compatibility.

This change enables more robust customization of the training
environment, facilitating the inclusion of necessary dependencies
directly through executor configuration.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
@jskswamy jskswamy force-pushed the feature/add-kubeflow-executor branch from 8d60b32 to 57c9ea2 Compare September 17, 2025 04:59
Add functionality to create and manage Kubernetes Secrets for
environment variables in the KubeflowExecutor class. The new
_method_ `_ensure_env_secret` verifies the existence of the
secret and creates or updates it as necessary. This allows
for better handling of sensitive environment configurations.

Update the template rendering to include secrets as environment
variables, enhancing the flexibility of environment variable
management.

- Implemented `_ensure_env_secret` to manage Secrets
- Updated YAML template to include `env_from_secrets`
- Added tests for secret creation and conflict handling

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
@jskswamy jskswamy force-pushed the feature/add-kubeflow-executor branch from 517fdc3 to 4532d7b Compare September 17, 2025 07:18
Revise the key sanitization logic to preserve underscores in file names,
aligning with Kubernetes naming conventions. The ConfigMap NAME must
still comply with DNS-1123 standards, but keys can now contain
underscores.

Additionally, update related tests to reflect this change, ensuring
that the sanitization process maintains underscores as valid characters
in keys. This enhances compatibility with Kubernetes while improving
the handling of file names.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the KubeflowExecutor class to improve clarity
in GPU configuration. The following changes were made:

- Renamed `gpus` to `gpus_per_node` to specify
  GPU allocation per node.
- Updated the related code references to use
  the new `gpus_per_node` attribute.
- Changed `image` to `container_image` for consistency
  in naming conventions.
- Adjusted tests to align with the updated attribute names
  and ensure proper functionality.

These changes enhance the readability and maintainability
of the code, making it clearer how GPU resources are managed.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the Kubeflow ClusterTrainingRuntime template
to use variable placeholders for the number of nodes
and the number of processes per node. This change
enhances flexibility by allowing dynamic configuration
based on input parameters.

Add unit tests to validate the rendering of nodes and
process counts in the template. The tests ensure that
the correct values are populated in the rendered output
for both single and multi-GPU scenarios.

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
type="Opaque",
)
core_client.create_namespaced_secret(namespace=self.namespace, body=body)
logger.info(f"Created Secret {generated_secret_name} in {self.namespace}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
name=generated_secret_name, namespace=self.namespace, body=patch_body
)
logger.info(
f"Patched Secret {generated_secret_name} with updated stringData in {self.namespace}"

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
f"Patched Secret {generated_secret_name} with updated stringData in {self.namespace}"
)
except Exception as patch_err:
logger.warning(f"Failed to patch Secret {generated_secret_name}: {patch_err}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
except Exception as patch_err:
logger.warning(f"Failed to patch Secret {generated_secret_name}: {patch_err}")
else:
logger.warning(f"Failed to create Secret {generated_secret_name}: {e}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
# GPU count should be present under both requests and limits
assert '"nvidia.com/gpu": 8' in rendered

def test_pvc_creation_when_missing(self, mocker):

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable test_pvc_creation_when_missing is not used.
assert body["spec"]["resources"]["requests"]["storage"] == "200Gi"
assert body.get("spec", {}).get("storageClassName") == "standard"

def test_pvc_creation_skipped_when_exists(self, mocker):

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable test_pvc_creation_skipped_when_exists is not used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants