-
Notifications
You must be signed in to change notification settings - Fork 87
KubeflowExecutor Implementation #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
KubeflowExecutor Implementation #313
Conversation
Add ConfigMapPackager to enable staging files into Kubernetes ConfigMaps for distributed training jobs. This packager supports file size validation, debug logging, and graceful error handling when Kubernetes client is unavailable. The implementation includes comprehensive test coverage with parameterized tests for different job_dir scenarios and proper handling of large files that exceed the 1MB ConfigMap limit. Dependencies added: - kubernetes>=28.0.0: Official Kubernetes Python client for API interactions - kubeflow: Kubeflow SDK for trainer integration and custom resource management Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
This commit introduces a new KubeflowExecutor that enables distributed training jobs on Kubernetes using the Kubeflow Trainer SDK. The executor supports both file-based and function-based execution modes, with files staged into Kubernetes ConfigMaps for execution. Key features: - Integration with Kubeflow Trainer SDK for TrainJob management - Support for both file-based and function-based execution - ConfigMapPackager integration for file staging - Comprehensive test coverage with parameterized tests - Consistent API with other executors (LocalExecutor, SlurmExecutor) - Resource management (CPU, memory, GPU requests/limits) - Custom runtime support via runtime_name parameter The executor follows the same patterns as existing executors and is integrated into the experiment system for parallel and detached execution support. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add comprehensive documentation for KubeflowExecutor in the execution guide, following the same pattern as other executors. The documentation includes: - Basic and advanced configuration examples - File-based and function-based execution modes - ConfigMapPackager integration for file staging - Prerequisites and architecture explanation - Monitoring, debugging, and troubleshooting sections Also update README to include KubeflowExecutor in supported executors list and add new section highlighting all available executors. The KubeflowExecutor enables distributed training jobs on Kubernetes using Kubeflow Trainer SDK while following proper separation of concerns between ClusterOps and MLE teams. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Fix Kubernetes resource naming issues by adding sanitize_kubernetes_name utility function that replaces underscores with hyphens to comply with RFC 1123 rules. Extract ConfigMap key sanitization logic into reusable _sanitize_configmap_key method that replaces forward slashes with hyphens for valid ConfigMap keys. Improve resource formatting in KubeflowExecutor to use flat structure expected by Kubeflow SDK instead of nested limits/requests format. Add comprehensive parametrized tests for sanitization functions and error handling scenarios. Remove redundant tests and fix brittle assertions to call actual methods rather than duplicating logic. These changes resolve ConfigMap creation failures due to invalid names and keys, ensuring successful NeMo Run training job submissions to Kubeflow. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
This commit addresses several critical issues in the Kubeflow executor: - Consolidates Kubernetes name sanitization logic into base.py to avoid duplication and ensure consistent naming across the codebase - Removes hardcoded resource defaults from KubeflowExecutor to allow ClusterTrainingRuntime to provide appropriate defaults - Fixes double prefix issue in ConfigMap names by ensuring only ConfigMapPackager adds the workspace prefix - Adds configurable volume_mount_path and default_task_dir parameters for better flexibility - Implements dynamic file path resolution that correctly infers staged file locations - Updates all tests to reflect the new behavior and ensure proper coverage Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add comprehensive ConfigMapPackager integration to KubeflowExecutor with full test coverage. This enables staging user files as Kubernetes ConfigMaps for distributed training jobs. Key changes: - Fix CustomTrainer API integration (script -> python_file parameter) - Add ConfigMapPackager staging and cleanup functionality - Implement proper resource management with experiment lifecycle - Add comprehensive test suite with 101 passing tests - Support both Script and Partial task types from Experiment API - Add proper error handling and logging for ConfigMap operations The implementation follows NeMo Run architecture with clear separation between tasks (what to run) and executors (where/how to run). Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Implement focused integration tests for ConfigMapPackager with KubeflowExecutor covering the complete ConfigMap lifecycle and resource management. Key improvements: - Comprehensive integration test covering ConfigMap creation, sanitization, large file handling, mount path validation, and error recovery - Lifecycle management test verifying complete resource cleanup The tests verify: - ConfigMap creation during job submission - Resource cleanup completeness (TrainJob + file cleanup) - Namespace isolation and mount path validation - Error handling for large files and API failures Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add comprehensive tests for ClusterTrainingRuntime creation, TrainJob management, and resource lifecycle with ConfigMapPackager integration. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add Kubernetes client integration to create and delete ClusterTrainingRuntime CRDs with experiment-specific configurations. Move Kubernetes configuration validation to __post_init__ for early failure detection. Remove unused cpu_request and memory_request parameters, add configurable image parameter. Add comprehensive tests that verify actual Kubernetes API calls and CRD body structure. Remove implementation comments and TODO statements to focus on user-facing documentation. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add CLI factory functions and entrypoint to hello_kubeflow.py example. Follow established patterns from other executors with simple factory functions and entrypoint integration. Include kubeflow_gpu and kubeflow_cpu factories with sensible defaults for common use cases. Update README with CLI usage examples and simplified documentation. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
|
Hey @jskswamy Thanks a lot for your contribution, this is great 🎉 . Appreciate you taking the time since this is a pretty involved PR. I took a quick glance and the design and organization looks good to me. Happy to proceed with merging once its ready on your side and works end to end. |
|
Hi, can you fix the DCO check by signing off your commits using |
Add support for inline script execution in the KubeflowExecutor using the SDK's function argument injection style. This change allows users to pass scripts directly as inline parameters, enhancing flexibility in task execution. Key changes include: - Introduced `_nemo_inline_entry_params` function for handling inline script execution. - Updated `create_trainjob` and `submit` methods to support inline scripts. - Enhanced logging for better tracking of execution modes. - Improved Kubernetes runtime management, enabling reuse of ClusterTrainingRuntime across experiments with similar configurations. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
2c3a530 to
5cf880d
Compare
57674c0 to
03bb57e
Compare
Updated the KubeflowExecutor class to support a unique executor name and improved the handling of Kubernetes configurations. Key changes include: - Introduced a `name` parameter for the executor to enable better identification in logs and configuration. - Adjusted volume mount path defaults and removed unnecessary runtime parameters. - Enhanced error handling for Kubernetes configuration failures, ensuring clearer logging of issues. - Streamlined file staging and cleanup processes using the new name parameter. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
This commit refactors the KubeflowExecutor class to enhance task management and streamline the creation of ClusterTrainingRuntime. Key changes include: - Removed the _nemo_inline_entry_params function, simplifying inline script handling. - Introduced a new method to get additional files based on task types, allowing for better staging of files in ConfigMap. - Updated the create_trainjob method to accept runtime_name directly, improving clarity in job submissions. - Adjusted the _runtime_name method to generate names based on a unique identifier and hash, ensuring no collisions. - Improved logging for better traceability during execution. These modifications aim to simplify the executor's interface and enhance its usability for developers working with Kubeflow. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the KubeflowExecutor class to replace the CustomTrainer with CommandTrainer for improved task handling. Introduce a new enable_tcpxo feature that configures a sidecar for TCP enhancements in the runtime template. The implementation now validates entrypoints and manages task configurations more robustly, ensuring compatibility with the CommandTrainer. - Added enable_tcpxo flag to runtime template - Updated TrainerClient initialization with KubernetesBackendConfig - Enhanced error handling for unsupported tasks - Improved logging for trainer configurations and commands Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Reorganize annotations under the spec.template.metadata section based on the enable_tcpxo condition. This improves clarity and maintains consistency in configuration. Add podAntiAffinity rules to ensure that replicated jobs do not schedule on the same node, enhancing fault tolerance and resource management. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the KubeflowExecutor to improve the way command and arguments are built based on the task entrypoint. Introduce a new method, _build_command_and_args, to centralize this logic. Additionally, implement a mutation function, _mutate_bash_torchrun_flags, that appends necessary torchrun flags to bash scripts, ensuring they include required parameters for distributed training. This change aims to streamline the execution of script tasks and enhance compatibility with the PET framework for distributed training setups. - Added _build_command_and_args method for command construction - Implemented _mutate_bash_torchrun_flags for bash script handling - Updated tests to verify new functionality and flag injection Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add a new StorageMount class to encapsulate the configuration for persistent volume claims (PVCs) in KubeflowExecutor. This includes attributes for mount path, read-only status, and PVC specifics. Enhance KubeflowExecutor to support storage mounts by ensuring PVCs are created if they do not exist. Update the template rendering to include storage PVC mounts in the runtime configuration. Add unit tests for storage mount normalization, PVC creation, and template rendering, ensuring correct behavior in various scenarios. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add functionality to handle launcher commands for both Script and Partial tasks in the KubeflowExecutor. This update introduces helper functions to build command arguments and integrates them into the existing execution flow. Key changes include: - Creation of `_build_pet_torchrun_flags` to standardize the flags for torchrun. - Implementation of `_build_launcher_command_and_args` for generating the appropriate command and arguments based on task type and entrypoint. - Refinement of `_materialize_task_content_for_staging` to accommodate both inline and function-based tasks. - Adjustments in the test suite to validate the new functionality and ensure correct behavior when executing tasks with torchrun. These enhancements aim to improve task execution consistency and facilitate better integration of PyTorch distributed training capabilities within the Kubeflow environment. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Updated the variable name from `volume_mount_path` to `workspace_mount_path` in the `KubeflowExecutor` class to enhance clarity and consistency. This change affects both the implementation in the code and the corresponding template files. Additionally, updated unit tests to reflect the new variable name, ensuring that all references are correctly aligned with this modification. This improves readability and reduces potential confusion regarding the purpose of the mount path. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Add support for specifying additional package installations in the KubeflowExecutor. This enhancement allows users to configure packages to be installed within the training container, improving flexibility for various training requirements. - Introduced an AdditionalPackages dataclass to encapsulate installation parameters. - Updated the CommandTrainer instantiation to accept additional packages if configured. - Modified get_volume_name and get_pvc_claim_name to ensure lowercase names for Kubernetes compatibility. This change enables more robust customization of the training environment, facilitating the inclusion of necessary dependencies directly through executor configuration. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
8d60b32 to
57c9ea2
Compare
Add functionality to create and manage Kubernetes Secrets for environment variables in the KubeflowExecutor class. The new _method_ `_ensure_env_secret` verifies the existence of the secret and creates or updates it as necessary. This allows for better handling of sensitive environment configurations. Update the template rendering to include secrets as environment variables, enhancing the flexibility of environment variable management. - Implemented `_ensure_env_secret` to manage Secrets - Updated YAML template to include `env_from_secrets` - Added tests for secret creation and conflict handling Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
517fdc3 to
4532d7b
Compare
Revise the key sanitization logic to preserve underscores in file names, aligning with Kubernetes naming conventions. The ConfigMap NAME must still comply with DNS-1123 standards, but keys can now contain underscores. Additionally, update related tests to reflect this change, ensuring that the sanitization process maintains underscores as valid characters in keys. This enhances compatibility with Kubernetes while improving the handling of file names. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the KubeflowExecutor class to improve clarity in GPU configuration. The following changes were made: - Renamed `gpus` to `gpus_per_node` to specify GPU allocation per node. - Updated the related code references to use the new `gpus_per_node` attribute. - Changed `image` to `container_image` for consistency in naming conventions. - Adjusted tests to align with the updated attribute names and ensure proper functionality. These changes enhance the readability and maintainability of the code, making it clearer how GPU resources are managed. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Refactor the Kubeflow ClusterTrainingRuntime template to use variable placeholders for the number of nodes and the number of processes per node. This change enhances flexibility by allowing dynamic configuration based on input parameters. Add unit tests to validate the rendering of nodes and process counts in the template. The tests ensure that the correct values are populated in the rendered output for both single and multi-GPU scenarios. Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
| type="Opaque", | ||
| ) | ||
| core_client.create_namespaced_secret(namespace=self.namespace, body=body) | ||
| logger.info(f"Created Secret {generated_secret_name} in {self.namespace}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (secret)
| name=generated_secret_name, namespace=self.namespace, body=patch_body | ||
| ) | ||
| logger.info( | ||
| f"Patched Secret {generated_secret_name} with updated stringData in {self.namespace}" |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (secret)
| f"Patched Secret {generated_secret_name} with updated stringData in {self.namespace}" | ||
| ) | ||
| except Exception as patch_err: | ||
| logger.warning(f"Failed to patch Secret {generated_secret_name}: {patch_err}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (secret)
| except Exception as patch_err: | ||
| logger.warning(f"Failed to patch Secret {generated_secret_name}: {patch_err}") | ||
| else: | ||
| logger.warning(f"Failed to create Secret {generated_secret_name}: {e}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (secret)
| # GPU count should be present under both requests and limits | ||
| assert '"nvidia.com/gpu": 8' in rendered | ||
|
|
||
| def test_pvc_creation_when_missing(self, mocker): |
Check notice
Code scanning / CodeQL
Unused local variable Note test
| assert body["spec"]["resources"]["requests"]["storage"] == "200Gi" | ||
| assert body.get("spec", {}).get("storageClassName") == "standard" | ||
|
|
||
| def test_pvc_creation_skipped_when_exists(self, mocker): |
Check notice
Code scanning / CodeQL
Unused local variable Note test
🎯 Overview
This PR adds Kubernetes support to NeMo Run by implementing a complete
KubeflowExecutorthat enables distributed training on Kubernetes clusters using Kubeflow Trainer.User Value: NeMo Run users can now run their experiments directly on Kubernetes clusters, leveraging the power of Kubeflow Trainer for distributed training without leaving the familiar NeMo Run API.
Key Capabilities:
🏗️ Architecture
Key Components
Why This Implementation?
Current State: NeMo Run currently supports various executors for different environments (local, SkyPilot, etc.) but lacks a Kubernetes executor for running experiments directly on Kubernetes clusters.
This Implementation Adds: Complete Kubernetes support for NeMo Run using Kubeflow Trainer, enabling users to:
Technical Approach: The Kubeflow Trainer SDK only provides
TrainJobmethods, so we implementClusterTrainingRuntimemanagement directly via Kubernetes API for:📁 Files Added/Modified
Core Implementation
nemo_run/core/execution/kubeflow.py- Main KubeflowExecutor implementationnemo_run/core/packaging/configmap.py- ConfigMapPackager for file stagingTests
test/core/execution/test_kubeflow.py- comprehensive unit tests for KubeflowExecutor and ConfigMapPackager integrationtest/core/packaging/test_configmap.py- comrehensive unit tests for ConfigMapPackagerExamples
examples/kubeflow/hello_kubeflow.py- Complete example with CLI integrationexamples/kubeflow/README.md- Comprehensive documentationexamples/kubeflow/architecture.md- Detailed architecture diagrams🚀 Key Features
1. Seamless NeMo Run Integration
2. Automatic Resource Management
ClusterTrainingRuntimeper experimentTrainJoblifecycle3. File Staging via ConfigMaps
ConfigMapPackager🔧 Implementation Details
Resource Creation Flow
Error Handling