diff --git a/docs/user-manuals/queue-management.md b/docs/user-manuals/queue-management.md index 30d1999856..68041f3395 100644 --- a/docs/user-manuals/queue-management.md +++ b/docs/user-manuals/queue-management.md @@ -204,7 +204,96 @@ my-job-blocked Job The `QueueUnit` stays in `Enqueued` phase because `team-a` has already reached its `max` quota. Once `my-job` completes and resources are released, `my-job-blocked` will be dequeued automatically. -For other job types (TFJob, PyTorchJob, etc.), use the `scheduling.x-k8s.io/suspend: "true"` annotation instead of `spec.suspend`. +### Job Suspension by Type + +Different job types use different fields for suspension: + +| Job Type | API Version | Suspension Field | Example | Status | +|----------|-------------|------------------|---------|--------| +| Kubernetes Job | `batch/v1` | `.spec.suspend` | `spec.suspend: true` | Supported | +| TFJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | Supported | +| PyTorchJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | Supported | +| Argo Workflow | `argoproj.io/v1alpha1` | Add `koord-queue-suspend` template | See example below | Supported | +| SparkApplication | `sparkoperator.k8s.io/v1beta2` | `.spec.suspend` | | WIP | +| XGBoostJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | | Not Supported Yet | +| PaddleJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | | Not Supported Yet | + +**Argo Workflow Example:** + +For Argo Workflow, Koord-Queue uses a special suspend template named `koord-queue-suspend`. The workflow must meet the following conditions to be managed by the queue: + +1. Contains a template named `koord-queue-suspend` with a `suspend` field +2. The workflow has a suspend node in running state, OR `spec.suspend` is set to true + +```yaml +apiVersion: argoproj.io/v1alpha1 +kind: Workflow +metadata: + name: my-workflow + annotations: + koord-queue/min-resources: | + cpu: 5 + memory: 5Gi +spec: + suspend: true + templates: + # Add this suspend template for queue management + - name: koord-queue-suspend + suspend: {} + # Your actual workflow templates + - name: main + container: + image: python:3.9 + command: [python, -c, "print('Hello from workflow')"] + entrypoint: main +``` + +**How it works:** + +When a Workflow is submitted, Koord-Queue checks if it should be managed by: +- Scanning all templates for a `koord-queue-suspend` template with a `suspend` field +- Checking if any workflow node is of type `Suspend` and in `Running` phase +- Or checking if `spec.suspend` is set to `true` + +When the `QueueUnit` is dequeued, the Extension Server will remove the suspend condition, allowing the workflow to proceed. + +**TFJob Example:** + +For TFJob, set `spec.runPolicy.suspend: true` to enable queue management: + +```yaml +apiVersion: kubeflow.org/v1 +kind: TFJob +metadata: + labels: + quota.scheduling.koordinator.sh/name: team-a-queue +spec: + runPolicy: + suspend: true +``` + +**PyTorchJob Example:** + +For PyTorchJob, set `spec.runPolicy.suspend: true` to enable queue management: + +```yaml +apiVersion: kubeflow.org/v1 +kind: PyTorchJob +metadata: + labels: + quota.scheduling.koordinator.sh/name: team-a-queue +spec: + runPolicy: + suspend: true +``` + +**How it works for Kubeflow Jobs:** + +When a TFJob or PyTorchJob is submitted: +1. The job extension detects the new job with `spec.runPolicy.suspend: true` +2. A corresponding `QueueUnit` is automatically created +3. The job waits in the queue until resources are available +4. When dequeued, the Extension Server sets `spec.runPolicy.suspend: false`, allowing the job to create pods and start training ## Use Queue diff --git a/docs/user-manuals/run-pytorchjob-in-koordinator.md b/docs/user-manuals/run-pytorchjob-in-koordinator.md new file mode 100644 index 0000000000..c675da4f78 --- /dev/null +++ b/docs/user-manuals/run-pytorchjob-in-koordinator.md @@ -0,0 +1,343 @@ +# Run PyTorchJob in Koordinator + +This guide explains how to run PyTorchJob workloads in Koordinator with integrated queue management and resource scheduling capabilities. + +## Overview + +Koordinator provides native support for PyTorchJob through its Koord-Queue integration. This enables: + +- **Job-level queuing**: Manage entire PyTorchJob workloads as units rather than individual pods +- **Deep ElasticQuota integration**: Leverage Koordinator's resource quota system for fair sharing and elastic allocation +- **Pre-scheduling**: Queue jobs before they create pods to reduce scheduler pressure +- **Multi-tenant isolation**: Support for multiple teams/projects with resource isolation +- **Priority-based scheduling**: Configure job priorities for fair resource allocation + +## Prerequisites + +Before running PyTorchJob in Koordinator, ensure you have: + +- Kubernetes cluster >= 1.22 +- Koordinator >= 1.5 installed +- Koord-Queue installed and configured +- PyTorchJob V1 CRDs installed (typically via [Training Operator V1](https://www.kubeflow.org/docs/components/trainer/legacy-v1/installation/)) + +## Installation + +### 1. Install Koord-Queue + +If not already installed, deploy Koord-Queue using Helm: + +```bash +helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ +helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \ + --namespace koord-queue \ + --create-namespace +``` + +Enable PyTorchJob extension in the Helm values: + +```yaml +# values.yaml +extension: + pytorch: + enable: true +``` + +Install with custom values: + +```bash +helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \ + --namespace koord-queue \ + --create-namespace \ + -f values.yaml +``` + +### 2. Verify Installation + +```bash +# Check deployments +kubectl get deployment -n koord-queue + +# Verify CRDs +kubectl get crd | grep -E "(queue|pytorchjob)" +``` + +## Configuration + +### 1. Create an ElasticQuota + +Create an ElasticQuota to define resource boundaries for your PyTorchJob queue: + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: pytorch-team-a + labels: + koord-queue/queue-policy: Priority # Priority, Block, or Intelligent +spec: + max: + cpu: "100" + memory: 200Gi + nvidia.com/gpu: "8" + min: + cpu: "20" + memory: 40Gi + nvidia.com/gpu: "2" +``` + +Apply the configuration: + +```bash +kubectl apply -f elastic-quota.yaml +``` + +### 2. Create a Queue (Optional) + +For advanced queue configuration, create a Queue CR: + +```yaml +apiVersion: scheduling.x-k8s.io/v1alpha1 +kind: Queue +metadata: + name: pytorch-training-queue + namespace: koord-queue +spec: + queuePolicy: Priority + priority: 100 + # admissionChecks: [] # Optional: add admission checks if needed +``` + +Apply the queue: + +```bash +kubectl apply -f queue.yaml +``` + +## Running PyTorchJob + +### Basic PyTorchJob Example + +Create a simple distributed PyTorchJob: + +```yaml +apiVersion: kubeflow.org/v1 +kind: PyTorchJob +metadata: + name: pytorch-training-job + namespace: default + annotations: + # Optional: specify which queue to use (defaults to queue matching ElasticQuota name) + scheduling.x-k8s.io/queue: pytorch-team-a + # Optional: set job priority within the queue + scheduling.x-k8s.io/priority: "10" +spec: + pytorchReplicaSpecs: + Master: + replicas: 1 + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime + command: + - "python" + - "-m" + - "torch.distributed.launch" + - "--nproc_per_node=1" + - "--nnodes=2" + - "--node_rank=$(RANK)" + - "--master_addr=$(MASTER_ADDR)" + - "--master_port=$(MASTER_PORT)" + - "train.py" + resources: + requests: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + limits: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + env: + - name: RANK + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/rank'] + - name: MASTER_ADDR + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/master-address'] + - name: MASTER_PORT + value: "29500" + Worker: + replicas: 1 + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime + command: + - "python" + - "-m" + - "torch.distributed.launch" + - "--nproc_per_node=1" + - "--nnodes=2" + - "--node_rank=$(RANK)" + - "--master_addr=$(MASTER_ADDR)" + - "--master_port=$(MASTER_PORT)" + - "train.py" + resources: + requests: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + limits: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + env: + - name: RANK + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/rank'] + - name: MASTER_ADDR + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/master-address'] + - name: MASTER_PORT + value: "29500" +``` + +Apply the PyTorchJob: + +```bash +kubectl apply -f pytorchjob.yaml +``` + +### How It Works + +When you create a PyTorchJob: + +1. **Automatic QueueUnit Creation**: Koord-Queue Controllers automatically detect the new PyTorchJob and create a corresponding `QueueUnit` resource +2. **Job Suspension**: The PyTorchJob is automatically suspended using the `scheduling.x-k8s.io/suspend: "true"` annotation +3. **Queue Processing**: The Queue Scheduler evaluates the job based on queue policy, priority, and available resources +4. **Resource Allocation**: If resources are available according to the ElasticQuota, the QueueUnit transitions to `Dequeued` state +5. **Job Execution**: The Extension Server removes the suspend annotation, allowing the PyTorchJob to create pods and start training + +## Advanced Configuration + +### Priority-Based Scheduling + +Configure job priority by setting the priority in the PyTorchJob pod template: + +```yaml +apiVersion: kubeflow.org/v1 +kind: PyTorchJob +metadata: + name: high-priority-training + namespace: default +spec: + pytorchReplicaSpecs: + Master: + replicas: 1 + template: + spec: + priorityClassName: high-priority # Use a PriorityClass + containers: + - name: pytorch + image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime + resources: + requests: + cpu: "8" + memory: 16Gi + nvidia.com/gpu: "2" +``` + +### Resource Quota Integration + +The PyTorchJob will automatically respect the ElasticQuota limits. Monitor quota usage: + +```bash +# Check ElasticQuota status +kubectl describe elasticquota pytorch-team-a + +# Check QueueUnit status +kubectl get queueunit -n default +kubectl describe queueunit +``` + +### Queue Policies + +Koord-Queue supports three queue policies: + +- **Priority**: Jobs with higher priority values are dequeued first (default) +- **Block**: Strict resource blocking - jobs wait until resources are guaranteed +- **Intelligent**: Dual-queue mechanism with configurable priority threshold + +Configure via ElasticQuota labels: + +```yaml +metadata: + labels: + koord-queue/queue-policy: Block # or Priority, Intelligent +``` + +## Monitoring and Troubleshooting + +### Check Job Status + +```bash +# Check PyTorchJob status +kubectl get pytorchjob +kubectl describe pytorchjob + +# Check QueueUnit status +kubectl get queueunit +kubectl describe queueunit + +# Check pod status +kubectl get pods -l training.kubeflow.org/job-name= +``` + +### Common Issues + +1. **Job stuck in suspended state**: + - Verify ElasticQuota has sufficient resources + - Check QueueUnit status for admission check failures + - Review queue policy settings + +2. **Resource allocation failures**: + - Check if ElasticQuota min/max limits are properly configured + - Verify cluster has sufficient GPU resources + - Review node capacity and taints + +3. **Queue not processing jobs**: + - Verify koord-queue controllers are running + - Check logs: `kubectl logs -n koord-queue deployment/koord-queue-controllers` + +## Best Practices + +1. **Use Priority Classes**: Define PriorityClasses for different training workload types +2. **Set Realistic Resource Requests**: Accurately estimate CPU, memory, and GPU requirements +3. **Monitor Quota Usage**: Regularly check ElasticQuota usage to avoid resource contention +4. **Use Gang Scheduling**: For distributed training, ensure all replicas are scheduled together +5. **Implement Resource Limits**: Set both requests and limits to prevent resource overcommitment + +## Integration with Other Koordinator Features + +PyTorchJob in Koordinator can leverage additional features: + +- **GPU Share**: Share GPU resources across multiple jobs +- **Network Topology Awareness**: Optimize pod placement for distributed training +- **Load-Aware Scheduling**: Balance cluster load during training workloads +- **Preemption**: Higher priority jobs can preempt lower priority ones + +## Next Steps + +- Learn about [Koord-Queue](./queue-management.md) for advanced queue management +- Explore [ElasticQuota](../architecture/resource-model.md) for resource management +- Read about [Gang Scheduling](../designs/gang-scheduling.md) for distributed training +- Check [Koordinator Architecture](../architecture/overview.md) for comprehensive understanding \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md index a60ddd9506..6ab7e1030f 100644 --- a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md @@ -204,7 +204,96 @@ my-job-blocked Job `QueueUnit` 保持 `Enqueued` 状态,因为 `team-a` 已达到 `max` 配额上限。当 `my-job` 完成并释放资源后,`my-job-blocked` 将自动出队执行。 -对于其他作业类型(TFJob、PyTorchJob 等),请使用 `scheduling.x-k8s.io/suspend: "true"` 注解代替 `spec.suspend`。 +### 不同作业类型的暂停方式 + +不同类型的作业使用不同的字段进行暂停: + +| 作业类型 | API 版本 | 暂停字段 | 示例 | 状态 | +|----------|---------|---------|------|------| +| Kubernetes Job | `batch/v1` | `.spec.suspend` | `spec.suspend: true` | 已支持 | +| TFJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | 已支持 | +| PyTorchJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | 已支持 | +| Argo Workflow | `argoproj.io/v1alpha1` | 添加 `koord-queue-suspend` 模板 | 见下方示例 | 已支持 | +| SparkApplication | `sparkoperator.k8s.io/v1beta2` | `.spec.suspend` | | 开发中 | +| XGBoostJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | | 尚未支持 | +| PaddleJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | | 尚未支持 | + +**Argo Workflow 示例:** + +对于 Argo Workflow,Koord-Queue 使用名为 `koord-queue-suspend` 的特殊暂停模板。工作流必须满足以下条件才能被队列管理: + +1. 包含名为 `koord-queue-suspend` 的模板,且有 `suspend` 字段 +2. 工作流有处于 Running 状态的 suspend 节点,或者 `spec.suspend` 设置为 true + +```yaml +apiVersion: argoproj.io/v1alpha1 +kind: Workflow +metadata: + name: my-workflow + annotations: + koord-queue/min-resources: | + cpu: 5 + memory: 5Gi +spec: + suspend: true + templates: + # 添加此暂停模板用于队列管理 + - name: koord-queue-suspend + suspend: {} + # 你的实际工作流模板 + - name: main + container: + image: python:3.9 + command: [python, -c, "print('Hello from workflow')"] + entrypoint: main +``` + +**工作原理:** + +当提交工作流时,Koord-Queue 通过以下方式检查是否应该管理它: +- 扫描所有模板,查找带有 `suspend` 字段的 `koord-queue-suspend` 模板 +- 检查是否有任何工作流节点类型为 `Suspend` 且状态为 `Running` +- 或者检查 `spec.suspend` 是否设置为 `true` + +当 `QueueUnit` 出队时,Extension Server 将移除暂停条件,允许工作流继续执行。 + +**TFJob 示例:** + +对于 TFJob,设置 `spec.runPolicy.suspend: true` 启用队列管理: + +```yaml +apiVersion: kubeflow.org/v1 +kind: TFJob +metadata: + labels: + quota.scheduling.koordinator.sh/name: team-a-queue +spec: + runPolicy: + suspend: true +``` + +**PyTorchJob 示例:** + +对于 PyTorchJob,设置 `spec.runPolicy.suspend: true` 启用队列管理: + +```yaml +apiVersion: kubeflow.org/v1 +kind: PyTorchJob +metadata: + labels: + quota.scheduling.koordinator.sh/name: team-a-queue +spec: + runPolicy: + suspend: true +``` + +**Kubeflow Jobs 的工作原理:** + +当提交 TFJob 或 PyTorchJob 时: +1. Job extension 检测到带有 `spec.runPolicy.suspend: true` 的新作业 +2. 自动创建对应的 `QueueUnit` +3. 作业在队列中等待直到资源可用 +4. 出队时,Extension Server 设置 `spec.runPolicy.suspend: false`,允许作业创建 Pod 并开始训练 ## 使用 Queue diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/run-pytorchjob-in-koordinator.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/run-pytorchjob-in-koordinator.md new file mode 100644 index 0000000000..c0e9b5a56b --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/run-pytorchjob-in-koordinator.md @@ -0,0 +1,343 @@ +# 在 Koordinator 中运行 PyTorchJob + +本指南介绍如何在 Koordinator 中运行 PyTorchJob 工作负载,并集成队列管理和资源调度能力。 + +## 概述 + +Koordinator 通过 Koord-Queue 集成提供对 PyTorchJob 的原生支持。这使得: + +- **作业级排队**:将整个 PyTorchJob 工作负载作为单元管理,而非单个 Pod +- **深度 ElasticQuota 集成**:利用 Koordinator 的资源配额系统实现公平共享和弹性分配 +- **预调度**:在作业创建 Pod 之前进行排队,减少调度器压力 +- **多租户隔离**:支持多个团队/项目的资源隔离 +- **基于优先级的调度**:配置作业优先级以实现公平的资源分配 + +## 前置条件 + +在 Koordinator 中运行 PyTorchJob 之前,请确保您具备: + +- Kubernetes 集群 >= 1.22 +- 已安装 Koordinator >= 1.5 +- 已安装并配置 Koord-Queue +- 已安装 PyTorchJob CRD(通常通过 [Training Operator](https://github.com/kubeflow/training-operator) 安装) + +## 安装 + +### 1. 安装 Koord-Queue + +如果尚未安装,使用 Helm 部署 Koord-Queue: + +```bash +helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ +helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \ + --namespace koord-queue \ + --create-namespace +``` + +在 Helm values 中启用 PyTorchJob 扩展: + +```yaml +# values.yaml +extension: + pytorch: + enable: true +``` + +使用自定义 values 安装: + +```bash +helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \ + --namespace koord-queue \ + --create-namespace \ + -f values.yaml +``` + +### 2. 验证安装 + +```bash +# 检查 Deployments +kubectl get deployment -n koord-queue + +# 验证 CRDs +kubectl get crd | grep -E "(queue|pytorchjob)" +``` + +## 配置 + +### 1. 创建 ElasticQuota + +创建 ElasticQuota 来为 PyTorchJob 队列定义资源边界: + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: pytorch-team-a + labels: + koord-queue/queue-policy: Priority # Priority、Block 或 Intelligent +spec: + max: + cpu: "100" + memory: 200Gi + nvidia.com/gpu: "8" + min: + cpu: "20" + memory: 40Gi + nvidia.com/gpu: "2" +``` + +应用配置: + +```bash +kubectl apply -f elastic-quota.yaml +``` + +### 2. 创建队列(可选) + +对于高级队列配置,创建 Queue CR: + +```yaml +apiVersion: scheduling.x-k8s.io/v1alpha1 +kind: Queue +metadata: + name: pytorch-training-queue + namespace: koord-queue +spec: + queuePolicy: Priority + priority: 100 + # admissionChecks: [] # 可选:如需添加入准检查 +``` + +应用队列: + +```bash +kubectl apply -f queue.yaml +``` + +## 运行 PyTorchJob + +### 基本 PyTorchJob 示例 + +创建一个简单的分布式 PyTorchJob: + +```yaml +apiVersion: kubeflow.org/v1 +kind: PyTorchJob +metadata: + name: pytorch-training-job + namespace: default + annotations: + # 可选:指定使用哪个队列(默认匹配 ElasticQuota 名称的队列) + scheduling.x-k8s.io/queue: pytorch-team-a + # 可选:设置队列中作业的优先级 + scheduling.x-k8s.io/priority: "10" +spec: + pytorchReplicaSpecs: + Master: + replicas: 1 + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime + command: + - "python" + - "-m" + - "torch.distributed.launch" + - "--nproc_per_node=1" + - "--nnodes=2" + - "--node_rank=$(RANK)" + - "--master_addr=$(MASTER_ADDR)" + - "--master_port=$(MASTER_PORT)" + - "train.py" + resources: + requests: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + limits: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + env: + - name: RANK + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/rank'] + - name: MASTER_ADDR + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/master-address'] + - name: MASTER_PORT + value: "29500" + Worker: + replicas: 1 + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime + command: + - "python" + - "-m" + - "torch.distributed.launch" + - "--nproc_per_node=1" + - "--nnodes=2" + - "--node_rank=$(RANK)" + - "--master_addr=$(MASTER_ADDR)" + - "--master_port=$(MASTER_PORT)" + - "train.py" + resources: + requests: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + limits: + cpu: "4" + memory: 8Gi + nvidia.com/gpu: "1" + env: + - name: RANK + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/rank'] + - name: MASTER_ADDR + valueFrom: + fieldRef: + fieldPath: metadata.annotations['kubeflow.org/master-address'] + - name: MASTER_PORT + value: "29500" +``` + +应用 PyTorchJob: + +```bash +kubectl apply -f pytorchjob.yaml +``` + +### 工作原理 + +当您创建 PyTorchJob 时: + +1. **自动创建 QueueUnit**:Koord-Queue Controllers 自动检测到新的 PyTorchJob 并创建对应的 `QueueUnit` 资源 +2. **作业暂停**:PyTorchJob 使用 `scheduling.x-k8s.io/suspend: "true"` 注解自动暂停 +3. **队列处理**:Queue Scheduler 根据队列策略、优先级和可用资源评估作业 +4. **资源分配**:如果根据 ElasticQuota 资源可用,QueueUnit 转换为 `Dequeued` 状态 +5. **作业执行**:Extension Server 移除暂停注解,允许 PyTorchJob 创建 Pod 并开始训练 + +## 高级配置 + +### 基于优先级的调度 + +通过在 PyTorchJob Pod 模板中设置优先级来配置作业优先级: + +```yaml +apiVersion: kubeflow.org/v1 +kind: PyTorchJob +metadata: + name: high-priority-training + namespace: default +spec: + pytorchReplicaSpecs: + Master: + replicas: 1 + template: + spec: + priorityClassName: high-priority # 使用 PriorityClass + containers: + - name: pytorch + image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime + resources: + requests: + cpu: "8" + memory: 16Gi + nvidia.com/gpu: "2" +``` + +### 资源配额集成 + +PyTorchJob 将自动遵守 ElasticQuota 限制。监控配额使用情况: + +```bash +# 检查 ElasticQuota 状态 +kubectl describe elasticquota pytorch-team-a + +# 检查 QueueUnit 状态 +kubectl get queueunit -n default +kubectl describe queueunit +``` + +### 队列策略 + +Koord-Queue 支持三种队列策略: + +- **Priority**:优先级值更高的作业优先出队(默认) +- **Block**:严格资源阻塞 - 作业等待直到资源有保证 +- **Intelligent**:双队列机制,具有可配置的优先级阈值 + +通过 ElasticQuota 标签配置: + +```yaml +metadata: + labels: + koord-queue/queue-policy: Block # 或 Priority、Intelligent +``` + +## 监控和故障排查 + +### 检查作业状态 + +```bash +# 检查 PyTorchJob 状态 +kubectl get pytorchjob +kubectl describe pytorchjob + +# 检查 QueueUnit 状态 +kubectl get queueunit +kubectl describe queueunit + +# 检查 Pod 状态 +kubectl get pods -l training.kubeflow.org/job-name= +``` + +### 常见问题 + +1. **作业卡在暂停状态**: + - 验证 ElasticQuota 是否有足够资源 + - 检查 QueueUnit 状态是否有准入检查失败 + - 检查队列策略设置 + +2. **资源分配失败**: + - 检查 ElasticQuota min/max 限制是否正确配置 + - 验证集群是否有足够的 GPU 资源 + - 检查节点容量和污点 + +3. **队列未处理作业**: + - 验证 koord-queue controllers 是否正在运行 + - 检查日志:`kubectl logs -n koord-queue deployment/koord-queue-controllers` + +## 最佳实践 + +1. **使用优先级类**:为不同类型的训练工作负载定义 PriorityClass +2. **设置实际的资源请求**:准确估算 CPU、内存和 GPU 需求 +3. **监控配额使用**:定期检查 ElasticQuota 使用情况以避免资源竞争 +4. **使用 Gang 调度**:对于分布式训练,确保所有副本一起调度 +5. **实施资源限制**:同时设置 requests 和 limits 以防止资源超卖 + +## 与其他 Koordinator 功能集成 + +Koordinator 中的 PyTorchJob 可以利用其他功能: + +- **GPU 共享**:在多个作业间共享 GPU 资源 +- **网络拓扑感知**:优化分布式训练的 Pod 放置 +- **负载感知调度**:在训练工作负载期间平衡集群负载 +- **抢占**:高优先级作业可以抢占低优先级作业 + +## 下一步 + +- 了解 [Koord-Queue](./queue-management.md) 进行高级队列管理 +- 探索 [ElasticQuota](../architecture/resource-model.md) 进行资源管理 +- 阅读 [Gang 调度](../designs/gang-scheduling.md) 了解分布式训练 +- 查看 [Koordinator 架构](../architecture/overview.md) 获得全面理解 diff --git a/sidebars.js b/sidebars.js index 026afa5887..540de89f1c 100644 --- a/sidebars.js +++ b/sidebars.js @@ -45,6 +45,7 @@ const sidebars = { 'Capacity Scheduling': [ 'user-manuals/capacity-scheduling', 'user-manuals/queue-management', + 'user-manuals/run-pytorchjob-in-koordinator', ], 'Task Scheduling': [ 'user-manuals/gang-scheduling',