diff --git a/docs/user-manuals/queue-management.md b/docs/user-manuals/queue-management.md
index 30d1999856..68041f3395 100644
--- a/docs/user-manuals/queue-management.md
+++ b/docs/user-manuals/queue-management.md
@@ -204,7 +204,96 @@ my-job-blocked                                   Job
 
 The `QueueUnit` stays in `Enqueued` phase because `team-a` has already reached its `max` quota. Once `my-job` completes and resources are released, `my-job-blocked` will be dequeued automatically.
 
-For other job types (TFJob, PyTorchJob, etc.), use the `scheduling.x-k8s.io/suspend: "true"` annotation instead of `spec.suspend`.
+### Job Suspension by Type
+
+Different job types use different fields for suspension:
+
+| Job Type | API Version | Suspension Field | Example | Status |
+|----------|-------------|------------------|---------|--------|
+| Kubernetes Job | `batch/v1` | `.spec.suspend` | `spec.suspend: true` | Supported |
+| TFJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | Supported |
+| PyTorchJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | Supported |
+| Argo Workflow | `argoproj.io/v1alpha1` | Add `koord-queue-suspend` template | See example below | Supported |
+| SparkApplication | `sparkoperator.k8s.io/v1beta2` | `.spec.suspend` |  | WIP |
+| XGBoostJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` |  | Not Supported Yet |
+| PaddleJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` |  | Not Supported Yet |
+
+**Argo Workflow Example:**
+
+For Argo Workflow, Koord-Queue uses a special suspend template named `koord-queue-suspend`. The workflow must meet the following conditions to be managed by the queue:
+
+1. Contains a template named `koord-queue-suspend` with a `suspend` field
+2. The workflow has a suspend node in running state, OR `spec.suspend` is set to true
+
+```yaml
+apiVersion: argoproj.io/v1alpha1
+kind: Workflow
+metadata:
+  name: my-workflow
+  annotations:
+   koord-queue/min-resources: |
+     cpu: 5
+     memory: 5Gi
+spec:
+  suspend: true
+  templates:
+    # Add this suspend template for queue management
+    - name: koord-queue-suspend
+      suspend: {}
+    # Your actual workflow templates
+    - name: main
+      container:
+        image: python:3.9
+        command: [python, -c, "print('Hello from workflow')"]
+  entrypoint: main
+```
+
+**How it works:**
+
+When a Workflow is submitted, Koord-Queue checks if it should be managed by:
+- Scanning all templates for a `koord-queue-suspend` template with a `suspend` field
+- Checking if any workflow node is of type `Suspend` and in `Running` phase
+- Or checking if `spec.suspend` is set to `true`
+
+When the `QueueUnit` is dequeued, the Extension Server will remove the suspend condition, allowing the workflow to proceed.
+
+**TFJob Example:**
+
+For TFJob, set `spec.runPolicy.suspend: true` to enable queue management:
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: TFJob
+metadata:
+  labels:
+    quota.scheduling.koordinator.sh/name: team-a-queue
+spec:
+  runPolicy:
+    suspend: true
+```
+
+**PyTorchJob Example:**
+
+For PyTorchJob, set `spec.runPolicy.suspend: true` to enable queue management:
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: PyTorchJob
+metadata:
+  labels:
+    quota.scheduling.koordinator.sh/name: team-a-queue
+spec:
+  runPolicy:
+    suspend: true
+```
+
+**How it works for Kubeflow Jobs:**
+
+When a TFJob or PyTorchJob is submitted:
+1. The job extension detects the new job with `spec.runPolicy.suspend: true`
+2. A corresponding `QueueUnit` is automatically created
+3. The job waits in the queue until resources are available
+4. When dequeued, the Extension Server sets `spec.runPolicy.suspend: false`, allowing the job to create pods and start training
 
 ## Use Queue
 
diff --git a/docs/user-manuals/run-pytorchjob-in-koordinator.md b/docs/user-manuals/run-pytorchjob-in-koordinator.md
new file mode 100644
index 0000000000..c675da4f78
--- /dev/null
+++ b/docs/user-manuals/run-pytorchjob-in-koordinator.md
@@ -0,0 +1,343 @@
+# Run PyTorchJob in Koordinator
+
+This guide explains how to run PyTorchJob workloads in Koordinator with integrated queue management and resource scheduling capabilities.
+
+## Overview
+
+Koordinator provides native support for PyTorchJob through its Koord-Queue integration. This enables:
+
+- **Job-level queuing**: Manage entire PyTorchJob workloads as units rather than individual pods
+- **Deep ElasticQuota integration**: Leverage Koordinator's resource quota system for fair sharing and elastic allocation
+- **Pre-scheduling**: Queue jobs before they create pods to reduce scheduler pressure
+- **Multi-tenant isolation**: Support for multiple teams/projects with resource isolation
+- **Priority-based scheduling**: Configure job priorities for fair resource allocation
+
+## Prerequisites
+
+Before running PyTorchJob in Koordinator, ensure you have:
+
+- Kubernetes cluster >= 1.22
+- Koordinator >= 1.5 installed
+- Koord-Queue installed and configured
+- PyTorchJob V1 CRDs installed (typically via [Training Operator V1](https://www.kubeflow.org/docs/components/trainer/legacy-v1/installation/))
+
+## Installation
+
+### 1. Install Koord-Queue
+
+If not already installed, deploy Koord-Queue using Helm:
+
+```bash
+helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
+helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
+  --namespace koord-queue \
+  --create-namespace
+```
+
+Enable PyTorchJob extension in the Helm values:
+
+```yaml
+# values.yaml
+extension:
+  pytorch:
+    enable: true
+```
+
+Install with custom values:
+
+```bash
+helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
+  --namespace koord-queue \
+  --create-namespace \
+  -f values.yaml
+```
+
+### 2. Verify Installation
+
+```bash
+# Check deployments
+kubectl get deployment -n koord-queue
+
+# Verify CRDs
+kubectl get crd | grep -E "(queue|pytorchjob)"
+```
+
+## Configuration
+
+### 1. Create an ElasticQuota
+
+Create an ElasticQuota to define resource boundaries for your PyTorchJob queue:
+
+```yaml
+apiVersion: scheduling.sigs.k8s.io/v1alpha1
+kind: ElasticQuota
+metadata:
+  name: pytorch-team-a
+  labels:
+    koord-queue/queue-policy: Priority  # Priority, Block, or Intelligent
+spec:
+  max:
+    cpu: "100"
+    memory: 200Gi
+    nvidia.com/gpu: "8"
+  min:
+    cpu: "20"
+    memory: 40Gi
+    nvidia.com/gpu: "2"
+```
+
+Apply the configuration:
+
+```bash
+kubectl apply -f elastic-quota.yaml
+```
+
+### 2. Create a Queue (Optional)
+
+For advanced queue configuration, create a Queue CR:
+
+```yaml
+apiVersion: scheduling.x-k8s.io/v1alpha1
+kind: Queue
+metadata:
+  name: pytorch-training-queue
+  namespace: koord-queue
+spec:
+  queuePolicy: Priority
+  priority: 100
+  # admissionChecks: []  # Optional: add admission checks if needed
+```
+
+Apply the queue:
+
+```bash
+kubectl apply -f queue.yaml
+```
+
+## Running PyTorchJob
+
+### Basic PyTorchJob Example
+
+Create a simple distributed PyTorchJob:
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: PyTorchJob
+metadata:
+  name: pytorch-training-job
+  namespace: default
+  annotations:
+    # Optional: specify which queue to use (defaults to queue matching ElasticQuota name)
+    scheduling.x-k8s.io/queue: pytorch-team-a
+    # Optional: set job priority within the queue
+    scheduling.x-k8s.io/priority: "10"
+spec:
+  pytorchReplicaSpecs:
+    Master:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
+              command:
+                - "python"
+                - "-m"
+                - "torch.distributed.launch"
+                - "--nproc_per_node=1"
+                - "--nnodes=2"
+                - "--node_rank=$(RANK)"
+                - "--master_addr=$(MASTER_ADDR)"
+                - "--master_port=$(MASTER_PORT)"
+                - "train.py"
+              resources:
+                requests:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+                limits:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+              env:
+                - name: RANK
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/rank']
+                - name: MASTER_ADDR
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/master-address']
+                - name: MASTER_PORT
+                  value: "29500"
+    Worker:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
+              command:
+                - "python"
+                - "-m"
+                - "torch.distributed.launch"
+                - "--nproc_per_node=1"
+                - "--nnodes=2"
+                - "--node_rank=$(RANK)"
+                - "--master_addr=$(MASTER_ADDR)"
+                - "--master_port=$(MASTER_PORT)"
+                - "train.py"
+              resources:
+                requests:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+                limits:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+              env:
+                - name: RANK
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/rank']
+                - name: MASTER_ADDR
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/master-address']
+                - name: MASTER_PORT
+                  value: "29500"
+```
+
+Apply the PyTorchJob:
+
+```bash
+kubectl apply -f pytorchjob.yaml
+```
+
+### How It Works
+
+When you create a PyTorchJob:
+
+1. **Automatic QueueUnit Creation**: Koord-Queue Controllers automatically detect the new PyTorchJob and create a corresponding `QueueUnit` resource
+2. **Job Suspension**: The PyTorchJob is automatically suspended using the `scheduling.x-k8s.io/suspend: "true"` annotation
+3. **Queue Processing**: The Queue Scheduler evaluates the job based on queue policy, priority, and available resources
+4. **Resource Allocation**: If resources are available according to the ElasticQuota, the QueueUnit transitions to `Dequeued` state
+5. **Job Execution**: The Extension Server removes the suspend annotation, allowing the PyTorchJob to create pods and start training
+
+## Advanced Configuration
+
+### Priority-Based Scheduling
+
+Configure job priority by setting the priority in the PyTorchJob pod template:
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: PyTorchJob
+metadata:
+  name: high-priority-training
+  namespace: default
+spec:
+  pytorchReplicaSpecs:
+    Master:
+      replicas: 1
+      template:
+        spec:
+          priorityClassName: high-priority  # Use a PriorityClass
+          containers:
+            - name: pytorch
+              image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
+              resources:
+                requests:
+                  cpu: "8"
+                  memory: 16Gi
+                  nvidia.com/gpu: "2"
+```
+
+### Resource Quota Integration
+
+The PyTorchJob will automatically respect the ElasticQuota limits. Monitor quota usage:
+
+```bash
+# Check ElasticQuota status
+kubectl describe elasticquota pytorch-team-a
+
+# Check QueueUnit status
+kubectl get queueunit -n default
+kubectl describe queueunit <queueunit-name>
+```
+
+### Queue Policies
+
+Koord-Queue supports three queue policies:
+
+- **Priority**: Jobs with higher priority values are dequeued first (default)
+- **Block**: Strict resource blocking - jobs wait until resources are guaranteed
+- **Intelligent**: Dual-queue mechanism with configurable priority threshold
+
+Configure via ElasticQuota labels:
+
+```yaml
+metadata:
+  labels:
+    koord-queue/queue-policy: Block  # or Priority, Intelligent
+```
+
+## Monitoring and Troubleshooting
+
+### Check Job Status
+
+```bash
+# Check PyTorchJob status
+kubectl get pytorchjob
+kubectl describe pytorchjob <job-name>
+
+# Check QueueUnit status
+kubectl get queueunit
+kubectl describe queueunit <queueunit-name>
+
+# Check pod status
+kubectl get pods -l training.kubeflow.org/job-name=<job-name>
+```
+
+### Common Issues
+
+1. **Job stuck in suspended state**: 
+   - Verify ElasticQuota has sufficient resources
+   - Check QueueUnit status for admission check failures
+   - Review queue policy settings
+
+2. **Resource allocation failures**:
+   - Check if ElasticQuota min/max limits are properly configured
+   - Verify cluster has sufficient GPU resources
+   - Review node capacity and taints
+
+3. **Queue not processing jobs**:
+   - Verify koord-queue controllers are running
+   - Check logs: `kubectl logs -n koord-queue deployment/koord-queue-controllers`
+
+## Best Practices
+
+1. **Use Priority Classes**: Define PriorityClasses for different training workload types
+2. **Set Realistic Resource Requests**: Accurately estimate CPU, memory, and GPU requirements
+3. **Monitor Quota Usage**: Regularly check ElasticQuota usage to avoid resource contention
+4. **Use Gang Scheduling**: For distributed training, ensure all replicas are scheduled together
+5. **Implement Resource Limits**: Set both requests and limits to prevent resource overcommitment
+
+## Integration with Other Koordinator Features
+
+PyTorchJob in Koordinator can leverage additional features:
+
+- **GPU Share**: Share GPU resources across multiple jobs
+- **Network Topology Awareness**: Optimize pod placement for distributed training
+- **Load-Aware Scheduling**: Balance cluster load during training workloads
+- **Preemption**: Higher priority jobs can preempt lower priority ones
+
+## Next Steps
+
+- Learn about [Koord-Queue](./queue-management.md) for advanced queue management
+- Explore [ElasticQuota](../architecture/resource-model.md) for resource management
+- Read about [Gang Scheduling](../designs/gang-scheduling.md) for distributed training
+- Check [Koordinator Architecture](../architecture/overview.md) for comprehensive understanding
\ No newline at end of file
diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md
index a60ddd9506..6ab7e1030f 100644
--- a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md
+++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/queue-management.md
@@ -204,7 +204,96 @@ my-job-blocked                                   Job
 
 `QueueUnit` 保持 `Enqueued` 状态，因为 `team-a` 已达到 `max` 配额上限。当 `my-job` 完成并释放资源后，`my-job-blocked` 将自动出队执行。
 
-对于其他作业类型（TFJob、PyTorchJob 等），请使用 `scheduling.x-k8s.io/suspend: "true"` 注解代替 `spec.suspend`。
+### 不同作业类型的暂停方式
+
+不同类型的作业使用不同的字段进行暂停：
+
+| 作业类型 | API 版本 | 暂停字段 | 示例 | 状态 |
+|----------|---------|---------|------|------|
+| Kubernetes Job | `batch/v1` | `.spec.suspend` | `spec.suspend: true` | 已支持 |
+| TFJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | 已支持 |
+| PyTorchJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` | `spec.runPolicy.suspend: true` | 已支持 |
+| Argo Workflow | `argoproj.io/v1alpha1` | 添加 `koord-queue-suspend` 模板 | 见下方示例 | 已支持 |
+| SparkApplication | `sparkoperator.k8s.io/v1beta2` | `.spec.suspend` |  | 开发中 |
+| XGBoostJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` |  | 尚未支持 |
+| PaddleJob | `kubeflow.org/v1` | `.spec.runPolicy.suspend` |  | 尚未支持 |
+
+**Argo Workflow 示例：**
+
+对于 Argo Workflow，Koord-Queue 使用名为 `koord-queue-suspend` 的特殊暂停模板。工作流必须满足以下条件才能被队列管理：
+
+1. 包含名为 `koord-queue-suspend` 的模板，且有 `suspend` 字段
+2. 工作流有处于 Running 状态的 suspend 节点，或者 `spec.suspend` 设置为 true
+
+```yaml
+apiVersion: argoproj.io/v1alpha1
+kind: Workflow
+metadata:
+  name: my-workflow
+  annotations:
+   koord-queue/min-resources: |
+     cpu: 5
+     memory: 5Gi
+spec:
+  suspend: true
+  templates:
+    # 添加此暂停模板用于队列管理
+    - name: koord-queue-suspend
+      suspend: {}
+    # 你的实际工作流模板
+    - name: main
+      container:
+        image: python:3.9
+        command: [python, -c, "print('Hello from workflow')"]
+  entrypoint: main
+```
+
+**工作原理：**
+
+当提交工作流时，Koord-Queue 通过以下方式检查是否应该管理它：
+- 扫描所有模板，查找带有 `suspend` 字段的 `koord-queue-suspend` 模板
+- 检查是否有任何工作流节点类型为 `Suspend` 且状态为 `Running`
+- 或者检查 `spec.suspend` 是否设置为 `true`
+
+当 `QueueUnit` 出队时，Extension Server 将移除暂停条件，允许工作流继续执行。
+
+**TFJob 示例：**
+
+对于 TFJob，设置 `spec.runPolicy.suspend: true` 启用队列管理：
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: TFJob
+metadata:
+  labels:
+    quota.scheduling.koordinator.sh/name: team-a-queue
+spec:
+  runPolicy:
+    suspend: true
+```
+
+**PyTorchJob 示例：**
+
+对于 PyTorchJob，设置 `spec.runPolicy.suspend: true` 启用队列管理：
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: PyTorchJob
+metadata:
+  labels:
+    quota.scheduling.koordinator.sh/name: team-a-queue
+spec:
+  runPolicy:
+    suspend: true
+```
+
+**Kubeflow Jobs 的工作原理：**
+
+当提交 TFJob 或 PyTorchJob 时：
+1. Job extension 检测到带有 `spec.runPolicy.suspend: true` 的新作业
+2. 自动创建对应的 `QueueUnit`
+3. 作业在队列中等待直到资源可用
+4. 出队时，Extension Server 设置 `spec.runPolicy.suspend: false`，允许作业创建 Pod 并开始训练
 
 ## 使用 Queue
 
diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/run-pytorchjob-in-koordinator.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/run-pytorchjob-in-koordinator.md
new file mode 100644
index 0000000000..c0e9b5a56b
--- /dev/null
+++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/run-pytorchjob-in-koordinator.md
@@ -0,0 +1,343 @@
+# 在 Koordinator 中运行 PyTorchJob
+
+本指南介绍如何在 Koordinator 中运行 PyTorchJob 工作负载，并集成队列管理和资源调度能力。
+
+## 概述
+
+Koordinator 通过 Koord-Queue 集成提供对 PyTorchJob 的原生支持。这使得：
+
+- **作业级排队**：将整个 PyTorchJob 工作负载作为单元管理，而非单个 Pod
+- **深度 ElasticQuota 集成**：利用 Koordinator 的资源配额系统实现公平共享和弹性分配
+- **预调度**：在作业创建 Pod 之前进行排队，减少调度器压力
+- **多租户隔离**：支持多个团队/项目的资源隔离
+- **基于优先级的调度**：配置作业优先级以实现公平的资源分配
+
+## 前置条件
+
+在 Koordinator 中运行 PyTorchJob 之前，请确保您具备：
+
+- Kubernetes 集群 >= 1.22
+- 已安装 Koordinator >= 1.5
+- 已安装并配置 Koord-Queue
+- 已安装 PyTorchJob CRD（通常通过 [Training Operator](https://github.com/kubeflow/training-operator) 安装）
+
+## 安装
+
+### 1. 安装 Koord-Queue
+
+如果尚未安装，使用 Helm 部署 Koord-Queue：
+
+```bash
+helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
+helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
+  --namespace koord-queue \
+  --create-namespace
+```
+
+在 Helm values 中启用 PyTorchJob 扩展：
+
+```yaml
+# values.yaml
+extension:
+  pytorch:
+    enable: true
+```
+
+使用自定义 values 安装：
+
+```bash
+helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
+  --namespace koord-queue \
+  --create-namespace \
+  -f values.yaml
+```
+
+### 2. 验证安装
+
+```bash
+# 检查 Deployments
+kubectl get deployment -n koord-queue
+
+# 验证 CRDs
+kubectl get crd | grep -E "(queue|pytorchjob)"
+```
+
+## 配置
+
+### 1. 创建 ElasticQuota
+
+创建 ElasticQuota 来为 PyTorchJob 队列定义资源边界：
+
+```yaml
+apiVersion: scheduling.sigs.k8s.io/v1alpha1
+kind: ElasticQuota
+metadata:
+  name: pytorch-team-a
+  labels:
+    koord-queue/queue-policy: Priority  # Priority、Block 或 Intelligent
+spec:
+  max:
+    cpu: "100"
+    memory: 200Gi
+    nvidia.com/gpu: "8"
+  min:
+    cpu: "20"
+    memory: 40Gi
+    nvidia.com/gpu: "2"
+```
+
+应用配置：
+
+```bash
+kubectl apply -f elastic-quota.yaml
+```
+
+### 2. 创建队列（可选）
+
+对于高级队列配置，创建 Queue CR：
+
+```yaml
+apiVersion: scheduling.x-k8s.io/v1alpha1
+kind: Queue
+metadata:
+  name: pytorch-training-queue
+  namespace: koord-queue
+spec:
+  queuePolicy: Priority
+  priority: 100
+  # admissionChecks: []  # 可选：如需添加入准检查
+```
+
+应用队列：
+
+```bash
+kubectl apply -f queue.yaml
+```
+
+## 运行 PyTorchJob
+
+### 基本 PyTorchJob 示例
+
+创建一个简单的分布式 PyTorchJob：
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: PyTorchJob
+metadata:
+  name: pytorch-training-job
+  namespace: default
+  annotations:
+    # 可选：指定使用哪个队列（默认匹配 ElasticQuota 名称的队列）
+    scheduling.x-k8s.io/queue: pytorch-team-a
+    # 可选：设置队列中作业的优先级
+    scheduling.x-k8s.io/priority: "10"
+spec:
+  pytorchReplicaSpecs:
+    Master:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
+              command:
+                - "python"
+                - "-m"
+                - "torch.distributed.launch"
+                - "--nproc_per_node=1"
+                - "--nnodes=2"
+                - "--node_rank=$(RANK)"
+                - "--master_addr=$(MASTER_ADDR)"
+                - "--master_port=$(MASTER_PORT)"
+                - "train.py"
+              resources:
+                requests:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+                limits:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+              env:
+                - name: RANK
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/rank']
+                - name: MASTER_ADDR
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/master-address']
+                - name: MASTER_PORT
+                  value: "29500"
+    Worker:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:
+            - name: pytorch
+              image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
+              command:
+                - "python"
+                - "-m"
+                - "torch.distributed.launch"
+                - "--nproc_per_node=1"
+                - "--nnodes=2"
+                - "--node_rank=$(RANK)"
+                - "--master_addr=$(MASTER_ADDR)"
+                - "--master_port=$(MASTER_PORT)"
+                - "train.py"
+              resources:
+                requests:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+                limits:
+                  cpu: "4"
+                  memory: 8Gi
+                  nvidia.com/gpu: "1"
+              env:
+                - name: RANK
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/rank']
+                - name: MASTER_ADDR
+                  valueFrom:
+                    fieldRef:
+                      fieldPath: metadata.annotations['kubeflow.org/master-address']
+                - name: MASTER_PORT
+                  value: "29500"
+```
+
+应用 PyTorchJob：
+
+```bash
+kubectl apply -f pytorchjob.yaml
+```
+
+### 工作原理
+
+当您创建 PyTorchJob 时：
+
+1. **自动创建 QueueUnit**：Koord-Queue Controllers 自动检测到新的 PyTorchJob 并创建对应的 `QueueUnit` 资源
+2. **作业暂停**：PyTorchJob 使用 `scheduling.x-k8s.io/suspend: "true"` 注解自动暂停
+3. **队列处理**：Queue Scheduler 根据队列策略、优先级和可用资源评估作业
+4. **资源分配**：如果根据 ElasticQuota 资源可用，QueueUnit 转换为 `Dequeued` 状态
+5. **作业执行**：Extension Server 移除暂停注解，允许 PyTorchJob 创建 Pod 并开始训练
+
+## 高级配置
+
+### 基于优先级的调度
+
+通过在 PyTorchJob Pod 模板中设置优先级来配置作业优先级：
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: PyTorchJob
+metadata:
+  name: high-priority-training
+  namespace: default
+spec:
+  pytorchReplicaSpecs:
+    Master:
+      replicas: 1
+      template:
+        spec:
+          priorityClassName: high-priority  # 使用 PriorityClass
+          containers:
+            - name: pytorch
+              image: pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
+              resources:
+                requests:
+                  cpu: "8"
+                  memory: 16Gi
+                  nvidia.com/gpu: "2"
+```
+
+### 资源配额集成
+
+PyTorchJob 将自动遵守 ElasticQuota 限制。监控配额使用情况：
+
+```bash
+# 检查 ElasticQuota 状态
+kubectl describe elasticquota pytorch-team-a
+
+# 检查 QueueUnit 状态
+kubectl get queueunit -n default
+kubectl describe queueunit <queueunit-name>
+```
+
+### 队列策略
+
+Koord-Queue 支持三种队列策略：
+
+- **Priority**：优先级值更高的作业优先出队（默认）
+- **Block**：严格资源阻塞 - 作业等待直到资源有保证
+- **Intelligent**：双队列机制，具有可配置的优先级阈值
+
+通过 ElasticQuota 标签配置：
+
+```yaml
+metadata:
+  labels:
+    koord-queue/queue-policy: Block  # 或 Priority、Intelligent
+```
+
+## 监控和故障排查
+
+### 检查作业状态
+
+```bash
+# 检查 PyTorchJob 状态
+kubectl get pytorchjob
+kubectl describe pytorchjob <job-name>
+
+# 检查 QueueUnit 状态
+kubectl get queueunit
+kubectl describe queueunit <queueunit-name>
+
+# 检查 Pod 状态
+kubectl get pods -l training.kubeflow.org/job-name=<job-name>
+```
+
+### 常见问题
+
+1. **作业卡在暂停状态**：
+   - 验证 ElasticQuota 是否有足够资源
+   - 检查 QueueUnit 状态是否有准入检查失败
+   - 检查队列策略设置
+
+2. **资源分配失败**：
+   - 检查 ElasticQuota min/max 限制是否正确配置
+   - 验证集群是否有足够的 GPU 资源
+   - 检查节点容量和污点
+
+3. **队列未处理作业**：
+   - 验证 koord-queue controllers 是否正在运行
+   - 检查日志：`kubectl logs -n koord-queue deployment/koord-queue-controllers`
+
+## 最佳实践
+
+1. **使用优先级类**：为不同类型的训练工作负载定义 PriorityClass
+2. **设置实际的资源请求**：准确估算 CPU、内存和 GPU 需求
+3. **监控配额使用**：定期检查 ElasticQuota 使用情况以避免资源竞争
+4. **使用 Gang 调度**：对于分布式训练，确保所有副本一起调度
+5. **实施资源限制**：同时设置 requests 和 limits 以防止资源超卖
+
+## 与其他 Koordinator 功能集成
+
+Koordinator 中的 PyTorchJob 可以利用其他功能：
+
+- **GPU 共享**：在多个作业间共享 GPU 资源
+- **网络拓扑感知**：优化分布式训练的 Pod 放置
+- **负载感知调度**：在训练工作负载期间平衡集群负载
+- **抢占**：高优先级作业可以抢占低优先级作业
+
+## 下一步
+
+- 了解 [Koord-Queue](./queue-management.md) 进行高级队列管理
+- 探索 [ElasticQuota](../architecture/resource-model.md) 进行资源管理
+- 阅读 [Gang 调度](../designs/gang-scheduling.md) 了解分布式训练
+- 查看 [Koordinator 架构](../architecture/overview.md) 获得全面理解
diff --git a/sidebars.js b/sidebars.js
index 026afa5887..540de89f1c 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -45,6 +45,7 @@ const sidebars = {
           'Capacity Scheduling': [
             'user-manuals/capacity-scheduling',
             'user-manuals/queue-management',
+            'user-manuals/run-pytorchjob-in-koordinator',
           ],
           'Task Scheduling': [
             'user-manuals/gang-scheduling',