Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/contributor/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Welcome to HAMi!

## Code of Conduct

Please make sure to read and observe our [Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md)
Please make sure to read and observe the [Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md)

## Community Expectations

Expand Down
2 changes: 1 addition & 1 deletion docs/contributor/github-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ in a few cycles.

## Push

When ready to review (or just to establish an offsite backup of your work),
When ready to review (or to establish an offsite backup of your work),
push your branch to your fork on `github.com`:

```sh
Expand Down
2 changes: 1 addition & 1 deletion docs/contributor/governance.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The HAMi and its leadership embrace the following values:
priority over shipping code or sponsors' organizational goals. Each
contributor participates in the project as an individual.

* Inclusivity: We innovate through different perspectives and skill sets, which
* Inclusivity: Innovation comes from different perspectives and skill sets, and this
can only be accomplished in a welcoming and respectful environment.

* Participation: Responsibilities within the project are earned through
Expand Down
2 changes: 1 addition & 1 deletion docs/contributor/ladder.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Description: A Contributor contributes directly to the project and adds value to

A very special thanks to the [long list of people](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md) who have contributed to and helped maintain the project. The project wouldn't be where it is today without your contributions. Thank you!

As long as you contribute to HAMi, your name will be added to the [AUTHORS.md file](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md). If you don't find your name, please contact us to add it.
As long as you contribute to HAMi, your name will be added to the [AUTHORS.md file](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md). If you don't find your name, please open an issue to have it added.

### Organization Member

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ HAMi 社区不仅受邀进行了技术分享,也在现场设立了展台,与
- 多卡组合
- 拓扑(NUMA / NVLink)

👉 这直接导致:
这直接导致:

- 调度逻辑外溢(extender / sidecar)
- 系统复杂度上升
Expand All @@ -91,7 +91,7 @@ DRA 的核心优势是:

PPT 里有一页非常关键,很多人会忽略:

### 👉 DRA 请求长这样
### DRA 请求长这样

```yaml
spec:
Expand Down Expand Up @@ -119,15 +119,15 @@ resources:
nvidia.com/gpu: 1
```

👉 结论非常明确:
结论非常明确:

> **DRA 是能力升级,但 UX 明显退化。**

## HAMi-DRA 的关键突破:自动化

这是这次分享最有价值的部分之一:

### 👉 Webhook 自动生成 ResourceClaim
### Webhook 自动生成 ResourceClaim

HAMi 的做法不是让用户"直接用 DRA",而是:

Expand Down Expand Up @@ -178,7 +178,7 @@ DRA driver 并不只是"注册资源",而是完整 lifecycle 管理:
- 环境变量管理
- 临时目录(cache / lock)

👉 这意味着:
这意味着:

> **GPU 调度已经进入 runtime orchestration 层,而不是简单资源分配。**

Expand All @@ -191,7 +191,7 @@ PPT 中给出了一个很关键的 benchmark:
- HAMi(传统):最高 ~42,000
- HAMi-DRA:显著下降(~30%+ 改善)

👉 这说明:
这说明:

> **DRA 的资源预绑定机制,可以减少调度阶段冲突和重试**

Expand All @@ -211,7 +211,7 @@ PPT 中给出了一个很关键的 benchmark:
- ResourceClaim:资源分配
- → **资源视角是第一等公民**

👉 这带来的变化:
这带来的变化:

> **Observability 从"推导"变成"直接建模"**

Expand All @@ -227,7 +227,7 @@ PPT 提出了一个非常关键的未来方向:
- PCI bus ID
- GPU attributes

👉 这其实是一个更大的叙事:
这其实是一个更大的叙事:

> **DRA 是 heterogeneous compute abstraction 的起点**

Expand All @@ -249,7 +249,7 @@ PPT 提出了一个非常关键的未来方向:

- 调度逻辑 → 资源声明

👉 本质上:
本质上:

> **Kubernetes 正在进化为 AI Infra Control Plane**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: HAMi 采用者

# HAMi 采用者

你和你的组织正在使用 HAMi?太棒了!我们很乐意听到你的使用反馈!💖
你和你的组织正在使用 HAMi?太棒了!请通过 GitHub 提交使用信息。

## 添加你的信息

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ translated: true
* 受邀参加贡献者活动
* 有资格成为组织成员

特别感谢[长长的名单](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md)中那些为项目做出贡献并帮助维护项目的人。没有你们的贡献,我们不会有今天的成就。谢谢!💖
特别感谢[长长的名单](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md)中那些为项目做出贡献并帮助维护项目的人。没有你们的贡献,我们不会有今天的成就。谢谢!

只要你为 HAMi 做出贡献,你的名字将被添加到[这里](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md)。如果你没有找到你的名字,联系我们添加。

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ kubectl get all -n kube-system -l app=hami -o yaml > hami-state-backup.yaml

### 3. 清理运行中的工作负载

⚠️ **关键提醒:** 升级前必须停止或重新调度所有 GPU 工作负载。在存在运行任务的情况下升级,可能导致段错误(segmentation fault)或不可预测行为。
**关键提醒:** 升级前必须停止或重新调度所有 GPU 工作负载。在存在运行任务的情况下升级,可能导致段错误(segmentation fault)或不可预测行为。

**优雅清理 GPU 工作负载:**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ HAMi 支持的设备如下表所示:

| 生产商 | 制造商 | 类型 | 内存隔离 | 核心隔离 | 多卡支持 |
|-------|-------|-----|---------|--------|---------|
| GPU | NVIDIA | 全部 | | | |
| MLU | Cambricon | 370, 590 | | | |
| DCU | Hygon | Z100, Z100L | | | |
| NPU | Huawei Ascend | 910B, 910B3, 310P | | | |
| GPU | iluvatar | 全部 | | | |
| GPU | Mthreads | MTT S4000 | | | |
| GPU | Metax | MXC500 | | | |
| GCU | Enflame | S60 | | | |
| XPU | Kunlunxin | P800 | | | |
| DPU | Teco | 检查中 | 进行中 | 进行中 | |
| GPU | NVIDIA | 全部 | Yes | Yes | Yes |
| MLU | Cambricon | 370, 590 | Yes | Yes | No |
| DCU | Hygon | Z100, Z100L | Yes | Yes | No |
| NPU | Huawei Ascend | 910B, 910B3, 310P | Yes | Yes | No |
| GPU | iluvatar | 全部 | Yes | Yes | No |
| GPU | Mthreads | MTT S4000 | Yes | Yes | No |
| GPU | Metax | MXC500 | Yes | Yes | No |
| GCU | Enflame | S60 | Yes | Yes | No |
| XPU | Kunlunxin | P800 | Yes | Yes | No |
| DPU | Teco | 检查中 | 进行中 | 进行中 | No |
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ linktitle: GPU 共享

本组件支持复用燧原 GCU 设备 (S60),并为此提供以下几种与 vGPU 类似的复用功能,包括:

***GPU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡
**GPU 共享**: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡

***百分比切片能力***: 你现在可以用百分比来申请一个 GCU 切片(例如 20%),本组件会确保任务使用的显存和算力不会超过这个百分比对应的数值
**百分比切片能力**: 你现在可以用百分比来申请一个 GCU 切片(例如 20%),本组件会确保任务使用的显存和算力不会超过这个百分比对应的数值

***设备 UUID 选择***: 你可以通过注解指定使用或排除特定的 GCU 设备
**设备 UUID 选择**: 你可以通过注解指定使用或排除特定的 GCU 设备

**部署说明**: 部署本组件后,只需要部署厂家提供的 gcushare-device-plugin 即可使用

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ translated: true

本组件支持复用天数智芯 GPU 设备 (MR-V100、BI-V150、BI-V100),并为此提供以下几种与 vGPU 类似的复用功能,包括:

***GPU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡
**GPU 共享**: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡

***可限制分配的显存大小***: 你现在可以用显存值(例如 3000M)来分配 GPU,本组件会确保任务使用的显存不会超过分配数值
**可限制分配的显存大小**: 你现在可以用显存值(例如 3000M)来分配 GPU,本组件会确保任务使用的显存不会超过分配数值

***可限制分配的算力核组比例***: 你现在可以用算力比例(例如 60%)来分配 GPU,本组件会确保任务使用的显存不会超过分配数值
**可限制分配的算力核组比例**: 你现在可以用算力比例(例如 60%)来分配 GPU,本组件会确保任务使用的显存不会超过分配数值

***设备 UUID 选择***: 你可以通过注解指定使用或排除特定的 GPU 设备
**设备 UUID 选择**: 你可以通过注解指定使用或排除特定的 GPU 设备

**部署说明**: 部署本组件后,只需要部署厂家提供的 gpu-manager 即可使用

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Welcome to HAMi!

## Code of Conduct

Please make sure to read and observe our [Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md)
Please make sure to read and observe the [Code of Conduct](https://github.com/cncf/foundation/blob/main/code-of-conduct.md)

## Community Expectations

Expand Down Expand Up @@ -51,7 +51,7 @@ When you are willing to take on an issue, just reply on the issue. The maintaine

### File an Issue

While we encourage everyone to contribute code, it is also appreciated when someone reports an issue.
Code contributions are welcome, and bug reports are equally appreciated.
Issues should be filed under the appropriate HAMi sub-repository.

*Example:* a HAMi issue should be opened to [Project-HAMi/HAMi](https://github.com/Project-HAMi/HAMi/issues).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ The HAMi and its leadership embrace the following values:
* Fairness: All stakeholders have the opportunity to provide feedback and submit
contributions, which will be considered on their merits.

* Community over Product or Company: Sustaining and growing our community takes
* Community over Product or Company: Sustaining and growing the community takes
priority over shipping code or sponsors' organizational goals. Each
contributor participates in the project as an individual.

* Inclusivity: We innovate through different perspectives and skill sets, which
* Inclusivity: Innovation comes from different perspectives and skill sets, and this
can only be accomplished in a welcoming and respectful environment.

* Participation: Responsibilities within the project are earned through
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This docs different ways to get involved and level up within the project. You ca

## Contributor Ladder

Hello! We are excited that you want to learn more about our project contributor ladder! This contributor ladder outlines the different contributor roles within the project, along with the responsibilities and privileges that come with them. Community members generally start at the first levels of the "ladder" and advance up it as their involvement in the project grows. Our project members are happy to help you advance along the contributor ladder.
This contributor ladder outlines the different contributor roles within the project, along with the responsibilities and privileges that come with them.

Each of the contributor roles below is organized into lists of three types of things. "Responsibilities" are things that a contributor is expected to do. "Requirements" are qualifications a person needs to meet to be in that role, and "Privileges" are things contributors on that level are entitled to.

Expand Down Expand Up @@ -47,7 +47,7 @@ Description: A Contributor contributes directly to the project and adds value to
* Invitations to contributor events
* Eligible to become an Organization Member

A very special thanks to the [long list of people](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md) who have contributed to and helped maintain the project. We wouldn't be where we are today without your contributions. Thank you! 💖
A very special thanks to the [long list of people](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md) who have contributed to and helped maintain the project. Thanks to everyone who contributed and helped maintain the project.

As long as you contribute to HAMi, your name will be added [to the AUTHORS list](https://github.com/Project-HAMi/HAMi/blob/master/AUTHORS.md). If you don't find your name, please contact us to add it.

Expand Down Expand Up @@ -142,7 +142,7 @@ The current list of maintainers can be found in the [MAINTAINERS](https://github

New maintainers are added by consensus among the current group of maintainers. This can be done via a private discussion via Slack or email. A majority of maintainers should support the addition of the new person, and no single maintainer should object to adding the new maintainer.

When adding a new maintainer, we should file a PR to [HAMi](https://github.com/Project-HAMi/HAMi) and update [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md). Once this PR is merged, you will become a maintainer of HAMi.
When adding a new maintainer, file a PR to [HAMi](https://github.com/Project-HAMi/HAMi) and update [MAINTAINERS](https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md). Once this PR is merged, you will become a maintainer of HAMi.

### Removing Maintainers

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ This feature will not be implemented without the help of @sailorvii.

## Introduction

The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it.
For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, we consider the CPU, Mem, GPU memory and other user-defined resource.
The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, MPS and MIG are preferred. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. The goal is an automatic slice plugin that creates slices on demand.
For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, the scheduler considers CPU, memory, GPU memory, and other user-defined resources.
HAMi is done by using [hami-core](https://github.com/Project-HAMi/HAMi-core), which is a cuda-hacking library. But mig is also widely used across the world. A unified API for dynamic-mig and hami-core is needed.

## Targets
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Node1 score: ((1+3)/4) * 10= 10
Node2 score: ((1+2)/4) * 10= 7.5
```

So, in `Binpack` policy we can select `Node1`.
So, in `Binpack` policy, the selected node is `Node1`.

#### Spread

Expand All @@ -124,7 +124,7 @@ Node1 score: ((1+3)/4) * 10= 10
Node2 score: ((1+2)/4) * 10= 7.5
```

So, in `Spread` policy we can select `Node2`.
So, in `Spread` policy, the selected node is `Node2`.

### GPU-scheduler-policy

Expand All @@ -147,7 +147,7 @@ GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75
```

So, in `Binpack` policy we can select `GPU2`.
So, in `Binpack` policy, the selected node is `GPU2`.

#### Spread

Expand All @@ -166,4 +166,4 @@ GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75
```

So, in `Spread` policy we can select `GPU1`.
So, in `Spread` policy, the selected node is `GPU1`.
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ sudo systemctl daemon-reload && systemctl restart containerd

#### 2. Label your nodes

Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by our scheduler.
Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by the HAMi scheduler.

```bash
kubectl label nodes {nodeid} gpu=on
Expand All @@ -106,7 +106,7 @@ First, you need to check your Kubernetes version by using the following command:
kubectl version
```

Then, add our repo in helm
Then, add the HAMi repo in helm

```bash
helm repo add hami-charts https://project-hami.github.io/HAMi/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ sudo systemctl daemon-reload && systemctl restart containerd

### Label your nodes

Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by our scheduler.
Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by the HAMi scheduler.

```bash
kubectl label nodes {nodeid} gpu=on
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ title: Enable cambricon MLU sharing

## Introduction

**We now support cambricon.com/mlu by implementing most device-sharing features as nvidia-GPU**, including:
**HAMi now supports cambricon.com/mlu by implementing most device-sharing features as nvidia-GPU**, including:

**MLU sharing**: Each task can allocate a portion of MLU instead of a whole MLU card, thus MLU can be shared among multiple tasks.

**Device Memory Control**: MLUs can be allocated with certain device memory size on certain type(i.e 370) and have made it that it does not exceed the boundary.

**MLU Type Specification**: You can specify which type of MLU to use or to avoid for a certain task, by setting "cambricon.com/use-mlutype" or "cambricon.com/nouse-mlutype" annotations.

**Very Easy to use**: You don't need to modify your task yaml to use our scheduler. All your MLU jobs will be automatically supported after installation. The only thing you need to do is tag the MLU node.
**Very Easy to use**: You don't need to modify your task yaml to use the HAMi scheduler. All your MLU jobs will be automatically supported after installation. The only thing you need to do is tag the MLU node.

## Prerequisites

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ You can update these configurations using one of the following methods:
2. Modify Helm Chart: Update the corresponding values in the [ConfigMap](https://raw.githubusercontent.com/archlitchi/HAMi/refs/heads/master/charts/hami/templates/scheduler/device-configmap.yaml), then reapply the Helm Chart to regenerate the ConfigMap.

* `nvidia.deviceMemoryScaling:`
Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if we set `nvidia.deviceMemoryScaling` argument to *S*, vGPUs split by this GPU will totally get `S * M` memory in Kubernetes with our device plugin.
Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with *M* memory, if `nvidia.deviceMemoryScaling` is set argument to *S*, vGPUs split by this GPU will totally get `S * M` memory in Kubernetes with the HAMi device plugin.
* `nvidia.deviceSplitCount:`
Integer type, by default: equals 10. Maximum tasks assigned to a simple GPU device.
* `nvidia.migstrategy:`
Expand Down
Loading
Loading