[Bug]: Cannot assign GPU per container in multi-container Pod with time-slicing

### **Summary**

When using NVIDIA GPU time-slicing (`device-sharing-strategy=time-slicing`, `rename-by-default=false`), Kubernetes does not allow assigning GPU resources independently to multiple containers within the same Pod.

GPU requests are aggregated at the **Pod level**, making it impossible to run multiple GPU-dependent containers in a single Pod unless the underlying node has multiple **physical GPUs**.

---

### **Environment**

* Kubernetes (EKS)
* Karpenter (NodePool / EC2NodeClass)
* Bottlerocket (`bottlerocket@v1.53.0`)
* Instance types:

  * `g6e.xlarge`, `g6e.2xlarge` (1 physical GPU)
* NVIDIA device plugin:

  * `device-sharing-strategy = "time-slicing"`
  * `replicas = 4`
  * `rename-by-default = false`

---

### **Problem Scenario**

We run a Pod with **multiple containers**, each requiring GPU:

```yaml
containers:
  - name: container-a
    resources:
      limits:
        nvidia.com/gpu: 1

  - name: container-b
    resources:
      limits:
        nvidia.com/gpu: 1
```

---

### **Expectation**

Since time-slicing exposes:

```text
nvidia.com/gpu: 4
```

We expect:

* Each container consumes 1 GPU slice
* Total Pod uses 2 slices
* Pod should run on a node with 1 physical GPU

---

### **Actual Behavior**

* Kubernetes aggregates GPU requests at the Pod level:

```text
Pod requires: nvidia.com/gpu = 2
```

* Karpenter evaluates instance types based on **physical GPUs**
* Instance types with 1 GPU cannot satisfy request

➡️ Pod is unschedulable unless using multi-GPU instances

---

### **Observed Workarounds & Trade-offs**

#### ❌ Option 1 — Only 1 container uses GPU

```yaml
container-a → GPU
container-b → CPU only
```

* Not viable: `container-b` requires GPU → crashes / restarts

---

#### ❌ Option 2 — Split into multiple Pods

```text
Pod A → container-a (GPU)
Pod B → container-b (GPU)
```

* Requires architectural changes
* Effectively turns into **multiple services**
* Not suitable when containers are tightly coupled

---

#### ✅ Option 3 — Use multi-GPU nodes (current workaround)

* Use instance with ≥2 physical GPUs
* Example: node with 4 GPUs
* Pod schedules successfully

**Downside:**

* Significantly higher cost
* Wastes GPU capacity when workloads are small
* Defeats purpose of GPU time-slicing (cost efficiency)

---

### **Key Insight**

Time-slicing increases **allocatable GPU slots**, but:

* Kubernetes still treats `nvidia.com/gpu` as:

  * **integer**
  * **Pod-scoped**
* There is no concept of:

  * per-container GPU isolation
  * or fractional GPU assignment inside a Pod

---

### **Impact**

* Cannot efficiently run multi-container GPU workloads
* Forces one of:

  * code refactor (split services)
  * over-provisioning (multi-GPU nodes)
* Reduces cost efficiency of GPU sharing setups

---

### **Question**

Is this limitation:

* inherent to Kubernetes device plugin design?
* or is there a recommended pattern for:

  * multi-container GPU workloads with time-slicing?

---

### **Conclusion**

GPU time-slicing does not enable “1 GPU per container” within a Pod.
GPU resources are still aggregated at the Pod level, forcing users to either redesign workloads or over-provision infrastructure.

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Cannot assign GPU per container in multi-container Pod with time-slicing #1690

Summary

Environment

Problem Scenario

Expectation

Actual Behavior

Observed Workarounds & Trade-offs

❌ Option 1 — Only 1 container uses GPU

❌ Option 2 — Split into multiple Pods

✅ Option 3 — Use multi-GPU nodes (current workaround)

Key Insight

Impact

Question

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Cannot assign GPU per container in multi-container Pod with time-slicing #1690

Description

Summary

Environment

Problem Scenario

Expectation

Actual Behavior

Observed Workarounds & Trade-offs

❌ Option 1 — Only 1 container uses GPU

❌ Option 2 — Split into multiple Pods

✅ Option 3 — Use multi-GPU nodes (current workaround)

Key Insight

Impact

Question

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions