fix: handle cases where OCI is out of capacity by aniekgul · Pull Request #34 · oracle/karpenter-provider-oci

aniekgul · 2026-06-05T12:48:42Z

Description

When a NodePool runs out of host capacity, OCI Karpenter previously returned a generic CloudProvider.CreateError from Create. Core Karpenter only deletes and reschedules a NodeClaim on an InsufficientCapacityError, so on a CreateError the NodeClaim stuck to the same (exhausted) NodePool/offering and retried in a tight loop, producing retry storms and 429 throttling, with no fallback to other capacity types or NodePools.

This change makes host-capacity exhaustion behave correctly across three layers:

Classify pool-wide exhaustion as insufficient capacity. When every launch attempt is consumed and at least one failed with an out-of-host-capacity error (non-capacity launch errors still return early), CloudProvider.Create now returns cloudprovider.NewInsufficientCapacityError. This lets core Karpenter delete + reschedule the NodeClaim, enabling cross-NodePool fallback.
Add an unavailable-offerings cache. A new TTL cache (pkg/cache/unavailableofferings.go) records (shape, AD/zone, capacity-type) offerings observed to be out of capacity. LaunchInstance marks offerings unavailable on capacity errors, and the instance-type provider gates offering availability (Available=false) when listing instance types. This drives spot→on-demand fallback within a NodePool and prevents the scheduler from immediately re-selecting dead offerings until the entry expires. The TTL is configurable via --unavailable-offerings-ttl-seconds (default 180s / 3 minutes).
Stop blindly retrying out-of-capacity at the SDK layer. oci.IsRetryable now treats an HTTP 500 that is an "Out of host capacity" error as non-retryable, so the capacity-fallback logic handles it instead of the SDK amplifying the host-capacity shortage and contributing to 429s.'

Fixes #33

EDIT: Also added a fix for a deadlock issue, see aniekgul#3 for details.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Unit and integration tests added to code.

Live cluster test

In a OKE cluster in ca-toronto-1

Setup

Create two node pools, one with an instance type that is highly unlikely to be available and another with lots of room.

  exp-fallback-arm-spot:
    enabled: true
    capacity_type: spot
    architecture: arm64
    shapes: [VM.Standard.A1.Flex]
    # Pin to a full A1 host (max 80 OCPU / 512 GB). The shape is offered in the
    # region so Karpenter still attempts a launch, but a near-empty A1 host on
    # spot almost never exists -> launch fails for insufficient capacity.
    shape_configs:
      - { ocpus: 80, memory_gibs: 512 }
    taints:
      - *taint_exp_fallback
      - *taint_oke_preemptible
    labels:
      experiment-fallback: "true"
    weight: 100

  exp-fallback-amd64-od:
    enabled: true
    capacity_type: on-demand
    shapes:
      - VM.Standard.E3.Flex
      - VM.Standard.E4.Flex
      - VM.Standard.E5.Flex
    taints:
      - *taint_exp_fallback
    labels:
      experiment-fallback: "true"
    weight: 90

Execution

Using the current karpenter version 1.1.0, create a deployment that targets both:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: karpenter-fallback-test
  namespace: default
  labels:
    app: karpenter-fallback-test
    experiment: karpenter-fallback
spec:
  replicas: 1
  selector:
    matchLabels:
      app: karpenter-fallback-test
  template:
    metadata:
      labels:
        app: karpenter-fallback-test
        experiment: karpenter-fallback
    spec:
      # Target only the two experiment pools.
      nodeSelector:
        experiment-fallback: "true"
      tolerations:
        # Isolating taint shared by both experiment pools.
        - key: experiment-karpenter-fallback
          operator: Equal
          value: "true"
          effect: NoSchedule
        # KPO spot taint on the ARM-spot pool — required so Karpenter considers
        # that (higher-weight) pool first.
        - key: oci.oraclecloud.com/oke-is-preemptible
          operator: Exists
          effect: NoSchedule
      containers:
        - name: pause
          # Multi-arch so the pod is valid on both the arm64 and amd64 pools.
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "2"
              memory: 4Gi

The pod will stay pending.

Logs from the current Karpenter version: (Only included important fields)

LaunchInstance start - name: ci-ca-toronto-1-exp-fallback-arm-spot-rf4wn

LaunchInstance failed - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

Reconciler error - error: launching nodeclaim, creating nodeclaim, cannot create node after trying all instance types

(Launch failure loop starts)
LaunchInstance failed - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

Delete the deployment and update to the version from this branch.

Recreate the deployment and now it should schedule on a new node from the second pool.

Logs from this branch's version:

LaunchInstance start - name: ci-ca-toronto-1-exp-fallback-arm-spot-jk6mu

LaunchInstance failed - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

all instance types exhausted due to insufficient capacity - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

failed launching nodeclaim - error: insufficient capacity, all instance types exhausted due to insufficient capacity,

deleted nodeclaim - name: ci-ca-toronto-1-exp-fallback-arm-spot-jk6mu

created nodeclaim - name: ci-ca-toronto-1-exp-fallback-amd64-od-q5nz5

LaunchInstance success - name: ci-ca-toronto-1-exp-fallback-amd64-od-q5nz5

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>

feat: handle oci out of capacity issues

oracle-contributor-agreement · 2026-06-05T12:48:47Z

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

PR author: aniekgul
13356402+aniekgul@users.noreply.github.com (@aniekgul)

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>

feat: handle limit or quota exceeded cases as well

Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>

fix: remove recursive read locks that cause deadlocks in ListInstanceTypes

oracle-contributor-agreement · 2026-06-09T12:13:09Z

Thank you for signing the OCA.

Conflict in pkg/cloudprovider/cloud_provider.go resolved by preserving the fork's IsSkippableLaunchError logic (superset of upstream's IsNoCapacityError — also handles service-limit/quota exhaustion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: sync upstream oracle/karpenter-provider-oci v1.2.0

oracle-contributor-agreement · 2026-06-17T18:50:19Z

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

aniekgul added 2 commits June 4, 2026 14:37

feat: handle oci out of capacity issues

c9aa700

Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>

Merge pull request #1 from aniekgul/33-handle-out-of-capacity-issues

ec5116d

feat: handle oci out of capacity issues

oracle-contributor-agreement Bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Jun 5, 2026

aniekgul mentioned this pull request Jun 5, 2026

[Bug] Karpenter fails to consider other Nodepools when one is out of capacity #33

Open

aniekgul added 4 commits June 6, 2026 13:38

feat: handle limit or quota exceeded cases as well

6fb68f1

Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>

Merge pull request #2 from aniekgul/33-handle-out-of-limit-issues

1eebafd

feat: handle limit or quota exceeded cases as well

fix: remove recursive read locks that cause deadlocks in listinstances

7bcac78

Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>

Merge pull request #3 from aniekgul/fix_deadlock

7d520b2

fix: remove recursive read locks that cause deadlocks in ListInstanceTypes

oracle-contributor-agreement Bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Jun 9, 2026

aodanxin and others added 2 commits June 17, 2026 13:51

Merge pull request #4 from aniekgul/sync/upstream-v1.2.0

46be91c

chore: sync upstream oracle/karpenter-provider-oci v1.2.0

oracle-contributor-agreement Bot added OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. and removed OCA Verified All contributors have signed the Oracle Contributor Agreement. labels Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle cases where OCI is out of capacity#34

fix: handle cases where OCI is out of capacity#34
aniekgul wants to merge 8 commits into
oracle:mainfrom
aniekgul:main

aniekgul commented Jun 5, 2026 •

edited

Loading

Uh oh!

oracle-contributor-agreement Bot commented Jun 5, 2026

Uh oh!

oracle-contributor-agreement Bot commented Jun 9, 2026

Uh oh!

oracle-contributor-agreement Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aniekgul commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Live cluster test

Setup

Execution

Checklist:

Uh oh!

oracle-contributor-agreement Bot commented Jun 5, 2026

Uh oh!

oracle-contributor-agreement Bot commented Jun 9, 2026

Uh oh!

oracle-contributor-agreement Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aniekgul commented Jun 5, 2026 •

edited

Loading