Skip to content

fix: handle cases where OCI is out of capacity#34

Open
aniekgul wants to merge 8 commits into
oracle:mainfrom
aniekgul:main
Open

fix: handle cases where OCI is out of capacity#34
aniekgul wants to merge 8 commits into
oracle:mainfrom
aniekgul:main

Conversation

@aniekgul

@aniekgul aniekgul commented Jun 5, 2026

Copy link
Copy Markdown

Description

When a NodePool runs out of host capacity, OCI Karpenter previously returned a generic CloudProvider.CreateError from Create. Core Karpenter only deletes and reschedules a NodeClaim on an InsufficientCapacityError, so on a CreateError the NodeClaim stuck to the same (exhausted) NodePool/offering and retried in a tight loop, producing retry storms and 429 throttling, with no fallback to other capacity types or NodePools.

This change makes host-capacity exhaustion behave correctly across three layers:

  1. Classify pool-wide exhaustion as insufficient capacity. When every launch attempt is consumed and at least one failed with an out-of-host-capacity error (non-capacity launch errors still return early), CloudProvider.Create now returns cloudprovider.NewInsufficientCapacityError. This lets core Karpenter delete + reschedule the NodeClaim, enabling cross-NodePool fallback.
  2. Add an unavailable-offerings cache. A new TTL cache (pkg/cache/unavailableofferings.go) records (shape, AD/zone, capacity-type) offerings observed to be out of capacity. LaunchInstance marks offerings unavailable on capacity errors, and the instance-type provider gates offering availability (Available=false) when listing instance types. This drives spot→on-demand fallback within a NodePool and prevents the scheduler from immediately re-selecting dead offerings until the entry expires. The TTL is configurable via --unavailable-offerings-ttl-seconds (default 180s / 3 minutes).
  3. Stop blindly retrying out-of-capacity at the SDK layer. oci.IsRetryable now treats an HTTP 500 that is an "Out of host capacity" error as non-retryable, so the capacity-fallback logic handles it instead of the SDK amplifying the host-capacity shortage and contributing to 429s.'

Fixes #33

EDIT: Also added a fix for a deadlock issue, see aniekgul#3 for details.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Unit and integration tests added to code.

Live cluster test

In a OKE cluster in ca-toronto-1

Setup

Create two node pools, one with an instance type that is highly unlikely to be available and another with lots of room.

  exp-fallback-arm-spot:
    enabled: true
    capacity_type: spot
    architecture: arm64
    shapes: [VM.Standard.A1.Flex]
    # Pin to a full A1 host (max 80 OCPU / 512 GB). The shape is offered in the
    # region so Karpenter still attempts a launch, but a near-empty A1 host on
    # spot almost never exists -> launch fails for insufficient capacity.
    shape_configs:
      - { ocpus: 80, memory_gibs: 512 }
    taints:
      - *taint_exp_fallback
      - *taint_oke_preemptible
    labels:
      experiment-fallback: "true"
    weight: 100

  exp-fallback-amd64-od:
    enabled: true
    capacity_type: on-demand
    shapes:
      - VM.Standard.E3.Flex
      - VM.Standard.E4.Flex
      - VM.Standard.E5.Flex
    taints:
      - *taint_exp_fallback
    labels:
      experiment-fallback: "true"
    weight: 90

Execution

Using the current karpenter version 1.1.0, create a deployment that targets both:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: karpenter-fallback-test
  namespace: default
  labels:
    app: karpenter-fallback-test
    experiment: karpenter-fallback
spec:
  replicas: 1
  selector:
    matchLabels:
      app: karpenter-fallback-test
  template:
    metadata:
      labels:
        app: karpenter-fallback-test
        experiment: karpenter-fallback
    spec:
      # Target only the two experiment pools.
      nodeSelector:
        experiment-fallback: "true"
      tolerations:
        # Isolating taint shared by both experiment pools.
        - key: experiment-karpenter-fallback
          operator: Equal
          value: "true"
          effect: NoSchedule
        # KPO spot taint on the ARM-spot pool — required so Karpenter considers
        # that (higher-weight) pool first.
        - key: oci.oraclecloud.com/oke-is-preemptible
          operator: Exists
          effect: NoSchedule
      containers:
        - name: pause
          # Multi-arch so the pod is valid on both the arm64 and amd64 pools.
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "2"
              memory: 4Gi

The pod will stay pending.

Logs from the current Karpenter version: (Only included important fields)

LaunchInstance start - name: ci-ca-toronto-1-exp-fallback-arm-spot-rf4wn

LaunchInstance failed - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

Reconciler error - error: launching nodeclaim, creating nodeclaim, cannot create node after trying all instance types

(Launch failure loop starts)
LaunchInstance failed - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

Delete the deployment and update to the version from this branch.

Recreate the deployment and now it should schedule on a new node from the second pool.

Logs from this branch's version:

LaunchInstance start - name: ci-ca-toronto-1-exp-fallback-arm-spot-jk6mu

LaunchInstance failed - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

all instance types exhausted due to insufficient capacity - error: Error returned by Compute Service. Http Status Code: 500. Error Code: InternalError. ... Message: Out of host capacity. Operation Name: LaunchInstance

failed launching nodeclaim - error: insufficient capacity, all instance types exhausted due to insufficient capacity,

deleted nodeclaim - name: ci-ca-toronto-1-exp-fallback-arm-spot-jk6mu

created nodeclaim - name: ci-ca-toronto-1-exp-fallback-amd64-od-q5nz5

LaunchInstance success - name: ci-ca-toronto-1-exp-fallback-amd64-od-q5nz5

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

aniekgul added 2 commits June 4, 2026 14:37
Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>
@oracle-contributor-agreement

Copy link
Copy Markdown

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Jun 5, 2026
aniekgul added 4 commits June 6, 2026 13:38
Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>
feat: handle limit or quota exceeded cases as well
Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>
fix: remove recursive read locks that cause deadlocks in ListInstanceTypes
@oracle-contributor-agreement

Copy link
Copy Markdown

Thank you for signing the OCA.

@oracle-contributor-agreement oracle-contributor-agreement Bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Jun 9, 2026
aodanxin and others added 2 commits June 17, 2026 13:51
Conflict in pkg/cloudprovider/cloud_provider.go resolved by preserving
the fork's IsSkippableLaunchError logic (superset of upstream's
IsNoCapacityError — also handles service-limit/quota exhaustion).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chore: sync upstream oracle/karpenter-provider-oci v1.2.0
@oracle-contributor-agreement

Copy link
Copy Markdown

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement Bot added OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. and removed OCA Verified All contributors have signed the Oracle Contributor Agreement. labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Required At least one contributor does not have an approved Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Karpenter fails to consider other Nodepools when one is out of capacity

2 participants