fix: handle cases where OCI is out of capacity#34
Conversation
Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>
feat: handle oci out of capacity issues
|
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>
feat: handle limit or quota exceeded cases as well
Signed-off-by: Aniek Gul <13356402+aniekgul@users.noreply.github.com>
fix: remove recursive read locks that cause deadlocks in ListInstanceTypes
|
Thank you for signing the OCA. |
Conflict in pkg/cloudprovider/cloud_provider.go resolved by preserving the fork's IsSkippableLaunchError logic (superset of upstream's IsNoCapacityError — also handles service-limit/quota exhaustion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chore: sync upstream oracle/karpenter-provider-oci v1.2.0
|
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
Description
When a NodePool runs out of host capacity, OCI Karpenter previously returned a generic
CloudProvider.CreateErrorfromCreate. Core Karpenter only deletes and reschedules a NodeClaim on anInsufficientCapacityError, so on aCreateErrorthe NodeClaim stuck to the same (exhausted) NodePool/offering and retried in a tight loop, producing retry storms and 429 throttling, with no fallback to other capacity types or NodePools.This change makes host-capacity exhaustion behave correctly across three layers:
CloudProvider.Createnow returnscloudprovider.NewInsufficientCapacityError. This lets core Karpenter delete + reschedule the NodeClaim, enabling cross-NodePool fallback.pkg/cache/unavailableofferings.go) records(shape, AD/zone, capacity-type)offerings observed to be out of capacity.LaunchInstancemarks offerings unavailable on capacity errors, and the instance-type provider gates offering availability (Available=false) when listing instance types. This drives spot→on-demand fallback within a NodePool and prevents the scheduler from immediately re-selecting dead offerings until the entry expires. The TTL is configurable via--unavailable-offerings-ttl-seconds(default 180s / 3 minutes).oci.IsRetryablenow treats an HTTP 500 that is an "Out of host capacity" error as non-retryable, so the capacity-fallback logic handles it instead of the SDK amplifying the host-capacity shortage and contributing to 429s.'Fixes #33
EDIT: Also added a fix for a deadlock issue, see aniekgul#3 for details.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Unit and integration tests added to code.
Live cluster test
In a OKE cluster in ca-toronto-1
Setup
Create two node pools, one with an instance type that is highly unlikely to be available and another with lots of room.
Execution
Using the current karpenter version 1.1.0, create a deployment that targets both:
The pod will stay pending.
Logs from the current Karpenter version: (Only included important fields)
Delete the deployment and update to the version from this branch.
Recreate the deployment and now it should schedule on a new node from the second pool.
Logs from this branch's version:
Checklist: