| copyright |
|
||
|---|---|---|---|
| lastupdated | 2026-06-03 | ||
| keywords | HA for {{site.data.keyword.short_name}}, DR for {{site.data.keyword.short_name}}, {{site.data.keyword.short_name}} recovery time objective, {{site.data.keyword.short_name}} recovery point objective | ||
| subcollection | inference |
{{site.data.keyword.attribute-definition-list}}
{: #ilab-ha-dr}
High availability{: term} (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recovery{: term} is the process of recovering the service instance to a working state. {: shortdesc}
{{site.data.keyword.instructlab_short}} is a highly available regional service designed for availability during a zonal outage. {{site.data.keyword.instructlab_short}} is designed to meet the Service Level Objectives (SLO) with the Standard plan.
For more information about the available region and data center locations, see Service and infrastructure availability by location.
{: #ha-architecture}
{: caption="Architecture diagram" caption-side="bottom"}{: external download="high-availability-architecture-light.svg"}
{: #ha-features}
{{site.data.keyword.instructlab_short}} supports the following high availability features:
| Feature | Description | Consideration |
|---|---|---|
| Global load balancing | When a node or availability zone fails, the service continues to run with API requests being routed through a global load balancer to the surviving HA instance nodes. | There may be a short period of time (seconds) between the outage and the global load balancer recognizing the failure, during which time, requests may be sent to the failed instance. |
| Active model alignment requests | When a node or availability zone fails, the service continues to run with API requests being routed through a global load balancer to the surviving HA instance nodes. Active synthetic data generation jobs and active model alignment jobs executing on nodes within the zone are retried on nodes in a different zone on failure automatically. In certain regions due to capacity constraints, model alignment nodes are deployed within one zone. When the zone is restored active model alignment jobs are automatically retried. | There may be a short period of time (seconds) between the outage and the global load balancer recognizing the failure, during which time, requests may be sent to the failed instance. |
| Active inference requests | Active requests are queued within {{site.data.keyword.instructlab_short}}, so even if the model is not available, the request is still processed. If requests are canceled on the client side, they continue to be processed on the backend and can be retrieved later. | N/A |
| {: caption="HA features for {{site.data.keyword.instructlab_short}}" caption-side="bottom"} |
{: #dr-features}
{{site.data.keyword.instructlab_short}} supports the following disaster recovery features:
| Feature | Description | Consideration |
|---|---|---|
| {{site.data.keyword.short_name}} follows a regional deployment model. | In the case of a regional failure APIs could become unavailable until the region is restored. | Other active regions where {{site.data.keyword.short_name}} is deployed to can be used to generate synthetic data and execute model alignments until the region is restored. |
| {{site.data.keyword.cos_short}} replication for model alignment | {{site.data.keyword.short_name}} persists all SDG and aligned models into the client provided object storage bucket. Reference the {{site.data.keyword.cos_short}} service documentation for disaster recovery strategies. | You can use bucket replication to replicate taxonomy content, generated synthetic data, and aligned models to a different region. For more information, see Understanding high availability and disaster recovery for {{site.data.keyword.cos_full}}. |
| {: caption="DR features for {{site.data.keyword.instructlab_short}}" caption-side="bottom"} |
{: #features-for-disaster-recovery}
The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.
| Failure | Resolution |
|---|---|
| Hardware failure (single point) | IBM provides an instance that's resilient from single point of hardware failure within a zone . No configuration required. |
| Zone failure | IBM provides an instance that's resilient from a zone failure. No configuration required. |
| Model alignment data corruption | Restore a point in time uncorrupted version of the client {{site.data.keyword.cos_short}} bucket contents from backup. {{site.data.keyword.short_name}} restoration handled by service team. |
| {: caption="DR scenarios" caption-side="bottom"} |
{: #feature-responsibilities}
It is your responsibility to continuously test your plan for HA and DR.
Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application. {: note}
Use the following checklists associated with each feature to help you create and practice your plan.
{{site.data.keyword.cos_short}} replication for model alignment
- Verify replication policy in place from primary bucket to backup bucket
- Verify a sample taxonomy file is synced within expected synchronization time from source to primary bucket
- Verify a sample synthetic data file is synced within expected synchronization time from source to primary bucket
- Verify a sample aligned model file is synced within expected synchronization time from source to primary bucket
Example checklist for {{site.data.keyword.cos_short}} replication for model alignment:
- [ ] Create a primary Red Hat AI Inference instance in primary region.
- [ ] Create a primary Cloud Object Storage bucket in primary region.
- [ ] Create a secondary Red Hat AI Inference instance in secondary region.
- [ ] Create a secondary object storage bucket in secondary region.
- [ ] Enable object replication from primary object bucket to secondary object bucket
- [ ] Upload taxonomy to primary object storage bucket and create taxonomy object in primary Red Hat AI Inference instance
- [ ] Ensure taxonomy object storage bucket object replicates to secondary region
- [ ] Generate training data from taxonomy in primary Red Hat AI Inference instance
- [ ] Ensure training data file replicates from primary object storage bucket to secondary object storage bucket
- [ ] Fine tune a model in the Red Hat AI Inference primary instance
- [ ] Ensure model alignment file replicates from primary object storage bucket to secondary object storage bucket{: codeblock}
For more information about responsibility ownership between you and {{site.data.keyword.cloud_notm}} for {{site.data.keyword.instructlab_short}}, see Your responsibilities.
{: #rto-rpo-features}
| Feature | RTO and RPO |
|---|---|
| Object storage replication with backup instance | RTO = minutes, RPO = near 0 |
| {: caption="RTO/RPO features for {{site.data.keyword.instructlab_short}}" caption-side="bottom"} |
{: #change-management-hadr}
Change management includes tasks such as upgrades, configuration changes, and deletion.
Grant users and processes the IAM roles and actions with the least privilege that is required for their work. For more information, see How can I prevent accidental deletion of services?. {: tip}
Consider creating a manual backup of your taxonomy, generated data, and aligned models before upgrading to a new version of {{site.data.keyword.instructlab_short}}.
{: #ibm-regional-failure}
If {{site.data.keyword.IBM_notm}} can’t restore the service instance, you must restore the service as described in the Planning for disaster recovery.
{: #ibm-service-maintenance}
- All upgrades follow {{site.data.keyword.IBM_notm}} service best practices, including recovery plans and rollback processes.
- Regular maintenance might cause short interruptions, mitigated by client availability retry logic.
- Changes are rolled out sequentially, region by region, and zone by zone within a region. {{site.data.keyword.IBM_notm}} reverts updates at the first sign of a defect.
- Complex changes are enabled and disabled with feature flags to control exposure.
- Changes that impact customer workloads are detailed in {{site.data.keyword.cloud_notm}} notifications.
For more information about planned maintenance, announcements, and release notes that impact this service, see the following links.