From 7bd3c87855fd5ca3b53b4e7c063010fe3e61d9fe Mon Sep 17 00:00:00 2001 From: mrMigles Date: Mon, 9 Mar 2026 17:58:59 +0500 Subject: [PATCH 1/2] feat: add resource migration job and troubleshooting section for upgrade failures Introduced a new resource migration job to handle OpenSearch StatefulSets during upgrades, along with a troubleshooting section in the documentation addressing potential upgrade failures due to pre-deploy migration hooks. This includes detailed descriptions, stack traces, solutions, and recommendations for successful upgrades. --- docs/public/troubleshooting.md | 60 +++++++++++++ .../pre-deploy/resource-migrator-job.yaml | 88 +++++++++++++++++++ .../pre-deploy/resource-migrator-rbac.yaml | 35 ++++++++ .../pre-deploy/resource-migrator-sa.yaml | 12 +++ .../helm/opensearch-service/values.yaml | 12 +++ 5 files changed, 207 insertions(+) create mode 100644 operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml create mode 100644 operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml create mode 100644 operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml diff --git a/docs/public/troubleshooting.md b/docs/public/troubleshooting.md index 125d1a02..4cc926d4 100644 --- a/docs/public/troubleshooting.md +++ b/docs/public/troubleshooting.md @@ -212,6 +212,11 @@ * [Stack trace](#stack-trace-33) * [How to solve](#how-to-solve-35) * [Recommendations](#recommendations-34) + * [Upgrade Failed Due to Pre-Deploy Migration Hook](#upgrade-failed-due-to-pre-deploy-migration-hook) + * [Description](#description-36) + * [Stack trace](#stack-trace-34) + * [How to solve](#how-to-solve-36) + * [Recommendations](#recommendations-35) ## Cluster Health @@ -1565,3 +1570,58 @@ OpenSearch supports updating cluster settings through `PUT _cluster/settings`, d ### Recommendations At installation or upgrade time, prevent this issue by keeping shard count under control and by sizing the cluster correctly. If `opensearch.data.dedicatedPod.enabled: false`, master nodes also act as data nodes, so increase `opensearch.master.replicas`. If dedicated data pods are enabled, increase `opensearch.data.dedicatedPod.replicas`. Also keep index settings reasonable for new indexes: `index.number_of_shards` defaults to 1, while `index.number_of_replicas` defaults to 1, so unnecessary shard and replica growth should be avoided. If really required, OpenSearch settings can also be provided through `opensearch.config` during installation. + +## Upgrade Failed Due to Pre-Deploy Migration Hook + +### Description + +During an OpenSearch Service upgrade (especially from 2.x to 3.x), the Helm pre-deploy hook job `opensearch-migration-1x` may fail with a `BackoffLimitExceeded` error. This job is a Kubernetes Job that runs as a `pre-install`/`pre-upgrade` Helm hook and performs index migration for indices originally created on OpenSearch 1.x. If the cluster contains such indices, they must be reindexed before the upgrade to 3.x can proceed. When this migration fails, the entire Helm upgrade is blocked. + +In **ArgoCD** deployments, this appears as a failed PreSync hook: + +```text +- Job/opensearch-migration-1x; Hook: PreSync; Phase: Failed + Sync Message: Job has reached the specified backoff limit +``` + +In **Helm** output, the error looks like: + +```text +Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred: + * job opensearch-migration-1x failed: BackoffLimitExceeded +``` + +### Stack trace + +```text +Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred: + * job opensearch-migration-1x failed: BackoffLimitExceeded +``` + +### How to solve + +1. **Check the migration Job logs** to find the root cause of the failure: + + ```sh + kubectl logs -n job/opensearch-migration-1x + ``` + + If the pod has already been cleaned up, look for the pod by label: + + ```sh + kubectl get pods -n -l component=migration + kubectl logs -n + ``` + +2. **Common failure reasons:** + - The cluster contains indices created on OpenSearch 1.x that block the upgrade to 3.x. The migrator attempts to reindex them but may fail due to insufficient resources, connectivity issues, or incompatible index settings. + - OpenSearch is not reachable from the migration pod (network or TLS issues). + - Insufficient permissions or missing secrets. + +3. **After identifying and resolving the issue**, retry the upgrade. If using ArgoCD, trigger a new sync. If using Helm directly, re-run the `helm upgrade` command. + +4. For detailed information about the index migration process, including how to run the migrator manually in dry-run mode, refer to the [Indices Migration](indices-migration.md) documentation. + +### Recommendations + +Before upgrading OpenSearch from 2.x to 3.x, run the migration tool in **dry-run mode** to identify any 1.x indices that would block the upgrade. Review the dry-run output and plan the migration during a maintenance window. See the [Indices Migration](indices-migration.md) guide for the full procedure. diff --git a/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml new file mode 100644 index 00000000..1c8e86ff --- /dev/null +++ b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml @@ -0,0 +1,88 @@ +{{- if .Values.resourceMigration.enabled }} +apiVersion: batch/v1 +kind: Job +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "2" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +spec: + template: + metadata: + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + spec: + serviceAccountName: {{ .Release.Name }}-resource-migrator + restartPolicy: OnFailure + {{- if .Values.resourceMigration.runAsNonRoot }} + securityContext: + runAsNonRoot: true + seccompProfile: { type: RuntimeDefault } + {{- end }} + containers: + - name: migrator + image: {{ template "kubectl.image" . }} + imagePullPolicy: {{ .Values.resourceMigration.imagePullPolicy | default "IfNotPresent" }} + command: ["/bin/sh","-c"] + args: + - | + set -euo pipefail + + command -v jq >/dev/null 2>&1 || { echo "[resource-migrator] jq is required"; exit 1; } + + KUBECTL="kubectl" + NS="{{ .Release.Namespace }}" + STATEFULSET_NAMES="{{ trim (include "opensearch.statefulsetNames" .) }}" + + if [ -z "$STATEFULSET_NAMES" ]; then + echo "[resource-migrator] No statefulset names configured, nothing to do" + exit 0 + fi + + IFS=','; for STS_NAME in $STATEFULSET_NAMES; do + STS_NAME="$(echo "$STS_NAME" | xargs)" + [ -z "$STS_NAME" ] && continue + + STS_JSON="$($KUBECTL -n "$NS" get statefulset "$STS_NAME" -o json 2>/dev/null || true)" + if [ -z "$STS_JSON" ]; then + echo "[resource-migrator] StatefulSet $STS_NAME does not exist, skipping" + continue + fi + + # 'node.master' env is a marker of OpenSearch 1.x StatefulSets that must be + # recreated during upgrade to 2.x because ArgoCD merge cannot remove extra envs. + NODE_MASTER="$(printf '%s' "$STS_JSON" \ + | jq -r '[.spec.template.spec.containers[].env[]? | select(.name == "node.master")] | length' 2>/dev/null || echo "0")" + + if [ "$NODE_MASTER" -gt 0 ]; then + echo "[resource-migrator] StatefulSet $STS_NAME contains 'node.master' env (OpenSearch 1.x)" + echo "[resource-migrator] Deleting StatefulSet $STS_NAME with --cascade=orphan" + if $KUBECTL -n "$NS" delete statefulset "$STS_NAME" --cascade=orphan --ignore-not-found=true; then + echo "[resource-migrator] StatefulSet $STS_NAME deleted successfully" + else + echo "[resource-migrator][ERROR] Failed to delete StatefulSet $STS_NAME" + exit 1 + fi + else + echo "[resource-migrator] StatefulSet $STS_NAME: no 'node.master' env found, skipping" + fi + done + + echo "[resource-migrator] done" + resources: + limits: + cpu: {{ .Values.resourceMigration.resources.limits.cpu }} + memory: {{ .Values.resourceMigration.resources.limits.memory }} + requests: + cpu: {{ .Values.resourceMigration.resources.requests.cpu }} + memory: {{ .Values.resourceMigration.resources.requests.memory }} + {{- if .Values.resourceMigration.runAsNonRoot }} + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: { drop: ["ALL"] } + {{- end }} +{{- end }} diff --git a/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml new file mode 100644 index 00000000..d078b29c --- /dev/null +++ b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml @@ -0,0 +1,35 @@ +{{- if .Values.resourceMigration.enabled }} +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-190" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +rules: + - apiGroups: ["apps"] + resources: ["statefulsets"] + verbs: ["get","list","delete"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-180" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +subjects: + - kind: ServiceAccount + name: {{ .Release.Name }}-resource-migrator + namespace: {{ .Release.Namespace }} +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: {{ .Release.Name }}-resource-migrator +{{- end }} diff --git a/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml new file mode 100644 index 00000000..e2ad5c83 --- /dev/null +++ b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml @@ -0,0 +1,12 @@ +{{- if .Values.resourceMigration.enabled }} +apiVersion: v1 +kind: ServiceAccount +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-200" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +{{- end }} diff --git a/operator/charts/helm/opensearch-service/values.yaml b/operator/charts/helm/opensearch-service/values.yaml index 80cf71da..d4da4692 100644 --- a/operator/charts/helm/opensearch-service/values.yaml +++ b/operator/charts/helm/opensearch-service/values.yaml @@ -1171,6 +1171,18 @@ integrationTests: # ref: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ priorityClassName: "" +resourceMigration: + enabled: true + resources: + requests: + memory: 64Mi + cpu: 20m + limits: + memory: 256Mi + cpu: 100m + imagePullPolicy: IfNotPresent + runAsNonRoot: true + groupMigration: enabled: true oldGroupPrefix: "qubership.org/" From 34a4b615613152512d942be9ce4a558bfd4cdcba Mon Sep 17 00:00:00 2001 From: mrMigles Date: Mon, 9 Mar 2026 18:08:02 +0500 Subject: [PATCH 2/2] docs: add resource migration section to installation guide for OpenSearch upgrades Introduced a new section detailing the resource migration job that automatically handles the removal of incompatible OpenSearch 1.x StatefulSets during upgrades to 2.x. This section explains the job's functionality, parameters, and how it integrates with ArgoCD to ensure a smooth upgrade process without manual intervention. --- docs/public/installation.md | 44 +++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/docs/public/installation.md b/docs/public/installation.md index d107a112..eb6ad3d2 100644 --- a/docs/public/installation.md +++ b/docs/public/installation.md @@ -1627,6 +1627,28 @@ Where: | `integrationTests.securityContext` | object | no | {} | The pod-level security attributes and common container settings for the OpenSearch integration tests pod. | | `integrationTests.priorityClassName` | string | no | "" | The priority class to be used by the OpenSearch integration tests pods. You should create the priority class beforehand. For more information about this feature, refer to [https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/). | +## Resource Migration + +The resource migration job is a pre-install/pre-upgrade Helm hook that automatically detects and removes +OpenSearch 1.x StatefulSets that are incompatible with OpenSearch 2.x. This is required when upgrading via +ArgoCD, because ArgoCD's merge strategy cannot remove extra environment variables (such as `node.master`) +from existing StatefulSets. The job deletes affected StatefulSets with `--cascade=orphan`, which preserves +running pods while allowing Helm to recreate the StatefulSet with the correct 2.x spec. + +The job iterates over all configured OpenSearch StatefulSets (master, data, arbiter depending on deployment +topology). If a StatefulSet does not exist or does not contain the `node.master` environment variable, it is +skipped. + +| Parameter | Type | Mandatory | Default value | Description | +|------------------------------------------------|---------|-----------|---------------|----------------------------------------------------------------------------------------------------------------------------| +| `resourceMigration.enabled` | boolean | no | true | Enables the resource migration pre-upgrade hook job. | +| `resourceMigration.resources.requests.cpu` | string | no | 20m | The minimum number of CPUs the resource migration container should use. | +| `resourceMigration.resources.requests.memory` | string | no | 64Mi | The minimum amount of memory the resource migration container should use. | +| `resourceMigration.resources.limits.cpu` | string | no | 100m | The maximum number of CPUs the resource migration container should use. | +| `resourceMigration.resources.limits.memory` | string | no | 256Mi | The maximum amount of memory the resource migration container should use. | +| `resourceMigration.imagePullPolicy` | string | no | IfNotPresent | The image pull policy for the resource migration container. | +| `resourceMigration.runAsNonRoot` | boolean | no | true | If `true`, applies restricted security context (non-root, read-only filesystem, drops all capabilities) to the job pod. | + ### Tags Description This section contains information about integration test tags that can be used in order to test OpenSearch service. You can use the following tags: @@ -2046,6 +2068,28 @@ If you need migrate to OpenSearch Service `1.x.x` (with OpenSearch 2.x) from pre * if `0.2.4` (or newest) version installed just proceed with upgrade. * if version before `0.2.4` installed, you need previously upgrade to version `0.2.4` to migrate security configuration to new format and then install required `1.x.x` version. +**ArgoCD upgrades:** + +When upgrading from OpenSearch 1.x to 2.x via ArgoCD, the StatefulSet spec changes significantly (for example, +the `node.master` environment variable is replaced by `node.roles`). ArgoCD's merge strategy cannot remove +these extra environment variables from existing StatefulSets, which causes the upgrade to fail. + +To handle this automatically, the chart includes a **resource migration** pre-upgrade hook +(`resourceMigration.enabled: true` by default). This hook inspects each OpenSearch StatefulSet for the +`node.master` environment variable (a marker of the 1.x spec). If found, it deletes the StatefulSet with +`--cascade=orphan`, preserving the running pods while allowing the new StatefulSet to be created with the +correct 2.x spec. + +No manual intervention is required as long as `resourceMigration.enabled` is `true`. If you prefer to handle +the StatefulSet cleanup manually, set `resourceMigration.enabled: false` and delete the affected StatefulSets +yourself before upgrading: + +```bash +kubectl -n delete statefulset --cascade=orphan +``` + +For the full list of resource migration parameters, refer to the [Resource Migration](#resource-migration) section. + ### Migration From OpenDistro Elasticsearch OpenSearch Service allows migration from OpenDistro Elasticsearch deployments.