diff --git a/docs/public/installation.md b/docs/public/installation.md index ea4444b9..495f8f76 100644 --- a/docs/public/installation.md +++ b/docs/public/installation.md @@ -1635,6 +1635,28 @@ Where: | `integrationTests.securityContext` | object | no | {} | The pod-level security attributes and common container settings for the OpenSearch integration tests pod. | | `integrationTests.priorityClassName` | string | no | "" | The priority class to be used by the OpenSearch integration tests pods. You should create the priority class beforehand. For more information about this feature, refer to [https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/). | +## Resource Migration + +The resource migration job is a pre-install/pre-upgrade Helm hook that automatically detects and removes +OpenSearch 1.x StatefulSets that are incompatible with OpenSearch 2.x. This is required when upgrading via +ArgoCD, because ArgoCD's merge strategy cannot remove extra environment variables (such as `node.master`) +from existing StatefulSets. The job deletes affected StatefulSets with `--cascade=orphan`, which preserves +running pods while allowing Helm to recreate the StatefulSet with the correct 2.x spec. + +The job iterates over all configured OpenSearch StatefulSets (master, data, arbiter depending on deployment +topology). If a StatefulSet does not exist or does not contain the `node.master` environment variable, it is +skipped. + +| Parameter | Type | Mandatory | Default value | Description | +|------------------------------------------------|---------|-----------|---------------|----------------------------------------------------------------------------------------------------------------------------| +| `resourceMigration.enabled` | boolean | no | true | Enables the resource migration pre-upgrade hook job. | +| `resourceMigration.resources.requests.cpu` | string | no | 20m | The minimum number of CPUs the resource migration container should use. | +| `resourceMigration.resources.requests.memory` | string | no | 64Mi | The minimum amount of memory the resource migration container should use. | +| `resourceMigration.resources.limits.cpu` | string | no | 100m | The maximum number of CPUs the resource migration container should use. | +| `resourceMigration.resources.limits.memory` | string | no | 256Mi | The maximum amount of memory the resource migration container should use. | +| `resourceMigration.imagePullPolicy` | string | no | IfNotPresent | The image pull policy for the resource migration container. | +| `resourceMigration.runAsNonRoot` | boolean | no | true | If `true`, applies restricted security context (non-root, read-only filesystem, drops all capabilities) to the job pod. | + ### Tags Description This section contains information about integration test tags that can be used in order to test OpenSearch service. You can use the following tags: @@ -2054,6 +2076,28 @@ If you need migrate to OpenSearch Service `1.x.x` (with OpenSearch 2.x) from pre * if `0.2.4` (or newest) version installed just proceed with upgrade. * if version before `0.2.4` installed, you need previously upgrade to version `0.2.4` to migrate security configuration to new format and then install required `1.x.x` version. +**ArgoCD upgrades:** + +When upgrading from OpenSearch 1.x to 2.x via ArgoCD, the StatefulSet spec changes significantly (for example, +the `node.master` environment variable is replaced by `node.roles`). ArgoCD's merge strategy cannot remove +these extra environment variables from existing StatefulSets, which causes the upgrade to fail. + +To handle this automatically, the chart includes a **resource migration** pre-upgrade hook +(`resourceMigration.enabled: true` by default). This hook inspects each OpenSearch StatefulSet for the +`node.master` environment variable (a marker of the 1.x spec). If found, it deletes the StatefulSet with +`--cascade=orphan`, preserving the running pods while allowing the new StatefulSet to be created with the +correct 2.x spec. + +No manual intervention is required as long as `resourceMigration.enabled` is `true`. If you prefer to handle +the StatefulSet cleanup manually, set `resourceMigration.enabled: false` and delete the affected StatefulSets +yourself before upgrading: + +```bash +kubectl -n delete statefulset --cascade=orphan +``` + +For the full list of resource migration parameters, refer to the [Resource Migration](#resource-migration) section. + ### Migration From OpenDistro Elasticsearch OpenSearch Service allows migration from OpenDistro Elasticsearch deployments. diff --git a/docs/public/troubleshooting.md b/docs/public/troubleshooting.md index dcffaab1..241842e7 100644 --- a/docs/public/troubleshooting.md +++ b/docs/public/troubleshooting.md @@ -212,6 +212,11 @@ * [Stack trace](#stack-trace-33) * [How to solve](#how-to-solve-35) * [Recommendations](#recommendations-33) + * [Upgrade Failed Due to Pre-Deploy Migration Hook](#upgrade-failed-due-to-pre-deploy-migration-hook) + * [Description](#description-36) + * [Stack trace](#stack-trace-34) + * [How to solve](#how-to-solve-36) + * [Recommendations](#recommendations-34) ## Cluster Health @@ -1603,3 +1608,58 @@ If `opensearch.data.dedicatedPod.enabled: false`, master nodes also act as data If dedicated data pods are enabled, increase `opensearch.data.dedicatedPod.replicas`. Also keep index settings reasonable for new indexes: `index.number_of_shards` defaults to 1, while `index.number_of_replicas` defaults to 1, so unnecessary shard and replica growth should be avoided. If really required, OpenSearch settings can also be provided through `opensearch.config` during installation. + +## Upgrade Failed Due to Pre-Deploy Migration Hook + +### Description + +During an OpenSearch Service upgrade (especially from 2.x to 3.x), the Helm pre-deploy hook job `opensearch-migration-1x` may fail with a `BackoffLimitExceeded` error. This job is a Kubernetes Job that runs as a `pre-install`/`pre-upgrade` Helm hook and performs index migration for indices originally created on OpenSearch 1.x. If the cluster contains such indices, they must be reindexed before the upgrade to 3.x can proceed. When this migration fails, the entire Helm upgrade is blocked. + +In **ArgoCD** deployments, this appears as a failed PreSync hook: + +```text +- Job/opensearch-migration-1x; Hook: PreSync; Phase: Failed + Sync Message: Job has reached the specified backoff limit +``` + +In **Helm** output, the error looks like: + +```text +Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred: + * job opensearch-migration-1x failed: BackoffLimitExceeded +``` + +### Stack trace + +```text +Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred: + * job opensearch-migration-1x failed: BackoffLimitExceeded +``` + +### How to solve + +1. **Check the migration Job logs** to find the root cause of the failure: + + ```sh + kubectl logs -n job/opensearch-migration-1x + ``` + + If the pod has already been cleaned up, look for the pod by label: + + ```sh + kubectl get pods -n -l component=migration + kubectl logs -n + ``` + +2. **Common failure reasons:** + - The cluster contains indices created on OpenSearch 1.x that block the upgrade to 3.x. The migrator attempts to reindex them but may fail due to insufficient resources, connectivity issues, or incompatible index settings. + - OpenSearch is not reachable from the migration pod (network or TLS issues). + - Insufficient permissions or missing secrets. + +3. **After identifying and resolving the issue**, retry the upgrade. If using ArgoCD, trigger a new sync. If using Helm directly, re-run the `helm upgrade` command. + +4. For detailed information about the index migration process, including how to run the migrator manually in dry-run mode, refer to the [Indices Migration](indices-migration.md) documentation. + +### Recommendations + +Before upgrading OpenSearch from 2.x to 3.x, run the migration tool in **dry-run mode** to identify any 1.x indices that would block the upgrade. Review the dry-run output and plan the migration during a maintenance window. See the [Indices Migration](indices-migration.md) guide for the full procedure. diff --git a/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml new file mode 100644 index 00000000..1c8e86ff --- /dev/null +++ b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-job.yaml @@ -0,0 +1,88 @@ +{{- if .Values.resourceMigration.enabled }} +apiVersion: batch/v1 +kind: Job +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "2" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +spec: + template: + metadata: + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + spec: + serviceAccountName: {{ .Release.Name }}-resource-migrator + restartPolicy: OnFailure + {{- if .Values.resourceMigration.runAsNonRoot }} + securityContext: + runAsNonRoot: true + seccompProfile: { type: RuntimeDefault } + {{- end }} + containers: + - name: migrator + image: {{ template "kubectl.image" . }} + imagePullPolicy: {{ .Values.resourceMigration.imagePullPolicy | default "IfNotPresent" }} + command: ["/bin/sh","-c"] + args: + - | + set -euo pipefail + + command -v jq >/dev/null 2>&1 || { echo "[resource-migrator] jq is required"; exit 1; } + + KUBECTL="kubectl" + NS="{{ .Release.Namespace }}" + STATEFULSET_NAMES="{{ trim (include "opensearch.statefulsetNames" .) }}" + + if [ -z "$STATEFULSET_NAMES" ]; then + echo "[resource-migrator] No statefulset names configured, nothing to do" + exit 0 + fi + + IFS=','; for STS_NAME in $STATEFULSET_NAMES; do + STS_NAME="$(echo "$STS_NAME" | xargs)" + [ -z "$STS_NAME" ] && continue + + STS_JSON="$($KUBECTL -n "$NS" get statefulset "$STS_NAME" -o json 2>/dev/null || true)" + if [ -z "$STS_JSON" ]; then + echo "[resource-migrator] StatefulSet $STS_NAME does not exist, skipping" + continue + fi + + # 'node.master' env is a marker of OpenSearch 1.x StatefulSets that must be + # recreated during upgrade to 2.x because ArgoCD merge cannot remove extra envs. + NODE_MASTER="$(printf '%s' "$STS_JSON" \ + | jq -r '[.spec.template.spec.containers[].env[]? | select(.name == "node.master")] | length' 2>/dev/null || echo "0")" + + if [ "$NODE_MASTER" -gt 0 ]; then + echo "[resource-migrator] StatefulSet $STS_NAME contains 'node.master' env (OpenSearch 1.x)" + echo "[resource-migrator] Deleting StatefulSet $STS_NAME with --cascade=orphan" + if $KUBECTL -n "$NS" delete statefulset "$STS_NAME" --cascade=orphan --ignore-not-found=true; then + echo "[resource-migrator] StatefulSet $STS_NAME deleted successfully" + else + echo "[resource-migrator][ERROR] Failed to delete StatefulSet $STS_NAME" + exit 1 + fi + else + echo "[resource-migrator] StatefulSet $STS_NAME: no 'node.master' env found, skipping" + fi + done + + echo "[resource-migrator] done" + resources: + limits: + cpu: {{ .Values.resourceMigration.resources.limits.cpu }} + memory: {{ .Values.resourceMigration.resources.limits.memory }} + requests: + cpu: {{ .Values.resourceMigration.resources.requests.cpu }} + memory: {{ .Values.resourceMigration.resources.requests.memory }} + {{- if .Values.resourceMigration.runAsNonRoot }} + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: { drop: ["ALL"] } + {{- end }} +{{- end }} diff --git a/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml new file mode 100644 index 00000000..d078b29c --- /dev/null +++ b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-rbac.yaml @@ -0,0 +1,35 @@ +{{- if .Values.resourceMigration.enabled }} +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-190" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +rules: + - apiGroups: ["apps"] + resources: ["statefulsets"] + verbs: ["get","list","delete"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-180" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +subjects: + - kind: ServiceAccount + name: {{ .Release.Name }}-resource-migrator + namespace: {{ .Release.Namespace }} +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: {{ .Release.Name }}-resource-migrator +{{- end }} diff --git a/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml new file mode 100644 index 00000000..e2ad5c83 --- /dev/null +++ b/operator/charts/helm/opensearch-service/templates/pre-deploy/resource-migrator-sa.yaml @@ -0,0 +1,12 @@ +{{- if .Values.resourceMigration.enabled }} +apiVersion: v1 +kind: ServiceAccount +metadata: + name: {{ .Release.Name }}-resource-migrator + labels: + app.kubernetes.io/instance: {{ .Release.Name }} + annotations: + "helm.sh/hook": pre-install,pre-upgrade + "helm.sh/hook-weight": "-200" + "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded +{{- end }} diff --git a/operator/charts/helm/opensearch-service/values.yaml b/operator/charts/helm/opensearch-service/values.yaml index b1f975ff..7a88c058 100644 --- a/operator/charts/helm/opensearch-service/values.yaml +++ b/operator/charts/helm/opensearch-service/values.yaml @@ -1183,6 +1183,18 @@ integrationTests: # ref: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ priorityClassName: "" +resourceMigration: + enabled: true + resources: + requests: + memory: 64Mi + cpu: 20m + limits: + memory: 256Mi + cpu: 100m + imagePullPolicy: IfNotPresent + runAsNonRoot: true + groupMigration: enabled: true oldGroupPrefix: "qubership.org/"