v0.10.4 deploy cycle: 4 blockers + 2 followups (audio loss, redis tolerations, migrations Job hook, capacityReserve falsy-zero)

[PLATFORM]

## Context

While deploying v0.10.4 to a Helm-on-LKE staging cluster (with the bin-pack `vexa.ai/pool` taint model) and exercising it against real Zoom Web meetings, we hit a stack of issues that each independently broke or risked breaking the deploy. Filing them together since they all surface during the same upgrade-and-smoke cycle.

Severity scale: 🔴 blocker, 🟡 important, 🟢 nice-to-have.

---

## 🔴 1. Audio recording lost on short / abrupt-stop Zoom meetings

**Symptom:** A Zoom Web meeting (id 11056) with 29s of audio captured: bot was admitted, PulseAudio capture flowing, recording started locally to `/tmp/recording_<id>_<uid>.wav`. User clicked stop. `[Delayed Stop] Waiting 90s for container...` elapsed → meeting-api force-finalized → bot pod deleted → **0 objects in S3, 0 rows in `recordings`/`media_files`**. All audio lost.

**Root cause:** Zoom (both Web and SDK) never had periodic chunk upload — `RecordingService` only uploads on `performGracefulLeave()`. With the 30s `SHUTDOWN_TIMEOUT_MS` racing against earlier cleanup steps (Zoom UI leave + voice-agent + per-speaker pipeline + PulseAudio sink + mux), the audio upload either never starts or gets cut off.

**Fix existed and was reverted:**
- `58ba53e feat: incremental audio recording — upload segments every 10s to MinIO` added `RecordingService.startIncremental(url, token)` for both Zoom paths.
- `24e0641 revert: remove experimental resource optimization commits` reverted it because *"incremental recording was never tested in a live meeting"*.
- Preserved on `optimization` branch; never re-applied.

**Asks:**
1. Re-apply the 58ba53e bot-side incremental upload (cherry-pick is non-trivial — `recordings.py`, `profiles.yaml`, `tests3/Makefile` conflicts with later commits 87e9962 + 0a318bc + 43998c1; cleanest is a re-implementation as a focused PR).
2. Reorder `performGracefulLeave()` so audio upload happens FIRST (right after captures stop), before UI/voice/pipeline cleanup. Even with the 30s budget consumed by hangs elsewhere, the recording is durable.
3. Google Meet + Teams already have `MediaRecorder` timeslice + per-chunk upload (referenced in `googlemeet/recording.ts` as "Pack B (issue #218)") — Zoom should match.

**Reproducer:** join a Zoom meeting, click Stop within 30s. Compare to a meeting that runs ≥4 min (uploads single chunk on natural shutdown).

---

## 🔴 2. Chart `deployment-redis.yaml` + `job-migrations.yaml` dropped `nodeSelector/tolerations/affinity` blocks in 0.10.4

**Symptom:** Fresh `helm upgrade` to 0.10.4 left redis Pending forever in any cluster that uses node taints. All nodes tainted (e.g. `vexa.ai/pool=temp:NoSchedule`); redis pod doesn't tolerate any of them; cluster-autoscaler reports `NotTriggerScaleUp: 3 node(s) had untolerated taint(s)`.

**Diff vs 0.10.3:**
```
diff /tmp/vexa-0.10.3/templates/deployment-redis.yaml /tmp/vexa-0.10.4/templates/deployment-redis.yaml
> -      {{- with .Values.global.nodeSelector }}
> -      nodeSelector: {{- toYaml . | nindent 8 }}
> -      {{- end }}
> -      {{- with .Values.global.tolerations }}
> -      tolerations: {{- toYaml . | nindent 8 }}
> -      {{- end }}
> -      {{- with .Values.global.affinity }}
> -      affinity: {{- toYaml . | nindent 8 }}
> -      {{- end }}
```

Same delta in `deployment-pgbouncer.yaml`. `job-migrations.yaml` never had these blocks but is also affected (taints elsewhere).

**Workaround we applied:** local patch of vendored `.tgz` to restore the blocks.

**Ask:** restore the toleration blocks (or expose `redis.tolerations` / `migrations.tolerations` as values fields).

---

## 🔴 3. Migrations Job not declared as a Helm hook

**Symptom:** Every `helm upgrade` that bumps `meetingApi.image.tag` fails with:
```
Error: UPGRADE FAILED: cannot patch "<release>-vexa-migrations" with kind Job:
Job.batch is invalid: spec.template: Invalid value: ...: field is immutable
```
Helm tries to patch the Job, K8s refuses (Jobs are immutable), `--atomic` rolls back the entire upgrade. We worked around it by `kubectl delete job` before every deploy (added to our Makefile).

**Root cause:** `templates/job-migrations.yaml` lacks the standard hook annotations that would tell Helm to delete-and-recreate.

**Ask:** add to the Job metadata:
```yaml
annotations:
  "helm.sh/hook": pre-upgrade,pre-install
  "helm.sh/hook-weight": "0"
  "helm.sh/hook-delete-policy": before-hook-creation
```

---

## 🟡 4. `capacityReserve.replicas: 0` doesn't disable

**Symptom:** Setting `capacityReserve.replicas: 0` in values renders as `replicas: 3` because the chart template uses `{{ .Values.capacityReserve.replicas | default 3 }}` and `0` is falsy in Go templates → `default 3` wins.

**Workaround:** use `capacityReserve.enabled: false` (which short-circuits the whole template). Worked, but the falsy-0 trap is a known Go template anti-pattern.

**Ask:** replace with `{{ .Values.capacityReserve.replicas | default 3 }}` → explicit nil check, e.g. `{{ if hasKey .Values.capacityReserve "replicas" }}{{ .Values.capacityReserve.replicas }}{{ else }}3{{ end }}`. Or document that `enabled: false` is the disable knob.

---

## 🟡 5. aioredis clients lack `socket_timeout` / `health_check_interval` (Vexa-ai/vexa#267 still open)

Already filed at #267 — adding for completeness because it surfaced again during today's session (5/5 services use bare `aioredis.from_url(...)` without the timeout kwargs). Our local check `OSS_AIOREDIS_HARDENED_TIMEOUTS` greps the OSS source on every check-up and prints a TRACKED warning.

When this lands, our chart's `cronjob-collector-watchdog` (currently the runtime safety net) becomes unnecessary.

---

## 🟢 6. Bot pod logs are ephemeral (debugging recordings fails)

**Observation:** When debugging the Issue 1 audio loss, the bot pod was deleted on exit — taking its `[Graceful Leave]` shutdown logs with it. We can't see whether the audio upload step started, what error it threw, whether mux completed. The runtime-api logs only show "exited with code 0".

**Asks:**
1. Make the `meeting-api` `[Delayed Stop]` (currently 90s) extract `kubectl logs <bot-pod>` into the meeting record before the pod is reaped (the runtime-api could do this in its lifecycle callback).
2. Or document a recommended fluent-bit / Loki sidecar pattern for bot pods.

---

## Cross-cutting ask

Per-release smoke matrix says compose / lite / helm all green. The above 4 blockers (Issues 1-4) all surfaced specifically in the **Helm-on-LKE-with-pool-taints** scenario, which is plausibly the production deployment shape for many users. Worth adding a smoke target that:
- deploys to a cluster with `nodeSelector` + tainted nodes
- bumps an image tag (forces Job recreate)
- runs a Zoom meeting and aborts it within 30s
- asserts audio is recoverable

Happy to contribute the test scaffold once Issue 1 is in flight.

---

**Reporter:** vexa-platform team. Repro context: vexa-platform/staging on LKE 590708, Helm 3.x, K8s 1.35, vexa subchart 0.10.4 with locally-patched tolerations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.4 deploy cycle: 4 blockers + 2 followups (audio loss, redis tolerations, migrations Job hook, capacityReserve falsy-zero) #272

Context

🔴 1. Audio recording lost on short / abrupt-stop Zoom meetings

🔴 2. Chart `deployment-redis.yaml` + `job-migrations.yaml` dropped `nodeSelector/tolerations/affinity` blocks in 0.10.4

🔴 3. Migrations Job not declared as a Helm hook

🟡 4. `capacityReserve.replicas: 0` doesn't disable

🟡 5. aioredis clients lack `socket_timeout` / `health_check_interval` (#267 still open)

🟢 6. Bot pod logs are ephemeral (debugging recordings fails)

Cross-cutting ask

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

v0.10.4 deploy cycle: 4 blockers + 2 followups (audio loss, redis tolerations, migrations Job hook, capacityReserve falsy-zero) #272

Description

Context

🔴 1. Audio recording lost on short / abrupt-stop Zoom meetings

🔴 2. Chart deployment-redis.yaml + job-migrations.yaml dropped nodeSelector/tolerations/affinity blocks in 0.10.4

🔴 3. Migrations Job not declared as a Helm hook

🟡 4. capacityReserve.replicas: 0 doesn't disable

🟡 5. aioredis clients lack socket_timeout / health_check_interval (#267 still open)

🟢 6. Bot pod logs are ephemeral (debugging recordings fails)

Cross-cutting ask

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

🔴 2. Chart `deployment-redis.yaml` + `job-migrations.yaml` dropped `nodeSelector/tolerations/affinity` blocks in 0.10.4

🟡 4. `capacityReserve.replicas: 0` doesn't disable

🟡 5. aioredis clients lack `socket_timeout` / `health_check_interval` (#267 still open)