Skip to content

v0.10.4 deploy cycle: 4 blockers + 2 followups (audio loss, redis tolerations, migrations Job hook, capacityReserve falsy-zero) #272

@DmitriyG228

Description

@DmitriyG228

[PLATFORM]

Context

While deploying v0.10.4 to a Helm-on-LKE staging cluster (with the bin-pack vexa.ai/pool taint model) and exercising it against real Zoom Web meetings, we hit a stack of issues that each independently broke or risked breaking the deploy. Filing them together since they all surface during the same upgrade-and-smoke cycle.

Severity scale: 🔴 blocker, 🟡 important, 🟢 nice-to-have.


🔴 1. Audio recording lost on short / abrupt-stop Zoom meetings

Symptom: A Zoom Web meeting (id 11056) with 29s of audio captured: bot was admitted, PulseAudio capture flowing, recording started locally to /tmp/recording_<id>_<uid>.wav. User clicked stop. [Delayed Stop] Waiting 90s for container... elapsed → meeting-api force-finalized → bot pod deleted → 0 objects in S3, 0 rows in recordings/media_files. All audio lost.

Root cause: Zoom (both Web and SDK) never had periodic chunk upload — RecordingService only uploads on performGracefulLeave(). With the 30s SHUTDOWN_TIMEOUT_MS racing against earlier cleanup steps (Zoom UI leave + voice-agent + per-speaker pipeline + PulseAudio sink + mux), the audio upload either never starts or gets cut off.

Fix existed and was reverted:

  • 58ba53e feat: incremental audio recording — upload segments every 10s to MinIO added RecordingService.startIncremental(url, token) for both Zoom paths.
  • 24e0641 revert: remove experimental resource optimization commits reverted it because "incremental recording was never tested in a live meeting".
  • Preserved on optimization branch; never re-applied.

Asks:

  1. Re-apply the 58ba53e bot-side incremental upload (cherry-pick is non-trivial — recordings.py, profiles.yaml, tests3/Makefile conflicts with later commits 87e9962 + 0a318bc + 43998c1; cleanest is a re-implementation as a focused PR).
  2. Reorder performGracefulLeave() so audio upload happens FIRST (right after captures stop), before UI/voice/pipeline cleanup. Even with the 30s budget consumed by hangs elsewhere, the recording is durable.
  3. Google Meet + Teams already have MediaRecorder timeslice + per-chunk upload (referenced in googlemeet/recording.ts as "Pack B (issue Recording upload partially broken in 0.10.x — 77% loss in 7d, 0% save rate for >30 min meetings (regression from 0.7–0.9 incremental writes) #218)") — Zoom should match.

Reproducer: join a Zoom meeting, click Stop within 30s. Compare to a meeting that runs ≥4 min (uploads single chunk on natural shutdown).


🔴 2. Chart deployment-redis.yaml + job-migrations.yaml dropped nodeSelector/tolerations/affinity blocks in 0.10.4

Symptom: Fresh helm upgrade to 0.10.4 left redis Pending forever in any cluster that uses node taints. All nodes tainted (e.g. vexa.ai/pool=temp:NoSchedule); redis pod doesn't tolerate any of them; cluster-autoscaler reports NotTriggerScaleUp: 3 node(s) had untolerated taint(s).

Diff vs 0.10.3:

diff /tmp/vexa-0.10.3/templates/deployment-redis.yaml /tmp/vexa-0.10.4/templates/deployment-redis.yaml
> -      {{- with .Values.global.nodeSelector }}
> -      nodeSelector: {{- toYaml . | nindent 8 }}
> -      {{- end }}
> -      {{- with .Values.global.tolerations }}
> -      tolerations: {{- toYaml . | nindent 8 }}
> -      {{- end }}
> -      {{- with .Values.global.affinity }}
> -      affinity: {{- toYaml . | nindent 8 }}
> -      {{- end }}

Same delta in deployment-pgbouncer.yaml. job-migrations.yaml never had these blocks but is also affected (taints elsewhere).

Workaround we applied: local patch of vendored .tgz to restore the blocks.

Ask: restore the toleration blocks (or expose redis.tolerations / migrations.tolerations as values fields).


🔴 3. Migrations Job not declared as a Helm hook

Symptom: Every helm upgrade that bumps meetingApi.image.tag fails with:

Error: UPGRADE FAILED: cannot patch "<release>-vexa-migrations" with kind Job:
Job.batch is invalid: spec.template: Invalid value: ...: field is immutable

Helm tries to patch the Job, K8s refuses (Jobs are immutable), --atomic rolls back the entire upgrade. We worked around it by kubectl delete job before every deploy (added to our Makefile).

Root cause: templates/job-migrations.yaml lacks the standard hook annotations that would tell Helm to delete-and-recreate.

Ask: add to the Job metadata:

annotations:
  "helm.sh/hook": pre-upgrade,pre-install
  "helm.sh/hook-weight": "0"
  "helm.sh/hook-delete-policy": before-hook-creation

🟡 4. capacityReserve.replicas: 0 doesn't disable

Symptom: Setting capacityReserve.replicas: 0 in values renders as replicas: 3 because the chart template uses {{ .Values.capacityReserve.replicas | default 3 }} and 0 is falsy in Go templates → default 3 wins.

Workaround: use capacityReserve.enabled: false (which short-circuits the whole template). Worked, but the falsy-0 trap is a known Go template anti-pattern.

Ask: replace with {{ .Values.capacityReserve.replicas | default 3 }} → explicit nil check, e.g. {{ if hasKey .Values.capacityReserve "replicas" }}{{ .Values.capacityReserve.replicas }}{{ else }}3{{ end }}. Or document that enabled: false is the disable knob.


🟡 5. aioredis clients lack socket_timeout / health_check_interval (#267 still open)

Already filed at #267 — adding for completeness because it surfaced again during today's session (5/5 services use bare aioredis.from_url(...) without the timeout kwargs). Our local check OSS_AIOREDIS_HARDENED_TIMEOUTS greps the OSS source on every check-up and prints a TRACKED warning.

When this lands, our chart's cronjob-collector-watchdog (currently the runtime safety net) becomes unnecessary.


🟢 6. Bot pod logs are ephemeral (debugging recordings fails)

Observation: When debugging the Issue 1 audio loss, the bot pod was deleted on exit — taking its [Graceful Leave] shutdown logs with it. We can't see whether the audio upload step started, what error it threw, whether mux completed. The runtime-api logs only show "exited with code 0".

Asks:

  1. Make the meeting-api [Delayed Stop] (currently 90s) extract kubectl logs <bot-pod> into the meeting record before the pod is reaped (the runtime-api could do this in its lifecycle callback).
  2. Or document a recommended fluent-bit / Loki sidecar pattern for bot pods.

Cross-cutting ask

Per-release smoke matrix says compose / lite / helm all green. The above 4 blockers (Issues 1-4) all surfaced specifically in the Helm-on-LKE-with-pool-taints scenario, which is plausibly the production deployment shape for many users. Worth adding a smoke target that:

  • deploys to a cluster with nodeSelector + tainted nodes
  • bumps an image tag (forces Job recreate)
  • runs a Zoom meeting and aborts it within 30s
  • asserts audio is recoverable

Happy to contribute the test scaffold once Issue 1 is in flight.


Reporter: vexa-platform team. Repro context: vexa-platform/staging on LKE 590708, Helm 3.x, K8s 1.35, vexa subchart 0.10.4 with locally-patched tolerations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions