You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While deploying v0.10.4 to a Helm-on-LKE staging cluster (with the bin-pack vexa.ai/pool taint model) and exercising it against real Zoom Web meetings, we hit a stack of issues that each independently broke or risked breaking the deploy. Filing them together since they all surface during the same upgrade-and-smoke cycle.
🔴 1. Audio recording lost on short / abrupt-stop Zoom meetings
Symptom: A Zoom Web meeting (id 11056) with 29s of audio captured: bot was admitted, PulseAudio capture flowing, recording started locally to /tmp/recording_<id>_<uid>.wav. User clicked stop. [Delayed Stop] Waiting 90s for container... elapsed → meeting-api force-finalized → bot pod deleted → 0 objects in S3, 0 rows in recordings/media_files. All audio lost.
Root cause: Zoom (both Web and SDK) never had periodic chunk upload — RecordingService only uploads on performGracefulLeave(). With the 30s SHUTDOWN_TIMEOUT_MS racing against earlier cleanup steps (Zoom UI leave + voice-agent + per-speaker pipeline + PulseAudio sink + mux), the audio upload either never starts or gets cut off.
Fix existed and was reverted:
58ba53e feat: incremental audio recording — upload segments every 10s to MinIO added RecordingService.startIncremental(url, token) for both Zoom paths.
24e0641 revert: remove experimental resource optimization commits reverted it because "incremental recording was never tested in a live meeting".
Preserved on optimization branch; never re-applied.
Asks:
Re-apply the 58ba53e bot-side incremental upload (cherry-pick is non-trivial — recordings.py, profiles.yaml, tests3/Makefile conflicts with later commits 87e9962 + 0a318bc + 43998c1; cleanest is a re-implementation as a focused PR).
Reorder performGracefulLeave() so audio upload happens FIRST (right after captures stop), before UI/voice/pipeline cleanup. Even with the 30s budget consumed by hangs elsewhere, the recording is durable.
Symptom: Fresh helm upgrade to 0.10.4 left redis Pending forever in any cluster that uses node taints. All nodes tainted (e.g. vexa.ai/pool=temp:NoSchedule); redis pod doesn't tolerate any of them; cluster-autoscaler reports NotTriggerScaleUp: 3 node(s) had untolerated taint(s).
Same delta in deployment-pgbouncer.yaml. job-migrations.yaml never had these blocks but is also affected (taints elsewhere).
Workaround we applied: local patch of vendored .tgz to restore the blocks.
Ask: restore the toleration blocks (or expose redis.tolerations / migrations.tolerations as values fields).
🔴 3. Migrations Job not declared as a Helm hook
Symptom: Every helm upgrade that bumps meetingApi.image.tag fails with:
Error: UPGRADE FAILED: cannot patch "<release>-vexa-migrations" with kind Job:
Job.batch is invalid: spec.template: Invalid value: ...: field is immutable
Helm tries to patch the Job, K8s refuses (Jobs are immutable), --atomic rolls back the entire upgrade. We worked around it by kubectl delete job before every deploy (added to our Makefile).
Root cause:templates/job-migrations.yaml lacks the standard hook annotations that would tell Helm to delete-and-recreate.
Symptom: Setting capacityReserve.replicas: 0 in values renders as replicas: 3 because the chart template uses {{ .Values.capacityReserve.replicas | default 3 }} and 0 is falsy in Go templates → default 3 wins.
Workaround: use capacityReserve.enabled: false (which short-circuits the whole template). Worked, but the falsy-0 trap is a known Go template anti-pattern.
Ask: replace with {{ .Values.capacityReserve.replicas | default 3 }} → explicit nil check, e.g. {{ if hasKey .Values.capacityReserve "replicas" }}{{ .Values.capacityReserve.replicas }}{{ else }}3{{ end }}. Or document that enabled: false is the disable knob.
Already filed at #267 — adding for completeness because it surfaced again during today's session (5/5 services use bare aioredis.from_url(...) without the timeout kwargs). Our local check OSS_AIOREDIS_HARDENED_TIMEOUTS greps the OSS source on every check-up and prints a TRACKED warning.
When this lands, our chart's cronjob-collector-watchdog (currently the runtime safety net) becomes unnecessary.
🟢 6. Bot pod logs are ephemeral (debugging recordings fails)
Observation: When debugging the Issue 1 audio loss, the bot pod was deleted on exit — taking its [Graceful Leave] shutdown logs with it. We can't see whether the audio upload step started, what error it threw, whether mux completed. The runtime-api logs only show "exited with code 0".
Asks:
Make the meeting-api[Delayed Stop] (currently 90s) extract kubectl logs <bot-pod> into the meeting record before the pod is reaped (the runtime-api could do this in its lifecycle callback).
Or document a recommended fluent-bit / Loki sidecar pattern for bot pods.
Cross-cutting ask
Per-release smoke matrix says compose / lite / helm all green. The above 4 blockers (Issues 1-4) all surfaced specifically in the Helm-on-LKE-with-pool-taints scenario, which is plausibly the production deployment shape for many users. Worth adding a smoke target that:
deploys to a cluster with nodeSelector + tainted nodes
bumps an image tag (forces Job recreate)
runs a Zoom meeting and aborts it within 30s
asserts audio is recoverable
Happy to contribute the test scaffold once Issue 1 is in flight.
Reporter: vexa-platform team. Repro context: vexa-platform/staging on LKE 590708, Helm 3.x, K8s 1.35, vexa subchart 0.10.4 with locally-patched tolerations.
[PLATFORM]
Context
While deploying v0.10.4 to a Helm-on-LKE staging cluster (with the bin-pack
vexa.ai/pooltaint model) and exercising it against real Zoom Web meetings, we hit a stack of issues that each independently broke or risked breaking the deploy. Filing them together since they all surface during the same upgrade-and-smoke cycle.Severity scale: 🔴 blocker, 🟡 important, 🟢 nice-to-have.
🔴 1. Audio recording lost on short / abrupt-stop Zoom meetings
Symptom: A Zoom Web meeting (id 11056) with 29s of audio captured: bot was admitted, PulseAudio capture flowing, recording started locally to
/tmp/recording_<id>_<uid>.wav. User clicked stop.[Delayed Stop] Waiting 90s for container...elapsed → meeting-api force-finalized → bot pod deleted → 0 objects in S3, 0 rows inrecordings/media_files. All audio lost.Root cause: Zoom (both Web and SDK) never had periodic chunk upload —
RecordingServiceonly uploads onperformGracefulLeave(). With the 30sSHUTDOWN_TIMEOUT_MSracing against earlier cleanup steps (Zoom UI leave + voice-agent + per-speaker pipeline + PulseAudio sink + mux), the audio upload either never starts or gets cut off.Fix existed and was reverted:
58ba53e feat: incremental audio recording — upload segments every 10s to MinIOaddedRecordingService.startIncremental(url, token)for both Zoom paths.24e0641 revert: remove experimental resource optimization commitsreverted it because "incremental recording was never tested in a live meeting".optimizationbranch; never re-applied.Asks:
recordings.py,profiles.yaml,tests3/Makefileconflicts with later commits 87e9962 + 0a318bc + 43998c1; cleanest is a re-implementation as a focused PR).performGracefulLeave()so audio upload happens FIRST (right after captures stop), before UI/voice/pipeline cleanup. Even with the 30s budget consumed by hangs elsewhere, the recording is durable.MediaRecordertimeslice + per-chunk upload (referenced ingooglemeet/recording.tsas "Pack B (issue Recording upload partially broken in 0.10.x — 77% loss in 7d, 0% save rate for >30 min meetings (regression from 0.7–0.9 incremental writes) #218)") — Zoom should match.Reproducer: join a Zoom meeting, click Stop within 30s. Compare to a meeting that runs ≥4 min (uploads single chunk on natural shutdown).
🔴 2. Chart
deployment-redis.yaml+job-migrations.yamldroppednodeSelector/tolerations/affinityblocks in 0.10.4Symptom: Fresh
helm upgradeto 0.10.4 left redis Pending forever in any cluster that uses node taints. All nodes tainted (e.g.vexa.ai/pool=temp:NoSchedule); redis pod doesn't tolerate any of them; cluster-autoscaler reportsNotTriggerScaleUp: 3 node(s) had untolerated taint(s).Diff vs 0.10.3:
Same delta in
deployment-pgbouncer.yaml.job-migrations.yamlnever had these blocks but is also affected (taints elsewhere).Workaround we applied: local patch of vendored
.tgzto restore the blocks.Ask: restore the toleration blocks (or expose
redis.tolerations/migrations.tolerationsas values fields).🔴 3. Migrations Job not declared as a Helm hook
Symptom: Every
helm upgradethat bumpsmeetingApi.image.tagfails with:Helm tries to patch the Job, K8s refuses (Jobs are immutable),
--atomicrolls back the entire upgrade. We worked around it bykubectl delete jobbefore every deploy (added to our Makefile).Root cause:
templates/job-migrations.yamllacks the standard hook annotations that would tell Helm to delete-and-recreate.Ask: add to the Job metadata:
🟡 4.
capacityReserve.replicas: 0doesn't disableSymptom: Setting
capacityReserve.replicas: 0in values renders asreplicas: 3because the chart template uses{{ .Values.capacityReserve.replicas | default 3 }}and0is falsy in Go templates →default 3wins.Workaround: use
capacityReserve.enabled: false(which short-circuits the whole template). Worked, but the falsy-0 trap is a known Go template anti-pattern.Ask: replace with
{{ .Values.capacityReserve.replicas | default 3 }}→ explicit nil check, e.g.{{ if hasKey .Values.capacityReserve "replicas" }}{{ .Values.capacityReserve.replicas }}{{ else }}3{{ end }}. Or document thatenabled: falseis the disable knob.🟡 5. aioredis clients lack
socket_timeout/health_check_interval(#267 still open)Already filed at #267 — adding for completeness because it surfaced again during today's session (5/5 services use bare
aioredis.from_url(...)without the timeout kwargs). Our local checkOSS_AIOREDIS_HARDENED_TIMEOUTSgreps the OSS source on every check-up and prints a TRACKED warning.When this lands, our chart's
cronjob-collector-watchdog(currently the runtime safety net) becomes unnecessary.🟢 6. Bot pod logs are ephemeral (debugging recordings fails)
Observation: When debugging the Issue 1 audio loss, the bot pod was deleted on exit — taking its
[Graceful Leave]shutdown logs with it. We can't see whether the audio upload step started, what error it threw, whether mux completed. The runtime-api logs only show "exited with code 0".Asks:
meeting-api[Delayed Stop](currently 90s) extractkubectl logs <bot-pod>into the meeting record before the pod is reaped (the runtime-api could do this in its lifecycle callback).Cross-cutting ask
Per-release smoke matrix says compose / lite / helm all green. The above 4 blockers (Issues 1-4) all surfaced specifically in the Helm-on-LKE-with-pool-taints scenario, which is plausibly the production deployment shape for many users. Worth adding a smoke target that:
nodeSelector+ tainted nodesHappy to contribute the test scaffold once Issue 1 is in flight.
Reporter: vexa-platform team. Repro context: vexa-platform/staging on LKE 590708, Helm 3.x, K8s 1.35, vexa subchart 0.10.4 with locally-patched tolerations.