Skip to content

job_scheduler_uu/start_watcher.sh silently dies without venv/PYTHONPATH setup #38

@brando90

Description

@brando90

Symptom

Running bash ~/ultimate-utils/py_src/uutils/job_scheduler_uu/start_watcher.sh on a SNAP node prints [OK] Watcher running and the tmux session dies seconds later. tmux ls shows no job_watcher session.

Root cause

start_watcher.sh launches the daemon with ${PYTHON:-python3} and does not:

  1. Activate a venv that has the package deps (dill, etc.)
  2. Set PYTHONPATH=~/ultimate-utils/py_src so uutils is importable

With the default system /usr/bin/python3 on SNAP nodes, the scheduler crashes immediately:

ModuleNotFoundError: No module named 'dill'

(confirmed on skampere2 with Python 3.10.12). The tmux session exits with no visible error because the script itself returned 0 before the Python process failed — the [OK] is misleading.

Repro

cd  # default system python
bash ~/ultimate-utils/py_src/uutils/job_scheduler_uu/start_watcher.sh
# stdout: "[OK] Watcher running."
tmux ls  # <no job_watcher session>

Workaround

Launch the tmux session manually with proper env:

tmux new-session -d -s job_watcher "source ~/uv_envs/veribench/bin/activate && export PYTHONPATH=~/ultimate-utils/py_src && python -m uutils.job_scheduler_uu.scheduler --job-dir \$HOME/dfs/job_queue --poll 15"

Suggested fix

Inside start_watcher.sh, either:

  • Activate a known-good venv (e.g. ~/uv_envs/veribench/bin/activate) and export PYTHONPATH=~/ultimate-utils/py_src before tmux new-session, or
  • Validate the Python env upfront (python -c "import uutils.job_scheduler_uu.scheduler") and abort with a clear error if imports fail, or
  • After tmux new-session, sleep 2 && tmux has-session -t job_watcher to confirm the window didn't die, and fail loudly if it did.

The third option (liveness check) is the most robust — it catches any future env issue, not just this one.

Context

Discovered during SNAP watcher setup on skampere2 (2026-04-16). Workaround used on skampere2 live. See agents-config commit bda6e7f for the related ~/dfs symlink requirement and commit 594a3cb for the two-layer job-checking protocol that surfaced this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions