From 2cffe1b44ef4dff3e4698f5bd657bc39f16999c2 Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Wed, 4 Mar 2026 17:48:47 +0800 Subject: [PATCH 1/7] feat: add board lock watcher and docs Introduce lock-watcher helper and docs so cancelled workflows can safely release board file locks across organizations. Made-with: Cursor --- docs/board-lock-watcher.md | 187 ++++++++++++++++++ ...77\347\224\250\350\257\264\346\230\216.md" | 1 + runner-wrapper/lock-watcher.sh | 101 ++++++++++ 3 files changed, 289 insertions(+) create mode 100644 docs/board-lock-watcher.md create mode 100755 runner-wrapper/lock-watcher.sh diff --git a/docs/board-lock-watcher.md b/docs/board-lock-watcher.md new file mode 100644 index 0000000..fc6bc44 --- /dev/null +++ b/docs/board-lock-watcher.md @@ -0,0 +1,187 @@ +## 开发板文件锁与取消等待使用说明 + +本文档说明如何在多组织共享同一块开发板时,结合文件锁与 `lock-watcher.sh`,实现 **等待中的 Job 能被 Cancel 正常打断**,避免死锁。 + +--- + +### 1. 组件概览 + +- **`runner-wrapper/runner-wrapper.sh`**:为 Runner 注入 Job Started / Completed 钩子。 +- **`runner-wrapper/pre-job-lock.sh`**:在 Job 开始前获取板子级文件锁(`flock`),并通过后台子进程持有锁。 +- **`runner-wrapper/post-job-lock.sh`**:在 Job 结束时创建 `.release` 标记,唤醒持锁子进程释放锁。 +- **`runner-wrapper/lock-watcher.sh`**(新增):运行在宿主机上的守护脚本,周期性查询某个仓库下 Actions Run 的状态;当发现持锁 Run 已被 **Cancel** 时,强制清理解锁文件,避免后续等待 Job 永久卡死。 + +锁文件结构(默认目录 `/tmp/github-runner-locks`): + +- `${RESOURCE_ID}.lock`:flock 使用的锁文件 +- `${RESOURCE_ID}.holder`:当前持锁信息,格式为 `PID RUNNER_NAME RUN_ID RUN_ATTEMPT` +- `${RESOURCE_ID}.${RUNNER_NAME}.${RUN_ID}.${RUN_ATTEMPT}.release`:释放标记,由 `post-job-lock.sh` 创建 + +--- + +### 2. Runner 端配置(各组织 .env) + +在各组织对应的 `.env` 中(示例:`.env.linebridge` / `.env.yoinspiration`): + +```bash +ORG= +REPO=test-runner +GH_PAT=ghp_xxx # Runner 注册用 Classic PAT + +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +注意: + +- 多组织共享同一块板子时,所有相关 `.env` 中的 + - `RUNNER_RESOURCE_ID_ROC_RK3568_PC` + - `RUNNER_LOCK_HOST_PATH` + - `RUNNER_LOCK_DIR` + 必须保持一致。 + +修改完 `.env` 后,重启对应 Runner: + +```bash +ENV_FILE=.env.linebridge ./runner.sh restart +ENV_FILE=.env.yoinspiration ./runner.sh restart +``` + +--- + +### 3. 宿主机上配置 lock-watcher + +#### 3.1 实例数量建议 + +- 默认推荐:**每个参与共享同一块板子的仓库,各启动一个 `lock-watcher.sh` 实例**,即每个仓库一份 `ORG/REPO/GITHUB_TOKEN` 配置。 +- 例如:`linebridge/test-runner`、`yoinspiration/test-runner` 各有一份 `.env.watcher.*` 与一个对应的 watcher 进程。 + +#### 3.2 准备 PAT + +在对应组织账号下创建 **Fine-grained PAT**(推荐): + +- 选择包含 `test-runner` 的仓库(例如 `linebridge/test-runner`)。 +- 在 **Repository permissions** 中将 **Actions** 设置为 **Read-only**。 + +生成后得到 `github_pat_xxx`。 + +#### 3.2 创建 watcher 环境文件 + +在仓库根目录创建 `.env.watcher`(示例为监控 `linebridge/test-runner` 与 `board-roc-rk3568-pc` 板): + +```bash +ORG=linebridge +REPO=test-runner +GITHUB_TOKEN=github_pat_xxx +RUNNER_RESOURCE_ID=board-roc-rk3568-pc +RUNNER_LOCK_DIR=/tmp/github-runner-locks +INTERVAL=10 +``` + +> 如需为其他组织(例如 `yoinspiration`)单独监控,可再创建一个环境文件(如 `.env.watcher.yoinspiration`),修改 `ORG` / `REPO` / `GITHUB_TOKEN` 后启动第二个 watcher 实例。 + +#### 3.3 安装依赖 + +在宿主机上安装 `jq` 以解析 GitHub API 返回的 JSON: + +```bash +sudo apt update +sudo apt install -y jq +``` + +--- + +### 4. 启动 lock-watcher + +在宿主机上打开一个长期运行的终端(建议放在 tmux/screen 或 systemd 服务中): + +```bash +cd /home/fei/os-internship/github-runners + +source .env.watcher + +./runner-wrapper/lock-watcher.sh +``` + +启动成功后,终端会打印类似: + +```text +[lock-watcher] monitoring linebridge/test-runner, resource=board-roc-rk3568-pc, lock_dir=/tmp/github-runner-locks, interval=10s +``` + +运行过程中示例日志: + +- 正常持锁: + +```text +[lock-watcher] run_id=22663158623 status=in_progress conclusion= pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc +``` + +- 对应 workflow 在 GitHub 上被 Cancel 后: + +```text +[lock-watcher] run_id=22663158623 status=completed conclusion=cancelled pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc +[lock-watcher] detected cancelled workflow, force releasing lock for board-roc-rk3568-pc +``` + +表示 watcher 已检测到 Cancel,并强制清理解锁文件。 + +--- + +### 5. 典型验证流程 + +以下以 `linebridge/test-runner` 与 `yoinspiration/test-runner` 共享 `board-roc-rk3568-pc` 板为例。 + +#### 5.1 触发占板子的 holder + +在 `linebridge/test-runner` 仓库中: + +1. 打开 Actions,选择 `board-lock-holder` workflow。 +2. 点击 **Run workflow** 触发一次运行。 +3. 在日志中看到: + + - `Waiting for lock: board-roc-rk3568-pc` + - `Acquired lock for board-roc-rk3568-pc` + +表示 holder 成功持有锁并占用板子。 + +#### 5.2 触发等待的 waiter + +在 `yoinspiration/test-runner` 仓库中: + +1. 打开 Actions,选择 `board-lock-waiter` workflow。 +2. 点击 **Run workflow**。 +3. 在日志中可以看到: + + - `Waiting for lock: board-roc-rk3568-pc` + +表示 waiter 正在等待同一块板子的锁。 + +#### 5.3 Cancel 并观察自动解锁 + +1. 在 `linebridge/test-runner` 的 `board-lock-holder` 运行页面点击 **Cancel workflow**。 +2. 几秒后,宿主机上 `lock-watcher.sh` 日志应出现: + + ```text + [lock-watcher] run_id=... status=completed conclusion=cancelled ... + [lock-watcher] detected cancelled workflow, force releasing lock for board-roc-rk3568-pc + ``` + +3. 此时 `yoinspiration` 侧的 `board-lock-waiter` 将在锁释放后继续执行,直至 Job 完成,而不会长时间卡在等待状态。 + +--- + +### 6. 常见问题 + +- **Q: watcher 日志中一直是 `status=in_progress conclusion=`?** + **A:** Run 还在运行中,尚未 Cancel 或完成,watcher 只会记录状态,不会释放锁。需要在 GitHub 页面上点击 Cancel,且等状态变为 Cancelled 后再观察。 + +- **Q: watcher 日志中频繁出现 `empty response for run_id=..., skip`?** + **A:** 对应的 Run 不属于当前 `ORG/REPO`,或 `GITHUB_TOKEN` 对该仓库没有足够的 Actions 读取权限。请确认: + - `.env.watcher` 中的 `ORG` / `REPO` 是否与 Run 实际所在仓库一致; + - Fine-grained PAT 是否勾选了对应仓库,并将 Actions 权限设为 Read-only。 + +- **Q: 没有安装 jq 时,status / conclusion 总是 ``?** + **A:** 需要先在宿主机安装 `jq`,否则 watcher 无法从响应 JSON 中解析状态。 + diff --git "a/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" "b/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" index 3407916..ab314b3 100644 --- "a/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" +++ "b/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" @@ -50,6 +50,7 @@ RUNNER_LOCK_DIR=/tmp/github-runner-locks 关键要求: +- 只有设置了 `RUNNER_RESOURCE_ID_PHYTIUMPI` / `RUNNER_RESOURCE_ID_ROC_RK3568_PC` 的板子才会启用 runner-wrapper 与锁目录挂载,未配置的板子仍使用默认 `run.sh`,不参与锁协调。 - 两套配置的 `RUNNER_RESOURCE_ID_*` 必须一致(同板卡共享同一锁)。 - 两套配置的 `RUNNER_LOCK_HOST_PATH` 必须一致(指向同一宿主机目录)。 diff --git a/runner-wrapper/lock-watcher.sh b/runner-wrapper/lock-watcher.sh new file mode 100755 index 0000000..ad46059 --- /dev/null +++ b/runner-wrapper/lock-watcher.sh @@ -0,0 +1,101 @@ +#!/usr/bin/env bash +set -euo pipefail + +# lock-watcher.sh - 简单的文件锁清理脚本 +# +# 场景: +# - pre-job-lock.sh 在获取锁时可能因为网络/runner 异常导致锁长期不释放 +# - GitHub 前端 Cancel workflow 后,run 状态变为 cancelled,但本地锁文件还在 +# - 本脚本定期检查 holder 文件中的 run_id,对应 run 如已 cancelled,则强制清理解锁 +# +# 使用方式(示例,在宿主机上运行): +# export ORG=yoinspiration +# export REPO=test-runner +# export GITHUB_TOKEN=github_pat_xxx # 具备 Actions 只读权限 +# export RUNNER_RESOURCE_ID=board-roc-rk3568-pc +# export RUNNER_LOCK_DIR=/tmp/github-runner-locks +# ./runner-wrapper/lock-watcher.sh +# +# 必要环境变量: +# ORG, REPO, GITHUB_TOKEN +# 可选环境变量: +# RUNNER_RESOURCE_ID(默认:default-hardware 或自行设置) +# RUNNER_LOCK_DIR(默认:/tmp/github-runner-locks) +# INTERVAL(轮询间隔秒,默认 10) + +: "${ORG:?ORG is required, e.g. yoinspiration}" +: "${REPO:?REPO is required, e.g. test-runner}" +: "${GITHUB_TOKEN:?GITHUB_TOKEN is required (with Actions read permission)}" + +RUNNER_RESOURCE_ID="${RUNNER_RESOURCE_ID:-default-hardware}" +LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" +INTERVAL="${INTERVAL:-10}" + +api_base="${GITHUB_API_URL:-https://api.github.com}" + +echo "[lock-watcher] monitoring ${ORG}/${REPO}, resource=${RUNNER_RESOURCE_ID}, lock_dir=${LOCK_DIR}, interval=${INTERVAL}s" + +while true; do + holder_file="${LOCK_DIR}/${RUNNER_RESOURCE_ID}.holder" + + if [[ ! -f "${holder_file}" ]]; then + sleep "${INTERVAL}" + continue + fi + + # holder 文件格式: PID RUNNER_NAME RUN_ID RUN_ATTEMPT + holder_pid="" + holder_runner="" + holder_run_id="" + holder_run_attempt="" + if ! read -r holder_pid holder_runner holder_run_id holder_run_attempt < "${holder_file}"; then + echo "[lock-watcher] failed to read holder file ${holder_file}" >&2 + sleep "${INTERVAL}" + continue + fi + + if [[ -z "${holder_run_id:-}" || "${holder_run_id}" == "unknown" ]]; then + sleep "${INTERVAL}" + continue + fi + + run_url="${api_base}/repos/${ORG}/${REPO}/actions/runs/${holder_run_id}" + resp="$(curl -fsSL \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "Accept: application/vnd.github+json" \ + "${run_url}" || true)" + + if [[ -z "${resp}" ]]; then + echo "[lock-watcher] empty response for run_id=${holder_run_id}, skip" >&2 + sleep "${INTERVAL}" + continue + fi + + # 优先使用 jq 解析;若 jq 不可用,则退化为空字符串(不报错) + if command -v jq >/dev/null 2>&1; then + status="$(printf '%s\n' "${resp}" | jq -r '.status // empty' 2>/dev/null || true)" + conclusion="$(printf '%s\n' "${resp}" | jq -r '.conclusion // empty' 2>/dev/null || true)" + else + status="" + conclusion="" + fi + + echo "[lock-watcher] run_id=${holder_run_id} status=${status:-} conclusion=${conclusion:-} pid=${holder_pid} holder_runner=${holder_runner}" + + # 如果 workflow 已取消,认为这个锁可以强制释放 + if [[ "${status}" == "completed" && "${conclusion}" == "cancelled" ]]; then + echo "[lock-watcher] detected cancelled workflow, force releasing lock for ${RUNNER_RESOURCE_ID}" >&2 + + # 尝试在宿主机上杀掉同名 PID(注意:容器内/宿主机 PID 命名空间不同,可能杀不到,仅 best-effort) + if [[ -n "${holder_pid}" && "${holder_pid}" =~ ^[0-9]+$ ]]; then + kill "${holder_pid}" 2>/dev/null || true + fi + + # 清理 holder 和对应的 release 标记,让后续等待不再被旧锁阻塞 + rm -f "${holder_file}" 2>/dev/null || true + rm -f "${LOCK_DIR}/${RUNNER_RESOURCE_ID}."*.release 2>/dev/null || true + fi + + sleep "${INTERVAL}" +done + From c8c9cc6fd18f942a0e1c774ab9953457b950744c Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Fri, 6 Mar 2026 10:58:03 +0800 Subject: [PATCH 2/7] =?UTF-8?q?feat:=20=E9=9B=86=E6=88=90=20watcher=20?= =?UTF-8?q?=E5=85=A5=20runner.sh=20=E5=B9=B6=E5=AE=8C=E5=96=84=E4=BD=BF?= =?UTF-8?q?=E7=94=A8=E8=AF=B4=E6=98=8E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - runner.sh 新增 watcher 子命令,复用 .env,需 RUNNER_LOCK_MONITOR_TOKEN - 使用说明:部署流程概览、tmux/screen/systemd 常驻、宿主机锁目录权限(2.1) Made-with: Cursor --- docs/board-lock-watcher.md | 165 +++++++++++++++++++++++++++++++++++-- runner.sh | 35 +++++++- 2 files changed, 190 insertions(+), 10 deletions(-) diff --git a/docs/board-lock-watcher.md b/docs/board-lock-watcher.md index fc6bc44..d6585bd 100644 --- a/docs/board-lock-watcher.md +++ b/docs/board-lock-watcher.md @@ -19,6 +19,17 @@ --- +### 1.1 部署流程概览 + +可按以下顺序操作;细节见后续章节。 + +| 时机 | 要做的事 | +|------|----------| +| **第一次在这台机器上部署** | ① 准备各组织的 `.env`(含 ORG、REPO、GH_PAT、板子锁变量、以及要用 watcher 时必填 `RUNNER_LOCK_MONITOR_TOKEN`)
② **在宿主机**设置锁目录权限:`sudo mkdir -p /tmp/github-runner-locks && sudo chmod 1777 /tmp/github-runner-locks`(详见 2.1)
③ 每个组织执行一次 `./runner.sh init -n 2`,生成 compose、起容器并注册
④(可选)安装 `jq`:`sudo apt install -y jq`
⑤ 每个组织起一个 watcher:`ENV_FILE=.env.xxx ./runner.sh watcher`,建议用 tmux/screen 或 systemd 常驻 | +| **以后每次使用(非初次)** | 需要时执行 `ENV_FILE=.env.xxx ./runner.sh start`(每个组织);watcher 若已用 tmux/screen/systemd 常驻则不用管,否则再按上面命令各起一个 | + +--- + ### 2. Runner 端配置(各组织 .env) 在各组织对应的 `.env` 中(示例:`.env.linebridge` / `.env.yoinspiration`): @@ -31,6 +42,8 @@ GH_PAT=ghp_xxx # Runner 注册用 Classic PAT RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks RUNNER_LOCK_DIR=/tmp/github-runner-locks +# 必填(仅在使用 ./runner.sh watcher 时):Fine-grained PAT,Actions: Read-only +RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx ``` 注意: @@ -41,6 +54,26 @@ RUNNER_LOCK_DIR=/tmp/github-runner-locks - `RUNNER_LOCK_DIR` 必须保持一致。 +#### 2.1 宿主机锁目录权限(首次部署或报 Permission denied 时) + +锁目录从宿主机挂进容器,若宿主机上该目录权限不对,容器内会报 `chmod: Operation not permitted` 或 `Permission denied`。**在宿主机**执行(仅首次部署或出现上述报错时): + +```bash +# 若目录已存在但权限不对,可先清理再改权限 +sudo rm -f /tmp/github-runner-locks/*.holder /tmp/github-runner-locks/*.release +sudo chmod 1777 /tmp/github-runner-locks +sudo find /tmp/github-runner-locks -maxdepth 1 -type f -name 'board-*' -exec chmod 666 {} \; +``` + +若目录不存在,先创建再设权限: + +```bash +sudo mkdir -p /tmp/github-runner-locks +sudo chmod 1777 /tmp/github-runner-locks +``` + +完成后重启对应 Runner(见下文)。 + 修改完 `.env` 后,重启对应 Runner: ```bash @@ -52,10 +85,121 @@ ENV_FILE=.env.yoinspiration ./runner.sh restart ### 3. 宿主机上配置 lock-watcher +#### 3.0 推荐:通过 runner.sh 启动(与锁同源配置,使用无感) + +Watcher 已集成进 `runner.sh`,**可直接复用各组织的 `.env`**,无需单独维护 `.env.watcher`: + +1. 在对应组织的 `.env` 中增加一行(必填): + ```bash + RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx # Fine-grained PAT,Actions: Read-only + ``` +2. 在宿主机执行(与 start/restart 同一套 ENV_FILE): + ```bash + ENV_FILE=.env.linebridge ./runner.sh watcher + ``` + 脚本会自动使用当前 `.env` 的 `ORG`、`REPO`、`RUNNER_LOCK_DIR` 以及 `RUNNER_RESOURCE_ID_ROC_RK3568_PC` 或 `RUNNER_RESOURCE_ID_PHYTIUMPI`(优先 roc)。若需指定资源 ID,可传参: + ```bash + ENV_FILE=.env.linebridge ./runner.sh watcher board-roc-rk3568-pc + ``` +3. 建议用 tmux/screen 或 systemd 常驻该进程;多组织时每个组织各开一个终端(或服务)运行 `./runner.sh watcher`。具体做法见下文「3.0.1 让 watcher 常驻」。 + +#### 3.0.1 让 watcher 常驻(tmux / screen / systemd) + +任选一种方式,使 watcher 在断开 SSH 或重启后仍可运行。 + +**方式 A:tmux** + +```bash +# 安装(若无) +sudo apt install -y tmux + +# 第一个组织 +cd /path/to/github-runners +tmux new -s watcher-linebridge +ENV_FILE=.env.linebridge ./runner.sh watcher +# 断开会话:Ctrl+B 再按 D + +# 第二个组织(新开一个终端或新会话) +tmux new -s watcher-yoinspiration +ENV_FILE=.env.yoinspiration ./runner.sh watcher +# 同样 Ctrl+B D 断开 + +# 重新连上查看 +tmux attach -t watcher-linebridge +tmux attach -t watcher-yoinspiration +``` + +**方式 B:screen** + +```bash +# 安装(若无) +sudo apt install -y screen + +# 第一个组织 +cd /path/to/github-runners +screen -S watcher-linebridge +ENV_FILE=.env.linebridge ./runner.sh watcher +# 断开:Ctrl+A 再按 D + +# 第二个组织 +screen -S watcher-yoinspiration +ENV_FILE=.env.yoinspiration ./runner.sh watcher +# Ctrl+A D 断开 + +# 重新连上 +screen -r watcher-linebridge +screen -r watcher-yoinspiration +``` + +**方式 C:systemd(开机自启,推荐长期使用)** + +每个组织一个 service 文件,例如 linebridge: + +```bash +sudo nano /etc/systemd/system/github-runner-watcher-linebridge.service +``` + +内容(将 `fei` 和 `/path/to/github-runners` 换成你的用户名与仓库绝对路径): + +```ini +[Unit] +Description=GitHub Runner lock watcher (linebridge) +After=network-online.target + +[Service] +Type=simple +User=fei +WorkingDirectory=/path/to/github-runners +Environment=ENV_FILE=.env.linebridge +ExecStart=/path/to/github-runners/runner.sh watcher +Restart=always +RestartSec=10 + +[Install] +WantedBy=multi-user.target +``` + +再为 yoinspiration 建一份(如 `github-runner-watcher-yoinspiration.service`),仅把 `linebridge` 改为 `yoinspiration`、`ENV_FILE=.env.linebridge` 改为 `ENV_FILE=.env.yoinspiration` 即可。 + +启用并启动: + +```bash +sudo systemctl daemon-reload +sudo systemctl enable --now github-runner-watcher-linebridge +sudo systemctl enable --now github-runner-watcher-yoinspiration +``` + +查看状态与日志: + +```bash +sudo systemctl status github-runner-watcher-linebridge +journalctl -u github-runner-watcher-linebridge -f +``` + #### 3.1 实例数量建议 -- 默认推荐:**每个参与共享同一块板子的仓库,各启动一个 `lock-watcher.sh` 实例**,即每个仓库一份 `ORG/REPO/GITHUB_TOKEN` 配置。 -- 例如:`linebridge/test-runner`、`yoinspiration/test-runner` 各有一份 `.env.watcher.*` 与一个对应的 watcher 进程。 +- 默认推荐:**每个参与共享同一块板子的仓库,各启动一个 watcher 实例**(即每个组织一份 `.env`,各执行一次 `./runner.sh watcher`)。 +- 例如:`linebridge/test-runner`、`yoinspiration/test-runner` 分别执行 `ENV_FILE=.env.linebridge ./runner.sh watcher` 与 `ENV_FILE=.env.yoinspiration ./runner.sh watcher`。 #### 3.2 准备 PAT @@ -66,7 +210,7 @@ ENV_FILE=.env.yoinspiration ./runner.sh restart 生成后得到 `github_pat_xxx`。 -#### 3.2 创建 watcher 环境文件 +#### 3.3 创建 watcher 环境文件(方式二时使用) 在仓库根目录创建 `.env.watcher`(示例为监控 `linebridge/test-runner` 与 `board-roc-rk3568-pc` 板): @@ -81,7 +225,7 @@ INTERVAL=10 > 如需为其他组织(例如 `yoinspiration`)单独监控,可再创建一个环境文件(如 `.env.watcher.yoinspiration`),修改 `ORG` / `REPO` / `GITHUB_TOKEN` 后启动第二个 watcher 实例。 -#### 3.3 安装依赖 +#### 3.4 安装依赖 在宿主机上安装 `jq` 以解析 GitHub API 返回的 JSON: @@ -94,13 +238,20 @@ sudo apt install -y jq ### 4. 启动 lock-watcher -在宿主机上打开一个长期运行的终端(建议放在 tmux/screen 或 systemd 服务中): +**方式一(推荐):用 runner.sh 启动,与锁同源配置** ```bash -cd /home/fei/os-internship/github-runners +cd /path/to/github-runners +ENV_FILE=.env.linebridge ./runner.sh watcher +``` -source .env.watcher +**方式二:单独环境文件** +在宿主机上打开一个长期运行的终端(建议放在 tmux/screen 或 systemd 服务中): + +```bash +cd /path/to/github-runners +source .env.watcher ./runner-wrapper/lock-watcher.sh ``` diff --git a/runner.sh b/runner.sh index 989d0b9..15344c5 100755 --- a/runner.sh +++ b/runner.sh @@ -101,16 +101,20 @@ shell_usage() { printf " %-${COLW}s %s\n" "./runner.sh ps|ls|list|status" "Show container status and registered Runner status" echo - echo "4. Deletion commands:" + echo "4. Lock watcher (Cancel 后自动释放板卡锁):" + printf " %-${COLW}s %s\n" "./runner.sh watcher [resource]" "Start lock-watcher (uses same .env; requires RUNNER_LOCK_MONITOR_TOKEN)" + echo + + echo "5. Deletion commands:" printf " %-${COLW}s %s\n" "./runner.sh rm|remove|delete [${RUNNER_NAME_PREFIX}runner- ...]" "Delete specified instances; no args will delete all (confirmation required, -y to skip)" printf " %-${COLW}s %s\n" "./runner.sh purge [-y]" "On top of remove, also delete the dynamically generated docker-compose.yml" echo - echo "5. Image management commands:" + echo "6. Image management commands:" printf " %-${COLW}s %s\n" "./runner.sh image" "Rebuild Docker image based on Dockerfile" echo - echo "6. Help" + echo "7. Help" printf " %-${COLW}s %s\n" "./runner.sh help" "Show this help" echo @@ -126,6 +130,7 @@ shell_usage() { printf " %-${KEYW}s %s\n" "RUNNER_RESOURCE_ID_ROC_RK3568_PC" "Lock ID for roc-rk3568-pc board (default: board-roc-rk3568-pc); same ID = serial" printf " %-${KEYW}s %s\n" "RUNNER_LOCK_DIR" "Lock dir in container (default /tmp/github-runner-locks)" printf " %-${KEYW}s %s\n" "RUNNER_LOCK_HOST_PATH" "Lock dir on host for bind mount (default /tmp/github-runner-locks)" + printf " %-${KEYW}s %s\n" "RUNNER_LOCK_MONITOR_TOKEN" "Fine-grained PAT, Actions read-only (required for watcher)" echo echo "Example workflow runs-on: runs-on: [self-hosted, linux, docker]" @@ -966,6 +971,30 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then echo ;; + # ./runner.sh watcher [resource] + watcher) + shell_get_org_and_pat + [[ -n "${RUNNER_LOCK_MONITOR_TOKEN:-}" ]] || shell_die "RUNNER_LOCK_MONITOR_TOKEN is required for watcher (use Fine-grained PAT with Actions: Read-only)." + export GITHUB_TOKEN="${RUNNER_LOCK_MONITOR_TOKEN}" + export ORG REPO + export RUNNER_LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" + if [[ -n "${1:-}" ]]; then + export RUNNER_RESOURCE_ID="$1" + else + if [[ -n "${RUNNER_RESOURCE_ID_ROC_RK3568_PC:-}" ]]; then + export RUNNER_RESOURCE_ID="${RUNNER_RESOURCE_ID_ROC_RK3568_PC}" + elif [[ -n "${RUNNER_RESOURCE_ID_PHYTIUMPI:-}" ]]; then + export RUNNER_RESOURCE_ID="${RUNNER_RESOURCE_ID_PHYTIUMPI}" + else + shell_die "No board resource ID set (RUNNER_RESOURCE_ID_ROC_RK3568_PC / RUNNER_RESOURCE_ID_PHYTIUMPI). Set one in .env or pass: ./runner.sh watcher " + fi + fi + WATCHER_SCRIPT="$(cd "$(dirname "$0")" && pwd)/runner-wrapper/lock-watcher.sh" + [[ -x "${WATCHER_SCRIPT}" ]] || shell_die "lock-watcher.sh not found or not executable: ${WATCHER_SCRIPT}" + shell_info "Starting lock-watcher for ${ORG}/${REPO}, resource=${RUNNER_RESOURCE_ID}" + exec "${WATCHER_SCRIPT}" + ;; + # ./runner.sh init -n|--count N init) count=0 From 9f9cd0cc90b0927f71d904a10855fb15893be178 Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Fri, 6 Mar 2026 11:41:22 +0800 Subject: [PATCH 3/7] =?UTF-8?q?feat(watcher):=20=E5=8D=95=E8=BF=9B?= =?UTF-8?q?=E7=A8=8B=E7=9B=91=E6=8E=A7=E6=89=80=E6=9C=89=E6=9D=BF=E5=AD=90?= =?UTF-8?q?=EF=BC=8C=E6=96=87=E6=A1=A3=E4=B8=8D=E5=86=8D=E6=8F=90=20.env.w?= =?UTF-8?q?atcher?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - runner.sh watcher 不传参时导出 RUNNER_RESOURCE_IDS(roc + phytiumpi),传参仍为单板 - lock-watcher.sh 支持 RUNNER_RESOURCE_IDS 空格分隔多资源,每轮循环检查所有板子 - 文档改为推荐每组织一个 watcher 监控所有板子;移除 .env.watcher 及方式二说明 Made-with: Cursor --- docs/board-lock-watcher.md | 59 ++++++--------- runner-wrapper/lock-watcher.sh | 127 +++++++++++++++++---------------- runner.sh | 14 ++-- 3 files changed, 95 insertions(+), 105 deletions(-) diff --git a/docs/board-lock-watcher.md b/docs/board-lock-watcher.md index d6585bd..71bc28d 100644 --- a/docs/board-lock-watcher.md +++ b/docs/board-lock-watcher.md @@ -9,7 +9,7 @@ - **`runner-wrapper/runner-wrapper.sh`**:为 Runner 注入 Job Started / Completed 钩子。 - **`runner-wrapper/pre-job-lock.sh`**:在 Job 开始前获取板子级文件锁(`flock`),并通过后台子进程持有锁。 - **`runner-wrapper/post-job-lock.sh`**:在 Job 结束时创建 `.release` 标记,唤醒持锁子进程释放锁。 -- **`runner-wrapper/lock-watcher.sh`**(新增):运行在宿主机上的守护脚本,周期性查询某个仓库下 Actions Run 的状态;当发现持锁 Run 已被 **Cancel** 时,强制清理解锁文件,避免后续等待 Job 永久卡死。 +- **`runner-wrapper/lock-watcher.sh`**(新增):运行在宿主机上的守护脚本,周期性查询某个仓库下 Actions Run 的状态;当发现持锁 Run 已被 **Cancel** 时,强制清理解锁文件,避免后续等待 Job 永久卡死。**一个 watcher 进程可同时监控多块板子**(由 `.env` 中配置的 `RUNNER_RESOURCE_ID_*` 决定)。 锁文件结构(默认目录 `/tmp/github-runner-locks`): @@ -25,7 +25,7 @@ | 时机 | 要做的事 | |------|----------| -| **第一次在这台机器上部署** | ① 准备各组织的 `.env`(含 ORG、REPO、GH_PAT、板子锁变量、以及要用 watcher 时必填 `RUNNER_LOCK_MONITOR_TOKEN`)
② **在宿主机**设置锁目录权限:`sudo mkdir -p /tmp/github-runner-locks && sudo chmod 1777 /tmp/github-runner-locks`(详见 2.1)
③ 每个组织执行一次 `./runner.sh init -n 2`,生成 compose、起容器并注册
④(可选)安装 `jq`:`sudo apt install -y jq`
⑤ 每个组织起一个 watcher:`ENV_FILE=.env.xxx ./runner.sh watcher`,建议用 tmux/screen 或 systemd 常驻 | +| **第一次在这台机器上部署** | ① 准备各组织的 `.env`(含 ORG、REPO、GH_PAT、板子锁变量、以及要用 watcher 时必填 `RUNNER_LOCK_MONITOR_TOKEN`)
② **在宿主机**设置锁目录权限:`sudo mkdir -p /tmp/github-runner-locks && sudo chmod 1777 /tmp/github-runner-locks`([详见 2.1](#21-宿主机锁目录权限首次部署或报-permission-denied-时))
③ 每个组织执行一次 `./runner.sh init -n 2`,生成 compose、起容器并注册
④(使用 watcher 时必装)安装 `jq`:`sudo apt install -y jq`([详见 3.3](#33-安装依赖使用-watcher-时必装))
⑤ 每个组织起**一个** watcher 即可:`ENV_FILE=.env.xxx ./runner.sh watcher`(会监控该 .env 下配置的**所有**板子),建议用 tmux/screen 或 systemd 常驻([详见 3.0.1](#301-让-watcher-常驻tmux--screen--systemd)) | | **以后每次使用(非初次)** | 需要时执行 `ENV_FILE=.env.xxx ./runner.sh start`(每个组织);watcher 若已用 tmux/screen/systemd 常驻则不用管,否则再按上面命令各起一个 | --- @@ -37,7 +37,7 @@ ```bash ORG= REPO=test-runner -GH_PAT=ghp_xxx # Runner 注册用 Classic PAT +GH_PAT=ghp_xxx # Runner 注册用,权限见 2.2 RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks @@ -81,6 +81,16 @@ ENV_FILE=.env.linebridge ./runner.sh restart ENV_FILE=.env.yoinspiration ./runner.sh restart ``` +#### 2.2 PAT 权限说明 + +- **GH_PAT**(Runner 注册、管理 runner 用) + - **组织级 Runner**(只设 `ORG`、不设 `REPO`):Classic PAT 需勾选 **`admin:org`**,用于调用组织 Actions runner 注册等接口。 + - **仓库级 Runner**(设了 `ORG` 和 `REPO`):一般需 **`repo`**(完整仓库权限);若仓库属组织且需在组织下管理 runner,可能仍要求 **`admin:org`**。 + - 若用 Fine-grained PAT:在对应 org/repo 的权限中勾选可“管理 Actions runners”的项(名称以 GitHub 当前界面为准)。 + +- **RUNNER_LOCK_MONITOR_TOKEN**(仅 watcher 用,只读 run 状态) + - 建议使用 **Fine-grained PAT**,在对应仓库下将 **Actions** 设为 **Read-only**,权限最小、与 GH_PAT 分离更安全。 + --- ### 3. 宿主机上配置 lock-watcher @@ -97,11 +107,11 @@ Watcher 已集成进 `runner.sh`,**可直接复用各组织的 `.env`**,无 ```bash ENV_FILE=.env.linebridge ./runner.sh watcher ``` - 脚本会自动使用当前 `.env` 的 `ORG`、`REPO`、`RUNNER_LOCK_DIR` 以及 `RUNNER_RESOURCE_ID_ROC_RK3568_PC` 或 `RUNNER_RESOURCE_ID_PHYTIUMPI`(优先 roc)。若需指定资源 ID,可传参: + **不传参时**:脚本会监控当前 `.env` 中配置的**所有**板子(`RUNNER_RESOURCE_ID_ROC_RK3568_PC` 与 `RUNNER_RESOURCE_ID_PHYTIUMPI` 若已设置则都会监控)。**传参时**只监控指定的一块板子: ```bash ENV_FILE=.env.linebridge ./runner.sh watcher board-roc-rk3568-pc ``` -3. 建议用 tmux/screen 或 systemd 常驻该进程;多组织时每个组织各开一个终端(或服务)运行 `./runner.sh watcher`。具体做法见下文「3.0.1 让 watcher 常驻」。 +3. 建议用 tmux/screen 或 systemd 常驻该进程;多组织时每个组织各开一个终端(或服务)运行 `./runner.sh watcher`(每个组织一个进程即可,该进程会监控该组织下所有板子)。具体做法见下文「3.0.1 让 watcher 常驻」。 #### 3.0.1 让 watcher 常驻(tmux / screen / systemd) @@ -210,24 +220,9 @@ journalctl -u github-runner-watcher-linebridge -f 生成后得到 `github_pat_xxx`。 -#### 3.3 创建 watcher 环境文件(方式二时使用) +#### 3.3 安装依赖(使用 watcher 时必装) -在仓库根目录创建 `.env.watcher`(示例为监控 `linebridge/test-runner` 与 `board-roc-rk3568-pc` 板): - -```bash -ORG=linebridge -REPO=test-runner -GITHUB_TOKEN=github_pat_xxx -RUNNER_RESOURCE_ID=board-roc-rk3568-pc -RUNNER_LOCK_DIR=/tmp/github-runner-locks -INTERVAL=10 -``` - -> 如需为其他组织(例如 `yoinspiration`)单独监控,可再创建一个环境文件(如 `.env.watcher.yoinspiration`),修改 `ORG` / `REPO` / `GITHUB_TOKEN` 后启动第二个 watcher 实例。 - -#### 3.4 安装依赖 - -在宿主机上安装 `jq` 以解析 GitHub API 返回的 JSON: +watcher 依赖 `jq` 解析 GitHub API 返回的 JSON;未安装时无法识别 run 的 `status/conclusion`,不会触发 Cancel 后清锁。在宿主机上执行: ```bash sudo apt update @@ -238,27 +233,17 @@ sudo apt install -y jq ### 4. 启动 lock-watcher -**方式一(推荐):用 runner.sh 启动,与锁同源配置** +**用 runner.sh 启动,与锁同源配置** ```bash cd /path/to/github-runners ENV_FILE=.env.linebridge ./runner.sh watcher ``` -**方式二:单独环境文件** - -在宿主机上打开一个长期运行的终端(建议放在 tmux/screen 或 systemd 服务中): - -```bash -cd /path/to/github-runners -source .env.watcher -./runner-wrapper/lock-watcher.sh -``` - 启动成功后,终端会打印类似: ```text -[lock-watcher] monitoring linebridge/test-runner, resource=board-roc-rk3568-pc, lock_dir=/tmp/github-runner-locks, interval=10s +[lock-watcher] monitoring linebridge/test-runner, resources=board-roc-rk3568-pc board-phytiumpi, lock_dir=/tmp/github-runner-locks, interval=10s ``` 运行过程中示例日志: @@ -266,13 +251,13 @@ source .env.watcher - 正常持锁: ```text -[lock-watcher] run_id=22663158623 status=in_progress conclusion= pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc +[lock-watcher] resource=board-roc-rk3568-pc run_id=22663158623 status=in_progress conclusion= pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc ``` - 对应 workflow 在 GitHub 上被 Cancel 后: ```text -[lock-watcher] run_id=22663158623 status=completed conclusion=cancelled pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc +[lock-watcher] resource=board-roc-rk3568-pc run_id=22663158623 status=completed conclusion=cancelled pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc [lock-watcher] detected cancelled workflow, force releasing lock for board-roc-rk3568-pc ``` @@ -330,7 +315,7 @@ source .env.watcher - **Q: watcher 日志中频繁出现 `empty response for run_id=..., skip`?** **A:** 对应的 Run 不属于当前 `ORG/REPO`,或 `GITHUB_TOKEN` 对该仓库没有足够的 Actions 读取权限。请确认: - - `.env.watcher` 中的 `ORG` / `REPO` 是否与 Run 实际所在仓库一致; + - watcher 使用的 `ORG` / `REPO`(来自 `ENV_FILE` 的 .env 或自建环境文件)是否与 Run 实际所在仓库一致; - Fine-grained PAT 是否勾选了对应仓库,并将 Actions 权限设为 Read-only。 - **Q: 没有安装 jq 时,status / conclusion 总是 ``?** diff --git a/runner-wrapper/lock-watcher.sh b/runner-wrapper/lock-watcher.sh index ad46059..bc19010 100755 --- a/runner-wrapper/lock-watcher.sh +++ b/runner-wrapper/lock-watcher.sh @@ -12,14 +12,15 @@ set -euo pipefail # export ORG=yoinspiration # export REPO=test-runner # export GITHUB_TOKEN=github_pat_xxx # 具备 Actions 只读权限 -# export RUNNER_RESOURCE_ID=board-roc-rk3568-pc +# export RUNNER_RESOURCE_IDS="board-roc board-phytiumpi" # 多块板子空格分隔;或单块用 RUNNER_RESOURCE_ID # export RUNNER_LOCK_DIR=/tmp/github-runner-locks # ./runner-wrapper/lock-watcher.sh # # 必要环境变量: # ORG, REPO, GITHUB_TOKEN # 可选环境变量: -# RUNNER_RESOURCE_ID(默认:default-hardware 或自行设置) +# RUNNER_RESOURCE_IDS(空格分隔的多个锁 ID,与 runner.sh 集成时自动传) +# RUNNER_RESOURCE_ID(单个锁 ID,RUNNER_RESOURCE_IDS 未设时使用) # RUNNER_LOCK_DIR(默认:/tmp/github-runner-locks) # INTERVAL(轮询间隔秒,默认 10) @@ -27,74 +28,78 @@ set -euo pipefail : "${REPO:?REPO is required, e.g. test-runner}" : "${GITHUB_TOKEN:?GITHUB_TOKEN is required (with Actions read permission)}" -RUNNER_RESOURCE_ID="${RUNNER_RESOURCE_ID:-default-hardware}" LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" INTERVAL="${INTERVAL:-10}" +# 资源 ID 列表:支持多块板子,空格分隔 +if [[ -n "${RUNNER_RESOURCE_IDS:-}" ]]; then + RESOURCE_IDS=(${RUNNER_RESOURCE_IDS}) +else + RESOURCE_IDS=("${RUNNER_RESOURCE_ID:-default-hardware}") +fi + api_base="${GITHUB_API_URL:-https://api.github.com}" -echo "[lock-watcher] monitoring ${ORG}/${REPO}, resource=${RUNNER_RESOURCE_ID}, lock_dir=${LOCK_DIR}, interval=${INTERVAL}s" +echo "[lock-watcher] monitoring ${ORG}/${REPO}, resources=${RESOURCE_IDS[*]}, lock_dir=${LOCK_DIR}, interval=${INTERVAL}s" while true; do - holder_file="${LOCK_DIR}/${RUNNER_RESOURCE_ID}.holder" - - if [[ ! -f "${holder_file}" ]]; then - sleep "${INTERVAL}" - continue - fi - - # holder 文件格式: PID RUNNER_NAME RUN_ID RUN_ATTEMPT - holder_pid="" - holder_runner="" - holder_run_id="" - holder_run_attempt="" - if ! read -r holder_pid holder_runner holder_run_id holder_run_attempt < "${holder_file}"; then - echo "[lock-watcher] failed to read holder file ${holder_file}" >&2 - sleep "${INTERVAL}" - continue - fi - - if [[ -z "${holder_run_id:-}" || "${holder_run_id}" == "unknown" ]]; then - sleep "${INTERVAL}" - continue - fi - - run_url="${api_base}/repos/${ORG}/${REPO}/actions/runs/${holder_run_id}" - resp="$(curl -fsSL \ - -H "Authorization: Bearer ${GITHUB_TOKEN}" \ - -H "Accept: application/vnd.github+json" \ - "${run_url}" || true)" - - if [[ -z "${resp}" ]]; then - echo "[lock-watcher] empty response for run_id=${holder_run_id}, skip" >&2 - sleep "${INTERVAL}" - continue - fi - - # 优先使用 jq 解析;若 jq 不可用,则退化为空字符串(不报错) - if command -v jq >/dev/null 2>&1; then - status="$(printf '%s\n' "${resp}" | jq -r '.status // empty' 2>/dev/null || true)" - conclusion="$(printf '%s\n' "${resp}" | jq -r '.conclusion // empty' 2>/dev/null || true)" - else - status="" - conclusion="" - fi - - echo "[lock-watcher] run_id=${holder_run_id} status=${status:-} conclusion=${conclusion:-} pid=${holder_pid} holder_runner=${holder_runner}" - - # 如果 workflow 已取消,认为这个锁可以强制释放 - if [[ "${status}" == "completed" && "${conclusion}" == "cancelled" ]]; then - echo "[lock-watcher] detected cancelled workflow, force releasing lock for ${RUNNER_RESOURCE_ID}" >&2 - - # 尝试在宿主机上杀掉同名 PID(注意:容器内/宿主机 PID 命名空间不同,可能杀不到,仅 best-effort) - if [[ -n "${holder_pid}" && "${holder_pid}" =~ ^[0-9]+$ ]]; then - kill "${holder_pid}" 2>/dev/null || true + for RUNNER_RESOURCE_ID in "${RESOURCE_IDS[@]}"; do + holder_file="${LOCK_DIR}/${RUNNER_RESOURCE_ID}.holder" + + if [[ ! -f "${holder_file}" ]]; then + continue + fi + + # holder 文件格式: PID RUNNER_NAME RUN_ID RUN_ATTEMPT + holder_pid="" + holder_runner="" + holder_run_id="" + holder_run_attempt="" + if ! read -r holder_pid holder_runner holder_run_id holder_run_attempt < "${holder_file}"; then + echo "[lock-watcher] failed to read holder file ${holder_file}" >&2 + continue fi - # 清理 holder 和对应的 release 标记,让后续等待不再被旧锁阻塞 - rm -f "${holder_file}" 2>/dev/null || true - rm -f "${LOCK_DIR}/${RUNNER_RESOURCE_ID}."*.release 2>/dev/null || true - fi + if [[ -z "${holder_run_id:-}" || "${holder_run_id}" == "unknown" ]]; then + continue + fi + + run_url="${api_base}/repos/${ORG}/${REPO}/actions/runs/${holder_run_id}" + resp="$(curl -fsSL \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "Accept: application/vnd.github+json" \ + "${run_url}" || true)" + + if [[ -z "${resp}" ]]; then + echo "[lock-watcher] empty response for run_id=${holder_run_id}, skip" >&2 + continue + fi + + # 优先使用 jq 解析;若 jq 不可用,则退化为空字符串(不报错) + if command -v jq >/dev/null 2>&1; then + status="$(printf '%s\n' "${resp}" | jq -r '.status // empty' 2>/dev/null || true)" + conclusion="$(printf '%s\n' "${resp}" | jq -r '.conclusion // empty' 2>/dev/null || true)" + else + status="" + conclusion="" + fi + + echo "[lock-watcher] resource=${RUNNER_RESOURCE_ID} run_id=${holder_run_id} status=${status:-} conclusion=${conclusion:-} pid=${holder_pid} holder_runner=${holder_runner}" + + # 如果 workflow 已取消,认为这个锁可以强制释放 + if [[ "${status}" == "completed" && "${conclusion}" == "cancelled" ]]; then + echo "[lock-watcher] detected cancelled workflow, force releasing lock for ${RUNNER_RESOURCE_ID}" >&2 + + # 尝试在宿主机上杀掉同名 PID(注意:容器内/宿主机 PID 命名空间不同,可能杀不到,仅 best-effort) + if [[ -n "${holder_pid}" && "${holder_pid}" =~ ^[0-9]+$ ]]; then + kill "${holder_pid}" 2>/dev/null || true + fi + + # 清理 holder 和对应的 release 标记,让后续等待不再被旧锁阻塞 + rm -f "${holder_file}" 2>/dev/null || true + rm -f "${LOCK_DIR}/${RUNNER_RESOURCE_ID}."*.release 2>/dev/null || true + fi + done sleep "${INTERVAL}" done diff --git a/runner.sh b/runner.sh index 15344c5..8bd8374 100755 --- a/runner.sh +++ b/runner.sh @@ -979,19 +979,19 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then export ORG REPO export RUNNER_LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" if [[ -n "${1:-}" ]]; then - export RUNNER_RESOURCE_ID="$1" + export RUNNER_RESOURCE_IDS="$1" else - if [[ -n "${RUNNER_RESOURCE_ID_ROC_RK3568_PC:-}" ]]; then - export RUNNER_RESOURCE_ID="${RUNNER_RESOURCE_ID_ROC_RK3568_PC}" - elif [[ -n "${RUNNER_RESOURCE_ID_PHYTIUMPI:-}" ]]; then - export RUNNER_RESOURCE_ID="${RUNNER_RESOURCE_ID_PHYTIUMPI}" - else + ids=() + [[ -n "${RUNNER_RESOURCE_ID_ROC_RK3568_PC:-}" ]] && ids+=("${RUNNER_RESOURCE_ID_ROC_RK3568_PC}") + [[ -n "${RUNNER_RESOURCE_ID_PHYTIUMPI:-}" ]] && ids+=("${RUNNER_RESOURCE_ID_PHYTIUMPI}") + if [[ ${#ids[@]} -eq 0 ]]; then shell_die "No board resource ID set (RUNNER_RESOURCE_ID_ROC_RK3568_PC / RUNNER_RESOURCE_ID_PHYTIUMPI). Set one in .env or pass: ./runner.sh watcher " fi + export RUNNER_RESOURCE_IDS="${ids[*]}" fi WATCHER_SCRIPT="$(cd "$(dirname "$0")" && pwd)/runner-wrapper/lock-watcher.sh" [[ -x "${WATCHER_SCRIPT}" ]] || shell_die "lock-watcher.sh not found or not executable: ${WATCHER_SCRIPT}" - shell_info "Starting lock-watcher for ${ORG}/${REPO}, resource=${RUNNER_RESOURCE_ID}" + shell_info "Starting lock-watcher for ${ORG}/${REPO}, resources=${RUNNER_RESOURCE_IDS}" exec "${WATCHER_SCRIPT}" ;; From 25aa775290272e95c1b8ce864a3ac76a7b7d4d4b Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Fri, 6 Mar 2026 11:59:31 +0800 Subject: [PATCH 4/7] =?UTF-8?q?feat(watcher):=20compose=20=E9=9B=86?= =?UTF-8?q?=E6=88=90=EF=BC=8Cstart/stop=20=E6=97=B6=E8=87=AA=E5=8A=A8?= =?UTF-8?q?=E5=90=AF=E5=81=9C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - compose 生成时若配置 RUNNER_LOCK_MONITOR_TOKEN 则加入 lock-watcher 服务(alpine + jq) - start/stop/restart 无参时操作所有服务(含 watcher),与锁使用无感 - 文档更新:watcher 随 start 自动启动,无需 tmux/systemd Made-with: Cursor --- docs/board-lock-watcher.md | 58 +++++++++++--------- runner.sh | 109 +++++++++++++++++++++++++++---------- 2 files changed, 112 insertions(+), 55 deletions(-) diff --git a/docs/board-lock-watcher.md b/docs/board-lock-watcher.md index 71bc28d..c60842e 100644 --- a/docs/board-lock-watcher.md +++ b/docs/board-lock-watcher.md @@ -9,7 +9,7 @@ - **`runner-wrapper/runner-wrapper.sh`**:为 Runner 注入 Job Started / Completed 钩子。 - **`runner-wrapper/pre-job-lock.sh`**:在 Job 开始前获取板子级文件锁(`flock`),并通过后台子进程持有锁。 - **`runner-wrapper/post-job-lock.sh`**:在 Job 结束时创建 `.release` 标记,唤醒持锁子进程释放锁。 -- **`runner-wrapper/lock-watcher.sh`**(新增):运行在宿主机上的守护脚本,周期性查询某个仓库下 Actions Run 的状态;当发现持锁 Run 已被 **Cancel** 时,强制清理解锁文件,避免后续等待 Job 永久卡死。**一个 watcher 进程可同时监控多块板子**(由 `.env` 中配置的 `RUNNER_RESOURCE_ID_*` 决定)。 +- **`runner-wrapper/lock-watcher.sh`**:周期性查询 GitHub Actions Run 的状态;当发现持锁 Run 已被 **Cancel** 时,强制清理解锁文件,避免后续等待 Job 永久卡死。**一个 watcher 进程可同时监控多块板子**。配置 `RUNNER_LOCK_MONITOR_TOKEN` 后,watcher 会作为 compose 服务随 `./runner.sh start` **自动启动**,与锁机制一样使用无感。 锁文件结构(默认目录 `/tmp/github-runner-locks`): @@ -25,8 +25,8 @@ | 时机 | 要做的事 | |------|----------| -| **第一次在这台机器上部署** | ① 准备各组织的 `.env`(含 ORG、REPO、GH_PAT、板子锁变量、以及要用 watcher 时必填 `RUNNER_LOCK_MONITOR_TOKEN`)
② **在宿主机**设置锁目录权限:`sudo mkdir -p /tmp/github-runner-locks && sudo chmod 1777 /tmp/github-runner-locks`([详见 2.1](#21-宿主机锁目录权限首次部署或报-permission-denied-时))
③ 每个组织执行一次 `./runner.sh init -n 2`,生成 compose、起容器并注册
④(使用 watcher 时必装)安装 `jq`:`sudo apt install -y jq`([详见 3.3](#33-安装依赖使用-watcher-时必装))
⑤ 每个组织起**一个** watcher 即可:`ENV_FILE=.env.xxx ./runner.sh watcher`(会监控该 .env 下配置的**所有**板子),建议用 tmux/screen 或 systemd 常驻([详见 3.0.1](#301-让-watcher-常驻tmux--screen--systemd)) | -| **以后每次使用(非初次)** | 需要时执行 `ENV_FILE=.env.xxx ./runner.sh start`(每个组织);watcher 若已用 tmux/screen/systemd 常驻则不用管,否则再按上面命令各起一个 | +| **第一次在这台机器上部署** | ① 准备各组织的 `.env`(含 ORG、REPO、GH_PAT、板子锁变量、以及 `RUNNER_LOCK_MONITOR_TOKEN`)
② **在宿主机**设置锁目录权限:`sudo mkdir -p /tmp/github-runner-locks && sudo chmod 1777 /tmp/github-runner-locks`([详见 2.1](#21-宿主机锁目录权限首次部署或报-permission-denied-时))
③ 每个组织执行一次 `./runner.sh init -n 2`,生成 compose、起容器并注册
④ watcher 会作为 compose 服务**随 start 自动启动**,无需单独起进程([详见 3](#3-watcher-自动启动与锁使用无感)) | +| **以后每次使用(非初次)** | 执行 `ENV_FILE=.env.xxx ./runner.sh start`(每个组织);watcher 随 runners 一起启停,使用无感 | --- @@ -42,7 +42,7 @@ GH_PAT=ghp_xxx # Runner 注册用,权限见 2.2 RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks RUNNER_LOCK_DIR=/tmp/github-runner-locks -# 必填(仅在使用 ./runner.sh watcher 时):Fine-grained PAT,Actions: Read-only +# 必填(watcher 自动启动时用):Fine-grained PAT,Actions: Read-only RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx ``` @@ -93,27 +93,27 @@ ENV_FILE=.env.yoinspiration ./runner.sh restart --- -### 3. 宿主机上配置 lock-watcher +### 3. Watcher 自动启动(与锁使用无感) -#### 3.0 推荐:通过 runner.sh 启动(与锁同源配置,使用无感) +配置 `RUNNER_LOCK_MONITOR_TOKEN` 后,`./runner.sh compose` 或 `init` 生成的 compose 会包含 **lock-watcher** 服务。执行 `./runner.sh start` 时,watcher 会随 runners 一起启动;`stop` / `restart` 时一起停止,**无需单独开终端或 systemd**。 -Watcher 已集成进 `runner.sh`,**可直接复用各组织的 `.env`**,无需单独维护 `.env.watcher`: - -1. 在对应组织的 `.env` 中增加一行(必填): +1. 在对应组织的 `.env` 中增加(必填): ```bash RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx # Fine-grained PAT,Actions: Read-only ``` -2. 在宿主机执行(与 start/restart 同一套 ENV_FILE): +2. 若已有 compose,需重新生成以加入 watcher: ```bash - ENV_FILE=.env.linebridge ./runner.sh watcher + ENV_FILE=.env.linebridge ./runner.sh compose ``` - **不传参时**:脚本会监控当前 `.env` 中配置的**所有**板子(`RUNNER_RESOURCE_ID_ROC_RK3568_PC` 与 `RUNNER_RESOURCE_ID_PHYTIUMPI` 若已设置则都会监控)。**传参时**只监控指定的一块板子: +3. 之后执行 `./runner.sh start` 即可,watcher 自动随 runners 启停。 + +**手动单独运行**(可选):若需在 compose 外单独跑 watcher(例如另一台机器),仍可使用: ```bash - ENV_FILE=.env.linebridge ./runner.sh watcher board-roc-rk3568-pc + ENV_FILE=.env.linebridge ./runner.sh watcher ``` -3. 建议用 tmux/screen 或 systemd 常驻该进程;多组织时每个组织各开一个终端(或服务)运行 `./runner.sh watcher`(每个组织一个进程即可,该进程会监控该组织下所有板子)。具体做法见下文「3.0.1 让 watcher 常驻」。 + 建议用 tmux/screen 常驻。传参可指定只监控一块板子:`./runner.sh watcher board-roc-rk3568-pc`。 -#### 3.0.1 让 watcher 常驻(tmux / screen / systemd) +#### 3.0.1 手动常驻(仅在不使用 compose 自动启动时) 任选一种方式,使 watcher 在断开 SSH 或重启后仍可运行。 @@ -208,8 +208,8 @@ journalctl -u github-runner-watcher-linebridge -f #### 3.1 实例数量建议 -- 默认推荐:**每个参与共享同一块板子的仓库,各启动一个 watcher 实例**(即每个组织一份 `.env`,各执行一次 `./runner.sh watcher`)。 -- 例如:`linebridge/test-runner`、`yoinspiration/test-runner` 分别执行 `ENV_FILE=.env.linebridge ./runner.sh watcher` 与 `ENV_FILE=.env.yoinspiration ./runner.sh watcher`。 +- **使用 compose 自动启动**:每组织一个 watcher 容器,随 `./runner.sh start` 自动拉起;无需额外配置。 +- **手动运行**:每组织一个 watcher 进程,分别执行 `ENV_FILE=.env.linebridge ./runner.sh watcher` 与 `ENV_FILE=.env.yoinspiration ./runner.sh watcher`。 #### 3.2 准备 PAT @@ -220,23 +220,31 @@ journalctl -u github-runner-watcher-linebridge -f 生成后得到 `github_pat_xxx`。 -#### 3.3 安装依赖(使用 watcher 时必装) +#### 3.3 安装依赖 + +- **compose 自动启动**:watcher 容器使用 alpine,启动时自动安装 `jq`,宿主机无需安装。 +- **手动运行 watcher**:需在宿主机安装 `jq`:`sudo apt install -y jq`,否则无法解析 run 状态。 + +--- -watcher 依赖 `jq` 解析 GitHub API 返回的 JSON;未安装时无法识别 run 的 `status/conclusion`,不会触发 Cancel 后清锁。在宿主机上执行: +### 4. 启动与验证 + +**compose 自动启动(推荐)** ```bash -sudo apt update -sudo apt install -y jq +cd /path/to/github-runners +ENV_FILE=.env.linebridge ./runner.sh start ``` ---- +watcher 会随 runners 一起启动。查看 watcher 日志: -### 4. 启动 lock-watcher +```bash +docker logs -f $(docker ps -q -f name=lock-watcher) +``` -**用 runner.sh 启动,与锁同源配置** +**手动运行 watcher**(可选) ```bash -cd /path/to/github-runners ENV_FILE=.env.linebridge ./runner.sh watcher ``` diff --git a/runner.sh b/runner.sh index 8bd8374..49a0b84 100755 --- a/runner.sh +++ b/runner.sh @@ -708,6 +708,34 @@ shell_generate_compose_file() { " - ${RUNNER_NAME_PREFIX}runner-roc-rk3568-pc-data:/home/runner" \ " - ${RUNNER_NAME_PREFIX}runner-roc-rk3568-pc-udev-rules:/etc/udev/rules.d" \ "" >> "${COMPOSE_FILE}" + + # lock-watcher:当配置了 RUNNER_LOCK_MONITOR_TOKEN 时自动加入,与 start/stop 一起启停 + if [[ -n "${RUNNER_LOCK_MONITOR_TOKEN:-}" ]]; then + local watcher_resource_ids=() + [[ -n "$res_phytiumpi" ]] && watcher_resource_ids+=("$res_phytiumpi") + [[ -n "$res_roc" ]] && watcher_resource_ids+=("$res_roc") + local watcher_ids_str="${watcher_resource_ids[*]}" + printf '%s\n' \ + " # lock-watcher:Cancel workflow 后自动清锁,与锁机制配套" \ + " ${RUNNER_NAME_PREFIX}lock-watcher:" \ + " image: alpine:3.19" \ + " container_name: \"${RUNNER_NAME_PREFIX}lock-watcher\"" \ + " restart: unless-stopped" \ + " command:" \ + " - /bin/sh" \ + " - -c" \ + " - \"apk add --no-cache bash curl jq && exec /watcher/lock-watcher.sh\"" \ + " environment:" \ + " ORG: \"${ORG}\"" \ + " REPO: \"${REPO}\"" \ + " GITHUB_TOKEN: \"\${RUNNER_LOCK_MONITOR_TOKEN}\"" \ + " RUNNER_LOCK_DIR: \"${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}\"" \ + " RUNNER_RESOURCE_IDS: \"${watcher_ids_str}\"" \ + " volumes:" \ + " - ./runner-wrapper:/watcher:ro" \ + " - ${RUNNER_LOCK_HOST_PATH:-/tmp/github-runner-locks}:${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" \ + "" >> "${COMPOSE_FILE}" + fi fi # 生成 volumes @@ -1076,18 +1104,25 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then shell_info "No Runner containers to start!" exit 0 fi - else - mapfile -t ids < <(docker_list_existing_containers) || ids=() - if [[ ${#ids[@]} -eq 0 ]]; then - shell_info "No Runner containers to start!" - exit 0 + shell_info "Starting ${#ids[@]} container(s): ${ids[*]}" + if [[ -f "$COMPOSE_FILE" ]]; then + $DC -f "$COMPOSE_FILE" up -d "${ids[@]}" + else + docker start "${ids[@]}" fi - fi - shell_info "Starting ${#ids[@]} container(s): ${ids[*]}" - if [[ -f "$COMPOSE_FILE" ]]; then - $DC -f "$COMPOSE_FILE" up -d "${ids[@]}" else - docker start "${ids[@]}" + if [[ -f "$COMPOSE_FILE" ]]; then + shell_info "Starting all services (runners + lock-watcher if configured)" + $DC -f "$COMPOSE_FILE" up -d + else + mapfile -t ids < <(docker_list_existing_containers) || ids=() + if [[ ${#ids[@]} -eq 0 ]]; then + shell_info "No Runner containers to start!" + exit 0 + fi + shell_info "Starting ${#ids[@]} container(s): ${ids[*]}" + docker start "${ids[@]}" + fi fi ;; @@ -1106,18 +1141,25 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then shell_info "No Runner containers to stop!" exit 0 fi - else - mapfile -t ids < <(docker_list_existing_containers) || ids=() - if [[ ${#ids[@]} -eq 0 ]]; then - shell_info "No Runner containers to stop!" - exit 0 + shell_info "Stopping ${#ids[@]} container(s): ${ids[*]}" + if [[ -f "$COMPOSE_FILE" ]]; then + $DC -f "$COMPOSE_FILE" stop "${ids[@]}" + else + docker stop "${ids[@]}" fi - fi - shell_info "Stopping ${#ids[@]} container(s): ${ids[*]}" - if [[ -f "$COMPOSE_FILE" ]]; then - $DC -f "$COMPOSE_FILE" stop "${ids[@]}" else - docker stop "${ids[@]}" + if [[ -f "$COMPOSE_FILE" ]]; then + shell_info "Stopping all services (runners + lock-watcher if configured)" + $DC -f "$COMPOSE_FILE" stop + else + mapfile -t ids < <(docker_list_existing_containers) || ids=() + if [[ ${#ids[@]} -eq 0 ]]; then + shell_info "No Runner containers to stop!" + exit 0 + fi + shell_info "Stopping ${#ids[@]} container(s): ${ids[*]}" + docker stop "${ids[@]}" + fi fi ;; @@ -1136,18 +1178,25 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then shell_info "No Runner containers to restart!" exit 0 fi - else - mapfile -t ids < <(docker_list_existing_containers) || ids=() - if [[ ${#ids[@]} -eq 0 ]]; then - shell_info "No Runner containers to restart!" - exit 0 + shell_info "Restarting ${#ids[@]} container(s): ${ids[*]}" + if [[ -f "$COMPOSE_FILE" ]]; then + $DC -f "$COMPOSE_FILE" restart "${ids[@]}" + else + docker restart "${ids[@]}" fi - fi - shell_info "Restarting ${#ids[@]} container(s): ${ids[*]}" - if [[ -f "$COMPOSE_FILE" ]]; then - $DC -f "$COMPOSE_FILE" restart "${ids[@]}" else - docker restart "${ids[@]}" + if [[ -f "$COMPOSE_FILE" ]]; then + shell_info "Restarting all services (runners + lock-watcher if configured)" + $DC -f "$COMPOSE_FILE" restart + else + mapfile -t ids < <(docker_list_existing_containers) || ids=() + if [[ ${#ids[@]} -eq 0 ]]; then + shell_info "No Runner containers to restart!" + exit 0 + fi + shell_info "Restarting ${#ids[@]} container(s): ${ids[*]}" + docker restart "${ids[@]}" + fi fi ;; From 22dd4b5b919e03454e0c1278962ae770e4184dc1 Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Thu, 19 Mar 2026 14:16:32 +0800 Subject: [PATCH 5/7] docs: add lock permission troubleshooting guidance Document pre-job lock permission failure recovery in both READMEs and update .env.example to recommend a persistent host lock path under /var/tmp while keeping the container lock path unchanged. Made-with: Cursor --- .env.example | 4 +++- README.md | 42 ++++++++++++++++++++++++++++++++++++++++++ README_CN.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 87 insertions(+), 1 deletion(-) diff --git a/.env.example b/.env.example index 6459949..a68adfc 100644 --- a/.env.example +++ b/.env.example @@ -24,8 +24,10 @@ GH_PAT= # 多组织共享同一块板时显式设相同 ID: # RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi # RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +# 容器内路径,通常保持默认: # RUNNER_LOCK_DIR=/tmp/github-runner-locks -# RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks +# 宿主机路径,建议使用持久目录避免 /tmp 清理导致权限漂移: +# RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks # ---------- 可选:注册 token 缓存 ---------- # REG_TOKEN_CACHE_FILE=.reg_token.cache diff --git a/README.md b/README.md index e197145..0372453 100644 --- a/README.md +++ b/README.md @@ -112,6 +112,48 @@ RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi **Note**: Serialization is a hardware limitation. This approach transforms "chaotic contention" into "ordered queuing" without reducing throughput. Different boards using their own lock IDs can run in parallel.See [runner-wrapper/README.md](runner-wrapper/README.md) for details. Reference: [Discussion #341](https://github.com/orgs/arceos-hypervisor/discussions/341). +### Troubleshooting: `Permission denied` in `pre-job-lock.sh` + +If job logs show errors like: + +- `chmod: changing permissions of '/tmp/github-runner-locks': Operation not permitted` +- `/tmp/github-runner-locks/board-xxx.lock: Permission denied` + +the host lock directory permissions are usually incorrect, or the lock directory is placed under host `/tmp` and gets cleaned up by the system, causing permission drift. + +Recommended configuration (keep consistent across organizations): + +- `RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks` (persistent host directory) +- `RUNNER_LOCK_DIR=/tmp/github-runner-locks` (container path, keep default) + +One-time fix: + +```bash +# 1) Create host directory and set sticky-bit permissions (1777) +sudo mkdir -p /var/tmp/github-runner-locks +sudo chown root:root /var/tmp/github-runner-locks +sudo chmod 1777 /var/tmp/github-runner-locks + +# 2) Update each org's .env +# RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +# RUNNER_LOCK_DIR=/tmp/github-runner-locks + +# 3) Regenerate compose and recreate containers to apply mounts +ENV_FILE=.env. ./runner.sh compose +ENV_FILE=.env. ./runner.sh stop +ENV_FILE=.env. ./runner.sh start +``` + +Verify: + +```bash +ls -ld /var/tmp/github-runner-locks +# expected: drwxrwxrwt + +docker inspect --format '{{range .Mounts}}{{println .Source "->" .Destination}}{{end}}' +# expected to include: /var/tmp/github-runner-locks -> /tmp/github-runner-locks +``` + ## Contributing ```bash diff --git a/README_CN.md b/README_CN.md index 30ed56c..ab28339 100644 --- a/README_CN.md +++ b/README_CN.md @@ -112,6 +112,48 @@ RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi **注意**:串行是硬件本身的限制,本方案把「无秩序抢占」变为「有序排队」,不额外降低吞吐。不同板子使用各自锁 ID 可并行执行。详见 [runner-wrapper/README.md](runner-wrapper/README.md),参考 [Discussion #341](https://github.com/orgs/arceos-hypervisor/discussions/341)。 +### 常见问题:`pre-job-lock.sh` 报 `Permission denied` + +如果 Job 日志中出现类似报错: + +- `chmod: changing permissions of '/tmp/github-runner-locks': Operation not permitted` +- `/tmp/github-runner-locks/board-xxx.lock: Permission denied` + +通常是宿主机锁目录权限不正确,或把锁目录放在宿主机 `/tmp` 后被系统清理导致权限漂移。 + +推荐配置(多组织保持一致): + +- `RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks`(宿主机持久目录) +- `RUNNER_LOCK_DIR=/tmp/github-runner-locks`(容器内路径,保持默认) + +一次性修复步骤: + +```bash +# 1) 在宿主机创建并设置目录权限(sticky bit 1777) +sudo mkdir -p /var/tmp/github-runner-locks +sudo chown root:root /var/tmp/github-runner-locks +sudo chmod 1777 /var/tmp/github-runner-locks + +# 2) 修改每个组织对应 .env +# RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +# RUNNER_LOCK_DIR=/tmp/github-runner-locks + +# 3) 重新生成 compose,并重建容器使挂载生效 +ENV_FILE=.env. ./runner.sh compose +ENV_FILE=.env. ./runner.sh stop +ENV_FILE=.env. ./runner.sh start +``` + +验证: + +```bash +ls -ld /var/tmp/github-runner-locks +# 期望:drwxrwxrwt + +docker inspect --format '{{range .Mounts}}{{println .Source "->" .Destination}}{{end}}' +# 期望包含:/var/tmp/github-runner-locks -> /tmp/github-runner-locks +``` + ## 贡献 ```bash From fa7c3eadbf93846c9dbfa2a08dbbbe126c0050e6 Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Sat, 21 Mar 2026 08:05:30 +0800 Subject: [PATCH 6/7] =?UTF-8?q?feat:=20libclang/bindgen=20=E4=BE=9D?= =?UTF-8?q?=E8=B5=96=E4=B8=8E=E9=94=81=E7=9B=AE=E5=BD=95=E6=A0=A1=E9=AA=8C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Dockerfile: 安装 cmake/clang/libclang-dev,软链 libclang.so 供 bindgen - pre-job-lock: 创建锁目录失败或不可写时立即退出并提示修复 Made-with: Cursor --- Dockerfile | 10 ++++++++++ runner-wrapper/pre-job-lock.sh | 14 +++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/Dockerfile b/Dockerfile index 0fedc8a..50d8125 100644 --- a/Dockerfile +++ b/Dockerfile @@ -55,10 +55,20 @@ RUN apt-get update \ python3-pip \ python3-tomli \ python3-sphinx \ + cmake \ + clang \ + libclang-dev \ ninja-build \ libslirp0 \ && rm -rf /var/lib/apt/lists/* +# Ensure bindgen can find libclang.so in a stable location. +RUN set -eux; \ + libclang_path="$(ls -1 /usr/lib/llvm-*/lib/libclang.so 2>/dev/null | head -n1)"; \ + if [ -n "${libclang_path}" ]; then \ + ln -sf "${libclang_path}" /usr/lib/libclang.so; \ + fi + # Build and install QEMU 10.1.2 from source RUN mkdir -p /tmp/qemu-build \ && cd /tmp/qemu-build \ diff --git a/runner-wrapper/pre-job-lock.sh b/runner-wrapper/pre-job-lock.sh index 2d22ec6..a7c016a 100644 --- a/runner-wrapper/pre-job-lock.sh +++ b/runner-wrapper/pre-job-lock.sh @@ -18,8 +18,20 @@ LOCK_FILE="${LOCK_DIR}/${RESOURCE_ID}.lock" RELEASE_FILE="${LOCK_DIR}/${RESOURCE_ID}.${RUN_KEY}.release" HOLDER_PID_FILE="${LOCK_DIR}/${RESOURCE_ID}.holder" -mkdir -p "${LOCK_DIR}" +if ! mkdir -p "${LOCK_DIR}" 2>/dev/null; then + echo "[$(date -Iseconds)] ❌ Cannot create lock dir ${LOCK_DIR}" >&2 + exit 1 +fi chmod 1777 "${LOCK_DIR}" || true + +# 如果目录不可写,给出明确提示后退出 +if ! touch "${LOCK_DIR}/.write-test" 2>/dev/null; then + echo "[$(date -Iseconds)] ❌ Lock dir ${LOCK_DIR} is not writable by user $(id -un)." >&2 + echo "[$(date -Iseconds)] Fix on runner host: sudo chmod 1777 ${LOCK_DIR}" >&2 + exit 1 +fi +rm -f "${LOCK_DIR}/.write-test" || true + # 清理当前 run 的残留释放标记,避免误判为可释放 rm -f "${RELEASE_FILE}" || true From 19b01206dff5b5469878722b5db24821a8933a36 Mon Sep 17 00:00:00 2001 From: yoinspiration Date: Sun, 22 Mar 2026 23:46:33 +0800 Subject: [PATCH 7/7] =?UTF-8?q?docs:=20=E5=A4=9A=E7=BB=84=E7=BB=87?= =?UTF-8?q?=E9=83=A8=E7=BD=B2=E6=8C=87=E5=8D=97=E4=B8=8E=20README=20?= =?UTF-8?q?=E9=93=BE=E6=8E=A5=EF=BC=9Bpre-job-lock=20=E7=AD=89=E5=BE=85?= =?UTF-8?q?=E9=94=81=E6=97=B6=E5=91=A8=E6=9C=9F=E6=97=A5=E5=BF=97?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 新增 docs/多组织部署指南.md,README 指向完整文档 - pre-job-lock: flock 阻塞时每 10s 输出等待进度;chmod 静默失败避免噪声 Made-with: Cursor --- README_CN.md | 2 + ...50\347\275\262\346\214\207\345\215\227.md" | 202 ++++++++++++++++++ runner-wrapper/pre-job-lock.sh | 19 +- 3 files changed, 220 insertions(+), 3 deletions(-) create mode 100644 "docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" diff --git a/README_CN.md b/README_CN.md index ab28339..99d7391 100644 --- a/README_CN.md +++ b/README_CN.md @@ -79,6 +79,8 @@ name:label1[,label2];name2:label1 ## 多组织共享 +> **完整文档**:参见 [docs/多组织部署指南.md](docs/多组织部署指南.md),含部署方式、环境变量、故障排查等。 + 当前脚本实现了在同一台主机上运行多个 Docker 容器,分别注册到不同的 GitHub 组织,即使这些容器需要访问同一物理硬件(如开发板、串口、电源控制等),也不会导致 CI 会导致资源冲突。 ### 场景说明 diff --git "a/docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" "b/docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" new file mode 100644 index 0000000..59dbdc1 --- /dev/null +++ "b/docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" @@ -0,0 +1,202 @@ +# 多组织部署完整指南 + +本文档是 **在同一台主机上为多个 GitHub 组织部署 Runner** 的完整参考,涵盖概念、配置、部署与故障排查。 + +--- + +## 文档导航 + +| 文档 | 适用场景 | +|------|----------| +| 本文档 | 多组织整体概念与快速上手 | +| [多组织共享Runner使用说明](./多组织共享Runner使用说明.md) | 分步操作、验证方法、Cancel 场景 | +| [runner-wrapper-multi-org-lock](./runner-wrapper-multi-org-lock.md) | 锁机制原理与架构设计 | +| [board-lock-watcher](./board-lock-watcher.md) | Cancel 后安全恢复、watcher 配置 | + +--- + +## 1. 概念说明 + +### 1.1 什么是多组织部署 + +**多组织部署** 指在同一台物理主机上运行多套 GitHub Actions Runner 容器,分别注册到不同的组织(或仓库),并可选地共享同一块硬件测试板卡。 + +``` +主机 +├── 组织 A 的 Runner 容器 → 注册到 Org-A +├── 组织 B 的 Runner 容器 → 注册到 Org-B +└── 物理硬件(如 phytiumpi 开发板)← 两套 Runner 均可访问 +``` + +### 1.2 为什么需要多组织 + +- **GitHub 限制**:一个 runner 只能注册到一个目标(repo/org/enterprise),无法同时服务多个组织。 +- **硬件共享**:多块板卡成本高,多个组织希望复用同一台主机上的开发板做 CI 测试。 +- **本方案**:通过 Docker 在同一主机运行多套独立 runner 实例,配合 **runner-wrapper 文件锁** 实现硬件访问的串行协调。 + +### 1.3 核心能力 + +| 能力 | 说明 | +|------|------| +| 同板卡任务串行 | 同一块板子的 Job 排队执行,避免硬件冲突 | +| 异板卡任务并行 | 不同板子的 Job 可同时运行 | +| 容器命名隔离 | 按 ORG/REPO 自动生成前缀,避免重名 | +| Cancel 安全恢复 | 配合 lock-watcher 支持网页 Cancel 后正常解锁 | + +--- + +## 2. 部署方式 + +### 2.1 方式一:同一目录 + ENV_FILE(推荐) + +使用同一份代码,通过 `ENV_FILE` 区分组织: + +```bash +# 组织 A +ENV_FILE=.env.orgA ./runner.sh init -n 2 + +# 组织 B(同一目录) +ENV_FILE=.env.orgB ./runner.sh init -n 2 +``` + +- **优点**:配置简单,代码与脚本统一更新。 +- **适用**:同一团队维护多组织,或快速验证。 + +### 2.2 方式二:不同目录各自部署 + +每个组织使用独立工作目录: + +```bash +# 组织 A +cd /opt/runners/org-a +cp .env.example .env # 编辑 ORG、GH_PAT、锁变量等 +./runner.sh init -n 2 + +# 组织 B +cd /opt/runners/org-b +cp .env.example .env +./runner.sh init -n 2 +``` + +- **优点**:权限与配置完全隔离,适合不同团队维护。 +- **注意**:多组织共享同一块板时,`RUNNER_RESOURCE_ID_*` 与 `RUNNER_LOCK_HOST_PATH` 必须一致。 + +--- + +## 3. 环境变量配置 + +### 3.1 每个组织必备 + +```env +ORG=your-org-name +GH_PAT=ghp_xxxx # Classic PAT,需 admin:org + +# 若为仓库级 Runner +REPO=your-repo-name +``` + +### 3.2 多组织共享同一块板时(必设相同值) + +```env +RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks # 推荐持久目录 +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +**要点**:两套配置的 `RUNNER_RESOURCE_ID_*` 和 `RUNNER_LOCK_HOST_PATH` 必须完全一致,否则无法实现同板卡串行。 + +### 3.3 可选:Cancel 后安全恢复 + +需要网页 Cancel 后能正常解锁时,增加: + +```env +RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx # Fine-grained PAT,Actions: Read-only +``` + +详见 [board-lock-watcher](./board-lock-watcher.md)。 + +--- + +## 4. 快速上手 + +### 4.1 宿主机准备(首次部署) + +```bash +sudo mkdir -p /var/tmp/github-runner-locks +sudo chown root:root /var/tmp/github-runner-locks +sudo chmod 1777 /var/tmp/github-runner-locks +``` + +### 4.2 为每个组织准备 .env + +```bash +# .env.orgA +ORG=org-a +GH_PAT=ghp_aaa +RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +```bash +# .env.orgB(锁变量与 orgA 一致) +ORG=org-b +GH_PAT=ghp_bbb +RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +### 4.3 初始化与检查 + +```bash +ENV_FILE=.env.orgA ./runner.sh init -n 2 +ENV_FILE=.env.orgB ./runner.sh init -n 2 + +ENV_FILE=.env.orgA ./runner.sh ps +ENV_FILE=.env.orgB ./runner.sh ps +``` + +--- + +## 5. 常用操作 + +| 操作 | 命令示例 | +|------|----------| +| 启动 | `ENV_FILE=.env.orgA ./runner.sh start` | +| 停止 | `ENV_FILE=.env.orgA ./runner.sh stop` | +| 查看状态 | `ENV_FILE=.env.orgA ./runner.sh list` | +| 配置变更后重建 | `ENV_FILE=.env.orgA ./runner.sh compose` 后 `docker compose -f docker-compose..yml up -d --force-recreate` | + +--- + +## 6. 常见问题 + +### 6.1 `pre-job-lock.sh` 报 Permission denied + +锁目录权限不正确。推荐使用 `/var/tmp/github-runner-locks` 等持久目录,并执行: + +```bash +sudo chmod 1777 /var/tmp/github-runner-locks +``` + +各组织 `.env` 中设置 `RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks`,然后重新 `compose` 并重启容器。 + +### 6.2 一直 Waiting for a runner + +检查:runner 是否 online、标签是否匹配、Runner group 是否授权目标仓库。 + +### 6.3 容器命名冲突 + +脚本会按 `ORG`(及 `REPO`)自动生成前缀,如 `-orgA-runner-1`。若仍有冲突,可显式设置 `RUNNER_NAME_PREFIX`。 + +--- + +## 7. 参考资料 + +- [README 多组织共享](../README_CN.md#多组织共享) +- [runner-wrapper README](../runner-wrapper/README.md) +- [Discussion #341: 多组织共享集成测试环境问题分析与解决方案](https://github.com/orgs/arceos-hypervisor/discussions/341) diff --git a/runner-wrapper/pre-job-lock.sh b/runner-wrapper/pre-job-lock.sh index a7c016a..95ff232 100644 --- a/runner-wrapper/pre-job-lock.sh +++ b/runner-wrapper/pre-job-lock.sh @@ -22,7 +22,7 @@ if ! mkdir -p "${LOCK_DIR}" 2>/dev/null; then echo "[$(date -Iseconds)] ❌ Cannot create lock dir ${LOCK_DIR}" >&2 exit 1 fi -chmod 1777 "${LOCK_DIR}" || true +chmod 1777 "${LOCK_DIR}" 2>/dev/null || true # 如果目录不可写,给出明确提示后退出 if ! touch "${LOCK_DIR}/.write-test" 2>/dev/null; then @@ -37,9 +37,22 @@ rm -f "${RELEASE_FILE}" || true # 打开锁文件并获取排他锁(阻塞等待) exec 200>"${LOCK_FILE}" -chmod 666 "${LOCK_FILE}" || true +chmod 666 "${LOCK_FILE}" 2>/dev/null || true echo "[$(date -Iseconds)] ⏳ Waiting for lock: ${RESOURCE_ID}" >&2 +# 后台每 10s 打印一次,便于在第二个 job 的日志中看到等待状态(避免在 echo 中嵌套括号与引号,防止部分 bash 误解析) +( + i=0 + while true; do + sleep 10 + i=$((i + 10)) + ts="$(date -Iseconds)" + printf '%s ⏳ Still waiting for lock: %s after %ss\n' "${ts}" "${RESOURCE_ID}" "${i}" >&2 + done +) & +WAITER_PID=$! flock -x 200 +kill "${WAITER_PID}" 2>/dev/null || true +wait "${WAITER_PID}" 2>/dev/null || true echo "[$(date -Iseconds)] ✅ Acquired lock for ${RESOURCE_ID}" >&2 # 后台子进程继承 fd 200 并持有锁,等待 post-job 创建释放文件 @@ -50,7 +63,7 @@ echo "[$(date -Iseconds)] ✅ Acquired lock for ${RESOURCE_ID}" >&2 "${RUNNER_NAME_SAFE}" \ "${RUN_ID_SAFE}" \ "${RUN_ATTEMPT_SAFE}" > "${HOLDER_PID_FILE}" - chmod 666 "${HOLDER_PID_FILE}" || true + chmod 666 "${HOLDER_PID_FILE}" 2>/dev/null || true while [ ! -f "${RELEASE_FILE}" ]; do sleep 1