diff --git a/.env.example b/.env.example index 6459949..a68adfc 100644 --- a/.env.example +++ b/.env.example @@ -24,8 +24,10 @@ GH_PAT= # 多组织共享同一块板时显式设相同 ID: # RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi # RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +# 容器内路径,通常保持默认: # RUNNER_LOCK_DIR=/tmp/github-runner-locks -# RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks +# 宿主机路径,建议使用持久目录避免 /tmp 清理导致权限漂移: +# RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks # ---------- 可选:注册 token 缓存 ---------- # REG_TOKEN_CACHE_FILE=.reg_token.cache diff --git a/Dockerfile b/Dockerfile index 0fedc8a..50d8125 100644 --- a/Dockerfile +++ b/Dockerfile @@ -55,10 +55,20 @@ RUN apt-get update \ python3-pip \ python3-tomli \ python3-sphinx \ + cmake \ + clang \ + libclang-dev \ ninja-build \ libslirp0 \ && rm -rf /var/lib/apt/lists/* +# Ensure bindgen can find libclang.so in a stable location. +RUN set -eux; \ + libclang_path="$(ls -1 /usr/lib/llvm-*/lib/libclang.so 2>/dev/null | head -n1)"; \ + if [ -n "${libclang_path}" ]; then \ + ln -sf "${libclang_path}" /usr/lib/libclang.so; \ + fi + # Build and install QEMU 10.1.2 from source RUN mkdir -p /tmp/qemu-build \ && cd /tmp/qemu-build \ diff --git a/README.md b/README.md index e197145..0372453 100644 --- a/README.md +++ b/README.md @@ -112,6 +112,48 @@ RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi **Note**: Serialization is a hardware limitation. This approach transforms "chaotic contention" into "ordered queuing" without reducing throughput. Different boards using their own lock IDs can run in parallel.See [runner-wrapper/README.md](runner-wrapper/README.md) for details. Reference: [Discussion #341](https://github.com/orgs/arceos-hypervisor/discussions/341). +### Troubleshooting: `Permission denied` in `pre-job-lock.sh` + +If job logs show errors like: + +- `chmod: changing permissions of '/tmp/github-runner-locks': Operation not permitted` +- `/tmp/github-runner-locks/board-xxx.lock: Permission denied` + +the host lock directory permissions are usually incorrect, or the lock directory is placed under host `/tmp` and gets cleaned up by the system, causing permission drift. + +Recommended configuration (keep consistent across organizations): + +- `RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks` (persistent host directory) +- `RUNNER_LOCK_DIR=/tmp/github-runner-locks` (container path, keep default) + +One-time fix: + +```bash +# 1) Create host directory and set sticky-bit permissions (1777) +sudo mkdir -p /var/tmp/github-runner-locks +sudo chown root:root /var/tmp/github-runner-locks +sudo chmod 1777 /var/tmp/github-runner-locks + +# 2) Update each org's .env +# RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +# RUNNER_LOCK_DIR=/tmp/github-runner-locks + +# 3) Regenerate compose and recreate containers to apply mounts +ENV_FILE=.env. ./runner.sh compose +ENV_FILE=.env. ./runner.sh stop +ENV_FILE=.env. ./runner.sh start +``` + +Verify: + +```bash +ls -ld /var/tmp/github-runner-locks +# expected: drwxrwxrwt + +docker inspect --format '{{range .Mounts}}{{println .Source "->" .Destination}}{{end}}' +# expected to include: /var/tmp/github-runner-locks -> /tmp/github-runner-locks +``` + ## Contributing ```bash diff --git a/README_CN.md b/README_CN.md index 30ed56c..99d7391 100644 --- a/README_CN.md +++ b/README_CN.md @@ -79,6 +79,8 @@ name:label1[,label2];name2:label1 ## 多组织共享 +> **完整文档**:参见 [docs/多组织部署指南.md](docs/多组织部署指南.md),含部署方式、环境变量、故障排查等。 + 当前脚本实现了在同一台主机上运行多个 Docker 容器,分别注册到不同的 GitHub 组织,即使这些容器需要访问同一物理硬件(如开发板、串口、电源控制等),也不会导致 CI 会导致资源冲突。 ### 场景说明 @@ -112,6 +114,48 @@ RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi **注意**:串行是硬件本身的限制,本方案把「无秩序抢占」变为「有序排队」,不额外降低吞吐。不同板子使用各自锁 ID 可并行执行。详见 [runner-wrapper/README.md](runner-wrapper/README.md),参考 [Discussion #341](https://github.com/orgs/arceos-hypervisor/discussions/341)。 +### 常见问题:`pre-job-lock.sh` 报 `Permission denied` + +如果 Job 日志中出现类似报错: + +- `chmod: changing permissions of '/tmp/github-runner-locks': Operation not permitted` +- `/tmp/github-runner-locks/board-xxx.lock: Permission denied` + +通常是宿主机锁目录权限不正确,或把锁目录放在宿主机 `/tmp` 后被系统清理导致权限漂移。 + +推荐配置(多组织保持一致): + +- `RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks`(宿主机持久目录) +- `RUNNER_LOCK_DIR=/tmp/github-runner-locks`(容器内路径,保持默认) + +一次性修复步骤: + +```bash +# 1) 在宿主机创建并设置目录权限(sticky bit 1777) +sudo mkdir -p /var/tmp/github-runner-locks +sudo chown root:root /var/tmp/github-runner-locks +sudo chmod 1777 /var/tmp/github-runner-locks + +# 2) 修改每个组织对应 .env +# RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +# RUNNER_LOCK_DIR=/tmp/github-runner-locks + +# 3) 重新生成 compose,并重建容器使挂载生效 +ENV_FILE=.env. ./runner.sh compose +ENV_FILE=.env. ./runner.sh stop +ENV_FILE=.env. ./runner.sh start +``` + +验证: + +```bash +ls -ld /var/tmp/github-runner-locks +# 期望:drwxrwxrwt + +docker inspect --format '{{range .Mounts}}{{println .Source "->" .Destination}}{{end}}' +# 期望包含:/var/tmp/github-runner-locks -> /tmp/github-runner-locks +``` + ## 贡献 ```bash diff --git a/docs/board-lock-watcher.md b/docs/board-lock-watcher.md new file mode 100644 index 0000000..c60842e --- /dev/null +++ b/docs/board-lock-watcher.md @@ -0,0 +1,331 @@ +## 开发板文件锁与取消等待使用说明 + +本文档说明如何在多组织共享同一块开发板时,结合文件锁与 `lock-watcher.sh`,实现 **等待中的 Job 能被 Cancel 正常打断**,避免死锁。 + +--- + +### 1. 组件概览 + +- **`runner-wrapper/runner-wrapper.sh`**:为 Runner 注入 Job Started / Completed 钩子。 +- **`runner-wrapper/pre-job-lock.sh`**:在 Job 开始前获取板子级文件锁(`flock`),并通过后台子进程持有锁。 +- **`runner-wrapper/post-job-lock.sh`**:在 Job 结束时创建 `.release` 标记,唤醒持锁子进程释放锁。 +- **`runner-wrapper/lock-watcher.sh`**:周期性查询 GitHub Actions Run 的状态;当发现持锁 Run 已被 **Cancel** 时,强制清理解锁文件,避免后续等待 Job 永久卡死。**一个 watcher 进程可同时监控多块板子**。配置 `RUNNER_LOCK_MONITOR_TOKEN` 后,watcher 会作为 compose 服务随 `./runner.sh start` **自动启动**,与锁机制一样使用无感。 + +锁文件结构(默认目录 `/tmp/github-runner-locks`): + +- `${RESOURCE_ID}.lock`:flock 使用的锁文件 +- `${RESOURCE_ID}.holder`:当前持锁信息,格式为 `PID RUNNER_NAME RUN_ID RUN_ATTEMPT` +- `${RESOURCE_ID}.${RUNNER_NAME}.${RUN_ID}.${RUN_ATTEMPT}.release`:释放标记,由 `post-job-lock.sh` 创建 + +--- + +### 1.1 部署流程概览 + +可按以下顺序操作;细节见后续章节。 + +| 时机 | 要做的事 | +|------|----------| +| **第一次在这台机器上部署** | ① 准备各组织的 `.env`(含 ORG、REPO、GH_PAT、板子锁变量、以及 `RUNNER_LOCK_MONITOR_TOKEN`)
② **在宿主机**设置锁目录权限:`sudo mkdir -p /tmp/github-runner-locks && sudo chmod 1777 /tmp/github-runner-locks`([详见 2.1](#21-宿主机锁目录权限首次部署或报-permission-denied-时))
③ 每个组织执行一次 `./runner.sh init -n 2`,生成 compose、起容器并注册
④ watcher 会作为 compose 服务**随 start 自动启动**,无需单独起进程([详见 3](#3-watcher-自动启动与锁使用无感)) | +| **以后每次使用(非初次)** | 执行 `ENV_FILE=.env.xxx ./runner.sh start`(每个组织);watcher 随 runners 一起启停,使用无感 | + +--- + +### 2. Runner 端配置(各组织 .env) + +在各组织对应的 `.env` 中(示例:`.env.linebridge` / `.env.yoinspiration`): + +```bash +ORG= +REPO=test-runner +GH_PAT=ghp_xxx # Runner 注册用,权限见 2.2 + +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/tmp/github-runner-locks +RUNNER_LOCK_DIR=/tmp/github-runner-locks +# 必填(watcher 自动启动时用):Fine-grained PAT,Actions: Read-only +RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx +``` + +注意: + +- 多组织共享同一块板子时,所有相关 `.env` 中的 + - `RUNNER_RESOURCE_ID_ROC_RK3568_PC` + - `RUNNER_LOCK_HOST_PATH` + - `RUNNER_LOCK_DIR` + 必须保持一致。 + +#### 2.1 宿主机锁目录权限(首次部署或报 Permission denied 时) + +锁目录从宿主机挂进容器,若宿主机上该目录权限不对,容器内会报 `chmod: Operation not permitted` 或 `Permission denied`。**在宿主机**执行(仅首次部署或出现上述报错时): + +```bash +# 若目录已存在但权限不对,可先清理再改权限 +sudo rm -f /tmp/github-runner-locks/*.holder /tmp/github-runner-locks/*.release +sudo chmod 1777 /tmp/github-runner-locks +sudo find /tmp/github-runner-locks -maxdepth 1 -type f -name 'board-*' -exec chmod 666 {} \; +``` + +若目录不存在,先创建再设权限: + +```bash +sudo mkdir -p /tmp/github-runner-locks +sudo chmod 1777 /tmp/github-runner-locks +``` + +完成后重启对应 Runner(见下文)。 + +修改完 `.env` 后,重启对应 Runner: + +```bash +ENV_FILE=.env.linebridge ./runner.sh restart +ENV_FILE=.env.yoinspiration ./runner.sh restart +``` + +#### 2.2 PAT 权限说明 + +- **GH_PAT**(Runner 注册、管理 runner 用) + - **组织级 Runner**(只设 `ORG`、不设 `REPO`):Classic PAT 需勾选 **`admin:org`**,用于调用组织 Actions runner 注册等接口。 + - **仓库级 Runner**(设了 `ORG` 和 `REPO`):一般需 **`repo`**(完整仓库权限);若仓库属组织且需在组织下管理 runner,可能仍要求 **`admin:org`**。 + - 若用 Fine-grained PAT:在对应 org/repo 的权限中勾选可“管理 Actions runners”的项(名称以 GitHub 当前界面为准)。 + +- **RUNNER_LOCK_MONITOR_TOKEN**(仅 watcher 用,只读 run 状态) + - 建议使用 **Fine-grained PAT**,在对应仓库下将 **Actions** 设为 **Read-only**,权限最小、与 GH_PAT 分离更安全。 + +--- + +### 3. Watcher 自动启动(与锁使用无感) + +配置 `RUNNER_LOCK_MONITOR_TOKEN` 后,`./runner.sh compose` 或 `init` 生成的 compose 会包含 **lock-watcher** 服务。执行 `./runner.sh start` 时,watcher 会随 runners 一起启动;`stop` / `restart` 时一起停止,**无需单独开终端或 systemd**。 + +1. 在对应组织的 `.env` 中增加(必填): + ```bash + RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx # Fine-grained PAT,Actions: Read-only + ``` +2. 若已有 compose,需重新生成以加入 watcher: + ```bash + ENV_FILE=.env.linebridge ./runner.sh compose + ``` +3. 之后执行 `./runner.sh start` 即可,watcher 自动随 runners 启停。 + +**手动单独运行**(可选):若需在 compose 外单独跑 watcher(例如另一台机器),仍可使用: + ```bash + ENV_FILE=.env.linebridge ./runner.sh watcher + ``` + 建议用 tmux/screen 常驻。传参可指定只监控一块板子:`./runner.sh watcher board-roc-rk3568-pc`。 + +#### 3.0.1 手动常驻(仅在不使用 compose 自动启动时) + +任选一种方式,使 watcher 在断开 SSH 或重启后仍可运行。 + +**方式 A:tmux** + +```bash +# 安装(若无) +sudo apt install -y tmux + +# 第一个组织 +cd /path/to/github-runners +tmux new -s watcher-linebridge +ENV_FILE=.env.linebridge ./runner.sh watcher +# 断开会话:Ctrl+B 再按 D + +# 第二个组织(新开一个终端或新会话) +tmux new -s watcher-yoinspiration +ENV_FILE=.env.yoinspiration ./runner.sh watcher +# 同样 Ctrl+B D 断开 + +# 重新连上查看 +tmux attach -t watcher-linebridge +tmux attach -t watcher-yoinspiration +``` + +**方式 B:screen** + +```bash +# 安装(若无) +sudo apt install -y screen + +# 第一个组织 +cd /path/to/github-runners +screen -S watcher-linebridge +ENV_FILE=.env.linebridge ./runner.sh watcher +# 断开:Ctrl+A 再按 D + +# 第二个组织 +screen -S watcher-yoinspiration +ENV_FILE=.env.yoinspiration ./runner.sh watcher +# Ctrl+A D 断开 + +# 重新连上 +screen -r watcher-linebridge +screen -r watcher-yoinspiration +``` + +**方式 C:systemd(开机自启,推荐长期使用)** + +每个组织一个 service 文件,例如 linebridge: + +```bash +sudo nano /etc/systemd/system/github-runner-watcher-linebridge.service +``` + +内容(将 `fei` 和 `/path/to/github-runners` 换成你的用户名与仓库绝对路径): + +```ini +[Unit] +Description=GitHub Runner lock watcher (linebridge) +After=network-online.target + +[Service] +Type=simple +User=fei +WorkingDirectory=/path/to/github-runners +Environment=ENV_FILE=.env.linebridge +ExecStart=/path/to/github-runners/runner.sh watcher +Restart=always +RestartSec=10 + +[Install] +WantedBy=multi-user.target +``` + +再为 yoinspiration 建一份(如 `github-runner-watcher-yoinspiration.service`),仅把 `linebridge` 改为 `yoinspiration`、`ENV_FILE=.env.linebridge` 改为 `ENV_FILE=.env.yoinspiration` 即可。 + +启用并启动: + +```bash +sudo systemctl daemon-reload +sudo systemctl enable --now github-runner-watcher-linebridge +sudo systemctl enable --now github-runner-watcher-yoinspiration +``` + +查看状态与日志: + +```bash +sudo systemctl status github-runner-watcher-linebridge +journalctl -u github-runner-watcher-linebridge -f +``` + +#### 3.1 实例数量建议 + +- **使用 compose 自动启动**:每组织一个 watcher 容器,随 `./runner.sh start` 自动拉起;无需额外配置。 +- **手动运行**:每组织一个 watcher 进程,分别执行 `ENV_FILE=.env.linebridge ./runner.sh watcher` 与 `ENV_FILE=.env.yoinspiration ./runner.sh watcher`。 + +#### 3.2 准备 PAT + +在对应组织账号下创建 **Fine-grained PAT**(推荐): + +- 选择包含 `test-runner` 的仓库(例如 `linebridge/test-runner`)。 +- 在 **Repository permissions** 中将 **Actions** 设置为 **Read-only**。 + +生成后得到 `github_pat_xxx`。 + +#### 3.3 安装依赖 + +- **compose 自动启动**:watcher 容器使用 alpine,启动时自动安装 `jq`,宿主机无需安装。 +- **手动运行 watcher**:需在宿主机安装 `jq`:`sudo apt install -y jq`,否则无法解析 run 状态。 + +--- + +### 4. 启动与验证 + +**compose 自动启动(推荐)** + +```bash +cd /path/to/github-runners +ENV_FILE=.env.linebridge ./runner.sh start +``` + +watcher 会随 runners 一起启动。查看 watcher 日志: + +```bash +docker logs -f $(docker ps -q -f name=lock-watcher) +``` + +**手动运行 watcher**(可选) + +```bash +ENV_FILE=.env.linebridge ./runner.sh watcher +``` + +启动成功后,终端会打印类似: + +```text +[lock-watcher] monitoring linebridge/test-runner, resources=board-roc-rk3568-pc board-phytiumpi, lock_dir=/tmp/github-runner-locks, interval=10s +``` + +运行过程中示例日志: + +- 正常持锁: + +```text +[lock-watcher] resource=board-roc-rk3568-pc run_id=22663158623 status=in_progress conclusion= pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc +``` + +- 对应 workflow 在 GitHub 上被 Cancel 后: + +```text +[lock-watcher] resource=board-roc-rk3568-pc run_id=22663158623 status=completed conclusion=cancelled pid=1511 holder_runner=DESKTOP-...-runner-roc-rk3568-pc +[lock-watcher] detected cancelled workflow, force releasing lock for board-roc-rk3568-pc +``` + +表示 watcher 已检测到 Cancel,并强制清理解锁文件。 + +--- + +### 5. 典型验证流程 + +以下以 `linebridge/test-runner` 与 `yoinspiration/test-runner` 共享 `board-roc-rk3568-pc` 板为例。 + +#### 5.1 触发占板子的 holder + +在 `linebridge/test-runner` 仓库中: + +1. 打开 Actions,选择 `board-lock-holder` workflow。 +2. 点击 **Run workflow** 触发一次运行。 +3. 在日志中看到: + + - `Waiting for lock: board-roc-rk3568-pc` + - `Acquired lock for board-roc-rk3568-pc` + +表示 holder 成功持有锁并占用板子。 + +#### 5.2 触发等待的 waiter + +在 `yoinspiration/test-runner` 仓库中: + +1. 打开 Actions,选择 `board-lock-waiter` workflow。 +2. 点击 **Run workflow**。 +3. 在日志中可以看到: + + - `Waiting for lock: board-roc-rk3568-pc` + +表示 waiter 正在等待同一块板子的锁。 + +#### 5.3 Cancel 并观察自动解锁 + +1. 在 `linebridge/test-runner` 的 `board-lock-holder` 运行页面点击 **Cancel workflow**。 +2. 几秒后,宿主机上 `lock-watcher.sh` 日志应出现: + + ```text + [lock-watcher] run_id=... status=completed conclusion=cancelled ... + [lock-watcher] detected cancelled workflow, force releasing lock for board-roc-rk3568-pc + ``` + +3. 此时 `yoinspiration` 侧的 `board-lock-waiter` 将在锁释放后继续执行,直至 Job 完成,而不会长时间卡在等待状态。 + +--- + +### 6. 常见问题 + +- **Q: watcher 日志中一直是 `status=in_progress conclusion=`?** + **A:** Run 还在运行中,尚未 Cancel 或完成,watcher 只会记录状态,不会释放锁。需要在 GitHub 页面上点击 Cancel,且等状态变为 Cancelled 后再观察。 + +- **Q: watcher 日志中频繁出现 `empty response for run_id=..., skip`?** + **A:** 对应的 Run 不属于当前 `ORG/REPO`,或 `GITHUB_TOKEN` 对该仓库没有足够的 Actions 读取权限。请确认: + - watcher 使用的 `ORG` / `REPO`(来自 `ENV_FILE` 的 .env 或自建环境文件)是否与 Run 实际所在仓库一致; + - Fine-grained PAT 是否勾选了对应仓库,并将 Actions 权限设为 Read-only。 + +- **Q: 没有安装 jq 时,status / conclusion 总是 ``?** + **A:** 需要先在宿主机安装 `jq`,否则 watcher 无法从响应 JSON 中解析状态。 + diff --git "a/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" "b/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" index 3407916..ab314b3 100644 --- "a/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" +++ "b/docs/\345\244\232\347\273\204\347\273\207\345\205\261\344\272\253Runner\344\275\277\347\224\250\350\257\264\346\230\216.md" @@ -50,6 +50,7 @@ RUNNER_LOCK_DIR=/tmp/github-runner-locks 关键要求: +- 只有设置了 `RUNNER_RESOURCE_ID_PHYTIUMPI` / `RUNNER_RESOURCE_ID_ROC_RK3568_PC` 的板子才会启用 runner-wrapper 与锁目录挂载,未配置的板子仍使用默认 `run.sh`,不参与锁协调。 - 两套配置的 `RUNNER_RESOURCE_ID_*` 必须一致(同板卡共享同一锁)。 - 两套配置的 `RUNNER_LOCK_HOST_PATH` 必须一致(指向同一宿主机目录)。 diff --git "a/docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" "b/docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" new file mode 100644 index 0000000..59dbdc1 --- /dev/null +++ "b/docs/\345\244\232\347\273\204\347\273\207\351\203\250\347\275\262\346\214\207\345\215\227.md" @@ -0,0 +1,202 @@ +# 多组织部署完整指南 + +本文档是 **在同一台主机上为多个 GitHub 组织部署 Runner** 的完整参考,涵盖概念、配置、部署与故障排查。 + +--- + +## 文档导航 + +| 文档 | 适用场景 | +|------|----------| +| 本文档 | 多组织整体概念与快速上手 | +| [多组织共享Runner使用说明](./多组织共享Runner使用说明.md) | 分步操作、验证方法、Cancel 场景 | +| [runner-wrapper-multi-org-lock](./runner-wrapper-multi-org-lock.md) | 锁机制原理与架构设计 | +| [board-lock-watcher](./board-lock-watcher.md) | Cancel 后安全恢复、watcher 配置 | + +--- + +## 1. 概念说明 + +### 1.1 什么是多组织部署 + +**多组织部署** 指在同一台物理主机上运行多套 GitHub Actions Runner 容器,分别注册到不同的组织(或仓库),并可选地共享同一块硬件测试板卡。 + +``` +主机 +├── 组织 A 的 Runner 容器 → 注册到 Org-A +├── 组织 B 的 Runner 容器 → 注册到 Org-B +└── 物理硬件(如 phytiumpi 开发板)← 两套 Runner 均可访问 +``` + +### 1.2 为什么需要多组织 + +- **GitHub 限制**:一个 runner 只能注册到一个目标(repo/org/enterprise),无法同时服务多个组织。 +- **硬件共享**:多块板卡成本高,多个组织希望复用同一台主机上的开发板做 CI 测试。 +- **本方案**:通过 Docker 在同一主机运行多套独立 runner 实例,配合 **runner-wrapper 文件锁** 实现硬件访问的串行协调。 + +### 1.3 核心能力 + +| 能力 | 说明 | +|------|------| +| 同板卡任务串行 | 同一块板子的 Job 排队执行,避免硬件冲突 | +| 异板卡任务并行 | 不同板子的 Job 可同时运行 | +| 容器命名隔离 | 按 ORG/REPO 自动生成前缀,避免重名 | +| Cancel 安全恢复 | 配合 lock-watcher 支持网页 Cancel 后正常解锁 | + +--- + +## 2. 部署方式 + +### 2.1 方式一:同一目录 + ENV_FILE(推荐) + +使用同一份代码,通过 `ENV_FILE` 区分组织: + +```bash +# 组织 A +ENV_FILE=.env.orgA ./runner.sh init -n 2 + +# 组织 B(同一目录) +ENV_FILE=.env.orgB ./runner.sh init -n 2 +``` + +- **优点**:配置简单,代码与脚本统一更新。 +- **适用**:同一团队维护多组织,或快速验证。 + +### 2.2 方式二:不同目录各自部署 + +每个组织使用独立工作目录: + +```bash +# 组织 A +cd /opt/runners/org-a +cp .env.example .env # 编辑 ORG、GH_PAT、锁变量等 +./runner.sh init -n 2 + +# 组织 B +cd /opt/runners/org-b +cp .env.example .env +./runner.sh init -n 2 +``` + +- **优点**:权限与配置完全隔离,适合不同团队维护。 +- **注意**:多组织共享同一块板时,`RUNNER_RESOURCE_ID_*` 与 `RUNNER_LOCK_HOST_PATH` 必须一致。 + +--- + +## 3. 环境变量配置 + +### 3.1 每个组织必备 + +```env +ORG=your-org-name +GH_PAT=ghp_xxxx # Classic PAT,需 admin:org + +# 若为仓库级 Runner +REPO=your-repo-name +``` + +### 3.2 多组织共享同一块板时(必设相同值) + +```env +RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks # 推荐持久目录 +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +**要点**:两套配置的 `RUNNER_RESOURCE_ID_*` 和 `RUNNER_LOCK_HOST_PATH` 必须完全一致,否则无法实现同板卡串行。 + +### 3.3 可选:Cancel 后安全恢复 + +需要网页 Cancel 后能正常解锁时,增加: + +```env +RUNNER_LOCK_MONITOR_TOKEN=github_pat_xxx # Fine-grained PAT,Actions: Read-only +``` + +详见 [board-lock-watcher](./board-lock-watcher.md)。 + +--- + +## 4. 快速上手 + +### 4.1 宿主机准备(首次部署) + +```bash +sudo mkdir -p /var/tmp/github-runner-locks +sudo chown root:root /var/tmp/github-runner-locks +sudo chmod 1777 /var/tmp/github-runner-locks +``` + +### 4.2 为每个组织准备 .env + +```bash +# .env.orgA +ORG=org-a +GH_PAT=ghp_aaa +RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +```bash +# .env.orgB(锁变量与 orgA 一致) +ORG=org-b +GH_PAT=ghp_bbb +RUNNER_RESOURCE_ID_PHYTIUMPI=board-phytiumpi +RUNNER_RESOURCE_ID_ROC_RK3568_PC=board-roc-rk3568-pc +RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks +RUNNER_LOCK_DIR=/tmp/github-runner-locks +``` + +### 4.3 初始化与检查 + +```bash +ENV_FILE=.env.orgA ./runner.sh init -n 2 +ENV_FILE=.env.orgB ./runner.sh init -n 2 + +ENV_FILE=.env.orgA ./runner.sh ps +ENV_FILE=.env.orgB ./runner.sh ps +``` + +--- + +## 5. 常用操作 + +| 操作 | 命令示例 | +|------|----------| +| 启动 | `ENV_FILE=.env.orgA ./runner.sh start` | +| 停止 | `ENV_FILE=.env.orgA ./runner.sh stop` | +| 查看状态 | `ENV_FILE=.env.orgA ./runner.sh list` | +| 配置变更后重建 | `ENV_FILE=.env.orgA ./runner.sh compose` 后 `docker compose -f docker-compose..yml up -d --force-recreate` | + +--- + +## 6. 常见问题 + +### 6.1 `pre-job-lock.sh` 报 Permission denied + +锁目录权限不正确。推荐使用 `/var/tmp/github-runner-locks` 等持久目录,并执行: + +```bash +sudo chmod 1777 /var/tmp/github-runner-locks +``` + +各组织 `.env` 中设置 `RUNNER_LOCK_HOST_PATH=/var/tmp/github-runner-locks`,然后重新 `compose` 并重启容器。 + +### 6.2 一直 Waiting for a runner + +检查:runner 是否 online、标签是否匹配、Runner group 是否授权目标仓库。 + +### 6.3 容器命名冲突 + +脚本会按 `ORG`(及 `REPO`)自动生成前缀,如 `-orgA-runner-1`。若仍有冲突,可显式设置 `RUNNER_NAME_PREFIX`。 + +--- + +## 7. 参考资料 + +- [README 多组织共享](../README_CN.md#多组织共享) +- [runner-wrapper README](../runner-wrapper/README.md) +- [Discussion #341: 多组织共享集成测试环境问题分析与解决方案](https://github.com/orgs/arceos-hypervisor/discussions/341) diff --git a/runner-wrapper/lock-watcher.sh b/runner-wrapper/lock-watcher.sh new file mode 100755 index 0000000..bc19010 --- /dev/null +++ b/runner-wrapper/lock-watcher.sh @@ -0,0 +1,106 @@ +#!/usr/bin/env bash +set -euo pipefail + +# lock-watcher.sh - 简单的文件锁清理脚本 +# +# 场景: +# - pre-job-lock.sh 在获取锁时可能因为网络/runner 异常导致锁长期不释放 +# - GitHub 前端 Cancel workflow 后,run 状态变为 cancelled,但本地锁文件还在 +# - 本脚本定期检查 holder 文件中的 run_id,对应 run 如已 cancelled,则强制清理解锁 +# +# 使用方式(示例,在宿主机上运行): +# export ORG=yoinspiration +# export REPO=test-runner +# export GITHUB_TOKEN=github_pat_xxx # 具备 Actions 只读权限 +# export RUNNER_RESOURCE_IDS="board-roc board-phytiumpi" # 多块板子空格分隔;或单块用 RUNNER_RESOURCE_ID +# export RUNNER_LOCK_DIR=/tmp/github-runner-locks +# ./runner-wrapper/lock-watcher.sh +# +# 必要环境变量: +# ORG, REPO, GITHUB_TOKEN +# 可选环境变量: +# RUNNER_RESOURCE_IDS(空格分隔的多个锁 ID,与 runner.sh 集成时自动传) +# RUNNER_RESOURCE_ID(单个锁 ID,RUNNER_RESOURCE_IDS 未设时使用) +# RUNNER_LOCK_DIR(默认:/tmp/github-runner-locks) +# INTERVAL(轮询间隔秒,默认 10) + +: "${ORG:?ORG is required, e.g. yoinspiration}" +: "${REPO:?REPO is required, e.g. test-runner}" +: "${GITHUB_TOKEN:?GITHUB_TOKEN is required (with Actions read permission)}" + +LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" +INTERVAL="${INTERVAL:-10}" + +# 资源 ID 列表:支持多块板子,空格分隔 +if [[ -n "${RUNNER_RESOURCE_IDS:-}" ]]; then + RESOURCE_IDS=(${RUNNER_RESOURCE_IDS}) +else + RESOURCE_IDS=("${RUNNER_RESOURCE_ID:-default-hardware}") +fi + +api_base="${GITHUB_API_URL:-https://api.github.com}" + +echo "[lock-watcher] monitoring ${ORG}/${REPO}, resources=${RESOURCE_IDS[*]}, lock_dir=${LOCK_DIR}, interval=${INTERVAL}s" + +while true; do + for RUNNER_RESOURCE_ID in "${RESOURCE_IDS[@]}"; do + holder_file="${LOCK_DIR}/${RUNNER_RESOURCE_ID}.holder" + + if [[ ! -f "${holder_file}" ]]; then + continue + fi + + # holder 文件格式: PID RUNNER_NAME RUN_ID RUN_ATTEMPT + holder_pid="" + holder_runner="" + holder_run_id="" + holder_run_attempt="" + if ! read -r holder_pid holder_runner holder_run_id holder_run_attempt < "${holder_file}"; then + echo "[lock-watcher] failed to read holder file ${holder_file}" >&2 + continue + fi + + if [[ -z "${holder_run_id:-}" || "${holder_run_id}" == "unknown" ]]; then + continue + fi + + run_url="${api_base}/repos/${ORG}/${REPO}/actions/runs/${holder_run_id}" + resp="$(curl -fsSL \ + -H "Authorization: Bearer ${GITHUB_TOKEN}" \ + -H "Accept: application/vnd.github+json" \ + "${run_url}" || true)" + + if [[ -z "${resp}" ]]; then + echo "[lock-watcher] empty response for run_id=${holder_run_id}, skip" >&2 + continue + fi + + # 优先使用 jq 解析;若 jq 不可用,则退化为空字符串(不报错) + if command -v jq >/dev/null 2>&1; then + status="$(printf '%s\n' "${resp}" | jq -r '.status // empty' 2>/dev/null || true)" + conclusion="$(printf '%s\n' "${resp}" | jq -r '.conclusion // empty' 2>/dev/null || true)" + else + status="" + conclusion="" + fi + + echo "[lock-watcher] resource=${RUNNER_RESOURCE_ID} run_id=${holder_run_id} status=${status:-} conclusion=${conclusion:-} pid=${holder_pid} holder_runner=${holder_runner}" + + # 如果 workflow 已取消,认为这个锁可以强制释放 + if [[ "${status}" == "completed" && "${conclusion}" == "cancelled" ]]; then + echo "[lock-watcher] detected cancelled workflow, force releasing lock for ${RUNNER_RESOURCE_ID}" >&2 + + # 尝试在宿主机上杀掉同名 PID(注意:容器内/宿主机 PID 命名空间不同,可能杀不到,仅 best-effort) + if [[ -n "${holder_pid}" && "${holder_pid}" =~ ^[0-9]+$ ]]; then + kill "${holder_pid}" 2>/dev/null || true + fi + + # 清理 holder 和对应的 release 标记,让后续等待不再被旧锁阻塞 + rm -f "${holder_file}" 2>/dev/null || true + rm -f "${LOCK_DIR}/${RUNNER_RESOURCE_ID}."*.release 2>/dev/null || true + fi + done + + sleep "${INTERVAL}" +done + diff --git a/runner-wrapper/pre-job-lock.sh b/runner-wrapper/pre-job-lock.sh index 2d22ec6..95ff232 100644 --- a/runner-wrapper/pre-job-lock.sh +++ b/runner-wrapper/pre-job-lock.sh @@ -18,16 +18,41 @@ LOCK_FILE="${LOCK_DIR}/${RESOURCE_ID}.lock" RELEASE_FILE="${LOCK_DIR}/${RESOURCE_ID}.${RUN_KEY}.release" HOLDER_PID_FILE="${LOCK_DIR}/${RESOURCE_ID}.holder" -mkdir -p "${LOCK_DIR}" -chmod 1777 "${LOCK_DIR}" || true +if ! mkdir -p "${LOCK_DIR}" 2>/dev/null; then + echo "[$(date -Iseconds)] ❌ Cannot create lock dir ${LOCK_DIR}" >&2 + exit 1 +fi +chmod 1777 "${LOCK_DIR}" 2>/dev/null || true + +# 如果目录不可写,给出明确提示后退出 +if ! touch "${LOCK_DIR}/.write-test" 2>/dev/null; then + echo "[$(date -Iseconds)] ❌ Lock dir ${LOCK_DIR} is not writable by user $(id -un)." >&2 + echo "[$(date -Iseconds)] Fix on runner host: sudo chmod 1777 ${LOCK_DIR}" >&2 + exit 1 +fi +rm -f "${LOCK_DIR}/.write-test" || true + # 清理当前 run 的残留释放标记,避免误判为可释放 rm -f "${RELEASE_FILE}" || true # 打开锁文件并获取排他锁(阻塞等待) exec 200>"${LOCK_FILE}" -chmod 666 "${LOCK_FILE}" || true +chmod 666 "${LOCK_FILE}" 2>/dev/null || true echo "[$(date -Iseconds)] ⏳ Waiting for lock: ${RESOURCE_ID}" >&2 +# 后台每 10s 打印一次,便于在第二个 job 的日志中看到等待状态(避免在 echo 中嵌套括号与引号,防止部分 bash 误解析) +( + i=0 + while true; do + sleep 10 + i=$((i + 10)) + ts="$(date -Iseconds)" + printf '%s ⏳ Still waiting for lock: %s after %ss\n' "${ts}" "${RESOURCE_ID}" "${i}" >&2 + done +) & +WAITER_PID=$! flock -x 200 +kill "${WAITER_PID}" 2>/dev/null || true +wait "${WAITER_PID}" 2>/dev/null || true echo "[$(date -Iseconds)] ✅ Acquired lock for ${RESOURCE_ID}" >&2 # 后台子进程继承 fd 200 并持有锁,等待 post-job 创建释放文件 @@ -38,7 +63,7 @@ echo "[$(date -Iseconds)] ✅ Acquired lock for ${RESOURCE_ID}" >&2 "${RUNNER_NAME_SAFE}" \ "${RUN_ID_SAFE}" \ "${RUN_ATTEMPT_SAFE}" > "${HOLDER_PID_FILE}" - chmod 666 "${HOLDER_PID_FILE}" || true + chmod 666 "${HOLDER_PID_FILE}" 2>/dev/null || true while [ ! -f "${RELEASE_FILE}" ]; do sleep 1 diff --git a/runner.sh b/runner.sh index 989d0b9..49a0b84 100755 --- a/runner.sh +++ b/runner.sh @@ -101,16 +101,20 @@ shell_usage() { printf " %-${COLW}s %s\n" "./runner.sh ps|ls|list|status" "Show container status and registered Runner status" echo - echo "4. Deletion commands:" + echo "4. Lock watcher (Cancel 后自动释放板卡锁):" + printf " %-${COLW}s %s\n" "./runner.sh watcher [resource]" "Start lock-watcher (uses same .env; requires RUNNER_LOCK_MONITOR_TOKEN)" + echo + + echo "5. Deletion commands:" printf " %-${COLW}s %s\n" "./runner.sh rm|remove|delete [${RUNNER_NAME_PREFIX}runner- ...]" "Delete specified instances; no args will delete all (confirmation required, -y to skip)" printf " %-${COLW}s %s\n" "./runner.sh purge [-y]" "On top of remove, also delete the dynamically generated docker-compose.yml" echo - echo "5. Image management commands:" + echo "6. Image management commands:" printf " %-${COLW}s %s\n" "./runner.sh image" "Rebuild Docker image based on Dockerfile" echo - echo "6. Help" + echo "7. Help" printf " %-${COLW}s %s\n" "./runner.sh help" "Show this help" echo @@ -126,6 +130,7 @@ shell_usage() { printf " %-${KEYW}s %s\n" "RUNNER_RESOURCE_ID_ROC_RK3568_PC" "Lock ID for roc-rk3568-pc board (default: board-roc-rk3568-pc); same ID = serial" printf " %-${KEYW}s %s\n" "RUNNER_LOCK_DIR" "Lock dir in container (default /tmp/github-runner-locks)" printf " %-${KEYW}s %s\n" "RUNNER_LOCK_HOST_PATH" "Lock dir on host for bind mount (default /tmp/github-runner-locks)" + printf " %-${KEYW}s %s\n" "RUNNER_LOCK_MONITOR_TOKEN" "Fine-grained PAT, Actions read-only (required for watcher)" echo echo "Example workflow runs-on: runs-on: [self-hosted, linux, docker]" @@ -703,6 +708,34 @@ shell_generate_compose_file() { " - ${RUNNER_NAME_PREFIX}runner-roc-rk3568-pc-data:/home/runner" \ " - ${RUNNER_NAME_PREFIX}runner-roc-rk3568-pc-udev-rules:/etc/udev/rules.d" \ "" >> "${COMPOSE_FILE}" + + # lock-watcher:当配置了 RUNNER_LOCK_MONITOR_TOKEN 时自动加入,与 start/stop 一起启停 + if [[ -n "${RUNNER_LOCK_MONITOR_TOKEN:-}" ]]; then + local watcher_resource_ids=() + [[ -n "$res_phytiumpi" ]] && watcher_resource_ids+=("$res_phytiumpi") + [[ -n "$res_roc" ]] && watcher_resource_ids+=("$res_roc") + local watcher_ids_str="${watcher_resource_ids[*]}" + printf '%s\n' \ + " # lock-watcher:Cancel workflow 后自动清锁,与锁机制配套" \ + " ${RUNNER_NAME_PREFIX}lock-watcher:" \ + " image: alpine:3.19" \ + " container_name: \"${RUNNER_NAME_PREFIX}lock-watcher\"" \ + " restart: unless-stopped" \ + " command:" \ + " - /bin/sh" \ + " - -c" \ + " - \"apk add --no-cache bash curl jq && exec /watcher/lock-watcher.sh\"" \ + " environment:" \ + " ORG: \"${ORG}\"" \ + " REPO: \"${REPO}\"" \ + " GITHUB_TOKEN: \"\${RUNNER_LOCK_MONITOR_TOKEN}\"" \ + " RUNNER_LOCK_DIR: \"${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}\"" \ + " RUNNER_RESOURCE_IDS: \"${watcher_ids_str}\"" \ + " volumes:" \ + " - ./runner-wrapper:/watcher:ro" \ + " - ${RUNNER_LOCK_HOST_PATH:-/tmp/github-runner-locks}:${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" \ + "" >> "${COMPOSE_FILE}" + fi fi # 生成 volumes @@ -966,6 +999,30 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then echo ;; + # ./runner.sh watcher [resource] + watcher) + shell_get_org_and_pat + [[ -n "${RUNNER_LOCK_MONITOR_TOKEN:-}" ]] || shell_die "RUNNER_LOCK_MONITOR_TOKEN is required for watcher (use Fine-grained PAT with Actions: Read-only)." + export GITHUB_TOKEN="${RUNNER_LOCK_MONITOR_TOKEN}" + export ORG REPO + export RUNNER_LOCK_DIR="${RUNNER_LOCK_DIR:-/tmp/github-runner-locks}" + if [[ -n "${1:-}" ]]; then + export RUNNER_RESOURCE_IDS="$1" + else + ids=() + [[ -n "${RUNNER_RESOURCE_ID_ROC_RK3568_PC:-}" ]] && ids+=("${RUNNER_RESOURCE_ID_ROC_RK3568_PC}") + [[ -n "${RUNNER_RESOURCE_ID_PHYTIUMPI:-}" ]] && ids+=("${RUNNER_RESOURCE_ID_PHYTIUMPI}") + if [[ ${#ids[@]} -eq 0 ]]; then + shell_die "No board resource ID set (RUNNER_RESOURCE_ID_ROC_RK3568_PC / RUNNER_RESOURCE_ID_PHYTIUMPI). Set one in .env or pass: ./runner.sh watcher " + fi + export RUNNER_RESOURCE_IDS="${ids[*]}" + fi + WATCHER_SCRIPT="$(cd "$(dirname "$0")" && pwd)/runner-wrapper/lock-watcher.sh" + [[ -x "${WATCHER_SCRIPT}" ]] || shell_die "lock-watcher.sh not found or not executable: ${WATCHER_SCRIPT}" + shell_info "Starting lock-watcher for ${ORG}/${REPO}, resources=${RUNNER_RESOURCE_IDS}" + exec "${WATCHER_SCRIPT}" + ;; + # ./runner.sh init -n|--count N init) count=0 @@ -1047,18 +1104,25 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then shell_info "No Runner containers to start!" exit 0 fi - else - mapfile -t ids < <(docker_list_existing_containers) || ids=() - if [[ ${#ids[@]} -eq 0 ]]; then - shell_info "No Runner containers to start!" - exit 0 + shell_info "Starting ${#ids[@]} container(s): ${ids[*]}" + if [[ -f "$COMPOSE_FILE" ]]; then + $DC -f "$COMPOSE_FILE" up -d "${ids[@]}" + else + docker start "${ids[@]}" fi - fi - shell_info "Starting ${#ids[@]} container(s): ${ids[*]}" - if [[ -f "$COMPOSE_FILE" ]]; then - $DC -f "$COMPOSE_FILE" up -d "${ids[@]}" else - docker start "${ids[@]}" + if [[ -f "$COMPOSE_FILE" ]]; then + shell_info "Starting all services (runners + lock-watcher if configured)" + $DC -f "$COMPOSE_FILE" up -d + else + mapfile -t ids < <(docker_list_existing_containers) || ids=() + if [[ ${#ids[@]} -eq 0 ]]; then + shell_info "No Runner containers to start!" + exit 0 + fi + shell_info "Starting ${#ids[@]} container(s): ${ids[*]}" + docker start "${ids[@]}" + fi fi ;; @@ -1077,18 +1141,25 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then shell_info "No Runner containers to stop!" exit 0 fi - else - mapfile -t ids < <(docker_list_existing_containers) || ids=() - if [[ ${#ids[@]} -eq 0 ]]; then - shell_info "No Runner containers to stop!" - exit 0 + shell_info "Stopping ${#ids[@]} container(s): ${ids[*]}" + if [[ -f "$COMPOSE_FILE" ]]; then + $DC -f "$COMPOSE_FILE" stop "${ids[@]}" + else + docker stop "${ids[@]}" fi - fi - shell_info "Stopping ${#ids[@]} container(s): ${ids[*]}" - if [[ -f "$COMPOSE_FILE" ]]; then - $DC -f "$COMPOSE_FILE" stop "${ids[@]}" else - docker stop "${ids[@]}" + if [[ -f "$COMPOSE_FILE" ]]; then + shell_info "Stopping all services (runners + lock-watcher if configured)" + $DC -f "$COMPOSE_FILE" stop + else + mapfile -t ids < <(docker_list_existing_containers) || ids=() + if [[ ${#ids[@]} -eq 0 ]]; then + shell_info "No Runner containers to stop!" + exit 0 + fi + shell_info "Stopping ${#ids[@]} container(s): ${ids[*]}" + docker stop "${ids[@]}" + fi fi ;; @@ -1107,18 +1178,25 @@ if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then shell_info "No Runner containers to restart!" exit 0 fi - else - mapfile -t ids < <(docker_list_existing_containers) || ids=() - if [[ ${#ids[@]} -eq 0 ]]; then - shell_info "No Runner containers to restart!" - exit 0 + shell_info "Restarting ${#ids[@]} container(s): ${ids[*]}" + if [[ -f "$COMPOSE_FILE" ]]; then + $DC -f "$COMPOSE_FILE" restart "${ids[@]}" + else + docker restart "${ids[@]}" fi - fi - shell_info "Restarting ${#ids[@]} container(s): ${ids[*]}" - if [[ -f "$COMPOSE_FILE" ]]; then - $DC -f "$COMPOSE_FILE" restart "${ids[@]}" else - docker restart "${ids[@]}" + if [[ -f "$COMPOSE_FILE" ]]; then + shell_info "Restarting all services (runners + lock-watcher if configured)" + $DC -f "$COMPOSE_FILE" restart + else + mapfile -t ids < <(docker_list_existing_containers) || ids=() + if [[ ${#ids[@]} -eq 0 ]]; then + shell_info "No Runner containers to restart!" + exit 0 + fi + shell_info "Restarting ${#ids[@]} container(s): ${ids[*]}" + docker restart "${ids[@]}" + fi fi ;;