fix: 修复配置热重载时 Web 管理端端口冲突拖垮 AstrBot (#75)#77
Conversation
保存插件配置会触发 AstrBot 热重载(terminate→initialize)。原实现存在 两个缺陷叠加导致整个 AstrBot 进程崩溃(issue DBJD-CR#75): 1. _serve() 用 except Exception 捕获,接不住 Uvicorn 绑定端口失败时 sys.exit() 抛出的 SystemExit(属 BaseException),该异常作为未检索的 任务异常冒泡到事件循环根部,拖垮宿主进程。改为 except BaseException 拦截,并单独放行 CancelledError 以保持正常停止的取消语义。 2. stop() 仅设置 should_exit,存在未关闭长连接时优雅关闭会永久挂起,5 秒 超时又被 except Exception: pass 静默吞掉,旧监听 socket 未释放,重载时 新实例绑定同端口失败。补充 force_exit=True 确保端口释放,超时改为告警 并取消任务,不再静默忽略。 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
审阅者指南(在小型 PR 中折叠)审阅者指南此 PR 加固了 AstrBot Web 管理端服务器的启停生命周期:当 Uvicorn 绑定端口失败时,配置热重载不再导致整个 AstrBot 进程崩溃,并且即使存在遗留的 WebSocket 连接也能可靠释放端口。 加固后的 Web 管理端服务器启停生命周期时序图sequenceDiagram
actor AstrBot
participant EventLoop
participant WebAdminServer
participant UvicornServer
AstrBot->>WebAdminServer: start()
WebAdminServer->>EventLoop: asyncio.create_task(_serve)
EventLoop->>WebAdminServer: run _serve()
WebAdminServer->>UvicornServer: serve()
alt [Uvicorn binds port successfully]
UvicornServer-->>WebAdminServer: serve() completes
else [Port bind fails]
UvicornServer-->>WebAdminServer: SystemExit
WebAdminServer-->>WebAdminServer: except BaseException
WebAdminServer->>WebAdminServer: logger.error(...)
WebAdminServer-->>EventLoop: _serve completes without crashing AstrBot
end
AstrBot->>WebAdminServer: stop()
WebAdminServer->>UvicornServer: set should_exit = True
WebAdminServer->>UvicornServer: set force_exit = True
WebAdminServer->>EventLoop: asyncio.wait_for(server_task, 5)
alt [server_task stops within 5s]
EventLoop-->>WebAdminServer: server_task finished
else [Timeout]
EventLoop-->>WebAdminServer: asyncio.TimeoutError
WebAdminServer->>WebAdminServer: logger.warning(...)
WebAdminServer->>EventLoop: server_task.cancel()
end
WebAdminServer-->>AstrBot: stop() returns without port leak
文件级变更
可能关联的 Issue
技巧与命令与 Sourcery 交互
自定义你的体验访问你的 控制面板 以:
获取帮助Original review guide in EnglishReviewer's guide (collapsed on small PRs)Reviewer's GuideThis PR hardens the AstrBot Web admin server’s lifecycle so that configuration hot-reload no longer crashes the whole AstrBot process when Uvicorn fails to bind its port, and ensures ports are reliably released even with lingering WebSocket connections. Sequence diagram for hardened Web admin server start/stop lifecyclesequenceDiagram
actor AstrBot
participant EventLoop
participant WebAdminServer
participant UvicornServer
AstrBot->>WebAdminServer: start()
WebAdminServer->>EventLoop: asyncio.create_task(_serve)
EventLoop->>WebAdminServer: run _serve()
WebAdminServer->>UvicornServer: serve()
alt [Uvicorn binds port successfully]
UvicornServer-->>WebAdminServer: serve() completes
else [Port bind fails]
UvicornServer-->>WebAdminServer: SystemExit
WebAdminServer-->>WebAdminServer: except BaseException
WebAdminServer->>WebAdminServer: logger.error(...)
WebAdminServer-->>EventLoop: _serve completes without crashing AstrBot
end
AstrBot->>WebAdminServer: stop()
WebAdminServer->>UvicornServer: set should_exit = True
WebAdminServer->>UvicornServer: set force_exit = True
WebAdminServer->>EventLoop: asyncio.wait_for(server_task, 5)
alt [server_task stops within 5s]
EventLoop-->>WebAdminServer: server_task finished
else [Timeout]
EventLoop-->>WebAdminServer: asyncio.TimeoutError
WebAdminServer->>WebAdminServer: logger.warning(...)
WebAdminServer->>EventLoop: server_task.cancel()
end
WebAdminServer-->>AstrBot: stop() returns without port leak
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Code Review
This pull request improves the lifecycle management and exception handling of the Web Admin Server by correctly propagating asyncio.CancelledError, catching exceptions to prevent process crashes during port binding failures, and handling server shutdown timeouts. The review feedback points out two key issues: catching BaseException can swallow KeyboardInterrupt signals, so catching (SystemExit, Exception) is recommended instead; and when cancelling the server task during a timeout, the task should be explicitly awaited to ensure the port is fully released before the stop method returns.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| except asyncio.TimeoutError: | ||
| logger.warning( | ||
| "[主动消息] Web 管理端未在 5 秒内停止喵,端口可能仍被占用。" | ||
| ) | ||
| self.server_task.cancel() |
There was a problem hiding this comment.
在调用 self.server_task.cancel() 后,该任务只是被标记为取消,并不会立即执行取消流程。如果不 await 该任务,stop() 方法会立即返回,而此时 _serve 协程可能还没有真正退出并释放端口。在热重载(先 stop 后立即 start)的场景下,这仍然可能导致短暂的端口冲突。
建议在 cancel() 后,使用 try...except asyncio.CancelledError 显式 await self.server_task,确保任务完全退出、端口彻底释放后再结束 stop()。
| except asyncio.TimeoutError: | |
| logger.warning( | |
| "[主动消息] Web 管理端未在 5 秒内停止喵,端口可能仍被占用。" | |
| ) | |
| self.server_task.cancel() | |
| except asyncio.TimeoutError: | |
| logger.warning( | |
| "[主动消息] Web 管理端未在 5 秒内停止喵,端口可能仍被占用。" | |
| ) | |
| self.server_task.cancel() | |
| try: | |
| await self.server_task | |
| except asyncio.CancelledError: | |
| pass |
| except BaseException as e: | ||
| # Uvicorn 绑定端口失败会调用 sys.exit() 抛出 SystemExit(属 | ||
| # BaseException 而非 Exception),若不在此拦截,该异常会作为未 | ||
| # 检索的任务异常冒泡到事件循环根部,拖垮整个 AstrBot 进程。 | ||
| logger.error(f"[主动消息] Web 管理端运行异常喵: {e!r}") |
There was a problem hiding this comment.
在 Python 中,直接捕获 BaseException 会连同 KeyboardInterrupt(如用户按下 Ctrl+C 终止程序)一起捕获并吞掉,这会导致程序无法响应中断信号正常退出。
既然这里的主要目的是拦截 Uvicorn 绑定端口失败时抛出的 SystemExit,建议显式捕获 (SystemExit, Exception),而不是使用过于宽泛的 BaseException。这样既能拦截 SystemExit 和普通异常,又不会影响 KeyboardInterrupt 的正常传播。
| except BaseException as e: | |
| # Uvicorn 绑定端口失败会调用 sys.exit() 抛出 SystemExit(属 | |
| # BaseException 而非 Exception),若不在此拦截,该异常会作为未 | |
| # 检索的任务异常冒泡到事件循环根部,拖垮整个 AstrBot 进程。 | |
| logger.error(f"[主动消息] Web 管理端运行异常喵: {e!r}") | |
| except (SystemExit, Exception) as e: | |
| # Uvicorn 绑定端口失败会调用 sys.exit() 抛出 SystemExit(属 | |
| # BaseException 而非 Exception),若不在此拦截,该异常会作为未 | |
| # 检索的任务异常冒泡到事件循环根部,拖垮整个 AstrBot 进程。 | |
| logger.error(f"[主动消息] Web 管理端运行异常喵: {e!r}") |
There was a problem hiding this comment.
Hey - 我在这里给出了一些整体性的反馈:
- 在
_serve中捕获BaseException时,建议使用logger.exception而不是logger.error,这样可以保留完整的回溯信息,便于调试诸如SystemExit这类少见故障。 - 无条件设置
self.server.force_exit = True会改变关闭流程的语义;你可能需要通过配置开关来控制它,或者只在实际出现 WebSocket 挂起问题的热重载/卸载路径中启用。
给 AI Agent 的提示
Please address the comments from this code review:
## Overall Comments
- When catching `BaseException` in `_serve`, consider using `logger.exception` instead of `logger.error` so you preserve the full traceback for debugging rare failures like `SystemExit`.
- Setting `self.server.force_exit = True` unconditionally changes shutdown semantics; you might want to gate this behind a configuration flag or only enable it in the hot-reload/unload path where the hanging WebSocket issue actually occurs.帮我变得更有用!请在每条评论上点 👍 或 👎,我会根据你的反馈来改进后续的评审。
Original comment in English
Hey - I've left some high level feedback:
- When catching
BaseExceptionin_serve, consider usinglogger.exceptioninstead oflogger.errorso you preserve the full traceback for debugging rare failures likeSystemExit. - Setting
self.server.force_exit = Trueunconditionally changes shutdown semantics; you might want to gate this behind a configuration flag or only enable it in the hot-reload/unload path where the hanging WebSocket issue actually occurs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- When catching `BaseException` in `_serve`, consider using `logger.exception` instead of `logger.error` so you preserve the full traceback for debugging rare failures like `SystemExit`.
- Setting `self.server.force_exit = True` unconditionally changes shutdown semantics; you might want to gate this behind a configuration flag or only enable it in the hot-reload/unload path where the hanging WebSocket issue actually occurs.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
回应 PR DBJD-CR#77 的 review 反馈: 1. _serve() 的 except BaseException 收窄为 except (SystemExit, Exception), 仍拦住 Uvicorn 绑定失败的 SystemExit,但不再波及 KeyboardInterrupt。 2. _serve() 改用 logger.exception 记录,保留完整回溯,便于诊断 SystemExit 这类少见故障。 3. stop() 超时强制 cancel 任务后补充 await,确保取消真正完成、监听 socket 在 stop 返回前已释放,杜绝热重载时端口残留。 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
|
review 的几条看了,最新一版提交(1a273d3)改了三处:
force_exit 那条我先不加开关。这个 stop 只有卸载/热重载的时候才会走,本来就是要把服务关掉,不存在"想优雅关结果被强制打断"的情况;而且不开 force_exit 的话,WebUI 那边 WebSocket 长连不断,uvicorn 优雅关闭会一直等连接、直接卡满 5 秒超时,端口放不出来——这就是这个 issue 端口冲突的根。加开关反而容易被配错。要是你觉得还是想留个配置项,我再加。 另外本地又复现了一遍验证:直接 import 插件真实的 WebAdminServer,用它自己的 start/stop/_serve 配真 uvicorn,模拟热重载抢端口。改之前进程直接被 SystemExit 拉崩,跟 issue 现象一样;改之后同样冲突下进程能扛住,SystemExit 被 _serve 拦了,stop 完端口也能正常被新实例重新绑上。 |
|
感谢你的详细反馈和深入的本地复现验证。你对 你的修复方案(包括对 无需额外引入配置开关,保持当前的实现即可。非常感谢你为解决此问题所做的细致工作! |
先占个坑,有其他的问题明日醒来再改。
📝 描述 / Description
修复 #75:在 AstrBot 主面板(WebUI)保存本插件配置时,整个 AstrBot 进程崩溃。
根本原因是两个缺陷叠加,缺一不会崩:
进程崩溃源:保存插件配置会触发 AstrBot 热重载(串行执行
terminate→initialize)。initialize时新实例会在4100端口重新启动内嵌的 Uvicorn 服务;若此时旧端口尚未释放,Uvicorn 绑定失败会调用sys.exit(1)抛出SystemExit。但_serve()协程用except Exception捕获,而SystemExit继承自BaseException而非Exception,因此捕获不到;又因为该协程是asyncio.create_task起的、从不被await,异常便作为“未检索的任务异常”冒泡到事件循环根部(run_until_complete),最终拖垮整个 AstrBot 进程。这与 issue 报错日志里SystemExit: 1经由runners.py→_serve的调用栈完全吻合。端口冲突源:
stop()仅设置should_exit = True。当存在未关闭的长连接(如 WebUI 的 WebSocket)时,Uvicorn 的优雅关闭会一直等待连接断开而永久挂起,导致stop()里await asyncio.wait_for(..., timeout=5)必然超时;而超时异常又被except Exception: pass静默吞掉,旧监听 socket 始终没有释放。于是热重载时新实例绑定同端口必然失败。🛠️ 改动点 / Modifications
仅修改
core/web_admin_server.py一个文件(+18 / -4):_serve():except Exception改为except BaseException,拦住 Uvicorn 绑定失败抛出的SystemExit,避免拖垮宿主进程;同时单独except asyncio.CancelledError: raise放行取消信号,保持正常停止时的取消语义。stop():补充self.server.force_exit = True,跳过对未关闭长连接的等待、确保监听 socket 被释放;并把超时分支由except Exception: pass改为记录告警并取消任务,不再静默忽略。影响 / Impact:
保存配置触发热重载时,AstrBot 不再崩溃。
即便偶发端口未及时释放,最多导致本次 Web 管理端启动失败(仅 Web 端不可用,插件主体功能不受影响),而不会拖垮整个 AstrBot。
stop()现在能可靠释放端口,热重载后 Web 管理端可正常重启。行为对既有用户无破坏:正常启动/停止流程不变,仅在异常路径上更健壮。
这不是一个破坏性变更 / This is NOT a breaking change.
📸 运行截图或测试结果 / Screenshots or Test Results
测试方法
由于崩溃的关键在于「
asyncio.create_task起的_serve任务、其内部的 except 层级、以及stop()的端口释放时序」,我们用一个独立脚本按 1:1 复刻了start()/_serve()/stop()的真实结构(相同的create_task且不await、相同的except捕获层级、相同的should_exit+force_exit),并搭配**真实的 Uvicorn(0.48.0)**运行,而非用 mock 假冒异常。bind失败 →sys.exit(1)→SystemExit,观察进程是否还会被拖垮。stop(),再探测端口是否已可重新绑定,验证force_exit是否真的释放了 socket。cancel()该任务,确认CancelledError仍能正常向上传播,没有被新的except BaseException误吞,从而不破坏正常卸载/重载的取消语义。测试结果
对照实验(修复前的写法):把
_serve的捕获改回except Exception,同一脚本下场景 A 即崩溃,asyncio.run直接抛出SystemExit并打印Task exception was never retrieved,与 issue #75 现象一致;改为except BaseException后进程存活。✅ 检查清单 / Checklist
❤️ CONTRIBUTING