中文版本
摘要
感谢作者在 v1.5.0 中提供了 fastest strategy、publisher strategy registry、WebVPN/CARSI 以及 CloakBrowser/Selenium fallback 等完整下载链路。这些能力对混合出版社批量下载很有价值。
建议为批量下载中的浏览器型 source 增加独立的全局并发限制,或提供单浏览器复用模式。目前在 fastest + batch_workers 批量场景下,WebVPN-Camofox、CARSI-Camofox、publisher browser 和 Selenium fallback 可能同时拉起大量可见 Chrome/CloakBrowser/chromedriver 窗口,Windows 任务栏会很快被占满。
复现信息
- 版本:
scansci-pdf 1.5.0
- 系统:Windows
- Python:Conda Python 3.11
- 策略:
fastest
- 关键配置:
batch_workers=10,vpnsci_enabled=true,carsi_enabled=true,scihub_enabled=true,Elsevier API 已配置
- 批量输入:63 个有效 DOI,覆盖 Elsevier、Springer Nature、Wiley、MDPI、Scientific.Net、SAGE、IOP 等出版社
示例 DOI:
10.1016/j.engfracmech.2024.110751
10.1002/stco.202500030
10.1007/s41403-023-00421-y
10.1016/j.jmatprotec.2022.117815
运行批量下载后,日志中可见单篇 DOI 会并发竞速大量 source:
ScanSci PDF - 10.1016/j.jmps.2023.105299 [fastest]
Racing 15 sources across 6 tiers (parallel)...
[WebVPN-Camofox] Target: ...
[CARSI-Camofox] Navigating to article: ...
[CARSI] Trying selenium download ...
[CARSI-Camofox] Opening browser for sciencedirect...
实际现象:
- 多个可见 Chrome/CloakBrowser 窗口同时出现。
- 中断任务后可能残留多个
chromedriver.exe。
- 用户无法从窗口标题判断每个窗口对应哪篇 DOI / 哪个 source。
v1.5.0 中的根因
这个问题主要来自两层并发叠加:
- 批量层并发:
# sources/__init__.py
workers = config.get("batch_workers", 5)
with ThreadPoolExecutor(max_workers=workers) as pool:
...
默认配置中 batch_workers 为 10。
- 单篇
fastest source 并发:
# sources/__init__.py
pool = ThreadPoolExecutor(max_workers=len(all_sources))
一篇 DOI 可能同时竞速 14-15 个 source。
- 若 source 是浏览器型策略,会各自打开可见浏览器:
# sources/vpnsci.py
browser = launch(headless=False, humanize=True, ...)
# publisher_strategies.py
launch_persistent_context(..., headless=False, humanize=True, ...)
因此实际浏览器压力接近:
batch_workers × 每篇 DOI 中的 browser-capable sources
这远高于普通用户对“批量下载”的预期。
按当前架构建议的实现方式
- 增加独立的浏览器并发配置:
max_browser_workers = 1
browser_mode = "single" # off | single | pooled | per_source
fastest 中区分普通 source 和 browser source:
- direct / OA / API source 可以并发;
- WebVPN/CARSI/publisher browser/Selenium source 进入单独队列;
- 非浏览器 source 全部失败后,再串行尝试浏览器 source。
-
为批量任务复用一个 CloakBrowser / browser pool,而不是每个 source 独立 launch()。
-
在浏览器窗口或日志中标注:
DOI: 10.xxxx/xxxx
Source: WebVPN-Camofox
Publisher: Elsevier
- 提供显式 CLI/config 开关:
enable_browser_sources_in_batch = false
这样用户可以先跑非浏览器覆盖率,再决定是否启用机构浏览器兜底。
建议测试
batch_workers=10 且 max_browser_workers=1 时,同时存在的浏览器实例不超过 1。
fastest 在非浏览器 source 成功后,不再启动浏览器 source。
- 浏览器 source 已在运行时,其他 DOI 的浏览器 source 排队等待。
- 中断 batch 后,不残留
chromedriver.exe / CloakBrowser 子进程。
- 日志能明确显示每个浏览器任务对应的 DOI、publisher 和 source。
如果后续有候选实现,我可以使用上述混合出版社 DOI 批次和合法机构访问环境协助验证。
English Version
Summary
Thanks for the comprehensive v1.5.0 download pipeline, especially the fastest strategy, publisher strategy registry, WebVPN/CARSI support, and CloakBrowser/Selenium fallbacks.
Could browser-capable sources get a separate global concurrency limit, or a single-browser reuse mode, for batch downloads? At the moment, fastest plus batch-level concurrency can open many visible Chrome/CloakBrowser/chromedriver windows at once when WebVPN-Camofox, CARSI-Camofox, publisher browser strategies, and Selenium fallback are enabled.
Reproduction
- Version:
scansci-pdf 1.5.0
- OS: Windows
- Python: Conda Python 3.11
- Strategy:
fastest
- Key config:
batch_workers=10, vpnsci_enabled=true, carsi_enabled=true, scihub_enabled=true, Elsevier API configured
- Batch input: 63 valid DOI records across Elsevier, Springer Nature, Wiley, MDPI, Scientific.Net, SAGE, IOP, etc.
Example DOI set:
10.1016/j.engfracmech.2024.110751
10.1002/stco.202500030
10.1007/s41403-023-00421-y
10.1016/j.jmatprotec.2022.117815
Observed log:
ScanSci PDF - 10.1016/j.jmps.2023.105299 [fastest]
Racing 15 sources across 6 tiers (parallel)...
[WebVPN-Camofox] Target: ...
[CARSI-Camofox] Navigating to article: ...
[CARSI] Trying selenium download ...
[CARSI-Camofox] Opening browser for sciencedirect...
Observed behavior:
- Multiple visible Chrome/CloakBrowser windows open simultaneously.
- Several
chromedriver.exe processes may remain after interrupting the batch.
- The user cannot easily tell which window belongs to which DOI/source.
Root cause in v1.5.0
The issue appears to come from multiplicative concurrency:
- Batch-level concurrency:
# sources/__init__.py
workers = config.get("batch_workers", 5)
with ThreadPoolExecutor(max_workers=workers) as pool:
...
The default config uses batch_workers=10.
- Per-paper
fastest source racing:
# sources/__init__.py
pool = ThreadPoolExecutor(max_workers=len(all_sources))
A single DOI may race 14-15 sources.
- Browser-capable sources independently open visible browsers:
# sources/vpnsci.py
browser = launch(headless=False, humanize=True, ...)
# publisher_strategies.py
launch_persistent_context(..., headless=False, humanize=True, ...)
The effective browser pressure can become:
batch_workers × browser-capable sources per DOI
This is much higher than expected for a normal batch download command.
Suggested implementation following the current design
- Add separate browser concurrency settings:
max_browser_workers = 1
browser_mode = "single" # off | single | pooled | per_source
- Separate normal sources from browser-capable sources inside
fastest:
- direct / OA / API sources can still race;
- WebVPN/CARSI/publisher browser/Selenium sources go through a separate queue;
- browser fallbacks are tried serially only after non-browser sources fail.
-
Reuse one CloakBrowser/browser pool for the whole batch instead of calling launch() independently per source.
-
Add DOI/source metadata to browser-window related logs:
DOI: 10.xxxx/xxxx
Source: WebVPN-Camofox
Publisher: Elsevier
- Add an explicit config/CLI switch:
enable_browser_sources_in_batch = false
This allows users to measure non-browser coverage first and then opt into browser fallbacks.
Suggested tests
- With
batch_workers=10 and max_browser_workers=1, no more than one browser instance is active.
- If a non-browser source succeeds in
fastest, browser sources are not launched.
- Browser-capable sources queue across DOI records.
- Interrupting the batch does not leave orphaned
chromedriver.exe / CloakBrowser child processes.
- Logs map each browser task to its DOI, publisher, and source.
I can help validate a candidate implementation with the mixed-publisher DOI batch above using lawful institutional access.
中文版本
摘要
感谢作者在
v1.5.0中提供了fasteststrategy、publisher strategy registry、WebVPN/CARSI 以及 CloakBrowser/Selenium fallback 等完整下载链路。这些能力对混合出版社批量下载很有价值。建议为批量下载中的浏览器型 source 增加独立的全局并发限制,或提供单浏览器复用模式。目前在
fastest+batch_workers批量场景下,WebVPN-Camofox、CARSI-Camofox、publisher browser 和 Selenium fallback 可能同时拉起大量可见 Chrome/CloakBrowser/chromedriver 窗口,Windows 任务栏会很快被占满。复现信息
scansci-pdf 1.5.0fastestbatch_workers=10,vpnsci_enabled=true,carsi_enabled=true,scihub_enabled=true,Elsevier API 已配置示例 DOI:
运行批量下载后,日志中可见单篇 DOI 会并发竞速大量 source:
实际现象:
chromedriver.exe。v1.5.0中的根因这个问题主要来自两层并发叠加:
默认配置中
batch_workers为 10。fastestsource 并发:一篇 DOI 可能同时竞速 14-15 个 source。
因此实际浏览器压力接近:
这远高于普通用户对“批量下载”的预期。
按当前架构建议的实现方式
fastest中区分普通 source 和 browser source:为批量任务复用一个 CloakBrowser / browser pool,而不是每个 source 独立
launch()。在浏览器窗口或日志中标注:
这样用户可以先跑非浏览器覆盖率,再决定是否启用机构浏览器兜底。
建议测试
batch_workers=10且max_browser_workers=1时,同时存在的浏览器实例不超过 1。fastest在非浏览器 source 成功后,不再启动浏览器 source。chromedriver.exe/ CloakBrowser 子进程。如果后续有候选实现,我可以使用上述混合出版社 DOI 批次和合法机构访问环境协助验证。
English Version
Summary
Thanks for the comprehensive
v1.5.0download pipeline, especially thefasteststrategy, publisher strategy registry, WebVPN/CARSI support, and CloakBrowser/Selenium fallbacks.Could browser-capable sources get a separate global concurrency limit, or a single-browser reuse mode, for batch downloads? At the moment,
fastestplus batch-level concurrency can open many visible Chrome/CloakBrowser/chromedriver windows at once when WebVPN-Camofox, CARSI-Camofox, publisher browser strategies, and Selenium fallback are enabled.Reproduction
scansci-pdf 1.5.0fastestbatch_workers=10,vpnsci_enabled=true,carsi_enabled=true,scihub_enabled=true, Elsevier API configuredExample DOI set:
Observed log:
Observed behavior:
chromedriver.exeprocesses may remain after interrupting the batch.Root cause in
v1.5.0The issue appears to come from multiplicative concurrency:
The default config uses
batch_workers=10.fastestsource racing:A single DOI may race 14-15 sources.
The effective browser pressure can become:
This is much higher than expected for a normal batch download command.
Suggested implementation following the current design
fastest:Reuse one CloakBrowser/browser pool for the whole batch instead of calling
launch()independently per source.Add DOI/source metadata to browser-window related logs:
This allows users to measure non-browser coverage first and then opt into browser fallbacks.
Suggested tests
batch_workers=10andmax_browser_workers=1, no more than one browser instance is active.fastest, browser sources are not launched.chromedriver.exe/ CloakBrowser child processes.I can help validate a candidate implementation with the mixed-publisher DOI batch above using lawful institutional access.