Skip to content

限制批量 fastest 中浏览器源并发,避免大量 Chrome/CloakBrowser 窗口 / Limit browser-source concurrency in batch fastest mode #7

@lwz20210407

Description

@lwz20210407

中文版本

摘要

感谢作者在 v1.5.0 中提供了 fastest strategy、publisher strategy registry、WebVPN/CARSI 以及 CloakBrowser/Selenium fallback 等完整下载链路。这些能力对混合出版社批量下载很有价值。

建议为批量下载中的浏览器型 source 增加独立的全局并发限制,或提供单浏览器复用模式。目前在 fastest + batch_workers 批量场景下,WebVPN-Camofox、CARSI-Camofox、publisher browser 和 Selenium fallback 可能同时拉起大量可见 Chrome/CloakBrowser/chromedriver 窗口,Windows 任务栏会很快被占满。

复现信息

  • 版本:scansci-pdf 1.5.0
  • 系统:Windows
  • Python:Conda Python 3.11
  • 策略:fastest
  • 关键配置:batch_workers=10vpnsci_enabled=truecarsi_enabled=truescihub_enabled=true,Elsevier API 已配置
  • 批量输入:63 个有效 DOI,覆盖 Elsevier、Springer Nature、Wiley、MDPI、Scientific.Net、SAGE、IOP 等出版社

示例 DOI:

10.1016/j.engfracmech.2024.110751
10.1002/stco.202500030
10.1007/s41403-023-00421-y
10.1016/j.jmatprotec.2022.117815

运行批量下载后,日志中可见单篇 DOI 会并发竞速大量 source:

ScanSci PDF - 10.1016/j.jmps.2023.105299 [fastest]
Racing 15 sources across 6 tiers (parallel)...
[WebVPN-Camofox] Target: ...
[CARSI-Camofox] Navigating to article: ...
[CARSI] Trying selenium download ...
[CARSI-Camofox] Opening browser for sciencedirect...

实际现象:

  • 多个可见 Chrome/CloakBrowser 窗口同时出现。
  • 中断任务后可能残留多个 chromedriver.exe
  • 用户无法从窗口标题判断每个窗口对应哪篇 DOI / 哪个 source。

v1.5.0 中的根因

这个问题主要来自两层并发叠加:

  1. 批量层并发:
# sources/__init__.py
workers = config.get("batch_workers", 5)
with ThreadPoolExecutor(max_workers=workers) as pool:
    ...

默认配置中 batch_workers 为 10。

  1. 单篇 fastest source 并发:
# sources/__init__.py
pool = ThreadPoolExecutor(max_workers=len(all_sources))

一篇 DOI 可能同时竞速 14-15 个 source。

  1. 若 source 是浏览器型策略,会各自打开可见浏览器:
# sources/vpnsci.py
browser = launch(headless=False, humanize=True, ...)

# publisher_strategies.py
launch_persistent_context(..., headless=False, humanize=True, ...)

因此实际浏览器压力接近:

batch_workers × 每篇 DOI 中的 browser-capable sources

这远高于普通用户对“批量下载”的预期。

按当前架构建议的实现方式

  1. 增加独立的浏览器并发配置:
max_browser_workers = 1
browser_mode = "single"  # off | single | pooled | per_source
  1. fastest 中区分普通 source 和 browser source:
  • direct / OA / API source 可以并发;
  • WebVPN/CARSI/publisher browser/Selenium source 进入单独队列;
  • 非浏览器 source 全部失败后,再串行尝试浏览器 source。
  1. 为批量任务复用一个 CloakBrowser / browser pool,而不是每个 source 独立 launch()

  2. 在浏览器窗口或日志中标注:

DOI: 10.xxxx/xxxx
Source: WebVPN-Camofox
Publisher: Elsevier
  1. 提供显式 CLI/config 开关:
enable_browser_sources_in_batch = false

这样用户可以先跑非浏览器覆盖率,再决定是否启用机构浏览器兜底。

建议测试

  • batch_workers=10max_browser_workers=1 时,同时存在的浏览器实例不超过 1。
  • fastest 在非浏览器 source 成功后,不再启动浏览器 source。
  • 浏览器 source 已在运行时,其他 DOI 的浏览器 source 排队等待。
  • 中断 batch 后,不残留 chromedriver.exe / CloakBrowser 子进程。
  • 日志能明确显示每个浏览器任务对应的 DOI、publisher 和 source。

如果后续有候选实现,我可以使用上述混合出版社 DOI 批次和合法机构访问环境协助验证。


English Version

Summary

Thanks for the comprehensive v1.5.0 download pipeline, especially the fastest strategy, publisher strategy registry, WebVPN/CARSI support, and CloakBrowser/Selenium fallbacks.

Could browser-capable sources get a separate global concurrency limit, or a single-browser reuse mode, for batch downloads? At the moment, fastest plus batch-level concurrency can open many visible Chrome/CloakBrowser/chromedriver windows at once when WebVPN-Camofox, CARSI-Camofox, publisher browser strategies, and Selenium fallback are enabled.

Reproduction

  • Version: scansci-pdf 1.5.0
  • OS: Windows
  • Python: Conda Python 3.11
  • Strategy: fastest
  • Key config: batch_workers=10, vpnsci_enabled=true, carsi_enabled=true, scihub_enabled=true, Elsevier API configured
  • Batch input: 63 valid DOI records across Elsevier, Springer Nature, Wiley, MDPI, Scientific.Net, SAGE, IOP, etc.

Example DOI set:

10.1016/j.engfracmech.2024.110751
10.1002/stco.202500030
10.1007/s41403-023-00421-y
10.1016/j.jmatprotec.2022.117815

Observed log:

ScanSci PDF - 10.1016/j.jmps.2023.105299 [fastest]
Racing 15 sources across 6 tiers (parallel)...
[WebVPN-Camofox] Target: ...
[CARSI-Camofox] Navigating to article: ...
[CARSI] Trying selenium download ...
[CARSI-Camofox] Opening browser for sciencedirect...

Observed behavior:

  • Multiple visible Chrome/CloakBrowser windows open simultaneously.
  • Several chromedriver.exe processes may remain after interrupting the batch.
  • The user cannot easily tell which window belongs to which DOI/source.

Root cause in v1.5.0

The issue appears to come from multiplicative concurrency:

  1. Batch-level concurrency:
# sources/__init__.py
workers = config.get("batch_workers", 5)
with ThreadPoolExecutor(max_workers=workers) as pool:
    ...

The default config uses batch_workers=10.

  1. Per-paper fastest source racing:
# sources/__init__.py
pool = ThreadPoolExecutor(max_workers=len(all_sources))

A single DOI may race 14-15 sources.

  1. Browser-capable sources independently open visible browsers:
# sources/vpnsci.py
browser = launch(headless=False, humanize=True, ...)

# publisher_strategies.py
launch_persistent_context(..., headless=False, humanize=True, ...)

The effective browser pressure can become:

batch_workers × browser-capable sources per DOI

This is much higher than expected for a normal batch download command.

Suggested implementation following the current design

  1. Add separate browser concurrency settings:
max_browser_workers = 1
browser_mode = "single"  # off | single | pooled | per_source
  1. Separate normal sources from browser-capable sources inside fastest:
  • direct / OA / API sources can still race;
  • WebVPN/CARSI/publisher browser/Selenium sources go through a separate queue;
  • browser fallbacks are tried serially only after non-browser sources fail.
  1. Reuse one CloakBrowser/browser pool for the whole batch instead of calling launch() independently per source.

  2. Add DOI/source metadata to browser-window related logs:

DOI: 10.xxxx/xxxx
Source: WebVPN-Camofox
Publisher: Elsevier
  1. Add an explicit config/CLI switch:
enable_browser_sources_in_batch = false

This allows users to measure non-browser coverage first and then opt into browser fallbacks.

Suggested tests

  • With batch_workers=10 and max_browser_workers=1, no more than one browser instance is active.
  • If a non-browser source succeeds in fastest, browser sources are not launched.
  • Browser-capable sources queue across DOI records.
  • Interrupting the batch does not leave orphaned chromedriver.exe / CloakBrowser child processes.
  • Logs map each browser task to its DOI, publisher, and source.

I can help validate a candidate implementation with the mixed-publisher DOI batch above using lawful institutional access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions