Skip to content

增加批量下载覆盖率审计报告,按出版社/source/status 汇总 / Add batch coverage audit report by publisher, source, and status #11

@lwz20210407

Description

@lwz20210407

中文版本

摘要

感谢作者在 v1.5.0 中提供了多来源下载、publisher strategy registry、WebVPN/CARSI、Elsevier API、OA discovery 以及批量下载能力。基于一个包含 63 个有效 DOI 的混合出版社批次,我做了一轮受控覆盖率审计,建议项目增加一个内置的 batch coverage / dry-run report 能力,并据此优化 publisher 覆盖缺口。

本轮审计刻意关闭浏览器型来源、CARSI browser、Sci-Hub 和 LibGen,只测试非浏览器路径:publisher non-browser source、OA/direct API、WebVPN HTTP cookies。这样可以把“普通非浏览器覆盖率”与“浏览器/机构兜底能力”分开评估。

复现信息

  • 版本:scansci-pdf 1.5.0
  • 系统:Windows
  • Python:Conda Python 3.11
  • 输入:63 个有效 DOI,覆盖 Elsevier、Springer Nature、MDPI、Wiley、IOP、SAGE、Scientific.Net、Taylor & Francis、EDP Sciences 等
  • 本轮策略:
    • 不启动 Browser/CARSI browser
    • 不启用 Sci-Hub / LibGen
    • 逐 DOI、逐 source
    • 测试 CrossrefUnpaywallOpenAlexOADOAJCrossrefPageEuropePMCCOREPMCWebVPNHTTP
    • 对部分 publisher 也测试 non-browser source,例如 ElsevierAPIMDPIDirect
    • 下载后检查 PDF header、文件大小和页数

覆盖率结果

最终 63 个 DOI 的结果:

success:        31
suspicious_pdf: 6
failed:         26

按 publisher 汇总:

Elsevier: 15 success, 6 suspicious_pdf, 4 failed
Springer Nature: 6 success, 6 failed
MDPI: 1 success, 5 failed
Wiley: 3 success, 1 failed
Trans Tech Publications / Scientific.Net: 3 success
IOP Publishing: 3 failed
SAGE: 2 failed
EDP Sciences: 1 success
University of Liege / ESAFORM: 1 success
Journal of Theoretical and Applied Mechanics / PTMTS: 1 success
Taylor & Francis: 1 failed
National Technical University KhPI: 1 failed
Ukrainian academic publisher: 1 failed
Universidad Nacional de Colombia: 1 failed
European Scientific Platform: 1 failed

可疑 PDF 均为 Elsevier API 返回的单页 PDF:

10.1016/j.ijpvp.2026.105819       pages=1 size=295358
10.1016/j.microrel.2025.115923    pages=1 size=202798
10.1016/j.tafmec.2023.104031      pages=1 size=215473
10.1016/j.jcsr.2021.107082        pages=1 size=264508
10.1016/j.engfracmech.2020.107520 pages=1 size=241910
10.1016/j.jcsr.2020.106264        pages=1 size=4535118

代表性失败 DOI

SAGE:
10.1177/03093247261440642
10.1177/09544062231161438

IOP Publishing:
10.1088/1742-6596/3104/1/012026
10.1088/1757-899x/1284/1/012021
10.1088/1742-6596/1781/1/012027

MDPI:
10.3390/cryst14050417
10.3390/cryst12060845
10.3390/ma15030806
10.3390/app11188408
10.3390/met11030479

Springer Nature:
10.1007/s11665-023-08959-2
10.1007/s40192-022-00282-3
10.1007/s00170-022-08742-y
10.1007/s12613-021-2284-4
10.1007/s10704-020-00499-3
10.1007/978-3-030-77719-7_6

Taylor & Francis:
10.1080/02670836.2022.2110339

v1.5.0 中暴露出的覆盖率问题

  1. 当前缺少一个内置 coverage report,把 batch 下载结果按 publisher、source、status 汇总。
  2. failedsuspicious_pdf 应分开统计;一页 PDF 不能算成功。
  3. 对 OA publisher(例如 MDPI、IOP conference series)应优先给出 deterministic direct route,而不是主要依赖 OA discovery API。
  4. 对 Springer chapter DOI(例如 10.1007/978-3-030-77719-7_6)和 Springer journal DOI 可能需要区分处理。
  5. 对小出版社 / conference proceedings,应至少在报告里给出“未覆盖 publisher family / 无 direct rule / 需人工检查”的明确原因。

按当前架构建议的实现方式

  1. 增加 dry-run / coverage report 模式:
scansci-pdf coverage input.txt --legal-only --no-browser --json coverage.json

输出建议包含:

{
  "total": 63,
  "by_publisher": {
    "MDPI": {"success": 1, "failed": 5}
  },
  "by_source": {
    "Unpaywall": {"success": 6, "failed": 20}
  },
  "failed_items": [
    {
      "doi": "10.3390/cryst14050417",
      "publisher": "MDPI",
      "attempted_sources": ["MDPIDirect", "Crossref", "Unpaywall"],
      "suggested_action": "check_mdpi_direct_url_rule"
    }
  ]
}
  1. 增加 status 分类:
success
suspicious_pdf
failed
skipped_existing
not_routed
needs_browser
needs_institution
metadata_invalid
  1. 对每个 publisher 输出 source coverage:
  • publisher direct 是否存在;
  • OA API 是否找到链接;
  • WebVPN HTTP 是否构造了 PDF URL;
  • 是否只能靠 browser fallback。
  1. 将 coverage report 用作回归测试,避免新版本改动导致常见 publisher 覆盖率下降。

建议测试

  • 构造一个 mixed-publisher DOI fixture,覆盖 Elsevier、Springer、MDPI、IOP、Wiley、SAGE、Taylor & Francis。
  • --no-browser --legal-only 模式下生成 coverage JSON。
  • 确认 successfailedsuspicious_pdf 分开统计。
  • 确认每个 DOI 记录 attempted source list。
  • 确认 coverage JSON 可以作为 CI artifact 或 regression fixture。

如果后续需要,我可以提供这 63 个 DOI 的脱敏 fixture 和完整 JSONL 结果协助做回归测试。


English Version

Summary

Thanks for the multi-source download pipeline in v1.5.0, including the publisher strategy registry, WebVPN/CARSI support, Elsevier API, OA discovery, and batch download support. I ran a controlled coverage audit on a mixed-publisher batch with 63 valid DOI records and would like to suggest a built-in batch coverage / dry-run report mode.

This audit intentionally disabled browser-based sources, CARSI browser, Sci-Hub, and LibGen. It only tested non-browser routes: publisher non-browser sources, OA/direct APIs, and WebVPN HTTP cookies. This separates ordinary non-browser coverage from browser/institutional fallback coverage.

Reproduction

  • Version: scansci-pdf 1.5.0
  • OS: Windows
  • Python: Conda Python 3.11
  • Input: 63 valid DOI records across Elsevier, Springer Nature, MDPI, Wiley, IOP, SAGE, Scientific.Net, Taylor & Francis, EDP Sciences, etc.
  • Strategy used for this audit:
    • no Browser/CARSI browser
    • no Sci-Hub / LibGen
    • sequential DOI/source attempts
    • tested Crossref, Unpaywall, OpenAlexOA, DOAJ, CrossrefPage, EuropePMC, CORE, PMC, WebVPNHTTP
    • tested publisher non-browser sources where available, e.g. ElsevierAPI, MDPIDirect
    • validated PDF header, file size, and page count

Coverage result

Final result for 63 DOI records:

success:        31
suspicious_pdf: 6
failed:         26

By publisher:

Elsevier: 15 success, 6 suspicious_pdf, 4 failed
Springer Nature: 6 success, 6 failed
MDPI: 1 success, 5 failed
Wiley: 3 success, 1 failed
Trans Tech Publications / Scientific.Net: 3 success
IOP Publishing: 3 failed
SAGE: 2 failed
EDP Sciences: 1 success
University of Liege / ESAFORM: 1 success
Journal of Theoretical and Applied Mechanics / PTMTS: 1 success
Taylor & Francis: 1 failed
National Technical University KhPI: 1 failed
Ukrainian academic publisher: 1 failed
Universidad Nacional de Colombia: 1 failed
European Scientific Platform: 1 failed

All suspicious PDFs came from Elsevier API and were one-page PDFs:

10.1016/j.ijpvp.2026.105819       pages=1 size=295358
10.1016/j.microrel.2025.115923    pages=1 size=202798
10.1016/j.tafmec.2023.104031      pages=1 size=215473
10.1016/j.jcsr.2021.107082        pages=1 size=264508
10.1016/j.engfracmech.2020.107520 pages=1 size=241910
10.1016/j.jcsr.2020.106264        pages=1 size=4535118

Representative failed DOI

SAGE:
10.1177/03093247261440642
10.1177/09544062231161438

IOP Publishing:
10.1088/1742-6596/3104/1/012026
10.1088/1757-899x/1284/1/012021
10.1088/1742-6596/1781/1/012027

MDPI:
10.3390/cryst14050417
10.3390/cryst12060845
10.3390/ma15030806
10.3390/app11188408
10.3390/met11030479

Springer Nature:
10.1007/s11665-023-08959-2
10.1007/s40192-022-00282-3
10.1007/s00170-022-08742-y
10.1007/s12613-021-2284-4
10.1007/s10704-020-00499-3
10.1007/978-3-030-77719-7_6

Taylor & Francis:
10.1080/02670836.2022.2110339

Coverage gaps exposed in v1.5.0

  1. There is no built-in coverage report summarizing batch results by publisher, source, and status.
  2. failed and suspicious_pdf should be counted separately; a one-page PDF should not be counted as success.
  3. OA publishers such as MDPI and IOP conference series would benefit from deterministic direct routes instead of relying mostly on OA discovery APIs.
  4. Springer chapter DOI such as 10.1007/978-3-030-77719-7_6 may need separate handling from Springer journal DOI.
  5. For smaller publishers / proceedings, the report should explain whether the publisher family is not routed, no direct rule exists, or manual checking is required.

Suggested implementation following the current design

  1. Add a dry-run / coverage report mode:
scansci-pdf coverage input.txt --legal-only --no-browser --json coverage.json

Suggested output shape:

{
  "total": 63,
  "by_publisher": {
    "MDPI": {"success": 1, "failed": 5}
  },
  "by_source": {
    "Unpaywall": {"success": 6, "failed": 20}
  },
  "failed_items": [
    {
      "doi": "10.3390/cryst14050417",
      "publisher": "MDPI",
      "attempted_sources": ["MDPIDirect", "Crossref", "Unpaywall"],
      "suggested_action": "check_mdpi_direct_url_rule"
    }
  ]
}
  1. Add explicit status classes:
success
suspicious_pdf
failed
skipped_existing
not_routed
needs_browser
needs_institution
metadata_invalid
  1. Report source coverage for each publisher:
  • whether a publisher direct route exists;
  • whether OA APIs find a link;
  • whether WebVPN HTTP can construct a PDF URL;
  • whether browser fallback is required.
  1. Use the coverage report as a regression fixture to avoid coverage drops in future versions.

Suggested tests

  • Build a mixed-publisher DOI fixture covering Elsevier, Springer, MDPI, IOP, Wiley, SAGE, Taylor & Francis.
  • Generate coverage JSON under --no-browser --legal-only.
  • Ensure success, failed, and suspicious_pdf are counted separately.
  • Ensure each DOI records its attempted source list.
  • Ensure coverage JSON can be used as a CI artifact or regression fixture.

I can provide the 63-DOI fixture and full JSONL results if useful for regression testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions