中文版本
摘要
感谢作者在 v1.5.0 中提供了多来源下载、publisher strategy registry、WebVPN/CARSI、Elsevier API、OA discovery 以及批量下载能力。基于一个包含 63 个有效 DOI 的混合出版社批次,我做了一轮受控覆盖率审计,建议项目增加一个内置的 batch coverage / dry-run report 能力,并据此优化 publisher 覆盖缺口。
本轮审计刻意关闭浏览器型来源、CARSI browser、Sci-Hub 和 LibGen,只测试非浏览器路径:publisher non-browser source、OA/direct API、WebVPN HTTP cookies。这样可以把“普通非浏览器覆盖率”与“浏览器/机构兜底能力”分开评估。
复现信息
- 版本:
scansci-pdf 1.5.0
- 系统:Windows
- Python:Conda Python 3.11
- 输入:63 个有效 DOI,覆盖 Elsevier、Springer Nature、MDPI、Wiley、IOP、SAGE、Scientific.Net、Taylor & Francis、EDP Sciences 等
- 本轮策略:
- 不启动 Browser/CARSI browser
- 不启用 Sci-Hub / LibGen
- 逐 DOI、逐 source
- 测试
Crossref、Unpaywall、OpenAlexOA、DOAJ、CrossrefPage、EuropePMC、CORE、PMC、WebVPNHTTP
- 对部分 publisher 也测试 non-browser source,例如
ElsevierAPI、MDPIDirect
- 下载后检查 PDF header、文件大小和页数
覆盖率结果
最终 63 个 DOI 的结果:
success: 31
suspicious_pdf: 6
failed: 26
按 publisher 汇总:
Elsevier: 15 success, 6 suspicious_pdf, 4 failed
Springer Nature: 6 success, 6 failed
MDPI: 1 success, 5 failed
Wiley: 3 success, 1 failed
Trans Tech Publications / Scientific.Net: 3 success
IOP Publishing: 3 failed
SAGE: 2 failed
EDP Sciences: 1 success
University of Liege / ESAFORM: 1 success
Journal of Theoretical and Applied Mechanics / PTMTS: 1 success
Taylor & Francis: 1 failed
National Technical University KhPI: 1 failed
Ukrainian academic publisher: 1 failed
Universidad Nacional de Colombia: 1 failed
European Scientific Platform: 1 failed
可疑 PDF 均为 Elsevier API 返回的单页 PDF:
10.1016/j.ijpvp.2026.105819 pages=1 size=295358
10.1016/j.microrel.2025.115923 pages=1 size=202798
10.1016/j.tafmec.2023.104031 pages=1 size=215473
10.1016/j.jcsr.2021.107082 pages=1 size=264508
10.1016/j.engfracmech.2020.107520 pages=1 size=241910
10.1016/j.jcsr.2020.106264 pages=1 size=4535118
代表性失败 DOI
SAGE:
10.1177/03093247261440642
10.1177/09544062231161438
IOP Publishing:
10.1088/1742-6596/3104/1/012026
10.1088/1757-899x/1284/1/012021
10.1088/1742-6596/1781/1/012027
MDPI:
10.3390/cryst14050417
10.3390/cryst12060845
10.3390/ma15030806
10.3390/app11188408
10.3390/met11030479
Springer Nature:
10.1007/s11665-023-08959-2
10.1007/s40192-022-00282-3
10.1007/s00170-022-08742-y
10.1007/s12613-021-2284-4
10.1007/s10704-020-00499-3
10.1007/978-3-030-77719-7_6
Taylor & Francis:
10.1080/02670836.2022.2110339
v1.5.0 中暴露出的覆盖率问题
- 当前缺少一个内置 coverage report,把 batch 下载结果按 publisher、source、status 汇总。
failed 与 suspicious_pdf 应分开统计;一页 PDF 不能算成功。
- 对 OA publisher(例如 MDPI、IOP conference series)应优先给出 deterministic direct route,而不是主要依赖 OA discovery API。
- 对 Springer chapter DOI(例如
10.1007/978-3-030-77719-7_6)和 Springer journal DOI 可能需要区分处理。
- 对小出版社 / conference proceedings,应至少在报告里给出“未覆盖 publisher family / 无 direct rule / 需人工检查”的明确原因。
按当前架构建议的实现方式
- 增加 dry-run / coverage report 模式:
scansci-pdf coverage input.txt --legal-only --no-browser --json coverage.json
输出建议包含:
{
"total": 63,
"by_publisher": {
"MDPI": {"success": 1, "failed": 5}
},
"by_source": {
"Unpaywall": {"success": 6, "failed": 20}
},
"failed_items": [
{
"doi": "10.3390/cryst14050417",
"publisher": "MDPI",
"attempted_sources": ["MDPIDirect", "Crossref", "Unpaywall"],
"suggested_action": "check_mdpi_direct_url_rule"
}
]
}
- 增加 status 分类:
success
suspicious_pdf
failed
skipped_existing
not_routed
needs_browser
needs_institution
metadata_invalid
- 对每个 publisher 输出 source coverage:
- publisher direct 是否存在;
- OA API 是否找到链接;
- WebVPN HTTP 是否构造了 PDF URL;
- 是否只能靠 browser fallback。
- 将 coverage report 用作回归测试,避免新版本改动导致常见 publisher 覆盖率下降。
建议测试
- 构造一个 mixed-publisher DOI fixture,覆盖 Elsevier、Springer、MDPI、IOP、Wiley、SAGE、Taylor & Francis。
- 在
--no-browser --legal-only 模式下生成 coverage JSON。
- 确认
success、failed、suspicious_pdf 分开统计。
- 确认每个 DOI 记录 attempted source list。
- 确认 coverage JSON 可以作为 CI artifact 或 regression fixture。
如果后续需要,我可以提供这 63 个 DOI 的脱敏 fixture 和完整 JSONL 结果协助做回归测试。
English Version
Summary
Thanks for the multi-source download pipeline in v1.5.0, including the publisher strategy registry, WebVPN/CARSI support, Elsevier API, OA discovery, and batch download support. I ran a controlled coverage audit on a mixed-publisher batch with 63 valid DOI records and would like to suggest a built-in batch coverage / dry-run report mode.
This audit intentionally disabled browser-based sources, CARSI browser, Sci-Hub, and LibGen. It only tested non-browser routes: publisher non-browser sources, OA/direct APIs, and WebVPN HTTP cookies. This separates ordinary non-browser coverage from browser/institutional fallback coverage.
Reproduction
- Version:
scansci-pdf 1.5.0
- OS: Windows
- Python: Conda Python 3.11
- Input: 63 valid DOI records across Elsevier, Springer Nature, MDPI, Wiley, IOP, SAGE, Scientific.Net, Taylor & Francis, EDP Sciences, etc.
- Strategy used for this audit:
- no Browser/CARSI browser
- no Sci-Hub / LibGen
- sequential DOI/source attempts
- tested
Crossref, Unpaywall, OpenAlexOA, DOAJ, CrossrefPage, EuropePMC, CORE, PMC, WebVPNHTTP
- tested publisher non-browser sources where available, e.g.
ElsevierAPI, MDPIDirect
- validated PDF header, file size, and page count
Coverage result
Final result for 63 DOI records:
success: 31
suspicious_pdf: 6
failed: 26
By publisher:
Elsevier: 15 success, 6 suspicious_pdf, 4 failed
Springer Nature: 6 success, 6 failed
MDPI: 1 success, 5 failed
Wiley: 3 success, 1 failed
Trans Tech Publications / Scientific.Net: 3 success
IOP Publishing: 3 failed
SAGE: 2 failed
EDP Sciences: 1 success
University of Liege / ESAFORM: 1 success
Journal of Theoretical and Applied Mechanics / PTMTS: 1 success
Taylor & Francis: 1 failed
National Technical University KhPI: 1 failed
Ukrainian academic publisher: 1 failed
Universidad Nacional de Colombia: 1 failed
European Scientific Platform: 1 failed
All suspicious PDFs came from Elsevier API and were one-page PDFs:
10.1016/j.ijpvp.2026.105819 pages=1 size=295358
10.1016/j.microrel.2025.115923 pages=1 size=202798
10.1016/j.tafmec.2023.104031 pages=1 size=215473
10.1016/j.jcsr.2021.107082 pages=1 size=264508
10.1016/j.engfracmech.2020.107520 pages=1 size=241910
10.1016/j.jcsr.2020.106264 pages=1 size=4535118
Representative failed DOI
SAGE:
10.1177/03093247261440642
10.1177/09544062231161438
IOP Publishing:
10.1088/1742-6596/3104/1/012026
10.1088/1757-899x/1284/1/012021
10.1088/1742-6596/1781/1/012027
MDPI:
10.3390/cryst14050417
10.3390/cryst12060845
10.3390/ma15030806
10.3390/app11188408
10.3390/met11030479
Springer Nature:
10.1007/s11665-023-08959-2
10.1007/s40192-022-00282-3
10.1007/s00170-022-08742-y
10.1007/s12613-021-2284-4
10.1007/s10704-020-00499-3
10.1007/978-3-030-77719-7_6
Taylor & Francis:
10.1080/02670836.2022.2110339
Coverage gaps exposed in v1.5.0
- There is no built-in coverage report summarizing batch results by publisher, source, and status.
failed and suspicious_pdf should be counted separately; a one-page PDF should not be counted as success.
- OA publishers such as MDPI and IOP conference series would benefit from deterministic direct routes instead of relying mostly on OA discovery APIs.
- Springer chapter DOI such as
10.1007/978-3-030-77719-7_6 may need separate handling from Springer journal DOI.
- For smaller publishers / proceedings, the report should explain whether the publisher family is not routed, no direct rule exists, or manual checking is required.
Suggested implementation following the current design
- Add a dry-run / coverage report mode:
scansci-pdf coverage input.txt --legal-only --no-browser --json coverage.json
Suggested output shape:
{
"total": 63,
"by_publisher": {
"MDPI": {"success": 1, "failed": 5}
},
"by_source": {
"Unpaywall": {"success": 6, "failed": 20}
},
"failed_items": [
{
"doi": "10.3390/cryst14050417",
"publisher": "MDPI",
"attempted_sources": ["MDPIDirect", "Crossref", "Unpaywall"],
"suggested_action": "check_mdpi_direct_url_rule"
}
]
}
- Add explicit status classes:
success
suspicious_pdf
failed
skipped_existing
not_routed
needs_browser
needs_institution
metadata_invalid
- Report source coverage for each publisher:
- whether a publisher direct route exists;
- whether OA APIs find a link;
- whether WebVPN HTTP can construct a PDF URL;
- whether browser fallback is required.
- Use the coverage report as a regression fixture to avoid coverage drops in future versions.
Suggested tests
- Build a mixed-publisher DOI fixture covering Elsevier, Springer, MDPI, IOP, Wiley, SAGE, Taylor & Francis.
- Generate coverage JSON under
--no-browser --legal-only.
- Ensure
success, failed, and suspicious_pdf are counted separately.
- Ensure each DOI records its attempted source list.
- Ensure coverage JSON can be used as a CI artifact or regression fixture.
I can provide the 63-DOI fixture and full JSONL results if useful for regression testing.
中文版本
摘要
感谢作者在
v1.5.0中提供了多来源下载、publisher strategy registry、WebVPN/CARSI、Elsevier API、OA discovery 以及批量下载能力。基于一个包含 63 个有效 DOI 的混合出版社批次,我做了一轮受控覆盖率审计,建议项目增加一个内置的 batch coverage / dry-run report 能力,并据此优化 publisher 覆盖缺口。本轮审计刻意关闭浏览器型来源、CARSI browser、Sci-Hub 和 LibGen,只测试非浏览器路径:publisher non-browser source、OA/direct API、WebVPN HTTP cookies。这样可以把“普通非浏览器覆盖率”与“浏览器/机构兜底能力”分开评估。
复现信息
scansci-pdf 1.5.0Crossref、Unpaywall、OpenAlexOA、DOAJ、CrossrefPage、EuropePMC、CORE、PMC、WebVPNHTTPElsevierAPI、MDPIDirect覆盖率结果
最终 63 个 DOI 的结果:
按 publisher 汇总:
可疑 PDF 均为 Elsevier API 返回的单页 PDF:
代表性失败 DOI
v1.5.0中暴露出的覆盖率问题failed与suspicious_pdf应分开统计;一页 PDF 不能算成功。10.1007/978-3-030-77719-7_6)和 Springer journal DOI 可能需要区分处理。按当前架构建议的实现方式
输出建议包含:
{ "total": 63, "by_publisher": { "MDPI": {"success": 1, "failed": 5} }, "by_source": { "Unpaywall": {"success": 6, "failed": 20} }, "failed_items": [ { "doi": "10.3390/cryst14050417", "publisher": "MDPI", "attempted_sources": ["MDPIDirect", "Crossref", "Unpaywall"], "suggested_action": "check_mdpi_direct_url_rule" } ] }建议测试
--no-browser --legal-only模式下生成 coverage JSON。success、failed、suspicious_pdf分开统计。如果后续需要,我可以提供这 63 个 DOI 的脱敏 fixture 和完整 JSONL 结果协助做回归测试。
English Version
Summary
Thanks for the multi-source download pipeline in
v1.5.0, including the publisher strategy registry, WebVPN/CARSI support, Elsevier API, OA discovery, and batch download support. I ran a controlled coverage audit on a mixed-publisher batch with 63 valid DOI records and would like to suggest a built-in batch coverage / dry-run report mode.This audit intentionally disabled browser-based sources, CARSI browser, Sci-Hub, and LibGen. It only tested non-browser routes: publisher non-browser sources, OA/direct APIs, and WebVPN HTTP cookies. This separates ordinary non-browser coverage from browser/institutional fallback coverage.
Reproduction
scansci-pdf 1.5.0Crossref,Unpaywall,OpenAlexOA,DOAJ,CrossrefPage,EuropePMC,CORE,PMC,WebVPNHTTPElsevierAPI,MDPIDirectCoverage result
Final result for 63 DOI records:
By publisher:
All suspicious PDFs came from Elsevier API and were one-page PDFs:
Representative failed DOI
Coverage gaps exposed in
v1.5.0failedandsuspicious_pdfshould be counted separately; a one-page PDF should not be counted as success.10.1007/978-3-030-77719-7_6may need separate handling from Springer journal DOI.Suggested implementation following the current design
Suggested output shape:
{ "total": 63, "by_publisher": { "MDPI": {"success": 1, "failed": 5} }, "by_source": { "Unpaywall": {"success": 6, "failed": 20} }, "failed_items": [ { "doi": "10.3390/cryst14050417", "publisher": "MDPI", "attempted_sources": ["MDPIDirect", "Crossref", "Unpaywall"], "suggested_action": "check_mdpi_direct_url_rule" } ] }Suggested tests
--no-browser --legal-only.success,failed, andsuspicious_pdfare counted separately.I can provide the 63-DOI fixture and full JSONL results if useful for regression testing.