中文版本
摘要
感谢作者在 v1.5.0 中实现了多来源下载、Sci-Hub fallback、WebVPN/CARSI 以及 PDF 基础校验。建议进一步区分“语法上是 PDF”和“确实是论文全文 PDF”。
在批量测试中,部分文件虽然通过 %PDF- / is_pdf_file() 校验并被视为成功,但实际只有 1 页且文件很小,疑似下载到了 preview、landing/error page、cover sheet 或错误 PDF。同时,use_tor=False 和 WebVPN session 相关日志也有容易误导用户的地方。
复现信息
- 版本:
scansci-pdf 1.5.0
- 系统:Windows
- Python:Conda Python 3.11
- 策略:
fastest
- 机构访问:WebVPN/CARSI 已配置
- Sci-Hub:启用,但调用侧设置
use_tor=False
- 批量输入:63 个有效 DOI,覆盖多个出版社
下载后使用 PyPDF2 对 30 个已归类 PDF 做页数检查,发现 6 个可疑文件:
Suspicious PDFs: 6 / 30
Pattern: 1 page, around 200-300 KB
示例特征:
pages=1, size=202798
pages=1, size=215473
pages=1, size=241849
pages=1, size=251572
pages=1, size=273561
pages=1, size=295358
在后续受控非浏览器测试中,也复现到 Elsevier API 返回一页小 PDF 的情况:
ElsevierAPI: pages=1 size=295358
这类文件语法上是 PDF,但很可能不是目标论文全文。
v1.5.0 中的根因
- PDF 成功判定过弱。
当前很多路径主要检查:
- 文件存在;
- header 是
%PDF-;
is_pdf_file() 通过。
但一个有效 PDF 仍可能是错误页、封面页、预览页或非目标文档。
use_tor=False 后仍会出现 Tor fallback 日志。
sources/scihub.py 中存在自动递归:
if not use_tor:
log.info("Sci-Hub: all clearnet domains failed, retrying via Tor...")
return try_scihub(doi, output_path, config, use_tor=True)
用户显式传入 use_tor=False 时,仍会看到“retrying via Tor”,容易理解为配置没有生效。
- WebVPN session 日志语义过宽。
当前最终可能输出:
[WebVPN] No valid session. Use vpnsci_login or carsi_login tool first.
但在某些情况下 cookies 文件实际存在,只是 WebVPN-Camofox / HTTP 对当前 DOI 没有成功。这条日志会让用户误以为 WebVPN 登录状态完全无效。
按当前架构建议的实现方式
- 增加 final PDF quality validation:
- page count;
- file size threshold;
- 从前几页提取文本并匹配 DOI/title 关键词;
- 可选检查 PDF metadata;
- 对 publisher/API source 设置 source-specific lower bound。
- 将可疑 PDF 作为独立状态,而不是成功:
{
"success": false,
"status": "suspicious_pdf",
"reason": "one_page_or_too_small",
"pages": 1,
"size": 241849
}
-
当某个 source 返回可疑 PDF 时,继续尝试下一个 source,而不是立即结束。
-
严格尊重 use_tor=False,或增加独立配置:
scihub_auto_tor_fallback = false
当该配置为 false 时,不应输出 “retrying via Tor”。
- 细分 WebVPN 诊断日志:
- no saved cookies;
- saved cookies exist but validation failed;
- saved cookies exist but this DOI download failed;
- browser fallback unavailable;
- browser fallback failed。
建议测试
- 构造一个 1 页有效 PDF,确认不会被计为正常成功。
- 构造一个 200-300 KB 的有效 PDF,确认进入
suspicious_pdf 状态。
- source A 返回可疑 PDF 后,source B 仍会继续尝试。
use_tor=False 时,不发生 Tor fallback,也不输出 retrying via Tor。
- cookies 文件存在但 DOI 下载失败时,日志不应说成 “No valid session”。
- 最终报告中分别统计
success、suspicious_pdf、failed。
如果后续有候选实现,我可以使用已经复现的一页小 PDF 样本和合法机构访问环境协助验证。
English Version
Summary
Thanks for implementing the multi-source download pipeline, Sci-Hub fallback, WebVPN/CARSI support, and basic PDF validation in v1.5.0. Could the tool distinguish "syntactically valid PDF" from "likely full-text article PDF" more strictly?
In batch testing, some files passed %PDF- / is_pdf_file() validation and were treated as successful, but were only one page and very small. They looked like preview pages, landing/error pages, cover sheets, or wrong PDFs rather than full-text articles. There are also two logging behaviors that can be confusing around use_tor=False and WebVPN session status.
Reproduction
- Version:
scansci-pdf 1.5.0
- OS: Windows
- Python: Conda Python 3.11
- Strategy:
fastest
- Institutional access: WebVPN/CARSI configured
- Sci-Hub: enabled, but caller set
use_tor=False
- Batch input: 63 valid DOI records across multiple publishers
After checking 30 classified PDFs with PyPDF2, 6 suspicious files were found:
Suspicious PDFs: 6 / 30
Pattern: 1 page, around 200-300 KB
Example characteristics:
pages=1, size=202798
pages=1, size=215473
pages=1, size=241849
pages=1, size=251572
pages=1, size=273561
pages=1, size=295358
A later controlled non-browser run also reproduced a one-page tiny PDF from Elsevier API:
ElsevierAPI: pages=1 size=295358
These files are valid PDFs, but likely not full-text article PDFs.
Root cause in v1.5.0
- Success validation is too weak.
Many paths primarily check:
- file exists;
- header starts with
%PDF-;
is_pdf_file() passes.
A syntactically valid PDF can still be an error page, cover sheet, preview page, or wrong document.
use_tor=False still produces Tor fallback behavior/logs.
sources/scihub.py contains:
if not use_tor:
log.info("Sci-Hub: all clearnet domains failed, retrying via Tor...")
return try_scihub(doi, output_path, config, use_tor=True)
When a caller explicitly passes use_tor=False, seeing "retrying via Tor" makes it look like the setting was ignored.
- WebVPN session diagnostics are too broad.
The final log may say:
[WebVPN] No valid session. Use vpnsci_login or carsi_login tool first.
In some cases a cookies file exists, but WebVPN-Camofox / HTTP simply failed for the current DOI. The message can make users think their WebVPN login is completely invalid.
Suggested implementation following the current design
- Add final PDF quality validation:
- page count;
- file size threshold;
- DOI/title keyword probe from the first pages;
- optional PDF metadata check;
- source-specific lower bounds for publisher/API sources.
- Represent suspicious PDFs as a separate status instead of success:
{
"success": false,
"status": "suspicious_pdf",
"reason": "one_page_or_too_small",
"pages": 1,
"size": 241849
}
-
If one source returns a suspicious PDF, continue trying the next source instead of stopping.
-
Respect use_tor=False strictly, or add a separate option:
scihub_auto_tor_fallback = false
When disabled, "retrying via Tor" should not be logged and Tor should not be attempted.
- Split WebVPN diagnostics into clearer cases:
- no saved cookies;
- saved cookies exist but validation failed;
- saved cookies exist but this DOI download failed;
- browser fallback unavailable;
- browser fallback failed.
Suggested tests
- A valid one-page PDF is not counted as normal success.
- A 200-300 KB valid PDF enters
suspicious_pdf.
- If source A returns a suspicious PDF, source B is still attempted.
- With
use_tor=False, no Tor fallback is attempted and no "retrying via Tor" log is emitted.
- If cookies exist but the DOI download fails, logs do not say only "No valid session".
- Final reports separately count
success, suspicious_pdf, and failed.
I can help validate candidate changes with the reproduced one-page PDF samples and lawful institutional access.
中文版本
摘要
感谢作者在
v1.5.0中实现了多来源下载、Sci-Hub fallback、WebVPN/CARSI 以及 PDF 基础校验。建议进一步区分“语法上是 PDF”和“确实是论文全文 PDF”。在批量测试中,部分文件虽然通过
%PDF-/is_pdf_file()校验并被视为成功,但实际只有 1 页且文件很小,疑似下载到了 preview、landing/error page、cover sheet 或错误 PDF。同时,use_tor=False和 WebVPN session 相关日志也有容易误导用户的地方。复现信息
scansci-pdf 1.5.0fastestuse_tor=False下载后使用 PyPDF2 对 30 个已归类 PDF 做页数检查,发现 6 个可疑文件:
示例特征:
在后续受控非浏览器测试中,也复现到 Elsevier API 返回一页小 PDF 的情况:
这类文件语法上是 PDF,但很可能不是目标论文全文。
v1.5.0中的根因当前很多路径主要检查:
%PDF-;is_pdf_file()通过。但一个有效 PDF 仍可能是错误页、封面页、预览页或非目标文档。
use_tor=False后仍会出现 Tor fallback 日志。sources/scihub.py中存在自动递归:用户显式传入
use_tor=False时,仍会看到“retrying via Tor”,容易理解为配置没有生效。当前最终可能输出:
但在某些情况下 cookies 文件实际存在,只是 WebVPN-Camofox / HTTP 对当前 DOI 没有成功。这条日志会让用户误以为 WebVPN 登录状态完全无效。
按当前架构建议的实现方式
{ "success": false, "status": "suspicious_pdf", "reason": "one_page_or_too_small", "pages": 1, "size": 241849 }当某个 source 返回可疑 PDF 时,继续尝试下一个 source,而不是立即结束。
严格尊重
use_tor=False,或增加独立配置:当该配置为 false 时,不应输出 “retrying via Tor”。
建议测试
suspicious_pdf状态。use_tor=False时,不发生 Tor fallback,也不输出 retrying via Tor。success、suspicious_pdf、failed。如果后续有候选实现,我可以使用已经复现的一页小 PDF 样本和合法机构访问环境协助验证。
English Version
Summary
Thanks for implementing the multi-source download pipeline, Sci-Hub fallback, WebVPN/CARSI support, and basic PDF validation in
v1.5.0. Could the tool distinguish "syntactically valid PDF" from "likely full-text article PDF" more strictly?In batch testing, some files passed
%PDF-/is_pdf_file()validation and were treated as successful, but were only one page and very small. They looked like preview pages, landing/error pages, cover sheets, or wrong PDFs rather than full-text articles. There are also two logging behaviors that can be confusing arounduse_tor=Falseand WebVPN session status.Reproduction
scansci-pdf 1.5.0fastestuse_tor=FalseAfter checking 30 classified PDFs with PyPDF2, 6 suspicious files were found:
Example characteristics:
A later controlled non-browser run also reproduced a one-page tiny PDF from Elsevier API:
These files are valid PDFs, but likely not full-text article PDFs.
Root cause in
v1.5.0Many paths primarily check:
%PDF-;is_pdf_file()passes.A syntactically valid PDF can still be an error page, cover sheet, preview page, or wrong document.
use_tor=Falsestill produces Tor fallback behavior/logs.sources/scihub.pycontains:When a caller explicitly passes
use_tor=False, seeing "retrying via Tor" makes it look like the setting was ignored.The final log may say:
In some cases a cookies file exists, but WebVPN-Camofox / HTTP simply failed for the current DOI. The message can make users think their WebVPN login is completely invalid.
Suggested implementation following the current design
{ "success": false, "status": "suspicious_pdf", "reason": "one_page_or_too_small", "pages": 1, "size": 241849 }If one source returns a suspicious PDF, continue trying the next source instead of stopping.
Respect
use_tor=Falsestrictly, or add a separate option:When disabled, "retrying via Tor" should not be logged and Tor should not be attempted.
Suggested tests
suspicious_pdf.use_tor=False, no Tor fallback is attempted and no "retrying via Tor" log is emitted.success,suspicious_pdf, andfailed.I can help validate candidate changes with the reproduced one-page PDF samples and lawful institutional access.