Skip to content

增强 PDF 全文有效性校验并修正 Tor/WebVPN 误导日志 / Improve full-text PDF validation and clarify Tor/WebVPN fallback logs #9

@lwz20210407

Description

@lwz20210407

中文版本

摘要

感谢作者在 v1.5.0 中实现了多来源下载、Sci-Hub fallback、WebVPN/CARSI 以及 PDF 基础校验。建议进一步区分“语法上是 PDF”和“确实是论文全文 PDF”。

在批量测试中,部分文件虽然通过 %PDF- / is_pdf_file() 校验并被视为成功,但实际只有 1 页且文件很小,疑似下载到了 preview、landing/error page、cover sheet 或错误 PDF。同时,use_tor=False 和 WebVPN session 相关日志也有容易误导用户的地方。

复现信息

  • 版本:scansci-pdf 1.5.0
  • 系统:Windows
  • Python:Conda Python 3.11
  • 策略:fastest
  • 机构访问:WebVPN/CARSI 已配置
  • Sci-Hub:启用,但调用侧设置 use_tor=False
  • 批量输入:63 个有效 DOI,覆盖多个出版社

下载后使用 PyPDF2 对 30 个已归类 PDF 做页数检查,发现 6 个可疑文件:

Suspicious PDFs: 6 / 30
Pattern: 1 page, around 200-300 KB

示例特征:

pages=1, size=202798
pages=1, size=215473
pages=1, size=241849
pages=1, size=251572
pages=1, size=273561
pages=1, size=295358

在后续受控非浏览器测试中,也复现到 Elsevier API 返回一页小 PDF 的情况:

ElsevierAPI: pages=1 size=295358

这类文件语法上是 PDF,但很可能不是目标论文全文。

v1.5.0 中的根因

  1. PDF 成功判定过弱。

当前很多路径主要检查:

  • 文件存在;
  • header 是 %PDF-
  • is_pdf_file() 通过。

但一个有效 PDF 仍可能是错误页、封面页、预览页或非目标文档。

  1. use_tor=False 后仍会出现 Tor fallback 日志。

sources/scihub.py 中存在自动递归:

if not use_tor:
    log.info("Sci-Hub: all clearnet domains failed, retrying via Tor...")
    return try_scihub(doi, output_path, config, use_tor=True)

用户显式传入 use_tor=False 时,仍会看到“retrying via Tor”,容易理解为配置没有生效。

  1. WebVPN session 日志语义过宽。

当前最终可能输出:

[WebVPN] No valid session. Use vpnsci_login or carsi_login tool first.

但在某些情况下 cookies 文件实际存在,只是 WebVPN-Camofox / HTTP 对当前 DOI 没有成功。这条日志会让用户误以为 WebVPN 登录状态完全无效。

按当前架构建议的实现方式

  1. 增加 final PDF quality validation:
  • page count;
  • file size threshold;
  • 从前几页提取文本并匹配 DOI/title 关键词;
  • 可选检查 PDF metadata;
  • 对 publisher/API source 设置 source-specific lower bound。
  1. 将可疑 PDF 作为独立状态,而不是成功:
{
  "success": false,
  "status": "suspicious_pdf",
  "reason": "one_page_or_too_small",
  "pages": 1,
  "size": 241849
}
  1. 当某个 source 返回可疑 PDF 时,继续尝试下一个 source,而不是立即结束。

  2. 严格尊重 use_tor=False,或增加独立配置:

scihub_auto_tor_fallback = false

当该配置为 false 时,不应输出 “retrying via Tor”。

  1. 细分 WebVPN 诊断日志:
  • no saved cookies;
  • saved cookies exist but validation failed;
  • saved cookies exist but this DOI download failed;
  • browser fallback unavailable;
  • browser fallback failed。

建议测试

  • 构造一个 1 页有效 PDF,确认不会被计为正常成功。
  • 构造一个 200-300 KB 的有效 PDF,确认进入 suspicious_pdf 状态。
  • source A 返回可疑 PDF 后,source B 仍会继续尝试。
  • use_tor=False 时,不发生 Tor fallback,也不输出 retrying via Tor。
  • cookies 文件存在但 DOI 下载失败时,日志不应说成 “No valid session”。
  • 最终报告中分别统计 successsuspicious_pdffailed

如果后续有候选实现,我可以使用已经复现的一页小 PDF 样本和合法机构访问环境协助验证。


English Version

Summary

Thanks for implementing the multi-source download pipeline, Sci-Hub fallback, WebVPN/CARSI support, and basic PDF validation in v1.5.0. Could the tool distinguish "syntactically valid PDF" from "likely full-text article PDF" more strictly?

In batch testing, some files passed %PDF- / is_pdf_file() validation and were treated as successful, but were only one page and very small. They looked like preview pages, landing/error pages, cover sheets, or wrong PDFs rather than full-text articles. There are also two logging behaviors that can be confusing around use_tor=False and WebVPN session status.

Reproduction

  • Version: scansci-pdf 1.5.0
  • OS: Windows
  • Python: Conda Python 3.11
  • Strategy: fastest
  • Institutional access: WebVPN/CARSI configured
  • Sci-Hub: enabled, but caller set use_tor=False
  • Batch input: 63 valid DOI records across multiple publishers

After checking 30 classified PDFs with PyPDF2, 6 suspicious files were found:

Suspicious PDFs: 6 / 30
Pattern: 1 page, around 200-300 KB

Example characteristics:

pages=1, size=202798
pages=1, size=215473
pages=1, size=241849
pages=1, size=251572
pages=1, size=273561
pages=1, size=295358

A later controlled non-browser run also reproduced a one-page tiny PDF from Elsevier API:

ElsevierAPI: pages=1 size=295358

These files are valid PDFs, but likely not full-text article PDFs.

Root cause in v1.5.0

  1. Success validation is too weak.

Many paths primarily check:

  • file exists;
  • header starts with %PDF-;
  • is_pdf_file() passes.

A syntactically valid PDF can still be an error page, cover sheet, preview page, or wrong document.

  1. use_tor=False still produces Tor fallback behavior/logs.

sources/scihub.py contains:

if not use_tor:
    log.info("Sci-Hub: all clearnet domains failed, retrying via Tor...")
    return try_scihub(doi, output_path, config, use_tor=True)

When a caller explicitly passes use_tor=False, seeing "retrying via Tor" makes it look like the setting was ignored.

  1. WebVPN session diagnostics are too broad.

The final log may say:

[WebVPN] No valid session. Use vpnsci_login or carsi_login tool first.

In some cases a cookies file exists, but WebVPN-Camofox / HTTP simply failed for the current DOI. The message can make users think their WebVPN login is completely invalid.

Suggested implementation following the current design

  1. Add final PDF quality validation:
  • page count;
  • file size threshold;
  • DOI/title keyword probe from the first pages;
  • optional PDF metadata check;
  • source-specific lower bounds for publisher/API sources.
  1. Represent suspicious PDFs as a separate status instead of success:
{
  "success": false,
  "status": "suspicious_pdf",
  "reason": "one_page_or_too_small",
  "pages": 1,
  "size": 241849
}
  1. If one source returns a suspicious PDF, continue trying the next source instead of stopping.

  2. Respect use_tor=False strictly, or add a separate option:

scihub_auto_tor_fallback = false

When disabled, "retrying via Tor" should not be logged and Tor should not be attempted.

  1. Split WebVPN diagnostics into clearer cases:
  • no saved cookies;
  • saved cookies exist but validation failed;
  • saved cookies exist but this DOI download failed;
  • browser fallback unavailable;
  • browser fallback failed.

Suggested tests

  • A valid one-page PDF is not counted as normal success.
  • A 200-300 KB valid PDF enters suspicious_pdf.
  • If source A returns a suspicious PDF, source B is still attempted.
  • With use_tor=False, no Tor fallback is attempted and no "retrying via Tor" log is emitted.
  • If cookies exist but the DOI download fails, logs do not say only "No valid session".
  • Final reports separately count success, suspicious_pdf, and failed.

I can help validate candidate changes with the reproduced one-page PDF samples and lawful institutional access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions