增强 PDF 全文有效性校验并修正 Tor/WebVPN 误导日志 / Improve full-text PDF validation and clarify Tor/WebVPN fallback logs

# 中文版本

## 摘要

感谢作者在 `v1.5.0` 中实现了多来源下载、Sci-Hub fallback、WebVPN/CARSI 以及 PDF 基础校验。建议进一步区分“语法上是 PDF”和“确实是论文全文 PDF”。

在批量测试中，部分文件虽然通过 `%PDF-` / `is_pdf_file()` 校验并被视为成功，但实际只有 1 页且文件很小，疑似下载到了 preview、landing/error page、cover sheet 或错误 PDF。同时，`use_tor=False` 和 WebVPN session 相关日志也有容易误导用户的地方。

## 复现信息

- 版本：`scansci-pdf 1.5.0`
- 系统：Windows
- Python：Conda Python 3.11
- 策略：`fastest`
- 机构访问：WebVPN/CARSI 已配置
- Sci-Hub：启用，但调用侧设置 `use_tor=False`
- 批量输入：63 个有效 DOI，覆盖多个出版社

下载后使用 PyPDF2 对 30 个已归类 PDF 做页数检查，发现 6 个可疑文件：

```text
Suspicious PDFs: 6 / 30
Pattern: 1 page, around 200-300 KB
```

示例特征：

```text
pages=1, size=202798
pages=1, size=215473
pages=1, size=241849
pages=1, size=251572
pages=1, size=273561
pages=1, size=295358
```

在后续受控非浏览器测试中，也复现到 Elsevier API 返回一页小 PDF 的情况：

```text
ElsevierAPI: pages=1 size=295358
```

这类文件语法上是 PDF，但很可能不是目标论文全文。

## `v1.5.0` 中的根因

1. PDF 成功判定过弱。

当前很多路径主要检查：

- 文件存在；
- header 是 `%PDF-`；
- `is_pdf_file()` 通过。

但一个有效 PDF 仍可能是错误页、封面页、预览页或非目标文档。

2. `use_tor=False` 后仍会出现 Tor fallback 日志。

`sources/scihub.py` 中存在自动递归：

```python
if not use_tor:
    log.info("Sci-Hub: all clearnet domains failed, retrying via Tor...")
    return try_scihub(doi, output_path, config, use_tor=True)
```

用户显式传入 `use_tor=False` 时，仍会看到“retrying via Tor”，容易理解为配置没有生效。

3. WebVPN session 日志语义过宽。

当前最终可能输出：

```text
[WebVPN] No valid session. Use vpnsci_login or carsi_login tool first.
```

但在某些情况下 cookies 文件实际存在，只是 WebVPN-Camofox / HTTP 对当前 DOI 没有成功。这条日志会让用户误以为 WebVPN 登录状态完全无效。

## 按当前架构建议的实现方式

1. 增加 final PDF quality validation：

- page count；
- file size threshold；
- 从前几页提取文本并匹配 DOI/title 关键词；
- 可选检查 PDF metadata；
- 对 publisher/API source 设置 source-specific lower bound。

2. 将可疑 PDF 作为独立状态，而不是成功：

```json
{
  "success": false,
  "status": "suspicious_pdf",
  "reason": "one_page_or_too_small",
  "pages": 1,
  "size": 241849
}
```

3. 当某个 source 返回可疑 PDF 时，继续尝试下一个 source，而不是立即结束。

4. 严格尊重 `use_tor=False`，或增加独立配置：

```toml
scihub_auto_tor_fallback = false
```

当该配置为 false 时，不应输出 “retrying via Tor”。

5. 细分 WebVPN 诊断日志：

- no saved cookies；
- saved cookies exist but validation failed；
- saved cookies exist but this DOI download failed；
- browser fallback unavailable；
- browser fallback failed。

## 建议测试

- 构造一个 1 页有效 PDF，确认不会被计为正常成功。
- 构造一个 200-300 KB 的有效 PDF，确认进入 `suspicious_pdf` 状态。
- source A 返回可疑 PDF 后，source B 仍会继续尝试。
- `use_tor=False` 时，不发生 Tor fallback，也不输出 retrying via Tor。
- cookies 文件存在但 DOI 下载失败时，日志不应说成 “No valid session”。
- 最终报告中分别统计 `success`、`suspicious_pdf`、`failed`。

如果后续有候选实现，我可以使用已经复现的一页小 PDF 样本和合法机构访问环境协助验证。

---

# English Version

## Summary

Thanks for implementing the multi-source download pipeline, Sci-Hub fallback, WebVPN/CARSI support, and basic PDF validation in `v1.5.0`. Could the tool distinguish "syntactically valid PDF" from "likely full-text article PDF" more strictly?

In batch testing, some files passed `%PDF-` / `is_pdf_file()` validation and were treated as successful, but were only one page and very small. They looked like preview pages, landing/error pages, cover sheets, or wrong PDFs rather than full-text articles. There are also two logging behaviors that can be confusing around `use_tor=False` and WebVPN session status.

## Reproduction

- Version: `scansci-pdf 1.5.0`
- OS: Windows
- Python: Conda Python 3.11
- Strategy: `fastest`
- Institutional access: WebVPN/CARSI configured
- Sci-Hub: enabled, but caller set `use_tor=False`
- Batch input: 63 valid DOI records across multiple publishers

After checking 30 classified PDFs with PyPDF2, 6 suspicious files were found:

```text
Suspicious PDFs: 6 / 30
Pattern: 1 page, around 200-300 KB
```

Example characteristics:

```text
pages=1, size=202798
pages=1, size=215473
pages=1, size=241849
pages=1, size=251572
pages=1, size=273561
pages=1, size=295358
```

A later controlled non-browser run also reproduced a one-page tiny PDF from Elsevier API:

```text
ElsevierAPI: pages=1 size=295358
```

These files are valid PDFs, but likely not full-text article PDFs.

## Root cause in `v1.5.0`

1. Success validation is too weak.

Many paths primarily check:

- file exists;
- header starts with `%PDF-`;
- `is_pdf_file()` passes.

A syntactically valid PDF can still be an error page, cover sheet, preview page, or wrong document.

2. `use_tor=False` still produces Tor fallback behavior/logs.

`sources/scihub.py` contains:

```python
if not use_tor:
    log.info("Sci-Hub: all clearnet domains failed, retrying via Tor...")
    return try_scihub(doi, output_path, config, use_tor=True)
```

When a caller explicitly passes `use_tor=False`, seeing "retrying via Tor" makes it look like the setting was ignored.

3. WebVPN session diagnostics are too broad.

The final log may say:

```text
[WebVPN] No valid session. Use vpnsci_login or carsi_login tool first.
```

In some cases a cookies file exists, but WebVPN-Camofox / HTTP simply failed for the current DOI. The message can make users think their WebVPN login is completely invalid.

## Suggested implementation following the current design

1. Add final PDF quality validation:

- page count;
- file size threshold;
- DOI/title keyword probe from the first pages;
- optional PDF metadata check;
- source-specific lower bounds for publisher/API sources.

2. Represent suspicious PDFs as a separate status instead of success:

```json
{
  "success": false,
  "status": "suspicious_pdf",
  "reason": "one_page_or_too_small",
  "pages": 1,
  "size": 241849
}
```

3. If one source returns a suspicious PDF, continue trying the next source instead of stopping.

4. Respect `use_tor=False` strictly, or add a separate option:

```toml
scihub_auto_tor_fallback = false
```

When disabled, "retrying via Tor" should not be logged and Tor should not be attempted.

5. Split WebVPN diagnostics into clearer cases:

- no saved cookies;
- saved cookies exist but validation failed;
- saved cookies exist but this DOI download failed;
- browser fallback unavailable;
- browser fallback failed.

## Suggested tests

- A valid one-page PDF is not counted as normal success.
- A 200-300 KB valid PDF enters `suspicious_pdf`.
- If source A returns a suspicious PDF, source B is still attempted.
- With `use_tor=False`, no Tor fallback is attempted and no "retrying via Tor" log is emitted.
- If cookies exist but the DOI download fails, logs do not say only "No valid session".
- Final reports separately count `success`, `suspicious_pdf`, and `failed`.

I can help validate candidate changes with the reproduced one-page PDF samples and lawful institutional access.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

增强 PDF 全文有效性校验并修正 Tor/WebVPN 误导日志 / Improve full-text PDF validation and clarify Tor/WebVPN fallback logs #9

中文版本

摘要

复现信息

`v1.5.0` 中的根因

按当前架构建议的实现方式

建议测试

English Version

Summary

Reproduction

Root cause in `v1.5.0`

Suggested implementation following the current design

Suggested tests

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

增强 PDF 全文有效性校验并修正 Tor/WebVPN 误导日志 / Improve full-text PDF validation and clarify Tor/WebVPN fallback logs #9

Description

中文版本

摘要

复现信息

v1.5.0 中的根因

按当前架构建议的实现方式

建议测试

English Version

Summary

Reproduction

Root cause in v1.5.0

Suggested implementation following the current design

Suggested tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`v1.5.0` 中的根因

Root cause in `v1.5.0`