Skip to content

支持 SAGE Journals(10.1177)出版商路由与机构 PDF 下载 / Add SAGE Journals publisher routing and institutional PDF handling #5

@lwz20210407

Description

@lwz20210407

中文版本

摘要

感谢作者在 v1.5.0 中提供了结构清晰的 publisher strategy registry 和独立 CARSI tier,这使合法机构访问来源的扩展变得非常自然。

建议增加对 SAGE Journals(DOI 前缀 10.1177/)的支持。目前 SAGE 文章不会被路由到 publisher browser strategy、WebVPN-specific PDF URL 构造或可选 CARSI publisher 配置。

复现信息

  • 版本:scansci-pdf 1.5.0v1.5.0,commit b4e2c5630bbaf900e43ce49868c9951ce2c1474c
  • 策略:legal_only
  • DOI:10.1177/14644207221121976
  • 已配置的合法来源:有效 WebVPN session,且已启用 CARSI

复现结果:

from scansci_pdf.sources.publishers import get_publisher, get_publisher_fast_sources

doi = "10.1177/14644207221121976"
print(get_publisher(doi))                       # ""
print([name for _, name in get_publisher_fast_sources(doi)])  # []

在当前网络解析中,该 DOI 会到达:

https://sage.cnpereading.com/doi/10.1177/14644207221121976

文章页可正常访问并显示 Restricted access,而未认证的 /doi/pdf/... 请求返回 HTML 而非 PDF。部分网络中标准 journals.sagepub.com 路径也可能不可访问,因此保留 DOI 解析后的 SAGE host 很重要。

v1.5.0 中的根因

  • src/scansci_pdf/sources/publishers.py:没有 10.1177/ -> SAGE 映射,也没有 SAGEBrowser tool registration。
  • src/scansci_pdf/sources/vpnsci.py:没有针对 journals.sagepub.comsage.cnpereading.com 的 PDF URL 规则。
  • src/scansci_pdf/data/publisher_carsi.json:未注册 SAGE domain。如果 SAGE 平台提供可适配的机构认证流程,当前 CARSI routing 也不会尝试该 publisher。
  • generic browser 功能虽然存在,但对未注册的 SAGE DOI,正常 tier 构造不会选择该策略。

按当前架构建议的实现方式

  1. 注册 SAGE:
# sources/publishers.py
"10.1177/": "SAGE",

PUBLISHER_TOOL_MAP["SAGE"] = ["SAGEBrowser", "Crossref", "Unpaywall"]
  1. publisher_strategies.py 中新增 try_sage_browser()。应首先解析 DOI 并保留实际返回的 SAGE host,然后在已认证浏览器上下文中测试 host-relative PDF candidates:
pdf_candidates = [
    f"/doi/pdf/{doi}?download=true",
    f"/doi/pdf/{doi}",
]

这样可同时支持 journals.sagepub.comsage.cnpereading.com 等区域 SAGE 前端,而不是仅硬编码一个 host。

  1. 为解析后的 SAGE host 增加 WebVPN PDF URL 构造:
# sources/vpnsci.py::_construct_publisher_pdf_url
elif "sagepub.com" in hostname or "sage.cnpereading.com" in hostname:
    base = f"{parsed.scheme}://{parsed.netloc}"
    return f"{base}/doi/pdf/{doi}?download=true"
  1. 确认 SAGE 当前 institution-login selector 后,增加 _PUBLISHER_SSO_CONFIG["SAGE"]。如果 SAGE 的机构登录确认为 CARSI/Shibboleth-compatible,则在 publisher_carsi.json 中加入可用 domain 并按已有 publisher 方式持久化 cookie。

  2. 在统一登录 aliases 中加入 sage,使 DOI-based login 与 README 推荐的工作流保持一致。

建议测试

  • get_publisher("10.1177/14644207221121976") == "SAGE"
  • get_publisher_fast_sources() 包含 SAGEBrowser
  • SAGE PDF URL 生成保留解析后的 hostname。
  • mock 显示 restricted access 的文章响应,再模拟认证后的 PDF 响应。
  • 同时测试 journals.sagepub.comsage.cnpereading.com 两种解析 host。
  • 确认 legal_only 仅尝试合法 OA / institutional 路径。

如果后续有候选实现,我可以使用上述 DOI 和合法机构访问环境协助测试验证。


English Version

Summary

Thanks for the well-structured v1.5.0 release, especially the publisher strategy registry and standalone CARSI tier.

Could SAGE Journals support be added for DOI prefix 10.1177/? Currently SAGE articles are not routed through a publisher browser strategy, WebVPN-specific PDF URL construction, or optional CARSI publisher configuration.

Reproduction

  • Version: scansci-pdf 1.5.0 (v1.5.0, commit b4e2c5630bbaf900e43ce49868c9951ce2c1474c)
  • Strategy: legal_only
  • DOI: 10.1177/14644207221121976
  • Legal sources configured: valid WebVPN session and CARSI enabled

Observed:

from scansci_pdf.sources.publishers import get_publisher, get_publisher_fast_sources

doi = "10.1177/14644207221121976"
print(get_publisher(doi))                       # ""
print([name for _, name in get_publisher_fast_sources(doi)])  # []

For this DOI, resolution in my network reaches:

https://sage.cnpereading.com/doi/10.1177/14644207221121976

The article page is reachable and shows Restricted access, while unauthenticated /doi/pdf/... requests return HTML instead of a PDF. The standard journals.sagepub.com route can also be unavailable in some networks, so retaining the resolved SAGE host is important.

Root cause in v1.5.0

  • src/scansci_pdf/sources/publishers.py: no 10.1177/ -> SAGE mapping or SAGEBrowser tool registration.
  • src/scansci_pdf/sources/vpnsci.py: no PDF URL rule for either journals.sagepub.com or sage.cnpereading.com.
  • src/scansci_pdf/data/publisher_carsi.json: SAGE domains are not registered, so CARSI routing cannot be attempted if the SAGE platform exposes an applicable institutional authentication workflow.
  • The generic browser functionality exists, but normal tier construction never selects it for an unregistered SAGE DOI.

Suggested implementation following the current design

  1. Register SAGE:
# sources/publishers.py
"10.1177/": "SAGE",

PUBLISHER_TOOL_MAP["SAGE"] = ["SAGEBrowser", "Crossref", "Unpaywall"]
  1. Add try_sage_browser() in publisher_strategies.py. Resolve the DOI first and preserve the resolved SAGE host, then try host-relative PDF candidates through the authenticated browser context:
pdf_candidates = [
    f"/doi/pdf/{doi}?download=true",
    f"/doi/pdf/{doi}",
]

This supports both journals.sagepub.com and regional SAGE front ends such as sage.cnpereading.com rather than hardcoding only one host.

  1. Add WebVPN PDF URL construction for resolved SAGE hosts:
# sources/vpnsci.py::_construct_publisher_pdf_url
elif "sagepub.com" in hostname or "sage.cnpereading.com" in hostname:
    base = f"{parsed.scheme}://{parsed.netloc}"
    return f"{base}/doi/pdf/{doi}?download=true"
  1. Add _PUBLISHER_SSO_CONFIG["SAGE"] after confirming SAGE's current institution-login selectors. If SAGE's institutional login is CARSI/Shibboleth-compatible, add a publisher_carsi.json entry covering both usable domains and persist its cookies like the supported publishers.

  2. Add sage to unified login aliases so DOI-based login matches the README's recommended workflow.

Suggested tests

  • get_publisher("10.1177/14644207221121976") == "SAGE".
  • get_publisher_fast_sources() includes SAGEBrowser.
  • SAGE PDF URL generation retains the resolved hostname.
  • Mock an article response with restricted access, then an authenticated PDF response.
  • Test both journals.sagepub.com and sage.cnpereading.com resolved hosts.
  • Ensure legal_only tries only legal OA/institutional paths.

I can help validate a candidate implementation using lawful institutional access to the DOI above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions