中文版本
摘要
感谢作者在 v1.5.0 中提供了结构清晰的 publisher strategy registry 和独立 CARSI tier,这使合法机构访问来源的扩展变得非常自然。
建议增加对 SAGE Journals(DOI 前缀 10.1177/)的支持。目前 SAGE 文章不会被路由到 publisher browser strategy、WebVPN-specific PDF URL 构造或可选 CARSI publisher 配置。
复现信息
- 版本:
scansci-pdf 1.5.0(v1.5.0,commit b4e2c5630bbaf900e43ce49868c9951ce2c1474c)
- 策略:
legal_only
- DOI:
10.1177/14644207221121976
- 已配置的合法来源:有效 WebVPN session,且已启用 CARSI
复现结果:
from scansci_pdf.sources.publishers import get_publisher, get_publisher_fast_sources
doi = "10.1177/14644207221121976"
print(get_publisher(doi)) # ""
print([name for _, name in get_publisher_fast_sources(doi)]) # []
在当前网络解析中,该 DOI 会到达:
https://sage.cnpereading.com/doi/10.1177/14644207221121976
文章页可正常访问并显示 Restricted access,而未认证的 /doi/pdf/... 请求返回 HTML 而非 PDF。部分网络中标准 journals.sagepub.com 路径也可能不可访问,因此保留 DOI 解析后的 SAGE host 很重要。
v1.5.0 中的根因
src/scansci_pdf/sources/publishers.py:没有 10.1177/ -> SAGE 映射,也没有 SAGEBrowser tool registration。
src/scansci_pdf/sources/vpnsci.py:没有针对 journals.sagepub.com 或 sage.cnpereading.com 的 PDF URL 规则。
src/scansci_pdf/data/publisher_carsi.json:未注册 SAGE domain。如果 SAGE 平台提供可适配的机构认证流程,当前 CARSI routing 也不会尝试该 publisher。
- generic browser 功能虽然存在,但对未注册的 SAGE DOI,正常 tier 构造不会选择该策略。
按当前架构建议的实现方式
- 注册 SAGE:
# sources/publishers.py
"10.1177/": "SAGE",
PUBLISHER_TOOL_MAP["SAGE"] = ["SAGEBrowser", "Crossref", "Unpaywall"]
- 在
publisher_strategies.py 中新增 try_sage_browser()。应首先解析 DOI 并保留实际返回的 SAGE host,然后在已认证浏览器上下文中测试 host-relative PDF candidates:
pdf_candidates = [
f"/doi/pdf/{doi}?download=true",
f"/doi/pdf/{doi}",
]
这样可同时支持 journals.sagepub.com 与 sage.cnpereading.com 等区域 SAGE 前端,而不是仅硬编码一个 host。
- 为解析后的 SAGE host 增加 WebVPN PDF URL 构造:
# sources/vpnsci.py::_construct_publisher_pdf_url
elif "sagepub.com" in hostname or "sage.cnpereading.com" in hostname:
base = f"{parsed.scheme}://{parsed.netloc}"
return f"{base}/doi/pdf/{doi}?download=true"
-
确认 SAGE 当前 institution-login selector 后,增加 _PUBLISHER_SSO_CONFIG["SAGE"]。如果 SAGE 的机构登录确认为 CARSI/Shibboleth-compatible,则在 publisher_carsi.json 中加入可用 domain 并按已有 publisher 方式持久化 cookie。
-
在统一登录 aliases 中加入 sage,使 DOI-based login 与 README 推荐的工作流保持一致。
建议测试
get_publisher("10.1177/14644207221121976") == "SAGE"。
get_publisher_fast_sources() 包含 SAGEBrowser。
- SAGE PDF URL 生成保留解析后的 hostname。
- mock 显示 restricted access 的文章响应,再模拟认证后的 PDF 响应。
- 同时测试
journals.sagepub.com 与 sage.cnpereading.com 两种解析 host。
- 确认
legal_only 仅尝试合法 OA / institutional 路径。
如果后续有候选实现,我可以使用上述 DOI 和合法机构访问环境协助测试验证。
English Version
Summary
Thanks for the well-structured v1.5.0 release, especially the publisher strategy registry and standalone CARSI tier.
Could SAGE Journals support be added for DOI prefix 10.1177/? Currently SAGE articles are not routed through a publisher browser strategy, WebVPN-specific PDF URL construction, or optional CARSI publisher configuration.
Reproduction
- Version:
scansci-pdf 1.5.0 (v1.5.0, commit b4e2c5630bbaf900e43ce49868c9951ce2c1474c)
- Strategy:
legal_only
- DOI:
10.1177/14644207221121976
- Legal sources configured: valid WebVPN session and CARSI enabled
Observed:
from scansci_pdf.sources.publishers import get_publisher, get_publisher_fast_sources
doi = "10.1177/14644207221121976"
print(get_publisher(doi)) # ""
print([name for _, name in get_publisher_fast_sources(doi)]) # []
For this DOI, resolution in my network reaches:
https://sage.cnpereading.com/doi/10.1177/14644207221121976
The article page is reachable and shows Restricted access, while unauthenticated /doi/pdf/... requests return HTML instead of a PDF. The standard journals.sagepub.com route can also be unavailable in some networks, so retaining the resolved SAGE host is important.
Root cause in v1.5.0
src/scansci_pdf/sources/publishers.py: no 10.1177/ -> SAGE mapping or SAGEBrowser tool registration.
src/scansci_pdf/sources/vpnsci.py: no PDF URL rule for either journals.sagepub.com or sage.cnpereading.com.
src/scansci_pdf/data/publisher_carsi.json: SAGE domains are not registered, so CARSI routing cannot be attempted if the SAGE platform exposes an applicable institutional authentication workflow.
- The generic browser functionality exists, but normal tier construction never selects it for an unregistered SAGE DOI.
Suggested implementation following the current design
- Register SAGE:
# sources/publishers.py
"10.1177/": "SAGE",
PUBLISHER_TOOL_MAP["SAGE"] = ["SAGEBrowser", "Crossref", "Unpaywall"]
- Add
try_sage_browser() in publisher_strategies.py. Resolve the DOI first and preserve the resolved SAGE host, then try host-relative PDF candidates through the authenticated browser context:
pdf_candidates = [
f"/doi/pdf/{doi}?download=true",
f"/doi/pdf/{doi}",
]
This supports both journals.sagepub.com and regional SAGE front ends such as sage.cnpereading.com rather than hardcoding only one host.
- Add WebVPN PDF URL construction for resolved SAGE hosts:
# sources/vpnsci.py::_construct_publisher_pdf_url
elif "sagepub.com" in hostname or "sage.cnpereading.com" in hostname:
base = f"{parsed.scheme}://{parsed.netloc}"
return f"{base}/doi/pdf/{doi}?download=true"
-
Add _PUBLISHER_SSO_CONFIG["SAGE"] after confirming SAGE's current institution-login selectors. If SAGE's institutional login is CARSI/Shibboleth-compatible, add a publisher_carsi.json entry covering both usable domains and persist its cookies like the supported publishers.
-
Add sage to unified login aliases so DOI-based login matches the README's recommended workflow.
Suggested tests
get_publisher("10.1177/14644207221121976") == "SAGE".
get_publisher_fast_sources() includes SAGEBrowser.
- SAGE PDF URL generation retains the resolved hostname.
- Mock an article response with restricted access, then an authenticated PDF response.
- Test both
journals.sagepub.com and sage.cnpereading.com resolved hosts.
- Ensure
legal_only tries only legal OA/institutional paths.
I can help validate a candidate implementation using lawful institutional access to the DOI above.
中文版本
摘要
感谢作者在
v1.5.0中提供了结构清晰的 publisher strategy registry 和独立 CARSI tier,这使合法机构访问来源的扩展变得非常自然。建议增加对 SAGE Journals(DOI 前缀
10.1177/)的支持。目前 SAGE 文章不会被路由到 publisher browser strategy、WebVPN-specific PDF URL 构造或可选 CARSI publisher 配置。复现信息
scansci-pdf 1.5.0(v1.5.0,commitb4e2c5630bbaf900e43ce49868c9951ce2c1474c)legal_only10.1177/14644207221121976复现结果:
在当前网络解析中,该 DOI 会到达:
文章页可正常访问并显示
Restricted access,而未认证的/doi/pdf/...请求返回 HTML 而非 PDF。部分网络中标准journals.sagepub.com路径也可能不可访问,因此保留 DOI 解析后的 SAGE host 很重要。v1.5.0中的根因src/scansci_pdf/sources/publishers.py:没有10.1177/ -> SAGE映射,也没有SAGEBrowsertool registration。src/scansci_pdf/sources/vpnsci.py:没有针对journals.sagepub.com或sage.cnpereading.com的 PDF URL 规则。src/scansci_pdf/data/publisher_carsi.json:未注册 SAGE domain。如果 SAGE 平台提供可适配的机构认证流程,当前 CARSI routing 也不会尝试该 publisher。按当前架构建议的实现方式
publisher_strategies.py中新增try_sage_browser()。应首先解析 DOI 并保留实际返回的 SAGE host,然后在已认证浏览器上下文中测试 host-relative PDF candidates:这样可同时支持
journals.sagepub.com与sage.cnpereading.com等区域 SAGE 前端,而不是仅硬编码一个 host。确认 SAGE 当前 institution-login selector 后,增加
_PUBLISHER_SSO_CONFIG["SAGE"]。如果 SAGE 的机构登录确认为 CARSI/Shibboleth-compatible,则在publisher_carsi.json中加入可用 domain 并按已有 publisher 方式持久化 cookie。在统一登录 aliases 中加入
sage,使 DOI-based login 与 README 推荐的工作流保持一致。建议测试
get_publisher("10.1177/14644207221121976") == "SAGE"。get_publisher_fast_sources()包含SAGEBrowser。journals.sagepub.com与sage.cnpereading.com两种解析 host。legal_only仅尝试合法 OA / institutional 路径。如果后续有候选实现,我可以使用上述 DOI 和合法机构访问环境协助测试验证。
English Version
Summary
Thanks for the well-structured
v1.5.0release, especially the publisher strategy registry and standalone CARSI tier.Could SAGE Journals support be added for DOI prefix
10.1177/? Currently SAGE articles are not routed through a publisher browser strategy, WebVPN-specific PDF URL construction, or optional CARSI publisher configuration.Reproduction
scansci-pdf 1.5.0(v1.5.0, commitb4e2c5630bbaf900e43ce49868c9951ce2c1474c)legal_only10.1177/14644207221121976Observed:
For this DOI, resolution in my network reaches:
The article page is reachable and shows
Restricted access, while unauthenticated/doi/pdf/...requests return HTML instead of a PDF. The standardjournals.sagepub.comroute can also be unavailable in some networks, so retaining the resolved SAGE host is important.Root cause in
v1.5.0src/scansci_pdf/sources/publishers.py: no10.1177/ -> SAGEmapping orSAGEBrowsertool registration.src/scansci_pdf/sources/vpnsci.py: no PDF URL rule for eitherjournals.sagepub.comorsage.cnpereading.com.src/scansci_pdf/data/publisher_carsi.json: SAGE domains are not registered, so CARSI routing cannot be attempted if the SAGE platform exposes an applicable institutional authentication workflow.Suggested implementation following the current design
try_sage_browser()inpublisher_strategies.py. Resolve the DOI first and preserve the resolved SAGE host, then try host-relative PDF candidates through the authenticated browser context:This supports both
journals.sagepub.comand regional SAGE front ends such assage.cnpereading.comrather than hardcoding only one host.Add
_PUBLISHER_SSO_CONFIG["SAGE"]after confirming SAGE's current institution-login selectors. If SAGE's institutional login is CARSI/Shibboleth-compatible, add apublisher_carsi.jsonentry covering both usable domains and persist its cookies like the supported publishers.Add
sageto unified login aliases so DOI-based login matches the README's recommended workflow.Suggested tests
get_publisher("10.1177/14644207221121976") == "SAGE".get_publisher_fast_sources()includesSAGEBrowser.journals.sagepub.comandsage.cnpereading.comresolved hosts.legal_onlytries only legal OA/institutional paths.I can help validate a candidate implementation using lawful institutional access to the DOI above.