Skip to content

Feature/external id scientif journals#130

Merged
lpi-tn merged 7 commits into
mainfrom
Feature/external-id-scientif-journals
Apr 23, 2026
Merged

Feature/external id scientif journals#130
lpi-tn merged 7 commits into
mainfrom
Feature/external-id-scientif-journals

Conversation

@lpi-tn
Copy link
Copy Markdown
Collaborator

@lpi-tn lpi-tn commented Apr 22, 2026

This pull request introduces several improvements and refactorings to the document processing plugins, focusing on standardizing the handling of external identifiers (especially DOIs), improving exception handling, and refactoring the OpenAlex plugin for clarity and maintainability. It also updates dependencies and enhances tests to validate the new logic.

Refactoring and Standardization of External Identifiers:

  • The OpenAlex and PeerJ plugins now consistently extract and set the DOI as external_id and specify its type as ExternalIdType.DOI. The OpenAlex DOI extraction method strips the https://doi.org/ prefix if present. [1] [2] [3] [4] [5]
  • The Plos plugin's document detail extraction is refactored to use helper methods for extracting DOI, ISSN, and journal properties, improving code clarity and maintainability. [1] [2]

Exception Handling Improvements:

  • Custom exceptions in welearn_datastack/exceptions.py now inherit from Exception instead of BaseException, aligning with Python best practices.
  • A new NoDOIFoundError exception is introduced to explicitly handle cases where a DOI cannot be found during scraping. [1] [2]

OpenAlex Plugin Refactor:

  • The _update_welearn_document method is refactored for clarity: publisher authorization, access, and license checks are now performed in dedicated methods before document construction. Details construction, content resolution, and author extraction are modularized. [1] [2]
  • The _invert_abstract method now returns None if the input is None, improving robustness.

Dependency and Typing Updates:

  • The python-dotenv dependency is updated from ^1.1.0 to ^1.2.1.
  • Type hints are improved throughout the OpenAlex plugin for better type safety and readability. [1] [2]

Test Enhancements:

  • Tests for OpenAlex and PeerJ plugins are updated to check that the correct external_id and external_id_type are set on documents, ensuring the new logic is validated. [1] [2]

Refactoring and Standardization:

  • OpenAlex, PeerJ, and Plos plugins now consistently extract and set the DOI as external_id with the correct type, and refactor extraction logic for maintainability. [1] [2] [3] [4] [5]
  • OpenAlex plugin modularizes publisher, access, and license checks, and document detail construction. [1] [2]

Exception Handling:

  • Custom exceptions now inherit from Exception instead of BaseException.
  • Added NoDOIFoundError for missing DOI cases. [1] [2]

Dependency and Typing:

  • Updated python-dotenv dependency version.
  • Improved type annotations in OpenAlex plugin. [1] [2]

Testing:

  • Enhanced tests to validate external_id and external_id_type extraction and assignment. [1] [2]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes how scientific journal scrapers/rest collectors populate external_id (DOI) and external_id_type, refactors the OpenAlex plugin for clearer responsibilities, and updates tests/dependencies to validate the new behavior.

Changes:

  • Standardize DOI extraction and assignment to document.external_id with ExternalIdType.DOI across OpenAlex, PeerJ, and PLOS.
  • Refactor OpenAlex document construction (publisher/access/license checks, details/content/author building).
  • Add NoDOIFoundError, adjust exception inheritance, and extend tests to assert external ID fields.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
welearn_datastack/plugins/scrapers/plos.py Refactors DOI/journal/ISSN extraction helpers and sets external_id / external_id_type.
welearn_datastack/plugins/scrapers/peerj.py Sets DOI as external_id, adds explicit missing-DOI exception, and sets external_id_type.
welearn_datastack/plugins/rest_requesters/open_alex.py Major refactor to modularize checks/build steps and normalize DOI handling for external_id.
welearn_datastack/exceptions.py Switches custom exceptions to inherit from Exception and adds NoDOIFoundError.
tests/document_collector_hub/plugins_test/test_scraping_peerj.py Adds assertions for DOI external_id and external_id_type.
tests/document_collector_hub/plugins_test/test_open_alex.py Adjusts DOI input format and asserts normalized DOI in external_id/type.
pyproject.toml Bumps python-dotenv version constraint.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread welearn_datastack/plugins/rest_requesters/open_alex.py Outdated
Comment thread welearn_datastack/plugins/rest_requesters/open_alex.py
Comment thread welearn_datastack/plugins/rest_requesters/open_alex.py
Comment thread welearn_datastack/plugins/rest_requesters/open_alex.py
Comment thread welearn_datastack/plugins/rest_requesters/open_alex.py
Comment thread welearn_datastack/plugins/rest_requesters/open_alex.py
Comment thread welearn_datastack/plugins/scrapers/plos.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lpi-tn lpi-tn merged commit aad9f07 into main Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants