Feature/external id scientif journals by lpi-tn · Pull Request #130 · CyberCRI/welearn-datastack

lpi-tn · 2026-04-22T15:47:38Z

This pull request introduces several improvements and refactorings to the document processing plugins, focusing on standardizing the handling of external identifiers (especially DOIs), improving exception handling, and refactoring the OpenAlex plugin for clarity and maintainability. It also updates dependencies and enhances tests to validate the new logic.

Refactoring and Standardization of External Identifiers:

The OpenAlex and PeerJ plugins now consistently extract and set the DOI as external_id and specify its type as ExternalIdType.DOI. The OpenAlex DOI extraction method strips the https://doi.org/ prefix if present. [1] [2] [3] [4] [5]
The Plos plugin's document detail extraction is refactored to use helper methods for extracting DOI, ISSN, and journal properties, improving code clarity and maintainability. [1] [2]

Exception Handling Improvements:

Custom exceptions in welearn_datastack/exceptions.py now inherit from Exception instead of BaseException, aligning with Python best practices.
A new NoDOIFoundError exception is introduced to explicitly handle cases where a DOI cannot be found during scraping. [1] [2]

OpenAlex Plugin Refactor:

The _update_welearn_document method is refactored for clarity: publisher authorization, access, and license checks are now performed in dedicated methods before document construction. Details construction, content resolution, and author extraction are modularized. [1] [2]
The _invert_abstract method now returns None if the input is None, improving robustness.

Dependency and Typing Updates:

The python-dotenv dependency is updated from ^1.1.0 to ^1.2.1.
Type hints are improved throughout the OpenAlex plugin for better type safety and readability. [1] [2]

Test Enhancements:

Tests for OpenAlex and PeerJ plugins are updated to check that the correct external_id and external_id_type are set on documents, ensuring the new logic is validated. [1] [2]

Refactoring and Standardization:

OpenAlex, PeerJ, and Plos plugins now consistently extract and set the DOI as external_id with the correct type, and refactor extraction logic for maintainability. [1] [2] [3] [4] [5]
OpenAlex plugin modularizes publisher, access, and license checks, and document detail construction. [1] [2]

Exception Handling:

Custom exceptions now inherit from Exception instead of BaseException.
Added NoDOIFoundError for missing DOI cases. [1] [2]

Dependency and Typing:

Updated python-dotenv dependency version.
Improved type annotations in OpenAlex plugin. [1] [2]

Testing:

Enhanced tests to validate external_id and external_id_type extraction and assignment. [1] [2]

…raction and validation

…ation

Copilot

Pull request overview

This PR standardizes how scientific journal scrapers/rest collectors populate external_id (DOI) and external_id_type, refactors the OpenAlex plugin for clearer responsibilities, and updates tests/dependencies to validate the new behavior.

Changes:

Standardize DOI extraction and assignment to document.external_id with ExternalIdType.DOI across OpenAlex, PeerJ, and PLOS.
Refactor OpenAlex document construction (publisher/access/license checks, details/content/author building).
Add NoDOIFoundError, adjust exception inheritance, and extend tests to assert external ID fields.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
welearn_datastack/plugins/scrapers/plos.py	Refactors DOI/journal/ISSN extraction helpers and sets `external_id` / `external_id_type`.
welearn_datastack/plugins/scrapers/peerj.py	Sets DOI as `external_id`, adds explicit missing-DOI exception, and sets `external_id_type`.
welearn_datastack/plugins/rest_requesters/open_alex.py	Major refactor to modularize checks/build steps and normalize DOI handling for `external_id`.
welearn_datastack/exceptions.py	Switches custom exceptions to inherit from `Exception` and adds `NoDOIFoundError`.
tests/document_collector_hub/plugins_test/test_scraping_peerj.py	Adds assertions for DOI `external_id` and `external_id_type`.
tests/document_collector_hub/plugins_test/test_open_alex.py	Adjusts DOI input format and asserts normalized DOI in `external_id`/type.
pyproject.toml	Bumps `python-dotenv` version constraint.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

lpi-tn added 5 commits April 22, 2026 16:20

feat: add DOI extraction and refactor related code in plos.py

18cc720

feat: enhance OpenAlex document processing with detailed metadata ext…

74a8a2b

…raction and validation

bump version

a07ae71

feat: update exception handling for version numbers and add DOI valid…

ceb20e1

…ation

feat: add external ID and type assertions in PeerJ document tests

a462731

lpi-tn requested review from Copilot, jmsevin and sandragjacinto April 22, 2026 15:47

Merge branch 'main' into Feature/external-id-scientif-journals

bdf7ae2

Copilot started reviewing on behalf of lpi-tn April 23, 2026 09:10 View session

jmsevin approved these changes Apr 23, 2026

View reviewed changes

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Update welearn_datastack/plugins/rest_requesters/open_alex.py

f65ea16

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

lpi-tn merged commit aad9f07 into main Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/external id scientif journals#130

Feature/external id scientif journals#130
lpi-tn merged 7 commits into
mainfrom
Feature/external-id-scientif-journals

lpi-tn commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lpi-tn commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants