Skip to content

refactor: consolidate regular expressions into a dedicated module and…#129

Merged
lpi-tn merged 2 commits into
mainfrom
Refacto/regex-in-one-file
Apr 28, 2026
Merged

refactor: consolidate regular expressions into a dedicated module and…#129
lpi-tn merged 2 commits into
mainfrom
Refacto/regex-in-one-file

Conversation

@lpi-tn
Copy link
Copy Markdown
Collaborator

@lpi-tn lpi-tn commented Apr 22, 2026

This pull request introduces a comprehensive suite of unit tests for regular expressions and refactors the codebase to standardize regex usage by centralizing all regular expression patterns in the welearn_datastack.regular_expression module. The changes improve test coverage, code maintainability, and consistency across the codebase by replacing inline regex patterns with named constants. Additionally, some minor refactoring and cleanup are performed for clarity and efficiency.

Testing improvements:

  • Added a new test file tests/test_regular_expressions.py containing extensive unit tests for all regular expressions used in the codebase, ensuring correctness and robustness of regex-based text processing.

Regex usage standardization:

  • Replaced inline regular expression patterns with named constants imported from welearn_datastack.regular_expression throughout the codebase, including in modules for computed metadata, embedding model helpers, XML extraction, PDF extraction, plugin interfaces, and REST requesters. [1] [2] [3] [4] [5] [6]
  • Updated all regex-based string cleaning, splitting, and matching operations to use the centralized regex constants, improving consistency and maintainability. [1] [2] [3] [4] [5] [6] [7] [8]

Code cleanup and minor refactoring:

  • Removed unused or redundant code, such as the _check_research method in oe_books_collector.py and unnecessary imports in several modules. [1] [2]
  • Simplified logic for removing trailing slashes from URLs and other minor code improvements.

Constants update:

  • Removed the now-centralized ANTI_URL_REGEX definition from constants.py, as it is now provided via the regular expression module

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR centralizes commonly used regular expressions into a new welearn_datastack.regular_expression module and updates call sites to use those shared constants, alongside adding unit tests to validate regex behavior.

Changes:

  • Added welearn_datastack/regular_expression.py to host named regex constants and a helper for XML tag matching.
  • Replaced scattered inline regex patterns across modules/plugins with imports from the centralized regex module.
  • Added tests/test_regular_expressions.py with extensive unit coverage for the shared regex patterns.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
welearn_datastack/utils_/scraping_utils.py Uses centralized BACKLINES_REGEX for string cleanup.
welearn_datastack/regular_expression.py Introduces shared regex constants and an XML tag regex helper.
welearn_datastack/plugins/scrapers/unccelearn.py Removes local string-cleaning helper and related unused imports.
welearn_datastack/plugins/scrapers/plos.py Switches ANTI_URL_REGEX import from constants to regex module.
welearn_datastack/plugins/scrapers/ird_le_mag.py Reuses shared clean_return_to_line instead of a local regex cleaner.
welearn_datastack/plugins/scrapers/conversation.py Replaces inline quoted-word regex with centralized constant; uses shared cleaner.
welearn_datastack/plugins/rest_requesters/wikipedia.py Uses centralized LANG_CODE_IN_URL_REGEX for language extraction.
welearn_datastack/plugins/rest_requesters/oapen.py Uses centralized whitespace/newline-related regex constants for text cleanup.
welearn_datastack/plugins/interface.py Uses centralized BACKLINES_REGEX and removes unused imports.
welearn_datastack/nodes_workflow/UpdateMaterializedView/update_materialized_view.py Uses centralized view-name validation regex.
welearn_datastack/modules/xml_extractor.py Replaces inline XML regex with centralized constants/helper.
welearn_datastack/modules/pdf_extractor.py Simplifies trailing-slash removal without regex.
welearn_datastack/modules/embedding_model_helpers.py Uses centralized whitespace/newline regex constants when normalizing text.
welearn_datastack/modules/computed_metadata.py Uses centralized regex constants for punctuation removal, sentence/word counting.
welearn_datastack/constants.py Removes ANTI_URL_REGEX now provided by regex module.
welearn_datastack/collectors/oe_books_collector.py Removes unused _check_research helper.
tests/test_regular_expressions.py Adds unit tests validating each centralized regex pattern/helper.
poetry.lock Lockfile updated/regenerated as part of the change set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread welearn_datastack/regular_expression.py
@lpi-tn lpi-tn merged commit 25f2530 into main Apr 28, 2026
6 of 7 checks passed
@lpi-tn lpi-tn deleted the Refacto/regex-in-one-file branch April 28, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants