refactor: consolidate regular expressions into a dedicated module and…#129
Merged
Conversation
… update usages across the codebase
Contributor
There was a problem hiding this comment.
Pull request overview
This PR centralizes commonly used regular expressions into a new welearn_datastack.regular_expression module and updates call sites to use those shared constants, alongside adding unit tests to validate regex behavior.
Changes:
- Added
welearn_datastack/regular_expression.pyto host named regex constants and a helper for XML tag matching. - Replaced scattered inline regex patterns across modules/plugins with imports from the centralized regex module.
- Added
tests/test_regular_expressions.pywith extensive unit coverage for the shared regex patterns.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/utils_/scraping_utils.py | Uses centralized BACKLINES_REGEX for string cleanup. |
| welearn_datastack/regular_expression.py | Introduces shared regex constants and an XML tag regex helper. |
| welearn_datastack/plugins/scrapers/unccelearn.py | Removes local string-cleaning helper and related unused imports. |
| welearn_datastack/plugins/scrapers/plos.py | Switches ANTI_URL_REGEX import from constants to regex module. |
| welearn_datastack/plugins/scrapers/ird_le_mag.py | Reuses shared clean_return_to_line instead of a local regex cleaner. |
| welearn_datastack/plugins/scrapers/conversation.py | Replaces inline quoted-word regex with centralized constant; uses shared cleaner. |
| welearn_datastack/plugins/rest_requesters/wikipedia.py | Uses centralized LANG_CODE_IN_URL_REGEX for language extraction. |
| welearn_datastack/plugins/rest_requesters/oapen.py | Uses centralized whitespace/newline-related regex constants for text cleanup. |
| welearn_datastack/plugins/interface.py | Uses centralized BACKLINES_REGEX and removes unused imports. |
| welearn_datastack/nodes_workflow/UpdateMaterializedView/update_materialized_view.py | Uses centralized view-name validation regex. |
| welearn_datastack/modules/xml_extractor.py | Replaces inline XML regex with centralized constants/helper. |
| welearn_datastack/modules/pdf_extractor.py | Simplifies trailing-slash removal without regex. |
| welearn_datastack/modules/embedding_model_helpers.py | Uses centralized whitespace/newline regex constants when normalizing text. |
| welearn_datastack/modules/computed_metadata.py | Uses centralized regex constants for punctuation removal, sentence/word counting. |
| welearn_datastack/constants.py | Removes ANTI_URL_REGEX now provided by regex module. |
| welearn_datastack/collectors/oe_books_collector.py | Removes unused _check_research helper. |
| tests/test_regular_expressions.py | Adds unit tests validating each centralized regex pattern/helper. |
| poetry.lock | Lockfile updated/regenerated as part of the change set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jmsevin
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a comprehensive suite of unit tests for regular expressions and refactors the codebase to standardize regex usage by centralizing all regular expression patterns in the
welearn_datastack.regular_expressionmodule. The changes improve test coverage, code maintainability, and consistency across the codebase by replacing inline regex patterns with named constants. Additionally, some minor refactoring and cleanup are performed for clarity and efficiency.Testing improvements:
tests/test_regular_expressions.pycontaining extensive unit tests for all regular expressions used in the codebase, ensuring correctness and robustness of regex-based text processing.Regex usage standardization:
welearn_datastack.regular_expressionthroughout the codebase, including in modules for computed metadata, embedding model helpers, XML extraction, PDF extraction, plugin interfaces, and REST requesters. [1] [2] [3] [4] [5] [6]Code cleanup and minor refactoring:
_check_researchmethod inoe_books_collector.pyand unnecessary imports in several modules. [1] [2]Constants update:
ANTI_URL_REGEXdefinition fromconstants.py, as it is now provided via the regular expression module