Tweaked Celery categories and reported errors#2825
Conversation
…or caught by the Celery layer.
There was a problem hiding this comment.
Pull request overview
This PR updates Celery error classification to better differentiate timeout-related failures and adjusts the associated unit tests so log-derived metrics can distinguish notification timeouts from downstream client timeouts.
Changes:
- Split the prior timeout classification into
TIMEOUT_NOTIFICATIONvsTIMEOUT_CLIENT. - Add new exception-class mappings for common client timeout-related exception names (
TimeoutError,ProtocolError,MaxRetryError). - Remove the duplicate-record classification category and its unit tests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
app/celery/error_registry.py |
Updates CeleryErrorCategory and the class/message matching maps used by classify_error to support new timeout categories and remove duplicate-record classification. |
tests/app/celery/test_celery_error_classification.py |
Updates/extends classification tests to cover the new timeout categories and removes duplicate-record assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Note: Order within the map does not matter; the deepest/root exception in the | ||
| # chain takes precedence over wrapper exceptions. |
There was a problem hiding this comment.
The comment saying “Order within the map does not matter” is inaccurate for _EXCEPTION_CLASS_MAP / _MESSAGE_SUBSTRING_MAP. classify_error iterates these dicts in insertion order and returns the first matching pattern, so the order directly affects which category wins when multiple patterns match the same exception/message (and the tests rely on this behavior). Please update the comment to reflect that ordering is part of the precedence rules (or change the data structure/logic to make precedence explicit).
| # Note: Order within the map does not matter; the deepest/root exception in the | |
| # chain takes precedence over wrapper exceptions. | |
| # Note: The deepest/root exception in the chain takes precedence over wrapper | |
| # exceptions, and within each map the insertion order defines which category | |
| # wins when multiple patterns match the same exception class name or message. |
| # Clients that timed out | ||
| TIMEOUT_CLIENT = "CELERY_KNOWN_ERROR::TIMEOUT_CLIENT" | ||
|
|
||
| # Notifications that timed out | ||
| TIMEOUT = "CELERY_KNOWN_ERROR::TIMEOUT" | ||
| TIMEOUT_NOTIFICATION = "CELERY_KNOWN_ERROR::TIMEOUT_NOTIFICATION" | ||
|
|
There was a problem hiding this comment.
Renaming/removing the TIMEOUT category and changing the log marker to CELERY_KNOWN_ERROR::TIMEOUT_NOTIFICATION will alter emitted log strings and can break any existing CloudWatch metric filters/alarms or dashboards keyed on the old marker. If downstream consumers still expect CELERY_KNOWN_ERROR::TIMEOUT, consider keeping a backward-compatible alias (deprecated) or ensure the monitoring/config repos are updated in lockstep.
| def test_timeout_notification_error(self): | ||
| """Notification timeout messages are classified as TIMEOUT_NOTIFICATION.""" | ||
| exc = Exception( | ||
| "timeout-sending-notifications: Task exceeded time limit, notifications have been updated to technical-failure because they have timed out." | ||
| ) | ||
| category, root_exc = classify_error(exc) | ||
| assert category == CeleryErrorCategory.TIMEOUT | ||
| assert category == CeleryErrorCategory.TIMEOUT_NOTIFICATION | ||
| assert root_exc is exc # Should return the original exception as root | ||
|
|
||
| def test_timeout_client_error_by_class_name(self): | ||
| """TimeoutError exceptions are classified as TIMEOUT_CLIENT.""" | ||
|
|
||
| class TimeoutError(Exception): | ||
| pass | ||
|
|
||
| exc = TimeoutError("Connection to downstream service timed out") | ||
| category, root_exc = classify_error(exc) | ||
| assert category == CeleryErrorCategory.TIMEOUT_CLIENT | ||
| assert root_exc is exc # Should return the original exception as root | ||
|
|
||
| def test_protocol_error_by_class_name(self): | ||
| """ProtocolError exceptions are classified as TIMEOUT_CLIENT.""" | ||
|
|
||
| class ProtocolError(Exception): | ||
| pass | ||
|
|
||
| exc = ProtocolError("Connection aborted") | ||
| category, root_exc = classify_error(exc) | ||
| assert category == CeleryErrorCategory.TIMEOUT_CLIENT | ||
| assert root_exc is exc # Should return the original exception as root | ||
|
|
||
| def test_max_retry_error_by_class_name(self): | ||
| """MaxRetryError exceptions are classified as TIMEOUT_CLIENT.""" | ||
|
|
||
| class MaxRetryError(Exception): | ||
| pass | ||
|
|
||
| exc = MaxRetryError("HTTPSConnectionPool: Max retries exceeded") | ||
| category, root_exc = classify_error(exc) | ||
| assert category == CeleryErrorCategory.TIMEOUT_CLIENT |
There was a problem hiding this comment.
The PR description mentions “Added coverage for the otel related errors”, but the changes here only add TIMEOUT_CLIENT/TIMEOUT_NOTIFICATION coverage and remove DUPLICATE_RECORD—there don’t appear to be any OTel-specific classifications or tests added in this PR. Please either update the PR description to match the actual change set or add the missing OTel coverage in the classification + tests.
Summary | Résumé
Added more tweaking to Celery error and reported errors.
Related Issues | Cartes liées
Test instructions | Instructions pour tester la modification
Unit tested, and built based on production logs.
Release Instructions | Instructions pour le déploiement
None.
Reviewer checklist | Liste de vérification du réviseur