Skip to content

Tweaked Celery categories and reported errors#2825

Open
jimleroyer wants to merge 6 commits intomainfrom
fix/celery-add-more-errors
Open

Tweaked Celery categories and reported errors#2825
jimleroyer wants to merge 6 commits intomainfrom
fix/celery-add-more-errors

Conversation

@jimleroyer
Copy link
Copy Markdown
Member

Summary | Résumé

Added more tweaking to Celery error and reported errors.

  • Split the existing timeout category into notify and client related categories, as differentiating the two should be useful.
  • Added coverage for the otel related errors. These might not actually be caught by Celery but let's see.
  • Removed the duplicate notification category. We will catch in Terraform and that won't be part of the celery known errors set.

Related Issues | Cartes liées

Test instructions | Instructions pour tester la modification

Unit tested, and built based on production logs.

Release Instructions | Instructions pour le déploiement

None.

Reviewer checklist | Liste de vérification du réviseur

  • This PR does not break existing functionality.
  • This PR does not violate GCNotify's privacy policies.
  • This PR does not raise new security concerns. Refer to our GC Notify Risk Register document on our Google drive.
  • This PR does not significantly alter performance.
  • Additional required documentation resulting of these changes is covered (such as the README, setup instructions, a related ADR or the technical documentation).

⚠ If boxes cannot be checked off before merging the PR, they should be moved to the "Release Instructions" section with appropriate steps required to verify before release. For example, changes to celery code may require tests on staging to verify that performance has not been affected.

@jimleroyer jimleroyer requested review from P0NDER0SA and ben851 March 31, 2026 22:03
@jimleroyer jimleroyer self-assigned this Mar 31, 2026
Copilot AI review requested due to automatic review settings March 31, 2026 22:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Celery error classification to better differentiate timeout-related failures and adjusts the associated unit tests so log-derived metrics can distinguish notification timeouts from downstream client timeouts.

Changes:

  • Split the prior timeout classification into TIMEOUT_NOTIFICATION vs TIMEOUT_CLIENT.
  • Add new exception-class mappings for common client timeout-related exception names (TimeoutError, ProtocolError, MaxRetryError).
  • Remove the duplicate-record classification category and its unit tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
app/celery/error_registry.py Updates CeleryErrorCategory and the class/message matching maps used by classify_error to support new timeout categories and remove duplicate-record classification.
tests/app/celery/test_celery_error_classification.py Updates/extends classification tests to cover the new timeout categories and removes duplicate-record assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 45 to 46
# Note: Order within the map does not matter; the deepest/root exception in the
# chain takes precedence over wrapper exceptions.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment saying “Order within the map does not matter” is inaccurate for _EXCEPTION_CLASS_MAP / _MESSAGE_SUBSTRING_MAP. classify_error iterates these dicts in insertion order and returns the first matching pattern, so the order directly affects which category wins when multiple patterns match the same exception/message (and the tests rely on this behavior). Please update the comment to reflect that ordering is part of the precedence rules (or change the data structure/logic to make precedence explicit).

Suggested change
# Note: Order within the map does not matter; the deepest/root exception in the
# chain takes precedence over wrapper exceptions.
# Note: The deepest/root exception in the chain takes precedence over wrapper
# exceptions, and within each map the insertion order defines which category
# wins when multiple patterns match the same exception class name or message.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to 39
# Clients that timed out
TIMEOUT_CLIENT = "CELERY_KNOWN_ERROR::TIMEOUT_CLIENT"

# Notifications that timed out
TIMEOUT = "CELERY_KNOWN_ERROR::TIMEOUT"
TIMEOUT_NOTIFICATION = "CELERY_KNOWN_ERROR::TIMEOUT_NOTIFICATION"

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming/removing the TIMEOUT category and changing the log marker to CELERY_KNOWN_ERROR::TIMEOUT_NOTIFICATION will alter emitted log strings and can break any existing CloudWatch metric filters/alarms or dashboards keyed on the old marker. If downstream consumers still expect CELERY_KNOWN_ERROR::TIMEOUT, consider keeping a backward-compatible alias (deprecated) or ensure the monitoring/config repos are updated in lockstep.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +134
def test_timeout_notification_error(self):
"""Notification timeout messages are classified as TIMEOUT_NOTIFICATION."""
exc = Exception(
"timeout-sending-notifications: Task exceeded time limit, notifications have been updated to technical-failure because they have timed out."
)
category, root_exc = classify_error(exc)
assert category == CeleryErrorCategory.TIMEOUT
assert category == CeleryErrorCategory.TIMEOUT_NOTIFICATION
assert root_exc is exc # Should return the original exception as root

def test_timeout_client_error_by_class_name(self):
"""TimeoutError exceptions are classified as TIMEOUT_CLIENT."""

class TimeoutError(Exception):
pass

exc = TimeoutError("Connection to downstream service timed out")
category, root_exc = classify_error(exc)
assert category == CeleryErrorCategory.TIMEOUT_CLIENT
assert root_exc is exc # Should return the original exception as root

def test_protocol_error_by_class_name(self):
"""ProtocolError exceptions are classified as TIMEOUT_CLIENT."""

class ProtocolError(Exception):
pass

exc = ProtocolError("Connection aborted")
category, root_exc = classify_error(exc)
assert category == CeleryErrorCategory.TIMEOUT_CLIENT
assert root_exc is exc # Should return the original exception as root

def test_max_retry_error_by_class_name(self):
"""MaxRetryError exceptions are classified as TIMEOUT_CLIENT."""

class MaxRetryError(Exception):
pass

exc = MaxRetryError("HTTPSConnectionPool: Max retries exceeded")
category, root_exc = classify_error(exc)
assert category == CeleryErrorCategory.TIMEOUT_CLIENT
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions “Added coverage for the otel related errors”, but the changes here only add TIMEOUT_CLIENT/TIMEOUT_NOTIFICATION coverage and remove DUPLICATE_RECORD—there don’t appear to be any OTel-specific classifications or tests added in this PR. Please either update the PR description to match the actual change set or add the missing OTel coverage in the classification + tests.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@ben851 ben851 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants