Tweaked Celery categories and reported errors by jimleroyer · Pull Request #2825 · cds-snc/notification-api

jimleroyer · 2026-03-31T22:03:33Z

Summary | Résumé

Added more tweaking to Celery error and reported errors.

Split the existing timeout category into notify and client related categories, as differentiating the two should be useful.
Added coverage for the otel related errors. These might not actually be caught by Celery but let's see.
Removed the duplicate notification category. We will catch in Terraform and that won't be part of the celery known errors set.

Related Issues | Cartes liées

Reduce Celery errors with business exception translation layer

Test instructions | Instructions pour tester la modification

Unit tested, and built based on production logs.

Release Instructions | Instructions pour le déploiement

None.

Reviewer checklist | Liste de vérification du réviseur

This PR does not break existing functionality.
This PR does not violate GCNotify's privacy policies.
This PR does not raise new security concerns. Refer to our GC Notify Risk Register document on our Google drive.
This PR does not significantly alter performance.
Additional required documentation resulting of these changes is covered (such as the README, setup instructions, a related ADR or the technical documentation).

⚠ If boxes cannot be checked off before merging the PR, they should be moved to the "Release Instructions" section with appropriate steps required to verify before release. For example, changes to celery code may require tests on staging to verify that performance has not been affected.

…or caught by the Celery layer.

…tegory.

Copilot

Pull request overview

This PR updates Celery error classification to better differentiate timeout-related failures and adjusts the associated unit tests so log-derived metrics can distinguish notification timeouts from downstream client timeouts.

Changes:

Split the prior timeout classification into TIMEOUT_NOTIFICATION vs TIMEOUT_CLIENT.
Add new exception-class mappings for common client timeout-related exception names (TimeoutError, ProtocolError, MaxRetryError).
Remove the duplicate-record classification category and its unit tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`app/celery/error_registry.py`	Updates `CeleryErrorCategory` and the class/message matching maps used by `classify_error` to support new timeout categories and remove duplicate-record classification.
`tests/app/celery/test_celery_error_classification.py`	Updates/extends classification tests to cover the new timeout categories and removes duplicate-record assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T22:06:45Z

app/celery/error_registry.py

 # Note: Order within the map does not matter; the deepest/root exception in the
 # chain takes precedence over wrapper exceptions.


The comment saying “Order within the map does not matter” is inaccurate for _EXCEPTION_CLASS_MAP / _MESSAGE_SUBSTRING_MAP. classify_error iterates these dicts in insertion order and returns the first matching pattern, so the order directly affects which category wins when multiple patterns match the same exception/message (and the tests rely on this behavior). Please update the comment to reflect that ordering is part of the precedence rules (or change the data structure/logic to make precedence explicit).

Suggested change

# Note: Order within the map does not matter; the deepest/root exception in the

# chain takes precedence over wrapper exceptions.

# Note: The deepest/root exception in the chain takes precedence over wrapper

# exceptions, and within each map the insertion order defines which category

# wins when multiple patterns match the same exception class name or message.

Copilot · 2026-03-31T22:06:46Z

app/celery/error_registry.py

+    # Clients that timed out
+    TIMEOUT_CLIENT = "CELERY_KNOWN_ERROR::TIMEOUT_CLIENT"
+
    # Notifications that timed out
-    TIMEOUT = "CELERY_KNOWN_ERROR::TIMEOUT"
+    TIMEOUT_NOTIFICATION = "CELERY_KNOWN_ERROR::TIMEOUT_NOTIFICATION"



Renaming/removing the TIMEOUT category and changing the log marker to CELERY_KNOWN_ERROR::TIMEOUT_NOTIFICATION will alter emitted log strings and can break any existing CloudWatch metric filters/alarms or dashboards keyed on the old marker. If downstream consumers still expect CELERY_KNOWN_ERROR::TIMEOUT, consider keeping a backward-compatible alias (deprecated) or ensure the monitoring/config repos are updated in lockstep.

Copilot · 2026-03-31T22:06:46Z

tests/app/celery/test_celery_error_classification.py

+    def test_timeout_notification_error(self):
+        """Notification timeout messages are classified as TIMEOUT_NOTIFICATION."""
        exc = Exception(
            "timeout-sending-notifications: Task exceeded time limit, notifications have been updated to technical-failure because they have timed out."
        )
        category, root_exc = classify_error(exc)
-        assert category == CeleryErrorCategory.TIMEOUT
+        assert category == CeleryErrorCategory.TIMEOUT_NOTIFICATION
+        assert root_exc is exc  # Should return the original exception as root
+
+    def test_timeout_client_error_by_class_name(self):
+        """TimeoutError exceptions are classified as TIMEOUT_CLIENT."""
+
+        class TimeoutError(Exception):
+            pass
+
+        exc = TimeoutError("Connection to downstream service timed out")
+        category, root_exc = classify_error(exc)
+        assert category == CeleryErrorCategory.TIMEOUT_CLIENT
+        assert root_exc is exc  # Should return the original exception as root
+
+    def test_protocol_error_by_class_name(self):
+        """ProtocolError exceptions are classified as TIMEOUT_CLIENT."""
+
+        class ProtocolError(Exception):
+            pass
+
+        exc = ProtocolError("Connection aborted")
+        category, root_exc = classify_error(exc)
+        assert category == CeleryErrorCategory.TIMEOUT_CLIENT
+        assert root_exc is exc  # Should return the original exception as root
+
+    def test_max_retry_error_by_class_name(self):
+        """MaxRetryError exceptions are classified as TIMEOUT_CLIENT."""
+
+        class MaxRetryError(Exception):
+            pass
+
+        exc = MaxRetryError("HTTPSConnectionPool: Max retries exceeded")
+        category, root_exc = classify_error(exc)
+        assert category == CeleryErrorCategory.TIMEOUT_CLIENT


The PR description mentions “Added coverage for the otel related errors”, but the changes here only add TIMEOUT_CLIENT/TIMEOUT_NOTIFICATION coverage and remove DUPLICATE_RECORD—there don’t appear to be any OTel-specific classifications or tests added in this PR. Please either update the PR description to match the actual change set or add the missing OTel coverage in the classification + tests.

ben851

LGTM I think

jimleroyer added 4 commits March 31, 2026 16:59

Split timeout category in two: for notify and for clients

277b464

Removed DUPLICATE_RECORD category as this is not an actual raised err…

dfd6668

…or caught by the Celery layer.

Adding ProtocolError to be mapped with client timeout celery error ca…

1fbb12f

…tegory.

Added MaxRetryError mapping to client timeout category

58e9401

jimleroyer requested review from P0NDER0SA and ben851 March 31, 2026 22:03

jimleroyer self-assigned this Mar 31, 2026

Copilot AI review requested due to automatic review settings March 31, 2026 22:03

Copilot started reviewing on behalf of jimleroyer March 31, 2026 22:04 View session

Removed unused import

8a869ba

Copilot AI reviewed Mar 31, 2026

View reviewed changes

jimleroyer mentioned this pull request Mar 31, 2026

Reduce Celery errors with business exception translation layer cds-snc/notification-planning-core#634

Open

5 tasks

Merge branch 'main' into fix/celery-add-more-errors

4cfd07d

ben851 approved these changes Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweaked Celery categories and reported errors#2825

Tweaked Celery categories and reported errors#2825
jimleroyer wants to merge 6 commits intomainfrom
fix/celery-add-more-errors

jimleroyer commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

ben851 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Note: Order within the map does not matter; the deepest/root exception in the
		# chain takes precedence over wrapper exceptions.

-# Note: Order within the map does not matter; the deepest/root exception in the
-# chain takes precedence over wrapper exceptions.
+# Note: The deepest/root exception in the chain takes precedence over wrapper
+# exceptions, and within each map the insertion order defines which category
+# wins when multiple patterns match the same exception class name or message.

Conversation

jimleroyer commented Mar 31, 2026

Summary | Résumé

Related Issues | Cartes liées

Test instructions | Instructions pour tester la modification

Release Instructions | Instructions pour le déploiement

Reviewer checklist | Liste de vérification du réviseur

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ben851 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants