Skip to content

Bugfix - Timer events not working#80

Merged
auslin-aot merged 2 commits into
AOT-Technologies:mainfrom
andrepestana-aot:bugfix/M8F-279-timer-events-not-working
May 21, 2026
Merged

Bugfix - Timer events not working#80
auslin-aot merged 2 commits into
AOT-Technologies:mainfrom
andrepestana-aot:bugfix/M8F-279-timer-events-not-working

Conversation

@andrepestana-aot
Copy link
Copy Markdown
Collaborator

@andrepestana-aot andrepestana-aot commented May 15, 2026

JIRA Ticket

https://aottech.atlassian.net/browse/M8F-279

Description

This PR hardens M8Flow Celery execution for tenant-aware background tasks and restores timer progression compatibility with the current SpiffWorkflow runtime.

What changed

1. Rebrand Celery task names to M8Flow-owned task IDs

M8Flow now exposes process-instance Celery tasks under M8Flow-owned names instead of upstream spiffworkflow_backend.* names:

  • M8FLOW_CELERY_TASK_PROCESS_INSTANCE_RUN
  • M8FLOW_CELERY_TASK_EVENT_NOTIFIER

This keeps task names and Flower UI branding aligned with M8Flow without modifying upstream SpiffWorkflow code.

2. Move tenant-context ownership into the M8Flow wrapper tasks

The M8Flow wrapper tasks now set and clear tenant context around the upstream raw task execution.

This was necessary because once the worker is dispatching the new M8Flow task names, those tasks are no longer just upstream Celery tasks. They are M8Flow wrappers, so tenant lifecycle must be owned at that wrapper boundary.

The wrapper task flow now:

  • cleans up scoped ORM session state before execution
  • resolves tenant from task headers or process_instance_id
  • sets tenant context for the duration of the task
  • delegates to the upstream .run() implementation
  • clears tenant context and session state afterward

3. Keep worker signal hooks as compatibility/fallback logic

set_tenant_context_for_task() and clear_tenant_context_for_task() in celery_worker.py were updated to skip the two M8Flow wrapper task names.

This prevents double tenant setup/cleanup for:

  • M8FLOW_CELERY_TASK_PROCESS_INSTANCE_RUN
  • M8FLOW_CELERY_TASK_EVENT_NOTIFIER

The worker hooks still remain in place for:

  • legacy upstream task names that may already be queued
  • fallback handling outside the M8Flow wrapper path
  • per-task scoped-session cleanup in prefork workers

4. Harden Celery worker database/session handling

Celery worker runtime handling was tightened to reduce session/connection leakage across prefork worker processes.

This includes:

  • resetting inherited DB connections when worker child processes start
  • cleaning up scoped ORM session state between tasks
  • resolving tenant IDs for process_instance_id using a direct engine query instead of the shared scoped ORM session

5. Restore timer wake-up behavior for current SpiffWorkflow runtime

A compatibility patch was added for timer catch events because the installed SpiffWorkflow version no longer refreshes waiting timer tasks in the way spiffworkflow-backend expects.

The patch restores the expected behavior by revisiting waiting timer-catching tasks and forcing their state update, which allows due timers to advance the process instance correctly.

Why this was needed

These changes address two related problems:

  1. Tenant-aware Celery execution was no longer safe once M8Flow introduced wrapper task names.
    Without changing the worker hooks, tenant context could be applied twice or session state could leak across Celery tasks.

  2. Timer-based process models could remain stuck even after Celery executed, because the current SpiffWorkflow runtime no longer refreshes waiting timer events the way upstream backend code expects.

Behavior after this PR

  • M8Flow-branded Celery task names are used consistently.
  • Tenant context is owned by the M8Flow task wrappers for the main process-instance Celery tasks.
  • Legacy queued upstream task names still remain compatible.
  • Scoped DB/session state is cleaned up more reliably in Celery workers.
  • Due timer catch events advance to the next task as expected.

Files changed

  • m8flow-backend/src/m8flow_backend/background_processing/__init__.py
  • m8flow-backend/src/m8flow_backend/background_processing/celery_worker.py
  • m8flow-backend/src/m8flow_backend/background_processing/celery_tasks/process_instance_task.py
  • m8flow-backend/src/m8flow_backend/services/celery_worker_runtime.py
  • m8flow-backend/src/m8flow_backend/services/spiff_timer_refresh_patch.py
  • m8flow-backend/src/m8flow_backend/startup/patch_registry.py

Type

  • Feature
  • Bug fix
  • Documentation
  • Other

Changes

  • Backend
  • Frontend
  • Documentation

Testing

Validated with unit coverage around:

  • Celery worker runtime/session handling
  • tenant context behavior for Celery tasks
  • startup patch registration
  • timer refresh compatibility

Also verified against the Docker containers by rebuilding/restarting the backend and Celery worker and running a fresh timer-based process instance to confirm the timer advanced to the expected ready/user-input state.

Related Issues

Closes #

@andrepestana-aot andrepestana-aot self-assigned this May 15, 2026
@andrepestana-aot andrepestana-aot changed the title timer fix Bugfix - Timer events not working May 15, 2026
@andrepestana-aot andrepestana-aot marked this pull request as ready for review May 15, 2026 21:57
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

PR Agent Review

Blocking issues

  1. Potential AttributeError in patched_spiff_task_to_api_task when reading error details
    • File: services/process_instance_service_patch.py
    • Code path:
      if error_event:
          error_message = error_event.error_details[-1].message
    • Issue: This assumes error_event.error_details is always a non-empty list. If it is None or empty, this will throw (TypeError/IndexError) and break task serialization.
    • Why blocking: This is a runtime correctness bug in an API task conversion path.
    • Suggested fix: Guard access:
      if error_event and getattr(error_event, "error_details", None):
          if len(error_event.error_details) > 0:
              error_message = error_event.error_details[-1].message

Non-blocking suggestions

  1. Overly broad exception swallowing in DB/session runtime helpers
    • File: services/celery_worker_runtime.py
    • cleanup_scoped_session() and reset_engine_for_worker_process() swallow all exceptions silently.
    • Consider at least debug/warning logging for observability. Silent failure can hide real connection/session lifecycle issues.
  2. _task_tenant_id only parses numeric string IDs
    • File: background_processing/celery_tasks/process_instance_task.py
    • Current logic only converts process_instance_id when str.isdigit(). Values like " 42 " or other serializations won’t resolve.
    • Minor robustness improvement: strip whitespace before digit check.

Recommended tests

  1. Add regression tests for error_details edge cases in patched_spiff_task_to_api_task
    • Cases:
      • error_event.error_details = []
      • error_event.error_details = None
    • Assert task conversion does not crash and error_message is None.
  2. Add tests for Celery task tenant extraction precedence
    • File under test: process_instance_task.py
    • Cases:
      • Header tenant present -> used, DB lookup not called.
      • No header, integer process_instance_id in kwargs -> DB lookup used.
      • No header, first positional arg string numeric -> converted and looked up.
      • Invalid ID type -> returns None.
  3. Add test for per-task context cleanup on upstream task failure
    • Validate _end_task() is called in finally and tenant context is reset when upstream .run() raises.

Generated by model gpt-5.3-codex on PR updates via OpenAI Responses API.

Copy link
Copy Markdown
Collaborator

@auslin-aot auslin-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing the Two Step Leave Approval sample template, I assigned a single user as the owner of the HR lane. This causes an error when the user in the Manager lane takes action.

Image

Throws below error on

Image Image

Whether this issue was introduced as part of this PR? It may have arisen after implementing the Swimlane Assignment Using Groups instead of Realm Roles PR.

@andrepestana-aot
Copy link
Copy Markdown
Collaborator Author

While testing the Two Step Leave Approval sample template, I assigned a single user as the owner of the HR lane. This causes an error when the user in the Manager lane takes action.

Image Throws below error on

Image Image
Whether this issue was introduced as part of this PR? It may have arisen after implementing the Swimlane Assignment Using Groups instead of Realm Roles PR.

image

I could reproduce the error by removing the email for the user. I fixed the problem with a more defensive code. Thanks!

@auslin-aot auslin-aot merged commit 58c35ec into AOT-Technologies:main May 21, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants