Skip to content

Add logs for stall detection#2155

Merged
amankrx merged 6 commits intoTraceMachina:mainfrom
amankrx:add-logs-for-stall-detection
Feb 14, 2026
Merged

Add logs for stall detection#2155
amankrx merged 6 commits intoTraceMachina:mainfrom
amankrx:add-logs-for-stall-detection

Conversation

@amankrx
Copy link
Collaborator

@amankrx amankrx commented Feb 13, 2026

Description

Added some logs for detecting stalls if any and what caused those stalls.

Fixes # (issue)

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High‑volume logs on hot paths can worsen stalls. Please ensure new logs are rate‑limited or include stable identifiers (action_id, worker_id, invocation_id, queue name) so stalls can be correlated across scheduler/worker/store. Basically, log state transitions.

@amankrx
Copy link
Collaborator Author

amankrx commented Feb 14, 2026

High‑volume logs on hot paths can worsen stalls. Please ensure new logs are rate‑limited or include stable identifiers (action_id, worker_id, invocation_id, queue name) so stalls can be correlated across scheduler/worker/store. Basically, log state transitions.

That's why I have added the log level as trace. We won't be using the trace log level in production. These logs are just added in case we experience stalling with normal deployment and we would like to find the source of the stall.

@amankrx amankrx enabled auto-merge (squash) February 14, 2026 08:55
@amankrx amankrx merged commit 94e7e3f into TraceMachina:main Feb 14, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants