Skip to content

quicker unhappy jobs #233

Description

@cortlandstarrett

At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.

There are a few problems with this:

  1. It reuses a timer that is intended for other purposes.
  2. It is not configurable separately from the hanging job.
  3. It it a long timer which means that unhappy jobs are held in memory for a long time increasing the number of concurrent jobs such that it could be a memory risk. At 50 jobs per second and a 30 second hanging job timer, this could expand to 1500 jobs waiting to end. This impacts our max jobs per worker setting.

It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.

A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions