Skip to content

Oasis server healthcheck - slow when under load #1208

@sambles

Description

@sambles

Report: The healthcheck api was lagging and dragging under high load. It caused our EKS to kill the oasis server pod. We raised the thresholds for probe check so now it seems ok. I am also refactoring some parts of our code to cache somethings so we don’t hit oasis end points all the time. But it’s really worth looking into oasis server optimisation if you have time

on a side note, it seems there is some internal process inside the server that also tries to call health api but it has the wrong url. Instead of api/health, it calls health. I think it’s wait-for-server.sh. But I couldn’t understand how it’s connected to gunicorn thing

Might be caused by one of two issues

  • resource problem -> the platform node is overloaded and the healthcheck is slow to respond (correct behaviour?) but might be that something needs optimisation
  • Max concurrent requests issue -> Because the http server is synchronous (Gunicorn ~ WSGI), its limited to (number of workers * threads per worker) so if longer running requests are gumming up all those slots it would hang the health-check response.

Options

  • grant the 'healthcheck' calls its own dedicated thread
  • for the side note, it probably make sense to update the routing so that both api/healthcheck and healthcheck are valid
  • Update the server to support async http

Metadata

Metadata

Assignees

Type

Projects

Status

On Hold

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions