-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Labels
Description
Report: The healthcheck api was lagging and dragging under high load. It caused our EKS to kill the oasis server pod. We raised the thresholds for probe check so now it seems ok. I am also refactoring some parts of our code to cache somethings so we don’t hit oasis end points all the time. But it’s really worth looking into oasis server optimisation if you have time
on a side note, it seems there is some internal process inside the server that also tries to call health api but it has the wrong url. Instead of api/health, it calls health. I think it’s wait-for-server.sh. But I couldn’t understand how it’s connected to gunicorn thing
Might be caused by one of two issues
- resource problem -> the platform node is overloaded and the healthcheck is slow to respond (correct behaviour?) but might be that something needs optimisation
- Max concurrent requests issue -> Because the http server is synchronous (Gunicorn ~ WSGI), its limited to (number of workers * threads per worker) so if longer running requests are gumming up all those slots it would hang the health-check response.
Options
- grant the 'healthcheck' calls its own dedicated thread
- for the side note, it probably make sense to update the routing so that both
api/healthcheckandhealthcheckare valid - Update the server to support async http
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
On Hold