Fix Connection_cache.call hanging on unreachable server#1113
Fix Connection_cache.call hanging on unreachable server#1113lthms wants to merge 1 commit intomirage:mainfrom
Connection_cache.call hanging on unreachable server#1113Conversation
* What The `Connection_cache.call` function could hang indefinitely when attempting to establish a connection to an unreachable server. This change resolves that issue by ensuring that the connection creation process is asynchronous and non-blocking. * Why Previously, the `Connection.create` function returns before the connection is established. When the returned connection is binded, something in the code of `get_connection` prevents the correct handling of the `ECONNREFUSED` Unix error. * How The `create` helper function now uses `Connection.connect` (which returns an Lwt promise) and binds its result using `>>=`. This transforms the `create` function itself into an Lwt promise. Consequently, the `Connection_cache.call` function no longer blocks on connection establishment but rather awaits the connection promise.
|
Don’t hesitate to let me know if I can help moving this forward :). |
art-w
left a comment
There was a problem hiding this comment.
Thanks for the PR, sorry for the delay! I'm not familiar enough with the code to be sure, but if I missed something then I don't understand the bug and I would love to know the true cause :)
| Connection.create ~persistent:true ~finalise ~ctx:self.ctx endp | ||
| and timeout = ref Lwt.return_unit in | ||
| Connection.connect ~persistent:true ~finalise ~ctx:self.ctx endp | ||
| >>= fun connection -> |
There was a problem hiding this comment.
This looks a bit weird, we are blocking on establishing the connection and only then setting up a timeout... I believe this works if we get the ECONNREFUSED immediate error but otherwise we would still hang with no control on the timeout?
Looking at the code around https://github.com/mirage/ocaml-cohttp/blob/main/cohttp-lwt/src/connection.ml#L291 , I wonder if the issue isn't larger than the cache: Connection.t has two queues of Lwt.wait() promises ("waiting" and "in flight"), but when an error happens with the channel, it looks like we don't signal the error to the queues? (leaving their Lwt.u unresolved)
Even if ECONNREFUSED wasn't propagated correctly, I was confused why the timeout below didn't trigger, but Connection.close seems to have the same issue of leaving the queues unresolved such that Connection.call is stuck hanging even though the timeout did close the connection (?)
There was a problem hiding this comment.
I’m not sure to follow your train of thoughts. Concretely, how should we move forward? We could merge this as it at least fix partially the issue, and look for a follow-up, maybe?
There was a problem hiding this comment.
The Connection code is missing calls to queue_fail whenever a failure happens or the connection is closed, so it's still possible that it'll leave requests hanging forever, since the lwt promises on the queues are never resolved? I believe adding calls to queue_fail to those two cases would fix your ECONNREFUSED issue, without requiring changes in the Connection_cache.
(I'm not a maintainer of cohttp so it's not my call to merge or not! I'm just pointing out that the bug seems to be elsewhere and that the follow-up would have to revert this PR to recover the timeout protection.)
There was a problem hiding this comment.
Okay, I don’t know when but i’ll try to have a look.
I’m not sure to understand who is maintaining cohttp 😅
The
Connection_cache.callfunction could hang indefinitely when attempting to establish a connection to an unreachable server. This change resolves that issue by ensuring that the connection creation process is asynchronous and non-blocking.Fixes #1112
Previously, the
Connection.createfunction returns before the connection is established. When the returned connection is binded, something in the code ofget_connectionprevents the correct handling of theECONNREFUSEDUnix error.The
createhelper function now usesConnection.connect(which returns an Lwt promise) and binds its result using>>=. This transforms thecreatefunction itself into an Lwt promise. Consequently, theConnection_cache.callfunction no longer blocks on connection establishment but rather awaits the connection promise.