Skip to content

Fix Connection_cache.call hanging on unreachable server#1113

Open
lthms wants to merge 1 commit intomirage:mainfrom
lthms:fix-1112
Open

Fix Connection_cache.call hanging on unreachable server#1113
lthms wants to merge 1 commit intomirage:mainfrom
lthms:fix-1112

Conversation

@lthms
Copy link

@lthms lthms commented Jun 8, 2025

  • What

The Connection_cache.call function could hang indefinitely when attempting to establish a connection to an unreachable server. This change resolves that issue by ensuring that the connection creation process is asynchronous and non-blocking.

Fixes #1112

  • Why

Previously, the Connection.create function returns before the connection is established. When the returned connection is binded, something in the code of get_connection prevents the correct handling of the ECONNREFUSED Unix error.

  • How

The create helper function now uses Connection.connect (which returns an Lwt promise) and binds its result using >>=. This transforms the create function itself into an Lwt promise. Consequently, the Connection_cache.call function no longer blocks on connection establishment but rather awaits the connection promise.

* What

The `Connection_cache.call` function could hang indefinitely when
attempting to establish a connection to an unreachable server. This
change resolves that issue by ensuring that the connection creation
process is asynchronous and non-blocking.

* Why

Previously, the `Connection.create` function returns before the
connection is established. When the returned connection is binded,
something in the code of `get_connection` prevents the correct handling
of the `ECONNREFUSED` Unix error.

* How

The `create` helper function now uses `Connection.connect` (which
returns an Lwt promise) and binds its result using `>>=`. This
transforms the `create` function itself into an Lwt promise.
Consequently, the `Connection_cache.call` function no longer blocks on
connection establishment but rather awaits the connection promise.
@samoht
Copy link
Member

samoht commented Jun 17, 2025

The change looks fine to me but /cc @art-w and @MisterDA who worked on this piece the most recently.

@lthms
Copy link
Author

lthms commented Jun 29, 2025

Don’t hesitate to let me know if I can help moving this forward :).

Copy link
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, sorry for the delay! I'm not familiar enough with the code to be sure, but if I missed something then I don't understand the bug and I would love to know the true cause :)

Connection.create ~persistent:true ~finalise ~ctx:self.ctx endp
and timeout = ref Lwt.return_unit in
Connection.connect ~persistent:true ~finalise ~ctx:self.ctx endp
>>= fun connection ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit weird, we are blocking on establishing the connection and only then setting up a timeout... I believe this works if we get the ECONNREFUSED immediate error but otherwise we would still hang with no control on the timeout?

Looking at the code around https://github.com/mirage/ocaml-cohttp/blob/main/cohttp-lwt/src/connection.ml#L291 , I wonder if the issue isn't larger than the cache: Connection.t has two queues of Lwt.wait() promises ("waiting" and "in flight"), but when an error happens with the channel, it looks like we don't signal the error to the queues? (leaving their Lwt.u unresolved)

Even if ECONNREFUSED wasn't propagated correctly, I was confused why the timeout below didn't trigger, but Connection.close seems to have the same issue of leaving the queues unresolved such that Connection.call is stuck hanging even though the timeout did close the connection (?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure to follow your train of thoughts. Concretely, how should we move forward? We could merge this as it at least fix partially the issue, and look for a follow-up, maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Connection code is missing calls to queue_fail whenever a failure happens or the connection is closed, so it's still possible that it'll leave requests hanging forever, since the lwt promises on the queues are never resolved? I believe adding calls to queue_fail to those two cases would fix your ECONNREFUSED issue, without requiring changes in the Connection_cache.

(I'm not a maintainer of cohttp so it's not my call to merge or not! I'm just pointing out that the bug seems to be elsewhere and that the follow-up would have to revert this PR to recover the timeout protection.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don’t know when but i’ll try to have a look.

I’m not sure to understand who is maintaining cohttp 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cohttp_lwt_unix.Connection_cache.call hangs if the server is unreachable

3 participants