Skip to content

When returning JobReply fails initially, the node keeps sending even after a successful reattempt #226

Description

@UnitedMarsupials

While looking at the log of a sizeable batch, I noticed strange errors being reported about Ignoring invalid reply for job ....

A closer investigation into one of such jobs showed, that the remote node tried to return the results, encountered an error, and then kept on trying -- forever...

Let's start with the node's log:

2025-10-28 15:54:55 dispynode - New job id 140058307931016 from 10.92.161.63/10.92.161.63
2025-10-28 15:54:55 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:05 dispynode - Could not send reply for job 140058307931016 to ('10.92.161.63', 21348); saving it in "/tmp/dispy/node/10.92.161.63/1761678653267_k75c7x1g/_dispy_job_reply_140058307931016.pkl"
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
...

And here is, what the client's logging:

2025-10-28 15:54:55.328 dispy - Running job 140058307931016 on 10.92.161.135
2025-10-28 15:54:55.349 dispy - Running job ('xxxx', 32, 5) / 140058307931016 on 10.92.161.135 (busy: 18 / 18)
...
2025-10-28 15:55:07.075 dispy - Received reply for job ('xxxx', 32, 5) / 140058307931016 from 10.92.161.135
...
2025-10-28 15:55:07.282 dispy - Ignoring invalid reply for job 140058307931016 from 10.92.161.135
2025-10-28 15:55:07.674 dispy - Ignoring invalid reply for job 140058307931016 from 10.92.161.135
...

So... The node's first attempt to send back the results failed at 15:55:05. The sending succeeded two seconds later -- the client processed the results at 15:55:07.075 and moved on. Unfortunately, the node continued to spam the client with the same results -- in perpetuity...

As I type this, there are already about 20 jobs with this problem -- from different nodes -- out of about 1.5 million jobs processed.

That's an obvious bug. In addition, I think, it'd make sense for the initial log-entry "Could not send reply for job... " to include the actual exception -- what caused the sending to fail?..

Using dispy 4.15.2 and pycos 4.12.2... Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions