While looking at the log of a sizeable batch, I noticed strange errors being reported about Ignoring invalid reply for job ....
A closer investigation into one of such jobs showed, that the remote node tried to return the results, encountered an error, and then kept on trying -- forever...
Let's start with the node's log:
2025-10-28 15:54:55 dispynode - New job id 140058307931016 from 10.92.161.63/10.92.161.63
2025-10-28 15:54:55 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:05 dispynode - Could not send reply for job 140058307931016 to ('10.92.161.63', 21348); saving it in "/tmp/dispy/node/10.92.161.63/1761678653267_k75c7x1g/_dispy_job_reply_140058307931016.pkl"
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:07 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
2025-10-28 15:55:08 dispynode - Sending result for job 140058307931016 (11)
...
And here is, what the client's logging:
2025-10-28 15:54:55.328 dispy - Running job 140058307931016 on 10.92.161.135
2025-10-28 15:54:55.349 dispy - Running job ('xxxx', 32, 5) / 140058307931016 on 10.92.161.135 (busy: 18 / 18)
...
2025-10-28 15:55:07.075 dispy - Received reply for job ('xxxx', 32, 5) / 140058307931016 from 10.92.161.135
...
2025-10-28 15:55:07.282 dispy - Ignoring invalid reply for job 140058307931016 from 10.92.161.135
2025-10-28 15:55:07.674 dispy - Ignoring invalid reply for job 140058307931016 from 10.92.161.135
...
So... The node's first attempt to send back the results failed at 15:55:05. The sending succeeded two seconds later -- the client processed the results at 15:55:07.075 and moved on. Unfortunately, the node continued to spam the client with the same results -- in perpetuity...
As I type this, there are already about 20 jobs with this problem -- from different nodes -- out of about 1.5 million jobs processed.
That's an obvious bug. In addition, I think, it'd make sense for the initial log-entry "Could not send reply for job... " to include the actual exception -- what caused the sending to fail?..
Using dispy 4.15.2 and pycos 4.12.2... Thank you!
While looking at the log of a sizeable batch, I noticed strange errors being reported about
Ignoring invalid reply for job ....A closer investigation into one of such jobs showed, that the remote node tried to return the results, encountered an error, and then kept on trying -- forever...
Let's start with the node's log:
And here is, what the client's logging:
So... The node's first attempt to send back the results failed at 15:55:05. The sending succeeded two seconds later -- the client processed the results at 15:55:07.075 and moved on. Unfortunately, the node continued to spam the client with the same results -- in perpetuity...
As I type this, there are already about 20 jobs with this problem -- from different nodes -- out of about 1.5 million jobs processed.
That's an obvious bug. In addition, I think, it'd make sense for the initial log-entry "Could not send reply for job... " to include the actual exception -- what caused the sending to fail?..
Using dispy 4.15.2 and pycos 4.12.2... Thank you!