Occasionally our pystatsd instance freezes, maybe in a similar way as in #40. Strace showed a KeyError in the following line:
See server.py:118
if key not in self.timers:
self.timers[key] = [ [], ts ]
self.timers[key][0].append(float(value or 0))
self.timers[key][1] = ts # <--- KeyError
It seems that the check ensures that the key is always present in self.timers. However, there's a second thread running, flushing the contents of self.timers periodically, and also deleting the contents of self.timers.
See server.py:228
When the flush happens right between the check in line 115 and the access in lines 117 and 118 then a KeyError will be raised. We think this is exactly what happens.
We haven't seen it happen in self.__record_counter, but it looks vulnerable in the same way.
We haven't had time yet to come up with a patch, but maybe something like this would work:
Use a threading.Lock() to synchronize access to self.timers and self.counters. The self.flush() involves network communication and can therefore take a long time. We don't want to block the receiving code during that time. Therefore maybe just use the lock to (deep-)copy self.timers and reset it. Then release the lock and flush from the copy. (Or not copy at all, but just switch references, i.e. assign self.timers to a second variable pointing to the same dictionary and then assign self.timers an empty dict and continue flushing from the new variable.)
Occasionally our pystatsd instance freezes, maybe in a similar way as in #40. Strace showed a
KeyErrorin the following line:See server.py:118
It seems that the check ensures that the key is always present in
self.timers. However, there's a second thread running, flushing the contents ofself.timersperiodically, and also deleting the contents ofself.timers.See server.py:228
When the flush happens right between the check in line 115 and the access in lines 117 and 118 then a
KeyErrorwill be raised. We think this is exactly what happens.We haven't seen it happen in
self.__record_counter, but it looks vulnerable in the same way.We haven't had time yet to come up with a patch, but maybe something like this would work:
Use a
threading.Lock()to synchronize access toself.timersandself.counters. Theself.flush()involves network communication and can therefore take a long time. We don't want to block the receiving code during that time. Therefore maybe just use the lock to (deep-)copyself.timersand reset it. Then release the lock and flush from the copy. (Or not copy at all, but just switch references, i.e. assign self.timers to a second variable pointing to the same dictionary and then assign self.timers an empty dict and continue flushing from the new variable.)