connect is a fairly expensive operation for the cache server, at large volume it can easily overwhelm the server thread accepting new connections temporarily, leading to test instabilities. Since it's currently impossible to coordinate different rpcperf instances at a fine granularity, I think we should introduce a few features into the connect behavior to reduce the chance of "reconnect storms":
- allow connect to be rate-limited, and the rate can be set in the config;
- allow timeouts to be retried with an exponential backoff, with a max cap that can be configured.
connect is a fairly expensive operation for the cache server, at large volume it can easily overwhelm the server thread accepting new connections temporarily, leading to test instabilities. Since it's currently impossible to coordinate different rpcperf instances at a fine granularity, I think we should introduce a few features into the connect behavior to reduce the chance of "reconnect storms":