Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic #3118

adiamzn · 2022-08-25T15:54:40Z

Currently sleepBackOffMillis is deterministic.

As can be seen by the benchmarks taken in this blog post, this out performed by backoff methods which add Jitter.

In the terminology of the blog post referenced, this PR introduces the "Full Jitter" method, with minimal code changes.

sazzad16 · 2022-08-26T01:19:02Z

@yangbodong22011 @dengliming Your reviews are always appreciated.

@walles @jensgreen You were first to introduce backoff. WDYT about this change?

yangbodong22011 · 2022-08-26T02:19:57Z

@adiamzn Hi, thanks for you PR, we don't have lock contention, just network queuing, so I'm not sure what your PR advantage is? Specifically:

Suppose there are 10 JedisCluster client. At t1, Redis is down or ha, so the client access fails and starts to retry.
t1+0.4 s starts for the first time, and still all fails (redis has not recovered yet)
t1 + 0.6s start the second time, same as above.
t1 + 0.8s Redis Cluster returns to normal.
t1 + 1.0s start the third time, all successful.

Your PR may only be useful in eliminating traffic spikes, but not in speeding up the time to success. Is that what you're trying to do with this PR?

adiamzn · 2022-08-26T08:53:43Z

@yangbodong22011
Thanks for the comment,
Yes this PR is to mitigate traffic spikes, especially spikes in new connections which can be caused my multiple clients retrying commands at the same time. Spikes in new connections can be especially problematic for Redis clusters which serve a large number of connections.

yangbodong22011 · 2022-08-26T09:53:33Z

@yangbodong22011 Thanks for the comment, Yes this PR is to mitigate traffic spikes, especially spikes in new connections which can be caused my multiple clients retrying commands at the same time. Spikes in new connections can be especially problematic for Redis clusters which serve a large number of connections.

Can you share some diagrams or specific issues you encountered in production?

adiamzn · 2022-08-26T10:48:12Z

@yangbodong22011
It is common in large production environments to see spikes in new connection requests following an event in which the cluster was unresponsive.
This scenario is most easily solved by adding some backoff with jitter to the retry policy in the client.
Unfortunately I don't think I can share any diagrams or specific scenarios from production.
I think this blog post summarizes the benefit of adding jitter quite well.

Thanks!

codecov-commenter · 2022-08-28T15:55:41Z

Codecov Report

Merging #3118 (3980173) into master (a692b47) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 3980173 differs from pull request most recent head ba10ade. Consider uploading reports for the commit ba10ade to get more accurate results

@@             Coverage Diff              @@
##             master    #3118      +/-   ##
============================================
- Coverage     66.55%   66.55%   -0.01%     
- Complexity     4386     4387       +1     
============================================
  Files           243      243              
  Lines         14225    14226       +1     
  Branches        851      851              
============================================
  Hits           9468     9468              
- Misses         4387     4389       +2     
+ Partials        370      369       -1

Impacted Files	Coverage Δ
...lients/jedis/executors/ClusterCommandExecutor.java	`85.07% <100.00%> (+0.22%)`	⬆️
...in/java/redis/clients/jedis/ConnectionFactory.java	`63.26% <0.00%> (-4.09%)`	⬇️
src/main/java/redis/clients/jedis/Jedis.java	`84.95% <0.00%> (-0.05%)`	⬇️
src/main/java/redis/clients/jedis/JedisPubSub.java	`71.81% <0.00%> (+1.81%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

yangbodong22011 · 2022-08-30T02:22:06Z

Unfortunately I don't think I can share any diagrams or specific scenarios from production.

@adiamzn OK, but you can test some data and compare the new algorithm you proposed with the old one to make everyone understand it better？

adiamzn · 2022-08-30T07:44:59Z

@yangbodong22011
Ideally yes, but this could take me a few weeks.
I think it would make sense to compare CPU usage and traffic spikes in a high client scenarios using this algorithm and the previous one.
Since this can take me a some time to get to, would you prefer I close this PR and open a new one when I have some results?
Thanks

yangbodong22011 · 2022-08-30T07:52:06Z

@adiamzn Thanks, I don't think a new PR is needed, we can just keep working on this PR and keep the stack for others to see.

adiamzn · 2022-08-31T12:04:51Z

@yangbodong22011 @sazzad16
I think its worth mentioning that similar jitter mechanisms exists in some other popular client libraries
redispy: redis/redis-py#1494
phpredis: phpredis/phpredis#1986
envoy proxy: https://github.com/envoyproxy/envoy/pull/19869/files

sazzad16 · 2022-09-14T06:37:40Z

@yangbodong22011 @adiamzn I am thinking about going forward with this change.

Before saying further, two equations:
A = millisLeft / (attemptsLeft * attemptsLeft)
B = millisLeft / (attemptsLeft * (attemptsLeft + 1))

In original PR where backOff time was added, the time was calculated as A. But in my testing I found that few tests here and there would get timeout exceeded. So I was looking for something definitely smaller than A, and I went with B.

Now in this PR, total sleep time is exptected to be half of total calculated backOff time. Contrary to current code where total sleep time is equals to total calculated time.

WDYT about considering A again but with this change (jitter) as sleep time is expected to be half of A which is smaller than A.

BTW, I have added tests of ClusterCommandExecutor in #3139

src/main/java/redis/clients/jedis/executors/ClusterCommandExecutor.java

Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic

ecd2090

adiamzn mentioned this pull request Aug 25, 2022

Add jitter to sleep time between retries in ClusterCommandExecutor #3117

Closed

sazzad16 requested review from dengliming, yangbodong22011 and sazzad16 August 26, 2022 01:18

sazzad16 added the wait for more reviews label Aug 26, 2022

Merge branch 'redis:master' into master

3da20ad

sazzad16 added this to the 4.3.0 milestone Sep 14, 2022

Merge branch 'master' into master

3980173

sazzad16 reviewed Sep 14, 2022

View reviewed changes

src/main/java/redis/clients/jedis/executors/ClusterCommandExecutor.java Outdated Show resolved Hide resolved

Larger maxBackOff

ba10ade

sazzad16 merged commit ff3f871 into redis:master Sep 20, 2022

sazzad16 added maintenance and removed wait for more reviews labels Sep 20, 2022

chayim added feature and removed maintenance labels Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic #3118

Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic #3118

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic #3118

Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic #3118

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!