8000 Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic by adiamzn · Pull Request #3118 · redis/jedis · GitHub
[go: up one dir, main page]

Skip to content

Makes getBackoffSleepMillis in ClusterCommandExecutor nondeterministic #3118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 20, 2022

Conversation

adiamzn
Copy link
Contributor
@adiamzn adiamzn commented Aug 25, 2022

Currently sleepBackOffMillis is deterministic.

As can be seen by the benchmarks taken in this blog post, this out performed by backoff methods which add Jitter.

In the terminology of the blog post referenced, this PR introduces the "Full Jitter" method, with minimal code changes.

@sazzad16
Copy link
Contributor

@yangbodong22011 @dengliming Your reviews are always appreciated.

@walles @jensgreen You were first to introduce backoff. WDYT about this change?

@yangbodong22011
Copy link
Contributor

@adiamzn Hi, thanks for you PR, we don't have lock contention, just network queuing, so I'm not sure what your PR advantage is? Specifically:

  1. Suppose there are 10 JedisCluster client. At t1, Redis is down or ha, so the client access fails and starts to retry.
  2. t1+0.4 s starts for the first time, and still all fails (redis has not recovered yet)
  3. t1 + 0.6s start the second time, same as above.
  4. t1 + 0.8s Redis Cluster returns to normal.
  5. t1 + 1.0s start the third time, all successful.

Your PR may only be useful in eliminating traffic spikes, but not in speeding up the time to success. Is that what you're trying to do with this PR?

@adiamzn
Copy link
Contributor Author
adiamzn commented Aug 26, 2022

@yangbodong22011
Thanks for the comment,
Yes this PR is to mitigate traffic spikes, especially spikes in new connections which can be caused my multiple clients retrying commands at the same time. Spikes in new connections can be especially problematic for Redis clusters which serve a large number of connections.

@yangbodong22011
Copy link
Contributor

@yangbodong22011 Thanks for the comment, Yes this PR is to mitigate traffic spikes, especially spikes in new connections which can be caused my multiple clients retrying commands at the same time. Spikes in new connections can be especially problematic for Redis clusters which serve a large number of connections.

Can you share some diagrams or specific issues you encountered in production?

@adiamzn
Copy link
Contributor Author
adiamzn commented Aug 26, 2022

@yangbodong22011
It is common in large production environments to see spikes in new connection requests following an event in which the cluster was unresponsive.
This scenario is most easily solved by adding some backoff with jitter to the retry policy in the client.
Unfortunately I don't think I can share any diagrams or specific scenarios from production.
I think this blog post summarizes the benefit of adding jitter quite well.

Thanks!

@codecov-commenter
Copy link
codecov-commenter commented Aug 28, 2022

Codecov Report

Merging #3118 (3980173) into master (a692b47) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 3980173 differs from pull request most recent head ba10ade. Consider uploading reports for the commit ba10ade to get more accurate results

@@             Coverage Diff              @@
##             master    #3118      +/-   ##
============================================
- Coverage     66.55%   66.55%   -0.01%     
- Complexity     4386     4387       +1     
============================================
  Files           243      243              
  Lines         14225    14226       +1     
  Branches        851      851              
============================================
  Hits           9468     9468              
- Misses         4387     4389       +2     
+ Partials        370      369       -1     
Impacted Files Coverage Δ
...lients/jedis/executors/ClusterCommandExecutor.java 85.07% <100.00%> (+0.22%) ⬆️
...in/java/redis/clients/jedis/ConnectionFactory.java 63.26% <0.00%> (-4.09%) ⬇️
src/main/java/redis/clients/jedis/Jedis.java 84.95% <0.00%> (-0.05%) ⬇️
src/main/java/redis/clients/jedis/JedisPubSub.java 71.81% <0.00%> (+1.81%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@yangbodong22011
Copy link
Contributor

Unfortunately I don't think I can share any diagrams or specific scenarios from production.

@adiamzn OK, but you can test some data and compare the new algorithm you proposed with the old one to make everyone understand it better?

@adiamzn
Copy link
Contributor Author
adiamzn commented Aug 30, 2022

@yangbodong22011
Ideally yes, but this could take me a few weeks.
I think it would make sense to compare CPU usage and traffic spikes in a high client scenarios using this algorithm and the previous one.
Since this can take me a some time to get to, would you prefer I close this PR and open a new one when I have some results?
Thanks

@yangbodong22011
Copy link
Contributor

@adiamzn Thanks, I don't think a new PR is needed, we can just keep working on this PR and keep the stack for others to see.

@adiamzn
Copy link
Contributor Author
adiamzn commented Aug 31, 2022

@yangbodong22011 @sazzad16
I think its worth mentioning that similar jitter mechanisms exists in some other popular client libraries
redispy: redis/redis-py#1494
phpredis: phpredis/phpredis#1986
envoy proxy: https://github.com/envoyproxy/envoy/pull/19869/files

@sazzad16
Copy link
Contributor

@yangbodong22011 @adiamzn I am thinking about going forward with this change.

Before saying further, two equations:
A = millisLeft / (attemptsLeft * attemptsLeft)
B = millisLeft / (attemptsLeft * (attemptsLeft + 1))

In original PR where backOff time was added, the time was calculated as A. But in my testing I found that few tests here and there would get timeout exceeded. So I was looking for something definitely smaller than A, and I went with B.

Now in this PR, total sleep time is exptected to be half of total calculated backOff time. Contrary to current code where total sleep time is equals to total calculated time.

WDYT about considering A again but with this change (jitter) as sleep time is expected to be half of A which is smaller than A.

BTW, I have added tests of ClusterCommandExecutor in #3139

@sazzad16 sazzad16 added this to the 4.3.0 milestone Sep 14, 2022
@sazzad16 sazzad16 merged commit ff3f871 into redis:master Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0