Cluster overwhelm countermeasures #13108

neunhoef · 2020-11-27T14:44:22Z

This PR combines several measures. It tries to implement a whole lot
of countermeasures for bad behaviour of the cluster during high load
situations. The document to discuss these countermeasures is here:

https://github.com/arangodb/documents/blob/master/DesignDocuments/02_PLANNING/FightClusterOverwhelm.md

This started as a document to collect evidence and problems, but
also ideas for counter measures. Now, it serves as a design
document for the changes actually done. A lot of discussions and
testing will still be required.

Enterprise PR: https://github.com/arangodb/enterprise/pull/600

Scope & Purpose

(Please describe the changes in this PR for reviewers - mandatory)

[*] 💩 Bugfix (requires CHANGELOG entry)
[*] 🍕 New feature (requires CHANGELOG entry, feature documentation and release notes)
[*] 🔥 Performance improvement
🔨 Refactoring/simplification
📖 CHANGELOG entry made

Backports:

No backports required
[*] Backports required for: (Please specify versions and link PRs)
We need to discuss how much of this is needed in 3.7 and 3.6, since the behaviour of clusters under load is pretty bad in all versions.

Related Information

(Please reference tickets / specification / other PRs etc)

Docs PR:
[*] Enterprise PR: https://github.com/arangodb/enterprise/pull/600
GitHub issue / Jira ticket number:
[*] Design document: https://github.com/arangodb/documents/blob/master/DesignDocuments/02_PLANNING/FightClusterOverwhelm.md

Testing & Verification

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.
[*] The behavior in this PR was manually tested
This change is already covered by existing tests, such as (please describe tests).
This PR adds tests that were used to verify all changes:
- Added new C++ Unit tests
- Added new integration tests (e.g. in shell_server / shell_server_aql)
- Added new resilience tests (only if the feature is impacted by failovers)
There are tests in an external testing repository:
I ensured this code runs with ASan / TSan or other static verification tools

http://172.16.10.101:8080/view/PR/job/arangodb-matrix-pr/13136/

… connection statistics

…ter-overwhelm

This counts ongoing RestHandler instances coming from the low prio queue.

…ter-overwhelm2

Some additional debugging output.

…ter-overwhelm2

neunhoef · 2020-11-27T14:44:43Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/12903/

neunhoef · 2020-11-27T15:38:40Z

Note that this currently does compile on Windows with some stupid initializer list error. Will fix next week.

…ter-overwhelm2

neunhoef · 2020-12-08T10:37:57Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/13084/

mchacki

I have left some inline comments, which I would like to be considered.
For the "production" code comments, none of them justifies to block this PR, so I am fine with the comments not being addressed.

But I am missing CHANGELOG entries here, which would like to sort out before accepting.

arangod/GeneralServer/CommTask.cpp

arangod/Network/NetworkFeature.cpp

arangod/RestHandler/RestBatchHandler.h

arangod/Scheduler/SchedulerFeature.cpp

CHANGELOG

arangod/GeneralServer/CommTask.cpp

mchacki · 2020-12-08T11:40:25Z

arangod/Scheduler/SupervisedScheduler.cpp

    if ((::queueWarningTick++ & 0xFFu) == 0) {
      auto const& now = std::chrono::steady_clock::now();
      if (::conditionQueueFullSince == time_point{}) {
-        logQueueWarningEveryNowAndThen(::queueWarningTick, _maxFifoSize, approxQueueLength);
+        logQueueWarningEveryNowAndThen(::queueWarningTick, _maxFifoSizes[3], approxQueueLength);


Do we want to add this warning now to prio 2 queue as well?
Having 4 lanes now there is a chance for lane 2 and 3 to starve, while lane 0 and 1 are busy. (0 should be fine, but 1 certainly has the right to do heavy load (e.g. replication => funny RocksDB stalls.)

mchacki · 2020-12-08T11:45:04Z

arangod/Scheduler/SupervisedScheduler.cpp

+  // We limit the number of ongoing low priority jobs to prevent a cluster
+  // from getting overwhelmed
+  std::size_t const ongoing = _ongoingLowPriorityGauge.load();
+  if (ongoing >= _ongoingLowPriorityLimit) {


For now this is good.
For the future:
As soon as we have higher parallelization in AQL, this is not enough anymore. (I am think about multiple diamonds in AQL where the upper diamond can work parallel to the lower diamond. as soon as we have this the amount of fan-out requests in AQL is not bounded by the number of DBServers, but by number diamonds * DBServers..

arangod/Scheduler/SupervisedScheduler.h

mchacki

ok, cross out everything I said before.
This PR adds a total of 0 Tests on this very critical piece of code.
How is the plan to test this?

dhly-etc · 2020-12-08T14:05:57Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/13092/

mchacki

This modifications addressed my comments very good 👍

The topic of missing tests is still open though still no Approve from my side.
Production Code is Approved.

mchacki · 2020-12-08T14:09:40Z

arangod/GeneralServer/CommTask.cpp

@@ -490,12 +490,12 @@ bool CommTask::handleRequestSync(std::shared_ptr<RestHandler> handler) {
  // and there is currently only a single client handled by the IoContext
  auto cb = [self = shared_from_this(), handler = std::move(handler), prio]() mutable {
    if (prio == RequestPriority::LOW) {
-      SchedulerFeature::SCHEDULER->ongoingLowPriorityTasks() += 1;
+      SchedulerFeature::SCHEDULER->trackBeginOngoingLowPriorityTask();


mchacki · 2020-12-08T14:10:16Z

arangod/Scheduler/SupervisedScheduler.cpp

-Gauge<uint64_t>& SupervisedScheduler::ongoingLowPriorityTasks() {
-  return _ongoingLowPriorityGauge;
+void SupervisedScheduler::trackBeginOngoingLowPriorityTask() {
+  if (_server.isStopping()) {


yay this is save, I like it

dhly-etc · 2020-12-08T19:22:25Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/13100/

dhly-etc · 2020-12-08T20:19:31Z

http://jenkins01.arangodb.biz:8080/view/PR/job/arangodb-matrix-pr/13101/

…ster-load-improvements

jsteemann · 2020-12-08T21:31:58Z

http://172.16.10.101:8080/view/PR/job/arangodb-matrix-pr/13103/

…vent-cluster-overwhelm2

jsteemann · 2020-12-09T19:19:41Z

http://172.16.10.101:8080/view/PR/job/arangodb-matrix-pr/13136/

jsteemann · 2020-12-09T21:07:54Z

http://172.16.10.101:8080/view/PR/job/arangodb-matrix-pr/13137/

Dan Larkin-York and others added 22 commits October 30, 2020 13:33

Configurable parameter for synchronous replication timeout and better…

91377a8

… connection statistics

Some experiments

42356c1

Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…

28c856e

…ter-overwhelm

Take out misuse of set_symmetric_difference.

3b48bb6

Introduce _onGoing counter.

677337e

This counts ongoing RestHandler instances coming from the low prio queue.

Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…

f14b515

…ter-overwhelm2

Add metrics and implement throttle on coordinators.

a57773b

Fix atomicity of metrics.

2d1def6

Fix metrics atomics.

ec55028

fixed gauge assignment operatorsd

64c2c8b

tests

d3a52a8

tiny

aa854ed

fixed metrics for correct floating point atomics

578a0e3

Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…

349840b

…ter-overwhelm2

devel merge

8a1f00d

better testing

44f45d1

Scheduler revamp, lots of other changes.

1b8ddc5

Some additional debugging output.

Essentially get rid of synchronous replication timeout.

12ef863

Take out unnecessary option.

2b2a95a

Replication of commit and abort transaction on high prio queue.

d437b6b

Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…

59c96ff

…ter-overwhelm2

Trigger RocksDB throttle earlier.

ee23e9d

neunhoef added this to the devel milestone Nov 27, 2020

neunhoef self-assigned this Nov 27, 2020

neunhoef and others added 4 commits December 1, 2020 13:05

Take out lots of debugging output.

2f54ff5

Merge branch 'devel' into feature/prevent-cluster-overwhelm2

fc180c8

Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…

dab1d46

…ter-overwhelm2

Do not change RocksDB write buffer number after all.

6280342

Fix compilation.

1f16b99

mchacki requested changes Dec 8, 2020

View reviewed changes

Dan Larkin-York added 2 commits December 8, 2020 08:32

Add missing headers

056d0f0

Address some review comments

e2d8d67

mchacki reviewed Dec 8, 2020

View reviewed changes

Dan Larkin-York added 4 commits December 8, 2020 10:15

Merge branch 'devel' into feature/prevent-cluster-overwhelm2

7ce9059

Fix some issues

825ed38

Fix a dependency

c12a833

Fix feature dependency in mock

f54d0bb

Fix feature re-add

9398260

jsteemann added 4 commits December 8, 2020 22:23

fix ubsan issues

65badf8

speed up shortest path unit tests

f27e69c

fix MSVC compile warning

2938056

Merge branch 'devel' of github.com:arangodb/arangodb into feature/clu…

ab6ce5b

…ster-load-improvements

jsteemann added 4 commits December 9, 2020 13:01

Merge branch 'devel' of github.com:arangodb/arangodb into feature/pre…

449877b

…vent-cluster-overwhelm2

fix potential nullptr access

0af7ff7

make APIs of network::Response safer to use

12f0e70

Merge branch 'devel' of github.com:arangodb/arangodb into feature/pre…

5a10472

…vent-cluster-overwhelm2

Merge branch 'devel' into feature/prevent-cluster-overwhelm2

24d3d0c

jsteemann merged commit b9a5b52 into devel Dec 11, 2020

jsteemann deleted the feature/prevent-cluster-overwhelm2 branch December 11, 2020 13:28

goedderz mentioned this pull request Aug 25, 2021

Lower priority of AQL lanes #14695

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster overwhelm countermeasures #13108

Cluster overwhelm countermeasures #13108

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cluster overwhelm countermeasures #13108

Cluster overwhelm countermeasures #13108

Uh oh!

Conversation

Uh oh!

Scope & Purpose

Backports:

Related Information

Testing & Verification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!