8000 Cluster overwhelm countermeasures by neunhoef · Pull Request #13108 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Cluster overwhelm countermeasures #13108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 106 commits into from
Dec 11, 2020
Merged
Show file tree
Hide file tree
Changes from 90 commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
91377a8
Configurable parameter for synchronous replication timeout and better…
Oct 30, 2020
42356c1
Some experiments
Nov 9, 2020
28c856e
Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…
neunhoef Nov 9, 2020
3b48bb6
Take out misuse of set_symmetric_difference.
neunhoef Nov 12, 2020
677337e
Introduce _onGoing counter.
neunhoef Nov 12, 2020
f14b515
Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…
neunhoef Nov 13, 2020
a57773b
Add metrics and implement throttle on coordinators.
neunhoef Nov 13, 2020
2d1def6
Fix atomicity of metrics.
neunhoef Nov 13, 2020
ec55028
Fix metrics atomics.
neunhoef Nov 16, 2020
64c2c8b
fixed gauge assignment operatorsd
kvahed Nov 16, 2020
d3a52a8
tests
kvahed Nov 17, 2020
aa854ed
tiny
kvahed Nov 17, 2020
578a0e3
fixed metrics for correct floating point atomics
kvahed Nov 17, 2020
349840b
Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…
neunhoef Nov 18, 2020
8a1f00d
devel merge
kvahed Nov 19, 2020
44f45d1
bet 10000 ter testing
kvahed Nov 25, 2020
1b8ddc5
Scheduler revamp, lots of other changes.
neunhoef Nov 27, 2020
12ef863
Essentially get rid of synchronous replication timeout.
neunhoef Nov 27, 2020
2b2a95a
Take out unnecessary option.
neunhoef Nov 27, 2020
d437b6b
Replication of commit and abort transaction on high prio queue.
neunhoef Nov 27, 2020
59c96ff
Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…
neunhoef Nov 27, 2020
ee23e9d
Trigger RocksDB throttle earlier.
neunhoef Nov 27, 2020
2f54ff5
Take out lots of debugging output.
neunhoef Dec 1, 2020
fc180c8
Merge branch 'devel' into feature/prevent-cluster-overwhelm2
Dec 1, 2020
dab1d46
Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…
neunhoef Dec 2, 2020
6280342
Do not change RocksDB write buffer number after all.
neunhoef Dec 2, 2020
7582e38
Do not touch RocksDB throttle for now.
neunhoef Dec 2, 2020
a5f5020
Some cleanup w.r.t. ongoing throtteling.
neunhoef Dec 2, 2020
d354ab1
Permute some definitions, more like devel.
neunhoef Dec 2, 2020
c0a0cb0
Take out inflight again.
neunhoef Dec 2, 2020
7cf76e2
Remove unused code.
neunhoef Dec 2, 2020
7b82262
Whitespacing.
neunhoef Dec 2, 2020
c42fe2c
Rearrange stuff in SupervisedScheduler.
neunhoef Dec 2, 2020
5231dbc
More rearranging.
neunhoef Dec 2, 2020
6e1980f
Transaction time limit down to 10s again.
neunhoef Dec 2, 2020
e119ae9
Restore startLocalCluster.sh
neunhoef Dec 2, 2020
c19a9bb
Adjust a comment.
neunhoef Dec 2, 2020
400c1fb
Adjust lower limit for replication timeout.
neunhoef Dec 2, 2020
aff9c2d
CHANGELOG.
neunhoef Dec 2, 2020
9d466a1
Try to fix Windows compilation.
neunhoef Dec 2, 2020
4f1bc5e
Merge branch 'feature/prevent-cluster-overwhelm2' of github.com:arang…
Dec 2, 2020
45acdb1
Merge branch 'devel' into feature/prevent-cluster-overwhelm2
Dec 2, 2020
3d5e9b8
Prepare for merge.
Dec 3, 2020
893783e
Merge other branch
Dec 3, 2020
dc8159a
Fix Mac compilation.
Dec 3, 2020
215831e
Move AQL continuations to new lane
Dec 3, 2020
0c1437f
Actually move to new lane
Dec 3, 2020
e463e39
Remove synchronous commit from document handler
Dec 3, 2020
1048caa
Merge branch 'devel' into feature/prevent-cluster-overwhelm2
Dec 4, 2020
ac5a331
Revert temporary workaround for issue that should be fixed elsewhere
Dec 4, 2020
d9b9da2
Ensure max is at least min
Dec 4, 2020
467822d
Make ongoing task accounting clearer
Dec 4, 2020
dea468d
Clarification
Dec 4, 2020
fc97d24
Fix plurality for consistency
Dec 4, 2020
5fe4b0d
Address review comment
Dec 4, 2020
70c6692
Fix compilation
Dec 4, 2020
92fbaf8
Update arangod/Replication/ReplicationMetricsFeature.cpp
Dec 4, 2020
cff195d
Update arangod/Replication/ReplicationMetricsFeature.cpp
Dec 4, 2020
983c57c
Update arangod/Scheduler/SchedulerFeature.cpp
Dec 4, 2020
d44c59b
Update arangod/Scheduler/SchedulerFeature.cpp
Dec 4, 2020
f1cfcbe
Cleanup new parameters
Dec 4, 2020
d5957f4
Update arangod/Scheduler/SupervisedScheduler.cpp
Dec 4, 2020
dc4bf54
Fix atomic ordering for gauge
Dec 4, 2020
c02ac9f
Merge branch 'feature/prevent-cluster-overwhelm2' of github.com:arang…
Dec 4, 2020
359fb16
Add introduced for new flag
Dec 4, 2020
bea1b10
Update arangod/Scheduler/SupervisedScheduler.h
Dec 4, 2020
21f4215
Update arangod/Scheduler/SupervisedScheduler.cpp
Dec 4, 2020
fb354cf
Update arangod/Scheduler/SupervisedScheduler.cpp
Dec 4, 2020
01d9f45
Update arangod/Scheduler/SupervisedScheduler.cpp
Dec 4, 2020
03b3d27
Update arangod/Scheduler/SupervisedScheduler.cpp
Dec 4, 2020
5f3999a
Rename some variables
Dec 4, 2020
43593e8
Merge branch 'feature/prevent-cluster-overwhelm2' of github.com:arang…
Dec 4, 2020
363ef2c
Add introduced for new parameter
Dec 4, 2020
dab2218
Deredundify some wording
Dec 4, 2020
df74bce
Set another introduced flag
Dec 4, 2020
acf47a0
Update arangod/Scheduler/SupervisedScheduler.cpp
Dec 4, 2020
db009e1
Save map lookup
Dec 4, 2020
6872713
Merge branch 'feature/prevent-cluster-overwhelm2' of github.com:arang…
Dec 4, 2020
c73af0d
Merge branch 'devel' into bug-fix/gauge-assignment-opera
neunhoef Dec 7, 2020
685ea8b
Address review comment about integral efficiency
Dec 7, 2020
6a2ff56
Revert some changes to metrics that are covered in another PR.
Dec 7, 2020
7f1293f
Merge branch 'devel' into feature/prevent-cluster-overwhelm2
jsteemann Dec 7, 2020
1a8bd32
Fix Windows and Mac compilation
Dec 7, 2020
1407d37
Merge branch 'feature/prevent-cluster-overwhelm2' of github.com:arang…
Dec 7, 2020
3171aec
relaxed precision on tests
kvahed Dec 8, 2020
a861533
keep memory order as default
kvahed Dec 8, 2020
186c045
Merge branch 'devel' into bug-fix/gauge-assignment-opera
neunhoef Dec 8, 2020
161410a
Merge remote-tracking branch 'origin/devel' into feature/prevent-clus…
neunhoef Dec 8, 2020
3934fb4
Merge remote-tracking branch 'origin/bug-fix/gauge-assignment-opera' …
neunhoef Dec 8, 2020
1f16b99
Fix compilation.
neunhoef Dec 8, 2020
056d0f0
Add missing headers
Dec 8, 2020
e2d8d67
Address some review comments
Dec 8, 2020
7ce9059
Merge branch 'devel' into feature/prevent-cluster-overwhelm2
Dec 8, 2020
825ed38
Fix some issues
Dec 8, 2020
c12a833
Fix a dependency
Dec 8, 2020
f54d0bb
Fix feature dependency in mock
Dec 8, 2020
9398260
Fix feature re-add
Dec 8, 2020
65badf8
fix ubsan issues
jsteemann Dec 8, 2020
f27e69c
speed up shortest path unit tests
jsteemann Dec 8, 2020
2938056
fix MSVC compile warning
jsteemann Dec 8, 2020
ab6ce5b
Merge branch 'devel' of github.com:arangodb/arangodb into feature/clu…
jsteemann Dec 8, 2020
449877b
Merge branch 'devel' of github.com:arangodb/arangodb into feature/pre…
jsteemann Dec 9, 2020
0af7ff7
fix potential nullptr access
jsteemann Dec 9, 2020
12f0e70
make APIs of network::Response safer to use
jsteemann Dec 9, 2020
5a10472
Merge branch 'devel' of github.com:arangodb/arangodb into feature/pre…
jsteemann Dec 9, 2020
24d3d0c
Merge branch 'devel' into feature/prevent-cluster-overwhelm2
jsteemann Dec 11, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion 3rdParty/fuerte/include/fuerte/message.h
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ struct ResponseHeader final : public MessageHeader {
// from (Response) a server.
class Message {
protec 6D47 ted:
Message() = default;
Message() : _timestamp(std::chrono::steady_clock::now()) {}
virtual ~Message() = default;

public:
Expand Down Expand Up @@ -174,6 +174,13 @@ class Message {
bool isContentTypeVPack() const;
bool isContentTypeHtml() const;
bool isContentTypeText() const;

std::chrono::steady_clock::time_point timestamp() const { return _timestamp; }
// set timestamp when it was sent
void timestamp(std::chrono::steady_clock::time_point t) { _timestamp = t; }

private:
std::chrono::steady_clock::time_point _timestamp;
};

// Request contains the message send to a server in a request.
Expand Down
38 changes: 31 additions & 7 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,6 +1,30 @@
devel
-----

* Throttle work coming from low priority queue, according to a constant
and to an estimate taking into account fanout for multi-shard operations.

* Move to 4 priority levels "low", "medium", "high" and "maintenance" in
scheduler to ensure that maintenance work and diagnostics is always
possible, even in the case of RocksDB throttles. Do not allow any
RocksDB work on "maintenance".

* Commit replications on high priority queue.

* Essentially get rid of timeout in replication to drop followers. This
is now entirely handled via reboot and failure tracking. The timeout
has now a default minimum of 15 minutes but can still be configured via
options.

* Additional metrics for all queue lengths and low prio ongoing work.

* New metric for number and total time of replication operations.

* New metrics for number of internal requests in flight, internal request
duration, and internal request timeouts

* Fix Gauge class' assignment operators.

* Fix Windows directory creation error handling.

* Add an AQL query kill check during early pruning. Fixes issue #13141.
Expand Down Expand Up @@ -109,11 +133,11 @@ devel
* Added metrics for the system CPU usage:
- `arangodb_server_statistics_user_percent`: Percentage of time that the
system CPUs have spent in user mode
- `arangodb_server_statistics_system_percent`: Percentage of time that
- `arangodb_server_statistics_system_percent`: Percentage of time that
the system CPUs have spent in kernel mode
- `arangodb_server_statistics_idle_percent`: Percentage of time that the
- `arangodb_server_statistics_idle_percent`: Percentage of time that the
system CPUs have been idle
- `arangodb_server_statistics_iowait_percent`: Percentage of time that
- `arangodb_server_statistics_iowait_percent`: Percentage of time that
the system CPUs have been waiting for I/O

These metrics resemble the overall CPU usage metrics in `top`.
Expand All @@ -127,19 +151,19 @@ devel
configured value was effectively clamped to a value of `1`.

* Improvements for the Pregel distributed graph processing feature:
- during the loading/startup phase, the in-memory edge cache is now
intentionally bypassed. The reason for this is that any edges are
- during the loading/startup phase, the in-memory edge cache is now
intentionally bypassed. The reason for this is that any edges are
looked up exactly once, so caching them is not beneficial, but would
only lead to cache pollution.
- the loading/startup phase can now load multiple collections in parallel,
whereas previously it was only loading multiple shards of the same
collection in parallel. This change helps to reduce load times in case
there are many collections with few shards, and on single server.
- the loading and result storage phases code has been overhauled so that
- the loading and result storage phases code has been overhauled so that
it runs slightly faster.
- for Pregel runs that are based on named graphs (in contrast to explicit
naming of the to-be-used vertex and edge collections), only those edge
collections are considered that, according to the graph definition, can
collections are considered that, according to the graph definition, can
have connections with the vertex. This change can reduce the loading
time substantially in case the graph contains many edge definitions.
- the number of executed rounds for the underlying Pregel algorithm now
Expand Down
3 changes: 2 additions & 1 deletion arangod/Aql/Query.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1179,7 +1179,8 @@ futures::Future<Result> finishDBServerParts(Query& query, int errorCode) {
network::RequestOptions options;
options.database = query.vocbase().name();
options.timeout = network::Timeout(60.0); // Picked arbitrarily
// options.skipScheduler = true;
options.continuationLane = RequestLane::CLUSTER_AQL_CONTINUATION;
// options.skipScheduler = true;

VPackBuffer<uint8_t> body;
VPackBuilder builder(body);
Expand Down
61 changes: 30 additions & 31 deletions arangod/Aql/SharedQueryState.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -128,37 +128,36 @@ void SharedQueryState::queueHandler() {
// We are shutting down
return;
}

bool queued = scheduler->queue(RequestLane::CLUSTER_AQL,
[self = shared_from_this(),
cb = _wakeupCb,
v = _cbVersion]() {

std::unique_lock<std::mutex> lck(self->_mutex, std::defer_lock);

do {
bool cntn = false;
try {
cntn = cb();
} catch (...) {}

lck.lock();
if (v == self->_cbVersion) {
unsigned c = self->_numWakeups--;
TRI_ASSERT(c > 0);
if (c == 1 || !cntn || !self->_valid) {
break;
}
} else {
return;
}
lck.unlock();
} while (true);

TRI_ASSERT(lck);
self->queueHandler();
});


bool queued =
scheduler->queue(RequestLane::CLUSTER_AQL_CONTINUATION,
[self = shared_from_this(), cb = _wakeupCb, v = _cbVersion]() {
std::unique_lock<std::mutex> lck(self->_mutex, std::defer_lock);

do {
bool cntn = false;
try {
cntn = cb();
} catch (...) {
}

lck.lock();
if (v == self->_cbVersion) {
unsigned c = self->_numWakeups--;
TRI_ASSERT(c > 0);
if (c == 1 || !cntn || !self->_valid) {
break;
}
} else {
return;
}
lck.unlock();
} while (true);

TRI_ASSERT(lck);
self->queueHandler();
});

if (!queued) { // just invalidate
_wakeupCb = nullptr;
_valid = false;
Expand Down
6 changes: 6 additions & 0 deletions arangod/Cluster/ClusterTrxMethods.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,12 @@ Future<Result> commitAbortTransaction(transaction::Methods& trx, transaction::St

TransactionId tidPlus = state->id().child();
const std::string path = "/_api/transaction/" + std::to_string(tidPlus.id());
if (state->isDBServer()) {
// This is a leader replicating the transaction commit or abort and
// we should tell the follower that this is a replication operation.
// It will then execute the request with a higher priority.
reqOpts.param(StaticStrings::IsSynchronousReplicationString, ServerState::instance()->getId());
}

fuerte::RestVerb verb;
if (status == transaction::Status::COMMITTED) {
Expand Down
2 changes: 1 addition & 1 deletion arangod/Cluster/MaintenanceRestHandler.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class MaintenanceRestHandler : public RestBaseHandler {
char const* name() const override { return "MaintenanceRestHandler"; }

RequestLane lane() const override final {
return RequestLane::CLUSTER_INTERNAL;
return RequestLane::CLIENT_FAST;
}

/// @brief Performs routing of request to appropriate subroutines
Expand Down
38 changes: 33 additions & 5 deletions arangod/Cluster/ReplicationTimeoutFeature.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include "ReplicationTimeoutFeature.h"

#include "FeaturePhases/DatabaseFeaturePhase.h"
#include "RestServer/MetricsFeature.h"
#include "StorageEngine/EngineSelectorFeature.h"
#include "StorageEngine/StorageEngine.h"

Expand All @@ -33,7 +34,18 @@ namespace arangodb {

double ReplicationTimeoutFeature::timeoutFactor = 1.0;
double ReplicationTimeoutFeature::timeoutPer4k = 0.1;
double ReplicationTimeoutFeature::lowerLimit = 30.0; // longer than heartbeat timeout
double ReplicationTimeoutFeature::lowerLimit = 900.0; // used to be 30.0
double ReplicationTimeoutFeature::upperLimit = 3600.0; // used to be 120.0

// We essentially stop using a meaningful timeout for this operation.
// This is achieved by setting the default for the minimal timeout to 1h or 3600s.
// The reason behind this is the following: We have to live with RocksDB stalls
// and write stops, which can happen in overload situations. Then, no meaningful
// timeout helps and it is almost certainly better to keep trying to not have
// to drop the follower and make matters worse. In case of an actual failure
// (or indeed a restart), the follower is marked as failed and its reboot id is
// increased. As a consequence, the connection is aborted and we run into an
// error anyway. This is when a follower will be dropped.

ReplicationTimeoutFeature::ReplicationTimeoutFeature(application_features::ApplicationServer& server)
: ApplicationFeature(server, "ReplicationTimeout") {
Expand All @@ -44,10 +56,16 @@ ReplicationTimeoutFeature::ReplicationTimeoutFeature(application_features::Appli
void ReplicationTimeoutFeature::collectOptions(std::shared_ptr<ProgramOptions> options) {
options->addSection("cluster", "Configure the cluster");

options->addOption(
"--cluster.synchronous-replication-timeout-minimum",
"all synchronous replication timeouts will be at least this value (in seconds)",
new DoubleParameter(&lowerLimit));
options->addOption("--cluster.synchronous-replication-timeout-minimum",
"all synchronous replication timeouts will be at least "
"this value (in seconds)",
new DoubleParameter(&lowerLimit));

options->addOption("--cluster.synchronous-replication-timeout-maximum",
"all synchronous replication timeouts will be at most "
"this value (in seconds)",
new DoubleParameter(&upperLimit))
.setIntroducedIn(30800);

options->addOption(
"--cluster.synchronous-replication-timeout-factor",
Expand All @@ -62,4 +80,14 @@ void ReplicationTimeoutFeature::collectOptions(std::shared_ptr<ProgramOptions> o
arangodb::options::makeDefaultFlags(arangodb::options::Flags::Hidden));
}

void ReplicationTimeoutFeature::validateOptions(std::shared_ptr<ProgramOptions> options) {
if (upperLimit < lowerLimit) {
LOG_TOPIC("8a9f3", WARN, Logger::CONFIG)
<< "--cluster.synchronous-replication-timeout-maximum must be at least "
<< "--cluster.synchronous-replication-timeout-minimum, setting max to "
<< "min";
upperLimit = lowerLimit;
}
}

} // namespace arangodb
6 changes: 5 additions & 1 deletion arangod/Cluster/ReplicationTimeoutFeature.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,10 @@
#define ARANGOD_CLUSTER_REPLICATION_TIMEOUT_FEATURE_H 1

#include "Basics/Common.h"
#include "RestServer/Metrics.h"

#include "ApplicationFeatures/ApplicationFeature.h"
#include "ApplicationFeatures/ApplicationServer.h"

namespace arangodb {

Expand All @@ -35,12 +37,14 @@ class ReplicationTimeoutFeature : public application_features::ApplicationFeatur
explicit ReplicationTimeoutFeature(application_features::ApplicationServer& server);

void collectOptions(std::shared_ptr<options::ProgramOptions>) override final;
void validateOptions(std::shared_ptr<options::ProgramOptions>) override final;

static double timeoutFactor;
static double timeoutPer4k;
static double lowerLimit;
static double upperLimit;
};

} // namespace arangodb

#endif
#endif
1 change: 0 additions & 1 deletion arangod/Cluster/RestClusterHandler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,6 @@ void RestClusterHandler::handleClusterInfo() {
auto& ci = server().getFeature<ClusterFeature>().clusterInfo();
auto dump = ci.toVelocyPack();

LOG_DEVEL << dump.toJson();
generateResult(rest::ResponseCode::OK, dump.slice());

}
Expand Down
2 changes: 2 additions & 0 deletions arangod/FeaturePhases/ServerFeaturePhase.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include "FeaturePhases/AqlFeaturePhase.h"
#include "GeneralServer/GeneralServerFeature.h"
#include "GeneralServer/SslServerFeature.h"
#include "Network/NetworkFeature.h"
#include "RestServer/EndpointFeature.h"
#include "RestServer/ServerFeature.h"
#include "RestServer/UpgradeFeature.h"
Expand All @@ -41,6 +42,7 @@ ServerFeaturePhase::ServerFeaturePhase(ApplicationServer& server)

startsAfter<EndpointFeature>();
startsAfter<GeneralServerFeature>();
startsAfter<NetworkFeature>();
startsAfter<ServerFeature>();
startsAfter<SslServerFeature>();
startsAfter<StatisticsFeature>();
Expand Down
12 changes: 10 additions & 2 deletions arangod/GeneralServer/CommTask.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -480,15 +480,23 @@ bool CommTask::handleRequestSync(std::shared_ptr<RestHandler> handler) {
handler->statistics().SET_QUEUE_START(SchedulerFeature::SCHEDULER->queueStatistics()._queued);

RequestLane lane = handler->getRequestLane();
RequestPriority prio = PriorityRequestLane(lane);

ContentType respType = handler->request()->contentTypeResponse();
uint64_t mid = handler->messageId();

// queue the operation in the scheduler, and make it eligible for direct execution
// only if the current CommTask type allows it (HttpCommTask: yes, CommTask: no)
// and there is currently only a single client handled by the IoContext
auto cb = [self = shared_from_this(), handler = std::move(handler)]() mutable {
auto cb = [self = shared_from_this(), handler = std::move(handler), prio]() mutable {
if (prio == RequestPriority::LOW) {
SchedulerFeature::SCHEDULER->ongoingLowPriorityTasks() += 1;
}
handler->statistics().SET_QUEUE_END();
handler->runHandler([self = std::move(self)](rest::RestHandler* handler) {
handler->runHandler([self = std::move(self), prio](rest::RestHandler* handler) {
if (prio == RequestPriority::LOW) {
SchedulerFeature::SCHEDULER->ongoingLowPriorityTasks() -= 1;
}
try {
// Pass the response to the io context
self->sendResponse(handler->stealResponse(), handler->stealStatistics());
Expand Down
Loading
0