8000 APM-164: Add basic overload control to arangod. by jsteemann · Pull Request #14796 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

APM-164: Add basic overload control to arangod. #14796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 23, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add basic overload control to arangod.
  This change adds the `x-arango-queue-time-seconds` header to all responses
  sent by arangod. This header contains the most recent request dequeing time
  (in seconds) as tracked by the scheduler. This value can be used by client
  applications and drivers to detect server overload and react on it.
  The new startup option `--http.return-queue-time-header` can be set to
  `false` to suppress these headers in responses sent by arangod.

  In addition, client applications and drivers can optionally augment their
  requests sent to arangod with the a header of the same name. If set, the
  value of the header should contain the maximum queuing time (in seconds)
  that the client is willing to accept. If the header is set in an incoming
  request, arangod will compare the current dequeing time from its scheduler
  with the maximum queue time value contained in the request. If the current
  dequeing time exceeds the value set in the header, arangod will reject the
  request and return HTTP 412 (precondition failed) with the new error code
  21004 (queue time violated).
  • Loading branch information
jsteemann committed Sep 19, 2021
commit 6f02ef5e2567512d96be019e9a09479967b7d91b
18 changes: 18 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,6 +1,24 @@
devel
-----

* Add basic overload control to arangod.
This change adds the `x-arango-queue-time-seconds` header to all responses
sent by arangod. This header contains the most recent request dequeing time
(in seconds) as tracked by the scheduler. This value can be used by client
applications and drivers to detect server overload and react on it.
The new startup option `--http.return-queue-time-header` can be set to
`false` to suppress these headers in responses sent by arangod.

In addition, client applications and drivers can optionally augment their
requests sent to arangod with the a header of the same name. If set, the
value of the header should contain the maximum queuing time (in seconds)
that the client is willing to accept. If the header is set in an incoming
request, arangod will compare the current dequeing time from its scheduler
with the maximum queue time value contained in the request. If the current
dequeing time exceeds the value set in the header, arangod will reject the
request and return HTTP 412 (precondition failed) with the new error code
21004 (queue time violated).

* Add `--datatype` startup option to arangoimport, in order to hard-code the
datatype (null/boolean/number/string) for certain attributes in the CSV/TSV import.
For example, given the following input file:
Expand Down
79 changes: 52 additions & 27 deletions arangod/GeneralServer/CommTask.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,31 +65,6 @@ inline bool startsWith(std::string const& path, char const* other) {
path.compare(0, size, other, size) == 0);
}

} // namespace

// -----------------------------------------------------------------------------
// --SECTION-- constructors and destructors
// -----------------------------------------------------------------------------

CommTask::CommTask(GeneralServer& server,
ConnectionInfo info)
: _server(server),
_connectionInfo(std::move(info)),
_connectionStatistics(ConnectionStatistics::acquire()),
_auth(AuthenticationFeature::instance()) {
TRI_ASSERT(_auth != nullptr);
_connectionStatistics.SET_START();
}

CommTask::~CommTask() {
_connectionStatistics.SET_END();
}

// -----------------------------------------------------------------------------
// --SECTION-- protected methods
// -----------------------------------------------------------------------------

namespace {
TRI_vocbase_t* lookupDatabaseFromRequest(application_features::ApplicationServer& server,
GeneralRequest& req) {
// get database name from request
Expand Down Expand Up @@ -127,8 +102,43 @@ bool resolveRequestContext(application_features::ApplicationServer& server,
// the "true" means the request is the owner of the context
return true;
}

bool queueTimeViolated(GeneralRequest const& req) {
// check if the client sent the "x-arango-queue-time-seconds" header
bool found = false;
std::string const& queueTimeValue = req.header(StaticStrings::XArangoQueueTimeSeconds, found);
if (found) {
// yes, now parse the sent time value
double requestedQueueTime = StringUtils::doubleDecimal(queueTimeValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the outcome if queueTimeValue is not a valid double? If I understand it will just get ignored, would it make sense signal/inform here (e.g., at least on higher log-levels)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this is set by ArangoDB and not user specified

if (requestedQueueTime > 0.0) {
// value is > 0.0, so now check the last dequeue time that the scheduler reported
double lastDequeueTime = static_cast<double>(
SchedulerFeature::SCHEDULER->getLastLowPriorityDequeueTime()) / 1000.0;

if (lastDequeueTime > requestedQueueTime) {
return true;
}
}
}
return false;
}

} // namespace

CommTask::CommTask(GeneralServer& server,
ConnectionInfo info)
: _server(server),
_connectionInfo(std::move(info)),
_connectionStatistics(ConnectionStatistics::acquire()),
_auth(AuthenticationFeature::instance()) {
TRI_ASSERT(_auth != nullptr);
_connectionStatistics.SET_START();
}

CommTask::~CommTask() {
_connectionStatistics.SET_END();
}

/// Must be called before calling executeRequest, will send an error
/// response if execution is supposed to be aborted

Expand Down Expand Up @@ -311,6 +321,12 @@ void CommTask::finishExecution(GeneralResponse& res, std::string const& origin)
// use "IfNotSet" to not overwrite an existing response header
res.setHeaderNCIfNotSet(StaticStrings::XContentTypeOptions, StaticStrings::NoSniff);
}

// add "x-arango-queue-time-seconds" header
if (_server.server().getFeature<GeneralServerFeature>().returnQueueTimeHeader()) {
res.setHeaderNC(StaticStrings::XArangoQueueTimeSeconds,
std::to_string(static_cast<double>(SchedulerFeature::SCHEDULER->getLastLowPriorityDequeueTime()) / 1000.0));
}
}

/// Push this request into the execution pipeline
Expand All @@ -336,8 +352,17 @@ void CommTask::executeRequest(std::unique_ptr<GeneralRequest> request,
LOG_TOPIC("2cece", WARN, Logger::REQUESTS)
<< "could not find corresponding request/response";
}

rest::ContentType const respType = request->contentTypeResponse();

// check if "x-arango-queue-time-seconds" header was set, and the value
// contained in it is above the current dequeing time
if (::queueTimeViolated(*request)) {
sendErrorResponse(rest::ResponseCode::PRECONDITION_FAILED,
respType, messageId, TRI_ERROR_QUEUE_TIME_REQUIREMENT_VIOLATED);
return;
}

// create a handler, this takes ownership of request and response
auto& server = _server.server();
auto& factory = server.getFeature<GeneralServerFeature>().handlerFactory();
Expand All @@ -351,7 +376,7 @@ void CommTask::executeRequest(std::unique_ptr<GeneralRequest> request,
VPackBuffer<uint8_t>());
return;
}

// forward to correct server if necessary
bool forwarded;
auto res = handler->forwardRequest(forwarded);
Expand Down
21 changes: 14 additions & 7 deletions arangod/GeneralServer/GeneralServerFeature.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ GeneralServerFeature::GeneralServerFeature(application_features::ApplicationServ
: ApplicationFeature(server, "GeneralServer"),
_allowMethodOverride(false),
_proxyCheck(true),
_returnQueueTimeHeader(true),
_permanentRootRedirect(true),
_redirectRootTo("/_admin/aardvark/index.html"),
_supportInfoApiPolicy("hardened"),
Expand Down Expand Up @@ -219,6 +220,11 @@ void GeneralServerFeature::collectOptions(std::shared_ptr<ProgramOptions> option
"if true, use a permanent redirect. If false, use a temporary",
new BooleanParameter(&_permanentRootRedirect))
.setIntroducedIn(30712);

options->addOption("--http.return-queue-time-header",
"if true, return the 'x-arango-queue-time-seconds' response header",
new BooleanParameter(&_returnQueueTimeHeader))
.setIntroducedIn(30900);

options->addOption("--frontend.proxy-request-check",
"enable proxy request checking",
Expand Down Expand Up @@ -282,9 +288,8 @@ void GeneralServerFeature::prepare() {
}

void GeneralServerFeature::start() {
_jobManager.reset(new AsyncJobManager);

_handlerFactory.reset(new RestHandlerFactory());
_jobManager = std::make_unique<AsyncJobManager>();
_handlerFactory = std::make_unique<RestHandlerFactory>();

defineHandlers();
buildServers();
Expand Down Expand Up @@ -321,17 +326,19 @@ void GeneralServerFeature::unprepare() {
_jobManager.reset();
}

double GeneralServerFeature::keepAliveTimeout() const {
double GeneralServerFeature::keepAliveTimeout() const noexcept {
return _keepAliveTimeout;
}

bool GeneralServerFeature::proxyCheck() const { return _proxyCheck; }
bool GeneralServerFeature::proxyCheck() const noexcept { return _proxyCheck; }

bool GeneralServerFeature::returnQueueTimeHeader() const noexcept { return _returnQueueTimeHeader; }

std::vector<std::string> GeneralServerFeature::trustedProxies() const {
return _trustedProxies;
}

bool GeneralServerFeature::allowMethodOverride() const {
bool GeneralServerFeature::allowMethodOverride() const noexcept {
return _allowMethodOverride;
}

Expand All @@ -350,7 +357,7 @@ Result GeneralServerFeature::reloadTLS() { // reload TLS data from disk
return res;
}

bool GeneralServerFeature::permanentRootRedirect() const {
bool GeneralServerFeature::permanentRootRedirect() const noexcept {
return _permanentRootRedirect;
}

Expand Down
10 changes: 6 additions & 4 deletions arangod/GeneralServer/GeneralServerFeature.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,14 @@ class GeneralServerFeature final : public application_features::ApplicationFeatu
void stop() override final;
void unprepare() override final;

double keepAliveTimeout() const;
bool proxyCheck() const;
double keepAliveTimeout() const noexcept;
bool proxyCheck() const noexcept ;
bool returnQueueTimeHeader() const noexcept;
std::vector<std::string> trustedProxies() const;
bool allowMethodOverride() const;
bool allowMethodOverride() const noexcept;
std::vector<std::string> const& accessControlAllowOrigins() const;
Result reloadTLS();
bool permanentRootRedirect() const;
bool permanentRootRedirect() const noexcept;
std::string redirectRootTo() const;
std::string const& supportInfoApiPolicy() const noexcept;

Expand Down Expand Up @@ -87,6 +88,7 @@ class GeneralServerFeature final : public application_features::ApplicationFeatu
double _keepAliveTimeout = 300.0;
bool _allowMethodOverride;
bool _proxyCheck;
bool _returnQueueTimeHeader;
bool _permanentRootRedirect;
std::vector<std::string> _trustedProxies;
std::vector<std::string> _accessControlAllowOrigins;
Expand Down
3 changes: 3 additions & 0 deletions arangod/Scheduler/Scheduler.h
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,9 @@ class Scheduler {

virtual void toVelocyPack(velocypack::Builder&) const = 0;
virtual QueueStatistics queueStatistics() const = 0;

/// @brief returns the last stored dequeue time [ms]
virtual uint64_t getLastLowPriorityDequeueTime() const noexcept = 0;

/// @brief approximate fill grade of the scheduler's queue (in %)
virtual double approximateQueueFillGrade() const = 0;
Expand Down
5 changes: 5 additions & 0 deletions arangod/Scheduler/SupervisedScheduler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -977,6 +977,11 @@ void SupervisedScheduler::trackEndOngoingLowPriorityTask() {
}
}

/// @brief returns the last stored dequeue time [ms]
uint64_t SupervisedScheduler::getLastLowPriorityDequeueTime() const noexcept {
return _metricsLastLowPriorityDequeueTime.load();
}

void SupervisedScheduler::setLastLowPriorityDequeueTime(uint64_t time) noexcept {
// update only probabilistically, in order to reduce contention on the gauge
if ((_sharedPRNG.rand() & 7) == 0) {
Expand Down
3 changes: 3 additions & 0 deletions arangod/Scheduler/SupervisedScheduler.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ class SupervisedScheduler final : public Scheduler {
void trackBeginOngoingLowPriorityTask();
void trackEndOngoingLowPriorityTask();

/// @brief returns the last stored dequeue time [ms]
uint64_t getLastLowPriorityDequeueTime() const noexcept override;

/// @brief set the time it took for the last low prio item to be dequeued
/// (time between queuing and dequeing) [ms]
void setLastLowPriorityDequeueTime(uint64_t time) noexcept;
Expand Down
4 changes: 2 additions & 2 deletions js/common/bootstrap/errors.js
Original file line number Diff line number Diff line change
Expand Up @@ -345,10 +345,10 @@
"ERROR_AGENCY_CANNOT_REBUILD_DBS" : { "code" : 20021, "message" : "Cannot rebuild readDB and spearHead" },
"ERROR_AGENCY_MALFORMED_TRANSACTION" : { "code" : 20030, "message" : "malformed agency transaction" },
"ERROR_SUPERVISION_GENERAL_FAILURE" : { "code" : 20501, "message" : "general supervision failure" },
"ERROR_QUEUE_FULL" : { "code" : 21003, "message" : "named queue is full" },
"ERROR_QUEUE_FULL" : { "code" : 21003, "message" : "queue is full" },
"ERROR_QUEUE_TIME_REQUIREMENT_VIOLATED" : { "code" : 21004, "message" : "queue time violated" },
"ERROR_ACTION_OPERATION_UNABORTABLE" : { "code" : 6002, "message" : "this maintenance action cannot be stopped" },
"ERROR_ACTION_UNFINISHED" : { "code" : 6003, "message" : "maintenance action still processing" },
"ERROR_NO_SUCH_ACTION" : { "code" : 6004, "message" : "no such maintenance action" },
"ERROR_HOT_BACKUP_INTERNAL" : { "code" : 7001, "message" : "internal hot backup error" },
"ERROR_HOT_RESTORE_INTERNAL" : { "code" : 7002, "message" : "internal hot restore error" },
"ERROR_BACKUP_TOPOLOGY" : { "code" : 7003, "message" : "backup does not match this topology" },
Expand Down
1 change: 1 addition & 0 deletions lib/Basics/StaticStrings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ std::string const StaticStrings::Unlimited = "unlimited";
std::string const StaticStrings::WwwAuthenticate("www-authenticate");
std::string const StaticStrings::XContentTypeOptions("x-content-type-options");
std::string const StaticStrings::XArangoFrontend("x-arango-frontend");
std::string const StaticStrings::XArangoQueueTimeSeconds("x-arango-queue-time-seconds");

// mime types
std::string const StaticStrings::MimeTypeDump(
Expand Down
1 change: 1 addition & 0 deletions lib/Basics/StaticStrings.h
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ class StaticStrings {
static std::string const WwwAuthenticate;
static std::string const XContentTypeOptions;
static std::string const XArangoFrontend;
static std::string const XArangoQueueTimeSeconds;

// mime types
static std::string const MimeTypeDump;
Expand Down
6 changes: 3 additions & 3 deletions lib/Basics/error-registry.h
Original file line number Diff line number Diff line change
Expand Up @@ -694,13 +694,13 @@ constexpr static frozen::unordered_map<ErrorCode, const char*, 355> ErrorMessage
{TRI_ERROR_SUPERVISION_GENERAL_FAILURE, // 20501
"general supervision failure"},
{TRI_ERROR_QUEUE_FULL, // 21003
"named queue is full"},
"queue is full"},
{TRI_ERROR_QUEUE_TIME_REQUIREMENT_VIOLATED, // 21004
"queue time violated"},
{TRI_ERROR_ACTION_OPERATION_UNABORTABLE, // 6002
"this maintenance action cannot be stopped"},
{TRI_ERROR_ACTION_UNFINISHED, // 6003
"maintenance action still processing"},
{TRI_ERROR_NO_SUCH_ACTION, // 6004
"no such maintenance action"},
{TRI_ERROR_HOT_BACKUP_INTERNAL, // 7001
"internal hot backup error"},
{TRI_ERROR_HOT_RESTORE_INTERNAL, // 7002
Expand Down
6 changes: 3 additions & 3 deletions lib/Basics/errors.dat
Original file line number Diff line number Diff line change
Expand Up @@ -484,18 +484,18 @@ ERROR_AGENCY_MALFORMED_TRANSACTION,20030,"malformed agency transaction","Malform
ERROR_SUPERVISION_GENERAL_FAILURE,20501,"general supervision failure","General supervision failure."

################################################################################
## Dispatcher errors
## Scheduler errors
################################################################################

ERROR_QUEUE_FULL,21003,"named queue is full","Will be returned if a queue with this name is full."
ERROR_QUEUE_FULL,21003,"queue is full","Will be returned if the scheduler queue is full."
ERROR_QUEUE_TIME_REQUIREMENT_VIOLATED,21004,"queue time violated","Will be returned if a request with a queue time requirement is set and it cannot be fulfilled."

################################################################################
## Maintenance errors
################################################################################

ERROR_ACTION_OPERATION_UNABORTABLE,6002,"this maintenance action cannot be stopped","This maintenance action cannot be stopped once it is started"
ERROR_ACTION_UNFINISHED,6003,"maintenance action still processing","This maintenance action is still processing"
ERROR_NO_SUCH_ACTION,6004,"no such maintenance action","No such maintenance action exists"

################################################################################
## Backup/Restore errors
Expand Down
15 changes: 8 additions & 7 deletions lib/Basics/voc-errors.h
Original file line number Diff line number Diff line change
Expand Up @@ -1827,10 +1827,16 @@ constexpr auto TRI_ERROR_AGENCY_MALFORMED_TRANSACTION
constexpr auto TRI_ERROR_SUPERVISION_GENERAL_FAILURE = ErrorCode{20501};

/// 21003: ERROR_QUEUE_FULL
/// "named queue is full"
/// Will be returned if a queue with this name is full.
/// "queue is full"
/// Will be returned if the scheduler queue is full.
constexpr auto TRI_ERROR_QUEUE_FULL = ErrorCode{21003};

/// 21004: ERROR_QUEUE_TIME_REQUIREMENT_VIOLATED
/// "queue time violated"
/// Will be returned if a request with a queue time requirement is set and it
/// cannot be fulfilled.
constexpr auto TRI_ERROR_QUEUE_TIME_REQUIREMENT_VIOLATED = ErrorCode{21004};

/// 6002: ERROR_ACTION_OPERATION_UNABORTABLE
/// "this maintenance action cannot be stopped"
/// This maintenance action cannot be stopped once it is started
Expand All @@ -1841,11 +1847,6 @@ constexpr auto TRI_ERROR_ACTION_OPERATION_UNABORTABLE
/// This maintenance action is still processing
constexpr auto TRI_ERROR_ACTION_UNFINISHED = ErrorCode{6003};

/// 6004: ERROR_NO_SUCH_ACTION
/// "no such maintenance action"
/// No such maintenance action exists
constexpr auto TRI_ERROR_NO_SUCH_ACTION = ErrorCode{6004};

/// 7001: ERROR_HOT_BACKUP_INTERNAL
/// "internal hot backup error"
/// Failed to create hot backup set
Expand Down
0