-
Notifications
You must be signed in to change notification settings - Fork 855
Bug fix/fix remote executor races #10206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e9fa270
to
0b5512e
Compare
} else { | ||
_lastResponse = std::move(res); | ||
} | ||
std::lock_guard<std::mutex> guard(_communicationMutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I do not get the need for the extra mutex here, an atomic _lastTicket should be sufficient so synchronize access to _lastError and _lastResponse ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we do not only have to synchronize a request with its answer, but also protect an answer with another request; e.g., RemoteExecutor sends a getSome/skipSome request, then a shutdown happens (e.g. due to a timeout) at the same time when the answer arrives. Then, this code in ExecutionBlockImpl<RemoteExecutor>::shutdown
if (!_hasTriggeredShutdown) {
std::lock_guard<std::mutex> guard(_communicationMutex);
std::ignore = generateNewTicket();
_hasTriggeredShutdown = true;
}
which resets _lastTicket
, _lastError
and _lastResponse
, races with the lambda that's called with the answer:
[=, ref(std::move(ref))](fuerte::Error err,
std::unique_ptr<fuerte::Request>,
std::unique_ptr<fuerte::Response> res) {
std::lock_guard<std::mutex> guard(_communicationMutex);
if (_lastTicket == ticket) {
if (err != fuerte::Error::NoError || res->statusCode() >= 400) {
_lastError = handleErrorResponse(spec, err, res.get());
} else {
_lastResponse = std::move(res);
}
_query.sharedState()->execute();
}
}
which reads _lastTicket
and then sets _lastError
or _lastResponse
. I do not see how that should be correctly synchronized with just an atomic _lastTicket
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimal comment for more Debug output.
i am not bound to this being addressed.
Otherwise 👍
|
||
// Already sent a shutdown request, but haven't got an answer yet. | ||
if (_didSendShutdownRequest) { | ||
return {ExecutionState::WAITING, TRI_ERROR_NO_ERROR}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add. traceShutdown here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code I had locally, I just added the tracing of actual remote requests, not the tracing of ::shutdown()
calls - I thought that'd be enough in most of the cases where shutdown is of interest at all, and doesn't clutter the log as much.
…ture/one-shard-clean-up-2 * 'devel' of https://github.com/arangodb/arangodb: Bug fix/improve stringutils performance (#10208) add option to talk to the SUT using VST (#10217) Doc - Added "log-output" example (#10207) fix it! (#10198) add missing include Bug fix/fix simple example dep proxy skip some regression test (#10213) fixed ui behaviour when replacing a foxx app (#9719) [devel] Fix document search (Ctrl+F/Cmd+F) (#10216) Convert many uses of ClusterComm to Fuerte (#10154) Remove invokeOnAllElements (#10212) AQL Subquery: MultiDependencyRowFetcher (#10101) Bug fix/fix remote executor races (#10206) fix several inefficiencies in Store (#10189) Deprecate rocksdb.max-write-buffer-number startup option (#9654) fix arangosh with vst
Scope & Purpose
Fix some races in the RemoteExecutor