8000 Bug fix/fix remote executor races by goedderz · Pull Request #10206 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Bug fix/fix remote executor races #10206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 10, 2019
Merged

Conversation

goedderz
Copy link
Member
@goedderz goedderz commented Oct 9, 2019

Scope & Purpose

Fix some races in the RemoteExecutor

  • Bug-Fix for devel-branch (no need for backports)

@goedderz goedderz self-assigned this Oct 9, 2019
@goedderz goedderz force-pushed the bug-fix/fix-remote-executor-races branch from e9fa270 to 0b5512e Compare October 9, 2019 13:21
@goedderz goedderz added the 1 Bug label Oct 9, 2019
@goedderz goedderz added this to the devel milestone Oct 9, 2019
@goedderz goedderz requested review from graetzer and mchacki October 9, 2019 13:38
@goedderz
Copy link
Member Author
goedderz commented Oct 9, 2019

} else {
_lastResponse = std::move(res);
}
std::lock_guard<std::mutex> guard(_communicationMutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I do not get the need for the extra mutex here, an atomic _lastTicket should be sufficient so synchronize access to _lastError and _lastResponse ?

Copy link
Member Author
@goedderz goedderz Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we do not only have to synchronize a request with its answer, but also protect an answer with another request; e.g., RemoteExecutor sends a getSome/skipSome request, then a shutdown happens (e.g. due to a timeout) at the same time when the answer arrives. Then, this code in ExecutionBlockImpl<RemoteExecutor>::shutdown

if (!_hasTriggeredShutdown) {
  std::lock_guard<std::mutex> guard(_communicationMutex);
  std::ignore = generateNewTicket();
  _hasTriggeredShutdown = true;
}

which resets _lastTicket, _lastError and _lastResponse, races with the lambda that's called with the answer:

[=, ref(std::move(ref))](fuerte::Error err,
                         std::unique_ptr<fuerte::Request>,
                         std::unique_ptr<fuerte::Response> res) {
  std::lock_guard<std::mutex> guard(_communicationMutex);

  if (_lastTicket == ticket) {
    if (err != fuerte::Error::NoError || res->statusCode() >= 400) {
      _lastError = handleErrorResponse(spec, err, res.get());
    } else {
      _lastResponse = std::move(res);
    }
    _query.sharedState()->execute();
  }
}

which reads _lastTicket and then sets _lastError or _lastResponse. I do not see how that should be correctly synchronized with just an atomic _lastTicket.

Copy link
Member
@mchacki mchacki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimal comment for more Debug output.
i am not bound to this being addressed.
Otherwise 👍


// Already sent a shutdown request, but haven't got an answer yet.
if (_didSendShutdownRequest) {
return {ExecutionState::WAITING, TRI_ERROR_NO_ERROR};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add. traceShutdown here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code I had locally, I just added the tracing of actual remote requests, not the tracing of ::shutdown() calls - I thought that'd be enough in most of the cases where shutdown is of interest at all, and doesn't clutter the log as much.

@mchacki mchacki merged commit 6528c59 into devel Oct 10, 2019
@mchacki mchacki deleted the bug-fix/fix-remote-executor-races branch October 10, 2019 10:36
ObiWahn added a commit that referenced this pull request Oct 11, 2019
…ture/one-shard-clean-up-2

* 'devel' of https://github.com/arangodb/arangodb:
  Bug fix/improve stringutils performance (#10208)
  add option to talk to the SUT using VST (#10217)
  Doc - Added "log-output" example (#10207)
  fix it! (#10198)
  add missing include
  Bug fix/fix simple example dep proxy skip some regression test (#10213)
  fixed ui behaviour when replacing a foxx app (#9719)
  [devel] Fix document search (Ctrl+F/Cmd+F) (#10216)
  Convert many uses of ClusterComm to Fuerte (#10154)
  Remove invokeOnAllElements (#10212)
  AQL Subquery: MultiDependencyRowFetcher (#10101)
  Bug fix/fix remote executor races (#10206)
  fix several inefficiencies in Store (#10189)
  Deprecate rocksdb.max-write-buffer-number startup option (#9654)
  fix arangosh with vst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0