8000 Dedicated crash handler thread. by neunhoef · Pull Request #21826 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Dedicated crash handler thread. #21826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jul 11, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c1d34c2
First draft and experiment.
neunhoef Jun 27, 2025
a0e27a6
Clean up code.
neunhoef Jun 27, 2025
abc4216
CHANGELOG.
neunhoef Jun 27, 2025
4a0e661
Take out test assertion again.
neunhoef Jun 27, 2025
27fb623
Increase backwards compatibility of the output.
neunhoef Jun 27, 2025
1cc8b35
Shut down crash handler thread when --version is used.
neunhoef Jun 27, 2025
4d39c2c
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jun 30, 2025
5b45310
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jun 30, 2025
3704b8e
Use atomic wait/notify_all for crash handler thread.
neunhoef Jun 30, 2025
dded8d4
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jul 1, 2025
f308917
Explain state transitions.
neunhoef Jul 1, 2025
b5950e6
Cleaner shutdown.
neunhoef Jul 1, 2025
62e81ff
Take out commented out code.
neunhoef Jul 1, 2025
1f33096
clang-format.
neunhoef Jul 1, 2025
70b04f0
First phase of moving stuff to dedicated crash handler thread.
neunhoef Jul 1, 2025
ebf3c43
Rework suggested by reviewer.
neunhoef Jul 2, 2025
fb9813c
Take assertion out.
neunhoef Jul 2, 2025
4b8523d
Remove additional string for backwards compatibilty.
neunhoef Jul 2, 2025
c09a7c6
Add CrashHandler as thread name.
neunhoef Jul 2, 2025
c4d6bea
Thread exceptions.
neunhoef Jul 2, 2025
2a2f45e
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jul 4, 2025
17d3f81
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jul 4, 2025
58cf61d
Fixes suggested by reviewer.
neunhoef Jul 4, 2025
7ec0d64
Finishing touches suggested by the reviewer.
neunhoef Jul 4, 2025
065de5e
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jul 8, 2025
1da431c
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jul 8, 2025
0b9e10c
Merge remote-tracking branch 'origin/devel' into feature/crash-handle…
neunhoef Jul 10, 2025
ff2542f
Fix CHANGELOG.
neunhoef Jul 10, 2025
1c5508d
Update arangod/RestServer/arangod.cpp
neunhoef Jul 10, 2025
c5d4a61
A few more suggestions by reviewer.
neunhoef Jul 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Cleaner shutdown.
  • Loading branch information
neunhoef committed Jul 1, 2025
commit b5950e672435fa8965d1c3e9a6c161da06acaa95
12 changes: 6 additions & 6 deletions arangod/RestServer/arangod.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ constexpr auto kNonServerFeatures =

static int runServer(int argc, char** argv, ArangoGlobalContext& context) {
try {
CrashHandler::installCrashHandler();
CrashHandler crashHandler; // initializes the crash handler and starts its
// thread the destructor will stop it.

std::string name = context.binaryName();

auto options = std::make_shared<arangodb::options::ProgramOptions>(
Expand Down Expand Up @@ -181,9 +183,9 @@ static int runServer(int argc, char** argv, ArangoGlobalContext& context) {
return std::make_unique<ShutdownFeature>(
server,
#ifdef USE_V8
std::array { ArangodServer::id<ScriptFeature>() }
std::array{ArangodServer::id<ScriptFeature>()}
#else
std::array { ArangodServer::id<AgencyFeaturePhase>() }
std::array{ArangodServer::id<AgencyFeaturePhase>()}
#endif
);
},
Expand Down Expand Up @@ -227,10 +229,8 @@ static int runServer(int argc, char** argv, ArangoGlobalContext& context) {
ret = EXIT_FAILURE;
}

// Shutdown the crash handler thread in a controlled manner
CrashHandler::shutdownCrashHandler();

Logger::flush();
// CrashHandler will be deactivated here automatically be its destructor
return context.exit(ret);
} catch (std::exception const& ex) {
LOG_TOPIC("8afa8", ERR, arangodb::Logger::FIXME)
Expand Down
4 changes: 2 additions & 2 deletions lib/ApplicationFeatures/VersionFeature.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ void VersionFeature::validateOptions(std::shared_ptr<ProgramOptions>) {
}

std::cout << builder.slice().toJson() << std::endl;
CrashHandler::shutdownCrashHandler();
// CrashHandler::shutdownCrashHandler();
exit(EXIT_SUCCESS);
}

Expand All @@ -74,7 +74,7 @@ void VersionFeature::validateOptions(std::shared_ptr<ProgramOptions>) {
<< LGPLNotice << std::endl
<< std::endl
<< Version::getDetailed() << std::endl;
CrashHandler::shutdownCrashHandler();
// CrashHandler::shutdownCrashHandler();
exit(EXIT_SUCCESS);
}
}
Expand Down
9 changes: 9 additions & 0 deletions lib/CrashHandler/CrashHandler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -503,6 +503,8 @@ void crashHandlerSignalHandler(int signal, siginfo_t* info, void* ucontext) {

namespace arangodb {

std::atomic<CrashHandler*> CrashHandler::_theCrashHandler;

void CrashHandler::triggerCrashHandler() {
::crashHandlerState.store(arangodb::CrashHandlerState::CRASH_DETECTED,
std::memory_order_release);
Expand Down Expand Up @@ -685,6 +687,13 @@ void CrashHandler::installCrashHandler() {

CrashHandler::crash(buffer.view());
});

std::atexit([]() {
CrashHandler* ch = CrashHandler::_theCrashHandler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do a

Suggested change
CrashHandler* ch = CrashHandler::_theCrashHandler;
auto* ch = CrashHandler::_theCrashHandler.exchange(nullptr);

instead? And similarly in std::at_quick_exit and ~CrashHandler(). That could avoid races between ~CrashHandler() and the exit handlers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. As far as I see only atexit and at_quick_exit needs this. The destructor already handles this with _threadRunning. And shutdownCrashHandler acquires a mutex anyway and then checks the thread pointer.

if (ch != nullptr) {
ch->shutdownCrashHandler();
}
});
}

/// @brief shuts down the crash handler thread
Expand Down
38 changes: 31 additions & 7 deletions lib/CrashHandler/CrashHandler.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,12 @@

#pragma once

#include <atomic>
#include <string_view>

namespace arangodb {

/// @brief States for the crash handler thread coordination
/// @brief States for the crash handler thread coordination.
/// The state of the dedicated crash handler starts with IDLE
/// and moves to CRASH_DETECTED when a crash is detected. When the
/// thread has handled the situation it goes to HANDLING_COMPLETE
Expand All @@ -42,7 +43,29 @@ enum class CrashHandlerState : int {
};

class CrashHandler {
std::atomic<bool> _threadRunning{false};
static std::atomic<CrashHandler*> _theCrashHandler;
Comment on lines +46 to +47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit _threadRunning, and do a CAS on _theCrashHandler instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be possible, but I would rather like to leave it as it is. First of all, I think it is easier readable. But more importantly, to do the CAS I have to first create the thread, if I am not mistaken. I would rather not create the thread, then try to compare and swap it into _theCrashHandler and then, if it has failed, roll back the thread creation. With _threadRunning, this is an easy atomic race and no further consequences if we lost it.

// This needs to be static for the signal handlers to reach!
A92E
public:
CrashHandler() {
// starts global background thread if not already done
bool threadRunning = _threadRunning.exchange(true);
if (!threadRunning) {
_theCrashHandler.store(this);
installCrashHandler();
}
}

~CrashHandler() {
// joins global background thread if running
bool threadRunning = _threadRunning.exchange(false);
if (threadRunning) {
shutdownCrashHandler();
}
_theCrashHandler.store(nullptr);
}

/// @brief log backtrace for current thread to logfile
static void logBacktrace();

Expand All @@ -67,18 +90,19 @@ class CrashHandler {
/// @brief disable printing of backtraces
static void disableBacktraces();

/// @brief installs the crash handler globally
static void installCrashHandler();

/// @brief shuts down the crash handler thread
static void shutdownCrashHandler();

/// @brief triggers the crash handler thread to handle a crash
/// @return true if successfully triggered, false if already in progress
static void triggerCrashHandler();

/// @brief waits for the crash handler thread to complete its work
static void waitForCrashHandlerCompletion();

private:
/// @brief installs the crash handler globally
static void installCrashHandler();

/// @brief shuts down the crash handler thread
static void shutdownCrashHandler();
};

} // namespace arangodb
0