-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages #21567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -574,7 +576,8 @@ static void setROIBlob(InferenceEngine::InferRequest& req, | |||
const IE::Blob::Ptr& blob, | |||
const cv::Rect &roi, | |||
const IECallContext& ctx) { | |||
if (ctx.uu.params.device_id.find("GPU") != std::string::npos) { | |||
if (ctx.uu.params.device_id.find("GPU") != std::string::npos && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to delete: not from this issue
@@ -21,6 +21,8 @@ | |||
#include <atomic> | |||
#include <tuple> | |||
|
|||
#include <opencv2/imgproc.hpp> | |||
//#include <opencv2/highgui.hpp> // CommandLineParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to delete: not from this issue
@@ -732,8 +732,9 @@ void cv::gimpl::ie::GIEExecutable::run(cv::gimpl::GIslandExecutable::IInput &in | |||
|
|||
if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg)) | |||
{ | |||
// (3) Wait until all passed task are done. | |||
GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Troubleshooting & Logging BKMs:
INFO level in log here because pipeline is intending to stop: it's once-time event, not need to worry about performance due to printout here
@@ -497,6 +525,7 @@ void emitterActorThread(std::shared_ptr<cv::gimpl::GIslandEmitter> emitter, | |||
|| cv::util::holds_alternative<Stop>(cmd)); | |||
if (cv::util::holds_alternative<Stop>(cmd)) | |||
{ | |||
GAPI_LOG_INFO(nullptr, "Early stop for: " << emitter->name() << " detected"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Troubleshooting & Logging BKMs:
Log incoming event up right before process it (possible stuck on push())
@@ -506,6 +535,7 @@ void emitterActorThread(std::shared_ptr<cv::gimpl::GIslandEmitter> emitter, | |||
GAPI_ITT_STATIC_LOCAL_HANDLE(emitter_push_hndl, "emitter_push"); | |||
|
|||
// Now start emitting the data from the source to the pipeline. | |||
GAPI_LOG_INFO(nullptr, "starting: " << emitter->name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Troubleshooting & Logging BKMs:
Log every busy-up worker tread right before starting: to link further logger messages by thread number with actual actor by name:
DEBUG(:1)@timestamp1 starting: VideoEmitter <--- mark thread :1 as VideoEmitter
DEBUG:2@timestamp2 blah blah blah
DEBUG:3@timestamp2 blah blah blah
...
DEBUG(:1)@timestamp2 blah blah blah <---belong to VideoEmitter
@@ -1119,6 +1183,8 @@ class cv::gimpl::GStreamingExecutor::Synchronizer final { | |||
// Start a thread which will handle the synchronization. | |||
// Do nothing if synchronization is not needed | |||
void start() { | |||
GAPI_LOG_INFO(nullptr, "try to start syncActorThread for emitters count: " << | |||
m_synchronized_emitters.size()); | |||
if (m_synchronized_emitters.size() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Troubleshooting & Logging BKMs:
print out COUNT of object BEFORE condition serve TWO purposes:
-to check actual objects count :)
-to find out is the next condition true or false
So we use only one logger row and do not require to logging approaches for different if-branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, from your description I didn't get how it's possible:
but stuck on extracting cmd from other empty queue in rewindToStop
What is the empty queue are u referring to?
@@ -732,8 +732,9 @@ void cv::gimpl::ie::GIEExecutable::run(cv::gimpl::GIslandExecutable::IInput &in | |||
|
|||
if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg)) | |||
{ | |||
// (3) Wait until all passed task are done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's needed to remove comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because comment message is exposed in logger message below - no need to duplicate it in code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought, that we've agreed that logging must be considered separately from this PR
@@ -63,6 +63,7 @@ class GStreamingIntrinExecutable final: public cv::gimpl::GIslandExecutable | |||
const cv::GCompileArgs &, | |||
const std::vector<ade::NodeHandle> &); | |||
|
|||
const char* name() const override { return "GStreamingIntrinExecutable"; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about backends that are placed in different submodules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. These quite invasive changes must be dropped from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's about the following proposals: https://github.com/opencv/opencv/pull/21567/files#r801291452 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't make hasty decisions besides it doesn't apply to the problem you're trying to fix
@@ -186,6 +187,7 @@ class GIslandEmitter | |||
// Obtain next value from the emitter | |||
virtual bool pull(GRunArg &) = 0; | |||
virtual ~GIslandEmitter() = default; | |||
virtual const char* name() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the new pure virtual method might broke the code which is being developed in other submodules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. All private modules will break after these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered above
modules/gapi/src/backends/common/gmetabackend.cpp
Outdated
1E0A
@@ -30,6 +30,8 @@ class GraphMetaExecutable final: public cv::gimpl::GIslandExecutable { | |||
|
|||
void run(std::vector<InObj> &&input_objs, | |||
std::vector<OutObj> &&output_objs) override; | |||
|
|||
const char* name() const override { return "GraphMetaExecutable"; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be generic for all backend?
typeid(*this).name()
or something this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really required for this particular fix??
If it is about error handling and logging, I believe we need to introduce these changes somewhere later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TolyaTalamanov as it was discussed offline - it it possible to put non-pure-virtual function in base class but with some default value "unspecified_executable" which has no necessity in overriding in descendant classes an will not brake anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmatveev
I see you point, but from my perspective logging & troubleshooting must serve purposes to resolve BUGs & simplify further RootCauseAnalysis:
It should be avoided to put log in spontaneous places spontaneously, thus current logger messages related to investigation & troubleshooting for existing problem. Also it allows to find & investigate further problem appearance too
These are the reasons why I put it here and why this PR is called Fix streaming hangs for ConstSource & Improve troubleshooting logger messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do that but please use a less invasive method instead.
@AsyaPronina was able to add ITT tracing for all islands uniformly what works fine across platforms.
Here it is https://github.com/opencv/opencv/blob/4.x/modules/gapi/src/compiler/gislandmodel.hpp#L297
auto &q2 = ade::util::value(qit); | ||
if (this_id == id2) continue; | ||
// Will draining all queue & waiting for stop cmd for all queues except own queue | ||
size_t expected_stop_count = std::count_if(in_queues.begin(), in_queues.end(), [] (const Q* ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remind me how queue can be nullptr, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -472,6 +499,7 @@ bool QueueReader::getResultsVector(std::vector<Q*> &in_queues, | |||
// Collector thread never handles the inputs directly | |||
// (collector's input queues are always produced by | |||
// islands in the graph). | |||
GAPI_LOG_INFO(nullptr, "try on theory..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get the log msg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on top
// In theory, the CNST should never reach here.
// Collector thread never handles the inputs directly
// (collector's input queues are always produced by
// islands in the graph).
I have suspicions about comments in code starting from "In theory..." therefor I suppose a weak in program logic in this particular place - so it should be wrapped in logger at least.
I agree that "try on theory..." is a bad message and must be rephrased - @dmatveev , @TolyaTalamanov could you help me to produce a correct message here? because i do not understand "In theory..." comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
therefor I suppose a weak in program logic in this particular place - so it should be wrapped in logger at least.
Let me explain what it means. The code is written here in assumption that something never happens. This assumption is indeed true at the point when the code is written, but things change and years later someone can change something and break old contracts. We've had these examples already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the point is not to tell in comments anything for sure. This shouldn't be relevant to the logs you write, though.
out_iter->ready = true; | ||
m_postIdx.erase(it); // Drop the link from the cache anyway | ||
if (out_iter != m_postings[out_idx].begin()) | ||
bool stop_triggered = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be under mutex
as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we only need this flag for logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be not under mutex protection because it is not shared resource but stack-variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do: Stop event is rare in comparison with regular event and require special processing - thus it is worth to know about it.
This slight refactoring is used to make printout GAPI_LOG_INFO without mutex protection to prevent pessimization
got_cmd = q2->try_pop(cmd); | ||
if (got_cmd && cv::util::holds_alternative<Stop>(cmd)) { | ||
got_stop_count ++; | ||
GAPI_LOG_INFO(nullptr, "got stop from id: " << id2 << |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need such kind of logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it exposes information about how many queues responded by stop and which are not stopped yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this PR has more changes than really needed
@@ -30,6 +30,8 @@ class GraphMetaExecutable final: public cv::gimpl::GIslandExecutable { | |||
|
|||
void run(std::vector<InObj> &&input_objs, | |||
std::vector<OutObj> &&output_objs) override; | |||
|
|||
const char* name() const override { return "GraphMetaExecutable"; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really required for this particular fix??
If it is about error handling and logging, I believe we need to introduce these changes somewhere later
@@ -732,8 +732,9 @@ void cv::gimpl::ie::GIEExecutable::run(cv::gimpl::GIslandExecutable::IInput &in | |||
|
|||
if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg)) | |||
{ | |||
// (3) Wait until all passed task are done. | |||
GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tasks
m_reqPool->waitAll(); | ||
GAPI_LOG_INFO(nullptr, "all done"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so would it spam console by default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i supposed EOS is a rare message (isn't it?) and must have been faced only once.
Are my assumptions correct? If yes - no spam here, only countable output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EOS may happen multiple times especially in start/stop scenarios.
@@ -63,6 +63,7 @@ class GStreamingIntrinExecutable final: public cv::gimpl::GIslandExecutable | |||
const cv::GCompileArgs &, | |||
const std::vector<ade::NodeHandle> &); | |||
|
|||
const char* name() const override { return "GStreamingIntrinExecutable"; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. These quite invasive changes must be dropped from this PR.
@@ -186,6 +187,7 @@ class GIslandEmitter | |||
// Obtain next value from the emitter | |||
virtual bool pull(GRunArg &) = 0; | |||
virtual ~GIslandEmitter() = default; | |||
virtual const char* name() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. All private modules will break after these changes.
GAPI_LOG_INFO(nullptr, "id: " << this_id << ", queues count: " << in_queues.size() << | ||
", expected stop msg count: " << expected_stop_count); | ||
size_t got_stop_count = 0; | ||
while(got_stop_count < expected_stop_count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after while
auto &q2 = ade::util::value(qit); | ||
if (this_id == id2) continue; | ||
|
||
GAPI_LOG_INFO(nullptr, "drain next id: " << id2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these logs are cryptic for those who'll read it. Also, why drain
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logs would be read with
"rewindToStop - id: XXX, queues count: YYYY, expected stop msg count: ZZZ"
"rewindToStop - drain next id: 0"
"rewindToStop - drain next id: 1"
"rewindToStop - got stop from id: 1 - (1/ZZZ)"
"rewindToStop - drain next id: 2"
"rewindToStop - got stop from id: 2 - (2/ZZZ)"
"rewindToStop - drain next id: 3"
...
"rewindToStop - drain next id: 0"
"rewindToStop - got stop from id: 0 - (ZZZ/ZZZ)"
together
and represent information about which queue is stopped but which is not yet
(It must have got actual name
per each queue
instead of id number in further logging improvements)
@@ -385,6 +410,7 @@ bool QueueReader::getInputVector(std::vector<Q*> &in_queues, | |||
const auto &stop = cv::util::get<Stop>(m_cmd[id]); | |||
if (stop.kind == Stop::Kind::CNST) | |||
{ | |||
GAPI_LOG_INFO(nullptr, "Got `Stop::Kind::CNST`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we better introduce <<
for our enums instead. But, again, it should be introduced separately & reviewed well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree,
not in this PR.
without <<
this message may be rephrased as "Got Stop from Constant Source" (is it correct?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
@@ -472,6 +499,7 @@ bool QueueReader::getResultsVector(std::vector<Q*> &in_queues, | |||
// Collector thread never handles the inputs directly | |||
// (collector's input queues are always produced by | |||
// islands in the graph). | |||
GAPI_LOG_INFO(nullptr, "try on theory..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -801,46 +849,56 @@ class StreamingOutput final: public cv::gimpl::GIslandExecutable::IOutput | |||
GAPI_ITT_STATIC_LOCAL_HANDLE(outputs_post_hndl, "StreamingOutput::post"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the fix was in rewind
already. So why do we need these changes now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you clarify with fix you meant here? because you put comment on GAPI_ITT_STATIC_LOCAL_HANDLE(outputs_post_hndl, "StreamingOutput::post");
line without modifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is why do we need all these below changes
To get rid of further discussion - all troubleshooting info were reverted to original state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// because it had got stop message before rewind was called | ||
expected_stop_count--; | ||
} | ||
GAPI_LOG_DEBUG(nullptr, "id: " << this_id << ", queues count: " << in_queues.size() << |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that it's can somehow help user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a debug message, we have some already so I'd not care about this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, then @sivanov-work you can ignore these comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages
Sometimes by using ConstSource (detected on scenario with Roi generator & VPL Source with VPP preproc) it is possible to get a stuck on end-of-pipeline:
-VideoEmitter resends Stop to ConstEmitter on end-of-stream
-Executor got stop from VideoEmitter and also tries to rewindQueue to extract Stop from other sources (ConstEmiter) but stuck on extracting cmd from other empty queue in rewindToStop
-Meantime ConstEmitter has got Stop from VideoEmitter and tries to push it on queue and got stuck, because this queue is not processed yet on rewindStop, which had stucked on pop from other empty queue
Proposed solution:
To modify rewindToStop to make it unblocking: It analyzed valid queues count (minus 1 as itself) and try to extract Stop Cmds in unblocking manner (try_pop) until received Stop count is not equal to expected queues number
Result:
sample VPL+Preproc+ConstSource doesn't hang anymore
UTs passed
Added troubleshooting info messages
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.
Build configuration