8000 G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages by sivanov-work · Pull Request #21567 · opencv/opencv · GitHub
[go: up one dir, main page]

Skip to content

G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages #21567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 9, 2022

Conversation

sivanov-work
Copy link
Contributor
@sivanov-work sivanov-work commented Feb 4, 2022

Sometimes by using ConstSource (detected on scenario with Roi generator & VPL Source with VPP preproc) it is possible to get a stuck on end-of-pipeline:
-VideoEmitter resends Stop to ConstEmitter on end-of-stream
-Executor got stop from VideoEmitter and also tries to rewindQueue to extract Stop from other sources (ConstEmiter) but stuck on extracting cmd from other empty queue in rewindToStop
-Meantime ConstEmitter has got Stop from VideoEmitter and tries to push it on queue and got stuck, because this queue is not processed yet on rewindStop, which had stucked on pop from other empty queue

Proposed solution:
To modify rewindToStop to make it unblocking: It analyzed valid queues count (minus 1 as itself) and try to extract Stop Cmds in unblocking manner (try_pop) until received Stop count is not equal to expected queues number

Result:
sample VPL+Preproc+ConstSource doesn't hang anymore
UTs passed
Added troubleshooting info messages

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Build configuration

force_builders=Custom,Custom Win,Custom Mac
build_gapi_standalone:Linux x64=ade-0.1.1f
build_gapi_standalone:Win64=ade-0.1.1f
build_gapi_standalone:Mac=ade-0.1.1f
build_gapi_standalone:Linux x64 Debug=ade-0.1.1f

Xbuild_image:Custom=centos:7
Xbuildworker:Custom=linux-1
build_gapi_standalone:Custom=ade-0.1.1f

build_image:Custom=ubuntu-openvino-2021.4.1:20.04
build_image:Custom Win=openvino-2021.4.1
build_image:Custom Mac=openvino-2021.4.1

test_modules:Custom=gapi,python2,python3,java
test_modules:Custom Win=gapi,python2,python3,java
test_modules:Custom Mac=gapi,python2,python3,java

buildworker:Custom=linux-1
# disabled due high memory usage: test_opencl:Custom=ON
test_opencl:Custom=OFF
test_bigdata:Custom=1
test_filter:Custom=*

@@ -574,7 +576,8 @@ static void setROIBlob(InferenceEngine::InferRequest& req,
const IE::Blob::Ptr& blob,
const cv::Rect &roi,
const IECallContext& ctx) {
if (ctx.uu.params.device_id.find("GPU") != std::string::npos) {
if (ctx.uu.params.device_id.find("GPU") != std::string::npos &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to delete: not from this issue

@@ -21,6 +21,8 @@
#include <atomic>
#include <tuple>

#include <opencv2/imgproc.hpp>
//#include <opencv2/highgui.hpp> // CommandLineParser
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to delete: not from this issue

@@ -732,8 +732,9 @@ void cv::gimpl::ie::GIEExecutable::run(cv::gimpl::GIslandExecutable::IInput &in

if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg))
{
// (3) Wait until all passed task are done.
GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshooting & Logging BKMs:

INFO level in log here because pipeline is intending to stop: it's once-time event, not need to worry about performance due to printout here

@@ -497,6 +525,7 @@ void emitterActorThread(std::shared_ptr<cv::gimpl::GIslandEmitter> emitter,
|| cv::util::holds_alternative<Stop>(cmd));
if (cv::util::holds_alternative<Stop>(cmd))
{
GAPI_LOG_INFO(nullptr, "Early stop for: " << emitter->name() << " detected");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshooting & Logging BKMs:

Log incoming event up right before process it (possible stuck on push())

@@ -506,6 +535,7 @@ void emitterActorThread(std::shared_ptr<cv::gimpl::GIslandEmitter> emitter,
GAPI_ITT_STATIC_LOCAL_HANDLE(emitter_push_hndl, "emitter_push");

// Now start emitting the data from the source to the pipeline.
GAPI_LOG_INFO(nullptr, "starting: " << emitter->name());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshooting & Logging BKMs:

Log every busy-up worker tread right before starting: to link further logger messages by thread number with actual actor by name:

DEBUG(:1)@timestamp1     starting: VideoEmitter  <--- mark thread :1 as VideoEmitter
DEBUG:2@timestamp2     blah blah blah
DEBUG:3@timestamp2     blah blah blah
...
DEBUG(:1)@timestamp2     blah blah blah        <---belong to VideoEmitter

@@ -1119,6 +1183,8 @@ class cv::gimpl::GStreamingExecutor::Synchronizer final {
// Start a thread which will handle the synchronization.
// Do nothing if synchronization is not needed
void start() {
GAPI_LOG_INFO(nullptr, "try to start syncActorThread for emitters count: " <<
m_synchronized_emitters.size());
if (m_synchronized_emitters.size() != 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshooting & Logging BKMs:

print out COUNT of object BEFORE condition serve TWO purposes:
-to check actual objects count :)
-to find out is the next condition true or false

So we use only one logger row and do not require to logging approaches for different if-branches

Copy link
Contributor
@TolyaTalamanov TolyaTalamanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, from your description I didn't get how it's possible:

but stuck on extracting cmd from other empty queue in rewindToStop

What is the empty queue are u referring to?

@@ -732,8 +732,9 @@ void cv::gimpl::ie::GIEExecutable::run(cv::gimpl::GIslandExecutable::IInput &in

if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg))
{
// (3) Wait until all passed task are done.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's needed to remove comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because comment message is exposed in logger message below - no need to duplicate it in code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought, that we've agreed that logging must be considered separately from this PR

@@ -63,6 +63,7 @@ class GStreamingIntrinExecutable final: public cv::gimpl::GIslandExecutable
const cv::GCompileArgs &,
const std::vector<ade::NodeHandle> &);

const char* name() const override { return "GStreamingIntrinExecutable"; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about backends that are placed in different submodules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. These quite invasive changes must be dropped from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about the following proposals: https://github.com/opencv/opencv/pull/21567/files#r801291452 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't make hasty decisions besides it doesn't apply to the problem you're trying to fix

@@ -186,6 +187,7 @@ class GIslandEmitter
// Obtain next value from the emitter
virtual bool pull(GRunArg &) = 0;
virtual ~GIslandEmitter() = default;
virtual const char* name() const = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the new pure virtual method might broke the code which is being developed in other submodules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. All private modules will break after these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above

@@ -30,6 +30,8 @@ class GraphMetaExecutable final: public cv::gimpl::GIslandExecutable {

void run(std::vector<InObj> &&input_objs,
std::vector<OutObj> &&output_objs) override;

const char* name() const override { return "GraphMetaExecutable"; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be generic for all backend?

typeid(*this).name()

or something this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really required for this particular fix??
If it is about error handling and logging, I believe we need to introduce these changes somewhere later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TolyaTalamanov as it was discussed offline - it it possible to put non-pure-virtual function in base class but with some default value "unspecified_executable" which has no necessity in overriding in descendant classes an will not brake anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatveev
I see you point, but from my perspective logging & troubleshooting must serve purposes to resolve BUGs & simplify further RootCauseAnalysis:
It should be avoided to put log in spontaneous places spontaneously, thus current logger messages related to investigation & troubleshooting for existing problem. Also it allows to find & investigate further problem appearance too

These are the reasons why I put it here and why this PR is called Fix streaming hangs for ConstSource & Improve troubleshooting logger messages

Copy link
Contributor
@dmatveev dmatveev Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do that but please use a less invasive method instead.
@AsyaPronina was able to add ITT tracing for all islands uniformly what works fine across platforms.

Here it is https://github.com/opencv/opencv/blob/4.x/modules/gapi/src/compiler/gislandmodel.hpp#L297

auto &q2 = ade::util::value(qit);
if (this_id == id2) continue;
// Will draining all queue & waiting for stop cmd for all queues except own queue
size_t expected_stop_count = std::count_if(in_queues.begin(), in_queues.end(), [] (const Q* ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remind me how queue can be nullptr, please?

10000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -472,6 +499,7 @@ bool QueueReader::getResultsVector(std::vector<Q*> &in_queues,
// Collector thread never handles the inputs directly
// (collector's input queues are always produced by
// islands in the graph).
GAPI_LOG_INFO(nullptr, "try on theory...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get the log msg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on top

            // In theory, the CNST should never reach here.
            // Collector thread never handles the inputs directly
            // (collector's input queues are always produced by
            // islands in the graph).

I have suspicions about comments in code starting from "In theory..." therefor I suppose a weak in program logic in this particular place - so it should be wrapped in logger at least.

I agree that "try on theory..." is a bad message and must be rephrased - @dmatveev , @TolyaTalamanov could you help me to produce a correct message here? because i do not understand "In theory..." comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

therefor I suppose a weak in program logic in this particular place - so it should be wrapped in logger at least.

Let me explain what it means. The code is written here in assumption that something never happens. This assumption is indeed true at the point when the code is written, but things change and years later someone can change something and break old contracts. We've had these examples already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the point is not to tell in comments anything for sure. This shouldn't be relevant to the logs you write, though.

out_iter->ready = true;
m_postIdx.erase(it); // Drop the link from the cache anyway
if (out_iter != m_postings[out_idx].begin())
bool stop_triggered = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be under mutex as well, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we only need this flag for logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be not under mutex protection because it is not shared resource but stack-variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do: Stop event is rare in comparison with regular event and require special processing - thus it is worth to know about it.
This slight refactoring is used to make printout GAPI_LOG_INFO without mutex protection to prevent pessimization

got_cmd = q2->try_pop(cmd);
if (got_cmd && cv::util::holds_alternative<Stop>(cmd)) {
got_stop_count ++;
GAPI_LOG_INFO(nullptr, "got stop from id: " << id2 <<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need such kind of logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it exposes information about how many queues responded by stop and which are not stopped yet

@dmatveev dmatveev self-assigned this Feb 7, 2022
@dmatveev dmatveev added this to the 4.6.0 milestone Feb 7, 2022
Copy link
Contributor
@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this PR has more changes than really needed

@@ -30,6 +30,8 @@ class GraphMetaExecutable final: public cv::gimpl::GIslandExecutable {

void run(std::vector<InObj> &&input_objs,
std::vector<OutObj> &&output_objs) override;

const char* name() const override { return "GraphMetaExecutable"; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really required for this particular fix??
If it is about error handling and logging, I believe we need to introduce these changes somewhere later

@@ -732,8 +732,9 @@ void cv::gimpl::ie::GIEExecutable::run(cv::gimpl::GIslandExecutable::IInput &in

if (cv::util::holds_alternative<cv::gimpl::EndOfStream>(in_msg))
{
// (3) Wait until all passed task are done.
GAPI_LOG_INFO(nullptr, "Got EndOfStream - wait until all passed task are done");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks

m_reqPool->waitAll();
GAPI_LOG_INFO(nullptr, "all done");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so would it spam console by default here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i supposed EOS is a rare message (isn't it?) and must have been faced only once.
Are my assumptions correct? If yes - no spam here, only countable output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOS may happen multiple times especially in start/stop scenarios.

@@ -63,6 +63,7 @@ class GStreamingIntrinExecutable final: public cv::gimpl::GIslandExecutable
const cv::GCompileArgs &,
const std::vector<ade::NodeHandle> &);

const char* name() const override { return "GStreamingIntrinExecutable"; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. These quite invasive changes must be dropped from this PR.

@@ -186,6 +187,7 @@ class GIslandEmitter
// Obtain next value from the emitter
virtual bool pull(GRunArg &) = 0;
virtual ~GIslandEmitter() = default;
virtual const char* name() const = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. All private modules will break after these changes.

GAPI_LOG_INFO(nullptr, "id: " << this_id << ", queues count: " << in_queues.size() <<
", expected stop msg count: " << expected_stop_count);
size_t got_stop_count = 0;
while(got_stop_count < expected_stop_count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after while

auto &q2 = ade::util::value(qit);
if (this_id == id2) continue;

GAPI_LOG_INFO(nullptr, "drain next id: " << id2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these logs are cryptic for those who'll read it. Also, why drain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logs would be read with

"rewindToStop - id: XXX, queues count: YYYY,  expected stop msg count: ZZZ"
"rewindToStop - drain next id: 0"
"rewindToStop - drain next id: 1"
"rewindToStop - got stop from id: 1 - (1/ZZZ)"
"rewindToStop - drain next id: 2"
"rewindToStop - got stop from id: 2 - (2/ZZZ)"
"rewindToStop - drain next id: 3"
...
"rewindToStop - drain next id: 0"
"rewindToStop - got stop from id: 0 - (ZZZ/ZZZ)"

together
and represent information about which queue is stopped but which is not yet
(It must have got actual name per each queue instead of id number in further logging improvements)

@@ -385,6 +410,7 @@ bool QueueReader::getInputVector(std::vector<Q*> &in_queues,
const auto &stop = cv::util::get<Stop>(m_cmd[id]);
if (stop.kind == Stop::Kind::CNST)
{
GAPI_LOG_INFO(nullptr, "Got `Stop::Kind::CNST`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we better introduce << for our enums instead. But, again, it should be introduced separately & reviewed well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree,
not in this PR.

without << this message may be rephrased as "Got Stop from Constant Source" (is it correct?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@@ -472,6 +499,7 @@ bool QueueReader::getResultsVector(std::vector<Q*> &in_queues,
// Collector thread never handles the inputs directly
// (collector's input queues are always produced by
// islands in the graph).
GAPI_LOG_INFO(nullptr, "try on theory...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -801,46 +849,56 @@ class StreamingOutput final: public cv::gimpl::GIslandExecutable::IOutput
GAPI_ITT_STATIC_LOCAL_HANDLE(outputs_post_hndl, "StreamingOutput::post");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the fix was in rewind already. So why do we need these changes now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you clarify with fix you meant here? because you put comment on GAPI_ITT_STATIC_LOCAL_HANDLE(outputs_post_hndl, "StreamingOutput::post"); line without modifications

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is why do we need all these below changes

@sivanov-work
Copy link
Contributor Author

To get rid of further discussion - all troubleshooting info were reverted to original state.

Copy link
Contributor
@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// because it had got stop message before rewind was called
expected_stop_count--;
}
GAPI_LOG_DEBUG(nullptr, "id: " << this_id << ", queues count: " << in_queues.size() <<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that it's can somehow help user

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a debug message, we have some already so I'd not care about this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, then @sivanov-work you can ignore these comments :)

@TolyaTalamanov TolyaTalamanov self-requested a review February 9, 2022 11:32
Copy link
Contributor
@rgarnov rgarnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alalek alalek merged commit d5ecb5f into opencv:4.x Feb 9, 2022
@alalek alalek mentioned this pull request Feb 22, 2022
@sivanov-work sivanov-work deleted the gapi_streaming_fix branch March 5, 2022 05:48
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
G-API: Fix streaming hangs for ConstSource & Improve troubleshooting logger messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0