-
Notifications
You must be signed in to change notification settings - Fork 68
[release/2.5][ROCm] update state check for test_trace_while_active* #2110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When timing in enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why, I updated the test state check to be either "started" or "scheduled"
Jenkins build for e6605e94216ab368150c5673bb23cd5f73c3faf5 commit finished as FAILURE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which ROCm version does this apply to? Do we need ROCm version checks? Also, should this be upstreamed? If yes, please file a draft PR and post a link to it in this PR description. Will this PR need to be cherry-picked to other branches as well?
@jithunnair-amd - This change was made in rocm6.4, I can add a 6.4 check. However, this failure was only caught on navi gpu with intel cpu. Not sure why it doesn't happen in other cases. We will definitely need to cherry-pick into future release branches. But, should I add a NAVI check also? |
Jenkins build for 706fe4bdae5cd815cacbb8323dfebec148fd2eba commit finished as FAILURE |
Jenkins build for 706fe4bdae5cd815cacbb8323dfebec148fd2eba commit finished as FAILURE |
Hmm, that's an intricacy. I'm not sure based on the above datapoints if this is a Navi-exclusive issue, or if it was only discovered on Navi. And even if it's a Navi-specific issue today, will it become an MI issue down the line as well. I feel comfortable applying this patch to all GPUs, Navi or otherwise. Paging @jeffdaily for a second opinion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Not sure if the way we get ROCm version in this PR is the same as in other cases, since it uses an internal function (indicated by the "_" in the beginning)
Jenkins build for 706fe4bdae5cd815cacbb8323dfebec148fd2eba commit finished as FAILURE |
Jenkins build for 706fe4bdae5cd815cacbb8323dfebec148fd2eba commit finished as FAILURE |
I'm fine with the change. However there are other places in this test file where |
@pragupta That's a fair question. Can we avoid some rework by making sure if those tests also need this update? Based on this PR's description, it'd seem so. If that's the case, I'd suggest we refactor |
@jithunnair-amd / @jeffdaily -- I can make the changes as suggested. However, as I mentioned earlier, this was caught in a very specific AMD GPU and Intel CPU case on NAVI. Even on that machine, I don't see any other UTs failing. So, while it's a clean solution, don't know if we need to impact other UTs. |
Sounds good, thanks for checking. Let's park that idea for a later point, if needed. |
!cherry-pick --onto release/2.6 |
…OCm#2110) When timing in enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why, I updated the test state check to be either "started" or "scheduled" Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883 Upstream PR: pytorch#153545 (cherry picked from commit 8a1ad2c)
…OCm#2110) When timing in enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why, I updated the test state check to be either "started" or "scheduled" Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883 Upstream PR: pytorch#153545 (cherry picked from commit 8a1ad2c)
#2201) …#2110) When timing in enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why, I updated the test state check to be either "started" or "scheduled" Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883 Upstream PR: pytorch#153545 (cherry picked from commit 8a1ad2c) Fixes #ISSUE_NUMBER
#2202) …#2110) When timing in enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why, I updated the test state check to be either "started" or "scheduled" Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883 Upstream PR: pytorch#153545 (cherry picked from commit 8a1ad2c) Fixes #ISSUE_NUMBER
When timing in enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why, I updated the test state check to be either "started" or "scheduled"
Fixes https://ontrack-internal.amd.com/browse/SWDEV-525883
Upstream PR: pytorch#153545