8000 UNSTABLE pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks) · Issue #149370 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

UNSTABLE pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks) #149370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
malfet opened this issue Mar 18, 2025 · 10 comments
Open

UNSTABLE pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks) #149370

malfet opened this issue Mar 18, 2025 · 10 comments
Assignees
Labels
module: ci Related to continuous integration oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable

Comments

@malfet
Copy link
Contributor
malfet commented Mar 18, 2025

See https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pr_time&mergeLF=true <- job passes and fails intermittently with no apparent commit that could have started it

cc @chauhang @penguinwu @seemethere @pytorch/pytorch-dev-infra

@pytorch-bot pytorch-bot bot added module: ci Related to continuous integration unstable labels Mar 18, 2025
Copy link
pytorch-bot bot commented Mar 18, 2025
Hello there! From the UNSTABLE prefix in this issue title, it looks like you are attempting to unstable a job in PyTorch CI. The information I have parsed is below:
  • Job name: pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks)
  • Credential: malfet

Within ~15 minutes, pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks) and all of its dependants will be unstable in PyTorch CI. Please verify that the job name looks correct. With great power comes great responsibility.

@malfet malfet added oncall: pt2 and removed module: ci Related to continuous integration unstable labels Mar 18, 2025
@pytorch-bot pytorch-bot bot added module: ci Related to continuous integration unstable labels Mar 18, 2025
@laithsakka
Copy link
Contributor

this should fix it #149347

@laithsakka laithsakka self-assigned this Mar 18, 2025
@IvanKobzarev IvanKobzarev added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 18, 2025
@clee2000
Copy link
Contributor
clee2000 commented Apr 4, 2025

The above fix did work, but the issue wasn't closed, so there have been a couple more times where it got broken, then got fixed, but stayed in unstable because the issue was never closed. The most recent revert c93e34d is green, so I am closing this now

@clee2000 clee2000 closed this as completed Apr 4, 2025
@laithsakka
Copy link
Contributor

I am landing this #150264
this should ensure its ok, but we need to make sure it stable right after than PR lands. @clee2000

@nWEIdia
Copy link
< 8000 /span>
Collaborator
nWEIdia commented May 14, 2025

How can I address the failures during migration? e.g. #151594 (comment) while moving from cuda12.4 to cuda12.6

@nWEIdia
Copy link
Collaborator
nWEIdia commented May 14, 2025

Re-opening as it reappeared in trunk and many PRs. Though not sure HUD page seems to have ignored it? (Can this magic be applied to cuda 12.6 job too or do I need to create a new issue marking it unstable?)

https://hud.pytorch.org/failure?name=pull%20%2F%20cuda12.4-py3.10-gcc9-sm75%20%2F%20test%20(pr_time_benchmarks%2C%201%2C%201%2C%20linux.g4dn.metal.nvidia.gpu)&jobName=cuda12.6-py3.10-gcc9-sm75%20%2F%20test%20(pr_time_benchmarks%2C%201%2C%201%2C%20linux.g4dn.metal.nvidia.gpu)&failureCaptures=MISSING%20REGRESSION%20TEST

Image

@nWEIdia nWEIdia reopened this May 14, 2025
@laithsakka
Copy link
Contributor
laithsakka commented May 15, 2025

there has been so much changes this week putting this PR to test and udpdate the results if needed can i get a stamp just in case i need to update results?
#153481

Image

@nWEIdia
Copy link
Collaborator
nWEIdia commented May 15, 2025

Hi @laithsakka Do you know how to resolve the following errors while trying to reproduce the instruction count numbers?

collecting compile time instruction count for add_loop_eager                                                                                                         
Failed to open instruction count event: Operation not permitted.                                                                                                     
Error disabling perf event (fd: -1): Bad file descriptor                                                                                                             
compile time instruction count for iteration 0 is 18446744073709551615                                                                                               
Failed to open instruction count event: Operation not permitted.                                                                                                     
Error disabling perf event (fd: -1): Bad file descriptor                                                                                                             
compile time instruction count for iteration 1 is 18446744073709551615    

Does this require the installation of any perf tools? Thanks!

@laithsakka
Copy link
Contributor
laithsakka commented May 17, 2025

are you running on a machine that have access to hardware counter? @nWEIdia

@nWEIdia
Copy link
Collaborator
nWEIdia commented May 17, 2025

are you running on a machine that have access to hardware counter? @nWEIdia

I believe I have root access to the hardware, but not sure how to enable it [the hardware counters]. Does this have any requirement for the OS? I have provisioned ubuntu 20.04.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ci Related to continuous integration oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module unstable
Projects
None yet
Development

No branches or pull requests

5 participants
0