-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Closed
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CImodule: ciRelated to continuous integrationRelated to continuous integrationmodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Current Status
mitigated. Some jobs will have failures and need to be restarted. Any job after 5/16 2:20pm PT should have the correct runtime
Error looks like
*Job failures with: No CUDA runtime is found
Incident timeline (all times pacific)
Include when the incident began, when it was detected, mitigated, root caused, and finally closed.
started: 5/16 7:15am PT
detected: 5/16 11:48am PT
resolved: 5/16 2:20pm PT
Root cause
An upgrade of nvidia-container-toolkit container
Mitigation
We pinned the version of nvidia-container-toolkit - see pytorch/test-infra#6637
follow ups:
- run
docker run --rm -t --gpus=all python:3.11 nvidia-smi
when installing nvidia drivers. - figure out if we should be alerted if any GPU runners runs with 0 utilization @wdvr @yangw-dev
- Asking nvidia if their dependencies can be pinned: [rpm] libnvidia-container-tools should pin to nvidia-container-toolkit version NVIDIA/nvidia-container-toolkit#1091
cc @seemethere @malfet @pytorch/pytorch-dev-infra
jeanschmidt
Metadata
Metadata
Assignees
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CImodule: ciRelated to continuous integrationRelated to continuous integrationmodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Done