8000 [torchrun] Fix: Use Correctly Reachable Host Address in c10d Rendezvous by kuizhiqing · Pull Request #150533 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[torchrun] Fix: Use Correctly Reachable Host Address in c10d Rendezvous #150533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kuizhiqing
Copy link
@kuizhiqing kuizhiqing commented Apr 2, 2025

Fixes #150532

In this PR, we replace socket.getfqdn() with socket.gethostbyname(socket.getfqdn()), ensuring that an IP address is used instead of a potentially unresolvable hostname.

Anyway, using an IP is more reliable than a hostname in this case.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Apr 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150533

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 62a8416 with merge base 0da8127 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
linux-foundation-easycla bot commented Apr 2, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Apr 2, 2025
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
@janeyx99 janeyx99 requested a review from d4l3k April 4, 2025 21:45
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 4, 2025
@janeyx99 janeyx99 requested a review from mori360 April 4, 2025 21:45
Copy link
Member
@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving torchelastic so we don't depend on DNS would be very nice -- but it needs some careful consideration to avoid breaking existing use cases

@d4l3k d4l3k requested a review from kiukchung April 4, 2025 22:12
Signed-off-by: kuizhiqing <kuizhiqing@msn.com>
@kuizhiqing kuizhiqing requested a review from d4l3k April 7, 2025 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (torchelastic) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torchrun in environments without DNS support
4 participants
0