8000 [PP] Allow unused kwargs in ZB path by H-Huang · Pull Request #153498 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[PP] Allow unused kwargs in ZB path #153498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: gh/H-Huang/179/base
Choose a base branch
from

Conversation

H-Huang
Copy link
Member
@H-Huang H-Huang commented May 13, 2025

Stack from ghstack (oldest at bottom):

This is a fix when an unused kwarg is in the PP stage forward, we try to call torch.autograd.grad() and update its gradients when it shouldn't have gradients. Leading to this error:

[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in
[rank3]:[rank3]: return lambda: stage_backward_input(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input
[rank3]:[rank3]: dinputs = torch.autograd.grad(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad
[rank3]:[rank3]: result = _engine_run_backward(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad

related issues: pytorch/torchtitan#1188

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]
Copy link
pytorch-bot bot commented May 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153498

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang added a commit that referenced this pull request May 13, 2025
ghstack-source-id: 7228c9b
Pull Request resolved: #153498
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 13, 2025
@H-Huang H-Huang marked this pull request as draft May 13, 2025 22:21
@H-Huang H-Huang added the module: pipelining Pipeline Parallelism label May 13, 2025
@H-Huang H-Huang added the release notes: distributed (pipeline) release notes category label May 14, 2025
related issues: pytorch/torchtitan#1188, #153484

cc awgu wanchaol fegin fduwjj wz337 wconstab d4l3k

[ghstack-poisoned]
@H-Huang H-Huang changed the title wip [PP] Allow unused kwargs in ZB path May 14, 2025
@H-Huang H-Huang marked this pull request as ready for review May 14, 2025 17:38
@H-Huang H-Huang requested a review from kwen2501 May 14, 2025 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0