[tp] refactor and fix PrepareModuleInput for DTensor inputs #128431

wanchaol · 2024-06-11T18:00:28Z

Stack from ghstack (oldest at bottom):

-> [tp] refactor and fix PrepareModuleInput for DTensor inputs #128431

as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes #128365

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 [ghstack-poisoned]

pytorch-bot · 2024-06-11T18:00:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128431

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 41c34a6 with merge base 4345d98 ():

NEW FAILURE - The following job has failed:

trunk / macos-py3-arm64 / build (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/distributed/tensor/parallel/style.py

wanchaol · 2024-06-11T21:15:06Z

@pytorchbot merge

as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

wanchaol · 2024-06-11T22:12:24Z

@pytorchbot rebase

pytorchmergebot · 2024-06-11T22:13:55Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-06-11T22:14:11Z

Successfully rebased gh/wanchaol/485/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/128431)

wanchaol · 2024-06-12T04:53:50Z

@pytorchbot merge

pytorchmergebot · 2024-06-12T04:55:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wanchaol · 2024-06-12T05:20:34Z

@pytorchbot merge -f "macos job seems taking forever to build.."

pytorchmergebot · 2024-06-12T05:20:52Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-06-12T05:22:19Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

as titled, this PR refactors the PrepareModuleInput styl 8000 e to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 ghstack-source-id: 037bb6d Pull Request resolved: #128431

wanchaol · 2024-06-12T18:58:10Z

@pytorchbot merge

pytorchmergebot · 2024-06-12T19:00:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wanchaol · 2024-06-12T19:13:56Z

@pytorchbot merge -f "lint fixing only, no need to re-run all CI jobs"

pytorchmergebot · 2024-06-12T19:14:13Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-06-12T19:16:27Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI or trains that does not use fp8

…128431) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes pytorch#128365 Pull Request resolved: pytorch#128431 Approved by: https://github.com/awgu

…ytorch#128431)" This reverts commit 089f9a1. Reverted pytorch#128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - https://hud.pytorch.org/pytorch/pytorch/commit/089f9a116ac8b2c14d6351b52614b529caba126b ([comment](pytorch#128431 (comment)))

…128431) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes pytorch#128365 Pull Request resolved: pytorch#128431 Approved by: https://github.com/awgu

as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 Pull Request resolved: #128431 Approved by: https://github.com/awgu (cherry picked from commit 7775fee)

…128431) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes pytorch#128365 Pull Request resolved: pytorch#128431 Approved by: https://github.com/awgu

…ytorch#128431)" This reverts commit 089f9a1. Reverted pytorch#128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - https://hud.pytorch.org/pytorch/pytorch/commit/089f9a116ac8b2c14d6351b52614b529caba126b ([comment](pytorch#128431 (comment)))

…128431) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes pytorch#128365 Pull Request resolved: pytorch#128431 Approved by: https://github.com/awgu

…#128719) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 Pull Request resolved: #128431 Approved by: https://github.com/awgu (cherry picked from commit 7775fee)

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI or trains that does not use fp8

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI or trains that does not use fp8

[tp] refactor PrepareModuleInput to reuse code

178f073

as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes #128365 [ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 11, 2024

Skylion007 reviewed Jun 11, 2024

View reviewed changes

torch/distributed/tensor/parallel/style.py Outdated Show resolved Hide resolved

wanchaol added the release notes: distributed (dtensor) release notes category label Jun 11, 2024

wanchaol requested review from yifuwang, tianyu-l, awgu and Skylion007 June 11, 2024 19:32

awgu approved these changes Jun 11, 2024

View reviewed changes

wanchaol changed the title ~~[tp] refactor PrepareModuleInput to reuse code~~ [tp] refactor and fix PrepareModuleInput for DTensor inputs Jun 11, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 11, 2024

Update

50c418b

[ghstack-poisoned]

pytorchmergebot added the merging label Jun 12, 2024

pytorchmergebot closed this in 089f9a1 Jun 12, 2024

pytorchmergebot added the Merged label Jun 12, 2024

pytorchmergebot added the merging label Jun 12, 2024

pytorchmergebot closed this in 7775fee Jun 12, 2024

pytorchmergebot removed the merging label Jun 12, 2024

wanchaol mentioned this pull request Jun 12, 2024

enable TP fp8 allgather with PrepareFloat8ModuleInput pytorch/torchtitan#393

Merged

This was referenced Jun 14, 2024

[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431) #128719

Merged

[v.2.4.0] Release Tracker #128436

Closed

github-actions bot deleted the gh/wanchaol/485/head branch July 14, 2024 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tp] refactor and fix PrepareModuleInput for DTensor inputs #128431

[tp] refactor and fix PrepareModuleInput for DTensor inputs #128431

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[tp] refactor and fix PrepareModuleInput for DTensor inputs #128431

[tp] refactor and fix PrepareModuleInput for DTensor inputs #128431

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128431

❌ 1 New Failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!