[FSDP2+TP] Disable 2D state_dict #129519

wz337 · 2024-06-25T22:06:39Z

Fixes #ISSUE_NUMBER

Gonna fill in the RFC but just want to run CI to see if anything else breaks.

Test:

python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-06-25T22:06:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129519

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

NVIDIA runners have a blacklog of 2000+ jobs

❌ 2 New Failures, 1 Unrelated Failure

As of commit 790e786 with merge base 1c75ddf ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_quantization.py::TestQuantizePT2EQATModels::test_qat_mobilenet_v2
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh)
test_mps.py::TestMPS::test_mps_allocator_module

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (matched macos rule in flaky-rules.json)
Failure: There is only 3229312KB free space left in /, which is less than the minimum requirement of

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wz337 · 2024-06-27T06:20:55Z

torch/distributed/_composable/fsdp/_fsdp_param.py

@@ -223,6 +223,22 @@ def _init_sharded_param(self, param: nn.Parameter, device: torch.device):
                tensor_meta=self._tp_spec.tensor_meta,
            )
            param_data = cast(DTensor, param)._local_tensor
+


@awgu Want to check with you whether to disable the 2D state_dict like this is ok with you.

Originally thinking about doing it in _register_state_dict_hooks, but since _register_state_dict_hooks happens during lazy_init so ended up registering the hooks in param init.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/_fsdp_param_group.py#L469

I think we can move this to the end of FSDPParamGroup.__init__ so that we do not need the if len(cur_module._state_dict_pre_hooks) == 0: check in the case that a module has >1 parameter.

I was looking to add this in FSDPParamGroup initially, but since we only wanted to add this hook for 2D, then we would need an if in FSDPParamGroup.__init__ as well. Something like this:

if any(len(fsdp_param._spmd_placements)>1 for fsdp_param in self.fsdp_params): self.module.register_state_dict_pre_hook(...) self.module._register_load_state_dict_pre_hook(...)

In this case, we would also run the extra if check for 1D vs. We only register the hook if we encounter a 2D placements so nothing changes from 1D side, but we will do the _state_dict_pre_hooks check for every param init for 2D.

Not sure which one is more preferrable to you.

Discussed with @awgu offline to move the check to the end of FSDPParamGroup.init.

awgu

Looks great! Thank you!

wz337 · 2024-07-01T00:45:17Z

@pytorchmergebot merge -i

pytorchmergebot · 2024-07-01T00:47:58Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-01T01:24:14Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have E7EE `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang LucasLLC MeetVadakkanchery mhorowitz Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. 3. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 4. Re-enabled the tests that were disabled in #129519 and removed relevant code **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang LucasLLC MeetVadakkanchery mhorowitz Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) [ghstack-poisoned]

…sent nested sharding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 3. Re-enabled the tests that were disabled in #129519 **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang LucasLLC MeetVadakkanchery mhorowitz Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) [ghstack-poisoned]

…ding for correct full_tensor() result" Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 3. Re-enabled the tests that were disabled in #129519 **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o zhaojuanmao mrshenli rohan-varma chauhang LucasLLC MeetVadakkanchery mhorowitz Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) [ghstack-poisoned]