DistributedModelParallel resharding Interface #2945

aporialiao · 2025-05-06T00:24:56Z

Summary:
Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

Main changes:

1. DMP reshard API:

which calls the underlying sharder for sharded module to reshard

2. Proper Testing:

A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW.
This test is called from test_dynamic_sharding.py -> test_model_parallel.py -> test_sharding.py, which follows the same structure as current DMP unit tests
This is how the test tests for correctness:

        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.

This tests for optimzier being correctly saved in resharding
The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. variable_batch_size etc.

3. Helper functions for testing

get_sharding_constructor_from_type to enable setting sharding_type for each unit test.
compare_model_pred_one_step only used for debugging to get more information on whether models are identical after resharding/running initial step
compare_model_weights also for debugging

3. Small refactoring in `update_shards` call.

Differential Revision: D73049934

facebook-github-bot · 2025-05-06T00:25:04Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-06T22:44:51Z

This pull request was exported from Phabricator. Differential Revision: D73049934

facebook-github-bot · 2025-05-06T22:45:02Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Pull Request resolved: pytorch#2945 Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use pl 8000 anner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-07T14:11:02Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-14T00:21:54Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-15T00:14:05Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-15T15:41:11Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-20T22:42:36Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-21T17:00:46Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Pull Request resolved: pytorch#2945 Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-21T17:00:55Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 3. Small refactoring in `update_shards` call. Differential Revision: D73049934

facebook-github-bot · 2025-05-28T22:16:31Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Pull Request resolved: pytorch#2945 Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 4. Bug fixes in `update_shards` call. * namely input dist was not properly updated - this will cause error when I am testing the reshard function in the *middle of training*. As input dist depends on the shard placements. Differential Revision: D73049934

facebook-github-bot · 2025-05-28T22:16:37Z

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary: Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP. ## Main changes: ### 1. DMP reshard API: * which calls the underlying sharder for sharded module to reshard ### 2. Proper Testing: * A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. * This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests * This is how the test tests for correctness: ``` 1. Generate global model and inputs 2. Create 2 identical local models based on global model 3. Use planner to generate sharding plan for local model 4. Based on planner output, generate a second, different sharding plan 5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively 6. Reshard (dynamic sharding API) model 1 with plan 2 7. Generate predictions for local models and compare them to global model prediction. Expect to be the same. ``` * This tests for `optimzier` being correctly saved in resharding * The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc. ### 3. Helper functions for testing * `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test. * `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step * `compare_model_weights` also for debugging ### 4. Bug fixes in `update_shards` call. * namely input dist was not properly updated - this will cause error when I am testing the reshard function in the *middle of training*. As input dist depends on the shard placements. Reviewed By: aliafzal Differential Revision: D73049934

facebook-github-bot · 2025-05-28T23:56:03Z

This pull request was exported from Phabricator. Differential Revision: D73049934

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2025

facebook-github-bot added the fb-exported label May 6, 2025

aporialiao force-pushed the export-D73049934 branch from 2958bf2 to 09a1ac2 Compare May 6, 2025 22:41

aporialiao force-pushed the export-D73049934 branch from 09a1ac2 to 7b44948 Compare May 6, 2025 22:45

aporialiao force-pushed the export-D73049934 branch from 7b44948 to 401fb0d Compare May 7, 2025 14:10

aporialiao force-pushed the export-D73049934 branch from 401fb0d to 5619d65 Compare May 14, 2025 00:21

aporialiao force-pushed the export-D73049934 branch from 5619d65 to 26ab9ed Compare May 15, 2025 00:13

aporialiao force-pushed the export-D73049934 branch from 26ab9ed to 8f816fd Compare May 15, 2025 15:41

aporialiao force-pushed the export-D73049934 branch from 8f816fd to be5f2e1 Compare May 20, 2025 22:42

aporialiao force-pushed the export-D73049934 branch from be5f2e1 to a0fd27c Compare May 21, 2025 16:57

aporialiao force-pushed the export-D73049934 branch from a0fd27c to def7b50 Compare May 21, 2025 17:00

aporialiao force-pushed the export-D73049934 branch from def7b50 to 6091434 Compare May 28, 2025 22:13

aporialiao force-pushed the export-D73049934 branch from 6091434 to 8ccf66f Compare May 28, 2025 22:16

aporialiao force-pushed the export-D73049934 branch from 8ccf66f to 66de0cc Compare May 28, 2025 23:55

facebook-github-bot closed this in d2a3e56 May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DistributedModelParallel resharding Interface #2945

DistributedModelParallel resharding Interface #2945

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DistributedModelParallel resharding Interface #2945

DistributedModelParallel resharding Interface #2945

Uh oh!

Conversation

Main changes:

1. DMP reshard API:

2. Proper Testing:

3. Helper functions for testing

3. Small refactoring in update_shards call.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

3. Small refactoring in `update_shards` call.