8000 DistributedModelParallel resharding Interface by aporialiao · Pull Request #2945 · pytorch/torchrec · GitHub
[go: up one dir, main page]

Skip to content

DistributedModelParallel resharding Interface #2945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

aporialiao
Copy link
Member

Summary:
Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

Main changes:

1. DMP reshard API:

  • which calls the underlying sharder for sharded module to reshard

2. Proper Testing:

  • A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW.
  • This test is called from test_dynamic_sharding.py -> test_model_parallel.py -> test_sharding.py, which follows the same structure as current DMP unit tests
  • This is how the test tests for correctness:
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
  • This tests for optimzier being correctly saved in resharding
  • The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. variable_batch_size etc.

3. Helper functions for testing

  • get_sharding_constructor_from_type to enable setting sharding_type for each unit test.
  • compare_model_pred_one_step only used for debugging to get more information on whether models are identical after resharding/running initial step
  • compare_model_weights also for debugging

3. Small refactoring in update_shards call.

Differential Revision: D73049934

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 6, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 6, 2025
Summary:
Pull Request resolved: pytorch#2945

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API:
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing:
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW.
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 7, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use pl
8000
anner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 14, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 15, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 15, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 20, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 21, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 21, 2025
Summary:
Pull Request resolved: pytorch#2945

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API:
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing:
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW.
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 28, 2025
Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 3. Small refactoring in `update_shards` call.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

aporialiao added a commit to aporialiao/torchrec that referenced this pull request May 28, 2025
Summary:
Pull Request resolved: pytorch#2945

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API:
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing:
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW.
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 4. Bug fixes in `update_shards` call.
* namely input dist was not properly updated - this will cause error when I am testing the reshard function in the *middle of training*. As input dist depends on the shard placements.

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

Summary:

Finally! DMP interface for resharding, most of the changes here are to enable proper testing of DMP.

## Main changes:
### 1. DMP reshard API: 
* which calls the underlying sharder for sharded module to reshard

### 2. Proper Testing: 
* A multi-rank test which generates a full Model and utilizes DMP interface. Currently only tests TW. 
* This test is called from `test_dynamic_sharding.py` -> `test_model_parallel.py` -> `test_sharding.py`, which follows the same structure as current DMP unit tests
* This is how the test tests for correctness:
```
        1. Generate global model and inputs
        2. Create 2 identical local models based on global model
        3. Use planner to generate sharding plan for local model
        4. Based on planner output, generate a second, different sharding plan
        5. Shard both local models 1 and 2 through DMP with plan 1 and 2 respectively
        6. Reshard (dynamic sharding API) model 1 with plan 2
        7. Generate predictions for local models and compare them to global model prediction. Expect to be the same.
```
* This tests for `optimzier` being correctly saved in resharding
* The test is setup with other variables to-be-set once more functionalities are enabled with dynamic sharding, e.g. `variable_batch_size` etc.

### 3. Helper functions for testing
* `get_sharding_constructor_from_type` to enable setting sharding_type for each unit test.
* `compare_model_pred_one_step` only used for debugging to get more information on whether models are identical after resharding/running initial step
* `compare_model_weights` also for debugging

### 4. Bug fixes in `update_shards` call.
* namely input dist was not properly updated - this will cause error when I am testing the reshard function in the *middle of training*. As input dist depends on the shard placements.

Reviewed By: aliafzal

Differential Revision: D73049934
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73049934

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0