8000 [Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 · Issue #133923 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bigning opened this issue Aug 19, 2024 · 7 comments
Assignees
Labels
high priority oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@bigning
Copy link
Contributor
bigning commented Aug 19, 2024

🐛 Describe the bug

This PR changed how load_planner and save_planner flatten the state dict. It broke the backward compatibility. With this PR, it can't load checkpointing that saved before this PR. it fails here https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/default_planner.py#L319 because the metadata keys doesn't match the state_dict keys.

Versions

PyTorch version: 2.4.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @xmfan @LucasLLC @pradeepfn

@bigning
Copy link
Contributor Author
bigning commented Aug 19, 2024

cc @fegin

@malfet malfet added high priority oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 20, 2024
@yifuwang
Copy link
Collaborator

cc @LucasLLC

@fegin fegin self-assigned this Aug 20, 2024
@fegin fegin added the oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. label Aug 20, 2024
@bigning
Copy link
Contributor Author
bigning commented Aug 20, 2024

Hey @fegin , wondering how you would fix it. Will you revert the change or keep that PR but add a patch to be able to load the old checkpointing? Asking because on our side, we need that information to decide how we monkeypatch pytorch in composer to release the torch2.4. thanks!

@fegin
Copy link
Contributor
fegin commented Aug 21, 2024

@bigning We are working on a PR to fix the BC issue. Reverting the change may also break other users and cause more issues. We are also planning to add versions to metadata so that this kind of BC issues won't happen again in the future.

@pradeepfn

@fegin
Copy link
Contributor
fegin commented Aug 21, 2024

@bigning #134158 is the tentative solution. @pradeepfn Please help to review the PR, thanks!

@tianyu-l tianyu-l added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Aug 27, 2024
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this issue Sep 20, 2024
… before 2.4 (pytorch#134158)

The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes pytorch#133923

Pull Request resolved: pytorch#134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
@bigning
Copy link
Contributor Author
bigning commented Sep 23, 2024

Hi @fegin , looks this fix is not included in 2.4.1 ?

@fegin
Copy link
Contributor
fegin commented Sep 27, 2024

@bigning unfortunately, when the fix was landed, the cherrypick time of 2.4.1 had already passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants
0