-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @fegin |
cc @LucasLLC |
Hey @fegin , wondering how you would fix it. Will you revert the change or keep that PR but add a patch to be able to load the old checkpointing? Asking because on our side, we need that information to decide how we monkeypatch pytorch in composer to release the torch2.4. thanks! |
@bigning We are working on a PR to fix the BC issue. Reverting the change may also break other users and cause more issues. We are also planning to add versions to metadata so that this kind of BC issues won't happen again in the future. |
@bigning #134158 is the tentative solution. @pradeepfn Please help to review the PR, thanks! |
… before 2.4 (pytorch#134158) The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes pytorch#133923 Pull Request resolved: pytorch#134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn
Hi @fegin , looks this fix is not included in 2.4.1 ? |
@bigning unfortunately, when the fix was landed, the cherrypick time of 2.4.1 had already passed. |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
This PR changed how load_planner and save_planner flatten the state dict. It broke the backward compatibility. With this PR, it can't load checkpointing that saved before this PR. it fails here https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/default_planner.py#L319 because the metadata keys doesn't match the state_dict keys.
Versions
PyTorch version: 2.4.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31
Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @xmfan @LucasLLC @pradeepfn
The text was updated successfully, but these errors were encountered: