[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

bigning · 2024-08-19T22:34:39Z

🐛 Describe the bug

This PR changed how load_planner and save_planner flatten the state dict. It broke the backward compatibility. With this PR, it can't load checkpointing that saved before this PR. it fails here https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/default_planner.py#L319 because the metadata keys doesn't match the state_dict keys.

Versions

PyTorch version: 2.4.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @xmfan @LucasLLC @pradeepfn

bigning · 2024-08-19T23:11:27Z

cc @fegin

yifuwang · 2024-08-20T00:46:02Z

cc @LucasLLC

bigning · 2024-08-20T17:21:57Z

Hey @fegin , wondering how you would fix it. Will you revert the change or keep that PR but add a patch to be able to load the old checkpointing? Asking because on our side, we need that information to decide how we monkeypatch pytorch in composer to release the torch2.4. thanks!

fegin · 2024-08-21T17:29:39Z

@bigning We are working on a PR to fix the BC issue. Reverting the change may also break other users and cause more issues. We are also planning to add versions to metadata so that this kind of BC issues won't happen again in the future.

@pradeepfn

fegin · 2024-08-21T22:00:30Z

@bigning #134158 is the tentative solution. @pradeepfn Please help to review the PR, thanks!

@pradeepfn

… before 2.4 (pytorch#134158) The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes pytorch#133923 Pull Request resolved: pytorch#134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn

bigning · 2024-09-23T18:13:04Z

Hi @fegin , looks this fix is not included in 2.4.1 ?

fegin · 2024-09-27T18:02:25Z

@bigning unfortunately, when the fix was landed, the cherrypick time of 2.4.1 had already passed.

malfet added high priority oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 20, 2024

pytorch-bot bot added the triage review label Aug 20, 2024

fegin self-assigned this Aug 20, 2024

fegin added the oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. label Aug 20, 2024

This was referenced Aug 20, 2024

[torch2.4] Fix sharded checkpointing backward compatibility issue mosaicml/composer#3564

Closed

[torch2.4] Fix sharded checkpointing backward compatibility issue mosaicml/composer#3565

Merged

fegin mentioned this issue Aug 21, 2024

[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 #134158

Closed

tianyu-l added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Aug 27, 2024

pytorchmergebot closed this as completed in c7338f4 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

Comments

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!