10000 [torch2.4] Fix sharded checkpointing backward compatibility issue by bigning · Pull Request #3564 · mosaicml/composer · GitHub
[go: up one dir, main page]

Skip to content

[torch2.4] Fix sharded checkpointing backward compatibility issue #3564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

bigning
Copy link
@bigning bigning commented Aug 20, 2024

torch 2.4 breaks the sharded checkpointing backward compatibility. It changed how the save_planner and load_planner flatten the state dict keys. Torch issue: pytorch/pytorch#133923 . So the new load planner can't load the checkpointing saved with old save_planner.

This PR monkey patches the load_planner if it failed to load an old checkpoint, then it removes the patch after loading the checkpointing.

@bigning bigning marked this pull request as ready for review August 20, 2024 22:28
@bigning bigning requested a review from a team as a code owner August 20, 2024 22:28
@bigning
Copy link
Author
bigning commented Aug 20, 2024

closing this since it's a forked repo and can't start a daily test.

@bigning bigning closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0