Update run_language_modeling.py to handle writes on networked filesystem better #3356
Add this suggestion to a batch that can be applie
3104
d as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the case of multi-node distributed training, reads and writes typically happen to a common networked filesystem.
In the current version of the
run_language_modeling.py
script, processes that havelocal_rank
as 0 perform the writes to disk (tensorboard, dataset cache and model checkpointing). In the case of multi-node distributed training, there ends up being one process per node havinglocal_rank
as 0, hence multiple processes try writing to the filesystem at one point, resulting on errors being thrown depending on the filesystem.This pull request updates the script such that only the process having a
global_rank
of 0 does the writing.global_rank
isn't a variable directly accessible in the script, it is obtained by callingtorch.distributed.get_rank()
.I've tested the script in 4 different cases and they work without any error in these cases: multi-node training with DDP, single-node training with DDP, single-node training with DP and single gpu training.