Update run_language_modeling.py to handle writes on networked filesystem better #3356

Genius1237 · 2020-03-20T05:49:14Z

In the case of multi-node distributed training, reads and writes typically happen to a common networked filesystem.

In the current version of the run_language_modeling.py script, processes that have local_rank as 0 perform the writes to disk (tensorboard, dataset cache and model checkpointing). In the case of multi-node distributed training, there ends up being one process per node having local_rank as 0, hence multiple processes try writing to the filesystem at one point, resulting on errors being thrown depending on the filesystem.

This pull request updates the script such that only the process having a global_rank of 0 does the writing. global_rank isn't a variable directly accessible in the script, it is obtained by calling torch.distributed.get_rank().

I've tested the script in 4 different cases and they work without any error in these cases: multi-node training with DDP, single-node training with DDP, single-node training with DP and single gpu training.

…ystem during distributed training better

Genius1237 · 2020-03-20T05:59:07Z

For some reference, check out pytorch/pytorch#12042 and facebookresearch/maskrcnn-benchmark#40. These address the same issue.

Also, one of the checks that failed, check_code_quality, would fail for the existing version of the script as well. There's a check for a line length of 119, and there are already many lines exceeding that.

LysandreJik

This looks good to me, thanks for looking into it @Genius1237!
There's a current issue with the code quality test so don't worry about it.

We should probably do this modification to all scripts.

Genius1237 · 2020-03-27T05:36:21Z

I did think about the other scripts. Are those already setup with DistributedDataParallel cause one would theorize that those tasks aren't that heavy and wouldn't benefit much from running across multiple GPUs.

Also, I have one or 2 more fixes along the lines of this one for distributed training. I was wondering if I should rename this PR and add those in, or create a new one for each of those fixes. One of them is about loading checkpoints (of the optimizer and scheduler) while resuming training.

stale · 2020-07-08T02:22:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LysandreJik · 2020-07-09T13:39:38Z

Closing this as run_language_modeling.py is now based on the trainer. Thank you for your contribution!!

Update language modeling script to perform writes on networked file s…

ac9ac20

…ystem during distributed training better

LysandreJik approved these changes Mar 26, 2020

View reviewed changes

stale bot added the wontfix label Jul 8, 2020

LysandreJik closed this Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update run_language_modeling.py to handle writes on networked filesystem better #3356

Update run_language_modeling.py to handle writes on networked filesystem better #3356

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update run_language_modeling.py to handle writes on networked filesystem better #3356

Update run_language_modeling.py to handle writes on networked filesystem better #3356

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!