8000 Update run_language_modeling.py to handle writes on networked filesystem better by Genius1237 · Pull Request #3356 · huggingface/transformers · GitHub
[go: up one dir, main page]

Skip to content

Update run_language_modeling.py to handle writes on networked filesystem better #3356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Genius1237
Copy link
Contributor

In the case of multi-node distributed training, reads and writes typically happen to a common networked filesystem.

In the current version of the run_language_modeling.py script, processes that have local_rank as 0 perform the writes to disk (tensorboard, dataset cache and model checkpointing). In the case of multi-node distributed training, there ends up being one process per node having local_rank as 0, hence multiple processes try writing to the filesystem at one point, resulting on errors being thrown depending on the filesystem.

This pull request updates the script such that only the process having a global_rank of 0 does the writing. global_rank isn't a variable directly accessible in the script, it is obtained by calling torch.distributed.get_rank().

I've tested the script in 4 different cases and they work without any error in these cases: multi-node training with DDP, single-node training with DDP, single-node training with DP and single gpu training.

@Genius1237
Copy link
Contributor Author

For some reference, check out pytorch/pytorch#12042 and facebookresearch/maskrcnn-benchmark#40. These address the same issue.

Also, one of the checks that failed, check_code_quality, would fail for the existing version of the script as well. There's a check for a line length of 119, and there are already many lines exceeding that.

Copy link
Member
@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for looking into it @Genius1237!
There's a current issue with the code quality test so don't worry about it.

We should probably do this modification to all scripts.

@Genius1237
Copy link
Contributor Author

I did think about the other scripts. Are those already setup with DistributedDataParallel cause one would theorize that those tasks aren't that heavy and wouldn't benefit much from running across multiple GPUs.

Also, I have one or 2 more fixes along the lines of this one for distributed training. I was wondering if I should rename this PR and add those in, or create a new one for each of those fixes. One of them is about loading checkpoints (of the optimizer and scheduler) while resuming training.

@stale
Copy link
stale bot commented Jul 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 8, 2020
@LysandreJik
Copy link
Member

Closing this as run_language_modeling.py is now based on the trainer. Thank you for your contribution!!

@LysandreJik LysandreJik closed this Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0