-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Fix lr_scheduler
unexpectedly calls step()
when init argument last_epoch is larger than -1
#149312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149312
Note: Links to docs will display an error until the docs builds have been completed. ❌ 51 New Failures, 6 Cancelled Jobs, 1 Unrelated FailureAs of commit 7ad1b75 with merge base 01f226b ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
424ac56
to
7c5e79a
Compare
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
…t_epoch is larger than -1
Successfully rebased |
7c5e79a
to
538d5e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look like the right approach. If the discrepancy is for ExponentialLR between get_lr and _get_closed_form_lr, I'd expect the fix to be local there. Could you explain your approach a little bit?
test/optim/test_lrscheduler.py
Outdated
optim2 = torch.optim.AdamW(model.parameters()) | ||
optim2.load_state_dict(optim.state_dict()) | ||
sch2 = LRClass(optim2, last_epoch=1) | ||
self.assertEqual(optim.param_groups[0]["lr"], optim2.param_groups[0]["lr"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the same comparison as the repro--we should be comparing that the closed form lr is the same as the params group lr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually, I see what you're doing now. Sorry I was confused yesterday. I'm willing to accept this fix if you update the test case.
It would also be good to include a comment about why we prefer the _is_initial
.
@@ -134,7 +135,8 @@ def wrapper(*args, **kwargs): | |||
def _initial_step(self): | |||
"""Initialize step counts and perform a step.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who has looked into LRScheduler more than I've been able to, have you seen a good reason why we need to call .step() from the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the key effect is to initialize optimizer initial lr as the same as the scheduler lr when create it, and reuse this part of code:
pytorch/torch/optim/lr_scheduler.py
Lines 200 to 220 in 4c5cf18
with _enable_get_lr_call(self): | |
if epoch is None: | |
self.last_epoch += 1 | |
values = self.get_lr() | |
else: | |
warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning) | |
self.last_epoch = epoch | |
if hasattr(self, "_get_closed_form_lr"): | |
values = cast(list[float], self._get_closed_form_lr()) | |
else: | |
values = self.get_lr() | |
for param_group, lr in zip(self.optimizer.param_groups, values): | |
if isinstance(param_group["lr"], Tensor): | |
param_group["lr"].fill_(_to_scalar(lr)) | |
else: | |
param_group["lr"] = lr | |
self._last_lr: list[float] = [ | |
group["lr"] for group in self.optimizer.param_groups | |
] |
One improvement can be made is extracting internal update lr logic from step
public method, please check this PR: #149392 and the issue it fixed. Thanks!
I'd love to see this expanded to ensure this works for all LRSchedulers! I have confirmed that I see the same issue when testing with StepLR (when I try to resume training and setup a new LRScheduler, it is always one step off b/c of this initial step that is taken in the init of LRSchedulers). |
@zeshengzong lmk if you can bring this PR over the finish line with expanding it to all LRSchedulers! |
Hi @janeyx99 , sorry for late reply, busy with something else previously. I would like fix all of them and hope I could clean up all issues related with |
Yes, adding a context to better distinguish initial lr or calculate lr, |
[ | ||
partial(ExponentialLR, gamma=0.999), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to expand this to more than ExponentialLR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Participating a pytorch meetup, will do it next week, thanks! :D
Fixes #102261
Changes
_is_initial
to replaceself.last_epoch == 0
condition to judge whetherlr
should be initial valueExponentialLR
checkpoint usecaseTest Result