Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

zeshengzong · 2025-03-17T11:15:33Z

Changes

Use flag _is_initial to replace self.last_epoch == 0 condition to judge whether lr should be initial value
Add test for ExponentialLR checkpoint usecase

Test Result

pytest -s test/optim/test_lrscheduler.py  -vv

pytorch-bot · 2025-03-17T11:15:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149312

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 51 New Failures, 6 Cancelled Jobs, 1 Unrelated Failure

As of commit 7ad1b75 with merge base 01f226b ():

NEW FAILURES - The following jobs have failed:

pull / cuda12.4-py3.10-gcc9-sm75 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-cpp-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-functorch-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-python-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cpu-py3.10-gcc11-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda12.6-py3.10-gcc11 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3_9-clang9-xla / build (gh)
ninja: build stopped: subcommand failed
pull / linux-focal-py3.13-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 2, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 3, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 4, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 5, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 2, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 3, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 4, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 5, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10-onnx / test (default, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10-onnx / test (default, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 5, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (docs_test, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (jit_legacy, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (numpy_2_x, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-rocm-py3.10 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-xpu-2025.0-py3.9 / build (gh)
undefined reference to sycl::_V1::exception::exception(std::error_code, std::shared_ptrsycl::_V1::context, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)'`

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
Lint / quick-checks / linux-job (gh)
##[error]The 8000 operation was canceled.
Lint / Test tools / linux-job (gh)
##[error]The operation was canceled.
Lint / toc / linux-job (gh)
##[error]The operation was canceled.
Lint / workflow-checks / linux-job (gh)
##[error]The operation was canceled.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build (gh)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zeshengzong · 2025-03-18T08:12:57Z

Hello @albanD @janeyx99 , please check whether the fixing is feasible, if it works, I would like to continue fix more schedulers which have same problem, like MultiplicativeLR, LinearLR, thanks!

zeshengzong · 2025-04-14T08:39:27Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-04-14T08:41:00Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

…t_epoch is larger than -1

pytorchmergebot · 2025-04-14T08:41:04Z

Successfully rebased fix/optim/step onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout fix/optim/step && git pull --rebase)

janeyx99

This does not look like the right approach. If the discrepancy is for ExponentialLR between get_lr and _get_closed_form_lr, I'd expect the fix to be local there. Could you explain your approach a little bit?

janeyx99 · 2025-05-05T21:24:13Z

test/optim/test_lrscheduler.py

+        optim2 = torch.optim.AdamW(model.parameters())
+        optim2.load_state_dict(optim.state_dict())
+        sch2 = LRClass(optim2, last_epoch=1)
+        self.assertEqual(optim.param_groups[0]["lr"], optim2.param_groups[0]["lr"])


This is not the same comparison as the repro--we should be comparing that the closed form lr is the same as the params group lr?

Changed, thanks!

torch/optim/lr_scheduler.py

janeyx99

Oh actually, I see what you're doing now. Sorry I was confused yesterday. I'm willing to accept this fix if you update the test case.

It would also be good to include a comment about why we prefer the _is_initial.

left newer review

janeyx99 · 2025-05-06T17:48:39Z

torch/optim/lr_scheduler.py

@@ -134,7 +135,8 @@ def wrapper(*args, **kwargs):
    def _initial_step(self):
        """Initialize step counts and perform a step."""


As someone who has looked into LRScheduler more than I've been able to, have you seen a good reason why we need to call .step() from the constructor?

I think one of the key effect is to initialize optimizer initial lr as the same as the scheduler lr when create it, and reuse this part of code:

pytorch/torch/optim/lr_scheduler.py

Lines 200 to 220 in 4c5cf18

8000

with _enable_get_lr_call(self):

if epoch is None:

self.last_epoch += 1

values = self.get_lr()

else:

warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)

self.last_epoch = epoch

if hasattr(self, "_get_closed_form_lr"):

values = cast(list[float], self._get_closed_form_lr())

else:

values = self.get_lr()

for param_group, lr in zip(self.optimizer.param_groups, values):

if isinstance(param_group["lr"], Tensor):

param_group["lr"].fill_(_to_scalar(lr))

else:

param_group["lr"] = lr

self._last_lr: list[float] = [

group["lr"] for group in self.optimizer.param_groups
]

One improvement can be made is extracting internal update lr logic from step public method, please check this PR: #149392 and the issue it fixed. Thanks!

joecummings · 2025-05-14T00:13:41Z

I'd love to see this expanded to ensure this works for all LRSchedulers! I have confirmed that I see the same issue when testing with StepLR (when I try to resume training and setup a new LRScheduler, it is always one step off b/c of this initial step that is taken in the init of LRSchedulers).

janeyx99 · 2025-05-14T00:19:45Z

@zeshengzong lmk if you can bring this PR over the finish line with expanding it to all LRSchedulers!

zeshengzong · 2025-05-14T01:44:24Z

@zeshengzong lmk if you can bring this PR over the finish line with expanding it to all LRSchedulers!

Hi @janeyx99 , sorry for late reply, busy with something else previously. I would like fix all of them and hope I could clean up all issues related with lr_scheduler, thanks for help!

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

zeshengzong · 2025-05-14T03:42:33Z

Oh actually, I see what you're doing now. Sorry I was confused yesterday. I'm willing to accept this fix if you update the test case.

It would also be good to include a comment about why we prefer the _is_initial.

Yes, adding a context to better distinguish initial lr or calculate lr, self.last_epoch == 0 is not enough at this case.

janeyx99 · 2025-05-15T19:42:17Z

test/optim/test_lrscheduler.py

+        [
+            partial(ExponentialLR, gamma=0.999),
+        ],
+    )


It'd be great to expand this to more than ExponentialLR!

Participating a pytorch meetup, will do it next week, thanks! :D

pytorch-bot bot added the release notes: optim label Mar 17, 2025

pytorchbot added the open source label Mar 17, 2025

zeshengzong force-pushed the fix/optim/step branch from 424ac56 to 7c5e79a Compare March 18, 2025 07:59

zeshengzong marked this pull request as ready for review March 18, 2025 08:06

zeshengzong requested review from albanD and janeyx99 as code owners March 18, 2025 08:06

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2025

albanD removed their request for review April 9, 2025 19:37

zeshengzong added 2 commits April 14, 2025 08:41

Fix lr_scheduler unexpectedly calls step() when init argument las…

c8b0f1e

…t_epoch is larger than -1

Update

538d5e0

pytorchmergebot force-pushed the fix/optim/step branch from 7c5e79a to 538d5e0 Compare April 14, 2025 08:41

janeyx99 previously requested changes May 5, 2025

View reviewed changes

janeyx99 reviewed May 6, 2025

View reviewed changes

torch/optim/lr_scheduler.py Show resolved Hide resolved

janeyx99 reviewed May 6, 2025

View reviewed changes

janeyx99 added the topic: bug fixes topic category label May 6, 2025

janeyx99 reviewed May 6, 2025

View reviewed changes

Update

7dea8ec

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

Update

7ad1b75

janeyx99 reviewed May 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

		@@ -134,7 +135,8 @@ def wrapper(args, *kwargs):
		def _initial_step(self):
		"""Initialize step counts and perform a step."""

	with _enable_get_lr_call(self):
	if epoch is None:
	self.last_epoch += 1
	values = self.get_lr()
	else:
	warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
	self.last_epoch = epoch
	if hasattr(self, "_get_closed_form_lr"):
	values = cast(list[float], self._get_closed_form_lr())
	else:
	values = self.get_lr()

	for param_group, lr in zip(self.optimizer.param_groups, values):
	if isinstance(param_group["lr"], Tensor):
	param_group["lr"].fill_(_to_scalar(lr))
	else:
	param_group["lr"] = lr

	self._last_lr: list[float] = [
	group["lr"] for group in self.optimizer.param_groups
	]

Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 #149312

Are you sure you want to change the base?

Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 #149312

Conversation

Changes

Test Result

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149312

❌ 51 New Failures, 6 Cancelled Jobs, 1 Unrelated Failure

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312

Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 #149312