Add support for differentiable LR in SGD + test #143122

EmmettBicker · 2024-12-12T16:40:05Z

pytorch-bot · 2024-12-12T16:40:09Z

EmmettBicker · 2024-12-12T20:00:50Z

janeyx99

torch/optim/sgd.py

janeyx99 · 2024-12-12T21:18:02Z

test/optim/test_optim.py

+    p = p.clone()
+    if not p.requires_grad:
+        p.requires_grad_(True)
+    p = p * 1.0  # make leaf


not necessary anymore as line 70 makes p a leaf already!

If p originally didn't require_grad, then p.requires_grad_(True) makes it a leaf. Do you think we should allow people to pass in p that doesn't require grad?

If this is the outer leaf, it doesn't need to be cloned + can be a leaf. we only want the inner param to be a non-leaf for inner optim that is differentiable.

test/optim/test_optim.py

janeyx99 · 2024-12-12T21:22:15Z

test/optim/test_optim.py

+        inner_optimizer.zero_grad()
+
+    meta_loss = loss  # type: ignore
+    meta_loss.backward(inputs=(p,), create_graph=True)


is the create_graph here necessary? Let's remove everything that you don't think should be necessary for clarity + better maintenance in the future

It crashes if I remove it and gives this error:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

If I do retain_graph=True it actually fails the test but doesn’t crash

test/optim/test_optim.py

janeyx99 · 2024-12-12T21:32:22Z

test/optim/test_optim.py

+        grad = torch.rand_like(params, requires_grad=True, dtype=torch.float64)
+
+        lr = torch.rand(1, requires_grad=True, dtype=torch.float64)
+        outer_mbuff = torch.rand_like(lr, requires_grad=True, dtype=torch.float64)


So the outer optimizer doesn't have to be differentiable, right? I wonder how much it can simplify the test to just have a simple SGD and not worry about its differentiability.

That's a good point, there's no real reason to add in outer_state at all so I'll remove it

I think we can definitely simplify the test by removing everything about testing the outer_optimizer! Interestingly though, I think it still has to be differentiable. I found that I think SGD only accepts non-leaf parameters (which I think anything that gets changed in the test function has to be because they're cloned) if it's differentiable. I tried the following code to instantiate the optimizer in the debug console:

Code:

print(p.is_leaf) # False SGD([p])

Output:

ValueError: can't optimize a non-leaf Tensor

Code:

print(p.is_leaf) # False SGD([p], differentiable=True)

Output:

SGD ( Parameter Group 0 dampening: 0 differentiable: True foreach: None fused: None lr: 0.001 maximize: False momentum: 0 nesterov: False weight_decay: 0 )

Outer_params should be leaves for the outer optimizer though, so that should not need to be differentiable!

I included this because lr is an outer_param and I clone it, making it not a leaf.

When I don't clone lr it gets modified by the test and eventually ends up negative and then the test fails because of a negative lr

test/optim/test_optim.py

janeyx99 · 2024-12-12T22:08:31Z

torch/optim/sgd.py

-
-        param.add_(grad, alpha=-lr)
+        if differentiable and isinstance(lr, Tensor) and lr.requires_grad:
+            param.addcmul_(grad, -lr.to(param.device))


This .to is expensive...I'm trying to brainstorm the best way to do this and will get back to you tomorrow.

My thoughts so far: if lr is on CPU, we should be able to take advantage of its CPUness and just .item() before passing it to CUDA, to avoid this expensive .to op. If that's not easy to do quickly (I haven't figured out why addcmul doesn't automatically allow it yet), we should maintain a lr_dict similar to in the fused adam optimizer. I'll get back to you!

Alright! So here's the ideal path forward for performance:

We should figure out how to take advantage of the CPUness of the LR and just .item() before passing it to CUDA for addcmul. Note that the following works for some torch ops already (add and mul):

import torch grad = torch.rand(2, 4, device="cuda") lr = torch.tensor(1e-3, requires_grad=True) torch.add(grad, lr) # works! torch.mul(grad, lr) # also works! torch.addcmul(grad, grad, lr) # does not work :/, device mismatch

Step one would be to write a PR (with a test) that enables that line to work just like add and mul work. The CUDA kernel for addcmul is in PointwiseOpsKernel.cu.

Then come back to this PR and use param.addcmul_(grad, lr, -1)

EmmettBicker · 2024-12-13T20:01:41Z

janeyx99 · 2024-12-13T21:41:45Z

EmmettBicker · 2024-12-13T21:45:59Z

janeyx99

janeyx99 · 2024-12-13T21:17:56Z

test/optim/test_optim.py

@@ -1,6 +1,11 @@
 # Owner(s): ["module: optimizer"]

+from __future__ import annotations


Nit: Is there a more global way to enable this across the repo? Vs. adding it everywhere?

I looked online and couldn't find anything -- I can ask for it by opening an issue on the cpython repo if you'd like

for the purposes of this pr this is fine + you do not have to but ofc it is up to u!

janeyx99 · 2024-12-13T21:19:27Z

test/optim/test_optim.py

+    inner_kwargs: dict[str, Any],
+    *ignored: Any,
+) -> tuple[Tensor, ...]:
+    assert (


Is there a decision you made here where the caller function is responsible for handling differentiable = True? i.e., another way to do this is to

inner_kwargs["differentiable"] = True

instead of the assert and have the caller function not pass in differentiable at all.

I was thinking that every time the prexisting _diff_fn is called, differentiable is True, so I wanted to follow the convention and have it be explicitly passed in inner_kwargs. At the same time, I knew the test would break if it wasn't true so I thought I should add the assert

okay, i'm fine with that

janeyx99 · 2024-12-13T21:19:38Z

test/optim/test_optim.py

+        inner_kwargs["differentiable"] is True
+    ), "Only call this test function when differentiable=True"
+    # TODO: Adjust this function to test more hyperparameters than just lr
+    assert "lr" in inner_kwargs, "Only call this test function with a custom lr"


This one was similar because I wanted to expand this later to be "assert 'lr' in inner_kwargs or 'betas' in inner_kwargs' or 'weight_decay' in inner_kwargs' "

test/optim/test_optim.py

janeyx99 · 2024-12-13T21:22:06Z

test/optim/test_optim.py

+
+    lr = inner_kwargs["lr"].clone().requires_grad_(True)
+    inner_kwargs.update({"lr": lr})
+    lr = lr * 1.0  # make non-leaf


Suggested change

lr = lr * 1.0 # make non-leaf

why is make non-leaf necessary?

I investigated this and apparently that line was really problematic and broke the gradient computation and caused lr.grad = None. The test worked because it skipped over lr as its gradient was none, but when I remove the * 1.0 I get some complicated behavior that I talk about in my next comment

I looked more into this, and the test fails with return (lr,), but it passes with return (meta_loss,), so the derivative of meta_loss w.r.t. lr is being correctly computed but the derivative of lr w.r.t. meta_loss isn't correctly computed.

The analytical_vJu for lr w.r.t. params was 0, and for lr w.r.t. lr was 1, but the numerical_vJu for lr w.r.t. params was 0.3474 and for lr w.r.t. lr was -1.3101. The analytical values make sense because params have no impact on lr and d(lr)/d(lr) should be 1, and the numerical results are different because we run outer_optim.step(). What I don't get is the gradients don't flow through outer_optim.step() even though it's differentiable.

The test works if we remove outer_optimizer.step(), because then d(lr)/d(params) = 1 and d(lr)/d(lr) = 0. I'm going to work on other comments for now, but do you think we should

look more into why the gradients don't consider the update taken by outer_optim.step()

or just leave it as is and remove outer_optim.step() along with the entire outer optimizer. I'm kinda for this option because this test would still validate that the inner optimizer is differentiable, but maybe I'm not considering something important

I agree with your inclination, we can dig into why the params/derivatives don't line up well but it's likely due to some faulty test code that we're likely removing anyway.

At this point, I'm more curious if the test case can be combined with the existing test case if there is no more outer optimizer, but I haven't looked closely so this may be too ambitious.

janeyx99 · 2024-12-13T21:23:05Z

test/optim/test_optim.py

+    criterion = nn.MSELoss()
+
+    # Just SGD because we're interested in the inner_optimizer
+    outer_optimizer = SGD([lr], differentiable=True)


Shouldn't need to be differentiable I don't think

janeyx99 · 2024-12-13T21:29:14Z

test/optim/test_optim.py

+        inner_optimizer.zero_grad()
+
+    meta_loss = loss  # type: ignore
+    meta_loss.backward(inputs=(p,), create_graph=True)


janeyx99 · 2024-12-13T21:41:00Z

test/optim/test_optim.py

+        grad = torch.rand_like(params, requires_grad=True, dtype=torch.float64)
+
+        lr = torch.rand(1, requires_grad=True, dtype=torch.float64)
+        outer_mbuff = torch.rand_like(lr, requires_grad=True, dtype=torch.float64)


Outer_params should be leaves for the outer optimizer though, so that should not need to be differentiable!

janeyx99 · 2024-12-13T21:54:34Z

torch/optim/sgd.py

-
-        param.add_(grad, alpha=-lr)
+        if differentiable and isinstance(lr, Tensor) and lr.requires_grad:
+            param.addcmul_(grad, -lr.to(param.device))


Alright! So here's the ideal path forward for performance:

We should figure out how to take advantage of the CPUness of the LR and just .item() before passing it to CUDA for addcmul. Note that the following works for some torch ops already (add and mul):

import torch grad = torch.rand(2, 4, device="cuda") lr = torch.tensor(1e-3, requires_grad=True) torch.add(grad, lr) # works! torch.mul(grad, lr) # also works! torch.addcmul(grad, grad, lr) # does not work :/, device mismatch

Step one would be to write a PR (with a test) that enables that line to work just like add and mul work. The CUDA kernel for addcmul is in PointwiseOpsKernel.cu.

Then come back to this PR and use param.addcmul_(grad, lr, -1)

janeyx99 · 2024-12-13T22:09:59Z

EmmettBicker · 2024-12-14T01:08:01Z

EmmettBicker · 2024-12-18T17:38:48Z

@janeyx99

EmmettBicker · 2024-12-19T21:10:58Z

Add support for differentiable LR in SGD + test

05af1fb

EmmettBicker requested review from albanD and janeyx99 as code owners December 12, 2024 16:40

pytorch-bot bot added the release notes: optim label Dec 12, 2024

pytorchbot added the open source label Dec 12, 2024

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 12, 2024

EmmettBicker added 2 commits December 12, 2024 15:48

removed default = False on differentiables

0ecb6a7

lint

84b5d24

janeyx99 reviewed Dec 12, 2024

View reviewed changes

albanD removed their request for review December 12, 2024 23:39

Addressed several of Jane's comments, still not ready to merge

fab1ea2

EmmettBicker added 2 commits December 13, 2024 15:23

Addressed more comments

f6f0bec

Renamed kwargs to inner_kwargs, changed x ,y to be simpler, reordered order of test var definitions to be more logical, put lr back into inner_kwargs to make the function more adaptable for future enhancement

Add differentiable flag to functional_sgd.py

f408cbd

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 13, 2024

janeyx99 reviewed Dec 13, 2024

View reviewed changes

etaf and others added 8 commits December 14, 2024 02:08

[Inductor UT] Generalize device-bias code in test_triton_syntax.py. (#…

bf8d4f5

…143178) Fix #143177 Pull Request resolved: #143178 Approved by: https://github.com/eellison

[EZ] Remove --pre from numpy installation command (#143237)

19f3570

Pull Request resolved: #143237 Approved by: https://github.com/janeyx99, https://github.com/kit1980

[ca] add graph id (#141906)

cdc03f9

Pull Request resolved: #141906 Approved by: https://github.com/jansel ghstack dependencies: #141919

[15/N] Fix extra warnings brought by clang-tidy-17 (#143100)

e9f6045

Fixes #ISSUE_NUMBER Pull Request resolved: #143100 Approved by: https://github.com/Skylion007

Get rid of _lazy_import hack (#143213)

24f24ee

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #143213 Approved by: https://github.com/aorenste, https://github.com/albanD

Update low prec codegen for div/mod (#142350)

ca97306

Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: #142350 Approved by: https://github.com/blaine-rister

EmmettBicker requested review from bertmaher, jeffdaily, jataylo, jithunnair-amd, pruthvistony, titaiwangms, shubhambhokare1, justinchuby, wschin, jerryzh168, salilsdesai, kimishpatel, digantdesai, jianyuh, jbschlosser, soulitzer, fmassa, soumith and ezyang as code owners December 18, 2024 17:37

pytorch-bot bot added module: compiled autograd compiled_autograd module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor release notes: quantization release notes category labels Dec 18, 2024

EmmettBicker closed this Dec 18, 2024

EmmettBicker mentioned this pull request Dec 18, 2024

Add support for differentiable LR in SGD + test v2.0 #143510

Closed

Add support for differentiable LR in SGD + test #143122

Add support for differentiable LR in SGD + test #143122

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143122

❌ 24 Cancelled Jobs, 1 Unrelated Failure

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment