-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Add support for differentiable LR in SGD + test #143122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for differentiable LR in SGD + test #143122
Conversation
Ah I think the failed tests addresses my second point, I'll work on changing those tests to accept the differentiable flag because having a default seems to error. Edit; I looked again and I couldn't find any of the tests, so I'm not sure where I saw them before! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good direction so far! Left some comments--I agree the test case should be simplified/cleaned up as much as possible and hope my comments help to that regard!
test/optim/test_optim.py
Outdated
p = p.clone() | ||
if not p.requires_grad: | ||
p.requires_grad_(True) | ||
p = p * 1.0 # make leaf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary anymore as line 70 makes p a leaf already!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If p originally didn't require_grad, then p.requires_grad_(True) makes it a leaf. Do you think we should allow people to pass in p that doesn't require grad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the outer leaf, it doesn't need to be cloned + can be a leaf. we only want the inner param to be a non-leaf for inner optim that is differentiable.
test/optim/test_optim.py
Outdated
inner_optimizer.zero_grad() | ||
|
||
meta_loss = loss # type: ignore | ||
meta_loss.backward(inputs=(p,), create_graph=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the create_graph here necessary? Let's remove everything that you don't think should be necessary for clarity + better maintenance in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It crashes if I remove it and gives this error:
RuntimeError: Trying to backward through the graph a second time
(or directly access saved tensors after they have already been freed).
Saved intermediate values of the graph are freed when you call .backward()
or autograd.grad(). Specify retain_graph=True if you need to backward
through the graph a second time or if you need to access saved tensors after calling backward.
If I do retain_graph=True it actually fails the test but doesn’t crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh hm....
test/optim/test_optim.py
Outdated
grad = torch.rand_like(params, requires_grad=True, dtype=torch.float64) | ||
|
||
lr = torch.rand(1, requires_grad=True, dtype=torch.float64) | ||
outer_mbuff = torch.rand_like(lr, requires_grad=True, dtype=torch.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the outer optimizer doesn't have to be differentiable, right? I wonder how much it can simplify the test to just have a simple SGD and not worry about its differentiability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, there's no real reason to add in outer_state at all so I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can definitely simplify the test by removing everything about testing the outer_optimizer! Interestingly though, I think it still has to be differentiable. I found that I think SGD only accepts non-leaf parameters (which I think anything that gets changed in the test function has to be because they're cloned) if it's differentiable. I tried the following code to instantiate the optimizer in the debug console:
Code:
print(p.is_leaf) # False
SGD([p])
Output:
ValueError: can't optimize a non-leaf Tensor
Code:
print(p.is_leaf) # False
SGD([p], differentiable=True)
Output:
SGD (
Parameter Group 0
dampening: 0
differentiable: True
foreach: None
fused: None
lr: 0.001
maximize: False
momentum: 0
nesterov: False
weight_decay: 0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outer_params should be leaves for the outer optimizer though, so that should not need to be differentiable!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this because lr is an outer_param and I clone it, making it not a leaf.
When I don't clone lr it gets modified by the test and eventually ends up negative and then the test fails because of a negative lr
torch/optim/sgd.py
Outdated
|
||
param.add_(grad, alpha=-lr) | ||
if differentiable and isinstance(lr, Tensor) and lr.requires_grad: | ||
param.addcmul_(grad, -lr.to(param.device)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This .to
is expensive...I'm trying to brainstorm the best way to do this and will get back to you tomorrow.
My thoughts so far: if lr is on CPU, we should be able to take advantage of its CPUness and just .item() before passing it to CUDA, to avoid this expensive .to
op. If that's not easy to do quickly (I haven't figured out why addcmul doesn't automatically allow it yet), we should maintain a lr_dict similar to in the fused adam optimizer. I'll get back to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! So here's the ideal path forward for performance:
- We should figure out how to take advantage of the CPUness of the LR and just .item() before passing it to CUDA for addcmul. Note that the following works for some torch ops already (add and mul):
import torch
grad = torch.rand(2, 4, device="cuda")
lr = torch.tensor(1e-3, requires_grad=True)
torch.add(grad, lr) # works!
torch.mul(grad, lr) # also works!
torch.addcmul(grad, grad, lr) # does not work :/, device mismatch
Step one would be to write a PR (with a test) that enables that line to work just like add and mul work. The CUDA kernel for addcmul is in PointwiseOpsKernel.cu.
- Then come back to this PR and use
param.addcmul_(grad, lr, -1)
Do you think I should add in support for functional_sgd? I was going to, but then I realized that _FunctionalSGD runs sgd with |
Renamed kwargs to inner_kwargs, changed x ,y to be simpler, reordered order of test var definitions to be more logical, put lr back into inner_kwargs to make the function more adaptable for future enhancement
If this is referring to the ones in distributed, don't bother! |
Ok I wont! I just set differentiable to false in _FunctionalSGD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress! The biggest thing I'm asking of you in this review is to leave wonderful python land and look into seeing if you could make addcmul with a 0dim Tensor performant (if it's on CPU), and then come back to use it.
@@ -1,6 +1,11 @@ | |||
# Owner(s): ["module: optimizer"] | |||
|
|||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Is there a more global way to enable this across the repo? Vs. adding it everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked online and couldn't find anything -- I can ask for it by opening an issue on the cpython repo if you'd like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the purposes of this pr this is fine + you do not have to but ofc it is up to u!
inner_kwargs: dict[str, Any], | ||
*ignored: Any, | ||
) -> tuple[Tensor, ...]: | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a decision you made here where the caller function is responsible for handling differentiable = True? i.e., another way to do this is to
inner_kwargs["differentiable"] = True
instead of the assert and have the caller function not pass in differentiable at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that every time the prexisting _diff_fn is called, differentiable is True, so I wanted to follow the convention and have it be explicitly passed in inner_kwargs. At the same time, I knew the test would break if it wasn't true so I thought I should add the assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, i'm fine with that
test/optim/test_optim.py
Outdated
inner_kwargs["differentiable"] is True | ||
), "Only call this test function when differentiable=True" | ||
# TODO: Adjust this function to test more hyperparameters than just lr | ||
assert "lr" in inner_kwargs, "Only call this test function with a custom lr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was similar because I wanted to expand this later to be "assert 'lr' in inner_kwargs or 'betas' in inner_kwargs' or 'weight_decay' in inner_kwargs' "
test/optim/test_optim.py
Outdated
|
||
lr = inner_kwargs["lr"].clone().requires_grad_(True) | ||
inner_kwargs.update({"lr": lr}) | ||
lr = lr * 1.0 # make non-leaf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lr = lr * 1.0 # make non-leaf |
why is make non-leaf necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated this and apparently that line was really problematic and broke the gradient computation and caused lr.grad = None. The test worked because it skipped over lr as its gradient was none, but when I remove the * 1.0 I get some complicated behavior that I talk about in my next comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more into this, and the test fails with return (lr,)
, but it passes with return (meta_loss,)
, so the derivative of meta_loss w.r.t. lr is being correctly computed but the derivative of lr w.r.t. meta_loss isn't correctly computed.
The analytical_vJu for lr w.r.t. params was 0, and for lr w.r.t. lr was 1, but the numerical_vJu for lr w.r.t. params was 0.3474 and for lr w.r.t. lr was -1.3101. The analytical values make sense because params have no impact on lr and d(lr)/d(lr)
should be 1, and the numerical results are different because we run outer_optim.step(). What I don't get is the gradients don't flow through outer_optim.step()
even though it's differentiable.
The test works if we remove outer_optimizer.step()
, because then d(lr)/d(params) = 1
and d(lr)/d(lr) = 0
. I'm going to work on other comments for now, but do you think we should
- look more into why the gradients don't consider the update taken by
outer_optim.step()
- or just leave it as is and remove
outer_optim.step()
along with the entire outer optimizer. I'm kinda for this option because this test would still validate that the inner optimizer is differentiable, but maybe I'm not considering something important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your inclination, we can dig into why the params/derivatives don't line up well but it's likely due to some faulty test code that we're likely removing anyway.
At this point, I'm more curious if the test case can be combined with the existing test case if there is no more outer optimizer, but I haven't looked closely so this may be too ambitious.
test/optim/test_optim.py
Outdated
criterion = nn.MSELoss() | ||
|
||
# Just SGD because we're interested in the inner_optimizer | ||
outer_optimizer = SGD([lr], differentiable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need to be differentiable I don't think
test/optim/test_optim.py
Outdated
inner_optimizer.zero_grad() | ||
|
||
meta_loss = loss # type: ignore | ||
meta_loss.backward(inputs=(p,), create_graph=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh hm....
test/optim/test_optim.py
Outdated
grad = torch.rand_like(params, requires_grad=True, dtype=torch.float64) | ||
|
||
lr = torch.rand(1, requires_grad=True, dtype=torch.float64) | ||
outer_mbuff = torch.rand_like(lr, requires_grad=True, dtype=torch.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outer_params should be leaves for the outer optimizer though, so that should not need to be differentiable!
torch/optim/sgd.py
Outdated
|
||
param.add_(grad, alpha=-lr) | ||
if differentiable and isinstance(lr, Tensor) and lr.requires_grad: | ||
param.addcmul_(grad, -lr.to(param.device)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! So here's the ideal path forward for performance:
- We should figure out how to take advantage of the CPUness of the LR and just .item() before passing it to CUDA for addcmul. Note that the following works for some torch ops already (add and mul):
import torch
grad = torch.rand(2, 4, device="cuda")
lr = torch.tensor(1e-3, requires_grad=True)
torch.add(grad, lr) # works!
torch.mul(grad, lr) # also works!
torch.addcmul(grad, grad, lr) # does not work :/, device mismatch
Step one would be to write a PR (with a test) that enables that line to work just like add and mul work. The CUDA kernel for addcmul is in PointwiseOpsKernel.cu.
- Then come back to this PR and use
param.addcmul_(grad, lr, -1)
Is this the same as what we do for foreach? |
It's not, in foreach they add it in and then raise an error if sgd w/ foreach is called while scripting, so I'll do that! |
…143178) Fix #143177 Pull Request resolved: #143178 Approved by: https://github.com/eellison
Pull Request resolved: #143237 Approved by: https://github.com/janeyx99, https://github.com/kit1980
Pull Request resolved: #141906 Approved by: https://github.com/jansel ghstack dependencies: #141919
…al exception can't be constructed (#140911) related to #34130 when pytorch attempts to re-raise an exception from a worker process (e.g. multiprocessing dataloader), if it can't reconstruct the original exception message due to a type error, it instead raises it as a runtime error. However, if it can't reconstruct the exception for some other reason, it throws an error with a stacktrace pointing to the `ExceptionWrapper` code rather than the original underlying issue. One case in which I run into this is with boto3's [HTTPClientError](https://github.com/boto/botocore/blob/66dc1f8d52aab1da8c2e03c722d15567b682a874/botocore/exceptions.py#L94)s. They must be constructed with a keyword argument `error`, but if `error` isn't passed, a `KeyError` is thrown instead of a `TypeError`, due to the particular way it is implemented: * [HTTPClientError](https://github.com/boto/botocore/blob/66dc1f8d52aab1da8c2e03c722d15567b682a874/botocore/exceptions.py#L94)'s constructor excepts variable keyword arguments it passes to `super` (BotoCoreError) * [it also defines a field `fmt` with `error`](https://github.com/boto/botocore/blob/66dc1f8d52aab1da8c2e03c722d15567b682a874/botocore/exceptions.py#L95) * BotoCoreError [expects to be able to format that string with the kwargs](https://github.com/boto/botocore/blob/66dc1f8d52aab1da8c2e03c722d15567b682a874/botocore/exceptions.py#L41) So in this case, if a HTTPClientError occurs on a worker process, you simply get a `KeyError: error` with a stacktrace pointing to [this line](https://github.com/pytorch/pytorch/blob/3e2f276a1445c6e27ebeb1fa5d60bce2200af315/torch/_utils.py#L710) which is unhelpful. Instead, I propose to reraise the error as a `RuntimeError` unconditionally. Pull Request resolved: #140911 Approved by: https://github.com/vmoens
Fixes #ISSUE_NUMBER Pull Request resolved: #143100 Approved by: https://github.com/Skylion007
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: #143245 Approved by: https://github.com/pytorchbot
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #143213 Approved by: https://github.com/aorenste, https://github.com/albanD
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: #142350 Approved by: https://github.com/blaine-rister
Oh my god im so sorry everyone I just wanted to rebase but my rebase contained a merge oh my god |
Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point. Pull Request resolved: #143510 Approved by: https://github.com/janeyx99
PR v2.0 has been merged. @albanD The differentiable tensor path for lr w/ sgd works now! |
First PR in a larger project to broaden support for differentiable optimizers with @janeyx99 !
I would love some feedback on this PR. I'm uncertain about the following two things:
meta_loss.backward(inputs=(p,), create_graph=True)
, but I also triedmeta_loss.backward(inputs=(p, lr,), create_graph=True)
and that gave me an incorrect answer in the gradcheck. I know that lr is getting a meaningful gradient, but I'm confused why adding it directly to the inputs would change how its gradients are computed when it seems like an input when the graph is visualized with torchviz's make_dotdifferentiable: bool = False
in_single_tensor_sgd
,_multi_tensor_sgd
, and_fused_sgd
. In multi and fused I threw an error when they were called withdifferentiable = True.
I'm wondering about the fact that I set a default parameter which doesn't seem to be the standard. I do this because some tests directly call_single_tensor_sgd
, but I was wondering if I should just change the tests instead of adding the default param.def _single_tensor_sgd( params: List[Tensor], grads: List[Tensor], momentum_buffer_list: List[Optional[Tensor]], grad_scale: Optional[Tensor], found_inf: Optional[Tensor], *, weight_decay: float, momentum: float, lr: float | Tensor, dampening: float, nesterov: bool, maximize: bool, has_sparse_grad: bool, + differentiable: bool = False, ):
I'd also love a way to make my test more concise because I feel like the more lines I add the more error prone it is.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @xmfan