Replies: 27 comments
-
Just had a very quick look, so this might not be the whole answer, but: have a look at the DiffSharp/src/DiffSharp.Core/Optim.fs Line 44 in 8e1608c This was something I added by design to turn off nesting through the standard optimizers by default, for efficiency reasons of the standard use cases. Also we should add some unit test cases for this reversible behavior. |
Beta Was this translation helpful? Give feedback.
-
I see I'll try that thanks! FWIW I added the numerical gradient to
and got
so the gradient is really not zero :-) |
Beta Was this translation helpful? Give feedback.
-
Another approach could be to set up an optimization loop based on the diffsharp differentiation API (e.g., An example of this is in the unit tests: |
Beta Was this translation helpful? Give feedback.
-
BTW I think we should use string interpolation everywhere e.g.
and where necessary strongly typed ones:
|
Beta Was this translation helpful? Give feedback.
-
Cool thanks |
Beta Was this translation helpful? Give feedback.
-
So my example is using this SGD which doesn't have https://github.com/dsyme/DiffSharp/blob/t2/src/DiffSharp.Core/Optim.fs#L229 Any other ideas? |
Beta Was this translation helpful? Give feedback.
-
Ah maybe you meant the inner |
Beta Was this translation helpful? Give feedback.
-
Hmmm that didn't work, I get this:
I'll take more of a look tomorrow |
Beta Was this translation helpful? Give feedback.
-
I got this working to a first approximation. Interesting results and a lot of fun - hard to believe how easy it is (even if "doing it well" may be harder). To get it working I did this:
(It feels like we need some improvement in the model differentiation API here to avoid having to do these things?) Anyway for kicks I used four hyperparameters: let gopt = Adam(generator, lr=hyp.[0], beta1=hyp.[2], reversible=true)
let dopt = Adam(discriminator, lr=hyp.[1], beta1=hyp.[3], reversible=true) and Optim.optim.sgd (train, x0=dsharp.tensor([0.0003,0.0003,0.5,0.5]), lr=dsharp.tensor(0.0000000005)) The very low learning rate on the outer SGD optimization is needed because some of the initial gradients on the learning rates are so high (given this is R^4 --> R is it possible or wise to use a matrix of outer learning rates?) Repeated optimization over the first unshuffled 640 images (not a lot I know) in batches of 64 yields very strong gradients where the system wants the generator to learn much less quickly and the discriminator to learn more quickly. The betas are not changing, e.g.
Because this is all just for the first 640 images perhaps it makes sense - I guess it's deciding that the "sum of losses" objective function is best optimized by initially having the discriminator learn something quickly (rather than the generator trying to match the random initial noise of the discriminator). I then looked at optimizing the beta parameters independently of the learning rates - nothing useful from this but interesting to see how easy it is to set this up
|
Beta Was this translation helpful? Give feedback.
-
Awesome progress! This looks really cool and I will definitely check it out in detail. Another quick reply in the meantime: the API here DiffSharp/src/DiffSharp.Core/DiffSharp.fs Line 1225 in 8e1608c namely |
Beta Was this translation helpful? Give feedback.
-
Just to mention I tried the same thing in vae.fsx here https://github.com/DiffSharp/DiffSharp/compare/dev...dsyme:t2?expand=1#diff-5f944dc2993b2616546c10e0c5c00913d434683d822b41944aa5a2128843dc38R156 However I got very large hypergradients for the learning rate with pretty much immediate divergence to NaN, no matter what I did. Perhaps I'm doing something wrong or perhaps it's in the nature of the data or the model or the loss function. It feels hard to debug this condition, to work out why the gradient is diverging. I think it's because even two steps of training the model are compacting again and again the effect of the learning rate. However a "fail on first NaN gradient" condition or similar might be quite useful?
|
Beta Was this translation helpful? Give feedback.
-
#287 gives a possible solution for this. After applying the change to Adam there I can do hyper-gradient optimization of the learning rate. For VAE, for example, I was able to optimize the learning rate (objective function is to minimise the sum of the losses over all batches in the last epoch). The hyper-gradient optimization progressively selects slightly larger learning rates and does reduce the sum of losses in training, but doesn't lead to actual improvement during validation. But there may be other reasons for that. |
Beta Was this translation helpful? Give feedback.
-
A little more on this. For VAE, using hyper-gradient optimization on the learning rate (optimizing w.r.t. the sum of the losses in the 2nd epoch) gave me a learning rate of 0.00136 for VAE instead of the default of 0.001 (and then sudden divergence of the learning rate to negative as it hit some kind of wall). This results in the following loss graph as training proceeeds (10-minibatch rolling average) The original learning rate first reaches a rolling-average loss of 130 after 390 steps, the hyper-optimized loss reaches is after 375 steps. Likewise reaches 120 after 670 steps v. 526 steps. The scores on the validation set of data are 109.7948151 for the default learning rate and 109.1915359 for the hyper-parameter optimized learning rate. Presumably for this example the default learning rate was from the python code for a paper which had already performed some kind of parameter sweep. Of course I had to run the training multiple times to work out the optimized learning rate and that is effectively a form of hyper-gradient-guided parameter sweep. I suppose methodologically it would be better to divide the data into three sets
But overall I'm impressed with the ease of doing the hyper-parameter optimization. |
Beta Was this translation helpful? Give feedback.
-
I pushed the code in my experimental branch to "examples/vae-gan-hypopt" as a record of this experiment https://github.com/DiffSharp/DiffSharp/compare/examples/vae-gan-hypopt?expand=1 |
Beta Was this translation helpful? Give feedback.
-
Including @awf, who I just showed this to. His comments on the VAE example were:
For the GAN example the comments were:
|
Beta Was this translation helpful? Give feedback.
-
Taking into account @awf's feedback, we can optimize Optim.optim.sgd ((fun hyp -> train(hyp.[0].exp(), 1.0-hyp.[1].exp(), 1.0-hyp.[2].exp(), None, None)),
x0=dsharp.tensor([log 0.001; log (1.0-0.9); log (1.0-0.999)]), lr=v 0.0001) This gives a different set of hyper-parameters and some further improvements:
In any case, these blunt techniques work, sort of, and the mixed/nested mode gradient-taking looks accurate. The techniques still require re-running training multiple times however and will always have to compete with massively-parallelizable brute force parameter sweeps. |
Beta Was this translation helpful? Give feedback.
-
Hi @dsyme it is a good indication that these are working! I expect to look at the results closely in the coming days. In the meantime, I would like to point you and @awf to the paper I published exactly on this subject at ICLR: https://arxiv.org/abs/1703.04782 In the paper we address many of the comments you've been raising. It took a lot of effort to explain this technique to the reviewers and the community at ICLR, so the paper is the best summary of everything around this subject so far. I hope you and @awf can have a careful look at it (if you have a bit of time, of course!). I think two most important comments are:
We should implement the online variety as well. There are also lots of research directions that follow from this and some student projects that are currently ongoing on my side. As a note: the paper initially started as result of a few quick experiments we did with DiffSharp 0.7 at that time. :) |
Beta Was this translation helpful? Give feedback.
-
Nice one! You should log how many function/gradient evaluations the optimizer took -- less than 50 is pretty good, more than 500 is too close to grid search. To significantly reduce function eval with a 3-parameter optimization problem, I would expect a Damped [Gauss-]Newton optimizer (e.g. Damped Newton or Levenberg Marquardt) to do much better than SGD, maybe 10x better in terms of function evaluations. Normally G-N is impractical because it needs good analytic derivatives.... @gbaydin is there an efficient way to get the nx3 Jacobian or the 3x3 Hessian for the outer problem? |
Beta Was this translation helpful? Give feedback.
-
@gbaydin Thank you yes the linked paper is great. I'll go back and read it carefully now I've played with this a bit. Quick question: are the beta1 and beta2 of the Adam also tuned incrementally in that paper? It looks like you're computing the analytical incrementalization w.r.t. learning rate? I was quite surprised that tuning those seemed to make more of a difference than the learning rate in this example |
Beta Was this translation helpful? Give feedback.
-
Thanks!
Wel, reverse-on-reverse should do it, but blows up the memory on my machine for this example. There may be ways around that, or there may be a mistake in my code. |
Beta Was this translation helpful? Give feedback.
-
@dsyme We had to focus that paper on the learning rate only. I think it's a great thing to extend this to other things including Adam's betas, other hyperparameters can be momentum rates, regularization coefficients, etc. Another one is to learn per-parameter learning rates (instead of a single scalar learning rate), which is indeed possible but was excluded from that paper.
@awf This sounds actually great and looking at things other than SGD for the outer problem would be great! Here's a quick list of things we have available right now in this code base:
|
Beta Was this translation helpful? Give feedback.
-
I was wondering about this. Also wondered about different learning rates for different layers/parts of models (so not necessarily one-per-parameter but one-per-layer - different parts of the brain with different levels of activity). |
Beta Was this translation helpful? Give feedback.
-
@gbaydin I implemented the technique from the paper for Adam. I might have made a mistake though. The orange and blue lines indicate the adaptive learning (with 0.001 and 0.0001 starting respectively) - the green line the fixed 0.001. The orange and blue look great at the start - they increase to learn very fast - but then the learning rate goes very low and the end score is 116 instead of 110. Have you seen this kind of problem before with adaptive learning rates? This is two epoch learning VAE over MNIST I'll check the formulae. The Adam/SGD code for the hyperparameter adaption is in the branch at the top of this issue. |
Beta Was this translation helpful? Give feedback.
-
Other things I noticed Just to mention I added the multiplicative update rule from the paper and ran a longer training run and still have that effect where the fixed learning rate ultimately does better: |
Beta Was this translation helpful? Give feedback.
-
@dsyme I wouldn't care too much about a small difference between these two curves beyond 2k iterations. I think in this example, the choice of 0.001 was already very close to optimal. I would, for example, look at the adaptive/non-adaptive behavior for other learning rates such as 0.1, 0.01, 0.0001, 0.00001 etc. and see if the adaptive versions do a good enough job to bring the learning rate towards the optimal value and the loss towards the optimal. If they do, it's a winning result for some contexts because it shows you can adapt the learning rate on the fly to do much better than the non-adaptive algorithm, without any need for costly grid search. Depending on the setting, minute differences are not always important as there is never a magic method that is guaranteed to give you "lower than everything else everywhere". Optimization is a huge field and this thing you're expecting is kind of a holy grail. If you're in a setting where you really want to make sure you have the absolute best loss and minute differences are important, you would definitely need the costly grid search, Bayesian optimization or other stuff. But this adaptive algorithm can even help with that because it can give you an indication of where to start your costly searches from, so you wouldn't need to cover a huge costly grid. A general note: these two curves are just two individual realizations of running these two algorithms. I would definitely be interested in looking at this too: for each curve (orange and blue) I would run exactly the F42D same experiment a handful of times (like ten times) with different random number seeds, and I would plot the mean and standard deviation of these curves to get a better picture of the expected behaviors. Looking at individual single realizations in stochastic optimization results can be misleading. (I actually want to add the shaded region plotting thing to the diffsharp pyplot wrapper for this purpose :) ) |
Beta Was this translation helpful? Give feedback.
-
@gbaydin Yes, it's possible, though the effect is quite strong - using the dynamic learning rate gives noticeably worse results in this instance, pushing the learning rate down to 0.00003 or so towards the end, giving the system a strong disinclination to learn anything at all. Strange since the higher learning rates do push the loss down. It's not explicable to me yet, I feel there may be some error in the implementation or methodology I'm using here. The theory is sound of course.
Well, in this case it wouldn't, as the adaptive algorithm is pushing the learning rate so low, and any manual searches would be in the wrong zone and miss the (current) lowest-loss solutions of 0.001 constant learning rate. But again there may simply be some error in implementation or method here. Or it may indeed be an artifact of the implementation. |
Beta Was this translation helpful? Give feedback.
-
One question about the methodology for hyper parameter optimization In the examples above I was running minibatch N, computing the hyper-gradient using the incremental method from the paper, then adjusting the learning rate for use with minibatch N+1 Is this actually valid? Is the hyper-gradient of the learning rate going to be correct when used with highly varying data inputs from minibatch to minibatch? @gbaydin in your paper was this the method used, or was it from epoch to epoch with same input data at each step? thanks |
Beta Was this translation helpful? Give feedback.
-
@gbaydin This is a question about something I thought might "just work". Don't spend time on it, I'm just curious if there's something obvious I'm doing wrong.
I took a tentative attempt at hyper-parameter optimization for the learning rate of the first 20 minibatches of the GAN sample
https://github.com/DiffSharp/DiffSharp/compare/examples/vae-gan-hypopt?expand=1
gan.fsx: https://github.com/dsyme/DiffSharp/blob/ef0bcd04575a67636b5557dcc953c4ab8e287598/examples/gan.fsx
The aim is simply to work out what the optimal training learning rate would be if we're only going to run training on precisely those same 20 minibatches. However my attempt doesn't work because the derivative of my
train
function is always zero according to the optimizer, e,g, my printfn addition to the SGD optimizer gives this:Here
f
is the sum of the generators losses for the first 20 minibatches (the result oftrain
) andg
is the gradient of thetrain
function as reported to SGD. At first I thought this might have been due tonoDiff
erasing all derivatives. However switching to just astripDiff
that just gets the primal didn't change things.Anyway if you can spot anything simple I'm doing wrong on a quick glance it would be instructive for me.
Beta Was this translation helpful? Give feedback.
All reactions