File tree Expand file tree Collapse file tree 1 file changed +4
-4
lines changed Expand file tree Collapse file tree 1 file changed +4
-4
lines changed Original file line number Diff line number Diff line change @@ -316,10 +316,10 @@ def step(self, closure=None):
316316 \end{aligned}
317317
318318 In addition, there is an additional deviation in our implementaion when compared to the standard procedure.
319- Noam Shazeer and Mitchell Stern use the sum of the square of the gradient. In our implementation ,
320- we use the mean of the square of the gradient and are careful to account for
321- the normalization factor . This allows for greater numerical stability for largs sums where we have limited
322- numerical ranges.
319+ Noam Shazeer and Mitchell Stern describe using the sum of squared gradients ,
320+ this implementation uses the mean instead. This choice is mathematically equivalent because
321+ normalization factors are adjusted accordingly . This allows for greater numerical stability for largs sums
322+ where we have limited numerical ranges.
323323
324324 .. _Adafactor\: Adaptive Learning Rates with Sublinear Memory Cost:
325325 https://arxiv.org/pdf/1804.04235
You can’t perform that action at this time.
0 commit comments