8000 updated doc to revewier comment · pytorch/pytorch@61dd92e · GitHub
[go: up one dir, main page]

Skip to content

Commit 61dd92e

Browse files
updated doc to revewier comment
1 parent 519c017 commit 61dd92e

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

torch/optim/_adafactor.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -316,10 +316,10 @@ def step(self, closure=None):
316316
\end{aligned}
317317
318318
In addition, there is an additional deviation in our implementaion when compared to the standard procedure.
319-
Noam Shazeer and Mitchell Stern use the sum of the square of the gradient. In our implementation,
320-
we use the mean of the square of the gradient and are careful to account for
321-
the normalization factor. This allows for greater numerical stability for largs sums where we have limited
322-
numerical ranges.
319+
Noam Shazeer and Mitchell Stern describe using the sum of squared gradients,
320+
this implementation uses the mean instead. This choice is mathematically equivalent because
321+
normalization factors are adjusted accordingly. This allows for greater numerical stability for largs sums
322+
where we have limited numerical ranges.
323323
324324
.. _Adafactor\: Adaptive Learning Rates with Sublinear Memory Cost:
325325
https://arxiv.org/pdf/1804.04235

0 commit comments

Comments
 (0)
0