updated doc to revewier comment

morrison-turnansky · morrison-turnansky · commit 61dd92ee536d · 2025-06-23T17:43:55.000Z
diff --git a/torch/optim/_adafactor.py b/torch/optim/_adafactor.py
@@ -316,10 +316,10 @@ def step(self, closure=None):
             \end{aligned}
 
         In addition, there is an additional deviation in our implementaion when compared to the standard procedure.
-        Noam Shazeer and Mitchell Stern use the sum of the square of the gradient. In our implementation,
-        we use the mean of the square of the gradient and are careful to account for
-        the normalization factor. This allows for greater numerical stability for largs sums where we have limited
-        numerical ranges.
+        Noam Shazeer and Mitchell Stern describe using the sum of squared gradients,
+        this implementation uses the mean instead. This choice is mathematically equivalent because
+        normalization factors are adjusted accordingly. This allows for greater numerical stability for largs sums
+        where we have limited numerical ranges.
 
     .. _Adafactor\: Adaptive Learning Rates with Sublinear Memory Cost:
         https://arxiv.org/pdf/1804.04235