force computation in opmath_t for CUDA fused optimizers #154069

MeetThePatel · 2025-05-21T21:49:37Z

Benchmarks before and after:

(On RTX 5070)

Adam
=====================================

Main:
Mean time per run: 11066.4259 µs
Median: 11065.9903

With forced opmath_t:
Mean time per run: 8368.6145 µs
Median: 8367.6631

SGD
=====================================

Main:
Mean time per run: 3679.6347 µs
Median: 3678.9713

With forced opmath_t:
Mean time per run: 3518.2261 µs
Median: 3517.4846

pytorch-bot · 2025-05-21T21:49:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154069

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 2f5411f with merge base 59c5fff ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh)
test_cuda.py::TestCudaOptimsCUDA::test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 2, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_cuda.py::TestCudaOptimsCUDA::test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_optim.py::TestOptimRenewedCUDA::test_fused_cpu_matches_cuda_SGD_cuda_bfloat16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99

Thanks, left some comments

janeyx99 · 2025-05-22T17:38:03Z

aten/src/ATen/native/cuda/FusedSgdKernel.cu

@@ -10,49 +10,51 @@ namespace at::native {

 namespace {

-template <typename scalar_t, int depth>
+constexpr uint8_t kParamIdx = 0;


Is there any perf impact of this or was renaming mostly for code understandability?

This is just for readability (and consistency with the other fused optimizers). I thought it may seem a bit hard to understand at first to be indexing to random parts of the args/r_args.

janeyx99 · 2025-05-22T20:29:48Z

aten/src/ATen/native/cuda/fused_adam_utils.cuh

+              tl.state_steps_addresses[tensor_loc]));
+
+      const opmath_t bias_correction1 =
+          1 - at::native::pow_(static_cast<opmath_t>(beta1), step_count);


Are these inner static_casts still necessary when in lines 170/171 they've been casted? ik static cast doesn't have runtime ramifications, but just curious if we could remove some redundant code here.

You're correct, I must've missed them in my final sweep of the file. I was going a bit out of order in the edits.
I'll remove them.

I switched the code to have a section at the top of the function blocks to do all the casting.

MeetThePatel · 2025-05-27T21:08:56Z

Mismatched elements: 31 / 64 (48.4%)
Greatest absolute difference: 0.00017547607421875 at index (7, 0) (up to 3e-05 allowed)
Greatest relative difference: 1.2930149750900455e-05 at index (1, 7) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
    python test/test_cuda.py TestCudaOptimsCUDA.test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32

I'm not as familiar with the AMP codebase, but are there any methods to mitigate the loss of precision from moving to opmath_t for the optimizer computations?

janeyx99 · 2025-06-02T15:25:45Z

@MeetThePatel I haven't looked into the exact failing test cases but we have a few approaches to make sure it's okay

use TensorTracker to realign the numbers after every step https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_optimizers.py#L2220 (there are some examples in test/test_optim.py)
go through the exact ops that happen and mathematically compound the error ranges to verify whether the end result makes sense.

1 is easier to do/verify so I would try to incorporate TensorTracker if not already.

MeetThePatel requested review from eqy and syed-ahmed as code owners May 21, 2025 21:49

pytorch-bot bot added the release notes: cuda release notes category label May 21, 2025

pytorchbot added the open source label May 21, 2025

MeetThePatel mentioned this pull request May 21, 2025

Use FMA for CUDA Adam kernel's lerps. #153097

Open

Skylion007 requested review from jansel and janeyx99 and removed request for jansel May 21, 2025 22:52

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2025

janeyx99 approved these changes May 22, 2025

View reviewed changes

MeetThePatel added 3 commits May 22, 2025 23:24

force computation in opmath_t for CUDA fused adam

06fb211

force computation in opmath_t for CUDA fused sgd

58792a5

Get rid of unnecessary static_casts

6f08e0f

MeetThePatel force-pushed the fused-opmath branch from c6d3e59 to 6f08e0f Compare May 22, 2025 23:24

lints.

2f5411f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

force computation in opmath_t for CUDA fused optimizers #154069

force computation in opmath_t for CUDA fused optimizers #154069

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

force computation in opmath_t for CUDA fused optimizers #154069

Are you sure you want to change the base?

force computation in opmath_t for CUDA fused optimizers #154069

Uh oh!

Conversation

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154069

❌ 3 New Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!