Use std::fma for CUDA Adam kernel's lerps. #153097

MeetThePatel · 2025-05-07T21:03:54Z

Switch the calculation of lerps in Adam's fused CUDA kernel to use std::fma, as proposed by @crcrpar .

pytorch-bot · 2025-05-07T21:03:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153097

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2b31134 with merge base 11c64b7 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

This comment was automatically generated by Dr. CI and updates every 15 minutes.

crcrpar

Would you have perf numbers of before/after?

Skylion007 · 2025-05-08T14:09:42Z

Would you have perf numbers of before/after?

I am not sure if this makes much of a perf benefit here TBH, the benefit is that it should be more accurate: https://en.cppreference.com/w/cpp/numeric/math/fma . My question is does ROCM/CUDA actually map this to proper cuda_fm_rn and cuda_fmaf_rn functions? I think so, but I don't see any documentation to ensure it doesn't fall back to CPU like emulation

MeetThePatel · 2025-05-08T16:50:55Z

@Skylion007 I'm not sure if this is quite exactly what you're asking, but on Godbolt NVCC 12.5.1, with the example code:

double testing_fma(double a, double b, double c) {
    return std::fma(a, b, c);
}

double testing_manual(double a, double b, double c) {
    return a * b + c;
}

the output for -O3 is:

testing_fma(double, double, double):
        jmp     fma
testing_manual(double, double, double):
        mulsd   %xmm1, %xmm0
        addsd   %xmm2, %xmm0
        ret

and the output for -O0 is:

testing_fma(double, double, double):
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $32, %rsp
        movsd   %xmm0, -8(%rbp)
        movsd   %xmm1, -16(%rbp)
        movsd   %xmm2, -24(%rbp)
        movsd   -24(%rbp), %xmm1
        movsd   -16(%rbp), %xmm0
        movq    -8(%rbp), %rax
        movapd  %xmm1, %xmm2
        movapd  %xmm0, %xmm1
        movq    %rax, %xmm0
        call    fma
        movq    %xmm0, %rax
        movq    %rax, %xmm0
        leave
        ret
testing_manual(double, double, double):
        pushq   %rbp
        movq    %rsp, %rbp
        movsd   %xmm0, -8(%rbp)
        movsd   %xmm1, -16(%rbp)
        movsd   %xmm2, -24(%rbp)
        movsd   -8(%rbp), %xmm0
        mulsd   -16(%rbp), %xmm0
        addsd   -24(%rbp), %xmm0
        movq    %xmm0, %rax
        movq    %rax, %xmm0
        popq    %rbp
        ret

ngimel · 2025-05-08T16:55:50Z

https://godbolt.org/z/oedrhM1v1 for device code (the above is host code), maps to fma.rn.f32

eqy

probably wont have any unintended numerics consequences right

ngimel · 2025-05-08T18:06:29Z

Shouldn't. Neither original formula nor the new one recovers the value when grad and moment are equal, but we've been living with it for a long time, so should be fine

Skylion007 · 2025-05-13T13:44:16Z

@pytorchbot merge

pytorchmergebot · 2025-05-13T13:46:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-13T14:07:48Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cuda12.6-py3 / build

Details for Dev Infra team

Raised by workflow job

ngimel · 2025-05-13T20:29:09Z

Windows error is real fma<double, float, float, (int)0> and also we shouldn't be calling fma with these mixed args? @MeetThePatel can you please take a look?

MeetThePatel · 2025-05-13T20:57:32Z

For the three fma's:

std::fma(-beta1, grad, grad)
std::fma(-beta2, grad * grad, grad * grad)
std::fma(beta2, exp_avg_sq, std::fma(-beta2, grad * grad, grad * grad))

do you think it makes more sense static_cast beta1 and beta2 down to opmath_t, rather than upcasting grad and exp_avg_sq to double?

How much would the loss of precision on beta1 and beta2 be (for going from double to opmath_t).

ngimel · 2025-05-13T23:28:14Z

Casting betas to opmath is acceptable

ngimel · 2025-05-14T22:54:48Z

cc @janeyx99 looks like we are now doing computations in double in the fused optimizer, I don't think we should

MeetThePatel requested review from eqy and syed-ahmed as code owners May 7, 2025 21:03

pytorch-bot bot added the release notes: cuda release notes category label May 7, 2025

eqy requested a review from crcrpar May 7, 2025 21:14

pytorchbot added the open source label May 7, 2025

crcrpar reviewed May 8, 2025

View reviewed changes

Skylion007 requested a review from ngimel May 8, 2025 14:09

ngimel approved these changes May 8, 2025

View reviewed changes

eqy approved these changes May 8, 2025

View reviewed changes

Skylion007 approved these changes May 13, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025

Skylion007 added the ciflow/rocm Trigger "default" config CI on ROCm label May 13, 2025

pytorchmergebot added the merging label May 13, 2025

pytorchmergebot removed the merging label Ma 8000 y 13, 2025

MeetThePatel added 2 commits May 14, 2025 06:02

Use fma for CUDA Adam kernel's lerps.

b9f8e91

Add static_cast<opmath_t> for beta1 and beta2

2b31134

MeetThePatel force-pushed the cuda-adam-lerp branch from 3cf034b to 2b31134 Compare May 14, 2025 06:02

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm labels May 14, 2025

janeyx99 mentioned this pull request May 15, 2025

Use opmath_t and not double compute in fused optimizers #153649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use std::fma for CUDA Adam kernel's lerps. #153097

Use std::fma for CUDA Adam kernel's lerps. #153097

Use std::fma for CUDA Adam kernel's lerps. #153097

Are you sure you want to change the base?

Use std::fma for CUDA Adam kernel's lerps. #153097

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153097

✅ You can merge normally! (1 Unrelated Failure)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Merge started

Merge failed