NVIDIA-CUDA-Floating-Point
NVIDIA-CUDA-Floating-Point
Figure 1: Multiply and add code fragment and Figure 2: The serial method uses a simple loop
output for x86 and NVIDIA Fermi GPU with separate multiplies and adds to compute
the dot product of the vectors. The final result
can be represented as ((((a1 × b1 ) + (a2 × b2 )) + (a3 ×
2.0. At the time this paper is written (Spring 2011) b3 )) + (a4 × b4 )).
there are no commercially available x86 CPUs which
offer hardware FMA. Because of this, the computed re-
sult in single precision in SSE would be 0. NVIDIA
GPUs with compute capability 2.0 do offer hardware
FMAs, so the result of executing this code will be the
more accurate one by default. However, both results are
correct according to the IEEE 754 standard. The code
fragment was compiled without any special intrinsics or t=0
compiler options for either platform. for i from 1 to 4
The fused multiply-add helps avoid loss of precision t = rn(ai × bi + t)
during subtractive cancellation. Subtractive cancella- return t
tion occurs during the addition of quantities of similar
magnitude with opposite signs. In this case many of the
leading bits cancel, leaving fewer meaningful bits of pre-
cision in the result. The fused multiply-add computes a Figure 3: The FMA method uses a simple loop
double-width product during the multiplication. Thus with fused multiply-adds to compute the dot
even if subtractive cancellation occurs during the addi- product of the vectors. The final result can be
tion there are still enough valid bits remaining in the represented as a4 × b4 + (a3 × b3 + (a2 × b2 + (a1 ×
product to get a precise result with no loss of precision. b1 + 0))).
ever, results computed by an implementation of the se- 4.3 Compute capability 2.0 and above
rial algorithm may differ from those computed by an Devices with compute capability 2.0 and above sup-
implementation of the other two algorithms. port both single and double precision IEEE 754 includ-
For example, consider the vectors ing fused multiply-add in both single and double preci-
sion. Operations such as square root and division will
~a = [1.907607, −.7862027, 1.147311, .9604002] result in the floating point value closest to the correct
~b = [−.9355000, −.6915108, 1.724470, −.7097529] mathematical result in both single and double precision,
by default.
whose elements are randomly chosen values between −1
and 2. The accuracy of each algorithm corresponding
to these inputs is shown in Figure 5. 4.4 Rounding modes
The main points to notice from the table are that The IEEE 754 standard defines four rounding modes:
each algorithm yields a different result, and they are round-to-nearest, round towards positive, round towards
all slightly different from the correct mathematical dot negative, and round towards zero. CUDA supports
product. In this example the FMA version is the most all four modes. By default, operations use round-to-
accurate, and the parallel algorithm is more accurate nearest. Compiler intrinsics like the ones listed in the
than the serial algorithm. In our experience these re- tables below can be used to select other rounding modes
sults are typical; fused multiply-add significantly in- for individual operations.
creases the accuracy of results, and parallel tree reduc-
tions for summation are usually much more accurate
than serial summation.
mode interpretation
4. CUDA AND FLOATING POINT rn round to nearest, ties to even
NVIDIA has extended the capabilities of GPUs with rz round towards zero
each successive hardware generation. Current genera- ru round towards +∞
tions of the NVIDIA architecture such as Tesla C2xxx, rd round towards -∞
GTX 4xx, and GTX 5xx, support both single and dou-
ble precision with IEEE 754 precision and include hard-
ware support for fused multiply-add in both single and
x + y addition 4.7 Differences from x86
__fadd_[rn|rz|ru|rd](x, y) NVIDIA GPUs differ from the x86 architecture in
x * y multiplication that rounding modes are encoded within each floating
__fmul_[rn|rz|ru|rd](x, y) point instruction instead of dynamically using a floating
fmaf(x, y, z) FMA point control word. Trap handlers for floating point ex-
__fmaf_[rn|rz|ru|rd](x, y, z) ceptions are not supported. On the GPU there is no sta-
1.0f / x reciprocal tus flag to indicate when calculations have overflowed,
__frcp_[rn|rz|ru|rd](x) underflowed, or have involved inexact arithmetic. Like
x / y division SSE, the precision of each GPU operation is encoded
__fdiv_[rn|rz|ru|rd](x, y) in the instruction (for x87 the precision is controlled
sqrtf(x) square root dynamically by the floating point control word).
__fsqrt_[rn|rz|ru|rd](x)
5. CONSIDERATIONS FOR A HETERO-
x + y addition
__dadd_[rn|rz|ru|rd](x, y) GENEOUS WORLD
x * y multiplication
__dmul_[rn|rz|ru|rd](x, y) 5.1 Mathematical function accuracy
fma(x, y, z) FMA So far we have only considered simple math oper-
__fma_[rn|rz|ru|rd](x, y, z) ations such as addition, multiplication, division, and
1.0 / x reciprocal square root. These operations are simple enough that
__drcp_[rn|rz|ru|rd](x) computing the best floating point result (e.g. the clos-
x / y division est in round-to-nearest) is reasonable. For other math-
__ddiv_[rn|rz|ru|rd](x, y) ematical operations computing the best floating point
sqrt(x) square root result is harder.
__dsqrt_[rn|rz|ru|rd](x) The problem is called the table maker’s dilemma. To
guarantee the correctly rounded result, it is not gen-
4.5 Controlling fused multiply-add erally enough to compute the function to a fixed high
accuracy. There might still be rare cases where the er-
In general, the fused multiply-add operation is faster ror in the high accuracy result affects the rounding step
and more accurate than performing separate multiply at the lower accuracy.
and add operations. However, on occasion you may It is possible to solve the dilemma for particular func-
wish to disable the merging of multiplies and adds into tions by doing mathematical analysis and formal proofs [4],
fused multiply-add instructions. To inhibit this op- but most math libraries choose instead to give up the
timization one can write the multiplies and additions guarantee of correct rounding. Instead they provide im-
using intrinsics with explicit rounding mode as shown plementations of math functions and document bounds
in the previous tables. Operations written directly as on the relative error of the functions over the input
intrinsics are guaranteed to remain independent and range. For example, the double precision sin function
will not be merged into fused multiply-add instructions. in CUDA is guaranteed to be accurate to within 2 units
With CUDA Fortran it is possible to disable FMA merg- in the last place (ulp) of the correctly rounded result. In
ing via a compiler flag. other words, the difference between the computed result
and the mathematical result is at most ±2 with respect
4.6 Compiler flags to the least significant bit position of the fraction part
Compiler flags relevant to IEEE 754 operations are of the floating point result.
-ftz={true|false}, -prec-div={true|false}, and For most inputs the sin function produces the cor-
-prec-sqrt={true|false}. These flags control single rectly rounded result. For some inputs the result is off
precision operations on devices of compute capability of by 1 ulp. For a small percentage of inputs the result is
2.0 or later. off by 2 ulp.
Producing different results, even on the same system,
mode flags is not uncommon when using a mix of precisions, li-
IEEE 754 mode -ftz=false braries and hardware. Take for example the C code
(default) -prec-div=true sequence shown in Figure 6. We compiled the code se-
-prec-sqrt=true quence on a 64-bit x86 platform using gcc version 4.4.3
fast mode -ftz=true (Ubuntu 4.3.3-4ubuntu5).
-prec-div=false This shows that the result of computing cos(5992555.0)
-prec-sqrt=false using a common library differs depending on whether
the code is compiled in 32-bit mode or 64-bit mode.
The default “IEEE 754 mode” means that single pre- The consequence is that different math libraries can-
cision operations are correctly rounded and support de- not be expected to compute exactly the same result for a
normals, as per the IEEE 754 standard. In the “fast given input. This applies to GPU programming as well.
mode” denormal numbers are flushed to zero, and the Functions compiled for the GPU will use the NVIDIA
operations division and square root are not computed to CUDA math library implementation while functions com-
the nearest floating point value. The flags have no effect piled for the CPU will use the host compiler math li-
on double precision or on devices of compute capability brary implementation (e.g. glibc on Linux). Because
below 2.0. these implementations are independent and neither is
volatile float x = 5992555.0; block. Changing the number of threads per block reor-
printf("cos(%f): %.10g\n", x, cos(x)); ganizes the reduction; if the reduction is addition, then
the change rearranges parentheses in the long string of
gcc test.c -lm -m64
additions.
cos(5992555.000000): 3.320904615e-07 Even if the same general strategy such as parallel
reduction is used on the CPU and GPU, it is com-
gcc test.c -lm -m32 mon to have widely different numbers of threads on the
cos(5992555.000000): 3.320904692e-07 GPU compared to the CPU. For example, the GPU
implementation might launch blocks with 128 threads
per block, while the CPU implementation might use 4
Figure 6: The computation of cosine using the threads in total.
glibc math library yields different results when
compiled with -m32 and -m64.
5.4 Verifying GPU Results
The same inputs will give the same results for indi-
guaranteed to be correctly rounded, the results will of- vidual IEEE 754 operations to a given precision on the
ten differ slightly. CPU and GPU. As we have explained, there are many
reasons why the same sequence of operations may not be
5.2 x87 and SSE performed on the CPU and GPU. The GPU has fused
One of the unfortunate realities of C compilers is that multiply-add while the CPU does not. Parallelizing al-
they are often poor at preserving IEEE 754 semantics gorithms may rearrange operations, yielding different
of floating point operations [6]. This can be particularly numeric results. The CPU may be computing results in
confusing on platforms that support x87 and SSE oper- a precision higher than expected. Finally, many com-
ations. Just like CUDA operations, SSE operations are mon mathematical functions are not required by the
performed on single or double precision values, while IEEE 754 standard to be correctly rounded so should
x87 operations often use an additional internal 80-bit not be expected to yield identical results between im-
precision format. Sometimes the results of a computa- plementations.
tion using x87 can depend on whether an intermediate When porting numeric code from the CPU to the
result was allocated to a register or stored to memory. GPU of course it makes sense to use the x86 CPU re-
Values stored to memory are rounded to the declared sults as a reference. But differences between the CPU
precision (e.g. single precision for float and double result and GPU result must be interpreted carefully.
precision for double). Values kept in registers can re- Differences are not automatically evidence that the re-
main in extended precision. Also, x87 instructions will sult computed by the GPU is wrong or that there is a
often be used by default for 32-bit compiles but SSE problem on the GPU.
instructions will be used by default for 64-bit compiles. Computing results in a high precision and then com-
Because of these issues, guaranteeing a specific preci- paring to results computed in a lower precision can be
sion level on the CPU can sometimes be tricky. When helpful to see if the lower precision is adequate for a par-
comparing CPU results to results computed on the GPU, ticular application. However, rounding high precision
it is generally best to compare using SSE instructions. results to a lower precision is not equivalent to perform-
SSE instructions follow IEEE 754 for single and double ing the entire computation in lower precision. This can
precision. sometimes be a problem when using x87 and comparing
On 32-bit x86 targets without SSE it can be help- results against the GPU. The results of the CPU may
ful to declare variables using volatile and force float- be computed to an unexpectedly high extended preci-
ing point values to be stored to memory (/Op in Visual sion for some or all of the operations. The GPU result
Studio and -ffloat-store in gcc). This moves results will be computed using single or double precision only.
from extended precision registers into memory, where
the precision is precisely single or double precision. Al-
ternately, the x87 control word can be updated to set
the precision to 24 or 53 bits using the assembly in-
6. CONCRETE RECOMMENDATIONS
struction fldcw or a compiler option such as -mpc32 or The key points we have covered are the following.
-mpc64 in gcc.
Use the fused multiply-add operator.
5.3 Core Counts The fused multiply-add operator on the GPU has
As we have shown in Section 3, the final values com- high performance and increases the accuracy of com-
puted using IEEE 754 arithmetic can depend on imple- putations. No special flags or function calls are needed
mentation choices such as whether to use fused multiply- to gain this benefit in CUDA programs. Understand
add or whether additions are organized in series or par- that a hardware fused multiply-add operation is not yet
allel. These differences affect computation on the CPU available on the CPU, which can cause differences in nu-
and on the GPU. merical results.
One example of the differences can arise from dif-
ferences between the number of concurrent threads in- Compare results carefully.
volved in a computation. On the GPU, a common de- Even in the strict world of IEEE 754 operations, mi-
sign pattern is to have all threads in a block coordinate nor details such as organization of parentheses or thread
to do a parallel reduction on data within the block, counts can affect the final result. Take this into account
followed by a serial reduction of the results from each when doing comparisons between implementations.
Know the capabilities of your GPU. 8. REFERENCES
The numerical capabilities are encoded in the com- [1] ANSI/IEEE 754-1985. American National
pute capability number of your GPU. Devices of com- Standard — IEEE Standard for Binary
pute capability 2.0 and later are capable of single and Floating-Point Arithmetic. American National
double precision arithmetic following the IEEE 754 stan- Standards Institute, Inc., New York, 1985.
dard, and have hardware units for performing fused [2] IEEE 754-2008. IEEE 754–2008 Standard for
multiply-add in both single and double precision. Floating-Point Arithmetic. August 2008.
[3] ISO/IEC 9899:1999(E). Programming
Take advantage of the CUDA math library func- languages—C. American National Standards
tions. Institute, Inc., New York, 1999.
These functions are documented in Appendix C of the
[4] Catherine Daramy-Loirat, David Defour, Florent
CUDA C Programming Guide [7]. The math library
de Dinechin, Matthieu Gallet, Nicolas Gast, and
includes all the math functions listed in the C99 stan-
Jean-Michel Muller. CR-LIBM: A library of
dard [3] plus some additional useful functions. These
correctly rounded elementary functions in
functions have been tuned for a reasonable compromise
double-precision, February 2005.
between performance and accuracy.
We constantly strive to improve the quality of our [5] David Goldberg. What every computer scientist
math library functionality. Please let us know about should know about floating-point arithmetic. ACM
any functions that you require that we do not provide, Computing Surveys, March 1991. Edited reprint
or if the accuracy or performance of any of our func- available at: http://download.oracle.com/docs/
tions does not meet your needs. Leave comments in the cd/E19957-01/806-3568/ncg_goldberg.html.
NVIDIA CUDA forum1 or join the Registered Devel- [6] David Monniaux. The pitfalls of verifying
oper Program2 and file a bug with your feedback. floating-point computations. ACM Transactions on
Programming Languages and Systems, May 2008.
7. ACKNOWLEDGEMENTS [7] NVIDIA. CUDA C Programming Guide Version
4.0, 2011.
Thanks to Ujval Kapasi, Kurt Wall, Paul Sidenblad,
Massimiliano Fatica, Everett Phillips, Norbert Juffa,
and Will Ramey for their helpful comments and sug-
gestions.
1
http://forums.nvidia.com/index.php?showforum=62
2
http://developer.nvidia.com/
join-nvidia-registered-developer-program