NVIDIA-CUDA-Floating-Point

This document discusses floating point accuracy and compliance issues specific to NVIDIA GPUs, focusing on the IEEE 754 Standard. It covers various aspects of floating point formats, operations, and the importance of the fused multiply-add (FMA) operation for improving computational accuracy. Additionally, it provides examples and recommendations for CUDA programmers to enhance numerical algorithm implementation on GPUs.

Uploaded by

bob song

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views7 pages

NVIDIA-CUDA-Floating-Point

Uploaded by

bob song

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Precision & Performance:

Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Nathan Whitehead Alex Fit-Florea

ABSTRACT 2. FLOATING POINT

A number of issues related to floating point accuracy
and compliance are a frequent source of confusion on 2.1 Formats
both CPUs and GPUs. The purpose of this white pa- Floating point encodings and functionality are defined
per is to discuss the most common issues related to in the IEEE 754 Standard [2] last revised in 2008. Gold-
NVIDIA GPUs and to supplement the documentation berg [5] gives a good introduction to floating point and
in the CUDA C Programming Guide [7]. many of the issues that arise.
The standard mandates binary floating point data be
encoded on three fields: a one bit sign field, followed
1. INTRODUCTION by exponent bits encoding the exponent offset by a nu-
Since the widespread adoption in 1985 of the IEEE meric bias specific to each format, and bits encoding
Standard for Binary Floating-Point Arithmetic (IEEE the significand (or fraction).
754-1985 [1]) virtually all mainstream computing sys- sign exponent fraction
tems have implemented the standard, including NVIDIA
with the CUDA architecture. IEEE 754 standardizes In order to ensure consistent computations across plat-
how arithmetic results should be approximated in float- forms and to exchange floating point data, IEEE 754
ing point. Whenever working with inexact results, pro- defines basic and interchange formats. The 32 and 64
gramming decisions can affect accuracy. It is important bit basic binary floating point formats correspond to the
to consider many aspects of floating point behavior in C data types float and double. Their corresponding
order to achieve the highest performance with the preci- representations have the following bit lengths:
sion required for any specific application. This is espe-
cially true in a heterogeneous computing environment float 1 8 23
where operations will be performed on different types
of hardware. double 1 11 52
Understanding some of the intricacies of floating point
and the specifics of how NVIDIA hardware handles float-
ing point is obviously important to CUDA programmers For numerical data representing finite values, the sign
striving to implement correct numerical algorithms. In is either negative or positive, the exponent field encodes
addition, users of libraries such as CUBLAS and CUFFT the exponent in base 2, and the fraction field encodes
will also find it informative to learn how NVIDIA han- the significand without the most significant non-zero
dles floating point under the hood. bit. For example the value −192 equals (−1)1 ×27 ×1.5,
We review some of the basic properties of floating and can be represented as having a negative sign, an
point calculations in Section 2. We also discuss the exponent of 7, and a fractional part .5. The exponents
fused multiply-add operator, which was added to the are biased by 127 and 1023, respectively, to allow ex-
IEEE 754 standard in 2008 [2] and is built into the ponents to extend from negative to positive. Hence the
hardware of NVIDIA GPUs. In Section 3 we work exponent 7 is represented by bit strings with values 134
through an example of computing the dot product of for float and 1030 for double. The integral part of 1.
two short vectors to illustrate how different choices of is implicit in the fraction.
implementation affect the accuracy of the final result.
In Section 4 we describe NVIDIA hardware versions and float
NVCC compiler options that affect floating point calcu- 1 10000110 .100000000000000000000000
lations. In Section 5 we consider some issues regarding
the comparison of CPU and GPU results. Finally, in double
Section 6 we conclude with concrete recommendations 1 10000000110 .10000000000000000. . . 0000000
to programmers that deal with numeric issues relating
to floating point on the GPU. Also, encodings to represent infinity and not-a-number
(NaN) data are reserved. The IEEE 754 Standard [2]
Permission to make digital or hard copies of all or part of this work describes floating point encodings in full.
for any use is granted without fee provided that copies bear this
notice and the full citation on the first page. Given that the fraction field uses a limited number of
Copyright 2011 NVIDIA. bits, not all real numbers can be represented exactly.
For example the mathematical value of the fraction 2/3 Here, the order in which operations are executed affects
represented in binary is 0.10101010... which has an in- the accuracy of the result. The results are independent
finite number of bits after the binary point. The value of the host system. These same results would be ob-
2/3 must be rounded first in order to be represented tained using any microprocessor, CPU or GPU, which
as a floating point number with limited precision. The supports single precision floating point.
rules for rounding and the rounding modes are spec-
ified in IEEE 754. The most frequently used is the 2.3 The Fused Multiply-Add (FMA)
round-to-nearest-or-even mode (abbreviated as round- In 2008 the IEEE 754 standard was revised to include
to-nearest). The value 2/3 rounded in this mode is rep- the fused multiply-add operation (FMA). The FMA op-
resented in binary as: eration computes rn(X × Y + Z) with only one round-
ing step. Without the FMA operation the result would
have to be computed as rn(rn(X × Y ) + Z) with two
float
rounding steps, one for multiply and one for add. Be-
0 01111110 .01010101010101010101011
cause the FMA uses only a single rounding step the
result is computed more accurately.
double
Let’s consider an example to illustrate how the FMA
0 01111111110 .01010101010101010. . . 1010101
operation works using decimal arithmetic first for clar-
The sign is positive and the stored exponent value ity. Let’s compute x2 − 1 in finite-precision with four
represents an exponent of −1. digits of precision after the decimal point, or a total of
five digits of precision including the leading digit before
2.2 Operations and Accuracy the decimal point.
The IEEE 754 standard requires support for a hand- For x = 1.0008, the correct mathematical result is
ful of operations. These include the arithmetic opera- x2 − 1 = 1.60064 × 10−4 . The closest number using only
tions add, subtract, multiply, divide, square root, fused- four digits after the decimal point is 1.6006 × 10−4 . In
multiply-add, remainder, conversion operations, scal- this case rn(x2 − 1) = 1.6006 × 10−4 which corresponds
ing, sign operations, and comparisons. The results of to the fused multiply-add operation rn(x × x + (−1)).
these operations are guaranteed to be the same for all The alternative is to compute separate multiply and add
implementations of the standard, for a given format and steps. For the multiply, x2 = 1.00160064, so rn(x2 ) =
rounding mode. 1.0016. The final result is rn(rn(x2 ) − 1) = 1.6000 ×
The rules and properties of mathematical arithmetic 10−4 .
do not hold directly for floating point arithmetic be- Rounding the multiply and add separately yields a re-
cause of floating point’s limited precision. For example, sult that is wrong by 0.00064. The corresponding FMA
the table below shows single precision values A, B, and computation is wrong by only 0.00004, and its result is
C, and the mathematical exact value of their sum com- closest to the correct mathematical answer. The results
puted using different associativity. are summarized below:
x = 1.0008
A = 21 × 1.00000000000000000000001
x2 = 1.00160064
B = 20 × 1.00000000000000000000001
x2 − 1 = 1.60064 × 10−4 true value
C = 23 × 1.00000000000000000000001
rn(x2 − 1) = 1.6006 × 10−4 fused multiply-add
(A + B) + C = 23 × 1.01100000000000000000001011
rn(x2 ) = 1.0016 × 10−4
A + (B + C) = 23 × 1.01100000000000000000001011
rn(rn(x2 ) − 1) = 1.6000 × 10−4 multiply, then add
Mathematically, (A + B) + C does equal A + (B + C).
Below is another example, using binary single preci-
Let rn(x) denote one rounding step on x. Perform-
sion values:
ing these same computations in single precision floating
point arithmetic in round-to-nearest mode according to A = 20 × 1.00000000000000000000001
IEEE 754, we obtain: B = −20 × 1.00000000000000000000010
rn(A × A + B) = 2−46 × 1.00000000000000000000000
A+B = 21 × 1.1000000000000000000000110000 . . . rn(rn(A × A) + B) = 0
rn(A + B) = 21 × 1.10000000000000000000010
B+C = 23 × 1.0010000000000000000000100100 . . . In this particular case, computing rn(rn(A × A) + B)
rn(B + C) = 23 × 1.00100000000000000000001 as an IEEE 754 multiply followed by an IEEE 754 add
A+B+C = 23 × 1.0110000000000000000000101100 . . .
rn(rn(A + B) + C) = 23 × 1.01100000000000000000010
loses all bits of precision, and the computed result is 0.
rn(A + rn(B + C)) = 23 × 1.01100000000000000000001 The alternative of computing the FMA rn(A × A + B)
provides a result equal to the mathematical value. In
For reference, the exact, mathematical results are com- general, the fused-multiply-add operation generates more
puted as well in the table above. Not only are the re- accurate results than computing one multiply followed
sults computed according to IEEE 754 different from by one add. The choice of whether or not to use the
the exact mathematical results, but also the results cor- fused operation depends on whether the platform pro-
responding to the sum rn(rn(A + B) + C) and the sum vides the operation and also on how the code is com-
rn(A + rn(B + C)) are different from each other. In this piled.
case, rn(A + rn(B + C)) is closer to the correct mathe- Figure 1 shows CUDA C code and output correspond-
matical result than rn(rn(A + B) + C). ing to inputs A and B and operations from the example
This example highlights that seemingly identical com- above. The code is executed on two different hardware
putations can produce different results even if all basic platforms: an x86-class CPU using SSE in single pre-
operations are computed in compliance with IEEE 754. cision, and an NVIDIA GPU with compute capability
union {
float f;
unsigned int i; x86-64 output:
} a, b;
float r; a: 1.0000001
b: -1.0000002
a.i = 0x3F800001; r: 0
b.i = 0xBF800002;
r = a.f * a.f + b.f; t=0
NVIDIA Fermi output: for i from 1 to 4
printf("a %.8g\n", a.f); a: 1.0000001 p = rn(ai × bi )
printf("b %.8g\n", b.f); b: -1.0000002 t = rn(t + p)
printf("r %.8g\n", r); r: 1.4210855e-14 return t

Figure 1: Multiply and add code fragment and Figure 2: The serial method uses a simple loop
output for x86 and NVIDIA Fermi GPU with separate multiplies and adds to compute
the dot product of the vectors. The final result
can be represented as ((((a1 × b1 ) + (a2 × b2 )) + (a3 ×
2.0. At the time this paper is written (Spring 2011) b3 )) + (a4 × b4 )).
there are no commercially available x86 CPUs which
offer hardware FMA. Because of this, the computed re-
sult in single precision in SSE would be 0. NVIDIA
GPUs with compute capability 2.0 do offer hardware
FMAs, so the result of executing this code will be the
more accurate one by default. However, both results are
correct according to the IEEE 754 standard. The code
fragment was compiled without any special intrinsics or t=0
compiler options for either platform. for i from 1 to 4
The fused multiply-add helps avoid loss of precision t = rn(ai × bi + t)
during subtractive cancellation. Subtractive cancella- return t
tion occurs during the addition of quantities of similar
magnitude with opposite signs. In this case many of the
leading bits cancel, leaving fewer meaningful bits of pre-
cision in the result. The fused multiply-add computes a Figure 3: The FMA method uses a simple loop
double-width product during the multiplication. Thus with fused multiply-adds to compute the dot
even if subtractive cancellation occurs during the addi- product of the vectors. The final result can be
tion there are still enough valid bits remaining in the represented as a4 × b4 + (a3 × b3 + (a2 × b2 + (a1 ×
product to get a precise result with no loss of precision. b1 + 0))).

3. DOT PRODUCT: AN ACCURACY EX-

and additions are done separately.
AMPLE A simple improvement to the algorithm is to use the
Consider the problem of finding the dot product of fused multiply-add to do the multiply and addition in
two short vectors ~a and ~b both with four elements. one step to improve accuracy. Figure 3 shows this ver-
sion.
a1
 
b1
  Yet another way to compute the dot product is to use
a2  a divide-and-conquer strategy in which we first find the
~a =   ~b = b2 

~a · ~b = a1 b1 + a2 b2 + a3 b3 + a4 b4 dot products of the first half and the second half of the
a3 b3 
a4 b4 vectors, then combine these results using addition. This
is a recursive strategy; the base case is the dot product
This operation is easy to write mathematically, but of vectors of length 1 which is a single multiply. Fig-
its implementation in software involves several choices. ure 4 graphically illustrates this approach. We call this
All of the strategies we will discuss use purely IEEE 754 algorithm the parallel algorithm because the two sub-
compliant operations. problems can be computed in parallel as they have no
dependencies. The algorithm does not require a paral-
3.1 Example Algorithms lel implementation, however; it can still be implemented
We present three algorithms which differ in how the with a single thread.
multiplications, additions, and possibly fused multiply-
adds are organized. These algorithms are presented in 3.2 Comparison
Figures 2, 3, and 4. Each of the three algorithms is rep- All three algorithms for computing a dot product use
resented graphically. Individual operation are shown as IEEE 754 arithmetic and can be implemented on any
a circle with arrows pointing from arguments to opera- system that supports the IEEE standard. In fact, an
tions. implementation of the serial algorithm on multiple sys-
The simplest way to compute the dot product is using tems will give exactly the same result. So will imple-
a short loop as shown in Figure 2. The multiplications mentations of the FMA or parallel algorithms. How-
p1 = rn(a1 × b1 ) double precision. Older NVIDIA architectures support
p2 = rn(a2 × b2 ) some of these features but not others. In CUDA, the
p3 = rn(a3 × b3 ) features supported by the GPU are encoded in the com-
p4 = rn(a4 × b4 ) pute capability number. The runtime library supports
slef t = rn(p1 + p2 ) a function call to determine the compute capability of
sright = rn(p3 + p4 ) a GPU at runtime; the CUDA C Programming Guide
t = rn(slef t + sright ) also includes a table of compute capabilities for many
return t different devices [7].

4.1 Compute capability 1.2 and below

Devices with compute capability 1.2 and below sup-
port single precision only. In addition, not all opera-
tions in single precision on these GPUs are IEEE 754
accurate. Denormal numbers (small numbers close to
zero) are flushed to zero. Operations such as square
root and division may not always result in the floating
Figure 4: The parallel method uses a tree to re- point value closest to the correct mathematical value.
duce all the products of individual elements into
a final sum. The final result can be represented
as ((a1 × b1 ) + (a2 × b2 )) + ((a3 × b3 ) + (a4 × b4 )). 4.2 Compute capability 1.3
Devices with compute capability 1.3 support both
single and double precision floating point computation.
method result float value Double precision operations are always IEEE 754 accu-
exact .0559587528435... 0x3D65350158... rate. Single precision in devices of compute capability
serial .0559588074 0x3D653510 1.3 is unchanged from previous compute capabilities.
FMA .0559587515 0x3D653501 In addition, the double precision hardware offers fused
parallel .0559587478 0x3D653500 multiply-add. As described in Section 2.3, the fused
multiply-add operation is faster and more accurate than
separate multiplies and additions. There is no single
Figure 5: All three algorithms yield results precision fused multiply-add operation in compute ca-
slightly different from the correct mathematical pability 1.3.
dot product.

ever, results computed by an implementation of the se- 4.3 Compute capability 2.0 and above
rial algorithm may differ from those computed by an Devices with compute capability 2.0 and above sup-
implementation of the other two algorithms. port both single and double precision IEEE 754 includ-
For example, consider the vectors ing fused multiply-add in both single and double preci-
sion. Operations such as square root and division will
~a = [1.907607, −.7862027, 1.147311, .9604002] result in the floating point value closest to the correct
~b = [−.9355000, −.6915108, 1.724470, −.7097529] mathematical result in both single and double precision,
by default.
whose elements are randomly chosen values between −1
and 2. The accuracy of each algorithm corresponding
to these inputs is shown in Figure 5. 4.4 Rounding modes
The main points to notice from the table are that The IEEE 754 standard defines four rounding modes:
each algorithm yields a different result, and they are round-to-nearest, round towards positive, round towards
all slightly different from the correct mathematical dot negative, and round towards zero. CUDA supports
product. In this example the FMA version is the most all four modes. By default, operations use round-to-
accurate, and the parallel algorithm is more accurate nearest. Compiler intrinsics like the ones listed in the
than the serial algorithm. In our experience these re- tables below can be used to select other rounding modes
sults are typical; fused multiply-add significantly in- for individual operations.
creases the accuracy of results, and parallel tree reduc-
tions for summation are usually much more accurate
than serial summation.
mode interpretation
4. CUDA AND FLOATING POINT rn round to nearest, ties to even
NVIDIA has extended the capabilities of GPUs with rz round towards zero
each successive hardware generation. Current genera- ru round towards +∞
tions of the NVIDIA architecture such as Tesla C2xxx, rd round towards -∞
GTX 4xx, and GTX 5xx, support both single and dou-
ble precision with IEEE 754 precision and include hard-
ware support for fused multiply-add in both single and
x + y addition 4.7 Differences from x86
__fadd_[rn|rz|ru|rd](x, y) NVIDIA GPUs differ from the x86 architecture in
x * y multiplication that rounding modes are encoded within each floating
__fmul_[rn|rz|ru|rd](x, y) point instruction instead of dynamically using a floating
fmaf(x, y, z) FMA point control word. Trap handlers for floating point ex-
__fmaf_[rn|rz|ru|rd](x, y, z) ceptions are not supported. On the GPU there is no sta-
1.0f / x reciprocal tus flag to indicate when calculations have overflowed,
__frcp_[rn|rz|ru|rd](x) underflowed, or have involved inexact arithmetic. Like
x / y division SSE, the precision of each GPU operation is encoded
__fdiv_[rn|rz|ru|rd](x, y) in the instruction (for x87 the precision is controlled
sqrtf(x) square root dynamically by the floating point control word).
__fsqrt_[rn|rz|ru|rd](x)
5. CONSIDERATIONS FOR A HETERO-
x + y addition
__dadd_[rn|rz|ru|rd](x, y) GENEOUS WORLD
x * y multiplication
__dmul_[rn|rz|ru|rd](x, y) 5.1 Mathematical function accuracy
fma(x, y, z) FMA So far we have only considered simple math oper-
__fma_[rn|rz|ru|rd](x, y, z) ations such as addition, multiplication, division, and
1.0 / x reciprocal square root. These operations are simple enough that
__drcp_[rn|rz|ru|rd](x) computing the best floating point result (e.g. the clos-
x / y division est in round-to-nearest) is reasonable. For other math-
__ddiv_[rn|rz|ru|rd](x, y) ematical operations computing the best floating point
sqrt(x) square root result is harder.
__dsqrt_[rn|rz|ru|rd](x) The problem is called the table maker’s dilemma. To
guarantee the correctly rounded result, it is not gen-
4.5 Controlling fused multiply-add erally enough to compute the function to a fixed high
accuracy. There might still be rare cases where the er-
In general, the fused multiply-add operation is faster ror in the high accuracy result affects the rounding step
and more accurate than performing separate multiply at the lower accuracy.
and add operations. However, on occasion you may It is possible to solve the dilemma for particular func-
wish to disable the merging of multiplies and adds into tions by doing mathematical analysis and formal proofs [4],
fused multiply-add instructions. To inhibit this op- but most math libraries choose instead to give up the
timization one can write the multiplies and additions guarantee of correct rounding. Instead they provide im-
using intrinsics with explicit rounding mode as shown plementations of math functions and document bounds
in the previous tables. Operations written directly as on the relative error of the functions over the input
intrinsics are guaranteed to remain independent and range. For example, the double precision sin function
will not be merged into fused multiply-add instructions. in CUDA is guaranteed to be accurate to within 2 units
With CUDA Fortran it is possible to disable FMA merg- in the last place (ulp) of the correctly rounded result. In
ing via a compiler flag. other words, the difference between the computed result
and the mathematical result is at most ±2 with respect
4.6 Compiler flags to the least significant bit position of the fraction part
Compiler flags relevant to IEEE 754 operations are of the floating point result.
-ftz={true|false}, -prec-div={true|false}, and For most inputs the sin function produces the cor-
-prec-sqrt={true|false}. These flags control single rectly rounded result. For some inputs the result is off
precision operations on devices of compute capability of by 1 ulp. For a small percentage of inputs the result is
2.0 or later. off by 2 ulp.
Producing different results, even on the same system,
mode flags is not uncommon when using a mix of precisions, li-
IEEE 754 mode -ftz=false braries and hardware. Take for example the C code
(default) -prec-div=true sequence shown in Figure 6. We compiled the code se-
-prec-sqrt=true quence on a 64-bit x86 platform using gcc version 4.4.3
fast mode -ftz=true (Ubuntu 4.3.3-4ubuntu5).
-prec-div=false This shows that the result of computing cos(5992555.0)
-prec-sqrt=false using a common library differs depending on whether
the code is compiled in 32-bit mode or 64-bit mode.
The default “IEEE 754 mode” means that single pre- The consequence is that different math libraries can-
cision operations are correctly rounded and support de- not be expected to compute exactly the same result for a
normals, as per the IEEE 754 standard. In the “fast given input. This applies to GPU programming as well.
mode” denormal numbers are flushed to zero, and the Functions compiled for the GPU will use the NVIDIA
operations division and square root are not computed to CUDA math library implementation while functions com-
the nearest floating point value. The flags have no effect piled for the CPU will use the host compiler math li-
on double precision or on devices of compute capability brary implementation (e.g. glibc on Linux). Because
below 2.0. these implementations are independent and neither is
volatile float x = 5992555.0; block. Changing the number of threads per block reor-
printf("cos(%f): %.10g\n", x, cos(x)); ganizes the reduction; if the reduction is addition, then
the change rearranges parentheses in the long string of
gcc test.c -lm -m64
additions.
cos(5992555.000000): 3.320904615e-07 Even if the same general strategy such as parallel
reduction is used on the CPU and GPU, it is com-
gcc test.c -lm -m32 mon to have widely different numbers of threads on the
cos(5992555.000000): 3.320904692e-07 GPU compared to the CPU. For example, the GPU
implementation might launch blocks with 128 threads
per block, while the CPU implementation might use 4
Figure 6: The computation of cosine using the threads in total.
glibc math library yields different results when
compiled with -m32 and -m64.
5.4 Verifying GPU Results
The same inputs will give the same results for indi-
guaranteed to be correctly rounded, the results will of- vidual IEEE 754 operations to a given precision on the
ten differ slightly. CPU and GPU. As we have explained, there are many
reasons why the same sequence of operations may not be
5.2 x87 and SSE performed on the CPU and GPU. The GPU has fused
One of the unfortunate realities of C compilers is that multiply-add while the CPU does not. Parallelizing al-
they are often poor at preserving IEEE 754 semantics gorithms may rearrange operations, yielding different
of floating point operations [6]. This can be particularly numeric results. The CPU may be computing results in
confusing on platforms that support x87 and SSE oper- a precision higher than expected. Finally, many com-
ations. Just like CUDA operations, SSE operations are mon mathematical functions are not required by the
performed on single or double precision values, while IEEE 754 standard to be correctly rounded so should
x87 operations often use an additional internal 80-bit not be expected to yield identical results between im-
precision format. Sometimes the results of a computa- plementations.
tion using x87 can depend on whether an intermediate When porting numeric code from the CPU to the
result was allocated to a register or stored to memory. GPU of course it makes sense to use the x86 CPU re-
Values stored to memory are rounded to the declared sults as a reference. But differences between the CPU
precision (e.g. single precision for float and double result and GPU result must be interpreted carefully.
precision for double). Values kept in registers can re- Differences are not automatically evidence that the re-
main in extended precision. Also, x87 instructions will sult computed by the GPU is wrong or that there is a
often be used by default for 32-bit compiles but SSE problem on the GPU.
instructions will be used by default for 64-bit compiles. Computing results in a high precision and then com-
Because of these issues, guaranteeing a specific preci- paring to results computed in a lower precision can be
sion level on the CPU can sometimes be tricky. When helpful to see if the lower precision is adequate for a par-
comparing CPU results to results computed on the GPU, ticular application. However, rounding high precision
it is generally best to compare using SSE instructions. results to a lower precision is not equivalent to perform-
SSE instructions follow IEEE 754 for single and double ing the entire computation in lower precision. This can
precision. sometimes be a problem when using x87 and comparing
On 32-bit x86 targets without SSE it can be help- results against the GPU. The results of the CPU may
ful to declare variables using volatile and force float- be computed to an unexpectedly high extended preci-
ing point values to be stored to memory (/Op in Visual sion for some or all of the operations. The GPU result
Studio and -ffloat-store in gcc). This moves results will be computed using single or double precision only.
from extended precision registers into memory, where
the precision is precisely single or double precision. Al-
ternately, the x87 control word can be updated to set
the precision to 24 or 53 bits using the assembly in-
6. CONCRETE RECOMMENDATIONS
struction fldcw or a compiler option such as -mpc32 or The key points we have covered are the following.
-mpc64 in gcc.
Use the fused multiply-add operator.
5.3 Core Counts The fused multiply-add operator on the GPU has
As we have shown in Section 3, the final values com- high performance and increases the accuracy of com-
puted using IEEE 754 arithmetic can depend on imple- putations. No special flags or function calls are needed
mentation choices such as whether to use fused multiply- to gain this benefit in CUDA programs. Understand
add or whether additions are organized in series or par- that a hardware fused multiply-add operation is not yet
allel. These differences affect computation on the CPU available on the CPU, which can cause differences in nu-
and on the GPU. merical results.
One example of the differences can arise from dif-
ferences between the number of concurrent threads in- Compare results carefully.
volved in a computation. On the GPU, a common de- Even in the strict world of IEEE 754 operations, mi-
sign pattern is to have all threads in a block coordinate nor details such as organization of parentheses or thread
to do a parallel reduction on data within the block, counts can affect the final result. Take this into account
followed by a serial reduction of the results from each when doing comparisons between implementations.
Know the capabilities of your GPU. 8. REFERENCES
The numerical capabilities are encoded in the com- [1] ANSI/IEEE 754-1985. American National
pute capability number of your GPU. Devices of com- Standard — IEEE Standard for Binary
pute capability 2.0 and later are capable of single and Floating-Point Arithmetic. American National
double precision arithmetic following the IEEE 754 stan- Standards Institute, Inc., New York, 1985.
dard, and have hardware units for performing fused [2] IEEE 754-2008. IEEE 754–2008 Standard for
multiply-add in both single and double precision. Floating-Point Arithmetic. August 2008.
[3] ISO/IEC 9899:1999(E). Programming
Take advantage of the CUDA math library func- languages—C. American National Standards
tions. Institute, Inc., New York, 1999.
These functions are documented in Appendix C of the
[4] Catherine Daramy-Loirat, David Defour, Florent
CUDA C Programming Guide [7]. The math library
de Dinechin, Matthieu Gallet, Nicolas Gast, and
includes all the math functions listed in the C99 stan-
Jean-Michel Muller. CR-LIBM: A library of
dard [3] plus some additional useful functions. These
correctly rounded elementary functions in
functions have been tuned for a reasonable compromise
double-precision, February 2005.
between performance and accuracy.
We constantly strive to improve the quality of our [5] David Goldberg. What every computer scientist
math library functionality. Please let us know about should know about floating-point arithmetic. ACM
any functions that you require that we do not provide, Computing Surveys, March 1991. Edited reprint
or if the accuracy or performance of any of our func- available at: http://download.oracle.com/docs/
tions does not meet your needs. Leave comments in the cd/E19957-01/806-3568/ncg_goldberg.html.
NVIDIA CUDA forum1 or join the Registered Devel- [6] David Monniaux. The pitfalls of verifying
oper Program2 and file a bug with your feedback. floating-point computations. ACM Transactions on
Programming Languages and Systems, May 2008.
7. ACKNOWLEDGEMENTS [7] NVIDIA. CUDA C Programming Guide Version
4.0, 2011.
Thanks to Ujval Kapasi, Kurt Wall, Paul Sidenblad,
Massimiliano Fatica, Everett Phillips, Norbert Juffa,
and Will Ramey for their helpful comments and sug-
gestions.

1
http://forums.nvidia.com/index.php?showforum=62
2
http://developer.nvidia.com/
join-nvidia-registered-developer-program

International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
7 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
2174 PDF
No ratings yet
2174 PDF
7 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
6 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
No ratings yet
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
6 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
COA
No ratings yet
COA
14 pages
Floating Point
No ratings yet
Floating Point
3 pages
Double Precision Floating Point Arithmetic
100% (3)
Double Precision Floating Point Arithmetic
12 pages
Lab 1
100% (1)
Lab 1
10 pages
10 MIPS Floating Point Arithmetic
No ratings yet
10 MIPS Floating Point Arithmetic
28 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
Floating Point
No ratings yet
Floating Point
16 pages
Friendly Introduction To Numerical Analysis 1st Edition Bradie Solutions Manual PDF
33% (3)
Friendly Introduction To Numerical Analysis 1st Edition Bradie Solutions Manual PDF
13 pages
Floating Point Arithmetic: Numbers
No ratings yet
Floating Point Arithmetic: Numbers
14 pages
02 - User's Manual Hardware - MICREX-SX SPH Instructions (SX-Programmer Expert) - FEH200
100% (2)
02 - User's Manual Hardware - MICREX-SX SPH Instructions (SX-Programmer Expert) - FEH200
603 pages
Operations On Floating Point Numbers
No ratings yet
Operations On Floating Point Numbers
16 pages
Numerical Methods
No ratings yet
Numerical Methods
72 pages
Implementation of Binary To Floating Point Converter Using HDL
No ratings yet
Implementation of Binary To Floating Point Converter Using HDL
41 pages
Design of 32 Bit Floating Point Addition and Subtraction Units Based On Ieee 754 Standard IJERTV2IS60996
No ratings yet
Design of 32 Bit Floating Point Addition and Subtraction Units Based On Ieee 754 Standard IJERTV2IS60996
5 pages
Forth-Programmers-Handbook-3rd - Ed - Dokumen - Pub
100% (1)
Forth-Programmers-Handbook-3rd - Ed - Dokumen - Pub
274 pages
Demystifying Floating Point - John Farrier - CppCon 2015
No ratings yet
Demystifying Floating Point - John Farrier - CppCon 2015
61 pages
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
No ratings yet
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
5 pages
Floating Point: - We Need A Way To Represent
No ratings yet
Floating Point: - We Need A Way To Represent
14 pages
Floating Point
No ratings yet
Floating Point
26 pages
Implementation of Floating Point Multiplier
No ratings yet
Implementation of Floating Point Multiplier
4 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
Ieee Arith
No ratings yet
Ieee Arith
3 pages
Floating Point Alu
No ratings yet
Floating Point Alu
11 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
Verilog Project Report
No ratings yet
Verilog Project Report
13 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
Floating Points
No ratings yet
Floating Points
31 pages
Linux Installation Guide
No ratings yet
Linux Installation Guide
76 pages
Synthesis of Single Precision Floating Point ALU: Department of Electronics and Communication Engineering
No ratings yet
Synthesis of Single Precision Floating Point ALU: Department of Electronics and Communication Engineering
20 pages
Telemac2d User v8p2
No ratings yet
Telemac2d User v8p2
115 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
Amiga DOS Developer Manual
No ratings yet
Amiga DOS Developer Manual
112 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Implementation of 32 Bit Floating Point MAC Unit To Feed Weighted Inputs To Neural Networks
No ratings yet
Implementation of 32 Bit Floating Point MAC Unit To Feed Weighted Inputs To Neural Networks
4 pages
Synopsis and Literature Survey
No ratings yet
Synopsis and Literature Survey
10 pages
IEEE754 Floating Point Standard Presentation Detailed
No ratings yet
IEEE754 Floating Point Standard Presentation Detailed
15 pages
3ADR059056M0201 PB610-B Panel Builder 600 EN PDF
No ratings yet
3ADR059056M0201 PB610-B Panel Builder 600 EN PDF
416 pages
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
No ratings yet
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
12 pages
CC2032 COAL Lab # 07
No ratings yet
CC2032 COAL Lab # 07
9 pages
Chapter_03_arith_3_float
No ratings yet
Chapter_03_arith_3_float
30 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
C Good
No ratings yet
C Good
197 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Unit 4 - 1
No ratings yet
Unit 4 - 1
11 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
FPGA Based Reciprocator
No ratings yet
FPGA Based Reciprocator
5 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Chapter 3 Arithmetic For Computers
No ratings yet
Chapter 3 Arithmetic For Computers
49 pages
Arithmetic For Computers: Chapter-4
No ratings yet
Arithmetic For Computers: Chapter-4
67 pages
Autocad 2013 PDF DXF Reference Enu
No ratings yet
Autocad 2013 PDF DXF Reference Enu
286 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
181
No ratings yet
181
11 pages
IEEE 754 Floating Point Formats
No ratings yet
IEEE 754 Floating Point Formats
12 pages
BCA 303 Unit 2 Notes
100% (1)
BCA 303 Unit 2 Notes
47 pages
Lecture 6
No ratings yet
Lecture 6
11 pages
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
No ratings yet
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
4 pages
Neenopal Data Analysis Task 2
No ratings yet
Neenopal Data Analysis Task 2
4 pages
Number System
No ratings yet
Number System
38 pages
Booth's Multiplication Algorithm
No ratings yet
Booth's Multiplication Algorithm
27 pages
Lecture Notes-1
No ratings yet
Lecture Notes-1
98 pages
DB-405 Windows SCADA Report Database Editing Guide 1.1
No ratings yet
DB-405 Windows SCADA Report Database Editing Guide 1.1
63 pages
MATH2089 NM Lectures Topic1
No ratings yet
MATH2089 NM Lectures Topic1
14 pages
Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium
No ratings yet
Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium
2 pages
CO 1 Material
No ratings yet
CO 1 Material
20 pages
Co Assignment Unit-2
No ratings yet
Co Assignment Unit-2
2 pages
IJSPR_1203_438 (1)
No ratings yet
IJSPR_1203_438 (1)
4 pages
Brief Introduction To The Fortran 90 Programming Language
No ratings yet
Brief Introduction To The Fortran 90 Programming Language
33 pages
An FPGA Implementation of High Speed and Area Efficient Double-Precision Floating Point Multiplier Using Urdhva Tiryagbhyam Technique
No ratings yet
An FPGA Implementation of High Speed and Area Efficient Double-Precision Floating Point Multiplier Using Urdhva Tiryagbhyam Technique
6 pages
The Sparc Architecture
No ratings yet
The Sparc Architecture
16 pages
Java Data Types
No ratings yet
Java Data Types
14 pages
Migrating From DB2 To PostgreSQL - What You Should Know - Severalnines
No ratings yet
Migrating From DB2 To PostgreSQL - What You Should Know - Severalnines
13 pages
Multiplication: - Accomplished Via Shifting and Addition
No ratings yet
Multiplication: - Accomplished Via Shifting and Addition
4 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
MTL107 MAL230 Problem Sheet 1
No ratings yet
MTL107 MAL230 Problem Sheet 1
4 pages
Modern C++ Programming
From Everand
Modern C++ Programming
Orhan Gazi
No ratings yet
Modern C++ Programming: Including the recent standards C++11, C++17, C++20, C++23
From Everand
Modern C++ Programming: Including the recent standards C++11, C++17, C++20, C++23
Orhan Gazi
No ratings yet
CompTIA Network+ Review Guide: Exam N10-008
From Everand
CompTIA Network+ Review Guide: Exam N10-008
Jon Buhagiar
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)