0% found this document useful (0 votes)

106 views46 pages

Lecture 1 FloatingPointNumberSystems

The document discusses floating point number systems which approximate real numbers using a finite number of bits. Floating point numbers represent numbers in the form ±0.d1d2...dt × βp, where the parameters {β, t, L, U} characterize the system. Common systems include IEEE single and double precision. Floating point arithmetic introduces errors like overflow and underflow when numbers are too large or small to represent. Rounding is required to convert real numbers to the nearest representable floating point value.

Uploaded by

Saksham Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views46 pages

Lecture 1 FloatingPointNumberSystems

Uploaded by

Saksham Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Lecture 1: Floating Point Number Systems

Leili Rafiee Sevyeri

Based on lecture notes by me and many previous CS370 instructors

Fall 2022
Cheriton School of Computer Science
University of Waterloo

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 1 / 26
Real Numbers

Infinite in extent: There exists x such that |x | is arbitrarily large.

Infinite in density: Any interval a ≤ x ≤ b contains infinitely many numbers.

Problem: Computers cannot represent infinite/arbitrarily large quantities!

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 2 / 26
Real Numbers

Infinite in extent: There exists x such that |x | is arbitrarily large.

Infinite in density: Any interval a ≤ x ≤ b contains infinitely many numbers.

Problem: Computers cannot represent infinite/arbitrarily large quantities!

Solution: The standard (partial) solution is to use floating point numbers to

approximate the reals.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 2 / 26
Floating Point Systems

An approximate representation of real numbers using a finite number of bits.

Questions to ponder:
How can we represent real numbers digitally?

How does the resulting “approximate” number system behave?

Why is this important?

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 3 / 26
Usage of the word “numerical”?

In ordinary language, it means: “having to do with numbers”.

For scientific/numerical computing, it instead means more like:
“having to do with a computer approximation of some mathematical
value/problem, usually involving (approximate) real numbers”.

Note:
An “analytical” solution is the exact answer to a problem.

A “numerical” solution is the approximate one produced by our algorithm

(often in floating point).

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 4 / 26
Numerical Errors: Toy Example

Consider the sum:

100
X
12 + 0.01
i=1

True answer: 13.

Now, assume we can only keep 2 digits of accuracy at a time. What happens?
Perform the sum one addition at a time, round after each.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 5 / 26
Numerical Errors: Toy Example

Consider the sum:

100
X
12 + 0.01
i=1

True answer: 13.

Now, assume we can only keep 2 digits of accuracy at a time. What happens?
Perform the sum one addition at a time, round after each.

((12 + 0.01) + 0.01) + 0.01) + 0.01) + · · ·

What is the sum after each step? And at the end?

“Numerical” answer: 12 Wrong!!

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 5 / 26
Numerical Errors: Taylor series example

Say we want to approximately evaluate e −5.5 (without a handy exponential

function).

Recall the Taylor series for a function :

(x − a)2 00 (x − a)3 (3)

f (x ) = f (a) + (x − a)f 0 (a) + f (a) + f (a) + · · ·
2! 3!
(Taylor series lets us approximate a function value in a neighbourhood, assuming
we know its value and derivatives at some nearby point a.)

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 6 / 26
Numerical Errors: Taylor series example

Applying Taylor series to f (x ) = e x around a = 0, gives

(x − 0)2 0 (x − 0)3 0 x2 x3
e x ≈ e 0 + (x − 0)e 0 + e + e + ··· = 1 + x + + + ···
2! 3! 2! 3!
Plugging in x = −5.5, and using 5 digits of accuracy, we find:

e −5.5 ≈ 1 − 5.5 + 15.125 − 27.730 + · · · = 0.0026363

Trying it on your calculator/phone:

e −5.5 = 0.0040868 (rounded to 5 digits)

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 7 / 26
Numerical Errors: Taylor series example

Another approximation is given by

1 1
e −x = = 2
ex x x3
1+x + + + ···
2! 3!
Again using 5 digits accuracy, we now get: e −5.5 ≈ 0.0040865.
This is much closer to the desired solution e −5.5 = 0.0040868.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 8 / 26
What went wrong here?

Floating point numbers often don’t quite behave like true real numbers.

This can lead to subtle (yet huge) errors!

To be useful, our numerical algorithms must work effectively under floating point
arithmetic.

Let’s examine the floating point representation.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 9 / 26
Infinite Expansion of Real Numbers

We can express a real number as an infinite expansion relative to some base β.

73
e.g. consider ≈ 24.3333
3
in base 10:
73
= 24.3333 . . . = 2 × 101 + 4 × 100 + 3 × 10−1 + 3 × 10−2 + · · ·
3
In base 2:
73
= 11000.010101 . . . = 1 × 24 + 1 × 23 + 0 × 22 + 0 × 21 + · · ·
3

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 10 / 26
Normalized Form of a Real number

After expressing the real number in the desired base β, we multiply by a power of
β to shift it into a normalized form:

0.d1 d2 d3 d4 . . . × β p

where:
di are digits in base β, i.e. 0 ≤ di < β.

normalized implies we shift to ensure d1 6= 0.

exponent p is an integer.

How can we now avoid storing infinite data?

Careful: You may see other normalization conventions outside this class, e.g.
d1 .d2 d3 d4 . . . × β p

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 11 / 26
Floating Point

Density (or precision) is bounded by limiting the number of digits, t.

Extent (or range) is bounded by limiting the range of values for exponent p.
Our floating point representation then has the finite form:

±0.d1 d2 . . . dt × β p

for L ≤ p ≤ U and d1 6= 0.
or
0 as a special case.

The four integer parameters {β, t, L, U} characterize a specific floating point

system, F .

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 12 / 26
Overflow/underflow errors

If the exponent is too big (> U) or too small (< L), our system cannot
represent the number.

When arithmetic operations generate such a number, this is called overflow

(too big) or underflow (too small), respectively.

For underflow, answer simply rounds to 0 instead.

For overflow, will typically produce a ±∞ (or NaN - not a number).

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 13 / 26
Floating Point Standards

The two most common standardized floating point systems are:

IEEE single precision (32 bits): {β = 2, t = 24, L = −126, U = 127}

IEEE double precision (64 bits): {β = 2, t = 53, L = −1022, U = 1023}

Almost always implemented directly in (CPU/GPU/etc.) hardware.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 14 / 26
Aside: Why the name “floating point”?

Alternative systems exist where the number of digits after the decimal (or radix)
point is fixed – “fixed point” numbers.

Example
Three digits after the decimal: 10.234, 171.001, 0.010.

This is just an integer representation, scaled by a constant scale factor. e.g.

10234 × 10−3 , 171001 × 10−3 , 10 × 10−3 , where 10−3 is the scale factor.

Floating point number systems let the radix point “float”, to represent a wider
range.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 15 / 26
Floating Point Density

Unlike fixed point, floating point numbers are not evenly spaced! E.g., For
F = {β = 2, t = 3, L = −1, U = 2} the representable (non-negative) values are
spaced like:

0, 0.100 × 2−1 , 0.101 × 2−1 , 0.110 × 2−1 , 0.111 × 2−1 , 0.100 × 20 ,

0.101 × 20 , 0.110 × 20 , . . . . . . . . . , 0.110 × 22 , 0.111 × 22

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 16 / 26
Rounding vs. Truncation

Floating point systems offer different rounding modes when converting a real
number into a representable FP number.

We will only consider:

1 Round-to-nearest – rounds to closest available number in F .
Usually the default.
1
We’ll break ties by simply rounding up. (Various other options exist ...)
2
2 Truncation/“Chopping” – rounds to next number in F towards zero. i.e.
simply discard any digits after the t-th.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 17 / 26
Example 1

Express 253.9 in a floating point system with base β = 10, t = 6 digits,

L = −5, U = 5.

Process:
1 Express in base β = 10 :

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 18 / 26
Example 1

Express 253.9 in a floating point system with base β = 10, t = 6 digits,

L = −5, U = 5.

Process:
1 Express in base β = 10 : 253.9.

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 18 / 26
Example 1

Express 253.9 in a floating point system with base β = 10, t = 6 digits,

L = −5, U = 5.

Process:
1 Express in base β = 10 : 253.9.
2 Shift to get leading 0, set exponent p:

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 18 / 26
Example 1

Express 253.9 in a floating point system with base β = 10, t = 6 digits,

L = −5, U = 5.

Process:
1 Express in base β = 10 : 253.9.
2 Shift to get leading 0, set exponent p: 0.2539 × 103 , p = 3 .

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 18 / 26
Example 1

Express 253.9 in a floating point system with base β = 10, t = 6 digits,

L = −5, U = 5.

Process:
1 Express in base β = 10 : 253.9.
2 Shift to get leading 0, set exponent p: 0.2539 × 103 , p = 3 .
3 Pad or round/truncate to t digits:

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 18 / 26
Example 1

Express 253.9 in a floating point system with base β = 10, t = 6 digits,

L = −5, U = 5.

Process:
1 Express in base β = 10 : 253.9.
2 Shift to get leading 0, set exponent p: 0.2539 × 103 , p = 3 .
3 Pad or round/truncate to t digits: 0.253900 × 103 .

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 18 / 26
Example 2

Express 8.25 in a floating point system with base β = 2, t = 7 digits,

L = −5, U = 5.

Process:
1 Express in base β = 2:

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 19 / 26
Example 2

Express 8.25 in a floating point system with base β = 2, t = 7 digits,

L = −5, U = 5.

Process:
1 Express in base β = 2: 1000.01

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 19 / 26
Example 2

Express 8.25 in a floating point system with base β = 2, t = 7 digits,

L = −5, U = 5.

Process:
1 Express in base β = 2: 1000.01
2 Shift to get leading 0, set exponent p:

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 19 / 26
Example 2

Express 8.25 in a floating point system with base β = 2, t = 7 digits,

L = −5, U = 5.

Process:
1 Express in base β = 2: 1000.01
2 Shift to get leading 0, set exponent p: 0.100001 × 24 , p = 4 .

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 19 / 26
Example 2

Express 8.25 in a floating point system with base β = 2, t = 7 digits,

L = −5, U = 5.

Process:
1 Express in base β = 2: 1000.01
2 Shift to get leading 0, set exponent p: 0.100001 × 24 , p = 4 .
3 Pad or round/truncate to t digits:

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 19 / 26
Example 2

Express 8.25 in a floating point system with base β = 2, t = 7 digits,

L = −5, U = 5.

Process:
1 Express in base β = 2: 1000.01
2 Shift to get leading 0, set exponent p: 0.100001 × 24 , p = 4 .
3 Pad or round/truncate to t digits: 0.1000010 × 24 .

Leili Rafiee Sevyeri (CS370) Lecture 1: Floating Point Number Systems Fall 2022 19 / 26
Example 3