[go: up one dir, main page]

0% found this document useful (0 votes)
58 views51 pages

Lecture 1 Number - Representation ECSE 343

lecture 1

Uploaded by

plat1711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views51 pages

Lecture 1 Number - Representation ECSE 343

lecture 1

Uploaded by

plat1711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

ECSE 343 Numerical Methods in

Engineering
Roni Khazaka
Dept. of Electrical and Computer Engineering
McGill University
Number
Representation

2 Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering


Number Representation
Abstract Quantity Stored Quantity
𝑥 𝑥"

0.6
𝑥 = 4/7 0.57
7𝑥 = 4 𝑥" 0.57143
0.571429
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering
0.57142857 3
Unsigned Integers
103 102 101 100

8 3 0 2 à 8×10! + 3×10" + 0×10# + 2×10$ = 8,302

27 26 25 24 23 22 21 20
1 0 1 0 0 1 0 1 à 1×2% + 0×2& + 1×2' + 0×2( + ⋯
⋯ + 0×2! + 1×2" + 0×2# + 1×2$ = 165#$

8 bit binary
%
d7 d6 d5 d4 d3 d2 d1 d0 à - 𝑑) 2 ) Range: 0 – 255
)*$

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 4


Unsigned Integers
uint8() has a range
uint8(5) à 0 0 0 0 0 1 0 1
from 0 to 255
8 bits

uint16(5) à 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

16 bits
uint16() has a range
from 0 to 65,535

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 5


Signed Integers: Sign Bit
± 26 25 24 23 22 21 20 &

s d6 d5 d4 d3 d2 d1 d0 à −1 +
- 𝑑) 2 )
)*$
Range: −127 → +127

8 bit binary

± 26 25 24 23 22 21 20
1 0 1 0 0 1 0 1 à − 0×2& + 1×2' + 0×2( + 0×2! + 1×2" + 0×2# + 1×2$ =
= −37

0 0 0 0 0 0 0 0 à +0

1 0 0 0 0 0 0 0 à −0

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 6


Signed Integers: Two’s Complement
26 25 24 23 22 21 20
0 1 0 1 1 0 1 1 à 1×2& + 0×2' + 1×2( + 1×2! + 0×2" + 1×2# + 1×2$ = 91

Flip all bits and add 1


à

1 0 1 0 0 1 0 1 à 1×2% + 0×2& + 1×2' + 0×2( + 0×2! + 1×2" + 0×2# + 1×2$


= 16565+)7548
à 165 − 2, = −91-./ + 0/1234145-
!

à −2, + 1×2% + 0×2& + 1×2' + 0×2( + 0×2! + 1×2" + 0×2# + 1×2$ = −91
à −1×2% + 0×2& + 1×2' + 0×2( + 0×2! + 1×2" + 0×2# + 1×2$ = −91
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 7
Signed Integers: Two’s Complement
-27 26 25 24 23 22 21 20 &
Range: −128 → +127
s d6 d5 d4 d3 d2 d1 d0 à −𝑠×2% + - 𝑑) 2)
)*$

8 bit binary

± 26 25 24 23 22 21 20
1 0 1 0 0 1 0 1 à −1×2% + 0×2& + 1×2' + 0×2( + 0×2! + 1×2" + 0×2# + 1×2$ =
= −91

0 0 0 0 0 0 0 0 à 0

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 8


Integers
Type Number of Bits Range
uint8() 8 0 à 255
uint16() 16 0 à 65,535
uint32() 32 0 à 4,294,967,295
uint64() 64 0 à 18,446,744,073,709,551,615
int8() 8 –128 à 127
int16() 16 –32768 à 32767
int32() 32 –2,147,483,648 à 2,147,483,647
int64() 64 –9,223,372,036,854,775,808 à
9,223,372,036,854,775,807

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 9


Set of Signed/Unsigned Integers

0 1 2 3 4 251 252 253 254 255


uint8()
1

−128 −127 −126 −1 0 1 125 126 127


int8()
1

Distance between two adjacent stored values: 1

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 10


Number Representation
Abstract Quantity Stored Quantity
𝑥 𝑥"

𝑥∈ℝ 𝑥3 = 𝑢𝑖𝑛𝑡32(𝑥)

0 ≤ 𝑥 ≤ 4,294,967,295 𝑥" ∈ 𝑢𝑖𝑛𝑡32

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 11


Error
Absolute Error: 𝜖 = 𝑥 − 𝑥"
𝑥 − 𝑥"
Relative Error: 𝜂=
𝑥
Rounding scheme à Round to nearest integer.
𝜖 = 𝑥 − 𝑥" ≤ 0.5
What about relative error?

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 12


Fixed-Point Arithmetics
102 101 100 10-1 10-2 10-3

1 3 2 0 4 5 à 1×10" + 3×10# + 2×10$ + 0×109# + 4×109" + 5×109! =


= 132.045
Distance between two adjacent stored values : 10+,

23 22 21 20 2-1 2-2 2-3 2-4


0 1 0 1 1 1 0 1
à 0×2! + 1×2" + 0×2# + 1×2$ + ⋯
⋯ + 1×29# + 1×29" + 0×29! + 1×29( = 5.8125#$

Distance between two adjacent stored values : 2+-


è Difficulty with dynamic range

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 13


Fixed-Point Arithmetics
ü Can use similar (same) hardware as integer arithmetic.
ü Fast computation.
o Numbers are equally spaced.
Loss of precision.
Limited Dynamic Range.

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 14


Floating Point
Arithmetics

15 Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering


Example: Planck’s Law
2ℎ𝜈 * 1
𝐵) 𝜈, 𝑇 = + ,)
𝑐
𝑒 -=. −1

Planck’s Constant: ℎ = 6.626070×10+,- 𝐽. 𝑠


Speed of light: 𝑐 = 299,792,458𝑚/𝑠
Boltzmann’s constant: 𝜅. = 1.380649×10+/, 𝐽/𝐾
For an IR wavelength of 1000𝑛𝑚: 𝜈 ≅ 300𝑇𝐻𝑧 = 3×100- 𝐻𝑧

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 16


Scientific Notation
Normalized value
1123 à 1.123×10!
11.23 à 1.123×10"
0.1123 à 1.123×10#"

1.123×10!$ 1.123×10#!$

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 17


Floating Point Example: Binary

Sign bit Always ‘1’ Exponent

±1.101001×2!

‘Significand’ or ‘Mantissa’

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 18


Floating Point: Binary

Sign bit Always ‘1’ Exponent

±1. 𝑀×2%

‘Significand’ or ‘Mantissa’

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 19


Float: Single Precision 32-bit
Sign bit 23-bit mantissa

𝑏!" 𝑏!# 𝑏$% 𝑏$& 𝑏$' 𝑏$( 𝑏$) 𝑏$* 𝑏$! 𝑏$$ 𝑏$" 𝑏$# 𝑏"% 𝑏"& 𝑏"' 𝑏"( 𝑏") 𝑏"* 𝑏"! 𝑏"$ 𝑏"" 𝑏"# 𝑏% 𝑏& 𝑏' 𝑏( 𝑏) 𝑏* 𝑏! 𝑏$ 𝑏" 𝑏#

8-bit exponent 𝐸 Bias


𝑒 = 𝑏*F 𝑏+G 𝑏+H 𝑏+I 𝑏+J 𝑏+K 𝑏+L 𝑏+* − 127 = 𝐸 − 127
Except for special cases (00000000 and 11111111)
è −126 ≤ 𝑒 ≤ 127
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 20
Float: Single Precision 32-bit
Sign bit 23-bit mantissa 𝑀

𝑏!" 𝑏!# 𝑏$% 𝑏$& 𝑏$' 𝑏$( 𝑏$) 𝑏$* 𝑏$! 𝑏$$ 𝑏$" 𝑏$# 𝑏"% 𝑏"& 𝑏"' 𝑏"( 𝑏") 𝑏"* 𝑏"! 𝑏"$ 𝑏"" 𝑏"# 𝑏% 𝑏& 𝑏' 𝑏( 𝑏) 𝑏* 𝑏! 𝑏$ 𝑏" 𝑏#

8-bit exponent 𝐸 Bias


&!"
−1 1. 𝑏'' 𝑏'" … 𝑏" 𝑏$ ×2(#"')
&!"
−1 1. 𝑀×2(#"')

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 21


Special Cases, IEEE 754
𝑀
𝑏!" 𝑏!# 𝑏$% 𝑏$& 𝑏$' 𝑏$( 𝑏$) 𝑏$* 𝑏$! 𝑏$$ 𝑏$" 𝑏$# 𝑏"% 𝑏"& 𝑏"' 𝑏"( 𝑏") 𝑏"* 𝑏"! 𝑏"$ 𝑏"" 𝑏"# 𝑏% 𝑏& 𝑏' 𝑏( 𝑏) 𝑏* 𝑏! 𝑏$ 𝑏" 𝑏#

𝐸
𝑬=𝟎 𝟎 < 𝑬 < 𝟐𝟓𝟓 𝑬 = 𝟐𝟓𝟓
1!" 1.0×22+0/3
𝑴=𝟎 ±0 −1 1.0×2 ±∞
1!" 1. 𝑀×22+0/3
𝑴≠𝟎 Denormalized −1 NaN

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 22


Distance between two adjacent numbers
23-bit mantissa

+1.100101011100101010010×2$ Next higher number


+1.100101011100101010011×2$
Difference is 𝜖* = 2#'! ×2$ ≅ 1.1921×10#)
Valid for all 𝑥 where 1 < 𝑥 < 2

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 23


Distance between two adjacent numbers
23-bit mantissa

+1.100101011100101010010×2" Next higher number


+1.100101011100101010011×2"
Difference is 𝜖 = 2#'! ×2" ≅ 2.3842×10#)
Valid for all 𝑥 where 2 < 𝑥 < 4

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 24


Distance between two adjacent numbers
23-bit mantissa

+1.100101011100101010010×2' Next higher number


+1.100101011100101010011×2'
Difference is 𝜖 = 2#'! ×2' ≅ 4.7684×10#)
Valid for all 𝑥 where 4 < 𝑥 < 8

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 25


Distance between two adjacent numbers
23-bit mantissa

+1.100101011100101010010×2"') Next higher number


+1.100101011100101010011×2"')
Difference is Δ = 2#'! ×2"') ≅ 2.0282×10!"
Valid for all 𝑥 where 2"') < 𝑥 < 2"'+

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 26


Smallest Normalized Value
23-bit mantissa

+1.000000000000000000000×2#"', Next higher number


+1.000000000000000000001×2#"',
Difference is 𝜖 = 2#'! ×2#"', ≅ 1.1013×10#-.
What about next smaller number?
Next number is zero if we do not de-normalize.
è Δ = 2#"', ≅ 1.1755×10#!+
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 27
Smallest Normalized Value
0
2+0/4 ≅ 1.1755×10+,6

+1.000000000000000000000×2#"',
2+/, ×2+0/4 ≅ 1.4013×10+-5

+1.000000000000000000001×2#"',

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 28


Denormalize 𝐸 = 0

+0.111111111111111111111×2#"',
2+/, ×2+0/4 ≅ 1.4013×10+-5

+1.000000000000000000000×2#"',
2+/, ×2+0/4 ≅ 1.4013×10+-5

+1.000000000000000000001×2#"',

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 29


Denormalized values 𝐸 = 0

+0.111111111111111111111×2#"',

+0.000000000000000000001×2#"',

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 30


Half-Precision 16-bit
5-bit exponent 𝐸 = 𝑏0- 𝑏0, 𝑏0/ 𝑏00 𝑏07
Sign bit 10-bit mantissa 𝑀 = 𝑏G 𝑏H … 𝑏P 𝑏F

𝑏") 𝑏"* 𝑏"! 𝑏"$ 𝑏"" 𝑏"# 𝑏% 𝑏& 𝑏' 𝑏( 𝑏) 𝑏* 𝑏! 𝑏$ 𝑏" 𝑏#


Bias
𝑒 = 𝑏>? 𝑏>@ 𝑏>A 𝑏>> 𝑏>B − 15 = 𝐸 − 15
Except for special cases (00000 and 11111) è −14 ≤ 𝑒 ≤ 15

&"#
−1 1. 𝑀×2(#".

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 31


Quarter Precision 8-bit (non-standard)
3-bit exponent 𝐸 = 𝑏4 𝑏5 𝑏-
Sign bit 4-bit mantissa 𝑀 = 𝑏* 𝑏+ 𝑏P 𝑏F
𝑏' 𝑏( 𝑏) 𝑏* 𝑏! 𝑏$ 𝑏" 𝑏#
Bias
𝑒 = 𝑏C 𝑏D 𝑏? − 3 = 𝐸 − 3
Except for special cases (000 and 111) è −2 ≤ 𝑒 ≤ 3

&$
−1 1. 𝑀×2(#!
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 32
Spacing (Powers of 2)
𝑀 = 0000 −1 $ 1.0000×2% −2 ≤ 𝑒 ≤ 3

0 1 2 4 8 16

𝐸 = 110 à 𝑒 =6−3=3 à 2@ =8
𝐸 = 101 à 𝑒 =5−3=2 à 2A =4

𝐸 = 010 à 𝑒 = 2 − 3 = −1 à 2E> =0.5


𝐸 = 001 à 𝑒 = 1 − 3 = −2 à 2EA =0.25
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 33
Spacing (𝐸 = 110)
−1 $ 1. 𝑀×2! = 1𝑏! 𝑏' 𝑏" . 𝑏$ 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

1000.0 ≤ 𝑥< ≤ 1111.1 Δ = 0.1A = 0.5>B

+1.0000×2@ = 1000.0A = 8>B +1.0100×2@ = 1010.0A = 10>B


+1.0001×2@ = 1000.1A = 8.5>B
+1.0010×2@ = 1001.0A = 9>B
+1.0011×2@ = 1001.1A = 9.5>B +1.1111×2@ = 1111.1A = 15.5>B
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 34
Spacing (𝐸 = 101)
−1 $ 1. 𝑀×2' = 1𝑏! 𝑏' . 𝑏" 𝑏$ 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

100.00 ≤ 𝑥< ≤ 111.11 Δ = 0.01A = 0.25>B

+1.0000×2A = 100.00A = 4>B +1.0100×2A = 101.00A = 5>B


+1.0001×2A = 100.01A = 4.25>B
+1.0010×2A = 100.10A = 4.5>B
+1.0011×2A = 100.11A = 4.75>B +1.1111×2A = 111.11A = 7.75>B
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 35
Spacing (𝐸 = 100)
−1 $ 1. 𝑀×2" = 1𝑏! . 𝑏' 𝑏" 𝑏$ 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

10.000 ≤ 𝑥< ≤ 11.111 Δ = 0.001A = 0.125>B

+1.0000×2> = 10.000A = 2>B +1.0100×2> = 10.100A = 2.5>B


+1.0001×2> = 10.001A = 2.125>B
+1.0010×2> = 10.010A = 2.25>B
+1.0011×2> = 10.011A = 2.375>B +1.1111×2A = 11.111A = 3.875>B
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 36
Spacing (𝐸 = 011)
−1 $ 1. 𝑀×2" = 1. 𝑏! 𝑏' 𝑏" 𝑏$ 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

1.0000 ≤ 𝑥< ≤ 1.1111 Δ = 0.0001A = 2E? = 0.0625>B

+1.0000×2B = 1.0000A = 1>B +1.0100×2B = 1.0100A = 1.25>B


+1.0001×2B = 1.0001A = 1.0625>B
+1.0010×2B = 1.0010A = 1.125>B
+1.0011×2B = 1.0011A = 1.1875>B +1.1111×2B = 1.1111A = 1.9375>B
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 37
Spacing (𝐸 = 010)
−1 $ 1. 𝑀×2#" = 0.1𝑏! 𝑏' 𝑏" 𝑏$ 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

0.10000 ≤ 𝑥< ≤ 0.11111 Δ = 0.00001A = 2ED = 0.03125>B


1.0000×2E> = 0.10000A = 0.5>B 1.0100×2E> = 0.10100A = 0.625>B
1.0001×2E> = 0.10001A = 0.53125>B
1.0010×2E> = 0.10010A = 0.5625>B
1.0011×2E> = 0.10011A = 0.59375>B 1.1111×2E> = 0.11111A = 0.96875>B
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 38
Spacing (𝐸 = 001)
−1 $ 1. 𝑀×2#' = 0.01𝑏! 𝑏' 𝑏" 𝑏$ 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

0.010000 ≤ 𝑥< ≤ 0.011111 Δ = 2E? 2EA = 2EC = 0.015625>B


1.0000×2+/ = 0.010000/ = 0.2507 0.10100×2+/ = 0.25 + 4×2+4
1.0001×2+/ = 0.010001/ = 0.25 + 2+4
1.0010×2+/ = 0.010010/ = 0.25 + 2×2+4
1.0011×2+/ = 0.010011/ = 0.25 + 3×2+4 0.11111×2+/ = 0.25 + 15×2+4

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 39


Spacing (𝐸 = 001)
−1 $ 1. 𝑀×2#' 0000 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

0.010000 ≤ 𝑥< ≤ 0.011111 Δ = 2E? 2EA = 2EC = 0.015625>B


1.0000×2+/ = 0.010000/ = 0.2507 0.10100×2+/ = 0.25 + 4×2+4
1.0001×2+/ = 0.010001/ = 0.25 + 2+4
1.0010×2+/ = 0.010010/ = 0.25 + 2×2+4
1.0011×2+/ = 0.010011/ = 0.25 + 3×2+4 0.11111×2+/ = 0.25 + 15×2+4

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 40


Normalized Positive Values

0 1 2 4 8 15.5 16

• Smallest normalized value: 0.25


• Largest normalized value: 15.5
• Spacing is the same between powers of 2
• There are 16 equally spaced values at each value of 𝐸
• There are 6×16 = 96 normalized positive floating-point values
• There is a ‘large’ gap between zero and the smallest normalized value

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 41


Subnormal Positive Values (𝐸 = 000)
−1 $ 0. 𝑀×2#' = 0.00𝑏! 𝑏' 𝑏" 𝑏$ 0001 ≤ 𝑀 ≤ 1111

0 1 2 4 8 15.5 16

0.000001 ≤ 𝑥< ≤ 0.001111 Δ = 2E? 2EA = 2EC = 0.015625>B

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 42


Special Cases, “8-bit Floating Point”

𝑬=𝟎 𝟎<𝑬<𝟕
𝟐𝟓𝟓 𝑬𝑬 == 𝟐𝟓𝟓
𝟕
11!" 2+0/3
# 1.0×22+,
𝑴=𝟎 ±0 −1
−1 1.0×2 ±∞
11!" 2+0/3
𝑀×22+,
𝑴≠𝟎 Denormalized
Subnormal −1
−1 1.𝑀×2
# 1.
NaN

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 43


Set of Floating-Point Values 𝔽
NaN

−∞ −0 +0 +∞

−16 −8 −4 −2 −1 0 1 2 4 8 16
−15.5 15.5

Abstract Quantity Stored Quantity


𝑥 𝑥∈ℝ 𝑥" 𝑥" ∈ 𝔽

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 44


Floating Point
Type Bits Exponent Mantissa Exponent Bias 𝒆𝒎𝒊𝒏 𝒆𝒎𝒂𝒙
half() 16 5 bits 10 bits 15 -14 15
single() 32 8 bits 23 bits 127 -126 127
double() 64 11 bits 52 bits 1,023 -1,022 1,023

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 45


Double() Largest Value
11 bits
M = 111 ⋯ 11 𝐸 = 111 ⋯ 10 𝑒 = 2046 − 1023 = 1023
52 bits

+1.111 ⋯ 11×2PF+* = +1111 ⋯ 11×2GIP =


53 bits
= 1.797693134862316×10*FH
Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 46
Double() Smallest Normal Value
11 bits
M = 000 ⋯ 00 𝐸 = 000 ⋯ 01 𝑒 = 1 − 1023 = −1022
52 bits

+1.000 ⋯ 00×2QPF++ = 2.225073858507201×10Q*FH

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 47


Double() Smallest Subnormal Value
11 bits
M = 000 ⋯ 01 𝐸 = 000 ⋯ 00 𝑒 = −1022
52 bits

+0.000 ⋯ 01×2QPF++ = 2QPFIL


= 4.940656458412465×10Q*+L

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 48


Example 32-bit Float

Sign bit

𝑏!" 𝑏!# 𝑏$% 𝑏$& 𝑏$' 𝑏$( 𝑏$) 𝑏$* 𝑏$! 𝑏$$ 𝑏$" 𝑏$# 𝑏"% 𝑏"& 𝑏"' 𝑏"( 𝑏") 𝑏"* 𝑏"! 𝑏"$ 𝑏"" 𝑏"# 𝑏% 𝑏& 𝑏' 𝑏( 𝑏) 𝑏* 𝑏! 𝑏$ 𝑏" 𝑏#

8-bit exponent 𝐸 Bias


&!"
−1 1. 𝑏'' 𝑏'" … 𝑏" 𝑏$ ×2(#"')

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 49


Exercise 1
Fill the values of the exponent bias, minimum exponent 𝑒FGH and
maximum exponent 𝑒FIJ in the table

Type Bits Exponent Mantissa Exponent Bias 𝒆𝒎𝒊𝒏 𝒆𝒎𝒂𝒙


half() 16 5 bits 10 bits
single() 32 8 bits 23 bits
double() 64 11 bits 52 bits

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 50


Floating Point
Type Bits Exponent Mantissa Exponent Bias 𝒆𝒎𝒊𝒏 𝒆𝒎𝒂𝒙
half() 16 5 bits 10 bits 15 -14 15
single() 32 8 bits 23 bits 127 -126 127
double() 64 11 bits 52 bits 1,023 -1,022 1,023

Copyright ©2020 R. Khazaka ECSE 343 Numerical Methods in Engineering 51

You might also like