SC2000/CZ2100
Probability & Statistics
Week 2
1
Ch 2. Presenting Data
● Frequency tables and Charts
● Bar Charts
● Stem and Leaf Displays
● Histograms
● Box Plots
● others
2
Box Plot:
Draw the Box Plot for the following data set (31 values).
14 15 16 16 17 17 17 17 17 18
18 18 18 18 18 19 19 19 20 20
20 20 20 20 21 21 22 23 24 24 29
3
Value > upper adjacent
Outlier =
or < lower adjacent
Outlier = 29
Upper Adjacent =24
75th percentile =20
Median =19 + Mean = 19.2
25th percentile =17
Lower Adjacent =14 4
Figure shows the boxplots of the Anger Expression Index by sports
participation. Does it look like there are any outliers? Which group
reported expressing more anger?
5
Practical example of Box Plot:
Daily Power Output of an Electric Generator
6
Another practical example of Box Plot:
Daily share prices
7
Ch 3. Summarizing Distributions
● Central Tendency: mean, median & mode
● Other Measures of Central Tendency
● Comparing Central Tendency
● Measures of Variability: Range, IQR,
Variance
● Linear Transformation of variable
● Variance Sum Law I
8
Central Tendency
Summarizes a distribution by its central location
Different ways to define central tendency:
∑ 𝑋𝑋
• Mean 𝜇𝜇 = 𝑁𝑁
• Median = 50th Percentile
• Mode = the value with the highest frequency
25th Percentile + 2∗Median + 75th Percentile
• Trimean =
4
1
• Geometric mean = ∏ 𝑋𝑋 , where Π means to multiply
𝑁𝑁
Eg: ∏5𝑖𝑖=1 𝑋𝑋𝑖𝑖 = 𝑋𝑋1 ∗ 𝑋𝑋2 ∗ 𝑋𝑋3 ∗ 𝑋𝑋4 ∗ 𝑋𝑋5
• Trimmed mean = mean for data with some higher and
9
lower values removed
Central Tendency:
Eg: Given the following data set, compute the mean, the
median, the mode, the trimean, the geometric mean
and the mean trimmed 18.2%.
1 3 4 4 4 5 5 7 8 9 31
10
Central Tendency – comparing various measures
For symmetric distributions:
Mean = Median= Trimean = Trimmed mean
= Mode (except bimodal distr)
For skewed distributions:
Differences among the measures.
Example – the mean is typically higher than
the median for a positive skewed distribution
11
Central Tendency – comparing various measures
Example:
Negative skewed
Calculate the
following:
1. mode
2. median
3. trimean
4. mean
12
Measures of Variability
An indication of how spread out is the distribution
Frequently used measures of variability:
• Range = Highest value – Lowest value
• Interquartile Range IQR = 75th – 25th Percentile
∑ 2
(𝑋𝑋 − 𝜇𝜇)
• Variance 𝜎𝜎 2 =
𝑁𝑁
• Standard deviation = Variance
13
Variability:
Eg: Given the following 2 data set, compute the mean,
the range, the IQR and the variance.
Quiz 1 score distribution Quiz 2 score distribution
6
No of students
5
4
3
2
1
0
4 5 6 7 8 9 10 4 5 6 7 8 9 10
score
Mean 𝜇𝜇
Range
IQR
Var 𝜎𝜎2
14
True or False questions:
15
Ch 3. Summarizing Distributions
● Central Tendency: mean, median & mode
● Other Measures of Central Tendency
● Comparing Central Tendency
● Measures of Variability: Range, IQR,
Variance
● Linear Transformation of variable
● Variance Sum Law I
17
Compute mean and variance from the population of size N :
Population ∑ 𝑋𝑋
𝜇𝜇 = 𝐸𝐸 𝑋𝑋 = 𝜎𝜎2 = E[X2] − 𝜇𝜇2
Mean 𝑁𝑁
∑ 𝑋𝑋 2
∑ 2
𝑋𝑋 −
Population 2 2 ∑(𝑋𝑋−𝜇𝜇) 2
𝑁𝑁
Variance 𝜎𝜎 = 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇) = or
𝑁𝑁 𝑁𝑁
Estimate mean and variance from a sample of size n :
∑ 𝑋𝑋
Sample Mean 𝑥𝑥̅ =
𝑛𝑛 2
∑ 𝑋𝑋
Sample ∑(𝑋𝑋−𝑥𝑥)̅ 2 ∑ 𝑋𝑋 2−
𝑛𝑛
Variance 𝑠𝑠 2 = or
𝑛𝑛 − 1
𝑛𝑛−1
Why n-1 ?
18
Suppose the denominator of s2 is n, instead of 𝑛𝑛 − 1 :
∑ 2 ∑ 𝑋𝑋 2
2
(𝑋𝑋 − 𝑥𝑥)
̅ 1 2
𝑠𝑠 = = � 𝑋𝑋 −
𝑛𝑛 𝑛𝑛 𝑛𝑛
For unbiased estimate, we expect the mean of s2 to be equal
to 𝜎𝜎2 :
1 ∑ 𝑋𝑋 2
2 2
𝐸𝐸[𝑠𝑠 ] = 𝐸𝐸 � 𝑋𝑋 −
𝑛𝑛 𝑛𝑛
1 𝐸𝐸 ∑ 𝑋𝑋 2
2
= � 𝐸𝐸 𝑋𝑋 −
𝑛𝑛 𝑛𝑛
𝜎𝜎2 + 𝜇𝜇2 from the previous slide
2
1 2 2
𝐸𝐸 ∑ 𝑋𝑋
= 𝑛𝑛 𝜎𝜎 + 𝑛𝑛 𝜇𝜇 −
𝑛𝑛 𝑛𝑛 19
2 Let 𝑌𝑌 = ∑ 𝑋𝑋
2
1 2 2
𝐸𝐸 ∑ 𝑋𝑋 𝐸𝐸[𝑌𝑌 2 ] = 𝜎𝜎𝑌𝑌2 + 𝜇𝜇𝑌𝑌2
𝐸𝐸[𝑠𝑠 ] = 𝑛𝑛 𝜎𝜎 + 𝑛𝑛 𝜇𝜇 −
𝑛𝑛 𝑛𝑛
1 𝑉𝑉𝑉𝑉𝑉𝑉 ∑ ∑ 2
𝑋𝑋 + 𝐸𝐸 𝑋𝑋
= 𝑛𝑛 𝜎𝜎 2 + 𝑛𝑛 𝜇𝜇2 −
𝑛𝑛 𝑛𝑛
1 ∑ ∑ 2
𝑉𝑉𝑉𝑉𝑉𝑉[𝑋𝑋] + 𝐸𝐸[𝑋𝑋]
= 𝑛𝑛 𝜎𝜎 2 + 𝑛𝑛 𝜇𝜇2 −
𝑛𝑛 𝑛𝑛
1 𝑛𝑛 𝜎𝜎 2 + 𝑛𝑛 𝜇𝜇 2
= 𝑛𝑛 𝜎𝜎 2 + 𝑛𝑛 𝜇𝜇2 −
𝑛𝑛 𝑛𝑛
1 2 2 2 2
1
= 𝑛𝑛 𝜎𝜎 + 𝑛𝑛 𝜇𝜇 − 𝜎𝜎 − 𝑛𝑛 𝜇𝜇 = 𝑛𝑛 − 1 𝜎𝜎 2
𝑛𝑛 𝑛𝑛
If the denominator is 𝑛𝑛 − 1 , then
the mean of s2 is equal to 𝜎𝜎2, i.e. 𝐸𝐸[𝑠𝑠 2 ] = 𝜎𝜎 2
20
Linear transformation of variable
- Transform data from one measurement scale
to another
Eg. Consider a taxi trip from point A to B. The taxi
service initial charge is $3 and additional $0.50 per km
for the trip. Let y be the cost of the taxi ride and x be
the distance travelled. We have:
𝑦𝑦 = 0.5𝑥𝑥 + 3
The mean of y = 0.5 (mean of x) + 3
The variance of y = 0.52 (variance of x)
22
Variance Sum Law I
- Linear combination of 2 independent
variables
Eg. Consider a couple A and B. A works x hrs/day and
earns $5/hr while B works y hrs/day and earns $10/hr.
Total daily income t is: Daily income difference d :
𝑡𝑡 = 5𝑥𝑥 + 10𝑦𝑦 𝑑𝑑 = 5𝑥𝑥 − 10𝑦𝑦
Mean = 5 (mean of x) +
− 10 (mean of y)
Variance = 52 𝜎𝜎𝑥𝑥2 + 102 𝜎𝜎𝑦𝑦2
23
Ch 4. Bivariate Data
● Introduction to Bivariate Data
● Pearson Correlation and Covariance
● Properties of Person Correlation
● Variance Sum Law II
24
Bivariate Data
- A dataset with a pair of variables which may be
correlated to one another.
Eg: two variables – ice cream sales and
temperature
$700
$600
$500
$400
Sales
$300
$200
$100
$0
10 12 14 16 18 20 22 24 26
Temperature °C
25
Bivariate Data
Eg: Figures below show the frequency distribution
of the There
age ofishusband
a strongand
linear
wifecorrelation
for 280 couples.
between the husband and
the wife age
Indication on the central tendency, variability and
the shape of distribution of husband and wife age.
No indication on the correlation of the 2 variables.
26
Eg: Correlation of variables X and Y
No correlation between X and Y
Perfectcorrelation
Perfect positive negative correlation
27
Question: Describe the relationship between
variables A and C. Think of things these variables
could represent in real life.
Negative relationship between A and C.
There is a negative relationship between price
and quantity of the products that we buy.
28
Pearson Correlation ⍴
- An indicator on the strength of the linear relationship
between two variables.
Covariance of X and Y,
𝐸𝐸[ 𝑋𝑋−𝜇𝜇𝑋𝑋 𝑌𝑌−𝜇𝜇𝑌𝑌 ]
Definition: ⍴ = denoted as cov(XY)
𝜎𝜎𝑋𝑋 𝜎𝜎𝑌𝑌
𝐸𝐸 𝑋𝑋𝑋𝑋 −𝜇𝜇𝑋𝑋 𝜇𝜇𝑌𝑌
=
𝐸𝐸 𝑋𝑋 2 − 𝜇𝜇𝑋𝑋 2 𝐸𝐸 𝑌𝑌 2 − 𝜇𝜇𝑌𝑌 2
∑ 𝑋𝑋 ∑ 𝑌𝑌
∑ 𝑋𝑋𝑋𝑋−
= 𝑁𝑁
2 ∑ 𝑋𝑋 2 2 ∑ 𝑌𝑌 2
∑ 𝑋𝑋 − ∑ 𝑌𝑌 −
𝑁𝑁 𝑁𝑁
∑ 𝑋𝑋𝑋𝑋
If 𝜇𝜇𝑋𝑋 = 𝜇𝜇𝑌𝑌 = 0, then ⍴ =
∑ 𝑋𝑋 2 ∑ 𝑌𝑌 2
29
Computation of Correlation based on a sample
of size n cov XY =
1
� 𝑋𝑋 − 𝑋𝑋� 𝑌𝑌 − 𝑌𝑌�
𝑛𝑛 − 1
𝐸𝐸[ 𝑋𝑋−𝑋𝑋� 𝑌𝑌−𝑌𝑌� ]
Correlation 𝑟𝑟 =
𝑠𝑠𝑋𝑋 𝑠𝑠𝑌𝑌
∑ 𝑋𝑋 ∑ 𝑌𝑌
∑ 𝑋𝑋𝑋𝑋−
= 𝑛𝑛
2 ∑ 𝑋𝑋 2 2 ∑ 𝑌𝑌 2
∑ 𝑋𝑋 − ∑ 𝑌𝑌 −
𝑛𝑛 𝑛𝑛
∑ 𝑋𝑋𝑋𝑋
If 𝑋𝑋� = 𝑌𝑌� = 0, then 𝑟𝑟 =
∑ 𝑋𝑋 2 ∑ 𝑌𝑌 2
30
Given the data shown in the figure, is it appropriate to
use Pearson Correlation to describe the relationship
between X and Y?
160.00
140.00
120.00
100.00
80.00
60.00
40.00
20.00
0.00
0 5 10 15 20 25
31
Eg: Given the data set, calculate σX, σY, cov(X,Y) and the
Pearson Correlation.
X 2 5 6 8 9
Y 8 5 2 4 1
32
Properties of Correlation
- Value in the range of [−1, +1]
- Symmetric: correlation of X with Y
= correlation of Y with X
- Unaffected by linear transformations:
Correlation of Y with X
= correlation of Y with A X + B
where A and B are constants
33
Eg: If the correlation between weight (in
pounds) and height (in feet) is 0.58, find:
(a) the correlation between weight (in pounds)
and height (in yards)
(b) the correlation between weight (in
kilograms) and height (in meters).
The correlation for both (a) and (b) is still 0.58
because linear transformations do not affect the
value of Pearson’s correlation, and both of the
above instances are linear transformations.
34
Ch 4. Bivariate Data
● Introduction to Bivariate Data
● Pearson Correlation and Covariance
● Properties of Person Correlation
● Variance Sum Law II
35
Variance Sum Law II
- Linear combination of 2 independent variables
X and Y
2
Variance of 𝑋𝑋 ± 𝑌𝑌: 𝜎𝜎𝑥𝑥±𝑦𝑦 = 𝜎𝜎𝑥𝑥2 + 𝜎𝜎𝑦𝑦2
- If the variables X and Y are correlated
2
Variance of 𝑋𝑋 ± 𝑌𝑌: 𝜎𝜎𝑥𝑥±𝑦𝑦 = 𝜎𝜎𝑥𝑥2 + 𝜎𝜎𝑦𝑦2 ± 2𝜌𝜌𝜎𝜎𝑥𝑥 𝜎𝜎𝑦𝑦
- For computation based on a sample
2
𝑠𝑠𝑥𝑥±𝑦𝑦 = 𝑠𝑠𝑥𝑥2 + 𝑠𝑠𝑦𝑦2 ± 2𝑟𝑟𝑠𝑠𝑥𝑥 𝑠𝑠𝑦𝑦
36
Eg: Students took 2 parts of a test, each worth 50 points. Part
A has a variance of 25, and Part B has a variance of 49.
The correlation between the test scores is 0.6.
(i) If the teacher adds the grades of the two parts together to
form a final test grade, what would the variance of the
final test grades be?
(ii) What would the variance of Part A - Part B be?
(i) Var (A + B) = 25 + 49 + 2*0.6*√25*√49
= 116
(ii) Var (A − B) = 25 + 49 - 2*0.6*√25*√49
= 32
37