3.
Data Distribution
Chapter 1 - Central Tendency
Statistics
Descriptive Inferential
Statistics Statistics
Central Tendency
Central Tendency
Descriptive Statistics Outcome is one value
Definition:
a descriptive summary of a dataset through a single value that reflects
the center of the data distribution
To tell us
How the center of distribution?
3. Data Distribution
Chapter 2 - Understanding Mean, Median, Mode
Central Tendency
Name Age
John 33
Mark
Susan
Joe
28
25
46
? Average
= Mean
… … Average age of employees in the data distribution (data set)
Bob 29
Statistics
Descriptive Inferential
Statistics Statistics
Central Tendency
Mean Median Mode
= Average
Calculating Mean
Name Age
John 33
Mark 28
33 + 28 + 25 + 46 + 32 + 29 + 42 + 21
Susan 25
8
Joe 46
Ema 32
Bob 29 Mean value = 32
Keith 42
Julia 21
Calculating Mean
Day Sales
Sunday 9500
Monday 100
9500 + 100 + 50 + 150 + 100 + 150 + 100
Tuesday 50
7
Wednesday 150
Thursday 100 More than 10 times compared
to most of the sales
Friday 150 Mean value = 1450
Saturday 100
Mean
● Average value of a data series
○ e.g. data[100, 200, 50, 150], then mean is 100+200+50+150 / 4 = 125
● Outlier(s) in our data set can mislead mean value
○ e.g. data[9500, 100, 50, 350], then mean is 9500 + 100 + 50 + 350 / 4 = 2500
○ in above example, mean value is too far high then most of our data in the dataset
● Formula,
○ Mean = 𝝨x/N
Median
● Middle value in our dataset
● Must sort out the data from low to high
○ e.g. data[150, 50, 600, 200, 350]
○ sorted_data[50, 150, 200, 350, 600]
○ therefore, median value is 200
● Formula
○ {(n + 1) ÷ 2}th value
● if n is even, the median is calculated by averaging the two middle values
○ e.g. data[150, 50, 600, 200, 350, 100]
○ sorted_data[50, 100, 150, 200, 350, 600]
○ therefore, media value is 150+200/2 = 175
Mode
● Value that frequently appears in the data set
○ e.g. data[100, 50, 200, 50, 150]
○ therefore, mode is 50 as it appears twice in the data set
● Can be more than one mode in a single data set
○ e.g. data[100, 50, 200, 50, 150, 100]
○ therefore, mode is 50 and 100
● Some data set do not have mode if there is no repeating number
○ e.g. data[50, 100, 150, 200, 250, 300]
● Sorting is preferred as it helps visually
3. Data Distribution
Chapter 3 - Understanding Weighted Mean
and Mean of Grouped Data
Statistics
Descriptive Inferential
Statistics Statistics
Central Tendency
Mean Median Mode
= Average
Weighted Mean
Weighted:
some part of the data is more important than other
calculating mean based on each weight
Weighted Mean - Example
Assessment Type Scores/Marks Weight in Percentage
Mid-term Exam 95 15%
Practical Project 85 35%
Final Exam 82 50%
● Year end score is computed based on:
○ Get 15% of overall score from mid-term exam
○ Get 35% of overall score from practical project
○ Get 50% of overall score from final exam
Weighted Mean - Calculation
Assessment Type Scores/Marks Weight Weight x Score
Mid-term Exam 95 0.15 14.25
Practical Project 85 0.35 29.75
Final Exam 82 0.50 41
Grade Point 85
● To calculate the weighted mean
∑w i . x i
Sum (Multiply weight of each value with its value)
WA =
∑w i
○
○ Sum (weight of each value)
○ Then divide w = weight of each value
x = data value
Mean of Grouped Data - Example
Frequency Distribution
Sales Group No. of Days
0-2 11
3-5 8
6-8 5
9-11 3
12-14 1
15-17 2
Mean/Avg = 150/30 = 5
Mean of Grouped Data - Calculation
Sales Midpoint (x) No. of Days f.x
∑fi . xi
Group (frequency, f) GM =
0-2 1 11 11 ∑fi
3-5 4 8 32
f = frequency of each group
6-8 7 5 35 x = midpoint
9-11 10 3 30
153
12-14 13 1 13
30
15-17 16 2 32
Grouped Mean = 5.1
30 153
3. Data Distribution
Chapter 4 - Variability
Statistics
Descriptive Inferential
Statistics Statistics
Central
Tendency Variability
Mean Median Mode Range Variance Standard Deviation
3. Data Distribution
Chapter 5 - Understanding Range, Variance
and Standard Deviation
Statistics
Descriptive Inferential
Statistics Statistics
Central
Tendency Variability
Mean Median Mode Range Variance Standard Deviation
Range - Example & Calculation
● the difference between the largest number and the smallest number
○ e.g. data[100, 50, 200, 50, 150]
○ therefore, range is 200 - 50 = 150
● same outlier effect as mean
○ e.g. data[100, 50, 9000, 50, 100]
○ therefore, range is 9000 - 50 = 8950
● Formula
○ Range = Largest number - Smallest number
● sorting is preferred as it helps visually
Variance - Calculation
● Calculating Variance
○ Step 1 - calculate the mean value
○ Step 2 - subtract mean value from each data point
○ Step 3 - get the squared value for each subtracted value
○ Step 4 - calculate the average of each squared value
Variance - Example
● E.g. data[15, 17, 16, 14, 18, 16] ● Step 1 - calculate the mean value
● Step 2 - subtract mean value from each data point
Step 1 ● Step 3 - get the squared value for each subtracted value
● Step 4 - calculate the average of each squared value
mean = 15+17+16+14+18+16 / 6 = 96/6 = 16
Step 2 & 3
15-16 = -1 => (-1)2 = 1 Step 4
17-16 = 1 => (1)2 = 1
1 + 1 + 0 + 4 + 4 + 0 = 10
2
16-16 = 0 => (0) = 0
n = 6, therefore VAR = 10/6 = 1.67
2
14-16 = -2 => (-2) = 4
18-16 = 2 => (2)2 = 4
16-16 = 0 => (0)2 = 0
Variance - Interpretation
● If the variance is too small, then our data is very close to the mean
○ E.g. data[15, 17, 16, 14, 18, 16]
○ Variance = 1.67, Mean = 16
○ Since the value of variance is small, each data point is not much far from mean
● If the variance is large, then our data is very far from the mean
○ E.g. data[13, 3, 40, 12, 3, 25]
○ Variance = 170, Mean = 16
○ Since the value of variance is large, each data point is considered far from mean
Standard Deviation - Interpretation
● the value of standard deviation shows us how far each data is deviated
from the mean
● Formula
○ take the square root of Variance
● E.g. data[15, 17, 16, 14, 18, 16]
○ VAR = 1.67
○ SD = √1.67 = 1.29
3. Data Distribution
Chapter 6 - Understanding Quartiles and
Interquartiles Range
Quartiles
● Divide the data set into four equal segments after arranging in ascending order
Quartiles
● First quartile, denoted as Q1
○ splits off the lowest 25% of data from the highest 75%
● Second quartile, denoted as Q2
○ cuts data set in half, median value
● Third quartile, denoted as Q3
○ splits off the highest 25% of data from the lowest 75%
Quartiles - Calculating Q1, Q2, Q3
● Step 1: Arrange data in ascending order
● Step 2: Find the median value, i.e. Q2
● Step 3: Find the median value of lower half of the data set, i.e. Q1
● Step 4: Find the median value of upper half of the data set, i.e. Q3
Quartiles - Example
● Step 1: Arrange data in ascending order
e.g. 7, 18, 16, 10, 2, 5, 13, 11, 3
● Step 2: Find the median value, i.e. Q2
Step 1 ● Step 3: Find the median value of lower half of the data set, i.e. Q1
2, 3, 5, 7, 10, 11, 13, 16, 18 ● Step 4: Find the median value of upper half of the data set, i.e. Q3
Step 2 Step 2
Median = 10 = Q2 Median Upper = (13+16)/2 = 14.5 = Q3
Step 2
Median Lower = (3+5)/2 = 4 = Q1
Interquartile Range (IQR)
● Measure spread of the center half of the data set
IQR = Q3 - Q1
● Useful to spot outliers
Any values that are more than:
Q3 + 1.5 IQR
OR
Any values that are less than:
Q1 - 1.5 IQR
Finding Outliers - Example
● Step 1: Arrange data in ascending order
e.g. 11, 41, 44, 47, 51, 53, 57, 75
● Step 2: Find the median value, i.e. Q2
Sort First
● Step 3: Find the median value of lower half of the data set, i.e. Q1
11, 41, 44, 47, 51, 53, 57, 75 ● Step 4: Find the median value of upper half of the data set, i.e. Q3
Q3
Q3 + 1.5 IQR
(53+57)/2 = 55
55 + (1.5 x 12.5) = 73.75
Q1
(41+44)/2 = 42.5 Q1 - 1.5 IQR
IQR
42.5 - (1.5 x 12.5) = 23.75
55-42.5 = 12.5