Continuous Distributions:
Normal Distribution
Real life and Normal Distribution
• Many real-life data points follow
Normal Distribution:
• People’s heights and weights
• Population blood pressure
• Test scores
• Also called as Gaussian
Distribution
• Generally, less/non-natural
phenomena do not have normal
distributions, e.g. income of
people
Normal Distribution: Key Features
• Symmetry: Perfect symmetry around the mean
• Mean = Median = Mode = Center point of the normal distribution
• Bell-shaped curve
• Empirical rule:
• 68% of the data falls within Mean ± 1 SD
• 95% of the data falls within Mean ± 2 SD
• 99.7% of the data falls within Mean ± 3 SD
• Z-score (Standard score): Useful in finding relative position of an
observation with respect to the overall population
Example: Student Heights (C:\code\Data
Analytics\normal_distribution.py)
Z-Score Calculations
•
Z-scores > +3 and < -3
are considered outliers
Z-Score: Problem
• A runner participated in a 200m race and a 500m race
• Consider the following, calculate Z-scores and determine where she
did better
Race Average time Standard deviation Runner’s time
200m 31s 1.5s 28s
500m 125s 8.2s 132s
Z-Score Example
• A runner participated in a 200m race and a 500m race
• Consider the following and determine where she did better
Race Average time Standard deviation Runner’s time
200m 31s 1.5s 28s
500m 125s 8.2s 132s
In other examples, positive/higher Z-score will be
better, e.g. marks obtained by a student –
Visualizing Z-Scores Because, here the student would want to be
above average
• In this example, a lower time would be preferable when completing a
race and so, the lower z-score would be better
Z-Score Interpretation
Normal Distribution and Probability
• Standard normal distribution = Normal distribution with mean of 0 and
standard deviation of 1
• Total area under the curve = 1
• Can be used to map Z-Score to probability of area under the curve (Next)
Understanding Z-Score and Area Under the Curve (Probability)
• Suppose Z-Score = 1.15 • Suppose Z-Score = -0.24
Student Example: Z-Scores, Probabilities,
Percentiles
• •
Three Important Measurements
•
Is our Data Normally Distributed?
• Shapiro-Wilk Test: p-value should be > 0.05 (Data size <= 5000 rows)
• QQ plot (Quantile-Quantile): Ideal is straight line
The Central Limit Theorem (CLT)
Central Limit Theorem (CLT)
• Problem: Suppose population data does not follow normal distribution (i.e.
it is left/right-skewed)
• Population->Samples
• Example: 10 lakh examination result of students->500 samples of 100
students each
• For each sample, calculate average marks (Sample mean or x̄)
• Plot these sample means on a graph
• They will follow normal distribution: Central Limit Theorem (CLT)
• Generally, minimum sample size = 30
• How many such samples? No such number
• Result: Consider original population also as normally distributed now
CLT
Population
Sample 1 Sample 2 Sample 3 .. Sample n
Sample mean Sample mean Sample mean .. Sample mean