[go: up one dir, main page]

0% found this document useful (0 votes)
37 views48 pages

Week5 BAM

This document provides an overview of topics to be covered in the Introduction to Statistical Analysis course. The weekly objectives include discussing continuous probability distribution functions like the uniform, exponential, and normal distributions. Key concepts are the means and variances of continuous distributions, the central limit theorem, and how to check if data is normally distributed. Examples will be done in Python. Continuous distributions are explored in more depth, including the uniform, exponential, and normal distributions. Their probability density functions and cumulative distribution functions are defined and examples are worked through.

Uploaded by

rajaayyappan317
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views48 pages

Week5 BAM

This document provides an overview of topics to be covered in the Introduction to Statistical Analysis course. The weekly objectives include discussing continuous probability distribution functions like the uniform, exponential, and normal distributions. Key concepts are the means and variances of continuous distributions, the central limit theorem, and how to check if data is normally distributed. Examples will be done in Python. Continuous distributions are explored in more depth, including the uniform, exponential, and normal distributions. Their probability density functions and cumulative distribution functions are defined and examples are worked through.

Uploaded by

rajaayyappan317
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

BAM 1024

Introduction to Statistical Analysis


Weekly Course Objectives
● Go over the idea of continuous probability distribution
functions.
● Go over types of continuous probability distribution
○ Uniform
○ Exponential
○ Normal
● Means and Variances of continuous PDFs!
● Discuss the normal distribution.
● Explore the application of the normal distribution in statistics.
○ Central Limit Theorem
○ Finding probabilities for the normal distribution!
● How do we check if something is “Normally” distributed?
● Do some examples in Python!
Continuous vs Discrete PDFs
● Recall, numerical data can be discrete or continuous.
● Discrete probability distributions count occurrences that
have countable or finite outcomes. A discrete random variable,
X, can take on only finite, countable values.
○ For example, the outcome of a dice throw. We know it can
only be {1,2,3,4,5,6}. It can’t be anything else!
● Continuous probability distributions count an interval of
values from an infinite range of values. In other words, a
continuous random variable, X, can take on any value.
○ For example, the heights observed in a classroom. We know
if two consecutive heights are 180.3 and 182.3, that there are
an infinite set of heights that can occur within that range!
For example 180.30001 is a possibility!
● Important to note that the continuous random variable X could
assume infinite values, so the probability of X taking on any
one specific value is zero. AKA P(X=x) = 0.
Continuous vs Discrete PDFs Illustration

● We can see here that discrete distributions have, of course, discrete values
● Continuous distributions have an infinite possibility of values
● This makes it clear to see, for example P(X=0) for both distributions yields
something for the discrete case, and basically nothing for the continuous case
Uniform (Continuous) PDF
● Before we spoke about the discrete case of the uniform PDF.
● Here, every event has an equally likely chance of occurring, over
the n possible outcomes.
● For the continuous case, we have some interval, between say a
and b, such that the probability of occurrence is uniformly
1/(b-a).
Uniform (Continuous) PDF cont.
● Now, why is the probability 1/(b-a)? Simple, we know the
probability underneath the curve must add up to 1.
● Since the random variable X can take on values between a and
b, the probability must be 1/(b-a) consistently across the range
of values.
● The proof actually goes into how we find probabilities for
continuous distributions.
● Note, the density of a specific value for a random variable X
that takes on some continuous distribution is 0. Therefore
P(X=x) = 0.
● However, to make some density, you can find probabilities for
intervals!
○ I.e., P(X<x) > 0 , provided x is not on the lower end of
values x can take on
Uniform (Continuous) CDF.
● To find the area underneath the curve we must turn to
integration!
● Integration is the process of finding the area underneath
some function, f(x). The math process is looking for the
“opposite” of a derivative.
● This yields something called a cumulative distribution
function (CDF)!
● The CDF is a fast way of finding probabilities for known
continuous distributions, because the integral has already been
calculated for you!
● CDF is denoted by F(x), since f(x) is the PDF!
● For example, the CDF for the uniform (continuous)
distribution is:
● F(x) = ∫ 1/(b-a) dx = x/(b-a) - a/(b-a) = (x-a)/(b-a)
Uniform (Continuous) Summarized
● More formally:
Uniform (Continuous) example
● Atinder is trying to guess randomly what the average grade in
his 1024 BAM course will be. He thinks it will range from 55%
to 75%. He doesn’t know exactly what it will be, but it will be
uniformly distribution from the range provided.
● Find the probability that the average grade is less than 60%
○ F(60) = (60-55)/(75-55) = 0.25
Uniform (Continuous) example 2
● Atinder is trying to guess randomly what the average grade in
his 1024 BAM course will be. He thinks it will range from 55%
to 75%. He doesn’t know exactly what it will be, but it will be
uniformly distribution from the range provided.
● Find the probability that the average grade is between 60% and
62.5%
○ P(60<X<62.5) = P(X<62.5) - P(X< 60)
○ Hold up… how?

P(X<62.5) P(X<60) P(60<X<62.5)


Uniform (Continuous) example 2 cont.
● P(60<X<62.5) = P(X<62.5) - P(X< 60)
= F(62.5) - F(60)
= (62.5-55)/(75-55) - (60-55)/(75-55)
= 0.375-0.25
= 0.125
Uniform (Continuous) example 3
● Atinder is trying to guess randomly what the average grade in
his 1024 BAM course will be. He thinks it will range from 55%
to 75%. He doesn’t know exactly what it will be, but it will be
uniformly distribution from the range provided.
● Find the probability that the average grade is more than 68%
○ P(68<X) = 1 - P(X< 68)
= 1 - F(68)
= 1 - (68-55)/(75-55)
= 1 - 0.65
= 0.35
Exponential Distribution
● Recall, we mentioned something in the earlier lectures
regarding skewness.
● Most times in the real world, data is right skewed, meaning
most of the data is on the left hand side, and as you increase in
value, the probability of observing those events decreases
exponentially!
● This rate of decrease is given by a parameter called lambda,
denoted as λ, and must be greater than 0.
● Visually we get:
Exponential PDF and CDF
● The mathematical way of expressing this is:
○ f(x) = λe-λx, where x ∈ [0, ∞], and λ > 0
● Therefore, the CDF is:
○ F(x) = ∫λe-λx dx = -e-λx - (- e-λ(0))
= -e-λx - (- 1 )
= -e-λx + 1
= 1 - e-λx
● Again, note, I will never ask you to do these on a midterm. I am
providing proofs of the CDFs to ensure you that you can
mathematically derive these yourself if required. You do not
need to memorize these formulas when you understand the
logic!
Exponential Distribution example
● Let X = amount of time (in minutes) a postal clerk spends with
his or her customer. The time is known to have an exponential
distribution with rate parameter equal to four minutes.
● Find the probability of the postal clerk spending less than 2.3
minutes with the customer.
○ We know that λ = 4.
○ P(X<2.3) = F(2.3) = 1 - e-4(2.3)
= 0.99989896
Exponential Distribution example 2
● Let X = amount of time (in minutes) a postal clerk spends with
his or her customer. The time is known to have an exponential
distribution with rate parameter equal to four minutes.
● Find the probability of the postal clerk spending more than 0.5
minutes with the customer.
○ We know that λ = 4.
○ P(X>0.5) = 1- P(X<0.5) = 1 - F(0.5) = 1-(1 - e-4(0.5))
= e-2
= 0.135335
Normal Distributions
● The Normal Distribution (aka Gaussian distribution) is one of
the most commonly sampling distribution.
● The following are key facts about the normal distribution:
○ The mean, median and mode are the exact same.
○ The distribution is symmetric around the mean, µ.
○ Exactly 50% of the data lie on the left and right side of the
mean.
○ 68% of the data lies within one standard deviation of the
mean, and 95% lies within two standard deviations.
● A big misconception is that most data in the world behaves
“normally” (is normally distributed). Actually, their statistics
follow a normal distribution.
○ Most data follows a long-tail distribution (data that is
skewed, typically to the right).
Normal Distribution PDF… kind of ugly
● The thing about the normal distribution is that it has 2
parameters! One for the mean, expressed as µ, and another
for the standard deviation, expressed by σ.
● µ is called the location parameter, and σ is called the shape
parameter. This is because the location parameter determines
where the data is centrally located ( the middle most value,
average ), and the shape parameter tells you how dispersed the
values are from the center (the dispersion parameter, the
standard deviation). This should ring a bell ;).
● But… the PDF is ugly;

= ,
Normal Distribution PDF… visually nice :)

= ,
Normal Distribution CDF ?!
● We’re in trouble… integrating this behemoth of a function is
impossible. This integral does not exist in a simple closed
formula. It is computed numerically. I.e., with software.
● To make matters more interesting, we know that µ and σ can
vary, leading to different CDFs.
● How do we account for this?
● We do something called a standard normal transformation.
● The standard normal is when the mean, µ, equals 0 and the
standard deviation, σ, equals 1.
○ I.e., µ = 0, σ = 1.
● The standard normal is given by the following transformation.
○ If X is normally distributed with mean µ and standard
deviation σ, the standard normal transformation is:
Z = (X-µ)/σ
Standard Normal Probability Charts

● The standard normal charts are


given on the left.
● This chart only works for the
standard normal.
● Meaning, you can calculate
probabilities using the chart on
the left.
● If the mean and/or standard
deviation was different, the
chart on the left would not
work unless you did the z
transformation!
● This is why the
z-transformation is so
important!!!!
Normal Distributions: Standard Normal
● The standard normal distribution is when the mean is equal
to 0, and the standard deviation is equal to 1.
● You can standardize any sets of values by the following
equation:
Z = (X - µ)/σ

● Standardizing your data is not only useful to find probabilities


for normal distributions, but to scale your data (more on this
later).
Standard Normal Proof
● E(Z) = E((X - µ)/σ) = 1/σ ( E((X - µ) )
= 1/σ ( E(X) - E(µ) )
= 1/σ ( µ - µ )
=0
● Var(Z) = Var((X - µ)/σ) = 1/σ2 (Var (X - µ))
= 1/σ2 ( Var (X) - Var (µ) )
= 1/σ2 ( σ2 - 0 )
= 1/σ2 ( σ2 )
=1

For a video proof of this, please look here.


Normal Distributions: Example
● For his brand new Banana phone, Atinder knows that the
length of time it takes the battery to recharge fully is normally
distributed with a mean of 3 hours and a standard deviation of
30 minutes. Atinder owns one of these computers and wants to
know the probability that the length of time will be between 2
and 2.5 hours.
● µ = 3, σ = 0.5

● P( 2 < X < 2.5 ) = ?


Normal Distributions: Example
● P( 2 < X < 2.5 ) = P( X < 2.5 ) - P( X < 2 )

= -

● P( 2 < X < 2.5 ) = P( X < 2.5 ) - P( X < 2 )


= 0.1359
Normal Distributions: Example 2
● The lifetime of Atinder’s phone has a normal distribution with
a mean of 24 months and standard deviation of 4 months. Find
the probability that his phone will last more than 30 months
● µ = 24, σ = 4
● P( 30 < X ) = ?

= -

● P( 30 < X ) = 1 - P(X < 30 ) = 0.0668


Normal Distributions: Example 3
● Entry to a certain University is determined by a national test.
The scores on this test are normally distributed with a mean of
500 and a standard deviation of 100. Tom wants to be admitted
to this university and he knows that he must score better than
at least 70% of the students who took the test. Tom takes the
test and scores 585. Will he be admitted to this university?
● µ = 500, σ = 100

● P( X < x ) = 0.7
Normal Distributions: Example 3 cont.
● Here the tricky thing is we know what the probability is, we
just don’t know what value satisfies it within the parameters
provided.
● We can use software to find the inverse of this very easily!
● P( X < x ) = 0.7, x must be 552.44 ~ 553.
● Since Tom scored greater than 553, he will be admitted! Yay!
Central Limit Theorem
● The Central Limit Theorem (CLT) is a theorem which means
that it is NOT a theory or just somebody's idea of the way
things work.
● As a theorem it ranks with the Pythagorean Theorem, or the
theorem that tells us that the sum of the angles of a triangle
must add to 180.
● These are facts of the ways of the world rigorously
demonstrated with mathematical precision and logic.
● The CLT concerns with drawing finite samples of size n from a
population with a known mean, μ, and a known standard
deviation σ
● The conclusion is that if we collect samples of size n with a
"large enough n," calculate each sample's mean, and create a
histogram (distribution) of those means, then the resulting
distribution will tend to have an approximate normal
distribution.
Central Limit Theorem Visual
● We know the exponential distribution is not distributed like
the normal distribution.
● I.e.,

● But… this is where things get interesting.


Central Limit Theorem Visual cont.
● Say I took a sample of size 2, n=2, from any exponential
distribution
● I might get the numbers 1.2 and 2.1. The sample average being
1.65

● If I did this again, I might get 1.3 and 1.6. The sample average
being 1.45.
Central Limit Theorem Visual cont.
● If I did this 1 million times, and recorded each of the sample
means of the 1 million samples of size 2, I might get the
following:

● Note, the normal distribution is superimposed to help see.


Central Limit Theorem Visual cont.
● If I increase the sample to n=4, I might get this:
Central Limit Theorem Visual cont.
● If I increase the sample to n=20, I might get this:

● Hmm…. something is happening


Central Limit Theorem Visual cont.
● If I increase the sample to n=50, I might get this:

● Omg… that is definitely normally distributed now!


● Clearly, the CLT states that when we are sampling, the
distribution of the sampling mean tends towards the normal
distribution as the sample size increases.
Central Limit Theorem Guidelines.
● In general, when n>30, we can apply the CLT and state the
sampling mean of some distribution is distributed normally
with mean μ and standard deviation σ/√(n).
○ In other words, μx̄ = μx and σx̄= σx /√(n).
● Let me guess… you want proofs… okay…
○ Assume every random sample from the distribution of a
random variable X is independent and identical.
○ E(X̄) = μx̄ = E( Σ X/n) = E ( (X1 + X2 + … + Xn)/n)
= E ( X1/n + X2/n + … + Xn/n)
= E ( X1 / n) + E ( X2 / n) + … + E ( Xn / n)
= E ( X / n ) + E ( X / n ) + … + E ( X / n ),
since i.i.d.,
= 1/n E(X) + 1/n E(X) + … + 1/n E(X)
= E(X)
= μx
Central Limit Theorem Guidelines.
● Similarly for the variance….
○ Var(X̄) = σ2x̄
= Var ( Σ X/n)
= Var ( (X1 + X2 + … + Xn)/n)
= Var ( X1/n + X2/n + … + Xn/n)
= Var ( X / n ) + Var ( X / n ) + … + Var ( X / n ),
since i.i.d.,
= 1/n2 Var ( X ) + 1/n2 Var ( X ) + … 1/n2 Var ( X ),
using the fact that Var( aX) = a2Var(X)
= 1/n2 (n Var ( X ) )
= Var ( X ) / n
= σ 2x / n
● Taking the square root gets us what we see in the formula!
Central Limit Theorem Math Notation

● The random variable of the sample mean approaches the standard normal
distribution as n tends to infinity.
Central Limit Theorem Example
● Suppose salaries at a very large company have a mean of
$62,000 and a standard deviation of $32,000.
● If a single employee is randomly selected, what is the
probability of their salary exceeding $66,000?
● Answer… we can’t do this because we don’t have enough
info!
○ P(X>66,000) = ?
● Recall, the CLT only works on the sample means! Meaning if
we took a sample of employees, we could use the CLT to find
the probabilities of the sample random variable!
Central Limit Theorem Example 2
● Suppose salaries at a very large company have a mean of
$62,000 and a standard deviation of $32,000.
● If 100 employees are randomly selected, what is the probability
their average salary exceeds $66,000?
● We can do this! We know that there are 100 employees!
Therefore the sampling distribution can be assumed to be
Normal!
● μx̄ = μx = $62,000
● σx̄ = σx / (n)0.5 = $32,000 / 10 = $3,200
● P(X> 66,000) = 1 - P(X<66,000)
= 1 - Φ((66,000 - 62,000)/3,200)
= 1 - 0.8944
= 0.1056
Is our data really Normal?
● We make all these assumptions, but one of the most important
things to check is if our data is even normally distributed to
begin with!
● We turn to QQ plots, which stands for quantile-quantile plots.
● As the name suggests, we want to compare each quantile of our
data to a distribution we think it is, and if it truly is distributed
according to that distribution, they should fit perfectly 1-1.
○ This means that the quantiles should form a line!
● But how do we do this?
QQ Plots Continued
● Begin by ranking your data in ascending order (smallest to
largest).
● Calculate the percentiles by taking the rank, subtracting 0.5
and then dividing by the total number of observations.
○ Percentile = (Rank - 0.5) / (n)
● Find the value according to the normal distribution that
achieves that percentile.
○ Ex: If a value is in the 10th percentile, we need to find what
value of the normal distribution (the standard normal
preferably) covers that probability.
● Standardize your data point to convert it to a z value.
● Compare your quantiles!
● An example will be done in Python!
Homework
● The length of time a particular smartphone's battery lasts (in
months) follows an exponential distribution with a lambda
parameter of 1/10.
● What is the probability of the battery lasting more than 4
months?
● What is the probability of the battery lasting between 6 and 11
months?
Homework 2
● An experimental garden has sunflowers plants. The plants are
being treated so they grow to unusual heights. They are
normally distributed with an average height of 9.3 feet, with a
standard deviation of 0.5 foot.
● What is the probability of observing a single sunflower being
less than 10 feet?
● What is the probability of a single sunflower being between 10
and 12 feet?
Homework 3
● Let X denote whether or not a randomly selected individual
approves of the job the President is doing. The probability that
any random person approves of what the president is doing is
0.34. Say that we ask 1000 people if they approve of what the
president is doing.
● What is the probability that less than 370 people approve of
what he is doing?
● Hint? Click here.
Homework 3 Solution
● The trick here is to recognize that we are not looking for the
average but instead using the fact that the sample size is large
to instantly jump to the normal distribution!
● There is technically a proof for this, but we will avoid it for
now.
● If you used the hint in the last slide, you will realize;
● Y ~ Bin(np, np(1-p) )
● Therefore, Z = (Y-np)/(np(1-p))
● With this you get that
P(Y < 400) = P(Z < (370-1000(0.34))/(1000(0.34)(0.66))0.5)
= P(Z < (370 - 340)/(14.9799) )
= P(Z < 2.002)
= 0.9772
Resources
● https://www.easycalculation.com/statistics/bell-curve-calculat
or.php
Thank you

You might also like