Module 2:
Statistical Foundation
Population & Sample
• Population: The entire set of individuals or
observations relevant to a particular study.
• Sample: A subset of the population selected for
analysis.
• Sampling is necessary when studying an entire
population is impractical due to time, cost, or
accessibility constraints.
Variables
• Definition: A characteristic, number, or quantity that can be
measured or quantified.
• Types of Variables:
• Qualitative (Categorical) Variables: Describe non-numerical
characteristics (e.g., gender, education level)
• .Quantitative Variables: Represent numerical data and can be further
divided into:
• Discrete Variables: Can take only specific values (e.g., number of
students in a class).
• Continuous Variables: Can take any value within a range (e.g., height,
weight).
Measures of central tendency
• Central tendency provides a summary of the dataset
using a single representative value.
• Mean (Arithmetic Average): Sum of all values divided
by the total number of values.
• Median: The middle value when data is arranged in
ascending or descending order.
• Mode: The most frequently occurring value in the
Measures of dispersion
Probability distributions
• A probability distribution describes how probabilities are
distributed over the values of a random variable.
• Types of Probability Distributions
• Discrete Probability Distribution: Applies to discrete
variables (e.g., binomial distribution, Poisson distribution).
• Continuous Probability Distribution: Applies to continuous
variables (e.g., normal distribution, exponential distribution).
Distribution
• Distribution refers to a mathematical expression that provides
an event's possible outcomes and how often they can occur.
• Ex: Rolling dice is a random experiment.
• A dice has six sides numbered from 1 to 6.
• When you roll the dice, the probability of getting 1 is an event
and that is one out of six (1/6)
• Similarly, the probability is one-sixth for all other values (2, 3,
4, 5, and 6).
• If you want to find the probability of getting a 7, it would be
zero, as it’s impossible to get such a value.
Event Probability
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
When plotted using a histogram, the distribution will provide a peculiar shape that
often helps you understand the distribution you are dealing with. In this case, you
will get a uniform distribution.
Therefore, using this
probability distribution
you can know that the
possible values for a
dice roll are 1 to 6, with
the probability of
getting any value
between this range
being the same
(in this case, it’s 1/6
which is roughly 0.17,
i.e., 17%).
Every probability
distribution is
Frequency
Probability Distribution
Distribution
It records the likelihood that
It records how often an an event is to occur. It is
event occurs. It is based based on theoretical
on actual observations assumption of what should
happen
Suppose you are dealing with two dice
now.
In this case, what will be the probability
of getting the sum of two dice as 2?
(1,1) (2,1) (3,1) (4,1) (5,1) (6,1)
(1,2) (2,2) (3,2) (4,2) (5,2) (6,2)
(1,3) (2,3) (3,3) (4,3) (5,3) (6,3)
(1,4) (2,4) (3,4) (4,4) (5,4) (6,4)
(1,5) (2,5) (3,5) (4,5) (5,5) (6,5)
(1,6) (2,6) (3,6) (4,6) (5,6) (6,6)
• If you were to calculate the probability of each event, you need
to look at how often that outcome can occur.
• For example, the probability of getting a sum of two dice as 1 is
zero.
• The probability of getting the sum as 2 will be 1/36 because
this can only happen when both the dice return 1
• and of the 36 possible outcomes, there is only one such event
that returns the sum as 2.
• Similarly, the probability of getting the sum 3 would be 2/36
because of the 36 possible outcomes; only two such outcomes
return the sum as 3: (1,2) and (2,1).
• Therefore if we know the denominator, i.e., the count of
outcomes for each event, we can calculate the probabilities.
The total possible events and the probability for each event will differ, making the
distribution take different shapes, as shown below
Common Types of Data
Discrete Data
• When you roll a dice or pick a card from a deck
• you have a limited number of outcomes possible.
• This type of data is called Discrete Data
• Which can only take a specified number of values.
• For example, in rolling a dice
• The specified values are 1, 2, 3, 4, 5, and 6.
• Suppose you count the number of boys in a class; since the
value is countable, it is discrete
Continuous Data
• Continuous data is data that can take any value.
• Height, weight, temperature and length are all examples of continuous data.
• Some continuous data will change over time, the temperature in a room
throughout the day
• a person’s height has infinitely many values within a given interval.
• This type of data is called Continuous Data, which can have any value within
a given range. That range can be finite or infinite.
• Continuous data is measurable but not countable, hence, continuous.
• .
Types of Distribution
Distribution types can be divided into continuous and discreet distributions
Normal distribution
• Of the different types of distributions out there, the most
used distribution in statistics and data science is Normal, also
known as the Gaussian distribution.
• A normal distribution is a symmetrical distribution with a bell-
shaped curve, where most values are clustered around the
center and tapering off as you move away from the center.
• The unique property of normal distribution is that its mean,
medium, and mode are all equal.
Central Limit Theorem (CLT)
• You collect data from 100 individuals about their age and calculate
its mean
• And if you then repeat this process 1,000 times (a minimum of 30
samples are required for CLT to be true) and plot these means then
what you get is a sampling distribution.
• As per CLT, the mean of the sampling distribution and population
(from where the samples have been drawn) is equal.
• Also, the sampling distribution will follow a Gaussian distribution
regardless of the distribution of the population.
As Gaussian distribution follows a 68-95-99.7 rule which states that in such distribution, 68%
of values lie within one standard deviation from the mean, 95% within 2 and 99.7% within
three, it makes it easy to understand the probability of finding a value in the population.
Understanding Data Distributions
• When analysing data, it's important to understand the
distribution of the data. The distribution refers to how
the data is spread out or clustered around certain
values or ranges.
• By examining the distribution, we can gain insights into
the characteristics and patterns of the data, which can
be useful in making informed decisions and predictions.
• There are various types of data distributions, each with
its own unique properties and implications.
• Understanding these distributions is a fundamental
aspect of data analysis and can help us make more
accurate and meaningful interpretations of the data.
H y p o th e s is te s tin g & S ig n ifi c a n c e le v e ls
• Hypothesis testing is a statistical method used to make
decisions about population parameters based on sample data.
• Steps in Hypothesis Testing
1.State the Null () and Alternative () Hypothesis:
1. : No effect or no difference.
2. : Indicates a significant effect or difference.
2.Choose the Significance Level ():
1. Common values: 0.05 (5%) or 0.01 (1%).
3.Select the Appropriate Test:
1. Z-test, t-test, chi-square test, etc.
4.Compute the Test Statistic:
1. Compare with the critical value or use the p-value.
5.Make a Decision:
1. If p-value < , reject .
2. If p-value > , fail to reject .
Types of Errors in Hypothesis Testing