Module -2
Data and Sampling Distributions
Random sampling and bias
selection bias
sampling distribution of statistic
bootstrap
confidence intervals
data distributions: normal, long tailed, student’s-t,
binomial, Chi-square, F distribution, Poisson and related
distributions.
Random sampling and bias
Sampling bias shows up in statistics a branch of mathematics that deals with
analyzing data. It arises when the way you collect a sample from a population
doesn't give every member of the population an equal chance of being
included.
What is Sampling Bias?
Sampling bias is a type of bias caused by selecting non-random data for
statistical analysis. This bias can skew the results and lead to incorrect
conclusions. It often occurs when the sample is not representative of the
population from which it was drawn.
In mathematics, it refers to the systematic error that occurs when certain
members of a population are more likely to be included in a sample than others.
This leads to a sample that is not representative of the population, which can
skew results and lead to incorrect conclusions.
Types of Sampling Bias
Below are the most common types of Sampling Bias are as follows:
Type of Sampling
Bias About it
Some members of the population are systematically
Selection Bias more likely to be included in the sample.
Only surviving subjects are considered, leading to
Survivorship Bias an overestimation of the success rate.
Undercoverage Some members of the population are inadequately
Bias represented in the sample.
The sample consists of volunteers who choose to
Voluntary participate, often leading to a non-representative
Response Bias sample.
Non-response Individuals chosen for the sample are unwilling or
Bias: unable to participate.
KEY TERMS FOR RANDOM SAMPLING
Sample -A subset from a larger data set.
Population -The larger data set or idea of a data set.
N (n) The size of the population (sample).
Random sampling -Drawing elements into a sample at random.
Stratified sampling - Dividing the population into strata and randomly
sampling from each strata.
Simple random sample - The sample that results from random sampling
without stratifying the population.
Bias – Systematic error
Sample bias - A sample that misrepresents the population
Random Sampling
Random sampling is a technique where each member of a population has an
equal chance of being selected. This method helps ensure that the sample is
representative of the population, thereby reducing sampling bias.
Example: Drawing names from a hat where each name has an equal chance of
being picked.
Stratified Sampling
Stratified sampling involves dividing the population into subgroups (strata)
based on a specific characteristic (e.g., age, gender, income level) and then
taking a random sample from each subgroup.
Example: In a survey on education, dividing the population into strata based on
educational level (e.g., high school, undergraduate, graduate) and then
randomly sampling from each stratum.
Systematic Sampling
Systematic sampling involves selecting every nth member of the population
after a random starting point.
Example: Choosing every 10th person on a list after randomly selecting a
starting point between 1 and 10.
Cluster Sampling
Cluster sampling involves dividing the population into clusters, usually based
on geography or other natural groupings, and then randomly selecting entire
clusters for the sample.
Example: Dividing a city into districts and randomly selecting some districts,
then surveying all individuals within those districts.
Oversampling and Undersampling
Oversampling: Increasing the proportion of a particular subgroup within the
sample to ensure adequate representation.
Undersampling: Reducing the proportion of a dominant subgroup to
balance the sample.
Example: In a health study, oversampling a minority group to ensure their
health outcomes are adequately represented.
Practice Problems on Sampling Bias: Solved
Problem: A university wants to survey students about campus facilities. They
decide to survey students only in the library. What type of sampling bias
might this introduce?
Solution: This might introduce selection bias because students in the library may
not represent the views of all students on campus.
Problem: A company wants to understand the job satisfaction of its
employees but only surveys employees who have been with the company for
more than 5 years. What type of bias is this?
Solution: This introduces survivorship bias, as it ignores the opinions of newer
employees who may have different perspectives.
Problem: An online retailer sends out a customer satisfaction survey via
email, but only 10% of recipients respond. What type of bias could this lead
to?
Solution: This could lead to non-response bias, as the opinions of the 90% who
did not respond are not considered.
Problem: In a survey about a new product, only the first 100 customers who
bought the product are surveyed. What type of sampling bias might this
cause?
Solution: This might cause selection bias, as the first 100 customers might have
different views than those who purchase the product later.
Problem: A political poll is conducted by calling landline phones. What type
of bias might this introduce?
Solution: This might introduce undercoverage bias, as many younger people or
those in urban areas might only have cell phones.
Random Selection
In the era of big data, it is sometimes surprising that smaller is better. Time and effort spent on
random sampling not only reduce bias, but also allow greater attention to data exploration and
data quality. For example, missing data and outliers may contain useful information. It might be
prohibitively expensive to track down missing values or evaluate outliers in millions of records,
but doing so in a sample of several thousand records may be feasible. Data plotting and manual
inspection bog down if there is too much data.
So when are massive amounts of data needed?
The classic scenario for the value of big data is when the data is not only big, but sparse as well.
Consider the search queries received by Google, where columns are terms, rows are individual
search queries, and cell values are either 0 or 1, depending on whether a query contains a term.
The goal is to determine the best predicted search destination for a given query. There are over
150,000 words in the English language, and Google processes over 1 trillion queries per year.
This yields a huge matrix, the vast majority of whose entries are “0.”
This is a true big data problem — only when such enormous quantities of data are accumulated
can effective search results be returned for most queries. And the more data accumulates, the
better the results. For popular search terms this is not such a problem — effective data can be
found fairly quickly for the handful of extremely popular topics trending at a particular time. The
real value of modern search technology lies in the ability to return detailed and useful results for
a huge variety of search queries, including those that occur only with a frequency, say, of one in a
million.
Consider the search phrase “Ricky Ricardo and Little Red Riding Hood.” In the early days of the
internet, this query would probably have returned results on Ricky Ricardo the band leader, the
television show I Love Lucy in which he starred, and the children’s story Little Red Riding
Hood. Later, now that trillions of search queries have been accumulated, this search query
returns the exact I Love Lucy episode in which Ricky narrates, in dramatic fashion, the Little
Red Riding Hood story to his infant son in a comic mix of English and Spanish.
Sample Mean versus Population Mean
The symbol (pronounced x-bar) is used to represent the mean of a sample from a population,
whereas is used to represent the mean of a population. Why make the distinction? Information
about samples is observed, and information about large populations is often inferred from
smaller samples. Statisticians like to keep the two things separate in the symbology.
Selection Bias
Selection bias happens when the participants of a study don't represent the whole
population you're trying to understand. This leads to errors and false conclusions,
erasing your data's accuracy and reliability.
Imagine you're analyzing user behavior to improve a product but only look at
responses from your most active users. Your results will be skewed, showing more
engagement than what actually exists across your entire user base.
Spotting and correcting selection bias is essential for anyone involved in data
analysis and product development. Ensuring your data accurately reflects your
target population makes your findings reliable and valid, which is crucial for
making data-driven decisions.
1. Sampling bias
This occurs when the sample isn't representative of the target population.
Example: If you're surveying only premium users about their experience, the
insights won't reflect the opinions of lower tier users.
2. Self-selection bias
This happens when individuals volunteer to participate, potentially differing
significantly from those who don't. This often leads to overrepresenting more
engaged or motivated users, skewing the data.
Example: In a survey about exercise habits, those who choose to participate
might be more health-conscious than the general population, leading to
overestimated fitness levels.
3. Survivorship bias
This is seen when only those who "survived" a process are studied, ignoring those
who didn’t make it.
Example: Analyzing only the success stories of a product without considering
users who abandoned it early on gives an overly positive view of the product's
performance.
4. Attrition bias
This arises when participants drop out, and the remaining group is no longer
representative.
Example: If less satisfied users are more likely to stop responding to follow-up
surveys, the feedback will be overly positive.
5. Undercoverage bias
This occurs when some members of the intended population are inadequately
represented in the sample.
Example: If a survey about mobile app usage excludes older adults, the results
won't capture their user behavior.
6. Nonresponse bias
This happens when certain individuals or groups don't respond to surveys or data
collection efforts, leading to results that only reflect the views of those who did
respond.
Example: A customer satisfaction survey might miss the views of dissatisfied
customers who choose not to respond, skewing the results positively.
7. Recall bias
This arises when participants do not accurately remember past events or
experiences, often skewing the data based on their current feelings or beliefs.
Example: In surveys asking about past medical history, participants might forget
or misreport their past illnesses, affecting the accuracy of the data.
Understanding these types of bias is crucial for product and data teams to ensure
their analyses reflect the entire user base accurately. Ignoring these biases can lead
to product features that only cater to a segment of your users, missing broader
needs and opportunities.
Identifying selection bias in research
In research, it's important to scrutinize the methods used to select participants.
Ask yourself:
Were participants chosen randomly, or was there a non-random selection
process?
Does the sample accurately represent the target population in terms of
demographics, behavior, and other relevant factors?
Using study designs like cohort studies or case-control studies can help minimize
bias. Ensure your sample size is adequate and covers all relevant demographics to
avoid underrepresentation. If you're conducting a user experience study, make sure
to include users from different age groups, geographic locations, and levels of
engagement with your product.
Identifying selection bias in data analysis
When analyzing data, selection bias can distort your findings. Here are some steps
to identify and address this bias in your data analysis:
Examine data sources: Ensure that your data sources are comprehensive
and representative of your study population. For instance, if you're analyzing
customer feedback, include data from various channels like surveys, social
media, and support tickets.
Analyze participation patterns: Look for patterns in who is participating
and who isn't. Are certain user groups underrepresented in your data? This
can help you understand potential biases in your analysis.
Compare characteristics: Compare the characteristics of participants
versus non-participants. If there's a significant difference, your sample may
be biased. For example, if a significant portion of feedback comes from
power users, the data may not reflect the experiences of casual users.
Use statistical techniques: Employ statistical methods like regression
analysis to control for variables that could introduce bias. Propensity score
matching can also help ensure that your comparisons are between similar
groups, reducing the impact of selection bias.
Validation checks: Perform validation checks to compare included versus
excluded data. This might involve looking at demographic data, usage
patterns, or other relevant metrics to ensure your sample is representative.
By actively identifying and addressing selection bias, you can improve the
accuracy of your data analysis, leading to more reliable insights and better
decision-making.
Mitigating selection bias
Mitigating selection bias is crucial for ensuring that your research and data
analysis produce accurate and reliable results. Here are some strategies to help you
avoid selection bias at different stages of your study.
How to avoid selection bias
To avoid selection bias, it’s important to plan and execute your research design
carefully. This involves considering how participants are selected and ensuring that
your sample is representative of the entire population you wish to study. Here are
key steps to avoid selection bias:
Regression to the Mean
Regression to the mean refers to a phenomenon involving successive
measurements on a given variable: extreme observations tend to be followed by
more central ones. Attaching special focus and meaning to the extreme value can
lead to a form of selection bias. Sports fans are familiar with the “rookie of the
year, sophomore slump” phenomenon. Among the athletes who begin their career
in a given season (the rookie class), there is always one who performs better than
all the rest. Generally, this “rookie of the year” does not do as well in his second
year. Why not? In nearly all major sports, at least those played with a ball or puck,
there are two elements that play a role in overall performance:
Skill
Luck
Regression to the mean is a consequence of a particular form of selection bias.
When we select the rookie with the best performance, skill and good luck are
probably contributing. In his next season, the skill will still be there but, in most
cases, the luck will not, so his performance will decline — it will regress. The
phenomenon was first identified by Francis Galton in 1886 [Galton-1886], who
wrote of it in connection with genetic tendencies; for example, the children of
extremely tall men tend not to be as tall as their father (see Figure 2-5)
sampling distribution of statistic
What is Sampling Distribution?
Sampling distribution is also known as a finite-sample distribution. Sampling
distribution is the probability distribution of a statistic based on random samples of
a given population. It represents the distribution of frequencies on how spread
apart various outcomes will be for a specific population.
Since population is too large to analyze, you can select a smaller group and
repeatedly sample or analyze them. The gathered data, or statistic, is used to
calculate the likely occurrence, or probability, of an event.
Important Terminologies in Sampling Distribution
Some important terminologies related to sampling distribution are given below:
Statistic: A numerical summary of a sample, such as mean, median,
standard deviation, etc.
Parameter: A numerical summary of a population is often estimated using
sample statistics.
Sample: A subset of individuals or observations selected from a population.
Population: Entire group of individuals or observations that a study aims to
describe or draw conclusions about.
Sampling Distribution: Distribution of a statistic (e.g., mean, standard
deviation) across multiple samples taken from the same population.
Central Limit Theorem(CLT): A fundamental theorem in statistics stating
that the sampling distribution of the sample mean tends to be approximately
normal as the sample size increases, regardless of the shape of the
population distribution.
Standard Error: Standard deviation of a sampling distribution, representing
the variability of sample statistics around the population parameter.
Bias: Systematic error in estimation or inference, leading to a deviation of
the estimated statistic from the true population parameter.
Confidence Interval: A range of values calculated from sample data that is
likely to contain the population parameter with a certain level of confidence.
Sampling Method: Technique used to select a sample from a population,
such as simple random sampling, stratified sampling, cluster sampling, etc.
Inferential Statistics: Statistical methods and techniques used to draw
conclusions or make inferences about a population based on sample data.
Hypothesis Testing: A statistical method for making decisions or drawing
conclusions about a population parameter based on sample data and
assumptions about the population.
Factors Influencing Sampling Distribution
A sampling distribution's variability can be measured either by calculating the
standard deviation(also called the standard error of the mean), or by calculating the
population variance. The one to be chosen is depending on the context and
interferences you want to draw. They both measure the spread of data points in
relation to the mean.
3 main factors influencing the variability of a sampling distribution are:
1. Number Observed in a Population: The symbol for this variable is "N." It
is the measure of observed activity in a given group of data.
2. Number Observed in Sample: The symbol for this variable is "n." It is the
measure of observed activity in a random sample of data that is part of the
larger grouping.
3. Method of Choosing Sample: How you chose the samples can account for
variability in some cases.
Types of Distributions
There are 3 main types of sampling distributions are:
Sampling Distribution of Mean
Sampling Distribution of Proportion
T-Distribution
Sampling Distribution of Mean
Mean is the most common type of sampling distribution.
It focuses on calculating the mean or rather the average of every sample group
chosen from the population and plotting the data points. The graph shows a normal
distribution where the center is the mean of the sampling distribution, which
represents the mean of the entire population.
We take many random samples of a given size n from a population with mean µ
and standard deviation σ. Some sample means will be above the population mean µ
and some will be below, making up the sampling distribution.
For any population with mean µ and standard deviation σ:
Mean, or center of the sampling distribution of x̄, is equal to the population
mean, µ.
µx− = µ
There is no tendency for a sample mean to fall systematically above or below µ,
even if the distribution of the raw data is skewed. Thus, the mean of the sampling
distribution is an unbiased estimate of the population mean µ.
Sampling distribution of the sample mean
Standard deviation of the sampling distribution is σ/√n, where n is the
sample size.
σx = σ/√n
Standard deviation of the sampling distribution measures how much the
sample statistic varies from sample to sample. It is smaller than the standard
deviation of the population by a factor of √n. Averages are less variable than
individual observations.
sampling distribution of standard deviation
Sampling Distribution of Proportion
Sampling distribution of proportion focuses on proportions in a population. Here,
you select samples and calculate their corresponding proportions. The means of the
sample proportions from each group represent the proportion of the entire
population.
Sampling distribution of proportion - 1
Sampling distribution of proportion – 2
Formula for the sampling distribution of a proportion (often denoted as p̂) is:
p̂ = x/n
where:
p̂ is Sample Proportion
x is Number of "successes" or occurrences of Event of Interest in Sample
n is Sample Size
This formula calculates the proportion of occurrences of a certain event (e.g.,
success, positive outcome) within a sample.
T-Distribution
Sampling distribution involves a small population or a population about which
you don't know much. It is used to estimate the mean of the population and other
statistics such as confidence intervals, statistical differences and linear regression.
T-distribution uses a t-score to evaluate data that wouldn't be appropriate for a
normal distribution.
Formula for the t-score, denoted as t, is:
t = [x - μ] / [s /√(n)]
where:
x is Sample Mean
μ is Population Mean (or an estimate of it)
s is Sample Standard Deviation
n is Sample Size
This formula calculates the difference between the sample mean and the
population mean, scaled by the standard error of the sample mean. The t-score
helps to assess whether the observed difference between the sample and population
means is statistically significant.
Central Limit Theorem[CLT]
Central Limit Theorem is the most important theorem of Statistics.
The Central Limit Theorem states that:
When large samples usually greater than thirty are taken into consideration
then the distribution of sample arithmetic mean approaches the normal
distribution irrespective of the fact that random variables were originally
distributed normally or not.
Central Limit Theorem
According to the central limit theorem, if X1, X2, ..., Xn is a random sample of
size n taken from a population with mean µ and variance σ2 then the sampling
distribution of the sample mean tends to normal distribution with mean µ and
variance σ2/n as sample size tends to large.
This formula indicates that as the sample size increases, the spread of the sample
means around the population mean decreases, with the standard deviation of the
sample means shrinking proportionally to the square root of the sample size, and
the variate Z,
where,
z is z-score
x is Value being Standardized (either an individual data point or the sample
mean)
μ is Population Mean
σ is Population Standard Deviation
n is Sample Size
This formula quantifies how many standard deviations a data point (or sample
mean) is away from the population mean. Positive z-scores indicate values above
the mean, while negative z-scores indicate values below the mean. Follows the
normal distribution with mean 0 and variance unity, that is, the variate Z follows
standard normal distribution.
According to the central limit theorem, the sampling distribution of the sample
means tends to normal distribution as sample size tends to large (n > 30).
The Bootstrap
Bootstrap Method is a powerful statistical technique widely used in mathematics
for estimating the distribution of a statistic by resampling with replacement from
the original data.
Bootstrap Method or Bootstrapping is a statistical procedure that resamples
a single data set to create many simulated samples. This process allows for
the calculation of standard errors, confidence intervals, and hypothesis testing
according to a post on bootstrapping statistics from statistician Jim Frost.
Bootstrapping is a resampling technique used to estimate population statistics by
sampling from a dataset with replacement. It can be used to estimate summary
statistics such as the mean and standard deviation. It is used in applied machine
learning to estimate the quality of a machine learning model at predicting
data that is not included in the training data.
“Bootstrapping is a statistical procedure that resamples a
single data set to create many simulated samples.”
How Bootstrapping Statistics Works?
In the bootstrap method, a sample of size n is drawn from a population. We'll
call this sample S. Then, rather than using theory to determine all possible
estimates, a sampling distribution is created by resampling observations from
S with replacement m times, where each resampled set contains n observations.
With proper sampling, S will be representative of the population. Thus, by
resampling S m times with replacement, it is as if m samples were drawn from the
original population, and the derived estimates will represent the theoretical
distribution from the traditional approach.
Increasing the number of replicate samples m does not increase the information
content of the data; that is, resampling the original dataset 100,000 times is not as
useful as resampling it 1,000 times. The information content of a dataset depends
on the sample size n, which remains constant for each replicate sample. Thus, the
benefit of a larger number of replicate samples is that they provide a more accurate
estimate of the sampling distribution.
Conceptually, you can imagine the bootstrap as replicating the original sample
thousands or millions of times so that you have a hypothetical population that
embodies all the knowledge from your original sample (it’s just larger). You can
then draw samples from this hypothetical population for the purpose of estimating
a sampling distribution. See Figure 2-7.
In practice, it is not necessary to actually replicate the sample a huge number of
times. We simply replace each observation after each draw; that is, we sample with
replacement. In this way we effectively create an infinite population in which the
probability of an element being drawn remains unchanged from draw to draw. The
algorithm for a bootstrap resampling of the mean is as follows, for a sample of size
n:
1. Draw a sample value, record, replace it.
2. Repeat n times.
3. Record the mean of the n resampled values.
4. Repeat steps 1–3 R times.
5. Use the R results to:
a. Calculate their standard deviation (this estimates sample mean standard error).
b. Produce a histogram or boxplot.
c. Find a confidence interval.
R, the number of iterations of the bootstrap, is set somewhat arbitrarily. The more
iterations you do, the more accurate the estimate of the standard error, or the
confidence interval. The result from this procedure is a bootstrap set of sample
statistics or estimated model parameters, which you can then examine to see how
variable they are.
The R package boot combines these steps in one function. For example, the
following applies the bootstrap to the incomes of people taking out loans:
OR
Bootstrap Method
Bootstrap Method or Bootstrapping is a statistical technique for estimating an
entire population quantity by averaging estimates from multiple smaller data
samples. Importantly, the sample is created by extracting observations one at a
time from a larger data sample and adding them back to the selected data sample.
This allows a given observation to be included multiple times in a given smaller
sample. This sampling technique is called sampling with replacement.
Bootstrap Method
The process of creating a sample can be summarized as follows: Choose a
sample size. If the sample size is smaller than the size you selected Randomly
select observations from the dataset Add them to the sample Bootstrapping
methods can be used to estimate population abundance. This is done by repeatedly
taking small samples, computing statistics, and averaging the computed statistics.
The procedure can be summarized as follows:
Choose the number of bootstrap samples to take
Choose your sample size For each bootstrap sample, draw a replacement
sample of the size you selected
Calculate the statistics for the samples Calculate the average of the
computed sample statistics
Bootstrap method is additionally a suitable for controlling and actually look
at the solidness of the outcomes. In spite of the fact that for most issues it is
difficult to know the genuine certainty span, bootstrap is asymptotically more exact
than the standard stretches got utilizing test change and presumptions of
ordinariness.
Differences between Bootstrap method and Traditional Hypothesis Testing
Various differences between Bootstrapping and Traditional Hypothesis Testing are
added in the table below:
Traditional Hypothesis Testing Bootstrapping
Traditional hypothesis testing relies Bootstrapping is a non-parametric
on the assumption that the data method that does not make
follows a specific probability assumptions about the underlying
distribution (e.g., normal distribution) probability distribution of the data.
and makes assumptions about the It relies on resampling from the
population parameters (e.g., mean, original data to estimate the sampling
variance). distribution of a statistic.
In traditional hypothesis testing, the
In bootstrapping, the sampling
sampling distribution is derived from
distribution is approximated by
theoretical probability distributions
repeatedly resampling from the
(e.g., t-distribution, F-distribution)
original data with replacement,
based on the assumptions made about
creating multiple bootstrap samples.
the population.
Bootstrapping is generally more
Traditional hypothesis testing can be
robust to departures from assumptions
sensitive to violations of the
and can be applied to a wider range of
underlying assumptions (e.g., non-
data situations, including non-normal
normality, heteroscedasticity).
distributions and complex models.
Traditional Hypothesis Testing Bootstrapping
Traditional hypothesis testing Bootstrapping provides confidence
provides p-values and confidence intervals and hypothesis tests based
intervals based on theoretical on the empirical sampling
distributions, which are widely distribution, which may be less
understood and interpreted. intuitive to interpret for some users.
Example of samples created using Bootstrap method
Example of how bootstrap samples are created and used to estimate a
statistic of interest.
Solution:
Let's say we have a small dataset of 5 observations:
Original Data: [3, 4, 5, 6, 7]
Create bootstrap samples by resampling with replacement:
We'll create 3 bootstrap samples of size 5 by randomly drawing observations from
the original data with replacement.
Each bootstrap sample will have the same size as the original dataset.
Bootstrap Sample 1: [5, 6, 3, 4, 7]
Bootstrap Sample 2: [4, 3, 6, 4, 6]
Bootstrap Sample 3: [7, 5, 7, 3, 4]
Calculate the statistic of interest (median) for each bootstrap sample:
Bootstrap Sample 1 median: 5
Bootstrap Sample 2 median: 4
Bootstrap Sample 3 median: 5
Repeat steps 1 and 2 many times (e.g., 10,000 times):
By repeating the process of creating bootstrap samples and calculating the
median, we can build an empirical sampling distribution of the median.
Use the empirical sampling distribution to calculate confidence intervals or
perform hypothesis tests:
For example, if we want to construct a 95% confidence interval for the median,
we can find the 2.5th and 97.5th percentiles of the empirical sampling distribution
of the median.
Let's say the 2.5th percentile is 4, and the 97.5th percentile is 6.
Then, the 95% confidence interval for the median would be [4, 6].
In this example, we used bootstrapping to estimate the median by resampling
from the original data multiple times and calculating the statistic of interest
(median) for each bootstrap sample. By repeating this process many times, we can
build an empirical sampling distribution of the median, which can be used to
construct confidence intervals or perform hypothesis tests without relying on
assumptions about the underlying population distribution.
Advantages of Bootstrap Method
Bootstrap method offers several key advantages that make it a valuable tool in
statistical analysis and mathematical research:
1. Non-parametric Nature: The Bootstrap method does not rely on
assumptions about the underlying distribution of the data. This makes it
particularly useful when dealing with complex or unknown distributions,
allowing for more flexible and robust statistical analysis.
2. Versatility: It can be applied to a wide range of statistical measures,
including means, medians, variances, and regression coefficients. This
versatility extends to various types of data, whether continuous, discrete, or
categorical.
3. Accuracy in Small Samples: In cases where sample sizes are small,
traditional methods may not provide reliable estimates. The Bootstrap
method can improve the accuracy of these estimates by effectively
increasing the sample size through resampling.
4. Simple Implementation: The Bootstrap method is straightforward to
implement using modern computational tools. It involves repeated
resampling and can be easily programmed, making it accessible for
researchers and analysts.
5. Internal Validation: By generating multiple resampled datasets, the
Bootstrap method allows for internal validation of statistical models. This
helps in assessing the stability and reliability of the models without the need
for additional external data.
6. Confidence Interval Estimation: The Bootstrap method is particularly
effective for constructing confidence intervals for various statistics. This
provides a clearer understanding of the precision and variability of the
estimates, which is crucial for decision-making and hypothesis testing.
7. Handling Complex Data Structures: The Bootstrap method is capable of
dealing with complex data structures, such as time-series data or data with
hierarchical relationships. This adaptability makes it suitable for a broad
range of applications across different fields.
Limitations of Bootstrap Methods
Various limitations of Bootstrap Methods are:
Time-Consuming: Accurate bootstrap requires thousands of simulated
samples.
Computationally Intensive: Because bootstrap requires thousands of
samples and is time-consuming, it also requires more computing power.
Sometimes Incompatible: Bootstrapping is not always the best solution for
your situation, especially when dealing with spatial data or time series.
Prone to Bias: Bootstrapping does not always take into account the
variability of the distribution, which introduces errors and bias into your
calculations.
Confidence Interval
A confidence interval is a range of values used to estimate an unknown
population parameter, such as the mean, proportion, or regression
coefficient. The confidence interval is calculated from a given set of sample data
and is constructed in a way that it has a specified probability of containing the true
population parameter.
The level of confidence (usually expressed as a percentage) is the complement of
the significance level, which represents the probability that the confidence interval
does not contain the true population parameter. For example, a 95% confidence
interval implies that if the process of computing the confidence interval is
repeated multiple times on different samples from the same population, 95%
of the computed intervals will contain the true population parameter.
The width of the confidence interval provides an estimate of the precision or
uncertainty associated with the sample estimate. A narrower confidence interval
indicates higher precision, while a wider interval suggests greater uncertainty. The
reason for this is that we split 100% - 90% = 10% in half so that we will have the
middle 90% of all of the bootstrap sample means.
Given a sample of size n, and a sample statistic of interest, the algorithm for a
bootstrap confidence interval is as follows:
1. Draw a random sample of size n with replacement from the data (a resample).
2. Record the statistic of interest for the resample.
3. Repeat steps 1–2 many (R) times.
4. For an x% confidence interval, trim [(1 – [x/100]) / 2]% of the R resample
results from either end of the distribution.
5. The trim points are the endpoints of an x% bootstrap confidence interval.
Figure 2-9 shows a a 90% confidence interval for the mean annual income of loan
applicants, based on a sample of 20 for which the mean was $57,573.
The bootstrap is a general tool that can be used to generate confidence intervals for
most statistics, or model parameters. Statistical textbooks and software, with roots
in over a half-century of computerless statistical analysis, will also reference
confidence intervals generated by formulas, especially the t-distribution
Example of Using Bootstrapping to Create Confidence Intervals
Solution:
Let's say we have a small sample of data representing the heights (in inches) of 10
individuals:
Heights = [65.2, 67.1, 68.5, 69.3, 70.0, 71.2, 72.4, 73.1, 74.5, 75.8]
We want to estimate the 95% confidence interval for the mean height in the
population using bootstrapping.
Here are the steps we would follow:
Calculate the sample mean from the original data:
Sample mean = (65.2 + 67.1 + 68.5 + 69.3 + 70.0 + 71.2 + 72.4 + 73.1 + 74.5 +
75.8) / 10 = 70.71 inches
Create a large number of bootstrap samples from the original data by resampling
with replacement. For example, let's create 10,000 bootstrap samples, each of size
10.
For each bootstrap sample, calculate the mean height.
After computing the means for all 10,000 bootstrap samples, we now have an
empirical bootstrap sampling distribution of the mean.
From this empirical bootstrap sampling distribution, we can determine the 95%
confidence interval by finding the 2.5th and 97.5th percentiles of the distribution.
Let's say the 2.5th percentile is 69.8 inches, and the 97.5th percentile is 71.6
inches.
Then, the 95% confidence interval for the mean height is [69.8, 71.6] inches.
This confidence interval means that if we were to repeat the process of taking a
sample of size 10 and constructing a bootstrap confidence interval many times,
95% of those intervals would contain the true population mean height.
The key advantage of bootstrapping in this example is that it does not require any
assumptions about the underlying distribution of heights in the population. It relies
solely on the information contained in the original sample data.
Data distributions
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a continuous
probability distribution that is symmetric about the mean, depicting that data
near the mean are more frequent in occurrence than data far from the mean.
We define Normal Distribution as the probability density function of any
continuous random variable for any given system. Now for defining Normal
Distribution suppose we take f(x) as the probability density function for any
random variable X.
Also, the function is integrated between the interval, (x, {x + dx}) then,
f(x) ≥ 0 ∀ x ϵ (−∞,+∞),
-∞∫+∞ f(x) = 1
We observe that the curve traced by the upper values of the Normal Distribution is
in the shape of a Bell, hence Normal Distribution is also called the "Bell Curve".
Normal Distribution Characteristics
Symmetry: The normal distribution is symmetric around its mean. This
means the left side of the distribution mirrors the right side.
Mean, Median, and Mode: In a normal distribution, the mean, median, and
mode are all equal and located at the center of the distribution.
Bell-shaped Curve: The curve is bell-shaped, indicating that most of the
observations cluster around the central peak and the probabilities for values
further away from the mean taper off equally in both directions.
Standard Deviation: The spread of the distribution is determined by the
standard deviation. About 68% of the data falls within one standard
deviation of the mean, 95% within two standard deviations, and 99.7%
within three standard deviations.
Normal Distribution Examples
We can draw Normal Distribution for various types of data that include,
Distribution of Height of People.
Distribution of Errors in any Measurement.
Distribution of Blood Pressure of any Patient, etc.
Normal Distribution Formula - Probability Density Function (PDF)
The formula for the probability density function of Normal Distribution (Gaussian
Distribution) is added below:
where,
x is Random Variable
μ is Mean
σ is Standard Deviation
Normal Distribution Curve
In any Normal Distribution, random variables are those variables that take
unknown values related to the distribution and are generally bound by a
range.
An example of the random variable is, suppose take a distribution of the height of
students in a class then the random variable can take any value in this case
but is bound by a boundary of 2 ft to 6 ft, as it is generally forced physically.
Range of any normal distribution can be infinite in this case we say that
normal distribution is not bothered by its range. In this case, range is
extended from –∞ to + ∞.
Bell Curve still exists, in that case, all the variables in that range are
called Continuous variables and their distribution is called Normal
Distribution as all the values are generally closely aligned to the mean value.
The graph or the curve for the same is called the Normal Distribution
Curve Or Normal Distribution Graph.
Normal Distribution Graph
Studying the graph it is clear that using Empirical Rule we distribute data
broadly in three parts. Thus, the empirical rule is also called the "68 – 95
– 99.7" rule. The curve is perfectly symmetric around the mean (μ), which
is located at the center and marks the highest point of the curve. This mean represents the
average value of the datasheet. This distribution is commonly used in real-world statistics
to represent things like test scores, height, and measurement errors, where most of the
values tend to cluster around the average, and extreme values are less common.
Standard Normal and QQ-Plots
A standard normal distribution is one in which the units on the x-axis are expressed in
terms of standard deviations away from the mean. To compare data to a standard normal
distribution, you subtract the mean then divide by the standard deviation; this is also
called normalization or standardization (see “Standardization (Normalization, Z-
Scores)”). Note that “standardization” in this sense is unrelated to database record
standardization (conversion to a common format). The transformed value is termed a z-
score, and the normal distribution is sometimes called the z-distribution. A QQ-Plot is
used to visually determine how close a sample is to the normal distribution. The QQ-Plot
orders the z-scores from low to high, and plots each value’s z-score on the y-axis; the x-
axis is the corresponding quantile of a normal distribution for that value’s rank. Since the
data is normalized, the units correspond to the number of standard deviations away of the
data from the mean. If the points roughly fall on the diagonal line, then the sample
distribution can be considered close to normal. Figure 2-11 shows a QQ-Plot for a sample
of 100 values randomly generated from a normal distribution; as expected, the points
closely follow the line. This figure can be produced in R with the qqnorm function:
long tail distribution
In statistics, a long tail distribution is a distribution that has a long “tail” that
slowly tapers off toward the end of the distribution:
One of the most well-known examples of a long-tail distribution is book sales.
There are a few books that have sold hundreds of millions of copies (Harry Potter,
The Lord of the Rings, The Da Vinci Code, etc.) but most books sell less than one
hundred copies total.
If we created a bar chart to visualize the total sales of every book ever published,
we would find that the chart exhibits a long-tailed distribution:
A few well-known books have sold hundreds of millions of copies and then there’s
a long tail of thousands upon thousands of books that have sold very few copies.
Student’s t-Distribution
The t-distribution is a normally shaped distribution, but a bit thicker and longer on
the tails. It is used extensively in depicting distributions of sample statistics.
Distributions of sample means are typically shaped like a t-distribution, and there
is a family of t-distributions that differ depending on how large the sample is. The
larger the sample, the more normally shaped the t-distribution becomes.
KEY TERMS FOR STUDENT’S T-DISTRIBUTION
n Sample size.
Degrees of freedom A parameter that allows the t-distribution to adjust to
different sample sizes, statistics, and number of groups.
The t-distribution is often called Student’s t because it was published in 1908 in
Biometrika by W. S. Gossett under the name “Student.” Gossett’s employer, the
Guinness brewery, did not want competitors to know that it was using statistical
methods, so insisted that Gossett not use his name on the article. Gossett wanted to
answer the question “What is the sampling distribution of the mean of a sample,
drawn from a larger population?” He started out with a resampling experiment —
drawing random samples of 4 from a data set of 3,000 measurements of criminals’
height and left-middle-finger lengths. (This being the era of eugenics, there was
much interest in data on criminals, and in discovering correlations between
criminal tendencies and physical or psychological attributes.) He plotted the
standardized results (the z-scores) on the x-axis and the frequency on the y-axis.
Separately, he had derived a function, now known as Student’s t, and he fit this
function over the sample results, plotting the comparison (see Figure 2-13).
Binomial Distribution
Yes/no (binomial) outcomes lie at the heart of analytics since they are often the
culmination of a decision or other process; buy/don’t buy, click/don’t click,
survive/die, and so on. Central to understanding the binomial distribution is the
idea of a set of trials, each trial having two possible outcomes with definite
probabilities.
For example, flipping a coin 10 times is a binomial experiment with 10 trials, each
trial having two possible outcomes (heads or tails); see Figure 2-14. Such yes/no or
0/1 outcomes are termed binary outcomes, and they need not have 50/50
probabilities. Any probabilities that sum to 1.0 are possible. It is conventional in
statistics to term the “1” outcome the success outcome; it is also common practice
to assign “1” to the more rare outcome. Use of the term success does not imply
that the outcome is desirable or beneficial, but it does tend to indicate the outcome
of interest. For example, loan defaults or fraudulent transactions are relatively
uncommon events that we may be interested in predicting, so they are termed “1s”
or “successes.”
The binomial distribution is the frequency distribution of the number of successes
(x) in a given number of trials (n) with specified probability (p) of success in each
trial. There is a family of binomial distributions, depending on the values of x, n,
and p. The binomial distribution would answer a question like:
If the probability of a click converting to a sale is 0.02, what is the probability of
observing 0 sales in 200 clicks?
The R function dbinom calculates binomial probabilities. For example:
would return 0.0729, the probability of observing exactly x = 2 successes in n = 5
trials, where the probability of success for each trial is p = 0.1.
Often we are interested in determining the probability of x or fewer successes in n
trials. In this case, we use the function pbinom:
(particularly when p is close to 0.50), the binomial distribution is virtually
indistinguishable from the normal distribution. In fact, calculating binomial
probabilities with large sample sizes is computationally demanding, and most
statistical procedures use the normal distribution, with mean and variance, as an
approximation.
Chi-Squared Distributions
The Chi-Squared distribution is parameterised by the degrees of freedom (df),
which corresponds to the number of independent random variables being summed.
The chi-square distribution is actually a series of distributions that vary in
shape according to their degrees of freedom. As the degrees of freedom
increase, the distribution becomes more symmetric and approaches a normal
distribution.
The chi-square test is a hypothesis test designed to test for a statistically
significant relationship between nominal and ordinal variables organized in a
bivariate table. In other words, it tells us whether two variables are
independent of one another.
Properties of Chi-Squared Distribution
Some of the common properties of Chi-Squared Distribution are discussed
below:
Non-Negativity
The Chi-Squared distribution is defined only for non-negative values
(x≥0x≥0) because it is based on the sum of squared standard normal
variables, which are always non-negative.
Degrees of Freedom
The shape of the Chi-Squared distribution depends on the number of degrees
of freedom (k).
For small k, the distribution is positively skewed.
As k→∞k→∞ the distribution approaches a normal distribution (via
the Central Limit Theorem).
Mean
The mean of the Chi-Squared distribution is equal to its degrees of freedom:
Mean = k
Variance
The variance of the Chi-Squared distribution is twice its degrees of freedom:
Variance = 2k
Standard Deviation (SD)
The standard deviation of the Chi-Squared distribution is the square root of
the variance, so: SD=2×dfSD=2×df
Skewness
The skewness decreases as the degrees of freedom increase: Skewness
= 8kk8
For small k, the distribution is heavily skewed to the right.
As k increases, the skewness approaches 0.
Kurtosis (Excess)
The kurtosis (excess) of the Chi-Squared distribution is: Excess Kurtosis =
12/k
This shows that the distribution becomes less peaked as k increases.
F-distribution
The F-distribution is a continuous statistical distribution used to test whether two
samples have the same variance. The F-Distribution has two parameters the
numerator degrees of freedom (df1) and the denominator degrees of freedom
(df2). Formula for F-distribution:
The independent random variables Samples 1 and 2, have a chi-square
distribution.
The related samples' degrees of freedom are denoted by df1 and df2.
Derivation of the F-Distribution
The F-Distribution can be derived by taking the ratio of two independent
chi-square distributions divided by their degrees of freedom. Let X1 and X2
be two independent chi-square distributed random variables with degrees of
freedom df1 and df2, respectively. Then the ratio
F = (X1/df1) / (X2/df2)
follows an F-Distribution with df1 and df2 degrees of freedom.
Applications of the F-Distribution in Statistics
The F-Distribution is commonly used in statistical analysis to compare the
variances of two populations.
For example, it can be used in analysis of variance (ANOVA) to test for
differences in means of three or more groups.
It can also be used in regression analysis to test the overall significance of a
regression model, or to compare the variances of the residuals for two or
more models.
Poisson and related distributions.
The Poisson distribution is a discrete probability distribution that calculates the
likelihood of a certain number of events happening in a fixed time or space,
assuming the events occur independently and at a constant rate.
It is characterized by a single parameter, λ (lambda), which represents the event's
average occurrence rate. The distribution is used when the events are rare, the
number of occurrences is non-negative, and can take on integer values (0, 1, 2,
3,...).
The key assumptions of the Poisson distribution are:
1. Events occur independently of each other.
2. The average rate of occurrence (λ) is constant over the given interval.
3. The number of events can be any non-negative integer.
In summary, the Poisson distribution is used to model the likelihood of events
happening at a certain rate within a fixed time or space, under the assumptions of
independence and constant occurrence.
Poisson Distribution Formula
Poisson distribution is characterized by a single parameter, lambda (λ), which
represents the average rate of occurrence of the events. The probability mass
function of the Poisson distribution is given by:
Where,
P(X = k) is the Probability of observing k Events
e is the Base of the Natural Logarithm (approximately 2.71828)
λ is the Average Rate of Occurrence of Events
k is the Number of Events that Occur
Exponential Distribution
Using the same parameter that we used in the Poisson distribution, we can also
model the distribution of the time between events: time between visits to a website
or between cars arriving at a toll plaza. It is also used in engineering to model time
to failure, and in process management to model, for example, the time required per
service call. The R code to generate random numbers from an exponential
distribution takes two arguments, n (the quantity of numbers to be generated), and
rate, the number of events per time period. For example:
This code would generate 100 random numbers from an exponential distribution
where the mean number of events per time period is 2. So you could use it to
simulate 100 intervals, in minutes, between service calls, where the average rate of
incoming calls is 0.2 per minute. A key assumption in any simulation study for
either the Poisson or exponential distribution is that the rate, , remains constant
over the period being considered. This is rarely reasonable in a global sense; for
example, traffic on roads or data networks varies by time of day and day of week.
However, the time periods, or areas of space, can usually be divided into segments
that are sufficiently homogeneous so that analysis or simulation within those
periods is valid.
Estimating the Failure Rate
In many applications, the event rate, , is known or can be estimated from prior
data. However, for rare events, this is not necessarily so. Aircraft engine failure, for
example, is sufficiently rare (thankfully) that, for a given engine type, there may be
little data on which to base an estimate of time between failures. With no data at
all, there is little basis on which to estimate an event rate. However, you can make
some guesses: if no events have been seen after 20 hours, you can be pretty sure
that the rate is not 1 per hour. Via simulation, or direct calculation of probabilities,
you can assess different hypothetical event rates and estimate threshold values
below which the rate is very unlikely to fall. If there is some data but not enough to
provide a precise, reliable estimate of the rate, a goodness-of-fit test (see “Chi-
Square Test”) can be applied to various rates to determine how well they fit the
observed data.