[go: up one dir, main page]

0% found this document useful (0 votes)
39 views21 pages

Stats For Data Science

Uploaded by

Ayush Mokal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views21 pages

Stats For Data Science

Uploaded by

Ayush Mokal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to Statistics

Statistics is a branch of mathematics that deals with the collection, analysis,


interpretation, presentation, and organization of data. It provides methods for making
inferences about populations based on the examination of a sample. Statistics is widely
used in various fields, including economics, biology, psychology, sociology, medicine,
and many others, to draw meaningful conclusions from data.

Categorical Data:

Categorical data represents categories and cannot be measured in a numerical sense. It


is often divided into two subtypes: nominal and ordinal.

● Nominal Data:
● Nominal data consists of categories without any inherent order or ranking.
● Example: Colors of cars (Red, Blue, Green). The colors have no inherent
order; they are distinct categories.
● Ordinal Data:
● Ordinal data has categories with a meaningful order or ranking, but the
intervals between them are not consistent.
● Example: Educational levels (High School, Bachelor's, Master's, PhD).
There is an order, but the difference between High School and Bachelor's
may not be the same as between Master's and PhD.

Numerical Data:

Numerical data consists of measurable quantities and can be further categorized into
discrete and continuous data.

● Discrete Data:
● Discrete data consists of distinct, separate values and cannot be
subdivided indefinitely.
● Example: The number of students in a classroom. You can count the
students, and the result is a whole number.
● Continuous Data:
● Continuous data can take any value within a given range and is often
measured with greater precision.
● Example: Height of individuals. Heights can be any value within a range
(e.g., 165.5 cm, 170.2 cm), and measurements can be more precise than
whole numbers.
Measures of central tendency

Measures of central tendency are statistical measures that describe the center or

average of a distribution. The main measures of central tendency are the mean, median,

and mode.

Mean:

The mean, often referred to as the average, is a measure of central tendency that

represents the sum of all values in a dataset divided by the number of observations. It is

calculated as follows:

Mean=Sum of all values/Number of observations

Example:

Consider a dataset representing the ages of a group of individuals: 25, 30, 35, 40, and

45. To calculate the mean:

Mean=25+30+35+40+455=175/5=35

So, the mean age of the group is 35.

Advantages of Mean:
​ Precision: The mean provides a precise measure of central location, taking into
account all values in the dataset.
​ Applicability: It can be used with both discrete and continuous numerical data.
​ Balance: The mean balances the values in the dataset by considering each
observation.

Disadvantages of Mean:

​ Sensitivity to Outliers: The mean is highly sensitive to extreme values (outliers) in


the dataset. A single unusually high or low value can significantly impact the
mean.
​ Not Suitable for Skewed Distributions: In skewed distributions, where values are
concentrated on one side, the mean may not accurately represent the central
tendency.
​ Dependence on Sample Size: The mean can be influenced by the sample size,
and in small samples, it may not provide a reliable estimate of the population
mean.

Considerations:

● When using the mean, it's essential to be aware of the characteristics of the data.
If the dataset has outliers or a skewed distribution, alternative measures like the
median may be more appropriate.
● The mean is often used in situations where the distribution of data is
approximately normal and when there are no extreme values that could
significantly distort the central tendency.
Median:

The median is another measure of central tendency that represents the middle value in

a dataset when the values are arranged in ascending or descending order. It is not

affected by extreme values and is particularly useful when dealing with skewed

distributions.

Calculation:

● If the dataset has an odd number of observations, the median is the middle
value.
● If the dataset has an even number of observations, the median is the average of
the two middle values.

Example:

Consider a dataset of incomes (in thousands) for a group of individuals: 25, 30, 40, 50,

and 200. To find the median:

​ Arrange the values in ascending order: 25, 30, 40, 50, 200.
​ Since there is an odd number of observations (5), the median is the middle value,
which is 40.

So, the median income for this group is 40,000.

Advantages of Median:

​ Robustness to Outliers: The median is not influenced by extreme values


(outliers), making it a robust measure in the presence of skewed distributions.
​ Suitable for Ordinal Data: The median can be applied to ordinal data, unlike the
mean.
​ Insensitivity to Sample Size: Unlike the mean, the median is not significantly
affected by the size of the dataset.

Disadvantages of Median:

​ Less Precision: The median provides less precise information about the central
location compared to the mean.
​ Not Utilizing All Data: It only considers the middle values and doesn't take into
account all observations in the dataset.

Considerations:

● The median is a good choice when dealing with datasets containing outliers or
when the distribution is skewed.
● It's particularly useful when describing the central tendency of income or other
variables with highly skewed distributions.
● In cases where a more precise measure of central tendency is required, the mean
might be preferred.
**Mode:**

The mode is a measure of central tendency that represents the value(s) in a dataset that
occur most frequently. Unlike the mean and median, the mode can be applied to both
numerical and categorical data.

**Calculation:**

- A dataset may have:

- **No Mode:** If all values occur with the same frequency.

- **Unimodal:** If one value occurs more frequently than others.

- **Bimodal or Multimodal:** If two or more values occur with the same highest
frequency.

**Example:**

Consider a dataset of test scores: 85, 90, 75, 90, 80. In this case, the mode is 90, as it
appears more frequently than the other scores.

**Advantages of Mode:**
1. **Simplicity:** Mode is a simple and easy-to-understand measure of central tendency.

2. **Applicability to Categorical Data:** It can be applied to both numerical and


categorical data.

**Disadvantages of Mode:**

1. **Not Always Unique:** A dataset may have no mode or multiple modes.

2. **May Not Be Representative:** In datasets with multiple modes, the mode(s) may
not provide a representative measure of central tendency.

**Considerations:**

- Mode is particularly useful when describing the most common value or category in a
dataset.

- It may not be as precise as the mean or median, especially in cases where there is
variability in the dataset.

- In some situations, a dataset may have both a mode and meaningful mean/median,
providing a more comprehensive view of the central tendency.
Measures of dispersion

Measures of dispersion are statistical measures that describe the spread or variability
of a dataset. They provide information about how much individual data points differ
from the central tendency. Common measures of dispersion include the range, variance,
standard deviation, and interquartile range.

​ Range:
● Calculation: Range = Maximum value - Minimum value.
● Example: Consider a dataset of exam scores: 65, 72, 80, 85, 92. The range
is 92 - 65 = 27.

● Advantages:
● Simple and easy to understand.
● Provides a quick overview of the spread.
● Disadvantages:
● Sensitive to extreme values (outliers).
● Ignores the distribution of values within the range.
​ Interquartile Range (IQR):
● Calculation: IQR = Q3 (upper quartile) - Q1 (lower quartile).
● Example: Given a dataset: 10, 15, 20, 25, 30, 35, 40, 45. Q1 = 15 and Q3 =
35. IQR = 35 - 15 = 20.

● Advantages:
● Less sensitive to extreme values than the range.
● Provides information about the spread of the middle 50% of the
data.
● Disadvantages:
● Does not consider the entire range of data.
● May not provide a complete picture of the distribution.
​ Variance:
● Calculation: Variance = Average of the squared differences from the mean.
● Example: Consider a dataset: 5, 8, 10, 12, 15. Mean = (5 + 8 + 10 + 12 + 15)
/ 5 = 10. Variance = [(5-10)² + (8-10)² + (10-10)² + (12-10)² + (15-10)²] / 5 =
10.
● Advantages:
● Takes into account all values in the dataset.
● Useful for mathematical calculations.
● Disadvantages:
● The squared differences can be hard to interpret in the original
units of measurement.
● Sensitive to extreme values.
​ Standard Deviation:
● Calculation: Standard Deviation = Square root of the variance.
● Example: Using the previous example, the standard deviation is the square
root of the variance, which is √10 ≈ 3.16.
● Advantages:
● Provides a more interpretable measure of spread compared to
variance.
● Widely used in statistical analysis.
● Disadvantages:
● Sensitive to extreme values.
● Requires additional computational steps compared to the range
and IQR.

Considerations:

● Outliers:
● All measures of dispersion are influenced by outliers, so it's crucial to be
aware of extreme values when interpreting results.
● Relationship with Central Tendency:
● Understanding both measures of central tendency and dispersion is
crucial for a comprehensive analysis of a dataset.
● Choice of Measure:
● The choice of a particular measure depends on the characteristics of the
data and the specific goals of analysis.
Probability

Probability is a branch of mathematics that deals with the likelihood of events


occurring. It is expressed as a number between 0 (impossible event) and 1 (certain
event).

​ Probability of Drawing at Least One Ace:


● Problem: What is the probability of drawing at least one Ace from a
standard deck of 52 playing cards?
● Solution:
● P(at least one Ace) = 1 - P(no Ace)
● P(no Ace) = 48/52 (since there are 4 Aces, 48 non-Aces)
● P(at least one Ace) = 1 - 48/52 = 4/52 = 1/13.
​ Probability of Drawing at Most Two Kings:
● Problem: What is the probability of drawing at most two Kings from a
standard deck of 52 playing cards?
● Solution:
● P(at most two Kings) = P(0 Kings) + P(1 King) + P(2 Kings)
● P(0 Kings) = 44/52 (48 non-Kings)
● P(1 King) = 8/52 (4 Kings)
● P(2 Kings) = 6/52 (2 Kings)
● P(at most two Kings) = 44/52 + 8/52 + 6/52 = 58/52 (Note:
Probability can't exceed 1, so it's simply 1).
​ Probability of Drawing Exactly One Heart:
● Problem: What is the probability of drawing exactly one Heart from a
standard deck of 52 playing cards?
● Solution:
● P(exactly one Heart) = P(1 Heart) = 13/52 (13 Hearts in total).

Dice:
​ Probability of Rolling at Least a 4:
● Problem: If you roll a six-sided die, what is the probability of rolling at least
a 4?
● Solution:
● P(at least 4) = P(4) + P(5) + P(6) = 3/6 = 1/2.
​ Probability of Rolling at Most a 3:
● Problem: If you roll a six-sided die, what is the probability of rolling at most
a 3?
● Solution:
● P(at most 3) = P(1) + P(2) + P(3) = 3/6 = 1/2.
​ Probability of Rolling Exactly a 6:
● Problem: If you roll a six-sided die, what is the probability of rolling exactly
a 6?
● Solution:
● P(exactly 6) = P(6) = 1/6.

Coins:
​ Probability of Getting at Least One Head in Two Coin Flips:
● Problem: What is the probability of getting at least one head when flipping
a fair coin twice?
● Solution:
● P(at least one head) = 1 - P(no heads)
● P(no heads) = P(two tails) = (1/2) * (1/2) = 1/4
● P(at least one head) = 1 - 1/4 = 3/4.
​ Probability of Getting at Most One Tail in Three Coin Flips:
● Problem: What is the probability of getting at most one tail when flipping a
fair coin three times?
● Solution:
● P(at most one tail) = P(0 tails) + P(1 tail)
● P(0 tails) = (1/2)^3 = 1/8
● P(1 tail) = 3 * (1/2)^3 = 3/8
● P(at most one tail) = 1/8 + 3/8 = 1/2.
​ Probability of Getting Exactly Two Heads in Four Coin Flips:
● Problem: What is the probability of getting exactly two heads when
flipping a fair coin four times?
● Solution:
● P(exactly two heads) = C(4,2) * (1/2)^4 = 6 * 1/16 = 3/8 (using
binomial distribution, where C(n, k) is the binomial coefficient).

Colored Balls:
​ Probability of Drawing at Least One Red Ball:
● Problem: A bag contains 5 red balls, 3 blue balls, and 2 green balls. What
is the probability of drawing at least one red ball?
● Solution:
● P(at least one red) = 1 - P(no red)
● P(no red) = P(all non-red) = 5/10 = 1/2
● P(at least one red) = 1 - 1/2 = 1/2.
​ Probability of Drawing at Most One Blue Ball:
● Problem: A bag contains 5 red balls, 3 blue balls, and 2 green balls. What
is the probability of drawing at most one blue ball?
● Solution:
● P(at most one blue) = P(0 blue) + P(1 blue)
● P(0 blue) = 8/10 * 7/9 = 28/45
● P(1 blue) = C(3,1) * (3/10) * (7/9) = 21/45
● P(at most one blue) = 28/45 + 21/45 = 49/45 (Note: Probability
can't exceed 1, so it's simply 1).
​ Probability of Drawing Exactly Two Green Balls:
● Problem: A bag contains 5 red balls, 3 blue balls, and 2 green balls. What
is the probability of drawing exactly two green balls in three draws?
● Solution:
● P(exactly two green) = C(2,2) * (2/10)^2 * (8/10) = 28/125 (using
binomial distribution, where C(n, k) is the binomial coefficient).

These examples cover a range of probability scenarios, including "at least," "at most,"

and "exactly" problems for cards, dice, coins, and colored balls.

Probability distributions describe the likelihood of different outcomes in a given set of


events.

**Binomial Distribution:**

A binomial distribution is a discrete probability distribution that models the number of

successes in a fixed number of independent and identically distributed (i.i.d.) Bernoulli

trials. Each trial results in either a success (often denoted as 1) or a failure (denoted as

0), and the probability of success remains constant across all trials. The distribution is
characterized by two parameters: the number of trials and the probability of success in

a single trial.

**Poisson Distribution:**

The Poisson distribution is a discrete probability distribution that models the number of

events that occur in a fixed interval of time or space. It is characterized by a single

parameter, which represents the average rate of occurrence of the events. The events

must be random, independent, and have a constant average rate.


**Normal Distribution:**

The normal distribution, also known as the Gaussian distribution or bell curve, is a

continuous probability distribution that is symmetric and characterized by its mean and

standard deviation
**Correlation:**

Correlation is a statistical measure that describes the extent to which two variables
change together. It indicates the strength and direction of a linear relationship between
two quantitative variables. The correlation coefficient (usually denoted by \( r \)) ranges
from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect
positive correlation, and 0 indicates no correlation.

**Example 1: Positive Correlation**

- **Scenario:** Examining the relationship between hours of study and exam scores.

- **Interpretation:** If there is a positive correlation, it means that as the number of


hours spent studying increases, the exam scores also tend to increase.

- **Example: ( r = 0.80 ) (strong positive correlation).

**Example 2: Negative Correlation**

- **Scenario:** Investigating the relationship between the number of hours spent


watching TV and academic performance.

- **Interpretation:** If there is a negative correlation, it means that as the time spent


watching TV increases, academic performance tends to decrease.

- **Example: ( r = -0.60 ) (moderate negative correlation).

**Example 3: No Correlation**

- **Scenario:** Analyzing the correlation between shoe size and mathematical ability.
- **Interpretation:** If there is no correlation, it implies that there is no systematic
relationship between shoe size and mathematical ability.

- **Example:** ( r = 0.05 ) (weak, negligible correlation).

**Example 4: Perfect Negative Correlation**

- **Scenario:** Examining the relationship between the temperature and the amount of
snowfall.

- **Interpretation:** A perfect negative correlation would mean that as the temperature


increases, the amount of snowfall decreases, and vice versa.

- **Example:( r = -1.0 ) (perfect negative correlation).

Student's t-distribution

The t-distribution, also known as the Student's t-distribution, is a probability distribution


that arises in hypothesis testing when the population standard deviation is unknown
and is estimated from the sample. It is used primarily for small sample sizes and is
characterized by its bell-shaped curve. The shape of the t-distribution is similar to the
normal distribution, but it has heavier tails, making it more suitable for handling
uncertainty in small samples.
**Hypothesis Testing: An Overview**

Hypothesis testing is a statistical method used to make inferences about a population


parameter based on a sample of data. It involves the formulation of two competing
hypotheses, a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_a\)), and a
statistical test to determine whether there is enough evidence in the sample data to
reject the null hypothesis in favor of the alternative hypothesis.

Here is a step-by-step guide to the hypothesis testing process:

1- Formulate Hypotheses:**

- Null Hypothesis ((H_0)): Represents a default assumption or a statement of no


effect. It is denoted as (H_0).

- Alternative Hypothesis ((H_a)): Represents what the researcher is trying to establish


or prove. It is denoted as (H_a) or (H_1).

2. Choose Significance Level ((alpha)):

- The significance level ((alpha)) is the probability of rejecting the null hypothesis when
it is actually true. Common choices include 0.05, 0.01, or 0.10.

3. Select a Statistical Test:

- Choose an appropriate statistical test based on the type of data and the nature of the
research question. Common tests include t-tests, chi-square tests, ANOVA, regression
analysis, etc.
4. Collect and Analyze Data:

- Collect a sample of data and perform the chosen statistical test. Obtain the test
statistic and calculate its associated p-value.

5. Calculate Test Statistic and P-Value:

- The test statistic is a numerical value calculated from the sample data that is used to
determine whether to reject the null hypothesis.

- The p-value is the probability of obtaining a test statistic as extreme as, or more
extreme than, the one observed, assuming the null hypothesis is true.

6. Make a Decision:

- If the p-value is less than the chosen significance level alpha, reject the null
hypothesis in favor of the alternative hypothesis.

- If the p-value is greater than or equal to alpha, do not reject the null hypothesis.

7. Draw Conclusions:

- Based on the decision made in step 6, draw conclusions and interpret the results in
the context of the research question.

8. Consider Practical Significance:

- Even if a result is statistically significant, it is important to consider whether the


observed effect is practically significant and has real-world implications.
**Common Errors in Hypothesis Testing:**

- **Type I Error (False Positive):** Rejecting a true null hypothesis.

- **Type II Error (False Negative):** Failing to reject a false null hypothesis.

Chi Square test

The chi-square test is a statistical test used to determine if there is a significant


association between two categorical variables. It is based on the comparison of
observed and expected frequencies in a contingency table. The test assesses whether
the observed distribution of categorical data differs from the distribution that would be
expected under the assumption of independence.

One Way Annova

The one-way analysis of variance (ANOVA) is a statistical test used to determine if there
are any statistically significant differences between the means of three or more
independent (unrelated) groups. It is an extension of the two-sample t-test for
comparing means of two groups to the case of more than two groups.

Two Way Annova

The two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA and
is used to investigate the influence of two different categorical independent variables on
a continuous dependent variable.

You might also like