0% found this document useful (0 votes)

39 views21 pages

Stats For Data Science

Uploaded by

Ayush Mokal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views21 pages

Stats For Data Science

Uploaded by

Ayush Mokal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Introduction to Statistics

Statistics is a branch of mathematics that deals with the collection, analysis,

interpretation, presentation, and organization of data. It provides methods for making
inferences about populations based on the examination of a sample. Statistics is widely
used in various fields, including economics, biology, psychology, sociology, medicine,
and many others, to draw meaningful conclusions from data.

Categorical Data:

Categorical data represents categories and cannot be measured in a numerical sense. It

is often divided into two subtypes: nominal and ordinal.

● Nominal Data:
● Nominal data consists of categories without any inherent order or ranking.
● Example: Colors of cars (Red, Blue, Green). The colors have no inherent
order; they are distinct categories.
● Ordinal Data:
● Ordinal data has categories with a meaningful order or ranking, but the
intervals between them are not consistent.
● Example: Educational levels (High School, Bachelor's, Master's, PhD).
There is an order, but the difference between High School and Bachelor's
may not be the same as between Master's and PhD.

Numerical Data:

Numerical data consists of measurable quantities and can be further categorized into
discrete and continuous data.

● Discrete Data:
● Discrete data consists of distinct, separate values and cannot be
subdivided indefinitely.
● Example: The number of students in a classroom. You can count the
students, and the result is a whole number.
● Continuous Data:
● Continuous data can take any value within a given range and is often
measured with greater precision.
● Example: Height of individuals. Heights can be any value within a range
(e.g., 165.5 cm, 170.2 cm), and measurements can be more precise than
whole numbers.
Measures of central tendency

Measures of central tendency are statistical measures that describe the center or

average of a distribution. The main measures of central tendency are the mean, median,

and mode.

Mean:

The mean, often referred to as the average, is a measure of central tendency that

represents the sum of all values in a dataset divided by the number of observations. It is

calculated as follows:

Mean=Sum of all values/Number of observations

Example:

Consider a dataset representing the ages of a group of individuals: 25, 30, 35, 40, and

45. To calculate the mean:

Mean=25+30+35+40+455=175/5=35

So, the mean age of the group is 35.

Advantages of Mean:
Precision: The mean provides a precise measure of central location, taking into
account all values in the dataset.
Applicability: It can be used with both discrete and continuous numerical data.
Balance: The mean balances the values in the dataset by considering each
observation.

Disadvantages of Mean:

Sensitivity to Outliers: The mean is highly sensitive to extreme values (outliers) in

the dataset. A single unusually high or low value can significantly impact the
mean.
Not Suitable for Skewed Distributions: In skewed distributions, where values are
concentrated on one side, the mean may not accurately represent the central
tendency.
Dependence on Sample Size: The mean can be influenced by the sample size,
and in small samples, it may not provide a reliable estimate of the population
mean.

Considerations:

● When using the mean, it's essential to be aware of the characteristics of the data.
If the dataset has outliers or a skewed distribution, alternative measures like the
median may be more appropriate.
● The mean is often used in situations where the distribution of data is
approximately normal and when there are no extreme values that could
significantly distort the central tendency.
Median:

The median is another measure of central tendency that represents the middle value in

a dataset when the values are arranged in ascending or descending order. It is not

affected by extreme values and is particularly useful when dealing with skewed

distributions.

Calculation:

● If the dataset has an odd number of observations, the median is the middle
value.
● If the dataset has an even number of observations, the median is the average of
the two middle values.

Example:

Consider a dataset of incomes (in thousands) for a group of individuals: 25, 30, 40, 50,

and 200. To find the median:

Arrange the values in ascending order: 25, 30, 40, 50, 200.
Since there is an odd number of observations (5), the median is the middle value,
which is 40.

So, the median income for this group is 40,000.

Advantages of Median:

Robustness to Outliers: The median is not influenced by extreme values

(outliers), making it a robust measure in the presence of skewed distributions.
Suitable for Ordinal Data: The median can be applied to ordinal data, unlike the
mean.
Insensitivity to Sample Size: Unlike the mean, the median is not significantly
affected by the size of the dataset.

Disadvantages of Median:

Less Precision: The median provides less precise information about the central
location compared to the mean.
Not Utilizing All Data: It only considers the middle values and doesn't take into
account all observations in the dataset.

Considerations:

● The median is a good choice when dealing with datasets containing outliers or
when the distribution is skewed.
● It's particularly useful when describing the central tendency of income or other
variables with highly skewed distributions.
● In cases where a more precise measure of central tendency is required, the mean
might be preferred.
**Mode:**

The mode is a measure of central tendency that represents the value(s) in a dataset that
occur most frequently. Unlike the mean and median, the mode can be applied to both
numerical and categorical data.

**Calculation:**

- A dataset may have:

- No Mode: If all values occur with the same frequency.

- Unimodal: If one value occurs more frequently than others.

- **Bimodal or Multimodal:** If two or more values occur with the same highest
frequency.

**Example:**

Consider a dataset of test scores: 85, 90, 75, 90, 80. In this case, the mode is 90, as it
appears more frequently than the other scores.

**Advantages of Mode:**
1. **Simplicity:** Mode is a simple and easy-to-understand measure of central tendency.

2. Applicability to Categorical Data: It can be applied to both numerical and

categorical data.

**Disadvantages of Mode:**

1. Not Always Unique: A dataset may have no mode or multiple modes.

2. **May Not Be Representative:** In datasets with multiple modes, the mode(s) may
not provide a representative measure of central tendency.

**Considerations:**

- Mode is particularly useful when describing the most common value or category in a
dataset.

- It may not be as precise as the mean or median, especially in cases where there is
variability in the dataset.

- In some situations, a dataset may have both a mode and meaningful mean/median,
providing a more comprehensive view of the central tendency.
Measures of dispersion

Measures of dispersion are statistical measures that describe the spread or variability
of a dataset. They provide information about how much individual data points differ
from the central tendency. Common measures of dispersion include the range, variance,
standard deviation, and interquartile range.

Range:
● Calculation: Range = Maximum value - Minimum value.
● Example: Consider a dataset of exam scores: 65, 72, 80, 85, 92. The range
is 92 - 65 = 27.

● Advantages:
● Simple and easy to understand.
● Provides a quick overview of the spread.
● Disadvantages:
● Sensitive to extreme values (outliers).
● Ignores the distribution of values within the range.
Interquartile Range (IQR):
● Calculation: IQR = Q3 (upper quartile) - Q1 (lower quartile).
● Example: Given a dataset: 10, 15, 20, 25, 30, 35, 40, 45. Q1 = 15 and Q3 =
35. IQR = 35 - 15 = 20.

● Advantages:
● Less sensitive to extreme values than the range.
● Provides information about the spread of the middle 50% of the
data.
● Disadvantages:
● Does not consider the entire range of data.
● May not provide a complete picture of the distribution.
Variance:
● Calculation: Variance = Average of the squared differences from the mean.
● Example: Consider a dataset: 5, 8, 10, 12, 15. Mean = (5 + 8 + 10 + 12 + 15)
/ 5 = 10. Variance = [(5-10)² + (8-10)² + (10-10)² + (12-10)² + (15-10)²] / 5 =
10.
● Advantages:
● Takes into account all values in the dataset.
● Useful for mathematical calculations.
● Disadvantages:
● The squared differences can be hard to interpret in the original
units of measurement.
● Sensitive to extreme values.
Standard Deviation:
● Calculation: Standard Deviation = Square root of the variance.
● Example: Using the previous example, the standard deviation is the square
root of the variance, which is √10 ≈ 3.16.
● Advantages:
● Provides a more interpretable measure of spread compared to
variance.
● Widely used in statistical analysis.
● Disadvantages:
● Sensitive to extreme values.
● Requires additional computational steps compared to the range
and IQR.

Considerations:

● Outliers:
● All measures of dispersion are influenced by outliers, so it's crucial to be
aware of extreme values when interpreting results.
● Relationship with Central Tendency:
● Understanding both measures of central tendency and dispersion is
crucial for a comprehensive analysis of a dataset.
● Choice of Measure:
● The choice of a particular measure depends on the characteristics of the
data and the specific goals of analysis.
Probability

Probability is a branch of mathematics that deals with the likelihood of events

occurring. It is expressed as a number between 0 (impossible event) and 1 (certain
event).

Probability of Drawing at Least One Ace:

● Problem: What is the probability of drawing at least one Ace from a
standard deck of 52 playing cards?
● Solution:
● P(at least one Ace) = 1 - P(no Ace)
● P(no Ace) = 48/52 (since there are 4 Aces, 48 non-Aces)
● P(at least one Ace) = 1 - 48/52 = 4/52 = 1/13.
Probability of Drawing at Most Two Kings:
● Problem: What is the probability of drawing at most two Kings from a
standard deck of 52 playing cards?
● Solution:
● P(at most two Kings) = P(0 Kings) + P(1 King) + P(2 Kings)
● P(0 Kings) = 44/52 (48 non-Kings)
● P(1 King) = 8/52 (4 Kings)
● P(2 Kings) = 6/52 (2 Kings)
● P(at most two Kings) = 44/52 + 8/52 + 6/52 = 58/52 (Note:
Probability can't exceed 1, so it's simply 1).
Probability of Drawing Exactly One Heart:
● Problem: What is the probability of drawing exactly one Heart from a
standard deck of 52 playing cards?
● Solution:
● P(exactly one Heart) = P(1 Heart) = 13/52 (13 Hearts in total).

Dice:
Probability of Rolling at Least a 4:
● Problem: If you roll a six-sided die, what is the probability of rolling at least
a 4?
● Solution:
● P(at least 4) = P(4) + P(5) + P(6) = 3/6 = 1/2.
Probability of Rolling at Most a 3:
● Problem: If you roll a six-sided die, what is the probability of rolling at most
a 3?
● Solution:
● P(at most 3) = P(1) + P(2) + P(3) = 3/6 = 1/2.
Probability of Rolling Exactly a 6:
● Problem: If you roll a six-sided die, what is the probability of rolling exactly
a 6?
● Solution:
● P(exactly 6) = P(6) = 1/6.

Coins:
Probability of Getting at Least One Head in Two Coin Flips:
● Problem: What is the probability of getting at least one head when flipping
a fair coin twice?
● Solution:
● P(at least one head) = 1 - P(no heads)
● P(no heads) = P(two tails) = (1/2) * (1/2) = 1/4
● P(at least one head) = 1 - 1/4 = 3/4.
Probability of Getting at Most One Tail in Three Coin Flips:
● Problem: What is the probability of getting at most one tail when flipping a
fair coin three times?
● Solution:
● P(at most one tail) = P(0 tails) + P(1 tail)
● P(0 tails) = (1/2)^3 = 1/8
● P(1 tail) = 3 * (1/2)^3 = 3/8
● P(at most one tail) = 1/8 + 3/8 = 1/2.
Probability of Getting Exactly Two Heads in Four Coin Flips:
● Problem: What is the probability of getting exactly two heads when
flipping a fair coin four times?
● Solution:
● P(exactly two heads) = C(4,2) * (1/2)^4 = 6 * 1/16 = 3/8 (using
binomial distribution, where C(n, k) is the binomial coefficient).

Colored Balls:
Probability of Drawing at Least One Red Ball:
● Problem: A bag contains 5 red balls, 3 blue balls, and 2 green balls. What
is the probability of drawing at least one red ball?
● Solution:
● P(at least one red) = 1 - P(no red)
● P(no red) = P(all non-red) = 5/10 = 1/2
● P(at least one red) = 1 - 1/2 = 1/2.
Probability of Drawing at Most One Blue Ball:
● Problem: A bag contains 5 red balls, 3 blue balls, and 2 green balls. What
is the probability of drawing at most one blue ball?
● Solution:
● P(at most one blue) = P(0 blue) + P(1 blue)
● P(0 blue) = 8/10 * 7/9 = 28/45
● P(1 blue) = C(3,1) * (3/10) * (7/9) = 21/45
● P(at most one blue) = 28/45 + 21/45 = 49/45 (Note: Probability
can't exceed 1, so it's simply 1).
Probability of Drawing Exactly Two Green Balls:
● Problem: A bag contains 5 red balls, 3 blue balls, and 2 green balls. What
is the probability of drawing exactly two green balls in three draws?
● Solution:
● P(exactly two green) = C(2,2) * (2/10)^2 * (8/10) = 28/125 (using
binomial distribution, where C(n, k) is the binomial coefficient).

These examples cover a range of probability scenarios, including "at least," "at most,"

and "exactly" problems for cards, dice, coins, and colored balls.

Probability distributions describe the likelihood of different outcomes in a given set of

events.

**Binomial Distribution:**

A binomial distribution is a discrete probability distribution that models the number of

successes in a fixed number of independent and identically distributed (i.i.d.) Bernoulli

trials. Each trial results in either a success (often denoted as 1) or a failure (denoted as

0), and the probability of success remains constant across all trials. The distribution is
characterized by two parameters: the number of trials and the probability of success in

a single trial.

**Poisson Distribution:**

The Poisson distribution is a discrete probability distribution that models the number of

events that occur in a fixed interval of time or space. It is characterized by a single

parameter, which represents the average rate of occurrence of the events. The events

must be random, independent, and have a constant average rate.

**Normal Distribution:**

The normal distribution, also known as the Gaussian distribution or bell curve, is a

continuous probability distribution that is symmetric and characterized by its mean and

standard deviation
**Correlation:**

Correlation is a statistical measure that describes the extent to which two variables
change together. It indicates the strength and direction of a linear relationship between
two quantitative variables. The correlation coefficient (usually denoted by \( r \)) ranges
from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect
positive correlation, and 0 indicates no correlation.

Example 1: Positive Correlation

- **Scenario:** Examining the relationship between hours of study and exam scores.

- Interpretation: If there is a positive correlation, it means that as the number of

hours spent studying increases, the exam scores also tend to increase.

- **Example: ( r = 0.80 ) (strong positive correlation).

Example 2: Negative Correlation

- Scenario: Investigating the relationship between the number of hours spent

watching TV and academic performance.

- Interpretation: If there is a negative correlation, it means that as the time spent

watching TV increases, academic performance tends to decrease.

- **Example: ( r = -0.60 ) (moderate negative correlation).

**Example 3: No Correlation**

- **Scenario:** Analyzing the correlation between shoe size and mathematical ability.
- **Interpretation:** If there is no correlation, it implies that there is no systematic
relationship between shoe size and mathematical ability.

- Example: ( r = 0.05 ) (weak, negligible correlation).

Example 4: Perfect Negative Correlation

- **Scenario:** Examining the relationship between the temperature and the amount of
snowfall.

- Interpretation: A perfect negative correlation would mean that as the temperature

increases, the amount of snowfall decreases, and vice versa.

- **Example:( r = -1.0 ) (perfect negative correlation).

Student's t-distribution

The t-distribution, also known as the Student's t-distribution, is a probability distribution

that arises in hypothesis testing when the population standard deviation is unknown
and is estimated from the sample. It is used primarily for small sample sizes and is
characterized by its bell-shaped curve. The shape of the t-distribution is similar to the
normal distribution, but it has heavier tails, making it more suitable for handling
uncertainty in small samples.
**Hypothesis Testing: An Overview**

Hypothesis testing is a statistical method used to make inferences about a population

parameter based on a sample of data. It involves the formulation of two competing
hypotheses, a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_a\)), and a
statistical test to determine whether there is enough evidence in the sample data to
reject the null hypothesis in favor of the alternative hypothesis.

Here is a step-by-step guide to the hypothesis testing process:

1- Formulate Hypotheses:**

- Null Hypothesis ((H_0)): Represents a default assumption or a statement of no

effect. It is denoted as (H_0).

- Alternative Hypothesis ((H_a)): Represents what the researcher is trying to establish

or prove. It is denoted as (H_a) or (H_1).

2. Choose Significance Level ((alpha)):

- The significance level ((alpha)) is the probability of rejecting the null hypothesis when
it is actually true. Common choices include 0.05, 0.01, or 0.10.

3. Select a Statistical Test:

- Choose an appropriate statistical test based on the type of data and the nature of the
research question. Common tests include t-tests, chi-square tests, ANOVA, regression
analysis, etc.
4. Collect and Analyze Data:

- Collect a sample of data and perform the chosen statistical test. Obtain the test
statistic and calculate its associated p-value.

5. Calculate Test Statistic and P-Value:

- The test statistic is a numerical value calculated from the sample data that is used to
determine whether to reject the null hypothesis.

- The p-value is the probability of obtaining a test statistic as extreme as, or more
extreme than, the one observed, assuming the null hypothesis is true.

6. Make a Decision:

- If the p-value is less than the chosen significance level alpha, reject the null
hypothesis in favor of the alternative hypothesis.

- If the p-value is greater than or equal to alpha, do not reject the null hypothesis.

7. Draw Conclusions:

- Based on the decision made in step 6, draw conclusions and interpret the results in
the context of the research question.

8. Consider Practical Significance:

- Even if a result is statistically significant, it is important to consider whether the

observed effect is practically significant and has real-world implications.
**Common Errors in Hypothesis Testing:**

- Type I Error (False Positive): Rejecting a true null hypothesis.

- Type II Error (False Negative): Failing to reject a false null hypothesis.

Chi Square test

The chi-square test is a statistical test used to determine if there is a significant

association between two categorical variables. It is based on the comparison of
observed and expected frequencies in a contingency table. The test assesses whether
the observed distribution of categorical data differs from the distribution that would be
expected under the assumption of independence.

One Way Annova

The one-way analysis of variance (ANOVA) is a statistical test used to determine if there
are any statistically significant differences between the means of three or more
independent (unrelated) groups. It is an extension of the two-sample t-test for
comparing means of two groups to the case of more than two groups.

Two Way Annova

The two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA and
is used to investigate the influence of two different categorical independent variables on
a continuous dependent variable.

Stats For Data Science
No ratings yet
Stats For Data Science
21 pages
2nd Unit - Statistics
No ratings yet
2nd Unit - Statistics
15 pages
Angilan, Ef
No ratings yet
Angilan, Ef
5 pages
Assignment No 2 8614-2
No ratings yet
Assignment No 2 8614-2
56 pages
Standard Deviation
No ratings yet
Standard Deviation
13 pages
Intro To Statistics - Descriptive Statistics and NPC - 20250225 - 171911 - 0000
No ratings yet
Intro To Statistics - Descriptive Statistics and NPC - 20250225 - 171911 - 0000
23 pages
Business Statistics Chapter 2
No ratings yet
Business Statistics Chapter 2
13 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Measures of Central Tendency and Variation
No ratings yet
Measures of Central Tendency and Variation
30 pages
Lecture Notes 2 - Descriptive Statistics-1720598791715
No ratings yet
Lecture Notes 2 - Descriptive Statistics-1720598791715
21 pages
Psychology Project
No ratings yet
Psychology Project
14 pages
8614 Assignment No 2
No ratings yet
8614 Assignment No 2
26 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Statistics & Psychology
No ratings yet
Statistics & Psychology
47 pages
Unit 4 & 5 8614
No ratings yet
Unit 4 & 5 8614
58 pages
Drawing Conclusions From Statistical Data: Measures of Central Tendency
No ratings yet
Drawing Conclusions From Statistical Data: Measures of Central Tendency
22 pages
Data Analysis and Data Visualization Basics 2
No ratings yet
Data Analysis and Data Visualization Basics 2
50 pages
Unit 5 8614
No ratings yet
Unit 5 8614
39 pages
Share MBBS - Lecture 4 (1) - 1
No ratings yet
Share MBBS - Lecture 4 (1) - 1
68 pages
Learning Activity 1 Jigsaw
No ratings yet
Learning Activity 1 Jigsaw
15 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
Lecture 1 - Measures of Central Tendency
No ratings yet
Lecture 1 - Measures of Central Tendency
3 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
MCS Lecture 3
No ratings yet
MCS Lecture 3
57 pages
Week 3 - Review Topic - Measures of Central Tendency and Dispersion - NEUVLE
No ratings yet
Week 3 - Review Topic - Measures of Central Tendency and Dispersion - NEUVLE
13 pages
Measures of Central Tendency and Dispersion Measure of Central Tendency
No ratings yet
Measures of Central Tendency and Dispersion Measure of Central Tendency
8 pages
Statistics
No ratings yet
Statistics
10 pages
Godinez Kizzha G Asynchronous Output 3
No ratings yet
Godinez Kizzha G Asynchronous Output 3
7 pages
Data Presentation
No ratings yet
Data Presentation
104 pages
Stat
No ratings yet
Stat
45 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
? Module 4 Measurement of Central Tendency and Dispersion
No ratings yet
? Module 4 Measurement of Central Tendency and Dispersion
10 pages
Differences Between Mean
No ratings yet
Differences Between Mean
5 pages
2 - Introduction To Statistics
No ratings yet
2 - Introduction To Statistics
97 pages
UNGROUPED DATA Measures of Central Tendency, Dispersion, and Position
No ratings yet
UNGROUPED DATA Measures of Central Tendency, Dispersion, and Position
34 pages
Business Statistics
100% (1)
Business Statistics
52 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Statistics for LLB Students
No ratings yet
Statistics for LLB Students
22 pages
Assignment No 3
No ratings yet
Assignment No 3
16 pages
Intro to Central Tendency
No ratings yet
Intro to Central Tendency
14 pages
Summarizing Data
No ratings yet
Summarizing Data
49 pages
MCT and MD For Pharmacy Students
No ratings yet
MCT and MD For Pharmacy Students
58 pages
Karim, Saman
No ratings yet
Karim, Saman
21 pages
Data Processing and Anlysis
No ratings yet
Data Processing and Anlysis
41 pages
Bba QT
No ratings yet
Bba QT
5 pages
Intro to Statistics Basics
No ratings yet
Intro to Statistics Basics
8 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
9 pages
Averages 2
No ratings yet
Averages 2
6 pages
Stats & Probs w3 Reviewer
No ratings yet
Stats & Probs w3 Reviewer
2 pages
IML U2
No ratings yet
IML U2
15 pages
Social Science Statistics (June-Aug) 2025-Topic 2
No ratings yet
Social Science Statistics (June-Aug) 2025-Topic 2
21 pages
1-2-3 Descriptive Stats & Central Tendency
No ratings yet
1-2-3 Descriptive Stats & Central Tendency
21 pages
B. Biostatistics (Descriptive Statistics)
No ratings yet
B. Biostatistics (Descriptive Statistics)
42 pages
Extended Research Project Mean Median Mode 35pages
No ratings yet
Extended Research Project Mean Median Mode 35pages
13 pages
Eco Unit 2
No ratings yet
Eco Unit 2
10 pages
Research pt-1
No ratings yet
Research pt-1
17 pages
Economics GR 11 Ist Term Project
No ratings yet
Economics GR 11 Ist Term Project
9 pages
Introduction To Applied Statistical Signal Analysis Guide To Biomedical and Electrical Engineering Applications 3rd Edition by Richard Shiavi 0120885816 9780120885817 Full
100% (5)
Introduction To Applied Statistical Signal Analysis Guide To Biomedical and Electrical Engineering Applications 3rd Edition by Richard Shiavi 0120885816 9780120885817 Full
142 pages
Survival Analysis
No ratings yet
Survival Analysis
13 pages
Comparison of Probability Distributions Used For Harnessing The Wind Energy Potential: A Case Study From India
No ratings yet
Comparison of Probability Distributions Used For Harnessing The Wind Energy Potential: A Case Study From India
18 pages
Bayesian Games & R&D Strategies
No ratings yet
Bayesian Games & R&D Strategies
24 pages
STAT 709 Midterm Exam Guide
No ratings yet
STAT 709 Midterm Exam Guide
2 pages
Design and Analysis of Experiments 8th Edition Solution Manual
No ratings yet
Design and Analysis of Experiments 8th Edition Solution Manual
39 pages
Group Assignment Questions Landscape
No ratings yet
Group Assignment Questions Landscape
14 pages
Basic Statistical Concepts and Methods
100% (1)
Basic Statistical Concepts and Methods
122 pages
Reflection Math 1040 Project
No ratings yet
Reflection Math 1040 Project
2 pages
Syllebus
No ratings yet
Syllebus
2 pages
Review For Final Exam
No ratings yet
Review For Final Exam
4 pages
These Two Methods Are Explained in Detail in The Next Sections of Your Material.
No ratings yet
These Two Methods Are Explained in Detail in The Next Sections of Your Material.
5 pages
Statistik 1 - 6 Distribusi Probabilitas Normal
No ratings yet
Statistik 1 - 6 Distribusi Probabilitas Normal
49 pages
Math 350 S10 HW11 Sol
No ratings yet
Math 350 S10 HW11 Sol
4 pages
BinomialDistribution - Pharmacy
No ratings yet
BinomialDistribution - Pharmacy
16 pages
Solutions To Problem Set 6
No ratings yet
Solutions To Problem Set 6
4 pages
U00 CohenTextbook EDUC 6600 2018
No ratings yet
U00 CohenTextbook EDUC 6600 2018
6 pages
Eda 2
No ratings yet
Eda 2
69 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
M Ch-27 Probability
No ratings yet
M Ch-27 Probability
12 pages
Chapter 4 Statistics
No ratings yet
Chapter 4 Statistics
11 pages
Data Mining Assignment 1
No ratings yet
Data Mining Assignment 1
8 pages
Marginal Probability Report
No ratings yet
Marginal Probability Report
9 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
Lesson 17 Validity and Reliability of The Instrument (Cont)
No ratings yet
Lesson 17 Validity and Reliability of The Instrument (Cont)
11 pages
Business Data Mining Week 7 A
No ratings yet
Business Data Mining Week 7 A
8 pages
Earnings and Age Regression Analysis
No ratings yet
Earnings and Age Regression Analysis
6 pages
Engineering Probability Lecture
No ratings yet
Engineering Probability Lecture
13 pages
4th Periodical Exam PROSTAT (Grade 11)
No ratings yet
4th Periodical Exam PROSTAT (Grade 11)
274 pages
Sample Size Determination PDF
100% (1)
Sample Size Determination PDF
28 pages

Stats For Data Science

Uploaded by

Stats For Data Science

Uploaded by

Introduction to Statistics

Statistics is a branch of mathematics that deals with the collection, analysis,

Categorical data represents categories and cannot be measured in a numerical sense. It

Mean=Sum of all values/Number of observations

45. To calculate the mean:

So, the mean age of the group is 35.

​ Sensitivity to Outliers: The mean is highly sensitive to extreme values (outliers) in

and 200. To find the median:

So, the median income for this group is 40,000.

​ Robustness to Outliers: The median is not influenced by extreme values

- A dataset may have:

- **No Mode:** If all values occur with the same frequency.

- **Unimodal:** If one value occurs more frequently than others.

2. **Applicability to Categorical Data:** It can be applied to both numerical and

1. **Not Always Unique:** A dataset may have no mode or multiple modes.

Probability is a branch of mathematics that deals with the likelihood of events

​ Probability of Drawing at Least One Ace:

Probability distributions describe the likelihood of different outcomes in a given set of

A binomial distribution is a discrete probability distribution that models the number of

successes in a fixed number of independent and identically distributed (i.i.d.) Bernoulli

events that occur in a fixed interval of time or space. It is characterized by a single

must be random, independent, and have a constant average rate.

**Example 1: Positive Correlation**

- **Interpretation:** If there is a positive correlation, it means that as the number of

- **Example: ( r = 0.80 ) (strong positive correlation).

**Example 2: Negative Correlation**

- **Scenario:** Investigating the relationship between the number of hours spent

- **Interpretation:** If there is a negative correlation, it means that as the time spent

- **Example: ( r = -0.60 ) (moderate negative correlation).

- **Example:** ( r = 0.05 ) (weak, negligible correlation).

**Example 4: Perfect Negative Correlation**

- **Interpretation:** A perfect negative correlation would mean that as the temperature

- **Example:( r = -1.0 ) (perfect negative correlation).

The t-distribution, also known as the Student's t-distribution, is a probability distribution

Hypothesis testing is a statistical method used to make inferences about a population

Here is a step-by-step guide to the hypothesis testing process:

- Null Hypothesis ((H_0)): Represents a default assumption or a statement of no

- Alternative Hypothesis ((H_a)): Represents what the researcher is trying to establish

2. Choose Significance Level ((alpha)):

3. Select a Statistical Test:

5. Calculate Test Statistic and P-Value:

8. Consider Practical Significance:

- Even if a result is statistically significant, it is important to consider whether the

- **Type I Error (False Positive):** Rejecting a true null hypothesis.

- **Type II Error (False Negative):** Failing to reject a false null hypothesis.

Chi Square test

The chi-square test is a statistical test used to determine if there is a significant

One Way Annova

Two Way Annova

You might also like

Sensitivity to Outliers: The mean is highly sensitive to extreme values (outliers) in

Robustness to Outliers: The median is not influenced by extreme values

- No Mode: If all values occur with the same frequency.

- Unimodal: If one value occurs more frequently than others.

2. Applicability to Categorical Data: It can be applied to both numerical and

1. Not Always Unique: A dataset may have no mode or multiple modes.

Probability of Drawing at Least One Ace:

Example 1: Positive Correlation

- Interpretation: If there is a positive correlation, it means that as the number of

Example 2: Negative Correlation

- Scenario: Investigating the relationship between the number of hours spent

- Interpretation: If there is a negative correlation, it means that as the time spent

- Example: ( r = 0.05 ) (weak, negligible correlation).

Example 4: Perfect Negative Correlation

- Interpretation: A perfect negative correlation would mean that as the temperature

- Type I Error (False Positive): Rejecting a true null hypothesis.

- Type II Error (False Negative): Failing to reject a false null hypothesis.