[go: up one dir, main page]

0% found this document useful (0 votes)
2 views37 pages

DOC MASS

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 37

Chapter 1: Defining and Collecting Data

Statistics is the science of collecting, organizing, analyzing, and interpreting DATA to make decisions.
Descriptive Statistics: Involves organizing, summarizing, and displaying data.
Thống kê mô tả (descriptive statistics): được hiểu là các phương pháp liên quan đến việc thu thập số liệu, tóm tắt, trình bày, tính toán và mô tả các đặc
trưng riêng biệt khác nhau để phản ánh tổng quát đối tượng nghiên cứu.
Central tendency
Variation
Skewness

Inferential Statistics: Involves using sample data to draw conclusions about a population.
Thống kê suy luận (inferential statistics): bao gồm các phương pháp ước lượng các đặc trưng của tổng thể, phân tích mối liên hệ giữa các hiện tượng
nghiên cứu, dự đoán hoặc ra quyết định trên cơ sở thu thập thông tin từ kết quả quan sát mẫu.

Confidence interval

Hypothesis Testing

Regression

Types of data
Categorical (qualitative) variables take categories as their values such as “yes”, “no”, or “blue”, “brown”, “green”.
Numerical (quantitative) variables have values that represent a counted or measured quantity.
Discrete variables arise from a counting process.
Continuous variables arise from a measuring process.

Population: A population contains all the items or individuals of interest that you seek to study.
population: all FPT students
Sample: A sample contains only a portion of a population of interest.
sample: 100 FPT students

Data: consist of information coming from observations, counts, measurements, or responses.

Parameter: a numerical measurement describing some characteristics of a population.


parameter: average parking time of all FPT students, population variance, population standard deviation, population proportion,...
Statistic: a numerical measurement describing some characteristics of a sample.
statistic: average parking time of 100 FPT students, sample variance,…

Sources of data
When you perform the activity that collects the data, you are using a primary data source.
When the data collection part of these activities is done by someone else, you are using a secondary data source.
Primary Sources: The data collector is the one using the data for analysis:
Data from a political survey.
Data collected from an experiment. (A treatment is applied to part of a population and responses are observed.)
Observed data (A researcher observes and measures characteristics of interest of part of a population.)
Secondary Sources: The person performing data analysis is not the data collector:
Analyzing census data.
Examining data from print journals or data published on the internet.

Chapter 2: Organizing and Visualizing Variables


Big picture of Statistics
Chapter 3: Numerical Descriptive Measures
1. Measures of Central Tendency
Sample Mean = sum of values divided by the number of values.
n: sample size

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 2
n: sample size
Sample Median: is the “middle” number (50% above, 50% below).

1. Rank the data set in increasing order.


n+1
2. Median position = 2 position in the ordered data.
o If the number of values is odd, the median is the middle number.
o If the number of values is even, the median is the average of the two middle numbers.
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
Sample Mode:
Value that occurs most often.FThere may be no mode OR there may be several modes.

2. Measures of Variation and Shape


Sample Range
Sample Variance: Average (approximately) of squared deviations of values from the mean.

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 4 - ^2
23 27 26 25 40 → 45.7
Sample Standard Deviation: Is the square root of the variance.

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 4
Sample Coefficient of Variation: Measures relative variation. Always in percentage (%).

Z Score

Z score >3 or <-3 → X: extreme value/outliers


Shape of a Distribution (Skewness)
Mean = Median → symmetric
Mean < Median → left-skewed
Mean > Median → right-skewed
3. Exploring Numerical Variables
Box Plots

Quartiles split the ranked data into 4 segments with an equal number of values per segment.
The first quartile, Q1, is the value for which 25% of the values are smaller and 75% are larger.
Q2 is the same as the median (50% of the values are smaller and 50% are larger).
Only 25% of the values are greater than the third quartile.
Find a quartile by determining the value in the appropriate position in the ranked data, where:
n+1
First quartile position: Q 1=
4
ranked value.

n+1
Second quartile position: Q2= 2 ranked value. = Median

3∗(n+ 1)
Third quartile position: Q 3= ranked value.
4

where n is the number of observed values.

The Interquartile Range (IQR) is Q3 – Q1 and measures the spread in the middle 50% of the data.
An outlier is an observation that is numerically distant from the rest of the data.
(< Q1 - 1.5 * IQR OR > Q3 + 1.5 * IQR).

Five-number summary

4. Numerical Descriptive Measures for a Population


Population Mean(muy)

580VN: Menu - 6 - 1 - Data - AC - OPTN - 2


570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 2
Population Variance & Population Standard Deviation

Population Variance
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 3 - ^2
Population Standard Deviation
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 – 3
Population
Probability distribution table
The Empirical Rule
Approximately 68% of the values are within +-1 standard deviation from the mean.
Approximately 95% of the values are within +-2 standard deviations from the mean.
Approximately 99.7% of the values are within +-3 standard deviations from the mean.
5. The Covariance and the Coefficient of Correlation
The Covariance: The covariance measures the strength of the linear relationship between two numerical variables (X and Y).

data: population → population covariance → cov(X,Y) = R * Sigma x * Sigma y


data: sample → sample covariance → cov(X, Y) = R * Sx * Sy
The Coefficient of Correlation: The coefficient of correlation measures the relative strength of a linear relationship between
two numerical variables(X and Y). -1 <= R <= 1
negative/positive → R
-1 <= R <= 1
R < 0: negative correlation, X increases Y decreases
R = 0: no correlation/no relationship
R > 0: positive correlation, X increases Y increases

strong/weak → |R|, R^2


0 <= |R| <=1
|R| ~0: → weak correlation
|R| ~1 → strong correlation

Chapter 4: Basic Probability


I. Basic probability concepts
1. Random experiment:
is a mechanism that produces a definite outcome that cannot be predicted with certainty.
Ex: Rolling a dice. There can be 6 possible outcomes {1, 2, 3, 4, 5, 6}.
However, none of the outcomes can be exactly predicted. 🡪 Rolling a dice: a random experiment
2. Random variable X: số chấm khi tung xúc xắc
Ch5: discrete random var
Ch6: continuous random var
3. Types of variables
Categorical variable: định tính
Numeric variable: Discrete & Continuous
Discrete variables arise from a counting process.
(e.g., number of classes you are taking). *The number of 0 1 2 3
Continuous variables arise from a measuring process.
(e.g., your annual salary, or your weight). *The amount of, … 50.1kg 55.5kg
4. Sample space
is the collection of all possible outcomes.
Ex: Roll a dice and record number dots. 🡪 S = {1, 2, 3, 4, 5, 6}
=7 → p=0
<10 → p=1
5. Event
is a subset of the sample space.
Ex: E1 = {6}, E2 = {even} = {2, 4, 6}
6. Probability
the numerical value representing the chance, likelihood, or possibility that a certain event will occur (always between 0 and
1).

52 lá
P(lá bích) = số lá bích / tổng số lá = 13 / 52 = 1/4
đọc câu hỏi → keyword → khoanh vùng
probability → ch4, ch5, ch6, ch7
II. Types of events
1. Impossible event and Certain event
Impossible event has no chance of occurring (probability = 0).
Certain event is sure to occur (probability = 1).
2. Simple event & Joint event
Simple event described by a single characteristic.
Ex: A day in January from all days in 2018.
Joint event described by two or more characteristics.
Ex: A day in January that is also a Wednesday from all days in 2018.
3. Complementary event
Complement of an event A (denoted A’): All events that are not part of event A.
Ex: All days from 2018 that are not in January.
4. Mutually exclusive events (Disjoint events)
Events A and B are said to be mutually exclusive if it is not possible that both occur at the same time.
Ex: Toss of a coin.
Let A be the event that the coin lands on head.
Let B be the event that the coin lands on the tail.
🡪 In a single fair coin toss, events A and B are mutually exclusive.
P ( A ∩B )=0

P ( A ∪ B )=P ( A ) + P ( B ) −P ( A ∩ B ) =P ( A )+ P ( B )−0=P ( A ) + P(B)

5. Independent events
Events A and B are said to be independent if the probability of B occurring is unaffected by the occurrence of the event A
happening.
Ex: Tossing a coin twice.
Let A be the event that the first coin toss lands on heads.
Let B be the event that the second coin toss lands on heads.
🡪 Clearly the result of the first coin toss does not affect the result of the second coin toss.
🡪 Events A and B are independent.
P ( A ∩B )=P ( A )∗P (B)
P ( A ∪ B )=P ( A ) + P ( B ) −P ( A ∩ B ) =P ( A )+ P ( B )−P ( A )∗P(B)

6. Collectively exhaustive events


One of the events must occur. The set of events covers the entire sample space.
P ( A ∪ B )=1

7. Events associated with OR A∪B


is the event that consists of all outcomes that are contained in either of the two events.

8. Events associated with AND A∩B


is the event that consists of all outcomes that are contained in two events.

9. Event E1 but not E2

III. Graph
1. Venn diagram
2. Contingency table

P(female|right-handed) = P(female and right-handed) / P(right-handed) = 44/100 : 87/100 = 44/87


= |female and right-handed| / |right-handed| = 44/87

3. Decision tree
IV. General addition rule

V. Conditional probability keyword *if, *given that, *given

Chapter 5: Discrete Probability Distributions & Chapter 6: Continuous Probability Distributions


Discrete Random Variable Continuous Random Variable
Definition Discrete variables produce Continuous variables produce
outcomes that come from a outcomes that come from a
counting process. measurement.
Example Number of girls in a Height of boys in class.
classroom. Weight of students in a class.
Number of blue marbles in a Amount of lemonade in a jug.
bag. Time it takes to run a race.
Number of heads when Lifetime of a battery.
flipping 5 coins.
Number of typos on a page.
Number of classes you are
taking.
Distribution Binomial distribution Uniform distribution
Poisson distribution Normal distribution

Binomial distribution Poisson distribution Uniform distribution Normal


n,pi distribution
Definition X: the number of successes in n X: the number of events in a
trials given unit of
time/distance/area/volume
λ : mean number of events
in a given unit of
time/distance/area/volume
λ : average …
Keyword Poisson Uniform Normal
Notation X B(n , π) X P(λ) X CU (a , b) X N (μ,σ )
2

Mean & E ( X )=μ=n∗π E ( X )=μ=λ a+ b E ( X )=μ


E ( X )=μ=
Variance expected value = mean 2
V ( X )=σ =λ 2 V ( X )=σ 2
2
2 2 (a−b)
V ( X )=σ =n∗π∗(1−π ) V ( X )=σ =
12
Probability −λ
e ∗λ
x
P ( X=x ) =0 P ( X=x ) =0
P ( X=x ) = d−c
x x
P ( X=x ) =C ∗π ∗(1−π )
n
n− x
x! P ( c < X < d )= P ( c < X < d ) 🡪 CASIO
CASIO b−a
CASIO

Given N/A N/A N/A P ( X < x )= p


probability Find x 🡪 CASIO
.
Find value

Chapter 5: Discrete Probability Distributions


1. Binomial distribution X: the number of successes in n trials
- At least : dấu lớn hơn hoặc bằng
- At most : dấu bé hơn hoặc bằng bấm máy tính binomial CD
- Exactly : dấu bằng bấm máy tính binomial PD

2. Poisson distribution X: the number of events in a given unit of time/distance/area/volume


3.

Chapter 6: Continuous Probability Distributions


4. Uniform distribution
5. Normal distribution
Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
The random variable has an infinite theoretical range: -∞ to +∞.

Chapter 6: The Normal Distribution and Other Continuous Distributions


Contents:
1. Continuous Random Variable
2. Normal Distribution
The Standardized Normal Distribution
Find Normal Probabilities
Given a Normal Probability, find the X Value
The Empirical Rule
3. The Uniform Distribution
Properties of the Uniform Distribution: mean, variance, standard deviation
Find uniform probabilities
1. Continuous Random Variable
Continuous variables produce outcomes that come from a measurement.
e.g. annual salary
weight, in kg.
thickness of an item.
time required to complete a task.
temperature of a solution.

2. Normal Distribution (mean, variance/standard deviation) variance = (standard deviation)^2


X ~ N(mean, variance)

Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
Range: The random variable has an infinite theoretical range: -∞ to +∞.
The Standardized Normal Distribution (Also known as the “Z” distribution) Z-score
Mean is 0.
Standard Deviation is 1. → Variance = 1^2 = 1

The Empirical Rule


μ ± 1σ covers about 68.26% of X’s.
μ ± 2σ covers about 95.44% of X’s.
μ ± 3σ covers about 99.73% of X’s.
Calculating Normal Probabilities

Probability is measured by the area under the curve.


The total area under the curve is 1.0, and the curve is
symmetric, so half is above the mean, half is below.
CASIO 570VN / CASIO 580VN
Continuous: P(X>5) = P(X>=5)
Discrete: P(X>5) = P(X>=6)
Given a Normal Probability. Find the X Value
CASIO 570VN / CASIO 580VN

3. Uniform Distribution (a, b)


a: min
b: max
X ~ U (a,b)

Symmetrical
Also called a rectangular distribution
Range: Any value between the smallest and largest is equally likely.

Properties of the Uniform Distribution: mean, variance, standard deviation

Chapter 7: Sampling Distributions


Contents:
1. Sampling distribution of sample mean
Central limit theorem
Describe sampling distribution of sample mean: shape, mean, standard deviation/variance
Find standard error of the mean
Find the probability
2. Sampling distribution of sample proportions
Find sample proportion
Find standard error of the proportion
Find the probability

1. Concept of Sampling Distribution


a) Definition:
i. Random sample

ii. Statistic

statistic >< parameter


statistic: sample mean, sample standard deviation, sample variance, …
tổng thể: 10000
a random sample of 1000 people
sample mean
iii. Sampling distribution

A sampling distribution is a distribution of all the possible values of a sample statistic for a given sample size
selected from a population.
population, sample size
sample n=100 → sample mean 1
n: sample size

sample n=100 → sample mean 2

statistic >< parameter


b) Point Estimation of Parameters
For μ, the estimate is x , the sample mean.
For π , the estimate is p, the sample proportion.
For μ1−μ2 , the estimate is x 1−x 2, the difference between the sample means of two independent random samples.
For π 1−π 2, the estimate is p1− p2, the difference between two sample proportions computed from two independent
random samples.
Mean Standard deviation
Population μ σ
Sample x s
Sample Distribution for the Mean μx σx
Sample Distribution for the Proportion μp σp

2. Sampling Distribution of Sample Mean:


● Standard Error of the Mean/Standard deviation for the Sample Mean
shape
● If the Population is Normal with mean μ and standard deviation σ, the sampling distribution of X is also exactly normally
distributed with:

If the Population is not Normal (sample size is >30) with mean μ and standard deviation σ, the sampling distribution of X
is approximately/nearly/almost normally distributed with:

● Central Limit Theorem: As the sample size gets large enough, the sampling distribution of the sample mean becomes
almost normal regardless of shape of population.

Định lý giới hạn trung tâm: Khi cỡ mẫu đủ lớn, phân phối các giá trị trung bình của mẫu có phân phối gần chuẩn dù tổng
thể có phân phối chuẩn hoặc không.
population: normal → sampling distribution of sample mean: normal
population: not normal + n>30 → sampling distribution of sample mean: approximately normal
mean, average
3. Sampling Distribution of Sample Proportion p:
proportion of left-handed people
proportion, rate, percentage, fraction
n: 100, X=10 → p = 10/100
● Population proportion π = the proportion of the population having some characteristic.

Sample proportion p provides an estimate of π .

● Standard Error of the Proportion/Standard deviation of the sample proportion

p
● The sampling distribution of is approximated normal distribution if

where
Chapter 8: Confidence Interval Estimation for a single sample
1. Concept of Confidence Interval

width = 2*e = upper confidence limit - lower confidence limit


e: error
General Formula for all confidence intervals is:
Point Estimate ± (Critical Value)*(Standard Error)
(Critical Value)*(Standard Error): Sampling error/Margin of error → e
● Point Estimate is the sample statistic estimating the population parameter of interest.

● Critical Value is a table value based on the sampling distribution of the point estimate and the desired confidence level.
Confidence level (1−α ) is confident the interval will contain the unknown population parameter (less than 100%).
● Standard Error is the standard deviation of the point estimate.
Interpretation
We are …% confident that the true mean/proportion … is between … and … (unit).
Although the true mean/proportion may or may not be in this interval, …% of intervals formed in this manner will contain the true
mean/proportion.

2. Confidence Intervals

a) Confidence Intervals for the Population Mean μ


i. when Population Standard Deviation σ is Known. → use Z

xngang: sample mean


Z: critical value, alpha = 1 - confidence level → alpha/2
sigma: population standard deviation
n: sample size
ii. when Population Standard Deviation σ is Unknown. → use t

xngang: sample mean


t: critical value
alpha = 1 - confidence level → alpha/2
degrees of freedom = n - 1
S: sample standard deviation
n: sample size
b) Confidence Intervals for the Population Proportion π.

p: sample proportion
Z: critical value
n: sample size
3. Determining the Required Sample Size.

a) For the Mean


b) For the Proportion

pi: population proportion


e: sampling error
margin of error
the sample mean will not differ from the true mean within e
the sample proportion will not differ from the true proportion within e
the sample mean/proportion will not differ from the true proportion by more than e

The previous chapter introduced methods for using sample statistics to estimate population parameters.

This chapter introduces the next major topic of inferential statistics: hypothesis testing.

Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests

1. Concept of Hypothesis Testing


A hypothesis is a statement or claim about a property of a population.
A hypothesis is a claim (assertion) about a population parameter. *keyword: claim, believe, … (population mean, population
proportion)
H0: Null Hypothesis and H1: Alternative Hypothesis

One very important guideline is that no information from the sample is used in the hypothesis’s statements.
The null hypothesis always contains equality (=, ≤, or ≥).
If H1 contains a “ ≠ ” symbol, the hypothesis test is two-tailed. If H1 contains a “>” symbol, the hypothesis test is right-
tailed, and if H1 contains a “<” symbol, the hypothesis test is left-tailed. (Note: Some authors use only the equal symbol in
the null hypothesis statement.)
A procedure leading to a decision about a particular hypothesis is called a test of a hypothesis.

When conducting scientific research, typically there is some known information, perhaps from some past work or from a long
accepted idea. We want to test whether this claim is believable. This is the basic idea behind a hypothesis test:

· State what we think is true.


· Quantify how confident we are about our claim.
· Use sample statistics to make inferences about population parameters.

E.g.

2. Test of hypotheses on population mean (1st way: Tradition method)

1. If is known 🡪 Z test of hypothesis for the mean

Ste Two-tailed Test Right-tailed Test Left-tailed Test


p

1 Formulate H0: μ=0 H0:μ≤0 :μ≥0


H0
the two
hypotheses H1:μ ≠0 H1: μ>0 : μ<0
H1
2 Compute
zstat/z0=x-0σ/n
test statistic
x: sample mean
0: in H0, H1
: population standard deviation
n: sample size
3 Identify z∝/2 +z -z
critical
values/Iden
tify region
of rejection
z/t

tail H1

alpha = significance
of level = area of
region of rejection

4 Decision If the test statistic is in region of rejection → Reject H0

If the test statistic is in region of non-rejection → Fail To Reject H0

2. If is unknown 🡪 t test of hypothesis for the mean

Ste Two-tailed Test Right-tailed Test Left-tailed Test


p

1 Formulat H0 : μ=0 :μ≤0


H0 : μ≥0
H0
e the
two H1 :μ≠0 :μ>0
H1 :μ<0
H1
hypothes
es

2 Compute
tstat/t0=x-0s/n
test
statistic

3 Identify t2, n-1 +t∝, n-1 -t∝,n-1


critical
values

4 Decision If the test statistic is in region of rejection → Reject H0

If the test statistic is in region of non-rejection → Fail To Reject H0

Step 1: Formulate two hypotheses. (State the null and alternative hypotheses for the appropriate hypothesis test.)
Step 2: Compute test statistic

Step 3: Identify the critical value (State the correct decision rule for the test of hypothesis in terms of a z- or t-test statistic and
appropriate rejection region at the indicated significance level.)

3. Test of hypotheses on population proportion 🡪 Z test of hypothesis for the proportion

Ste Two-tailed Test Right-tailed Test Left-tailed Test


p

1 Formulat H0: =0 H0:0 :0


H0
e
H1:0 H1:>0 :<0
H1
two
hypothes
es

2 Compute
zstat/z0=p- π0 π0*(1- π0)n
test
statistic

3 Identify z∝/2 +z -z
critical
values

4 Decision If the test statistic is in region of rejection → Reject H0

If the test statistic is in region of non-rejection → Fail To Reject H0

Step 1: Formulate two hypotheses. (State the null and alternative hypotheses for the appropriate hypothesis test.)

Step 2: Compute test statistic

Step 3: Identify the critical value (State the correct decision rule for the test of hypothesis in terms of a z- or t-test statistic and
appropriate rejection region at the indicated significance level.)

3. Identify type of errors in testing hypothesis

Type I Error: The error of rejecting H0 when H0 is true.

Type II Error: The error of failing to reject H0 when H0 is false/H1is true.

= PType I Error = P(Reject H0 when H0 is true)

= P(Type II Error) = P (Fail To Reject H0 when H0 is false)

Chap 10: Two-Sample Tests


1. Comparing the Means of Two Independent Populations
a. Test of Hypotheses 🡪 Pooled-Variance t Test for the Difference Between Two Means Assuming Equal Variances
Assumption: Samples are randomly and independently drawn.
Populations are normally distributed or both sample sizes are at least 30.
Population variances are unknown but assumed equal.
St Two-tailed Test Right-tailed Test Left-tailed Test
e
p
1 Formula H 0 : μ1−μ2=μ D H 0 : μ1−μ2 ≤ μ D H 0 : μ1−μ2 ≥ μ D
te the 2 H 1 : μ 1−μ2 ≠ μ D H 1 : μ 1−μ2 > μ D H 1 : μ 1−μ2 < μ D
hypothe
ses
2 Comput ( xbar 1−xbar 2 )−μD
t stat =
e √❑
2
2 2 (n¿¿ 2−1)∗S 2
test S p= ( n1−1 )∗S1 +
n1 +n2−2
¿
statistic
3 Identify
critical
values
± t ∝ /2 ,n +n −2
1 2
+t ∝, n +n −2
1 2
−t ∝, n +n −2
1 2

4 Decision If the test statistic is in region of rejection 🡪 Reject H 0


If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval

2. Comparing the Means of Two Related Populations


a. Test of Hypotheses 🡪 Paired t Test for the Difference Between Two Means
Assumption: Differences are normally distributed.
Or, if not Normal, use large samples.
St Two-tailed Test Right-tailed Test Left-tailed Test
e
p
1 Formula H 0 : μD =0 H 0 : μD ≤ 0 H 0 : μD ≥ 0
te the 2 H 1: μD ≠ 0 H 1: μD> 0 H 1: μD< 0
hypothe
ses
2 Comput D−μ D
t stat =
e S D / √❑

test
statistic
3 Identify
critical
values
± t∝
,n−1
2

+t ∝, n−1 −t ∝, n−1
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval

3. Comparing the Proportions of Two Independent Populations


a. Test of Hypothesis 🡪 Z Test for the Difference Between Two Proportions
Assumption: n1∗π 1 ≥ 5, n1∗(1−π ¿ ¿1)≥5 ¿
n2∗π 2 ≥ 5, n2∗(1−π ¿ ¿ 2)≥5 ¿ Confidence Interval
Step Two-tailed Test Right-tailed Test Left-tailed Test
1 Formulate the 2 H 0 :π 1−π 2=π D H 0 :π 1−π 2 ≤ π D H 0 :π 1−π 2 ≥ π D
hypotheses H 1 : π 1 −π 2 ≠ π D H 1 : π 1 −π 2> π D H 1 : π 1 −π 2< π D
2 Compute ( p1− p2 )−π D
z stat =
test statistic √❑
x 1+ x2
p=
n1 + n2
3 Identify
critical values

−z ∝

± z ∝/2 + z∝
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0

4. F Test for the Ratio of Two Variances


Ste Two-tailed Test Right-tailed Test
p
2 2 2 2
1 Formula H 0 :σ 1−σ 2=0 H 0 :σ 1−σ 2 ≤0
te the 2 2 2
H 1 : σ 1−σ 2 ≠ 0
2
H 1 : σ 1−σ 2 >0
2

hypothe
ses
2 Comput S1
2
F stat = 2 (S > S ¿
2
1
2
2
e S 2

test S1^2: sample variance sample 1


statistic
3 Identify F ∝, n −1 , n −1
1 2

critical
values

F∝
,n1−1 , n2−1
2

right-skewed

4 Decision If the test statistic is in region of rejection 🡪 Reject H 0


If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0

Chap 10: Two-Sample Tests

1. Comparing the Means of Two Independent Populations


a. Test of Hypotheses 🡪 Pooled-Variance t Test for the Difference Between Two Means Assuming Equal Variances
Assumption: Samples are randomly and independently drawn.
Populations are normally distributed or both sample sizes are at least 30.
Population variances are unknown but assumed equal.
Ste Two-tailed Test Right-tailed Test Left-tailed Test
p
1 Formulate H 0 : μ1−μ2=μ D H 0 : μ1−μ2 ≤ μ D H 0 : μ1−μ2 ≥ μ D
the 2 H 1 : μ 1−μ2 ≠ μ D H 1 : μ 1−μ2 > μ D H 1 : μ 1−μ2 < μ D
hypotheses
2 Compute ( xbar 1−xbar 2 )−μD
t stat =
test √❑
2
2 2 (n¿¿ 2−1)∗S 2
statistic S p= ( n1−1 )∗S1 +
n1 +n2−2
¿
3 Identify
critical
values
± t ∝ /2 ,n +n −2
1 2
+t ∝, n +n −2
1 2
−t ∝, n +n −2
1 2

4 Decision If the test statistic is in region of rejection 🡪 Reject H 0


If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval

2. Comparing the Means of Two Related Populations


a. Test of Hypotheses 🡪 Paired t Test for the Difference Between Two Means
Assumption: Differences are normally distributed.
Or, if not Normal, use large samples.
Ste Two-tailed Test Right-tailed Test Left-tailed Test
p
1 Formulate H 0 : μD =0 H 0 : μD ≤ 0 H 0 : μD ≥ 0
the 2 H 1: μD ≠ 0 H 1: μD> 0 H 1: μD< 0
hypotheses
2 Compute D−μ D
t stat =
test S D / √❑

statistic
3 Identify
critical
values
± t∝
,n−1
2
+t ∝, n−1 −t ∝, n−1
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval
3. Comparing the Proportions of Two Independent Populations
a. Test of Hypothesis 🡪 Z Test for the Difference Between Two Proportions
Assumption: n1∗π 1 ≥ 5, n1∗(1−π ¿ ¿1)≥5 ¿
n2∗π 2 ≥ 5, n2∗(1−π ¿ ¿ 2)≥5 ¿

Ste Two-tailed Test Right-tailed Test Left-tailed Test


p
1 Formulate H 0 :π 1−π 2=π D H 0 :π 1−π 2 ≤ π D H 0 :π 1−π 2 ≥ π D
the 2 H 1 : π 1 −π 2 ≠ π D H 1 : π 1 −π 2> π D H 1 : π 1 −π 2< π D
hypotheses
2 Compute ( p1− p2 )−π D
z stat =
test √❑
x 1+ x2
statistic p=
n1 + n2
3 Identify
critical
values

± z ∝/2 + z∝ −z ∝
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval
4. F Test for the Ratio of Two Variances
Step Two-tailed Test Right-tailed Test
2 2 2 2
1 Formulate H 0 :σ 1−σ 2=0 H 0 :σ 1−σ 2 ≤0
the 2 2 2
H 1 : σ 1−σ 2 ≠ 0
2 2
H 1 : σ 1−σ 2 >0
hypothese
s
2 Compute 2
S1
F stat = 2 (S > S ¿
2
1
2
2
test S2
statistic S1^2: sample variance sample 1
3 Identify F∝ F ∝, n −1 , n −1
,n1−1 , n2−1 1 2
2
critical
values

right-skewed

4 Decision If the test statistic is in region of rejection 🡪 Reject H 0


If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0

Chap 11: Analysis of Variance (F test)


❖ One-way ANOVA/ ANOVA: A single Factor
So sánh trung bình của nhiều nhóm tổng thể dựa trên trung bình mẫu và thông qua kiểm định giả thuyết để kết
luận về sự bằng nhau của các trung bình tổng thể này.
1 Assumptions:
Samples are randomly and independently selected.
Populations are normally distributed.
Populations have equal variances.
MS: Mean Square SS: Sum of Squares df: degrees of freedom
1 Hypotheses of One-Way
ANOVA
2 Compute test statistic

3 Identify critical values

F α ,c−1 ,n−c
α : significance level
Numerator d.f = df1 = c−1
Denominator d.f = df2 = n−c

🡪 Reject H 0
4 Decision
F stat > F α , c−1 , n−c

F stat ≤ Fα , c−1 , n−c 🡪 Fail to reject H 0

2 One-way ANOVA Summary Table

❖ Two-way ANOVA
1. Assumptions:
Populations are normally distributed.
Populations have equal variances.
Samples are randomly and independently selected.
Hypotheses of H 0: There is no difference in means by factor H 0: There is no difference in means by factor H 0: There is no interaction effect in means.
Two-Way A. B. H 1: There is an interaction effect between factor A
ANOVA H 1: There is a difference in mean by factor A. H 1: There is a difference in means by factor and factor B in means.
B.
Compute
test statistic

Identify
critical values

F α ,r−1 , rc(n ' −1 ) F α ,c−1 ,rc (n ' −1) F α ,(r −1 )(c−1),rc (n'−1)

🡪 Reject H 0 🡪 Reject H 0 🡪 Reject H 0


Decision
F stat > F α F stat > F α F stat > F α

F stat ≤ Fα 🡪 Fail to reject H 0 F stat ≤ Fα 🡪 Fail to reject H 0 F stat ≤ Fα 🡪 Fail to reject H 0

2. Two-Way ANOVA Sources of Variation

Two Factors of interest: A and B

r = number of levels of factor A.

c = number of levels of factor B.

n’ = number of replications for each cell.

n = total number of observations in all cells n = (r)(c)(n’).

Xijk = value of the kth observation of level i of factor A and level j of factor B.

3. Two-way ANOVA Summary Table

p-value method
1. H0, H1
2. test statistic
3. tail
two-tail → p-value = 2*P(Z>|T.S|)
right-tail → p-value = P(Z>T.S)
left-tail → p-value = P(Z<T.S)
4. p-value < alpha → Reject H0
p-value >= alpha → Fail to reject H0
Chap 13: Simple Linear Regression
Purpose

Comparison → Hypothesis testing (ch10, ch11)


Relationship → Regression (ch13, ch14)
Regression 🡪 Predict
Example: cân nặng = chiều cao*a + b
chiều cao 🡪 cân nặng
học vấn 🡪 thu nhập
chiều dài lá 🡪 diện tích lá
diện tích nhà 🡪 giá nhà
thu nhập 🡪 chi tiêu
Regression analysis is used to:
Predict the value of a dependent variable based on the value of at least one independent variable. (>=1)
Explain the impact of changes in an independent variable on the dependent variable.
Simple linear regression 🡪 Find a linear equation between 2 variables. (Independent variable X and dependent variable Y)
Example: chiều cao 🡪 cân nặng X: chiều cao Y: cân nặng
học vấn 🡪 thu nhập X: học vấn Y: thu nhập
chiều dài lá 🡪 diện tích lá X: chiều dài lá Y: diện tích lá
diện tích nhà 🡪 giá nhà X: diện tích nhà Y: giá nhà
thu nhập 🡪 chi tiêu X: thu nhập Y: chi tiêu
1. Simple Linear Regression model:
Dependent variable (Y): the variable we wish to predict or explain. muốn dự đoán hoặc giải thích
Independent variable (X): the variable used to predict or explain the dependent variable.
Only one independent variable, X.
Relationship between X and Y is described by a linear function.
a linear function y = ax + b
Changes in Y are assumed to be related to changes in X.
2. Equation of linear regression

x : independent variable value (giá trị của biến độc lập)

^y : predicted value (kết quả dự đoán)

b 0 (Point estimate for y-intercept - hệ số chặn/hệ số tự do)

b 1 (Point estimate for slope -hệ số góc)

b 1< 0→ X increase ,Y decrease → negative relationship

b 1=0 → X increase , Y unchange → no relationship

b 1> 0→ X increase ,Y increase → positive relationship

Interpretation of the Intercept b0 and the Slope b1


b 0 is the estimated mean value of Y when the value of X is zero.

b 1 is the estimated change in the mean value of Y as a result of a one-unit increase in X.

Compute b 0
1st way CASIO
2nd Excel
Compute b 1
1st way CASIO
2nd Excel
Use linear regression to predict future values

3. ¿ R∨¿, R 🡪 Strength of a linear relationship (Strong/Weak) & R 🡪 Positive/Negative correlation


2

S xy
Sample correlation coefficient R=
√❑
−1 ≤ R ≤ 1
R<0 → X increase ,Y decrease → negative correlation
R=0 → X increase , Y unchange →no correlation
R>0 → X increase ,Y increase → positive correlation

R and b 1 have the same sign.


2 SS R
Coefficient of determination (R Square) R = SS
T

Meaning: the portion (%) of the total variation in the dependent variable that is explained by variation in the
independent variable.
¿ R∨¿ and R2 both measure the strength of a linear relationship.

0 ≤|R|≤ 1, 0 ≤ R2 ≤ 1
|R| ~ 0 → weak correlation
|R| ~ 1 → strong correlation

4. Compute sums of squares

5. Standard error of estimate

Meaning: is a measure of the variation of observed Y values from the regression line.
6. Assumptions of Regression L.I.N.E
Linearity: The relationship between X and Y is linear.
Independence of Errors: Error values are statistically independent.
Particularly important when data is collected over a period of time.
Normality of Error: Error values are normally distributed for any given value of X.
Equal Variance (also called homoscedasticity): The probability distribution of the errors has constant variance.
7. Residual Analysis

Meaning: is the difference between its observed and predicted value.


8. Hypothesis Testing
t-test for a population slope t-test for a correlation coefficient
Claim: Is there a linear relationship between X and Y? Claim: Is there evidence of a linear relationship between X and Y?
Step 1: Null and alternative hypotheses: Step 1: Null and alternative hypotheses:
H 0 : β 1=0 (no linear relationship) H 0 : ρ=0(No correlation)
H 1 : β1 ≠ 0 (linear relationship does exist) H 1 : ρ≠ 0 (Correlation exists) population coefficient of
correlation
R: sample coefficient of correlation
Step 2: Test statistic Step 2: Test statistic
b −β r−ρ
t stat = 1 1 t stat =
sb √❑
1

Step 3: Step 3:

± t ∝ /2 ,n−2 ± t ∝ /2 ,n−2
Step 4: Step 4:
If the test statistic is in region of rejection 🡪 Reject H 0 If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0 If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0

ch8, ch9: t with df = n-1


ch10: t with df = n1 + n2 -2 (two independent populations)
t with df = n - 1 (two related populations)
ch13: t with df = n - 2

Chap 14: Introduction to Multiple Regression


Multiple Regression model with 2 independent variable

Dependent variable (Y)


Independent ( X 1 , X 2)
β0 = Y intercept
β1 = slope of Y with variable X1, holding X2 constant
β2 = slope of Y with variable X2, holding X1 constant
εi = random error in Y for observation i
Multiple Regression equation ^y =b0 +b 1∗X 1 +b 2∗X 2
The coefficient of multiple determination (R Square): the proportion of total variation in Y explained by all X variables taken
together.

Adjusted R Square: the proportion of variation in Y explained by all X variables adjusted for the number of X variables used.

You might also like