DOC MASS
DOC MASS
DOC MASS
Statistics is the science of collecting, organizing, analyzing, and interpreting DATA to make decisions.
Descriptive Statistics: Involves organizing, summarizing, and displaying data.
Thống kê mô tả (descriptive statistics): được hiểu là các phương pháp liên quan đến việc thu thập số liệu, tóm tắt, trình bày, tính toán và mô tả các đặc
trưng riêng biệt khác nhau để phản ánh tổng quát đối tượng nghiên cứu.
Central tendency
Variation
Skewness
Inferential Statistics: Involves using sample data to draw conclusions about a population.
Thống kê suy luận (inferential statistics): bao gồm các phương pháp ước lượng các đặc trưng của tổng thể, phân tích mối liên hệ giữa các hiện tượng
nghiên cứu, dự đoán hoặc ra quyết định trên cơ sở thu thập thông tin từ kết quả quan sát mẫu.
Confidence interval
Hypothesis Testing
Regression
Types of data
Categorical (qualitative) variables take categories as their values such as “yes”, “no”, or “blue”, “brown”, “green”.
Numerical (quantitative) variables have values that represent a counted or measured quantity.
Discrete variables arise from a counting process.
Continuous variables arise from a measuring process.
Population: A population contains all the items or individuals of interest that you seek to study.
population: all FPT students
Sample: A sample contains only a portion of a population of interest.
sample: 100 FPT students
Sources of data
When you perform the activity that collects the data, you are using a primary data source.
When the data collection part of these activities is done by someone else, you are using a secondary data source.
Primary Sources: The data collector is the one using the data for analysis:
Data from a political survey.
Data collected from an experiment. (A treatment is applied to part of a population and responses are observed.)
Observed data (A researcher observes and measures characteristics of interest of part of a population.)
Secondary Sources: The person performing data analysis is not the data collector:
Analyzing census data.
Examining data from print journals or data published on the internet.
Z Score
Quartiles split the ranked data into 4 segments with an equal number of values per segment.
The first quartile, Q1, is the value for which 25% of the values are smaller and 75% are larger.
Q2 is the same as the median (50% of the values are smaller and 50% are larger).
Only 25% of the values are greater than the third quartile.
Find a quartile by determining the value in the appropriate position in the ranked data, where:
n+1
First quartile position: Q 1=
4
ranked value.
n+1
Second quartile position: Q2= 2 ranked value. = Median
3∗(n+ 1)
Third quartile position: Q 3= ranked value.
4
The Interquartile Range (IQR) is Q3 – Q1 and measures the spread in the middle 50% of the data.
An outlier is an observation that is numerically distant from the rest of the data.
(< Q1 - 1.5 * IQR OR > Q3 + 1.5 * IQR).
Five-number summary
Population Variance
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 - 3 - ^2
Population Standard Deviation
580VN: Menu - 6 - 1 - Data - AC - OPTN - 2
570VN: Mode - 3 - 1 - Data - AC - Shift - 1 - 4 – 3
Population
Probability distribution table
The Empirical Rule
Approximately 68% of the values are within +-1 standard deviation from the mean.
Approximately 95% of the values are within +-2 standard deviations from the mean.
Approximately 99.7% of the values are within +-3 standard deviations from the mean.
5. The Covariance and the Coefficient of Correlation
The Covariance: The covariance measures the strength of the linear relationship between two numerical variables (X and Y).
52 lá
P(lá bích) = số lá bích / tổng số lá = 13 / 52 = 1/4
đọc câu hỏi → keyword → khoanh vùng
probability → ch4, ch5, ch6, ch7
II. Types of events
1. Impossible event and Certain event
Impossible event has no chance of occurring (probability = 0).
Certain event is sure to occur (probability = 1).
2. Simple event & Joint event
Simple event described by a single characteristic.
Ex: A day in January from all days in 2018.
Joint event described by two or more characteristics.
Ex: A day in January that is also a Wednesday from all days in 2018.
3. Complementary event
Complement of an event A (denoted A’): All events that are not part of event A.
Ex: All days from 2018 that are not in January.
4. Mutually exclusive events (Disjoint events)
Events A and B are said to be mutually exclusive if it is not possible that both occur at the same time.
Ex: Toss of a coin.
Let A be the event that the coin lands on head.
Let B be the event that the coin lands on the tail.
🡪 In a single fair coin toss, events A and B are mutually exclusive.
P ( A ∩B )=0
5. Independent events
Events A and B are said to be independent if the probability of B occurring is unaffected by the occurrence of the event A
happening.
Ex: Tossing a coin twice.
Let A be the event that the first coin toss lands on heads.
Let B be the event that the second coin toss lands on heads.
🡪 Clearly the result of the first coin toss does not affect the result of the second coin toss.
🡪 Events A and B are independent.
P ( A ∩B )=P ( A )∗P (B)
P ( A ∪ B )=P ( A ) + P ( B ) −P ( A ∩ B ) =P ( A )+ P ( B )−P ( A )∗P(B)
III. Graph
1. Venn diagram
2. Contingency table
3. Decision tree
IV. General addition rule
Bell Shaped
Symmetrical
Mean, Median and Mode are Equal
Location is determined by the mean, μ.
Spread is determined by the standard deviation, σ.
Range: The random variable has an infinite theoretical range: -∞ to +∞.
The Standardized Normal Distribution (Also known as the “Z” distribution) Z-score
Mean is 0.
Standard Deviation is 1. → Variance = 1^2 = 1
Symmetrical
Also called a rectangular distribution
Range: Any value between the smallest and largest is equally likely.
ii. Statistic
A sampling distribution is a distribution of all the possible values of a sample statistic for a given sample size
selected from a population.
population, sample size
sample n=100 → sample mean 1
n: sample size
If the Population is not Normal (sample size is >30) with mean μ and standard deviation σ, the sampling distribution of X
is approximately/nearly/almost normally distributed with:
● Central Limit Theorem: As the sample size gets large enough, the sampling distribution of the sample mean becomes
almost normal regardless of shape of population.
Định lý giới hạn trung tâm: Khi cỡ mẫu đủ lớn, phân phối các giá trị trung bình của mẫu có phân phối gần chuẩn dù tổng
thể có phân phối chuẩn hoặc không.
population: normal → sampling distribution of sample mean: normal
population: not normal + n>30 → sampling distribution of sample mean: approximately normal
mean, average
3. Sampling Distribution of Sample Proportion p:
proportion of left-handed people
proportion, rate, percentage, fraction
n: 100, X=10 → p = 10/100
● Population proportion π = the proportion of the population having some characteristic.
p
● The sampling distribution of is approximated normal distribution if
where
Chapter 8: Confidence Interval Estimation for a single sample
1. Concept of Confidence Interval
● Critical Value is a table value based on the sampling distribution of the point estimate and the desired confidence level.
Confidence level (1−α ) is confident the interval will contain the unknown population parameter (less than 100%).
● Standard Error is the standard deviation of the point estimate.
Interpretation
We are …% confident that the true mean/proportion … is between … and … (unit).
Although the true mean/proportion may or may not be in this interval, …% of intervals formed in this manner will contain the true
mean/proportion.
2. Confidence Intervals
p: sample proportion
Z: critical value
n: sample size
3. Determining the Required Sample Size.
The previous chapter introduced methods for using sample statistics to estimate population parameters.
This chapter introduces the next major topic of inferential statistics: hypothesis testing.
One very important guideline is that no information from the sample is used in the hypothesis’s statements.
The null hypothesis always contains equality (=, ≤, or ≥).
If H1 contains a “ ≠ ” symbol, the hypothesis test is two-tailed. If H1 contains a “>” symbol, the hypothesis test is right-
tailed, and if H1 contains a “<” symbol, the hypothesis test is left-tailed. (Note: Some authors use only the equal symbol in
the null hypothesis statement.)
A procedure leading to a decision about a particular hypothesis is called a test of a hypothesis.
When conducting scientific research, typically there is some known information, perhaps from some past work or from a long
accepted idea. We want to test whether this claim is believable. This is the basic idea behind a hypothesis test:
E.g.
tail H1
alpha = significance
of level = area of
region of rejection
2 Compute
tstat/t0=x-0s/n
test
statistic
Step 1: Formulate two hypotheses. (State the null and alternative hypotheses for the appropriate hypothesis test.)
Step 2: Compute test statistic
Step 3: Identify the critical value (State the correct decision rule for the test of hypothesis in terms of a z- or t-test statistic and
appropriate rejection region at the indicated significance level.)
2 Compute
zstat/z0=p- π0 π0*(1- π0)n
test
statistic
3 Identify z∝/2 +z -z
critical
values
Step 1: Formulate two hypotheses. (State the null and alternative hypotheses for the appropriate hypothesis test.)
Step 3: Identify the critical value (State the correct decision rule for the test of hypothesis in terms of a z- or t-test statistic and
appropriate rejection region at the indicated significance level.)
test
statistic
3 Identify
critical
values
± t∝
,n−1
2
+t ∝, n−1 −t ∝, n−1
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval
−z ∝
± z ∝/2 + z∝
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
hypothe
ses
2 Comput S1
2
F stat = 2 (S > S ¿
2
1
2
2
e S 2
critical
values
F∝
,n1−1 , n2−1
2
right-skewed
statistic
3 Identify
critical
values
± t∝
,n−1
2
+t ∝, n−1 −t ∝, n−1
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval
3. Comparing the Proportions of Two Independent Populations
a. Test of Hypothesis 🡪 Z Test for the Difference Between Two Proportions
Assumption: n1∗π 1 ≥ 5, n1∗(1−π ¿ ¿1)≥5 ¿
n2∗π 2 ≥ 5, n2∗(1−π ¿ ¿ 2)≥5 ¿
± z ∝/2 + z∝ −z ∝
4 Decision If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
b. Confidence Interval
4. F Test for the Ratio of Two Variances
Step Two-tailed Test Right-tailed Test
2 2 2 2
1 Formulate H 0 :σ 1−σ 2=0 H 0 :σ 1−σ 2 ≤0
the 2 2 2
H 1 : σ 1−σ 2 ≠ 0
2 2
H 1 : σ 1−σ 2 >0
hypothese
s
2 Compute 2
S1
F stat = 2 (S > S ¿
2
1
2
2
test S2
statistic S1^2: sample variance sample 1
3 Identify F∝ F ∝, n −1 , n −1
,n1−1 , n2−1 1 2
2
critical
values
right-skewed
F α ,c−1 ,n−c
α : significance level
Numerator d.f = df1 = c−1
Denominator d.f = df2 = n−c
🡪 Reject H 0
4 Decision
F stat > F α , c−1 , n−c
❖ Two-way ANOVA
1. Assumptions:
Populations are normally distributed.
Populations have equal variances.
Samples are randomly and independently selected.
Hypotheses of H 0: There is no difference in means by factor H 0: There is no difference in means by factor H 0: There is no interaction effect in means.
Two-Way A. B. H 1: There is an interaction effect between factor A
ANOVA H 1: There is a difference in mean by factor A. H 1: There is a difference in means by factor and factor B in means.
B.
Compute
test statistic
Identify
critical values
F α ,r−1 , rc(n ' −1 ) F α ,c−1 ,rc (n ' −1) F α ,(r −1 )(c−1),rc (n'−1)
Xijk = value of the kth observation of level i of factor A and level j of factor B.
p-value method
1. H0, H1
2. test statistic
3. tail
two-tail → p-value = 2*P(Z>|T.S|)
right-tail → p-value = P(Z>T.S)
left-tail → p-value = P(Z<T.S)
4. p-value < alpha → Reject H0
p-value >= alpha → Fail to reject H0
Chap 13: Simple Linear Regression
Purpose
Compute b 0
1st way CASIO
2nd Excel
Compute b 1
1st way CASIO
2nd Excel
Use linear regression to predict future values
S xy
Sample correlation coefficient R=
√❑
−1 ≤ R ≤ 1
R<0 → X increase ,Y decrease → negative correlation
R=0 → X increase , Y unchange →no correlation
R>0 → X increase ,Y increase → positive correlation
Meaning: the portion (%) of the total variation in the dependent variable that is explained by variation in the
independent variable.
¿ R∨¿ and R2 both measure the strength of a linear relationship.
0 ≤|R|≤ 1, 0 ≤ R2 ≤ 1
|R| ~ 0 → weak correlation
|R| ~ 1 → strong correlation
Meaning: is a measure of the variation of observed Y values from the regression line.
6. Assumptions of Regression L.I.N.E
Linearity: The relationship between X and Y is linear.
Independence of Errors: Error values are statistically independent.
Particularly important when data is collected over a period of time.
Normality of Error: Error values are normally distributed for any given value of X.
Equal Variance (also called homoscedasticity): The probability distribution of the errors has constant variance.
7. Residual Analysis
Step 3: Step 3:
± t ∝ /2 ,n−2 ± t ∝ /2 ,n−2
Step 4: Step 4:
If the test statistic is in region of rejection 🡪 Reject H 0 If the test statistic is in region of rejection 🡪 Reject H 0
If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0 If the test statistic is in region of non-rejection 🡪 Fail to Reject H 0
Adjusted R Square: the proportion of variation in Y explained by all X variables adjusted for the number of X variables used.