[go: up one dir, main page]

0% found this document useful (0 votes)
30 views32 pages

6 Correlation and Linear Regression

The document covers fundamental concepts in data management, focusing on correlation and linear regression analysis. It explains how to quantify the relationship between two continuous variables using the Pearson correlation coefficient, detailing the interpretation of correlation values and the conditions necessary for regression analysis. Additionally, it provides examples and formulas for calculating correlation and regression, emphasizing the importance of establishing a strong correlation before making predictions.

Uploaded by

monrealnin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views32 pages

6 Correlation and Linear Regression

The document covers fundamental concepts in data management, focusing on correlation and linear regression analysis. It explains how to quantify the relationship between two continuous variables using the Pearson correlation coefficient, detailing the interpretation of correlation values and the conditions necessary for regression analysis. Additionally, it provides examples and formulas for calculating correlation and regression, emphasizing the importance of establishing a strong correlation before making predictions.

Uploaded by

monrealnin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Lesson 4

DATA
MANAGEMENT
Lesson Coverage
◦ Basic Statistical Concepts
◦ Measures of Central Tendency
◦ Measures of Dispersion
◦ Measures of Relative Position
◦ Probability and the Normal Distribution
◦ Correlation and Linear Regression
◦ Chi-square
CORRELATION AND
LINEAR REGRESSION
Lesson Coverage 6
Correlation Analysis
◦ Used to quantify the association between two
continuous variables.
◦ Estimates a sample correlation coefficient, more
specifically the Pearson Product Moment
Correlation Coefficient. The sample correlation
coefficient, denoted r
◦ Correlation analysis calculates the level of
change in one variable due to the change in the
other.
Correlation Analysis
◦ Ranges between -1 and +1 and quantifies the direction and
strength of the line association between two variables.
◦ The correlation between two variables can be positive or
negative.
◦ The sign of the correlation coefficient indicates the direction
of the association.
◦ The magnitude of the correlation coefficient indicates the
strength of the association.
Correlation Analysis
1 A perfect positive linear relationship ◦ A correlation of 𝑟 = 0.9 suggests
+0.7 A strong positive linear relationship a strong, positive association between
+0.5 A moderate positive linear relationship two variables,
+0.3 A weak positive linear relationship ◦ A correlation of 𝑟 = −0.2
suggests a weak, negative association.
0 No relationship
−0.3 A weak negative linear relationship ◦ A correlation close to zero
suggests no linear association between
−0.5 A moderate negative linear relationship two continuous variables
−0.7 A strong negative linear relationship
−1 A perfect negative linear relationship
Scenario 1
depicts a strong
positive
association
(𝒓 = 𝟎. 𝟗) ,
similar to what
we might see for
the correlation
between infant
birth weight and
birth length.
Scenario 2
depicts a weaker
association
(𝒓 = 𝟎. 𝟐)
that we might
expect to see
between age
and body mass
index (which
tends to
increase with
age.
Scenario 3
might depict the
lack of
association (r
approximately
0) between the
extent of media
exposure in
adolescence and
age at which
adolescents
initiate sexual
activity.
Scenario 4
might depict the
strong negative
association
(𝒓 = −𝟎. 𝟗)
generally
observed
between the no.
of hours of
aerobic exercise
per week and
percent body
fat.
◦The closer r is to zero, the weaker the linear relationship.
◦Positive r values indicate a positive correlation, where the
values of both variables tend to increase together.
◦Negative r values indicate a negative correlation, where the
values of one variable tend to increase when the values of
the other variable decrease.
Infant ID # Age (wks) Weight (gm)

2
1 34.7
36.0
1895
2030
Example: A small study is
3 29.3 1440 conducted involving 17 infants
4 40.1 2835 to investigate the association
5
6
35.7
42.4
3090
3827
between gestational age at
7 40.3 3260 birth, measured in weeks, and
8 37.3 2690 birth weight, measured in
9
10
40.9
38.3
3285
2920
grams
11 38.5 3430 Scatter Diagram
4000
12 41.4 3657
3500
13 39.7 3685 3000

14 39.7 3345 2500

15 41.1 3260 2000

16 38.0 2680 1500

1000
17 38.7 2005 25 30 35 40 45
Correlation Formula
Cov(x, y) σ (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝑟= Cov(𝑥, 𝑦) =
𝑛−1
𝑠𝑥2 ⋅ 𝑠𝑦2
σ 2
(𝑥 − 𝑥)
ҧ
𝑠𝑥2 =
𝑛−1

σ 2
2
(𝑦 − 𝑦)

𝑠𝑦 =
𝑛−1
Correlation Formula
σ (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝑟= 𝑛−1
σ(𝑥 − 𝑥)ҧ 2 σ(𝑦 − 𝑦)
ത 2
𝑛−1 𝑛−1
Infant ID # 𝒙 y
1 34.7 1895
2 36.0 2030
3 29.3 1440
4 40.1 2835
5 35.7 3090
6 42.4 3827
7 40.3 3260
8 37.3 2690
9 40.9 3285
10 38.3 2920
11 38.5 3430
12 41.4 3657
13 39.7 3685
14 39.7 3345
15 41.1 3260
16 38.0 2680
17 38.7 2005
Sum 652.1 49334
1 A perfect positive linear relationship
𝟐𝟖 𝟕𝟔𝟖. 𝟒 +0.7 A strong positive linear relationship
𝒓= +0.5 A moderate positive linear relationship
𝟏𝟓𝟗. 𝟒𝟓 𝟕 𝟕𝟔𝟕 𝟔𝟔𝟎 +0.3 A weak positive linear relationship

𝟐𝟖 𝟕𝟔𝟖. 𝟒
𝒓=
𝟏 𝟐𝟑𝟖 𝟓𝟓𝟑 𝟑𝟖𝟕
𝟐𝟖 𝟕𝟔𝟖. 𝟒 The sample correlation
𝒓=
𝟑𝟓 𝟏𝟗𝟑. 𝟎𝟖𝟕𝟐 coefficient indicates a
strong positive linear
𝒓 = 𝟎. 𝟖𝟐 relationship.
Let’s Do This:
A researcher decides to examine whether there is a correlation between
cost of internet service per month (in hundreds) and degree of customer
satisfaction (on a scale of 1 – 10 with 1 being not at all satisfied and 10
being extremely satisfied). The researcher only includes programs with
comparable types of services. A sample data is provided below.
Cost (in hundreds) Satisfaction
22 6
36 8
34 10
30 4
18 9
10 6
24 3
38 5
44 2
50 10
𝑥 y
22 6
36 8
34 10
30 4
18 9 The result indicates that there
10 6 is pretty much no relationship
between amount of money
24 3
𝟐𝟒. 𝟐 one spend for an internet
38 5 𝒓= service provider and the
44 2 𝟏𝟑𝟓𝟐. 𝟒 𝟕𝟒. 𝟏 degree of customer
50 10 satisfaction.
𝒓 = 𝟎. 𝟎𝟖
+0.7 A strong positive linear relationship
+0.5 A moderate positive linear relationship
+0.3 A weak positive linear relationship
0 No relationship
Regression Analysis
◦ One of the most important statistical techniques for
business applications. It’s a statistical methodology
that helps estimate the strength and direction of the
relationship between two or more variables.
Using Linear Regression to Predict an
Outcome
◦ Statistical researchers often use a linear relationship to predict the (average)
numerical value of Y for a given value of X using a straight line (called the
regression line). If you know the slope and the y-intercept of that
regression line, then you can plug in a value for X and predict the average
value for Y. In other words, you predict (the average) Y from X.

◦ If you establish at least a moderate correlation between X and Y through


both a correlation coefficient and a scatterplot, then you know they have
some type of linear relationship.
Using Linear Regression to Predict an
Outcome
◦ Never do a regression analysis unless you have already
found at least a moderately strong correlation
between the two variables. (A good rule of thumb is it
should be at or beyond either positive or negative 0.50.)
◦ If the data don’t resemble a line to begin with, you
shouldn’t try to use a line to fit the data and make
predictions (but people still try).
Using Linear Regression to Predict an
Outcome
◦ Before moving forward to find the equation for your
regression line, you have to identify which of your two
variables is X and which is Y. When doing correlations, the
choice of which variable is X and which is Y doesn’t
matter, as long as you’re consistent for all the data. But
when fitting lines and making predictions, the choice
of X and Y does make a difference.
Using Linear Regression to Predict an
Outcome
◦ Statisticians call the X-variable the explanatory
variable, because if X changes, the slope tells you
(or explains) how much Y is expected to change in
response. Therefore, the Y variable is called the
response variable. Other names for X and Y
include the independent and dependent variables,
respectively.
Using Linear Regression to Predict an
Outcome
◦ As a summary, you can only make predictions
when the two conditions are met:
➢ The scatterplot must form a linear pattern.
➢ The correlation, r, is moderate to strong (typically
beyond 0.50 or –0.50).
◦ Some researchers actually don’t check these
conditions before making predictions. Their
claims are not valid unless the two conditions
are met.
The figure shows
examples of what
various correlations
look like, in terms
of the strength and
direction of the
relationship.
Linear Regression Formula
where:
𝒚 = 𝒂 + 𝒃𝒙 𝒚
𝒙
is the dependent variable
is the independent variable
𝒂 is the intercept
𝒃 is the slope

(σ 𝑦) σ 𝑥 2 − (σ 𝑥)(σ 𝑥𝑦)
𝑎=
𝑛(σ 𝑥 2 ) − (σ 𝑥)2
𝑛(σ 𝑥𝑦) − (σ 𝑥)(σ 𝑦)
𝑏=
𝑛(σ 𝑥 2 ) − (σ 𝑥)2
Example: Consider the following two
variables x and y, your task is to formulate
the regression model.
X Y
34.86 43.04
42.58 51.88
71.73 88.55
110.77 130.69
259.95 314.17
X Y Scatter Diagram
350
The two conditions are met.
34.86 43.04
300
42.58 51.88 (σ 𝑦) σ 𝑥 2 − (σ 𝑥)(σ 𝑥𝑦)
250
𝑎=
71.73 88.55 200
𝑛(σ 𝑥 2 ) − (σ 𝑥)2
110.77 130.69 150 628.33 88017.46 − (519.89)(106206.14)
𝑎=
259.95 314.17
100
5(88017.46) − (519.89)2
50

0 𝒂 = 𝟎. 𝟓𝟐𝟏𝟐
𝒓 = 𝟎. 𝟗𝟗𝟗𝟗𝟕 0 50 100 150 200 250 300

𝑛(σ 𝑥𝑦) − (σ 𝑥)(σ 𝑦)


𝑏=
𝑛(σ 𝑥 2 ) − (σ 𝑥)2
෍ 𝑿 = 𝟓𝟏𝟗. 𝟖𝟗 ෍ 𝑿𝟐 = 𝟖𝟖 𝟎𝟏𝟕. 𝟒𝟔 5 106206.14 − (519.89)(628.33)
𝑏=
5(88017.46) − (519.89)2

෍ 𝒀 = 𝟔𝟐𝟖. 𝟑𝟑 ෍ 𝑿𝒀 = 𝟏𝟎𝟔 𝟐𝟎𝟔. 𝟏𝟒 𝒃 = 𝟏. 𝟐𝟎𝟑𝟔


𝒏=𝟓 𝒚 = 𝟎. 𝟓𝟐𝟏𝟐 + 𝟏. 𝟐𝟎𝟑𝟔𝒙
Let’s Do This:
ABC laboratory is conducting research on Height Weight
(in cm) (in kg)
height and weight and wanted to know if there is
130.00 55
any relationship like as the height increases, the 135.00 56
weight will also increase. They have gathered a 140.00 62
sample of 1000 people for each of the categories 142.00 63
and came up with an average height in that group. 147.00 63
156.00 51
Weight(in kgs)
70 𝒓 = −𝟎. 𝟏𝟑𝟏𝟔
60

50

40 The two conditions are not


30

20
satisfied. Thus, regression
10

0
will not be valid.
125.00 130.00 135.00 140.00 145.00 150.00 155.00 160.00
Infant ID # Age (wks) Weight (gm)

2
1 34.7
36.0
1895
2030
Practice: A small study is
3 29.3 1440 conducted involving 17 infants
4 40.1 2835 to investigate the association
5
6
35.7
42.4
3090
3827
between gestational age at
7 40.3 3260 birth, measured in weeks, and
8 37.3 2690 birth weight, measured in
9
10
40.9
38.3
3285
2920
grams.
Scatter Diagram
11 38.5 3430
4000
𝒓 = 𝟎. 𝟖𝟐
12 41.4 3657
3500
13 39.7 3685 3000

14 39.7 3345 2500

15 41.1 3260 2000

16 38.0 2680 1500

1000
17 38.7 2005 25 30 35 40 45
CORRELATION
AND LINEAR
REGRESSION
Utilized by:
Lesson Coverage 6 SHALEEN JEAN E. REVECHE

PPT by: Sir Arvin B. Salera

You might also like