[go: up one dir, main page]

0% found this document useful (0 votes)
306 views20 pages

Class 9 Validation of The Linear Regression Model

The document discusses validating linear regression models. It provides three key measures for validation: 1) Coefficient of determination (R-squared) which measures the percentage of variation in the dependent variable explained by the model. 2) Hypothesis testing of the regression coefficients to determine if an independent variable is statistically significant in predicting the dependent variable. 3) Analysis of variance to assess the overall validity of multiple linear regression models. Additional details and formulas are provided for calculating R-squared, confidence intervals, and testing hypotheses in simple and multiple linear regression analysis.

Uploaded by

Sumana Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
306 views20 pages

Class 9 Validation of The Linear Regression Model

The document discusses validating linear regression models. It provides three key measures for validation: 1) Coefficient of determination (R-squared) which measures the percentage of variation in the dependent variable explained by the model. 2) Hypothesis testing of the regression coefficients to determine if an independent variable is statistically significant in predicting the dependent variable. 3) Analysis of variance to assess the overall validity of multiple linear regression models. Additional details and formulas are provided for calculating R-squared, confidence intervals, and testing hypotheses in simple and multiple linear regression analysis.

Uploaded by

Sumana Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Validation of the Linear

Regression Model
Validation of the Simple Linear Regression Model

It is important to validate the regression model to ensure its validity


and goodness of fit before it can be used for practical applications.
The following measures are used to validate the simple linear
regression models:

• Co-efficient of determination (R-square).


• Hypothesis test for the regression coefficient
• Analysis of Variance for overall model validity (relevant more for
multiple linear regression).

The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)
• The co-efficient of determination (or R-square or R2)
measures the percentage of variation in Y explained by the
model (0 + 1 X).
• The simple linear regression model can be broken into
explained variation and unexplained variation as shown in

In absence of the predictive model for Yi, the users will use the
mean value of Y­i. Thus, the total variation is measured as the
difference between Yi and mean value of Yi (i.e.,Yi - ).
Description of total variation, explained variation
and unexplained variation

Variation Type Measure Description

Total Variation (SST) ( ) Total variation is the difference between the actual
value and the mean value.

Variation explained by the model ( ) Variation explained by the model is the difference
between the estimated value of Y i and the mean value
of Y

Variation not explained by model ( ) Variation not explained by the model is the difference
between the actual value and the predicted value of Y i
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
− ∧ − ∧


𝑌 𝑖 −𝑌 = ⏟
𝑌 𝑖 −𝑌 + ⏟
𝑌 𝑖 −𝑌 𝑖
Total   Variation  in   Y   Variation   in   Y  explained  by  the   model Variation  in   Y   not   explained   by   the   model

It can be proved mathematically that sum of squares of total variation


is equal to sum of squares of explained variation plus sum of squares
of unexplained variation

∑ ( 𝑌 𝑖 −𝑌 ) =∑ (𝑌 𝑖 −𝑌 ) + ∑ ( 𝑌 𝑖 − 𝑌 𝑖 )
𝑛 − 2 𝑛 ∧ − 2 𝑛 ∧ 2


𝑖=1 ⏟
𝑖=1 ⏟
𝑖=1
𝑆𝑆𝑇 𝑆𝑆𝑅 𝑆𝑆𝐸

where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by

( )
∧ − 2

2 Explained  variation 𝑆𝑆𝑅 𝑌 𝑖 −𝑌


Coefficient  of  determination  = R = = =
Total   variation 𝑆𝑆𝑇
( )
− 2
𝑌 𝑖 −𝑌
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:

• The value of R2 lies between 0 and 1.


• Higher value of R2 implies better fit, but one should be aware of
spurious regression.
• Mathematically, the square of correlation coefficient is equal to
coefficient of determination (i.e., r2 = R2).
• We do not put any minimum threshold for R2; higher value of R2
implies better fit.
Spurious Regression

Number of Facebook users and the number of


people who died of helium poisoning in UK
Year Number of Facebook users in Number of people who died of helium
millions (X) poisoning in UK (Y)

2004 1 2

2005 6 2

2006 12 2

2007 58 2

2008 145 11

2009 360 21

2010 608 31

2011 845 40

2012 1056 51
Facebook users versus helium poisoning in UK

The regression model is given as Y = 1.9967 + 0.0465 X

The R-square value for regression model between the number of deaths due to
helium poisoning in UK and the number of Facebook users is 0.9928. That is,
99.28% variation in the number of deaths due to helium poisoning in UK is
explained by the number of Facebook users.
Hypothesis Test for Regression Co-efficient (t-Test)

• The regression co-efficient ( 1) captures the existence of a


linear relationship between the response variable and the
explanatory variable.
• If 1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
The standard error of 1 is given by

In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by

The denominator in above Eq. is (n  2) since 0 and 1 are estimated


from the sample in estimating Yi and thus two degrees of freedom are

lost. The standard error of 𝛽 can be written as


1
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:

H0 :  1 = 0
HA: 1  0
• The corresponding t-statistic is given as
Confidence Interval for Regression coefficients 0
and 1
The standard error of estimates of and are given by

√ 𝑆𝑒
𝑛 ∧

𝑆𝑒 × ∑ 𝑋
2
𝑖 𝑆 𝑒 (𝛽1 )=
𝑆 𝑒 ( 𝛽 0)=
√𝑛 × 𝑆 𝑆 𝑋
𝑖=1
√ 𝑆 𝑆𝑋

√ (𝑌 − 𝑌 )
∧ 2
where 𝑖 𝑖
𝑆 𝑒=
𝑛 −2

Where Se is the standard error of residuals and SSX =

The interval estimate or (1-)100% confidence interval for and


are given by
∧ ∧ ∧ ∧
𝛽 1 ∓𝑡 𝛼/ 2 ,𝑛− 2 𝑆𝑒 (𝛽 1) 𝛽 0 ∓𝑡 𝛼 /2 , 𝑛−2 𝑆𝑒 ( 𝛽0 )
Multiple Linear Regression
• Multiple linear regression means linear in
regression parameters (beta values). The following
are examples of multiple linear regression:

An important task in multiple regression is to estimate the


beta values (1, 2, 3 etc…)
Co-efficient of Multiple Determination (R-Square) and
Adjusted R-Square
As in the case of simple linear regression, R-square measures
the proportion of variation in the dependent variable
explained by the model. The co-efficient of multiple
determination (R-Square or R2) is given by
• SSE is the sum of squares of errors and SST is the sum of squares
of total deviation. In case of MLR, SSE will decrease as the number
of explanatory variables increases, and SST remains constant.

• To counter this, R2 value is adjusted by normalizing both SSE and


SST with the corresponding degrees of freedom. The adjusted R-
square is given by
Statistical Significance of Individual Variables in MLR – t-
test
Checking the statistical significance of individual variables is achieved
through t-test. Note that the estimate of regression coefficient is
given by Eq:

This means the estimated value of regression coefficient is a linear


function of the response variable. Since we assume that the
residuals follow normal distribution, Y follows a normal distribution
and the estimate of regression coefficient also follows a normal
distribution. Since the standard deviation of the regression coefficient
is estimated from the sample, we use a t-test.
The null and alternative hypotheses in the case of individual
independent variable and the dependent variable Y is given,
respectively, by

• H0: There is no relationship between independent variable Xi and


dependent variable Y
• HA: There is a relationship between independent variable Xi and
dependent variable Y

Alternatively,
• H0: i = 0
• HA: i  0
The corresponding test statistic is given by
Validation of Overall Regression Model – F-test

Analysis of Variance (ANOVA) is used to validate the overall


regression model. If there are k independent variables in the
model, then the null and the alternative hypotheses are,
respectively, given by

H 0: 1 =  2 =  3 = … =  k = 0
H1: Not all  s are zero.

F-statistic is given by:

F = (SST-SSE)/k/SSE/(n-k-1) ~ Fk,n-k-1
F-test for the overall fit of the model

• The decision rule at significance level  is:


• Reject H0 if

• Where the critical value F(, k, n-k-1) can be found from an F-table.
• The existence of a regression relation by itself does not assure that
useful prediction can be made by using it.
• Note that when k=1, this test reduces to the F-test for testing in simple
linear regression whether or not 1= 0

You might also like