Class 9 Validation of The Linear Regression Model
Class 9 Validation of The Linear Regression Model
Regression Model
Validation of the Simple Linear Regression Model
The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)
• The co-efficient of determination (or R-square or R2)
measures the percentage of variation in Y explained by the
model (0 + 1 X).
• The simple linear regression model can be broken into
explained variation and unexplained variation as shown in
In absence of the predictive model for Yi, the users will use the
mean value of Yi. Thus, the total variation is measured as the
difference between Yi and mean value of Yi (i.e.,Yi - ).
Description of total variation, explained variation
and unexplained variation
Total Variation (SST) ( ) Total variation is the difference between the actual
value and the mean value.
Variation explained by the model ( ) Variation explained by the model is the difference
between the estimated value of Y i and the mean value
of Y
Variation not explained by model ( ) Variation not explained by the model is the difference
between the actual value and the predicted value of Y i
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
− ∧ − ∧
⏟
𝑌 𝑖 −𝑌 = ⏟
𝑌 𝑖 −𝑌 + ⏟
𝑌 𝑖 −𝑌 𝑖
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model
∑ ( 𝑌 𝑖 −𝑌 ) =∑ (𝑌 𝑖 −𝑌 ) + ∑ ( 𝑌 𝑖 − 𝑌 𝑖 )
𝑛 − 2 𝑛 ∧ − 2 𝑛 ∧ 2
⏟
𝑖=1 ⏟
𝑖=1 ⏟
𝑖=1
𝑆𝑆𝑇 𝑆𝑆𝑅 𝑆𝑆𝐸
where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by
( )
∧ − 2
2004 1 2
2005 6 2
2006 12 2
2007 58 2
2008 145 11
2009 360 21
2010 608 31
2011 845 40
2012 1056 51
Facebook users versus helium poisoning in UK
The R-square value for regression model between the number of deaths due to
helium poisoning in UK and the number of Facebook users is 0.9928. That is,
99.28% variation in the number of deaths due to helium poisoning in UK is
explained by the number of Facebook users.
Hypothesis Test for Regression Co-efficient (t-Test)
In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by
H0 : 1 = 0
HA: 1 0
• The corresponding t-statistic is given as
Confidence Interval for Regression coefficients 0
and 1
The standard error of estimates of and are given by
√ 𝑆𝑒
𝑛 ∧
∧
𝑆𝑒 × ∑ 𝑋
2
𝑖 𝑆 𝑒 (𝛽1 )=
𝑆 𝑒 ( 𝛽 0)=
√𝑛 × 𝑆 𝑆 𝑋
𝑖=1
√ 𝑆 𝑆𝑋
√ (𝑌 − 𝑌 )
∧ 2
where 𝑖 𝑖
𝑆 𝑒=
𝑛 −2
Alternatively,
• H0: i = 0
• HA: i 0
The corresponding test statistic is given by
Validation of Overall Regression Model – F-test
H 0: 1 = 2 = 3 = … = k = 0
H1: Not all s are zero.
F = (SST-SSE)/k/SSE/(n-k-1) ~ Fk,n-k-1
F-test for the overall fit of the model
• Where the critical value F(, k, n-k-1) can be found from an F-table.
• The existence of a regression relation by itself does not assure that
useful prediction can be made by using it.
• Note that when k=1, this test reduces to the F-test for testing in simple
linear regression whether or not 1= 0