Regression Analysis Using R
Regression Analysis Using R
Regression Analysis Using R
Simple Linear Regression: A simple linear regression uses the presence of a linear relationship
to predict the value of a dependent variable based on the value of an independent variable. The
dependent variable is also referred to as the outcome, target or criterion variable and the
independent variable as the predictor, explanatory or regressor variable. A simple linear
regression is also referred to as a bivariate linear regression or simply as a linear regression.
The basic simple linear regression model is
𝑌 = 𝛼 + 𝛽𝑋 + 𝜖
install.packages ("datarium")
library(datarium)
data("marketing", package = "datarium")
head(marketing, 4)
We want to predict future sales on the basis of advertising budget spent on youtube.
Create a scatter plot displaying the sales units versus youtube advertising budget. Add a
smoothed line
install.packages ("ggplot2")
library(ggplot2)
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() + stat_smooth()
The graph above suggests a linearly increasing relationship between the sales and the youtube
variables. This is a good thing, because, one important assumption of the linear regression is
that the relationship between the outcome and predictor variables is linear and additive.
It is also possible to compute the correlation coefficient between the two variables using the R
function cor():
cor(marketing$sales, marketing$youtube)
The R function lm() can be used to determine the beta coefficients of the linear model:
Call:
lm(formula = sales ~ youtube, data = marketing)
Coefficients:
(Intercept) youtube
8.43911 0.04754
The results show the intercept and the beta coefficient for the youtube variable.
̂ = 8.44 +
the estimated regression line equation can be written as follow: 𝑠𝑎𝑙𝑒𝑠
0.048*youtube
the intercept is 8.44. It can be interpreted as the predicted sales unit for a zero
youtube advertising budget. Recall that, we are operating in units of thousand dollars.
This means that, for a youtube advertising budget equal zero, we can expect a sale of
8.44 *1000 = 8440 dollars.
the regression beta coefficient for the variable youtube, also known as the slope, is
0.048. This means that, for a youtube advertising budget equal to 1000 dollars, we can
expect an increase of 48 units (0.048*1000) in sales. That is, sales = 8.44 +
0.048*1000 = 56.44 units. As we are operating in units of thousand dollars, this
represents a sale of 56440 dollars.
To add the regression line onto the scatter plot, you can use
summary(model)
Call:
lm(formula = sales ~ youtube, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.0632 -2.3454 -0.2295 2.4805 8.6548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.439112 0.549412 15.36 <2e-16 ***
youtube 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In our example, both the 𝑝-values for the intercept and the predictor variable are highly
significant, so we can reject the null hypothesis and accept the alternative hypothesis, which
means that there is a significant association between the predictor and the outcome variables.
The standard error measures the variability/accuracy of the beta coefficients. It can be used to
compute the confidence intervals of the coefficients.
confint(model)
2.5 % 97.5 %
(Intercept) 7.35566312 9.52256140
youtube 0.04223072 0.05284256
confint(model, level=0.90)
The R-squared (𝑅 2 ) ranges from 0 to 1 and represents the proportion of information (i.e.
variation) in the data that can be explained by the model. The adjusted R-squared adjusts for
the degrees of freedom.
The 𝑅 2 measures, how well the model fits the data. For a simple linear regression, 𝑅 2 is the
square of the Pearson correlation coefficient.
A high value of 𝑅 2 is a good indication. However, as the value of 𝑅 2 tends to increase when
more predictors are added in the model, such as in multiple linear regression model, you should
mainly consider the adjusted R-squared, which is a penalized 𝑅 2 for a higher number of
predictors.
An (adjusted) 𝑅 2 that is close to 1 indicates that a large proportion of the variability in the
outcome has been explained by the regression model. A number near 0 indicates that the
regression model did not explain much of the variability in the outcome.
The 𝐹-statistic gives the overall significance of the model. It assess whether at least one
predictor variable has a non-zero coefficient.
In a simple linear regression, this test is not really interesting since it just duplicates the
information in given by the 𝑡-test, available in the coefficient table. In fact, the 𝐹 test is
identical to the square of the 𝑡 test: 312.1 = (17.67)2 . This is true in any model with 1 degree of
freedom.
The 𝐹-statistic becomes more important once we start using multiple predictors as in multiple
linear regression.
A large 𝐹-statistic will corresponds to a statistically significant 𝑝-value (𝑝 < 0.05). In our
example, the 𝐹-statistic equal 312.14 producing a 𝑝-value of 1.46e-42, which is highly
significant.
install.packages ("datasets")
library(datasets)
data("cars", package = "datasets")
model<-lm(dist ~ speed, data = cars)
model
̂ = -17.579 + 3.932*speed.
The linear model equation can be written as follow: 𝑑𝑖𝑠𝑡
Note that, the units of the variable speed and dist are respectively, mph and ft.
Using the above model, we can predict the stopping distance for a new speed value.
Start by creating a new data frame containing, for example, three new speed values:
You can predict the corresponding stopping distances using the R function predict() as
follow:
fit: the predicted sale values for the three new advertising budget
lwr and upr: the lower and the upper confidence limits for the expected values, respectively. By
default the function produces the 95% confidence limits.
For example, the 95% confidence interval associated with a speed of 19 is (51.83, 62.44). This
means that, according to our model, a car with a speed of 19 mph has, on average, a stopping
distance ranging between 51.83 and 62.44 ft.
Prediction interval
The prediction interval gives uncertainty around a single value. In the same way, as the
confidence intervals, the prediction intervals can be computed as follow:
The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This means that,
according to our model, 95% of the cars with a speed of 19 mph have a stopping distance
between 25.76 and 88.51.
Multiple Regression
Multiple regression involves a single dependent variable and two or more independent
variables.
Multiple Regression Model: The general form of the multiple regression model is as follows:
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜖
where 𝑌 is the dependent variable, 𝑋𝑗 ’s are independent variables, 𝛽0 is the intercept, 𝛽𝑗 ’s are
the partial regression coefficients and 𝜖 is the random error term.
We want to build a model for estimating sales based on the advertising budget invested in
youtube, facebook and newspaper, as follow:
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data =
marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The first step in interpreting the multiple regression analysis is to examine the 𝐹-statistic and
the associated 𝑝-value, at the bottom of model summary.
In our example, it can be seen that 𝑝-value of the 𝐹-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly related to the
outcome variable.
To see which predictor variables are significant, you can examine the coefficients table, which
shows the estimate of regression beta coefficients and the associated 𝑡-statitic 𝑝-values:
summary(model)$coefficient
For a given predictor, the 𝑡-statistic evaluates whether or not there is significant association
between the predictor and the outcome variable, that is whether the beta coefficient of the
predictor is significantly different from zero.
It can be seen that, changing in youtube and facebook advertising budget are significantly
associated to changes in sales while changes in newspaper budget is not significantly associated
with sales.
For a given predictor variable, the coefficient (beta) can be interpreted as the average effect on
y of a one unit increase in predictor, holding all other predictors fixed.
For example, for a fixed amount of youtube and newspaper advertising budget, spending an
additional 1000 dollars on facebook advertising leads to an increase in sales by approximately
0.1885*1000 = 189 sale units, on average.
The youtube coefficient suggests that for every 1000 dollars increase in youtube advertising
budget, holding all other predictors constant, we can expect an increase of 0.045*1000 =
45 sales units, on average.
We found that newspaper is not significant in the multiple regression model. This means that,
for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper
advertising budget will not significantly affect sales units.
As the newspaper variable is not significant, it is possible to remove it from the model:
Call:
lm(formula = sales ~ youtube + facebook, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5572 -1.0502 0.2906 1.4049 3.3994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.50532 0.35339 9.919 <2e-16 ***
youtube 0.04575 0.00139 32.909 <2e-16 ***
facebook 0.18799 0.00804 23.382 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.018 on 197 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
confint(model)
2.5 % 97.5 %
(Intercept) 2.80841159 4.20222820
youtube 0.04301292 0.04849671
facebook 0.17213877 0.20384969
Model assessment: As we have seen in simple linear regression, the overall quality of the
model can be assessed by examining the R-squared (𝑅 2 ) and Residual Standard Error (RSE).
R-squared: In multiple linear regression, the 𝑅 2 represents the correlation coefficient between
the observed values of the outcome variable (𝑦) and the fitted (i.e., predicted) values of 𝑦. For
this reason, the value of 𝑅 will always be positive and will range from zero to one.
𝑅 2 represents the proportion of variance, in the outcome variable y, that may be predicted by
knowing the value of the 𝑥 variables. An 𝑅 2 value close to 1 indicates that the model explains a
large portion of the variance in the outcome variable.
A problem with the 𝑅 2 , is that, it will always increase when more variables are added to the
model, even if those variables are only weakly associated with the response. A solution is to
adjust the 𝑅 2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted 𝑅 Square” value in the summary output is a correction for the
number of 𝑥 variables included in the prediction model.
In our example, with youtube and facebook predictor variables, the adjusted 𝑅 2 = 0.89,
meaning that “89% of the variance in the measure of sales can be predicted by youtube and
facebook advertising budgets.
This model is better than the simple linear model with only youtube (Chapter simple-linear-
regression), which had an adjusted 𝑅 2 of 0.61.
To compute multiple regression using all of the predictors in the data set, simply type this:
If you want to perform the regression using all of the variables except one, say newspaper,
type this:
model$residuals
model$fitted.values
A more conventional way to estimate the model performance is to display the residual against
different measures.
You add the code par(mfrow=c(2,2)) before plot(). If you do not add this line of code,
R prompts you to hit the enter command to display the next graph.
par(mfrow=c(2,2))
(mfrow=c(2,2)): return a window with the four graphs side by side. The first 2 adds the number
of rows. The second 2 adds the number of columns. If you write (mfrow=c(3,2)): you will create
a 3 rows 2 columns window.
plot(model)
Predicting Dependent Variable(Y): To predict, we use predict() function
This will give us the predicted values along with the confidence interval.
This will give us the predicted values along with the prediction interval.
We must assign a set of levels to a qualitative variable to account for the effect that the
variable may have on the response. This is done through the use of indicator/dummy
variables.
The dummy/indicator variables has two categories and usually coded as 0 and 1. Other
coding are also possible (for example, 1 and -1) but are not frequently used.
Dummy variables can be incorporated in regression models just as easily as quantitative
variables.
A regression model may contain regressors that are all exclusively dummy, or
qualitative, in nature. Such models are called Analysis of Variance (ANOVA) models.
ANOVA model: Consider the data on average salary (in dollars) of public school teachers in 50
states and the District of Columbia for the year 1985. These 51 areas are classified into three
geographical regions: (1) Northeast and North Central (21 states in all), (2) South (17 states in
all), and (3) West (13 states in all).
There are various statistical techniques to compare two or more mean values, which generally
go by the name of analysis of variance. But the same objective can be accomplished within the
framework of regression analysis. To see this, consider the following model:
𝑌 = 𝛽0 + 𝛽11 𝐷1 + 𝛽12 𝐷2 + 𝜖
Mean salary of public school teachers in the Northeast and North Central:
𝐸(𝑌|𝐷1 = 0, 𝐷2 = 0) = 𝛽0 + 𝛽11
𝐸(𝑌|𝐷1 = 0, 𝐷2 = 1) = 𝛽0 + 𝛽12
𝐸(𝑌|𝐷1 = 1, 𝐷2 = 0) = 𝛽0
In other words, the mean salary of public school teachers in the West is given by the intercept,
𝛽0, in the multiple regression and the slope coefficients 𝛽11 and 𝛽12 tell you how much the
mean salaries of teachers in the Northeast and North Central and in the South differ from the
mean salary of teachers in the West.
library(foreign)
data<-
read.spss("C:/Users/MdZillurRahman/Dropbox/Lecture/salary.sav",
to.data.frame=TRUE)
lmout1<-lm(salary~D1+D2,data=data)
Call:
lm(formula = salary ~ D1 + D2, data = data)
Residuals:
Min 1Q Median 3Q Max
-6329.1 -2592.1 -370.6 2143.4 15321.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26159 1128 23.180 <2e-16 ***
D1 -1734 1436 -1.208 0.2330
D2 -3265 1499 -2.178 0.0344 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4069 on 48 degrees of freedom
Multiple R-squared: 0.09008, Adjusted R-squared: 0.05217
F-statistic: 2.376 on 2 and 48 DF, p-value: 0.1038
The fitted model is
As these regression results show, the mean salary of teachers in the West is about
$26159, that of teachers in the Northeast and North Central is lower by about $1734,
and that of teachers in the South is lower by about $3265. The actual mean salaries in
the last two regions are about $24424 and $22894.
lmout2<-lm(salary~D1+D2+D3-1,data=data)
The category for which no dummy variable is assigned is known as the base, benchmark,
control, comparison, reference, or omitted category. And all comparisons are made in
relation to the benchmark category.
The intercept value represents the mean value of the benchmark category.
If a qualitative variable has more than one category, as in our illustrative example, the
choice of the benchmark category is strictly up to the researcher. Sometimes the choice
of the benchmark is dictated by the particular problem at hand.
Model containing interaction effect: Personal savings and personal disposable income, United
States, 1970-1995. These data relate to personal disposable (i.e., after tax) income and personal
savings, both measured in billions of dollars, in the United States for the period 1970 to 1995.
Our objective is to estimate a savings function that relates savings (𝑌) to personal disposable
income (PDI) (𝑋) for the United States for the said period. If we do that, we will be maintaining
that the relationship between savings and PDI remains the same throughout the sample period.
But that might be a tall assumption. For example, it is well known that in 1982 the United
States suffered its worst peacetime recession. The unemployment rate that year reached 9.7
percent, the highest since 1948. An event such as this might disturb the relationship between
savings and PDI. To see if this in fact happened, we can divide our sample data into two
periods, 1970 to 1981 and 1982 to 1995, the pre- and post-1982 recession periods.
data<-matrix(c(1970,61.0,727.1,0,
1971,68.6,790.2,0,
1972,63.6,855.3,0,
1973,89.6,965.0,0,
1974,97.6,1054.2,0,
1975,104.4,1159.2,0,
1976,96.4,1273.0,0,
1977,92.5,1401.4,0,
1978,112.6,1580.1,0,
1979,130.1,1769.5,0,
1980,161.8,1973.3,0,
1981,199.1,2200.2,0,
1982,205.5,2347.3,1,
1983,167.0,2522.4,1,
1984,235.7,2810.0,1,
1985,206.2,3002.0,1,
1986,196.5,3187.6,1,
1987,168.4,3363.1,1,
1988,189.1,3640.8,1,
1989,187.8,3894.5,1,
1990,208.7,4166.8,1,
1991,246.4,4343.7,1,
1992,272.6,4613.7,1,
1993,214.4,4790.2,1,
1994,189.4,5021.7,1,
1995,249.3,5320.8,1),26,4,byrow=T)
colnames(data)<-c('YEAR','PS','PDI','DV')
data<-data.frame(data)
RegOut<-lm(PS~PDI+DV+PDI*DV,data=data)
summary(RegOut)
Call:
lm(formula = PS ~ PDI + DV + PDI * DV, data = data)
Residuals:
Min 1Q Median 3Q Max
-38.729 -14.777 -1.398 11.689 50.535
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.01612 20.16483 0.050 0.960266
PDI 0.08033 0.01450 5.541 1.44e-05 ***
DV 152.47855 33.08237 4.609 0.000136 ***
PDI:DV -0.06547 0.01598 -4.096 0.000477 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.15 on 22 degrees of freedom
Multiple R-squared: 0.8819, Adjusted R-squared: 0.8658
F-statistic: 54.78 on 3 and 22 DF, p-value: 2.268e-10
Both the differential intercept (beta2) and slope coefficients (beta3) are individually statistically
significant, suggesting that the savings- income relationship between the two time periods has
changed.
𝑌 = 𝛽0 + 𝛽11 𝐷1 + 𝛽12 𝐷2 + 𝛽2 𝑋 + 𝜖
library(foreign)
data<-
read.spss("C:/Users/MdZillurRahman/Dropbox/Lecture/salary.sav",
to.data.frame=TRUE)
lmout3<-lm(salary~D1+D2+spending,data=data)
summary(lmout3)
Call:
lm(formula = salary ~ D1 + D2 + spending, data = data)
Residuals:
Min 1Q Median 3Q Max
-3936.4 -1517.3 -158.4 1291.6 6134.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13269.1141 1395.0557 9.512 1.57e-12 ***
D1 -1673.5144 801.1703 -2.089 0.0422 *
D2 -1144.1567 861.1182 -1.329 0.1904
spending 3.2888 0.3176 10.354 1.03e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2270 on 47 degrees of freedom
Multiple R-squared: 0.7227, Adjusted R-squared: 0.705
F-statistic: 40.82 on 3 and 47 DF, p-value: 3.875e-13
As public expenditure goes up by a dollar, on average, a public school teacher's salary goes up
by about $3.29. Controlling for spending on education, we now see that the differential
intercept coefficient is significant for the Northeast and North-Central region, but not for the
South.