Regression Analysis Using R

Regression Analysis
Regression analysis is a technique of studying the dependence of one variable (called

dependent variable), on one or more variables (called explanatory variables), with a view to
estimating or predicting the population mean or average value of the dependent variable in
terms of the known or fixed values of the independent variables. It can be used in the following
ways:
1. Determine whether the independent variables explain a significant variation in the

dependent variable: whether a relationship exist
2. Determine how much of the variation in the dependent variable can be explained by the
independent variables: strength of the relationship
3. Determine the structure or form of the relationship: the mathematical equation relating the
independent and dependent variables
4. Predict the values of the dependent variable
5. Control for other independent variables when evaluating the contributions of a specific
variable or set of variables
Some Situations Where Regression Analysis is Appropriate
 An agronomist may be interested to study the dependence of the production of paddy on

rainfall, temperature, amount of fertilizer and soil fertility.
 An economist wishes to predict income of individuals on the basis of their level of education
and age both measured in years.
Simple Linear Regression: A simple linear regression uses the presence of a linear relationship
to predict the value of a dependent variable based on the value of an independent variable. The
dependent variable is also referred to as the outcome, target or criterion variable and the
independent variable as the predictor, explanatory or regressor variable. A simple linear
regression is also referred to as a bivariate linear regression or simply as a linear regression.
The basic simple linear regression model is
𝑌 = 𝛼 + 𝛽𝑋 + 𝜖
where 𝑌 is the dependent or criterion variable, 𝑋 is the independent or predictor variable, 𝛼 is

the intercept of the line, 𝛽 is the slope of the line and 𝜖 is the random error term.
In order to run a simple linear regression, you require the following:
 One independent variable that is continuous (e.g., height, exam performance, etc.).
 One dependent variable that is continuous (e.g., height, weight, etc.).
We will use the marketing data set [datarium package]. It contains the impact of three
advertising Medias (youtube, facebook and newspaper) on sales. Data are the advertising
budget in thousands of dollars along with the sales. The advertising experiment has been
repeated 200 times with different budgets and the observed sales have been recorded.
install.packages ("datarium")
library(datarium)
data("marketing", package = "datarium")
head(marketing, 4)
We want to predict future sales on the basis of advertising budget spent on youtube.
Create a scatter plot displaying the sales units versus youtube advertising budget. Add a
smoothed line
install.packages ("ggplot2")
library(ggplot2)
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() + stat_smooth()
The graph above suggests a linearly increasing relationship between the sales and the youtube
variables. This is a good thing, because, one important assumption of the linear regression is
that the relationship between the outcome and predictor variables is linear and additive.
It is also possible to compute the correlation coefficient between the two variables using the R
function cor():
cor(marketing$sales, marketing$youtube)
The R function lm() can be used to determine the beta coefficients of the linear model:
model<-lm(sales ~ youtube, data = marketing)

model
Call:
lm(formula = sales ~ youtube, data = marketing)
Coefficients:
(Intercept) youtube
8.43911 0.04754
The results show the intercept and the beta coefficient for the youtube variable.
Interpretation: From the output above:
 ̂ = 8.44 +
the estimated regression line equation can be written as follow: 𝑠𝑎𝑙𝑒𝑠
0.048*youtube
 the intercept is 8.44. It can be interpreted as the predicted sales unit for a zero
youtube advertising budget. Recall that, we are operating in units of thousand dollars.
This means that, for a youtube advertising budget equal zero, we can expect a sale of
8.44 *1000 = 8440 dollars.
 the regression beta coefficient for the variable youtube, also known as the slope, is
0.048. This means that, for a youtube advertising budget equal to 1000 dollars, we can
expect an increase of 48 units (0.048*1000) in sales. That is, sales = 8.44 +
0.048*1000 = 56.44 units. As we are operating in units of thousand dollars, this
represents a sale of 56440 dollars.
To add the regression line onto the scatter plot, you can use
ggplot(marketing, aes(youtube, sales)) +

geom_point() + stat_smooth(method = lm)
The R function summary() will display the statistical summary of the model:
summary(model)
Call:
lm(formula = sales ~ youtube, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.0632 -2.3454 -0.2295 2.4805 8.6548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.439112 0.549412 15.36 <2e-16 ***
youtube 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.91 on 198 degrees of freedom

Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
In our example, both the 𝑝-values for the intercept and the predictor variable are highly
significant, so we can reject the null hypothesis and accept the alternative hypothesis, which
means that there is a significant association between the predictor and the outcome variables.
The standard error measures the variability/accuracy of the beta coefficients. It can be used to
compute the confidence intervals of the coefficients.
confint(model)
2.5 % 97.5 %
(Intercept) 7.35566312 9.52256140
youtube 0.04223072 0.05284256
We can change the confidence level
confint(model, level=0.90)
The R-squared (𝑅 2 ) ranges from 0 to 1 and represents the proportion of information (i.e.
variation) in the data that can be explained by the model. The adjusted R-squared adjusts for
the degrees of freedom.
The 𝑅 2 measures, how well the model fits the data. For a simple linear regression, 𝑅 2 is the
square of the Pearson correlation coefficient.
A high value of 𝑅 2 is a good indication. However, as the value of 𝑅 2 tends to increase when
more predictors are added in the model, such as in multiple linear regression model, you should
mainly consider the adjusted R-squared, which is a penalized 𝑅 2 for a higher number of
predictors.
An (adjusted) 𝑅 2 that is close to 1 indicates that a large proportion of the variability in the
outcome has been explained by the regression model. A number near 0 indicates that the
regression model did not explain much of the variability in the outcome.
The 𝐹-statistic gives the overall significance of the model. It assess whether at least one
predictor variable has a non-zero coefficient.
In a simple linear regression, this test is not really interesting since it just duplicates the
information in given by the 𝑡-test, available in the coefficient table. In fact, the 𝐹 test is
identical to the square of the 𝑡 test: 312.1 = (17.67)2 . This is true in any model with 1 degree of
freedom.
The 𝐹-statistic becomes more important once we start using multiple predictors as in multiple
linear regression.
A large 𝐹-statistic will corresponds to a statistically significant 𝑝-value (𝑝 < 0.05). In our
example, the 𝐹-statistic equal 312.14 producing a 𝑝-value of 1.46e-42, which is highly
significant.
Prediction for new data set

We start by building a simple linear regression model that predicts the stopping distances of
cars on the basis of the speed.
install.packages ("datasets")
library(datasets)
data("cars", package = "datasets")
model<-lm(dist ~ speed, data = cars)
model
̂ = -17.579 + 3.932*speed.
The linear model equation can be written as follow: 𝑑𝑖𝑠𝑡
Note that, the units of the variable speed and dist are respectively, mph and ft.
Using the above model, we can predict the stopping distance for a new speed value.
Start by creating a new data frame containing, for example, three new speed values:
new.speeds <- data.frame(

speed = c(12, 19, 24)
)
You can predict the corresponding stopping distances using the R function predict() as
follow:
predict(model, newdata = new.speeds)
Confidence interval for the mean response

The confidence interval reflects the uncertainty around the mean predictions. To display the
95% confidence intervals around the mean the predictions, specify the option interval =
"confidence":
predict(model, newdata = new.speeds, interval = "confidence")
The output contains the following columns:
fit: the predicted sale values for the three new advertising budget
lwr and upr: the lower and the upper confidence limits for the expected values, respectively. By
default the function produces the 95% confidence limits.
For example, the 95% confidence interval associated with a speed of 19 is (51.83, 62.44). This
means that, according to our model, a car with a speed of 19 mph has, on average, a stopping
distance ranging between 51.83 and 62.44 ft.
Prediction interval
The prediction interval gives uncertainty around a single value. In the same way, as the
confidence intervals, the prediction intervals can be computed as follow:
predict(model, newdata = new.speeds, interval = "prediction")
The 95% prediction intervals associated with a speed of 19 is (25.76, 88.51). This means that,
according to our model, 95% of the cars with a speed of 19 mph have a stopping distance
between 25.76 and 88.51.
Multiple Regression
Multiple regression involves a single dependent variable and two or more independent
variables.
 Can variation in sales be explained in terms of variation in advertising expenditures, prices

and level of distribution?
 Are consumers’ perceptions of quality determined by their perceptions of prices, brand
image and brand attributes?
Multiple Regression Model: The general form of the multiple regression model is as follows:
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜖
where 𝑌 is the dependent variable, 𝑋𝑗 ’s are independent variables, 𝛽0 is the intercept, 𝛽𝑗 ’s are
the partial regression coefficients and 𝜖 is the random error term.
We will use marketing data as like the simple regression model
data("marketing", package = "datarium")

head(marketing, 4)
We want to build a model for estimating sales based on the advertising budget invested in
youtube, facebook and newspaper, as follow:
model<-lm(sales ~ youtube + facebook + newspaper, data =

marketing)
summary(model)
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data =
marketing)
Residuals:
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The first step in interpreting the multiple regression analysis is to examine the 𝐹-statistic and
the associated 𝑝-value, at the bottom of model summary.
In our example, it can be seen that 𝑝-value of the 𝐹-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly related to the
outcome variable.
To see which predictor variables are significant, you can examine the coefficients table, which
shows the estimate of regression beta coefficients and the associated 𝑡-statitic 𝑝-values:
summary(model)$coefficient
For a given predictor, the 𝑡-statistic evaluates whether or not there is significant association
between the predictor and the outcome variable, that is whether the beta coefficient of the
predictor is significantly different from zero.
It can be seen that, changing in youtube and facebook advertising budget are significantly
associated to changes in sales while changes in newspaper budget is not significantly associated
with sales.
For a given predictor variable, the coefficient (beta) can be interpreted as the average effect on
y of a one unit increase in predictor, holding all other predictors fixed.
For example, for a fixed amount of youtube and newspaper advertising budget, spending an
additional 1000 dollars on facebook advertising leads to an increase in sales by approximately
0.1885*1000 = 189 sale units, on average.
The youtube coefficient suggests that for every 1000 dollars increase in youtube advertising
budget, holding all other predictors constant, we can expect an increase of 0.045*1000 =
45 sales units, on average.
We found that newspaper is not significant in the multiple regression model. This means that,
for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper
advertising budget will not significantly affect sales units.
As the newspaper variable is not significant, it is possible to remove it from the model:
model<-lm(sales ~ youtube + facebook, data = marketing)

summary(model)
Call:
lm(formula = sales ~ youtube + facebook, data = marketing)
Residuals:
-10.5572 -1.0502 0.2906 1.4049 3.3994
Coefficients:
(Intercept) 3.50532 0.35339 9.919 <2e-16 ***
youtube 0.04575 0.00139 32.909 <2e-16 ***
facebook 0.18799 0.00804 23.382 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Finally, our model equation can be written as follow:
̂ = 3.5 + 0.045*youtube + 0.187*facebook.

sales
The confidence interval of the model coefficient can be extracted as follow:
confint(model)
2.5 % 97.5 %
(Intercept) 2.80841159 4.20222820
youtube 0.04301292 0.04849671
facebook 0.17213877 0.20384969
Model assessment: As we have seen in simple linear regression, the overall quality of the
model can be assessed by examining the R-squared (𝑅 2 ) and Residual Standard Error (RSE).
R-squared: In multiple linear regression, the 𝑅 2 represents the correlation coefficient between
the observed values of the outcome variable (𝑦) and the fitted (i.e., predicted) values of 𝑦. For
this reason, the value of 𝑅 will always be positive and will range from zero to one.
𝑅 2 represents the proportion of variance, in the outcome variable y, that may be predicted by
knowing the value of the 𝑥 variables. An 𝑅 2 value close to 1 indicates that the model explains a
large portion of the variance in the outcome variable.
A problem with the 𝑅 2 , is that, it will always increase when more variables are added to the
model, even if those variables are only weakly associated with the response. A solution is to
adjust the 𝑅 2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted 𝑅 Square” value in the summary output is a correction for the
number of 𝑥 variables included in the prediction model.
In our example, with youtube and facebook predictor variables, the adjusted 𝑅 2 = 0.89,
meaning that “89% of the variance in the measure of sales can be predicted by youtube and
facebook advertising budgets.
This model is better than the simple linear model with only youtube (Chapter simple-linear-
regression), which had an adjusted 𝑅 2 of 0.61.
 To compute multiple regression using all of the predictors in the data set, simply type this:
model<-lm(sales ~., data = marketing)
 If you want to perform the regression using all of the variables except one, say newspaper,
type this:
model <- lm(sales ~. -newspaper, data = marketing)
To get the residuals and fitted values, we use
model$residuals
model$fitted.values
A more conventional way to estimate the model performance is to display the residual against
different measures.
You can use the plot() function to show four graphs:
 Residuals vs Fitted values

 Normal Q-Q plot: Theoretical Quartile vs Standardized residuals
 Scale-Location: Fitted values vs Square roots of the standardized residuals
 Residuals vs Leverage: Leverage vs Standardized residuals
You add the code par(mfrow=c(2,2)) before plot(). If you do not add this line of code,
R prompts you to hit the enter command to display the next graph.
par(mfrow=c(2,2))
(mfrow=c(2,2)): return a window with the four graphs side by side. The first 2 adds the number
of rows. The second 2 adds the number of columns. If you write (mfrow=c(3,2)): you will create
a 3 rows 2 columns window.
plot(model)
Predicting Dependent Variable(Y): To predict, we use predict() function
predict(model, newdata =marketing, interval = 'confidence')
This will give us the predicted values along with the confidence interval.
predict(model, newdata =marketing, interval = 'prediction')
This will give us the predicted values along with the prediction interval.
Dummy Variable Regression

The variables employed in regression analysis are often quantitative variables, i.e., the variables
have a well-defined scale of measurement. In some situations it is necessary to use qualitative
or categorical variables as predictor variables in regression. Examples: Gender/Sex (Male,
Female), Religion (Muslim, Hindu, Others), Employment status (Employed, Unemployed), etc.
 We must assign a set of levels to a qualitative variable to account for the effect that the
variable may have on the response. This is done through the use of indicator/dummy
variables.
 The dummy/indicator variables has two categories and usually coded as 0 and 1. Other
coding are also possible (for example, 1 and -1) but are not frequently used.
 Dummy variables can be incorporated in regression models just as easily as quantitative
variables.
 A regression model may contain regressors that are all exclusively dummy, or
qualitative, in nature. Such models are called Analysis of Variance (ANOVA) models.
ANOVA model: Consider the data on average salary (in dollars) of public school teachers in 50
states and the District of Columbia for the year 1985. These 51 areas are classified into three
geographical regions: (1) Northeast and North Central (21 states in all), (2) South (17 states in
all), and (3) West (13 states in all).
There are various statistical techniques to compare two or more mean values, which generally
go by the name of analysis of variance. But the same objective can be accomplished within the
framework of regression analysis. To see this, consider the following model:
𝑌 = 𝛽0 + 𝛽11 𝐷1 + 𝛽12 𝐷2 + 𝜖
where 𝑌 = (average) salary of public school teacher

1, if the state is in the Northeast or North Central
𝐷1 = {
0, otherwise (i. e. , in other region of the country)
1, if the state is in the South
𝐷2 = {
0, otherwise (i. e. , in other region of the country)
Mean salary of public school teachers in the Northeast and North Central:
𝐸(𝑌|𝐷1 = 0, 𝐷2 = 0) = 𝛽0 + 𝛽11
Mean salary of public school teachers in the South:
𝐸(𝑌|𝐷1 = 0, 𝐷2 = 1) = 𝛽0 + 𝛽12
Mean salary of public school teachers in the West:
𝐸(𝑌|𝐷1 = 1, 𝐷2 = 0) = 𝛽0
In other words, the mean salary of public school teachers in the West is given by the intercept,
𝛽0, in the multiple regression and the slope coefficients 𝛽11 and 𝛽12 tell you how much the
mean salaries of teachers in the Northeast and North Central and in the South differ from the
mean salary of teachers in the West.
library(foreign)
data<-
read.spss("C:/Users/MdZillurRahman/Dropbox/Lecture/salary.sav",
to.data.frame=TRUE)
lmout1<-lm(salary~D1+D2,data=data)
Call:
lm(formula = salary ~ D1 + D2, data = data)
Residuals:
-6329.1 -2592.1 -370.6 2143.4 15321.4
Coefficients:
(Intercept) 26159 1128 23.180 <2e-16 ***
D1 -1734 1436 -1.208 0.2330
D2 -3265 1499 -2.178 0.0344 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4069 on 48 degrees of freedom
F-statistic: 2.376 on 2 and 48 DF, p-value: 0.1038
 The fitted model is
𝑌̂ = 26159 − 1734𝐷1 − 3265𝐷2
As these regression results show, the mean salary of teachers in the West is about
$26159, that of teachers in the Northeast and North Central is lower by about $1734,
and that of teachers in the South is lower by about $3265. The actual mean salaries in
the last two regions are about $24424 and $22894.
 The same interpretation can be obtained from
lmout2<-lm(salary~D1+D2+D3-1,data=data)
 The category for which no dummy variable is assigned is known as the base, benchmark,
control, comparison, reference, or omitted category. And all comparisons are made in
relation to the benchmark category.
 The intercept value represents the mean value of the benchmark category.
 If a qualitative variable has more than one category, as in our illustrative example, the
choice of the benchmark category is strictly up to the researcher. Sometimes the choice
of the benchmark is dictated by the particular problem at hand.
Model containing interaction effect: Personal savings and personal disposable income, United
States, 1970-1995. These data relate to personal disposable (i.e., after tax) income and personal
savings, both measured in billions of dollars, in the United States for the period 1970 to 1995.
Our objective is to estimate a savings function that relates savings (𝑌) to personal disposable
income (PDI) (𝑋) for the United States for the said period. If we do that, we will be maintaining
that the relationship between savings and PDI remains the same throughout the sample period.
But that might be a tall assumption. For example, it is well known that in 1982 the United
States suffered its worst peacetime recession. The unemployment rate that year reached 9.7
percent, the highest since 1948. An event such as this might disturb the relationship between
savings and PDI. To see if this in fact happened, we can divide our sample data into two
periods, 1970 to 1981 and 1982 to 1995, the pre- and post-1982 recession periods.
Columns: Year, Personal savings, Personal disposable income, Dummy variable
data<-matrix(c(1970,61.0,727.1,0,
1971,68.6,790.2,0,
1972,63.6,855.3,0,
1973,89.6,965.0,0,
1974,97.6,1054.2,0,
1975,104.4,1159.2,0,
1976,96.4,1273.0,0,
1977,92.5,1401.4,0,
1978,112.6,1580.1,0,
1979,130.1,1769.5,0,
1980,161.8,1973.3,0,
1981,199.1,2200.2,0,
1982,205.5,2347.3,1,
1983,167.0,2522.4,1,
1984,235.7,2810.0,1,
1985,206.2,3002.0,1,
1986,196.5,3187.6,1,
1987,168.4,3363.1,1,
1988,189.1,3640.8,1,
1989,187.8,3894.5,1,
1990,208.7,4166.8,1,
1991,246.4,4343.7,1,
1992,272.6,4613.7,1,
1993,214.4,4790.2,1,
1994,189.4,5021.7,1,
1995,249.3,5320.8,1),26,4,byrow=T)
colnames(data)<-c('YEAR','PS','PDI','DV')
data<-data.frame(data)
RegOut<-lm(PS~PDI+DV+PDI*DV,data=data)
summary(RegOut)
Call:
lm(formula = PS ~ PDI + DV + PDI * DV, data = data)
Residuals:
-38.729 -14.777 -1.398 11.689 50.535
Coefficients:
(Intercept) 1.01612 20.16483 0.050 0.960266
PDI 0.08033 0.01450 5.541 1.44e-05 ***
DV 152.47855 33.08237 4.609 0.000136 ***
PDI:DV -0.06547 0.01598 -4.096 0.000477 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 54.78 on 3 and 22 DF, p-value: 2.268e-10
Both the differential intercept (beta2) and slope coefficients (beta3) are individually statistically
significant, suggesting that the savings- income relationship between the two time periods has
changed.
ANCOVA model: Regression models containing a mixture of quantitative and qualitative

variables are called analysis of covariance (ANCOVA) models. ANCOVA models are an extension
of the ANOVA models in that they provide a method of statistically controlling the effects of
quantitative regressors, called covariates or control variables, in a model that includes both
quantitative and qualitative, or dummy, regressors.
Let us consider the model
𝑌 = 𝛽0 + 𝛽11 𝐷1 + 𝛽12 𝐷2 + 𝛽2 𝑋 + 𝜖
library(foreign)
data<-
read.spss("C:/Users/MdZillurRahman/Dropbox/Lecture/salary.sav",
to.data.frame=TRUE)
lmout3<-lm(salary~D1+D2+spending,data=data)
summary(lmout3)
Call:
lm(formula = salary ~ D1 + D2 + spending, data = data)
Residuals:
-3936.4 -1517.3 -158.4 1291.6 6134.0
Coefficients:
(Intercept) 13269.1141 1395.0557 9.512 1.57e-12 ***
D1 -1673.5144 801.1703 -2.089 0.0422 *
D2 -1144.1567 861.1182 -1.329 0.1904
spending 3.2888 0.3176 10.354 1.03e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2270 on 47 degrees of freedom
F-statistic: 40.82 on 3 and 47 DF, p-value: 3.875e-13
As public expenditure goes up by a dollar, on average, a public school teacher's salary goes up
by about $3.29. Controlling for spending on education, we now see that the differential
intercept coefficient is significant for the Northeast and North-Central region, but not for the
South.

Regression Analysis Using R

Uploaded by

Document Informationclick to expand document informationREGRESSION PDF

Document Informationclick to expand document information

Copyright:

Available Formats

Regression Analysis Using R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis Using R

Uploaded by

Copyright:

Available Formats

Regression Analysis

Regression analysis is a technique of studying the dependence of one variable (called

1. Determine whether the independent variables explain a significant variation in the

Some Situations Where Regression Analysis is Appropriate

 An agronomist may be interested to study the dependence of the production of paddy on

where 𝑌 is the dependent or criterion variable, 𝑋 is the independent or predictor variable, 𝛼 is

model<-lm(sales ~ youtube, data = marketing)

Interpretation: From the output above:

ggplot(marketing, aes(youtube, sales)) +

Residual standard error: 3.91 on 198 degrees of freedom

We can change the confidence level

Prediction for new data set

new.speeds <- data.frame(

predict(model, newdata = new.speeds)

Confidence interval for the mean response

predict(model, newdata = new.speeds, interval = "confidence")

The output contains the following columns:

predict(model, newdata = new.speeds, interval = "prediction")

 Can variation in sales be explained in terms of variation in advertising expenditures, prices

We will use marketing data as like the simple regression model

data("marketing", package = "datarium")

model<-lm(sales ~ youtube + facebook + newspaper, data =

model<-lm(sales ~ youtube + facebook, data = marketing)

Finally, our model equation can be written as follow:

̂ = 3.5 + 0.045*youtube + 0.187*facebook.

The confidence interval of the model coefficient can be extracted as follow:

model<-lm(sales ~., data = marketing)

model <- lm(sales ~. -newspaper, data = marketing)

To get the residuals and fitted values, we use

You can use the plot() function to show four graphs:

 Residuals vs Fitted values

predict(model, newdata =marketing, interval = 'confidence')

predict(model, newdata =marketing, interval = 'prediction')

Dummy Variable Regression

where 𝑌 = (average) salary of public school teacher

Mean salary of public school teachers in the South:

Mean salary of public school teachers in the West:

𝑌̂ = 26159 − 1734𝐷1 − 3265𝐷2

 The same interpretation can be obtained from

Columns: Year, Personal savings, Personal disposable income, Dummy variable

ANCOVA model: Regression models containing a mixture of quantitative and qualitative

Let us consider the model

You might also like

̂ = 3.5 + 0.045youtube + 0.187facebook.