Multiple linear Regression
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables ( independent variables) to predict the outcome
of a response variable ( dependent variable).
Multiple regression is an extension of linear (OLS) regression that uses just one explanatory
variable (independent variable).
How to check assumptions of multiple linear regression
Go to analyze, regression, linear
Put your scores of dependent variable under the box of DV
Put your scores of independent variable under the box of IV
Under heading statistics…. Check off estimates, model fit, r square change, descriptives, part and
partial correlation, case wise diagnostics…
Under heading of plots…. Move *zpred to x variable…. Move *zresid to y variable… also check
off normal probability plot…
Under heading of save…. Check off cook’s distance
Click continue..
click ok
Assumptions
Sample size: You should have atleast 20 sample size for each level of independent variable. This
rule implies when you have dependent variable normally distributed. If it is not normally
distributed you need more than 20 cases in each independent variable.
Assumption 1: The residuals of Dependent variable should be normally distributed
To check normality.. go to Analyze….descriptive statistics… explore….
move dependent variable under the option of dependent variable
then click… plots
check off histogram and normality plots with tests.. remove checking of stem and leaf
click continue
click ok
Interpretation
look for the Shapiro wilk test. The p value should be greater than .05 for dv to be normally
distributed
Assumption 2: Linear relationship
There should be a linear relationship between IV and DV Make a scatter plot In scatter
plot all values should be between -3.00 to +3.00 In table of residual statistics std residuals
value should be between -3.00 to +3.00 In normal P-P plot most or all points should be
falling on line
Assumption 3: There should be Absence of Multicollinearity
Multicollinearity refers to when your predictor variables are highly correlated with each
other. This is an issue, as your regression model will not be able to accurately associate
variance in your outcome variable with the correct predictor variable, leading to muddled
results and incorrect inferences. Keep in mind that this assumption is only relevant for a
multiple linear regression, which has multiple predictor variables. If you are performing a
simple linear regression (one predictor), you can skip this assumption.
Three methods to check multicollinearity are as follows:
1) Correlation matrix – All independent variables should not be very strongly related with each
other. It means when computing the matrix of Pearson’s Bivariate Correlation among all
independent variables the correlation coefficients need to be greater than 0.70. Check
multicollinearity by going in correlation and putting each iv in dv box and others in IV’s box
2) Tolerance – the tolerance measures the influence of one independent variable on all
other independent variables; the tolerance is calculated with an initial linear regression
analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1
there might be multicollinearity in the data and with T < 0.01 there certainly is. Value of
Tolerance should be over 0.2. A small tolerance value indicates that the variable under
consideration is almost a perfect linear combination of the independent variables already in the
equation and that it should not be added to the regression equation. All variables involved in the
linear relationship will have a small tolerance.
3) Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is
defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present;
with VIF > 10 there is certainly multicollinearity among the variables. Value of VIF should be
between 3-10. Below 3 is perfect value 3-5 is great 5-10 bit issue >10 problematic Note: The
Variance Inflation Factor (VIF) is always greater than or equal to 1. There is no formal VIF
value for determining presence of multicollinearity. Values of VIF that exceed 10 are often
regarded as indicating multicollinearity, but in weaker models values above 2.5 may be a cause
for concern.
If multicollinearity is found in the data, centering the data (that is deducting the mean of the
variable from each score) might help to solve the problem. However, the simplest way to
address the problem is to remove independent variables with high VIF values.
Assumption 4: Presence of Homoscedasticity
All independent variables should have same impact on the dependent variable. We can
check it through scatter plot.. all value should be scattered, should no be of U shape, S
shape or clustered at 1 point.
Assumption 5: Presence of Independence of observation/ no auto correlation in the data
linear regression analysis requires that there is little or no autocorrelation in the data.
Autocorrelation occurs when the residuals are not independent from each other. In other
words when the value of y(x+1) is not independent from the value of y(x).
While a scatterplot allows you to check for
autocorrelations, you can test the linear regression
model for autocorrelation with the Durbin-Watson
test. Durbin-Watson’s d tests the null hypothesis
that the residuals are not linearly auto-correlated.
While d can assume values between 0 and 4, values
around 2 indicate no autocorrelation.
As a rule of thumb values of 1.5 < d < 2.5 show
that there is no auto-correlation in the data. However, the Durbin-Watson test only
analyses linear autocorrelation and only between direct neighbors, which are first order
effects.
In heading of statistics check durbin Watson under heading of residuals The Durbin
Watson (DW) statistic is a test for autocorrelation in the residuals from a statistical
regression analysis. The Durbin-Watson statistic will always have a value between 0 and
4. A value of 2.0 means that there is no autocorrelation detected in the sample.
Assumption 6: Absence of outliers
In cook’s distance, no value should be greater than 1. You can check the value in data
view.. any value greater than 1 would be showing that observation as the outlier
Interpretation of results:
In table of descriptive statistics M should be close to 50 and SD should be close to 10
1. Interpretation of Table of Modal summary
i. The "R" column represents the value of R, the multiple correlation coefficient. R can be
considered to be one measure of the quality of the prediction of the dependent variable.
The relation of predictor variable with the outcome variable should be strong. In table of
correlation coefficient greater than 0.3. (applicable for relation of all IV’s with DV’s)
ii. The "R Square" column represents the R 2 value (also called the coefficient of
determination), which is the proportion of variance in the dependent variable that can be
explained by the independent variables (technically, it is the proportion of variation
accounted for by the regression model above and beyond the mean model). R square
value shows the 1 unit increase in the IV produces that percent change in DV.
2. Interpretation of Table of ANOVA
The F-ratio in the ANOVA table tests whether the overall regression model is a good fit
for the data. The table shows that either the independent variables statistically
significantly predict the dependent variable. If p< .05, you can conclude that the
coefficients are statistically significantly different to 0 (zero). The t-value and
corresponding pvalue are located in the "t" and "Sig." columns, respectively.
Example 1
Multiple Linear Regression in SPSS – A Simple Example
A company wants to know how job performance relates to IQ, motivation and social support.
They collect data on 60 employees, resulting in job_performance.sav.
Part of these data are shown below.
Quick Data Check
We usually start our analysis with a solid data inspection.
Check all the assumptions of Multiple linear regression as described above.
Here we'll limit it to a quick check of relevant histograms and correlations. The syntax below
shows the fastest way to generate histograms.
Syntax for Running Histograms
*Inspect histograms for all regression variables.
frequencies perf to soc
/format notable
/histogram.
Histograms Output
We'll show the first histogram below. Note that each histogram is based on 60 observations,
which corresponds to the number of cases in our data. This means that we don't have any system
missing values.
Second, note that all histograms look plausible; none of them have weird shapes or extremely
high or low values. As we see, histograms provide a very nice and quick data check.
Running the Correlation Matrix
Next, we'll check whether the correlations among our regression variables make any sense. We'll
create the correlation matrix by running correlations perf to soc.
Inspecting the Correlation Matrix
Most importantly, the correlations are plausible; job performance correlates positively and
substantively with all other variables. This makes sense because each variable reflects as positive
quality that's likely to contribute to better job performance.
Note that IQ doesn't really correlate with anything but job performance. Perhaps we'd expect
somewhat higher correlations here but we don't find this result very unusual. Finally, note that
the correlation matrix confirms that there's no missing values in our data.
Linear Regression in SPSS - Model
We'll try to predict job performance from all other variables by means of a multiple regression
analysis. Therefore, job performance is our criterion (or dependent variable). IQ, motivation
and social support are our predictors (or independent variables). The model is illustrated
below.
Important Note
A basic rule of thumb is that we need at least 15 independent observations for each predictor in
our model. With three predictors, we need at least (3 x 15 =) 45 respondents. The 60 respondents
we actually have in our data are sufficient for our model.
Linear Regression in SPSS - Purpose
Keep in mind that regression does not prove any causal relations from our predictors on job
performance. However, we do find such causal relations intuitively likely. If they do exist, then
we can perhaps improve job performance by enhancing the motivation, social support and IQ
of our employees.
If there aren't any causal relations among our variables, then being able to predict job
performance may still be useful for assessing job applicants; we can measure their IQ,
motivation and social support but we can't measure their job performance before we actually hire
them.
Running our Linear Regression in SPSS
The screenshots below illustrate how to run a basic regression analysis in SPSS.
In the linear regression dialog below, we move perf into the Dependent box. Next, we
move IQ, mot and soc into the Independent(s) box. Clicking Paste results in the next syntax
example.
Linear Regression in SPSS - Syntax
*SPSS regression with default settings.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT perf
/METHOD=ENTER iq mot soc.
Linear Regression in SPSS - Short Syntax
We can now run the syntax as generated from the menu. However, we do want to point out that
much of this syntax does absolutely nothing in this example. Running regression/dependent
perf/enter iq mot soc. does the exact same things as the longer regression syntax.
SPSS Regression Output - Coefficients Table
SPSS regression with default settings results in four tables. The most important table is the last
table, “Coefficients”.
The b coefficients tell us how many units job performance increases for a single unit
increase in each predictor. Like so, 1 point increase on the IQ tests corresponds to 0.27 points
increase on the job performance test. Given only the scores on our predictors, we can predict job
performance by computing
Job performance = 18.1 + (0.27 x intelligence) + (0.31 x motivation) +(0.16 x social support)
Importantly, note that all b coefficients are positive numbers; higher IQ is associated with higher
job performance and so on.
B coefficients having the “wrong direction” often indicate a problem with the analysis known as
multicollinearity.
The column “Sig.” holds the p-values for our predictors. As a rule of thumb, we say that a b
coefficient is statistically significant if its p-value is smaller than 0.05. All of our b coefficients
are statistically significant.
The beta coefficients allow us to compare the relative strengths of our predictors. These are
roughly 2 to 2 to 1 for IQ, motivation and social support.
SPSS Regression Output - Model Summary Table
The second most important table in our output is the Model Summary as shown below.
As we previously mentioned, our model predicts job performance. R denotes
the correlation between predicted and observed job performance. In our case, R = 0.81. Since
this is a very high correlation, our model predicts job performance rather precisely.
r square is simply the square of R. It indicates the proportion of variance in job performance
that can be “explained” by our three predictors.
Because regression maximizes R square for our sample, it will be somewhat lower for the
entire population, a phenomenon known as shrinkage. The adjusted r-square estimates the
population R square for our model and thus gives a more realistic indication of its predictive
power.
SPSS Linear Regression - Conclusion
The high adjusted R squared tells us that our model does a great job in predicting job
performance. On top of that, our b coeffients are all statistically significant and make perfect
intuitive sense. Mission accomplished.
Example 2
The Multiple Linear Regression Analysis in SPSS
This example is based on the FBI’s 2006 crime statistics. Particularly we are interested in the
relationship between size of the state, various property crime rates and the number of murders in
the city.
It is our hypothesis that less violent crimes open the door to violent crimes.
We also hypothesize that even we account for some effect of the city size by comparing crime
rates per 100,000 inhabitants that there still is an effect left.
Check Assumption 1: First we need to check whether there is a linear relationship between the
independent variables and the dependent variable in our multiple linear regression model. To do
this, we can check scatter plots. The scatter plots below indicate a good linear relationship
between murder rate and burglary and motor vehicle theft rates, and only weak relationships
between population and larceny.
Check assumption2: Secondly, we need to check for multivariate normality. We can do this by
checking normal Q-Q plots of each variable. In our example, we find that multivariate normality
might not be present in the population data (which is not surprising since we truncated variability
by selecting the 70 biggest cities).
We will ignore this violation of the assumption for now, and conduct the multiple linear
regression analysis.
Analysis
Multiple linear regression is found in SPSS in Analyze/Regression/Linear…
In our example, we need to enter the variable “murder rate” as the dependent variable and the
population, burglary, larceny, and vehicle theft variables as independent variables.
In this case, we will select stepwise as the method. The default method for the multiple linear
regression analysis is ‘Enter’. That means that all variables are forced to be in the model.
However, since over fitting is a concern of ours, we want only the variables in the model that
explain a significant amount of additional variance.
In the field “Options…” we can set the stepwise criteria. We want to include variables in our
multiple linear regression model that increase the probability of F by at least 0.05 and we want to
exclude them if the increase F by less than 0.1.
The “Statistics…” menu allows us to include additional statistics that we need to assess the
validity of our linear regression analysis.
It is advisable to include the collinearity diagnostics and the Durbin-Watson test for auto-
correlation. To test the assumption of homoscedasticity and normality of residuals we will also
include a special plot from the “Plots…” menu.
The first table in the results output tells us the variables in our analysis. Turns out that only
motor vehicle theft is useful to predict the murder rate.
The next table shows the multiple linear regression model summary and overall fit statistics. We
find that the adjusted R² of our model is .398 with the R² = .407. This means that the linear
regression explains 40.7% of the variance in the data. The Durbin-Watson d = 2.074, which is
between the two critical values of 1.5 < d < 2.5. Therefore, we can assume that there is no first
order linear auto-correlation in our multiple linear regression data.
If we would have forced all variables (Method: Enter) into the linear regression model, we would
have seen a slightly higher R² and adjusted R² (.458 and .424 respectively).
The next output table is the F-test. The linear regression’s F-test has the null hypothesis that the
model explains zero variance in the dependent variable (in other words R² = 0). The F-test is
highly significant, thus we can assume that the model explains a significant amount of the
variance in murder rate.
The next table shows the multiple linear regression estimates including the intercept and the
significance levels.
In our stepwise multiple linear regression analysis, we find a non-significant intercept but highly
significant vehicle theft coefficient, which we can interpret as: for every 1-unit increase in
vehicle thefts per 100,000 inhabitants, we will see .014 additional murders per 100,000.
If we force all variables into the multiple linear regression, we find that only burglary and motor
vehicle theft are significant predictors. We can also see that motor vehicle theft has a higher
impact than burglary by comparing the standardized coefficients (beta = .507 versus beta = .333).
The information in the table above also allows us to check for multicollinearity in our multiple
linear regression model.
Assumption: Tolerance should be > 0.1 (or VIF < 10) for all variables, which they are.
Assumption: Lastly, we can check for normality of residuals with a normal P-P plot. The plot
shows that the points generally follow the normal (diagonal) line with no strong deviations. This
indicates that the residuals are normally distributed.