Multiple Regression
Analysis
         Multiple Regression
     Cautions About Linear Regression
 Correlation and regression describe only linear
  relations.
 Correlation and least-squares regression line
  are not resistant to outliers.
 Predictions outside the range of observed data
  are often inaccurate.
 Relationship between two variables often
  influenced by lurking variables not included in
  our model.
General Principle of Data Analysis
Plot your data   To understand the data, always start with a
                 series of graphs
Interpret what   Look for all pattern and deviations on that
   you see       pattern
 Numerical       Choose an appropriate measure                 to
 summary ?       describe the pattern and deviation
Mathematical     If the pattern is regular, summarize the data
  model ?        in a compact mathematical model
Analysis of Two Quantitative Variables
Plot your data   For two quantitative variables, use a
                 scatterplot
Interpret what   Describe the direction, form and strength of
   you see       the relationship
 Numerical       If pattern is roughly linear, summarize with
 summary ?       correlation, means and standard deviations
 Mathematical    Regression gives a compact model of
   model ?       overall pattern if relationship is roughly
                 linear
Analysis of Three or More Quantitative
              Variables
Plot your data   To examine relationships among all
                 possible pairs, use a scatterplot matrix
Interpret what   Describe the direction, form and strength of
   you see       the relationships
 Numerical       If pattern is roughly linear, summarize with
 summary ?       correlations,    means       and   standard
                 deviations
 Mathematical    Multiple Regression gives a compact model
   model ?       of relationship between response variable
                 and a set of predictors
           Multiple Regression
• Can we predict job performance (Y) from overall school
  achievement (X1) and IQ scores (X2)?
   - How much variance in Y is explained by X1 and X2 in
     combination?
   - how important is each predictor of job performance?
• Two kinds of research questions in Multiple Regression:
   - Is the model significant and important??
   - Are the individual predictors significant and important?
The Structural Model
Y  c  b1 X 1  b2 X 2  ...  b p X p  e
 Y       - any dependent variable score is predicted according to :
  c      - an intercept on the Y axis, plus
b1 X 1   - a weighted effect of predictor   X1
b2 X 2   - a weighted effect of predictor   X2
bp X p   - a weighted effect of predictor   Xp
    e    - error
The Structural Model
Y  c  b1 X 1  b2 X 2  ...  b p X p  e
DATA =             MODEL             + RESIDUAL
The Regression Plane – Two Predictors
             (3D space)
    Unstandardized Partial Regression
            Coefficients - b
• Y is calculated according to Least Square Criterion (LSC)
• solved for by finding a set of weights (b) minimising errors
  of prediction (around the plane)
      -   b1 indicates change in Y given unit change in X1 when X2 …
          Xp = 0
      -    when standardised, indicates SD change in Y given SD
          change in X, and is denoted by 
•    c is the Y intercept
•    Y is therefore a weighted combination of the predictors
    (and intercept) called a linear composite (LC)
Bivariate regression
Multiple Regression
Multiple Regression
        Variance Explained – R2
R 2 is simply the r 2 representing the proportion of variance in Y
which is explained by Yˆ – the linear composite
 r2 
          SS regression
                           
                               SSYˆ
                                       
                                          Yˆ  Y 
                                                             2
                                          Y  Y 
                                                             2
             SS total          SSY
 a ratio reflecting the proportion of variance captured by our
 model relative to the overall variance in our data
 R 2 =.50 means 50% of the variance in Y is explained by the
 combination of X1, X2… Xp
    2
R       vs r   2
      Significance of the Model
•   R 2 tells us how important the model is
• the model can also be tested for statistical
  significance
• test is conducted on R the multiple correlation
  coefficient, against df = p, N - p - 1
         ( N  p  1) R   2
                           MS regression
      F                 
            p (1  R ) 2
                           MS residual
 Importance of Individual Predictors
r    – simple correlation coefficient
b    – partial regression coefficient
    – standardized partial regression coefficient
pr   – partial correlation coefficient
sr   – semi-partial correlation coefficient
r – simple correlation coefficient
  • indicates importance of predictor in terms of its
    direct relationship with the criterion
  • not very useful in Multiple Regression as it does
    not take into account inter-correlations with other
    predictors.
b – Partial Regression Coefficient
  • indication of the importance of a predictor in
    terms of the model (not the data).
  • scale-bound so can’t compare magnitude.
  • can however compare significance – each b is
    tested by dividing it by its standard error to give
    a t-value:
    – standardized partial regression coefficient
    • indication of the importance of a predictor in
      terms of the model (not the data).
    • standardized (scale free) so you can compare
      magnitude
    • test of significance is same as for b
pr – Partial Correlation Coefficient
sr – semi-partial correlation coefficient
Unique, Shared and Total Variance
Assumptions of Multiple Regression
   • Scale (predictor and criterion scores)
      • measured using a continuous scale (interval
        or ratio)
      • normality (variables are normally distributed)
      • linearity (there is a straight line relationship
        between predictors and criterion)
      • predictors are not multicollinear or singular
        (extremely highly correlated)
Assumptions of Multiple Regression
  • Residuals
     • normality: array of Y values are normally
       distributed around Yˆ (assumption of normality in
       arrays)
     • homoscedasticity: variance of Y values are
       constant across full range Y values (assumption
       of homogeneity of variance in arrays)
     • linearity: straight-line relationship between Y
       and residuals (with mean = 0 and slope = 0)
     • independence (residuals uncorrelated)
Multicollinearity and Singularity
• occurs when predictors are highly correlated (>.90)
• causes unstable calculation of regression weights (b)
• diagnosed with inter-correlations, tolerance and VIF
     Tolerance =   (1  Rx2 )
                     2
       • where Rx is the overlap between a particular
         predictor and all the other predictors
       • values below .10 considered problematic
         Variance Inflation Factor (VIF) = 1/tolerance
         - values above 4 considered problematic
• best solution is to remove or combine collinear predictors
Outliers – Extreme Cases
 • distort solution and inflate standard error
 • univariate outliers
        • cases beyond 3 SD on any variable
 • multivariate outliers
        • described in terms of:
            • leverage (h) – distance of case from group
              centroid along line/plane of best fit
            • discrepancy – extent to which case
              deviates from line/plane of best fit
            • influence – combined effect of leverage
              and discrepancy: effect of the outlier on
              the solution
Multivariate Outliers – high influence
                                  high discrepancy
Multivariate Outliers – low influence
                               high discrepancy
Multivariate Outliers – Testing
Leverage
• Leverage statistic (h): varies from 0 to 1, values > .50 are
  problematic
• Mahalanobis Distance h x (n-1), distributed as chi-square and
  tested as such (df = p, <.001)
discrepancy – not directly tested
influence –
 assesses change in solution when case is removed
 Cook’s Distance, values > 1 are problematic
Working example
A marketing manager of a large supermarket chain wanted to
determine the effect of shelf space and price on the sales of pet food.
A random sample of 15 equal-sized shops was selected, and the
sales, shelf space in square metres and price per kilogram were
recorded
 1. What contribution do both shelf space and price make to the
     prediction of sales of pet food?
 2 . Which is a better predictor of sales of pet food?
 3. Do a residual analysis
The data file can be found in Work17.sav
Using SPSS
   Graphs
    Scatter/Dot
      Matrix Scatter
Using SPSS
  Graph
  [DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav
Using SPSS
  Multiple Linear Regression:
  Starting the Procedure
  • In the menu, click on
    Analyze
    • Point to
      Regression
     • Point to
       Linear…
  … and click.
Using SPSS
 Multiple Linear Regression:
 Selecting Variables
                               Choose the variables
                               for analysis from the
                               list in the variable box.
                               To select multiple
                               variables, hold down
                               the Ctrl Key and chose
                               the variables that you
                               want.
Using SPSS
 Multiple Linear Regression:
 Selecting Variables
                               Move       shelf     space
                               (space) & price per kg
                               (price),     which     are
                               already highlighted, to
                               the      box       labeled
                               Independent(s)        then
                               click the arrow.
                               Move sales of pet food
                               (sales) to the box
                               labeled Dependent by
                               clicking the arrow.
Using SPSS
 Multiple Linear Regression:
 Requesting Statistics
                               Request
                               descriptive
                               statistics by
                               clicking the
                               button
                               labeled
                               Statistics…
Using SPSS
 Multiple Linear Regression:
 Requesting Statistics
                               Statistics for the Model
                               fit and Estimates for
                               Regression
                               Coefficients will be
                               produced by default.
                               Click the checkbox for
                               Descriptives. Also,
                               click the checkbox for
                               Durbin-Watson for
                               Residuals.
                                Click the Continue
                               button.
Using SPSS
 Multiple Linear Regression:
 Standardized Residual Plots
                               You can also request
                               several different plots.
                               Click the Plots… button.
                               In the box labeled
                               Standardized Residual
                               Plots, first click the
                               checkbox for
                               Histogram,
                               then click the box
                               for Normal
                               probability plot.
                               Click the Continue
                               button.
Using SPSS
 Multiple Linear Regression:
 Enter Method
                                                 The independent
                                                 variables can be
                                                 entered into the
                                                 analysis using
                                                 five different
                                                 methods.
 Enter Method, a procedure for variable selection in which all variables in a
 block are entered in a single step.
Using SPSS
 Multiple Linear Regression:
 Enter Method
                               Enter is the
                               default method
                               of variable entry.
                               Click the OK
                               button to run the
                               Multiple Linear
                               Regression
                               procedure.
Using SPSS
  Multiple Linear Regression Output:
  Descriptive Statistics
  Regression
  [DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav
Using SPSS
  Multiple Linear Regression Output:
  Correlations
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Variables Entered
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Model Summary
  Correlation                                       Standard Deviation
                      Coefficient of                around the
                      Determination                 regression line
                                              Durbin-Watson
                                              Statistic
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Model Summary
 Independence
    Durbin-Watson Statistic.
    The D-W statistic is
    defined as:
     Another way to look at the Durbin-Watson Statistic is:
     D = 2(1-ρ)
     where ρ = the correlation between consecutive errors.
Using SPSS
  Multiple Linear Regression Enter Method Output:
  ANOVA
                                                    Measures of
                                                    Variation
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Coefficients
    Regression Equation:
    ŷi = 10.50x1 + 0.057x2 + 2.029
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Residuals Statistics
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Residuals Histogram
                                                    Normality
                                                Normality of residuals
                                                is only required for
                                                valid hypothesis
                                                testing, that is, the
                                                normality assumption
                                                assures that the p-
                                                values for the t-tests
                                                and F-test will be
                                                valid. Normality is not
                                                required in order to
                                                obtain unbiased
                                                estimates of the
                                                regression
                                                coefficients
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Plot of Standardized Residuals
                                                      Normality
                                                    A standardized
                                                    normal
                                                    probability (P-P)
                                                    plot is sensitive
                                                    to non-normality
                                                    in the middle
                                                    range of data
                                                    tails.
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Interpretation of Output
  1. What contribution do both shelf space and price make to the
  prediction of sales of pet food?
  Both independent variables (shelf space and price) together explain 85 per
  cent of the variance (R Square) in sales of pet food, which is highly
  significant as indicated by the F-value of 34.08
Using SPSS
  Multiple Linear Regression Enter Method Output:
  Interpretation of Output
  2. Which of the two variable is a better predictor of sales of pet food?
   An examination of the t-values and Beta values indicate that price contributes
   better to the prediction of sales. Therefore, you can say that price
   significantly predicts sales of pet food with t = 3.22, P < .05. However, the
   shelf space allocated is not a significant predictor.