Simple Linear Regression Guide
Simple Linear Regression Guide
4
Regression Analysis
• Regression Analysis • Simple Regression
• A statistical procedure used to develop an • Regression analysis involving one independent
equation showing how two or more variables variable (x) and one dependent variable (y).
are related. • Linear Regression
• Allows us to Build Model/Equation to help • Regression analysis in which relationships
Estimate and Predict. between the independent variables and the
• The entire process will take us from: dependent variable are approximated by a
• Taking an initial look at data to see if there is straight line.
a relationship. • Simple Linear Regression
• Creating an equation to help us • Relationship between one independent variable
estimate/predict. and one dependent variable that is
• Assessing whether equation fits the sample approximated by a straight line, with slope and
data. intercept.
• Use Statistical Inference to see if there is a • Multiple Linear Regression (not covered in
significant relationship.
this class)
• Predict with the equation.
• Regression analysis involving two or more
• Regression Analysis does not prove a cause independent variables to create straight line
and effect relationship, but rather it helps model/equation.
us to create a model (equation) that can • Curvilinear Relationships (not covered in
help us to estimate or make predictions. this class) 5
• Relationships that are not linear.
Scatter Chart to “See” If There Is a Relationship
• Graphical method to investigate if there
is a relationship between 2 quantitative
variables
• Excel Charting:
• Independent Variable = x
• Horizontal Axis
• Left most column in data set
• Dependent Variable = y = f(x)
• Vertical Axis
• Column to right of x data column
• Always label the x and y axes.
• Use an informative chart title.
• Goal of chart: Visually, we are “looking” to To get estimated line & equation and r^2, right-
see if there is a relationship pattern. click markers in chart and click on “Add
Trendline”. Then click dialog button for “Linear”
• For our Sales (x) Ad Expense (y) data and the checkboxes for “Display equation on 6
we “see” a direct relationship. chart” & “Display R^2 on chart”. Learn about
equation & r^2 later…
Types of Relationships
Investigate if there is a relationship: With the Scatter Chart, you look to
see if there is a relationship.
Looks like “As x increases, y Looks like “As x increases, y Looks like No Relationship
increases”. Direct or Positive decreases”. Inverse or Indirect 7
Relationship or Negative Relationship
Baseball Data Scatter Charts
8
Covariance and Correlation: Numerical Measures to
Investigate if There is a Relationship
• These numerical measures will be more precise that the “Positive”,
“Negative” “No Relationship” (also “Little Relationship”) categories
that the Scatter Chart gave us.
• Numerical measures to investigate if there is a relationship
between two quantitative variables.
9
Scatter Chart and Ybar and X Bar Lines
• Scatter Charts are graphical means to find a relationship between 2
quantitative variables.
• We need a numerical measure that is more precise than our Scatter Chart.
• To understand how the numerical measure can do this, we plot a Ybar line
and Xbar line on our chart.
10
Covariance
• Measure of the linear relationship
between two quantitative variables.
1. Positive values indicate a positive
relationship; negative, a negative
relationship.
2. Close to zero means there is not
much of a relationship.
3. The magnitude of covariance is
difficult to interpret.
4. Covariance has problems with units
(like feet compared to inches).
5. We can standardize covariance by
dividing it by sx*sy to get Coefficient
of Correlation.
• In Excel use COVARIANCE.S function
for sample data:
• Y data first, x data second
11
12
Coefficient of Correlation (rxy)
• Measure the strength and direction of the linear • Investigate if there is a relationship: We will
relationship between two quantitative variables. have a number answer that indicates the
• A relative measure of strength of association
strength and direction:
1. Always a number between -1 and 1.
(relationship) between 2 variables or a measure of
2. 0 = No correlation
strength per unit of standard deviation, sx * sy . 3. Near 0.5 of -0.5 = moderate correlation
• Solves Covariance “units”/ magnitude problem. 4. Near -1 or 1 = strong correlation
• In Excel use CORREL or PEARSON functions. 5. Does not have problems with units like
Covariance does.
6. Can only be used for one independent
variable to measure a linear relationship
• As opposed to Coefficient of Determination (“r
squared” or “Goodness of Fit Test”), which can be
used for 1 or more independent variables and for
linear or non-liner relationships
• Note: Because the Correlation Coefficient
measure the strength and direction of a
LINEAR relationship, not nonlinear
relationships. If you get a correlation
measure near zero, it may be true that there
is a very weak linear relationship, but that 13
does not say that there is not some other
sort of non-linear relationship.
14
Ad Expenditures / Sales Example:
Covariance and Correlation in Excel
17
Overview: Simple Linear Regression
• Algebra:
• f(x) = y = m*x + b
• Statistics:
• Yhat = ŷ = b1*x + b0 (sample statistics)
• y = β1 x + β0 (population parameters)
y = β1x + β0 + ε
• y = Predicted value
• β1 = Slope β1 = “Beta sub 1”
• x = Value you put into the equation to try and predict the y value.
• β0 = Y-Intercept β0 = “Beta sub 0”
• ε = Error Value = random variable that accounts for the
variability in y that cannot be explained by the
liner relationship between x and y. ε = “Epsilon”
19
Because Not All Sample Points Are On The Estimated Line
We Will Get Some Error (e)
20
Assumptions About The Error Value (e) Necessary for the “Least
Squares Method” of calculating b1 and b0.
1. The assumption of bell shape for
errors, indicates that right on the
line, the mean of the error value at
any particular x is zero. E(e) = 0.
• This means that we can use the
slope and intercept (β1 & β0) as
constants.
2. Total population can be thought of
as having sub-populations.
• For each x value there is a range of
possible y values (sub-population).
• The Bell Shaped distribution is an
assumption about the possibility of
getting a y value above or below the
line for a given x value.
3. The error (e) variation will be
constant
• E(y|x) = β1x + β0
• Describes the line down the middle, 21
where ε = 0.
• Is the mean of all the y values and sits
exactly on the line.
Simple Liner Regression Equation
with Population Parameters
E(y|x) = β1x + β0
• E(y|x) = Expected Value or Mean
of all the y values at a particular x value.
• E(y|x) = β1x + β0 describes s straight line
down the middle, where ε = 0.
22
Sample Slope and Y-Intercept
• Because population parameters for slope & intercept are not usually known, we
estimate them using sample data in order to calculate sample statistics for slope
and y-intercept.
23
Estimated Simple Linear Regression Equation
with Sample Statistics
1. 𝑦! = Point estimator of E(y|x) = Estimates mean of all y values for a given x in population.
or
2. 𝑦! = Can Predict individual y value for a particular business situation.
24
• Graph of the estimated simple linear regression equation is called “estimated regression line”.
Estimation Process
for Simple Linear
Regression
25
Overview:
Least Squares Method to Derive Formula for b1 & b0
26
**For proof of formulas, see
downloadable pdf file.
Formulas for estimated
Slope (b1)
Y-intercept (b0)
27
Ad Expenditures / Sales Example:
Calculate Slope and Y-intercept
28
Experimental Region
• We are not sure if the relationship is
linear outside our x sample data range.
• It is best to make predictions over the
range of the min and max of the
sample x data.
• This range is called the
“Experimental Region”
• When you make predictions outside
the Experimental Region, it is called
“Extrapolation”
• The y-intercept is often estimated
using Extrapolation
• We do not have empirical evidence
that the relationship holds outside the
experimental region; an estimate
created outside the experimental
region is an unreliable estimate. 29
Bike Weight/ Bike Price Example:
Calculate Slope and Y-intercept from Sample Data, Make a Prediction
• ŷ = $391,243.63
Vertical lines
on chart
represnet
residuals or
Sometimes the “Errors in
Equation
using
Overpredicts
Predicted
Value as
compared to
Actual Y 34
Value.
Residuals = (𝑦! − 𝑦̂! )
• Predicted Values = 𝑦!! = 𝑏" 𝑥! + 𝑏#
• Calculate predicted values using Estimated
Equation at each 𝑥!
• FORECAST function in Excel can be used for
predicted values. All it needs is an x value and the
known y and x values and it will calculate the
predicted value using the Estimated Regression
Equation Slope and Intercept.
• Residuals = (𝑌! − 𝑦!! )
• Particualr value – Predicted value
• Distance that “Original Y Sample Value” is above
or below the Estimated Line
• Vertical lines on chart represnet residuals
• Note 1:
• Sum of Yi values = Sum of predicted values (all ŷi )
• Note 2:
• Sum of residuals = 0
• Note 3:
• Residuals Squared are minimized becasue we calculated predicted values
35
with b1 & b0
Ybar Line vs. Estimated Simple Linear Equation Line
for Making Predictions
Ybar Model Yhat Model
These errors
are called These errors are
“Deviations” called
“Residuals”
37
If All Sample Points Fell on Estimated Line, There Would
Be No Errors
38
Error in Predicting Y Using Equation
39
Total Error if we used just Ybar
40
Two Parts in Total Error
41
Total Error = Explained +Unexplained
Total Error = Regression +Error
42
Comparing “Regression” or “Error” to “Total” to measure “Goodness of Fit”…
• If we would like to compare the total “Regression” or “Error” to the “Total Error”, the
problem is:
• If we want to add up all “Total Error” or all “Unexplained” or all “Regression”:
• We would get zero!!!! 43
44
Squaring & then Summing “Total”, “Regression” & “Error”
2 2 2
3 3
𝑆𝑆𝑇 = % 𝑌/ − 𝑌( 3
𝑆𝑆𝑅 = % 𝑌*/ − 𝑌( 𝑆𝑆𝐸 = % 𝑌/ − 𝑌*/
/01 /01 /01
• Sum of Squares Total. • Sum of Squares due to • Sum of Squares due to Error.
• Measure of error involved Regression. • Measure of how far away
in using Ybar to make a • Measure of how far Particular Value is away from
prediction. Predicted Value is away Predicted Value.
• Amount of SST that is
• How well sample points from Ybar. unexplained by the Estimated
cluster around Ybar line. • Amount of SST that is Regression Equation/Line.
explained by the • Unexplained part of SST.
Estimated Regression • How well sample points
Equation/Line. cluster around 𝒀 E line
• Explained part of SST. • If all Sample Points fall on
• If all Sample Points fall on Estimated Line, SSE = 0.
Estimated Line, SSR = SST. • If you have residuals already 45
calculated, you can use the
Excel function SUMSQ.
How to Think About SST and SSE
SST SSE
• Measure of How
Much Better Using
Yhat is for Making
Predictions than
using Ybar.
47
Relationship Between SST, SSR and SSE
2 2 2
3 3
𝑆𝑆𝑇 = % 𝑌/ − 𝑌( 3
𝑆𝑆𝑅 = % 𝑌*/ − 𝑌( 𝑆𝑆𝐸 = % 𝑌/ − 𝑌*/
/01 /01 /01
• In Excel the RSQ function can be used to calculate r^2 using just the sample x & y data.
Ad Expenditures / Sales Example:
Calculate Coefficient Of
Determination
50
The Closer to 1, the Better the Fit.
51
Coefficient of Determination
• From page 137 in our Essentials of Business Analytics textbook (ISBN10: 1-285-18726-1):
• For typical data in the social and behavioral sciences, values of r^2 as low as 0.25
are often considered useful.
• For data in physical and life sciences, r^2 values of 0.60 or greater are often
found, and in some cases , r^2 values greater than 0.90 can be found.
• In business applications, r^2 values vary greatly, depending on the unique
characteristics of each applications.
52
Compare Coefficient Of Determination &
Coefficient Of Correlation
C. Of Correlation = rxy C. Of Determination = r^2
• rxy = (Sign of b1)*SQRT(r^2). • r^2 = (rxy)^2.
• Number between -1 and 1. • Number between 0 and 1.
• Measures strength and direction of • Measure strength and goodness of
liner relationship between one fit of relationship.
independent variable and one • Can be used on linear or non-
dependent variable. linear relationships.
• Can be used for one or more
• Only for liner relationships. independent variables.
• Only for one independent variable. • Referred to as R^2 in Multiple
Regression 53
Estimates for Variance & Standard Deviation of the Estimated
Regression Equation
3
𝑆𝑆𝐸 = % 𝑌/ − 𝑌*/ = 𝑇𝑜𝑡𝑎𝑙 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐸𝑟𝑟𝑜𝑟
∑ *! , *- ! "
𝑀𝑆𝐸 = = Estimate of Variance for Regression Equation
.,/
∑ *! , *- ! "
𝑠= = Estimate of Standard Deviation for Regression Equation,
.,/
called “Standard Error of Estimate” or “Standard Error of y”.
(Measure in spread of Residuals).
54
• If you already have the residual values calculated, you can use the Excel function STEYX
to calculate Standard Error of Estimate. All it needs are the x and y sample point values.
Standard Error of the Estimate
55
Bike Weight/ Bike Price Example:
Calculate Coefficient Of Determination & Standard Error
56
“How Fairly A Statistic Represents Its Data Points”.
Measuring How Fairly The “Mean” Measuring How Fairly the “Estimated
Represents Its Data Points with “Standard Regression Equation” Represents Its Data
Deviation” Points with “Standard Deviation of the
Estimated Line” or “Standard Error of the
Estimate”
$
∑(𝑦# − 𝑦)
% $
∑ 𝑌# − 𝑌+#
𝑛−1 𝑛−2
57
Data Analysis, Regression feature Step 1: Dialog Box
58
Degrees of Freedom
• Degrees of freedom represents the number of independent units of information in a calculation.
• In general, Degrees of Freedom = df = n - # of estimated parameters.
• Both of these would become:
' &
∑(𝑦% − 𝑦)
𝑠!"#$ =
𝑑𝑓
&
∑ 𝑌% − 𝑌-%
𝑠'"()"**%+$ -%$" =
𝑑𝑓
59
Data Analysis, Regression feature Step 2: Output
60
What Regression Output Means
61
LINEST Array Function to deliver 10 statistics
for Linear Regression
Highlight one more
column than there are
independent variables
and 5 rows.
• However:
• Before we can test the reasonableness of the slope and y-intercept, we must check
to see if the assumptions necessary to use the Least Squares Regression Model are
valid 63
Conditions Necessary for Valid Inference with the Least Squares Model
• For any given combination of
values of the independent
variables x1, x2,…, xq, the
population of potential error
terms e must be:
1. Normally Distributed (Bell
Shaped)
2. Has a mean of zero
3. Has a constant variance
4. The values of e are statistically
independent
• In general, this is a concern only
when we collect data from single
entity over time, like in a time
series 64
We can visually test assumptions by examining Residual Plots
Implication of Assumptions
• e Normally Distributed (Bell Shaped)
• Implication: Because y is a linear function of
e, y is normally distributed random variable
for all values of x.
• e has a mean of zero, E(e) = 0
• Implication: The slopes and intercepts are
constants and we can use the line to
estimate or predict
• Has a constant variance
• Implication: variance is same for all x
• The values of e are statistically
independent
• Linear model can be used without an
adjustment for seasonality or other cyclical 65
patterns
Visually Testing Assumptions: Plot Residuals Against X Values
• Markers above Zero Line indicate a
Positive Residual, which is a Sample
Point that is above the Predicted Value.
• The equation under estimated as
compared to the actual Sample Point.
If assumptions are met we should see consistent markers above and below the Zero Line with a 66
higher frequency near the Zero Line. This plot does not supply evidence that would support
rejecting the assumptions.
You can manually plot Residuals Against x or use Data
Analysis, Regression feature
Manual Data Analysis, Regression feature
67
Visually Testing Assumptions: Plot Residuals Against X Values
1. e term (and thus dependent variable y) is Normally
Distributed.
• If this assumption is met we should see:
• The frequency of values near Zero Line should be
greater than frequency of values away from Zero Line.
2. E(e) = 0, has a mean of zero.
• If this assumption is met we should see:
• For a given x value the center of the spread of the
errors should be near the Zero Line.
• About same number of values above and below the
Zero Line.
3. Has a constant variance.
• If this assumption is met we should see: This plot does NOT provide evidence of a violation of
• Errors are symmetrically distributed above and below the conditions necessary for valid inference in
and along the Zero Line across the x values. regression.
• Spread of errors looks similar for each x.
The implication of 1) bell shaped errors, 2) mean = 0 & 3) constant variance is that the point 68
estimates are unbiased (do not tend to underpredict or overpredict), for any given combination of
values of the independent variables x1, x2,…, xq .
If assumptions are met, point estimates tend to not
underpredict or overpredict (unbiased estimate)
69
Inference in Regression is Generally Valid Unless These plots provide strong evidence of a violation of
the conditions necessary for valid inference in
You See Marked Violations Such as These: regression.
71
If Residual Plots Show Assumptions Are Met:
• We can run Hypothesis test to check reasonableness of regression parameters β0,
β1, β2, . . . , βq .
• We can create Confidence Intervals for our predicted values of our dependent
variable.
• Hypothesis Testing (Busn 210)
• A Statistical procedure that uses sample evidence & probability theory to
determine whether a statement about the values of a parameter are reasonable
(reliable) or not.
• Confidence Intervals (Busn 210)
• From sample data we calculate a lower & upper limit to make probability
statement about how sure we are that the population parameter will lie
between the lower and upper limit.
• 95% Confidence Intervals mean that if we constructed 100 similar intervals, 72
about 95 would contain the population parameter and 5 would not.
The Logical Element to Test in Linear Regression: Slope
If the slope is zero, there is If the slope is NOT zero, there is
probably NOT a relationship probably a relationship
73
Hypothesis Test to Check if Slope/s Are Equal to Zero
• If slope/s are ALL equal to zero, then:
• E(y|x1,x2,…xq) = β0 + β1x1 + β2x2 + · · · + βqxq
becomes:
• E(y|x1,x2,…xq) = β0 + 0*x1 + 0*x2 + · · · + 0*xq
becomes:
• E(y|x1,x2,…xq) = β0
• Not a linear function of x1, x2,…, xq
• For our Hypothesis Test, our goal is to “Reject” the hypothesis “ALL Slope/s = 0”.
• If ALL Slope/s = 0”, then model would be no better than the Ybar line for making 74
predictions.
Steps For Hypothesis Testing
1. State The Null and Alternative Hypotheses.
• Null Hypothesis = H0 = All Slope/s = 0
• Alternative Hypothesis = Ha = All Slope/s <> 0
• “At Least One Slope is NOT Equal to Zero”
2. Set Level of Significance = “alpha”.
• Alpha = risk of rejecting H0 when it is TRUE.
• Alpha determines the hurdle for whether or not the test statistic just represents sample error or
there is a true “statistically significant” difference (past the hurdle).
• Alpha is used to compare against p-value. P-value <= Alpha, we Reject H0 and accept Ha
• Alpha is often 0.05 or 0.01.
• When testing the slope, when we get a statistically significant difference, it will mean:
• It is reasonable to assume that at least one of the slopes is not zero.
• It is reasonable to assume that there is a statistically significant relationship.
3. Rejection Rule:
• “If p-value is less than our alpha, we reject H0 and accept Ha , otherwise, we fail to 75
reject H0 .”
Steps For Hypothesis Testing
4. From Sample Data calculate the Test Statistic and then calculate the p-value of the Test Statistic.
• Use F Test Statistic (and F Distribution) for Testing Overall Significance:
• F Test Statistic is:
!!"/$ !!"/(,- ./01/2234') 67.
• 𝐹= !!%/(' )$ )*)
= !!%/(,- 51141)
= 675
• p-value:
• = F.DIST.RT(F Test Statistic, q , n-q-1)
• Use t Test Statistic (and t Distribution) for testing individual Slope and Y-Intercept:
• t Test Statistic for Slope:
8!
• t= "
'
∑ $% &$
• Data Analysis Regression Output provides F & t Test Statistics, & p-values. 76
• The key is: If p-value is less than Alpha, Reject H0 and Accept Ha
Steps For Hypothesis Testing
5. From the sample evidence makes reasonable statements about the population
parameter.
• If we reject H0 and accept Ha we will say:
• “The sample evidence suggests that at least one slope is not equal to zero. It is reasonable to
assume that there is a significant relationship at the given level of significance.”
• “xi and y are related and a linear relationship explains a statistically significant portion of the
variability in y over the Experimental Region.”
• If we fail to reject H0 we will say:
• “The sample evidence suggests that all slope/s are equal to zero. It is reasonable to assume
that there is NOT a significant relationship at the given level of significance.”
77
F
Distribution
for
Hypothesis
Test
78
F Test Statistic for Hypothesis Test
• Use the F Distribution • The larger the F value, the
• F Test Statistic is: stronger the evidence that there
!!"/$ !!"/(,- ./01/2234') 67. is an overall regression
𝐹= !!%/(' )$ )*)
= !!%/(,- 51141)
= 675 relationship.
• P-value = Probability of getting
• SSR = Sum of squares due to regression (explained the F Test Statistic or greater in
variation) the F Distribution.
• SSE = Sum of squares due to error (unexplained • The smaller the p-value the stronger
variation) the evidence that there is an overall
• q = the number of independent variables in the regression relationship (stronger the
regression model evidence against the Null that all
• n = the number of observations in the sample slopes are zero).
• SSR/q = MSR = Mean Square Regression = test statistic • If p-value is smaller than alpha, we
that measures variability in the dependent variable y reject H0 .
that is explained by the independent variables (x1, x2…
xq)
• Data Analysis Regression
• SSE/(n-q-1) = MSE = Mean Square Error = measure of 79
variability that is not explained by x1, x2… xq . Output provides F and p-value
• df = degrees of freedom = term used in Excel ANOVA
output.
Formulas for testing individual estimates of parameters:
• Sum of Squares of Error (Residuals) = SSE = ∑ 𝑦M − 𝑦!M N = ∑ 𝑦M − 𝑏O − 𝑏P𝑥M N
80
• Confidence interval for b0 is = 𝑏O ± 𝑡𝞪/N𝑠Q$ ,
where ta/2 is the t value providing an area of a/2 in the upper tail of a t distribution with n - 2 degrees of freedom.
Testing Individual Regression Parameters
• If F Test Statistic indicates that at least one of the slope/s are not zero, then we can test if there is a
statistically significant relationship between the dependent variable y and each of the independent
variables by testing each slope.
• We use the t Distribution (Bell Shaped Probability Distribution from Busn 210)
• We use t Test Statistic to test whether the slope is zero:
Q
t=R'
('
• b1 = slope
• 𝑠%= = estimated standard error of slope
• t = # of standard deviations.
• If t is past our hurdle, we reject H0 and accept Ha.
• H0 : Slope = 0
• Ha : Slope <> 0
• Alpha of 0.5 or 0.01 are often used. Alpha determines the hurdle or is used to compare against p-value.
• This is a two-tail test.
• If t is past hurdle in either direction, reject H0 and accept Ha. It seems reasonable that the slope is not zero.
• If the p-value is less than alpha, it seems reasonable that the slope is not zero. The smaller the p-value, the
stronger the evidence that the slope is not zero and the more evidence we have that a relationship exists 81
between y and x.
• In Simple Linear Regression, t test and F test will yield same p-value.
t Distribution for Hypothesis Test
82
Hypothesis Test For Weekly Ad Expense and Sales Example:
• First we look at the residual plot to see if the assumptions of the Least Squares Method are met. It
appears that the assumptions are met. The plot does NOT provide evidence of a violation of the
conditions necessary for valid inference in regression.
• Because the p-value for the F Test Statistic is less than 0.01, we reject H0 and accept Ha. It is reasonable
to assume that the slope is not zero and that there is a significant relationship between x and y. A linear
relationship explains a statistically significant portion of the variability in y over the Experimental Region. 83
• Similarly, p-value for Y-Intercept is less than 0.01 and so we conclude it is not zero. However, the Y-
Intercept value is not in our Experimental Region.
What the F Statistic Hypothesis Test Looks Like
84
Confidence Intervals to Test if Slope β1 & Y-Intercept β0 Are Equal to 0
• Excel Data Analysis, Regression tool calculates upper and lower limit for a Confidence Interval
• Interval does not contain 0: conclude Y-Intercept (β0) is not zero (when all x are set to zero).
• Interval does not contain 0: conclude Slope (β1) is not zero (there is a linear relationship)
Found an
overall
regression
relationship
at both alpha
= 0.05 &
alpha = 0.01
85
Nonsignificant Variables: Reassess Whole Model/Equation
Slope Intercept
• If Slope not significant (Do not reject H0 : • If Y-intercept not significant
Slope = 0)
• The decision to include or not include the
• If practical experience suggests that the
nonsignificant x (independent variable) has a calculated y-intercept may require special
relationship with the y variable, consider consideration because setting “Constant is
leaving the x in the model/equation. Zero” in Data Analysis Regression tool will
• Business example: # of deliveries for a truck
route had insignificant slope, but was clearly set the equation intercept equal to zero
related to total time. and may dramatically change the slope
values.
• If the model/equation adequately explains
the y variable without the nonsignificant x • Business example when you might want
independent variable, try rerunning the the equation to go through the origin
regression process without the nonsignificant (x=0,y=0): labor hours = x and Output = y.
x variable, but be aware that the calculations
for the remaining variables may change.
• Key is that you may have to run the
regression tool in Excel a number of times
over various variables to try and get the
best slopes and y-intercept for the
equation. 86
Multicollinearity
• Multicollinearity
• Correlation among the independent variables when performing multiple regression.
• In Multiple Regression when you have more than one x, each x should be related to the y value, but in general, no
two x values should be related to each other.
• For example, if we have y = time for truck deliveries in a day, x1 = number of miles, x2 = amount of gas, because number of miles is
related to gas, the resulting multiple regression process may have problems.
• Use PEARSON or CORREL to analyze any 2 x variables
• Rule of thumb: if absolute value is greater than 0.7, there is potential problem.
• Problems with correlation among the independent variables is that it increases the variances & standard errors of the
estimated parameters (β0, β1, β2, . . . , βq ) and predicted values of y, and so inference based on these estimates is not as
precise than it should be.
• For example, if t test or confidence intervals lead us to reject a variable as nonsignificant, it may be because there is
too much variation and thus the interval is too wide (or t stat not past hurdle).
• We may incorrectly conclude that the variable is not significantly different from zero when the independent variable
actually has a strong relationship with the dependent variable.
• If inference is a primary goal, we should avoid variables that are highly correlated.
• If two variables are highly correlated, consider removing one.
• If predicting is primary goal, multicollinearity is not necessarily a concern.
• Note: If any statistic (b0, b1, b2, . . . , bq ) or p-value changes significantly when a new x variable is added or removed, we
must suspect that multicollinearity is at play.
• Checking correlation between pairs of variables does not always uncover multicollinearity.
87
• Variable might be correlated with multiple other variables. To check: 1) treat x1 as dependent and the rest of the x
as independent and run ANOVA table to see if R^2 is big to see if there strong relationship. R^2 > 0.5, rule of thumb
that there might be multicollinearity.
Inference and Very Large Samples
• When sample size is large:
1. Estimates of Variance and Standard Error (# Standard Deviations) are calculated
with sample size in the denominator. As sample size increases, Estimates of
Variance and Standard Error decrease.
2. Law of Large Numbers (Large Sample Size) says that as sample size gets bigger,
statistic approaches parameter. As statistic approaches parameter, variation
between the two decreases. As the variation between the two decreases,
Estimates of Variance and Standard Error decrease.
3. As Estimates of Variance and Standard Error decrease, the intervals used in
inference (Hypothesis Testing and Confidences Intervals) decrease, p-values get
smaller, and almost all relationships will seem significant (meaningful and
specious).
• You can’t really tell from the small p-value if the relationship is meaningful or specious
(deceptively attractive)
88
4. Multicollinearity can still be an issue.
Small Sample Size
• Maybe be hard to test assumptions for inference in regression, like with a Residual
Plot (because not enough sample points).
• Assessing multicollinearity is difficult.
89