SIMPLE LINEAR REGRESSION is a statistical method that allows us to summarise, study, and
estimate relationships between two quantitative variables.
LINEAR refers to the same amount of change in y for the same amount of change in x.
You can use simple linear regression when you want to know:
►how strong the relationship is between two variables
e.g. the relationship between rainfall and soil erosion
►the value of the dependent variable at a certain value of the independent variable
e.g. the amount of soil erosion at a certain level of rainfall
Why do we use linear regression?
►data analysts believe that there is a linear relationship between the variable of interest
(y) and the explanatory variables (x)
►business managers make decisions based on the assumed relationship between a
variable of interest (y) and one or more explanatory variables (x)
e.g. predicting sales (y) in terms of expenditure on adverts (x)
How do we identify if a relationship is linear?
►scatter plot
►correlation
INDEPENDENT VARIABLE (x) is the variable that we can manipulate and change and is
assumed to have a direct effect on the dependent variable. Also known as the predictor
or explanatory variable.
DEPENDENT VARIABLE (y) is the variable being tested and measured and is dependent on
the independent variable. Also known as the response or outcome variable.
@study_ingmadesimple Luca du Toit
Simple linear regression formula
𝒚 = 𝒂 + 𝒃 𝒙𝒊 + 𝒆𝒊 𝒇𝒐𝒓 𝒊 = 𝟏, 𝟐, … , 𝒏
or
𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝒖
►y is the dependent variable (y) for any given value of the independent variable (x)
►a or 𝜷𝟎 is the intercept, the value of y when x is 0.
𝑎 = 𝑦̅ − 𝑏 𝑥̅
►b or 𝜷𝟏 is the slope parameter – how much we expect y to change as x increases.
∑𝑛𝑖 = 1 (𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑏=
∑𝑛𝑖 = 1 (𝑥𝑖 − 𝑥̅ )2
►x is the independent variable (the variable we expect is influencing y).
►e or 𝒖 is the error of the estimate, or how much variation there is in our estimate of the
regression coefficient.
Linear regression steps:
(1) plot the two variables – with the variable of interest (dependent) on the y-axis and the
explanatory variable (independent) on the x-axis
(2) check to see if the relationship shows any linearity
(3) check to see if it shows a positive or negative relationship visually
(4) use the data to obtain the linear regression line parameters (a and b)
(5) check to see if you can predict a value of y give a new value of x
(6) check to see how far outside of the data range you will be able to predict using the
regression equation
How well does the estimated linear regression line fit the data: Goodness-of-fit
Measures of Variation – The Sum of Squares
►calculate the sum of square of prediction error 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦 ̅
̂𝑖 )2
►calculate the total sum of square 𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦 ̅)2 𝑜𝑟 𝑆𝑆𝐸 + 𝑆𝑆𝑅
►calculate sum of squares due to regression 𝑆𝑆𝑅 = ∑(𝑦 ̂𝑖 − 𝑦̅)2
where 𝑦̅ denotes the mean/ expected value
where 𝑦𝑖 denotes the actual values
where 𝑦̂𝑖 denotes the predicted values
where 𝑢̂𝑖 denotes the predicted error terms
@study_ingmadesimple Luca du Toit
Coefficient of determination (𝑹𝟐 )
The coefficient of determination is a measurement used to explain how much variability
of one factor can be caused by its relationship to another related factor. This correlation,
known as the "goodness of fit," is represented as a value between 0.0 and 1.0. The
coefficient of determination gives you the percentage variation in y explained by x-
variables. A high 𝑅2 indicates a good fit, while a low 𝑅2 indicates a poor fit. However, a
low 𝑅2 does not indicate that the ordinary least squares regression is of little use.
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = 𝑜𝑟 1 − 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝑅 2 ≤ 1
𝑆𝑆𝑇 𝑆𝑆𝑇
Correlation coefficient (𝒓𝒙𝒚 )
Correlation coefficient is used to find out how strong a relationship is between x and y.
The formulas return a value between -1 and 1, where:
►1 indicates a strong positive relationship
►-1 indicates a strong negative relationship
►0 indicates no relationship at all
𝑟𝑥𝑦 = (𝑠𝑖𝑔𝑛 𝑜𝑓 𝑏) × √𝑅 2 , 𝑤ℎ𝑒𝑟𝑒 𝑏 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑦 = 𝑎 + 𝑏
@study_ingmadesimple Luca du Toit
Outlier VS Influential points
OUTLIER is a data point (x,y) whose y-value does not follow the general trend of the data.
INFLUENTIAL POINT is a data point (x,y) whose inclusion unduly changes the regression line.
If these were to be removed, they would significantly change the regression line.
Regression using Excel
(1) To perform regression in Excel the Data Analysis tools must be used in the Analysis
group. If this section does not appear under the Data tab, it will need to be installed.
►To install the analysis tools open the File tab and select Options.
►In the Excel Options menu select Add-ins.
►Select Analysis ToolPak-VBA and press Go to install.
►The Analysis group will appear under the Data tab.
@study_ingmadesimple Luca du Toit
(2) Click on a blank cell and then open the Data tab. Select Data Analysis in the Analysis
group.
@study_ingmadesimple Luca du Toit
(3) Select Regression from the pop-up menu and press OK.
(4) Select the 𝑦 range by dragging your cursor over the column containing the 𝑦 values,
including the column name.
@study_ingmadesimple Luca du Toit
(5) Select the 𝑥 range by dragging your cursor over the column containing the 𝑥 values,
including the column name.
(6) Select Labels to add the variable names and press OK.
@study_ingmadesimple Luca du Toit
(7) The regression output should look like this:
e.g. A R1 increase in income will lead to R0.89 increase in consumption spending.
(8) Select the Scatter Charts icon in the Charts group under the Insert tab and select the
Scatter option from the dropdown list.
@study_ingmadesimple Luca du Toit
(9) Select Add Chart Elements from the Chart Layouts group and open the Trendline
menu and select Linear to add a fitted regression line to the chart. The appearance of
the chart can be edited and axis labels should be added. Add an appropriate title to the
chart.
The graph indicates a positive relationship between income and consumption. As income
increases the mean of consumption also exhibits an increasing trend. From the graph we
can observe a positive regression coefficient on income (𝛽1), which is the slope
parameter. A negative coefficient will indicate a negative relationship between the
independent and explanatory variable.
PLEASE NOTE: I am selling the service provided in summarising this chapter and not the
intellectual property provided.
@study_ingmadesimple Luca du Toit