Linear Regression
with One Variable
Lecture 02 - 03
Mirza Mohammad Lutfe Elahi Silvia Ahmed
CSE 445 Machine Learning ECE@NSU
Linear Regression
• Housing price data example
– Supervised learning regression problem
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
• What do we start with?
– Training set (this is your data set)
CSE 445 Machine Learning ECE@NSU
Training Set of Housing prices
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
• Notation (used throughout the course)
– m = number of training examples
– x's = input variables / features
– y's = output variable "target" variables
• (x, y) - single training example
• (x(i), y(i)) - specific example (ith training example)
• i is an index to training set
CSE 445 Machine Learning ECE@NSU
Training Set of Housing prices
Housing Price
600
500
PRICE (IN 1000S OF $)
400
300
200
100
0
0 500 1000 1500 2000 2500 3000
SIZE (FEET2)
CSE 445 Machine Learning ECE@NSU
Model Representation
• With our training set defined - how do we use it?
– Take training set
– Pass into a learning algorithm
– Algorithm outputs a function (denoted h ) (h = hypothesis)
• This function takes an input (e.g. size of new house)
• Tries to output the estimated value of y
CSE 445 Machine Learning ECE@NSU
How do we Represent h?
• Hypothesis function h -
– hθ(x) = θ0 + θ1x (h(x) (shorthand))
Housing Price
600
500
h(x) = θ0 + θ1x
PRICE (IN 1000S OF $)
400
300
200
100
0
0 500 1000 1500 2000 2500 3000
SIZE (FEET2)
CSE 445 Machine Learning ECE@NSU
Model Representation
• What does this mean?
– Means y is a linear function of x.
– θi are parameters
• θ0 is zero condition
• θ1 is gradient
• This kind of function is a linear regression with one
variable
– Also called Univariate Linear Regression
• So in summary
– A hypothesis takes in some variable
– Uses parameters determined by a learning system
– Outputs a prediction based on that input
CSE 445 Machine Learning ECE@NSU
Hypothesis Function
hθ(x) = θ0 + θ1x
• equation of a straight line.
• set values for θ0 and θ1 to get estimated output hθ.
• hθ is a function that is trying to map input data (the x's) to our
output data (the y's).
Input x Output y
• Example – 0 4
1 7
2 7
3 8
• Let θ0 = 2 and θ1 = 2, then hθ(x) = 2 + 2x
• For input 1 to our hypothesis, hθ(x) will be 4 (off by 3).
• Need to try out various values to find best possible fit.
• Or the most representative "straight line" through the data
points mapped on the x-y plane.
CSE 445 Machine Learning ECE@NSU
Cost Function
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
• Hypothesis: hθ(x) = θ0 + θ1x
• θi’s: parameters
How to choose θi’s?
CSE 445 Machine Learning ECE@NSU
Cost Function
• A cost function lets us figure out how to fit the best
straight line to our data.
• We can measure the accuracy of our hypothesis
function by using a cost function.
• Choosing values for θi (parameters)
– Different values give different functions.
CSE 445 Machine Learning ECE@NSU
Cost Function
hθ(x) = θ0 + θ1x
h(x) = 1.5 + 0.x h(x) = 0.5x h(x) = 1 + 0.5x
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Θ0 =1.5 Θ0 = 0 Θ0 = 1
Θ1 = 0 Θ1 = 0.5 Θ1 = 0.5
CSE 445 Machine Learning ECE@NSU
Linear Regression – Cost Function
θ0, θ1
y
x
• Based on our training set we want to generate parameters
which make the straight line
– Chosen these parameters so hθ(x) is close to y for our training
examples
• Basically, uses x’s in training set with hθ(x) to give output which is as close
to the actual y value as possible
• Think of hθ(x) as a "y imitator" - it tries to convert the x into y, and
considering we already have - we can evaluate, how well hθ(x) does this.
CSE 445 Machine Learning ECE@NSU
Cost Function
• To formalize this;
– We want to solve a minimization problem
– Minimize the difference between h(x) and y for
each/any/every example –
• minimize (hθ(x) - y)2
– Sum this over the training set
CSE 445 Machine Learning ECE@NSU
Cost Function
• Minimize squared difference between predicted house price
and actual house price
– 1/2m
• 1/m - means we determine the average
• 1/2m - the 2 makes the math a bit easier, and doesn't change the constants
we determine at all (i.e. half the smallest value is still the smallest value!)
– Minimizing difference means we get the values of θ0 and θ1 which find
on average the minimal deviation of x from y when we use those
parameters in our hypothesis function
• More cleanly, this is a cost function
• And we want to minimize this cost function
– Our cost function is (because of the summation term) inherently looking at ALL the data
in the training set at any time
CSE 445 Machine Learning ECE@NSU
Cost Function
• Lets consider some intuition about the cost function and why
we want to use it
– The cost function determines parameters
– The value associated with the parameters determines how your
hypothesis behaves, with different values generate different
• Simplified hypothesis
– Assumes θ0 = 0 3
2
hθ(x) = θ1x
1
0
0 1 2 3
Θ0 = 0
CSE 445 Machine Learning ECE@NSU
Cost Function
• Cost function and goal here are very similar to when we
have θ0, but with a simpler parameter
– Simplified hypothesis makes visualizing cost function J(θ) a bit easier
• So hypothesis pass through (0, 0)
• Two key functions we want to understand
– hθ(x)
• Hypothesis is a function of x - function of what the size of the house is
– J(θ1)
• Is a function of the parameter of θ1
CSE 445 Machine Learning ECE@NSU
Cost Function
hθ(x) = θ1x
hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3
hθ(x)
2 2
J(θ1)
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1
Θ1= 1 Θ1 = 1
J(θ1) = 0
CSE 445 Machine Learning ECE@NSU
Cost Function
hθ(x) = θ1x
hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3
2 2
J(θ1)
y
1 1
hθ(x)
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1
Θ1= 0.5 Θ1= 0.5
J(θ1) = 0.58
CSE 445 Machine Learning ECE@NSU
Cost Function
hθ(x) = θ1x
hθ(x) J(θ1)
(for fixed θ1, this is a function of x) (function of the parameter θ1)
3 3
Θ1 = 1
2 2
J(θ1)
y
Θ1= 0.5
1 1
hθ(x)
0
Θ1 = 0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 3
x Θ1
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
• Using our original complex hypothesis with two variables,
– So cost function is
• J(θ0, θ1)
• Example,
– Say
• θ0 = 50
• θ1 = 0.06
– Previously we plotted our cost function by plotting
• θ1 vs J(θ1)
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
– Now we have two parameters
• Plot becomes a bit more complicated
• Generates a 3D surface plot where axis are
– X = θ1
– Z = θ0
– Y = J(θ0, θ1)
• the height (y) indicates the value of the cost function, so find where y is at a
minimum
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
• Instead of a surface plot we can use a contour figures/plots
– Set of ellipses in different colors
– Each color is the same value of J(θ0, θ1), but obviously plot to different
locations because θ1 and θ0 will vary
– Imagine a bowl shape function coming out of the screen so the middle is
the concentric circles
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
• Each point (like the green one here) represents a pair of
parameter values for θ0 and θ1
– Our example here put the values at
• θ0 = ~800
• θ1 = ~-0.15
– Not a good fit
• i.e. these parameters give a value on our contour plot far from the center
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
– If we have
• θ0 = ~360
• θ1 = 0
• This gives a better hypothesis, but still not great - not in the center of
the contour plot
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
Finally we find the minimum, which gives the best hypothesis
CSE 445 Machine Learning ECE@NSU
Deeper Insight - Cost Function
• Doing this by eye/hand is a painful task
What we really want is an efficient algorithm for finding the minimum
for θ0 and θ1
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• Minimize cost function J
• Gradient descent
– Used all over machine learning for minimization
• Start by looking at a general J() function
• Problem
– We have J(θ0, θ1)
– We want to get min J(θ0, θ1)
• Gradient descent applies to more general functions
– J(θ0, θ1, θ2 .... θn)
– min J(θ0, θ1, θ2 .... θn)
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
How does it work?
• Start with initial guesses
– Start at (0, 0) (or any other value)
– Keeping changing θ0 and θ1 a little bit to try and reduce J(θ0, θ1)
• Each time you change the parameters, you select the gradient
which reduces J(θ0, θ1) the most possible
• Repeat
• Do so until you converge to a local minimum
• Has an interesting property
– Where you start can determine which minimum you end up
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• Here we can see one initialization point led to one local
minimum
• The other led to a different one
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• What does this all mean?
– Update θj by setting it to (θj - α) times the partial derivative of the cost
function with respect to θj
• Notation
– := Denotes assignment
– α (alpha)
• Is a number called the learning rate
• Controls how big a step you take
– If α is big have an aggressive gradient descent
– If α is small take tiny steps
• Derivative term
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• There is a subtly about how this gradient descent algorithm is
implemented
– Do this for θ0 and θ1
– For j = 0 and j = 1 means we simultaneously update both
– How do we do this?
• Compute the right hand side for both θ0 and θ1
– So we need a temp value
• Then, update θ0 and θ1 at the same time
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• To understand gradient descent, we'll return to a simpler
function where we minimize one parameter to help explain the
algorithm in more detail
– where θ1 is a real number
• Two key terms in the algorithm
– Alpha
– Derivative term
• Notation nuances
– Partial derivative vs. derivative
• Use partial derivative when we have multiple variables but only derive with
respect to one
• Use derivative when we are deriving with respect to all the variables
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• Derivative term
• Derivative says
– Lets take the tangent at the point and look at the slope of the line
– So moving towards the minimum (down) will create a negative
derivative, alpha is always positive, so will update J(θ1) to a smaller
value
– Similarly, if we're moving up a slope we make J(θ1) a bigger numbers
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
J(θ1) (θ1 ϵ ℝ)
positive slope d
θ1 = θ1 − α J(θ1) ≥ 0
dθ1
θ1 = θ1 − α (positive number)
θ1
J(θ1) (θ1 ϵ ℝ)
d
negative slope θ1 = θ1 − α J(θ1) ≤ 0
dθ1
θ1 = θ1 − α (negative number)
θ1
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
d J(θ1) (θ1 ϵ ℝ)
θ1 = θ1 − α J(θ1)
dθ1
If'α is too small, gradient descent
can be slow.
θ1
J(θ1) (θ1 ϵ ℝ)
If'α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
θ1
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
d
θ1 = θ1 − α J(θ1) ≥ 0
dθ1
J(θ1) (θ1 ϵ ℝ)
θ1 at local optima
θ1
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• Gradient descent can converge to a local minimum, even with
the learning rate α fixed.
d J(θ1) (θ1 ϵ ℝ)
θ1 = θ1 − α J(θ1) ≥ 0
dθ1
θ1
• As we approach a local minimum, gradient descent will
automatically take smaller steps. So, no need to decrease α
over time.
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
repeat until convergence {
Update θ0 and θ1
simultaneously
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
CSE 445 Machine Learning ECE@NSU
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training
examples
CSE 445 Machine Learning ECE@NSU
Implementation
House sizes:
2104 hθ(x) = -40 + 0.25x
1416
1534
852
-40 x 1 + 0.25 x 2104
-40 x 1 + 0.25 x 1416
X =
-40 x 1 + 0.25 x 1532
-40 x 1 + 0.25 x 852
Prediction = Data Matrix * Parameters
CSE 445 Machine Learning ECE@NSU
Implementation
House sizes: Have 3 competing hypothesis
2104 1. hθ(x) = -40 + 0.25x
1416 2. hθ(x) = 200 + 0.1x
1534 3. hθ(x) = -150 + 0.4x
852
486 410 692
314 342 416
X =
344 353 464
173 285 191
Prediction Prediction Prediction
of the 1st hθ of the 2nd hθ of the 3rd hθ
CSE 445 Machine Learning ECE@NSU