Prepared by : Dr.
Hanaa Bayomi
Updated By: Prof Abeer ElKorany
Lecture 2 : Linear Regression
LINEAR REGRESSION WITH ONE VARIABLE
➢ Model Representation
➢ Cost Function
➢ Gradient Descent
MODEL REPRESENTATION
dependent
variable
1250
Independent
variable
Supervised Learning Regression:
“right answers” or “Labeled data” given Predict continuous valued output (price)
MODEL REPRESENTATION
Example
x (1) 2104
(x,y) one training example (one raw) 232
y (2)
(x (i),y (i)) i th training example
x (4) 852
MODEL REPRESENTATION
Training set the job of a learning algorithm to output
a function is usually denoted lowercase
h and h stands for hypothesis
Learning algorithm
x h y
the job of a hypothesis function is
taking the value of x and it tries to
output the estimated value of y. So h is
a function that maps from x's to y's
MODEL REPRESENTATION
How do we represent h ?
X
X
X X X
X
X
Y X
X
X
Linear Equations
Y
Change in Y
θ1= Slop (ΔY)
Change in X (ΔX)
θ0=Y-intercept
Linear regression with one variable. Univariate linear regression.
X
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship
COST FUNCTION
▪ The cost function, let us figure out how to fit the best
possible straight line to our data.
How to choose θi’s ?
Scatter plot
▪ 1. Plot of All (Xi, Yi) Pairs
▪ 2. Suggests How Well Model Will Fit
Y
60
40
20
0 X
0 20 40 60
Thinking Challenge
How would you draw a line through the points?
How do you determine which line ‘fits best’?
Y
60
40
20
0 X
0 20 40 60
11
Thinking Challenge
How would you draw a line through the points?
How do you determine which line ‘fits best’?
Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
Thinking Challenge
How would you draw a line through the points?
How do you determine which line ‘fits best’?
Slope unchanged
Y
60
40
20
0 X
0 20 40 60
Intercept changed
Thinking Challenge
How would you draw a line through the points?
How do you determine which line ‘fits best’?
Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept changed
Price ($) in 1000's
Training Set Size in feet2 (x)
(y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ? Or Weight
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Least Squares
▪ 1. ‘Best Fit’ Means Difference Between Actual Y Values
and Predicted Y Values is a Minimum. So square errors!
m m
(Yi − h ( xi ) ) =
2
ˆi
2
i =1 i =1
17
Least Squares
▪ 1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted
Y Values Are a Minimum. So square errors!
m m
(Yi − h ( xi ) ) =
2
ˆi
2
i =1 i =1
▪ 2. LS Minimizes the Sum of the Squared Differences (errors) (SSE)
18
Least Squares Graphically
n
LS minimizes i = 1 + 2 + 3 + 4
2 2 2 2 2
i =1
Y Y2 = 0 + 1 X 2 + ˆ2
^4
^2
^1 ^3
hθ(xi ) = θ0 + θ1 X i
X
19
Least Squared errors Linear Regression
COST FUNCTION ,
Minimize
predictions on the
training set
the actual values
Minimize
Cost function visualization with One parameter
Consider a simple case of hypothesis by setting θ0=0, then h becomes :
hθ(x)=θ1x
Each value of θ1 corresponds to a different hypothesis as it is the slope
of the line
which corresponds to different lines passing through the origin as
shown in plots below as y-intercept i.e. θ0 is nulled out.
At θ1=2,
At θ1=1,
At θ1=0.5, J(0.5)
Cost function visualization with One parameter
CHANGE OF COEFFICIENT COST FUNCTION
Cost function visualization with One parameter
CHANGE OF COEFFICIENT COST FUNCTION
Cost function visualization with One parameter
Simple Hypothesis
At θ1=2,
At θ1=1,
At θ1=0.5, J(0.5)
On plotting points like this further, one
gets the following graph for the cost
function which is dependent on
parameter θ1.
plot each value of θ1 corresponds to a
different hypothesizes
Cost function visualization with One parameter
What is the optimal value of θ1 that minimizes J(θ1) ?
It is clear that best value for θ1 =1 as J(θ1 ) = 0,
which is the minimum.
How to find the best value for θ1 ?
Plotting ?? Not practical specially in high
dimensions?
The solution :
1. Analytical solution: not applicable for large
datasets
2. Numerical solution: ex: Gradient descent .
Ploting the cost function 𝑗 𝜃0 , 𝜃1
Cost function visualization with θ0, θ1
COST FUNCTION (RECAP)
Gradient Descent
GRADIENT DESCENT
➢ Iterative solution not only in linear regression. It's
actually used all over the place in machine learning.
➢ Objective: minimize any function ( Cost Function J)
PROBLEM SETUP
Imagine that this is a landscape of grassy park, and you want
to go to the lowest point in the park as rapidly as possible
Starting
point Red: means high
blue: means low
J(0,1)
1
local
minimum 0
New Starting
point
Red: means high
blue: means low
J(0,1)
New local
minimum
1
0
With different starting point
Gradient descent Algorithm (LMS)
Gradient descent Algorithm
J(θ1) EXAMPLE
d
1 = 1 − j (1 )
+ slop
d1
θ1 θ1= θ1- (+ve)
J(θ1)
- slop
θ1= θ1-
(-ve)
θ1
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
Gradient descent Algorithm
QUESTION
what do you think one step of gradient descent will do?
Change of Learning rate value
o
If α is too small, gradient
descent can be slow.
If α is too large, gradient
descent can overshoot the
minimum. It may fail to
converge, or even diverge.
Change of Learning rate value
If α is too small, gradient
descent can be slow.
If α is too large, gradient
descent can overshoot the
minimum. It may fail to
converge, or even diverge.
Local minimum
Gradient descent can converge to a local minimum, even with the
learning rate α fixed.
As we approach a local minimum, gradient descent will
automatically take smaller steps. So, no need to decrease α
over time.
GRADIENT DESCENT FOR
A LINEAR REGRESSION
d 1 m
(h ( xi ) − Yi )
d
j (0 ,1 ) = 2
d j d j 2m i=1
d 1 m
( 0 + 1( xi ) − Yi )
d
j (0 ,1 ) = 2
d j d j 2m i=1
1 m
j ( 0 ,1 ) = (h ( xi ) − Yi )
d
j = 0:
d 0 m i =1
1 m
j ( 0 ,1 ) = (h ( xi ) − Yi ) • xi
d
j = 1:
d1 m i =1
G.D. FOR LINEAR
REGRESSION
MODEL REPRESENTATION
Linear Regression
Using
TensorFlow
1-D Data Example
Data Preparation
import numpy as np
num_of_points = 100 #Generate 100 Data Points
points = []
for i in range(num_of_points):
x1= np.random.normal(0.0, 0.55)
y1= x1 * 0.1 + 0.3 + np.random.normal(0.0, 0.01)
points.append([x1, y1])
x_data = [v[0] for v in points]
y_data = [v[1] for v in points]
Draw Data
import matplotlib.pyplot as plt
plt.plot(x_data, y_data, 'ro', label='Original data')
plt.legend()
plt.show()
Original Data
Variables and Nodes
Preparation
import tensorflow as tf
#initialize weights "W and bias "b"
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = W * x_data + b
#Define Loss function as Mean of Squared Error
loss = tf.reduce_mean(tf.square(y - y_data))
#Create Optimizer class to minimize Losses
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
#initialize TensorFlow Variables (always)
init = tf.global_variables_initializer()
Execute TensorFlow Graph
#Start TensorFlow Session and carryout Variable initialization
sess = tf.Session()
sess.run(init)
#Carryout 16 Iterations
for step in range(16):
sess.run(train)
#Draw Original Data
plt.plot(x_data, y_data, 'ro')
#Draw Predicted data (using calculated weight and bias after training
plt.plot(x_data, sess.run(W) * x_data + sess.run(b))
plt.xlabel('x')
plt.xlim(-2, 2)
plt.ylim(0.1, 0.6)
plt.ylabel('y')
plt.legend()
plt.show()
# print updated weights, bias, and Loss value after current training iteration
print(step, sess.run(W), sess.run(b),sess.run(loss))
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16