Lec 3-5 (Function Approximation)
Lec 3-5 (Function Approximation)
CS F425
BITS Pilani Dr. Bharat Richhariya
Department of CSIS
Pilani Campus
Lecture 3: Approximating a function (Regression)
BITS Pilani
Pilani Campus
Linear Regression
• Regression refers to a set of methods for modeling the relationship
between one or more independent variables and a dependent variable.
o The purpose of regression is most often to characterize the relationship
between the inputs and outputs.
o Machine learning, on the other hand, is most often concerned with prediction.
Making Predictions
Gradient Descent with the Learned
Model
Basic Elements of Linear Regression
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: linear model
• The linearity assumption says that the target (price) can be expressed as a
weighted sum of the features (area and age):
𝑝𝑟𝑖𝑐𝑒 = 𝑤𝑎𝑟𝑒𝑎 ⋅ 𝑎𝑟𝑒𝑎 + 𝑤𝑎𝑔𝑒 ⋅ 𝑎𝑔𝑒 + 𝑏
• 𝑤𝑎𝑟𝑒𝑎 and 𝑤𝑎𝑔𝑒 are called weights, and 𝑏 is called a bias(/offset/intercept).
• The weights determine the influence of each feature on our prediction.
• The bias just says what value the predicted price should take when all of the
features take value 0.
• The equation above is an affine transformation of input features, which is
characterized by a linear transformation of features via weighted sum, combined
with a translation via the added bias.
Linear regression: linear model
• Given a dataset, our goal is to choose the weights 𝒘 and the bias 𝑏 such that, on
average, the predictions made by our model best fit the true observations.
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: Loss function
• To think about how to fit data with our model, we need to determine a measure
of fitness.
• The loss function quantifies the distance between the real and predicted value of
the target.
o The loss will be a non-negative number where smaller values are better.
o Perfect predictions incur a loss of 0.
• The most popular loss function in regression problems is the squared error:
𝑖 1 𝑖 𝑖 2
𝑙 (𝒘, 𝑏) = 𝑦ො − 𝑦
2
1
• The constant makes no difference but will prove to be convenient, cancelling out when we take
2
the derivative of the loss.
Linear regression: Loss function
• The empirical error is only a function of the model parameters.
1 𝑛 1 𝑛 2
𝐿(𝒘, 𝑏) = σ𝑖=1 𝑙 𝑖 𝒘, 𝑏 = σ𝑖=1 𝒘⊤ 𝑥 𝑖 +𝑏−𝑦 𝑖
𝑛 𝑛
𝒘∗ , 𝑏 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝒘,𝑏 𝐿(𝒘, 𝑏)
Basic Elements of Linear Regression
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: analytic solution
• Linear regression can be solved analytically by applying a simple formula:
• Subsume the bias 𝑏 into the parameter 𝒘 by appending a column to the design matrix
(𝑿) consisting of all ones.
• Then our prediction problem is to minimize ∥ 𝑦 − 𝑿𝒘 ∥2 .
• Take the loss surface to be the minimum of the loss over the entire domain.
• Taking the derivative of the loss with respect to 𝒘 and setting it equal to zero yields the
analytic solution:
𝒘 ∗ = 𝑿⊤ 𝑿 −1
𝑿⊤ 𝑦
• The requirement of an analytic solution is so restrictive that it would exclude all exciting
aspects of deep learning.
• Simple problems like linear regression may admit analytic solutions but, we should not
get used to such good fortune!
Basic Elements of Linear Regression
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: gradient descent
• In cases where we cannot solve the models analytically, we can still train models
effectively in practice.
• The key technique for optimizing nearly any deep learning model is called
gradient descent.
• Gradient descent iteratively reduces the error by updating the parameters in the
direction that incrementally lowers the loss function.
Starting
point
loss
𝜂 𝑖 𝜂 𝑖
𝒘←𝒘− σ𝑖∈𝐵 𝜕𝒘 𝑙 𝒘, 𝑏 = 𝒘 − σ𝑖∈𝐵 𝒙 𝒘⊤ 𝒙 𝑖 +𝑏−𝑦 𝑖
𝐵 𝐵
𝜂 𝑖 𝜂
𝑏←𝑏− σ𝑖∈𝐵 𝜕𝑏 𝑙 𝒘, 𝑏 = 𝑏 − σ𝑖∈𝐵(𝒘⊤ 𝒙 𝑖
+𝑏−𝑦 𝑖 )
𝐵 𝐵
• The values of the batch size and learning rate are manually pre-
specified and not typically learned through model training.
o These parameters that are tunable but not updated in the training loop are
called hyperparameters.
Linear regression: gradient descent
• Linear regression happens to be a learning problem where there is only one minimum
over the entire domain.
• For more complicated models, like deep networks, the loss surfaces contain many
minima.
• Deep learning practitioners seldom struggle to find parameters that minimize the loss on
training sets.
• The more formidable task is to find parameters that will achieve
low loss on data that we have not seen before.
• A challenge called generalization.
Basic Elements of Linear Regression
Making Predictions
Gradient Descent with the Learned
Model
Making Predictions with the Learned Model
• Given the learned linear regression model 𝒘 ෝ ⊤ 𝒙 + 𝑏 , we can estimate
the price of a new house given its area 𝑥1 and age 𝑥2 .
o Estimating targets given features is commonly called prediction or
inference.
BITS Pilani
Pilani Campus
Logistic Regression
Regression vs. Classification
• Regression estimates a continuous value
• Classification predicts a discrete category
Logistic regression
• This is just like linear regression, except that the values y we want to
predict takes on only a small number of discrete values.
• For now, we will focus on the binary classification problem in which y
can take on only two values: 0 and 1.
• For instance, if we are trying to build a spam classifier for email, then
x (i) may be some features of a piece of email, and y may be 1 if it is a
piece of spam mail, and 0 otherwise.
Logistic regression
• We could approach the classification problem ignoring the fact that y
is discrete-valued, and use our old linear regression algorithm to try
to predict y given x.
• However, this method performs very poorly.
• Intuitively, it also doesn’t make sense to consider 𝑦ො values larger than
1 or smaller than 0 when we know that y ∈ {0, 1}.
Logistic regression
• Let’s change the form for our hypotheses for the prediction 𝑦.
ො
• We will choose
ෝ=σ(θT)
𝒚
• where