Module 2
Regression
• Univariate regression problem (one output, real value)
• Fully connected network
ML Model
The model is just a mathematical equation; when the inputs are passed through this
equation, it computes the output, and this is termed inference.
The model equation also contains parameters. Different parameter values change the
outcome of the computation; the model equation describes a family of possible relations
between inputs and outputs, and the parameters specify the particular relationship.
When we train or learn a MODEL, we FIND parameters that describe the true relationship
between inputs and outputs.
A learning algorithm takes a
training set of input/output
pairs and manipulates the
parameters until the inputs
Learning predict their corresponding
Algorithm outputs as closely as possible.
• For simplicity, we assume that both
the input x and output y are
vectors of a predetermined and
fixed size and that the elements of
Structured/ each vector are always ordered in
Tabular Data the same way.
Supervised ML Model
When we compute the prediction y from the input x, we call
this inference.
The model is just a mathematical equation with a fixed form. It
represents a family of different relations between the input
and the output. The model also contains parameters 𝜙. The
choice of parameters determines the particular relation
between input and output, so we should really write:
ML Model
The model is just a mathematical equation; when the inputs are passed through this
equation, it computes the output, and this is termed inference.
The model equation also contains parameters. Different parameter values change the
outcome of the computation; the model equation describes a family of possible relations
between inputs and outputs, and the parameters specify the particular relationship.
When we train or learn a MODEL, we FIND parameters that describe the true relationship
between inputs and outputs.
Learning/training the Model
• When we talk about learning or training a model, we mean that we
attempt to find parameters 𝝓 that make sensible output predictions
from the input.
• We learn these parameters using a training dataset of I pairs of input
and output examples {xi, yi}.
• We aim to select parameters that map each training input to its
associated output as closely as possible. We quantify the degree of
mismatch in this mapping with the loss L.
• This is a scalar value that summarizes how poorly the model predicts
the training outputs from their corresponding inputs for parameters
𝝓.
Loss Function
More properly, the loss function also depends on the training data {xi, yi}, so we should write
L [{xi, yi}, 𝝓 ], but this is rather cumbersome
Deploying the Model
• If the loss is small after this minimization, we have found model
parameters that accurately predict the training outputs yi from the
training inputs xi.
• After training a model, we must now assess its performance; we run
the model on separate test data to see how well it generalizes to
examples that it didn’t observe during training.
• If the performance is adequate, then we are ready to deploy the
model.
Example: Linear Model
2 Parameters between 1D input and 1D output
Linear Model
This model has two parameters ϕ = [ϕ0, ϕ1], where ϕ0 is
the y-intercept of the line and ϕ1 is the slope.
Different choices for the y-intercept and slope result in different relations between input and output
Data:Input/Output pairs (I = 12)
Example: 1D Linear regression loss function
Loss function:
“Least squares loss function”
Loss(𝜙)
The three circles represent the lines
Goal
Training
• The process of finding parameters that minimize the loss is termed
model fitting, training, or learning.
• The basic method is to choose the initial parameters randomly and
then improve them by “walking down” the loss function until we
reach the bottom.
• One way to do this is to measure the gradient of the surface at the
current position and take a step in the direction that is most steeply
downhill. Repeat this process until the gradient is flat and we can
improve no further.
Loss(𝜙)
The three circles represent the lines
Example: 1D Linear regression training
This technique is known as gradient descent
Shallow Neural Networks
Shallow neural networks
• 1D regression model is obviously limited
• Want to be able to describe input/output that are not lines
• Want multiple inputs
• Want multiple outputs
• Shallow neural networks
• Flexible enough to describe arbitrarily complex input/output mappings
• Can have as many inputs as we want
• Can have as many outputs as we want
Shallow neural networks
• Example network, 1 input, 1 output
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
• Functions for three different choices of the ten parameters 𝜙. In each case, the input/output relation is
piecewise linear.
• However, the positions of the joints, the slopes of the linear regions between them, and the overall height
vary.
1D Linear Regression
Example shallow network
Example shallow network
Example shallow network
Activation function
Example shallow network
Activation function
Rectified Linear Unit
(particular kind of activation function)
Example shallow network
Activation function
Rectified Linear Unit
(particular kind of activation function)
Example shallow network
This model has 10 parameters:
• Represents a family of functions
• Parameters determine particular function
• Given parameters can perform inference (run equation)
• Given training dataset
• Define loss function (least squares)
• Change parameters to minimize loss function
Example shallow network
Example shallow network
Piecewise linear functions with three joints
Hidden units
Break down into two parts:
where:
Hidden units
1. compute three
linear functions
2. Weight the hidden
units
2. Pass through ReLU
functions to
compute hidden units
2. Pass through ReLU
functions to
compute hidden units
Example shallow network
Example shallow network = piecewise linear functions
1 “joint” per ReLU function
Activation pattern = which hidden units are activated
Shaded region:
• Unit 1 active
• Unit 2 inactive
• Unit 3 active
Depicting neural networks
Each parameter multiplies its source and adds to its target
Depicting neural networks
Shallow neural networks
• Example network, 1 input, 1 output
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
With 3 hidden units:
With D hidden units:
With enough hidden units…
… we can describe any 1D function to arbitrary accuracy
Universal approximation theorem
“a formal proof that, with enough hidden units, a shallow
neural network can describe any continuous function on a
compact subset of to arbitrary precision”
Shallow neural networks
• Example network, 1 input, 1 ouput
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
Two outputs
• 1 input, 4 hidden units, 2 outputs
Two outputs
• 1 input, 4 hidden units, 2 outputs
Two outputs
• 1 input, 4 hidden units, 2 outputs
Shallow neural networks
• Example network, 1 input, 1 ouput
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
Two inputs
• 2 inputs, 3 hidden units, 1 output
Convex polygons
Question 1:
• For the 2D case, what if there were two outputs?
• If this is one of the outputs, what would the other one look like?
Shallow neural networks
• Example network, 1 input, 1 ouput
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
Arbitrary inputs, hidden units, outputs
• 𝐷𝑜 Outputs, D hidden units, and 𝐷𝑖 inputs
• e.g., Three inputs, three hidden units, two outputs
Question 2:
• How many parameters does this model have?
Shallow neural networks
• Example network, 1 input, 1 ouput
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
Number of output regions
• In general, each output consists of D dimensional convex polytopes
• With two inputs, and three outputs, we saw there were seven
polygons:
Number of output regions
• In general, each output consists of D dimensional convex polytopes
• How many?
Highlighted point = 500 hidden units or 51,001 parameters
Number of regions:
• Number of regions created by D > 𝐷𝑖 planes in 𝐷𝑖 dimensions was
proved by Zavlasky (1975) to be:
Binomial coefficients!
• How big is this? It’s greater than 2𝐷𝑖 but less than 2𝐷 .
𝐷𝑖
Proof that bigger than larger than 2
1D input with 1 hidden 2D input with 2 hidden 3D input with D hidden
unit creates two regions units creates four regions units creates eight regions
(one joint) (two lines) (three planes)
Shallow neural networks
• Example network, 1 input, 1 ouput
• Universal approximation theorem
• More than one output
• More than one input
• General case
• Number of regions
• Terminology
Nomenclature
Nomenclature
• Y-offsets = biases
• Slopes = weights
• Everything in one layer connected to everything in the next = Fully Connected Network
• No loops = Feedforward network
• Values after ReLU (activation functions) = activations
• Values before ReLU = pre-activations
• One hidden layer = shallow neural network
• More than one hidden layer = deep neural network
• Number of hidden units ≈ capacity
Other activation functions
Regression
We have built a model that can:
• take an arbitrary number of inputs
• output an arbitrary number of outputs
• model a function of arbitrary complexity between the two
Next time:
• What happens if we feed one neural network into another neural
network?