Algorithms For Artificial Intelligence
Algorithms For Artificial Intelligence
Stanford, California
Contents
Acknowledgments v
1 Machine Learning I 1
1.1 Linear Predictors 1
1.1.1 Types of Prediction Tasks 1
1.1.2 Data 2
1.1.3 Learning 3
1.1.4 Feature Extraction 3
1.1.5 Weight Vector 5
1.1.6 Binary Classification 7
1.1.7 Geometric Intuition 7
1.2 Loss Minimization 8
1.2.1 Score and Margin 9
1.2.2 Linear Regression 10
1.2.3 Regression Loss Functions 11
1.2.4 Loss Minimization Framework 12
1.3 Optimization Problem 13
1.3.1 Gradient Descent 13
1.3.2 Least Squares Regression 14
contents iii
2 Machine Learning II 22
2.1 Features 25
2.1.1 Organization of Features 26
2.1.2 Feature Templates 26
2.1.3 Sparsity in Feature Vectors 27
2.1.4 Feature Vector Representation 27
2.1.5 Hypothesis Class 29
2.1.6 Feature Extration and Learning 29
2.1.7 Linearity 31
2.1.8 Geometric viewpoint 32
2.1.9 Summary 33
2.2 Neural Networks 33
2.2.1 Decomposing the problem 34
2.2.2 Learning strategy 35
2.2.3 Gradients 35
2.2.4 Linear functions 36
2.2.5 Neural networks 36
References 64
This work is taken from the lecture notes for the course Artificial Intelligence: Prin-
ciples and Techniques at Stanford University, CS 221 (cs221.stanford.edu). The
contributors to the content of this work are Percy Liang and Dorsa Sadigh—this
collection is simply a typesetting of existing lecture notes with minor modifica-
tions and additions of working Julia implementations. We would like to thank
the original authors for their contribution. In addition, we wish to thank Mykel
Kochenderfer and Tim Wheeler for their contribution to the Tufte-Algorithms
LATEX template, based off of Algorithms for Optimization.1 1
M. J. Kochenderfer and T. A.
Wheeler, Algorithms for Optimiza-
tion. MIT Press, 2019.
Robert J. Moss
Stanford, Calif.
September 15, 2020
We now embark on our journey into machine learning with the simplest yet most
practical tool: linear predictors, which cover both classification and regression and
are examples of reflex models. After getting some geometric intuition for linear
predictors, we will turn to learning the weights of a linear predictor by formulating
an optimization problem based on the loss minimization framework. Finally, we
will discuss stochastic gradient descent, an efficient algorithm for optimizing (that
is, minimizing) the loss that’s tailored for machine learning which is much faster
than gradient descent.
x −→ f −→ y
tion and regression is that the former has discrete outputs (e.g., ”yes” or ”no”),
whereas the latter has continuous outputs. Note that the dichotomy of prediction
tasks are not meant to be formal definitions, but rather to provide intuition. For
instance, binary classification could technically be seen as a regression problem
if the labels are −1 and +1. And structured prediction generally refers to tasks
where the possible set of outputs y is huge (generally, exponential in the size
of the input), but where each individual y has some structure. For example, in
machine translation, the output is a sequence of words.
x −→ f −→ y ∈ {+1, −1}
x −→ f −→ y ∈ R
picture −→ f −→ ‘‘cat’’
• Ranking: y is a permutation
1 2 3 4 −→ f −→ 2 3 4 1
1.1.2 Data
The starting point of machine learning is the data. For now, we will focus on
supervised learning, in which our data provides both inputs and outputs, in contrast
to unsupervised learning, which only provides inputs. A (supervised) example
(also called a data point or instance) is simply an input-output pair ( x, y), which
specifies that y is the ground-truth output for x. The training data Dtrain is a
multiset of examples (repeats are allowed, but this is not important), which forms
a partial specification of the desired behavior of a predictor.
( x, y)
1.1.3 Learning
Learning is about taking the training data Dtrain and producing a predictor f , x
which is a function that takes inputs x and tries to map them to outputs y = f ( x ). ↓
One thing to keep in mind is that we want the predictor to approximately work Dtrain −→ learner −→ f
even for examples that we have not seen in Dtrain . This problem of generalization, ↓
y
which we will discuss two chapters from now, forces us to design f in a principled,
mathematical way. We will first focus on examining what f is, independent of
how the learning works. Then we will come back to learning f based on data.
julia> x = "abc@gmail.com";
julia> ϕ(x)
5-element Array{Float64,1}:
1.0
0.8461538461538461
1.0
1.0
0.0
φ( x ) = [φ1 ( x ), . . . , φd ( x )]
−1.2 −1.2
length>10 :
fracOfAlpha : 0.6 0.6
w = contains_@ : 3 = 3
endsWith_.com : 2.2 2.2
endsWith_.org : 1.4 1.4
In the context of binary classification with binary features (φj ( x ) ∈ {0, 1}),
the weights w j ∈ R have an intuitive interpretation. If w j is positive, then the
presence of feature j (φj ( x ) = 1) favors a positive classification. Conversely, if w j
is negative, then the presence of feature j favors a negative classification.
d
w · φ( x ) = ∑ w j φ( x ) j
j =1
Recall that in this setting, we call the predictor a (binary) classifier. The case
of f w ( x ) = ? is a boundary case that isn’t so important. We can just predict +1
arbitrarily as a matter of convention.
+1 if w · φ( x ) ≥ 0
f w ( x ) = sign(w · φ( x )) =
−1 otherwise
Example:
w = [2, −1]
φ( x ) ∈ {[2, 0], [0, 2], [2, 4]}
So far we have talked about linear predictors f w which are based on a feature
extractor φ and a weight vector w. Now we turn to the problem of estimating
(also known as fitting or learning) w from training data. The loss minimization
framework is to cast learning as an optimization problem. Note the theme of
separating your problem into a model (optimization problem) and an algorithm
(optimization algorithm).
x
↓
Dtrain −→ learner −→ f
↓
y
Loss0-1 ( x, y, w)
3
2
Loss0-1 ( x, y, w) = 1[ f w ( x ) 6= y]
1
= 1[(w · φ( x ))y ≤ 0]
| {z } 0
margin −2 0 2
margin (w · φ( x ))y
We can plot the loss as a function of the margin. From the graph, it is clear that
the loss is 1 when the margin is negative and 0 when it is positive. Loss0-1
w · φ( x )
w · φ( x ) − y
Let’s instead define the residual to measure how close the prediction f w ( x )
1
is to the correct y. The residual will play the analogous role of the margin for
classification and will let us craft an appropriate loss function. 0
0 1 2 3 4
φ( x )
Definition: residual. The residual is the amount by which the prediction
f w ( x ) = w · φ( x ) overshoots the target y: Figure 1.2. Residual.
(w · φ( x )) − y = score − target
Loss( x, y, w)
A popular and convenient loss function to use in linear regression is the squared 3
loss, which penalizes the residual of the prediction quadratically. If the predictor 2
is off by a residual of 10, then the loss will be 100.
1
Losssquared ( x, y, w) = ( f w ( x ) − y)2 0
−2 0 2
= (w · φ ( x ) − y )2 residual (w · φ( x )) − y
| {z }
residual
Losssquared
Algorithm 1.6. The squared loss for
an example (x,y) using weights w,
Loss_squared(x, y, w, ϕ) = residual(x, y, w, ϕ)^2 and feature extractor ϕ.
4
An alternative to the squared loss is the absolute deviation loss, which simply
Loss( x, y, w)
3
takes the absolute value of the residual.
2
Lossabsdev ( x, y, w) = | f w ( x ) − y| 1
= | w · φ( x ) − y | 0
| {z } −2 0 2
residual residual (w · φ( x )) − y
Losssquared
Loss_absdev(x, y, w, ϕ) = abs(residual(x, y, w, ϕ)) Lossabsdev
1
TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
Loss( x, y, w)
train
minimize TrainLoss(w)
w∈Rd
Key: need to set w to make global tradeoffs—not every example can be happy.
minimize(𝒟train, ϕ, Loss) =
minimize(w->TrainLoss(w, 𝒟train, ϕ, Loss))
Which regression loss to use? For the training data Dtrain and feature vec-
tor φ( x ) = x, the weight w that minimizes the least squares (L2 ) regression
training loss is the mean y. The mean tries to accommodate every example,
i.e. popular.
Losssquared ( x, y, w) = (w · φ( x ) − y)2
For least absolute deviation (L1 ) regression, the weights w that minimizes the
training loss is the median y. The median is more robust to outliers.
Lossabsdev ( x, y, w) = |w · φ( x ) − y|
move the furthest point arbitrarily farther out without affecting the median. This
makes sense given that the squared loss penalizes large residuals a lot more. w∈R
In summary, this is an example of where the choice of the loss function has a 8
TrainLoss(w)
qualitative impact on the weights learned, and we can study these differences in 6
terms of the objective function without thinking about optimization algorithms. 4
2
1.3 Optimization Problem 0
0 2
Having defined a bunch of different objective functions that correspond to training weight w
loss, we would now like to optimize them—that is, obtain an algorithm that
Figure 1.3. Training loss in R with
outputs the w where the objective function achieves the minimum value. minimum.
minimize TrainLoss(w) w ∈ R2
w∈Rd
Initialize w = [0, . . . , 0]
for t = 1, . . . , T do:
w ← w − η ∇w TrainLoss(w)
|{z} | {z }
step size gradient
To do this, we will rely on the gradient of the function, which tells us which
direction to move in to decrease the objective the most. The gradient is a valu-
able piece of information, especially since we will often be optimizing in high
dimensions (d on the order of thousands).
This iterative optimization procedure is called gradient descent. Gradient descent
has two hyperparameters, the step size η (which specifies how aggressively we want
to pursue a direction) and the number of iterations T. Let’s not worry about how
to set them, but you can think of T = 100 and η = 0.1 for now.
1
TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
(w · φ ( x ) − y )2
train
1
∇w TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
∇w Losssquared ( x, y, w)
train
1
= ∑
| Dtrain | (x,y)∈D
2( w · φ ( x ) − y ) φ ( x )
| {z }
train
prediction−target
(residual)
Note that for linear predictors, the gradient is always something times φ( x )
because w only affects the loss through w · φ( x ).
w ← w − η ∇w TrainLoss(w)
We can now apply gradient descent on any of our objective functions that we
defined before and have a working algorithm. But it is not necessarily the best
algorithm. One problem (but not the only problem) with gradient descent is that
it is slow. Those of you familiar with optimization will recognize that methods like
Newton’s method can give faster convergence, but that’s not the type of slowness
I’m talking about here. Rather, it is the slowness that arises in large-scale machine
learning applications. Recall that the training loss is a sum over the training data.
If we have one million training examples (which is, by today’s standards, only a
modest number), then each gradient computation requires going through those
one million examples, and this must happen before we can make any progress.
Can we make progress before seeing all the data?
Key idea: stochastic updates. It’s not about quality, it’s about quantity.
In practice, we often find that just performing one pass over the training ex-
amples with SGD, touching each example once, often performs comparably to
taking ten passes over the data with GD. There are also other variants of SGD.
You can randomize the order in which you loop over the training data in each
iteration, which is useful. Think about what would happen if you had all the
positive examples first and the negative examples after that.
0 1 η
• Constant: η = 0.1
p
• Decreasing, decaying: η = 1/ # updates made so far
A suggested form for the step size is to set the initial step size to 1 and let the
step size decrease as the inverse of the square root of the number of updates
we’ve taken so far. There are some nice theoretical results showing that SGD is
guaranteed to converge in this case.1 1
provided all your gradients have
bounded length
1.3.6 Summary
In summary we have seen (1) the functions we’re considering (linear predictors),
(2) the criterion for choosing one (loss minimization), and (3) an algorithm
that goes after that criterion (SGD). We already worked out a linear regression
example. What are good loss functions for binary classification?
1. Linear predictors:
f w ( x ) based on score w · φ( x )
minimize TrainLoss(w)
w
w ← w − η ∇w Loss( x, y, w) 4
Loss0-1 ( x, y, w)
3
1.3.7 Zero-One Loss 2
Recall that we have the zero-one loss for classification. But the main problem with 0
−2 0 2
zero-one loss is that it’s hard to optimize (in fact, it’s provably NP hard in the margin (w · φ( x ))y
worst case). And in particular, we cannot apply gradient-based optimization to it,
Loss0-1
because the gradient is zero (almost) everywhere.
Loss_01(x, y, w, ϕ) = I(margin(x, y, w, ϕ) ≤ 0)
Hinge loss upper bounds Loss0-1 and has a non-trivial gradient. Try to increase
the margin if it is less than 1. To fix this problem, we can use the hinge loss, which
is an upper bound on the zero-one loss. Minimizing upper bounds are a general
idea; the hope is that pushing down the upper bound leads to pushing down the
Algorithm 1.15. The hinge loss func-
actual function.
tion, an upper bound on the zero-
one loss function.
Loss_hinge(x, y, w, ϕ) = max(1 - margin(x, y, w, ϕ), 0)
4
Advanced: The hinge loss corresponds to the Support Vector Machine (SVM)
Loss( x, y, w)
3
objective function with one important difference. The SVM objective function also
includes a regularization penalty kwk2 , which prevents the weights from getting 2
too large. We will get to regularization later in the course, so you needn’t worry 1
about this for now. But if you’re curious, read on. 0
Why should we penalize kwk2 ? One answer is Occam’s razor, which says to −2 0 2
margin (w · φ( x ))y
find the simplest hypothesis that explains the data. Here, simplicity is measured
in the length of w. This can be made formal using statistical learning theory (take Loss0-1
CS229T if you want to learn more). Losshinge
Perhaps a less abstract and more geometric reason is the following. Recall that
Figure 1.6. Hinge loss.
we defined the (algebraic) margin to be w · φ( x )y. The actual (signed) distance
from a point to the decision boundary is actually kwwk
· φ( x )y, this is called the geo-
metric margin. So the loss being zero (that is, Losshinge ( x, y, w) = 0) is equivalent
to the algebraic margin being at least 1 (that is, w · φ( x )y ≥ 1), which is equiva-
1
lent to the geometric margin being larger than kw k
(that is, kw wk
· φ( x )y ≥ kw1 k ).
Therefore, reducing kwk increases the geometric margin. For this reason, SVMs
are also referred to as max-margin classifiers.
Try to increase margin even when it already exceeds 1. Another popular loss
function used in machine learning is the logistic loss. 4
The main property of the logistic loss is no matter how correct your prediction
Loss( x, y, w)
3
is, you will have non-zero loss, and so there is still an incentive (although a
2
diminishing one) to push the margin even larger. This means that you’ll update
1
on every single example. There are some connections between logistic regression
and probabilistic models, which we will get to later. 0
−2 0 2
margin (w · φ( x ))y
Loss0-1
Losshinge
Losslogistic
You should try to ‘‘see’’ the solution before you write things down formally.
Pictorially, it should be evident: when the margin is less than 1, then the
gradient is the gradient of 1 − (w · φ( x ))y, which is equal to −φ( x )y. If the
margin is larger than 1, then the gradient is the gradient of 0, which is 0.
−φ( x )y if w · φ( x )y < 1
∇w Losshinge ( x, y, w) =
0 if w · φ( x )y > 1
What about when the margin is exactly 1? Technically, the gradient doesn’t
exist because the hinge loss is not differentiable there. Practically speaking,
we can take either −φ( x )y or 0 (or anything in between).
Technical note (can be skipped): given f (w), the gradient ∇ f (w) is only
defined at points w where f is differentiable. However, subdifferentials ∂ f (w)
are defined at every point (for convex functions). The subdifferential is a set
of vectors called subgradients z ∈ f (w) which define linear underapproxi-
mations to f , namely f (w) + z · (w0 − w) ≤ f (w0 ) for all w0 .
1.3.10 Summary
zero-one
Loss functions hinge squared
logistic absolute deviation
1
TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
Loss( x, y, w)
train
minimize TrainLoss(w)
w∈Rd
The actual loss function depends on what we’re trying to accomplish. Generally,
the loss function takes the score w · φ( x ), compares it with the correct output
y to form either the residual (for regression) or the margin (for classification).
Regression losses are smallest when the residual is close to zero. Classification
losses are smallest when the margin is large. Which loss function we choose
depends on the desired properties. For example, the absolute deviation loss for
regression is robust against outliers. The logistic loss for classification never relents
in encouraging large margin.
φ( x ) y Loss( x, y, w) = (w · φ( x ) − y)2
[1, 0] 2 ( w1 − 2 ) 2
[1, 0] 4 ( w1 − 4 ) 2
[0, 1] −1 (w1 − (−1))2
1
TrainLoss(w) = ((w1 − 2)2 + (w1 − 4)2 + (w2 − (−1))2 )
3
Note that we’ve been talking about the loss on a single example, and
plotting it in 1D against the residual or the margin. Recall that what we’re
actually optimizing is the training loss, which sums over all data points. To
help visualize the connection between a single loss plot and the more general
picture, consider the simple example of linear regression on three data points:
([1, 0], 2), ([1, 0], 4), and ([0, 1], −1), where φ( x ) = x.
Let’s try to draw the training loss, which is a function of w = [w1 , w2 ].
Specifically, the training loss is 13 ((w1 − 2)2 + (w1 − 4)2 + (w2 − (−1))2 ).
The first two points contribute a quadratic term sensitive to w1 , and the third
point contributes a quadratic term sensitive to w2 . When you combine them,
you get a quadratic centered at [3, −1].
Initialize w = [0, . . . , 0]
for t = 1, . . . , T do:
w ← w − η ∇w TrainLoss(w)
gradient of the objective with respect to the parameters w and stepping in the
opposite direction of the gradient. Think about a ball at the current weight vector
and rolling it down on the surface of the training loss objective. Gradient descent
(GD) computes the gradient of the full training loss, which can be slow for large
datasets. Stochastic gradient descent (SGD), which approximates the gradient of
the training loss with the loss at a single example, generally takes less time. In
both cases, one must be careful to set the step size η properly (not too big, not too
small).
The first half of this chapter is about thinking about the feature extractor φ.
Features are a critical part of machine learning which often do not get as much
attention as they deserve. Ideally, they would be given to us by a domain expert,
and all we (as machine learning people) have to do is to stick them into our
learning algorithm. While one can get considerable mileage out of doing this, the
interface between general-purpose machine learning and domain knowledge is
often nuanced, so to be successful, it pays to understand this interface.
In the second half of this chapter, we return to learning, rip out the linear
predictors that we had from before, and show how we can build more powerful
neural network classifiers given the features that we extracted.
Can we obtain decision boundaries which are circles by using linear classi-
fiers? The answer is yes.
This might seem paradoxical since we are only working with linear clas-
sifiers. But as we will see later, linear refers to the relationship between the
weight vector w and the prediction score (not the input x, which might not
even be a real vector), whereas the decision boundary refers to how the
prediction varies as a function of x.
Advanced: Sometimes people might think that linear classifiers are not
expressive, and that you need neural networks to get expressive and non-
linear classifiers. This is false. You can build arbitrarily expressive models
with the machinery of linear classifiers (see kernel methods). The advantages
of neural networks are the computational benefits and the inductive bias that
comes from the particular neural network architecture.
2.1 Features
w · φ( x )
length>10 : 1
fracOfAlpha : 0.85
feature extractor
"abc@gmail.com" −
−−−−−−−−→ contains_@ : 1
arbitrary!
endsWith_.com : 1
endsWith_.org : 0
• Contains character
A useful organization principle is a feature template, which groups all the fea-
tures which are computed in a similar way. (People often use the word ”feature”
when they really mean ”feature template”.) A feature template also allows us
to define a set of related features (contains_@, contains_a, contains_b). This
reduces the amount of burden on the feature engineer since we don’t need to
know which particular characters are useful, but only that existence of certain
single characters is a useful cue to look at.
We can write each feature template as a English description with a blank (___),
which is to be filled in with an arbitrary string. Also note that feature templates
are most natural for defining binary features, ones which take on value 1 (true)
or 0 (false). Note that an isolated feature (fraction of alphanumeric characters)
can be treated as a trivial feature template with no blanks to be filled.
As another example, if x is a k × k image, then {pixelIntensityij : 1 ≤ i, j ≤ k } is
a feature template consisting of k2 features, whose values are the pixel intensities
at various positions of x.
last_three_chars___(x, 𝚺='a':'z') =
[endswith(x, a*b*c) for a in 𝚺, b in 𝚺, c in 𝚺]
the features and represent the feature values as an array. This representation is
appropriate when the number of nonzeros is significant (the features are dense).
Arrays are especially efficient in terms of space and speed (and you can take
advantage of GPUs). In computer vision applications, features (e.g., the pixel
intensity features) are generally dense, so array representation is more common.
fracOfAlpha : 0.85
contains_a : 0
...
endsWith_@ : 1
...
[0.85, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
1. Maps do incur extra overhead compared to arrays, and therefore maps are
much slower when the features are not sparse.
Finally, it is important to be clear when describing features. Saying ”length”
might mean that there is one feature whose value is the length of x or that there
could be a feature template ”length is equal to ___”. These two encodings of the
same information can have a drastic impact on prediction accuracy when using a
linear predictor, as we’ll see later.
Predictor: f w ( x ) = w · φ( x ) or sign(w · φ( x ))
F = { f w : w ∈ Rd }
Note that if the hypothesis class doesn’t contain any good predictors, then no
amount of learning can help. So the question when extracting features is really
whether they are powerful enough to express predictors which are good. It’s
okay and expected that F will contain a bunch of bad ones as well. Later, we’ll
see reasons for keeping the hypothesis class small (both for computational and
statistical reasons), because we can’t get the optimal w for any feature extractor φ
we choose.
Linear functions:
φ( x ) = x
F 1 = { x 7 → w1 x : w2 ∈ R }
Quadratic functions:
φ( x ) = [ x, x2 ]
F2 = { x 7→ w1 x + w2 x2 : w1 ∈ R, w2 ∈ R}
Given a fixed feature extractor φ, let us consider the space of all predictors f w
obtained by sweeping w over all possible values. If we use φ( x ) = x, then we get
linear functions that go through the origin. However, we want to have functions
that ”bend” (or are not monotonic). For example, if we want to predict someone’s
health given his or her body temperature, there is a sweet spot temperature (37
C) where the health is optimal; both higher and lower values should cause the
health to decline. If we use φ( x ) = [ x, x2 ], then we get quadratic functions that
go through the origin, which are a strict superset of the linear functions, and
therefore are strictly more expressive.
However, even quadratic functions can be limiting because they have to rise
and fall in a certain (parabolic) way. What if we wanted a more flexible, freestyle
approach? We can create piecewise constant functions by defining features that
”fire” (are 1) on particular regions of the input (e.g., 1 < x ≤ 2). Each feature
gets associated with its own weight, which in this case corresponds to the desired
function value in that region. Thus by varying the weight vector, we get piece-
2.1.7 Linearity
Wait a minute...how were we able to get non-linear predictions using linear
predictors? It is important to remember that for linear predictors, it is the score
w · φ( x ) that is linear in w and φ( x ) (read it off directly from the formula). In
particular, the score is not linear in x (it sometimes doesn’t even make sense
because x need not be a vector at all—it could be a string or a PDF file. Also, neither
the predictor f w (unless we’re doing linear regression) nor the loss function
TrainLoss(w) are linear in anything. For the prediction driven by score w · φ( x ):
Linear in w? Yes
Linear in φ( x )? Yes
Linear in x? No! (x not necessarily even a vector)
Recall, feature extractor φ should pick out properties of x that might be useful
for prediction of y.
Let’s apply what you’ve learned about feature extraction to a concrete prob-
lem. The motivation here is that messaging platforms often just show a single
stream of messages, when there is generally a grouping of messages into coherent
conversations. How can we build a classifier that can group messages automat-
ically? We can formulate this as a binary classification problem where we look
at two messages and determine whether or not these two are part of the same
conversation.
Question. What feature templates would you use for predicting whether
the second message is a response to the first?
• time elapsed
2.1.9 Summary
• Feature templates: organize related (sparse) features
What we’ve shown so far is that by being mildly clever with choosing the feature
extractor φ, we can actually get quite a bit of mileage out of our so-called linear
predictors. However, sometimes we don’t know what features are good to use,
either because the prediction task is non-intuitive or we don’t have time to figure
out which features are suitable. Sometimes, we think we might know what features
are good, but then it turns out that they aren’t (this happens a lot!). In the spirit
of machine learning, we’d like to automate things as much as possible. In this
context, it means creating algorithms that can take whatever crude features we
have and turn them into refined predictions, thereby shifting the burden off
feature extraction and moving it to learning.
Neural networks have been around for many decades, but they fell out of favor
because they were difficult to train. In the last decade, there has been a huge
resurgence of interest in neural networks since they perform so well and training
seems to not be such an issue when you have tons of data and compute. In a sense,
neural networks allow one to automatically learn the features of a linear classifier
which are geared towards the desired task, rather than specifying them all by
hand.
As a motivating example, consider the problem of predicting whether two Example 2.4. Predicting car colli-
sion.
cars at positions x1 and x2 are going to collide. Suppose the true output is
1 (safe) whenever the cars are separated by a distance of at least 1. Clearly,
this the decision is not linear.
Input: position of two oncoming cars x = [ x1 , x2 ]
Output: whether safe (y = +1) or collide (y = −1)
True function: safe if cars sufficiently far
y = sign(| x1 − x2 | − 1)
x y
[1, 3] +1
Examples:
[3, 1] +1
[1, 0.5] −1
x h1 h2 y
[1, 3] 0 1 +1
[3, 1] 1 0 +1
[1, 0.5] 0 0 −1
h1 = 1[ x1 − x2 ≥ 1]
= 1[v1 · φ( x ) ≥ 0] v1 = [−1, +1, −1]
h2 = 1[ x2 − x1 ≥ 1]
= 1[v2 · φ( x ) ≥ 0] v2 = [−1, −1, +1]
Final prediction:
y = sign(h1 + h2 )
f V,w ( x ) = sign(w1 h1 + w2 h2 ) w = [1, 1]
2.2.3 Gradients
If we try to train the weights v1 , v2 , w1 , w2 , we will immediately notice a problem:
the gradient of h1 with respect to v1 is always zero because of the hard thresholding
function. Therefore, we define a function logistic function σ (z), which looks roughly
like the step function 1[z ≥ 0], but has non-zero gradients everywhere. One thing
to bear in mind is that even though the gradients are non-zero, they can be quite
small when |z| is large. This is what makes optimizing neural networks hard.
1
σ ( z ) = (1 + e − z ) −1 =
1 + e−z
σ0 (z) = σ (z)(1 − σ (z)) (derivative)
φ ( x )3
The ReLU has two advantages: (i) its gradient doesn’t vanish as z grows, which
makes it empirically easier to train; and (ii) it only involves a max operation,
which is computationally easier to compute than the exponential function.
For a neural network with one hidden layer, the intermediate hidden units are
h j = σ (v j · φ( x )) where σ (z) = (1 + e−z )−1 , producing output score = w · h:
now deep the neural network is, the top layer is always a linear function, and the
layers below can be interpreted as defining a (possibly very complex) feature
map. Whether this is a suitable form depends on the nature of the application.
Empirically, though, neural networks have been quite successful, since learning
the features from the data with the explicit objective of minimizing the loss can
yield better features than ones which are manually crafted. Since 2010, there have
been some advances in getting neural networks to work, and they have become
the state-of-the-art in many tasks. For example, all the major companies (Google,
Microsoft, IBM) all recently switched over to using neural networks for speech
recognition. In computer vision, (convolutional) neural networks are completely
dominant in object recognition.
φ( x )
h( x ) = [ h1 ( x ), . . . , hk ( x )]
The main thing left to do for neural networks is to be able to train them. Con-
ceptually, this should be straightforward: just take the gradient and run SGD.
While this is true, computing the gradient, even though it is not hard, can be quite
tedious to do by hand.
Optimization problem.
minimize TrainLoss(V, w)
V,w
1
TrainLoss(V, w) = ∑
| Dtrain | (x,y)∈D
Loss( x, y, V, w)
train
2.3.1 Approach
Mathematically, just grind through the chain rule. But wext we visualize the
computation using a computation graph. We will illustrate a graphical way of
organizing the computation of gradients, which is built out of a few components.
This graphical approach will show the structure of the function and will not only
make gradients easy to compute, but also shed more light onto the predictor
and loss function. In fact, these days if you use a package such as TensorFlow or
PyTorch, you can write down the expressions symbolically and the gradient is
computed for you. This is done essentially using the computational procedure
that we will see.
Advantages:
+ − ·
1 1 1 −1 b a
a b a b a b
max σ
1[ a > b ] 1[ a < b ] σ ( a)(1 − σ ( a))
a b a
convenience, we write the partial derivative on the edge connecting the input to
the output. Partial derivatives (i.e. gradients): how much does the output change
if an input changes? Example:
Here are 5 examples of simple functions and their partial derivatives. These should
be familiar from basic calculus. All we’ve done is present them in a visually more
intuitive way. But it turns out that these simple functions are all we need to build
out function2
up many of the more complex and potentially scarier looking functions that we’ll
∂out
encounter in machine learning. ∂mid
mid function1
2.3.4 Composing functions
∂mid
∂in
The second conceptual point is to think about composing. Graphically, this is very
in
natural: the output of one function f simply gets fed as the input into another
function g. Now how does in affect out (what is the partial derivative)? The key Chain rule:
idea is that the partial derivative decomposes into a product of the two partial ∂out ∂out ∂mid
=
derivatives on the two edges. You should recognize this is no more than the chain ∂in ∂mid ∂in
rule in graphical form. More generally, the partial derivative of y with respect
to x is simply the product of all the green expressions on the edges of the path
connecting x and y. This visual intuition will help us better understand more
complex functions, which we will turn to next.
loss max
1[1 − margin > 0]
− 0
−1
1 margin ·
y
score · y
φ( x )
w φ( x )
to compute
∂ Loss( x, y, w)
∇w Loss( x, y, w) =
∂w
simply multiply the edges: −1[margin < 1]φ( x )y
2.3.6 Backpropagation
Now, we can apply the same strategy to neural networks. Here we are using the
squared loss for concreteness, but one can also use the logistic or hinge losses.
Note that there is some really nice modularity here: you can pick any predictor
(linear or neural network) to drive the score, and the score can be fed into any
loss function (squared, hinge, etc.).
!2
k
Loss( x, y, w) = ∑ w j σ(vj · φ(x)) − y
j =1
| {z }
neural network−target
graph from leaves to the root, and compute f i , the value of each node i, recur-
sively given the node values of the children of i. These values will be used in the
backward pass. Next, in the backward pass, we go through all the nodes from
the root to the leaves and compute gi recursively from f i and g j , the backward
∂f
value for the parent of i using the key recurrence gi = ∂ f j g j (just the chain rule).
i
In this example, the backward pass gives us the gradient of the output node (the
gradient of the loss) with respect to the weights (the red nodes).
rithms. In a way, there is no learning. At training time, we just store the entire • Euclidean (L2 norm):
training examples. At prediction time, we get an input x 0 and we just find the s
n
input in our training set that is most similar,1 and return its output. In a practical k v − v0 k 2 = ∑ (vi − vi0 )2
i =1
implementation, finding the closest input is non-trivial. Popular choices are using
k-d trees or locality-sensitive hashing. We will not worry about this issue. The
intuition being expressed here is that similar (nearby) points tend to have similar
outputs. This is a reasonable assumption in most cases; all else equal, having
a body temperature of 37 and 37.1 is probably not going to affect the health
prediction by much.
Let’s look at the decision boundary of nearest neighbors. The input space is
partitioned into regions, such that each region has the same closest point (this
is a Voronoi diagram), and each region could get a different output. Notice that
this decision boundary is much more expressive than what you could get with
quadratic features. In particular, one interesting property is that the complexity of
the decision boundary adapts to the number of training examples. As we increase
the number of training examples, the number of regions will also increase. Such
methods are called non-parametric. The decision boundaries are shown in the
Voronoi diagrams in figures 2.3 and 2.4.
• Non-parametric: the hypothesis class adapts to number of examples. Figure 2.4. Voronoi diagram show-
ing nearest neighbors classification
with Euclidean distance (L2 ).
• Simple and powerful, but kind of brute force.
package delivery to route finding. The key was to cast whatever problem we were
interested in solving into the problem of finding the minimum cost path in a
graph. However, search problems assume that taking an action a from a state s
results deterministically in a unique successor state Succ(s, a).
3.1 Uncertainty
In the real world, the deterministic successor assumption is often unrealistic, for
there is randomness: taking an action might lead to any one of many possible states.
One deep question here is how we can even hope to act optimally in the face of
randomness? Certainly we can’t just have a single deterministic plan, and talking
about a minimum cost path doesn’t make sense. Today, we will develop tools
to tackle this more challenging setting. We will fortunately still be able to reuse
many of the intuitions about search problems, in particular the notion of a state.
• Robotics: decide where to move, but actuators can fail, hit unseen obstacles,
etc.
3.2. markov decision processes 47
• Resource allocation: decide what to produce, don’t know the customer demand
for various products
• Agriculture: decide what to plant, but don’t know weather and thus crop yield
Let us consider an example. You are exploring a South Pacific island, which
is modeled as a 3 × 4 grid of states. From each state, you can take one of four
actions to move to an adjacent state: north (N), east (E), south (S), or west
(W). If you try to move off the grid, you remain in the same state. You start
at (2, 1). If you end up in either of the green or red squares, your journey
ends, either in a lava lake (reward of −50) or in a safe area with either no
view (2) or a fabulous view of the island (20). What do you do?
If we have a deterministic search problem, then the obvious thing will be
to go for the fabulous view, which yields a reward of 20. You can set numIters
to 10 and press Run. Each state is labeled with the maximum expected utility
(sum of rewards) one can get from that state (analogue of FutureCost in a
search problem). We will define this quantity formally later. For now, look
at the arrows, which represent the best action to take from each cell. Note
that in some cases, there is a tie for the best, where some of the actions seem
to be moving in the wrong direction. This is because there is no penalty for
moving around indefinitely. If you change moveReward to −0.1, then you’ll
see the arrows point in the right direction.
In reality, we are dealing with treacherous terrain, and there is on each
action a probability slipProb of slipping, which results in moving in a ran-
dom direction. Try setting slipProb to various values. For small values (e.g.,
0.1), the optimal action is to still go for the fabulous view. For large values
(e.g., 0.3), then it’s better to go for the safe and boring 2. Play around with
the other reward values to get intuition for the problem.
Important: note that we are only specifying the dynamics of the world,
not directly specifying the best action to take. The best actions are computed
automatically from the algorithms we’ll see shortly.
1 2 1 2 2 1
(4) + · (8) + · · (12) + · · · = 12
3 3 3 3 3 3
Let’s suppose you always stay. Note that each outcome of the game will result
in a different sequence of rewards, resulting in a utility, which is in this case
just the sum of the rewards. We are interested in the expected utility, which
you can compute to be 12.
1(10) = 10
While we already solved this game directly, we’d like to develop a more general
framework for thinking about not just this game, but also other problems such
as the volcano crossing example. To that end, let us formalize the dice game as a
Markov decision process (MDP).
An MDP can be represented as a graph. The nodes in this graph include both
states and chance nodes. Edges coming out of states are the possible actions from
that state, which lead to chance nodes. Edges coming out of a chance nodes
are the possible random outcomes of that action, which end up back in states.
Our convention is to label these chance-to-state edges with the probability of a
particular transition and the associated reward for traversing that edge.
A Markov decision process has a set of states States, a starting state sstate , and the
set of actions Actions(s) from each state s. It also has a transition distribution T,
which specifies for each state s and action a, a distribution over possible successor
states s0 . Specifically, we have that ∑s0 T (s, a, s0 ) = 1 because T is a probability
distribution (more on this later). Associated with each transition (s, a, s0 ) is a
reward, which could be either positive or negative. If we arrive in a state s for
which IsEnd(s) is true, then the game is over. Finally, the discount factor γ is a
quantity which specifies how much we value the future and will be discussed
later.
MDPs share many similarities with search problems, but there are differences
(one main difference and one minor one). The main difference is the move from
Map to MDP:
• Succ(s, a) ⇒ T (s, a, s0 )
• Cost(s, a) ⇒ Reward(s, a, s0 )
3.2.1 Transitions
Just to dwell on the major difference, transition probabilities, a bit more: for
each state s and action a, the transition probabilities specifies a distribution over
successor states s0 .
s a s0 T (s, a, s0 )
in quit end 1
Example: transition probabilities.
in stay in 2/3
in stay end 1/3
Probabilities must sum to one. This means that for each given s and a, if we sum
the transition probability T (s, a, s0 ) over all possible successor states s0 , we get 1.
If a transition to a particular s0 is not possible, then T (s, a, s0 ) = 0. We refer to the
s0 for which T (s, a, s0 ) > 0 as the successors. Generally, the number of successors
of a given (s, a) is much smaller than the total number of states. For instance, in a
search problem, each (s, a) has exactly one successor. Formally, for each state s
and action a: ∑ T (s, a, s0 ) = 1 for successors = s0 such that T (s, a, s0 ) > 0.
s0 ∈States
Let us revisit the transportation example. As we all know, magic trams aren’t
the most reliable forms of transportation, so let us asume that with probability
1 1
2 , it actually does as advertised, and with probability 2 it just leaves you in
the same state.
Example: transportation.
3.2.2 Solutions
So we now know what an MDP is. What do we do with one? For search problems,
we were trying to find the minimum cost path. However, fixed paths won’t suffice
for MDPs, because we don’t know which states the random dice rolls are going
to take us. Therefore, we define a policy, which specifies an action for every single
state, not just the states along a path. This way, we have all our bases covered,
and know what action to take no matter where we are. One might wonder if we
ever need to take different actions from a given state. The answer is no, since like
as in a search problem, the state contains all the information that we need to act
optimally for the future. In more formal speak, the transitions and rewards satisfy
the Markov property. Every time we end up in a state, we are faced with the exact
same problem and therefore should take the same optimal action.
Solutions to:
s π (s)
(1, 1) S
Example: volcano crossing. (2, 1) E
(3, 1) N
··· ···
Now that we’ve defined an MDP (the input) and a policy (the output), let’s turn
to defining the evaluation metric for a policy—there are many of them, which one
should we choose? Recall that we’d like to maximize the total rewards (utility),
but this is a random quantity, so we can’t quite do that. Instead, we will instead
maximize the expected utility, which we will refer to as value (of a policy).
Evaluating a policy: volcano crossing. To get an intuitive feel for the rela-
tionship between a value and utility, consider the volcano example. If you
press Run multiple times, you will get random paths shown on the right
leading to different utilities. Note that there is considerable variation in what
happens. The expectation of this utility is the value. You can run multiple
simulations by increasing numEpisodes. If you set numEpisodes to 1000, then
you’ll see the average utility converging to the value.
Path Utility
[in; stay, 4, end] 4
[in; stay, 4, in; stay, 4, in; stay, 4, end] 12
Value: 12
[in; stay, 4, in; stay, 4, end] 8
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 16
... ...
3.3.1 Discounting
There is an additional aspect to utility: discounting, which captures the fact that
a reward today might be worth more than the same reward tomorrow. If the
discount γ is small, then we favor the present more and downweight future
rewards more. Note that the discounting parameter is applied exponentially
to future rewards, so the distant future is always going to have a fairly small
contribution to the utility (unless γ = 1). The terminology, though standard, is
slightly confusing: a larger value of the discount parameter γ actually means that
the future is discounted less.
u1 = r1 + γr2 + γ2 r3 + γ3 r4 + · · ·
Vπ (s) = E[u1 | s0 = s]
= ∑ P[s1 = s0 | s0 = s, a1 = π (s)]E[u1 | s1 = s0 , s0 = s, a1 = π (s)]
s0
Note that P[s1 = s0 | s0 = s, a1 = π (s)] = T (s, π (s), s0 ). Using the fact that
u1 = r1 + γu2 and taking expectations, we get that:
Example: Dice game. As an example, let’s compute the values of the nodes
in the dice game for the policy ‘‘stay’’. Note that the recurrence involves both
Vπ (in) on the left-hand side and the right-hand side. At least in this simple
example, we can solve this recurrence easily to get the value. Let γ = 1. Let
π be the ‘‘stay’’ policy: π (in) = stay.
Vπ (end) = 0
Vπ (in) = Qπ (in, stay)
1 2
= (4 + Vπ (end)) + (4 + Vπ (in))
3 3
In this case, can solve in closed form:
1 2
Vπ (in) = 4 + (4 + Vπ (in))
3 3
2
= 4 + Vπ (in)
3
1
=⇒ Vπ (in) = 4
3
Vπ (in) = 12
But for a much larger MDP with 100000 states, how do we efficiently compute
the value of a policy? One option is the following: observe that the recurrences
define a system of linear equations, where the variables are Vπ (s) for each state
s and there is an equation for each state. So we could solve the system of linear
equations by computing a matrix inverse. However, inverting a 100000 × 100000
matrix is expensive in general. There is an even simpler approach called policy
evaluation. We’ve already seen examples of iterative algorithms in machine learn-
ing: the basic idea is to start with something crude, and refine it over time. Policy
(0)
iteration starts with a vector of all zeros for the initial values Vπ . Each iteration,
we loop over all the states and apply the two recurrences that we had before. The
equations look hairier because of the superscript (t), which simply denotes the
value of at iteration t of the algorithm.
Key idea: iterative algorithm. Start with arbitrary policy values and re-
peatedly apply recurrences to converge to true values.
Some implementation notes: a good strategy for determining how many itera-
tions to run policy evaluation is based on how accurate the result is. Rather than
set some fixed number of iterations (e.g, 100), we instead set an error tolerance
(e.g., e = 0.01), and iterate until the maximum change between values of any
state s from one iteration (t) to the previous (t − 1) is at most e. The second note
(t)
is that while the algorithm is stated as computing Vπ for each iteration t, we
actually only need to keep track of the last two values. This is important for saving
memory. How many iterations to run (tPE )? Repeat until values don’t change
much:
(t) ( t −1)
max |Vπ (s) − Vπ (s)| ≤ e
s∈States
(t) (t) ( t −1)
Don’t store Vπ for each iteration t, need only last two: Vπ and Vπ .
π (s) and we only need to look at the action specified by the policy. Advanced:
Here, we have to iterate tPE time steps to reach a target level of error e. It turns out
that tPE doesn’t actually have to be very large for very small errors. Specifically, the
error decreases exponentially fast as we increase the number of iterations. In other
words, to cut the error in half, we only have to run a constant number of more
iterations. Advanced: For acyclic graphs (for example, the MDP for Blackjack),
we just need to do one iteration (not tPE ) provided that we process the nodes
in reverse topological order of the graph. This is the same setup as we had for
dynamic programming in search problems, only the equations are different.
Let us run policy evaluation on the dice game. The value converges very
quickly to the correct answer. Let π be the stay policy: π (in) = stay.
(t)
Vπ (end) = 0
(t) 1 ( t −1) 2 ( t −1)
Vπ (in) = (4 + Vπ (end)) + (4 + Vπ (in))
3 3
s end in
(t)
Vπ 0.00 0.00 (t = 0 iterations)
(t)
Vπ 0.00 4.00 (t = 1 iterations)
(t) Converges to Vπ (in) = 12.
Vπ 0.00 6.67 (t = 2 iterations)
(t)
Vπ 0.00 8.44 (t = 3 iterations)
(t)
Vπ 0.00 12.00 (t = 100 iterations)
3.3.3 Summary
Let’s summarize: we have defined an MDP, which we should think of a graph
where the nodes are states and chance nodes. Because of randomness, solving an
MDP means generating policies, not just paths. A policy is evaluated based on
its value: the expected utility obtained over random paths. Finally, we saw that
policy evaluation provides a simple way to compute the value of a policy.
If we are given a policy π, we now know how to compute its value Vπ (sstate ). So
now, we could just enumerate all the policies, compute the value of each one,
and take the best policy, but the number of policies is exponential in the number
of states (AS to be exact), so we need something a bit more clever. We will now
introduce value iteration, which is an algorithm for finding the best policy. While
evaluating a given policy and finding the best policy might seem very different,
it turns out that value iteration will look a lot like policy evaluation.
Definition: optimal value. The optimal value V ∗ (s) is the maximum value
attained by any policy.
The recurrences for V ∗ and Q∗ are identical to the ones for policy evaluation
with one difference: in computing V ∗ , instead of taking the action from the fixed
policy π, we take the best action, the one that results in the largest Q∗ (s, a).
So far, we have focused on computing the value of the optimal policy, but what
is the actual policy? It turns out that this is pretty easy to compute. Suppose
you’re at a state s. Q∗ (s, a) tells you the value of taking action a from state s. So
the optimal action is simply to take the action a with the largest value of Q∗ (s, a).
Given Q∗ , read off the optimal policy:
3.4.3 Convergence
Let us state more formally the conditions under which any of these algorithms
that we talked about will work. A sufficient condition is that either the discount
γ must be strictly less than 1 or the MDP graph is acyclic. We can reinterpret the
Value iteration: dice game. Let us demonstrate value iteration on the dice
game. Initially, the optimal policy is ‘‘quit’’, but as we run value iteration
longer, it switches to ‘‘stay’’.
s end in
V ∗(t) 0.00 0.00 (t = 0 iterations)
π ∗ (s) — —
V ∗(t) 0.00 10.00 (t = 1 iterations)
π ∗ (s) — quit
V ∗(t) 0.00 10.67 (t = 2 iterations)
π ∗ (s) — stay
V ∗(t) 0.00 11.11 (t = 3 iterations)
π ∗ (s) — stay
V ∗(t) 0.00 12.00 (t = 100 iterations)
π ∗ (s) — stay
to compute, it’s easy to turn these recurrences into algorithms that iterate between
those recurrences until convergence.
Algorithms:
Recipe:
3.4.6 Summary
• Markov decision processes (MDPs) cope with uncertainty.
• Value iteration computes optimal value (maximum expected utility) and opti-
mal policy.
References
1. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019 (cit.
on p. v).