[go: up one dir, main page]

0% found this document useful (0 votes)
43 views69 pages

Algorithms For Artificial Intelligence

The document outlines algorithms for artificial intelligence, focusing on machine learning techniques such as linear predictors, optimization problems, and feature extraction. It covers various prediction tasks, including binary classification and regression, and discusses the importance of data and learning strategies. Additionally, it introduces concepts like neural networks and Markov decision processes, providing a comprehensive overview of AI principles and techniques as taught in Stanford's CS 221 course.

Uploaded by

mattosrfm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views69 pages

Algorithms For Artificial Intelligence

The document outlines algorithms for artificial intelligence, focusing on machine learning techniques such as linear predictors, optimization problems, and feature extraction. It covers various prediction tasks, including binary classification and regression, and discusses the importance of data and learning strategies. Additionally, it introduces concepts like neural networks and Markov decision processes, providing a comprehensive overview of AI principles and techniques as taught in Stanford's CS 221 course.

Uploaded by

mattosrfm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Algorithms for Artificial Intelligence

Stanford, California
Contents

Acknowledgments v

1 Machine Learning I 1
1.1 Linear Predictors 1
1.1.1 Types of Prediction Tasks 1
1.1.2 Data 2
1.1.3 Learning 3
1.1.4 Feature Extraction 3
1.1.5 Weight Vector 5
1.1.6 Binary Classification 7
1.1.7 Geometric Intuition 7
1.2 Loss Minimization 8
1.2.1 Score and Margin 9
1.2.2 Linear Regression 10
1.2.3 Regression Loss Functions 11
1.2.4 Loss Minimization Framework 12
1.3 Optimization Problem 13
1.3.1 Gradient Descent 13
1.3.2 Least Squares Regression 14
contents iii

1.3.3 Gradient Descent is Slow 15


1.3.4 Stochastic Gradient Descent 16
1.3.5 Step Size 17
1.3.6 Summary 17
1.3.7 Zero-One Loss 18
1.3.8 Hinge Loss (SVMs) 18
1.3.9 Logistic Regression 19
1.3.10 Summary 21

2 Machine Learning II 22
2.1 Features 25
2.1.1 Organization of Features 26
2.1.2 Feature Templates 26
2.1.3 Sparsity in Feature Vectors 27
2.1.4 Feature Vector Representation 27
2.1.5 Hypothesis Class 29
2.1.6 Feature Extration and Learning 29
2.1.7 Linearity 31
2.1.8 Geometric viewpoint 32
2.1.9 Summary 33
2.2 Neural Networks 33
2.2.1 Decomposing the problem 34
2.2.2 Learning strategy 35
2.2.3 Gradients 35
2.2.4 Linear functions 36
2.2.5 Neural networks 36

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


iv c ontents

2.3 Efficient Gradients 38


2.3.1 Approach 39
2.3.2 Functions as boxes 39
2.3.3 Basic building blocks 40
2.3.4 Composing functions 40
2.3.5 Binary classification with hinge loss 41
2.3.6 Backpropagation 41
2.4 Nearest Neighbors 43
2.4.1 Summary of learners 45

3 Markov Decision Process I 46


3.1 Uncertainty 46
3.2 Markov Decision Processes 47
3.2.1 Transitions 50
3.2.2 Solutions 51
3.3 Policy Evaluation 52
3.3.1 Discounting 54
3.3.2 Policy evaluation 54
3.3.3 Summary 59
3.4 Value Iteration 59
3.4.1 Optimal value and policy 59
3.4.2 Value iteration 60
3.4.3 Convergence 60
3.4.4 Summary of algorithms 61
3.4.5 Unifying idea 61
3.4.6 Summary 63

References 64

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


Acknowledgments

This work is taken from the lecture notes for the course Artificial Intelligence: Prin-
ciples and Techniques at Stanford University, CS 221 (cs221.stanford.edu). The
contributors to the content of this work are Percy Liang and Dorsa Sadigh—this
collection is simply a typesetting of existing lecture notes with minor modifica-
tions and additions of working Julia implementations. We would like to thank
the original authors for their contribution. In addition, we wish to thank Mykel
Kochenderfer and Tim Wheeler for their contribution to the Tufte-Algorithms
LATEX template, based off of Algorithms for Optimization.1 1
M. J. Kochenderfer and T. A.
Wheeler, Algorithms for Optimiza-
tion. MIT Press, 2019.

Robert J. Moss
Stanford, Calif.
September 15, 2020

Ancillary material is available on the template’s webpage:


https://github.com/sisl/textbook_template
1 Machine Learning I

From CS221 Spring 2020, Percy


1.1 Linear Predictors Liang, Chelsea Finn & Nima Anari,
Stanford University.

We now embark on our journey into machine learning with the simplest yet most
practical tool: linear predictors, which cover both classification and regression and
are examples of reflex models. After getting some geometric intuition for linear
predictors, we will turn to learning the weights of a linear predictor by formulating
an optimization problem based on the loss minimization framework. Finally, we
will discuss stochastic gradient descent, an efficient algorithm for optimizing (that
is, minimizing) the loss that’s tailored for machine learning which is much faster
than gradient descent.

Example application: spam classification.


Input: x = email message
Output: y = {spam, not-spam}
Objective: obtain a predictor f :

x −→ f −→ y

First, some terminology. A predictor is a function f that maps an input x to an


output y. In statistics, y is known as a response, and when x is a real vector, it is
known as the covariate.

1.1.1 Types of Prediction Tasks


In the context of classification tasks, f is called a classifier and y is called a label
(sometimes class, category, or tag). The key distinction between binary classifica-
2 c hapter 1. machine learning i

tion and regression is that the former has discrete outputs (e.g., ”yes” or ”no”),
whereas the latter has continuous outputs. Note that the dichotomy of prediction
tasks are not meant to be formal definitions, but rather to provide intuition. For
instance, binary classification could technically be seen as a regression problem
if the labels are −1 and +1. And structured prediction generally refers to tasks
where the possible set of outputs y is huge (generally, exponential in the size
of the input), but where each individual y has some structure. For example, in
machine translation, the output is a sequence of words.

• Binary classification: (e.g., email =⇒ spam/not spam)

x −→ f −→ y ∈ {+1, −1}

• Regression: (e.g., location, year =⇒ housing price)

x −→ f −→ y ∈ R

• Multiclass classification: y is a category

picture −→ f −→ ‘‘cat’’

• Ranking: y is a permutation

1 2 3 4 −→ f −→ 2 3 4 1

• Structured prediction: y is an object which is built from parts

‘‘la casa blu’’ −→ f −→ ‘‘the blue house’’

1.1.2 Data
The starting point of machine learning is the data. For now, we will focus on
supervised learning, in which our data provides both inputs and outputs, in contrast
to unsupervised learning, which only provides inputs. A (supervised) example
(also called a data point or instance) is simply an input-output pair ( x, y), which
specifies that y is the ground-truth output for x. The training data Dtrain is a
multiset of examples (repeats are allowed, but this is not important), which forms
a partial specification of the desired behavior of a predictor.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.1. linear predictors 3

Data example: specifies that y is the ground-truth output for x

( x, y)

Training data: list of examples for spam classification


" #
(“ . . . 10m dollars . . .00 , +1)
Dtrain =
(“ . . . CS221 . . .00 , −1)

partial specification of behavior

1.1.3 Learning
Learning is about taking the training data Dtrain and producing a predictor f , x
which is a function that takes inputs x and tries to map them to outputs y = f ( x ). ↓
One thing to keep in mind is that we want the predictor to approximately work Dtrain −→ learner −→ f
even for examples that we have not seen in Dtrain . This problem of generalization, ↓
y
which we will discuss two chapters from now, forces us to design f in a principled,
mathematical way. We will first focus on examining what f is, independent of
how the learning works. Then we will come back to learning f based on data.

1.1.4 Feature Extraction


We will consider predictors f based on feature extractors φ. Feature extraction is a
bit of an art that requires intuition about both the task and also what machine
learning algorithms are capable of.
The general principle is that features should represent properties of x which
might be relevant for predicting y. It is okay to add features which turn out to be
irrelevant, since the learning algorithm can sort it out (though it might require
more data to do so).

Feature Vector Notation: Each input x represented by a feature vector φ( x ),


which is computed by the feature extractor φ. When designing features, it is
useful to think of the feature vector as being a map from strings (feature names)
to doubles (feature values). But formally, the feature vector φ( x ) ∈ Rd is a real

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


4 c h apter 1. machine learning i

Example: feature extractor. Example 1.1. A feature extractor for


classifying whether a string is an
Example task: predict y, indicating whether a string x is an email address. email address. The indicator func-
Question: what properties of x might be relevant for predicting y? tion symbol I can be created by typ-
ing \bbI and hitting tab and the
Feature extractor: feature extractor symbol ϕ symbol
Given input x, output a set of (feature name, feature value) pairs. with \phi.
 
length>10 : 1
 fracOfAlpha : 0.85
 
feature extractor
"abc@gmail.com" −
−−−−−−−−→  contains_@ : 1 
 
arbitrary!  
endsWith_.com : 1 
endsWith_.org : 0

# Indicator function \bbI


I(b) = b ? 1 : 0

# Feature extrator \phi


ϕ = x -> [I(length(x) > 10),
sum(isletter(xᵢ) for xᵢ in x)/length(x),
I(occursin("@", x)),
I(endswith(x, ".com")),
I(endswith(x, ".org"))]

julia> x = "abc@gmail.com";
julia> ϕ(x)
5-element Array{Float64,1}:
1.0
0.8461538461538461
1.0
1.0
0.0

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.1. linear predictors 5

vector φ( x ) = [φ1 ( x ), . . . , φd ( x )], where each component φj ( x ), for j = 1, . . . , d,


represents a feature.

Definition: feature vector. For an input x, its feature vector is:

φ( x ) = [φ1 ( x ), . . . , φd ( x )]

Think of φ( x ) ∈ Rd as a point in a high-dimensional space.

This vector-based representation allows us to think about feature vectors as a


point in a (high-dimensional) vector space, which will later be useful for getting
some geometric intuition. Mathematically, feature vector doesn’t need feature
names:    
length>10 : 1 1
 fracOfAlpha : 0.85 0.85
 

φ( x ) =  contains_@ : 1 = 1 
   
   
endsWith_.com : 1   1 
endsWith_.org : 0 0

1.1.5 Weight Vector


So far, we have defined a feature extractor φ that maps each input x to the feature
vector φ( x ). A weight vector w = [w1 , . . . , wd ] (also called a parameter vector or
weights) specifies the contributions of each feature vector to the prediction.

−1.2 −1.2
   
length>10 :
 fracOfAlpha : 0.6   0.6 
 

w =  contains_@ : 3 = 3 
   
   
endsWith_.com : 2.2   2.2 
endsWith_.org : 1.4 1.4

In the context of binary classification with binary features (φj ( x ) ∈ {0, 1}),
the weights w j ∈ R have an intuitive interpretation. If w j is positive, then the
presence of feature j (φj ( x ) = 1) favors a positive classification. Conversely, if w j
is negative, then the presence of feature j favors a negative classification.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


6 c h apter 1. machine learning i

Definition: score. The score on an example ( x, y) is w · φ( x ), how confident


we are in predicting +1. Score is a weighted combination of features:

d
w · φ( x ) = ∑ w j φ( x ) j
j =1

score(x, w, ϕ) = w⋅ϕ(x) Algorithm 1.1. The score of input x


using weights w and feature extrac-
tor ϕ written in the Julia program-
ming language. The dot product
Note that while the feature vector depends on the input x, the weight vector symbol ⋅ can be created by typing
\cdot and hitting tab (included in
does not. This is because we want a single predictor (specified by the weight the LinearAlgebra package), and
vector) that works on any input. the w symbol with \bfw.
Given a feature vector φ( x ) and a weight vector w, we define the prediction
score to be their inner product. The score intuitively represents the degree to
which the classification is positive or negative. The predictor is linear because
the score is a linear function of w (more on linearity in the next chapter). Again,
in the context of binary classification with binary features, the score aggregates
the contribution of each feature, weighted appropriately. We can think of each
feature present as voting on the classification.

Example: score. Example 1.2. Calculating score


as the weighted sum of features
with feature extractor ϕ from exam-
weight vector: w ∈ Rd feature vector: φ( x ) ∈ Rd ple 1.1.
−1.2
   
length>10 : length>10 : 1
 fracOfAlpha : 0.6   fracOfAlpha : 0.85
   
 contains_@ : 3   contains_@ : 1 
   
   
endsWith_.com : 2.2  endsWith_.com : 1 
endsWith_.org : 1.4 endsWith_.org : 0

w · φ( x ) = −1.2(1) + 0.6(0.85) + 3(1) + 2.2(1) + 1.4(0) = 4.51


julia> x = "abc@gmail.com";
julia> w = [-1.2, 0.6, 3, 2.2, 1.4];
julia> score(x, w, ϕ)
4.507692307692308

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.1. linear predictors 7

1.1.6 Binary Classification


We now have gathered enough intuition that we can formally define the predictor
f . For each weight vector w, we write f w to denote the predictor that depends
on w and takes the sign of the score. For this section, we will focus on the case
of binary classification. For weight vector w ∈ Rd and feature vector φ( x ) ∈ Rd , a
(binary) linear classifier is defined as:

 +1

 if w · φ( x ) > 0
f w ( x ) = sign(w · φ( x )) = −1 if w · φ( x ) < 0


? if w · φ( x ) = 0

Recall that in this setting, we call the predictor a (binary) classifier. The case
of f w ( x ) = ? is a boundary case that isn’t so important. We can just predict +1
arbitrarily as a matter of convention.

+1 if w · φ( x ) ≥ 0
f w ( x ) = sign(w · φ( x )) =
−1 otherwise

binary_classifier(x, w, ϕ) = score(x, w, ϕ) ≥ 0 ? +1 : -1 Algorithm 1.2. A binary classifier f


to classify input x using weight vec-
tor w and feature extractor ϕ. We
predict +1 when the score is zero
as a matter of convention.

1.1.7 Geometric Intuition


So far, we have talked about linear predictors as weighted combinations of fea-
tures. We can get a bit more insight by studying the geometry of the problem. Let’s
visualize the predictor f w by looking at which points it classifies positive. Specifi-
cally, we can draw a ray from the origin to w (in two dimensions). Points which
form an acute angle with w are classified as positive (dot product is positive),
and points that form an obtuse angle with w are classified as negative. Points
which are orthogonal {z ∈ Rd : w · z = 0} constitute the decision boundary. By
changing w, we change the predictor f w and thus the decision boundary as well.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


8 c hapter 1. machine learning i

Example:

w = [2, −1]
φ( x ) ∈ {[2, 0], [0, 2], [2, 4]}

In general: binary classifier f w defines a hyperplane decision boundary with


normal vector w.
R2 : hyperplane is a line
R3 : hyperplane is a plane

1.2 Loss Minimization

So far we have talked about linear predictors f w which are based on a feature
extractor φ and a weight vector w. Now we turn to the problem of estimating
(also known as fitting or learning) w from training data. The loss minimization
framework is to cast learning as an optimization problem. Note the theme of
separating your problem into a model (optimization problem) and an algorithm
(optimization algorithm).

x

Dtrain −→ learner −→ f

y

learner = optimization problem −→ optimization algorithm

Definition: loss function. A loss function Loss( x, y, w) quantifies how un-


happy you would be if you used w to make a prediction on x when the correct
output is y. It is the object we want to minimize.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.2. loss minimization 9

1.2.1 Score and Margin


Recal that the score on an example ( x, y) is w · φ( x ) which is how confident we are
in predicting +1. Before we talk about what loss functions look like and how to
learn w, we introduce another important concept, the notion of a margin. Suppose
the correct label is y ∈ {−1, +1}. The margin of an input x is w · φ( x )y, which
measures how correct the prediction that w makes is. The larger the margin the
better, and non-positive margins correspond to classification errors.

Definition: margin. The margin on an example ( x, y) is how correct we are:

(w · φ( x ))y = score × target

margin(x, y, w, ϕ) = score(x, w, ϕ)*y Algorithm 1.3. The margin of an ex-


ample (x,y) using weights w and
feature extractor ϕ.

Note that if we look at the actual prediction f w ( x ), we can only ascertain


whether the prediction was right or not. By looking at the score and the margin,
we can get a more nuanced view into the behavior of the classifier.
Geometrically, if kwk = 1, then the margin of an input x is exactly the distance
from its feature vector φ( x ) to the decision boundary.

Question: When does a binary classifier err on an example?


• margin less than 0

• margin greater than 0

• score less than 0

• score greater than 0

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


10 c hapter 1. machine learning i

Figure 1.1. Zero-one loss.


Now let us define our first loss function, the zero-one loss. This corresponds
exactly to our familiar notion of whether our predictor made a mistake or not. We
can also write the loss in terms of the margin. 4

Loss0-1 ( x, y, w)
3

2
Loss0-1 ( x, y, w) = 1[ f w ( x ) 6= y]
1
= 1[(w · φ( x ))y ≤ 0]
| {z } 0
margin −2 0 2
margin (w · φ( x ))y
We can plot the loss as a function of the margin. From the graph, it is clear that
the loss is 1 when the margin is negative and 0 when it is positive. Loss0-1

Loss_01_f(x, y, w, ϕ, f) = I(f(x, w, ϕ) ≠ y) Algorithm 1.4. The zero-one loss


function for an example (x,y) us-
Loss_01(x, y, w, ϕ) = I(margin(x, y, w, ϕ) ≤ 0)
ing weights w, feature extractor ϕ,
and classifier f. Note that Loss_01
performs binary classification us-
ing the margin.
1.2.2 Linear Regression
f w ( x ) = w · φ( x )
Now let’s turn for a moment to regression, where the output y is a real number 3
rather than {−1, +1}. Here, the zero-one loss doesn’t make sense, because it’s ( φ ( x ), y )
residual
unlikely that we’re going to predict y exactly. 2

w · φ( x )
w · φ( x ) − y
Let’s instead define the residual to measure how close the prediction f w ( x )
1
is to the correct y. The residual will play the analogous role of the margin for
classification and will let us craft an appropriate loss function. 0
0 1 2 3 4
φ( x )
Definition: residual. The residual is the amount by which the prediction
f w ( x ) = w · φ( x ) overshoots the target y: Figure 1.2. Residual.

(w · φ( x )) − y = score − target

residual(x, y, w, ϕ) = score(x, w, ϕ) - y Algorithm 1.5. The residual of an ex-


ample (x,y) using weights w, and
feature extractor ϕ.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.2. loss minimization 11

1.2.3 Regression Loss Functions 4

Loss( x, y, w)
A popular and convenient loss function to use in linear regression is the squared 3
loss, which penalizes the residual of the prediction quadratically. If the predictor 2
is off by a residual of 10, then the loss will be 100.
1

Losssquared ( x, y, w) = ( f w ( x ) − y)2 0
−2 0 2
= (w · φ ( x ) − y )2 residual (w · φ( x )) − y
| {z }
residual
Losssquared
Algorithm 1.6. The squared loss for
an example (x,y) using weights w,
Loss_squared(x, y, w, ϕ) = residual(x, y, w, ϕ)^2 and feature extractor ϕ.

4
An alternative to the squared loss is the absolute deviation loss, which simply

Loss( x, y, w)
3
takes the absolute value of the residual.
2
Lossabsdev ( x, y, w) = | f w ( x ) − y| 1
= | w · φ( x ) − y | 0
| {z } −2 0 2
residual residual (w · φ( x )) − y

Losssquared
Loss_absdev(x, y, w, ϕ) = abs(residual(x, y, w, ϕ)) Lossabsdev

Algorithm 1.7. The absolute devia-


tion loss for an example (x,y) using
weights w, and feature extractor ϕ.
Example of Losssquared and Lossabsdev :

w = [2, −1], φ( x ) = [2, 0], y = −1


Losssquared ( x, y, w) = 25
Lossabsdev ( x, y, w) = 5

julia> w = [2, -1]; ϕ = x->[2, 0]; y = -1; x = missing


julia> Loss_squared(x, y, w, ϕ)
25
julia> Loss_absdev(x, y, w, ϕ)
5

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


12 c hapter 1. machine learning i

1.2.4 Loss Minimization Framework


So far: one example, Loss( x, y, w) is easy to minimize. Note that on one example,
both the squared and absolute deviation loss functions have the same minimum,
so we cannot really appreciate the differences here. However, we are learning w
based on a whole training set Dtrain , not just one example. We typically minimize
the training loss (also known as the training error or empirical risk), which is the
average loss over all the training examples.

Key idea: minimize training loss

1
TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
Loss( x, y, w)
train

minimize TrainLoss(w)
w∈Rd

Key: need to set w to make global tradeoffs—not every example can be happy.

function TrainLoss(w, 𝒟train, ϕ, Loss) Algorithm 1.8. Training loss for


weights w and data 𝒟train using
return sum(Loss(x,y,w,ϕ) for (x,y) ∈ 𝒟train)/length(𝒟train)
feature extractor ϕ and loss func-
end
tion Loss.

minimize(𝒟train, ϕ, Loss) =
minimize(w->TrainLoss(w, 𝒟train, ϕ, Loss))

Importantly, such an optimization problem requires making tradeoffs across


all the examples (in general, we won’t be able to set w to a single value that makes
every example have low loss).
Now the question of which loss we should use becomes more interesting. For
example, consider the case where all the inputs are φ( x ) = 1. Essentially the
problem becomes one of predicting a single value y∗ which is the least offensive
towards all the examples.
If our loss function is the squared loss, then the optimal value is the mean
y∗ = | D 1 | ∑( x,y)∈Dtrain y. If our loss function is the absolute deviation loss, then
train
the optimal value is the median. The median is more robust to outliers: you can

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.3. optimization problem 13

Which regression loss to use? For the training data Dtrain and feature vec-
tor φ( x ) = x, the weight w that minimizes the least squares (L2 ) regression
training loss is the mean y. The mean tries to accommodate every example,
i.e. popular.
Losssquared ( x, y, w) = (w · φ( x ) − y)2
For least absolute deviation (L1 ) regression, the weights w that minimizes the
training loss is the median y. The median is more robust to outliers.

Lossabsdev ( x, y, w) = |w · φ( x ) − y|

move the furthest point arbitrarily farther out without affecting the median. This
makes sense given that the squared loss penalizes large residuals a lot more. w∈R
In summary, this is an example of where the choice of the loss function has a 8

TrainLoss(w)
qualitative impact on the weights learned, and we can study these differences in 6
terms of the objective function without thinking about optimization algorithms. 4

2
1.3 Optimization Problem 0
0 2
Having defined a bunch of different objective functions that correspond to training weight w

loss, we would now like to optimize them—that is, obtain an algorithm that
Figure 1.3. Training loss in R with
outputs the w where the objective function achieves the minimum value. minimum.

minimize TrainLoss(w) w ∈ R2
w∈Rd

1.3.1 Gradient Descent


w2

Definition: gradient. The gradient ∇w TrainLoss(w) is the direction that


increases the loss the most. w1

Figure 1.4. Training loss in R2 with


A general approach is to use iterative optimization, which essentially starts at minimum.
some starting point w (say, all zeros), and tries to tweak w so that the objective
function value decreases.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


14 c hapter 1. machine learning i

function Gradi entDescent(Dtrain , ∇ TrainLoss) Algorithm 1.9. Gradient descent.

Initialize w = [0, . . . , 0]
for t = 1, . . . , T do:
w ← w − η ∇w TrainLoss(w)
|{z} | {z }
step size gradient

init_weights(d) = d == 1 ? zero(d) : zeros(d) Algorithm 1.10. Gradient descent on


the weights w to minimize the least
squares training loss function over
function gradient_descent(𝒟train, ϕ, ∇TrainLoss; η=0.1, T=100)
the training data 𝒟train using fea-
w = init_weights(length(ϕ(𝒟train[1][1]))) ture extractor ϕ, step size η, and
for t in 1:T number of iterations T.
w = w - η*∇TrainLoss(w, 𝒟train, ϕ)
end
return w
end

To do this, we will rely on the gradient of the function, which tells us which
direction to move in to decrease the objective the most. The gradient is a valu-
able piece of information, especially since we will often be optimizing in high
dimensions (d on the order of thousands).
This iterative optimization procedure is called gradient descent. Gradient descent
has two hyperparameters, the step size η (which specifies how aggressively we want
to pursue a direction) and the number of iterations T. Let’s not worry about how
to set them, but you can think of T = 100 and η = 0.1 for now.

1.3.2 Least Squares Regression


All that’s left to do before we can use gradient descent is to compute the gradient
of our objective function TrainLoss. The calculus can usually be done by hand;
combinations of the product and chain rule suffice in most cases for the functions
we care about. Note that the gradient often has a nice interpretation. For squared
loss, it is the residual (prediction − target) times the feature vector φ( x ). Our
objective function is:

1
TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
(w · φ ( x ) − y )2
train

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.3. optimization problem 15

Compute the gradient using the chain rule:

1
∇w TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
∇w Losssquared ( x, y, w)
train

1
= ∑
| Dtrain | (x,y)∈D
2( w · φ ( x ) − y ) φ ( x )
| {z }
train
prediction−target
(residual)

Note that for linear predictors, the gradient is always something times φ( x )
because w only affects the loss through w · φ( x ).

∇Loss_squared(x, y, w, ϕ) = 2residual(x, y, w, ϕ)*ϕ(x) Algorithm 1.11. Training loss gradi-


ent for least squares regression.
function ∇TrainLoss(w, 𝒟train, ϕ)
return sum(∇Loss_squared(x, y, w, ϕ)
for (x,y) ∈ 𝒟train)/length(𝒟train)
end

1.3.3 Gradient Descent is Slow


The problem with gradient descent is each iteration requires going over all training
examples—which is expensive when you have lots of data!

w ← w − η ∇w TrainLoss(w)

We can now apply gradient descent on any of our objective functions that we
defined before and have a working algorithm. But it is not necessarily the best
algorithm. One problem (but not the only problem) with gradient descent is that
it is slow. Those of you familiar with optimization will recognize that methods like
Newton’s method can give faster convergence, but that’s not the type of slowness
I’m talking about here. Rather, it is the slowness that arises in large-scale machine
learning applications. Recall that the training loss is a sum over the training data.
If we have one million training examples (which is, by today’s standards, only a
modest number), then each gradient computation requires going through those
one million examples, and this must happen before we can make any progress.
Can we make progress before seeing all the data?

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


16 c hapter 1. machine learning i

1.3.4 Stochastic Gradient Descent


In stochastic gradient descent (SGD), rather than looping through all the training
examples to compute a single gradient and making one step, SGD loops through
the examples ( x, y) and updates the weights w based on each example. Each
update is not as good because we’re only looking at one example rather than all
the examples, but we can make many more updates this way.

function Stoch asticGradientDescent(Dtrain , ∇ Loss) Algorithm 1.12. Stochastic gradient


descent.
Initialize w = [0, . . . , 0]
for t = 1, . . . , T do:
for ( x, y) ∈ Dtrain do:
w ← w − η ∇w Loss( x, y, w)

Key idea: stochastic updates. It’s not about quality, it’s about quantity.

function stochastic_gradient_descent(𝒟train, ϕ, ∇Loss; η=0.1) Algorithm 1.13. Stochastic gradient


descent over training data 𝒟train
w = init_weights(length(ϕ(𝒟train[1][1])))
using feature extractor ϕ and gradi-
for t in 1:T
ent of the loss function ∇Loss and
for (x,y) ∈ 𝒟train step size η.
w = w - η*∇Loss(x, y, w, ϕ)
end
end
return w
end

In practice, we often find that just performing one pass over the training ex-
amples with SGD, touching each example once, often performs comparably to
taking ten passes over the data with GD. There are also other variants of SGD.
You can randomize the order in which you loop over the training data in each
iteration, which is useful. Think about what would happen if you had all the
positive examples first and the negative examples after that.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.3. optimization problem 17

1.3.5 Step Size


One remaining issue is choosing the step size, which in practice (and as we have
seen) is actually quite important. Generally, larger step sizes are like driving fast.
You can get faster convergence, but you might also get very unstable results and
crash and burn. On the other hand, with smaller step sizes you get more stability,
but you might get to your destination more slowly. What should the step size η be?

0 1 η

conservative, more stable aggressive, faster

• Constant: η = 0.1
p
• Decreasing, decaying: η = 1/ # updates made so far

A suggested form for the step size is to set the initial step size to 1 and let the
step size decrease as the inverse of the square root of the number of updates
we’ve taken so far. There are some nice theoretical results showing that SGD is
guaranteed to converge in this case.1 1
provided all your gradients have
bounded length

mutable struct Decay


i # iteration Algorithm 1.14. Inverse square
end root step size decay. The iteration i
will automatically increment every
import Base.* time the Decay is multiplied.
*(δη::Decay, x) = 1/sqrt(δη.i+=1) * x

1.3.6 Summary
In summary we have seen (1) the functions we’re considering (linear predictors),
(2) the criterion for choosing one (loss minimization), and (3) an algorithm
that goes after that criterion (SGD). We already worked out a linear regression
example. What are good loss functions for binary classification?

1. Linear predictors:
f w ( x ) based on score w · φ( x )

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


18 c hapter 1. machine learning i

2. Loss minimization: learning as optimization

minimize TrainLoss(w)
w

Figure 1.5. Zero-one loss.


3. Stochastic gradient descent: optimization algorithm

w ← w − η ∇w Loss( x, y, w) 4

Loss0-1 ( x, y, w)
3
1.3.7 Zero-One Loss 2

Loss0-1 ( x, y, w) = 1[(w · φ( x ))y ≤ 0] 1

Recall that we have the zero-one loss for classification. But the main problem with 0
−2 0 2
zero-one loss is that it’s hard to optimize (in fact, it’s provably NP hard in the margin (w · φ( x ))y
worst case). And in particular, we cannot apply gradient-based optimization to it,
Loss0-1
because the gradient is zero (almost) everywhere.

Loss_01(x, y, w, ϕ) = I(margin(x, y, w, ϕ) ≤ 0)

Problems with zero-one loss:

• Gradient of Loss0-1 is 0 everywhere, therefore SGD is not applicable.

• Loss0-1 is insensitive to how badly the model messed up.

1.3.8 Hinge Loss (SVMs)


Losshinge ( x, y, w) = max{1 − (w · φ( x ))y, 0}

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.3. optimization problem 19

Hinge loss upper bounds Loss0-1 and has a non-trivial gradient. Try to increase
the margin if it is less than 1. To fix this problem, we can use the hinge loss, which
is an upper bound on the zero-one loss. Minimizing upper bounds are a general
idea; the hope is that pushing down the upper bound leads to pushing down the
Algorithm 1.15. The hinge loss func-
actual function.
tion, an upper bound on the zero-
one loss function.
Loss_hinge(x, y, w, ϕ) = max(1 - margin(x, y, w, ϕ), 0)

4
Advanced: The hinge loss corresponds to the Support Vector Machine (SVM)

Loss( x, y, w)
3
objective function with one important difference. The SVM objective function also
includes a regularization penalty kwk2 , which prevents the weights from getting 2

too large. We will get to regularization later in the course, so you needn’t worry 1
about this for now. But if you’re curious, read on. 0
Why should we penalize kwk2 ? One answer is Occam’s razor, which says to −2 0 2
margin (w · φ( x ))y
find the simplest hypothesis that explains the data. Here, simplicity is measured
in the length of w. This can be made formal using statistical learning theory (take Loss0-1
CS229T if you want to learn more). Losshinge
Perhaps a less abstract and more geometric reason is the following. Recall that
Figure 1.6. Hinge loss.
we defined the (algebraic) margin to be w · φ( x )y. The actual (signed) distance
from a point to the decision boundary is actually kwwk
· φ( x )y, this is called the geo-
metric margin. So the loss being zero (that is, Losshinge ( x, y, w) = 0) is equivalent
to the algebraic margin being at least 1 (that is, w · φ( x )y ≥ 1), which is equiva-
1
lent to the geometric margin being larger than kw k
(that is, kw wk
· φ( x )y ≥ kw1 k ).
Therefore, reducing kwk increases the geometric margin. For this reason, SVMs
are also referred to as max-margin classifiers.

geometric_margin(x, y, w, ϕ) = w/norm(w)⋅ϕ(x)*y Algorithm 1.16. The geometric mar-


gin is the signed distance from a
point to a decision boundary.

1.3.9 Logistic Regression

Losslogistic ( x, y, w) = log(1 + e−(w·φ( x))y )

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


20 c hapter 1. machine learning i

Loss_logistic(x, y, w, ϕ) = log(1 + exp(-margin(x, y, w, ϕ))) Algorithm 1.17. The logistic loss


function.

Try to increase margin even when it already exceeds 1. Another popular loss
function used in machine learning is the logistic loss. 4
The main property of the logistic loss is no matter how correct your prediction

Loss( x, y, w)
3
is, you will have non-zero loss, and so there is still an incentive (although a
2
diminishing one) to push the margin even larger. This means that you’ll update
1
on every single example. There are some connections between logistic regression
and probabilistic models, which we will get to later. 0
−2 0 2
margin (w · φ( x ))y

Loss0-1
Losshinge
Losslogistic

Figure 1.7. Logistic loss.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


1.3. optimization problem 21

A gradient exercise. Compute the gradient of the hinge loss:

Losshinge ( x, y, w) = max{1 − (w · φ( x ))y, 0}

You should try to ‘‘see’’ the solution before you write things down formally.
Pictorially, it should be evident: when the margin is less than 1, then the
gradient is the gradient of 1 − (w · φ( x ))y, which is equal to −φ( x )y. If the
margin is larger than 1, then the gradient is the gradient of 0, which is 0.

−φ( x )y if w · φ( x )y < 1
∇w Losshinge ( x, y, w) =
0 if w · φ( x )y > 1

What about when the margin is exactly 1? Technically, the gradient doesn’t
exist because the hinge loss is not differentiable there. Practically speaking,
we can take either −φ( x )y or 0 (or anything in between).

∇Loss_hinge(x, y, w, ϕ) = margin(x, y, w, ϕ) < 1 ? -ϕ(x)*y : 0

Technical note (can be skipped): given f (w), the gradient ∇ f (w) is only
defined at points w where f is differentiable. However, subdifferentials ∂ f (w)
are defined at every point (for convex functions). The subdifferential is a set
of vectors called subgradients z ∈ f (w) which define linear underapproxi-
mations to f , namely f (w) + z · (w0 − w) ≤ f (w0 ) for all w0 .

1.3.10 Summary

Table 1.1. Summary of machine


score = w · φ( x ) Classification Regression
learning.
Predictor f w sign(score) score

Relate to correct y margin (score · y) residual (score − y)

zero-one
Loss functions hinge squared
logistic absolute deviation

Algorithm stochastic gradient descent stochastic gradient descent

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


2 Machine Learning II

From CS221 Spring 2020, Percy


Recall from last chapter that learning is the process of taking training data and Liang, Chelsea Finn & Nima Anari,
turning it into a model (predictor). Last chapter, we started by studying the Stanford University.

predictor f , concerning ourselves with linear predictors based on the score w ·


φ( x ), where w is the weight vector we wish to learn and φ is the feature extractor
that maps an input x to some feature vector φ( x ) ∈ Rd , turning something that is
domain-specific (images, text) into a mathematical object. Then we looked at how
to learn such a predictor by formulating an optimization problem and developing
an algorithm to solve that problem. Recall that the optimization problem was to
minimize the training loss, which is the average loss over all the training examples.

1
TrainLoss(w) = ∑
| Dtrain | (x,y)∈D
Loss( x, y, w)
train

minimize TrainLoss(w)
w∈Rd

The actual loss function depends on what we’re trying to accomplish. Generally,
the loss function takes the score w · φ( x ), compares it with the correct output
y to form either the residual (for regression) or the margin (for classification).
Regression losses are smallest when the residual is close to zero. Classification
losses are smallest when the margin is large. Which loss function we choose
depends on the desired properties. For example, the absolute deviation loss for
regression is robust against outliers. The logistic loss for classification never relents
in encouraging large margin.

Review: Optimization Algorithms. Finally, we introduced two very simple


algorithms to minimize the training loss, both based on iteratively computing the
23

φ( x ) y Loss( x, y, w) = (w · φ( x ) − y)2
[1, 0] 2 ( w1 − 2 ) 2
[1, 0] 4 ( w1 − 4 ) 2
[0, 1] −1 (w1 − (−1))2

1
TrainLoss(w) = ((w1 − 2)2 + (w1 − 4)2 + (w2 − (−1))2 )
3
Note that we’ve been talking about the loss on a single example, and
plotting it in 1D against the residual or the margin. Recall that what we’re
actually optimizing is the training loss, which sums over all data points. To
help visualize the connection between a single loss plot and the more general
picture, consider the simple example of linear regression on three data points:
([1, 0], 2), ([1, 0], 4), and ([0, 1], −1), where φ( x ) = x.
Let’s try to draw the training loss, which is a function of w = [w1 , w2 ].
Specifically, the training loss is 13 ((w1 − 2)2 + (w1 − 4)2 + (w2 − (−1))2 ).
The first two points contribute a quadratic term sensitive to w1 , and the third
point contributes a quadratic term sensitive to w2 . When you combine them,
you get a quadratic centered at [3, −1].

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


24 c hapter 2. ma chine learning ii

function Gradi entDescent(Dtrain , ∇ TrainLoss) Algorithm 2.1. Gradient descent.

Initialize w = [0, . . . , 0]
for t = 1, . . . , T do:
w ← w − η ∇w TrainLoss(w)

function Stoch asticGradientDescent(Dtrain , ∇ Loss) Algorithm 2.2. Stochastic gradient


descent.
Initialize w = [0, . . . , 0]
for t = 1, . . . , T do:
for ( x, y) ∈ Dtrain do:
w ← w − η ∇w Loss( x, y, w)

gradient of the objective with respect to the parameters w and stepping in the
opposite direction of the gradient. Think about a ball at the current weight vector
and rolling it down on the surface of the training loss objective. Gradient descent
(GD) computes the gradient of the full training loss, which can be slow for large
datasets. Stochastic gradient descent (SGD), which approximates the gradient of
the training loss with the loss at a single example, generally takes less time. In
both cases, one must be careful to set the step size η properly (not too big, not too
small).
The first half of this chapter is about thinking about the feature extractor φ.
Features are a critical part of machine learning which often do not get as much
attention as they deserve. Ideally, they would be given to us by a domain expert,
and all we (as machine learning people) have to do is to stick them into our
learning algorithm. While one can get considerable mileage out of doing this, the
interface between general-purpose machine learning and domain knowledge is
often nuanced, so to be successful, it pays to understand this interface.
In the second half of this chapter, we return to learning, rip out the linear
predictors that we had from before, and show how we can build more powerful
neural network classifiers given the features that we extracted.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.1. features 25

Can we obtain decision boundaries which are circles by using linear classi-
fiers? The answer is yes.
This might seem paradoxical since we are only working with linear clas-
sifiers. But as we will see later, linear refers to the relationship between the
weight vector w and the prediction score (not the input x, which might not
even be a real vector), whereas the decision boundary refers to how the
prediction varies as a function of x.
Advanced: Sometimes people might think that linear classifiers are not
expressive, and that you need neural networks to get expressive and non-
linear classifiers. This is false. You can build arbitrarily expressive models
with the machinery of linear classifiers (see kernel methods). The advantages
of neural networks are the computational benefits and the inductive bias that
comes from the particular neural network architecture.

2.1 Features

There are two component to classification, score (drives prediction):

w · φ( x )

• Previous Chapter: learning chooses w via optimization.

• This Chapter: feature extraction specifies φ( x ) based on domain knowledge.

As a reminder, the prediction is driven by the score w · φ( x ). In regression,


we predict the score directly, and in binary classification, we predict the sign of
the score. Both w and φ( x ) play an important role in prediction. So far, we have
fixed φ( x ) and used learning to set w. Now, we will explore how φ( x ) affects the
prediction.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


26 c hapter 2. machine learning ii

2.1.1 Organization of Features


Task: Predict whether a string is an email address.

 
length>10 : 1
 fracOfAlpha : 0.85
 
feature extractor
"abc@gmail.com" −
−−−−−−−−→  contains_@ : 1 
 
arbitrary!  
endsWith_.com : 1 
endsWith_.org : 0

Which features to include? We need an organizational principle. How would we


go about about creating good features? Here, we used our prior knowledge to
define certain features (contains_@) which we believe are helpful for detecting
email addresses. But this is ad-hoc: which strings should we include? We need a
more systematic way to go about this.

2.1.2 Feature Templates

Definition: feature template (informal). A feature template is a group of


features all computed in a similar way.

Input: "abc@gmail.com", with some feature templates:

• Length greather than

• Last three characters equals

• Contains character

• Pixel intensity of position ,

A useful organization principle is a feature template, which groups all the fea-
tures which are computed in a similar way. (People often use the word ”feature”
when they really mean ”feature template”.) A feature template also allows us
to define a set of related features (contains_@, contains_a, contains_b). This
reduces the amount of burden on the feature engineer since we don’t need to
know which particular characters are useful, but only that existence of certain
single characters is a useful cue to look at.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.1. features 27

We can write each feature template as a English description with a blank (___),
which is to be filled in with an arbitrary string. Also note that feature templates
are most natural for defining binary features, ones which take on value 1 (true)
or 0 (false). Note that an isolated feature (fraction of alphanumeric characters)
can be treated as a trivial feature template with no blanks to be filled.
As another example, if x is a k × k image, then {pixelIntensityij : 1 ≤ i, j ≤ k } is
a feature template consisting of k2 features, whose values are the pixel intensities
at various positions of x.

Feature template example. Last three characters equals ___


 
endsWith_aaa : 0
endsWith_aab : 0
 
 
 ... 
"abc@gmail.com" −→  endsWith_com : 1

 
. . .
 
 
endsWith_zzz : 0

This is an example of one feature template mapping onto a group of m3


features, where m (26 in this example) is the number of possible characters.

last_three_chars___(x, 𝚺='a':'z') =
[endswith(x, a*b*c) for a in 𝚺, b in 𝚺, c in 𝚺]

2.1.3 Sparsity in Feature Vectors


Innefficient to represent all the zeros. In general, a feature template corresponds
to many features. It would be inefficient to represent all the features explicitly.
Fortunately, the feature vectors are often sparse, meaning that most of the feature
values are 0. It is common for all but one of the features to be 0. This is known as
a one-hot representation of a discrete value such as a character.

2.1.4 Feature Vector Representation


Let’s now talk a bit more about implementation. There are two common ways to
define features: using arrays or using maps. Arrays assume a fixed ordering of

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


28 c hapter 2. machine learning ii

Feature template example. Last character equals ___


 
endsWith_a : 0
endsWith_b : 0
 
"abc@gmail.com" −→ endsWith_c : 0
 
 
 ... 
endsWith_z : 0

last_char_equals___(x, 𝚺='a':'z') = [endswith(x, c) for c in 𝚺]

the features and represent the feature values as an array. This representation is
appropriate when the number of nonzeros is significant (the features are dense).
Arrays are especially efficient in terms of space and speed (and you can take
advantage of GPUs). In computer vision applications, features (e.g., the pixel
intensity features) are generally dense, so array representation is more common.
 
fracOfAlpha : 0.85
 contains_a : 0 
 
...
 
 
 
 endsWith_@ : 1 
...

• Array representation (good for dense features):

[0.85, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

• Map representation (good for sparse features):

{"fracOfAlpha" : 0.85, "contains_@" : 1}

However, when we have sparsity (few nonzeros), it is typically more efficient


to represent the feature vector as a map from strings to doubles rather than a fixed-
size array of doubles. The features not in the map implicitly have a default value
of zero. This sparse representation is very useful in natural language processing,
and is what allows us to work effectively over trillions of features. In Python,
one would define a feature vector φ( x ) as the dictionary "endsWith_"+x[-3:]:

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.1. features 29

1. Maps do incur extra overhead compared to arrays, and therefore maps are
much slower when the features are not sparse.
Finally, it is important to be clear when describing features. Saying ”length”
might mean that there is one feature whose value is the length of x or that there
could be a feature template ”length is equal to ___”. These two encodings of the
same information can have a drastic impact on prediction accuracy when using a
linear predictor, as we’ll see later.

2.1.5 Hypothesis Class


Having discussed how feature templates can be used to organize groups of fea-
tures and allow us to leverage sparsity, let us further study how features impact
prediction. The key notion is that of a hypothesis class, which is the set of all possi-
ble predictors that you can get by varying the weight vector w. Thus, the feature
extractor φ specifies a hypothesis class F . This allows us to take data and learning
out of the picture.

Predictor: f w ( x ) = w · φ( x ) or sign(w · φ( x ))

Definition: hypothesis class. A hypothesis class is the set of possible predic-


tors with a fixed φ( x ) and varying w:

F = { f w : w ∈ Rd }

2.1.6 Feature Extration and Learning


Stepping back, we can see the two stages more clearly. First, we perform feature
extraction (given domain knowledge) to specify a hypothesis class F . Second, we
perform learning (given training data) to obtain a particular predictor f w ∈ F .

• Feature extraction: set F based on domain knowledge

• Learning: set f w ∈ F based on data

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


30 c hapter 2. ma chine learning ii

Note that if the hypothesis class doesn’t contain any good predictors, then no
amount of learning can help. So the question when extracting features is really
whether they are powerful enough to express predictors which are good. It’s
okay and expected that F will contain a bunch of bad ones as well. Later, we’ll
see reasons for keeping the hypothesis class small (both for computational and
statistical reasons), because we can’t get the optimal w for any feature extractor φ
we choose.

Regression: x ∈ R, y ∈ R Example 2.1. Beyond linear func-


tions.

Linear functions:

φ( x ) = x
F 1 = { x 7 → w1 x : w2 ∈ R }

Quadratic functions:

φ( x ) = [ x, x2 ]
F2 = { x 7→ w1 x + w2 x2 : w1 ∈ R, w2 ∈ R}

Given a fixed feature extractor φ, let us consider the space of all predictors f w
obtained by sweeping w over all possible values. If we use φ( x ) = x, then we get
linear functions that go through the origin. However, we want to have functions
that ”bend” (or are not monotonic). For example, if we want to predict someone’s
health given his or her body temperature, there is a sweet spot temperature (37
C) where the health is optimal; both higher and lower values should cause the
health to decline. If we use φ( x ) = [ x, x2 ], then we get quadratic functions that
go through the origin, which are a strict superset of the linear functions, and
therefore are strictly more expressive.
However, even quadratic functions can be limiting because they have to rise
and fall in a certain (parabolic) way. What if we wanted a more flexible, freestyle
approach? We can create piecewise constant functions by defining features that
”fire” (are 1) on particular regions of the input (e.g., 1 < x ≤ 2). Each feature
gets associated with its own weight, which in this case corresponds to the desired
function value in that region. Thus by varying the weight vector, we get piece-

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.1. features 31

Regression: x ∈ R, y ∈ R Example 2.2. Even more flexible


functions

Piecewise constant functions:

φ ( x ) = [1[0 < x ≤ 1], 1[1 < x ≤ 2], . . . ]


( )
10
F3 = x 7→ ∑ w j 1[ j − 1 < x ≤ j] : w ∈ R10
j =1

wise constant functions with a particular discretization level. We can increase or


decrease the discretization level as we need.
Advanced: what happens if x were not a scalar, but a d-dimensional vector?
We could perform discretization in Rd , but the number of features grows expo-
nentially in d, which leads to a phenomenon called the curse of dimensionality.

2.1.7 Linearity
Wait a minute...how were we able to get non-linear predictions using linear
predictors? It is important to remember that for linear predictors, it is the score
w · φ( x ) that is linear in w and φ( x ) (read it off directly from the formula). In
particular, the score is not linear in x (it sometimes doesn’t even make sense
because x need not be a vector at all—it could be a string or a PDF file. Also, neither
the predictor f w (unless we’re doing linear regression) nor the loss function
TrainLoss(w) are linear in anything. For the prediction driven by score w · φ( x ):

Linear in w? Yes
Linear in φ( x )? Yes
Linear in x? No! (x not necessarily even a vector)

Key idea: non-linearity.

• Predictors f w ( x ) can be expressive non-linear functions and decision


boundaries of x.

• Score w · φ( x ) is linear function of w, which permits efficient learning.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


32 c hapter 2. machine learning ii

The significance is as follows: From the feature extraction viewpoint, we can


define arbitrary features that yield very non-linear functions in x. From the learn-
ing viewpoint (only looking at φ( x ), not x), linearity plays an important role in
being able to optimize the weights efficiently (as it leads to convex optimization
problems).

2.1.8 Geometric viewpoint


φ( x ) = [1, x1 , x2 , x12 + x22 ]
How to relate non-linear decision boundary in x space with linear decision bound-
ary in φ( x ) space? Let’s try to understand the relationship between the non-
linearity in x and linearity in φ( x ). We consider binary classification where our
input is x = [ x1 , x2 ] ∈ R2 a point on the plane. With the quadratic features
φ( x ), we can carve out the decision boundary corresponding to an ellipse (think
about the formula for an ellipse and break it down into monomials). We can now
look at the feature vectors φ( x ), which include an extra dimension. In this 3D
space, a linear predictor (defined by the hyperplane) actually corresponds to the
non-linear predictor in the original 2D space.

Input x: Example 2.3. An example task: de-


tecting responses.
two consecutive messages in a chat
Output y ∈ {+1, −1}:

whether the second message is a response to the first

Recall, feature extractor φ should pick out properties of x that might be useful
for prediction of y.

Let’s apply what you’ve learned about feature extraction to a concrete prob-
lem. The motivation here is that messaging platforms often just show a single
stream of messages, when there is generally a grouping of messages into coherent
conversations. How can we build a classifier that can group messages automat-
ically? We can formulate this as a binary classification problem where we look
at two messages and determine whether or not these two are part of the same
conversation.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.2. neural networks 33

Question. What feature templates would you use for predicting whether
the second message is a response to the first?
• time elapsed

• time elapsed is between ___ and ___

• first message contains ___

• second message contains ___

• two messages both contain ___

• two messages have ___ common words

2.1.9 Summary
• Feature templates: organize related (sparse) features

• Hypothesis class: defined by features (what is possible)

• Linear classifiers: can produce non-linear decision boundaries

2.2 Neural Networks

What we’ve shown so far is that by being mildly clever with choosing the feature
extractor φ, we can actually get quite a bit of mileage out of our so-called linear
predictors. However, sometimes we don’t know what features are good to use,
either because the prediction task is non-intuitive or we don’t have time to figure
out which features are suitable. Sometimes, we think we might know what features
are good, but then it turns out that they aren’t (this happens a lot!). In the spirit
of machine learning, we’d like to automate things as much as possible. In this
context, it means creating algorithms that can take whatever crude features we
have and turn them into refined predictions, thereby shifting the burden off
feature extraction and moving it to learning.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


34 c hapter 2. machine learning ii

Neural networks have been around for many decades, but they fell out of favor
because they were difficult to train. In the last decade, there has been a huge
resurgence of interest in neural networks since they perform so well and training
seems to not be such an issue when you have tons of data and compute. In a sense,
neural networks allow one to automatically learn the features of a linear classifier
which are geared towards the desired task, rather than specifying them all by
hand.

As a motivating example, consider the problem of predicting whether two Example 2.4. Predicting car colli-
sion.
cars at positions x1 and x2 are going to collide. Suppose the true output is
1 (safe) whenever the cars are separated by a distance of at least 1. Clearly,
this the decision is not linear.
Input: position of two oncoming cars x = [ x1 , x2 ]
Output: whether safe (y = +1) or collide (y = −1)
True function: safe if cars sufficiently far

y = sign(| x1 − x2 | − 1)

x y
[1, 3] +1
Examples:
[3, 1] +1
[1, 0.5] −1

2.2.1 Decomposing the problem


The intuition is to break up the problem into two subproblems, which test if car 1
(or car 2) is to the far right. Given these two binary values h1 , h2 , we can declare
safety if at least one of them is true.

• Test if car 1 is far right of car 2: h1 = 1[ x1 − x2 ≥ 1]

• Test if car 2 is far right of car 1: h2 = 1[ x2 − x1 ≥ 1]

• Safe if at least one is true: y = sign(h1 + h2 )

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.2. neural networks 35

x h1 h2 y
[1, 3] 0 1 +1
[3, 1] 1 0 +1
[1, 0.5] 0 0 −1

2.2.2 Learning strategy


Having written y in a specific way, let us try to generalize to a family of predictors
(this seems to be a recurring theme). We can define v1 = [−1, 1, −1] and v2 =
[−1, −1, 1] and w1 = w2 = 1 to accomplish this. At a high-level, we have defined
two intermediate subproblems, that of predicting h1 and h2 . These two values are
hidden in the sense that they are not specified to be anything. They just need to be
set in a way such that y is linearly predictable from them. Define: φ( x ) = [1, x1 , x2 ].
Intermediate hidden subproblems:

h1 = 1[ x1 − x2 ≥ 1]
= 1[v1 · φ( x ) ≥ 0] v1 = [−1, +1, −1]
h2 = 1[ x2 − x1 ≥ 1]
= 1[v2 · φ( x ) ≥ 0] v2 = [−1, −1, +1]

Final prediction:

y = sign(h1 + h2 )
f V,w ( x ) = sign(w1 h1 + w2 h2 ) w = [1, 1]

Key idea: joint learning. Goal is to learn both hidden subproblems V =


(v1 , v2 ) and combination weights w = [w1 , w2 ].

2.2.3 Gradients
If we try to train the weights v1 , v2 , w1 , w2 , we will immediately notice a problem:
the gradient of h1 with respect to v1 is always zero because of the hard thresholding
function. Therefore, we define a function logistic function σ (z), which looks roughly
like the step function 1[z ≥ 0], but has non-zero gradients everywhere. One thing
to bear in mind is that even though the gradients are non-zero, they can be quite
small when |z| is large. This is what makes optimizing neural networks hard.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


36 c hapter 2. machine learning ii

Problem: gradient of h1 with respect to v1 is 0: h1 = 1[v1 · φ( x ) ≥ 0]. Solution:


h1 = σ(v1 · φ( x )).

Definition: logistic function. The logistic function maps (− ∞, ∞) to [0, 1]:

1
σ ( z ) = (1 + e − z ) −1 =
1 + e−z
σ0 (z) = σ (z)(1 − σ (z)) (derivative)

σ(z) = 1/(1 + exp(-z)) Algorithm 2.3. The logistic func-


σ′(z) = σ(z)*(1 - σ(z)) tion σ and its derivative σ′.

2.2.4 Linear functions


Let’s try to visualize the functions. Recall that a linear function takes the input
φ( x ) ∈ Rd and directly take the dot product with the weight vector w to form
the score, the basis for prediction in both binary classification and regression. For
linear functions, the output is the score = w · φ( x ):

φ ( x )1 Figure 2.1. A linear function with


w φ ( x ) ∈ R3 .
φ ( x )2 score

φ ( x )3

2.2.5 Neural networks


A (one-layer) neural network first maps an input φ( x ) ∈ Rd onto a hidden
intermediate representation h ∈ Rk , which in turn is mapped to the score via a
linear function. Specifically, let k be the number of hidden units. For each hidden
unit j = 1, . . . , k, we have a weight vector v j ∈ Rd , which is used to determine
the value of the hidden node h j ∈ R (also called the activation) according to
h j = σ (v j · φ( x )), where σ is the activation function. The activation function can

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.2. neural networks 37

be a number of different things, but its main property is that it is a non-linear


function. Let h = [ h1 , . . . , hk ] be the vector of activations. This activation vector is
now combined with another weight vector w ∈ Rk to produce the final score. The
logistic function is an instance of an activation function, and is the classic one that
was used in the past. These days, most people use a rectifier function, commonly
known as a rectified linear unit (ReLU), which is defined as ReLU(z) = max(z, 0). ReLU(z) = max(z,0)

The ReLU has two advantages: (i) its gradient doesn’t vanish as z grows, which
makes it empirically easier to train; and (ii) it only involves a max operation,
which is computationally easier to compute than the exponential function.

function neural_network(x, V, w, ϕ, g::Function=σ) Algorithm 2.4. A one-layer neural


network with activation function g
h = map(vⱼ -> g(vⱼ ⋅ ϕ(x)), V)
defaulting to the logistic function,
w ⋅ h or a vector of activation functions
end g.

function neural_network(x, V, w, ϕ, g::Vector)


h = map((g, vⱼ) -> g(vⱼ ⋅ ϕ(x)), g, V)
w ⋅ h
end

For a neural network with one hidden layer, the intermediate hidden units are
h j = σ (v j · φ( x )) where σ (z) = (1 + e−z )−1 , producing output score = w · h:

h1 Figure 2.2. A one-layer neural net-


φ ( x )1 V work with φ( x ) ∈ R3 and h ∈ R2
σ w score using the logistic activation func-
φ ( x )2 tion σ.
σ
φ ( x )3
h2

Interpretation: intermediate hidden units as learned features of a linear predic-


tor. The noteworthy aspect here is that the activation vector h behaves a lot like our
feature vector φ( x ) that we were using for linear prediction. The difference is that
mapping from input φ( x ) to h is learned automatically, not manually constructed
(as was the case before). Therefore, a neural network can be viewed as learning
the features of a linear classifier. Of course, the type of features that can be learned
must be of the form x 7→ σ (v j · φ( x )). Even for deep neural networks, no matter

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


38 c hapter 2. machine learning ii

now deep the neural network is, the top layer is always a linear function, and the
layers below can be interpreted as defining a (possibly very complex) feature
map. Whether this is a suitable form depends on the nature of the application.
Empirically, though, neural networks have been quite successful, since learning
the features from the data with the explicit objective of minimizing the loss can
yield better features than ones which are manually crafted. Since 2010, there have
been some advances in getting neural networks to work, and they have become
the state-of-the-art in many tasks. For example, all the major companies (Google,
Microsoft, IBM) all recently switched over to using neural networks for speech
recognition. In computer vision, (convolutional) neural networks are completely
dominant in object recognition.

Key idea: feature learning.


• Before: apply linear predictor on manually specified features.

φ( x )

• Now: apply linear predictor on automatically learned features.

h( x ) = [ h1 ( x ), . . . , hk ( x )]

2.3 Efficient Gradients

The main thing left to do for neural networks is to be able to train them. Con-
ceptually, this should be straightforward: just take the gradient and run SGD.
While this is true, computing the gradient, even though it is not hard, can be quite
tedious to do by hand.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.3. efficient gradients 39

Optimization problem.

minimize TrainLoss(V, w)
V,w
1
TrainLoss(V, w) = ∑
| Dtrain | (x,y)∈D
Loss( x, y, V, w)
train

Loss( x, y, V, w) = ( f V,w ( x ) − y)2


k
f V,w ( x ) = ∑ w j σ(vj · φ(x))
j =1

Goal. compute the gradient: ∇V,w TrainLoss(V, w)

2.3.1 Approach
Mathematically, just grind through the chain rule. But wext we visualize the
computation using a computation graph. We will illustrate a graphical way of
organizing the computation of gradients, which is built out of a few components.
This graphical approach will show the structure of the function and will not only
make gradients easy to compute, but also shed more light onto the predictor
and loss function. In fact, these days if you use a package such as TensorFlow or
PyTorch, you can write down the expressions symbolically and the gradient is
computed for you. This is done essentially using the computational procedure
that we will see.

Advantages:

• Avoid long equations

• Reveal structure of computations (modularity, efficiency, dependencies) —


TensorFlow/PyTorch are built on this

2.3.2 Functions as boxes


The first conceptual step is to think of functions as boxes that take a set of inputs
and produces an output. Then the partial derivatives (gradients if the input
is vector-valued) are just a measure of sensitivity: if we perturb in1 by a small
amount e, how much does the output out change? The answer is ∂out ∂in · e. For 1

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


40 c hapter 2. ma chine learning ii

+ − ·
1 1 1 −1 b a
a b a b a b
max σ
1[ a > b ] 1[ a < b ] σ ( a)(1 − σ ( a))
a b a

convenience, we write the partial derivative on the edge connecting the input to
the output. Partial derivatives (i.e. gradients): how much does the output change
if an input changes? Example:

2in1 + in2 in3 = out


2(in1 + e) + in2 in3 = out + 2e out function

2in1 + (in2 + e)in3 = out + in3 e


∂out ∂out ∂out
∂in1 ∂in2 ∂in3

2.3.3 Basic building blocks in1 in2 in3

Here are 5 examples of simple functions and their partial derivatives. These should
be familiar from basic calculus. All we’ve done is present them in a visually more
intuitive way. But it turns out that these simple functions are all we need to build
out function2
up many of the more complex and potentially scarier looking functions that we’ll
∂out
encounter in machine learning. ∂mid

mid function1
2.3.4 Composing functions
∂mid
∂in
The second conceptual point is to think about composing. Graphically, this is very
in
natural: the output of one function f simply gets fed as the input into another
function g. Now how does in affect out (what is the partial derivative)? The key Chain rule:
idea is that the partial derivative decomposes into a product of the two partial ∂out ∂out ∂mid
=
derivatives on the two edges. You should recognize this is no more than the chain ∂in ∂mid ∂in
rule in graphical form. More generally, the partial derivative of y with respect
to x is simply the product of all the green expressions on the edges of the path
connecting x and y. This visual intuition will help us better understand more
complex functions, which we will turn to next.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.3. efficient gradients 41

2.3.5 Binary classification with hinge loss


Let us start with a simple example: the hinge loss for binary classification. In red,
we have highlighted the weights w with respect to which we want to take the
derivative. The central question is how small perturbations in w affect a change in
the output (loss). Intermediate nodes have been labeled with interpretable names
(score, margin). The actual gradient is the product of the edge-wise gradients
from w to the loss output.

loss max
1[1 − margin > 0]

− 0
−1

1 margin ·
y

score · y
φ( x )

w φ( x )

Gradient with respect to w. For the hinge loss

Loss( x, y, w) = max{1 − w · φ( x )y, 0},

to compute
∂ Loss( x, y, w)
∇w Loss( x, y, w) =
∂w
simply multiply the edges: −1[margin < 1]φ( x )y

2.3.6 Backpropagation
Now, we can apply the same strategy to neural networks. Here we are using the
squared loss for concreteness, but one can also use the logistic or hinge losses.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


42 c hapter 2. ma chine learning ii

Note that there is some really nice modularity here: you can pick any predictor
(linear or neural network) to drive the score, and the score can be fed into any
loss function (squared, hinge, etc.).
!2
k
Loss( x, y, w) = ∑ w j σ(vj · φ(x)) − y
j =1
| {z }
neural network−target

So far, we have mainly used the graphical representation to visualize the


computation of function values and gradients for our conceptual understanding.
But it turns out that the graph has algorithmic implications too. Recall that to
train any sort of model using (stochastic) gradient descent, we need to compute
the gradient of the loss (top output node) with respect to the weights (leaf nodes
highlighted in red). We also saw that these gradients (partial derivatives) are just
the product of the local derivatives (green stuff) along the path from a leaf to a
root. So we can just go ahead and compute these gradients: for each red node,
multiply the quantities on the edges. However, notice that many of the paths share
subpaths in common, so sometimes there’s an opportunity to save computation
(think dynamic programming). To make this sharing more explicit, for each node
i in the tree, define the forward value f i to be the value of the subexpression
rooted at that tree, which depends on the inputs underneath that subtree. For
example, the parent node of w1 corresponds to the expression w1 σ (v1 · φ( x )).
The f i ’s are the intermediate computations required to even evaluate the function
at the root. Next, for each node i in the tree, define the backward value gi to be
the gradient of the output with respect to f i , the forward value of node i. This
measures the change that would happen in the output (root node) induced by
changes to f i . Note that both f i and gi can either be scalars, vectors, or matrices,
but have the same dimensionality.

Definition: forward/backward values.


• Forward: f i is value for subexpression rooted at i
∂out
• Backward: gi = ∂ fi is how f i influences output

We now define the backpropagation algorithm on arbitrary computation


graphs. First, in the forward pass, we go through all the nodes in the computation

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.4. nearest neighbors 43

graph from leaves to the root, and compute f i , the value of each node i, recur-
sively given the node values of the children of i. These values will be used in the
backward pass. Next, in the backward pass, we go through all the nodes from
the root to the leaves and compute gi recursively from f i and g j , the backward
∂f
value for the parent of i using the key recurrence gi = ∂ f j g j (just the chain rule).
i
In this example, the backward pass gives us the gradient of the output node (the
gradient of the loss) with respect to the weights (the red nodes).

function Backpropagation(in) Algorithm 2.5. Backpropagation.

Forward pass, compute each f i (from leaves to root)


Backward pass, compute each gi (from root to leaves)

While we can go through the motions of running the backpropagation al-


gorithm to compute gradients, what is the result of running SGD? For linear
predictors (using the squared loss or hinge loss), TrainLoss(w) is a convex func-
tion, which means that SGD (with an appropriately set step size) is theoretically
guaranteed to converge to the global optimum. However, for neural networks,
TrainLoss(w) is typically non-convex which means that there are multiple local
optima, and SGD is not guaranteed to converge to the global optimum. There are
many settings that SGD fails both theoretically and empirically, but in practice,
SGD on neural networks can work with proper attention to tuning hyperparame-
ters. The gap between theory and practice is not well understood and an active
area of research.

2.4 Nearest Neighbors

Linear predictors were governed by a simple dot product w · φ( x ). Neural net-


1
works chained together these simple primitives to yield something more complex. Commonly used distances:
Now, we will consider nearest neighbors, which yields complexity by another • Manhattan (L1 norm):
n
mechanism: computing similarities between examples. k v − v0 k 1 = ∑ |vi − vi0 |
Nearest neighbors is perhaps conceptually one of the simplest learning algo- i =1

rithms. In a way, there is no learning. At training time, we just store the entire • Euclidean (L2 norm):
training examples. At prediction time, we get an input x 0 and we just find the s
n
input in our training set that is most similar,1 and return its output. In a practical k v − v0 k 2 = ∑ (vi − vi0 )2
i =1
implementation, finding the closest input is non-trivial. Popular choices are using

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


44 c hapter 2. machine learning ii

k-d trees or locality-sensitive hashing. We will not worry about this issue. The
intuition being expressed here is that similar (nearby) points tend to have similar
outputs. This is a reasonable assumption in most cases; all else equal, having
a body temperature of 37 and 37.1 is probably not going to affect the health
prediction by much.

Algorithm 2.6. Nearest neighbors.


function NearestNeighbors(x 0 , φ, Dtrain )
Training, just store Dtrain
for predictor f ( x 0 ) do
Algorithm 2.7. Nearest neighbors to
Find ( x, y) ∈ Dtrain where kφ( x ) − φ( x 0 )k is smallest find the y value in the training data
return y 𝒟train associated with the exam-
ple closets to the input x′ after ap-
plying feature vector ϕ.
Key idea: similarity. Similar examples tend to have similar outputs.

Figure 2.3. Voronoi diagram show-


ing nearest neighbors classification
dist_manhattan(v, v′) = norm(v - v′, 1) with Manhattan distance (L1 ).
dist_euclidean(v, v′) = norm(v - v′, 2)

function nearest_neighbors(x′, ϕ, 𝒟train, dist)


𝒟train[argmin([dist(ϕ(x), ϕ(x′)) for (x,y) in 𝒟train])][2]
end

Let’s look at the decision boundary of nearest neighbors. The input space is
partitioned into regions, such that each region has the same closest point (this
is a Voronoi diagram), and each region could get a different output. Notice that
this decision boundary is much more expressive than what you could get with
quadratic features. In particular, one interesting property is that the complexity of
the decision boundary adapts to the number of training examples. As we increase
the number of training examples, the number of regions will also increase. Such
methods are called non-parametric. The decision boundaries are shown in the
Voronoi diagrams in figures 2.3 and 2.4.

• Much more expressive than quadratic features.

• Non-parametric: the hypothesis class adapts to number of examples. Figure 2.4. Voronoi diagram show-
ing nearest neighbors classification
with Euclidean distance (L2 ).
• Simple and powerful, but kind of brute force.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


2.4. nearest neighbors 45

2.4.1 Summary of learners


Let us conclude now. First, we discussed some general principles for designing
good features for linear predictors. Just with the machinery of linear prediction,
we were able to obtain rich predictors. Second, we focused on expanding the
expressivity of our predictors fixing a particular feature extractor φ. We covered
three algorithms: linear predictors combine the features linearly (which is rather
weak), but is easy and fast. Note that we can always make the hypothesis class
arbitrarily large by adding more features, but that’s another issue. Neural networks
effectively learn non-linear features, which are then used in a linear way. This is
what gives them their power and prediction speed, but they are harder to learn
(due to the non-convexity of the objective function). Nearest neighbors is based on
computing similarities with training examples. They are powerful and easy to
learn, but are slow to use for prediction because they involve enumerating (or
looking up points in) the training data.

1. Linear predictors: combine raw features

• prediction is fast, easy to learn, weak use of features.

2. Neural networks: combine learned features

• prediction is fast, hard to learn, powerful use of features.

3. Nearest neighbors: predict according to similar examples

• prediction is slow, easy to learn, powerful use of features.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


3 Markov Decision Process I

From CS221 Spring 2020, Percy


Last chapter, we looked at search problems, a powerful paradigm that can be Liang, Chelsea Finn & Nima Anari,
used to solve a diverse range of problems ranging from word segmentation to Stanford University.

package delivery to route finding. The key was to cast whatever problem we were
interested in solving into the problem of finding the minimum cost path in a
graph. However, search problems assume that taking an action a from a state s
results deterministically in a unique successor state Succ(s, a).

3.1 Uncertainty

In the real world, the deterministic successor assumption is often unrealistic, for
there is randomness: taking an action might lead to any one of many possible states.
One deep question here is how we can even hope to act optimally in the face of
randomness? Certainly we can’t just have a single deterministic plan, and talking
about a minimum cost path doesn’t make sense. Today, we will develop tools
to tackle this more challenging setting. We will fortunately still be able to reuse
many of the intuitions about search problems, in particular the notion of a state.

Applications. Randomness shows up in many places. They could be caused by


limitations of the sensors and actuators of the robot (which we can control to
some extent). Or they could be caused by market forces or nature, which we have
no control over. We’ll see that all of these sources of randomness can be handled
in the same mathematical framework.

• Robotics: decide where to move, but actuators can fail, hit unseen obstacles,
etc.
3.2. markov decision processes 47

• Resource allocation: decide what to produce, don’t know the customer demand
for various products

• Agriculture: decide what to plant, but don’t know weather and thus crop yield

Let us consider an example. You are exploring a South Pacific island, which
is modeled as a 3 × 4 grid of states. From each state, you can take one of four
actions to move to an adjacent state: north (N), east (E), south (S), or west
(W). If you try to move off the grid, you remain in the same state. You start
at (2, 1). If you end up in either of the green or red squares, your journey
ends, either in a lava lake (reward of −50) or in a safe area with either no
view (2) or a fabulous view of the island (20). What do you do?
If we have a deterministic search problem, then the obvious thing will be
to go for the fabulous view, which yields a reward of 20. You can set numIters
to 10 and press Run. Each state is labeled with the maximum expected utility
(sum of rewards) one can get from that state (analogue of FutureCost in a
search problem). We will define this quantity formally later. For now, look
at the arrows, which represent the best action to take from each cell. Note
that in some cases, there is a tie for the best, where some of the actions seem
to be moving in the wrong direction. This is because there is no penalty for
moving around indefinitely. If you change moveReward to −0.1, then you’ll
see the arrows point in the right direction.
In reality, we are dealing with treacherous terrain, and there is on each
action a probability slipProb of slipping, which results in moving in a ran-
dom direction. Try setting slipProb to various values. For small values (e.g.,
0.1), the optimal action is to still go for the fabulous view. For large values
(e.g., 0.3), then it’s better to go for the safe and boring 2. Play around with
the other reward values to get intuition for the problem.
Important: note that we are only specifying the dynamics of the world,
not directly specifying the best action to take. The best actions are computed
automatically from the algorithms we’ll see shortly.

3.2 Markov Decision Processes

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


48 c hapter 3. markov decision process i

Example: Dice game. For each round r = 1, 2, . . .


• You choose stay or quit.

• If quit, you get $10 and we end the game,

• If stay, you get $4 and then I roll a 6-sided dice.

– If the dice results in 1 or 2, we end the game.


– Otherwise, continue to the next round.

Expected utility. If follow policy ”stay”:

1 2 1 2 2 1
(4) + · (8) + · · (12) + · · · = 12
3 3 3 3 3 3
Let’s suppose you always stay. Note that each outcome of the game will result
in a different sequence of rewards, resulting in a utility, which is in this case
just the sum of the rewards. We are interested in the expected utility, which
you can compute to be 12.

Expected utility. If follow policy ”quit”:

1(10) = 10

If you quit, then you’ll get a reward of 10 deterministically. Therefore, in


expectation, the ”stay” strategy is preferred, even though sometimes you’ll
get less than 10.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.2. markov decision processes 49

While we already solved this game directly, we’d like to develop a more general
framework for thinking about not just this game, but also other problems such
as the volcano crossing example. To that end, let us formalize the dice game as a
Markov decision process (MDP).
An MDP can be represented as a graph. The nodes in this graph include both
states and chance nodes. Edges coming out of states are the possible actions from
that state, which lead to chance nodes. Edges coming out of a chance nodes
are the possible random outcomes of that action, which end up back in states.
Our convention is to label these chance-to-state edges with the probability of a
particular transition and the associated reward for traversing that edge.

Definition: Markov decision process.


• States: the set of states

• sstart ∈ States: starting state

• Actions(s): possible actions from state s

• T (s0 | s, a): probability of s0 if take action a in state s

• Reward(s, a, s0 ): reward for the transition (s, a, s0 )

• IsEnd(s): whether at end of game

• 0 ≤ γ ≤ 1: discount factor (default: 1)

A Markov decision process has a set of states States, a starting state sstate , and the
set of actions Actions(s) from each state s. It also has a transition distribution T,
which specifies for each state s and action a, a distribution over possible successor
states s0 . Specifically, we have that ∑s0 T (s, a, s0 ) = 1 because T is a probability
distribution (more on this later). Associated with each transition (s, a, s0 ) is a
reward, which could be either positive or negative. If we arrive in a state s for
which IsEnd(s) is true, then the game is over. Finally, the discount factor γ is a
quantity which specifies how much we value the future and will be discussed
later.
MDPs share many similarities with search problems, but there are differences
(one main difference and one minor one). The main difference is the move from

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


50 c hapter 3. markov decision process i

a deterministic successor function Succ(s, a) to transition probabilities over s0 .


We can think of the successor function Succ(s, a) as a special case of transition
probabilities: 
1 if s0 = Succ(s, a)
T (s, a, s0 ) =
0 otherwise

A minor difference is that we’ve gone from minimizing costs to maximizing


rewards. The two are really equivalent: you can negate one to get the other.

Definition: search problem.


• States: the set of states

• sstate ∈ States: starting state

• Actions(s): possible actions from state s

• Succ(s, a): where we end up if take action a in state s

• Cost(s, a): cost for taking action a in state s

• IsEnd(s): whether at end

Map to MDP:

• Succ(s, a) ⇒ T (s, a, s0 )

• Cost(s, a) ⇒ Reward(s, a, s0 )

3.2.1 Transitions
Just to dwell on the major difference, transition probabilities, a bit more: for
each state s and action a, the transition probabilities specifies a distribution over
successor states s0 .

Definition: transition probabilities. The transition probabilities T (s, a, s0 )


specify the probability of ending up in state s0 if taken action a in state s.

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.2. markov decision processes 51

s a s0 T (s, a, s0 )
in quit end 1
Example: transition probabilities.
in stay in 2/3
in stay end 1/3

Probabilities must sum to one. This means that for each given s and a, if we sum
the transition probability T (s, a, s0 ) over all possible successor states s0 , we get 1.
If a transition to a particular s0 is not possible, then T (s, a, s0 ) = 0. We refer to the
s0 for which T (s, a, s0 ) > 0 as the successors. Generally, the number of successors
of a given (s, a) is much smaller than the total number of states. For instance, in a
search problem, each (s, a) has exactly one successor. Formally, for each state s
and action a: ∑ T (s, a, s0 ) = 1 for successors = s0 such that T (s, a, s0 ) > 0.
s0 ∈States

Let us revisit the transportation example. As we all know, magic trams aren’t
the most reliable forms of transportation, so let us asume that with probability
1 1
2 , it actually does as advertised, and with probability 2 it just leaves you in
the same state.

Example: transportation.

• Street with blocks numbered 1 to n.

• Walking from s to s + 1 takes 1 minute.

• Taking a magic tram from s to 2s takes 2 minutes.

• How to travel from 1 to n in the least time?

Tram fails with probability 0.5.

3.2.2 Solutions
So we now know what an MDP is. What do we do with one? For search problems,
we were trying to find the minimum cost path. However, fixed paths won’t suffice

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


52 c hapter 3. markov decision process i

for MDPs, because we don’t know which states the random dice rolls are going
to take us. Therefore, we define a policy, which specifies an action for every single
state, not just the states along a path. This way, we have all our bases covered,
and know what action to take no matter where we are. One might wonder if we
ever need to take different actions from a given state. The answer is no, since like
as in a search problem, the state contains all the information that we need to act
optimally for the future. In more formal speak, the transitions and rewards satisfy
the Markov property. Every time we end up in a state, we are faced with the exact
same problem and therefore should take the same optimal action.

Solutions to:

• Search problem: path (sequence of actions)

• MDPs: policies (state to action mapping)

Definition: policy. A policy π is a mapping from each state s ∈ States to an


action a ∈ Actions(s).

s π (s)
(1, 1) S
Example: volcano crossing. (2, 1) E
(3, 1) N
··· ···

3.3 Policy Evaluation

Now that we’ve defined an MDP (the input) and a policy (the output), let’s turn
to defining the evaluation metric for a policy—there are many of them, which one
should we choose? Recall that we’d like to maximize the total rewards (utility),
but this is a random quantity, so we can’t quite do that. Instead, we will instead
maximize the expected utility, which we will refer to as value (of a policy).

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.3. policy evaluation 53

Definition: utility. Following a policy yields a random path. The utility of a


policy is the (discounted) sum of the rewards on the path (this is a random
quantity).

Definition: value (expected utility). The value of a policy is the expected


utility.

Evaluating a policy: volcano crossing. To get an intuitive feel for the rela-
tionship between a value and utility, consider the volcano example. If you
press Run multiple times, you will get random paths shown on the right
leading to different utilities. Note that there is considerable variation in what
happens. The expectation of this utility is the value. You can run multiple
simulations by increasing numEpisodes. If you set numEpisodes to 1000, then
you’ll see the average utility converging to the value.
Path Utility
[in; stay, 4, end] 4
[in; stay, 4, in; stay, 4, in; stay, 4, end] 12
Value: 12
[in; stay, 4, in; stay, 4, end] 8
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 16
... ...

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


54 c hapter 3. markov decision process i

3.3.1 Discounting
There is an additional aspect to utility: discounting, which captures the fact that
a reward today might be worth more than the same reward tomorrow. If the
discount γ is small, then we favor the present more and downweight future
rewards more. Note that the discounting parameter is applied exponentially
to future rewards, so the distant future is always going to have a fairly small
contribution to the utility (unless γ = 1). The terminology, though standard, is
slightly confusing: a larger value of the discount parameter γ actually means that
the future is discounted less.

Definition: utility. Path: s0 , a1 r1 s1 , a2 r2 s2 , . . . (action, reward, new state).


The utility with discount γ is:

u1 = r1 + γr2 + γ2 r3 + γ3 r4 + · · ·

• Discount γ = 1 (save for the future):

– [stay, stay, stay, stay]: 4 + 4 + 4 + 4 = 16

• Discount γ = 0 (live in the moment):

– [stay, stay, stay, stay]: 4 + 0 · (4 + · · · ) = 4

• Discount γ = 0.5 (balanced life):


1
– [stay, stay, stay, stay]: 4 + 2 · 4 + 14 · 4 + 18 · 4 = 7.5

3.3.2 Policy evaluation


Associated with any policy π are two important quantities, the value of the policy
Vπ (s) and the Q-value of a policy Qπ (s, a). In terms of the MDP graph, one can
think of the value Vπ (s) as labeling the state nodes, and the Q-value Qπ (s, a) as
labeling the chance nodes. This label refers to the expected utility if we were to
start at that node and continue the dynamics of the game.
We will now write down some equations relating value and Q-value. Our
eventual goal is to get to an algorithm for computing these values, but as we will
see, writing down the relationships gets us most of the way there, just as writing

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.3. policy evaluation 55

Definition: value of a policy. Let Vπ (s) be the expected utility received by


following policy π from state s.

Definition: Q-value of a policy. Let Qπ (s, a) be the expected utility of tak-


ing action a from state s, and then following policy π.

down the recurrence for FutureCost directly lead to a dynamic programming


algorithm for acyclic search problems. First, we get Vπ (s), the value of a state s,
by just following the action edge specified by the policy and taking the Q-value
Qπ (s, π (s)). (There’s also a base case where IsEnd(s).) Second, we get Qπ (s, a) by
considering all possible transitions to successor states s0 and taking the expectation
over the immediate reward Reward(s, a, s0 ) plus the discounted future reward
γVπ (s0 ). While we’ve defined the recurrence for the expected utility directly, we
can derive the recurrence by applying the law of total expectation and invoking
the Markov property. To do this, we need to set up some random variables: Let s0
be the initial state, a1 be the action that we take, r1 be the reward we obtain, and
s1 be the state we end up in. Also define ut = rt + γrt+1 + γ2 rt+2 + · · · to be the
utility of following policy π from time step t. Then Vπ (s) = E[u1 | s0 = s], which
(assuming s is not an end state) in turn equals:

Vπ (s) = E[u1 | s0 = s]
= ∑ P[s1 = s0 | s0 = s, a1 = π (s)]E[u1 | s1 = s0 , s0 = s, a1 = π (s)]
s0

Note that P[s1 = s0 | s0 = s, a1 = π (s)] = T (s, π (s), s0 ). Using the fact that
u1 = r1 + γu2 and taking expectations, we get that:

E[u | s1 = s0 , s0 = s, a1 = π (s)] = Reward(s, π (s), s0 ) + γVπ (s0 )

The rest follows from algebra.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


56 c hapter 3. markov decision process i

Plan. Define recurrences relating value and Q-value:



0 if IsEnd(s)
Vπ (s) =
 Qπ (s, π (s)) otherwise

Qπ (s, a) = ∑ T (s, a, s0 ) Reward(s, a, s0 ) + γVπ (s0 )


 
s0

Example: Dice game. As an example, let’s compute the values of the nodes
in the dice game for the policy ‘‘stay’’. Note that the recurrence involves both
Vπ (in) on the left-hand side and the right-hand side. At least in this simple
example, we can solve this recurrence easily to get the value. Let γ = 1. Let
π be the ‘‘stay’’ policy: π (in) = stay.

Vπ (end) = 0
Vπ (in) = Qπ (in, stay)
1 2
= (4 + Vπ (end)) + (4 + Vπ (in))
3 3
In this case, can solve in closed form:
1 2
Vπ (in) = 4 + (4 + Vπ (in))
3 3
2
= 4 + Vπ (in)
3
1
=⇒ Vπ (in) = 4
3
Vπ (in) = 12

But for a much larger MDP with 100000 states, how do we efficiently compute
the value of a policy? One option is the following: observe that the recurrences
define a system of linear equations, where the variables are Vπ (s) for each state
s and there is an equation for each state. So we could solve the system of linear
equations by computing a matrix inverse. However, inverting a 100000 × 100000
matrix is expensive in general. There is an even simpler approach called policy
evaluation. We’ve already seen examples of iterative algorithms in machine learn-
ing: the basic idea is to start with something crude, and refine it over time. Policy

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.3. policy evaluation 57

(0)
iteration starts with a vector of all zeros for the initial values Vπ . Each iteration,
we loop over all the states and apply the two recurrences that we had before. The
equations look hairier because of the superscript (t), which simply denotes the
value of at iteration t of the algorithm.

Key idea: iterative algorithm. Start with arbitrary policy values and re-
peatedly apply recurrences to converge to true values.

function PolicyEvaluation(π) Algorithm 3.1. Policy evaluation.


(0)
Initialize Vπ (s) ← 0 for all states s.
for iteration t = 1, . . . , tPE do
for each state s do
( t −1) 0
Vπ (s) ← ∑ T (s, π (s), s0 )[Reward(s, π (s), s0 ) + γVπ
(t)
(s )]
s0
| {z }
Q(t−1) (s,π (s))

Some implementation notes: a good strategy for determining how many itera-
tions to run policy evaluation is based on how accurate the result is. Rather than
set some fixed number of iterations (e.g, 100), we instead set an error tolerance
(e.g., e = 0.01), and iterate until the maximum change between values of any
state s from one iteration (t) to the previous (t − 1) is at most e. The second note
(t)
is that while the algorithm is stated as computing Vπ for each iteration t, we
actually only need to keep track of the last two values. This is important for saving
memory. How many iterations to run (tPE )? Repeat until values don’t change
much:
(t) ( t −1)
max |Vπ (s) − Vπ (s)| ≤ e
s∈States
(t) (t) ( t −1)
Don’t store Vπ for each iteration t, need only last two: Vπ and Vπ .

Complexity. Computing the running time of policy evaluation is straightfor-


ward: for each of the tPE iterations, we need to enumerate through each of the S
states, and for each one of those, loop over the successors S0 . Note that we don’t
have a dependence on the number of actions A because we have a fixed policy

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


58 c hapter 3. markov decision process i

π (s) and we only need to look at the action specified by the policy. Advanced:
Here, we have to iterate tPE time steps to reach a target level of error e. It turns out
that tPE doesn’t actually have to be very large for very small errors. Specifically, the
error decreases exponentially fast as we increase the number of iterations. In other
words, to cut the error in half, we only have to run a constant number of more
iterations. Advanced: For acyclic graphs (for example, the MDP for Blackjack),
we just need to do one iteration (not tPE ) provided that we process the nodes
in reverse topological order of the graph. This is the same setup as we had for
dynamic programming in search problems, only the equations are different.

MDP complexity. O(tPE SS0 )


• S states

• A actions per state

• S successors (number of s0 with T (s, a, s0 ) > 0)

Let us run policy evaluation on the dice game. The value converges very
quickly to the correct answer. Let π be the stay policy: π (in) = stay.
(t)
Vπ (end) = 0
(t) 1 ( t −1) 2 ( t −1)
Vπ (in) = (4 + Vπ (end)) + (4 + Vπ (in))
3 3

s end in
(t)
Vπ 0.00 0.00 (t = 0 iterations)
(t)
Vπ 0.00 4.00 (t = 1 iterations)
(t) Converges to Vπ (in) = 12.
Vπ 0.00 6.67 (t = 2 iterations)
(t)
Vπ 0.00 8.44 (t = 3 iterations)
(t)
Vπ 0.00 12.00 (t = 100 iterations)

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.4. value iteration 59

3.3.3 Summary
Let’s summarize: we have defined an MDP, which we should think of a graph
where the nodes are states and chance nodes. Because of randomness, solving an
MDP means generating policies, not just paths. A policy is evaluated based on
its value: the expected utility obtained over random paths. Finally, we saw that
policy evaluation provides a simple way to compute the value of a policy.

• MDP: graph with states, chance nodes, transition probabilities, rewards.

• Policy: mapping from state to action (solution to MDP).

• Value of policy: expected utility over random paths.

• Policy evaluation: iterative algorithm to compute value of policy.

3.4 Value Iteration

If we are given a policy π, we now know how to compute its value Vπ (sstate ). So
now, we could just enumerate all the policies, compute the value of each one,
and take the best policy, but the number of policies is exponential in the number
of states (AS to be exact), so we need something a bit more clever. We will now
introduce value iteration, which is an algorithm for finding the best policy. While
evaluating a given policy and finding the best policy might seem very different,
it turns out that value iteration will look a lot like policy evaluation.

3.4.1 Optimal value and policy


We will write down a bunch of recurrences which look exactly like policy evalua-
tion, but instead of having Vπ and Qπ with respect to a fixed policy π, we will
have V ∗ and Q∗ , which are with respect to the optimal policy. The goal: try to get
directly at maximum expected utility.

Definition: optimal value. The optimal value V ∗ (s) is the maximum value
attained by any policy.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


60 c hapter 3. markov decision process i

The recurrences for V ∗ and Q∗ are identical to the ones for policy evaluation
with one difference: in computing V ∗ , instead of taking the action from the fixed
policy π, we take the best action, the one that results in the largest Q∗ (s, a).

• Optimal value if take action a in state s:

Q∗ (s, a) = ∑0 T (s, a, s0 )[Reward(s, a, s0 ) + γV ∗ (s0 )].


s

• Optimal value from state s:



0 if IsEnd(s)
V ∗ (s) =
 max Q∗ (s, a) otherwise
a∈Actions(s)

So far, we have focused on computing the value of the optimal policy, but what
is the actual policy? It turns out that this is pretty easy to compute. Suppose
you’re at a state s. Q∗ (s, a) tells you the value of taking action a from state s. So
the optimal action is simply to take the action a with the largest value of Q∗ (s, a).
Given Q∗ , read off the optimal policy:

π ∗ (s) = arg max Q∗ (s, a)


a∈Actions(s)

3.4.2 Value iteration


By now, you should be able to go from recurrences to algorithms easily. Following
the recipe, we simply iterate some number of iterations, go through each state
s and then replace the equality in the recurrence with the assignment operator.
Value iteration is also guaranteed to converge to the optimal value. What about
the optimal policy? We get it as a byproduct. The optimal value V ∗ (s) is computed
by taking a max over actions. If we take the argmax, then we get the optimal
policy π ∗ (s).

3.4.3 Convergence
Let us state more formally the conditions under which any of these algorithms
that we talked about will work. A sufficient condition is that either the discount
γ must be strictly less than 1 or the MDP graph is acyclic. We can reinterpret the

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.4. value iteration 61

function ValueIteration Algorithm 3.2. Value iteration


[Bellman, 1957].
Initialize V ∗(0) (s) ← 0 for all states s.
for iteration t = 1, . . . , tVI do
for each state s do
V ∗(t) (s) ← max ∑ T (s, a, s0 )[Reward(s, a, s0 ) + γV ∗(t−1) (s0 )]
a∈Actions(s) s0
| {z }
Q∗(t−1) (s,a)

Time complexity: O(tVI SAS0 )

discount γ < 1 condition as introducing a new transition from each state to a


special end state with probability (1 − γ), multiplying all the other transition
probabilities by γ, and setting the discount to 1. The interpretation is that with
probability 1 − γ, the MDP terminates at any state. In this view, we just need that
a sampled path be finite with probability 1. We won’t prove this theorem, but
will instead give a counterexample to show that things can go badly if we have
a cyclic graph and γ = 1. In the graph, whatever we initialize value iteration,
value iteration will terminate immediately with the same value. In some sense,
this isn’t really the fault of value iteration, but it’s because all paths are of infinite
length. In some sense, if you were to simulate from this MDP, you would never
terminate, so we would never find out what your utility was at the end.

3.4.4 Summary of algorithms


• Policy evaluation: (MDP, π) → Vπ

• Value iteration: MDP → (V ∗ , π ∗ )

3.4.5 Unifying idea


There are two key ideas in this chapter. First, the policy π, value Vπ , and Q-value
Qπ are the three key quantities of MDPs, and they are related via a number of
recurrences which can be easily gotten by just thinking about their interpretations.
Second, given recurrences that depend on each other for the values you’re trying

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


62 c hapter 3. markov decision process i

Value iteration: dice game. Let us demonstrate value iteration on the dice
game. Initially, the optimal policy is ‘‘quit’’, but as we run value iteration
longer, it switches to ‘‘stay’’.
s end in
V ∗(t) 0.00 0.00 (t = 0 iterations)
π ∗ (s) — —
V ∗(t) 0.00 10.00 (t = 1 iterations)
π ∗ (s) — quit
V ∗(t) 0.00 10.67 (t = 2 iterations)
π ∗ (s) — stay
V ∗(t) 0.00 11.11 (t = 3 iterations)
π ∗ (s) — stay
V ∗(t) 0.00 12.00 (t = 100 iterations)
π ∗ (s) — stay

Theorem: convergence. Suppose either discount γ < 1, or MDP graph is


acyclic. Then value iteration converges to the correct answer.

Example: non-convergence. discount γ = 1, zero rewards

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu to c


3.4. value i te ration 63

to compute, it’s easy to turn these recurrences into algorithms that iterate between
those recurrences until convergence.

Algorithms:

• Search DP computes FutureCost(s)

• Policy evaluation computes policy value Vπ (s)

• Value iteration computes optimal value V ∗ (s)

Recipe:

• Write down recurrence (e.g., Vπ (s) = · · · Vπ (s0 ) · · · )

• Turn into iterative algorithm (replace mathematical equality with assignment


operator)

3.4.6 Summary
• Markov decision processes (MDPs) cope with uncertainty.

• Solutions are policies rather than paths.

• Policy evaluation computes policy value (expected utility).

• Value iteration computes optimal value (maximum expected utility) and opti-
mal policy.

• Main technique: write recurrences → algorithm

• Next chapter: reinforcement learning—when we don’t know rewards, transi-


tion probabilities.

toc 2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu


64 i ndex

References

1. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019 (cit.
on p. v).

2020-09-15 18:19:29-07:00, draft: send comments to mossr@cs.stanford.edu

You might also like