CS 229, Public Course Problem Set #4 Solutions: Unsupervised Learn-Ing and Reinforcement Learning
CS 229, Public Course Problem Set #4 Solutions: Unsupervised Learn-Ing and Reinforcement Learning
CS 229, Public Course Problem Set #4 Solutions: Unsupervised Learn-Ing and Reinforcement Learning
However, EM can also be applied to the supervised learning setting, and in this problem we
discuss a “mixture of linear regressors” model; this is an instance of what is often call the
Hierarchical Mixture of Experts model. We want to represent p(y|x), x ∈ Rn and y ∈ R,
and we do so by again introducing a discrete latent random variable
X X
p(y|x) = p(y, z|x) = p(y|x, z)p(z|x).
z z
For simplicity we’ll assume that z is binary valued, that p(y|x, z) is a Gaussian density,
and that p(z|x) is given by a logistic regression model. More formally
p(z|x; φ) = g(φT x)z (1 − g(φT x))1−z
−(y − θiT x)2
1
p(y|x, z = i; θi ) = √ exp i = 1, 2
2πσ 2σ 2
where σ is a known parameter and φ, θ0 , θ1 ∈ Rn are parameters of the model (here we
use the subscript on θ to denote two different parameter vectors, not to index a particular
entry in these vectors).
Intuitively, the process behind model can be thought of as follows. Given a data point x,
we first determine whether the data point belongs to one of two hidden classes z = 0 or
z = 1, using a logistic regression model. We then determine y as a linear function of x
(different linear functions for different values of z) plus Gaussian noise, as in the standard
linear regression model. For example, the following data set could be well-represented by
the model, but not by standard linear regression.
CS229 Problem Set #4 Solutions 2
(a) Suppose x, y, and z are all observed, so that we obtain a training set
{(x(1) , y (1) , z (1) ), . . . , (x(m) , y (m) , z (m) )}. Write the log-likelihood of the parameters,
and derive the maximum likelihood estimates for φ, θ0 , and θ1 . Note that because
p(z|x) is a logistic regression model, there will not exist a closed form estimate of φ.
In this case, derive the gradient and the Hessian of the likelihood with respect to φ;
in practice, these quantities can be used to numerically compute the ML esimtate.
Answer: The log-likelihood is given by
m
Y
ℓ(φ, θ0 , θ1) = log p(y (i) |x(i) , z (i) ; θ0 , θ1 )p(z (i) |x(i) ; φ)
i=1
−(y (i) − θ0T x(i) )2
X
T 1
= log (1 − g(φ x)) √ exp
2πσ 2σ 2
i:z (i) =0
−(y (i) − θ1T x(i) )2
X 1
+ log (g(φT x) √ exp
2πσ 2σ 2
i:z (i) =1
But this is just a least-squares problem on a subset of the data. In particular, if we let X0
and ~y0 be the design matrices formed by considering only those examples with z (i) = 0,
then using the same logic as for the derivation of the least squares solution we get the
maximum likelihood estimate of θ0 ,
This is just the standard logistic regression objective function, for which we already know
the gradient and Hessian
(b) Now suppose z is a latent (unobserved) random variable. Write the log-likelihood of
the parameters, and derive an EM algorithm to maximize the log-likelihood. Clearly
specify the E-step and M-step (again, the M-step will require a numerical solution,
so find the appropriate gradients and Hessians).
CS229 Problem Set #4 Solutions 3
Every propability in this term can be computed using the probability densities defined in
the problem, so the E-step is tractable.
(i)
For the M-step, we first define wj = p(z (i) = j|x(i) , y (i) ; φ, θ0 , θ1 ) for j = 0, 1 as
computed in the E-step (of course we only need to compute one of these terms in the
(i) (i)
real E-step, since w0 = 1 − w1 , but we define both to simplify the expressions).
Differentiating our lower bound on the likelihood with respect to θ0 , removing terms that
don’t depend on θ0 , and setting the expression equal to zero, we get
m X
set
X (i) p(y (i) |x(i) , z (i) = j; θj )p(z (i) = j|x(i) ; φ)
0 = ∇θ0 wj log (i)
i=1 j=0,1 wj
m
(i)
X
= ∇θ0 w0 log p(y (i) |x(i) , z (i) = j; θj )
i=1
m
(i)
X
= ∇θ0 −w0 (y (i) − θ0T x(i) )2
i=1
This term is the same as the objective for logistic regression task, but with the w(i)
quantity replacing y (i) . Therefore, the gradient and Hessian are given by
CS229 Problem Set #4 Solutions 4
m X
(i)
X
∇φ ~ − ~h), ~hi = g(φT x(i) ),
wj log p(z (i) = j|x(i) ; φ) = X T (w
i=1 j=0,1
where in both cases the last equality comes from the identity in the hint.
(b) Using these distributions, derive an EM algorithm for the model. Clearly state the
E-step and the M-step of the algorithm.
Answer: Even though z (i) is a scalar value, in this problem we continue to use the
(i)T
notation z , etc, to make the similarities to the Factor anlysis case obvious.
For the E-step, we compute the distribution Qi (z (i) ) = p(z (i) |x(i) ; U ) by computing
µz(i) |x(i) and Σz(i) |x(i) using the above formulas.
For the M-step, we need to maximize
m Z
X p(x(i) , |z (i) ; U )p(z (i) )
Qi (z (i) ) log
i=1 z (i) Qi (z (i) )
m
X h i
= Ez(i) ∼Qi log p(x(i) |z (i) ; U ) + log p(z (i) ) − log Qi (z (i) ) .
i=1
Taking the gradient with respect to U equal to zero, dropping terms that don’t depend
on U , and omitting the subscript on the expectation, this becomes
m m
X h i X 1
∇U E log p(x(i) |z (i) ; U ) = ∇U E − 2 (x(i) − U z (i) )T (x(i) − U z (i) )
i=1 i=1
2σ
m
1 X h T T
i
= − 2 ∇U E trz (i) U T U z (i) − 2trz (i) U T x(i)
2σ i=1
m
1 X h (i) (i)T (i) (i)T
i
= − E U z z − x z
2σ 2 i=1
m
1 Xh (i) (i)T (i) (i)T
i
= −U E[z z ] + x E[z ]
2σ 2 i=1
using the same reasoning as in the Factor Analysis class notes. Setting this derivative to
zero gives
m
! m
!−1
(i)T (i) (i)T
X X
(i)
U = x E[z ] E[z z ]
i=1 i=1
m
! m
!−1
X X
= x(i) µTz(i |x(i) Σz(i) |x(i) + µz(i |x(i) µTz(i |x(i)
i=1 i=1
All these terms were calcuated in the E step, so this is our final M step update.
(c) As σ 2 → 0, show that if the EM algorithm convergences to a parameter vector U ⋆
(and such convergence is guarenteed by the argument presented in class), then U ⋆
1
Pm (i) (i) T
must be an eigenvector of the sample covariance matrix Σ = m i=1 x x — i.e.,
U ⋆ must satisfy
λU ⋆ = ΣU ⋆ .
[Hint: When σ 2 → 0, Σz|x → 0, so the E step only needs to compute the means
µz|x and not the variances. Let w ∈ Rm be a vector containing all these means,
CS229 Problem Set #4 Solutions 6
wi = µz(i) |x(i) , and show that the E step and M step can be expressed as
XU XT w
w= , U=
UT U wT w
respectively. Finally, show that if U doesn’t change after this update, it must satisfy
the eigenvector equation shown above. ]
U T x(i)
Answer: For the E step, when σ 2 → 0, µz(i) |x(i) = UT U
, so using w as defined in the
hint we have
XU
w= T
U U
as desired.
As mentioned in the hint, when σ 2 → 0, Σz(i) |x(i) = 0, so
m
! m !−1
X X
(i) T T
U = x µz(i |x(i) Σz(i) |x(i) + µz(i |x(i) µz(i |x(i)
i=1 i=1
m
! m
X X XT w
= x(i) wi ( wi wi )−1 =
i=1 i=1
wT w
For the image data we’re using, the preprocessing step for the ICA algorithm is slightly different, though the
precise mechanism and justification is not imporant for the sake of this problem. Those who are curious about
the details should read Bell and Sejnowki’s paper “The ’Independent Components’ of Natural Scenes are Edge
Filters,” which provided the basis for the implementation we use in this problem.
CS229 Problem Set #4 Solutions 7
For reference, computing the ICA W matrix for the entire set of image patches takes about
5 minutes on a 1.6 Ghz laptop using our implementation.
After you’ve learned the U matrix for PCA (the columns of U should contain the principal
components of the data) and the W matrix of ICA, you can plot the basis functions using
the plot ica bases(W); and plot pca bases(U); functions we have provide. Comment
briefly on the difference between the two sets of basis functions.
Answer: The following are our implementations of pca.m and ica.m:
function U = pca(X)
[U,S,V] = svd(X*X’);
function W = ica(X)
[n,m] = size(X);
chunk = 100;
alpha = 0.0005;
W = eye(n);
for iter=1:10,
disp([num2str(iter)]);
X = X(:,randperm(m));
for i=1:floor(m/chunk),
Xc = X(:,(i-1)*chunk+1:i*chunk);
dW = (1 - 2./(1+exp(-W*Xc)))*Xc’ + chunk*inv(W’);
W = W + alpha*dW;
end
end
The PCA bases capture global features of the images, while the ICA bases capture more local
features.
4. Convergence of Policy Iteration
In this problem we show that the Policy Iteration algorithm, described in the lecture notes,
is guarenteed to find the optimal policy for an MDP. First, define B π to be the Bellman
operator for policy π, defined as follows: if V ′ = B(V ), then
X
V ′ (s) = R(s) + γ Psπ(s) (s′ )V (s′ ).
s′ ∈S
CS229 Problem Set #4 Solutions 9
(a) Prove that if V1 (s) ≤ V2 (s) for all s ∈ S, then B(V1 )(s) ≤ B(V2 )(s) for all s ∈ S.
Answer:
X
B(V1 )(s) = R(s) + γ Psπ(s) (s′ )V1 (s′ )
s′ ∈S
X
≤ R(s) + γ Psπ(s) (s′ )V2 (s′ ) = B(V2 )(s)
s′ ∈S
kB π (V ) − V π k∞ ≤ γkV − V π k∞
where kV k∞ = maxs∈S |V (s)|. Intuitively, this means that applying the Bellman
operator B π to any value function V , brings that value function “closer” to the value
function for π, V π . This also means that applying B π repeatedly (an infinite number
of times)
B π (B π (. . . B π (V ) . . .))
will result in the value function V π (a little bit more is needed to make this completely
formal, but we won’t worry about that here).
[Hint: Use the fact that for any α, x ∈ Rn , if i αi = 1 and αi ≥ 0, then i αi xi ≤
P P
maxi xi .] Answer:
X X
π π ′ ′ ′ π ′
kB (V ) − V k∞ = max R(s) + γ P (s )V (s ) − R(s) − γ P (s )V (s )
sπ(s) sπ(s)
s′ ∈S
s′ ∈S s′ ∈S
X
′ ′ π ′
= γ max P (s ) (V (s ) − V (s ))
sπ(s)
s′ ∈∫ ′
s ∈S
≤ γkV − V π k∞
Show that this policy will never perform worse that the previous one — i.e., show
′
that for all s ∈ S, V π (s) ≤ V π (s).
′
[Hint: First show that V π (s) ≤ B π (V π )(s), then use the proceeding excercises to
π′ π π′
show that B (V )(s) ≤ V (s).]
Answer:
X
V π (s) = R(s) + γ Psπ(s) (s′ )V π (s′ )
s′ ∈S
X
≤ R(s) + γ max Psa (s′ )V π (s′ )
a∈A
s′ ∈S
X ′
= R(s) + γ Psπ′ (s) (s′ )V π (s′ ) = B π (V π )(s)
s′ ∈S
CS229 Problem Set #4 Solutions 10
(d) Use the proceeding exercises to show that policy iteration will eventually converge
(i.e., produce a policy π ′ = π). Furthermore, show that it must converge to the
optimal policy π ⋆ . For the later part, you may use the property that if some value
function satisfies
X
V (s) = R(s) + γ max s′ ∈ SPsa (s′ )V (s′ )
a∈A
then V = V ⋆ .
Answer: We know that policy iteration must converge because there are only a finite
number of possible policies (if there are |S| states, each with |A| actions, then that leads
to a |S||A| total possible policies). Since the policies are monotonically improving, as we
showed in part (c), at some point we must stop generating new policies, so the algorithm
must produce π ′ = π. Using the assumptions stated in the question, it is easy to show
convergence to the optimal policy. If π ′ = π, then using the same logic as in part (c)
′ X
V π (s) = V π (s) = R(s) + γ max Psa (s′ )V π (s),
a∈A
s′ ∈∫
So V = V ⋆ , and therefore π = π ⋆ .
0.6
0.4
0.2
−0.2
−0.4
−0.6
2 The dynamics of this domain were taken from Sutton and Barto, 1998.
CS229 Problem Set #4 Solutions 11
All states except those at the top of the hill have a constant reward R(s) = −1, while the
goal state at the hilltop has reward R(s) = 0; thus an optimal agent will try to get to the
top of the hill as fast as possible (when the car reaches the top of the hill, the episode is
over, and the car is reset to its initial position). However, when starting at the bottom
of the hill, the car does not have enough power to reach the top by driving forward, so
it must first accerlaterate accelerate backwards, building up enough momentum to reach
the top of the hill. This strategy of moving away from the goal in order to reach the goal
makes the problem difficult for many classical control algorithms.
As discussed in class, Q-learning maintains a table of Q-values, Q(s, a), for each state and
action. These Q-values are useful because, in order to select an action in state s, we only
need to check to see which Q-value is greatest. That is, in state s we take the action
The Q-learning algorithm adjusts its estimates of the Q-values as follows. If an agent is in
state s, takes action a, then ends up in state s′ , Q-learning will update Q(s, a) by
At each time, your implementation of Q-learning can execute the greedy policy π(s) =
arg maxa∈A Q(s, a)
Implement the [q, steps per episode] = qlearning(episodes) function in the q5/
directory. As input, the function takes the total number of episodes (each episode starts
with the car at the bottom of the hill, and lasts until the car reaches the top), and outputs
a matrix of the Q-values and a vector indicating how many steps it took before the car was
able to reach the top of the hill. You should use the [x, s, absorb] = mountain car(x,
actions(a)) function to simulate one control cycle for the task — the x variable describes
the true (continuous) state of the system, whereas the s variable describes the discrete
index of the state, which you’ll use to build the Q values.
Plot a graph showing the average number of steps before the car reaches the top of the
hill versus the episode number (there is quite a bit of variation in this quantity, so you will
probably want to average these over a large number of episodes, as this will give you a
better idea of how the number of steps before reaching the hilltop is decreasing). You can
also visualize your resulting controller by calling the draw mountain car(q) function.
Answer: The following is our implementation of qlearning.m:
for i=1:episodes,
CS229 Problem Set #4 Solutions 12
while (~absorb)
% execute the best action or a random action
[x, sn, absorb] = mountain_car(x, actions(a));
reward = -double(absorb == 0);
% find the best action for the next state and update q value
[maxq, an] = max(q(sn,:));
if (q(sn,1) == q(sn,2)) an = ceil(rand*num_actions); end
q(s,a) = (1 - alpha)*q(s,a) + alpha*(reward + gamma*maxq);
a = an;
s = sn;
steps = steps + 1;
end
steps_per_episode(i) = steps;
end
Within 10000 episodes, the algorithm converges to a policy that usually gets the car up the hill
in around 52-53 steps. The following plot shows the number of steps per episode (averaged
over 500 episodes) versus the number of episodes. We generated the plot using the following
code:
for i=1:10,
[q, ep_steps] = qlearning(10000);
all_ep_steps(i,:) = ep_steps;
end
plot(mean(reshape(mean(all_ep_steps), 500, 20)));
250
200
Average Steps per Episode
150
100
50
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Episode Number