What is TD learning?
• Temporal-Difference learning = TD learning
• The prediction problem is that of estimating the value function for a
policy π
• The control problem is the problem of finding an optimal policy π*
• Given some experience following a policy π, update estimate v of vπ
for non-terminal states occurring in that experience
• Given current step t, TD methods wait until the next time step to
update V(St)
• Learn from partial returns
Value-based Reinforcement Learning
• We want to estimate the optimal value V*(s) or action-value function
Q*(s, a) using a function approximator V(s; θ) or Q(s, a; θ) with
parameters θ
• This function approximator can be any parametric supervised
machine learning model
• Recall that the optimal value is the maximum value achievable under
any policy
Update Rule for TD(0)
• At time t + 1, TD methods immediately form a target Rt+1 + γ V(St+1)
and make a useful update with step size α using the observed reward
Rt+1 and the estimate V(St+1)
• The update is the step size times the difference between the target
output and the actual output
Update Rule Intuition
• The target output is a more accurate estimate of V(St) given the
reward Rt+1 is known
• The actual output is our current estimate of V(St)
• We simply take one step with our current value function estimate to
get a more accurate estimate of V(St) and then perform an update to
move V(St) closer towards the more accurate estimate (i.e. temporal
difference)
Tabular TD(0) Algorithm
SARSA – On-policy TD Control
• SARSA = State-Action-Reward-State-Action
• Learn an action-value function instead of a state-value function
• qπ is the action-value function for policy π
• Q-values are the values qπ(s, a) for s in S, a in A
• SARSA experiences are used to update Q-values
• Use TD methods for the prediction problem
SARSA Update Rule
• We want to estimate qπ(s, a) for the current policy π, and for all states
s and action a
• The update rule is similar to that for TD(0) but we transition from
state-action pair to state-action pair, and learn the values of state-
action pairs
• The update is performed after every transition from a non-terminal
state St
• If St+1 is terminal, then Q(St+1, At+1) is zero
• The update rule uses (St, At, Rt+1, St+1, At+1)
SARSA Algorithm
Deep Q-Networks (DQN)
• Introduced deep reinforcement learning
• It is common to use a function approximator Q(s, a; θ) to approximate
the action-value function in Q-learning
• Deep Q-Networks is Q-learning with a deep neural network function
approximator called the Q-network
• Discrete and finite set of actions A
• Example: Breakout has 3 actions – move left, move right, no
movement
• Uses epsilon-greedy policy to select actions
Q-Networks
• Core idea: We want the neural network to learn a non-linear
hierarchy of features or feature representation that gives accurate Q-
value estimates
• The neural network has a separate output unit for each possible
action, which gives the Q-value estimate for that action given the
input state
• The neural network is trained using mini-batch stochastic gradient
updates and experience replay
Experience Replay
• The state is a sequence of actions and observations st = x1, a1, x2, …, at-
1, xt
• Store the agent’s experiences at each time step et = (st, at, rt, st+1) in a
dataset D = e1, ..., en pooled over many episodes into a replay memory
• In practice, only store the last N experience tuples in the replay
memory and sample uniformly from D when performing updates
State representation
• It is difficult to give the neural network a sequence of arbitrary length
as input
• Use fixed length representation of sequence/history produced by a
function ϕ(st)
• Example: The last 4 image frames in the sequence of Breakout
gameplay
Q-Network Training
• Sample random mini-batch of experience tuples uniformly at random
from D
• Similar to Q-learning update rule but:
• Use mini-batch stochastic gradient updates
• The gradient of the loss function for a given iteration with respect to the
parameter θi is the difference between the target value and the actual value is
multiplied by the gradient of the Q function approximator Q(s, a; θ) with
respect to that specific parameter
• Use the gradient of the loss function to update the Q function
approximator
DQN Algorithm
Comments
• It was previously thought that the combination of simple online
reinforcement learning algorithms with deep neural networks was
fundamentally unstable
• The sequence of observed data (states) encountered by an online
reinforcement learning agent is non-stationary and online updates are
strongly correlated
• The technique of DQN is stable because it stores the agent’s data in
experience replay memory so that it can be randomly sampled from
different time-steps
• Aggregating over memory reduces non-stationarity and decorrelates
updates but limits methods to off-policy reinforcement learning
algorithms
• Experience replay updates use more memory and computation per real
interaction than online updates, and require off-policy learning
algorithms that can update from data generated by an older policy