[go: up one dir, main page]

0% found this document useful (0 votes)
25 views15 pages

What Is TD Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

What Is TD Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

What is TD learning?

• Temporal-Difference learning = TD learning


• The prediction problem is that of estimating the value function for a
policy π
• The control problem is the problem of finding an optimal policy π*
• Given some experience following a policy π, update estimate v of vπ
for non-terminal states occurring in that experience
• Given current step t, TD methods wait until the next time step to
update V(St)
• Learn from partial returns
Value-based Reinforcement Learning
• We want to estimate the optimal value V*(s) or action-value function
Q*(s, a) using a function approximator V(s; θ) or Q(s, a; θ) with
parameters θ
• This function approximator can be any parametric supervised
machine learning model
• Recall that the optimal value is the maximum value achievable under
any policy
Update Rule for TD(0)
• At time t + 1, TD methods immediately form a target Rt+1 + γ V(St+1)
and make a useful update with step size α using the observed reward
Rt+1 and the estimate V(St+1)
• The update is the step size times the difference between the target
output and the actual output
Update Rule Intuition
• The target output is a more accurate estimate of V(St) given the
reward Rt+1 is known
• The actual output is our current estimate of V(St)
• We simply take one step with our current value function estimate to
get a more accurate estimate of V(St) and then perform an update to
move V(St) closer towards the more accurate estimate (i.e. temporal
difference)
Tabular TD(0) Algorithm
SARSA – On-policy TD Control
• SARSA = State-Action-Reward-State-Action
• Learn an action-value function instead of a state-value function
• qπ is the action-value function for policy π
• Q-values are the values qπ(s, a) for s in S, a in A
• SARSA experiences are used to update Q-values
• Use TD methods for the prediction problem
SARSA Update Rule
• We want to estimate qπ(s, a) for the current policy π, and for all states
s and action a
• The update rule is similar to that for TD(0) but we transition from
state-action pair to state-action pair, and learn the values of state-
action pairs
• The update is performed after every transition from a non-terminal
state St
• If St+1 is terminal, then Q(St+1, At+1) is zero
• The update rule uses (St, At, Rt+1, St+1, At+1)
SARSA Algorithm
Deep Q-Networks (DQN)
• Introduced deep reinforcement learning
• It is common to use a function approximator Q(s, a; θ) to approximate
the action-value function in Q-learning
• Deep Q-Networks is Q-learning with a deep neural network function
approximator called the Q-network
• Discrete and finite set of actions A
• Example: Breakout has 3 actions – move left, move right, no
movement
• Uses epsilon-greedy policy to select actions
Q-Networks
• Core idea: We want the neural network to learn a non-linear
hierarchy of features or feature representation that gives accurate Q-
value estimates
• The neural network has a separate output unit for each possible
action, which gives the Q-value estimate for that action given the
input state
• The neural network is trained using mini-batch stochastic gradient
updates and experience replay
Experience Replay
• The state is a sequence of actions and observations st = x1, a1, x2, …, at-
1, xt

• Store the agent’s experiences at each time step et = (st, at, rt, st+1) in a
dataset D = e1, ..., en pooled over many episodes into a replay memory
• In practice, only store the last N experience tuples in the replay
memory and sample uniformly from D when performing updates
State representation
• It is difficult to give the neural network a sequence of arbitrary length
as input
• Use fixed length representation of sequence/history produced by a
function ϕ(st)
• Example: The last 4 image frames in the sequence of Breakout
gameplay
Q-Network Training
• Sample random mini-batch of experience tuples uniformly at random
from D
• Similar to Q-learning update rule but:
• Use mini-batch stochastic gradient updates
• The gradient of the loss function for a given iteration with respect to the
parameter θi is the difference between the target value and the actual value is
multiplied by the gradient of the Q function approximator Q(s, a; θ) with
respect to that specific parameter
• Use the gradient of the loss function to update the Q function
approximator
DQN Algorithm
Comments
• It was previously thought that the combination of simple online
reinforcement learning algorithms with deep neural networks was
fundamentally unstable
• The sequence of observed data (states) encountered by an online
reinforcement learning agent is non-stationary and online updates are
strongly correlated
• The technique of DQN is stable because it stores the agent’s data in
experience replay memory so that it can be randomly sampled from
different time-steps
• Aggregating over memory reduces non-stationarity and decorrelates
updates but limits methods to off-policy reinforcement learning
algorithms
• Experience replay updates use more memory and computation per real
interaction than online updates, and require off-policy learning
algorithms that can update from data generated by an older policy

You might also like