Reinforcement Learning
COMP3411/9814: Artificial Intelligence
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
First ideas
Initial ideas
Instrumental or operational conditioning.
Stimulus-behavior learning.
Thorndike, 1911
Deakin University CRICOS Provider Code: 00113B
Reinforcement learning
• Idea of learning by interaction with the environment.
• With no explicit instructor but with a direct sensorimotor
connection.
• Awareness of how our environment answers to what we do.
Reinforcement Learning
Pole balancing
• Pole balancing can be learned the same way
• Reward might be only received at the end
• after falling or hitting the end of the track
Pole balancing
And you think pole balancing is trivial?
Types of learning
• Supervised Learning
• Agent is given examples of input/output pairs
• Learns a function from inputs to outputs that agrees with the training examples and
generalises to new examples
• Unsupervised Learning
• Agent is only given inputs
• Tries to find structure in these inputs
• Reinforcement Learning
• Training examples presented one at a time
• Must guess best output based on a reward, tries to maximise (expected) rewards over
time
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
Reinforcement learning
Reinforcement Learning
• RL is to learn what to do, mapping from situations to actions.
• An agent should be able to sense the environment states and
perform actions to affect such states.
• Actions might affect not only immediate reward.
• An important challenge is the exploration/ exploitation trade-off
problem.
Elements of reinforcement learning
There are four essential elements:
• Policy
• Informs how to act in a particular situation.
• Set of stimulus-response rules or associations.
• Can be stochastic.
Elements of reinforcement learning
There are four essential elements:
• Reward function
• Defines the aim of an RL problem.
• Maps each perceived state (or state-action pair) into a number,
the reward.
• The goal is to maximize the long-term reward.
• In biological systems may correspond to pain and pleasure
feelings.
• Can be stochastic.
Elements of reinforcement learning
There are four essential elements:
• Value function
• Shows what it’s good in the long run (the reward in an
immediate sense).
• In biological systems corresponds to more refined judgments o
foresight about the future from one state.
• Actions are decided based on the value.
• It’s much harder to determine values than rewards.
Elements of reinforcement learning
There are four essential elements:
• Optionally, a model of the environment
• Imitates the environment behaviour.
• Can predict states and reward obtained.
• The use of models of the environment is still relatively new.
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
Exploration / Exploitation Trade-off
• Most of the time, the agent chooses what it thinks the “best”
action is.
• But to learn, it must occasionally choose something different
from the preferred action.
Exploration / Exploitation Trade-off
Should I stay or should I go now?
Should I stay or should I go now?
If I go, there will be trouble
And if I stay it will be double
-- The Clash
Exploration / Exploitation Trade-off
• The greedy action exploits the
current knowledge.
• The non-greedy action explores.
• Exploitation maximises the
immediate reward and exploration in
the long run.
• There is a conflict between
exploration and exploitation.
Action-value estimation methods
• We denote the real action value as q*(a).
• We denote the estimated value at time-step t as Qt(a).
• Simple Estimation: to average received rewards when
action a has been selected Ka times.
Action-value estimation methods
• If Ka = 0, Qt(a) is defined with an arbitrary value, e.g., Qt(a) =
0 (not necessarily the best).
• As Ka ∞, Qt(a) converges to q*(a).
Action-value estimation methods
Greedy method
• The simplest way to choose an action: the action with the
highest estimated value.
• At* where Qt(At*) = maxa Qt(a).
ε-greedy method
• A simple alternative: to choose the best action most of the
time, and sometimes (with a small probability ε) a random
one.
• Qt(a) converges to q*(a) with probability 1- ε.
Action-value
estimation
methods
• 2000 agents
averaged.
• It’s possible to reduce
ε over time
Action-value estimation methods
Softmax method
• ε-greedy effectively trades off exploration and exploitation, but
the selection is equitable (or fair) among actions.
• Sometimes, the worst action is very bad.
• High temperatures give almost equal probability for all
actions.
• Low temperatures make a bigger difference in the probability.
Incremental implementation
• A simple implementation: record all the rewards.
• Problem: growing use of memory and computational cost over
time.
Incremental implementation
• Denote Qk the estimated reward at time-step k, i.e., the average
of the k-1 first rewards, then:
Non-stationary problems
• Methods of average are appropriate for stationary problems.
• In non-stationary problems, make sense to give more weight
to more recent rewards than the past ones.
• Using a constant parameter step-size, 0 < α ≤ 1:
Contextual or associative search
• So far, only non-associative tasks, i.e., no association between
actions and situations.
• In an RL problem, there is more than one situation.
• The aim is to learn a policy: to map situations into actions.
• For instance, a set of n-armed bandits with changing colours.
• If actions also affect the following situation as well as the
reward, then it corresponds to a full RL problem.
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
The agent-environment interface
• Any method able to solve the problem is considered an RL
method.
• Agent: comprises the learner and the one making the decisions.
(although they can be separated!).
• Environment: everything external to the agent that it interacts with.
The agent-environment interface
• Reward: numeric value the agent tries to maximise. Rt+1 ∈ ℝ.
• St ∈ S. S set of possible states.
• At ∈ A(St). A(St) actions available at St.
• The agent implements a map from the states toward the action
selection probability.
• This is called agent policy πt where πt(a|s) is the probability of At = a
and St = s.
• RL methods detail how an agent updates its policy as a result of its
experience.
The agent-environment interface
• Example: recycling robot.
• The agent decides if (i) actively search for a can, (ii) remains
stationary and waits for a can, or (iii) gets back to the home base to
recharge (three possible actions).
• The state is determined by the battery state.
• Reward: Mostly zero, positive when collects a can and negative
(much higher) when the battery runs empty.
Goals and rewards
• The agent’s goal is to maximise the amount of total reward, not the
immediate reward.
• A robot learning to walk receives a reward proportional to the
forward movement.
• A robot learning to escape from a maze receives a reward equal to
zero until escapes and then receives +1.
• Another strategy is giving a reward of -1 after each movement till
escaping.
• An agent learning to play chess receives +1 for winning and -1 for
losing.
Goals and rewards
• The chess player should be rewarded only for winning and not for
taking opponent’s pieces.
• Otherwise, the agent will learn to maximise subgoals.
• In summary, the reward signal is the way to communicate to the
agent what you want it to achieve, not how you want it achieved.
Episodic and non-episodic tasks
The pole-balancing task. The mountain car task.
Returns
• If the reward sequence received is Rt+1, Rt+2, Rt+3, …. We want to
maximise the expected return Gt.
• In tasks with final state and that can be divided into subsequences
(episodes)
• Each episode finishes in the final state and the task starts over from
an initial state.
• These tasks are known as episodic tasks.
Returns
• Tasks intended to be performed continuously without limit are
referred to as continuous tasks (or non-episodic).
• The return could be infinite, given that T = ∞.
• In this case, the agent maximises the discounted rewards, choosing
actions to maximise the discounted return:
• Discount rate 0 ≤ γ < 1. Determine the present value of future
rewards. If γ = 0, the agent is myopic. If γ 1 the agent is
foresighted.
3 minutes
Returns
γ Reward sequence Return
0.5 1 0 0 0 …..
0.5 0 0 2 0 0 0 …..
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns
γ Reward sequence Return
0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 …..
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns
γ Reward sequence Return
0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
4 minutes
Returns
γ Reward sequence Return
0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 ….. 1.62
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns
γ Reward sequence Return
0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 ….. 1.62
0.5 -1 2 6 3 2 0 0 0 ….. 2
Unified Notation
• A final absorbing state with reward equal to zero.
• It is possible that T = ∞ or γ = 1, but not both.
The Markov property
• In RL, state means any information available for the agent (either
processed or not).
• The state should not inform everything about the environment to the
agent. For instance, an agent playing blackjack should not know
the next card in the deck.
• We do not blame the agent for not knowing something important,
but we do for knowing something and then forgetting it.
• Ideally, a state should contain compact information about the past,
retaining relevant information. This is called the Markov property.
For instance, the chess board.
The Markov property
• Sometimes, the property cannot be fully satisfied.
• In pole-balancing, the state satisfies the property if the exact
position and velocity of the cart are specified, and the pole angle
and its change rate.
• However, there may exist distortions, such as delays and other
effects as the temperature of the wheels.
• Some studies have even used a simple region division: left, right,
and centre.
Markov Decision Processes
• An RL task with the Markov property is called Markov decision
process (MDP).
• Markov decision process (MDP): < 𝑆𝑆, 𝐴𝐴, δ, 𝑟𝑟 >.
• 𝑆𝑆 is a finite set of states,
• 𝐴𝐴 is a set of actions,
• 𝛿𝛿 is the transition function δ: 𝑆𝑆 × 𝐴𝐴 → 𝑆𝑆, and,
• 𝑟𝑟 is the reward function 𝑟𝑟: 𝑆𝑆 × 𝐴𝐴 → ℝ.
Recycling robot MDP
• At each moment, the robot decides if (i) actively search for a can, (ii)
waits for someone to bring a can, or (iii) gets back to the home base
to recharge.
• The best strategy is to actively search for cans.
• In case the battery runs out, the robot needs to be rescued leading
to a negative reward.
• The agent solely decides as a function of the energy level of the
battery. Two levels: high, low.
• S = {high, low}.
• Possible decisions (agent’s actions): wait, search, recharge.
• A(high) = {search, wait}.
• A(low) = {search, wait, recharge}.
Recycling robot MDP
• Transition probabilities and expected reward:
rsearch > rwait
• Assumption: cans cannot be collected when going back to the home
base or when the battery is depleted.
Recycling robot MDP
• Transition graph:
• Transition probabilities from one action always sum to 1.
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
Value function
• Value function estimations.
• State-value function, or
• Action-value function (for state-action pairs)
• The function estimates how good it is for the agent to be in a given
state, in terms of future reward (or expected return).
• The value of a state s under a policy π, denoted vπ(s) or V (S), is the
π
expected return when starting in s and following π thereafter:
Value function
• The value of a terminal state, if any, is zero.
• The value of taking action a in state s under policy π, is denoted
qπ(s,a) or Q (S,A):
π
• Value functions vπ(s) and qπ(s,a) can be estimated from experience.
• If there are many states, it’s impractical to keep values for each state.
• In this case, parameterized function approximators are used to keep
vπ(s) and qπ(s,a).
Value function
• Try to maximise expected future reward:
𝜋𝜋
• 𝑉𝑉 (𝑠𝑠𝑡𝑡 ) is the value of state 𝑠𝑠𝑡𝑡 under policy 𝜋𝜋
• 𝛾𝛾 is a discount factor [0..1]
Value function
𝜋𝜋
• 𝑉𝑉 (𝑠𝑠) is the expected value of following policy 𝜋𝜋 in state 𝑠𝑠
∗
• 𝑉𝑉 (𝑠𝑠) be the maximum discounted reward obtainable from s.
• i.e. the value of following the optimal policy
• We make the simplification that actions are deterministic, but we
don’t know which action to take.
• Other RL algorithms relax this assumption
Value function
∗
• The red arrows show 𝜋𝜋 , the optimal policy, with 𝛾𝛾 = 0.9
∗
• 𝑉𝑉 (𝑠𝑠) values shown in red
𝑄𝑄-values
• How to choose an action in a state?
∗ ′
𝑄𝑄(𝑠𝑠, 𝑎𝑎) = 𝑟𝑟(𝑠𝑠, 𝑎𝑎) + 𝛾𝛾𝑉𝑉 (𝑠𝑠 )
• The 𝑄𝑄-value for an action, 𝑎𝑎, in a state, 𝑠𝑠, is the immediate reward
for the action plus the discounted value of following the optimal
policy after that action
• V* is value obtained by following the optimal policy
′
• 𝑠𝑠 = 𝛿𝛿(𝑠𝑠, 𝑎𝑎) is the succeeding state, assuming the optimal policy
𝑄𝑄-values
𝛾𝛾 = 0.9
Grid world example
Another grid world example
• Cells correspond to the states.
• 4 possible actions.
• Actions leading the agent out of the environment do not change
the position but give reward = -1.
• All other actions give reward = 0, except movements from A
and B.
Another grid world example
• All actions equal probability.
• Discount factor γ = 0.9.
• Negative values near the lower edge.
• The best state is A, but expected return is lower than 10, the
immediate reward.
• B is valued more than 5, the immediate rewards.
Another grid world example
• The best state is A, but expected return is lower than 10, the
immediate reward.
• B is valued more than 5, the immediate rewards.
• Why?
2 minutes
Another grid world example
• Optimal value function and optimal policy for the grid world.
Lecture Overview
• Introduction
• Elements of Reinforcement Learning
• Exploration vs Exploitation
• The agent-environment interface
• Value functions
• Temporal difference prediction
Temporal-difference (TD) prediction
• TD is one central and novel idea in RL.
• Monte Carlo-like methods must wait until the end of the episode to
update V(St) – (only at that point Gt is known):
• The simplest TD method is called TD(0):
Temporal-difference (TD) prediction
• Approximations can be on-policy or off-policy.
• TD control learns an action-value function instead of a state-
value function.
• We estimate qπ(s,a) for the current policy π.
• Therefore, we consider transitions from state-action pair to state-
action pair.
Sarsa: On-Policy TD Control
• Updates after each transition from a non-terminal St.
• If St+1 is terminal, Q(St+1, At+1) is defined as zero.
• Each element of the 5-tuple (St, At, Rt+1, St+1, At+1) is used, this
gives the name to the algorithm.
• On-policy methods continuously estimate qπ for policy π, and at
the same time change π greedily towards qπ.
Sarsa: On-Policy TD Control
• On-policy TD algorithm:
Q-Learning: Off-Policy TD Control
• A simple but important breakthrough is an off-policy TD algorithm.
• The simplest way is one-step Q-learning:
• The learned action-value function Q directly approximates q*, the
optimal action-value function, regardless the followed policy π.
• The policy still has an effect in which state-action pairs are visited
and updated.
Q-Learning: Off-Policy TD Control
• Off-policy TD algorithm:
The Cliff Walking
• Reward of -1 for all transitions, except in the cliff.
• The cliff gives a negative reward of -100 and sends the agent
back to the start position.
The Cliff Walking
• ε-greedy, with ε=0.1.
• Q-learning learns the optimal
path. Sarsa learns the longest,
safest path.
• However, overall Q-learning
behaviour is worse.
• If ε is gradually reduced, both
methods converge
asymptotically to the optimal
policy.
Actor-critic methods
• Policy approximation.
• Learning is always on-policy.
• The actor structures the policy.
• The critic must learn and critique the
followed policy.
• Minimal computation to select
actions, even in continuous-valued
actions or for stochastic policies.
• The separate actor in actor-critic is
more appealing as psychological and
biological models.
Deep Q-Network
• Proposed by Mnih et al. in 2015.
• Human-level control through deep reinforcement learning.
• Tested in 49 Atari games.
Examples
• Stumpy - A simple learning
robot.
• Stumpy receives a reward
after each action. Did it
move forward or not?
• After each move, updates
its policy.
Examples
• Stumpy - A simple learning
robot.
• Continues trying to maximise
its reward.
• Stumpy after 30 minutes.
Examples
• Another example
Reference
• For a more comprehensive introduction,
you should definitely have a look at:
• Sutton, R. S., & Barto, A. G.
(2018). Reinforcement learning: An
introduction. MIT press.
• http://www.incompleteideas.net/book/
the-book-2nd.html
Feedback
• In case you want to provide anonymous
feedback on these lectures, please visit:
• https://forms.gle/KBkN744QuffuAZLF8
Muchas gracias!