[go: up one dir, main page]

0% found this document useful (0 votes)
11 views79 pages

COMP3411 Week 4 - RL

The document provides an overview of Reinforcement Learning (RL), outlining its key elements such as policies, reward functions, and value functions, as well as the exploration vs exploitation trade-off. It explains the agent-environment interface, the importance of maximizing long-term rewards, and introduces Markov Decision Processes (MDPs) as a framework for RL tasks. Additionally, it discusses various action-value estimation methods and the significance of value functions in determining the expected future rewards for an agent's actions.

Uploaded by

tianzong Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views79 pages

COMP3411 Week 4 - RL

The document provides an overview of Reinforcement Learning (RL), outlining its key elements such as policies, reward functions, and value functions, as well as the exploration vs exploitation trade-off. It explains the agent-environment interface, the importance of maximizing long-term rewards, and introduces Markov Decision Processes (MDPs) as a framework for RL tasks. Additionally, it discusses various action-value estimation methods and the significance of value functions in determining the expected future rewards for an agent's actions.

Uploaded by

tianzong Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Reinforcement Learning

COMP3411/9814: Artificial Intelligence


Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


First ideas
Initial ideas
Instrumental or operational conditioning.
Stimulus-behavior learning.

Thorndike, 1911
Deakin University CRICOS Provider Code: 00113B

Reinforcement learning

• Idea of learning by interaction with the environment.

• With no explicit instructor but with a direct sensorimotor


connection.

• Awareness of how our environment answers to what we do.


Reinforcement Learning
Pole balancing

• Pole balancing can be learned the same way

• Reward might be only received at the end


• after falling or hitting the end of the track
Pole balancing
And you think pole balancing is trivial?
Types of learning
• Supervised Learning
• Agent is given examples of input/output pairs
• Learns a function from inputs to outputs that agrees with the training examples and
generalises to new examples

• Unsupervised Learning
• Agent is only given inputs
• Tries to find structure in these inputs

• Reinforcement Learning
• Training examples presented one at a time
• Must guess best output based on a reward, tries to maximise (expected) rewards over
time
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


Reinforcement learning
Reinforcement Learning
• RL is to learn what to do, mapping from situations to actions.

• An agent should be able to sense the environment states and


perform actions to affect such states.

• Actions might affect not only immediate reward.

• An important challenge is the exploration/ exploitation trade-off


problem.
Elements of reinforcement learning

There are four essential elements:

• Policy

• Informs how to act in a particular situation.


• Set of stimulus-response rules or associations.
• Can be stochastic.
Elements of reinforcement learning
There are four essential elements:

• Reward function

• Defines the aim of an RL problem.


• Maps each perceived state (or state-action pair) into a number,
the reward.
• The goal is to maximize the long-term reward.
• In biological systems may correspond to pain and pleasure
feelings.
• Can be stochastic.
Elements of reinforcement learning
There are four essential elements:

• Value function

• Shows what it’s good in the long run (the reward in an


immediate sense).
• In biological systems corresponds to more refined judgments o
foresight about the future from one state.
• Actions are decided based on the value.
• It’s much harder to determine values than rewards.
Elements of reinforcement learning

There are four essential elements:

• Optionally, a model of the environment

• Imitates the environment behaviour.


• Can predict states and reward obtained.
• The use of models of the environment is still relatively new.
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


Exploration / Exploitation Trade-off

• Most of the time, the agent chooses what it thinks the “best”
action is.

• But to learn, it must occasionally choose something different


from the preferred action.
Exploration / Exploitation Trade-off

Should I stay or should I go now?


Should I stay or should I go now?
If I go, there will be trouble
And if I stay it will be double

-- The Clash
Exploration / Exploitation Trade-off
• The greedy action exploits the
current knowledge.
• The non-greedy action explores.
• Exploitation maximises the
immediate reward and exploration in
the long run.
• There is a conflict between
exploration and exploitation.
Action-value estimation methods
• We denote the real action value as q*(a).
• We denote the estimated value at time-step t as Qt(a).
• Simple Estimation: to average received rewards when
action a has been selected Ka times.
Action-value estimation methods
• If Ka = 0, Qt(a) is defined with an arbitrary value, e.g., Qt(a) =
0 (not necessarily the best).
• As Ka  ∞, Qt(a) converges to q*(a).
Action-value estimation methods
Greedy method
• The simplest way to choose an action: the action with the
highest estimated value.
• At* where Qt(At*) = maxa Qt(a).

ε-greedy method
• A simple alternative: to choose the best action most of the
time, and sometimes (with a small probability ε) a random
one.
• Qt(a) converges to q*(a) with probability 1- ε.
Action-value
estimation
methods
• 2000 agents
averaged.
• It’s possible to reduce
ε over time
Action-value estimation methods
Softmax method
• ε-greedy effectively trades off exploration and exploitation, but
the selection is equitable (or fair) among actions.
• Sometimes, the worst action is very bad.
• High temperatures give almost equal probability for all
actions.
• Low temperatures make a bigger difference in the probability.
Incremental implementation
• A simple implementation: record all the rewards.

• Problem: growing use of memory and computational cost over


time.
Incremental implementation
• Denote Qk the estimated reward at time-step k, i.e., the average
of the k-1 first rewards, then:
Non-stationary problems
• Methods of average are appropriate for stationary problems.
• In non-stationary problems, make sense to give more weight
to more recent rewards than the past ones.
• Using a constant parameter step-size, 0 < α ≤ 1:
Contextual or associative search
• So far, only non-associative tasks, i.e., no association between
actions and situations.
• In an RL problem, there is more than one situation.
• The aim is to learn a policy: to map situations into actions.
• For instance, a set of n-armed bandits with changing colours.

• If actions also affect the following situation as well as the


reward, then it corresponds to a full RL problem.
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


The agent-environment interface
• Any method able to solve the problem is considered an RL
method.
• Agent: comprises the learner and the one making the decisions.
(although they can be separated!).
• Environment: everything external to the agent that it interacts with.
The agent-environment interface
• Reward: numeric value the agent tries to maximise. Rt+1 ∈ ℝ.
• St ∈ S. S set of possible states.
• At ∈ A(St). A(St) actions available at St.
• The agent implements a map from the states toward the action
selection probability.
• This is called agent policy πt where πt(a|s) is the probability of At = a
and St = s.
• RL methods detail how an agent updates its policy as a result of its
experience.
The agent-environment interface
• Example: recycling robot.
• The agent decides if (i) actively search for a can, (ii) remains
stationary and waits for a can, or (iii) gets back to the home base to
recharge (three possible actions).
• The state is determined by the battery state.
• Reward: Mostly zero, positive when collects a can and negative
(much higher) when the battery runs empty.
Goals and rewards
• The agent’s goal is to maximise the amount of total reward, not the
immediate reward.
• A robot learning to walk receives a reward proportional to the
forward movement.
• A robot learning to escape from a maze receives a reward equal to
zero until escapes and then receives +1.
• Another strategy is giving a reward of -1 after each movement till
escaping.
• An agent learning to play chess receives +1 for winning and -1 for
losing.
Goals and rewards
• The chess player should be rewarded only for winning and not for
taking opponent’s pieces.
• Otherwise, the agent will learn to maximise subgoals.
• In summary, the reward signal is the way to communicate to the
agent what you want it to achieve, not how you want it achieved.
Episodic and non-episodic tasks

The pole-balancing task. The mountain car task.


Returns
• If the reward sequence received is Rt+1, Rt+2, Rt+3, …. We want to
maximise the expected return Gt.

• In tasks with final state and that can be divided into subsequences
(episodes)
• Each episode finishes in the final state and the task starts over from
an initial state.
• These tasks are known as episodic tasks.
Returns
• Tasks intended to be performed continuously without limit are
referred to as continuous tasks (or non-episodic).
• The return could be infinite, given that T = ∞.
• In this case, the agent maximises the discounted rewards, choosing
actions to maximise the discounted return:

• Discount rate 0 ≤ γ < 1. Determine the present value of future


rewards. If γ = 0, the agent is myopic. If γ  1 the agent is
foresighted.
3 minutes
Returns

γ Reward sequence Return


0.5 1 0 0 0 …..
0.5 0 0 2 0 0 0 …..
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns

γ Reward sequence Return


0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 …..
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns

γ Reward sequence Return


0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
4 minutes
Returns

γ Reward sequence Return


0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 ….. 1.62
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns

γ Reward sequence Return


0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 ….. 1.62
0.5 -1 2 6 3 2 0 0 0 ….. 2
Unified Notation
• A final absorbing state with reward equal to zero.

• It is possible that T = ∞ or γ = 1, but not both.


The Markov property
• In RL, state means any information available for the agent (either
processed or not).
• The state should not inform everything about the environment to the
agent. For instance, an agent playing blackjack should not know
the next card in the deck.
• We do not blame the agent for not knowing something important,
but we do for knowing something and then forgetting it.
• Ideally, a state should contain compact information about the past,
retaining relevant information. This is called the Markov property.
For instance, the chess board.
The Markov property
• Sometimes, the property cannot be fully satisfied.
• In pole-balancing, the state satisfies the property if the exact
position and velocity of the cart are specified, and the pole angle
and its change rate.
• However, there may exist distortions, such as delays and other
effects as the temperature of the wheels.
• Some studies have even used a simple region division: left, right,
and centre.
Markov Decision Processes

• An RL task with the Markov property is called Markov decision


process (MDP).
• Markov decision process (MDP): < 𝑆𝑆, 𝐴𝐴, δ, 𝑟𝑟 >.
• 𝑆𝑆 is a finite set of states,
• 𝐴𝐴 is a set of actions,
• 𝛿𝛿 is the transition function δ: 𝑆𝑆 × 𝐴𝐴 → 𝑆𝑆, and,
• 𝑟𝑟 is the reward function 𝑟𝑟: 𝑆𝑆 × 𝐴𝐴 → ℝ.
Recycling robot MDP
• At each moment, the robot decides if (i) actively search for a can, (ii)
waits for someone to bring a can, or (iii) gets back to the home base
to recharge.
• The best strategy is to actively search for cans.
• In case the battery runs out, the robot needs to be rescued leading
to a negative reward.
• The agent solely decides as a function of the energy level of the
battery. Two levels: high, low.
• S = {high, low}.
• Possible decisions (agent’s actions): wait, search, recharge.
• A(high) = {search, wait}.
• A(low) = {search, wait, recharge}.
Recycling robot MDP
• Transition probabilities and expected reward:

rsearch > rwait

• Assumption: cans cannot be collected when going back to the home


base or when the battery is depleted.
Recycling robot MDP
• Transition graph:

• Transition probabilities from one action always sum to 1.


Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


Value function
• Value function estimations.
• State-value function, or
• Action-value function (for state-action pairs)

• The function estimates how good it is for the agent to be in a given


state, in terms of future reward (or expected return).
• The value of a state s under a policy π, denoted vπ(s) or V (S), is the
π

expected return when starting in s and following π thereafter:


Value function
• The value of a terminal state, if any, is zero.
• The value of taking action a in state s under policy π, is denoted
qπ(s,a) or Q (S,A):
π

• Value functions vπ(s) and qπ(s,a) can be estimated from experience.


• If there are many states, it’s impractical to keep values for each state.
• In this case, parameterized function approximators are used to keep
vπ(s) and qπ(s,a).
Value function

• Try to maximise expected future reward:

𝜋𝜋
• 𝑉𝑉 (𝑠𝑠𝑡𝑡 ) is the value of state 𝑠𝑠𝑡𝑡 under policy 𝜋𝜋

• 𝛾𝛾 is a discount factor [0..1]


Value function

𝜋𝜋
• 𝑉𝑉 (𝑠𝑠) is the expected value of following policy 𝜋𝜋 in state 𝑠𝑠

• 𝑉𝑉 (𝑠𝑠) be the maximum discounted reward obtainable from s.

• i.e. the value of following the optimal policy

• We make the simplification that actions are deterministic, but we


don’t know which action to take.

• Other RL algorithms relax this assumption


Value function

• The red arrows show 𝜋𝜋 , the optimal policy, with 𝛾𝛾 = 0.9

• 𝑉𝑉 (𝑠𝑠) values shown in red
𝑄𝑄-values
• How to choose an action in a state?
∗ ′
𝑄𝑄(𝑠𝑠, 𝑎𝑎) = 𝑟𝑟(𝑠𝑠, 𝑎𝑎) + 𝛾𝛾𝑉𝑉 (𝑠𝑠 )

• The 𝑄𝑄-value for an action, 𝑎𝑎, in a state, 𝑠𝑠, is the immediate reward
for the action plus the discounted value of following the optimal
policy after that action

• V* is value obtained by following the optimal policy



• 𝑠𝑠 = 𝛿𝛿(𝑠𝑠, 𝑎𝑎) is the succeeding state, assuming the optimal policy
𝑄𝑄-values

𝛾𝛾 = 0.9
Grid world example
Another grid world example

• Cells correspond to the states.


• 4 possible actions.
• Actions leading the agent out of the environment do not change
the position but give reward = -1.
• All other actions give reward = 0, except movements from A
and B.
Another grid world example

• All actions equal probability.


• Discount factor γ = 0.9.
• Negative values near the lower edge.
• The best state is A, but expected return is lower than 10, the
immediate reward.
• B is valued more than 5, the immediate rewards.
Another grid world example

• The best state is A, but expected return is lower than 10, the
immediate reward.
• B is valued more than 5, the immediate rewards.
• Why?
2 minutes
Another grid world example

• Optimal value function and optimal policy for the grid world.
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction


Temporal-difference (TD) prediction
• TD is one central and novel idea in RL.
• Monte Carlo-like methods must wait until the end of the episode to
update V(St) – (only at that point Gt is known):

• The simplest TD method is called TD(0):


Temporal-difference (TD) prediction
• Approximations can be on-policy or off-policy.
• TD control learns an action-value function instead of a state-
value function.
• We estimate qπ(s,a) for the current policy π.
• Therefore, we consider transitions from state-action pair to state-
action pair.
Sarsa: On-Policy TD Control
• Updates after each transition from a non-terminal St.
• If St+1 is terminal, Q(St+1, At+1) is defined as zero.

• Each element of the 5-tuple (St, At, Rt+1, St+1, At+1) is used, this
gives the name to the algorithm.
• On-policy methods continuously estimate qπ for policy π, and at
the same time change π greedily towards qπ.
Sarsa: On-Policy TD Control
• On-policy TD algorithm:
Q-Learning: Off-Policy TD Control
• A simple but important breakthrough is an off-policy TD algorithm.
• The simplest way is one-step Q-learning:

• The learned action-value function Q directly approximates q*, the


optimal action-value function, regardless the followed policy π.
• The policy still has an effect in which state-action pairs are visited
and updated.
Q-Learning: Off-Policy TD Control
• Off-policy TD algorithm:
The Cliff Walking
• Reward of -1 for all transitions, except in the cliff.
• The cliff gives a negative reward of -100 and sends the agent
back to the start position.
The Cliff Walking
• ε-greedy, with ε=0.1.
• Q-learning learns the optimal
path. Sarsa learns the longest,
safest path.
• However, overall Q-learning
behaviour is worse.
• If ε is gradually reduced, both
methods converge
asymptotically to the optimal
policy.
Actor-critic methods
• Policy approximation.
• Learning is always on-policy.
• The actor structures the policy.
• The critic must learn and critique the
followed policy.
• Minimal computation to select
actions, even in continuous-valued
actions or for stochastic policies.
• The separate actor in actor-critic is
more appealing as psychological and
biological models.
Deep Q-Network
• Proposed by Mnih et al. in 2015.
• Human-level control through deep reinforcement learning.
• Tested in 49 Atari games.
Examples
• Stumpy - A simple learning
robot.

• Stumpy receives a reward


after each action. Did it
move forward or not?
• After each move, updates
its policy.
Examples
• Stumpy - A simple learning
robot.

• Continues trying to maximise


its reward.
• Stumpy after 30 minutes.
Examples

• Another example
Reference
• For a more comprehensive introduction,
you should definitely have a look at:
• Sutton, R. S., & Barto, A. G.
(2018). Reinforcement learning: An
introduction. MIT press.
• http://www.incompleteideas.net/book/
the-book-2nd.html
Feedback
• In case you want to provide anonymous
feedback on these lectures, please visit:

• https://forms.gle/KBkN744QuffuAZLF8

Muchas gracias!

You might also like