0% found this document useful (0 votes)

11 views79 pages

COMP3411 Week 4 - RL

The document provides an overview of Reinforcement Learning (RL), outlining its key elements such as policies, reward functions, and value functions, as well as the exploration vs exploitation trade-off. It explains the agent-environment interface, the importance of maximizing long-term rewards, and introduces Markov Decision Processes (MDPs) as a framework for RL tasks. Additionally, it discusses various action-value estimation methods and the significance of value functions in determining the expected future rewards for an agent's actions.

Uploaded by

tianzong Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views79 pages

COMP3411 Week 4 - RL

Uploaded by

tianzong Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Reinforcement Learning

COMP3411/9814: Artificial Intelligence

Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

First ideas
Initial ideas
Instrumental or operational conditioning.
Stimulus-behavior learning.

Thorndike, 1911
Deakin University CRICOS Provider Code: 00113B

Reinforcement learning

• Idea of learning by interaction with the environment.

• With no explicit instructor but with a direct sensorimotor

connection.

• Awareness of how our environment answers to what we do.

Reinforcement Learning
Pole balancing

• Pole balancing can be learned the same way

• Reward might be only received at the end

• after falling or hitting the end of the track
Pole balancing
And you think pole balancing is trivial?
Types of learning
• Supervised Learning
• Agent is given examples of input/output pairs
• Learns a function from inputs to outputs that agrees with the training examples and
generalises to new examples

• Unsupervised Learning
• Agent is only given inputs
• Tries to find structure in these inputs

• Reinforcement Learning
• Training examples presented one at a time
• Must guess best output based on a reward, tries to maximise (expected) rewards over
time
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

Reinforcement learning
Reinforcement Learning
• RL is to learn what to do, mapping from situations to actions.

• An agent should be able to sense the environment states and

perform actions to affect such states.

• Actions might affect not only immediate reward.

• An important challenge is the exploration/ exploitation trade-off

problem.
Elements of reinforcement learning

There are four essential elements:

• Policy

• Informs how to act in a particular situation.

• Set of stimulus-response rules or associations.
• Can be stochastic.
Elements of reinforcement learning
There are four essential elements:

• Reward function

• Defines the aim of an RL problem.

• Maps each perceived state (or state-action pair) into a number,
the reward.
• The goal is to maximize the long-term reward.
• In biological systems may correspond to pain and pleasure
feelings.
• Can be stochastic.
Elements of reinforcement learning
There are four essential elements:

• Value function

• Shows what it’s good in the long run (the reward in an

immediate sense).
• In biological systems corresponds to more refined judgments o
foresight about the future from one state.
• Actions are decided based on the value.
• It’s much harder to determine values than rewards.
Elements of reinforcement learning

There are four essential elements:

• Optionally, a model of the environment

• Imitates the environment behaviour.

• Can predict states and reward obtained.
• The use of models of the environment is still relatively new.
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

Exploration / Exploitation Trade-off

• Most of the time, the agent chooses what it thinks the “best”
action is.

• But to learn, it must occasionally choose something different

from the preferred action.
Exploration / Exploitation Trade-off

Should I stay or should I go now?

Should I stay or should I go now?
If I go, there will be trouble
And if I stay it will be double

-- The Clash
Exploration / Exploitation Trade-off
• The greedy action exploits the
current knowledge.
• The non-greedy action explores.
• Exploitation maximises the
immediate reward and exploration in
the long run.
• There is a conflict between
exploration and exploitation.
Action-value estimation methods
• We denote the real action value as q*(a).
• We denote the estimated value at time-step t as Qt(a).
• Simple Estimation: to average received rewards when
action a has been selected Ka times.
Action-value estimation methods
• If Ka = 0, Qt(a) is defined with an arbitrary value, e.g., Qt(a) =
0 (not necessarily the best).
• As Ka  ∞, Qt(a) converges to q*(a).
Action-value estimation methods
Greedy method
• The simplest way to choose an action: the action with the
highest estimated value.
• At* where Qt(At*) = maxa Qt(a).

ε-greedy method
• A simple alternative: to choose the best action most of the
time, and sometimes (with a small probability ε) a random
one.
• Qt(a) converges to q*(a) with probability 1- ε.
Action-value
estimation
methods
• 2000 agents
averaged.
• It’s possible to reduce
ε over time
Action-value estimation methods
Softmax method
• ε-greedy effectively trades off exploration and exploitation, but
the selection is equitable (or fair) among actions.
• Sometimes, the worst action is very bad.
• High temperatures give almost equal probability for all
actions.
• Low temperatures make a bigger difference in the probability.
Incremental implementation
• A simple implementation: record all the rewards.

• Problem: growing use of memory and computational cost over

time.
Incremental implementation
• Denote Qk the estimated reward at time-step k, i.e., the average
of the k-1 first rewards, then:
Non-stationary problems
• Methods of average are appropriate for stationary problems.
• In non-stationary problems, make sense to give more weight
to more recent rewards than the past ones.
• Using a constant parameter step-size, 0 < α ≤ 1:
Contextual or associative search
• So far, only non-associative tasks, i.e., no association between
actions and situations.
• In an RL problem, there is more than one situation.
• The aim is to learn a policy: to map situations into actions.
• For instance, a set of n-armed bandits with changing colours.

• If actions also affect the following situation as well as the

reward, then it corresponds to a full RL problem.
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

The agent-environment interface
• Any method able to solve the problem is considered an RL
method.
• Agent: comprises the learner and the one making the decisions.
(although they can be separated!).
• Environment: everything external to the agent that it interacts with.
The agent-environment interface
• Reward: numeric value the agent tries to maximise. Rt+1 ∈ ℝ.
• St ∈ S. S set of possible states.
• At ∈ A(St). A(St) actions available at St.
• The agent implements a map from the states toward the action
selection probability.
• This is called agent policy πt where πt(a|s) is the probability of At = a
and St = s.
• RL methods detail how an agent updates its policy as a result of its
experience.
The agent-environment interface
• Example: recycling robot.
• The agent decides if (i) actively search for a can, (ii) remains
stationary and waits for a can, or (iii) gets back to the home base to
recharge (three possible actions).
• The state is determined by the battery state.
• Reward: Mostly zero, positive when collects a can and negative
(much higher) when the battery runs empty.
Goals and rewards
• The agent’s goal is to maximise the amount of total reward, not the
immediate reward.
• A robot learning to walk receives a reward proportional to the
forward movement.
• A robot learning to escape from a maze receives a reward equal to
zero until escapes and then receives +1.
• Another strategy is giving a reward of -1 after each movement till
escaping.
• An agent learning to play chess receives +1 for winning and -1 for
losing.
Goals and rewards
• The chess player should be rewarded only for winning and not for
taking opponent’s pieces.
• Otherwise, the agent will learn to maximise subgoals.
• In summary, the reward signal is the way to communicate to the
agent what you want it to achieve, not how you want it achieved.
Episodic and non-episodic tasks

The pole-balancing task. The mountain car task.

Returns
• If the reward sequence received is Rt+1, Rt+2, Rt+3, …. We want to
maximise the expected return Gt.

• In tasks with final state and that can be divided into subsequences
(episodes)
• Each episode finishes in the final state and the task starts over from
an initial state.
• These tasks are known as episodic tasks.
Returns
• Tasks intended to be performed continuously without limit are
referred to as continuous tasks (or non-episodic).
• The return could be infinite, given that T = ∞.
• In this case, the agent maximises the discounted rewards, choosing
actions to maximise the discounted return:

• Discount rate 0 ≤ γ < 1. Determine the present value of future

rewards. If γ = 0, the agent is myopic. If γ  1 the agent is
foresighted.
3 minutes
Returns

γ Reward sequence Return

0.5 1 0 0 0 …..
0.5 0 0 2 0 0 0 …..
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns

γ Reward sequence Return

0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 …..
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns

γ Reward sequence Return

0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 …..
0.5 -1 2 6 3 2 0 0 0 …..
4 minutes
Returns

γ Reward sequence Return

0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 ….. 1.62
0.5 -1 2 6 3 2 0 0 0 …..
3 minutes
Returns

γ Reward sequence Return

0.5 1 0 0 0 ….. 1
0.5 0 0 2 0 0 0 ….. 0.5
0.9 0 0 2 0 0 0 ….. 1.62
0.5 -1 2 6 3 2 0 0 0 ….. 2
Unified Notation
• A final absorbing state with reward equal to zero.

• It is possible that T = ∞ or γ = 1, but not both.

The Markov property
• In RL, state means any information available for the agent (either
processed or not).
• The state should not inform everything about the environment to the
agent. For instance, an agent playing blackjack should not know
the next card in the deck.
• We do not blame the agent for not knowing something important,
but we do for knowing something and then forgetting it.
• Ideally, a state should contain compact information about the past,
retaining relevant information. This is called the Markov property.
For instance, the chess board.
The Markov property
• Sometimes, the property cannot be fully satisfied.
• In pole-balancing, the state satisfies the property if the exact
position and velocity of the cart are specified, and the pole angle
and its change rate.
• However, there may exist distortions, such as delays and other
effects as the temperature of the wheels.
• Some studies have even used a simple region division: left, right,
and centre.
Markov Decision Processes

• An RL task with the Markov property is called Markov decision

process (MDP).
• Markov decision process (MDP): < 𝑆𝑆, 𝐴𝐴, δ, 𝑟𝑟 >.
• 𝑆𝑆 is a finite set of states,
• 𝐴𝐴 is a set of actions,
• 𝛿𝛿 is the transition function δ: 𝑆𝑆 × 𝐴𝐴 → 𝑆𝑆, and,
• 𝑟𝑟 is the reward function 𝑟𝑟: 𝑆𝑆 × 𝐴𝐴 → ℝ.
Recycling robot MDP
• At each moment, the robot decides if (i) actively search for a can, (ii)
waits for someone to bring a can, or (iii) gets back to the home base
to recharge.
• The best strategy is to actively search for cans.
• In case the battery runs out, the robot needs to be rescued leading
to a negative reward.
• The agent solely decides as a function of the energy level of the
battery. Two levels: high, low.
• S = {high, low}.
• Possible decisions (agent’s actions): wait, search, recharge.
• A(high) = {search, wait}.
• A(low) = {search, wait, recharge}.
Recycling robot MDP
• Transition probabilities and expected reward:

rsearch > rwait

• Assumption: cans cannot be collected when going back to the home

base or when the battery is depleted.
Recycling robot MDP
• Transition graph:

• Transition probabilities from one action always sum to 1.

Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

Value function
• Value function estimations.
• State-value function, or
• Action-value function (for state-action pairs)

• The function estimates how good it is for the agent to be in a given

state, in terms of future reward (or expected return).
• The value of a state s under a policy π, denoted vπ(s) or V (S), is the
π

expected return when starting in s and following π thereafter:

Value function
• The value of a terminal state, if any, is zero.
• The value of taking action a in state s under policy π, is denoted
qπ(s,a) or Q (S,A):
π

• Value functions vπ(s) and qπ(s,a) can be estimated from experience.

• If there are many states, it’s impractical to keep values for each state.
• In this case, parameterized function approximators are used to keep
vπ(s) and qπ(s,a).
Value function

• Try to maximise expected future reward:

𝜋𝜋
• 𝑉𝑉 (𝑠𝑠𝑡𝑡 ) is the value of state 𝑠𝑠𝑡𝑡 under policy 𝜋𝜋

• 𝛾𝛾 is a discount factor [0..1]

Value function

𝜋𝜋
• 𝑉𝑉 (𝑠𝑠) is the expected value of following policy 𝜋𝜋 in state 𝑠𝑠
∗
• 𝑉𝑉 (𝑠𝑠) be the maximum discounted reward obtainable from s.

• i.e. the value of following the optimal policy

• We make the simplification that actions are deterministic, but we

don’t know which action to take.

• Other RL algorithms relax this assumption

Value function
∗
• The red arrows show 𝜋𝜋 , the optimal policy, with 𝛾𝛾 = 0.9
∗
• 𝑉𝑉 (𝑠𝑠) values shown in red
𝑄𝑄-values
• How to choose an action in a state?
∗ ′
𝑄𝑄(𝑠𝑠, 𝑎𝑎) = 𝑟𝑟(𝑠𝑠, 𝑎𝑎) + 𝛾𝛾𝑉𝑉 (𝑠𝑠 )

• The 𝑄𝑄-value for an action, 𝑎𝑎, in a state, 𝑠𝑠, is the immediate reward
for the action plus the discounted value of following the optimal
policy after that action

• V* is value obtained by following the optimal policy

′
• 𝑠𝑠 = 𝛿𝛿(𝑠𝑠, 𝑎𝑎) is the succeeding state, assuming the optimal policy
𝑄𝑄-values

𝛾𝛾 = 0.9
Grid world example
Another grid world example

• Cells correspond to the states.

• 4 possible actions.
• Actions leading the agent out of the environment do not change
the position but give reward = -1.
• All other actions give reward = 0, except movements from A
and B.
Another grid world example

• All actions equal probability.

• Discount factor γ = 0.9.
• Negative values near the lower edge.
• The best state is A, but expected return is lower than 10, the
immediate reward.
• B is valued more than 5, the immediate rewards.
Another grid world example

• The best state is A, but expected return is lower than 10, the
immediate reward.
• B is valued more than 5, the immediate rewards.
• Why?
2 minutes
Another grid world example

• Optimal value function and optimal policy for the grid world.
Lecture Overview
• Introduction

• Elements of Reinforcement Learning

• Exploration vs Exploitation

• The agent-environment interface

• Value functions

• Temporal difference prediction

Temporal-difference (TD) prediction
• TD is one central and novel idea in RL.
• Monte Carlo-like methods must wait until the end of the episode to
update V(St) – (only at that point Gt is known):

• The simplest TD method is called TD(0):

Temporal-difference (TD) prediction
• Approximations can be on-policy or off-policy.
• TD control learns an action-value function instead of a state-
value function.
• We estimate qπ(s,a) for the current policy π.
• Therefore, we consider transitions from state-action pair to state-
action pair.
Sarsa: On-Policy TD Control
• Updates after each transition from a non-terminal St.
• If St+1 is terminal, Q(St+1, At+1) is defined as zero.

• Each element of the 5-tuple (St, At, Rt+1, St+1, At+1) is used, this
gives the name to the algorithm.
• On-policy methods continuously estimate qπ for policy π, and at
the same time change π greedily towards qπ.
Sarsa: On-Policy TD Control
• On-policy TD algorithm:
Q-Learning: Off-Policy TD Control
• A simple but important breakthrough is an off-policy TD algorithm.
• The simplest way is one-step Q-learning:

• The learned action-value function Q directly approximates q*, the

optimal action-value function, regardless the followed policy π.
• The policy still has an effect in which state-action pairs are visited
and updated.
Q-Learning: Off-Policy TD Control
• Off-policy TD algorithm:
The Cliff Walking
• Reward of -1 for all transitions, except in the cliff.
• The cliff gives a negative reward of -100 and sends the agent
back to the start position.
The Cliff Walking
• ε-greedy, with ε=0.1.
• Q-learning learns the optimal
path. Sarsa learns the longest,
safest path.
• However, overall Q-learning
behaviour is worse.
• If ε is gradually reduced, both
methods converge
asymptotically to the optimal
policy.
Actor-critic methods
• Policy approximation.
• Learning is always on-policy.
• The actor structures the policy.
• The critic must learn and critique the
followed policy.
• Minimal computation to select
actions, even in continuous-valued
actions or for stochastic policies.
• The separate actor in actor-critic is
more appealing as psychological and
biological models.
Deep Q-Network
• Proposed by Mnih et al. in 2015.
• Human-level control through deep reinforcement learning.
• Tested in 49 Atari games.
Examples
• Stumpy - A simple learning
robot.

• Stumpy receives a reward

after each action. Did it
move forward or not?
• After each move, updates
its policy.
Examples
• Stumpy - A simple learning
robot.

• Continues trying to maximise

its reward.
• Stumpy after 30 minutes.
Examples

• Another example
Reference
• For a more comprehensive introduction,
you should definitely have a look at:
• Sutton, R. S., & Barto, A. G.
(2018). Reinforcement learning: An
introduction. MIT press.
• http://www.incompleteideas.net/book/
the-book-2nd.html
Feedback
• In case you want to provide anonymous
feedback on these lectures, please visit:

• https://forms.gle/KBkN744QuffuAZLF8

Muchas gracias!

AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
RLDL Unit 1
No ratings yet
RLDL Unit 1
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Deep Reinforcement Learning Guide
No ratings yet
Deep Reinforcement Learning Guide
32 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
RL - 01 Introduction To Reinforcement Learning
No ratings yet
RL - 01 Introduction To Reinforcement Learning
62 pages
Module 1
No ratings yet
Module 1
72 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
64 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
Unit 5
No ratings yet
Unit 5
45 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Lecture 9 - Reinforced Learning
No ratings yet
Lecture 9 - Reinforced Learning
18 pages
Assignment 15 Modern AI
No ratings yet
Assignment 15 Modern AI
3 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Module 01
No ratings yet
Module 01
66 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
No ratings yet
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
27 pages
Sections
No ratings yet
Sections
76 pages
Module 1
No ratings yet
Module 1
85 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Unit 1 Reinforcement Learning
No ratings yet
Unit 1 Reinforcement Learning
70 pages
Maai 6
No ratings yet
Maai 6
143 pages
Unit-5 Reinforcemnt and Q Learning
No ratings yet
Unit-5 Reinforcemnt and Q Learning
45 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
RL Unit-1
No ratings yet
RL Unit-1
52 pages
Reinforcement Learning Guide
100% (1)
Reinforcement Learning Guide
24 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Kguh
No ratings yet
Kguh
38 pages
ML U5 Notes
No ratings yet
ML U5 Notes
26 pages
Reinforcement Learning & Robotics
No ratings yet
Reinforcement Learning & Robotics
35 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
LearnAlgorithms LT
No ratings yet
LearnAlgorithms LT
95 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
L5 Compression
No ratings yet
L5 Compression
60 pages
COMP3411 Week 2 - Search - Armin v1-1
No ratings yet
COMP3411 Week 2 - Search - Armin v1-1
146 pages
COMP3411 Week 7 - Computer Vision
No ratings yet
COMP3411 Week 7 - Computer Vision
58 pages
COMP3411 Week 8 - Language Processing
No ratings yet
COMP3411 Week 8 - Language Processing
74 pages
COMP3411 Slides All Term
No ratings yet
COMP3411 Slides All Term
23 pages
6383 Quality Foods..
No ratings yet
6383 Quality Foods..
2 pages
Vdocuments - MX - Servicing The Ge Senographe Digital Mammo Family 2000d Ds
No ratings yet
Vdocuments - MX - Servicing The Ge Senographe Digital Mammo Family 2000d Ds
2 pages
11th Computer Science Chapter 9 To 12 Model Question Paper English Medium
No ratings yet
11th Computer Science Chapter 9 To 12 Model Question Paper English Medium
4 pages
Design Session
No ratings yet
Design Session
45 pages
Winter 2022 Adj
No ratings yet
Winter 2022 Adj
1 page
Mobile Tracking Memanfaatkan Teknologi Global Positioning System (GPS) Dan General Packet Radio Service (GPRS
No ratings yet
Mobile Tracking Memanfaatkan Teknologi Global Positioning System (GPS) Dan General Packet Radio Service (GPRS
7 pages
Backlog Exam - Routine - Even - 2023 - 24 - Sem - 2 - 4 - 6 - 8 - Spl. Supple
No ratings yet
Backlog Exam - Routine - Even - 2023 - 24 - Sem - 2 - 4 - 6 - 8 - Spl. Supple
14 pages
Open Packet Tracer Scenario: Telnet-CDP Lab-1 Scenario - Pkt. Before Proceeding, Save As Telnet-CDP Lab-1 Scenario-Working - PKT
No ratings yet
Open Packet Tracer Scenario: Telnet-CDP Lab-1 Scenario - Pkt. Before Proceeding, Save As Telnet-CDP Lab-1 Scenario-Working - PKT
3 pages
Tim Cook Is The CEO
No ratings yet
Tim Cook Is The CEO
5 pages
Test2 Chap5678 HonsDBMS-11feb25
No ratings yet
Test2 Chap5678 HonsDBMS-11feb25
1 page
SE Unit 1
No ratings yet
SE Unit 1
32 pages
PVT Unaided Schools - FORM-I - Instructions
No ratings yet
PVT Unaided Schools - FORM-I - Instructions
11 pages
Compressor control-TS - L Manual Operacion
80% (5)
Compressor control-TS - L Manual Operacion
80 pages
Telegram - : HCL Reasoning Questions
No ratings yet
Telegram - : HCL Reasoning Questions
7 pages
CPFR Roadmap
No ratings yet
CPFR Roadmap
142 pages
Leaf Disease Detection Robot
No ratings yet
Leaf Disease Detection Robot
61 pages
Thunderbolt Technology: Mukesh Kumar Soni
No ratings yet
Thunderbolt Technology: Mukesh Kumar Soni
16 pages
MongoDB Security Attack Guide
No ratings yet
MongoDB Security Attack Guide
18 pages
Vocabulary Activities
No ratings yet
Vocabulary Activities
15 pages
PARSONO Resume''
No ratings yet
PARSONO Resume''
6 pages
Pasig City Scholarship Intake Form
0% (1)
Pasig City Scholarship Intake Form
3 pages
Ae Books Browse Tree Guide
No ratings yet
Ae Books Browse Tree Guide
1,658 pages
COFFEE - The Scripting Language of Cinema 4D - Rui Batista PDF
100% (2)
COFFEE - The Scripting Language of Cinema 4D - Rui Batista PDF
148 pages
Likewise Open 5.0 Installation and Administration Guide
100% (2)
Likewise Open 5.0 Installation and Administration Guide
86 pages
Class Diagrams
100% (1)
Class Diagrams
20 pages
Gad User Manual Partial
No ratings yet
Gad User Manual Partial
57 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Panorama Francophone 1 Student Book IB Diploma Danièle Bourdais Sample
100% (4)
Panorama Francophone 1 Student Book IB Diploma Danièle Bourdais Sample
100 pages
CII iPATE 2020 Admit Card Details
No ratings yet
CII iPATE 2020 Admit Card Details
1 page
Unit 1
No ratings yet
Unit 1
10 pages

COMP3411 Week 4 - RL

Uploaded by

COMP3411 Week 4 - RL

Uploaded by

Reinforcement Learning

COMP3411/9814: Artificial Intelligence

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

• Idea of learning by interaction with the environment.

• With no explicit instructor but with a direct sensorimotor

• Awareness of how our environment answers to what we do.

• Pole balancing can be learned the same way

• Reward might be only received at the end

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

• An agent should be able to sense the environment states and

• Actions might affect not only immediate reward.

• An important challenge is the exploration/ exploitation trade-off

There are four essential elements:

• Informs how to act in a particular situation.

• Defines the aim of an RL problem.

• Shows what it’s good in the long run (the reward in an

There are four essential elements:

• Optionally, a model of the environment

• Imitates the environment behaviour.

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

• But to learn, it must occasionally choose something different

Should I stay or should I go now?

• Problem: growing use of memory and computational cost over

• If actions also affect the following situation as well as the

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

The pole-balancing task. The mountain car task.

• Discount rate 0 ≤ γ < 1. Determine the present value of future

γ Reward sequence Return

γ Reward sequence Return

γ Reward sequence Return

γ Reward sequence Return

γ Reward sequence Return

• It is possible that T = ∞ or γ = 1, but not both.

• An RL task with the Markov property is called Markov decision

rsearch > rwait

• Assumption: cans cannot be collected when going back to the home

• Transition probabilities from one action always sum to 1.

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

• The function estimates how good it is for the agent to be in a given

expected return when starting in s and following π thereafter:

• Value functions vπ(s) and qπ(s,a) can be estimated from experience.

• Try to maximise expected future reward:

• 𝛾𝛾 is a discount factor [0..1]

• i.e. the value of following the optimal policy

• We make the simplification that actions are deterministic, but we

• Other RL algorithms relax this assumption

• V* is value obtained by following the optimal policy

• Cells correspond to the states.

• All actions equal probability.

• Elements of Reinforcement Learning

• The agent-environment interface

• Temporal difference prediction

• The simplest TD method is called TD(0):

• The learned action-value function Q directly approximates q*, the

• Stumpy receives a reward

• Continues trying to maximise

You might also like