0% found this document useful (0 votes)

27 views80 pages

Reinforcement Learning

The document discusses reinforcement learning, focusing on the model-based and model-free approaches, and outlines the mechanics of an agent interacting with an environment to maximize rewards. It provides a detailed example of a self-driving cab, explaining the state and action spaces, reward structure, and the Q-learning algorithm used to optimize the agent's performance. Additionally, it covers hyperparameters and the exploration-exploitation dilemma in the context of reinforcement learning.

Uploaded by

s012678865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views80 pages

Reinforcement Learning

Uploaded by

s012678865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Reinforcement Learning

Important Terms
• In the model-based approach, a system uses a predictive
model of the world to ask questions of the form “what will
happen if I do x?” to choose the best x1.

• In the alternative model-free approach, the modeling step is

bypassed altogether in favor of learning a control policy
directly.
Reinforcement learning
• Reinforcement learning involves an agent, a set of states, and a
set of actions per state.

• By performing an action, the agent transitions from state to state. Executing

an action in a specific state provides the agent with a reward (a numerical
score).

• The goal of the agent is to maximize its total reward.

• It does this by adding the maximum reward attainable from future states to
the reward for achieving its current state, effectively influencing the current
action by the potential future reward.
• This potential reward is a weighted sum of expected values of the rewards
of all future steps starting from the current state.
Example Design:
Self-Driving Cab
• The Smartcab's job is to pick up the passenger at one location
and drop them off in another.
• Smartcab should take care of:
• Drop off the passenger to the right location.
• Save passenger's time by taking minimum time possible to drop off
• Take care of passenger's safety and traffic rules
1. Rewards

• The agent (the imaginary driver) is reward-motivated

• It is going to learn how to control the cab by trial experiences in
the environment.
• Points to consider to decide the rewards and/or penalties and
their magnitude :
• The agent - high positive reward - for a successful drop-off
• The agent - penalized if it tries to drop off a passenger in wrong locations
2. State Space

• The State Space is the set of all

possible situations our taxi could
inhabit.
• Let's say we have a training area for
our Smartcab where we are teaching
it to transport people in a parking lot
to four diﬀerent locations (R, G, Y, B).
• There's four (4) destinations and ﬁve
(4 + 1) passenger locations.
• So, our taxi environment
has 5×5×5×4=500 total possible
states.
3. Action Space
• The agent encounters one of the 500 states and it takes an action.
• The action in our case can be to move in a direction or decide to
pickup/dropoff a passenger.

• In other words, we have six possible actions:

• south
• north
• east
• west
• pickup
• dropoff
• The taxi cannot perform certain actions in certain states due to
walls.
• In environment's code, we will simply provide a -1 penalty for
every wall hit and the taxi won't move anywhere.
• This will just rack up penalties causing the taxi to consider going
around the wall.
Problem statement
• "There are 4 locations (labeled by different letters), and our job
is to pick up the passenger at one location and drop him off at
another. We receive +20 points for a successful drop-off and
lose 1 point for every time-step it takes. There is also a 10 point
penalty for illegal pick-up and drop-off actions."
• Action Space of size 6 and a State Space of size 500.
• (Only information required by the RL algorithm)
• Reinforcement Learning will learn a mapping of states to the
optimal action to perform in that state by exploration,
• i.e. the agent explores the environment and takes actions based off
rewards defined in the environment.
• The optimal action for each state is the action that has
the highest cumulative long-term reward.
Total states: 500
state = env.encode(3, 1, 2, 0)
# (taxi row, taxi column, passenger index,
destination index)
print("State:", state)

• State: 328
• Five (4 + 1) passenger locations and Four (4)
destination locations.
• We are using our illustration's coordinates to
generate a number corresponding to a state
between 0 and 499, which turns out to
be 328 for our illustration's state.
The Reward Table

• When the Taxi environment is created, there is an

initial Reward table that's also created, called `P`.
• We can think of it like a matrix that has the number of
states as rows and number of actions as columns, i.e.
a states × actions matrix.
OUT:
{0: [(1.0, 428, -1, This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.
False)],
1: [(1.0, 228, -1,
False)],
2: [(1.0, 348, -1,
False)],
3: [(1.0, 328, -1,
False)],
4: [(1.0, 328, -10,
False)],
5: [(1.0, 328, -10,
False)],
1: [(1.0, 228, -1,
False)],
2: [(1.0, 348, -1,
env.P[328]
False)],
3: [(1.0, 328, -1,
False)],
4: [(1.0, 328, -10,
False)],
5: [(1.0, 328, -10,
• The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff)
False)]} the
taxi can perform at our current state in the illustration.
• The nextstate is the state we would be in if we take the action at this index of the
dict
• All the movement actions have a -1 reward and the pickup/dropoff actions have
-10 reward in this particular state. If we are in a state where the taxi has a
passenger and is on top of the right destination, we would see a reward of 20 at
the dropoff action
• done is used to tell us when we have successfully dropped off a passenger in the
right location. Each successfull dropoff is the end of an episode
Solving the environment without
Reinforcement Learning
• Since we have our ‘P’ table for default rewards in each state, we can
try to have our taxi navigate just using that.
• We'll create an inﬁnite loop which runs until one passenger
reaches one destination (one episode), or in other words, when
the received reward is 20.
• The method automatically selects one random action from set of
all possible actions.
we have the taxi at row 3, column 1, our passenger is at
location 2, and our destination is location 0.
state = env.encode(3, 1, 2, 0)

env.s = 328 # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0
frames = [] # for animation
done = False
while not done:
action = env.action_space.sample()
state, reward, done, info = env.step(action)
if reward == -10:
penalties += 1
# Put each rendered frame into dict for animation
frames.append({
'frame': env.render(mode='ansi'), 'state': state, 'action': action, 'reward': reward})
epochs += 1
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))
• Not good.
• Our agent takes thousands of
timesteps and makes lots of wrong
drop oﬀs to deliver just one passenger
to the right destination.
• This is because we aren't learning from
past experience.
• We can run this over and over, and it
will never optimize.
• The agent has no memory of which
action was best for each state, which is
exactly what Reinforcement Learning
will do for us.
Q-learning

• Solution - A simple RL algorithm called Q-learning which will give

our agent some memory.
• Q-learning lets the agent use the environment's rewards to learn,
over time, the best action to take in a given state.
• In our Taxi environment, we have the reward table, P, that the agent
will learn from.
• It does thing by looking receiving a reward for taking an action in the
current state, then updating a Q-value to remember if that action was
beneficial.
• The values store in the Q-table are called a Q-values, and they map to
a (state, action) combination.
• A Q-value for a particular state-action combination is
representative of the "quality" of an action taken from that state.
Better Q-values imply better chances of getting greater rewards.
• Q-values are initialized to an arbitrary value, and as the agent
exposes itself to the environment and receives diﬀerent rewards
by executing diﬀerent actions, it get updated.
The Q-values are updated using the
equation:

• - α (alpha) is the learning rate (0<α≤1)

• α is the extent to which our Q-values are being updated in every
iteration.
• - γ (gamma) is the discount factor (0≤γ≤1)
• determines how much importance we want to give to future rewards.
• A high value for the discount factor (close to 1) captures the long-term
eﬀective award,
• whereas, a discount factor of 0 makes our agent consider only
immediate reward, hence making it greedy.
• The Q-value of the agent's current state and action is
updated by
• ﬁrst taking a weight (1−α) of the old Q-value,
• then adding the learned value.
• The learned value is a combination of the reward for taking the current
action in the current state, and the discounted maximum reward from
the next state we will be in once we take the current action.

• We are learning the proper action to take in the current state by

looking at the reward for the current state/action combo, and the
max rewards for the next state.
Q-Table
Summing up the Q-Learning Process
• Initialize the Q-table by all zeros.
• Start exploring actions:
• For each state, select any one among all possible actions for the current
state (S).
• Travel to the next state (S') as a result of that action (a).
• For all possible actions from the state (S') select the one with the
highest Q-value.
• Update Q-table values using the equation.
• Set the next state as the current state.
• If goal state is reached, then end OR repeat the process.
We evaluate our agents according to the
following metrics,
• Average number of penalties per episode: The smaller the
number, the better the performance of our agent.
• Average number of timesteps per trip: We want a small number
of timesteps per episode as well since we want our agent to take
minimum steps(i.e. the shortest path) to reach the destination.
• Average rewards per move: The larger the reward means the
agent is doing the right thing. In our case, as both timesteps and
penalties are negatively rewarded, a higher average reward
would mean that the agent reaches the destination as fast as
possible with the least penalties"
Hyperparameters and optimizations
• Alpha: (the learning rate) should decrease as you continue to gain a
larger and larger knowledge base.
• Gamma: as you get closer and closer to the deadline, your preference
for near-term reward should increase, as you won't be around long
enough to get the long-term reward, which means your gamma
should decrease.
• Epsilon: as we develop our strategy, we have less need of exploration
and more exploitation to get more utility from our policy, so as trials
increase, epsilon should decrease.
Let’s say that a robot has to cross a maze and reach the end point. There
are mines, and the robot can only move one tile at a time. If the robot steps
onto a mine, the robot is dead. The robot has to reach the end point in the
shortest time possible.
The scoring/reward system is as below:
1. The robot loses 1 point at each step.
This is done so that the robot takes the
shortest path and reaches the goal as
fast as possible.
2. If the robot steps on a mine, the point
loss is 100 and the game ends.
3. If the robot gets power ⚡, it gains 1
point.
4. If the robot reaches the end goal, the
robot gets 100 points.
Q-Table
• This table will guide us to the best action at each state.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that
action at that state.
• When we start, all the values in the Q-table are zeros.
• There is an iterative process of updating the values.
• As we start to explore the environment, the Q-function gives us
better and better approximations by continuously updating the
Q-values in the table.
Step 1: initialize the Q-Table

• There are n columns, where n= number of actions. There are m

rows, where m= number of states.
• We will initialize the values at 0.
Steps 2 and 3: choose and perform an
action
• This step runs until the time we stop the training
• We will choose an action (a) in the state (s) based on the Q-Table.
• But, as mentioned earlier, when the episode initially starts, every Q-value is 0.
• Epsilon greedy strategy
• Epsilon-Greedy is a simple method to balance exploration and exploitation by
choosing between exploration and exploitation randomly.
• Epsilon refers to the probability of choosing to explore
• In the beginning, the epsilon rates will be higher.
• The robot will explore the environment and randomly choose actions.
• The logic behind this is that the robot does not know anything about the environment.
• As the robot explores the environment, the epsilon rate decreases and the
robot starts to exploit the environment.
• Epsilon: (Hyperparameter)as we develop our strategy, we have less need of
exploration and more exploitation to get more utility from our policy
Exploration vs Exploitation

• Exploration
• allows an agent to improve its current knowledge about each action,
hopefully leading to long-term benefit. Improving the accuracy of the
estimated action-values, enables an agent to make more informed
decisions in the future.
• Exploitation
• on the other hand, chooses the greedy action to get the most reward by
exploiting the agent’s current action-value estimates. But by being
greedy with respect to action-value estimates, may not actually get the
most reward and lead to sub-optimal behaviour.
• It cannot, however, choose to do both simultaneously, which is
also called the exploration-exploitation dilemma.
• For the robot example, there are four actions to choose from:
up, down, left, and right.
• We are starting the training now — our robot knows nothing
about the environment. So the robot chooses a random action,
say right.
Steps 4 and 5: evaluate
• We can now update the Q-values for being at the start and
moving right using the Bellman equation.
• Now we have taken an action and observed an outcome and
reward. We need to update the function Q(s,a).
The Q-values are updated using the
equation:

• - α (alpha) is the learning rate (0<α≤1)

• α is the extent to which our Q-values are being updated in every
iteration.
• - γ (gamma) is the discount factor (0≤γ≤1)
• determines how much importance we want to give to future rewards.
• A high value for the discount factor (close to 1) captures the long-term
eﬀective award,
• whereas, a discount factor of 0 makes our agent consider only
immediate reward, hence making it greedy.
Hyperparameters and optimizations
• Alpha: (the learning rate) should decrease as you continue to gain a
larger and larger knowledge base.
• Gamma: as you get closer and closer to the deadline, your preference
for near-term reward should increase, as you won't be around long
enough to get the long-term reward, which means your gamma
should decrease.
• Epsilon: as we develop our strategy, we have less need of exploration
and more exploitation to get more utility from our policy, so as trials
increase, epsilon should decrease.
•power = +1
•mine = -100
•end = +100
Recap

• Q-Learning is a value-based reinforcement learning algorithm which is used

to find the optimal action-selection policy using a Q function.
• Our goal is to maximize the value function Q.
• The Q table helps us to find the best action for each state.
• It helps to maximize the expected reward by selecting the best of all
possible actions.
• Q(state, action) returns the expected future reward of that action at that
state.
• This function can be estimated using Q-Learning, which iteratively updates
Q(s,a) using the Bellman equation.
• Initially we explore the environment and update the Q-Table. When the
Q-Table is ready, the agent will start to exploit the environment and start
taking better actions.
Implementing Q-learning in python

• First, we'll initialize the Q-table to a 500×6 matrix of zeros.

• create the training algorithm that will update this Q-table as the
agent explores the environment over thousands of episodes.
• decide whether to pick a random action or to exploit the already
computed Q-values.
• execute the chosen action in the environment to obtain the
next_state and the reward from performing the action.
• calculate the maximum Q-value for the actions corresponding to the
next_state, and with that, we can easily update our Q-value to the
new_q_value.
we have the taxi at row 3, column 1, our passenger is at
location 2, and our destination is location 0.

• Now that the Q-table has been established over 100,000

episodes, let's see what the Q-values are at our illustration's
state:
• q_table[328]
• OUT:
array([ -2.30108105, -1.97092096, -2.30357004, -2.20591839,
-10.3607344 , -8.5583017 ])
The max Q-value is "north" (-1.971), so it looks like Q-learning has
eﬀectively learned the best action to take in our illustration's state!

RL Cab
No ratings yet
RL Cab
9 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
ML U5 Notes
No ratings yet
ML U5 Notes
26 pages
Reinforcement Learning: Mr. Praveen Gupta Assistant Professor, CSA/SOET
No ratings yet
Reinforcement Learning: Mr. Praveen Gupta Assistant Professor, CSA/SOET
26 pages
Sections
No ratings yet
Sections
76 pages
112 Q Learning N
100% (1)
112 Q Learning N
15 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
Q Learning Ejemplo
100% (1)
Q Learning Ejemplo
11 pages
37 RL
No ratings yet
37 RL
18 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
32 pages
Unit 02 REL
No ratings yet
Unit 02 REL
127 pages
Markov Decision Processes
0% (1)
Markov Decision Processes
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
p1 Piotr
No ratings yet
p1 Piotr
7 pages
Report p1
No ratings yet
Report p1
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Unit-V: Nb-Mjcet
No ratings yet
Unit-V: Nb-Mjcet
36 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
RL Unit-3
No ratings yet
RL Unit-3
63 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
Unit 1
No ratings yet
Unit 1
18 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Exp1 D16AD 60
No ratings yet
Exp1 D16AD 60
11 pages
Agent Environment Interface
No ratings yet
Agent Environment Interface
19 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
35 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning & Robotics
No ratings yet
Reinforcement Learning & Robotics
35 pages
Module 1
No ratings yet
Module 1
72 pages
Week 14 RL October 19
No ratings yet
Week 14 RL October 19
25 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
ML Basics Unit 5
No ratings yet
ML Basics Unit 5
19 pages
Q-Learning for Optimal Pathfinding
No ratings yet
Q-Learning for Optimal Pathfinding
2 pages
Hot Issues of Information Management: Information Systems, Organization, and Strategy: Starbucks (Chap. 3)
No ratings yet
Hot Issues of Information Management: Information Systems, Organization, and Strategy: Starbucks (Chap. 3)
50 pages
Rant Cell Dashboard
No ratings yet
Rant Cell Dashboard
53 pages
Getting Started With SAS Text Miner
No ratings yet
Getting Started With SAS Text Miner
102 pages
Information Retrieval
100% (1)
Information Retrieval
11 pages
Basic HTML Structure
No ratings yet
Basic HTML Structure
2 pages
DL Assignment Solutions
No ratings yet
DL Assignment Solutions
64 pages
Service Manual Quanta MobileX4 - 6
No ratings yet
Service Manual Quanta MobileX4 - 6
14 pages
ISDS 2024 Program
No ratings yet
ISDS 2024 Program
5 pages
Cover Letter For Volunteer Firefighter With No Experience
100% (1)
Cover Letter For Volunteer Firefighter With No Experience
5 pages
Chemistry Principles and Reactions 8th Edition Masterton Solutions Manual Full
100% (15)
Chemistry Principles and Reactions 8th Edition Masterton Solutions Manual Full
109 pages
ICDL Computer Essentials 1.0 Windows 10
No ratings yet
ICDL Computer Essentials 1.0 Windows 10
99 pages
Zurabi-Papiashvili 2024
No ratings yet
Zurabi-Papiashvili 2024
3 pages
AS7262 Datasheet
No ratings yet
AS7262 Datasheet
47 pages
PHD Thesis Diagram
100% (1)
PHD Thesis Diagram
7 pages
Programming Fundamentals C++ NCCM221
No ratings yet
Programming Fundamentals C++ NCCM221
4 pages
SAP PM Work Order Training Guide
No ratings yet
SAP PM Work Order Training Guide
36 pages
DETENTION LIST MID SEM JUNE-2023 (Signed)
No ratings yet
DETENTION LIST MID SEM JUNE-2023 (Signed)
1 page
Amazon Web Services-SAP-C01
No ratings yet
Amazon Web Services-SAP-C01
76 pages
Insights Into DeepSeek-V3 - Scaling Challenges and Reflections On
No ratings yet
Insights Into DeepSeek-V3 - Scaling Challenges and Reflections On
14 pages
Computer Science and Engineering - PIEMR
No ratings yet
Computer Science and Engineering - PIEMR
4 pages
Quantum Computing's Impact on AI
No ratings yet
Quantum Computing's Impact on AI
7 pages
Dynex Dynablot User Manual 1300
No ratings yet
Dynex Dynablot User Manual 1300
100 pages
Cs3311-Set 4
No ratings yet
Cs3311-Set 4
4 pages
Blu-ray Disc: High-Capacity Optical Format
No ratings yet
Blu-ray Disc: High-Capacity Optical Format
15 pages
CNC Printing
No ratings yet
CNC Printing
6 pages
Communications201608 DL PDF
No ratings yet
Communications201608 DL PDF
108 pages
Manual Teltonika-RUT950
No ratings yet
Manual Teltonika-RUT950
145 pages
Unit 1-5 MCQ Opps
No ratings yet
Unit 1-5 MCQ Opps
134 pages
AN2295 - Developer's Serial Bootloader For M68HC08 and HCS08
No ratings yet
AN2295 - Developer's Serial Bootloader For M68HC08 and HCS08
56 pages
NC2500 Instruction Manual EN 1.0
No ratings yet
NC2500 Instruction Manual EN 1.0
24 pages

Reinforcement Learning

Uploaded by

Reinforcement Learning

Uploaded by

Reinforcement Learning

• In the alternative model-free approach, the modeling step is

• By performing an action, the agent transitions from state to state. Executing

• The goal of the agent is to maximize its total reward.

• The agent (the imaginary driver) is reward-motivated

• The State Space is the set of all

• In other words, we have six possible actions:

• When the Taxi environment is created, there is an

env.s = 328 # set environment to illustration's state

• Solution - A simple RL algorithm called Q-learning which will give

• - α (alpha) is the learning rate (0<α≤1)

• We are learning the proper action to take in the current state by

• There are n columns, where n= number of actions. There are m

• - α (alpha) is the learning rate (0<α≤1)

• Q-Learning is a value-based reinforcement learning algorithm which is used

• First, we'll initialize the Q-table to a 500×6 matrix of zeros.

• Now that the Q-table has been established over 100,000

You might also like