[go: up one dir, main page]

0% found this document useful (0 votes)
27 views80 pages

Reinforcement Learning

The document discusses reinforcement learning, focusing on the model-based and model-free approaches, and outlines the mechanics of an agent interacting with an environment to maximize rewards. It provides a detailed example of a self-driving cab, explaining the state and action spaces, reward structure, and the Q-learning algorithm used to optimize the agent's performance. Additionally, it covers hyperparameters and the exploration-exploitation dilemma in the context of reinforcement learning.

Uploaded by

s012678865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views80 pages

Reinforcement Learning

The document discusses reinforcement learning, focusing on the model-based and model-free approaches, and outlines the mechanics of an agent interacting with an environment to maximize rewards. It provides a detailed example of a self-driving cab, explaining the state and action spaces, reward structure, and the Q-learning algorithm used to optimize the agent's performance. Additionally, it covers hyperparameters and the exploration-exploitation dilemma in the context of reinforcement learning.

Uploaded by

s012678865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Reinforcement Learning

Important Terms
• In the model-based approach, a system uses a predictive
model of the world to ask questions of the form “what will
happen if I do x?” to choose the best x1.

• In the alternative model-free approach, the modeling step is


bypassed altogether in favor of learning a control policy
directly.
Reinforcement learning
• Reinforcement learning involves an agent, a set of states, and a
set of actions per state.

• By performing an action, the agent transitions from state to state. Executing


an action in a specific state provides the agent with a reward (a numerical
score).

• The goal of the agent is to maximize its total reward.


• It does this by adding the maximum reward attainable from future states to
the reward for achieving its current state, effectively influencing the current
action by the potential future reward.
• This potential reward is a weighted sum of expected values of the rewards
of all future steps starting from the current state.
Example Design:
Self-Driving Cab
• The Smartcab's job is to pick up the passenger at one location
and drop them off in another.
• Smartcab should take care of:
• Drop off the passenger to the right location.
• Save passenger's time by taking minimum time possible to drop off
• Take care of passenger's safety and traffic rules
1. Rewards

• The agent (the imaginary driver) is reward-motivated


• It is going to learn how to control the cab by trial experiences in
the environment.
• Points to consider to decide the rewards and/or penalties and
their magnitude :
• The agent - high positive reward - for a successful drop-off
• The agent - penalized if it tries to drop off a passenger in wrong locations
2. State Space

• The State Space is the set of all


possible situations our taxi could
inhabit.
• Let's say we have a training area for
our Smartcab where we are teaching
it to transport people in a parking lot
to four different locations (R, G, Y, B).
• There's four (4) destinations and five
(4 + 1) passenger locations.
• So, our taxi environment
has 5×5×5×4=500 total possible
states.
3. Action Space
• The agent encounters one of the 500 states and it takes an action.
• The action in our case can be to move in a direction or decide to
pickup/dropoff a passenger.

• In other words, we have six possible actions:


• south
• north
• east
• west
• pickup
• dropoff
• The taxi cannot perform certain actions in certain states due to
walls.
• In environment's code, we will simply provide a -1 penalty for
every wall hit and the taxi won't move anywhere.
• This will just rack up penalties causing the taxi to consider going
around the wall.
Problem statement
• "There are 4 locations (labeled by different letters), and our job
is to pick up the passenger at one location and drop him off at
another. We receive +20 points for a successful drop-off and
lose 1 point for every time-step it takes. There is also a 10 point
penalty for illegal pick-up and drop-off actions."
• Action Space of size 6 and a State Space of size 500.
• (Only information required by the RL algorithm)
• Reinforcement Learning will learn a mapping of states to the
optimal action to perform in that state by exploration,
• i.e. the agent explores the environment and takes actions based off
rewards defined in the environment.
• The optimal action for each state is the action that has
the highest cumulative long-term reward.
Total states: 500
state = env.encode(3, 1, 2, 0)
# (taxi row, taxi column, passenger index,
destination index)
print("State:", state)

• State: 328
• Five (4 + 1) passenger locations and Four (4)
destination locations.
• We are using our illustration's coordinates to
generate a number corresponding to a state
between 0 and 499, which turns out to
be 328 for our illustration's state.
The Reward Table

• When the Taxi environment is created, there is an


initial Reward table that's also created, called `P`.
• We can think of it like a matrix that has the number of
states as rows and number of actions as columns, i.e.
a states × actions matrix.
OUT:
{0: [(1.0, 428, -1, This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.
False)],
1: [(1.0, 228, -1,
False)],
2: [(1.0, 348, -1,
False)],
3: [(1.0, 328, -1,
False)],
4: [(1.0, 328, -10,
False)],
5: [(1.0, 328, -10,
False)],
1: [(1.0, 228, -1,
False)],
2: [(1.0, 348, -1,
env.P[328]
False)],
3: [(1.0, 328, -1,
False)],
4: [(1.0, 328, -10,
False)],
5: [(1.0, 328, -10,
• The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff)
False)]} the
taxi can perform at our current state in the illustration.
• The nextstate is the state we would be in if we take the action at this index of the
dict
• All the movement actions have a -1 reward and the pickup/dropoff actions have
-10 reward in this particular state. If we are in a state where the taxi has a
passenger and is on top of the right destination, we would see a reward of 20 at
the dropoff action
• done is used to tell us when we have successfully dropped off a passenger in the
right location. Each successfull dropoff is the end of an episode
Solving the environment without
Reinforcement Learning
• Since we have our ‘P’ table for default rewards in each state, we can
try to have our taxi navigate just using that.
• We'll create an infinite loop which runs until one passenger
reaches one destination (one episode), or in other words, when
the received reward is 20.
• The method automatically selects one random action from set of
all possible actions.
we have the taxi at row 3, column 1, our passenger is at
location 2, and our destination is location 0.
state = env.encode(3, 1, 2, 0)

env.s = 328 # set environment to illustration's state


epochs = 0
penalties, reward = 0, 0
frames = [] # for animation
done = False
while not done:
action = env.action_space.sample()
state, reward, done, info = env.step(action)
if reward == -10:
penalties += 1
# Put each rendered frame into dict for animation
frames.append({
'frame': env.render(mode='ansi'), 'state': state, 'action': action, 'reward': reward})
epochs += 1
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))
• Not good.
• Our agent takes thousands of
timesteps and makes lots of wrong
drop offs to deliver just one passenger
to the right destination.
• This is because we aren't learning from
past experience.
• We can run this over and over, and it
will never optimize.
• The agent has no memory of which
action was best for each state, which is
exactly what Reinforcement Learning
will do for us.
Q-learning

• Solution - A simple RL algorithm called Q-learning which will give


our agent some memory.
• Q-learning lets the agent use the environment's rewards to learn,
over time, the best action to take in a given state.
• In our Taxi environment, we have the reward table, P, that the agent
will learn from.
• It does thing by looking receiving a reward for taking an action in the
current state, then updating a Q-value to remember if that action was
beneficial.
• The values store in the Q-table are called a Q-values, and they map to
a (state, action) combination.
• A Q-value for a particular state-action combination is
representative of the "quality" of an action taken from that state.
Better Q-values imply better chances of getting greater rewards.
• Q-values are initialized to an arbitrary value, and as the agent
exposes itself to the environment and receives different rewards
by executing different actions, it get updated.
The Q-values are updated using the
equation:

• - α (alpha) is the learning rate (0<α≤1)


• α is the extent to which our Q-values are being updated in every
iteration.
• - γ (gamma) is the discount factor (0≤γ≤1)
• determines how much importance we want to give to future rewards.
• A high value for the discount factor (close to 1) captures the long-term
effective award,
• whereas, a discount factor of 0 makes our agent consider only
immediate reward, hence making it greedy.
• The Q-value of the agent's current state and action is
updated by
• first taking a weight (1−α) of the old Q-value,
• then adding the learned value.
• The learned value is a combination of the reward for taking the current
action in the current state, and the discounted maximum reward from
the next state we will be in once we take the current action.

• We are learning the proper action to take in the current state by


looking at the reward for the current state/action combo, and the
max rewards for the next state.
Q-Table
Summing up the Q-Learning Process
• Initialize the Q-table by all zeros.
• Start exploring actions:
• For each state, select any one among all possible actions for the current
state (S).
• Travel to the next state (S') as a result of that action (a).
• For all possible actions from the state (S') select the one with the
highest Q-value.
• Update Q-table values using the equation.
• Set the next state as the current state.
• If goal state is reached, then end OR repeat the process.
We evaluate our agents according to the
following metrics,
• Average number of penalties per episode: The smaller the
number, the better the performance of our agent.
• Average number of timesteps per trip: We want a small number
of timesteps per episode as well since we want our agent to take
minimum steps(i.e. the shortest path) to reach the destination.
• Average rewards per move: The larger the reward means the
agent is doing the right thing. In our case, as both timesteps and
penalties are negatively rewarded, a higher average reward
would mean that the agent reaches the destination as fast as
possible with the least penalties"
Hyperparameters and optimizations
• Alpha: (the learning rate) should decrease as you continue to gain a
larger and larger knowledge base.
• Gamma: as you get closer and closer to the deadline, your preference
for near-term reward should increase, as you won't be around long
enough to get the long-term reward, which means your gamma
should decrease.
• Epsilon: as we develop our strategy, we have less need of exploration
and more exploitation to get more utility from our policy, so as trials
increase, epsilon should decrease.
Let’s say that a robot has to cross a maze and reach the end point. There
are mines, and the robot can only move one tile at a time. If the robot steps
onto a mine, the robot is dead. The robot has to reach the end point in the
shortest time possible.
The scoring/reward system is as below:
1. The robot loses 1 point at each step.
This is done so that the robot takes the
shortest path and reaches the goal as
fast as possible.
2. If the robot steps on a mine, the point
loss is 100 and the game ends.
3. If the robot gets power ⚡, it gains 1
point.
4. If the robot reaches the end goal, the
robot gets 100 points.
Q-Table
• This table will guide us to the best action at each state.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that
action at that state.
• When we start, all the values in the Q-table are zeros.
• There is an iterative process of updating the values.
• As we start to explore the environment, the Q-function gives us
better and better approximations by continuously updating the
Q-values in the table.
Step 1: initialize the Q-Table

• There are n columns, where n= number of actions. There are m


rows, where m= number of states.
• We will initialize the values at 0.
Steps 2 and 3: choose and perform an
action
• This step runs until the time we stop the training
• We will choose an action (a) in the state (s) based on the Q-Table.
• But, as mentioned earlier, when the episode initially starts, every Q-value is 0.
• Epsilon greedy strategy
• Epsilon-Greedy is a simple method to balance exploration and exploitation by
choosing between exploration and exploitation randomly.
• Epsilon refers to the probability of choosing to explore
• In the beginning, the epsilon rates will be higher.
• The robot will explore the environment and randomly choose actions.
• The logic behind this is that the robot does not know anything about the environment.
• As the robot explores the environment, the epsilon rate decreases and the
robot starts to exploit the environment.
• Epsilon: (Hyperparameter)as we develop our strategy, we have less need of
exploration and more exploitation to get more utility from our policy
Exploration vs Exploitation

• Exploration
• allows an agent to improve its current knowledge about each action,
hopefully leading to long-term benefit. Improving the accuracy of the
estimated action-values, enables an agent to make more informed
decisions in the future.
• Exploitation
• on the other hand, chooses the greedy action to get the most reward by
exploiting the agent’s current action-value estimates. But by being
greedy with respect to action-value estimates, may not actually get the
most reward and lead to sub-optimal behaviour.
• It cannot, however, choose to do both simultaneously, which is
also called the exploration-exploitation dilemma.
• For the robot example, there are four actions to choose from:
up, down, left, and right.
• We are starting the training now — our robot knows nothing
about the environment. So the robot chooses a random action,
say right.
Steps 4 and 5: evaluate
• We can now update the Q-values for being at the start and
moving right using the Bellman equation.
• Now we have taken an action and observed an outcome and
reward. We need to update the function Q(s,a).
The Q-values are updated using the
equation:

• - α (alpha) is the learning rate (0<α≤1)


• α is the extent to which our Q-values are being updated in every
iteration.
• - γ (gamma) is the discount factor (0≤γ≤1)
• determines how much importance we want to give to future rewards.
• A high value for the discount factor (close to 1) captures the long-term
effective award,
• whereas, a discount factor of 0 makes our agent consider only
immediate reward, hence making it greedy.
Hyperparameters and optimizations
• Alpha: (the learning rate) should decrease as you continue to gain a
larger and larger knowledge base.
• Gamma: as you get closer and closer to the deadline, your preference
for near-term reward should increase, as you won't be around long
enough to get the long-term reward, which means your gamma
should decrease.
• Epsilon: as we develop our strategy, we have less need of exploration
and more exploitation to get more utility from our policy, so as trials
increase, epsilon should decrease.
•power = +1
•mine = -100
•end = +100
Recap

• Q-Learning is a value-based reinforcement learning algorithm which is used


to find the optimal action-selection policy using a Q function.
• Our goal is to maximize the value function Q.
• The Q table helps us to find the best action for each state.
• It helps to maximize the expected reward by selecting the best of all
possible actions.
• Q(state, action) returns the expected future reward of that action at that
state.
• This function can be estimated using Q-Learning, which iteratively updates
Q(s,a) using the Bellman equation.
• Initially we explore the environment and update the Q-Table. When the
Q-Table is ready, the agent will start to exploit the environment and start
taking better actions.
Implementing Q-learning in python

• First, we'll initialize the Q-table to a 500×6 matrix of zeros.


• create the training algorithm that will update this Q-table as the
agent explores the environment over thousands of episodes.
• decide whether to pick a random action or to exploit the already
computed Q-values.
• execute the chosen action in the environment to obtain the
next_state and the reward from performing the action.
• calculate the maximum Q-value for the actions corresponding to the
next_state, and with that, we can easily update our Q-value to the
new_q_value.
we have the taxi at row 3, column 1, our passenger is at
location 2, and our destination is location 0.

• Now that the Q-table has been established over 100,000


episodes, let's see what the Q-values are at our illustration's
state:
• q_table[328]
• OUT:
array([ -2.30108105, -1.97092096, -2.30357004, -2.20591839,
-10.3607344 , -8.5583017 ])
The max Q-value is "north" (-1.971), so it looks like Q-learning has
effectively learned the best action to take in our illustration's state!

You might also like