unit4(AI)2024.docx-1

UNIT-4 REINFORCEMENT LEARNING:
Reinforcement Learning is defined as a Machine

Learning method that is concerned with how software
agents should take actions in an environment.
Reinforcement Learning is a part of the deep learning
method that helps you to maximize some portion of
the cumulative reward.
This neural network learning method helps you to
learn how to attain a complex objective or maximize a
specific dimension over many steps.

Here are some important terms used in Reinforcement

AI:
●Agent: It is an assumed entity which performs
actions in an environment to gain some reward.
●Environment (e): A scenario that an agent has to
face.
●Reward (R): An immediate return given to an
agent when he or she performs specific action or
task.
●State (s): State refers to the current situation
returned by the environment.
●Policy (π): It is a strategy which applies by the
agent to decide the next action based on the
current state.
●Value (V): It is expected long-term return with
discount, as compared to the short-term reward.
●Value Function: It specifies the value of a state
that is the total amount of reward. It is an agent
which should be expected beginning from that
state.
●Model of the environment: This mimics the
behavior of the environment. It helps you to make
inferences to be made and also determine how
the environment will behave.
●Model based methods: It is a method for solving
reinforcement learning problems which use
model-based methods.
●Q value or action value (Q): Q value is quite
similar to value. The only difference between the
two is that it takes an additional parameter as a
current action.
Reinforcement Learning working
Let’s see some simple example which helps you to
illustrate the reinforcement learning mechanism.
Consider the scenario of teaching new tricks to your
cat
●As cat doesn’t understand English or any other
human language, we can’t tell her directly what to
do. Instead, we follow a different strategy.
●We emulate a situation, and the cat tries to
respond in many different ways. If the cat’s
response is the desired way, we will give her fish.
●Now whenever the cat is exposed to the same
situation, the cat executes a similar action with
even more enthusiastically in expectation of
getting more reward(food).
●That’s like learning that cat gets from “what to do”
from positive experiences.
●At the same time, the cat also learns what not do
when faced with negative experiences.
Example of Reinforcement Learning
How Reinforcement Learning works
In this case,
●Your cat is an agent that is exposed to the
environment. In this case, it is your house. An
example of a state could be your cat sitting, and
you use a specific word in for cat to walk.
●Our agent reacts by performing an action
transition from one “state” to another “state.”
●For example, your cat goes from sitting to walking.
●The reaction of an agent is an action, and the
policy is a method of selecting an action given a
state in expectation of better outcomes.
●After the transition, they may get a reward or
penalty in return.
Reinforcement Learning Algorithms

There are three approaches to implement a
Reinforcement Learning algorithm.
Value-Based
In a value-based Reinforcement Learning method, you
should try to maximize a value function V(s). In this
method, the agent is expecting a long-term return of
the current states under policy π.
Policy-based
In a policy-based RL method, you try to come up with
such a policy that the action performed in every state
helps you to gain maximum reward in the future.
Two types of policy-based methods are:
●Deterministic: For any state, the same action is
produced by the policy π.
●Stochastic: Every action has a certain probability,
which is determined by the following
equation.Stochastic Policy :
n{a\s) = P\A, = a\S, =S]
Model-Based
In this Reinforcement Learning method, you need to
create a virtual model for each environment. The
agent learns to perform in that specific environment.
Characteristics of Reinforcement Learning

Here are important characteristics of reinforcement
learning
●There is no supervisor, only a real number or
reward signal
●Sequential decision making
●Time plays a crucial role in Reinforcement
problems
●Feedback is always delayed, not instantaneous
●Agent’s actions determine the subsequent data it
receives
Types of Reinforcement Learning
Two types of reinforcement learning methods are:
Positive:
It is defined as an event, that occurs because of
specific behavior. It increases the strength and the
frequency of the behavior and impacts positively on
the action taken by the agent.
This type of Reinforcement helps you to maximize
performance and sustain change for a more extended
period. However, too much Reinforcement may lead to
over-optimization of state, which can affect the
results.
Negative:
Negative Reinforcement is defined as strengthening of
behavior that occurs because of a negative condition
which should have stopped or avoided. It helps you to
define the minimum stand of performance. However,
the drawback of this method is that it provides enough
to meet up the minimum behavior.
What are some common active and passive

RL techniques?
Passive Learning
As the goal of the agent is to evaluate how
good an optimal policy is, the agent needs to
learn the expected utility Uπ(s) for each state
s. This can be done in three ways.
Direct Utility Estimation
In this method, the agent executes a
sequence of trials or runs (sequences of
states-actions transitions that continue until
the agent reaches the terminal state).
Each trial gives a sample value and the agent
estimates the utility based on the samples
values.
Can be calculated as running averages of
sample values.
The main drawback is that this method makes
a wrong assumption that state utilities are
independent while in reality they are
Markovian.
Also, it is slow to converge.
Suppose we have a 4x3 grid as the
environment in which the agent can move
either Left, Right, Up or Down(set of available
actions). An example of a run
Total reward starting at (1,1) = 0.72

2. Adaptive Dynamic Programming(ADP)
ADP is a smarter method than Direct Utility
Estimation as it runs trials to learn the model
of the environment by estimating the utility
of a state as a sum of reward for being in that
state and the expected discounted reward of
being in the next state.
Where R(s) = reward for being in state s,
P(s’|s, π(s)) = transition model, γ = discount
factor and Uπ(s) = utility of being in state s’.
It can be solved using value-iteration
algorithm. The algorithm converges fast but
can become quite costly to compute for large
state spaces. ADP is a model based approach
and requires the transition model of the
environment. A model-free approach is
Temporal Difference Learning.
3. Temporal Difference Learning (TD)

TD learning does not require the agent to
learn the transition model. The update occurs
between successive states and agent only
updates states that are directly affected.
Where α = learning rate which determines
the convergence to true utilities.
While ADP adjusts the utility of s with all its
successor states, TD learning adjusts it with
that of a single successor state s’. TD is slower
in convergence but much simpler in terms of
computation.
Active Learning
ADP with exploration function
As the goal of an active agent is to learn an
optimal policy, the agent needs to learn the
expected utility of each state and update its
policy. Can be done using a passive ADP agent
and then using value or policy iteration it can
learn optimal actions. But this approach
results into a greedy agent. Hence, we use an
approach that gives higher weights to
unexplored actions and lower weights to
actions with lower utilities.
Where f(u, n) is the exploration function that

increases with expected value u and
decreases with number of tries n
R+ is an optimistic reward and Ne is the

number of times we want an agent to be
forced to pick an action in every state. The
exploration function converts a passive agent
into an active one.
2. Q-Learning
Q-learning is a TD learning method which
does not require the agent to learn the
transitional model, instead learns Q-value
functions Q(s, a) .
Q-values can be updated using the following

equation,
Next action can be selected using the

following policy,
Again, this is simpler to compute but slower
than ADP.
Table 1: Comparison of active and passive RL

methods
Learning Models of Reinforcement

There are two important learning models in
reinforcement learning:
●Markov Decision Process
●Q learning
Markov Decision Process
The following parameters are used to get a
solution:
●Set of actions- A
●Set of states -S
●Reward- R
●Policy- n
●Value- V
The mathematical approach for mapping a
solution in reinforcement Learning is recon as a
Markov Decision Process or (MDP).
Q-Learning
Q learning is a value-based method of supplying
information to inform which action an agent
should take.
Let’s understand this method by the following
example:
●There are five rooms in a building which are
connected by doors.
●Each room is numbered 0 to 4
●The outside of the building can be one big
outside area (5)
●Doors number 1 and 4 lead into the building
from room 5
Next, you need to associate a reward value to
each door:
●Doors which lead directly to the goal have a
reward of 100
●Doors which is not directly connected to the
target room gives zero reward
●As doors are two-way, and two arrows are
assigned for each room
●Every arrow in the above image contains an
instant reward value
Explanation:
In this image, you can view that room represents
a state
Agent’s movement from one room to another
represents an action
In the below-given image, a state is described as a
node, while the arrows show the action.
For example, an agent traverse from room
number 2 to 5
●Initial state = state 2
●State 2-> state 3
●State 3 -> state (2,1,4)
●State 4-> state (0,5,3)
●State 1-> state (5,3)
●State 0-> state 4
Reinforcement Learning vs. Supervised
Learning
Parame Reinforcement Supervised
ters Learning Learning
Decisio reinforcement In this method, a
n style learning helps you decision is made
to take your on the input given
decisions at the beginning.
sequentially.
Works Works on Works on
on interacting with the examples or given
environment. sample data.
Depend In RL method Supervised
ency on learning decision is learning the
decision dependent. decisions which
Therefore, you are independent
should give labels of each other, so
to all the labels are given
dependent for every decision.
decisions.
Best Supports and work It is mostly
suited better in AI, where operated with an
human interaction interactive
is prevalent. software system
or applications.
Exampl Chess game Object recognition
e
Applications of Reinforcement
Learning
Here are applications of Reinforcement Learning:
●Robotics for industrial automation.
●Business strategy plan
●It helps you to create training systems that
provide custom instruction and materials
according to the requirement of students.
●Aircraft control and robot motion control

unit4(AI)2024.docx-1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

unit4(AI)2024.docx-1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

unit4(AI)2024.docx-1

Uploaded by

Copyright:

Available Formats

UNIT-4 REINFORCEMENT LEARNING:

Reinforcement Learning is defined as a Machine

Here are some important terms used in Reinforcement

Reinforcement Learning Algorithms

Characteristics of Reinforcement Learning

What are some common active and passive

Total reward starting at (1,1) = 0.72

3. Temporal Difference Learning (TD)

Where f(u, n) is the exploration function that

R+ is an optimistic reward and Ne is the

Q-values can be updated using the following

Next action can be selected using the

Table 1: Comparison of active and passive RL

Learning Models of Reinforcement

You might also like