ML UNIT 5

UNIT-5
ReinforcementLearning:SingleStateCase:K-ArmedBandit,ElementsofReinforcementlearning, Model based

Learning, Temporal Difference learning, Generalizing from examples. [TB-1]
In reinforcement learning, the learner is a decision-making agent that takes actions in an

environment and receives reward (or penalty) for its actions in trying to solve a problem. After a set of
trial-anderror runs, it should learn the best policy, which is the sequence of actions that maximize the
totalreward.Reinforcementlearning (RL)isageneralframeworkwhereagentslearntoperformactions in an
environment so as to maximize a reward. The two main components are the environment, which
represents the problem to be solved, and the agent, which represents the learning algorithm.
ReinforcementLearningisafeedback-basedMachinelearningtechniqueinwhichanagentlearns to
behave in an environment by performing theactions and seeing the results of actions. For each good
action, the agent gets positive feedback, andfor each bad action, theagent gets negativefeedback
or penalty.
In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled
data, unlikesupervised learning. Since there is no labeled data, so the agent is bound to learn by its
experience only. RL solves a specific type of problem where decision makingis sequential, and the goalis
long-term, such as game-playing, robotics, etc.
Theagentinteractswiththeenvironmentandexploresitbyitself.Theprimarygoalofanagentin
reinforcement learning is to improve the performance by getting the maximum positive rewards.
TypesofReinforcementlearning
Therearemainlytwo typesofreinforcementlearning,whichare:
o PositiveReinforcement
o NegativeReinforcement
Positive Reinforcement:
The positive reinforcement learning means adding something to increase the tendency that expected
behavior wouldoccur again.Itimpactspositivelyonthebehavior oftheagentand increasesthe strength
ofthebehavior.Thistypeofreinforcementcansustainthechangesfor along time,buttoomuch positive
reinforcement may lead to an overload of states that can reduce the consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it increases the
tendency that the specific behavior will occur again by avoiding the negative condition. It can be more
effective than the positive reinforcement depending on situation and behavior, but it provides
reinforcement only to meet minimum behavior.
Page1of13
INTRODUCTION:-
Let us say we want to build a machine that learns to play chess. In this case we cannot use a
supervised learner for two reasons. First, it is very costly to have a teacher that will take us through
many games and indicate us the best move for each position. Second, in many cases, there is no such
thing as the best move; the goodness of a move depends on the moves that follow. A single move does
notcount;asequenceofmovesisgoodifafter playing them wewinthegame.Theonlyfeedbackisatthe end of
the game when we win or lose the game.
Another example is a robot that is placed in a maze. The robot can move in one of the four
compass directions and should make a sequence of movements to reach the exit. As long as the robot is
in the maze, there is no feedback and the robot tries many moves until it reaches the exit and only then
does it get a reward. In this case there is no opponent, but we can have a preference for shorter
trajectories, implying that in this case we play against time.
These two applications have a number of points in common: there is a decision maker, called the
agent that is placed in an environment (see figure 18.1). In chess, the game-player is the decision maker
and the environment is the board; in the second case, the maze is the environment of the robot. At any
time, the environment is ina certain state that is one of a set ofpossible states—for example, the state of
the board, the position of the robot in the maze. The decision maker has a set of actions possible: legal
movement ofpiecesonthechessboard,movement oftherobotinpossibledirectionswithouthitting the walls,
and so forth. Once an action is chosen and taken, the state changes. The solution to the task requires a
sequence of actions, and we get feedback, in the form of a reward rarely, generally only when the
complete sequence is carried out. The reward defines the problem and is necessary if we want a learning
agent. The learning agentlearns the best sequence of actions to solve a problem where “best” is
quantified as the sequence of actions that has the maximum cumulative reward. Such is the setting of
reinforcement learning.
Reinforcement learning is different from the learning methods we discussed before in a number
of respects. It is called “learning with a critic,” as opposed to learning with a teacher which we have in
supervised learning.A criticdiffers from ateacher in thatitdoesnottell uswhat to do butonlyhow well we
have been doing in the past; the critic never informs in advance. The feedback from the critic isscarce
andwhenitcomes,itcomeslate.Thisleadsto thecreditassignmentproblem.Aftertakingseveral actions and
getting the reward, we wouldlike to assess the individual actionswe did in the past and find the moves
that led us to win the reward so that we can record and recall them later on. As we seeshortly, what a
reinforcement learning program does is that it learns to generate an internal value forthe intermediate
states or actions in terms of how good they are in leading us to the goal and getting us to the real reward.
Once such an internal reward mechanism is learned, the agent can just take the local actions to maximize
it.
SINGLESTATECASE:K-ARMEDBANDIT
We start with a simple example. The K-armed bandit is a hypothetical slot machine with K levers.
Theactionistochooseandpulloneofthelevers,andwewinacertainamountofmoneythatisthe
Page2of13
reward associated with the lever (action). The task is to decide which lever to pull to maximize the
reward. This is a classification problem where we choose one of K. If this were supervised learning, then
the teacher would tell us the correct class, namely, the lever leading to maximum earning. In this case of
reinforcement learning, we can only try different levers and keep track of the best. This is a simplified
reinforcement learning problem because there is only one state, or one slot machine, and we need only
decide on the action. Another reason why this is simplified is that we immediately get a reward after a
single action; the reward is not delayed, so we immediately see the value of our action.
Let us say Q(a) is the value of action a. Initially, Q(a) = 0 for all a. When we try action a, we get
reward ra≥ 0.If rewards are deterministic, wealways get the same r aforany pullof a andin such a case, we
can just set Q(a) = ra. If we want to exploit, once we find an action a such that Q(a) > 0, we can keep
choosing it and get raat each pull. However, it is quite possible that there is another lever with a higher
reward, so we need to explore.
We can choose different actions and store Q(a) for all a. Whenever we want to exploit, we can
choose the action with the maximum value, that is,
Ifrewardsarenotdeterministicbutstochastic, wegetadifferentrewardeach timewechoosethe same

action. The amount of the reward is defined by the probability distribution p(r|a). In such a case, we
define Qt(a) as the estimate of the value of action a at time t. It is an average of all rewards received when
action a was chosen before time t. An online update can be defined as
wherert+1(a)istherewardreceivedafter taking actionaattime(t+1)sttime.
Notethatequation18.2isthedeltarule:ηisthelearningfactor,r t+1isthedesiredoutput,andQt
(a) is the current prediction. Qt+1(a) is the expected value of action a at time t + 1 and converges to the
mean of p(r|a) as t increases.
The full reinforcement learning problem generalizes this simple case in a number of ways. First,
we have several states. This corresponds to having several slot machines with different reward
probabilities, p(r|si, aj), and we need to learn Q(si, aj), which is the value of taking action aj when in state
si. Second, the actions affect not only the reward but also the next state, and we move from one state to
another. Third, the rewards are delayed and we need to be able to estimate immediate values from
delayed rewards.
ELEMENTSOFREINFORCEMENTLEARNING: -
Therearefour mainelementsofReinforcementLearning,which aregiven below:
• Policy
• RewardSignal
• ValueFunction
• Modelofthe environment
1. Policy:-
A policy can be defined as a way how an agentbehaves at a given time. It maps the perceivedstates of
the environment to the actions taken on those states. A policy is the core element of the RL as it alone
can define the behavior of the agent. In some cases, it may be a simple function or a lookup table,
whereas, for other cases, it may involve general computation as a search process. It could be
deterministic or a stochastic policy:
Fordeterministicpolicy:a= π(s)
Page3of13
Forstochasticpolicy:π(a|s)=P[At=a|St=s]
2. RewardSignal:-
The goal of reinforcement learning is defined by the reward signal. At each state, the environment
sends an immediate signal to the learning agent, and this signal is known as a reward signal. These
rewards are given according to the good and bad actions taken by the agent. The agent's main
objective is to maximize the total number of rewards for good actions. The reward signal can change
the policy, such as if an action selected by the agent leads to low reward, then the policy may change
to select other actions in the future.
3. Value Function:The value function gives information about how well the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good
and bad action, whereas a value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The goal of estimating
values is to achieve more rewards.
4. Model:The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave.Suchas,ifa state andanactionaregiven,thenamodelcanpredictthenextstateandreward.
Themodelisusedforplanning,whichmeansitprovidesawaytotakeacourseofactionby
considering all future situations before actually experiencing those situations. The approaches for
solving the RL problems with the help of the model are termed as themodel-based approach.
Comparatively, an approach without using a model is called a model-free approach.
The learning decision maker is called the agent. The agent interacts with the environment that
includes everything outside the agent. The agent has sensors to decide on its state in the environment
and takes an action that modifies its state. When the agent takes an action, the environment provides a
reward. Time is discrete as t = 0, 1, 2,..., and st∈ S denotes the state of the agent at time t where S is the
setofallpossiblestates.at∈A(st)denotesthe actionthattheagenttakesattimetwhereA(st)istheset
ofpossibleactionsinstatest.Whentheagentinstatesttakestheactionat,theclockticks,rewardrt+1∈
R is received, and the agent moves to the next state, st+1. The problem is modeled using a Markovdecision
process (MDP). The reward and next state are sampled from their respective probability
distributions,p(rt+1|st,at)andP(st+1|st,at).NotethatwhatwehaveisaMarkovsystemwherethestate
andrewardinthenexttimestepdependonlyonthecurrentstateandaction.Insomeapplications,
reward and next state are deterministic, and for a certain state and action taken, there is one possible
reward value and next state.
Markov Decision Process or MDP, is used to formalize the reinforcement learningproblems.
If the environment is completely observable, then its dynamic can be modeled as aMarkov Process. In
MDP, the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
MDPcontainsatupleoffourelements(S, A,Pa, Ra):
AsetoffiniteStatesS
 AsetoffiniteActionsA
 RewardsreceivedaftertransitioningfromstateSto stateS',dueto actiona.
 ProbabilityPa.
MDPusesMarkovproperty,andtobetterunderstandthe MDP,
Page4of13
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1 and move to the states2,
then the state transition from s1 to s2 only depends on the current state and future action and states do
not depend on past actions, rewards, or states."
OR
in other words, as per Markov Property, the current state transition does not depend on any past action
or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess game, the
players only focus on the current state and do not need to remember past actions or states.
Depending on the application, a certain state may be designated as the initial state and in some
applications, there is also an absorbing terminal (goal) state where the search ends; all actions in this
terminal state transition to itself with probability 1 and without any reward. The sequence of actions
from the start to the terminal state is an episode, or a trial.
The policy, π, defines the agent’s behavior and is a mapping from the states of the environment to
actions:π:S→A.Thepolicydefinestheactionto betakeninany statest:at= π(st).Thevalueofa policyπ, Vπ(st), is
the expected cumulative reward that will be received while the agent follows the policy,starting from
state st.
In the finite-horizon or episodic model, the agent tries to maximize the expected reward for the
next T steps:
Certain tasks are continuing, and there is no prior fixed limit to the episode. In the infinite-
horizon model, there is no sequence limit, but future rewards are discounted:
where 0 ≤ γ < 1 is the discount rate to keep the return finite. If γ = 0, then only the immediate reward
counts. As γ approaches 1, rewards further in the future count more, and we say that the agent becomes
more farsighted. γis less than 1 because there generally is a time limitto the sequence of actions needed
to solvethetask. Theagent may bea robot that runsonabattery. Weprefer rewardssooner rather than later
because we are not certain how long we will survive.
For each policyπ,thereisaVπ(st),and wewanttofindtheoptimalpolicyπ∗such that
Page5of13
Bellman’sequation:-
To each possible next state st+1, we move with probability P(st+1|st, at), and continuing from there
using the optimal policy, the expected cumulative reward is V ∗(st+1). We sum over all such possible next
states, and we discount it because it is one time step later. Adding our immediate expected reward, we
get the total expected cumulative reward for action at. We then choose the best of possible actions.
Equation 18.6 is known as Bellman’s equation. Similarly, we can also write
This means that if we have the Q∗(st, at) values, then by using a greedy search at each local step we
get the optimal sequence of steps that maximizes the cumulative reward.
MODEL-BASEDLEARNING:-
We start with model-based learning where we completely know the environment model
parameters, p(rt+1|st, at) and P(st+1|st, at). In such a case, we do not need any exploration and can directly
solve for the optimal value function and policy using dynamic programming. The optimal value function
is unique and is the solution to the simultaneous equations given in equation 18.6. Once we have the
optimal value function, the optimal policy is to choose the action that maximizes the value in the next
state:
1. ValueIteration:-
To find the optimal policy, we can use the optimal value function, and there is an iterative algorithm
called value iteration that has been shown to converge to the correct V ∗values. Its pseudo code is
given in figure 18.2
Page6of13
We say that the values converged if the maximum value difference between two iterations is less
than a certain threshold δ:
wherel istheiterationcounter.Becausewecareonlyabouttheactionswith themaximum value,itis

possible that the policy converges to the optimal one even before the values converge to
theiroptimalvalues. Each iteration isO(|S|2|A|), but frequently there is only a small number k< |S|of
next possible states, so complexity decreases to O(k|S||A|).
2. PolicyIteration:-
In policy iteration, we store and update the policy rather than doing this indirectly over the values.
The pseudo code is given in figure 18.3.
The idea is to start with a policy and improve it repeatedly until there is no change. The value
function can be calculated by solving for the linear equations. We then check whether we can
improve the policy by taking these into account. This step is guaranteed to improve the policy, and
when no improvement is possible, the policy is guaranteed to be optimal. Each iteration of this
algorithm takes O(|A||S|2+ |S|3) time that is more than that of value iteration, but policy iteration
needs fewer iterations than value iteration.
TEMPORALDIFFERENCELEARNING: -
Model isdefined by the reward and next state probability distributions, and as we saw in section
18.4, when we know these, we can solve for the optimal policy using dynamic programming. However,
these methods are costly, and we seldom have such perfect knowledge of the environment.
Page7of13
The more interesting and realistic application of reinforcement learning is when we do not have the
model.Thisrequiresexplorationoftheenvironmentto query themodel.Whenweexploreandgettosee the
value of the next state and reward, we use this information to update the value of the current state.
These algorithms are called temporal difference algorithms because what we do is look at the difference
between our current estimate of the value of a state (or a state-action pair) and the discounted value of
the next state and the reward received.
1. ExplorationStrategies
2. DeterministicRewardsand Actions
3. NondeterministicRewardsandActions
4. EligibilityTraces
1. ExplorationStrategies:-
To explore, one possibility is to use ε -greedy search where with probability ε, we choose one
action uniformly randomly among all possible actions, namely, explore, and with probability 1 −ε,we
choose the best action, namely, exploit. We do not want to continue exploring indefinitely but start
exploiting once we do enough exploration; for this, we start with a high ε value and gradually
decrease it. We need to make sure that our policy is soft, that is, the probability of choosing anyaction
a ∈ Ain state s ∈ S is greater than 0.
Wecanchooseprobabilistically,usingthesoftmaxfunctiontoconvertvaluesto probabilities
and then sample according to these probabilities. To gradually move from exploration toexploitation,
we can use a“temperature” variable T and define the probability of choosing action a as
When T is large, all probabilities are equal and we have exploration. When T is small, better
actions are favored. So the strategy is to start with a large T and decrease it gradually, a procedure
named annealing, which in this case moves from exploration to exploitation smoothly in time.
2. DeterministicRewardsandActions: -
In model-free learning, we first discuss the simpler deterministic case, where at any state-action
pair, there is a single reward and next state possible. In this case, equation 18.7
reduces to
and we simply use this as an assignment to update Q(st, at). When in state st, we choose action at by
one of the stochastic strategies we saw earlier, which returns a reward r t+1and takes us to state st+1.
We then update the value of previous action as
Page8of13
3. NondeterministicRewardsandActions: -
If the rewards and the result of actions are not deterministic, then we have a probability
distribution for the reward p(rt+1|st, at) from which rewards are sampled, and there is a probability
distribution for the next state P(st+1|st, at). These help us model the uncertainty in the system that
may be due to forces we cannot control in the environment: for instance, our opponent in chess, the
dicein backgammon,or our lackofknowledgeofthesystem.For example,we mayhave animperfect
robot which sometimes fails to go in the intended direction and deviates, or advances shorter or
longer than expected.
Insuch acase,wehave
Qlearningalgorithm(off–policymethod):-
We cannot do a direct assignment in this case because for the same state and action, we may
receive different rewards or move to different next states. What we do is keep a running average.
This is known as the Q learning algorithm:
Thepseudo codeoftheQlearning algorithm isgiveninfigure18.5.

Page9of13
We can also think of equation 18.15 as reducing the difference between the current Q value and
the backed-up estimate, from one time step later. Such algorithms are called temporal difference(TD)
algorithms.
On-policymethod(Sarsaalgorithm):-
SARSAstandsfor StateActionRewardStateaction,whichisan on-policytemporaldifference learning
method. The on-policy control method selects the action for each state while learning usinga specific
policy. The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and
allpairsof(s-a).ThemaindifferencebetweenQ-learningandSARSAalgorithmsisthat unlikeQ-
learning,themaximum rewardfor thenextstateisnotrequired for updating theQ-valueinthetable. In
SARSA, new action and reward are selected using the same policy, which has determined
theoriginalaction.TheSARSAisnamedbecauseitusesthequintupleQ(s,a,r,s',a').Where,
s: original state
a:Originalaction
r:rewardobservedwhilefollowingthestates s'
and a': New state, action pair.
This is an off-policy method as the value of the best next action is used without using the policy.In
an on-policy method, the policy is used to determine also the next action. The on-policy version of Q
learning is the Sarsa algorithm whose pseudo code is given in figure 18.6.
We see that instead of looking for all possible next actions aʹ and choosing the best, the on-policy
Sarsa uses the policy derived from Q values to choose one next action aʹ and uses its Q value to
calculate the temporal difference. On-policy methods estimate the value of a policy while using it to
takeactions.Inoff-policymethods,theseareseparated,andthepolicyusedtogeneratebehavior,
Page10of13
called the behavior policy, may in fact be different from the policy that is evaluated and improved,
called the estimation policy.
The same idea of temporal difference can also be used to learn V (s) values, instead of Q(s, a). TD
learning uses the following update rule to update a state value:
This again is the delta rule where r t+1+γV(st+1) is the better, later prediction and V(st) is the
current estimate. Their difference is the temporal difference, and the update is done to decrease this
difference. The update factor η is gradually decreased, and TD is guaranteed to converge to the
optimal value function V∗(s).
4. EligibilityTraces:-
The previous algorithms are one-step—that is, the temporal difference is used to update only the
previous value (of the state or state-action pair). An eligibility trace is a record of the occurrence of
pastvisitsthat enables usto implementtemporal creditassignment,allowing usto updatethe values of
previously occurring visits as well. We discuss how this is done with Sarsa to learn Q values;
adapting this to learn V values is straightforward.
To store the eligibility trace, we require an additional memory variable associated with each
state-action pair, e(s, a), initialized to 0. When the state-action pair (s, a) is visited, namely, when we
take action a in state s, its eligibility is set to 1; the eligibilities of all other state-action pairs are
multiplied by γλ. 0 ≤ λ ≤ 1 is the trace decay parameter.
Weremember thatinSarsa,thetemporal errorattimetis
InSarsawithaneligibilitytrace,namedSarsa(λ),allstate-actionpairsareupdatedas
Sarsa(λ)algorithm:-
Inonlineupdating,alleligiblevalues areupdatedimmediatelyafter each step;in offlineupdating, the
updates are accumulated and a single update is done at the end of the episode. Online updating takes
more time but converges faster. The pseudo code for Sarsa(λ) is given in figure 18.8. Q(λ) and TD(λ)
algorithms can similarly be derived
Page11of13
GENERALIZATION:-
Until now, we assumed that the Q(s, a) values (or V (s), if we are estimating values of states) are
stored in a lookup table, and the algorithms we considered earlier are called tabular algorithms. There
are a number of problems with this approach:
1. When the number of states and the number of actions is large, the sizeof the table may becomequite
large.
2. Statesand actionsmaybecontinuous,forexample,turning thesteering wheel byacertain angle,and to
use a table, they should be discretized which may cause error.
3. When the search space is large, too many episodes may be needed to fill in all the entries of the table
with acceptable accuracy.
Instead of storing the Q values as they are, we can consider this a regression problem. This is a
supervised learning problem where we define a regressor Q(s, a|θ), taking s and a as inputs and
parameterized by a vector of parameters, θ, to learn Q values. For example, this can be an artificialneural
network with s and a as its inputs, one output, and θ its connection weights.
To be able to train the regressor, we need a training set. In the case of Sarsa(0), we saw before that we
would like Q(st, at) to get close to rt+1+γQ(st+1,at+1). So, we can form a set of training samples where the
input is the state-action pair (st, at) and the required output is rt+1+γQ(st+1,at+1). We can write thesquared
error as
Training setscan similarlybedefined for Q(0)and TD(0),whereinthelatter casewelearn V(s),and the

required output is rt+1− γV (st+1). Once such a set is ready, we can use any supervised learning algorithm
for learning the training set.
If we are using a gradient-descent method, as in training neural networks, the parameter vector is
updated as
Thisisaone-stepupdate.InthecaseofSarsa(λ),theeligibilitytraceisalsotakenintoaccount: where the
temporal difference error is
Page12of13
andthevectorofeligibilitiesofparametersare updatedas
with e0all zeros. In the case of a tabular algorithm, the eligibilities are stored for the state-action pairs
because they are the parameters (stored as a table). In the case of an estimator, eligibility is associated
with theparametersofthe estimator.We also notethatthisis very similar to themomentum method for
stabilizing back propagation. The difference is that in the case of momentum previous weight changes
are remembered, whereas here previous gradient vectors are remembered. Depending on the model
used for Q(st, at), for example, a neural network, we plug its gradient vector in equation 18.23.
UNITWISEIMPORTANTQUESTIONS:-
1. WhatisReinforcementLearning?ExplainK-Armed Bandit.
2. DiscussonElementsofReinforcement learning
3. Identify an example of a reinforcement learning application thatcan be modeled by a POMDP. Define
the states, actions, observations, and reward.
4. ExplainValueiterationalgorithmformodel-basedlearning
5. ExplainPolicyiterationalgorithmformodel-basedlearning
6. IdentifydifferentAlgorithmsinTemporalDifferenceLearning
7. ExplainDeterministicRewardsandActions
8. ExplainNonDeterministicRewardsandActions
9. ExplainQlearningoff-policytemporaldifferencealgorithm.
10. ExplainSarsaalgorithmanon-policyversionofQlearning
11. Outlinetheconceptof`Generalizingfromexamples.
Page13of13

ML UNIT 5

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

ML UNIT 5

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML UNIT 5

Uploaded by

Copyright:

Available Formats

UNIT-5

ReinforcementLearning:SingleStateCase:K-ArmedBandit,ElementsofReinforcementlearning, Model based

In reinforcement learning, the learner is a decision-making agent that takes actions in an

Ifrewardsarenotdeterministicbutstochastic, wegetadifferentrewardeach timewechoosethe same

wherert+1(a)istherewardreceivedafter taking actionaattime(t+1)sttime.

wherel istheiterationcounter.Becausewecareonlyabouttheactionswith themaximum value,itis

Thepseudo codeoftheQlearning algorithm isgiveninfigure18.5.

Weremember thatinSarsa,thetemporal errorattimetis

Training setscan similarlybedefined for Q(0)and TD(0),whereinthelatter casewelearn V(s),and the

Thisisaone-stepupdate.InthecaseofSarsa(λ),theeligibilitytraceisalsotakenintoaccount: where the

temporal difference error is

You might also like