Reinf 2

Unit IV
Reinforcement Learning and Evaluating Hypotheses
Introduction, Learning Task, Q Learning, Non deterministic Rewards and actions, temporal-
difference learning, Relationship to Dynamic Programming, Active reinforcement learning,
Generalization in reinforcement learning.
Motivation, Basics of Sampling Theory: Error Estimation and Estimating Binomial Proportions,
The Binomial Distribution, Estimators, Bias, and Variance
Reinforcement learning addresses the question of how an autonomous agent that senses and acts in its
environment can learn to choose optimal actions to achieve its goals.
4.1. Introduction
 Consider building a learning robot. The robot, or agent, has a set of sensors to observe the state of
its environment, and a set of actions it can perform to alter this state. 

 Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals. 

 The goals of the agent can be defined by a reward function that assigns a numerical value to each
distinct action the agent may take from each distinct state. 

 This reward function may be built into the robot, or known only to an external teacher who
provides the reward value for each action performed by the robot. 

 The task of the robot is to perform sequences of actions, observe their consequences, and learn a
control policy. 


The control policy is one that, from any initial state, chooses actions that maximize the reward
accumulated over time by the agent. 
Example:
 A mobile robot may have sensors such as a camera and sonars, and actions such as "move
forward" and "turn." 

 The robot may have a goal of docking onto its battery charger whenever its battery level is low. 

 The goal of docking to the battery charger can be captured by assigning a positive reward (Eg.,
+100) to state-action transitions that immediately result in a connection to the charger and a
reward of zero to every other state-action transition. 
Reinforcement Learning Problem

 An agent interacting with its environment. The agent exists in an environment described by some
set of possible states S. 

 Agent perform any of a set of possible actions A. Each time it performs an action a, in some state st
the agent receives a real-valued reward r, that indicates the immediate value of this state-action
transition. This produces a sequence of states si, actions ai, and immediate rewards ri as shown in
the figure. 

 The agent's task is to learn a control policy, : S, that→ maximizesA the expected sum of these
rewards, with future rewards discounted exponentially by their delay. 
85
Reinforcement learning problem characteristics
1. Delayed reward: The task of the agent is to learn a target function that maps from the current state
s to the optimal action a = (s). In reinforcement learning, training information is not available in (s, (s)).
Instead, the trainer provides only a sequence of immediate reward values as the agent executes its
sequence of actions. The agent, therefore, faces the problem of temporal credit assignment: determining
which of the actions in its sequence are to be credited with producing the eventual rewards.
2. Exploration: In reinforcement learning, the agent influences the distribution of training examples by
the action sequence it chooses. This raises the question of which experimentation strategy produces
most effective learning. The learner faces a trade-off in choosing whether to favor exploration of
unknown states and actions, or exploitation of states and actions that it has already learned will yield
high reward.
3. Partially observable states: The agent's sensors can perceive the entire state of the environment at
each time step, in many practical situations sensors provide only partial information. In such cases, the
agent needs to consider its previous observations together with its current sensor data when choosing
actions, and the best policy may be one that chooses actions specifically to improve the observability of
the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same environment, using
the same sensors. For example, a mobile robot may need to learn how to dock on its battery charger,
how to navigate through narrow corridors, and how to pick up output from laser printers. This setting
raises the possibility of using previously obtained experience or knowledge to reduce sample complexity
when learning new tasks.
4.2. Learning Task

 Consider Markov decision process (MDP) where the agent can perceive a set S of distinct states of
its environment and has a set A of actions that it can perform. 

 At each discrete time step t, the agent senses the current state st, chooses a current action at, and
performs it. 
86
 The environment responds by giving the agent a reward rt = r(st, at) and by producing the succeeding state st+l =
δ(st, at). Here the func state and action, and not on earlier states or actions. 
The task of the agent is to learn a policy, : S, for→selectingA its next action a, based on the current
observed state st; that is, (st) = at.
How shall we specify preciselytheagentto learn?which policy π we
1. One approach is to require the policy that produces the greatest possible cumulative reward for the
robot over time.
 To state this requirement more )precisely,achievedbyfollowing define an arbitrary policy π from an arbitrary initial 
 Where, the sequence of rewards rt+i is generated by beginning at state st and by repeatedly using
the policy π to select actions. 

 Here 0 ≤ γ ≤ 1 is a constant that determines t rewards. if we set γ = 0, only the immediate re rewards are given
greater emphasis relative to the immediate reward. 
 The quantity Vπdiscounted(st)cumulativeis rewardcalledachievedthe by policy π state s. It is reasonable to discount

future rewards relative to immediate rewards because, in many
cases, we prefer to obtain the reward sooner rather than later.
2. Other definitions of total reward is finite horizon reward,
Considers the undiscounted sum of rewards over a finite number h of steps
3. Another approach is average reward
Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that m an optimal policy and denote it by π*
87
Refer the value*(s)an optimalfunctionpolicyasV*(s) . VπV*(s) gives the maximum discounted cumulative
reward that the agent can obtain starting from state s.
Example:
A simple grid-world environment is depicted in the diagram
 The six grid squares in this diagram represent six possible states, or locations, for the agent. 

 Each arrow in the diagram represents a possible action the agent can take to move from one state
to another. 

 The number associated with each arrow represents the immediate reward r(s, a) the agent receives
if it executes the corresponding state-action transition 

 The immediate reward in this environment is defined to be zero for all state-action transitions
except for those leading into the state labelled G. The state G as the goal state, and the agent can
receive reward by entering this state. 
Once the states, actions, and immediate rewards a determine the optimal policy π * and its value fun
Let’s choose γ = 0.9. The diagram at the bottom of
88

Reinf 2

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Reinf 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinf 2

Uploaded by

Copyright:

Available Formats

Unit IV

Reinforcement Learning and Evaluating Hypotheses

Reinforcement Learning Problem

4.2. Learning Task

How shall we specify preciselytheagentto learn?which policy π we

 The quantity Vπdiscounted(st)cumulativeis rewardcalledachievedthe by policy π state s. It is reasonable to discount

2. Other definitions of total reward is finite horizon reward,

Considers the undiscounted sum of rewards over a finite number h of steps

3. Another approach is average reward

Let’s choose γ = 0.9. The diagram at the bottom of

You might also like