Reinf 2
Reinf 2
Reinf 2
Introduction, Learning Task, Q Learning, Non deterministic Rewards and actions, temporal-
difference learning, Relationship to Dynamic Programming, Active reinforcement learning,
Generalization in reinforcement learning.
Motivation, Basics of Sampling Theory: Error Estimation and Estimating Binomial Proportions,
The Binomial Distribution, Estimators, Bias, and Variance
Reinforcement learning addresses the question of how an autonomous agent that senses and acts in its
environment can learn to choose optimal actions to achieve its goals.
4.1. Introduction
Consider building a learning robot. The robot, or agent, has a set of sensors to observe the state of
its environment, and a set of actions it can perform to alter this state.
Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals.
The goals of the agent can be defined by a reward function that assigns a numerical value to each
distinct action the agent may take from each distinct state.
This reward function may be built into the robot, or known only to an external teacher who
provides the reward value for each action performed by the robot.
The task of the robot is to perform sequences of actions, observe their consequences, and learn a
control policy.
The control policy is one that, from any initial state, chooses actions that maximize the reward
accumulated over time by the agent.
Example:
A mobile robot may have sensors such as a camera and sonars, and actions such as "move
forward" and "turn."
The robot may have a goal of docking onto its battery charger whenever its battery level is low.
The goal of docking to the battery charger can be captured by assigning a positive reward (Eg.,
+100) to state-action transitions that immediately result in a connection to the charger and a
reward of zero to every other state-action transition.
1. Delayed reward: The task of the agent is to learn a target function that maps from the current state
s to the optimal action a = (s). In reinforcement learning, training information is not available in (s, (s)).
Instead, the trainer provides only a sequence of immediate reward values as the agent executes its
sequence of actions. The agent, therefore, faces the problem of temporal credit assignment: determining
which of the actions in its sequence are to be credited with producing the eventual rewards.
2. Exploration: In reinforcement learning, the agent influences the distribution of training examples by
the action sequence it chooses. This raises the question of which experimentation strategy produces
most effective learning. The learner faces a trade-off in choosing whether to favor exploration of
unknown states and actions, or exploitation of states and actions that it has already learned will yield
high reward.
3. Partially observable states: The agent's sensors can perceive the entire state of the environment at
each time step, in many practical situations sensors provide only partial information. In such cases, the
agent needs to consider its previous observations together with its current sensor data when choosing
actions, and the best policy may be one that chooses actions specifically to improve the observability of
the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same environment, using
the same sensors. For example, a mobile robot may need to learn how to dock on its battery charger,
how to navigate through narrow corridors, and how to pick up output from laser printers. This setting
raises the possibility of using previously obtained experience or knowledge to reduce sample complexity
when learning new tasks.
86
The environment responds by giving the agent a reward rt = r(st, at) and by producing the succeeding state st+l =
δ(st, at). Here the func state and action, and not on earlier states or actions.
The task of the agent is to learn a policy, : S, for→selectingA its next action a, based on the current
observed state st; that is, (st) = at.
1. One approach is to require the policy that produces the greatest possible cumulative reward for the
robot over time.
To state this requirement more )precisely,achievedbyfollowing define an arbitrary policy π from an arbitrary initial
Where, the sequence of rewards rt+i is generated by beginning at state st and by repeatedly using
the policy π to select actions.
Here 0 ≤ γ ≤ 1 is a constant that determines t rewards. if we set γ = 0, only the immediate re rewards are given
greater emphasis relative to the immediate reward.
Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that m an optimal policy and denote it by π*
87
Refer the value*(s)an optimalfunctionpolicyasV*(s) . VπV*(s) gives the maximum discounted cumulative
reward that the agent can obtain starting from state s.
Example:
A simple grid-world environment is depicted in the diagram
The six grid squares in this diagram represent six possible states, or locations, for the agent.
Each arrow in the diagram represents a possible action the agent can take to move from one state
to another.
The number associated with each arrow represents the immediate reward r(s, a) the agent receives
if it executes the corresponding state-action transition
The immediate reward in this environment is defined to be zero for all state-action transitions
except for those leading into the state labelled G. The state G as the goal state, and the agent can
receive reward by entering this state.
Once the states, actions, and immediate rewards a determine the optimal policy π * and its value fun
88