MODULE- I
Syllabus
Introduction to Reinforcement Learning–Reinforcement
Learning framework and application–Introduction to
Immediate Reinforcement Learning –Bandit Algorithm
Introduction-Bandit Algorithm, UCB, PAC, Bandit Optimality–
Value function based method-Policy Gradient
Introduction
Machine Learning Types:
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning:
- Close to Human Learning
- Algorithm Learns a policy of how to act in a given environment.
- Every action has some impact in the environment, and the
environment provides rewards that guides the learning algorithm.
RL Vs Supervised Learning
• Reinforcement learning is different from supervised learning.
Supervised learning is learning from a training set of labeled data.
• It acts correctly in situations not present in the training set.
• Very Sparse Supervision.
• No target output is provided.
• No error gradient information available.
• Action chooses next state.
RL Vs Unsupervised Learning
• Unsupervised Learning is Finding structure hidden in collections of
unlabeled data.
• Pattern Detection not Primary Goal.
• RL is trying to maximize a reward signal instead of trying to find
hidden structure.
• Uncovering structure in an agent’s experience can certainly be useful
in reinforcement learning, but by itself does not address the
reinforcement learning problem of maximizing a reward signal
Reinforcement Learning
• Reinforcement -making something stronger
• Learning - the process of learning something, knowledge that you get from studying
• RL is a Reward Based Learning
• Reinforcement learning is a framework for solving control tasks (also
called decision problems) by building agents that learn from the
environment by interacting with it through trial and error
and receiving rewards (positive or negative) as unique feedback.
• Learning from interactions with the environment comes from our
natural experiences.
• RL is the science of decision making.
• Ex: Chess,Maze,Path planning, Soccer robot, Sweeper Robot etc.
Micromouse
Reinforcement Learning (Contd..)
• An agent is an entity that moves through the environment, performs
actions and achieve goal. (Or) An agent is an Intelligent program that
learns to make decisions. We can say that an agent is a learner in the
RL setting.
• The Agent learns how to move from initial state to goal state and
• Environment : The playground of the agent is called the environment.
The agent takes all the actions in the environment and is bound to be
in it. (External Condition) (or) A situation in which an agent is
present or surrounded by.
• Action :Actions are possible decisions taken by an agent within the
environment. i.e agent can perform actions based on the current
environment
Reinforcement Learning(Contd..)
• State :State is a situation returned by the environment after each
action taken by the agent. (or) A state is a moment or instance in the
environment at any point.
• Reward: A feedback returned to the agent from the environment to
evaluate the action of the agent.
• The reward is simply a numeric value assigned which could be –Ve or
+Ve with magnitude.
Reinforcement Learning(Contd..)
1.Chess Play
• Environment: The 8x8 chessboard with pieces.
• Agent: The AI player.
• State: Current positions of all pieces, whose turn it is, castling rights, etc.
• Action: Moving a piece (e.g., "Queen to E5").
• Reward:
• +1 for capturing an opponent’s piece,
• +10 for checkmate (win),
• -10 for being checkmated (loss),
• -0.1 per move (encourages efficiency).
Reinforcement Learning(Contd..)
Self-Driving Car
• Environment: Roads, traffic signals, pedestrians, and other vehicles.
• Agent: The autonomous driving system.
• State: Car’s speed, position, nearby obstacles, traffic light status, etc.
• Action: Accelerate, brake, steer left/right, change lanes.
• Reward:
• +1 for maintaining safe speed,
• +10 for obeying traffic rules,
• -100 for collisions,
• -5 for sudden braking (penalizing discomfort).
Reinforcement Learning(Contd..)
Pac-Man Game
• Environment: The Pac-Man maze with ghosts, pellets, and walls.
• Agent: The Pac-Man character.
• State: Current position of Pac-Man, positions of ghosts, remaining
pellets, etc.
• Action: Move up, down, left, or right.
• Reward:
• +10 for eating a pellet,
• +200 for eating a power pellet and a ghost,
• -1 for each time step (encourages efficiency),
• -500 if caught by a ghost (game over).
Robot Vacuum Cleaner
• Environment: A room with furniture, obstacles, and dirt.
• Agent: The vacuum robot.
• State: Battery level, dirt detected, current location, obstacle proximity.
• Action: Move forward, turn left/right, start/stop suction.
• Reward:
• +5 for cleaning dirt,
• -1 for bumping into obstacles,
• -0.2 per second of movement (energy cost),
• +20 if the room is fully cleaned.
• Environment: A patient’s medical history, real-time vitals, lab results, and treatment options.
• Agent: The AI system recommending treatments (e.g., dosage adjustments, drug choices).
• State:
• Patient’s current condition
• Action:
• Adjust medication dosage, Recommend a new drug, Suggest lifestyle changes (e.g., diet,
exercise),Order further diagnostic tests.
• Reward:
• +50 if patient’s health improves (e.g., normalized vitals).
• -20 if side effects occur (e.g., adverse drug reaction).
• -100 if condition worsens (e.g., hospitalization needed).
• +10 for cost-effective treatments (encourages efficiency).
Elements of Reinforcement Learning
• Reinforcement learning systems have 4 main elements:
• Policy
• Reward signal
• Value function
• Model of the environment
Elements of Reinforcement Learning
• A policy function, often denoted as π (sigma), is a strategy that
maps states to actions, determining what action an agent
should take in a given state.
• It's the agent's decision-making rule and can be deterministic
(choosing one action for each state) or stochastic (assigning
probabilities to multiple actions).
• The goal is to find an optimal policy that maximizes the
cumulative reward over time.
Elements of Reinforcement Learning
• Reinforcement learning algorithms learn policies by interacting
with the environment, receiving rewards, and updating the
policy to maximize the expected cumulative reward.
• This process involves finding the best balance between
exploration (trying new actions) and exploitation (taking the
actions known to be good).
Elements of Reinforcement Learning
Policy:
A Policy defines the learning agent’s way of behaving at a given time.
• A policy is a mapping from the perceived states of the environment to
actions to be taken when in those states.
• A reinforcement learning agent uses a policy to select actions given
the current environment state.
• Policy may be a lookup table or a simple function.
• In some cases it involve extensive computation such as a search
process
• In general, policies may be stochastic, specifying probabilities for each
action.
Elements of Reinforcement Learning
Vacuum Robot
• If dirt detected in current cell:
→ Action: Activate suction until dirt is gone.
• If battery level < 20%:
→ Action: Return to charging station.
• If obstacle detected within 0.5m:
→ Action: Turn 90° and move forward.
Elements of Reinforcement Learning
Policy Rules :
Chess play
• If opponent’s queen is undefended and can be captured safely:
→ Action: Capture the queen.
• If in an endgame (few pieces left) and a pawn can be
promoted:
→ Action: Push the pawn toward promotion.
Elements of Reinforcement Learning
Personalized Treatment Planning (Chronic Disease
Management)
Policy:
• If patient’s blood glucose > 180 mg/dL and no recent insulin
dose:
→ Action: Increase insulin dose by 2 units.
• If patient reports dizziness and blood pressure < 90/60 mmHg:
→ Action: Reduce hypertensive medication by 1 dose.
• If patient’s symptoms improve and no side effects:
→ Action: Maintain current treatment.
Elements of Reinforcement Learning
Elements of Reinforcement Learning
Reward Signal:
• The reward signal defines the goal of RL.
• On each time step, the environment sends a single number called the
reward to the reinforcement learning agent
• The agent’s objective is to maximize the total reward that it receives over
the long run.
• The reward signal thus defines what are the good and bad events for the
agent.
• The reward signal is used to alter the policy.
• In general, reward signals may be stochastic functions of the state of the
environment and the actions taken.
• The reward signal indicates what is good in an immediate sense
Elements of Reinforcement Learning..
Value function :
• A value function specifies what is good in the long run.
• The value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting from that state.
• For example, a state might always yield a low immediate reward but
still have a high value because it is regularly followed by other states
that yield high rewards. Or the reverse could be true.
• Rewards are Primary, where as values as predictions of rewards are
secondary. Without Rewards there could be no values.
• The most important component of RL algorithm is a method for
efficiency estimating values.
Elements of Reinforcement Learning..
• We seek actions that bring about states of highest value, not highest
reward, because these actions obtain the greatest amount of reward
for us over the long run.
• Rewards are basically given directly by the environment, but values
must be estimated and re-estimated from the sequences of
observations an agent makes over its entire lifetime.
Reinforcement Learning Systems(Contd..)
Model of the environment:
• It mimics the behaviour of the environment.
• It allows inferences to be made about how the environment will behave.
• Ex: Given a state and action, the model might predict the resultant next
state and next reward.
• Models are used for planning, which means deciding on a course of action
by considering possible future situations before they are experienced
• Model-based methods use models and planning. Think of this as modelling
the dynamics p(s’ | s, a)
• Model-free methods learn exclusively from trial-and-error (i.e. no
modelling of the environment)
• Model-based RL:
The agent learns a model of the environment, allowing it to predict future
states and rewards. This model can then be used for planning and
optimization.
• Model-free RL:
The agent learns directly from its experiences without building an explicit
model of the environment. It relies on trial-and-error to update its policy
based on the rewards it receives.
Introduction to Immediate RL
• Immediate Reinforcement Learning (IRL) refers to a subset of
reinforcement learning (RL) where the agent receives feedback
(rewards or penalties) immediately after taking an action, rather
than waiting for a delayed outcome. This contrasts with traditional RL,
where rewards might be sparse or delayed, making learning more
challenging.
Introduction to Immediate RL…
• Instant Feedback: The agent gets a reward signal right after each action,
simplifying the credit assignment problem (knowing which action led to the
reward).
• Simpler Learning: Since feedback is immediate, algorithms can often learn
faster compared to environments with delayed rewards.
• Common in Bandit Problems: A classic example is the multi-armed
bandit problem, where an agent chooses actions (e.g., pulling a lever) and
gets instant rewards.
• Examples:
• Advertisement: An algorithm shows an ad and immediately observes
whether the user clicked it (reward = 1 for click, 0 otherwise)
Introduction-Bandit Algorithm:
• The name comes from imagining a gambler at a row of slot
machines (sometimes known as "one-armed bandits"), who has to
decide which machines to play, how many times to play each machine
and in which order to play them, and whether to continue with the
current machine or try a different machine.
• It is a type of decision-making algorithm that helps to balance
exploration (trying new actions to discover their rewards) and
exploitation (choosing the best-known action to maximize reward).
Multi-Armed Bandit Problem(MAB)
• In the Multi-Armed Bandit problem, an agent is presented with
multiple options (arms), each providing a reward drawn from an
unknown probability distribution. The agent aims to maximize the
cumulative reward over a series of trials.
• The challenge lies in choosing the best arm to pull, balancing the
need to explore different arms to learn about their reward
distributions and exploiting the known arms that have provided high
rewards.
Multi-Armed Bandit Problem(MAB)….
• Examples :
• Game Designing : Building a hit game is challenging. MABP can be used to test
experimental changes in game play/interface and exploit the changes which show
positive experiences for players.
• Recommendation Systems: Suggest products/movies/music dynamically.
• Clinical Trials: Allocate patients to treatments while learning which is best. The
well being of patients during clinical trials is as important as the actual results of
the study. Here, exploration is equivalent to identifying the best treatment, and
exploitation is treating patients as effectively as possible during the trial.
• Online Ads: Decide which ad to show to a user to maximize click-through rate.
The goal of an advertising campaign is to maximise revenue from displaying ads.
The advertiser makes revenue every time an offer is clicked by a web user.
• Similar to MABP, there is a trade-off between exploration, where the goal is to
collect information on an ad’s performance using click-through rates, and
exploitation, where we stick with the ad that has performed the best so far.
Multi-Armed Bandit(MAB) ….
Exploration versus Exploitation
• Exploration is the action of allowing an agent to discover new
features about the environment.
• Exploitation is making the agent stick to the existing knowledge
gained.
• If the agent continuously exploits past experiences, it likely gets stuck.
• On the other hand, if it continues to explore, it might never find a
good policy, which results in exploration-exploitation dilemma.
• Trade-off between exploration and exploitation
Multi-Armed Bandit(MAB) ….
• Exploration is an action that enables agents to gain knowledge about
the environment or model. The exploration process chooses actions
with unpredictable results to collect information about the states and
rewards that the performed actions will result in.
• Exploitation is a strategy in reinforcement learning that an agent
leverages to make decisions in a state from the existing knowledge to
maximize the expected reward. The goal of exploitation is utilizing
what is already known about the environment to achieve the best
outcome.
Multi-Armed Bandit(MAB)
Exploration:
• Try new things all the time.
• Take Some risk to collect information about unknown options.
Exploitation:
• Use Previous experience
• Take advantage of the best option we know.
Definition
MAB-Definition
Example
• The central dilemma in the MAB problem is the trade-off between
exploration (trying different arms to gather information about their
rewards) and exploitation (choosing the arm that has provided the
highest rewards based on current information). Balancing these two
aspects is crucial for optimizing long-term rewards.
• Formal Representation
• Formally, the MAB problem can be described as follows:
• Arms: K independent arms, each with an unknown reward
distribution.
• Rewards: Each arm i provides a reward Ri, drawn from an unknown
distribution with an expected value.
• Objective: Maximize the cumulative reward over multiple trials.
• In our k-armed bandit problem, each of the k actions has an expected
or mean reward given that that action is selected; let us call this the
value of that action. We denote the action selected on time step t as
At, and the corresponding reward as Rt. The value then of an arbitrary
action a, denoted q*(a), is the expected reward given that a is
selected:
Action-value Methods
Strategies to Solve the Multi-Armed Bandit
Problem
• Greedy Strategy : Select the arm with the highest average so far.
• ε-Greedy Strategy :Select an arm at random with probability ε and
otherwise do a greedy selection
a. With probability ε: explore (choose a random arm).
b. With probability 1−ε: exploit (choose the best-known arm).
c. Simple but effective.
• Upper Confidence Bound (UCB)
• Probably Approximately Correct (PAC)
Epsilon-Greedy
• The epsilon-greedy algorithm is one of the simplest strategies for solving
the MAB problem. It works as follows:
• With probability ϵ, explore a random arm.
• With probability 1−ϵ, exploit the arm with the highest estimated reward.
Algorithm of Epsilon-Greedy
1. Initialize the estimated values of all arms to zero or a small positive
number.
2. For each trial:
• Generate a random number between 0 and 1.
• If the number is less than ϵ, select a random arm (exploration).
• Otherwise, select the arm with the highest estimated reward (exploitation).
• Update the estimated reward of the selected arm based on the observed reward.
Upper-Confidence –Bound Action Selection
• Upper Confidence Bound action selection uses uncertainty in the
action – value estimates for balancing exploration and exploitation.
• The estimated means (or a similar quantity) of the random variables
reflect the current knowledge of the algorithm in a condensed form
and guide further exploitation.
• The widths of the confidence bounds reflect the uncertainty of the
algorithm’s knowledge and will guide further exploration.
• By relating means and widths we can obtain criteria on when to
explore and when to exploitation.
UCB
• Where ln(t) denotes the natural logarithm of t.
• Nt(a) denotes the number of times that action a has been selected
prior to time t.
• The number c>0 controls the degree of exploration.
• If Nt(a) = 0, then a is considered to be a maximizing action.
Probably Approximately Correct (PAC)
• It addresses the problem of learning a function from a set of samples
in a way that is both probably correct and approximately correct.
• The "probably" aspect refers to the confidence level of the algorithm,
while the "approximately correct" aspect refers to the accuracy of the
hypothesis.
• An algorithm is a (ε,δ)-PAC algorithm for the multi armed bandit with
sample complexity T , if it outputs an ε-optimal arm, a′ , with
probability at least 1−δ, when it terminates, and the number of time
steps the algorithm performs until it terminates is bounded by T .
Probably Approximately Correct (PAC)
• In the multi-armed bandit setting, you have n arms (actions), each
providing a stochastic reward. The goal is to identify the best arm (the
one with the highest expected reward) with high confidence using as
few samples (pulls) as possible.
• The Agent goal is to find with the high probability, a near optimal arm,
namely with probability at least (1-δ) output an ε optimal arm.
• ‘l’ tells you how many times to sample each arm to ensure that your
empirical estimates are accurate enough to make a good decision.
• n: Number of arms
• ε: Allowed regret margin — you want to be within ε of the best arm
• δ: Confidence — probability of failing to find a good arm should be
less than δ
• ℓ: Number of times you need to sample each arm to ensure accurate
estimation
Bandit Optimality
• Bandit optimality refers to a policy or algorithm that achieves the
maximum expected cumulative reward (or equivalently, minimizes
regret) over time.
• Regret Optimality
• PAC Optimality
Bandit Optimality
• Regret Optimality or Regret Minimization (Cumulative Reward
Maximization):
• The primary goal is to minimize cumulative regret, which is the
difference between the reward obtained by always choosing the
optimal arm and the reward obtained by the agent's strategy.
• Regret after T rounds:
where:
• μ∗ = mean reward of the best arm,
• μat = mean reward of the arm chosen at time tt.
Bandit Optimality
An algorithm is (ϵ,δ)-PAC optimal if, after some number of
samples T(ϵ,δ), it returns an arm aa such that:
• μ∗ = true mean of the best arm,
• μa = mean of the selected arm,
• ϵ = suboptimality gap tolerance,
• δ = uncertainty probability.
Value function based method
• Value-function-based methods in reinforcement learning
(RL) focus on learning the value of states or state-action pairs
to guide an agent's decision-making. These methods learn a
value function, which estimates the expected cumulative reward
(return) for a given state or state-action pair. The agent then
uses this value function to select actions that maximize the
expected reward.
• Value Function: Estimates the expected return from a state or
state-action pair.
• At each step, the agent learns and updates a value function that
predicts expected return (cumulative future reward).
Value function based method
Types of Value Functions:
• State-value function (V(s)):
• Estimates the expected return when starting from a specific
state and following a given policy.
• Action-value function (Q(s, a)):
• Estimates the expected return when starting from a specific
state, taking a specific action, and then following a given policy.
• For the state-value function, we calculate the value of a state St
• For the action-value function, we calculate the value of the state-
action pair (St,At ) hence the value of taking that action at that
state.
• In either case, whichever value function we choose (state-value or
action-value function), the returned value is the expected return.
Value function based method
Value function based method
Examples of Algorithms:
• Q-Learning: Learns the action-value function Q(s,a) directly.
• SARSA (State-Action-Reward-State-Action): Updates the
action-value function based on the action actually taken.
• Deep Q-Networks (DQN): Uses neural networks to
approximate the Q-function for environments with large or
continuous state spaces.
Value function based method
Advantages:
• Simplicity: Often easier to implement and understand.
• Efficiency in Discrete Spaces: Performs well in environments
with a finite set of actions.
Disadvantages:
• Action Space Limitations: Struggles with continuous or high-
dimensional action spaces.
• Implicit Policy Representation: The policy is not explicitly
represented, making it harder to adapt or incorporate prior
knowledge.
Policy Gradient
• Policy gradient methods in reinforcement learning directly
optimize a parametrized policy by using the gradient of the
expected return with respect to the policy parameters. They are
model-free, meaning they don't require a model of the
environment. This approach is particularly well-suited for
handling high-dimensional or continuous action spaces.
• In reinforcement learning, a policy defines how an agent should
choose actions based on the current state of the environment.
• Policy-Based: learn directly the stochastic policy function that
maps state to action. Act by sampling policy.
policy is just a simple pre-specified function (for instance, the Greedy Policy)
that uses the values given by the value-function to select its actions.
So the difference is:
• In policy-based training, the optimal policy (denoted π*) is found by training the
policy directly.
• In value-based training, finding an optimal value function (denoted Q* or V*,
we’ll study the difference below) leads to having an optimal policy.
Policy-Based Reinforcement Learning
Policy-based methods focus on learning the policy directly
without estimating a value function. The agent optimizes the
policy by adjusting its parameters to maximize the expected
return.
• Policy Representation: The policy π(a∣s;θ)π(a∣s;θ) is
parameterized (e.g., by a neural network) and maps states to
actions.
• Optimization Objective: Adjust the policy parameters θθ to
maximize the expected cumulative reward.
Policy-Based Reinforcement Learning
• REINFORCE Algorithm: A basic policy gradient method that
updates policy parameters in the direction that increases expected
reward.
• Trust Region Policy Optimization (TRPO): Improves training
stability by ensuring that policy updates do not deviate excessively
from the previous policy, but it can be complex to implement due to
its reliance on second-order optimization techniques.
• Proximal Policy Optimization (PPO): An improvement over TRPO,
PPO achieves similar performance while being easier to implement
and computationally more efficient. It simplifies the optimization
process by using a surrogate objective function and clipping
mechanisms to limit the policy update at each step.
Policy-Based Reinforcement Learning
Advantages:
• Continuous Action Spaces: Naturally handles continuous and
high-dimensional action spaces.
• Stochastic Policies: Can learn stochastic policies, beneficial in
environments requiring exploration or dealing with uncertainty.
• Explicit Policy Representation: The policy is explicitly
represented and can be easily adjusted or combined with other
strategies.
Policy-Based Reinforcement Learning
Disadvantages:
• Sample Inefficiency: Generally requires more data to
converge compared to value-based methods.
• High Variance: Policy gradient estimates can have high
variance, making training unstable.
• Complexity: Often more complex to implement and tune.