0% found this document useful (0 votes)

34 views18 pages

Mid Term Report SoS

The midterm report discusses the N-armed bandit problem and various strategies for reinforcement learning, including greedy action selection, epsilon-greedy methods, and Upper Confidence Bound (UCB). It also covers Markov Decision Processes (MDPs), their definitions, policies, value functions, and the importance of the Bellman equation for policy evaluation and improvement. Finally, it outlines dynamic programming methods such as policy iteration and value iteration for solving MDPs optimally.

Uploaded by

Yash Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views18 pages

Mid Term Report SoS

Uploaded by

Yash Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Midterm Report

Summer of Science
Reinforcement Learning (Code: CS36)

Submitted by
Yash Vardhan
Roll No: 24B3319

Under the mentorship of

Shreeya Choudhary

June 17, 2025

N-Armed Bandit Problem
The N-armed bandit problem is a fundamental setting in reinforcement learning
that exemplifies the dilemma of decision-making under uncertainty. It models a
situation where an agent must repeatedly choose from among N different options
(or actions), each yielding stochastic rewards, with the objective of maximizing
cumulative reward over time.

Problem Setting
Consider N distinct actions (arms of a slot machine), each associated with an
unknown but fixed reward distribution. At each time step t, the agent selects
an action At ∈ {1, 2, . . . , N } and receives a reward Rt ∼ PAt drawn from the
probability distribution corresponding to that arm.
We assume:
• Each arm a has an expected reward q∗ (a) = E[Rt |At = a].
• These reward distributions are stationary, i.e., they do not change over
time.

The agent’s goal is to maximize the expected cumulative reward over T time
steps: " T #
X
max E Rt
t=1

Estimating Action Values

Since the true action values q∗ (a) are unknown, the agent maintains estimates
Qt (a) of the expected reward of each action. A natural estimate is the sample
average: Pt
i=1 1Ai =a · Ri
Qt (a) = P t
i=1 1Ai =a

Law of Large Numbers

According to the Law of Large Numbers, if each action is sampled sufficiently
many times, its sample average will converge to the true expected reward:

lim Qn (a) = q∗ (a)

n→∞

This justifies the use of sample averages for value estimation. However, it
also emphasizes the need for each action to be sampled multiple times — thus
necessitating exploration. This is going to be heavily used further in environ-
ments where the probability distribution is unknown or non-stationary.

1
Greedy Action Selection
The greedy method always selects the action with the highest current esti-
mated reward:
At = arg max Qt (a)
a

This purely exploitative approach can lead to suboptimal behavior, espe-

cially if the initial estimates are inaccurate. Once a suboptimal action is fa-
vored early, it may continue to dominate due to lack of exploration, preventing
discovery of better actions.

Epsilon-Greedy Method
To address the shortcomings of the greedy method, the ε-greedy method
introduces a controlled degree of randomness:

• With probability ε, select a random action (exploration).

• With probability 1 − ε, select the greedy action (exploitation).

This ensures continual exploration, thereby avoiding premature convergence

to suboptimal actions and allowing Qt (a) to converge to q∗ (a) for all a.

Incremental Updates
Rather than storing all past rewards, one can efficiently compute the sample
average using the following incremental update rule:
1
Qn+1 (a) = Qn (a) + (Rn − Qn (a))
n
Here, Qn (a) is the estimate after n selections of action a, and Rn is the most
recent observed reward. This approach:

• Requires only the current estimate and the latest reward.

• Ensures convergence to q∗ (a) under stationarity.

• Automatically decreases the learning rate over time.

A more general form uses a fixed step-size parameter α:

Qn+1 (a) = Qn (a) + α (Rn − Qn (a))

This is particularly useful in non-stationary settings, where recent re-

wards should be weighted more heavily than distant past ones. The step sized
parameter can be varied too from step to step, however convergence is not nec-
essarily guaranteed.

2
Optimistic Initial Values
Another simple yet effective method to encourage exploration in the N-armed
bandit setting is the use of optimistic initial values. Instead of initializing the
estimated values Q0 (a) for each action to zero or some neutral default, we assign
them large, unrealistically high values. As a result, the agent initially assumes
that all actions are highly rewarding. This forces the agent to try each action
at least once, thereby achieving a form of structured exploration, even when
using a purely greedy policy. Over time, as the estimates adjust based on the
actual observed rewards, the agent naturally shifts toward exploiting the best-
performing action. While simple to implement, this method can be insufficient
in non-stationary environments.

Upper Confidence Bound (UCB)

A more mathematically grounded solution to the exploration-exploitation trade-
off is the Upper Confidence Bound (UCB) action selection strategy. This method
augments each action’s estimated value with a confidence interval that accounts
for the uncertainty due to limited sampling. At time t, the agent selects the
action At as follows:
" s #
ln t
At = arg max Qt (a) + c ·
a Nt (a)

Here, Qt (a) is the estimated value of action a, Nt (a) is the number of times
action a has been selected so far, t is the current time step, and c is a tunable
parameter that controls the level of exploration. The second term in the expres-
sion incentivizes actions that have been tried less frequently, thereby increasing
their likelihood of being selected. As more information is gathered, the uncer-
tainty term diminishes, and the agent increasingly favors actions with genuinely
high estimated rewards However, its effectiveness dpends on the appropriate
tuning of the exploration coefficient c.

Gradient Bandit Methods

In contrast to value-based approaches, gradient bandit methods learn preferences
over actions directly and optimize the policy by ascending the expected reward
gradient. Let Ht (a) denote the agent’s preference for action a at time t. These
preferences are converted into a probability distribution over actions using the
softmax function:
eHt (a)
Pt (a) = PN
Ht (b)
b=1 e
At each time step, an action is chosen according to this distribution, and the
preferences are updated in a direction that increases the probability of rewarding
actions. After selecting action At and receiving reward Rt , the update rule

3
becomes:
(
Ht (a) + α(Rt − R̄t )(1 − Pt (a)) if a = At
Ht+1 (a) =
Ht (a) − α(Rt − R̄t )Pt (a) ̸ At
if a =

Here, α is the step size, and R̄t is a baseline such as the average reward, used
to reduce variance in updates. The idea is to increase the preference for actions
that perform better than the baseline, and decrease it for others. This results in
a stochastic policy that gradually shifts toward favoring better actions without
deterministically eliminating suboptimal ones. Gradient bandit methods are
particularly suited to problems requiring a non-determinsitic policy, and they
perform well in both stationary and non-stationary environments when tuned
appropriately. However, they can be sensitive to learning rates and may exhibit
instability if reward magnitudes are not normalized.

Markov Decision Processes (MDPs)

To model sequential decision-making problems in a mathematically rigorous
way, reinforcement learning relies on the framework of Markov Decision Pro-
cesses (MDPs). An MDP provides a formalism for environments where out-
comes depend both on the agent’s actions and on stochastic transitions within
the environment.

Definition
A Markov Decision Process is formally defined by a tuple (S, A, P, R, γ), where:

• S: A set of possible states the environment can be in.

• A: A set of actions available to the agent.

• P (s′ | s, a): The transition probability — the probability of moving to

state s′ after taking action a in state s.
• R(s, a): The reward function — the expected immediate reward re-
ceived when taking action a in state s.

• γ ∈ [0, 1]: The discount factor, determining the weight given to future
rewards.

At each discrete time step t, the agent observes the current state St , selects
an action At , receives a reward Rt+1 , and transitions to the next state St+1
according to the transition probability P (St+1 | St , At ). This interaction forms
a trajectory S0 , A0 , R1 , S1 , A1 , R2 , . . .

4
The Markov Property
The defining assumption of an MDP is the Markov property, which states that
the future is independent of the past given the present. Formally, the next state
and reward depend only on the current state and action:

P (St+1 = s′ , Rt+1 = r | St = s, At = a, history) = P (s′ , r | s, a)

This assumption allows for compact and recursive mathematical formula-

tions, including the Bellman equations.

Policy and Return

A policy π defines the agent’s behavior by specifying a mapping from states to
actions. It may be deterministic, where π(s) = a, or stochastic, where π(a | s)
defines the probability of selecting action a in state s.
The return at time t, denoted Gt , is the total discounted sum of future
rewards:
X∞
2
Gt = Rt+1 + γRt+2 + γ Rt+3 + · · · = γ k Rt+k+1
k=0

The objective of reinforcement learning is to find a policy that maximizes

the expected return from each state.

Value Functions
To evaluate how good it is to be in a certain state or to perform a certain action,
we define two fundamental value functions under a given policy π:
State-Value Function:

vπ (s) = Eπ [Gt | St = s]

This function gives the expected return when starting in state s and following
policy π thereafter.
Action-Value Function:

qπ (s, a) = Eπ [Gt | St = s, At = a]

This function gives the expected return when starting in state s, taking action
a, and subsequently following policy π.
These value functions are central to reinforcement learning and provide the
basis for most algorithmic approaches to policy evaluation and improvement.

Bellman Equation and Policy Evaluation

Starting from the definition of the state-value function under a policy π,

vπ (s) = Eπ [Gt | St = s] = Eπ [Rt+1 + γGt+1 | St = s] ,

5
we apply the law of iterated expectations and expand the return recursively:
X X
vπ (s) = π(a | s) P (s′ , r | s, a) [r + γvπ (s′ )] .
a s′ ,r

This recursive relationship is known as the Bellman expectation equation.

We now turn to the problem of computing the value function vπ for a given
policy π, a process known as policy evaluation.

Policy Evaluation
Given a fixed policy π, policy evaluation is the process of computing the cor-
responding state-value function vπ (s). Since the Bellman expectation equation
defines a system of equations — one for each state — solving it analytically is
intractable in most environments of practical interest. Instead, we compute the
value function iteratively.
Starting from an initial guess v0 (s), we update the value function using the
Bellman equation as an update rule:
X X
vk+1 (s) = π(a | s) P (s′ , r | s, a) [r + γvk (s′ )]
a s′ ,r

This process is repeated until the values converge to within a desired thresh-
old. Under the condition γ < 1, the update is guaranteed to converge to the
true value function vπ .

Policy Improvement
Once the value function vπ of a determinsitic policy π has been computed, we
can attempt to construct a new determenistic policy that performs at least as
well, if not better. This process is known as policy improvement.
Using the current value function vπ , we define a new policy π ′ by acting
greedily with respect to it. For each state s, the improved policy is given by:
X
π ′ (s) = arg max P (s′ , r | s, a) [r + γvπ (s′ )]
a
s′ ,r

The policy improvement theorem guarantees that the new policy π ′ is at

least as good as the original policy π, meaning vπ′ (s) ≥ vπ (s) for all states s. If
the inequality is strict for at least one state, then π ′ is strictly better than π.

Policy Improvement Theorem

The Policy Improvement Theorem provides a formal guarantee that acting
greedily with respect to the current value function will yield a policy that is
no worse, and possibly better, than the current one.
Let π and π ′ be two deterministic policies. Suppose that for all s ∈ S,
qπ (s, π ′ (s)) ≥ vπ (s).

6
Then, the new policy π ′ is at least as good as π:

vπ′ (s) ≥ vπ (s) for all s.

Moreover, if strict inequality holds for any state, i.e., if

qπ (s, π ′ (s)) > vπ (s) for some s,

then π ′ is strictly better than π, meaning

vπ′ (s) > vπ (s) for at least one s.

This result underpins the policy improvement step and guarantees that re-
peated application of evaluation and improvement will eventually converge to
an optimal policy.This fundamental result enables iterative procedures that al-
ternate between evaluating a policy and improving it, ultimately leading to the
discovery of an optimal policy.

Dynamic Programming
Dynamic Programming (DP) refers to a class of algorithms that solve Markov
Decision Processes exactly by recursively computing value functions. These
methods require complete knowledge of the environment — specifically, the
transition probabilities P (s′ , r | s, a) and the reward function R(s, a).
At the heart of DP is the repeated application of the Bellman equations
to update value estimates. The key quantities being stored and updated are
the value functions themselves: either the state-value function v : S → R, or
the action-value function q : S × A → R. These are typically stored as arrays
indexed by state (and action) identifiers.
For finite MDPs with n states and m actions, the space complexity is O(n)
for state-value methods and O(nm) for action-value methods. Each iteration
(or sweep) involves computing expectations over all actions and all possible
next states and rewards, leading to a worst-case time complexity of O(n2 m)
per iteration.Thus, A DP method is guaranteed to find an optimal policy in
polynomial time even though the total number of (deterministic) policies is
mn . However,DP is computationally feasible only for small to moderately-sized
problems which reuquire knowledge of all the transition probabilites.

Policy Iteration
Policy iteration is an iterative algorithm that alternates between policy evalu-
ation and policy improvement. Starting from an arbitrary initial policy π0 , the
algorithm performs the following steps at each iteration k:

1. Policy Evaluation: Compute the value function vπk , typically using

iterative updates.

7
2. Policy Improvement: Construct a new policy πk+1 by acting greedily
with respect to vπk :
X
πk+1 (s) = arg max P (s′ , r | s, a) [r + γvπk (s′ )]
a
s′ ,r

This procedure is repeated until the policy stabilizes, i.e., πk+1 = πk . The
Policy Improvement Theorem guarantees that each new policy is at least as
good as the previous one. For finite MDPs, the process terminates in a finite
number of iterations and returns the optimal policy π∗ .

Value Iteration
Value iteration simplifies policy iteration by combining the policy evaluation and
improvement steps into a single update rule. Instead of evaluating a policy to
completion, the value function is updated directly using the Bellman optimality
equation:
X
vk+1 (s) = max P (s′ , r | s, a) [r + γvk (s′ )]
a
s′ ,r

As the iterations proceed, vk converges to the optimal value function v∗ .

Once the value function has sufficiently converged, the corresponding optimal
policy is extracted as:
X
π∗ (s) = arg max P (s′ , r | s, a) [r + γv∗ (s′ )]
a
s′ ,r

Value iteration typically converges more quickly than policy iteration, espe-
cially in large state spaces where computing exact value functions for interme-
diate policies is computationally expensive. However for the intermediate vk ,
the existence of a corresponding πk is not necessarily guaranteed.

Policy Evaluation in Unknown Environments

In many real-world scenarios, the environment’s dynamics — such as transition
probabilities and reward functions — are not known a priori. This makes it
impossible to compute the exact expectations in the Bellman equations, as re-
quired by Dynamic Programming methods. Instead, we must rely on experience
to estimate value functions, giving rise to a class of model-free approaches.
Policy evaluation in unknown environments involves estimating the value
function vπ (s) = Eπ [Gt | St = s] from sampled episodes generated by following
the policy π. Rather than using explicit knowledge of P (s′ , r | s, a), these meth-
ods estimate expected returns directly from observed state–reward trajectories.
Two primary families of methods fall under this framework:

8
• Monte Carlo (MC) Methods: These estimate value functions by aver-
aging complete returns observed at the end of episodes. They are simple,
intuitive, and require no bootstrapping, but they can only update values
after episodes terminate.
• Temporal-Difference (TD) Methods: These update value estimates
incrementally after each step using bootstrapped predictions. They are
online, efficient, and applicable to both episodic and continuing tasks.

Both approaches offer viable solutions to policy evaluation in the absence of

a model and form the foundation for modern reinforcement learning algorithms.

Monte Carlo Methods

Monte Carlo (MC) methods estimate value functions from sampled episodes of
interaction with the environment, without requiring any knowledge of transition
probabilities or reward functions. These methods are particularly effective when
the environment is unknown or analytically intractable.
MC methods operate under the assumption that episodes eventually termi-
nate. They compute value estimates by averaging the returns following visits to
each state, under a fixed policy π. Given that the return from time t is defined
as Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . ., the goal is to approximate the expected
return:

v(s) = Eπ [Gt | St = s]

Estimating Value Using Visit Counts

Let N (s) denote the number of times state s has been visited across all sampled
episodes. Let G1 , G2 , . . . , GN (s) be the returns following each of these visits.
The Monte Carlo estimate of v(s) is the sample average:
N (s)
1 X
V (s) = Gi
N (s) i=1
By the Law of Large Numbers, this empirical average converges to the true
expected return as the number of visits grows:
N (s)
1 X
lim Gi = vπ (s)
N (s)→∞ N (s)
i=1

This provides the statistical foundation for Monte Carlo methods: with
enough data, value estimates converge almost surely to their true expectations.

9
Incremental Monte Carlo Updates
Rather than storing all past returns, we can update V (s) incrementally after
each visit using:
1
V (s) ← V (s) + (G − V (s))
N (s)
where G is the return following the current visit to s. This implementa-
tion maintains only the latest estimate and a visit count, and incorporates new
information as it arrives.
In non-stationary environments, or when memory is constrained, a constant
step-size version is preferred:

V (s) ← V (s) + α (G − V (s))

where α ∈ (0, 1] is a fixed learning rate. This update behaves like an ex-
ponential moving average, giving more weight to recent returns and enabling
continual learning.

Temporal-Difference Learning
Temporal-Difference (TD) learning is a class of model-free reinforcement learn-
ing methods that estimate value functions from raw experience by combining
key ideas from both Dynamic Programming and Monte Carlo methods.
Like Monte Carlo methods, TD learning does not require knowledge of the
environment’s transition probabilities or rewards. However, unlike Monte Carlo
methods, which update value estimates only at the end of an episode, TD learn-
ing updates estimates incrementally at every time step. This makes TD learning
suitable for both online learning and continuing tasks with no natural episode
boundaries.

TD(0)
The simplest TD method is known as TD(0). It performs a one-step update
using the observed reward and the estimated value of the next state. Given a
state St , reward Rt+1 , and next state St+1 , the value function is updated as:

V (St ) ← V (St ) + α [Rt+1 + γV (St+1 ) − V (St )]

This is known as a bootstrapped update, as it relies on the current estimate
V (St+1 ) to improve the estimate of V (St ), rather than waiting for the full return.
The term inside the brackets is called the TD error:

δt = Rt+1 + γV (St+1 ) − V (St )

The TD error measures the discrepancy between the current estimate and a
better-informed one based on a single step of experience.

10
Comparison with Monte Carlo and Dynamic Programming
TD(0) can be viewed as a hybrid between Monte Carlo methods and Dynamic
Programming:

• Monte Carlo methods compute updates based on the complete return Gt

and wait for episode termination.
• Dynamic Programming methods compute exact expectations using the
environment model.
• TD(0) uses samples from experience and bootstraps by estimating the
return with Rt+1 + γV (St+1 ).

Thus, TD(0) is both model-free and incremental, combining the data effi-
ciency of DP with the model-free flexibility of MC. As, in while unlike DP, it
works in an unknown environment, it uses the Markov property by betting on
its increments.

Convergence and Bias–Variance Tradeoff

Under standard assumptions — such as sufficient exploration of the state space
and a diminishing step size schedule — TD(0) converges to the true value func-
tion vπ .
Compared to Monte Carlo methods, TD(0) typically has lower variance
because it uses only a single-step reward and an estimated value. However,
it introduces bias due to bootstrapping. This bias–variance tradeoff makes
TD(0) particularly effective in many online learning scenarios and practical RL
applications.

TD(λ)
While TD(0) updates value estimates using a one-step bootstrapped return
and Monte Carlo methods rely on complete returns, TD(λ) provides a unified
framework that generalizes both. It introduces a trace-decay parameter λ ∈
[0, 1] that determines the depth of bootstrapping used in value estimation.
When λ = 0, TD(λ) reduces to TD(0); when λ = 1, it approaches the
behavior of Monte Carlo methods. Intermediate values yield algorithms that
balance bias and variance by blending multiple n-step returns.

Forward View: λ-Return

The forward view of TD(λ) defines the λ-return as an exponentially-weighted
average of n-step returns:
∞
(n)
X
Gλt = (1 − λ) λn−1 Gt
n=1

11
(n)
Here, the n-step return Gt is given by:
(n)
Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n V (St+n )
The λ-return smoothly interpolates between the immediate bootstrapped
target and the full Monte Carlo return, allowing credit assignment over varying
time scales.

Backward View: Eligibility Traces

The forward view is not suitable for online learning, since it requires access
to future rewards. The backward view, on the other hand, enables real-time
updates using eligibility traces, which track how recently and frequently each
state has been visited.
At each time step t, the eligibility trace for state s is updated as:

Et (s) = γλEt−1 (s) + 1{St =s}

The value function is then updated for all states as:

V (s) ← V (s) + αδt Et (s)

where δt = Rt+1 + γV (St+1 ) − V (St ) is the TD error.

Equivalence of Forward and Backward Views

The forward and backward views of TD(λ) are fundamentally equivalent in
terms of the total update applied to the value function over a complete episode.
This result is formalized by the following theorem:

Equivalence of Forward and Backward Views

Let ∆V forward (s) denote the total update to the value function from the forward
view using λ-returns:
T
X −1
∆V forward (s) = α 1{St =s} Gλt − V (St )

t=0
backward
Let ∆V (s) denote the corresponding update from the backward view
using eligibility traces and TD errors:
T
X −1
∆V backward (s) = α δt Et (s)
t=0
Then, under batch updates over a finite episode of length T ,

∆V forward (s) = ∆V backward (s) for all s ∈ S

This theoretical equivalence justifies the use of the backward view in practice,
as it permits efficient, online, and incremental learning.

12
Bias–Variance Tradeoff and Convergence
TD(λ) converges to the true value function under similar conditions to TD(0),
such as sufficient exploration and appropriate step-size schedules. The param-
eter λ offers fine-grained control over the bias–variance tradeoff: smaller values
yield faster but more biased estimates, while larger values yield more accurate
but higher-variance estimates.
By interpolating between TD and Monte Carlo, combined with the efficient
backward implementation, TD(λ), serves as one of the most powerful and gen-
eral prediction algorithms in reinforcement learning.

Monte Carlo Control

In the model-free control setting, the objective is to learn an optimal policy
π∗ without access to the environment’s model. Unlike prediction, where the
policy is fixed, control requires both evaluating the current policy and improv-
ing it, based solely on sampled experience. Monte Carlo control achieves this
by estimating the action-value function qπ (s, a) from episodes and using it to
iteratively refine the policy.

Exploring Starts (ES)

A straightforward approach assumes exploring starts, where each episode begins
from a randomly chosen state-action pair. This ensures that all state-action
pairs are visited infinitely often, allowing unbiased estimation of returns.
The action-value function is estimated using sample averages:
N (s,a)
1 X
Q(s, a) = Gi
N (s, a) i=1

where Gi is the return following the i-th visit to (s, a), and N (s, a) is the visit
count.
The policy is then improved by acting greedily with respect to Q:

π(s) = arg max Q(s, a)

This forms a Generalized Policy Iteration (GPI) loop using Monte Carlo
methods for both evaluation and improvement.

-Greedy Monte Carlo Control

Since exploring starts is often unrealistic in practice, a more general approach
ensures exploration using an -greedy policy. The policy selects the action with
the highest estimated value most of the time, but with a small probability ε, it
explores:

13
(
ε
1−ε+ |A(s)| if a = arg maxa Q(s, a)
π(a | s) = ε
|A(s)| otherwise
Then for any state s, the value of π is at least as good as the expected value
under any previous -greedy policy not greedy w.r.t. the updated Q. This guar-
antees continual exploration, ensuring that all state-action pairs are sampled
infinitely often, allowing value estimates to improve over time.

Proof
Let a∗ = arg maxa Q(s, a) be the greedy action, and let π ′ be any previous
-greedy policy (not necessarily greedy w.r.t. the current Q).
We compare the expected value under both policies at state s:

X ε X ε
π(a | s)Q(s, a) = 1 − ε + Q(s, a∗ ) + Q(s, a)
|A(s)| |A(s)|
a a̸=a∗

So:
X ε X
π(a | s)Q(s, a) = (1 − ε)Q(s, a∗ ) + Q(s, a)
a
|A(s)| a

This is a convex combination of the maximum value Q(s, a∗ ) and the average
value over all actions. Since Q(s, a∗ ) ≥ Q(s, a) for all a, the weighted sum is
guaranteed to be:
X X
π(a | s)Q(s, a) ≥ π ′ (a | s)Q(s, a)
a a

as long as π places less probability on a∗ than π, which holds by construction

′

if π is the -greedy policy with respect to a newly improved Q.

SARSA: On-Policy Temporal-Difference Control

SARSA (State–Action–Reward–State–Action) is an on-policy Temporal-Difference
(TD) control algorithm used to learn the optimal policy π∗ without requiring
a model of the environment. Unlike Monte Carlo methods, SARSA updates
the action-value function incrementally at each time step using bootstrapped
estimates, making it suitable for online learning and continuing tasks.

Update Rule
At time step t, SARSA observes the current state St , selects action At using
an ε-greedy behavior policy π (i.e., greedy with respect to Q with probability
1 − ε, and random otherwise), receives reward Rt+1 , transitions to the next

14
state St+1 , and selects the next action At+1 using the same -greedy policy. The
action-value function is then updated as:

Q(St , At ) ← Q(St , At ) + α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]

This is a one-step TD update where the target is constructed using the action
actually taken under the current behavior policy. As a result, SARSA learns
the action-value function corresponding to the same -greedy policy it follows,
making it a truly on-policy algorithm. This ensures that all actions are sampled
infinitely often while favoring those with higher estimated value, supporting
both exploration and exploitation.

Comparison to Monte Carlo Control

Unlike Monte Carlo control, which updates only at the end of an episode using
the full return Gt , SARSA performs updates at each time step using the TD
error:

δt = Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

This makes SARSA more sample-efficient and suitable for situations where
episodes are long or non-terminating. SARSA also incorporates bootstrapping,
which enables faster propagation of reward information but introduces addi-
tional bias.
SARSA provides an elegant and practical on-policy control algorithm that
blends the benefits of bootstrapping, online updates, and exploratory policy im-
provement. It is particularly effective in scenarios where balancing exploration
and exploitation during learning is crucial.

Convergence Conditions for On-Policy Control (MC and

SARSA)
Let πt be a sequence of ε-greedy policies with respect to the evolving action-
value estimates Qt . Then, under the following conditions:

1. Every state–action pair (s, a) is visited infinitely often.

2. The step-size parameters αt (s, a) satisfy:

∞
X ∞
X
αt (s, a) = ∞, αt2 (s, a) < ∞
t=1 t=1

3. The sequence of policies πt remains sufficiently exploratory, such as main-

taining a fixed ε > 0 or slowly decaying εt .

15
Then the action-value estimates converge to the optimal values:

lim Qt (s, a) = q∗ (s, a)

t→∞

and the sequence of policies πt converges to the optimal policy π∗ .

Motivated by the same heuristic of generalising the one step optimistic bet to
the n-step reward and combining them to get an optimal bias-variance tradeoff
we get the following SARSA(λ) algorithm.

SARSA(λ): On-Policy TD(λ) Control

Eligibility Traces and Update Rule
The forward view of SARSA(λ) defines the λ-return as a weighted average of
multi-step SARSA returns. At time step t, the n-step return is defined as:
(n)
Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n Q(St+n , At+n )
The λ-return is then given by:
∞
(n)
X
Gλt = (1 − λ) λn−1 Gt
n=1

This formulation smoothly interpolates between short-term bootstrapped

estimates and long-term Monte Carlo returns. However, because it requires
knowledge of future rewards and actions, it is not suitable for online learning.
In practice, SARSA(λ) is implemented using the backward view with eligibility
traces, which yields the same total update as the forward view in episodic tasks
under batch updates.
Instead of updating only the most recent state–action pair (St , At ), SARSA(λ)
maintains an eligibility trace Et (s, a) for every pair (s, a). The traces are up-
dated at each time step as follows:

Et (s, a) = γλEt−1 (s, a) + 1{(s,a)=(St ,At )}

At each time step t, the TD error is computed:

δt = Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

Then the Q-values are updated for all (s, a) as:

Q(s, a) ← Q(s, a) + α δt Et (s, a)

This backward view of TD(λ) assigns credit to all recently visited state–action
pairs in proportion to how recently and frequently they were visited, enabling
more efficient learning.
- When λ = 0, SARSA(λ) reduces to standard one-step SARSA. - When
λ → 1, it behaves more like Monte Carlo methods with full episode returns. -

16
Intermediate values of λ provide a trade-off between bias and variance in the
learning process
SARSA(λ) is typically used with an ε-greedy policy for exploration and con-
verges under the standard conditions with the exploration paramater decaying
slwoly.

RL Unit5
No ratings yet
RL Unit5
101 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
RL Unit-2
No ratings yet
RL Unit-2
67 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Bandit
No ratings yet
Bandit
8 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Introduction To RL
No ratings yet
Introduction To RL
64 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
97 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Multi-Arm Bandit Problem Guide
No ratings yet
Multi-Arm Bandit Problem Guide
10 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Video Transcript Module 6 - Lesson 1
No ratings yet
Video Transcript Module 6 - Lesson 1
13 pages
Deep Reinforcement Learning Guide
No ratings yet
Deep Reinforcement Learning Guide
32 pages
9 DeepReinforcementLearning
No ratings yet
9 DeepReinforcementLearning
138 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Reinforcement Learning Insights
No ratings yet
Reinforcement Learning Insights
6 pages
AdaptiveEpsilonGreedyExploration PDF
No ratings yet
AdaptiveEpsilonGreedyExploration PDF
8 pages
RL Viva
No ratings yet
RL Viva
30 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Nidhish RLAI-Lab1
No ratings yet
Nidhish RLAI-Lab1
18 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
B Ravindran
No ratings yet
B Ravindran
62 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
No ratings yet
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
27 pages
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
No ratings yet
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
7 pages
Resources ML
No ratings yet
Resources ML
2 pages
Combinatorial Geometry Handout - Phoenixfire-23-40
No ratings yet
Combinatorial Geometry Handout - Phoenixfire-23-40
18 pages
JEE ADV Physics Problems 2022
100% (1)
JEE ADV Physics Problems 2022
25 pages
Aemo 2022 1
No ratings yet
Aemo 2022 1
1 page
Day2 Problems
No ratings yet
Day2 Problems
1 page
POTD Holiday Olympiad
No ratings yet
POTD Holiday Olympiad
3 pages
Problem 2 - Yash Vardhan
No ratings yet
Problem 2 - Yash Vardhan
3 pages
" India ,: Sir Ratan Tata-The Man I Aspire To Be!
100% (1)
" India ,: Sir Ratan Tata-The Man I Aspire To Be!
3 pages
Math Olympiad Mock Exam Guide
100% (1)
Math Olympiad Mock Exam Guide
3 pages
HASTS211 - Times Series MAY2021
No ratings yet
HASTS211 - Times Series MAY2021
3 pages
Outlier Detection Methods Guide
No ratings yet
Outlier Detection Methods Guide
20 pages
Midterm 1 Exam A Answer Key
No ratings yet
Midterm 1 Exam A Answer Key
9 pages
Econ3120 SPR 14 Prelim1 Final Solution v2
No ratings yet
Econ3120 SPR 14 Prelim1 Final Solution v2
8 pages
Dmth404 Statistics
No ratings yet
Dmth404 Statistics
448 pages
Spearman's Cooeficient of Rank Correlation
No ratings yet
Spearman's Cooeficient of Rank Correlation
17 pages
Chapter 9 Quiz
No ratings yet
Chapter 9 Quiz
10 pages
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
No ratings yet
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
15 pages
Homoscedasticity PDF
No ratings yet
Homoscedasticity PDF
2 pages
BPCC-108 Questions
No ratings yet
BPCC-108 Questions
7 pages
Stats Learning Practice Solutions
100% (1)
Stats Learning Practice Solutions
3 pages
M5A42 Applied Stochastic Processes Problem Sheet 1 Solutions Term 1 2010-2011
No ratings yet
M5A42 Applied Stochastic Processes Problem Sheet 1 Solutions Term 1 2010-2011
8 pages
Regresi Ganda
No ratings yet
Regresi Ganda
33 pages
Support Vector Regression
100% (1)
Support Vector Regression
23 pages
Question Bank
No ratings yet
Question Bank
7 pages
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
100% (1)
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
60 pages
1 Statistical Quality Control, 7th Edition by Douglas C. Montgomery
No ratings yet
1 Statistical Quality Control, 7th Edition by Douglas C. Montgomery
65 pages
JC2 Further Math Exam Guide
No ratings yet
JC2 Further Math Exam Guide
7 pages
XIME-QT-1 Assignment-II
No ratings yet
XIME-QT-1 Assignment-II
2 pages
Naseeb Zada PDF
No ratings yet
Naseeb Zada PDF
25 pages
International Journal of Forecasting: Robertfildes Paulgoodwin Dilekönkal
No ratings yet
International Journal of Forecasting: Robertfildes Paulgoodwin Dilekönkal
24 pages
Hypothesis Testing in Statistics
No ratings yet
Hypothesis Testing in Statistics
31 pages
MFDS II Handbook
No ratings yet
MFDS II Handbook
7 pages
Financial Econometrics Guide
100% (1)
Financial Econometrics Guide
146 pages
Tabla Z de Estadistica
No ratings yet
Tabla Z de Estadistica
6 pages
Applied Finance in Python
100% (3)
Applied Finance in Python
545 pages
Lognormal Distribution
No ratings yet
Lognormal Distribution
20 pages
Normal Curve Notes
No ratings yet
Normal Curve Notes
6 pages
Statistics Estimation
No ratings yet
Statistics Estimation
15 pages
Car Price and Specification Data
No ratings yet
Car Price and Specification Data
56 pages

Mid Term Report SoS

Uploaded by

Mid Term Report SoS

Uploaded by

Midterm Report

Under the mentorship of

June 17, 2025

Estimating Action Values

Law of Large Numbers

lim Qn (a) = q∗ (a)

This purely exploitative approach can lead to suboptimal behavior, espe-

• With probability ε, select a random action (exploration).

This ensures continual exploration, thereby avoiding premature convergence

• Requires only the current estimate and the latest reward.

• Automatically decreases the learning rate over time.

A more general form uses a fixed step-size parameter α:

Qn+1 (a) = Qn (a) + α (Rn − Qn (a))

This is particularly useful in non-stationary settings, where recent re-

Upper Confidence Bound (UCB)

Gradient Bandit Methods

Markov Decision Processes (MDPs)

• S: A set of possible states the environment can be in.

• P (s′ | s, a): The transition probability — the probability of moving to

P (St+1 = s′ , Rt+1 = r | St = s, At = a, history) = P (s′ , r | s, a)

This assumption allows for compact and recursive mathematical formula-

Policy and Return

The objective of reinforcement learning is to find a policy that maximizes

Bellman Equation and Policy Evaluation

vπ (s) = Eπ [Gt | St = s] = Eπ [Rt+1 + γGt+1 | St = s] ,

This recursive relationship is known as the Bellman expectation equation.

The policy improvement theorem guarantees that the new policy π ′ is at

Policy Improvement Theorem

vπ′ (s) ≥ vπ (s) for all s.

Moreover, if strict inequality holds for any state, i.e., if

qπ (s, π ′ (s)) > vπ (s) for some s,

then π ′ is strictly better than π, meaning

vπ′ (s) > vπ (s) for at least one s.

1. Policy Evaluation: Compute the value function vπk , typically using

As the iterations proceed, vk converges to the optimal value function v∗ .

Policy Evaluation in Unknown Environments

Both approaches offer viable solutions to policy evaluation in the absence of

Monte Carlo Methods

Estimating Value Using Visit Counts

V (s) ← V (s) + α (G − V (s))

V (St ) ← V (St ) + α [Rt+1 + γV (St+1 ) − V (St )]

δt = Rt+1 + γV (St+1 ) − V (St )

• Monte Carlo methods compute updates based on the complete return Gt

Convergence and Bias–Variance Tradeoff

Forward View: λ-Return

Backward View: Eligibility Traces

Et (s) = γλEt−1 (s) + 1{St =s}

V (s) ← V (s) + αδt Et (s)

Equivalence of Forward and Backward Views

Equivalence of Forward and Backward Views

∆V forward (s) = ∆V backward (s) for all s ∈ S

Monte Carlo Control

Exploring Starts (ES)

π(s) = arg max Q(s, a)

-Greedy Monte Carlo Control

as long as π places less probability on a∗ than π, which holds by construction

if π is the -greedy policy with respect to a **newly improved Q**.

SARSA: On-Policy Temporal-Difference Control

Q(St , At ) ← Q(St , At ) + α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]

Comparison to Monte Carlo Control

δt = Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

Convergence Conditions for On-Policy Control (MC and

1. Every state–action pair (s, a) is visited infinitely often.

2. The step-size parameters αt (s, a) satisfy:

3. The sequence of policies πt remains sufficiently exploratory, such as main-

lim Qt (s, a) = q∗ (s, a)

and the sequence of policies πt converges to the optimal policy π∗ .

SARSA(λ): On-Policy TD(λ) Control

This formulation smoothly interpolates between short-term bootstrapped

Et (s, a) = γλEt−1 (s, a) + 1{(s,a)=(St ,At )}

δt = Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

Q(s, a) ← Q(s, a) + α δt Et (s, a)

You might also like

if π is the -greedy policy with respect to a newly improved Q.