[go: up one dir, main page]

0% found this document useful (0 votes)
34 views18 pages

Mid Term Report SoS

The midterm report discusses the N-armed bandit problem and various strategies for reinforcement learning, including greedy action selection, epsilon-greedy methods, and Upper Confidence Bound (UCB). It also covers Markov Decision Processes (MDPs), their definitions, policies, value functions, and the importance of the Bellman equation for policy evaluation and improvement. Finally, it outlines dynamic programming methods such as policy iteration and value iteration for solving MDPs optimally.

Uploaded by

Yash Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views18 pages

Mid Term Report SoS

The midterm report discusses the N-armed bandit problem and various strategies for reinforcement learning, including greedy action selection, epsilon-greedy methods, and Upper Confidence Bound (UCB). It also covers Markov Decision Processes (MDPs), their definitions, policies, value functions, and the importance of the Bellman equation for policy evaluation and improvement. Finally, it outlines dynamic programming methods such as policy iteration and value iteration for solving MDPs optimally.

Uploaded by

Yash Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Midterm Report

Summer of Science
Reinforcement Learning (Code: CS36)

Submitted by
Yash Vardhan
Roll No: 24B3319

Under the mentorship of


Shreeya Choudhary

June 17, 2025


N-Armed Bandit Problem
The N-armed bandit problem is a fundamental setting in reinforcement learning
that exemplifies the dilemma of decision-making under uncertainty. It models a
situation where an agent must repeatedly choose from among N different options
(or actions), each yielding stochastic rewards, with the objective of maximizing
cumulative reward over time.

Problem Setting
Consider N distinct actions (arms of a slot machine), each associated with an
unknown but fixed reward distribution. At each time step t, the agent selects
an action At ∈ {1, 2, . . . , N } and receives a reward Rt ∼ PAt drawn from the
probability distribution corresponding to that arm.
We assume:
• Each arm a has an expected reward q∗ (a) = E[Rt |At = a].
• These reward distributions are stationary, i.e., they do not change over
time.

The agent’s goal is to maximize the expected cumulative reward over T time
steps: " T #
X
max E Rt
t=1

Estimating Action Values


Since the true action values q∗ (a) are unknown, the agent maintains estimates
Qt (a) of the expected reward of each action. A natural estimate is the sample
average: Pt
i=1 1Ai =a · Ri
Qt (a) = P t
i=1 1Ai =a

Law of Large Numbers


According to the Law of Large Numbers, if each action is sampled sufficiently
many times, its sample average will converge to the true expected reward:

lim Qn (a) = q∗ (a)


n→∞

This justifies the use of sample averages for value estimation. However, it
also emphasizes the need for each action to be sampled multiple times — thus
necessitating exploration. This is going to be heavily used further in environ-
ments where the probability distribution is unknown or non-stationary.

1
Greedy Action Selection
The greedy method always selects the action with the highest current esti-
mated reward:
At = arg max Qt (a)
a

This purely exploitative approach can lead to suboptimal behavior, espe-


cially if the initial estimates are inaccurate. Once a suboptimal action is fa-
vored early, it may continue to dominate due to lack of exploration, preventing
discovery of better actions.

Epsilon-Greedy Method
To address the shortcomings of the greedy method, the ε-greedy method
introduces a controlled degree of randomness:

• With probability ε, select a random action (exploration).


• With probability 1 − ε, select the greedy action (exploitation).

This ensures continual exploration, thereby avoiding premature convergence


to suboptimal actions and allowing Qt (a) to converge to q∗ (a) for all a.

Incremental Updates
Rather than storing all past rewards, one can efficiently compute the sample
average using the following incremental update rule:
1
Qn+1 (a) = Qn (a) + (Rn − Qn (a))
n
Here, Qn (a) is the estimate after n selections of action a, and Rn is the most
recent observed reward. This approach:

• Requires only the current estimate and the latest reward.


• Ensures convergence to q∗ (a) under stationarity.

• Automatically decreases the learning rate over time.

A more general form uses a fixed step-size parameter α:

Qn+1 (a) = Qn (a) + α (Rn − Qn (a))

This is particularly useful in non-stationary settings, where recent re-


wards should be weighted more heavily than distant past ones. The step sized
parameter can be varied too from step to step, however convergence is not nec-
essarily guaranteed.

2
Optimistic Initial Values
Another simple yet effective method to encourage exploration in the N-armed
bandit setting is the use of optimistic initial values. Instead of initializing the
estimated values Q0 (a) for each action to zero or some neutral default, we assign
them large, unrealistically high values. As a result, the agent initially assumes
that all actions are highly rewarding. This forces the agent to try each action
at least once, thereby achieving a form of structured exploration, even when
using a purely greedy policy. Over time, as the estimates adjust based on the
actual observed rewards, the agent naturally shifts toward exploiting the best-
performing action. While simple to implement, this method can be insufficient
in non-stationary environments.

Upper Confidence Bound (UCB)


A more mathematically grounded solution to the exploration-exploitation trade-
off is the Upper Confidence Bound (UCB) action selection strategy. This method
augments each action’s estimated value with a confidence interval that accounts
for the uncertainty due to limited sampling. At time t, the agent selects the
action At as follows:
" s #
ln t
At = arg max Qt (a) + c ·
a Nt (a)

Here, Qt (a) is the estimated value of action a, Nt (a) is the number of times
action a has been selected so far, t is the current time step, and c is a tunable
parameter that controls the level of exploration. The second term in the expres-
sion incentivizes actions that have been tried less frequently, thereby increasing
their likelihood of being selected. As more information is gathered, the uncer-
tainty term diminishes, and the agent increasingly favors actions with genuinely
high estimated rewards However, its effectiveness dpends on the appropriate
tuning of the exploration coefficient c.

Gradient Bandit Methods


In contrast to value-based approaches, gradient bandit methods learn preferences
over actions directly and optimize the policy by ascending the expected reward
gradient. Let Ht (a) denote the agent’s preference for action a at time t. These
preferences are converted into a probability distribution over actions using the
softmax function:
eHt (a)
Pt (a) = PN
Ht (b)
b=1 e
At each time step, an action is chosen according to this distribution, and the
preferences are updated in a direction that increases the probability of rewarding
actions. After selecting action At and receiving reward Rt , the update rule

3
becomes:
(
Ht (a) + α(Rt − R̄t )(1 − Pt (a)) if a = At
Ht+1 (a) =
Ht (a) − α(Rt − R̄t )Pt (a) ̸ At
if a =

Here, α is the step size, and R̄t is a baseline such as the average reward, used
to reduce variance in updates. The idea is to increase the preference for actions
that perform better than the baseline, and decrease it for others. This results in
a stochastic policy that gradually shifts toward favoring better actions without
deterministically eliminating suboptimal ones. Gradient bandit methods are
particularly suited to problems requiring a non-determinsitic policy, and they
perform well in both stationary and non-stationary environments when tuned
appropriately. However, they can be sensitive to learning rates and may exhibit
instability if reward magnitudes are not normalized.

Markov Decision Processes (MDPs)


To model sequential decision-making problems in a mathematically rigorous
way, reinforcement learning relies on the framework of Markov Decision Pro-
cesses (MDPs). An MDP provides a formalism for environments where out-
comes depend both on the agent’s actions and on stochastic transitions within
the environment.

Definition
A Markov Decision Process is formally defined by a tuple (S, A, P, R, γ), where:

• S: A set of possible states the environment can be in.


• A: A set of actions available to the agent.

• P (s′ | s, a): The transition probability — the probability of moving to


state s′ after taking action a in state s.
• R(s, a): The reward function — the expected immediate reward re-
ceived when taking action a in state s.

• γ ∈ [0, 1]: The discount factor, determining the weight given to future
rewards.

At each discrete time step t, the agent observes the current state St , selects
an action At , receives a reward Rt+1 , and transitions to the next state St+1
according to the transition probability P (St+1 | St , At ). This interaction forms
a trajectory S0 , A0 , R1 , S1 , A1 , R2 , . . .

4
The Markov Property
The defining assumption of an MDP is the Markov property, which states that
the future is independent of the past given the present. Formally, the next state
and reward depend only on the current state and action:

P (St+1 = s′ , Rt+1 = r | St = s, At = a, history) = P (s′ , r | s, a)

This assumption allows for compact and recursive mathematical formula-


tions, including the Bellman equations.

Policy and Return


A policy π defines the agent’s behavior by specifying a mapping from states to
actions. It may be deterministic, where π(s) = a, or stochastic, where π(a | s)
defines the probability of selecting action a in state s.
The return at time t, denoted Gt , is the total discounted sum of future
rewards:
X∞
2
Gt = Rt+1 + γRt+2 + γ Rt+3 + · · · = γ k Rt+k+1
k=0

The objective of reinforcement learning is to find a policy that maximizes


the expected return from each state.

Value Functions
To evaluate how good it is to be in a certain state or to perform a certain action,
we define two fundamental value functions under a given policy π:
State-Value Function:

vπ (s) = Eπ [Gt | St = s]

This function gives the expected return when starting in state s and following
policy π thereafter.
Action-Value Function:

qπ (s, a) = Eπ [Gt | St = s, At = a]

This function gives the expected return when starting in state s, taking action
a, and subsequently following policy π.
These value functions are central to reinforcement learning and provide the
basis for most algorithmic approaches to policy evaluation and improvement.

Bellman Equation and Policy Evaluation


Starting from the definition of the state-value function under a policy π,

vπ (s) = Eπ [Gt | St = s] = Eπ [Rt+1 + γGt+1 | St = s] ,

5
we apply the law of iterated expectations and expand the return recursively:
X X
vπ (s) = π(a | s) P (s′ , r | s, a) [r + γvπ (s′ )] .
a s′ ,r

This recursive relationship is known as the Bellman expectation equation.


We now turn to the problem of computing the value function vπ for a given
policy π, a process known as policy evaluation.

Policy Evaluation
Given a fixed policy π, policy evaluation is the process of computing the cor-
responding state-value function vπ (s). Since the Bellman expectation equation
defines a system of equations — one for each state — solving it analytically is
intractable in most environments of practical interest. Instead, we compute the
value function iteratively.
Starting from an initial guess v0 (s), we update the value function using the
Bellman equation as an update rule:
X X
vk+1 (s) = π(a | s) P (s′ , r | s, a) [r + γvk (s′ )]
a s′ ,r

This process is repeated until the values converge to within a desired thresh-
old. Under the condition γ < 1, the update is guaranteed to converge to the
true value function vπ .

Policy Improvement
Once the value function vπ of a determinsitic policy π has been computed, we
can attempt to construct a new determenistic policy that performs at least as
well, if not better. This process is known as policy improvement.
Using the current value function vπ , we define a new policy π ′ by acting
greedily with respect to it. For each state s, the improved policy is given by:
X
π ′ (s) = arg max P (s′ , r | s, a) [r + γvπ (s′ )]
a
s′ ,r

The policy improvement theorem guarantees that the new policy π ′ is at


least as good as the original policy π, meaning vπ′ (s) ≥ vπ (s) for all states s. If
the inequality is strict for at least one state, then π ′ is strictly better than π.

Policy Improvement Theorem


The Policy Improvement Theorem provides a formal guarantee that acting
greedily with respect to the current value function will yield a policy that is
no worse, and possibly better, than the current one.
Let π and π ′ be two deterministic policies. Suppose that for all s ∈ S,
qπ (s, π ′ (s)) ≥ vπ (s).

6
Then, the new policy π ′ is at least as good as π:

vπ′ (s) ≥ vπ (s) for all s.

Moreover, if strict inequality holds for any state, i.e., if

qπ (s, π ′ (s)) > vπ (s) for some s,

then π ′ is strictly better than π, meaning

vπ′ (s) > vπ (s) for at least one s.

This result underpins the policy improvement step and guarantees that re-
peated application of evaluation and improvement will eventually converge to
an optimal policy.This fundamental result enables iterative procedures that al-
ternate between evaluating a policy and improving it, ultimately leading to the
discovery of an optimal policy.

Dynamic Programming
Dynamic Programming (DP) refers to a class of algorithms that solve Markov
Decision Processes exactly by recursively computing value functions. These
methods require complete knowledge of the environment — specifically, the
transition probabilities P (s′ , r | s, a) and the reward function R(s, a).
At the heart of DP is the repeated application of the Bellman equations
to update value estimates. The key quantities being stored and updated are
the value functions themselves: either the state-value function v : S → R, or
the action-value function q : S × A → R. These are typically stored as arrays
indexed by state (and action) identifiers.
For finite MDPs with n states and m actions, the space complexity is O(n)
for state-value methods and O(nm) for action-value methods. Each iteration
(or sweep) involves computing expectations over all actions and all possible
next states and rewards, leading to a worst-case time complexity of O(n2 m)
per iteration.Thus, A DP method is guaranteed to find an optimal policy in
polynomial time even though the total number of (deterministic) policies is
mn . However,DP is computationally feasible only for small to moderately-sized
problems which reuquire knowledge of all the transition probabilites.

Policy Iteration
Policy iteration is an iterative algorithm that alternates between policy evalu-
ation and policy improvement. Starting from an arbitrary initial policy π0 , the
algorithm performs the following steps at each iteration k:

1. Policy Evaluation: Compute the value function vπk , typically using


iterative updates.

7
2. Policy Improvement: Construct a new policy πk+1 by acting greedily
with respect to vπk :
X
πk+1 (s) = arg max P (s′ , r | s, a) [r + γvπk (s′ )]
a
s′ ,r

This procedure is repeated until the policy stabilizes, i.e., πk+1 = πk . The
Policy Improvement Theorem guarantees that each new policy is at least as
good as the previous one. For finite MDPs, the process terminates in a finite
number of iterations and returns the optimal policy π∗ .

Value Iteration
Value iteration simplifies policy iteration by combining the policy evaluation and
improvement steps into a single update rule. Instead of evaluating a policy to
completion, the value function is updated directly using the Bellman optimality
equation:
X
vk+1 (s) = max P (s′ , r | s, a) [r + γvk (s′ )]
a
s′ ,r

As the iterations proceed, vk converges to the optimal value function v∗ .


Once the value function has sufficiently converged, the corresponding optimal
policy is extracted as:
X
π∗ (s) = arg max P (s′ , r | s, a) [r + γv∗ (s′ )]
a
s′ ,r

Value iteration typically converges more quickly than policy iteration, espe-
cially in large state spaces where computing exact value functions for interme-
diate policies is computationally expensive. However for the intermediate vk ,
the existence of a corresponding πk is not necessarily guaranteed.

Policy Evaluation in Unknown Environments


In many real-world scenarios, the environment’s dynamics — such as transition
probabilities and reward functions — are not known a priori. This makes it
impossible to compute the exact expectations in the Bellman equations, as re-
quired by Dynamic Programming methods. Instead, we must rely on experience
to estimate value functions, giving rise to a class of model-free approaches.
Policy evaluation in unknown environments involves estimating the value
function vπ (s) = Eπ [Gt | St = s] from sampled episodes generated by following
the policy π. Rather than using explicit knowledge of P (s′ , r | s, a), these meth-
ods estimate expected returns directly from observed state–reward trajectories.
Two primary families of methods fall under this framework:

8
• Monte Carlo (MC) Methods: These estimate value functions by aver-
aging complete returns observed at the end of episodes. They are simple,
intuitive, and require no bootstrapping, but they can only update values
after episodes terminate.
• Temporal-Difference (TD) Methods: These update value estimates
incrementally after each step using bootstrapped predictions. They are
online, efficient, and applicable to both episodic and continuing tasks.

Both approaches offer viable solutions to policy evaluation in the absence of


a model and form the foundation for modern reinforcement learning algorithms.

Monte Carlo Methods


Monte Carlo (MC) methods estimate value functions from sampled episodes of
interaction with the environment, without requiring any knowledge of transition
probabilities or reward functions. These methods are particularly effective when
the environment is unknown or analytically intractable.
MC methods operate under the assumption that episodes eventually termi-
nate. They compute value estimates by averaging the returns following visits to
each state, under a fixed policy π. Given that the return from time t is defined
as Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . ., the goal is to approximate the expected
return:

v(s) = Eπ [Gt | St = s]

Estimating Value Using Visit Counts


Let N (s) denote the number of times state s has been visited across all sampled
episodes. Let G1 , G2 , . . . , GN (s) be the returns following each of these visits.
The Monte Carlo estimate of v(s) is the sample average:
N (s)
1 X
V (s) = Gi
N (s) i=1
By the Law of Large Numbers, this empirical average converges to the true
expected return as the number of visits grows:
N (s)
1 X
lim Gi = vπ (s)
N (s)→∞ N (s)
i=1

This provides the statistical foundation for Monte Carlo methods: with
enough data, value estimates converge almost surely to their true expectations.

9
Incremental Monte Carlo Updates
Rather than storing all past returns, we can update V (s) incrementally after
each visit using:
1
V (s) ← V (s) + (G − V (s))
N (s)
where G is the return following the current visit to s. This implementa-
tion maintains only the latest estimate and a visit count, and incorporates new
information as it arrives.
In non-stationary environments, or when memory is constrained, a constant
step-size version is preferred:

V (s) ← V (s) + α (G − V (s))


where α ∈ (0, 1] is a fixed learning rate. This update behaves like an ex-
ponential moving average, giving more weight to recent returns and enabling
continual learning.

Temporal-Difference Learning
Temporal-Difference (TD) learning is a class of model-free reinforcement learn-
ing methods that estimate value functions from raw experience by combining
key ideas from both Dynamic Programming and Monte Carlo methods.
Like Monte Carlo methods, TD learning does not require knowledge of the
environment’s transition probabilities or rewards. However, unlike Monte Carlo
methods, which update value estimates only at the end of an episode, TD learn-
ing updates estimates incrementally at every time step. This makes TD learning
suitable for both online learning and continuing tasks with no natural episode
boundaries.

TD(0)
The simplest TD method is known as TD(0). It performs a one-step update
using the observed reward and the estimated value of the next state. Given a
state St , reward Rt+1 , and next state St+1 , the value function is updated as:

V (St ) ← V (St ) + α [Rt+1 + γV (St+1 ) − V (St )]


This is known as a bootstrapped update, as it relies on the current estimate
V (St+1 ) to improve the estimate of V (St ), rather than waiting for the full return.
The term inside the brackets is called the TD error:

δt = Rt+1 + γV (St+1 ) − V (St )

The TD error measures the discrepancy between the current estimate and a
better-informed one based on a single step of experience.

10
Comparison with Monte Carlo and Dynamic Programming
TD(0) can be viewed as a hybrid between Monte Carlo methods and Dynamic
Programming:

• Monte Carlo methods compute updates based on the complete return Gt


and wait for episode termination.
• Dynamic Programming methods compute exact expectations using the
environment model.
• TD(0) uses samples from experience and bootstraps by estimating the
return with Rt+1 + γV (St+1 ).

Thus, TD(0) is both model-free and incremental, combining the data effi-
ciency of DP with the model-free flexibility of MC. As, in while unlike DP, it
works in an unknown environment, it uses the Markov property by betting on
its increments.

Convergence and Bias–Variance Tradeoff


Under standard assumptions — such as sufficient exploration of the state space
and a diminishing step size schedule — TD(0) converges to the true value func-
tion vπ .
Compared to Monte Carlo methods, TD(0) typically has lower variance
because it uses only a single-step reward and an estimated value. However,
it introduces bias due to bootstrapping. This bias–variance tradeoff makes
TD(0) particularly effective in many online learning scenarios and practical RL
applications.

TD(λ)
While TD(0) updates value estimates using a one-step bootstrapped return
and Monte Carlo methods rely on complete returns, TD(λ) provides a unified
framework that generalizes both. It introduces a trace-decay parameter λ ∈
[0, 1] that determines the depth of bootstrapping used in value estimation.
When λ = 0, TD(λ) reduces to TD(0); when λ = 1, it approaches the
behavior of Monte Carlo methods. Intermediate values yield algorithms that
balance bias and variance by blending multiple n-step returns.

Forward View: λ-Return


The forward view of TD(λ) defines the λ-return as an exponentially-weighted
average of n-step returns:

(n)
X
Gλt = (1 − λ) λn−1 Gt
n=1

11
(n)
Here, the n-step return Gt is given by:
(n)
Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n V (St+n )
The λ-return smoothly interpolates between the immediate bootstrapped
target and the full Monte Carlo return, allowing credit assignment over varying
time scales.

Backward View: Eligibility Traces


The forward view is not suitable for online learning, since it requires access
to future rewards. The backward view, on the other hand, enables real-time
updates using eligibility traces, which track how recently and frequently each
state has been visited.
At each time step t, the eligibility trace for state s is updated as:

Et (s) = γλEt−1 (s) + 1{St =s}


The value function is then updated for all states as:

V (s) ← V (s) + αδt Et (s)


where δt = Rt+1 + γV (St+1 ) − V (St ) is the TD error.

Equivalence of Forward and Backward Views


The forward and backward views of TD(λ) are fundamentally equivalent in
terms of the total update applied to the value function over a complete episode.
This result is formalized by the following theorem:

Equivalence of Forward and Backward Views


Let ∆V forward (s) denote the total update to the value function from the forward
view using λ-returns:
T
X −1
∆V forward (s) = α 1{St =s} Gλt − V (St )

t=0
backward
Let ∆V (s) denote the corresponding update from the backward view
using eligibility traces and TD errors:
T
X −1
∆V backward (s) = α δt Et (s)
t=0
Then, under batch updates over a finite episode of length T ,

∆V forward (s) = ∆V backward (s) for all s ∈ S


This theoretical equivalence justifies the use of the backward view in practice,
as it permits efficient, online, and incremental learning.

12
Bias–Variance Tradeoff and Convergence
TD(λ) converges to the true value function under similar conditions to TD(0),
such as sufficient exploration and appropriate step-size schedules. The param-
eter λ offers fine-grained control over the bias–variance tradeoff: smaller values
yield faster but more biased estimates, while larger values yield more accurate
but higher-variance estimates.
By interpolating between TD and Monte Carlo, combined with the efficient
backward implementation, TD(λ), serves as one of the most powerful and gen-
eral prediction algorithms in reinforcement learning.

Monte Carlo Control


In the model-free control setting, the objective is to learn an optimal policy
π∗ without access to the environment’s model. Unlike prediction, where the
policy is fixed, control requires both evaluating the current policy and improv-
ing it, based solely on sampled experience. Monte Carlo control achieves this
by estimating the action-value function qπ (s, a) from episodes and using it to
iteratively refine the policy.

Exploring Starts (ES)


A straightforward approach assumes exploring starts, where each episode begins
from a randomly chosen state-action pair. This ensures that all state-action
pairs are visited infinitely often, allowing unbiased estimation of returns.
The action-value function is estimated using sample averages:
N (s,a)
1 X
Q(s, a) = Gi
N (s, a) i=1

where Gi is the return following the i-th visit to (s, a), and N (s, a) is the visit
count.
The policy is then improved by acting greedily with respect to Q:

π(s) = arg max Q(s, a)


a

This forms a Generalized Policy Iteration (GPI) loop using Monte Carlo
methods for both evaluation and improvement.

-Greedy Monte Carlo Control


Since exploring starts is often unrealistic in practice, a more general approach
ensures exploration using an -greedy policy. The policy selects the action with
the highest estimated value most of the time, but with a small probability ε, it
explores:

13
(
ε
1−ε+ |A(s)| if a = arg maxa Q(s, a)
π(a | s) = ε
|A(s)| otherwise
Then for any state s, the value of π is at least as good as the expected value
under any previous -greedy policy not greedy w.r.t. the updated Q. This guar-
antees continual exploration, ensuring that all state-action pairs are sampled
infinitely often, allowing value estimates to improve over time.

Proof
Let a∗ = arg maxa Q(s, a) be the greedy action, and let π ′ be any previous
-greedy policy (not necessarily greedy w.r.t. the current Q).
We compare the expected value under both policies at state s:

 
X ε X ε
π(a | s)Q(s, a) = 1 − ε + Q(s, a∗ ) + Q(s, a)
|A(s)| |A(s)|
a a̸=a∗

So:
X ε X
π(a | s)Q(s, a) = (1 − ε)Q(s, a∗ ) + Q(s, a)
a
|A(s)| a

This is a convex combination of the maximum value Q(s, a∗ ) and the average
value over all actions. Since Q(s, a∗ ) ≥ Q(s, a) for all a, the weighted sum is
guaranteed to be:
X X
π(a | s)Q(s, a) ≥ π ′ (a | s)Q(s, a)
a a

as long as π places less probability on a∗ than π, which holds by construction


if π is the -greedy policy with respect to a **newly improved Q**.

SARSA: On-Policy Temporal-Difference Control


SARSA (State–Action–Reward–State–Action) is an on-policy Temporal-Difference
(TD) control algorithm used to learn the optimal policy π∗ without requiring
a model of the environment. Unlike Monte Carlo methods, SARSA updates
the action-value function incrementally at each time step using bootstrapped
estimates, making it suitable for online learning and continuing tasks.

Update Rule
At time step t, SARSA observes the current state St , selects action At using
an ε-greedy behavior policy π (i.e., greedy with respect to Q with probability
1 − ε, and random otherwise), receives reward Rt+1 , transitions to the next

14
state St+1 , and selects the next action At+1 using the same -greedy policy. The
action-value function is then updated as:

Q(St , At ) ← Q(St , At ) + α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]

This is a one-step TD update where the target is constructed using the action
actually taken under the current behavior policy. As a result, SARSA learns
the action-value function corresponding to the same -greedy policy it follows,
making it a truly on-policy algorithm. This ensures that all actions are sampled
infinitely often while favoring those with higher estimated value, supporting
both exploration and exploitation.

Comparison to Monte Carlo Control


Unlike Monte Carlo control, which updates only at the end of an episode using
the full return Gt , SARSA performs updates at each time step using the TD
error:

δt = Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )


This makes SARSA more sample-efficient and suitable for situations where
episodes are long or non-terminating. SARSA also incorporates bootstrapping,
which enables faster propagation of reward information but introduces addi-
tional bias.
SARSA provides an elegant and practical on-policy control algorithm that
blends the benefits of bootstrapping, online updates, and exploratory policy im-
provement. It is particularly effective in scenarios where balancing exploration
and exploitation during learning is crucial.

Convergence Conditions for On-Policy Control (MC and


SARSA)
Let πt be a sequence of ε-greedy policies with respect to the evolving action-
value estimates Qt . Then, under the following conditions:

1. Every state–action pair (s, a) is visited infinitely often.

2. The step-size parameters αt (s, a) satisfy:



X ∞
X
αt (s, a) = ∞, αt2 (s, a) < ∞
t=1 t=1

3. The sequence of policies πt remains sufficiently exploratory, such as main-


taining a fixed ε > 0 or slowly decaying εt .

15
Then the action-value estimates converge to the optimal values:

lim Qt (s, a) = q∗ (s, a)


t→∞

and the sequence of policies πt converges to the optimal policy π∗ .


Motivated by the same heuristic of generalising the one step optimistic bet to
the n-step reward and combining them to get an optimal bias-variance tradeoff
we get the following SARSA(λ) algorithm.

SARSA(λ): On-Policy TD(λ) Control


Eligibility Traces and Update Rule
The forward view of SARSA(λ) defines the λ-return as a weighted average of
multi-step SARSA returns. At time step t, the n-step return is defined as:
(n)
Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n Q(St+n , At+n )
The λ-return is then given by:

(n)
X
Gλt = (1 − λ) λn−1 Gt
n=1

This formulation smoothly interpolates between short-term bootstrapped


estimates and long-term Monte Carlo returns. However, because it requires
knowledge of future rewards and actions, it is not suitable for online learning.
In practice, SARSA(λ) is implemented using the backward view with eligibility
traces, which yields the same total update as the forward view in episodic tasks
under batch updates.
Instead of updating only the most recent state–action pair (St , At ), SARSA(λ)
maintains an eligibility trace Et (s, a) for every pair (s, a). The traces are up-
dated at each time step as follows:

Et (s, a) = γλEt−1 (s, a) + 1{(s,a)=(St ,At )}


At each time step t, the TD error is computed:

δt = Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )


Then the Q-values are updated for all (s, a) as:

Q(s, a) ← Q(s, a) + α δt Et (s, a)


This backward view of TD(λ) assigns credit to all recently visited state–action
pairs in proportion to how recently and frequently they were visited, enabling
more efficient learning.
- When λ = 0, SARSA(λ) reduces to standard one-step SARSA. - When
λ → 1, it behaves more like Monte Carlo methods with full episode returns. -

16
Intermediate values of λ provide a trade-off between bias and variance in the
learning process
SARSA(λ) is typically used with an ε-greedy policy for exploration and con-
verges under the standard conditions with the exploration paramater decaying
slwoly.

17

You might also like