SSRN 4963741
SSRN 4963741
Abstract
Contents
1 Introduction 5
1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Structure of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
2.2.3 Bellman Expectation Equation for V π (s) . . . . . . . . . . . . . . . . 7
2.2.4 Action-Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Bellman Expectation Equation for Qπ (s, a) . . . . . . . . . . . . . . . 8
2.2.6 Bellman Optimality Equations . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Solution Methods: Value Iteration and Policy Iteration . . . . . . . . . . . . 9
2.3.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Limitations of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Example: Grid World Application . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Scenario Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 MDP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.3 Applying Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.4 Visualization and Interpretation . . . . . . . . . . . . . . . . . . . . . 11
2.5.5 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
3.6.1 Environment Description . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6.2 Comparing SARSA and Q-Learning . . . . . . . . . . . . . . . . . . . 15
3.6.3 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3
5.1.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Soft Actor-Critic (SAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Entropy Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . . . . . . . . . 21
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3.2 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3.3 Exploration Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Distributional RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.2 Quantile Regression DQN (QR-DQN) . . . . . . . . . . . . . . . . . . 22
5.4.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Decision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.1 Reframing RL as Sequence Modeling . . . . . . . . . . . . . . . . . . 22
5.5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Conclusion 23
4
1 Introduction
Reinforcement Learning (RL) is a subfield of artificial intelligence concerned with how agents
ought to take actions in an environment to maximize some notion of cumulative reward.
Unlike supervised learning, where the agent is provided with correct input-output pairs, RL
involves learning from the consequences of actions in a trial-and-error fashion. This process
is akin to how humans and animals learn from interactions with their environment.
The history of RL is rich and multifaceted, drawing from disciplines such as control
theory, psychology, neuroscience, and computer science. This paper traces the chronological
development of RL, highlighting key milestones and the mathematical underpinnings that
have driven progress in the field.
• Section 3 discusses the emergence of model-free methods like Q-Learning and Tem-
poral Difference Learning (1980-2000).
5
2.1 Markov Decision Processes (MDPs)
2.1.1 Historical Context
The concept of Markov Decision Processes (MDPs) was introduced by Bellman [1957] as
a mathematical framework for modeling decision-making in situations where outcomes are
partly random and partly under the control of a decision-maker. MDPs provide a formal-
ization that is essential for understanding and developing RL algorithms.
• P (s′ |s, a): The transition probability function, specifying the probability of transitioning
from state s to state s′ given action a.
• R(s, a, s′ ): The reward function, giving the immediate reward received after transition-
ing from s to s′ via action a.
• γ ∈ [0, 1]: The discount factor, representing the importance of future rewards compared
to immediate rewards.
Markov Property An essential feature of MDPs is the Markov property, which states that
the future is independent of the past given the present state and action. Mathematically,
this is expressed as:
This property simplifies the analysis and computation of optimal policies, as it allows us
to consider only the current state and action when making decisions.
The agent’s goal in an MDP is to find a policy π : S → A that maximizes the expected
cumulative discounted reward, also known as the return. The return Gt at time t is defined
as:
6
∞
X
Gt = γ k rt+k+1
k=0
Policy A policy π(a|s) is a mapping from states to probabilities of selecting each possible
action. Policies can be deterministic, where π(a|s) selects the action with certainty, or
stochastic, where actions are selected according to a probability distribution.
Dynamic programming (DP) is a method for solving complex problems by breaking them
down into simpler subproblems. In the context of RL, DP provides a suite of algorithms for
computing optimal policies given a perfect model of the environment as an MDP.
The state-value function V π (s) under a policy π gives the expected return starting from
state s:
" ∞
#
X
V π (s) = Eπ Gt st = s = Eπ γ k rt+k+1 st = s
k=0
This function quantifies the long-term value of a state when following policy π.
Derivation The derivation of the Bellman equation relies on the law of total expectation
and the Markov property:
7
V π (s) = Eπ rt+1 + γGt+1 st = s
!
X X
P (s′ |s, a) R(s, a, s′ ) + γEπ Gt+1 st+1 = s′
= π(a|s)
a s′
X X
= π(a|s) P (s′ |s, a) (R(s, a, s′ ) + γV π (s′ ))
a s′
Interpretation The equation states that the value of a state under policy π is equal to the
expected immediate reward plus the expected discounted value of the next state, averaged
over all possible actions and next states.
The action-value function Qπ (s, a) provides the expected return starting from state s, taking
action a, and thereafter following policy π:
Qπ (s, a) = Eπ Gt st = s, at = a
Significance The action-value function is crucial for selecting optimal actions, as it quan-
tifies the expected return of taking a specific action in a given state.
8
Optimal State-Value Function
X
V ∗ (s) = max P (s′ |s, a) [R(s, a, s′ ) + γV ∗ (s′ )]
a
s′
Derivation These equations arise from the principle of optimality, which states that any
subsequence of optimal actions must itself be optimal. The optimal value functions satisfy
these equations, providing a recursive means to compute them.
Value iteration is an algorithm that iteratively applies the Bellman optimality equation to
converge to the optimal value function V ∗ (s).
Algorithm Steps
Convergence Under standard assumptions (e.g., finite state and action spaces), value
iteration converges to V ∗ (s) as the number of iterations approaches infinity.
Policy iteration involves two main steps: policy evaluation and policy improvement.
Policy Evaluation For a fixed policy π, compute the state-value function V π (s) by solving
the system of linear equations defined by the Bellman expectation equation.
9
Policy Improvement Update the policy by acting greedily with respect to V π (s):
X
πnew (s) = arg max P (s′ |s, a) [R(s, a, s′ ) + γV π (s′ )]
a
s′
Algorithm Steps
• Storage: Storing value functions and policies for large state spaces can be impractical.
Consider a grid world environment where an agent must navigate from a starting position
to a goal state. Each cell in the grid represents a state, and the agent can move in four
directions: up, down, left, and right.
10
• Transition Probabilities (P ): Deterministic or stochastic movements.
• Rewards (R): Negative reward (cost) for each move; large positive reward for reaching
the goal.
By applying value iteration, we iteratively update the value function for each state until
convergence. This process yields the optimal value function V ∗ (s) and the corresponding
optimal policy.
The resulting value function can be visualized as a heatmap over the grid, indicating the
desirability of each state. The optimal policy guides the agent towards the goal while mini-
mizing costs.
2.5.5 Insights
This example illustrates how DP methods can be applied to solve MDPs in simple environ-
ments. It also highlights the challenges that arise when scaling to larger or more complex
environments.
Temporal Difference (TD) learning, introduced by Sutton [1988], combines ideas from Monte
Carlo methods and dynamic programming. It enables agents to learn directly from raw
experience without a model of the environment.
11
3.1.2 TD Prediction
TD prediction focuses on learning the value function V π (s) for a given policy π.
TD(0) Update Rule The simplest TD algorithm, TD(0), updates the value estimate
based on the observed reward and the estimated value of the next state:
TD(0) converges to V π (s) under certain conditions, such as using a suitable learning rate
schedule and visiting all states sufficiently often.
3.2 Q-Learning
3.2.1 Introduction
Explanation This update adjusts the estimate of Q(st , at ) towards the observed reward
plus the discounted maximum estimated value of the next state. The use of maxa′ Q(st+1 , a′ )
reflects the assumption that the agent will act optimally in the future.
• Off-Policy: Q-Learning learns the optimal policy regardless of the agent’s current
policy.
12
• Convergence: Under certain conditions (e.g., appropriate exploration and learning
rates), Q-Learning converges to Q∗ (s, a).
Explanation Unlike Q-Learning, SARSA updates the value of Q(st , at ) using the action
at+1 selected by the current policy. This makes it sensitive to the agent’s exploration strategy.
• Exploration Impact: SARSA accounts for the agent’s exploration behavior, poten-
tially leading to safer policies in certain environments.
• With probability 1−ϵ, select the action with the highest estimated value (exploitation).
13
3.4.3 Adaptive Exploration Strategies
In environments with large or continuous state spaces, storing and updating Q-values for
every state-action pair is infeasible. Function approximation methods generalize learning
across similar states and actions.
Q(s, a; θ) = θ⊤ ϕ(s, a)
Gradient Descent Updates Parameters θ are updated using gradient descent to mini-
mize the squared TD error.
Neural networks can approximate complex, nonlinear functions, enabling agents to handle
high-dimensional inputs such as images.
An agent must navigate along the edge of a cliff to reach a goal state. Stepping off the cliff
results in a large negative reward.
14
3.6.2 Comparing SARSA and Q-Learning
• SARSA: Learns a policy that avoids the cliff by accounting for the agent’s exploration,
leading to safer paths.
• Q-Learning: Learns the optimal policy that skirts the edge of the cliff to minimize
steps but may risk falling off during learning due to the exploratory actions.
3.6.3 Insights
This example illustrates how the choice between on-policy and off-policy methods can affect
the learned policy, especially in environments where risky actions have severe consequences.
Early attempts to use neural networks in RL faced challenges with stability and convergence.
The resurgence of deep learning in the 2010s, fueled by advances in training techniques and
computational resources, revitalized interest in combining neural networks with RL.
Neural networks can approximate complex, nonlinear functions and handle high-dimensional
sensory inputs, such as images or raw sensor data.
Introduced by Mnih et al. [2015], DQN was a landmark achievement that demonstrated the
potential of deep RL by achieving human-level performance on Atari games using raw pixel
inputs.
15
4.2.2 Architecture
DQN employs a convolutional neural network (CNN) to process image inputs and output
Q-values for each possible action.
To address the instability issues inherent in combining Q-Learning with deep neural networks,
DQN introduces two key techniques:
Experience Replay
Target Networks
• Updates the target network parameters periodically by copying from the main network.
• Provides a stable target for the TD error, reducing oscillations in the learning process.
1. Initialize the main network Q(s, a; θ) and the target network Q(s, a; θ− ).
16
4.2.5 Loss Function
yi = rt+1 + γ max
′
Q(st+1 , a′ ; θ− )
a
DQN demonstrated that deep neural networks could successfully approximate value functions
in high-dimensional spaces, paving the way for further research in deep RL.
Problem Addressed Standard DQN tends to overestimate action values due to the max-
imization step in the target calculation.
Solution Double DQN, proposed by van Hasselt et al. [2016], decouples action selection
and evaluation:
Motivation In some states, the choice of action has little impact on the value function.
Separating the representation of state-value and advantage functions can improve learning
efficiency.
17
4.3.3 Prioritized Experience Replay
Problem Addressed Uniform sampling from the replay buffer may be inefficient, as some
experiences are more valuable for learning.
Solution Prioritized Experience Replay [Schaul et al., 2016] samples experiences based on
the magnitude of their TD error, focusing learning on more significant updates.
|δi |ω
P (i) = P ω
k |δk |
Policy gradient methods optimize the policy directly by computing gradients of the expected
return with respect to policy parameters.
Policies are parameterized as π(a|s; θ), where θ are the parameters of a neural network.
Derivation Overview Starting from J(θ) = Eπθ [Gt ], the gradient is obtained using the
log-derivative trick and the policy’s probability density function.
Algorithm Steps
18
3. Update the policy parameters:
X
θ ←θ+α ∇θ log πθ (at |st )Gt
t
• Critic: The value function V (s; w) that evaluates the current policy.
The advantage function A(s, a) = Q(s, a) − V (s) measures the benefit of taking action a in
state s over the average action.
The actor updates the policy parameters in the direction suggested by the critic:
The critic updates the value function parameters by minimizing the TD error:
19
4.6 Asynchronous Advantage Actor-Critic (A3C)
4.6.1 Motivation
Multiple worker agents interact with their own copies of the environment in parallel, updating
shared global parameters asynchronously.
4.6.3 Benefits
PPO [Schulman et al., 2017] simplifies the trust region optimization approach, making it
more practical for large-scale problems.
20
5.1.3 Advantages
SAC [Haarnoja et al., 2018] incorporates an entropy term into the objective to encourage
exploration:
X
J(π) = E(st ,at )∼ρπ [r(st , at ) + αH(π(·|st ))]
t
The entropy term H(π(·|st )) promotes stochastic policies, preventing premature convergence
to suboptimal deterministic policies.
5.2.3 Benefits
DDPG [Lillicrap et al., 2015] extends DPG to use deep neural networks, enabling RL in
continuous action spaces.
21
• Target Networks: Slow-moving copies of the actor and critic networks for stable
updates.
Uses additive noise (e.g., Ornstein-Uhlenbeck process) to the actions during training for
exploration.
5.4 Distributional RL
5.4.1 Concept
Distributional RL models the distribution of possible returns, rather than just the expected
return.
QR-DQN [Dabney et al., 2018] approximates the return distribution using quantile regres-
sion, capturing the uncertainty and variability in returns.
5.4.3 Benefits
• Improved Performance: Capturing the full distribution can lead to better policies.
Decision Transformers [Chen et al., 2021] treat RL as a sequence modeling problem, using
transformer architectures to predict actions based on past trajectories.
5.5.2 Architecture
22
5.5.3 Advantages
• Simplifies Training: Avoids the need for explicit value function estimation.
6 Conclusion
From its early theoretical foundations in the mid-20th century to the sophisticated algorithms
of today, reinforcement learning has undergone remarkable evolution. The development
of model-free methods, the integration with deep learning, and the recent innovations in
policy optimization and sequence modeling have significantly expanded the capabilities of
RL agents.
As we look to the future, addressing challenges related to sample efficiency, safety, and
scalability will be essential. Continued interdisciplinary collaboration and the integration of
RL with other AI domains hold the promise of unlocking new applications and advancing
the field further.
23
X X
∇θ J(θ) ≈ dπ (s) ∇θ πθ (a|s)Qπ (s, a)
s a
X X
∇θ J(θ) = dπ (s) πθ (a|s)∇θ log πθ (a|s)Qπ (s, a) = Eπθ [∇θ log πθ (a|s)Qπ (s, a)]
s a
24
References
Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
ISBN 978-0-691-07951-6.
Lili Chen, Kevin Lu, Aravind Rajeswaran, Parsa Lee, Aditya Grover, and Sham
Kumar. Decision transformer: Reinforcement learning via sequence model-
ing. In Advances in Neural Information Processing Systems, volume 34, pages
15084–15097, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/
7f489f640400adbfba07ba77a50ba99a-Abstract.html.
Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Distributional reinforcement
learning with quantile regression. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32, 2018. doi: 10.1609/aaai.v32i1.11791.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceed-
ings of the 35th International Conference on Machine Learning, ICML’18, pages 1861–
1870. PMLR, 2018.
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement
learning, 2015. URL https://arxiv.org/abs/1509.02971.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig
Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Ku-
maran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through
deep reinforcement learning. Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236.
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience
replay. In 4th International Conference on Learning Representations, 2016. URL https:
//arxiv.org/abs/1511.05952.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https:
//arxiv.org/abs/1707.06347.
25
Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double
Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,
pages 2094–2100. AAAI Press, 2016. doi: 10.5555/3016100.3016191.
Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis,
King’s College, Cambridge, Cambridge, UK, 1989. Ph.D. thesis.
26