0% found this document useful (0 votes)

17 views26 pages

SSRN 4963741

Uploaded by

Basma Mortaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views26 pages

SSRN 4963741

Uploaded by

Basma Mortaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Reinforcement Learning: A Historical and

Mathematical Overview (1950-2024)

Miquel Noguer i Alonso
Artificial Intelligence Finance Institute

November 22, 2024

Abstract

This paper provides an exhaustive overview of the evolution of Reinforcement Learn-

ing (RL) from its inception in the 1950s to the cutting-edge developments up to 2024.
Emphasis is placed on the mathematical foundations and theoretical advancements that
have shaped the field. We delve deeply into the principles of dynamic programming,
temporal difference learning, and the integration of deep learning techniques. By trac-
ing the chronological progression of RL, we aim to elucidate how foundational concepts
have evolved into sophisticated algorithms that address complex, real-world problems.
This comprehensive examination is intended for readers with a solid mathematical
background seeking an in-depth understanding of RL’s development.

Contents
1 Introduction 5
1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Structure of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Foundations (1950-1980) 5

2.1 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Historical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Definition and Components . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Objective of the Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Dynamic Programming and the Bellman Equations . . . . . . . . . . . . . . 7
2.2.1 Introduction to Dynamic Programming . . . . . . . . . . . . . . . . . 7
2.2.2 State-Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1
2.2.3 Bellman Expectation Equation for V π (s) . . . . . . . . . . . . . . . . 7
2.2.4 Action-Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Bellman Expectation Equation for Qπ (s, a) . . . . . . . . . . . . . . . 8
2.2.6 Bellman Optimality Equations . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Solution Methods: Value Iteration and Policy Iteration . . . . . . . . . . . . 9
2.3.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Limitations of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Example: Grid World Application . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Scenario Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 MDP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.3 Applying Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.4 Visualization and Interpretation . . . . . . . . . . . . . . . . . . . . . 11
2.5.5 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Model-Free Methods: Q-Learning and Temporal Difference Learning (1980-

2000) 11
3.1 Temporal Difference (TD) Learning . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Q-Learning Update Rule . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Properties of Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 SARSA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Update Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 Comparison with Q-Learning . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Exploration-Exploitation Trade-off . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Importance of Exploration . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 ϵ-Greedy Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.3 Adaptive Exploration Strategies . . . . . . . . . . . . . . . . . . . . . 14
3.5 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5.1 Need for Function Approximation . . . . . . . . . . . . . . . . . . . . 14
3.5.2 Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . 14
3.5.3 Nonlinear Function Approximation . . . . . . . . . . . . . . . . . . . 14
3.6 Example: Cliff Walking Problem . . . . . . . . . . . . . . . . . . . . . . . . . 14

2
3.6.1 Environment Description . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6.2 Comparing SARSA and Q-Learning . . . . . . . . . . . . . . . . . . . 15
3.6.3 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Deep Reinforcement Learning (2000-2020) 15

4.1 Function Approximation with Neural Networks . . . . . . . . . . . . . . . . 15
4.1.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Advantages of Neural Networks . . . . . . . . . . . . . . . . . . . . . 15
4.2 Deep Q-Networks (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Stabilizing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.6 Impact and Significance . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Improvements and Variants of DQN . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Dueling DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Prioritized Experience Replay . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.2 Policy Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.3 Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.4 REINFORCE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5.1 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5.2 Advantage Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5.3 Policy Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5.4 Value Function Update . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Asynchronous Advantage Actor-Critic (A3C) . . . . . . . . . . . . . . . . . . 20
4.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6.2 Parallel Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Advanced Techniques and Applications (2020-2024) 20

5.1 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . . . . . . . . . 20
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.2 Clipped Surrogate Objective . . . . . . . . . . . . . . . . . . . . . . . 20

3
5.1.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Soft Actor-Critic (SAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Entropy Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . . . . . . . . . 21
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3.2 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3.3 Exploration Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Distributional RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.2 Quantile Regression DQN (QR-DQN) . . . . . . . . . . . . . . . . . . 22
5.4.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Decision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.1 Reframing RL as Sequence Modeling . . . . . . . . . . . . . . . . . . 22
5.5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Conclusion 23

A Mathematical Proofs and Derivations 23

A.1 Proof of the Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . 23
A.2 Derivation of the Bellman Optimality Equation . . . . . . . . . . . . . . . . 24

4
1 Introduction
Reinforcement Learning (RL) is a subfield of artificial intelligence concerned with how agents
ought to take actions in an environment to maximize some notion of cumulative reward.
Unlike supervised learning, where the agent is provided with correct input-output pairs, RL
involves learning from the consequences of actions in a trial-and-error fashion. This process
is akin to how humans and animals learn from interactions with their environment.
The history of RL is rich and multifaceted, drawing from disciplines such as control
theory, psychology, neuroscience, and computer science. This paper traces the chronological
development of RL, highlighting key milestones and the mathematical underpinnings that
have driven progress in the field.

1.1 Motivation and Scope

Understanding the historical context of RL provides valuable insights into why certain algo-
rithms were developed and how they address specific challenges. By examining the mathe-
matical formulations in detail, we aim to offer a comprehensive resource that bridges theo-
retical concepts with practical implementations.

1.2 Structure of the Paper

The paper is organized chronologically to reflect the evolution of RL:

• Section 2 covers the foundational period (1950-1980), focusing on dynamic program-

ming and Markov Decision Processes.

• Section 3 discusses the emergence of model-free methods like Q-Learning and Tem-
poral Difference Learning (1980-2000).

• Section 4 explores the integration of deep learning with RL (2000-2020), leading to

deep reinforcement learning algorithms.

• Section 5 examines advanced techniques and applications (2020-2024), including pol-

icy gradient methods, actor-critic algorithms, and Decision Transformers.

• Section 6 outlines future directions and ongoing research challenges.

2 Theoretical Foundations (1950-1980)

The period from 1950 to 1980 laid the groundwork for RL, with significant contributions
from control theory and the formalization of decision-making processes.

5
2.1 Markov Decision Processes (MDPs)
2.1.1 Historical Context

The concept of Markov Decision Processes (MDPs) was introduced by Bellman [1957] as
a mathematical framework for modeling decision-making in situations where outcomes are
partly random and partly under the control of a decision-maker. MDPs provide a formal-
ization that is essential for understanding and developing RL algorithms.

2.1.2 Definition and Components

Definition 1 (Markov Decision Process). An MDP is defined as a tuple (S, A, P, R, γ),

where:

• S: A finite set of states representing all possible configurations of the environment.

• A: A finite set of actions available to the agent.

• P (s′ |s, a): The transition probability function, specifying the probability of transitioning
from state s to state s′ given action a.

• R(s, a, s′ ): The reward function, giving the immediate reward received after transition-
ing from s to s′ via action a.

• γ ∈ [0, 1]: The discount factor, representing the importance of future rewards compared
to immediate rewards.

Markov Property An essential feature of MDPs is the Markov property, which states that
the future is independent of the past given the present state and action. Mathematically,
this is expressed as:

P (st+1 |st , at , st−1 , at−1 , . . . , s0 , a0 ) = P (st+1 |st , at )

This property simplifies the analysis and computation of optimal policies, as it allows us
to consider only the current state and action when making decisions.

2.1.3 Objective of the Agent

The agent’s goal in an MDP is to find a policy π : S → A that maximizes the expected
cumulative discounted reward, also known as the return. The return Gt at time t is defined
as:

6
∞
X
Gt = γ k rt+k+1
k=0

Policy A policy π(a|s) is a mapping from states to probabilities of selecting each possible
action. Policies can be deterministic, where π(a|s) selects the action with certainty, or
stochastic, where actions are selected according to a probability distribution.

2.2 Dynamic Programming and the Bellman Equations

2.2.1 Introduction to Dynamic Programming

Dynamic programming (DP) is a method for solving complex problems by breaking them
down into simpler subproblems. In the context of RL, DP provides a suite of algorithms for
computing optimal policies given a perfect model of the environment as an MDP.

2.2.2 State-Value Function

The state-value function V π (s) under a policy π gives the expected return starting from
state s:
" ∞
#
X
V π (s) = Eπ Gt st = s = Eπ γ k rt+k+1 st = s

k=0

This function quantifies the long-term value of a state when following policy π.

2.2.3 Bellman Expectation Equation for V π (s)

The Bellman expectation equation provides a recursive relationship for V π (s):

V π (s) = Eπ rt+1 + γV π (st+1 ) st = s

X X
= π(a|s) P (s′ |s, a) [R(s, a, s′ ) + γV π (s′ )] (1)
a s′

Derivation The derivation of the Bellman equation relies on the law of total expectation
and the Markov property:

7
V π (s) = Eπ rt+1 + γGt+1 st = s

!
X X
P (s′ |s, a) R(s, a, s′ ) + γEπ Gt+1 st+1 = s′

= π(a|s)
a s′
X X
= π(a|s) P (s′ |s, a) (R(s, a, s′ ) + γV π (s′ ))
a s′

Interpretation The equation states that the value of a state under policy π is equal to the
expected immediate reward plus the expected discounted value of the next state, averaged
over all possible actions and next states.

2.2.4 Action-Value Function

The action-value function Qπ (s, a) provides the expected return starting from state s, taking
action a, and thereafter following policy π:

Qπ (s, a) = Eπ Gt st = s, at = a

2.2.5 Bellman Expectation Equation for Qπ (s, a)

Similarly, the Bellman equation for Qπ (s, a) is:

Qπ (s, a) = E rt+1 + γEπ Gt+1 st+1 st = s, at = a

X
= P (s′ |s, a) [R(s, a, s′ ) + γV π (s′ )]
s′
" #
X X
= P (s′ |s, a) R(s, a, s′ ) + γ π(a′ |s′ )Qπ (s′ , a′ ) (2)
s′ a′

Significance The action-value function is crucial for selecting optimal actions, as it quan-
tifies the expected return of taking a specific action in a given state.

2.2.6 Bellman Optimality Equations

For optimal policies, the Bellman equations take on a maximization form.

8
Optimal State-Value Function
X
V ∗ (s) = max P (s′ |s, a) [R(s, a, s′ ) + γV ∗ (s′ )]
a
s′

Optimal Action-Value Function

X h i
∗ ′ ′ ∗ ′ ′
Q (s, a) = P (s |s, a) R(s, a, s ) + γ max
′
Q (s , a )
a
s′

Derivation These equations arise from the principle of optimality, which states that any
subsequence of optimal actions must itself be optimal. The optimal value functions satisfy
these equations, providing a recursive means to compute them.

2.3 Solution Methods: Value Iteration and Policy Iteration

2.3.1 Value Iteration

Value iteration is an algorithm that iteratively applies the Bellman optimality equation to
converge to the optimal value function V ∗ (s).

Algorithm Steps

1. Initialize V (s) arbitrarily for all s ∈ S.

2. Repeat until convergence:

• For each state s ∈ S:

X
V (s) ← max P (s′ |s, a) [R(s, a, s′ ) + γV (s′ )]
a
s′

Convergence Under standard assumptions (e.g., finite state and action spaces), value
iteration converges to V ∗ (s) as the number of iterations approaches infinity.

2.3.2 Policy Iteration

Policy iteration involves two main steps: policy evaluation and policy improvement.

Policy Evaluation For a fixed policy π, compute the state-value function V π (s) by solving
the system of linear equations defined by the Bellman expectation equation.

9
Policy Improvement Update the policy by acting greedily with respect to V π (s):
X
πnew (s) = arg max P (s′ |s, a) [R(s, a, s′ ) + γV π (s′ )]
a
s′

Algorithm Steps

1. Initialize policy π arbitrarily.

2. Repeat until policy stabilizes:

(a) Policy Evaluation: Compute V π (s).

(b) Policy Improvement: Update π to πnew .

Convergence Policy iteration converges to the optimal policy π ∗ in a finite number of

iterations for finite MDPs.

2.4 Limitations of Dynamic Programming

While dynamic programming methods are powerful, they have several limitations:

• Curse of Dimensionality: The computational complexity grows exponentially with

the number of states.

• Model Requirement: DP requires full knowledge of the environment’s dynamics (P

and R), which is often not available in practical applications.

• Storage: Storing value functions and policies for large state spaces can be impractical.

2.5 Example: Grid World Application

2.5.1 Scenario Description

Consider a grid world environment where an agent must navigate from a starting position
to a goal state. Each cell in the grid represents a state, and the agent can move in four
directions: up, down, left, and right.

2.5.2 MDP Formulation

• States (S): All grid cells.

• Actions (A): {Up, Down, Left, Right}.

10
• Transition Probabilities (P ): Deterministic or stochastic movements.

• Rewards (R): Negative reward (cost) for each move; large positive reward for reaching
the goal.

• Discount Factor (γ): Typically set less than 1 to ensure convergence.

2.5.3 Applying Value Iteration

By applying value iteration, we iteratively update the value function for each state until
convergence. This process yields the optimal value function V ∗ (s) and the corresponding
optimal policy.

2.5.4 Visualization and Interpretation

The resulting value function can be visualized as a heatmap over the grid, indicating the
desirability of each state. The optimal policy guides the agent towards the goal while mini-
mizing costs.

2.5.5 Insights

This example illustrates how DP methods can be applied to solve MDPs in simple environ-
ments. It also highlights the challenges that arise when scaling to larger or more complex
environments.

3 Model-Free Methods: Q-Learning and Temporal Dif-

ference Learning (1980-2000)
The period from 1980 to 2000 saw significant advancements in RL through the development of
model-free methods, which allow agents to learn optimal policies without explicit knowledge
of the environment’s dynamics.

3.1 Temporal Difference (TD) Learning

3.1.1 Introduction

Temporal Difference (TD) learning, introduced by Sutton [1988], combines ideas from Monte
Carlo methods and dynamic programming. It enables agents to learn directly from raw
experience without a model of the environment.

11
3.1.2 TD Prediction

TD prediction focuses on learning the value function V π (s) for a given policy π.

TD(0) Update Rule The simplest TD algorithm, TD(0), updates the value estimate
based on the observed reward and the estimated value of the next state:

V (st ) ← V (st ) + α [rt+1 + γV (st+1 ) − V (st )]

Interpretation The term δt = rt+1 + γV (st+1 ) − V (st ) is the TD error, representing

the difference between the predicted value and the observed return. The value function is
updated to minimize this error.

3.1.3 Convergence Properties

TD(0) converges to V π (s) under certain conditions, such as using a suitable learning rate
schedule and visiting all states sufficiently often.

3.2 Q-Learning
3.2.1 Introduction

Q-Learning, developed by Watkins [1989], is a model-free, off-policy algorithm that aims to

learn the optimal action-value function Q∗ (s, a).

3.2.2 Q-Learning Update Rule

The core update rule for Q-Learning is:

h i
′
Q(st , at ) ← Q(st , at ) + α rt+1 + γ max
′
Q(s t+1 , a ) − Q(s ,
t ta )
a

Explanation This update adjusts the estimate of Q(st , at ) towards the observed reward
plus the discounted maximum estimated value of the next state. The use of maxa′ Q(st+1 , a′ )
reflects the assumption that the agent will act optimally in the future.

3.2.3 Properties of Q-Learning

• Off-Policy: Q-Learning learns the optimal policy regardless of the agent’s current
policy.

12
• Convergence: Under certain conditions (e.g., appropriate exploration and learning
rates), Q-Learning converges to Q∗ (s, a).

3.3 SARSA Algorithm

3.3.1 Introduction

SARSA (State-Action-Reward-State-Action) is an on-policy algorithm that updates the

action-value function based on the action actually taken by the policy.

3.3.2 Update Rule

Q(st , at ) ← Q(st , at ) + α [rt+1 + γQ(st+1 , at+1 ) − Q(st , at )]

Explanation Unlike Q-Learning, SARSA updates the value of Q(st , at ) using the action
at+1 selected by the current policy. This makes it sensitive to the agent’s exploration strategy.

3.3.3 Comparison with Q-Learning

• On-Policy vs. Off-Policy: SARSA is on-policy, while Q-Learning is off-policy.

• Exploration Impact: SARSA accounts for the agent’s exploration behavior, poten-
tially leading to safer policies in certain environments.

3.4 Exploration-Exploitation Trade-off

3.4.1 Importance of Exploration

Effective learning requires a balance between exploiting known information to maximize

reward and exploring new actions to discover their potential.

3.4.2 ϵ-Greedy Policy

A common exploration strategy is the ϵ-greedy policy:

• With probability ϵ, select a random action (exploration).

• With probability 1−ϵ, select the action with the highest estimated value (exploitation).

13
3.4.3 Adaptive Exploration Strategies

More sophisticated strategies include:

• Decay of ϵ: Gradually reduce ϵ over time to shift from exploration to exploitation.

• Optimistic Initialization: Initialize Q-values optimistically to encourage explo-

ration.

• Upper Confidence Bound (UCB): Use statistical confidence intervals to guide

exploration.

3.5 Function Approximation

3.5.1 Need for Function Approximation

In environments with large or continuous state spaces, storing and updating Q-values for
every state-action pair is infeasible. Function approximation methods generalize learning
across similar states and actions.

3.5.2 Linear Function Approximation

Represents the value function as a linear combination of features:

Q(s, a; θ) = θ⊤ ϕ(s, a)

Gradient Descent Updates Parameters θ are updated using gradient descent to mini-
mize the squared TD error.

3.5.3 Nonlinear Function Approximation

Neural networks can approximate complex, nonlinear functions, enabling agents to handle
high-dimensional inputs such as images.

3.6 Example: Cliff Walking Problem

3.6.1 Environment Description

An agent must navigate along the edge of a cliff to reach a goal state. Stepping off the cliff
results in a large negative reward.

14
3.6.2 Comparing SARSA and Q-Learning

• SARSA: Learns a policy that avoids the cliff by accounting for the agent’s exploration,
leading to safer paths.

• Q-Learning: Learns the optimal policy that skirts the edge of the cliff to minimize
steps but may risk falling off during learning due to the exploratory actions.

3.6.3 Insights

This example illustrates how the choice between on-policy and off-policy methods can affect
the learned policy, especially in environments where risky actions have severe consequences.

4 Deep Reinforcement Learning (2000-2020)

The integration of deep learning with RL algorithms led to significant breakthroughs, en-
abling agents to tackle complex tasks with high-dimensional inputs.

4.1 Function Approximation with Neural Networks

4.1.1 Historical Background

Early attempts to use neural networks in RL faced challenges with stability and convergence.
The resurgence of deep learning in the 2010s, fueled by advances in training techniques and
computational resources, revitalized interest in combining neural networks with RL.

4.1.2 Advantages of Neural Networks

Neural networks can approximate complex, nonlinear functions and handle high-dimensional
sensory inputs, such as images or raw sensor data.

4.2 Deep Q-Networks (DQN)

4.2.1 Introduction

Introduced by Mnih et al. [2015], DQN was a landmark achievement that demonstrated the
potential of deep RL by achieving human-level performance on Atari games using raw pixel
inputs.

15
4.2.2 Architecture

DQN employs a convolutional neural network (CNN) to process image inputs and output
Q-values for each possible action.

4.2.3 Stabilizing Techniques

To address the instability issues inherent in combining Q-Learning with deep neural networks,
DQN introduces two key techniques:

Experience Replay

• Stores past experiences (st , at , rt+1 , st+1 ) in a replay buffer.

• Samples mini-batches of experiences uniformly at random for training.

• Breaks the correlations between sequential data and stabilizes learning.

Target Networks

• Maintains a separate target network Q(s, a; θ− ) with parameters θ− .

• Updates the target network parameters periodically by copying from the main network.

• Provides a stable target for the TD error, reducing oscillations in the learning process.

4.2.4 Training Process

1. Initialize the main network Q(s, a; θ) and the target network Q(s, a; θ− ).

2. Collect experiences by interacting with the environment using an ϵ-greedy policy.

3. Store experiences in the replay buffer.

4. Sample mini-batches from the replay buffer.

5. Compute the target Q-values using the target network.

6. Update the main network parameters by minimizing the loss function.

7. Periodically update the target network parameters.

16
4.2.5 Loss Function

The loss function minimized during training is:

Li (θi ) = E(st ,at ,rt+1 ,st+1 ) (yi − Q(st , at ; θi ))2

where the target yi is computed using the target network:

yi = rt+1 + γ max
′
Q(st+1 , a′ ; θ− )
a

4.2.6 Impact and Significance

DQN demonstrated that deep neural networks could successfully approximate value functions
in high-dimensional spaces, paving the way for further research in deep RL.

4.3 Improvements and Variants of DQN

4.3.1 Double DQN

Problem Addressed Standard DQN tends to overestimate action values due to the max-
imization step in the target calculation.

Solution Double DQN, proposed by van Hasselt et al. [2016], decouples action selection
and evaluation:

yiDouble = rt+1 + γQ(st+1 , arg max

′
Q(st+1 , a′ ; θi ); θ− )
a

4.3.2 Dueling DQN

Motivation In some states, the choice of action has little impact on the value function.
Separating the representation of state-value and advantage functions can improve learning
efficiency.

Architecture Dueling DQN splits the network into two streams:

• Value Stream: Estimates the state-value function V (s; θ, β).

• Advantage Stream: Estimates the advantage function A(s, a; θ, α).

The Q-values are combined as:

!
1 X
Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α) − A(s, a′ ; θ, α)
|A| a′

17
4.3.3 Prioritized Experience Replay

Problem Addressed Uniform sampling from the replay buffer may be inefficient, as some
experiences are more valuable for learning.

Solution Prioritized Experience Replay [Schaul et al., 2016] samples experiences based on
the magnitude of their TD error, focusing learning on more significant updates.

Implementation The probability of sampling an experience i is:

|δi |ω
P (i) = P ω
k |δk |

where δi is the TD error and ω determines the degree of prioritization.

4.4 Policy Gradient Methods

4.4.1 Introduction

Policy gradient methods optimize the policy directly by computing gradients of the expected
return with respect to policy parameters.

4.4.2 Policy Parameterization

Policies are parameterized as π(a|s; θ), where θ are the parameters of a neural network.

4.4.3 Policy Gradient Theorem

The gradient of the expected return J(θ) is given by:

∇θ J(θ) = Eπθ [∇θ log πθ (a|s)Qπ (s, a)]

Derivation Overview Starting from J(θ) = Eπθ [Gt ], the gradient is obtained using the
log-derivative trick and the policy’s probability density function.

4.4.4 REINFORCE Algorithm

Algorithm Steps

1. Collect trajectories τ by following the policy πθ .

2. Compute returns Gt for each time step.

18
3. Update the policy parameters:

X
θ ←θ+α ∇θ log πθ (at |st )Gt
t

Variance Reduction Subtracting a baseline b(st ) reduces variance without introducing

bias:
X
θ ←θ+α ∇θ log πθ (at |st ) (Gt − b(st ))
t

4.5 Actor-Critic Methods

4.5.1 Conceptual Framework

Actor-critic methods combine value-based and policy-based approaches:

• Actor: The policy π(a|s; θ) that selects actions.

• Critic: The value function V (s; w) that evaluates the current policy.

4.5.2 Advantage Function

The advantage function A(s, a) = Q(s, a) − V (s) measures the benefit of taking action a in
state s over the average action.

4.5.3 Policy Update

The actor updates the policy parameters in the direction suggested by the critic:

θ ← θ + α∇θ log πθ (at |st )A(st , at )

4.5.4 Value Function Update

The critic updates the value function parameters by minimizing the TD error:

w ← w − β∇w (rt+1 + γV (st+1 ; w) − V (st ; w))2

19
4.6 Asynchronous Advantage Actor-Critic (A3C)
4.6.1 Motivation

A3C addresses the limitations of experience replay in continuous or partially observable

environments and improves training efficiency.

4.6.2 Parallel Training

Multiple worker agents interact with their own copies of the environment in parallel, updating
shared global parameters asynchronously.

4.6.3 Benefits

• Stability: Avoids the need for experience replay by leveraging parallelism.

• Exploration: Different workers explore different parts of the state space.

• Efficiency: Utilizes multiple CPU cores effectively.

5 Advanced Techniques and Applications (2020-2024)

Recent years have witnessed significant advancements in RL algorithms, addressing chal-
lenges related to sample efficiency, stability, and scalability.

5.1 Proximal Policy Optimization (PPO)

5.1.1 Introduction

PPO [Schulman et al., 2017] simplifies the trust region optimization approach, making it
more practical for large-scale problems.

5.1.2 Clipped Surrogate Objective

PPO uses a clipped surrogate objective to prevent large policy updates:

LPPO (θ) = Et [min (rt (θ)At , clip(rt (θ), 1 − ϵ, 1 + ϵ)At )]

πθ (at |st )
where rt (θ) = πθold (at |st )
.

20
5.1.3 Advantages

• Stability: Prevents destructive policy updates.

• Simplicity: Easier to implement compared to other trust region methods.

• Performance: Achieves state-of-the-art results in various benchmarks.

5.2 Soft Actor-Critic (SAC)

5.2.1 Objective

SAC [Haarnoja et al., 2018] incorporates an entropy term into the objective to encourage
exploration:
X
J(π) = E(st ,at )∼ρπ [r(st , at ) + αH(π(·|st ))]
t

5.2.2 Entropy Regularization

The entropy term H(π(·|st )) promotes stochastic policies, preventing premature convergence
to suboptimal deterministic policies.

5.2.3 Benefits

• Sample Efficiency: Off-policy nature allows reusing past experiences.

• Stability: Improved convergence properties.

• Performance: Effective in continuous control tasks.

5.3 Deep Deterministic Policy Gradient (DDPG)

5.3.1 Introduction

DDPG [Lillicrap et al., 2015] extends DPG to use deep neural networks, enabling RL in
continuous action spaces.

5.3.2 Key Components

• Actor Network: Outputs deterministic actions.

• Critic Network: Estimates the Q-value of state-action pairs.

21
• Target Networks: Slow-moving copies of the actor and critic networks for stable
updates.

5.3.3 Exploration Strategy

Uses additive noise (e.g., Ornstein-Uhlenbeck process) to the actions during training for
exploration.

5.4 Distributional RL
5.4.1 Concept

Distributional RL models the distribution of possible returns, rather than just the expected
return.

5.4.2 Quantile Regression DQN (QR-DQN)

QR-DQN [Dabney et al., 2018] approximates the return distribution using quantile regres-
sion, capturing the uncertainty and variability in returns.

5.4.3 Benefits

• Improved Performance: Capturing the full distribution can lead to better policies.

• Risk-Sensitive Policies: Allows for consideration of risk in decision-making.

5.5 Decision Transformers

5.5.1 Reframing RL as Sequence Modeling

Decision Transformers [Chen et al., 2021] treat RL as a sequence modeling problem, using
transformer architectures to predict actions based on past trajectories.

5.5.2 Architecture

Processes sequences of return-to-go, states, and actions:

(Rt , st , at , Rt+1 , st+1 , at+1 , . . . )

22
5.5.3 Advantages

• Leverages Pretrained Models: Can utilize pretrained transformer models.

• Simplifies Training: Avoids the need for explicit value function estimation.

• Flexibility: Can condition on desired returns to control agent behavior.

6 Conclusion
From its early theoretical foundations in the mid-20th century to the sophisticated algorithms
of today, reinforcement learning has undergone remarkable evolution. The development
of model-free methods, the integration with deep learning, and the recent innovations in
policy optimization and sequence modeling have significantly expanded the capabilities of
RL agents.
As we look to the future, addressing challenges related to sample efficiency, safety, and
scalability will be essential. Continued interdisciplinary collaboration and the integration of
RL with other AI domains hold the promise of unlocking new applications and advancing
the field further.

A Mathematical Proofs and Derivations

A.1 Proof of the Policy Gradient Theorem
Theorem 1 (Policy Gradient Theorem). For any differentiable policy πθ (a|s), the gradient
of the expected return J(θ) with respect to policy parameters θ is:

∇θ J(θ) = Eπθ [∇θ log πθ (a|s)Qπ (s, a)]

Proof. Starting from the expected return:

X X
J(θ) = dπ (s) πθ (a|s)r(s, a)
s a

where dπ (s) is the stationary distribution of states under policy π.

Taking the gradient:
X X X X
∇θ J(θ) = ∇θ dπ (s) πθ (a|s)r(s, a) + dπ (s) ∇θ πθ (a|s)r(s, a)
s a s a

Assuming ∇θ dπ (s) is negligible, we focus on the second term:

23
X X
∇θ J(θ) ≈ dπ (s) ∇θ πθ (a|s)Qπ (s, a)
s a

Using the identity ∇θ πθ (a|s) = πθ (a|s)∇θ log πθ (a|s):

X X
∇θ J(θ) = dπ (s) πθ (a|s)∇θ log πθ (a|s)Qπ (s, a) = Eπθ [∇θ log πθ (a|s)Qπ (s, a)]
s a

A.2 Derivation of the Bellman Optimality Equation

Starting from the definition of the optimal value function:

V ∗ (s) = max V π (s)

Using the Bellman expectation equation:

X X
V π (s) = π(a|s) P (s′ |s, a) [R(s, a, s′ ) + γV π (s′ )]
a s′

Since V ∗ (s′ ) ≥ V π (s′ ) for all policies π, we have:

X
V ∗ (s) = max P (s′ |s, a) [R(s, a, s′ ) + γV ∗ (s′ )]
a
s′

24
References
Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
ISBN 978-0-691-07951-6.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Parsa Lee, Aditya Grover, and Sham
Kumar. Decision transformer: Reinforcement learning via sequence model-
ing. In Advances in Neural Information Processing Systems, volume 34, pages
15084–15097, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/
7f489f640400adbfba07ba77a50ba99a-Abstract.html.

Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Distributional reinforcement
learning with quantile regression. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32, 2018. doi: 10.1609/aaai.v32i1.11791.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceed-
ings of the 35th International Conference on Machine Learning, ICML’18, pages 1861–
1870. PMLR, 2018.

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement
learning, 2015. URL https://arxiv.org/abs/1509.02971.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig
Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Ku-
maran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through
deep reinforcement learning. Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience
replay. In 4th International Conference on Learning Representations, 2016. URL https:
//arxiv.org/abs/1511.05952.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https:
//arxiv.org/abs/1707.06347.

Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine

Learning, 3(1):9–44, 1988. doi: 10.1023/A:1022633531479.

25
Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double
Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,
pages 2094–2100. AAAI Press, 2016. doi: 10.5555/3016100.3016191.

Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis,
King’s College, Cambridge, Cambridge, UK, 1989. Ph.D. thesis.

Trouble Code List PDF
No ratings yet
Trouble Code List PDF
21 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
RL Notes
No ratings yet
RL Notes
69 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Modern Deep Reinforcement Learning Algorithms
No ratings yet
Modern Deep Reinforcement Learning Algorithms
56 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Exploring Reinforcement Learning Algorithms: Information and Communication Technologies Department
No ratings yet
Exploring Reinforcement Learning Algorithms: Information and Communication Technologies Department
60 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
64 pages
Book
No ratings yet
Book
534 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
65 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
1701 07274
No ratings yet
1701 07274
70 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
Deep Reinforcement Learning PDF
No ratings yet
Deep Reinforcement Learning PDF
150 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
Salis Grad Thesis
No ratings yet
Salis Grad Thesis
90 pages
Chenyu Liu
No ratings yet
Chenyu Liu
81 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
RLHF
No ratings yet
RLHF
123 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
RLHF
No ratings yet
RLHF
122 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
60 pages
ML in Control and Games SIAM Review HAL
No ratings yet
ML in Control and Games SIAM Review HAL
76 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Reinforcement Learning From Human Feedback
No ratings yet
Reinforcement Learning From Human Feedback
110 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
60 pages
Deep Policy Gradient Methods in Commodity Markets
No ratings yet
Deep Policy Gradient Methods in Commodity Markets
114 pages
Reinforcement Learning For Planning of A Simulated Production Line
No ratings yet
Reinforcement Learning For Planning of A Simulated Production Line
62 pages
Super Cheatsheet Artificial Intelligence
No ratings yet
Super Cheatsheet Artificial Intelligence
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
A Concise Introduction To Decentralized POMDPs
No ratings yet
A Concise Introduction To Decentralized POMDPs
146 pages
Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Hindsight Learning For Mdps With Exogenous Inputs: Contact Authors: Sean Sinclair and Adith Swaminathan
No ratings yet
Hindsight Learning For Mdps With Exogenous Inputs: Contact Authors: Sean Sinclair and Adith Swaminathan
52 pages
Computational Modeling of An Economy Using Elements of Artificial
No ratings yet
Computational Modeling of An Economy Using Elements of Artificial
160 pages
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
RS 701 QuickStartGuide 399G034
No ratings yet
RS 701 QuickStartGuide 399G034
2 pages
Id Privacy Full
No ratings yet
Id Privacy Full
179 pages
ZA eCTD Module 1 Technical v1
No ratings yet
ZA eCTD Module 1 Technical v1
74 pages
ND-12.018.01 High Level Data Link Control (HDLC) Interface
No ratings yet
ND-12.018.01 High Level Data Link Control (HDLC) Interface
113 pages
Search Health Check
No ratings yet
Search Health Check
2 pages
Solution of Week - 3
No ratings yet
Solution of Week - 3
8 pages
Modelling Soil-Structure Interaction in Masonry Arch Bridges
No ratings yet
Modelling Soil-Structure Interaction in Masonry Arch Bridges
9 pages
Arts 10
No ratings yet
Arts 10
10 pages
Bahasa Inggris - Xi SF
No ratings yet
Bahasa Inggris - Xi SF
5 pages
11 Edition: Chapter 2: Application Layer
No ratings yet
11 Edition: Chapter 2: Application Layer
12 pages
RoA PT. SCII Job No 07319.00309
No ratings yet
RoA PT. SCII Job No 07319.00309
3 pages
OP. Hfl100b 138 170
No ratings yet
OP. Hfl100b 138 170
109 pages
Fire House Shed
No ratings yet
Fire House Shed
9 pages
Create Forms That Users Complete or Print in Word
No ratings yet
Create Forms That Users Complete or Print in Word
7 pages
Viper Alarm Wiring Diagram
No ratings yet
Viper Alarm Wiring Diagram
23 pages
A Survey of Public IoT Datasets For Network Security Research
No ratings yet
A Survey of Public IoT Datasets For Network Security Research
33 pages
Contoh Prosiding AtlantisPress
No ratings yet
Contoh Prosiding AtlantisPress
8 pages
Reflections Price Sheet
No ratings yet
Reflections Price Sheet
2 pages
XA1244 Rev1 JUMBO1
No ratings yet
XA1244 Rev1 JUMBO1
13 pages
Optimal and Robust Estimation With An Introduction To Stochastic Control Theory
No ratings yet
Optimal and Robust Estimation With An Introduction To Stochastic Control Theory
8 pages
Bindu Madhavi (Resume) PDF
No ratings yet
Bindu Madhavi (Resume) PDF
3 pages
Text
No ratings yet
Text
15 pages
Mario Medley-Trumpet
No ratings yet
Mario Medley-Trumpet
2 pages
Component Interface - Peoplesoft Interview Questions
No ratings yet
Component Interface - Peoplesoft Interview Questions
3 pages
Mid Term Report Samrt Green House2
No ratings yet
Mid Term Report Samrt Green House2
28 pages
http192 168 1 1rpsys HTML
No ratings yet
http192 168 1 1rpsys HTML
1 page
Introduction To Scheme: Slides Created by Marty Stepp
No ratings yet
Introduction To Scheme: Slides Created by Marty Stepp
21 pages
Evidence - Lets Find Words
No ratings yet
Evidence - Lets Find Words
2 pages
Cbsp2103 Structured Programming
No ratings yet
Cbsp2103 Structured Programming
10 pages