A Primer Chapter on Reinforcement Learning-Final

A Primer Chapter on Reinforcement
Learning (RL) Outline

What is Reinforcement Learning?
Interaction with the environment is one of the most primitive ways for humans to learn. Whether we are
learning how to drive a car or to hold a conversation, we learn by noticing how the environment reacts to
our actions and we seek to improve our behavior based on the feedback from the environment. Compared
to other forms of machine learning models, Reinforcement Learning (RL) is one of the most goal-directed
computational approaches that learn from interactions with the environment without direct supervision
or a complete world model. RL is learning, not by directly teaching the agent how to act, but rather
providing incentives through reward mechanisms to make a map of state-action pairs from which the
learner can decide what to do to maximize its yield in each state. It uses the formal framework of Markov
decision processes to define these relationships (Barto 2018, Sutton and Barto 2018).
RL is different from both unsupervised and supervised learning paradigms. Supervised learning deals with
labeled data and aims to extrapolate the gained knowledge, whereas unsupervised learning attempts to
find hidden structures in unlabeled data. In contrast, RL is context aware when making predictions that
maximize the expected cumulative rewards. To achieve this goal, unsupervised and supervised learning
methods are often applied together with reinforcement learning. RL encompasses the general principle
of learning, search, and decision making, inspired by the way organisms learn in a natural environment.
Historically, RL was composed of three distinct but now intertwined threads of “learning from trial and
error,” “optimal control,” and “temporal-difference.” The idea of “trial and error learning” was originally
used in the 1850s by psychologists and ethologists describing observations of animal behavior. The
concept remained mostly used by psychologists until Alan Turing first used it to bring forward the
possibility of creating artificial intelligence. He described a design of a pleasure-pain system in his 1948
report, suggesting reinforcing and cancelling entries based on these sensations. The term “optimal
control” came into use in the late 1950s to describe maximizing or minimizing a measure of a dynamical
system’s behavior over time. The class of methods for solving these problems came to be known as
“dynamic programming” by introducing the discrete stochastic version of the optimal control problem
known as “Markov decision processes.” Methods of “Temporal-difference” made the other two threads
of reinforcement learning interrelated to some extent. The origins of “temporal-difference” learning is
partly drawn from animal psychology, specifically the concept of secondary reinforcers, which are stimuli
paired with primary reinforcers such as food or pain. The “temporal-difference” and “optimal control”
threads were fully brought together with Q-learning proposed by Chris Watkins. His work extended and
integrated prior work in all three threads of reinforcement learning and brought forward what we today
call reinforcement learning methods.
Introduction to RL Formalisms
Reinforcement learning can be preliminary defined as “learning what to do—how to map situations to
actions—so as to maximize a numerical reward signal” (Sutton and Barto 2018). Despite the greedy policy
1
adopted in this learning process in which the most rewarding action should be selected, it is in the
learner’s interest to explore undiscovered actions that maximize the overall return. This definition can be
further expanded into basic components of a standard RL model where there are: (1) an agent
represented by the model; (2) states in an environment; (3) actions needed to be taken by the agent,
influenced by environment states; (4) the policy that dictates the value of agents’ taken actions; and finally
(5) the reward for the action taken. More complex RL models may include additional algorithms and
deeper model architecture to perform more specialized tasks. A RL algorithm is the sequence of actions
by which agents optimize their total reward actions.
The RL components terminology introduced earlier is widely used in control theory. In feedback control
systems, the action controller works in a similar way in which the system performance is improved by
means of observing the state metrics until the desired optimal system state is reached. However, critiques
to such resembling are not limited to the inherited time delay, but also the efficacy to model a relatively
large state (and action) space environment, the so-called “the curse of dimensionality”, not to mention
the system instability in highly uncertain and stochastic environments. In this context, RL lends itself as an
artificial intelligence tool to a wide range of control-like problems with uncertainty handling capabilities
and robustness to large state space environments. RL is also distinguished by the compactness of its
format, where the system modules (i.e. environment interaction, state-action mapping and reward
accumulation) are all brought together to alleviate latency issues arise in pipeline-like procedural systems.
Let us look at the following abstract for a typical control problem (see Figure 1), before the controller lets
the system take actions, the following procedures must be performed for each feedback cycle:
1- Information must be extracted from the environment as features and fed into state estimator
2- System reference propagates to the controller
3- Back-end and front-end system performance metrics collected and fed into the controller
4- The controller decides on appropriate actions to take based on estimated state, system
performance and desired state
Figure 1: Feedback loop and components structure in control
While for RL applications, this four-steps protocol is replaced by the agent sequential decision-making
(SDM) routine, i.e. interact-learn-improve. RL is distinguished from other computational
approaches/models by its emphasis on learning by agent interaction with the environment without the
need for extensive knowledge of the environmental constraints and environment-embedded models. In
this way, a wide range of transportation decision-making problems can be tackled directly with respect to
state-action-reward workflow adapted from Markov Decision Process, without the need to incorporate
the transportation system components’ complex behaviors in the decision-making process, i.e. traffic flow
characteristics or driving behavior as in car-following models. RL can be thought of as an SDM mechanism
2
powered by AI, and those early listed components reflect exactly this perception of RL. In this section of
the chapter, these components of RL will be discussed in the context of SDM. This discussion will relate to
other SDM mechanisms and give rise to further elaboration of RL formalisms.
RL Agent-Environment Interaction
An agent is basically the decision-maker. Agents’ actions or decisions made in response to specific
environment state and the corresponding reward received from the environment represent the
interaction with the environment. The agent can be an engine controller in a robot that wants to run on
an icy surface environment with reward of stability, or a traffic signal controller at an intersection that
wants to optimize the green time allocation to the environment of an intersection approaches with
reward of minimized spillbacks. The agent can be even a robot playing chess in a game environment with
reward of winning the game. Note that the agent in these three examples needs to pick an action out of
a set of actions/decisions which would essentially lead to different outcomes. Thus, it is not feasible for
an agent to be certain about the reward corresponding to taking a specific action. But this is not the only
source of uncertainty in a RL setting. The other challenge is the stochasticity of the environment itself,
which would not necessarily respond in the same way, think of traffic streams arrival rates in the traffic
light problem, or the next chess move of the competitor in the chess game example. This is called state
transition in the SDM context, from the known Markov Decision Process (MDP). Therefore, there is no
way for agents to learn about their actions’ uncertain rewards and the state transitions rather than trial-
and-error, or better said, learning-by-doing. However, this is not left totally unstructured, there are
structured “learning methods” for agents in RL that will be discussed later in the following sections. The
takeaway from this section is to know that the RL agent responds to the stochasticity of actions’ rewards
and state transition by making estimates based on learning. These estimates build on the conditional
probability distributions principles (see Equation 1) and they are learned/updated each time the agent is
interacting with the environment.
𝑃(𝑆 ′ , 𝑟|𝑆, 𝑎): The joint probability of transitioning to state 𝑆 ′ and Equation 1
getting a reward 𝑟 given that the initial state was 𝑆 and
action 𝑎 was taken
Where:
𝑆 𝑎𝑛𝑑 𝑆 ′ 𝑎𝑟𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑡ℎ𝑒 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑖𝑛𝑔 𝑠𝑡𝑎𝑡𝑒 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦 ∈
{𝑆1 , 𝑆2 , … , 𝑆𝑖 }
Agent Policy
The policy component of an RL system as is the one dictating the agent’s actions. However, one should
distinguish between the policy and RL algorithms family since this might be a bit confusing. RL policy helps
the agent to better understand the value of its current state and the value of taking a specific action at
this specific state. This essentially addresses the problem of estimating the uncertainties associated with
states-actions pairs discussed earlier. For the sake of implementation, the policy is usually initialized to
arbitrary values and then estimated iteratively by a technique borrowed from dynamic programming
called policy iteration, in which algorithmic methods are used to improve and reevaluate the policy.
Elaboration will be made on such a process in the policy learning section. To better understand the
mechanics of the policy versus RL algorithms, let us put it this way: the policy helps the agent to quantify
the environment by means of value functions (state- and action-value functions – see Equation 2 and
Equation 3), whilst the RL algorithms’ function is to logically describe how to optimize these values (and
3
consequently to maximize rewards) at each step of making actions. Note that the value function is
subscripted with the Greek letter Pi to denote the policy influence on these functions as they are defined
by the policy Pi. Note also that the value functions are not introduced here as absolute functions, yet they
are defined as the expected values (E) to account for the uncertain stochastic nature of the evaluated
states and actions. Thus, RL algorithms and policy altogether can be thought of as the frontal cortex of the
agent’s brain: policy is responsible for cognitive functions, while the RL algorithms are the mindset for the
agent’s SDM engine. This gave rise to adopting the deep neural networks technique as a legitimate way
of representing the policy in RL, especially for complex environments with infinite state-space
representation.
𝑣𝜋 (𝑠) ≐ 𝐸[𝐺𝑡 |𝑆𝑡 = 𝑠] State value function Equation 2
𝑞𝜋 (𝑠, 𝑎) ≐ 𝐸[𝐺𝑡 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎] Action value function Equation 3
Where:
𝐺𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑡𝑢𝑟𝑛 (𝑎𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑒𝑤𝑎𝑟𝑑𝑠)
𝑆𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑔𝑒𝑛𝑡 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑠𝑡𝑎𝑡𝑒 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 𝑡
𝐴𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑔𝑒𝑛𝑡 𝑡𝑎𝑘𝑒𝑛 𝑎𝑐𝑡𝑖𝑜𝑛 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 𝑡
Reward Signal
This is a critical component of an RL model since it affects the effectiveness of the learning process. Thus,
it should be selected deliberately considering two factors:
(1) Meaningfulness in describing the problem

(2) Promptness in conveying the ongoing environment updates
To illustrate this, given the example mentioned earlier of the robot playing chess with the reward of
winning the game. This reward, although meaningfully describes the problem we are trying to solve, it is
not promptly conveying a granular signal to help the agent quickly improve the performance. Rather, the
agent must wait until a full game session is over to be able to assess the value of its actions and improve.
However, sometimes such an infrequent signal of reward is indispensable, and one must accept it in
certain problem domains. This leads to a discussion on the concept of episodes. Earlier, it was informally
illustrated that in an RL application, the agent’s goal is to maximize its expected rewards in the long run.
The expected rewards from each time step can be generalized to be the expected return. To formally
define the expected return (𝐺𝑡 ), it can be basically perceived as the expected rewards cumulated by the
agent until the last time step 𝑇 (Equation 4). However, this applies only to applications called episodic
tasks that do have a final time step, i.e. a termination state, which is not necessarily the case for all
applications. Another formal definition for the return is needed for the other types of applications known
as continuous tasks. For continuous tasks, the reward from the immediate time step is emphasized, while
it exponentially decays with the time steps progressing (see Equation 5). This connotation is crucial as it
governs the way the return is computed in a RL application. Thus, one should decide while building a RL
4
application whether it performs an episodic or a continuous task, and on this basis formulate the returns
computational rules.
𝐺𝑡 ≐ 𝑅𝑡+1 + 𝑅𝑡+2 + 𝑅𝑡+3 + ⋯ + 𝑅𝑇 Expected return in Equation 4

episodic tasks
∞
Expected return in Equation 5
𝐺𝑡 ≐ ∑ 𝛾 𝑘 𝑅𝑡+𝑘+1 = 𝑅𝑡+1 + 𝛾 𝑘 𝐺𝑡+1 continuous tasks
𝑘=0
Where:
𝑅𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑤𝑎𝑟𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 𝑡
𝛾: 𝑖𝑠 𝑎 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑓𝑎𝑐𝑡𝑜𝑟
𝑘: 𝑖𝑠 𝑡ℎ𝑒 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝𝑠 𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
The “Learning” Component in RL

The algorithmic component of learning in RL can be categorized into two main categories: (1) Tabular
solutions methods, and (2) Approximate solution methods. In the tabular solutions methods, the state-
action pairs are finite and they can be listed in a tabular form-like. This format makes it viable to easily
map the estimated value functions to the corresponding state-action pair.As for the approximate solution
methods, they are good for situations when the state-action pairs are infinite, thus, the approximated
value function takes the form of a parameterized function with a weight vector. Tabular methods can be
listed as following:
(1) Multi-armed Bandits

(2) Finite Markov Decision Process (MDP)
(3) Monte Carlo Methods
(4) Temporal Difference Learning
(5) N-step Bootstrapping
In this chapter, a brief discussion will be provided on tabular as well as approximate methods, but before
this, an elaboration on the learning component in RL is due. Thus, in the following section, concepts like
policy-based learning versus value-based learning, on-policy and off-policy, and the general learning
workflow in RL will be discussed.
Policy-based learning Vs. Value-based learning

Value function learning estimates the value of a state or state-action pair by calculating the expected
cumulative reward for the current state. The expected reward function is then used to determine what
action to take at the current time step. On the other hand, a policy is a mapping from states to probabilities
of selecting each possible action. If policy π is followed by the agent at time t, then π(a|s) is the probability
that state-action pair is selected (𝐴𝑡 = 𝑎 if 𝑆𝑡 = 𝑠). This is a more straightforward alternative to value
function methods which estimate the value of each state-action pair. A chosen policy π is then defined to
be better than or equal to a policy π′ if its expected return is greater than or equal to that of π′ for all
states. There is always at least one policy that is better than or equal to all other policies, which is called
5
“optimal policy”. Optimal policy, which is not necessarily unique is marked as 𝜋∗. However, even if a
complete and accurate model of the environment’s dynamics is available, it is usually not possible to
simply compute the optimal policy.
On-policy and Off-policy learning

In instances of reinforcement learning where a policy is optimized by agents, evaluation of the policy can
be performed in two ways. On-policy methods use the same learned policy to evaluate the intrinsic future
value of the current policy. In other words, both how the agent takes actions and how rewarding those
actions will be are estimated using the same policy. This is a compromise as the action values learned are
not based on an optimal policy but one that is still being learned.
On the other hand, off-policy learning refers to the methods that use different distinct policies for taking
actions and evaluating those actions. The policy being learned about is called the “target policy” while the
policy used to generate behavior is called “behavior policy”. In this case learning is from data “off” the
target policy and the overall process is called off-policy learning. The most famous state of the art off-
policy method is called Q-learning in which the update rule uses the highest yielding Q-value, while
behavior policy can be more restrictive.
On-policy methods are usually simpler in terminology and are implemented first. Off-policy methods
require additional concepts and notations and are often of greater variance with possible divergence
issues. However, off-policy methods are more general and include on-policy methods as a special case.
They are widely used for learning from further sources without the explicit need to interact with the
environment such as a conventional non-learning controller or a human expert. It should be noted that
the distribution of the collected experience for the target policy does not necessarily reflect the
distribution induced by the behavior policy that we are optimizing. Some researchers (Fujimoto, Meger et
al. 2019) argue that an exploration error is noticeable when using off-policy methods with finite data,
where the value target includes errors caused by an estimate of unseen action pairs. Several new methods
including batch-constrained reinforcement learning, and KL-control for practical applications have been
proposed to ensure the agent is acting realistic and close to available data distribution (Fujimoto, Meger
et al. 2019) and (Jaques, Ghandeharioun et al. 2019). Other methods include mixing on-policy and off-
policy learning techniques to yield superior performance (Hausknecht and Stone 2016).
Various methods for finding the optimal policy are found in the literature. Policy iteration methods try to
directly find a policy function that maps states to actions. Policy Gradient (PG) is also one of the first
proposed methods in which a policy score function based on policy gradient ascent is used to help find
the best parameters that improve the policy. Asynchronous advantage actor-critic (A3C) method
published by Google’s deep mind is a state-of-the-art method which trains several agents based on their
own perception of the environment to diversify the experiences available for training of all agents. Other
on-policy methods include Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization
(PPO) which limit the possible steps by constraining it to a trusted threshold for each region or a ratio
signifying the difference between old and updated policy, respectively. These algorithms mainly attempt
constraining update steps to avoid divergence risk.
6
The learning process in RL
RL methods can be categorized in two different classes of model-based and model-free models. These
two classes have inherited differences which should be considered when choosing the right model.
Historically, most RL methods have used a model-based approach based on information from already
formulated world models by researchers. In this approach, the agent can make predictions about the next
state and reward using its world-model before taking the action. An example is using control delay
estimation models at intersections to optimize an RL model of traffic signal program. Due to the usage of
user insights and pre-framed world models, model-based RL is usually more data efficient. The agent can
learn faster and there exists a point of reference to correct its mistakes. This means that if our underlying
assumptions for the model are correct, a better goal-oriented and reliable model which needs less data
to be trained is obtainable. Moreover, since a human-understandable model is used, model interpretation
and adjustment for those specific changes are typically easier for the user.
The cost function is defined by the user and can emphasize the most important parts according to the
user. In general, the dynamics of the environment during and after the learning period can be predicted
by the agent in a model-based algorithm with the guide of the estimate of the transition probability and
reward function.
Model-free RL is an instance of RL in which no world model or estimation for it is used. Many times, an
actual working world-model giving the dynamics of the environment does not exist or cannot be simply
computed. Some applications are complicated, and the nature of why different states occur can be
ambiguous. Model-free models are exceptionally useful in these cases. Here, a series of actions are taken
without prior knowledge of the rewards or transition function to learn a policy that maximizes the total
reward by interacting directly with the environment. The main algorithm used for model-free RL is called
Q-learning (SARSA). State–action–reward–state–action (SARSA)). Deep Q network is also a variation of Q-
learning in which a Neural network is trained instead of a two-dimensional array of state-actions to work
with larger state spaces in which it is impossible to estimate the value for all unseen states. This neural
network setting belongs to approximate solution methods, whilst the two-dimensional array belongs to
the tabular methods, both categories will be further discussed in the next section of the chapter.
As an example, consider the case of an autonomous vehicle run by a reinforcement learning agent. If the
agent has never driven before, what strategy would be best to implement? We can start by reading the
manual for the car, road signs, study physics and basic dynamics to plan each movement in response to
environmental stimuli. However, any person who has learned to drive a car knows that it is almost
impossible to learn to drive unless you get behind the wheels and start driving, perhaps with the aid of a
driver instructor. Trial-and-error is the only way to learn this complicated task as there are simply too
many variables to consider and one cannot predict the whole task of driving by trying to solely study
before getting on the road. The world model and dynamics of the system cannot be fully described. This
is where a model-free approach would be most beneficial.
However, some of the secondary tasks within the driving goal that can be modeled and planned will help
accelerate the learning process. For instance, routing can be much more efficient if pre-existing maps and
data is fed to the agent rather than simply relying on trial and error. Some common tasks such as how to
change lanes if a decision for lane-changing has been made can also follow a model-based approach to
ensure faster convergence.
7
Imitation learning, which is learning from observing an expert agent perform can prove to be useful as
well. Imitation learning can be categorized as a middle-ground between model-free and model-based
learning. In this method, an explicit model is not learned, but certain obvious mistakes can be avoided by
learning general rules about the task on hand by imitating the behavior of an already successful expert
agent. This can be analogous to learning from a driver instructor or parents as they drive, while the learner
observes the instructor's reaction to different situations.
Exploitation-exploration tradeoff
Due to the nature of reinforcement learning, in order to maximize the gained reward agents are faced
with a fundamental dilemma: exploring the world for better informed decisions in the future or making
the best decision given the current information. In other words, since agents learn the dynamics of the
world and the value of their actions in each state as they progress, finding the best long-term policy may
involve short-term sacrifices. This is analogous to trying new foods at your local restaurant versus choosing
the already familiar dishes one already likes. A good example is the e-greedy method in RL in which the
agent takes random actions with unknown payoffs for a small portion of the time as opposed to choosing
the action which maximizes the reward given our current understanding of the dynamics of the world the
whole time. This exploration of the unknown state-action space lets the agent potentially find actions that
maximize the gained reward in the long-term although that random action is not necessarily the best
action in the short-term. The value for exploration, known as epsilon, decreases as the agent learns more
about the world dynamics. This is (Do not to be confused epsilon with learning rate, alpha, which
determines how much the policy is updated at each step to reach convergence.) Algorithms such as
decaying epsilon-greedy method decrease the number of random actions taken by the agent to address
this tradeoff. Other methods such as contextual epsilon-greedy algorithm which chooses the value of
epsilon in each state based on how critical the situation the agent is in have also been proposed to address
this issue. Interested readers can refer to (Wang, Zariphopoulou et al. 2019) for more information on the
topic.
Tabular Methods versus Approximate Solution Methods

Tabular solutions and approximate solution methods were referred to briefly in the previous section on
the RL learning component. At this point, and after the RL learning mechanics should have been clear, a
discussion will be made on these two categories of algorithms used to perform the learning task in RL.
Tabular Methods
Multi-armed Bandits
In this method, an agent is faced with a situation, in which one out of k options or actions should be
selected and will be followed by a numeric reward. In RL settings, this agent needs to develop an
understanding of the environment in terms of the best actions to take, so that the reward is maximized
for the entire event. The agent starts exploitation-exploration phase, and in the meanwhile, the actions
and rewards pairs are being filled tabularly. The action value in this method is computed as the arithmetic
mean (see Equation 6), recalling that the problem is stochastic in nature, i.e. the returned reward from a
specific action is not the same every time the action is selected. If the agent has already conducted enough
exploration, greedy approach in selecting actions should be confidently adopted.
8
𝑠𝑢𝑚 𝑜𝑓 𝑟𝑒𝑤𝑎𝑟𝑑𝑠 𝑤ℎ𝑒𝑛 𝑎𝑐𝑡𝑖𝑜𝑛 (𝑎) 𝑡𝑎𝑘𝑒𝑛 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 (𝑡) Equation 6
𝑄𝑡 (𝑎) ≐
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑎𝑐𝑡𝑖𝑜𝑛 (𝑎) 𝑡𝑎𝑘𝑒𝑛 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 (𝑡)
Finite Markov Decision Process (MDP)
Finite Markov Decision Process (MDP) was referred to earlier as an SDM is considered the most pertinent
representation of the modern RL model in a decision-making framework. The RL formalisms introduced
earlier in this chapter were all excerpted from MDP settings. To recall, in an MDP framework, the agent is
transitioning between environment states with some probability conditional on its taken action in the
previous time step, and receiving a certain numeric reward associated with this taken action. Thus, in a
RL settings, the agent needs to evaluate the states as well as the actions altogether in what is so called
state and action value functions (see Equation 2 and Equation 3). As long as the states and its viable actions
are all finite, the framework is then legitimately a tabular solution method, noting that in this process, a
“complete knowledge” of the environment is assumed.
Monte Carlo Methods (MC)

In this method, the assumption of the complete knowledge of the environment is relaxed. This gives the
RL framework more flexibility in tackling more problems without prior knowledge. MC methods are
typically used for numerical estimation of functions by means of random sampling. In RL settings, states,
actions and rewards are sequentially sampled either from real or simulated environment and used by MC
to estimate their state-value functions by means of averaging sample returns. In this way, an experience-
based learning is established. One limitation on such class of methods in estimating action values should
be highlighted here, which is the problem of “unvisited” state-action pair. For MC to be able to estimate
a state-action pair, several visits should be made to this state-pair as we need to average over the visits
rewards. However, if the action is selected based on some deterministic (greedy) policy, some pairs may
never be visited. Thus, there needs to be sufficient exploration.
Temporal-Difference Learning
As discussed earlier in the RL formalisms, and elaborated in the learning component in RL, the policy
evaluation or prediction makes the main building block in a RL system. The policy evaluation task is usually
addressed in two domains: (1) Monte Carlo method, and (2) Dynamic Programming method. Both
methods were historically proven to be successful in approximating phenomena by repeated sampling,
and updates based on learnt estimate, respectively. In the Monte Carlo approximation method, one must
wait until the last sample (state) until an estimate can be made (see Equation 2). In Dynamic Programming,
the estimate updates at time 𝑡 + 1 is made based on prior learned estimates, an approach called
bootstrapping. The novelty of this introduced method of temporal-difference learning is that it
conceptually builds on both methods but with a refined approach. As shown in the temporal difference
update equation in its simplest format (see Equation 7), at time step 𝑡 + 1, the immediate observed
reward 𝑅𝑡+1 and the estimated state value 𝑉(𝑆𝑡+1 ) constitute the target update considering the temporal
changes in the environment.
𝑉(𝑆𝑡 ) ← 𝑉(𝑆𝑡 ) + 𝛼[𝑅𝑡+1 + 𝛾𝑉(𝑆𝑡+1 ) − 𝑉(𝑆𝑡 )] Equation 7
However, one must be aware of the cons and pros of temporal-difference policy learning methods. Such
a method basically learns a guess from a guess (bootstrapping). This was proven to converge to the correct
value. Unlike the dynamic programming methods in learning a policy, temporal-difference ones don’t
need a model to describe the stochastic transitions for next states given current state and selected action.
9
Moreover, temporal-difference methods are good for on-policy learning (e.g. SARSA), as they are capable
of learning and updating the policy at each time step, and also integrated with off-policy learning (e.g. Q-
learning and Expected SARSA).
On-policy Prediction with Approximation

The second category of algorithms used for estimating state-value function is the approximation solution
methods. The approximation in these methods refers to function approximation of the state-value (𝑣𝜋 )
from on-policy data. Unlike the tabular methods discussed earlier, the value functions approximation is
not mapped to unique state-action pairs, it rather takes the format of a parametric function 𝑣̂(𝑠, 𝑤) in
state features with weight vector 𝑤, noting that this weight vector is much less than the number of states.
This is clear in the sense that the functionality of this weight vector is to approximate the state in terms
of its significantly present features, rather than describing the environment states, which is most probably
infinite for such types of methods. Lastly, this approximated value function 𝑣̂ can be thought of as the
outcome of an artificial neural network that takes state features and makes predictions on the value of
this state by adjusting the weights of these features. To generalize, in this category of algorithms in RL,
functional approximation of supervised learning is used to perform the learning task of the agent. Thus,
the training mechanism in such a supervised learning mechanism should be simply explained as follows.
In an on-policy settings, on each time step, a training example is observed, consisting of the state at this
time step (𝑆𝑡 ) mapped to its true value as measured from the policy (𝑣𝜋 (𝑆𝑡 )). These training examples are
fed into a supervised learning technique to estimate the functional parameter weight vector, with the
objective of minimizing the so-called Prediction Objective (𝑉𝐸).
Inverse Reinforcement Learning (IRL)

Inverse Reinforcement Learning (IRL) is recently recognized among the data science community as a
machine learning framework to reversely solve the problem of RL, i.e. finding the reward function that
governs an agent's actions. In reference to the RL formalisms. The IRL can be formulated as following:
given an optimal policy that tells the agent what to do for each state, and the RL model dynamics that
describe the stochastic transitions for next states given current state and selected action, the agent’s
reward function can then be traced back. In an IRL setting, an apprentice agent (also called a researcher
agent) is trying to learn the reward function from the expert demonstration (Piot, Geist et al. 2017). In
one of the most cited works on IRL, (NG and Russell 2000) refers to the role played by IRL in ascertaining
the reward function “being optimized by a natural system” in an apprenticeship learning context. In this
work, the reward function was particularly regarded as the “most succinct, robust, and transferable
definition of the task”. This can explain the motivation of IRL to infer reward function from behavior.
A good example of IRL in a transportation context comes from the area of autonomous driving and safety,
where (Shimosaka, Nishi et al. 2015) used IRL to model risk anticipation and defensive driving on
residential roads. In the context of risk anticipation and defensive driving, a carefully designed reward
function is essential for the system (agent) to optimize its actions in relation to the encountered states.
Thus, in this work, the behavior of expert drivers is used in an IRL setting to learn about their reward
function from their demonstrated risk anticipation and defensive driving behavior. The concept is even
more expanded outside the area of robotics and autonomous driving to be used in understanding the
mechanics of decision-making process, as will be seen in the next section, where IRL is used to learn the
utility (reward) function of individuals making trips.
10
RL Applications in Transportation Domains
In this section of the primer chapter, RL applications in transportation domains of transportation planning,
transportation policy, and uninterrupted operations urban traffic management are going to be discussed
considering the contextualized RL algorithmic components and learning paradigms.
Planning
Activity plans, known also as activity chains or tours are the key components of activity-based travel
demand models. Activity plans are simply the sequence of the trips made by individual agents on a time-
step (possibly daily) basis. (Feygin 2018) used Inverse Reinforcement Learning (IRL) to estimate utility
(reward) functions of such activity plans, drawn from a large sample of call detail record (CDR) data in the
San Francisco Bay Area. Raw and anonymized CDR data was processed first using Machine Learning tools
to recognize the underlying activity sequence of individuals. Then, IRL was used to uncover the reward
function that incentivized such activity sequence. IRL, as one can tell from the name, is simply learning
about agents’ values or rewards (i.e. utility) by observing their behavior (i.e. trip/activity schedule). In this
dissertation, Feygin, S. formulated the daily activity participation dynamics as a MDP (see Figure 2), where:
▪ ST: staying at the current activity

▪ LE: embarking on a trip to a different activity
▪ AR: arriving at a destination
▪ SM: continuing a trip in progress using the same mode
▪ CT: switching to a different travel mode
Figure 2: Example dynamics describing activity-travel plan MDP for two types of activity and two travel
modes. Arrows between activities represent possible choices of next activity or travel given the current
state. Note that certain states are not reachable (i.e., car at t = 0, work at t = 1). - source: Feygin, S. (2018).
Figure 3 shows a typical representation of a general RL problem, while Figure 4 shows the IRL one. In IRL
settings, the agent is the researcher trying to learn the utility over probability paths of the rational (or
expert) trips maker picking up specific plans, by means of observing the state space that matches the
expected features observed in that expert demonstration. The learning agents, researcher agent in IRL
case, parametrize the utility U in terms of the 𝜃 values, in a way that matches the maximum consistency
11
among the state space transitions and the expected state-action feature vectors (𝜙). The principle of
maximum entropy (Jaynes 1957) was used to solve this optimization problem.
Figure 3: A schematic of the general reinforcement learning problem. - source: Feygin, S. (2018).
12
Figure 4: A schematic of the general inverse reinforcement learning problem. - source: Feygin, S. (2018).
𝐷: 𝑡ℎ𝑒 𝑑𝑒𝑚𝑜𝑛𝑠𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 𝑠𝑒𝑡 𝑜𝑓 𝑠𝑡𝑎𝑡𝑒 − 𝑎𝑐𝑡𝑖𝑜𝑛 𝑝𝑎𝑖𝑟𝑠 (𝜉)
𝑚: 𝑡𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑖𝑒𝑠 (𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑝𝑙𝑎𝑛𝑠)
𝑈𝑡 : 𝑈𝑡𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑡𝑎𝑡𝑒 − 𝑎𝑐𝑡𝑖𝑜𝑛 𝑝𝑎𝑖𝑟 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 (𝑡) − 𝑢𝑛𝑘𝑛𝑜𝑤𝑛
𝜃: 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
∅: 𝑑𝑜𝑚𝑎𝑖𝑛 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑠𝑡𝑎𝑡𝑒𝑠 𝑎𝑛𝑑 𝑎𝑐𝑡𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒
Transportation Policy
In the domain of transportation policy, arising issues like dynamic congestion pricing can be efficiently
solved in a RL setting. A project recently presented by the Oracle team in a session on Reinforcement
Learning for Transportation during the Reinforcement Learning Zürich meeting in Switzerland (Arango
2019), shows the functionality of RL in such application. The RL agent is a policy/decision maker deciding
on and assigning toll values (actions) at certain operational states such that: the flow in the network and
the total revenues are maximized. The system architecture (see Figure 5) encompasses along with the
typical RL components, a traffic simulator (SUMO) to process the environment states’ vector components:
[toll segment speed- main segment speed- toll segment occupancy- main segment occupancy- input flow
rate] and convey indicative state signal to the RL agent. The toll price is introduced as a function of the so-
called split ratio, which is the desired traffic split ratio among the toll and the main road segments. Actions
vector comprises three discrete actions corresponding to the split ratio: [Action 0: decrease split ratio by
13
0.1- Action 1: increase split ratio by 0.1- Action 2: don’t modify split ratio]. Double Deep Q-Network is used
for policy learning and actions optimization.
Figure 5: RL-based Dynamic Congestion Pricing Controller- source: ORACLE 2019
In a cell transmission model mesoscopic setting, (Pandey, Wang et al. 2019) developed a deep
reinforcement learning algorithm for dynamic pricing of express lanes with multiple access locations. In
this framework, most of the simplifying assumptions related to dynamic pricing were relaxed in order to
test the RL tool under heterogeneous conditions, i.e. travelers and lane choice. The authors particularly
emphasized the consideration of “multiple origins and destinations”, “multiple access locations”,
“enroute diversion of travelers”, “partial observability of the sensor readings”, and “stochastic demand
and observation”. Unlike previous dynamic pricing work, where the protocols heavily relied on look-up
tables strategy for pricing, the tolls in (Pandey, Wang et al. 2019) were modelled as continous and
stochastic variables learned by a feedforward neural network. The toll was tackled as a function of real
time observations made in partially observable Markov Decision Process (POMDP). Typical toll policies
flaws, i.e. “jam-and-harvest” inherited from “revenue-maximizing policies”, were refrained in this
framework.
Urban Traffic Management

The congestion problem at urban traffic junctions incentivized researchers from University of Toronto (El-
Tantawy, Abdulhai et al. 2013) to explore the potential for RL in Adaptive Traffic Signal Control (ATSC)
applications. In their work, they introduced a Multi-Agent Reinforcement Learning approach to
14
Integrated Network settings (MARLIN) to address the problem so that the agents (controllers) at each
junction in an arbitrary urban network work collaboratively with their neighboring agents on reducing the
cumulative delays. The algorithm used follows the same RL sequence discussed earlier, with agents
updating their states, deciding and coordinating on their actions to reach a unique equilibrium policy that
would achieve optimality for all agents (Nash Equilibrium). The signalized intersections operational
conditions are parametrized into RL agents signal of reward function (see Figure 6) to give a sense of
environment stochastic states and dynamic operational conditions. The definition of reward function is
pivotal here to make full understanding of the application.
“The immediate reward for a certain agent is defined as the reduction (saving) in the total cumulative
delay associated with that agent” (El-Tantawy, Abdulhai et al. 2013)
Figure 6: MARLIN-ATSC platform. - Source: El-Tantawy, S. et al. (2013)
Uninterrupted Traffic Operations

(Fares and Gomaa 2014) explored the RL-based density control agent (RLCA) that utilized Q-learning
algorithm as a viable solution to the conventional ramp metering problem in uninterrupted traffic
operations. The agent was designed based on Markovian modeling framework to reflect the stochastic
nature of the problem, with an objective function of optimizing the freeway mainstream density (i.e.
maximize the flow and minimize the total travel times). The method was tested in two environments:
dense and light traffic flow. In both environments, the RLCA helped in stabilizing the flow on the freeway
mainstream. Walraven et al. (2016) proposed a RL framework for developing speed limit policies in
highways.
Interrupted Traffic Operations
15
Signal control and interrupted urban network operations have been one of the primary fields of
application for RL (Wiering 2000, Balaji et al. 2010, Walraven et al. 2016). Researchers have used a Neuro-
Fuzzy Actor-Critic control method for real-time intersection and arterial traffic signal control (Zhang et al.
2007). In another study, researchers have also dealt with the problem of traffic signal control by
developing RL methodologies for various situations including dilemma zone optimal green time extension
(Chong and Abbas 2010). Li et al. (2016) proposed a deep neural network (DNN) to learn the Q-function
of reinforcement learning using traffic state as inputs and the corresponding traffic system performance
as output. A comprehensive review of reinforcement learning methodologies used for adaptive traffic
signal control is provided in Mannion et al. (2016). The recent advent of computational systems releashed
new capabilities for developing adaptive traffic signal controllers in a network level and boosted the
research in the specific domain. Several studies on the subject have emerged including deep learning RL
applications (van der Pol and Oliehoek. 2016, Wei et al. 2018, Tan et al. 2019), multi-agent RL (El-Tantawy
et al. 2013, Chu et al. 2019), ensembles of actor-critic algorithm and LSTM models (Li et al. 2020),
applications of Continuous-Time Markov Games (Aragon-Gómez and Clempner 2020), RL based meta-
optimization strategies (Xiao et al. 2019) and comparative approaches between different RL problem
formulations for signal control (Alesani et al. 2017, Alesani et al. 2018, Genders and Razavi 2020).
Driver Behavioral Modeling

RL has been very popular for driving control problems (Desjardins and Chaib-Draa 2011, Chong et al. 2011,
Shimosaka et al. 2015), and for locally optimal policies for a variety of objectives such as matching a target
velocity or minimizing fuel consumption (Wu et al. 2017, Mousa et al. 2018RL concepts have been also
exploited to identify optimal state-action pairs of driving behavior. Abbas et al. (2011) trained a fuzzy rule
neural network using traffic state-action pairs obtained through a driver simulator.). Vlachogiannis et al.
(2020) identified the optimum personalized driving policies based on a Q-learning framework for
accomplishing safe driving conditions based on driving data collected through smartphones.
RL and DRL is often used in car simulations (Azevedo et al. 2016, Kheterpal et al. 2018). A model for
autonomous driving where the vehicle would act like a human in car following conditions was attempted
by Zhu et al. (2018). Some of the most recent studies exploit deep RL methodologies to control the
behavior of autonomous vehicles under simulation environments. Vinitsky et al (2018) proposed
benchmarks consisting of four traffic control scenarios in mixed-autonomy networks. Ferdowsi et al.
(2018) developed a robust deep reinforcement learning model to train an autonomous vehicle to take the
most appropriate actions in the presence of cyber-physical attacks. Chen et al. (2020) proposed a
framework for microscopic modeling at signalized intersections for CAVs. Fu et al. (2020) propose an
Attention-based Twin Delayed Deep Deterministic policy gradient to analyze driving behavior (Fu and Mo
2020). Recently, cross-modal interactions have been also modeled using RL concepts (Alsaleh and Sayed
2020).
Other Transportation Applications

RL has been applied to various transportation optimization problems, such as vehicle routing (James et al.
2019), optimizing taxi driving strategies (Gao et a. 2018), optimizing city logistics (Tamagawa et al. 2010,
Taniguchi et al. 2018, Firdausiyah et al. 2019) and transit operations (Cao et al. 2020). Boyalı etl al. (2019)
proposed a multi-agent RL framework for optimizing autonomous on demand vehicles with the aim to
reduce the mean waiting time of the passengers. Qian et al. (2019) formulated a deep learning RL problem
16
for optimizing the Electric Vehicle (EV) charging navigation, aiming at minimizing the total travel time and
the charging cost at EV charging stations. Shi et al. (2019) developed an RL model for optimizing the EV
fleet operation for ride-hailing services with the aim to optimize customer waiting time, electricity cost,
and operational costs of the vehicles. Yoon and Chow (2020) developed a reinforcement learning
framework for the transit route design problem. Cats and West (2020) proposed a RL based modeling
framework to optimize the transit path choice as a within-day dynamic process.
17
References
Arango, M. (2019). Toll Road with Dynamic Congestion Pricing Using Reinforcement Learning.
Reinforcement Learning Zurich Meetup
Barto, R. S. S. a. A. G. (2018). Reinforcement Learning: An Introduction. Cambridge, Massachusetts and

London, England, The MIT Press.
El-Tantawy, S., et al. (2013). "Multiagent Reinforcement Learning for Integrated Network of Adaptive
Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown
Toronto." IEEE Transactions on Intelligent Transportation Systems 14: 1140-1150.
Fares, A. and W. Gomaa (2014). Multi-Agent Reinforcement Learning Control for Ramp Metering. 23rd
International Conference on Systems Engineering. Las Vegas.
Feygin, S. (2018). Inferring Structural Models of Travel Behavior: An Inverse Reinforcement Learning
Approach. Civil & Environmental Engineering, University of California, Berkeley. PhD.
Fujimoto, S., et al. (2019). Off-Policy Deep Reinforcement Learning without Exploration.
arXiv:1812.02900v3 [cs.LG], arXiv.
Hausknecht, M. and P. Stone (2016). On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning.
Deep Reinforcement Learning: Frontiers and Challenges, IJCAI 2016 Workshop.
Jaques, N., et al. (2019). Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human
Preferences in Dialog. arXiv:1907.00456v2 [cs.LG].
Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics .2." Physical Review 108(2): 171-190.
NG, A. Y. and S. Russell (2000). Algorithms for Inverse Reinforcement Learning. Seventeenth International
Conference on Machine Learning.
Pandey, V., et al. (2019) Deep Reinforcement Learning Algorithm for Dynamic Pricing of Express Lanes
with Multiple Access Locations.
Piot, B., et al. (2017). "Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning."
IEEE Transactions on Neural Networks and Learning Systems 28(8): 1814-1826.
Shimosaka, M. et al. (2015) ‘Predicting driving behavior using inverse reinforcement learning with multiple
reward functions towards environmental diversity’, Intelligent Vehicles Symposium (IV), IEEE, (Iv). doi:
10.1109/IVS.2015.7225745.
Wang, H., et al. (2019) Exploration versus exploitation in reinforcement learning: a stochastic control
approach.
Zhang, Y., Xie, Y., Ye, Z. (2007), Development and Evaluation of a Multi-Agent Based Neuro-Fuzzy Arterial
Traffic Signal Control System, Work, vol.473700, 00092.
18
Chong, L., Abbas, M.(2010), Neuro-fuzzy actor critic reinforcement learning for determination of optimal
timing plans, IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, 545–550.
Li, L., Lv, Y., Wang, F. Y.(2016), Traffic signal timing via deep reinforcement learning, IEEE/CAA Journal of
Automatica Sinica, vol.3, 3, 247–254.
Mannion, P., Duggan, J., Howley, E.(2016), An Experimental Review of Reinforcement Learning Algorithms
for Adaptive Traffic Signal Control, Autonomic Road Transport Support Systems, 47–66.
Walraven, E., Spaan, M. T., & Bakker, B. (2016). Traffic flow optimization: A reinforcement learning
approach. Engineering Applications of Artificial Intelligence, 52, 203-212.
Chu, T., Wang, J., Codecà, L., & Li, Z. (2019). Multi-agent deep reinforcement learning for large-scale traffic
signal control. IEEE Transactions on Intelligent Transportation Systems.
Li, S., Wei, C., Yan, X., Ma, L., Chen, D., & Wang, Y. (2020). A Deep Adaptive Traffic Signal Controller With
Long-Term Planning Horizon and Spatial-Temporal State Definition Under Dynamic Traffic
Fluctuations. IEEE Access, 8, 37087-37104.
Aragon-Gómez, R., & Clempner, J. B. (2020). Traffic-signal control reinforcement learning approach for
continuous-time Markov games. Engineering Applications of Artificial Intelligence, 89, 103415.
Aslani, M., Mesgari, M. S., & Wiering, M. (2017). Adaptive traffic signal control with actor-critic methods
in a real-world traffic network with different traffic disruption events. Transportation Research Part C:
Emerging Technologies, 85, 732-752.
Aslani, M., Seipel, S., Mesgari, M. S., & Wiering, M. (2018). Traffic signal optimization through discrete and
continuous reinforcement learning with robustness analysis in downtown Tehran. Advanced Engineering
Informatics, 38, 639-655.
Genders, W., & Razavi, S. (2020). Policy Analysis of Adaptive Traffic Signal Control Using Reinforcement
Learning. Journal of Computing in Civil Engineering, 34(1), 04019046.
Xiao, Q., Li, C., Tang, Y., & Li, L. (2019). Meta-Reinforcement Learning of Machining Parameters for Energy-
Efficient Process Control of Flexible Turning Operations. IEEE Transactions on Automation Science and
Engineering.)
Wiering, M. A. (2000) ‘Multi-agent reinforcement learning for traffic light control’, in In Machine Learning:
Proceedings of the Seventeenth International Conference, pp. 1151–1158.
Balaji, P. G., German, X. and Srinivasan, D. (2010) ‘Urban traffic signal control using reinforcement learning
agents’, IET Intelligent Transport Systems, 4(3), p. 177. doi: 10.1049/iet-its.2009.0096
Walraven, E., Spaan, M. T. J. and Bakker, B. (2016) ‘Traffic flow optimization: A reinforcement learning
approach’, Engineering Applications of Artificial Intelligence, 52, pp. 203–212. doi:
10.1016/j.engappai.2016.01.001.).
19
Wei, H., Zheng, G., Yao, H., & Li, Z. (2018, July). Intellilight: A reinforcement learning approach for
intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (pp. 2496-2505)
Elise van der Pol and Frans A Oliehoek. 2016. Coordinated Deep Reinforcement Learners for Traffic Light
Control. NIPS
Tan, K. L., Poddar, S., Sarkar, S., & Sharma, A. (2019, November). Deep Reinforcement Learning for
Adaptive Traffic Signal Control. In ASME 2019 Dynamic Systems and Control Conference. American Society
of Mechanical Engineers Digital Collection.
Desjardins, C. and Chaib-draa, B. (2011) ‘Cooperative Adaptive Cruise Control: A Reinforcement Learning
Approach’, IEEE Transactions on Intelligent Transportation Systems, 12(4), pp. 1248–1260. doi: Doi
10.1109/Tits.2011.2157145
Chong, L. et al. (2011) ‘A revised reinforcement learning algorithm to model complicated vehicle
continuous actions in traffic’, in IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC,
pp. 1791–1796. doi: 10.1109/ITSC.2011.6083005
Wu, C., Parvate, K., Kheterpal, N., Dickstein, L., Mehta, A., Vinitsky, E., & Bayen, A. M. (2017, October).
Framework for control and deep reinforcement learning in traffic. In 2017 IEEE 20th International
Conference on Intelligent Transportation Systems (ITSC) (pp. 1-8). IEEE.
Vlachogiannis, D. M., Vlahogianni, E. I., & Golias, J. (2020). A Reinforcement Learning Model for
Personalized Driving Policies Identification. International Journal of Transportation Science and
Technology.
Chen, J., Xue, Z., & Fan, D. (2020). Deep Reinforcement Learning Based Left-Turn Connected and
Automated Vehicle Control at Signalized Intersection in Vehicle-to-Infrastructure Environment.
Information, 11(2), 77.
Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., Wu, F., Liaw, R., Liang, E., Bayen, A.
M.(2018), Benchmarks for reinforcement learning in mixed-autonomy traffic, in 2nd Conference on Robot
Learning (CoRL 2018).,
Ferdowsi, A., Challita, U., Saad, W., Mandayam, N. B.(2018), Robust Deep Reinforcement Learning for
Security and Safety in Autonomous Vehicle Systems, IEEE Conference on Intelligent Transportation
Systems, Proceedings, ITSC, vol.2018-Novem, 307–312..
Abbas, M. M., Chong, L., Higgs, B., Medina, A.(2011), An Agent-based Reinforcement Learning Model for
Simulating Driver Heterogeneous Behavior during Safety Critical Events in Traffic, in TRB 2012 Annual
Meeting..
Alsaleh, R., & Sayed, T. (2020). Modeling pedestrian-cyclist interactions in shared space using inverse
reinforcement learning. Transportation Research Part F: Traffic Psychology and Behaviour, 70, 37-57.
20
Zhu, M., Wang, X., & Wang, Y. (2018). Human-like autonomous car-following model with deep
reinforcement learning. Transportation research part C: emerging technologies, 97, 348-368.
Kheterpal, N., Parvate, K., Wu, C., Kreidieh, A., Vinitsky, E., & Bayen, A. (2018). Flow: Deep reinforcement
learning for control in sumo. EPiC Series in Engineering, 2, 134-151.
Azevedo, T., De Araújo, P. J., Rossetti, R. J., & Rocha, A. P. C. (2016). JADE, TraSMAPI and SUMO: A tool-
chain for simulating traffic light control. arXiv preprint arXiv:1601.08154.
Mousa, S. R., Mousa, R., & Ishak, S. (2018). A Deep-reinforcement Learning Algorithm for Eco-driving
Control at Signalized Intersections with Prioritized Experience Replay, Target Network, and Double
Learning (No. 18-06104).
Fu, X., Di, X., & Mo, Z. (2020). When Do Drivers Concentrate? Attention-based Driver Behavior Modeling
With Deep Reinforcement Learning. arXiv preprint arXiv:2002.11385.
Boyalı, A., Hashimoto, N., John, V., & Acarman, T. (2019, June). Multi-Agent Reinforcement Learning for
Autonomous On Demand Vehicles. In 2019 IEEE Intelligent Vehicles Symposium (IV) (pp. 1461-1468).
IEEE.
Cao, Z., Guo, H., Song, W., Gao, K., Chen, Z., Zhang, L., & Zhang, X. (2020). Using Reinforcement
Learning to Minimize the Probability of Delay Occurrence in Transportation. IEEE Transactions on
Vehicular Technology, 69(3), 2424-2436.
Taniguchi, E., Qureshi, A. G., & Konda, K. (2018). Multi‐agent Simulation with Reinforcement Learning for
Evaluating a Combination of City Logistics Policy Measures. City Logistics 2: Modeling and Planning
Initiatives, 165-177.
Firdausiyah, N., Taniguchi, E., & Qureshi, A. G. (2019). Modeling city logistics using adaptive dynamic
programming based multi-agent simulation. Transportation Research Part E: Logistics and Transportation
Review, 125, 74-96.
Tamagawa, D., Taniguchi, E., & Yamada, T. (2010). Evaluating city logistics measures using a multi-agent
model. Procedia-Social and Behavioral Sciences, 2(3), 6002-6012.
Qian, T., Shao, C., Wang, X., & Shahidehpour, M. (2019). Deep Reinforcement Learning for EV Charging
Navigation by Coordinating Smart Grid and Intelligent Transportation System. IEEE Transactions on Smart
Grid.
Gao, Y., Jiang, D., & Xu, Y. (2018). Optimize taxi driving strategies based on reinforcement
learning. International Journal of Geographical Information Science, 32(8), 1677-1696.
James, J. Q., Yu, W., & Gu, J. (2019). Online vehicle routing with neural combinatorial optimization and
deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(10), 3806-3817.
Yoon, G. and Chow, J. (2020) Contextual bandit-based sequential transit route design under demand
uncertainty, March 2020 Transportation Research Record Journal of the Transportation Research Board.
21
Shi, J., Gao, Y., Wang, W., Yu, N., & Ioannou, P. A. (2019). Operating Electric Vehicle Fleet for Ride-Hailing
Services With Reinforcement Learning. IEEE Transactions on Intelligent Transportation Systems.
Cats, O., & West, J. (2020). Learning and Adaptation in Dynamic Transit Assignment Models for Congested
Networks. Transportation Research Record, 0361198119900138.
22

A Primer Chapter on Reinforcement Learning-Final

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

A Primer Chapter on Reinforcement Learning-Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Primer Chapter on Reinforcement Learning-Final

Uploaded by

Copyright:

Available Formats

A Primer Chapter on Reinforcement

Learning (RL) Outline

Figure 1: Feedback loop and components structure in control

𝑣𝜋 (𝑠) ≐ 𝐸[𝐺𝑡 |𝑆𝑡 = 𝑠] State value function Equation 2

𝑞𝜋 (𝑠, 𝑎) ≐ 𝐸[𝐺𝑡 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎] Action value function Equation 3

𝐺𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑡𝑢𝑟𝑛 (𝑎𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑒𝑤𝑎𝑟𝑑𝑠)

𝑆𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑔𝑒𝑛𝑡 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑠𝑡𝑎𝑡𝑒 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 𝑡

𝐴𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑔𝑒𝑛𝑡 𝑡𝑎𝑘𝑒𝑛 𝑎𝑐𝑡𝑖𝑜𝑛 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 𝑡

(1) Meaningfulness in describing the problem

𝐺𝑡 ≐ 𝑅𝑡+1 + 𝑅𝑡+2 + 𝑅𝑡+3 + ⋯ + 𝑅𝑇 Expected return in Equation 4

𝑅𝑡 : 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑤𝑎𝑟𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 𝑡

𝑘: 𝑖𝑠 𝑡ℎ𝑒 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝𝑠 𝑝𝑟𝑜𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

The “Learning” Component in RL

(1) Multi-armed Bandits

Policy-based learning Vs. Value-based learning

On-policy and Off-policy learning

Tabular Methods versus Approximate Solution Methods

Monte Carlo Methods (MC)

𝑉(𝑆𝑡 ) ← 𝑉(𝑆𝑡 ) + 𝛼[𝑅𝑡+1 + 𝛾𝑉(𝑆𝑡+1 ) − 𝑉(𝑆𝑡 )] Equation 7

On-policy Prediction with Approximation

Inverse Reinforcement Learning (IRL)

▪ ST: staying at the current activity

𝐷: 𝑡ℎ𝑒 𝑑𝑒𝑚𝑜𝑛𝑠𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑏𝑒ℎ𝑎𝑣𝑖𝑜𝑟 𝑠𝑒𝑡 𝑜𝑓 𝑠𝑡𝑎𝑡𝑒 − 𝑎𝑐𝑡𝑖𝑜𝑛 𝑝𝑎𝑖𝑟𝑠 (𝜉)

𝑚: 𝑡𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑖𝑒𝑠 (𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑝𝑙𝑎𝑛𝑠)

𝑈𝑡 : 𝑈𝑡𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑠𝑡𝑎𝑡𝑒 − 𝑎𝑐𝑡𝑖𝑜𝑛 𝑝𝑎𝑖𝑟 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑠𝑡𝑒𝑝 (𝑡) − 𝑢𝑛𝑘𝑛𝑜𝑤𝑛

𝜃: 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛

Figure 5: RL-based Dynamic Congestion Pricing Controller- source: ORACLE 2019

Urban Traffic Management

Figure 6: MARLIN-ATSC platform. - Source: El-Tantawy, S. et al. (2013)

Uninterrupted Traffic Operations

Interrupted Traffic Operations

Driver Behavioral Modeling

Other Transportation Applications

Barto, R. S. S. a. A. G. (2018). Reinforcement Learning: An Introduction. Cambridge, Massachusetts and

You might also like