0% found this document useful (0 votes)

174 views7 pages

Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL

The document provides information on deep reinforcement learning techniques including: 1) Deep Q-learning uses neural networks to approximate the Q-function and generalize to unseen states, since it is infeasible to estimate the Q-value of every single state-action pair. 2) A replay buffer stores experiences from agent-environment interactions to train the neural network on batches of randomly sampled experiences, making training more stable and efficient. 3) Training continues until the predicted Q-values converge closely with the expected Q-values, indicating the network has stabilized and learned an optimal policy.

Uploaded by

Rajat Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

174 views7 pages

Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL

Uploaded by

Rajat Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lecture Notes

CNNs - A Specialised Architecture for Visual Data

Deep Reinforcement Learning

When the state-space is relatively small, e.g. a 3x3 grid, or even a few thousand states etc., it is easy to visit
each and every state and estimate its Q-value. However, when the state-space is large, visiting every space
may not be possible. Also, in many cases like in training a robotic arm to play table tennis, there can be infinite
number of state-action pair. So, the classical approach of estimating the Q-value by visiting each state-action
pair wouldn't work. Using deep learning, instead of finding the Q-value for every state, you learn the Q-
function or policy using function approximation.

Generalizability in Deep RL

In deep reinforcement learning, you try to approximate the Q-function. You train the model, that is, the
action-value function (or the policy or the value function) based on the 'seen' states and the model should be
able to ‘generalize’ well on the ‘unseen’ states. For example, say you are training an autonomous car. The car
will be trained in certain traffic conditions and would be exposed to certain states, but it should be able to
perform in new, unseen states as well. This property of “generalization” is valid for any (well-trained) model in
machine learning.

You will use a neural network for approximating these functions: state-value, action-value, policy etc. You can
use other algorithms as well: decision trees, SVM etc., but in practice, the complexity of most RL tasks require
deep neural networks.

Deep Q Learning
When you train a machine learning model, such as linear regression, you assume that the data points
are independent and identically distributed (IID). For example, say you are using regression to predict
the price of a house using its area, so you have (area, price) tuples as your training data. You as sume
that each data point is independent of each other (the price of a house does not affect that of another)
and that all the data points are identically distributed. An identical distribution means that every time
you draw a data point, whether you draw the first or the 100th data point, the probability of getting a
particular data point is the same.

The samples are not independent in the case of RL because the next state is usually highly dependent
on the previous state and action. For example, let's say an agent is trying to learn to play an Atari game.
The state is represented by the screenshot of the game at any point. As you move from one screenshot
to the next, there will be little change in pixel values. Hence, the next state (screenshot) is dependent on
the previous one.

The other problem is that of identical distribution. We said that an identical distribution means that
every time you draw a data point, whether you draw the first or the 100th data point, the probability of
getting a particular data point is the same.

Now, this is not the case in deep RL - the probability of a (state, action, reward) triplet appearing changes
with 'time' (i.e. episodes). That is, say you are creating a training set of 1 million (state, action, reward)
triplets. Towards the start, the agent has not learnt much, so the probability of an (s, a, r) triplet will be
very different from when the agent has learnt much more during the end of the game.

Replay Buffer

Before training the neural network, to learn the state-value function, you need to have a dataset that can
be used for training. However, in reinforcement learning tasks, you generally don't have the dataset. You
are required to create a dataset of the form (s,a,r,s′). As you know the equation for Q-learning is:

𝑸(𝒔, 𝒂) ≔ 𝑸(𝒔, 𝒂) + 𝜶[𝒓 + 𝜸((𝒎𝒂𝒙𝒂 𝑸(𝒔′ , 𝒂) − 𝑸(𝒔, 𝒂))]

The predicted Q-value by the model, using (s,a) will give Q(s,a). Further, next state s’ with all possible will
generate the target Q(s’,a).

When in state s, the agent takes an action a. The environment returns the new state s′ and the reward r.
This sample is stored as <s,a,r,s′> in memory. Again, in the new state s′, the agent performs another
action, gets a reward and a new state from the environment. You choose an action according to the
epsilon-greedy policy. With ϵ probability, you do exploration while with 1−ϵ you choose the action
corresponding to the highest Q-value. You store these experiences in the memory (replay buffer) in the form
of <s,a,r,s′>.

These samples generated by the interaction of the agent with the environment are collectiv ely known
as experience. You represent experience as a 4-tuple <state, action, reward, next_state> . The reward
here is the immediate reward you get from the environment. These encountered experiences are stored
in the memory. And this memory is called the replay buffer.

A neural network is trained in batches in which the data points are random (IID). But the agent's
experience is generated sequentially, i.e. you go from one state to another state in a
sequential, dependent fashion. So, to make a batch from a sequential process, you store them in
memory and take random samples of batch size to train the network.

When to stop training?

The predicted and the target values gets improved over time as the Q-value generated using neural network
gets better after each gradient update. This will give better rewards with each time step. For example, let's say
that the predicted Q-value you get from the network is 20 and the expected value is 30. After training for a
few time-steps, you will get the predicted value as 27, but the target value could also increase to 35.

This type of situation will come in the initial stage of training, but after training for sufficient time, the network
will stabilise. The parameters of the neural network will not change significantly, and that's when you can stop
training further. The predicted value Q(s,a) will be close to the target value Q(s′,a) and they will not
significantly improve over time during training.

Training pseudo code

The deep Q-learning pseudocode is as follows:
• Initialise replay memory D to a capacity of N (if N=2000, then it can store 2000 experiences)
• Initialise the action-value function Q (i.e. the neural net) with random weights
• Total number of episodes is M
• Each episode is of length (time steps) T
• For 1 to M episodes, do:
o For 1 to T time steps, do:
▪ Generate an experience of the form <s,a,s′,r>
▪ With probability epsilon, select a random action a
▪ Else select 𝑎 = 𝑎𝑟𝑔 𝑚𝑎𝑥𝑎 𝑄(𝑠 ′ , 𝑎, 𝛳)
▪ Go to the next state s'
▪ Set next state as the current state
▪ Store the experience in the replay memory D
▪ Train the model on (say) 100 samples (batch size) randomly selected from the memory
▪ Randomly sample transitions (s,a,s′,r) of batch size from replay buffer
▪ Calculate the target (𝑦) = 𝑟 + 𝑚𝑎𝑥𝑎 𝑄(𝑠 ′ , 𝑎)
▪ Calculate the Q-value for this state-action pair (s,a) as predicted by the network
▪ Train the model to minimize the 'squared error': (𝑄(𝑠, 𝑎) − 𝑦)2

Note: You are generating only one experience at every timestep while training on (say batch size = 100) 100
experiences at every timestep. This is so because it has been assumed that you must have generated 100
experiences previously and stored in memory. Training will start only after you have your 'first' 100
experiences.

The Q-function is being updated at the end of every episode. Thus, you choose better actions after each
update to generate the experiences. This will result in more optimal actions and next states, which will further
improve the learning process (i.e. the policy). Both the input and label are changing with time which leads to a
dynamic dataset.

Architectures of Deep Q Learning: Architecture 1

The input to the NN is both state and action. For choosing the 𝑚𝑎𝑥𝑎 𝑄(𝑠 ′ , 𝑎), for a particular state, you have
to input every possible action, calculate its Q-value and choose the action for which Q-value is maximum.

Disadvantage:
Suppose the state is represented by a vector of size 4 and for every possible state, there can be 2 actions. To
find 𝑚𝑎𝑥𝑎 𝑄(𝑠 ′ , 𝑎), you have to do feedforward 2 times, one for each action and find the Q-value. Then take
the action for which Q-value is the maximum.

Architectures of Deep Q Learning: Architecture 2

The input to the NN is only the state and its output is the Q-value of every possible action. Here, the size of the
input vector will be 4 and the output size will be 2.
Advantage:
The advantage here is that, just by giving state as the input to the NN, you will get Q(s, a) for each action, so
you have to run the NN just once for every state. Take the action for which Q(s, a) is the maximum.
Note: Here, the gradient should flow only from the action you took and not from all the possible Q values.

DQN Architecture II - Visualisation

Let's see to train the second architecture.
• First, generate the experience using epsilon-greedy policy
• Store the experience in the memory

• Find the Q(s, a) and Q(s', a) from the Q-Network.

• Calculate the predicted value Q(s,a) and the target value: 𝑟 + 𝑎𝑟𝑔 𝑚𝑎𝑥𝑎 𝑄(𝑠 ′ , 𝑎)
• Now, train the model

• Use the trained Q network to generate another experience and continue the training.
Double DQN - A DQN Variation

During training, until an episode ends, only the parameters of the Q-network are updated, just like you do for
a DQN architecture. The parameters of the target network are not updated. If you are constantly shifting the
predicted and target Q-values to update the network, it can become destabilized by falling into feedback loops
between the predicted and target Q-values.

To avoid this problem, you keep the target value fixed for the entire episode. To calculate the target
value (𝑟 + 𝑎𝑟𝑔 𝑚𝑎𝑥𝑎 𝑄(𝑠 ′ , 𝑎)) for each sample, the Q(s′,a) is calculated using the target network. So, the
target value remains fixed during the entire episode and you try to improve the predicted value using the 'Q-
Network'. The learning in such a case will be more stable because the 'goal post' is not shifting after every
timestep. After the episode ends, the weights of the 'Q-Network' are copied to the 'Target Network' and a
new target value is calculated for the next episode.

There are two deep-Q networks in this architecture. In DQN, you update the Q-network after every timestep
in an episode. However, in double DQN you only update the main network (also known as the 'Q-network')
after every timestep.

Apart from the 'Q network', there is another neural net called the 'target network'. Recall that you
calculate Q(s,a) and Q(s′,a) using the Q-network in DQN. However, in double DQN you calculate Q(s,a) using
the 'Q-Network' and Q(s′,a) using the 'Target Network'.

Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
01 Module 2 Neural Network Based Reinforcement Learning
No ratings yet
01 Module 2 Neural Network Based Reinforcement Learning
133 pages
UNIT- 5
No ratings yet
UNIT- 5
43 pages
CH5_Function Approximation (1)
No ratings yet
CH5_Function Approximation (1)
33 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Report
No ratings yet
Report
11 pages
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
No ratings yet
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
5 pages
Assignment3 Yash Patel
No ratings yet
Assignment3 Yash Patel
10 pages
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
7 pages
Deep Neural Network (1)
No ratings yet
Deep Neural Network (1)
17 pages
Q_Networks[1]-31-50
No ratings yet
Q_Networks[1]-31-50
20 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Deep Q-Network
No ratings yet
Deep Q-Network
15 pages
2410.22766v1
No ratings yet
2410.22766v1
12 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Ee126 Project 1
No ratings yet
Ee126 Project 1
5 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Deep-Learning-book-part5
No ratings yet
Deep-Learning-book-part5
142 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
drl_v5
No ratings yet
drl_v5
64 pages
AI A Z HandBook
No ratings yet
AI A Z HandBook
12 pages
CS461 Intermediate Report Team7
No ratings yet
CS461 Intermediate Report Team7
5 pages
Part 2 - Building A Deep Q-Network To Play Gridworld - Catastrophic Forgetting and Experience Replay - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 2 - Building A Deep Q-Network To Play Gridworld - Catastrophic Forgetting and Experience Replay - by NandaKishore Joshi - Towards Data Science
8 pages
dqn-atari
No ratings yet
dqn-atari
26 pages
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
No ratings yet
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
13 pages
4b - Deep Reinforcement Learning
No ratings yet
4b - Deep Reinforcement Learning
29 pages
SocrAI Day 4
No ratings yet
SocrAI Day 4
38 pages
IMPLing The DQN
No ratings yet
IMPLing The DQN
9 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
18-deeprl
No ratings yet
18-deeprl
19 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
No ratings yet
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
10 pages
Q Learning
No ratings yet
Q Learning
38 pages
Lecture2 Drl A
No ratings yet
Lecture2 Drl A
39 pages
Q learning
No ratings yet
Q learning
187 pages
Unit Iv Deep Q Learning
No ratings yet
Unit Iv Deep Q Learning
27 pages
Chapter_1_Introduction_RL_Report_Kiran
No ratings yet
Chapter_1_Introduction_RL_Report_Kiran
2 pages
Deep q Network
No ratings yet
Deep q Network
6 pages
MDP
No ratings yet
MDP
10 pages
A Short Survey On Memory Based RL
No ratings yet
A Short Survey On Memory Based RL
18 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
Sarcia - Judd Michael - AS4
No ratings yet
Sarcia - Judd Michael - AS4
6 pages
Chapter 3
No ratings yet
Chapter 3
14 pages
Introtodeeplearning MIT 6.S191
No ratings yet
Introtodeeplearning MIT 6.S191
36 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
MAS-Lab7-QFA
No ratings yet
MAS-Lab7-QFA
10 pages
Machine Learning 4th Unit
No ratings yet
Machine Learning 4th Unit
54 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
Continuous Time 2
No ratings yet
Continuous Time 2
91 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
How To Build A RAG Chatbot Using Ollama - Serve LLMs Locally
No ratings yet
How To Build A RAG Chatbot Using Ollama - Serve LLMs Locally
12 pages
Srushti Anil Kulkarni - 121AX055 - IOT EXP 9
No ratings yet
Srushti Anil Kulkarni - 121AX055 - IOT EXP 9
10 pages
Manual de Usuario PLC 1400-87-97
No ratings yet
Manual de Usuario PLC 1400-87-97
11 pages
IC-2 Documentation
No ratings yet
IC-2 Documentation
33 pages
Finding The Real Origin Ips Hiding Behind Cloudflare or Tor
No ratings yet
Finding The Real Origin Ips Hiding Behind Cloudflare or Tor
10 pages
Covering Letter With CV Example
100% (2)
Covering Letter With CV Example
7 pages
OOP Notes (1)
No ratings yet
OOP Notes (1)
42 pages
Project Management Software Literature Review
100% (1)
Project Management Software Literature Review
5 pages
820-01598 Prometheus Power Sequence Diagram LO2
No ratings yet
820-01598 Prometheus Power Sequence Diagram LO2
1 page
Cpar Admin Guide
No ratings yet
Cpar Admin Guide
272 pages
204CS001-Machine Learning Techniques.
No ratings yet
204CS001-Machine Learning Techniques.
1 page
Summer Internship Report NISHANT NEWWWW
No ratings yet
Summer Internship Report NISHANT NEWWWW
12 pages
Eng Last Axiom
No ratings yet
Eng Last Axiom
10 pages
A Comprehensive Study of Camouflaged Object Detection Using Deep Learning
No ratings yet
A Comprehensive Study of Camouflaged Object Detection Using Deep Learning
8 pages
Monport 40W Lightburn Setup
No ratings yet
Monport 40W Lightburn Setup
2 pages
Keysight 34980A PROGRAMMER'S HELP
No ratings yet
Keysight 34980A PROGRAMMER'S HELP
1,108 pages
S 2
No ratings yet
S 2
9 pages
Definitions: Flow Shop Scheduling
No ratings yet
Definitions: Flow Shop Scheduling
7 pages
GUI Change Log Eng
No ratings yet
GUI Change Log Eng
14 pages
Rutuja N. Mhaske: Profile Contact
No ratings yet
Rutuja N. Mhaske: Profile Contact
2 pages
Activity Sheets Tle
No ratings yet
Activity Sheets Tle
5 pages
Effective Coordination of Multiple Intelligent Agents For Command and Control
No ratings yet
Effective Coordination of Multiple Intelligent Agents For Command and Control
15 pages
Lesson - 04 - Deploy and Manage Azure Compute Resources
No ratings yet
Lesson - 04 - Deploy and Manage Azure Compute Resources
81 pages
Bihl, Trevor J. - Zobaa, Ahmed F - Big Data Analytics in Future Power Systems (2019)
No ratings yet
Bihl, Trevor J. - Zobaa, Ahmed F - Big Data Analytics in Future Power Systems (2019)
189 pages
Ranchi Q 22-23
No ratings yet
Ranchi Q 22-23
9 pages
Turnstile Design Sheet
No ratings yet
Turnstile Design Sheet
6 pages
Autocad Land Desktop Tutorial: Lesson 1
No ratings yet
Autocad Land Desktop Tutorial: Lesson 1
288 pages
Introduction To Packet Tracer - PT IoT Basics Quiz - Packet Tracer
100% (9)
Introduction To Packet Tracer - PT IoT Basics Quiz - Packet Tracer
10 pages
TCP IP Reference Model
No ratings yet
TCP IP Reference Model
50 pages
DB2 Connection With Eclipse IDE
100% (1)
DB2 Connection With Eclipse IDE
11 pages

Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL

Uploaded by

Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL

Uploaded by

Lecture Notes

CNNs - A Specialised Architecture for Visual Data

Deep Reinforcement Learning

𝑸(𝒔, 𝒂) ≔ 𝑸(𝒔, 𝒂) + 𝜶[𝒓 + 𝜸((𝒎𝒂𝒙𝒂 𝑸(𝒔′ , 𝒂) − 𝑸(𝒔, 𝒂))]

When to stop training?

Training pseudo code

Architectures of Deep Q Learning: Architecture 1

Architectures of Deep Q Learning: Architecture 2

DQN Architecture II - Visualisation

• Find the Q(s, a) and Q(s', a) from the Q-Network.

You might also like