Deep Reinforcement Learning
MIT 6.S191
Alexander Amini
January 30, 2019
AlphaGo video
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning
Data: (", $)
" is data, $ is label
Goal: Learn function to map
"→$
Apple example:
This thing is an apple.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning Unsupervised Learning
Data: (", $) Data: "
" is data, $ is label " is data, no labels!
Goal: Learn function to map Goal: Learn underlying
"→$ structure
Apple example: Apple example:
This thing is like
This thing is an apple.
the other thing.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning Unsupervised Learning Reinforcement Learning
Data: (", $) Data: " Data: state-action pairs
" is data, $ is label " is data, no labels!
Goal: Learn function to map Goal: Learn underlying Goal: Maximize future rewards
"→$ structure over many time steps
Apple example: Apple example: Apple example:
This thing is like Eat this thing because it
This thing is an apple.
the other thing. will keep you alive.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning Unsupervised Learning Reinforcement Learning
Data: (", $) Data: " Data: state-action pairs
" is data, $ is label " is data, no labels!
RL: our focus
Goal: Learn function to map today.
Goal: Learn underlying Goal: Maximize future rewards
"→$ structure over many time steps
Apple example: Apple example: Apple example:
This thing is like Eat this thing because it
This is an apple.
the other thing. will keep you alive.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
AGENT
Agent: takes actions.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
AGENT ENVIRONMENT
Environment: the world in which the agent exists and operates.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
AGENT Action: !" ENVIRONMENT
ACTIONS
Action: a move the agent can make in the environment.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
AGENT Action: !" ENVIRONMENT
ACTIONS
Observations: of the environment after taking actions.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
AGENT Action: %" ENVIRONMENT
ACTIONS
State: a situation which the agent perceives.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"
AGENT Action: &" ENVIRONMENT
ACTIONS
Reward: feedback that measures the success or failure of the agent’s action.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"
AGENT Action: &" ENVIRONMENT
ACTIONS
Total Reward ,
'" = ) %*
*+"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"
AGENT Action: &" ENVIRONMENT
ACTIONS
Total Reward ,
'" = ) %* = %" + %"#$ … + %"#/ + ⋯
*+"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"
AGENT Action: &" ENVIRONMENT
ACTIONS
Discounted ,
Total Reward
'" = ) - * %*
*+"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"
AGENT Action: &" ENVIRONMENT
ACTIONS
Discounted
Total Reward
. ': discount factor
)" = + ' , %, = ' " %" + ' "#$ %"#$ … + ' "#1 %"#1 + ⋯
,-"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Defining the Q-function
!" = $" + &$"'( +& ) $"') + ⋯
Total reward, !" , is the discounted sum of all rewards obtained from time 0
+ ,, . = / !"
The Q-function captures the expected total future reward an
agent in state, ,, can receive by executing a certain action, .
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
How to take actions given a Q-function?
! ", $ = & '(
(state, action)
Ultimately, the agent needs a policy ) * , to infer the best action to take at its state, s
Strategy: the policy should choose an action that maximizes future reward
+ ∗ " = argmax !(", $)
2
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Reinforcement Learning Algorithms
Value Learning Policy Learning
Find % ", # Find ! "
# = argmax %(", #) Sample # ~ ! "
-
Deep Reinforcement Learning Algorithms
Value Learning Policy Learning
Find % ", # Find ! "
# = argmax %(", #) Sample # ~ ! "
-
Digging deeper into the Q-function
Example: Atari Breakout
It can be very difficult for humans to
accurately estimate Q-values
A B
Which (s, a) pair has a
higher Q-value?
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Digging deeper into the Q-function
Example: Atari Breakout
It can be very difficult for humans to
accurately estimate Q-values
A B
Which (s, a) pair has a
higher Q-value?
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Digging deeper into the Q-function
Example: Atari Breakout - Middle
It can be very difficult for humans to
accurately estimate Q-values
A B
Which (s, a) pair has a
higher Q-value?
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Digging deeper into the Q-function
Example: Atari Breakout - Side
It can be very difficult for humans to
accurately estimate Q-values
A B
Which (s, a) pair has a
higher Q-value?
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN)
How can we use deep neural networks to model Q-functions?
state, " Deep
! ", $
NN
“move
right”
action, $
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN)
How can we use deep neural networks to model Q-functions?
! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”
action, $
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN): Training
How can we use deep neural networks to model Q-functions?
! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”
action, $
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN): Training
How can we use deep neural networks to model Q-functions?
! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”
action, $
target predicted
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN): Training
How can we use deep neural networks to model Q-functions?
! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”
action, $
target predicted
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
DQN Atari Results
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
DQN Atari Results
Surpass Below
human-level human-level
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Downsides of Q-learning
Complexity:
• Can model scenarios where the action space is discrete and small
• Cannot handle continuous action spaces IMPORTANT:
Imagine you want to predict
steering wheel angle of a car!
Flexibility:
• Cannot learn stochastic policies since policy is deterministically computed
from the Q function
To overcome, consider a new class of RL training algorithms:
Policy gradient methods
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG) : Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
! ", $%
! ", $&
Deep
NN
! ", $'
state, "
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "# |%
! "& |%
Deep
NN
! "' |%
state, %
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
NN
! "' |%
state, %
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
! "|% = 0("23456|%3"37)
NN
! "' |%
state, %
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
Policy Gradient: Directly optimize the policy!
! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
! "|% = 0("23456|%3"37)
NN
! "' |%
state, %
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇* log )* .: |%: <:
! ← ! + >∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high
rewards
3. Decrease probability of actions that lead to
low/no rewards
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇789: ;7 <= |?= @=
! ← ! + B∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high
rewards
3. Decrease probability of actions that lead to
low/no rewards
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇789: ;7 <= |?= @=
! ← ! + B∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high log-likelihood of action
rewards
3. Decrease probability of actions that lead to ∇* log )* .F |%F GF
low/no rewards
reward
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
The Game of Go
Aim: Get more board territory than your opponent.
Board Size %
Positions 3$ % Legal Legal Positions
nxn
1×1 3 33.33% 1
2×2 81 70.37% 57
3×3 19,683 64.40% 12,675
4×4 43,046,721 56.49% 24,318,165
5×5 847,288,609,443 48.90% 414,295,148,741
9×9 4.434264882×1038 23.44% 1.03919148791×1038
13×13 4.300233593×1080 8.66% 3.72497923077×1079
19×19 1.740896506×10172 1.20% 2.08168199382×10170
Greater number of legal board positions than atoms in the universe.
Source: Wikipedia.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)
Silver et al., Nature 2016.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)
1) Initial training: human data
Silver et al., Nature 2016.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)
1) Initial training: human data
2) Self-play and reinforcement learning
à super-human performance
Silver et al., Nature 2016.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)
1) Initial training: human data
2) Self-play and reinforcement learning
à super-human performance
3) “Intuition” about board state
Silver et al., Nature 2016.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)
1) Initial training: human data
2) Self-play and reinforcement learning
à super-human performance
3) “Intuition” about board state
Silver et al., Nature 2016.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaZero: RL from Self-Play (2018)
Silver et al., Science 2018.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Questions?