Add-On DRL CS06
Add-On DRL CS06
• Supervised learning
– classification, regression
• Unsupervised learning
– clustering
• Reinforcement learning
– more general than supervised/unsupervised learning
– learn from interaction w/ environment to achieve a goal
environment
reward action
new state
agent
• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
– Policy Gradient Methods
Robot in a room
actions: UP, DOWN, LEFT, RIGHT
+1
UP
-1 80% move UP
10% move LEFT
10% move RIGHT
START
-1 80% move UP
10% move LEFT
10% move RIGHT
START
reward +1 at [4,3], -1 at [4,2]
reward -0.04 for each step
• states
• actions
• rewards
-1
• solution/policy
– mapping from each state to an action
Optimal policy
+1
-1
Reward for each step: -2
+1
-1
Reward for each step: -0.1
+1
-1
Reward for each step: -0.04
+1
-1
Reward for each step: -0.01
+1
-1
Reward for each step: +0.01
+1
-1
Markov Decision Process (MDP)
• set of states S, set of actions A, initial state S0
• transition model P(s,a,s’) environment
– P( [1,1], up, [1,2] ) = 0.8 reward action
new state
• reward function r(s) agent
– r( [4,3] ) = +1
• goal: maximize cumulative reward in the long run
• reinforcement learning
– transitions and rewards usually not available
– how to change the policy based on experience
– how to explore the environment
Computing return from rewards
• episodic (vs. continuing) tasks
– “game over” after N steps
– optimal policy depends on N; harder to analyze
• additive rewards
– V(s0, s1, …) = r(s0) + r(s1) + r(s2) + …
– infinite value for continuing tasks
• discounted rewards
– V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + …
– value bounded if rewards bounded
Value functions
• state value function: V(s)
– expected return when starting in s and following
s
• useful for finding the optimal policy
– can estimate from experience a
s’
• Bellman equation
Optimal value functions
• there’s a set of optimal policies Backup Diagrams:
– V defines partial ordering on policies See Textbook from
Richard S. Sutton
– they share the same optimal value function
Chapter 3: Finite
Markov Decision
Processes
• Bellman optimality equation Page 48
• Value iteration
Page 61-67
Generalized Policy Iteration (GPI)
Basic Idea:
• GPI allows policy evaluation and policy improvement to interact in a less rigid manner than traditional
policy iteration.
• Unlike strict policy iteration where evaluation and improvement alternate in distinct phases, GPI allows
for more flexible and interleaved interactions between these processes.
Outline
• examples
• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning