[go: up one dir, main page]

0% found this document useful (0 votes)
10 views23 pages

Add-On DRL CS06

Uploaded by

pivam12168
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views23 pages

Add-On DRL CS06

Uploaded by

pivam12168
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Previous Lectures

• Supervised learning
– classification, regression

• Unsupervised learning
– clustering

• Reinforcement learning
– more general than supervised/unsupervised learning
– learn from interaction w/ environment to achieve a goal

environment
reward action
new state
agent

Slides from Peter Bodík RAD Lab, UC Berkeley


Recall
• examples

• defining an RL problem
– Markov Decision Processes

• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
– Policy Gradient Methods
Robot in a room
actions: UP, DOWN, LEFT, RIGHT
+1
UP
-1 80% move UP
10% move LEFT
10% move RIGHT
START

• reward +1 at [4,3], -1 at [4,2]


• reward -0.04 for each step

• what’s the strategy to achieve max reward?


• what if the actions were deterministic?
Other examples
• pole-balancing
• TD-Gammon [Gerry Tesauro]
• helicopter [Andrew Ng]

• no teacher who would say “good” or “bad”


– is reward “10” good or bad?
– rewards could be delayed

• similar to control theory


– more general, fewer constraints

• explore the environment and learn from experience


– not just blind search, try to be smart about it
Robot in a room
actions: UP, DOWN, LEFT, RIGHT
+1
UP

-1 80% move UP
10% move LEFT
10% move RIGHT
START
reward +1 at [4,3], -1 at [4,2]
reward -0.04 for each step

• states
• actions
• rewards

• what is the solution?


Is this a solution?
+1

-1

• only if actions deterministic


– not in this case (actions are stochastic)

• solution/policy
– mapping from each state to an action
Optimal policy
+1

-1
Reward for each step: -2
+1

-1
Reward for each step: -0.1
+1

-1
Reward for each step: -0.04
+1

-1
Reward for each step: -0.01
+1

-1
Reward for each step: +0.01
+1

-1
Markov Decision Process (MDP)
• set of states S, set of actions A, initial state S0
• transition model P(s,a,s’) environment
– P( [1,1], up, [1,2] ) = 0.8 reward action
new state
• reward function r(s) agent
– r( [4,3] ) = +1
• goal: maximize cumulative reward in the long run

• policy: mapping from S to A


– (s) or (s,a) (deterministic vs. stochastic)

• reinforcement learning
– transitions and rewards usually not available
– how to change the policy based on experience
– how to explore the environment
Computing return from rewards
• episodic (vs. continuing) tasks
– “game over” after N steps
– optimal policy depends on N; harder to analyze

• additive rewards
– V(s0, s1, …) = r(s0) + r(s1) + r(s2) + …
– infinite value for continuing tasks

• discounted rewards
– V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + …
– value bounded if rewards bounded
Value functions
• state value function: V(s)
– expected return when starting in s and following 

• state-action value function: Q(s,a)


– expected return when starting in s, performing a, and following 

s
• useful for finding the optimal policy
– can estimate from experience a

– pick the best action using Q(s,a) r

s’
• Bellman equation
Optimal value functions
• there’s a set of optimal policies Backup Diagrams:
– V defines partial ordering on policies See Textbook from
Richard S. Sutton
– they share the same optimal value function
Chapter 3: Finite
Markov Decision
Processes
• Bellman optimality equation Page 48

– system of n non-linear equations a

– solve for V*(s) r


– easy to extract the optimal policy
s’

• having Q*(s,a) makes it even simpler


Dynamic programming
• main idea
– use value functions to structure the search for good policies
– need a perfect model of the environment

• two main components


– policy evaluation: compute V from 
– policy improvement: improve  based on V

– start with an arbitrary policy


– repeat evaluation/improvement until convergence
Policy evaluation/improvement
• policy evaluation:  -> V
– Bellman eqn’s define a system of n eqn’s
– could solve, but will use iterative version

– start with an arbitrary value function V0, iterate until Vk converges

• policy improvement: V -> ’

– ’ either strictly better than , or ’ is optimal (if  = ’)


Policy/Value iteration
• Policy iteration

– two nested iterations; too slow


– don’t need to converge to Vk
• just move towards it

• Value iteration

– use Bellman optimality equation as an update


– converges to V*
Using DP
• need complete model of the environment and rewards
– robot in a room
• state space, action space, transition model

• can we use DP to solve


– robot in a room?
– back gammon?
– helicopter?
Self Study Example

Iterative policy evaluation:


See Textbook from Richard S. Sutton

CHAPTER 4. DYNAMIC PROGRAMMING

Exercise 4.1, 4.2

Page 61-67
Generalized Policy Iteration (GPI)
Basic Idea:

• GPI allows policy evaluation and policy improvement to interact in a less rigid manner than traditional
policy iteration.
• Unlike strict policy iteration where evaluation and improvement alternate in distinct phases, GPI allows
for more flexible and interleaved interactions between these processes.
Outline
• examples

• defining an RL problem
– Markov Decision Processes

• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning

You might also like