[go: up one dir, main page]

0% found this document useful (0 votes)
19 views49 pages

Lecture Reinforcement Learning

Reinforcement learning

Uploaded by

Syed zakir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

Lecture Reinforcement Learning

Reinforcement learning

Uploaded by

Syed zakir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Reinforcement Learning

Philipp Koehn

16 April 2024

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Rewards 1

● Agent takes actions

● Agent occasionally receives reward

● Maybe just at the end of the process, e.g., Chess:


– agent has to decide on individual moves
– reward only at end: win/lose

● Maybe more frequently


– Scrabble: points for each word played
– ping pong: any point scored
– baby learning to crawl: any forward movement

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Markov Decision Process 2

State Map Stochastic Movement

● States s ∈ S, actions a ∈ A

● Model T (s, a, s′) ≡ P (s′∣s, a) = probability that a in s leads to s′

● Reward function R(s) (or R(s, a), R(s, a, s′))


−0.04 (small penalty) for nonterminal states
={
±1 for terminal states

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Agent Designs 3

● Utility based agent


– needs model of environment
– learns utility function on states
– selects action that maximize expected outcome utility

● Q-learning
– learns action-utility function (Q(s, a) function)
– does not need to model outcomes of actions
– function provides expected utility of taken a given action at a given step

● Reflex agent
– learns policy that maps states to actions

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


4

passive reinforcement learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Setup 5

State Map Stochastic Reward Function


Movement

⎪ +1


for goal
R(s) = ⎨ –1 for pit



⎩ –0.04 for other

Unknown information

● We know which state we are in (= partially observable environment)


● We know which actions we can take
● But only after taking an action
→ new state becomes known
→ reward becomes known

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Passive Reinforcement Learning 6

● Given a policy

● Task: compute utility of policy

● We will extend this later to active reinforcement learning


(⇒ policy needs to be learned)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 7

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 8

-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 9

-0.04

-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 10

-0.04

-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 11

-0.04

-0.04
-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 12

-0.04

-0.04 -0.04
-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 13

-0.04

-0.04 -0.04 +1
-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 14

0.92 0.96 1.00


0.80 0.88
0.84

0.76

0.72

● Sample of reward to go

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 15

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Sampling 16

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Utility of Policy 17

● Definition of utility U of the policy π for state s


U (s) = E [∑ γ tR(St)]
π
t=0

● Start at state S0 = s

● Reward for state is R(s)

● Discount factor γ (we use γ = 1 in our examples)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Direct Utility Estimation 18

● Learning from the samples

● Reward to go:
0.92 0.96 1.00
– (1,1) one sample: 0.72 0.80 0.88
0.84
– (1,2) two samples: 0.76, 0.84
– (1,3) two samples: 0.80, 0.88
0.76

● Reward to go 0.72
will converge to utility of state

● But very slowly — can we do better?

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Bellman Equation 19

● Direct utility estimation ignores dependency between states

● Given by Bellman equation

U π (s) = R(s) + γ ∑ P (s′∣s, π(s)) U π (s′)


s′

(γ = reward decay)

● Use of this known dependence can speed up learning

● Requires learning of transition probabilities P (s′∣s, π(s))

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Adaptive Dynamic Programming 20

Need to learn:

● State rewards R(s)


– whenever a state is visited, record award (deterministic)

● Outcome of action π(s) at state s according to policy π


– collect statistic count(s, s′) that s′ is reached from s
– estimate probability distribution

′ count(s, s′)
P (s ∣s, π(s)) =
∑s′′ count(s, s′′)

⇒ Ingredients for policy evaluation algorithm

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Adaptive Dynamic Programming 21

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Learning Curve 22

● Major change at 78th trial: first time terminated in –1 state at (4,2)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Temporal Difference Learning 23

● Idea: do not model P (s′∣s, π(s)), directly adjust utilities U (s) for all visited states

● Estimate of current utility: U π (s)

● Estimate of utility after action: R(s) + γU π (s′)

● Adjust utility of current state U π (s) if they differ

∆U π (s) = α ( R(s) + γU π (s′) − U π (s))

(α = learning rate)

● Learning rate may decrease when state has been visited often

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Learning Curve 24

● Noisier, converging more slowly

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Comparison 25

● Both eventually converge to correct values

● Adaptive dynamic programming (ADP)


faster than
Temporal difference learning (TD)

– both make adjustments to make successors agree


– but: ADP adjusts all possible successors, TD only observed successor

● ADP computationally more expensive due to policy evaluation algorithm

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


26

active reinforcement learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Active Reinforcement Learning 27

● Previously: passive agent follows prescribed policy

● Now: active agent decides which action to take

– following optimal policy (as currently viewed)


– exploration

● Goal: optimize rewards for a given time frame

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Greedy Agent 28

1. Start with initial policy

2. Compute utilities (using ADP)

3. Optimize policy

4. Go to Step 2

● This very seldom converges to global optimal policy

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Learning Curve 29

● Greedy agent stuck in local optimum

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Bandit Problems 30

● Bandit: slot machine

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Bandit Problems 31

● Bandit: slot machine

● N-armed bandit: n levers

● Each has different


probability distribution over payoffs

● Spend coin on
– presumed optimal payoff
– exploration (new lever)

● If independent
– Gittins index: formula for solution
– uses payoff / number of times used

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Greedy in the Limit of Infinite Exploration 32

● Explore any action in any state unbounded number of times

● Eventually has to become greedy


– carry out optimal policy
⇒ maximize reward

● Simple strategy
– with probability p(1/t) take random action
– initially (t small) focus on exploration
– later (t big) focus on optimal policy

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Extension of Adaptive Dynamic Programming 33

● Previous definition of utility calculation


U (s) ← R(s) + γ maxa ∑ P (s′∣s, a) U (s′)
s′

● New utility calculation

U +(s) ← R(s) + γ maxa f (∑ P (s′∣s, a) U +(s′), N (s, a))


s′

● One possible definition of f (u, n)

R+ if n < Nc
f (u, n) = {
u otherwise

R+ is optimistic estimate, best possible award in any state

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Learning Curve 34

● Performance of exploratory ADP agent


● Parameter settings R+ = 2 and Ne = 5
● Fairly quick convergence to optimal policy

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Q-Learning 35

● Learning an action utility function Q(s, a)

● Allows computation of utilities U (s) = maxaQ(s, a)

● Model-free: no explicit transition model P (s′∣s, a)

● Theoretically correct Q values


Q(s, a) = R(s) + γ ∑ P (s′∣s, a) maxa′ Q(s′, a′)
s′

● Update formula inspired by temporal difference learning


(after taking action a to reach state s’)
∆Q(s, a) = α(R(s, a, s′) + γ maxa′ Q(s′, a′) − Q(s, a))

● For our example, Q-learning slower, but successful applications (TD-G AMMON)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


36

generalization in
reinforcement learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Large Scale Reinforcement Learning 37

● Adaptive dynamic programming (ADP) scalable to maybe 10,000 states


– Backgammon has 1020 states
– Chess has 1040 states

● It is not possible to visit all these states multiple times

⇒ Generalization of states needed

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Function Approximation 38

● Define state utility function as linear combination of features

Ûθ (s) = θ1 f1(s) + θ2 f2(s) + ... + θn fn(s)

● Recall: features to assess Chess state


– f1(s) = (number of white pawns) – (number of black pawns)
– f2(s) = (number of white rooks) – (number of black rooks)
– f3(s) = (number of white queens) – (number of black queens)
– f4(s) = king safety
– f5(s) = good pawn position
– etc.

⇒ Reduction from 1040 to, say, 20 parameters

● Main benefit: ability to assess unseen states

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Learning Feature Weights 39

● Example: 2 features: x and y

Ûθ (f1, f2) = θ0 + θ1f1 + θ2f2

● Current feature weights θ0, θ1, θ2 = (0.5, 0.2, 0.1)

● Model’s prediction of utility of specific state, e.g., Ûθ (1, 1) = 0.8

● Sample set of trials, found value uθ (1, 1) = 0.4

● Error Eθ = 12 (Ûθ (f1, f2) − uθ (f1, f2))2

● How do you update the weights θi?

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Gradient Descent Training 40

● Compute gradient of error


dEθ
= (Ûθ (f1, f2) − uθ (f1, f2)) fi
dθi

● Update against gradient


dEθ
∆θi = −µ
dθi

● Our example
– ∆θ1 = −µ(Ûθ (f1, f2) − uθ (f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
– ∆θ2 = −µ(Ûθ (f1, f2) − uθ (f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Additional Remarks 41

● If we know something about the problem


⇒ we may want to use more complex features

● Our toy example: utility related to Manhattan distance from goal (xgoal, ygoal)

f3(s) = (x − xgoal) + (y − ygoal)

● Gradient descent training can also be used for temporal distance learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


42

policy search

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Policy Search 43

● Idea: directly optimize policy

● Policy may be parameterized Q functions, hence:

π(s) = argmaxaQ̂θ (s, a)

● Stochastic policy, e.g., given by softmax function


1 Q̂θ (s,a)
πθ (s, a) = e
Zs

● Policy value ρ(θ): expected reward if πθ is carried out

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Hillclimbing 44

● Deterministic policy, deterministic environment


⇒ optimizing policy value ρ(θ) may be done in closed form

● If ρ(θ) differentiable
⇒ gradient descent by following policy gradient

● Make small changes to parameters


⇒ hillclimb if ρ(θ) improves

● More complex for stochastic environment

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


45

examples

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Game Playing 46

● Backgammon: TD-G AMMON (1992)

● Reward only at end of game

● Training with self-play

● 200,000 training games needed

● Competitive with top human players

● Better positional play, worse end game

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Robot Control 47

● Observe position x, vertical speed x̂, angle θ, angle speed θ̂


● Action: jerk left or right
● Reward: time balanced until pole falls, or cart out of bounce
● More complex: multiple stacked poles, helicopter flight, walking

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024


Summary 48

● Building on Markov decision processes and machine learning

● Passive reinforcement learning


(fixed policy, partially observable environment, stochastic outcomes of actions)
– sampling (carrying out trials)
– adaptive dynamic programming
– temporal difference learning

● Active reinforcement learning


– greedy in the limit of infinite exploration
– following optimal policy vs. exploration
– exploratory adaptive dynamic programming

● Generalization: representing utility function with small set of parameters

● Policy search

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

You might also like