0% found this document useful (0 votes)

19 views49 pages

Lecture Reinforcement Learning

Reinforcement learning

Uploaded by

Syed zakir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views49 pages

Lecture Reinforcement Learning

Reinforcement learning

Uploaded by

Syed zakir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Reinforcement Learning

Philipp Koehn

16 April 2024

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Rewards 1

● Agent takes actions

● Agent occasionally receives reward

● Maybe just at the end of the process, e.g., Chess:

– agent has to decide on individual moves
– reward only at end: win/lose

● Maybe more frequently

– Scrabble: points for each word played
– ping pong: any point scored
– baby learning to crawl: any forward movement

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Markov Decision Process 2

State Map Stochastic Movement

● States s ∈ S, actions a ∈ A

● Model T (s, a, s′) ≡ P (s′∣s, a) = probability that a in s leads to s′

● Reward function R(s) (or R(s, a), R(s, a, s′))

−0.04 (small penalty) for nonterminal states
={
±1 for terminal states

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Agent Designs 3

● Utility based agent

– needs model of environment
– learns utility function on states
– selects action that maximize expected outcome utility

● Q-learning
– learns action-utility function (Q(s, a) function)
– does not need to model outcomes of actions
– function provides expected utility of taken a given action at a given step

● Reflex agent
– learns policy that maps states to actions

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

passive reinforcement learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Setup 5

State Map Stochastic Reward Function

Movement
⎧
⎪ +1
⎪
⎪
for goal
R(s) = ⎨ –1 for pit
⎪
⎪
⎪
⎩ –0.04 for other

Unknown information

● We know which state we are in (= partially observable environment)

● We know which actions we can take
● But only after taking an action
→ new state becomes known
→ reward becomes known

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Passive Reinforcement Learning 6

● Given a policy

● Task: compute utility of policy

● We will extend this later to active reinforcement learning

(⇒ policy needs to be learned)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 7

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 8

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 9

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 10

-0.04

-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 11

-0.04

-0.04
-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 12

-0.04

-0.04 -0.04
-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 13

-0.04

-0.04 -0.04 +1
-0.04
-0.04

-0.04

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 14

0.92 0.96 1.00

0.80 0.88
0.84

0.76

0.72

● Sample of reward to go

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 15

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Sampling 16

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Utility of Policy 17

● Definition of utility U of the policy π for state s

∞
U (s) = E [∑ γ tR(St)]
π
t=0

● Start at state S0 = s

● Reward for state is R(s)

● Discount factor γ (we use γ = 1 in our examples)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Direct Utility Estimation 18

● Learning from the samples

● Reward to go:
0.92 0.96 1.00
– (1,1) one sample: 0.72 0.80 0.88
0.84
– (1,2) two samples: 0.76, 0.84
– (1,3) two samples: 0.80, 0.88
0.76

● Reward to go 0.72
will converge to utility of state

● But very slowly — can we do better?

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Bellman Equation 19

● Direct utility estimation ignores dependency between states

● Given by Bellman equation

U π (s) = R(s) + γ ∑ P (s′∣s, π(s)) U π (s′)

s′

(γ = reward decay)

● Use of this known dependence can speed up learning

● Requires learning of transition probabilities P (s′∣s, π(s))

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Adaptive Dynamic Programming 20

Need to learn:

● State rewards R(s)

– whenever a state is visited, record award (deterministic)

● Outcome of action π(s) at state s according to policy π

– collect statistic count(s, s′) that s′ is reached from s
– estimate probability distribution

′ count(s, s′)
P (s ∣s, π(s)) =
∑s′′ count(s, s′′)

⇒ Ingredients for policy evaluation algorithm

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Adaptive Dynamic Programming 21

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Learning Curve 22

● Major change at 78th trial: first time terminated in –1 state at (4,2)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Temporal Difference Learning 23

● Idea: do not model P (s′∣s, π(s)), directly adjust utilities U (s) for all visited states

● Estimate of current utility: U π (s)

● Estimate of utility after action: R(s) + γU π (s′)

● Adjust utility of current state U π (s) if they differ

∆U π (s) = α ( R(s) + γU π (s′) − U π (s))

(α = learning rate)

● Learning rate may decrease when state has been visited often

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Learning Curve 24

● Noisier, converging more slowly

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Comparison 25

● Both eventually converge to correct values

● Adaptive dynamic programming (ADP)

faster than
Temporal difference learning (TD)

– both make adjustments to make successors agree

– but: ADP adjusts all possible successors, TD only observed successor

● ADP computationally more expensive due to policy evaluation algorithm

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

active reinforcement learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Active Reinforcement Learning 27

● Previously: passive agent follows prescribed policy

● Now: active agent decides which action to take

– following optimal policy (as currently viewed)

– exploration

● Goal: optimize rewards for a given time frame

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Greedy Agent 28

1. Start with initial policy

2. Compute utilities (using ADP)

3. Optimize policy

4. Go to Step 2

● This very seldom converges to global optimal policy

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Learning Curve 29

● Greedy agent stuck in local optimum

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Bandit Problems 30

● Bandit: slot machine

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Bandit Problems 31

● Bandit: slot machine

● N-armed bandit: n levers

● Each has different

probability distribution over payoffs

● Spend coin on
– presumed optimal payoff
– exploration (new lever)

● If independent
– Gittins index: formula for solution
– uses payoff / number of times used

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Greedy in the Limit of Infinite Exploration 32

● Explore any action in any state unbounded number of times

● Eventually has to become greedy

– carry out optimal policy
⇒ maximize reward

● Simple strategy
– with probability p(1/t) take random action
– initially (t small) focus on exploration
– later (t big) focus on optimal policy

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Extension of Adaptive Dynamic Programming 33

● Previous definition of utility calculation

U (s) ← R(s) + γ maxa ∑ P (s′∣s, a) U (s′)
s′

● New utility calculation

U +(s) ← R(s) + γ maxa f (∑ P (s′∣s, a) U +(s′), N (s, a))

s′

● One possible definition of f (u, n)

R+ if n < Nc
f (u, n) = {
u otherwise

R+ is optimistic estimate, best possible award in any state

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Learning Curve 34

● Performance of exploratory ADP agent

● Parameter settings R+ = 2 and Ne = 5
● Fairly quick convergence to optimal policy

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Q-Learning 35

● Learning an action utility function Q(s, a)

● Allows computation of utilities U (s) = maxaQ(s, a)

● Model-free: no explicit transition model P (s′∣s, a)

● Theoretically correct Q values

Q(s, a) = R(s) + γ ∑ P (s′∣s, a) maxa′ Q(s′, a′)
s′

● Update formula inspired by temporal difference learning

(after taking action a to reach state s’)
∆Q(s, a) = α(R(s, a, s′) + γ maxa′ Q(s′, a′) − Q(s, a))

● For our example, Q-learning slower, but successful applications (TD-G AMMON)

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

generalization in
reinforcement learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Large Scale Reinforcement Learning 37

● Adaptive dynamic programming (ADP) scalable to maybe 10,000 states

– Backgammon has 1020 states
– Chess has 1040 states

● It is not possible to visit all these states multiple times

⇒ Generalization of states needed

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Function Approximation 38

● Define state utility function as linear combination of features

Ûθ (s) = θ1 f1(s) + θ2 f2(s) + ... + θn fn(s)

● Recall: features to assess Chess state

– f1(s) = (number of white pawns) – (number of black pawns)
– f2(s) = (number of white rooks) – (number of black rooks)
– f3(s) = (number of white queens) – (number of black queens)
– f4(s) = king safety
– f5(s) = good pawn position
– etc.

⇒ Reduction from 1040 to, say, 20 parameters

● Main benefit: ability to assess unseen states

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Learning Feature Weights 39

● Example: 2 features: x and y

Ûθ (f1, f2) = θ0 + θ1f1 + θ2f2

● Current feature weights θ0, θ1, θ2 = (0.5, 0.2, 0.1)

● Model’s prediction of utility of specific state, e.g., Ûθ (1, 1) = 0.8

● Sample set of trials, found value uθ (1, 1) = 0.4

● Error Eθ = 12 (Ûθ (f1, f2) − uθ (f1, f2))2

● How do you update the weights θi?

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Gradient Descent Training 40

● Compute gradient of error

dEθ
= (Ûθ (f1, f2) − uθ (f1, f2)) fi
dθi

● Update against gradient

dEθ
∆θi = −µ
dθi

● Our example
– ∆θ1 = −µ(Ûθ (f1, f2) − uθ (f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
– ∆θ2 = −µ(Ûθ (f1, f2) − uθ (f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Additional Remarks 41

● If we know something about the problem

⇒ we may want to use more complex features

● Our toy example: utility related to Manhattan distance from goal (xgoal, ygoal)

f3(s) = (x − xgoal) + (y − ygoal)

● Gradient descent training can also be used for temporal distance learning

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

policy search

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Policy Search 43

● Idea: directly optimize policy

● Policy may be parameterized Q functions, hence:

π(s) = argmaxaQ̂θ (s, a)

● Stochastic policy, e.g., given by softmax function

1 Q̂θ (s,a)
πθ (s, a) = e
Zs

● Policy value ρ(θ): expected reward if πθ is carried out

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Hillclimbing 44

● Deterministic policy, deterministic environment

⇒ optimizing policy value ρ(θ) may be done in closed form

● If ρ(θ) differentiable
⇒ gradient descent by following policy gradient

● Make small changes to parameters

⇒ hillclimb if ρ(θ) improves

● More complex for stochastic environment

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

examples

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Game Playing 46

● Backgammon: TD-G AMMON (1992)

● Reward only at end of game

● Training with self-play

● 200,000 training games needed

● Competitive with top human players

● Better positional play, worse end game

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Robot Control 47

● Observe position x, vertical speed x̂, angle θ, angle speed θ̂

● Action: jerk left or right
● Reward: time balanced until pole falls, or cart out of bounce
● More complex: multiple stacked poles, helicopter flight, walking

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Summary 48

● Building on Markov decision processes and machine learning

● Passive reinforcement learning

(fixed policy, partially observable environment, stochastic outcomes of actions)
– sampling (carrying out trials)
– adaptive dynamic programming
– temporal difference learning

● Active reinforcement learning

– greedy in the limit of infinite exploration
– following optimal policy vs. exploration
– exploratory adaptive dynamic programming

● Generalization: representing utility function with small set of parameters

● Policy search

Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2024

Slides Network Analysis
No ratings yet
Slides Network Analysis
162 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Unit-5 Reinforcemnt and Q learning
No ratings yet
Unit-5 Reinforcemnt and Q learning
45 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Detection of Cyber Attacks Using Ai
No ratings yet
Detection of Cyber Attacks Using Ai
92 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
DLMAIRIL01_Q4-2024_Session1
No ratings yet
DLMAIRIL01_Q4-2024_Session1
84 pages
A (Long) Peek Into Reinforcement Learning _ Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning _ Lil'Log
23 pages
part6
No ratings yet
part6
7 pages
databookRL Steve Brunton PDF
No ratings yet
databookRL Steve Brunton PDF
76 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Reinforcement Learning: R M V E R I
No ratings yet
Reinforcement Learning: R M V E R I
21 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
ML 4
No ratings yet
ML 4
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Unit 5
No ratings yet
Unit 5
45 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
UNIT-4 OF AI
No ratings yet
UNIT-4 OF AI
9 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
IntroductiontoRL-BR
No ratings yet
IntroductiontoRL-BR
22 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
unit 5 ml
No ratings yet
unit 5 ml
15 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
15-dijkstra
No ratings yet
15-dijkstra
48 pages
Industrial Engineering Lectures First Season
No ratings yet
Industrial Engineering Lectures First Season
122 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
ECE567 16F Hw1soln PDF
No ratings yet
ECE567 16F Hw1soln PDF
5 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
CHAPTER 6 Linear Inequalities in Two Variables
No ratings yet
CHAPTER 6 Linear Inequalities in Two Variables
32 pages
Lecture-5 Investors Utility and CAL
No ratings yet
Lecture-5 Investors Utility and CAL
26 pages
cs188 Fa18 Final Sol
No ratings yet
cs188 Fa18 Final Sol
26 pages
Applied Model Predictive Control - A Brief Guide Do MATLAB/Simulink MPC Toolbox
100% (1)
Applied Model Predictive Control - A Brief Guide Do MATLAB/Simulink MPC Toolbox
66 pages
LPP (Tableau Method)
No ratings yet
LPP (Tableau Method)
7 pages
Artificial Intelligence Gtu MCQ Artificial Intelligence Gtu MCQ
No ratings yet
Artificial Intelligence Gtu MCQ Artificial Intelligence Gtu MCQ
159 pages
Introduction To Fully Connected NN - Assignment
No ratings yet
Introduction To Fully Connected NN - Assignment
3 pages
Data Hipotetik Valuasi
No ratings yet
Data Hipotetik Valuasi
20 pages
Knowledge Discovery Process
No ratings yet
Knowledge Discovery Process
15 pages
HND Past Questions Uba, T.A.N
No ratings yet
HND Past Questions Uba, T.A.N
20 pages
Digitel Signature
No ratings yet
Digitel Signature
21 pages
Activation Functions: Sigmoid, Tanh, Relu, Leaky Relu, Prelu, Elu, Threshold Relu and Softmax Basics For Neural Networks and Deep Learning
No ratings yet
Activation Functions: Sigmoid, Tanh, Relu, Leaky Relu, Prelu, Elu, Threshold Relu and Softmax Basics For Neural Networks and Deep Learning
15 pages
Grade 11 Mathematics Social Science
No ratings yet
Grade 11 Mathematics Social Science
9 pages
Introduction To The Boundary Element Method
100% (1)
Introduction To The Boundary Element Method
1 page
Project Selection Criteria
No ratings yet
Project Selection Criteria
7 pages
Assignment 1-418 Answers - Copy - Copy (2)
No ratings yet
Assignment 1-418 Answers - Copy - Copy (2)
4 pages
Experiment03 - Full Adder
No ratings yet
Experiment03 - Full Adder
9 pages
Lampiran Uji Regresi Linear Berganda
No ratings yet
Lampiran Uji Regresi Linear Berganda
3 pages
worksheet 6
No ratings yet
worksheet 6
2 pages
Project Distribution Data
No ratings yet
Project Distribution Data
6 pages
Extra Quiz CT2-1
No ratings yet
Extra Quiz CT2-1
3 pages
Encryption and Hashing
No ratings yet
Encryption and Hashing
2 pages
MMW - Topic 6 - Polyas Solving Technique
No ratings yet
MMW - Topic 6 - Polyas Solving Technique
4 pages
AI 10-11 1 Introduction
No ratings yet
AI 10-11 1 Introduction
3 pages
B.Tech.AIDS-80-81
No ratings yet
B.Tech.AIDS-80-81
2 pages