0% found this document useful (0 votes)

41 views41 pages

16 - Reinforcement Learning and Bandits

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views41 pages

16 - Reinforcement Learning and Bandits

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

CMPSC 448: Machine Learning

Lecture 16. Reinforcement Learning and Bandits

Rui Zhang
Fall 2024

1
What types of ML are there?

2
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world

● Learning in MDP: When we don't know the world

○ Monte Carlo Methods

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation of function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
3
What is reinforcement learning?
How to build agents that learn behaviors in a dynamic world?
● Agent-oriented learning
● learning by interacting with an environment to achieve a goal
● more natural, realistic, and ambitious than other kinds of machine learning

RL is a general-purpose framework for decision-making

● RL is for an agent with the capacity to act
● Each action influences the agent’s future state
● Success is measured by a scalar reward signal
● Goal: select actions to maximize the future reward
● The learner is not told which actions to take, but instead must discover which actions
yield the most reward by trying them.
● The agent has to exploit what it has already experienced in order to obtain reward, but
it also has to explore in order to make better action selections in the future.
4
Maze

5
https://www.samyzaf.com/ML/rl/qmaze.html
TD-Gammon

estimated state value (≈ prob of winning)

Action selection by a shallow search

Start with a random Neural Network

Play millions of games against itself (i.e., self-play)
Learn a value function from this simulated experience
Six weeks later it’s the best player of backgammon in the world
Originally used expert handcrafted features, later repeated with raw board positions 6
AlphaGo
In October 2015, AlphaGo became the first computer Go program to beat a human
professional Go player without handicaps on a full-sized 19×19 board.

Monte Carlo Tree Search, learning policy and value function networks for pruning the search
7
tree, trained from expert demonstrations, self play, and Tensor Processing Unit
The RL interface between Agent and Environment
Agent is in a state, takes an action, gets some reward for the pair of (state,
action), and goes to a new state!

Action

8
RL Terms
A set of States
● These are the possible positions of our mouse within the maze.

A set of Actions available in each state

● This is {forward, back} in a corridor and {forward, back, left, right} at a crossroads.

Transitions between states

● For example, if you go left at a crossroads you end up in a new position. These can be a set of probabilities
that link to more than one possible state (e.g. when you use an attack in a game of Pokémon you can either
miss, inflict some damage, or inflict enough damage to knock out your opponent).

Rewards associated with each transition

● In the robot-mouse example, most of the rewards are 0, but they’re positive if you reach a point that
has water or cheese and negative if you reach a point that has an electric shock.

Policy
● A mapping from states to actions aiming to maximize its cumulative reward (how to map situations to
9
actions).
Notations
We will follow the following notation for known and unknown variables:

10
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

11
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

Distribution over rewards. May or may not be known. Could be deterministic or random

12
Dynamics
State transition probability. May or may not be known. Could be deterministic or random

Distribution over rewards. May or may not be known. Could be deterministic or random

Goal is to learn a policy which is a function whose input is a state and output is an action
(might be randomized)

13
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
○ Formulation
○ Regret
○ Action-Value Methods
○ -greedy action selection
○ UCB action selection
○ Gradient bandits
● Markov Decision Processes (MDP)
● Learning in MDP: When we don't know the world

14
Multi-armed Bandits

The simplest reinforcement learning problem

One state (no state transition probabilities)

Actions: k levers (arms), each action is associated with a reward
Policy is to sequentially choose arms to maximize cumulative reward

15
Formulation: -armed bandit problem
On each of an infinite sequence of time steps,
you choose an action from possibilities, and receive a real-valued reward

The reward depends only on the action taken; it is identically, independently

distributed given the action:

These true reward distribution is unknown. Nevertheless, you must maximize total
reward (equivalent to minimize total regret)

You must both try actions to learn their values (explore), and prefer those that
appear best (exploit)
16
Regret
Goal: minimize the REGRET:

● Low regret means that we do not lose much from not knowing future events.
● We can perform almost as well as someone who observes the entire
sequence and picks the best prediction strategy in hindsight
● We cannot compute regret (because this requires knowing the best arm), but
we use it to analyze our algorithm
● We can also compete with changing environment
17
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

18
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown

● policy: try each arm (exploration), then use best one (exploitation)

19
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown

● policy: try each arm (exploration), then use best one (exploitation)

If rewards are random and known

● policy: take action with highest expected reward

20
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)

If rewards are deterministic and unknown

● policy: try each arm (exploration), then use best one (exploitation)

If rewards are random and known

● policy: take action with highest expected reward

If rewards are random and unknown

● policy: Explore by trying each arm 10,000 times to estimate the rewards, and
then exploit. But here exploration is too long and pre-determined.
21
Action-value methods
Methods that learn action-value estimates and nothing else.

For example, estimate action values as sample averages:

22
The exploration/exploitation dilemma
Suppose you form estimates

Define the greedy action at time t as

If then you are exploiting

If then you are exploring

You can't do both, but you need to do both

You can never stop exploring, but maybe you should explore less with time. Or
maybe not. 23
-greedy action selection
In greedy action selection, you always exploit

In -greedy, you are usually greedy, but with probability you instead pick an
action at random (possibly the greedy action again)

This is perhaps the simplest way to balance exploration and exploitation

24
-greedy action selection

Exploration is needed because there is always uncertainty about the accuracy of

the action-value estimates.
25
-greedy action selection

Exploration is needed because there is always uncertainty about the accuracy of

the action-value estimates.
26
Linear vs sublinear Regret
● If an algorithm forever explores it will have linear total regret
● If an algorithm never explores it will have linear total regret
● Can we have sublinear total regret?

27
Incremental Implementation
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards

How can we do this incrementally (without storing all the rewards)?

28
From Averaging to Learning Rule
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards

How can we do this incrementally (without storing all the rewards)?

We can store a running sum and count (and divide), or equivalently:

This is a standard form for learning/update rules we will frequently use

29
Optimistic initial values to Encourage Exploration
All methods so far depend on
So far we have used

Suppose we initialize the action values optimistically ( ), we can

encourage models to try all the arms.

30
Upper Confidence Bound (UCB) action selection
● -greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or
particularly uncertain.
● It would be better to select among the non-greedy actions according to their
potential for actually being optimal, taking into account both how close their
estimates are to being maximal and the uncertainties in those estimates.
● One effective way of doing this is to select actions according to the upper
confidence bound:

○ Estimate an upper bound on the true action values

○ Select the action with the largest estimated upper bound
○ A clever way of reducing exploration over time
31
Appendix

32
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:

The first condition is required to guarantee that the steps are large enough to eventually
overcome any initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
33
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:

Yes; No, because it is too small.

34
Action-Value vs Numerical Preference
We consider learning a numerical preference for each action
The action probabilities follows softmax distribution:

This is similar to a classification problem where classes are actions.

Then, we can use stochastic gradient descent :)

35
Gradient Bandit Algorithm

Then, we can use stochastic gradient descent :)

On each step, after selecting action and receiving the reward

The term serves as a baseline with which the reward is compared.

If the reward is higher than the baseline, then the probability of taking in the future is
increased, and if the reward is below baseline, then the probability is decreased.
The non-selected actions move in the opposite direction.
36
Summary comparison of bandit algorithms

37
Derivation of gradient-bandit algorithm

38
39
40
41

AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
97 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Introduction To RL
No ratings yet
Introduction To RL
64 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Deep Reinforcement Learning Guide
No ratings yet
Deep Reinforcement Learning Guide
32 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
37 RL
No ratings yet
37 RL
18 pages
COMP3411 Week 4 - RL
No ratings yet
COMP3411 Week 4 - RL
79 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
RL Unit
No ratings yet
RL Unit
595 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Unit 1
No ratings yet
Unit 1
18 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
64 pages
Module 1
No ratings yet
Module 1
72 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
RLDL Unit 1
No ratings yet
RLDL Unit 1
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
16 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
15 - Matrix Factorization
No ratings yet
15 - Matrix Factorization
55 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
Dream Voyage of The Swan
No ratings yet
Dream Voyage of The Swan
7 pages
Hot Weather Concreting
No ratings yet
Hot Weather Concreting
4 pages
Indias Wars A Military History by Arjun Subramaniam - Review PDF
No ratings yet
Indias Wars A Military History by Arjun Subramaniam - Review PDF
7 pages
PR1 Wks 5 8
No ratings yet
PR1 Wks 5 8
7 pages
Gulu University Cutoff Weights For Previous Year 2024 - 2025
0% (1)
Gulu University Cutoff Weights For Previous Year 2024 - 2025
1 page
Law Students' Torts Guide
No ratings yet
Law Students' Torts Guide
11 pages
Envirotemp FR3 Fluid Dissolved Gas Guide
No ratings yet
Envirotemp FR3 Fluid Dissolved Gas Guide
27 pages
JNTUA Managerial Economics - Financial Analysis Notes - R20
No ratings yet
JNTUA Managerial Economics - Financial Analysis Notes - R20
52 pages
Certification Inquiry for Businesses
No ratings yet
Certification Inquiry for Businesses
3 pages
South Zone Football (Men) Legal 2018-19
No ratings yet
South Zone Football (Men) Legal 2018-19
3 pages
9 - A7PR Ig 5
No ratings yet
9 - A7PR Ig 5
2 pages
KOMTRAX Brochure 2022
No ratings yet
KOMTRAX Brochure 2022
4 pages
Compact Diesel Car Comparison
No ratings yet
Compact Diesel Car Comparison
9 pages
BSC Se Group02-1
No ratings yet
BSC Se Group02-1
9 pages
Cycling Refrigerated Air Dryers: Secotec
No ratings yet
Cycling Refrigerated Air Dryers: Secotec
4 pages
Quality Control & Curriculum Insights
No ratings yet
Quality Control & Curriculum Insights
5 pages
070 C
No ratings yet
070 C
4 pages
Practical On Unit V
No ratings yet
Practical On Unit V
21 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
Estimate Id 170487
No ratings yet
Estimate Id 170487
1 page
Solar PV Systems Maintenance Guide
100% (1)
Solar PV Systems Maintenance Guide
14 pages
Austen Jeffersons Lab Write Up Popper
No ratings yet
Austen Jeffersons Lab Write Up Popper
5 pages
Sikkim's Power Sector Overview
No ratings yet
Sikkim's Power Sector Overview
21 pages
Hepatitis Akut pada Anak
No ratings yet
Hepatitis Akut pada Anak
13 pages
Bash Shell Scripting With Example and Minor Projec - 241107 - 101611
No ratings yet
Bash Shell Scripting With Example and Minor Projec - 241107 - 101611
47 pages
A New Approach For Analyzing and Predicting Carbon Dioxide Emissions
No ratings yet
A New Approach For Analyzing and Predicting Carbon Dioxide Emissions
5 pages
Exercise1 Problem
No ratings yet
Exercise1 Problem
2 pages
Hostel NDC Backlog
No ratings yet
Hostel NDC Backlog
12 pages
UNDP Partnership & Communication Analyst Guide
No ratings yet
UNDP Partnership & Communication Analyst Guide
2 pages
Domicile of Origin V Domicile of Choice Seminar
No ratings yet
Domicile of Origin V Domicile of Choice Seminar
3 pages

16 - Reinforcement Learning and Bandits

Uploaded by

16 - Reinforcement Learning and Bandits

Uploaded by

CMPSC 448: Machine Learning

Lecture 16. Reinforcement Learning and Bandits

● Learning in MDP: When we don't know the world

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

RL is a general-purpose framework for decision-making

estimated state value (≈ prob of winning)

Start with a random Neural Network

A set of Actions available in each state

Transitions between states

Rewards associated with each transition

The simplest reinforcement learning problem

One state (no state transition probabilities)

The reward depends only on the action taken; it is identically, independently

If rewards are deterministic and unknown

If rewards are deterministic and unknown

If rewards are random and known

If rewards are deterministic and unknown

If rewards are random and known

If rewards are random and unknown

For example, estimate action values as sample averages:

Define the greedy action at time t as

If then you are exploiting

You can't do both, but you need to do both

This is perhaps the simplest way to balance exploration and exploitation

Exploration is needed because there is always uncertainty about the accuracy of

Exploration is needed because there is always uncertainty about the accuracy of

How can we do this incrementally (without storing all the rewards)?

How can we do this incrementally (without storing all the rewards)?

This is a standard form for learning/update rules we will frequently use

Suppose we initialize the action values optimistically ( ), we can

○ Estimate an upper bound on the true action values

Yes; No, because it is too small.

This is similar to a classification problem where classes are actions.

Then, we can use stochastic gradient descent :)

The term serves as a baseline with which the reward is compared.

You might also like