CMPSC 448: Machine Learning
Lecture 16. Reinforcement Learning and Bandits
Rui Zhang
Fall 2024
1
What types of ML are there?
2
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world
● Learning in MDP: When we don't know the world
○ Monte Carlo Methods
○ Temporal-Difference Learning (TD): SARSA and Q-Learning
Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation of function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
3
What is reinforcement learning?
How to build agents that learn behaviors in a dynamic world?
● Agent-oriented learning
● learning by interacting with an environment to achieve a goal
● more natural, realistic, and ambitious than other kinds of machine learning
RL is a general-purpose framework for decision-making
● RL is for an agent with the capacity to act
● Each action influences the agent’s future state
● Success is measured by a scalar reward signal
● Goal: select actions to maximize the future reward
● The learner is not told which actions to take, but instead must discover which actions
yield the most reward by trying them.
● The agent has to exploit what it has already experienced in order to obtain reward, but
it also has to explore in order to make better action selections in the future.
4
Maze
5
https://www.samyzaf.com/ML/rl/qmaze.html
TD-Gammon
estimated state value (≈ prob of winning)
Action selection by a shallow search
Start with a random Neural Network
Play millions of games against itself (i.e., self-play)
Learn a value function from this simulated experience
Six weeks later it’s the best player of backgammon in the world
Originally used expert handcrafted features, later repeated with raw board positions 6
AlphaGo
In October 2015, AlphaGo became the first computer Go program to beat a human
professional Go player without handicaps on a full-sized 19×19 board.
Monte Carlo Tree Search, learning policy and value function networks for pruning the search
7
tree, trained from expert demonstrations, self play, and Tensor Processing Unit
The RL interface between Agent and Environment
Agent is in a state, takes an action, gets some reward for the pair of (state,
action), and goes to a new state!
Action
8
RL Terms
A set of States
● These are the possible positions of our mouse within the maze.
A set of Actions available in each state
● This is {forward, back} in a corridor and {forward, back, left, right} at a crossroads.
Transitions between states
● For example, if you go left at a crossroads you end up in a new position. These can be a set of probabilities
that link to more than one possible state (e.g. when you use an attack in a game of Pokémon you can either
miss, inflict some damage, or inflict enough damage to knock out your opponent).
Rewards associated with each transition
● In the robot-mouse example, most of the rewards are 0, but they’re positive if you reach a point that
has water or cheese and negative if you reach a point that has an electric shock.
Policy
● A mapping from states to actions aiming to maximize its cumulative reward (how to map situations to
9
actions).
Notations
We will follow the following notation for known and unknown variables:
10
Dynamics
State transition probability. May or may not be known. Could be deterministic or random
11
Dynamics
State transition probability. May or may not be known. Could be deterministic or random
Distribution over rewards. May or may not be known. Could be deterministic or random
12
Dynamics
State transition probability. May or may not be known. Could be deterministic or random
Distribution over rewards. May or may not be known. Could be deterministic or random
Goal is to learn a policy which is a function whose input is a state and output is an action
(might be randomized)
13
Outline
● Introduction to Reinforcement learning
● Multi-armed Bandits
○ Formulation
○ Regret
○ Action-Value Methods
○ -greedy action selection
○ UCB action selection
○ Gradient bandits
● Markov Decision Processes (MDP)
● Learning in MDP: When we don't know the world
14
Multi-armed Bandits
The simplest reinforcement learning problem
One state (no state transition probabilities)
Actions: k levers (arms), each action is associated with a reward
Policy is to sequentially choose arms to maximize cumulative reward
15
Formulation: -armed bandit problem
On each of an infinite sequence of time steps,
you choose an action from possibilities, and receive a real-valued reward
The reward depends only on the action taken; it is identically, independently
distributed given the action:
These true reward distribution is unknown. Nevertheless, you must maximize total
reward (equivalent to minimize total regret)
You must both try actions to learn their values (explore), and prefer those that
appear best (exploit)
16
Regret
Goal: minimize the REGRET:
● Low regret means that we do not lose much from not knowing future events.
● We can perform almost as well as someone who observes the entire
sequence and picks the best prediction strategy in hindsight
● We cannot compute regret (because this requires knowing the best arm), but
we use it to analyze our algorithm
● We can also compete with changing environment
17
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
18
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
If rewards are deterministic and unknown
● policy: try each arm (exploration), then use best one (exploitation)
19
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
If rewards are deterministic and unknown
● policy: try each arm (exploration), then use best one (exploitation)
If rewards are random and known
● policy: take action with highest expected reward
20
Examples
If rewards are deterministic and known
● policy: pull arm with highest reward (exploitation)
If rewards are deterministic and unknown
● policy: try each arm (exploration), then use best one (exploitation)
If rewards are random and known
● policy: take action with highest expected reward
If rewards are random and unknown
● policy: Explore by trying each arm 10,000 times to estimate the rewards, and
then exploit. But here exploration is too long and pre-determined.
21
Action-value methods
Methods that learn action-value estimates and nothing else.
For example, estimate action values as sample averages:
22
The exploration/exploitation dilemma
Suppose you form estimates
Define the greedy action at time t as
If then you are exploiting
If then you are exploring
You can't do both, but you need to do both
You can never stop exploring, but maybe you should explore less with time. Or
maybe not. 23
-greedy action selection
In greedy action selection, you always exploit
In -greedy, you are usually greedy, but with probability you instead pick an
action at random (possibly the greedy action again)
This is perhaps the simplest way to balance exploration and exploitation
24
-greedy action selection
Exploration is needed because there is always uncertainty about the accuracy of
the action-value estimates.
25
-greedy action selection
Exploration is needed because there is always uncertainty about the accuracy of
the action-value estimates.
26
Linear vs sublinear Regret
● If an algorithm forever explores it will have linear total regret
● If an algorithm never explores it will have linear total regret
● Can we have sublinear total regret?
27
Incremental Implementation
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards
How can we do this incrementally (without storing all the rewards)?
28
From Averaging to Learning Rule
To simplify notation, let us focus on one action
● We consider only its rewards, and its estimate after rewards
How can we do this incrementally (without storing all the rewards)?
We can store a running sum and count (and divide), or equivalently:
This is a standard form for learning/update rules we will frequently use
29
Optimistic initial values to Encourage Exploration
All methods so far depend on
So far we have used
Suppose we initialize the action values optimistically ( ), we can
encourage models to try all the arms.
30
Upper Confidence Bound (UCB) action selection
● -greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or
particularly uncertain.
● It would be better to select among the non-greedy actions according to their
potential for actually being optimal, taking into account both how close their
estimates are to being maximal and the uncertainties in those estimates.
● One effective way of doing this is to select actions according to the upper
confidence bound:
○ Estimate an upper bound on the true action values
○ Select the action with the largest estimated upper bound
○ A clever way of reducing exploration over time
31
Appendix
32
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:
The first condition is required to guarantee that the steps are large enough to eventually
overcome any initial conditions or random fluctuations. The second condition guarantees that
eventually the steps become small enough to assure convergence.
33
Standard stochastic approximation convergence condition
Sometimes it is convenient to vary the step-size parameter from step to step.
: the step-size parameter used to process the reward received after the n-th
selection of action a.
If , it results in sample average, which will converge to the true action
value by law of large numbers.
But of course, convergence is not guaranteed for all choices of
A well-known result in stochastic approximation theory gives us the conditions
required to assure convergence with probability 1:
Yes; No, because it is too small.
34
Action-Value vs Numerical Preference
We consider learning a numerical preference for each action
The action probabilities follows softmax distribution:
This is similar to a classification problem where classes are actions.
Then, we can use stochastic gradient descent :)
35
Gradient Bandit Algorithm
Then, we can use stochastic gradient descent :)
On each step, after selecting action and receiving the reward
The term serves as a baseline with which the reward is compared.
If the reward is higher than the baseline, then the probability of taking in the future is
increased, and if the reward is below baseline, then the probability is decreased.
The non-selected actions move in the opposite direction.
36
Summary comparison of bandit algorithms
37
Derivation of gradient-bandit algorithm
38
39
40
41