LECTURER: MAJA POPOVIĆ
REINFORCEMENT LEARNING
REINFORCEMENT LEARNING
TOPIC OUTLINE
Introduction to Reinforcement Learning 1
Dynamic Programming and Markov Decision Processes 2
Bandit Problems 3
Q-Learning 4
Reinforcement Learning Approaches 5
Summary: Reinforcement Learning 6
UNIT 3
BANDIT PROBLEMS
REINFORCEMENT LEARNING
Recommended sources:
• https://www.davidsilver.uk/teaching/
on-line lectures (videos + slides)
• https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook
2ndEd.pdf
on-line book
Exploration and exploitation
EXPLORATION AND EXPLOITATION
Exploration
− find as much as possible information about the environment
Exploitation
− exploit known information to maximise the reward
both are important
EXPLORATION VS EXPLOITATION DILLEMA
Exploitation Exploration
− Choose high reward actions − Discover novel actions
− Exploit learned information − Learn about environment
− Maximize immediate gain − Balance knowledge expansion
When to explore new options and when to exploit what we know?
THE BANDIT PROBLEM
Dilemma: balancing immediate gains vs. exploration for long-
term optimal choices in sequential decision-making
− Scenario: conduct clinical trials for multiple treatments
− Problem: how to balance optimal individual patient care
and discover the overall best medical treatment?
How much to explore / exploit?
Source of the text: Freedman, 1987
THE BANDIT PROBLEM
Goal: balance optimal decisions
with exploration in sequential processes
− Adaptation: optimise decisions
based on evolving data Healthcare Finance
− Iterative Learning: continuously
refine strategies for maximised
Online Recommender
gains advertising systems
Applications
EXPLORATION AND EXPLOITATION
Exploration-exploitation strategies
− Random exploration
− “optimism in the face of uncertainty”
preference to exploring uncertain states/actions
− Information state space
Bandit problems
Multiple arm bandits
MULTIPLE ARM BANDIT PROBLEM
Multiple Arm Bandit: optimise choices among multiple actions
to maximise cumulative rewards
− Multiple Arms: set of actions
− Reward probability distribution:
unknown
uncertain rewards for each action
− Initial Knowledge: limited or no reward
distribution information
Maximise rewards:
− Learning: update action preferences How to choose actions?
based on observed rewards over time
MULTIPLE ARM BANDIT
probability distributions of the rewards for each action
(unknown)
MULTIPLE ARM BANDIT
Multiple arm bandit
− a tuple (𝐴, 𝑅)
− 𝐴 = a known set of actions (“arms”)
− 𝑅𝑎 𝑟 = 𝑃 𝑟 𝑎 unknown probability distribution of rewards for each action 𝑎
− 𝑎𝑡 = action taken at the time step 𝑡
− 𝑟𝑡 = reward obtained at the time step 𝑡
− The goal: maximise the cumulative reward
𝑡
𝑅 = 𝑟𝜏
𝜏=1
MULTIPLE ARM BANDIT
Regret
− Mean reward for action 𝑎: 𝑞(𝑎) = 𝐸[𝑟|𝑎]
− Optimal value: 𝑣 ∗ = 𝑞 𝑎∗ = max 𝑞(𝑎)
𝑎
− Regret at the time step 𝑡: 𝑙𝑡 = 𝐸[𝑣 ∗ − 𝑞(𝑎𝑡 )]
− Total regret:
𝑡
𝐿𝑡 = 𝐸 (𝑣 ∗ − 𝑞(𝑎𝜏 ))
𝜏=1
maximise cumulative reward = minimise total regret
MULTIPLE ARM BANDIT
Regret: a function of counts and differences (gaps)
𝐿𝑡 = 𝐸 (𝑣 ∗ − 𝑞(𝑎𝜏 )) =
𝜏=1
𝐸[𝑁𝑡 (𝑎)] ∙ (𝑣 ∗ − 𝑞(𝑎)) = 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)
𝑎 𝑎
𝑁𝑡 𝑎 = number of times action 𝑎 was selected until the time step 𝑡
∆ 𝑎 = difference between values of the optimal action 𝑎∗ and selected action 𝑎
∆ 𝑎 unknown!
MULTIPLE ARM BANDIT
Regret
𝐿𝑡 = 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)
𝑎
increases over time
depending on the strategy, increases linearly or sub-linearly
MULTIPLE ARM BANDIT
Regret
depending on the strategy, increases linearly or sub-linearly
Exploitation-exploration strategies
EXPLOITATION – EXPLORATION ALGORITHMS
Exploitation-exploration strategies for bandit problems
random
• greedy strategy (pure exploitation)
• 𝜀-greedy strategy (combining exploitation and exploration)
optimism and uncertainty
• UCB -- upper confidence bound (better combining exploitation and
exploration)
information state space
• introducing states as summaries of all accumulated information
EXPLOITATION – EXPLORATION ALGORITHMS
Greedy strategy
• estimate the value of each action 𝑞ො𝑡 𝑎 = #times action 𝑎 is selected and won reward
#times action 𝑎 is selected
• select the action with the highest value
𝑎𝑡∗ = argmax 𝑞ො𝑡 (𝑎)
𝑎
can get stuck on a sub-optimal action
linear total regret
EXPLOITATION – EXPLORATION ALGORITHMS
Greedy strategy: getting stuck
• A giving a reward with 50% chance
• B giving a reward with 80% chance
Unlucky sequence of initial events:
1) tried A and got a reward
2) tried B and did not get a reward
⇒ keep pulling A
EXPLOITATION – EXPLORATION ALGORITHMS
𝜀-greedy strategy
• select a random action with probability 𝜀
• select the action with the highest value (as in greedy strategy) with probability 1 − 𝜀
𝜀 = 1 ⇒ pure exploration
𝜀 = 0 ⇒ pure exploitation (greedy strategy)
again a linear total regret
EXPLOITATION – EXPLORATION ALGORITHMS
𝜀𝑡 -greedy strategy
𝜀 not constant but decreasing with time
• logarithmic total regret
• requires knowledge about rewards/gaps
EXPLOITATION – EXPLORATION ALGORITHMS
Upper Confidence Bound (UCB)
three possible actions with probability distributions (unknown to the agent)
EXPLOITATION – EXPLORATION ALGORITHMS
Upper Confidence Bound (UCB)
• estimate upper confidence for each action value 𝑢ො 𝑡 𝑎
so that 𝑞(𝑎) ≤ 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎 with a high probability
• select the action which maximises upper confidence bound
𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎
𝑎
EXPLOITATION – EXPLORATION ALGORITHMS
Upper Confidence Bound (UCB)
EXPLOITATION – EXPLORATION ALGORITHMS
Upper Confidence Bound (UCB)
upper confidence depends on how many times the action has been selected
• small 𝑁𝑡 𝑎 ⇒ high 𝑢ො 𝑡 𝑎 ⇒ high uncertainty
• large 𝑁𝑡 𝑎 ⇒ low 𝑢ො 𝑡 𝑎 ⇒ low uncertainty
exploring the action 𝑎 more and more decreases the value of 𝑢ො 𝑡 𝑎
⇒ agent becomes more certain about the choices
EXPLOITATION – EXPLORATION ALGORITHMS
Upper Confidence Bound (UCB)
2 ln 𝑡
𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 +
𝑎 𝑁𝑡 (𝑎)
no knowledge about the rewards (differences, gaps) needed
EXPLOITATION – EXPLORATION ALGORITHMS
Upper Confidence Bound (UCB)
Hoeffding’s inequality
2
ത
𝑃 𝐸 𝑋 > 𝑋𝑡 + 𝑢 ≤ 𝑒 −2𝑡𝑢
−2𝑁 𝑎 𝑢 2 𝑎
𝑃 𝑞 𝑎 > 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎 ≤ 𝑒 𝑡 𝑡
2 𝑎
𝑝≤ 𝑒 −2𝑁𝑡 𝑎 𝑢𝑡 probability 𝑝
𝑝 = 𝑡 −4 it should be decreasing over time
EXPLOITATION – EXPLORATION ALGORITHMS
Information state space
• Bandit as a sequential decision problem (not only one-state)
• With an “information state” 𝑠ǁ at each step
summarising all information accumulated until this step
Each action causes transition to a new state 𝑠′ǁ by adding information
with probability 𝑃𝑠𝑎ǁ 𝑠′ǁ
⇒ a Markov Decision Process (𝑆,
ሚ 𝐴, 𝑃,
෨ 𝑅, 𝛽)
Contextual bandits
CONTEXTUAL BANDITS
Contextual bandit
− a tuple (𝐴, 𝑆, 𝑅)
− 𝐴 = a known set of actions (“arms”)
− 𝑆 = 𝑃 𝑠 = an unknown distribution of states (contexts)
− 𝑅𝑠𝑎 𝑟 = 𝑃 𝑟 𝑠, 𝑎 unknown probability distribution of rewards for each action 𝑎 and
context 𝑠
− 𝑎𝑡 = action taken at the time step 𝑡
− 𝑠𝑡 = state/context at the time step 𝑡
− 𝑟𝑡 = reward obtained at the time step 𝑡
The goal: maximise the cumulative reward 𝑅 = σ𝑡𝜏=1 𝑟𝜏
Questions?
Examples
EXPLOITATION – EXPLORATION ALGORITHMS
Tetris
States – screen pixels
Actions: rotate, left, right, drop
Rewards: number of cleared lines
EXPLOITATION – EXPLORATION ALGORITHMS
Recommendation systems
Bandits
Actions (arms): articles
Rewards: clicks, time
Contextual bandits:
Context: information about the user/article
EXPLOITATION – EXPLORATION ALGORITHMS
Clinical trials
Bandits
Actions (arms): medicines
Rewards: healed patients
SESSION 3
TRANSFER TASK
TRANSFER TASK
TASKS
Case study
An online news platform seeks to maximize user engagement by
recommending the most relevant articles while balancing click-
through rates and diversity of content.
The task involves selecting articles to display to users in order to
optimize engagement and satisfaction, considering variables
like user preferences, article popularity, and novelty
Task
Research how the multi-arm bandit problem can enhance article
recommendation. Define arms, rewards, exploration, exploitation
TRANSFER TASK
PRESENTATION OF THE RESULTS
Please present your
results.
The results will be
discussed in plenary.
LEARNING CONTROL QUESTIONS
1. five-stage finite horizon four-arm bandit problem uses the
following sequence of discount factors: ρ[k] = 1, 1, 1, 0.5,
0.25, k = 1,2,3,4,5. Suppose the true mean rewards are 24.5,
36.5, 19.0, 32.5, for arms j = 1,2,3,4 respectively. What is the
maximum possible total reward?
a) 71.25
b) 95.0
c) 136.875
d) 182.5
LEARNING CONTROL QUESTIONS
2. A three-arm bandit problem is a 100-stage finite horizon process.
The true mean rewards of the arms are 10.1, 9.0 and 7.5. Suppose
a policy selects each arm for ten stages, calculates the average
reward for each, then selects the arm with the highest observed
average. If the optimal reward is correctly identified, what will be
the regret?
a) 37.0
b) 266.0
c) 973.0
d) 1010.0
LEARNING CONTROL QUESTIONS
3. For softmax action selection, what best describes the policy as
the temperature parameter approaches zero?
a) An exploratory policy
b) A greedy policy
c) An ϵ-greedy policy
d) A UCB1 policy
LIST OF SOURCES
Freedman, B. (1987). Equipoise and the ethics of clinical research. New England Journal of Medicine, 317 (3), 141–145.
Sutton, R. S., & Barto, A. G. (2018a). Reinforcement learning: An introduction (2nd ed.). MIT press.
© 2022 IU Internationale Hochschule GmbH
This content is protected by copyright. All rights reserved.
This content may not be reproduced and/or electronically edited, duplicated, or distributed in any kind of
form without written permission by the IU Internationale Hochschule GmbH.