[go: up one dir, main page]

0% found this document useful (0 votes)
49 views47 pages

DLMAIRIL01 Q4-2024 Session3

The document outlines key concepts in Reinforcement Learning, focusing on Bandit Problems and the exploration-exploitation dilemma. It discusses various strategies for optimizing decision-making in uncertain environments, including multiple arm bandits and contextual bandits. The document also includes examples and applications in fields such as healthcare and online recommendation systems.

Uploaded by

Joël Kazadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views47 pages

DLMAIRIL01 Q4-2024 Session3

The document outlines key concepts in Reinforcement Learning, focusing on Bandit Problems and the exploration-exploitation dilemma. It discusses various strategies for optimizing decision-making in uncertain environments, including multiple arm bandits and contextual bandits. The document also includes examples and applications in fields such as healthcare and online recommendation systems.

Uploaded by

Joël Kazadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

LECTURER: MAJA POPOVIĆ

REINFORCEMENT LEARNING
REINFORCEMENT LEARNING
TOPIC OUTLINE

Introduction to Reinforcement Learning 1

Dynamic Programming and Markov Decision Processes 2

Bandit Problems 3

Q-Learning 4

Reinforcement Learning Approaches 5

Summary: Reinforcement Learning 6


UNIT 3

BANDIT PROBLEMS
REINFORCEMENT LEARNING

Recommended sources:

• https://www.davidsilver.uk/teaching/
on-line lectures (videos + slides)

• https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook
2ndEd.pdf
on-line book
Exploration and exploitation
EXPLORATION AND EXPLOITATION

Exploration
− find as much as possible information about the environment

Exploitation
− exploit known information to maximise the reward

both are important


EXPLORATION VS EXPLOITATION DILLEMA

Exploitation Exploration

− Choose high reward actions − Discover novel actions

− Exploit learned information − Learn about environment

− Maximize immediate gain − Balance knowledge expansion

When to explore new options and when to exploit what we know?


THE BANDIT PROBLEM

Dilemma: balancing immediate gains vs. exploration for long-


term optimal choices in sequential decision-making

− Scenario: conduct clinical trials for multiple treatments

− Problem: how to balance optimal individual patient care


and discover the overall best medical treatment?

How much to explore / exploit?

Source of the text: Freedman, 1987


THE BANDIT PROBLEM

Goal: balance optimal decisions


with exploration in sequential processes

− Adaptation: optimise decisions


based on evolving data Healthcare Finance

− Iterative Learning: continuously


refine strategies for maximised
Online Recommender
gains advertising systems

Applications
EXPLORATION AND EXPLOITATION

Exploration-exploitation strategies

− Random exploration

− “optimism in the face of uncertainty”


preference to exploring uncertain states/actions

− Information state space


Bandit problems
Multiple arm bandits
MULTIPLE ARM BANDIT PROBLEM

Multiple Arm Bandit: optimise choices among multiple actions


to maximise cumulative rewards

− Multiple Arms: set of actions

− Reward probability distribution:


unknown
uncertain rewards for each action

− Initial Knowledge: limited or no reward


distribution information
Maximise rewards:
− Learning: update action preferences How to choose actions?
based on observed rewards over time
MULTIPLE ARM BANDIT

probability distributions of the rewards for each action


(unknown)
MULTIPLE ARM BANDIT

Multiple arm bandit


− a tuple (𝐴, 𝑅)
− 𝐴 = a known set of actions (“arms”)
− 𝑅𝑎 𝑟 = 𝑃 𝑟 𝑎 unknown probability distribution of rewards for each action 𝑎
− 𝑎𝑡 = action taken at the time step 𝑡
− 𝑟𝑡 = reward obtained at the time step 𝑡

− The goal: maximise the cumulative reward


𝑡

𝑅 = ෍ 𝑟𝜏
𝜏=1
MULTIPLE ARM BANDIT

Regret
− Mean reward for action 𝑎: 𝑞(𝑎) = 𝐸[𝑟|𝑎]
− Optimal value: 𝑣 ∗ = 𝑞 𝑎∗ = max 𝑞(𝑎)
𝑎
− Regret at the time step 𝑡: 𝑙𝑡 = 𝐸[𝑣 ∗ − 𝑞(𝑎𝑡 )]
− Total regret:
𝑡

𝐿𝑡 = 𝐸 ෍(𝑣 ∗ − 𝑞(𝑎𝜏 ))
𝜏=1

maximise cumulative reward = minimise total regret


MULTIPLE ARM BANDIT

Regret: a function of counts and differences (gaps)

𝐿𝑡 = 𝐸 ෍(𝑣 ∗ − 𝑞(𝑎𝜏 )) =
𝜏=1

෍ 𝐸[𝑁𝑡 (𝑎)] ∙ (𝑣 ∗ − 𝑞(𝑎)) = ෍ 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)


𝑎 𝑎

𝑁𝑡 𝑎 = number of times action 𝑎 was selected until the time step 𝑡


∆ 𝑎 = difference between values of the optimal action 𝑎∗ and selected action 𝑎

∆ 𝑎 unknown!
MULTIPLE ARM BANDIT

Regret

𝐿𝑡 = ෍ 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)


𝑎

increases over time

depending on the strategy, increases linearly or sub-linearly


MULTIPLE ARM BANDIT

Regret
depending on the strategy, increases linearly or sub-linearly
Exploitation-exploration strategies
EXPLOITATION – EXPLORATION ALGORITHMS

Exploitation-exploration strategies for bandit problems


random
• greedy strategy (pure exploitation)
• 𝜀-greedy strategy (combining exploitation and exploration)

optimism and uncertainty


• UCB -- upper confidence bound (better combining exploitation and
exploration)

information state space


• introducing states as summaries of all accumulated information
EXPLOITATION – EXPLORATION ALGORITHMS

Greedy strategy

• estimate the value of each action 𝑞ො𝑡 𝑎 = #times action 𝑎 is selected and won reward
#times action 𝑎 is selected

• select the action with the highest value

𝑎𝑡∗ = argmax 𝑞ො𝑡 (𝑎)


𝑎

can get stuck on a sub-optimal action


linear total regret
EXPLOITATION – EXPLORATION ALGORITHMS

Greedy strategy: getting stuck

• A giving a reward with 50% chance


• B giving a reward with 80% chance

Unlucky sequence of initial events:

1) tried A and got a reward


2) tried B and did not get a reward

⇒ keep pulling A
EXPLOITATION – EXPLORATION ALGORITHMS

𝜀-greedy strategy

• select a random action with probability 𝜀


• select the action with the highest value (as in greedy strategy) with probability 1 − 𝜀

𝜀 = 1 ⇒ pure exploration

𝜀 = 0 ⇒ pure exploitation (greedy strategy)

again a linear total regret


EXPLOITATION – EXPLORATION ALGORITHMS

𝜀𝑡 -greedy strategy

𝜀 not constant but decreasing with time

• logarithmic total regret


• requires knowledge about rewards/gaps
EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

three possible actions with probability distributions (unknown to the agent)


EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

• estimate upper confidence for each action value 𝑢ො 𝑡 𝑎


so that 𝑞(𝑎) ≤ 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎 with a high probability

• select the action which maximises upper confidence bound

𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎


𝑎
EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)


EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

upper confidence depends on how many times the action has been selected
• small 𝑁𝑡 𝑎 ⇒ high 𝑢ො 𝑡 𝑎 ⇒ high uncertainty
• large 𝑁𝑡 𝑎 ⇒ low 𝑢ො 𝑡 𝑎 ⇒ low uncertainty

exploring the action 𝑎 more and more decreases the value of 𝑢ො 𝑡 𝑎


⇒ agent becomes more certain about the choices
EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

2 ln 𝑡
𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 +
𝑎 𝑁𝑡 (𝑎)

no knowledge about the rewards (differences, gaps) needed


EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

Hoeffding’s inequality
2

𝑃 𝐸 𝑋 > 𝑋𝑡 + 𝑢 ≤ 𝑒 −2𝑡𝑢

−2𝑁 𝑎 𝑢 2 𝑎
𝑃 𝑞 𝑎 > 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎 ≤ 𝑒 𝑡 𝑡

2 𝑎
𝑝≤ 𝑒 −2𝑁𝑡 𝑎 𝑢𝑡 probability 𝑝

𝑝 = 𝑡 −4 it should be decreasing over time


EXPLOITATION – EXPLORATION ALGORITHMS

Information state space

• Bandit as a sequential decision problem (not only one-state)


• With an “information state” 𝑠ǁ at each step
summarising all information accumulated until this step

Each action causes transition to a new state 𝑠′ǁ by adding information


with probability 𝑃𝑠𝑎ǁ 𝑠′ǁ

⇒ a Markov Decision Process (𝑆,


ሚ 𝐴, 𝑃,
෨ 𝑅, 𝛽)
Contextual bandits
CONTEXTUAL BANDITS

Contextual bandit

− a tuple (𝐴, 𝑆, 𝑅)
− 𝐴 = a known set of actions (“arms”)
− 𝑆 = 𝑃 𝑠 = an unknown distribution of states (contexts)
− 𝑅𝑠𝑎 𝑟 = 𝑃 𝑟 𝑠, 𝑎 unknown probability distribution of rewards for each action 𝑎 and
context 𝑠
− 𝑎𝑡 = action taken at the time step 𝑡
− 𝑠𝑡 = state/context at the time step 𝑡
− 𝑟𝑡 = reward obtained at the time step 𝑡

The goal: maximise the cumulative reward 𝑅 = σ𝑡𝜏=1 𝑟𝜏


Questions?
Examples
EXPLOITATION – EXPLORATION ALGORITHMS

Tetris

States – screen pixels


Actions: rotate, left, right, drop
Rewards: number of cleared lines
EXPLOITATION – EXPLORATION ALGORITHMS

Recommendation systems

Bandits

Actions (arms): articles


Rewards: clicks, time

Contextual bandits:
Context: information about the user/article
EXPLOITATION – EXPLORATION ALGORITHMS

Clinical trials

Bandits

Actions (arms): medicines


Rewards: healed patients
SESSION 3

TRANSFER TASK
TRANSFER TASK
TASKS

Case study

An online news platform seeks to maximize user engagement by


recommending the most relevant articles while balancing click-
through rates and diversity of content.
The task involves selecting articles to display to users in order to
optimize engagement and satisfaction, considering variables
like user preferences, article popularity, and novelty

Task
Research how the multi-arm bandit problem can enhance article
recommendation. Define arms, rewards, exploration, exploitation
TRANSFER TASK
PRESENTATION OF THE RESULTS

Please present your


results.
The results will be
discussed in plenary.
LEARNING CONTROL QUESTIONS

1. five-stage finite horizon four-arm bandit problem uses the


following sequence of discount factors: ρ[k] = 1, 1, 1, 0.5,
0.25, k = 1,2,3,4,5. Suppose the true mean rewards are 24.5,
36.5, 19.0, 32.5, for arms j = 1,2,3,4 respectively. What is the
maximum possible total reward?
a) 71.25
b) 95.0
c) 136.875
d) 182.5
LEARNING CONTROL QUESTIONS

2. A three-arm bandit problem is a 100-stage finite horizon process.


The true mean rewards of the arms are 10.1, 9.0 and 7.5. Suppose
a policy selects each arm for ten stages, calculates the average
reward for each, then selects the arm with the highest observed
average. If the optimal reward is correctly identified, what will be
the regret?
a) 37.0
b) 266.0
c) 973.0
d) 1010.0
LEARNING CONTROL QUESTIONS

3. For softmax action selection, what best describes the policy as


the temperature parameter approaches zero?
a) An exploratory policy
b) A greedy policy
c) An ϵ-greedy policy
d) A UCB1 policy
LIST OF SOURCES

Freedman, B. (1987). Equipoise and the ethics of clinical research. New England Journal of Medicine, 317 (3), 141–145.

Sutton, R. S., & Barto, A. G. (2018a). Reinforcement learning: An introduction (2nd ed.). MIT press.
© 2022 IU Internationale Hochschule GmbH
This content is protected by copyright. All rights reserved.
This content may not be reproduced and/or electronically edited, duplicated, or distributed in any kind of
form without written permission by the IU Internationale Hochschule GmbH.

You might also like