0% found this document useful (0 votes)

49 views47 pages

DLMAIRIL01 Q4-2024 Session3

The document outlines key concepts in Reinforcement Learning, focusing on Bandit Problems and the exploration-exploitation dilemma. It discusses various strategies for optimizing decision-making in uncertain environments, including multiple arm bandits and contextual bandits. The document also includes examples and applications in fields such as healthcare and online recommendation systems.

Uploaded by

Joël Kazadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views47 pages

DLMAIRIL01 Q4-2024 Session3

Uploaded by

Joël Kazadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

LECTURER: MAJA POPOVIĆ

REINFORCEMENT LEARNING
REINFORCEMENT LEARNING
TOPIC OUTLINE

Introduction to Reinforcement Learning 1

Dynamic Programming and Markov Decision Processes 2

Bandit Problems 3

Q-Learning 4

Reinforcement Learning Approaches 5

Summary: Reinforcement Learning 6

UNIT 3

BANDIT PROBLEMS
REINFORCEMENT LEARNING

Recommended sources:

• https://www.davidsilver.uk/teaching/
on-line lectures (videos + slides)

• https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook
2ndEd.pdf
on-line book
Exploration and exploitation
EXPLORATION AND EXPLOITATION

Exploration
− find as much as possible information about the environment

Exploitation
− exploit known information to maximise the reward

both are important

EXPLORATION VS EXPLOITATION DILLEMA

Exploitation Exploration

− Choose high reward actions − Discover novel actions

− Exploit learned information − Learn about environment

− Maximize immediate gain − Balance knowledge expansion

When to explore new options and when to exploit what we know?

THE BANDIT PROBLEM

Dilemma: balancing immediate gains vs. exploration for long-

term optimal choices in sequential decision-making

− Scenario: conduct clinical trials for multiple treatments

− Problem: how to balance optimal individual patient care

and discover the overall best medical treatment?

How much to explore / exploit?

Source of the text: Freedman, 1987

THE BANDIT PROBLEM

Goal: balance optimal decisions

with exploration in sequential processes

− Adaptation: optimise decisions

based on evolving data Healthcare Finance

− Iterative Learning: continuously

refine strategies for maximised
Online Recommender
gains advertising systems

Applications
EXPLORATION AND EXPLOITATION

Exploration-exploitation strategies

− Random exploration

− “optimism in the face of uncertainty”

preference to exploring uncertain states/actions

− Information state space

Bandit problems
Multiple arm bandits
MULTIPLE ARM BANDIT PROBLEM

Multiple Arm Bandit: optimise choices among multiple actions

to maximise cumulative rewards

− Multiple Arms: set of actions

− Reward probability distribution:

unknown
uncertain rewards for each action

− Initial Knowledge: limited or no reward

distribution information
Maximise rewards:
− Learning: update action preferences How to choose actions?
based on observed rewards over time
MULTIPLE ARM BANDIT

probability distributions of the rewards for each action

(unknown)
MULTIPLE ARM BANDIT

Multiple arm bandit

− a tuple (𝐴, 𝑅)
− 𝐴 = a known set of actions (“arms”)
− 𝑅𝑎 𝑟 = 𝑃 𝑟 𝑎 unknown probability distribution of rewards for each action 𝑎
− 𝑎𝑡 = action taken at the time step 𝑡
− 𝑟𝑡 = reward obtained at the time step 𝑡

− The goal: maximise the cumulative reward

𝑡

𝑅 = ෍ 𝑟𝜏
𝜏=1
MULTIPLE ARM BANDIT

Regret
− Mean reward for action 𝑎: 𝑞(𝑎) = 𝐸[𝑟|𝑎]
− Optimal value: 𝑣 ∗ = 𝑞 𝑎∗ = max 𝑞(𝑎)
𝑎
− Regret at the time step 𝑡: 𝑙𝑡 = 𝐸[𝑣 ∗ − 𝑞(𝑎𝑡 )]
− Total regret:
𝑡

𝐿𝑡 = 𝐸 ෍(𝑣 ∗ − 𝑞(𝑎𝜏 ))
𝜏=1

maximise cumulative reward = minimise total regret

MULTIPLE ARM BANDIT

Regret: a function of counts and differences (gaps)

𝐿𝑡 = 𝐸 ෍(𝑣 ∗ − 𝑞(𝑎𝜏 )) =
𝜏=1

෍ 𝐸[𝑁𝑡 (𝑎)] ∙ (𝑣 ∗ − 𝑞(𝑎)) = ෍ 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)

𝑎 𝑎

𝑁𝑡 𝑎 = number of times action 𝑎 was selected until the time step 𝑡

∆ 𝑎 = difference between values of the optimal action 𝑎∗ and selected action 𝑎

∆ 𝑎 unknown!
MULTIPLE ARM BANDIT

Regret

𝐿𝑡 = ෍ 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)

𝑎

increases over time

depending on the strategy, increases linearly or sub-linearly

MULTIPLE ARM BANDIT

Regret
depending on the strategy, increases linearly or sub-linearly
Exploitation-exploration strategies
EXPLOITATION – EXPLORATION ALGORITHMS

Exploitation-exploration strategies for bandit problems

random
• greedy strategy (pure exploitation)
• 𝜀-greedy strategy (combining exploitation and exploration)

optimism and uncertainty

• UCB -- upper confidence bound (better combining exploitation and
exploration)

information state space

• introducing states as summaries of all accumulated information
EXPLOITATION – EXPLORATION ALGORITHMS

Greedy strategy

• estimate the value of each action 𝑞ො𝑡 𝑎 = #times action 𝑎 is selected and won reward
#times action 𝑎 is selected

• select the action with the highest value

𝑎𝑡∗ = argmax 𝑞ො𝑡 (𝑎)

𝑎

can get stuck on a sub-optimal action

linear total regret
EXPLOITATION – EXPLORATION ALGORITHMS

Greedy strategy: getting stuck

• A giving a reward with 50% chance

• B giving a reward with 80% chance

Unlucky sequence of initial events:

1) tried A and got a reward

2) tried B and did not get a reward

⇒ keep pulling A
EXPLOITATION – EXPLORATION ALGORITHMS

𝜀-greedy strategy

• select a random action with probability 𝜀

• select the action with the highest value (as in greedy strategy) with probability 1 − 𝜀

𝜀 = 1 ⇒ pure exploration

𝜀 = 0 ⇒ pure exploitation (greedy strategy)

again a linear total regret

EXPLOITATION – EXPLORATION ALGORITHMS

𝜀𝑡 -greedy strategy

𝜀 not constant but decreasing with time

• logarithmic total regret

• requires knowledge about rewards/gaps
EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

three possible actions with probability distributions (unknown to the agent)

EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

• estimate upper confidence for each action value 𝑢ො 𝑡 𝑎

so that 𝑞(𝑎) ≤ 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎 with a high probability

• select the action which maximises upper confidence bound

𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎

𝑎
EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

upper confidence depends on how many times the action has been selected
• small 𝑁𝑡 𝑎 ⇒ high 𝑢ො 𝑡 𝑎 ⇒ high uncertainty
• large 𝑁𝑡 𝑎 ⇒ low 𝑢ො 𝑡 𝑎 ⇒ low uncertainty

exploring the action 𝑎 more and more decreases the value of 𝑢ො 𝑡 𝑎

⇒ agent becomes more certain about the choices
EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

2 ln 𝑡
𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 +
𝑎 𝑁𝑡 (𝑎)

no knowledge about the rewards (differences, gaps) needed

EXPLOITATION – EXPLORATION ALGORITHMS

Upper Confidence Bound (UCB)

Hoeffding’s inequality
2
ത
𝑃 𝐸 𝑋 > 𝑋𝑡 + 𝑢 ≤ 𝑒 −2𝑡𝑢

−2𝑁 𝑎 𝑢 2 𝑎
𝑃 𝑞 𝑎 > 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎 ≤ 𝑒 𝑡 𝑡

2 𝑎
𝑝≤ 𝑒 −2𝑁𝑡 𝑎 𝑢𝑡 probability 𝑝

𝑝 = 𝑡 −4 it should be decreasing over time

EXPLOITATION – EXPLORATION ALGORITHMS

Information state space

• Bandit as a sequential decision problem (not only one-state)

• With an “information state” 𝑠ǁ at each step
summarising all information accumulated until this step

Each action causes transition to a new state 𝑠′ǁ by adding information

with probability 𝑃𝑠𝑎ǁ 𝑠′ǁ

⇒ a Markov Decision Process (𝑆,

ሚ 𝐴, 𝑃,
෨ 𝑅, 𝛽)
Contextual bandits
CONTEXTUAL BANDITS

Contextual bandit

− a tuple (𝐴, 𝑆, 𝑅)
− 𝐴 = a known set of actions (“arms”)
− 𝑆 = 𝑃 𝑠 = an unknown distribution of states (contexts)
− 𝑅𝑠𝑎 𝑟 = 𝑃 𝑟 𝑠, 𝑎 unknown probability distribution of rewards for each action 𝑎 and
context 𝑠
− 𝑎𝑡 = action taken at the time step 𝑡
− 𝑠𝑡 = state/context at the time step 𝑡
− 𝑟𝑡 = reward obtained at the time step 𝑡

The goal: maximise the cumulative reward 𝑅 = σ𝑡𝜏=1 𝑟𝜏

Questions?
Examples
EXPLOITATION – EXPLORATION ALGORITHMS

Tetris

States – screen pixels

Actions: rotate, left, right, drop
Rewards: number of cleared lines
EXPLOITATION – EXPLORATION ALGORITHMS

Recommendation systems

Bandits

Actions (arms): articles

Rewards: clicks, time

Contextual bandits:
Context: information about the user/article
EXPLOITATION – EXPLORATION ALGORITHMS

Clinical trials

Bandits

Actions (arms): medicines

Rewards: healed patients
SESSION 3

TRANSFER TASK
TRANSFER TASK
TASKS

Case study

An online news platform seeks to maximize user engagement by

recommending the most relevant articles while balancing click-
through rates and diversity of content.
The task involves selecting articles to display to users in order to
optimize engagement and satisfaction, considering variables
like user preferences, article popularity, and novelty

Task
Research how the multi-arm bandit problem can enhance article
recommendation. Define arms, rewards, exploration, exploitation
TRANSFER TASK
PRESENTATION OF THE RESULTS

Please present your

results.
The results will be
discussed in plenary.
LEARNING CONTROL QUESTIONS

1. five-stage finite horizon four-arm bandit problem uses the

following sequence of discount factors: ρ[k] = 1, 1, 1, 0.5,
0.25, k = 1,2,3,4,5. Suppose the true mean rewards are 24.5,
36.5, 19.0, 32.5, for arms j = 1,2,3,4 respectively. What is the
maximum possible total reward?
a) 71.25
b) 95.0
c) 136.875
d) 182.5
LEARNING CONTROL QUESTIONS

2. A three-arm bandit problem is a 100-stage finite horizon process.

The true mean rewards of the arms are 10.1, 9.0 and 7.5. Suppose
a policy selects each arm for ten stages, calculates the average
reward for each, then selects the arm with the highest observed
average. If the optimal reward is correctly identified, what will be
the regret?
a) 37.0
b) 266.0
c) 973.0
d) 1010.0
LEARNING CONTROL QUESTIONS

3. For softmax action selection, what best describes the policy as

the temperature parameter approaches zero?
a) An exploratory policy
b) A greedy policy
c) An ϵ-greedy policy
d) A UCB1 policy
LIST OF SOURCES

Freedman, B. (1987). Equipoise and the ethics of clinical research. New England Journal of Medicine, 317 (3), 141–145.

Sutton, R. S., & Barto, A. G. (2018a). Reinforcement learning: An introduction (2nd ed.). MIT press.
© 2022 IU Internationale Hochschule GmbH
This content is protected by copyright. All rights reserved.
This content may not be reproduced and/or electronically edited, duplicated, or distributed in any kind of
form without written permission by the IU Internationale Hochschule GmbH.

Bandit
No ratings yet
Bandit
8 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Reinforcement Learning Q&A Guide
No ratings yet
Reinforcement Learning Q&A Guide
10 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
97 pages
RL Unit-2
No ratings yet
RL Unit-2
67 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
An Analysis of Multi-Armed Bandit Algorithms
No ratings yet
An Analysis of Multi-Armed Bandit Algorithms
6 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
17 ThompsonSampling
No ratings yet
17 ThompsonSampling
24 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
RL Mid-1 Bit Bank
No ratings yet
RL Mid-1 Bit Bank
10 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
Reinforcement Learning for Coders
No ratings yet
Reinforcement Learning for Coders
25 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
K-Armed Bandit
No ratings yet
K-Armed Bandit
2 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
Report
No ratings yet
Report
4 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Multi-Arm Bandit Problem Guide
No ratings yet
Multi-Arm Bandit Problem Guide
10 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Bandit Problem - Week 2
No ratings yet
Bandit Problem - Week 2
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Module 02
No ratings yet
Module 02
68 pages
Aifinal
No ratings yet
Aifinal
15 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
Bandit Problems
No ratings yet
Bandit Problems
8 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
47 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
2 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Video Transcript Module 6 - Lesson 1
No ratings yet
Video Transcript Module 6 - Lesson 1
13 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Multi-Armed Bandit Problem With Online Clustering As Side
No ratings yet
Multi-Armed Bandit Problem With Online Clustering As Side
13 pages
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
No ratings yet
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
9 pages
Deep Reinforcement Learning Guide
No ratings yet
Deep Reinforcement Learning Guide
32 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Multi-Armed Bandit
No ratings yet
Multi-Armed Bandit
17 pages
Written Interview People Data Analyst Canonical
No ratings yet
Written Interview People Data Analyst Canonical
8 pages
Joël Kazadi Thomas International Candidate Feedback
No ratings yet
Joël Kazadi Thomas International Candidate Feedback
8 pages
Assessment of the Level in English
No ratings yet
Assessment of the Level in English
12 pages
7.5 FH_puzzle
No ratings yet
7.5 FH_puzzle
17 pages
7.2 Papier 2 - Rcc Vol 4 - No 2
No ratings yet
7.2 Papier 2 - Rcc Vol 4 - No 2
15 pages
Joel Kazadi Profile
No ratings yet
Joel Kazadi Profile
2 pages
Weigel 2020 Participation Dividend
No ratings yet
Weigel 2020 Participation Dividend
55 pages
7.3 Derivative Polynomial
No ratings yet
7.3 Derivative Polynomial
11 pages
Inflation
No ratings yet
Inflation
4 pages
Kazadi Joel 9213934 DLMAIRIL01
No ratings yet
Kazadi Joel 9213934 DLMAIRIL01
15 pages
Ishara CV en
No ratings yet
Ishara CV en
2 pages
Draft 2
No ratings yet
Draft 2
4 pages
Draft 1
No ratings yet
Draft 1
1 page
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Kazadi Paper ECCAS
No ratings yet
Kazadi Paper ECCAS
14 pages
Itsa Cec
No ratings yet
Itsa Cec
18 pages
Mémoire JKK
No ratings yet
Mémoire JKK
66 pages
"Congo Challenge Internship Report"
No ratings yet
"Congo Challenge Internship Report"
14 pages
Maths (DAV CMC) Ch-8
No ratings yet
Maths (DAV CMC) Ch-8
3 pages
DFT & FFT: A Student's Guide
100% (1)
DFT & FFT: A Student's Guide
59 pages
Unit 1 - Part 1 - MSK Sir
No ratings yet
Unit 1 - Part 1 - MSK Sir
56 pages
8 TrainingNN-3
No ratings yet
8 TrainingNN-3
67 pages
Convergence of LMS Algorithm
No ratings yet
Convergence of LMS Algorithm
9 pages
Convolutional Neural Network (CNN)
No ratings yet
Convolutional Neural Network (CNN)
85 pages
Aim: To Evaluate A Definite Integral by Simpson's 3/8 Rule Algorithm
No ratings yet
Aim: To Evaluate A Definite Integral by Simpson's 3/8 Rule Algorithm
3 pages
Midterm s05
No ratings yet
Midterm s05
9 pages
EECE253 07 Convolution
No ratings yet
EECE253 07 Convolution
60 pages
Chapter 6 Problem Solving With Decisions
No ratings yet
Chapter 6 Problem Solving With Decisions
40 pages
DSP Record
No ratings yet
DSP Record
85 pages
Web Resources For Optimization
No ratings yet
Web Resources For Optimization
3 pages
AME352 Lecture Newton Rapson Method
No ratings yet
AME352 Lecture Newton Rapson Method
27 pages
MATLAB Based Image Processing Lab Experiments: Session S3F
No ratings yet
MATLAB Based Image Processing Lab Experiments: Session S3F
4 pages
Chap02 Decision Modelling Using Spreadsheet
No ratings yet
Chap02 Decision Modelling Using Spreadsheet
36 pages
Tree Traversal Techniques
No ratings yet
Tree Traversal Techniques
17 pages
Time-Scale Analysis: Wavelets and Filter Banks 1. Analysis and Synthesis
No ratings yet
Time-Scale Analysis: Wavelets and Filter Banks 1. Analysis and Synthesis
8 pages
MUS320: Introduction To Digital Audio Signal Processing
No ratings yet
MUS320: Introduction To Digital Audio Signal Processing
8 pages
Car Resale Value Prediction
No ratings yet
Car Resale Value Prediction
23 pages
Bayes' Theorem and Search Algorithms
No ratings yet
Bayes' Theorem and Search Algorithms
2 pages
LABEX3
No ratings yet
LABEX3
10 pages
Question AI - Best AI Homework Helper - Going To Questions
No ratings yet
Question AI - Best AI Homework Helper - Going To Questions
1 page
Learner'S Module: Mathematics in The Modern World
No ratings yet
Learner'S Module: Mathematics in The Modern World
5 pages
Lab 2: Filter Design and Implementation: Objectives of This Lab
No ratings yet
Lab 2: Filter Design and Implementation: Objectives of This Lab
10 pages
E4702 HW#4-5 Solutions: (Solution.)
No ratings yet
E4702 HW#4-5 Solutions: (Solution.)
5 pages
Figure 6-1 (P. 216) : Automated Control Systems, 8/E by Benjamin C. Kuo and Farid Golnaraghi Rights Reserved
No ratings yet
Figure 6-1 (P. 216) : Automated Control Systems, 8/E by Benjamin C. Kuo and Farid Golnaraghi Rights Reserved
11 pages
Optimal State Space Tree for 4-Queens
No ratings yet
Optimal State Space Tree for 4-Queens
23 pages
FDS Answer Bank 11-20
No ratings yet
FDS Answer Bank 11-20
38 pages
02 Control Digital Syllabus
No ratings yet
02 Control Digital Syllabus
2 pages
DES and AES Algorithm
No ratings yet
DES and AES Algorithm
51 pages

DLMAIRIL01 Q4-2024 Session3

Uploaded by

DLMAIRIL01 Q4-2024 Session3

Uploaded by

LECTURER: MAJA POPOVIĆ

Introduction to Reinforcement Learning 1

Dynamic Programming and Markov Decision Processes 2

Reinforcement Learning Approaches 5

Summary: Reinforcement Learning 6

both are important

− Choose high reward actions − Discover novel actions

− Exploit learned information − Learn about environment

− Maximize immediate gain − Balance knowledge expansion

When to explore new options and when to exploit what we know?

Dilemma: balancing immediate gains vs. exploration for long-

− Scenario: conduct clinical trials for multiple treatments

− Problem: how to balance optimal individual patient care

How much to explore / exploit?

Source of the text: Freedman, 1987

Goal: balance optimal decisions

− Adaptation: optimise decisions

− Iterative Learning: continuously

− “optimism in the face of uncertainty”

− Information state space

Multiple Arm Bandit: optimise choices among multiple actions

− Multiple Arms: set of actions

− Reward probability distribution:

− Initial Knowledge: limited or no reward

probability distributions of the rewards for each action

Multiple arm bandit

− The goal: maximise the cumulative reward

maximise cumulative reward = minimise total regret

Regret: a function of counts and differences (gaps)

෍ 𝐸[𝑁𝑡 (𝑎)] ∙ (𝑣 ∗ − 𝑞(𝑎)) = ෍ 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)

𝑁𝑡 𝑎 = number of times action 𝑎 was selected until the time step 𝑡

𝐿𝑡 = ෍ 𝐸[𝑁𝑡 (𝑎)] ∙ ∆(𝑎)

increases over time

depending on the strategy, increases linearly or sub-linearly

Exploitation-exploration strategies for bandit problems

optimism and uncertainty

information state space

• select the action with the highest value

𝑎𝑡∗ = argmax 𝑞ො𝑡 (𝑎)

can get stuck on a sub-optimal action

Greedy strategy: getting stuck

• A giving a reward with 50% chance

Unlucky sequence of initial events:

1) tried A and got a reward

• select a random action with probability 𝜀

𝜀 = 0 ⇒ pure exploitation (greedy strategy)

again a linear total regret

𝜀 not constant but decreasing with time

• logarithmic total regret

Upper Confidence Bound (UCB)

three possible actions with probability distributions (unknown to the agent)

Upper Confidence Bound (UCB)

• estimate upper confidence for each action value 𝑢ො 𝑡 𝑎

• select the action which maximises upper confidence bound

𝑎𝑡∗ = argmax 𝑞ො𝑡 𝑎 + 𝑢ො 𝑡 𝑎

Upper Confidence Bound (UCB)

Upper Confidence Bound (UCB)

exploring the action 𝑎 more and more decreases the value of 𝑢ො 𝑡 𝑎

Upper Confidence Bound (UCB)

no knowledge about the rewards (differences, gaps) needed

Upper Confidence Bound (UCB)

𝑝 = 𝑡 −4 it should be decreasing over time

Information state space

• Bandit as a sequential decision problem (not only one-state)

Each action causes transition to a new state 𝑠′ǁ by adding information

⇒ a Markov Decision Process (𝑆,

The goal: maximise the cumulative reward 𝑅 = σ𝑡𝜏=1 𝑟𝜏

States – screen pixels

Actions (arms): articles

Actions (arms): medicines

An online news platform seeks to maximize user engagement by

Please present your

1. five-stage finite horizon four-arm bandit problem uses the

2. A three-arm bandit problem is a 100-stage finite horizon process.

3. For softmax action selection, what best describes the policy as

You might also like