0% found this document useful (0 votes)

21 views112 pages

کتاب هشتم بارگزاری شده

The document provides an overview of Reinforcement Learning (RL), emphasizing its nature as a learning process through interaction with an environment to optimize future behavior. It outlines the key components of RL, including states, actions, rewards, and the goal of learning an optimal policy to maximize long-term rewards. Additionally, it discusses the Markov Decision Process framework and various examples illustrating deterministic and stochastic environments in RL applications.

Uploaded by

ali2124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views112 pages

کتاب هشتم بارگزاری شده

Uploaded by

ali2124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 112

Reinforcement Learning (RL)

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani
Fall 2020

Most slides have been taken from Klein and Abdeel, CS188, UC Berkeley.
Reinforcement Learning (RL)
 Learning as a result of interaction with an environment
 to improve the agent’s ability to behave optimally in the future
to achieve the goal.

 The first idea when we think about the nature of learning

 Examples:
 Baby movements
 Learning to drive car
 Environment’s response affects our subsequent actions
 We find out the effects of our actions later

2
Paradigms of learning
 Supervised learning
𝑛 𝑁
 Training data: features and labels for 𝑁 samples 𝒙 , 𝑦 (𝑛) 𝑛=1
 Unsupervised learning
𝑛 𝑁
 Training data: only features for 𝑁 samples 𝒙 𝑛=1
 Reinforcement learning
 Training data: a sequence of (s, a, r)
 (state, action, reward)
 Agent acts on its environment, it receives some evaluation of its
action via reinforcement signal
 it is not told of which action is the correct one to achieve its goal

3
Reinforcement Learning (RL)

 𝑆: Set of states
 𝐴: Set of actions

 Goal: Learning an optimal policy (mapping from states to actions) in

order to maximize its long-term reward
 The agent's objective is to maximize amount of reward it receives over time.

4
Reinforcement Learning: Example
 Chess game (deterministic game)
 Learning task: chose move at arbitrary board states
 Training signal: final win or loss
 Training: e.g., played n games against itself

5
Main characteristics and applications of RL
 Main characteristics of RL
 Learning is a multistage decision making process
 Actions influence later perceptions (inputs)
 Delayed reward: actions may affect not only the immediate reward but
also subsequent rewards
 agent must learn from interactions with environment
 Agent must be able to learn from its own experience
 Not entirely supervised, but interactive
 by trial-and-error
 Opportunity for active exploration
 Needs trade-off between exploration and exploitation
 Goal: map states to actions, so as to maximize reward over
time (optimal policy)

6
Main elements of RL
 A reward function
 It maps each state (or, state-action pair) to a real number,
called reward.

 Policy: A map from state space to action space.

 May be stochastic.

 A value function
 Value of a state (or state-action) is the total expected reward,
starting from that state (or state-action).

7
RL problem: deterministic environment
 Deterministic
 Transition and reward functions

 At time 𝑡:
 Agent observes state 𝑠𝑡 ∈ 𝑆
 Then chooses action 𝑎𝑡 ∈ 𝐴
 Then receives reward 𝑟𝑡 , and state changes to 𝑠𝑡+1

 Learn to choose action 𝑎𝑡 in state 𝑠𝑡 that maximizes the

discounted return:
∞
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ = ෍ 𝛾 𝑘 𝑟𝑡+𝑘 , 0<𝛾≤1
𝑘=0
Upon visiting the sequence of states 𝑠𝑡 , 𝑠𝑡+1 , … with actions 𝑎𝑡 , 𝑎𝑡+1 , … it
shows the total payoff
8
RL problem: stochastic environment

 Stochastic environment
 Stochastic transition and/or reward function

 Learn to choose a policy that maximizes the expected

discounted return:
𝐸 𝑅𝑡 = 𝐸 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯
∞
starting from state 𝑠𝑡 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ = ෍ 𝛾 𝑘 𝑟𝑡+𝑘
𝑘=0

9
RL deterministic world: Example
 Example: Robot grid world
 Deterministic and known reward and transitions

10
Non-deterministic world: Example

[Russel, AIMA,2010]

Other actions
reward: -0.04

 What is the policy to achieve max reward?

11
Optimal policy

𝑟 = −0.04 for other actions

𝑟 = −4 for other actions 𝑟 = −0.4 for other actions

12
[Russel, AIMA 2010]
RL: Autonomous Agent
 Execute actions in environment, observe results, and learn
 Learn (perhaps stochastic) policy that maximizes
𝐸 σ∞𝑘=0 𝛾 𝑘𝑟
𝑡+𝑘 |𝑠𝑡 = 𝑠, 𝜋 for every state 𝑠 ∈ 𝑆

 Function to be learned is the policy 𝜋: 𝑆 × 𝐴 → [0,1]

(when the policy is deterministic 𝜋: 𝑆 → 𝐴)
 Training examples in supervised learning: 𝑠, 𝑎 pairs
 RL training data shows the amount of reward for a pair 𝑠, 𝑎 .
 training data are of the form 𝑠, 𝑎 , 𝑟

13
Markov Decision Process (RL Setting)
 We encounter a multistage decision making process.

 Markov assumption:
𝑃 𝑠𝑡+1 , 𝑟𝑡 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡−1 , 𝑠𝑡−1 , 𝑎𝑡−1 , , 𝑟𝑡−2 , … ) = 𝑃(𝑠𝑡+1 , 𝑟𝑡 |𝑠𝑡 , 𝑎𝑡 )

 Markov property: Transition probabilities depend on state only,

not on the path to the state.

 Goal: for every possible state 𝑠∈𝑆 learn a policy 𝜋 for

choosing actions that maximizes (infinite-horizon discounted
reward):
𝐸 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ |𝑠𝑡 = 𝑠, 𝜋 , 0<𝛾≤1
14
Markov Decision Process
 MDPs provide a way to think about how we can act
optimally under uncertainty

 Components:
 states
 actions
 state transition probabilities
 reward function
 discount factor

15
s

a
s, a

s,a,s’
s’

16
MDP: Recycling Robot example
 𝑆 = ℎ𝑖𝑔ℎ, 𝑙𝑜𝑤
 𝐴 = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡, 𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒
 𝒜 ℎ𝑖𝑔ℎ = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡 Available actions in the ‘high’ state
 𝒜 𝑙𝑜𝑤 = 𝑠𝑒𝑎𝑟𝑐ℎ, 𝑤𝑎𝑖𝑡, 𝑟𝑒𝑐ℎ𝑎𝑟𝑔𝑒
 ℛ𝑠𝑒𝑎𝑟𝑐ℎ > ℛ𝑤𝑎𝑖𝑡

𝑃(𝑠𝑡+1 = ℎ𝑖𝑔ℎ|𝑠𝑡 = ℎ𝑖𝑔ℎ, 𝑎𝑡 = 𝑠𝑒𝑎𝑟𝑐ℎ)

17
reward
State-value function for policy 𝜋
 Given a policy 𝜋, define value function

𝑉 𝜋 𝑠 = 𝐸 σ∞
𝑘=0 𝛾 𝑘𝑟
𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋

 𝑉 𝜋 𝑠 : How good for the agent to be in the state 𝑠 when

its policy is 𝜋
 It is simply the expected sum of discounted rewards upon
starting in state s and taking actions according to 𝜋

18
Known Environments

 We first assume that the MDP is known although in RL

problems, we do not know it
𝑎 𝑎
 𝒫𝑠𝑠 ′ and ℛ 𝑠𝑠 ′ is known

 Dynamic programming methods are used for solving the

problems when MDP is known
 Value iteration and Policy iteration are two more classic
approaches to this problem.

19
Recursive definition for 𝑉 𝜋 (𝑆)
𝑉 𝜋 𝑠 = 𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋

= 𝐸 𝑟𝑡 + 𝛾 σ∞
𝑘=1 𝛾
𝑘−1 𝑟
𝑡+𝑘 𝑠𝑡 = 𝑠, 𝜋

= 𝐸 𝑟𝑡 + 𝛾 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡 = 𝑠, 𝜋

𝜋(𝑠) 𝜋(𝑠)
= ෍ 𝒫𝑠𝑠′ ℛ𝑠𝑠′ + 𝛾𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡+1 = 𝑠′, 𝜋
𝑠′
𝑎 ′
𝒫𝑠𝑠 ′ = 𝑃 𝑠𝑡+1 = 𝑠 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎
𝑉 𝜋 (𝑠′)
𝑎 ′
ℛ𝑠𝑠 ′ = 𝐸 𝑟𝑡 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, 𝑠𝑡+1 = 𝑠

Bellman 𝑉 𝜋 𝑠 = ෍ 𝒫𝑠𝑠′
𝜋(𝑠) 𝜋(𝑠)
ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′)
Equations 𝑠′

Base equation for dynamic programming approaches

20
Backup diagram

s
𝜋(𝑠) 𝜋(𝑠)
𝑉 𝜋 𝑠 = ෍ 𝒫𝑠𝑠′ ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′) (s)

𝑠′ s, (s)

s, (s),s’
s’

21
State-action value function for policy 𝜋

𝑄 𝜋 𝑠, 𝑎 = 𝐸 σ∞ 𝑘
𝑘=0 𝛾 𝑟𝑡+𝑘 𝑠𝑡 = 𝑠 , 𝑎𝑡 = 𝑎, 𝜋

𝑎 𝑎 ∞ 𝑘
= ෍ 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾𝐸 σ𝑘=0 𝛾 𝑟𝑡+𝑘+1 𝑠𝑡+1 = 𝑠′, 𝜋

𝑠′

𝑉 𝜋 (𝑠′) s

a
s, a

𝑎 𝑎 s, a,s’
𝑄𝜋 𝑠, 𝑎 = ෍ 𝒫𝑠𝑠 𝜋
′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 (𝑠′)

𝑠′ s’

22
State-value function for policy 𝜋: example
 Deterministic example

𝜋 𝑠
𝑉 𝜋 𝑠 = ℛ𝑠𝑠′ + 𝛾𝑉 𝜋 (𝑠′)

73 81

𝛾 = 0.9

66 90 100

23
Grid-world: value function example

[Russel, AIMA 2010]

24
Optimal policy
 Policy 𝜋 is better than (or equal to) 𝜋′ (i.e. 𝜋 ≥ 𝜋′) iff
𝜋 𝜋 ′
𝑉 𝑠 ≥𝑉 𝑠 , ∀𝑠 ∈ 𝑆

 Optimal policy 𝜋 ∗ is better than (or equal to) all other

policies (∀𝜋, 𝜋 ∗ ≥ 𝜋)

 Optimal policy 𝛑∗ :
𝜋 ∗ 𝑠 = argmax 𝑉 𝜋 𝑠 , ∀𝑠 ∈ 𝑆
𝜋

25
MDP: Optimal policy
state-value and action-value functions
 Optimal policies share the same optimal state-value
𝜋∗
function (𝑉 (𝑠) will be abbreviated as 𝑉 ∗ (𝑠)):
𝑉 ∗ 𝑠 = max 𝑉 𝜋 𝑠 , ∀𝑠 ∈ 𝑆
𝜋

 And the same optimal action-value function:

𝑄∗ 𝑠, 𝑎 = max 𝑄𝜋 𝑠, 𝑎 , ∀𝑠 ∈ 𝑆, 𝑎 ∈ 𝒜(𝑠)
𝜋

 For any MDP, a deterministic optimal policy exists!

26
Optimal Quantities

▪ The value (utility) of a state s:

V*(s) = expected utility starting in s
and acting optimally s is a
s state
a
▪ The value (utility) of a q-state (s,a): (s, a) is a
s, a q-state
Q*(s,a) = expected utility starting out
having taken action a from state s s,a,s’ (s,a,s’) is a
and (thereafter) acting optimally transition
s’

▪ The optimal policy:

*(s) = optimal action from state s

27
Bellman optimality equation

𝑉 ∗ 𝑠 = max 𝑄 ∗ 𝑠, 𝑎
𝑎∈𝒜(𝑠)
𝑎 𝑎
𝑄 ∗ 𝑠, 𝑎 = ෍ 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 ∗ 𝑠′

𝑠′

𝑎 𝑎
𝑉 ∗ 𝑠 = max ෍ 𝒫𝑠𝑠 ∗
′ ℛ 𝑠𝑠 ′ + 𝛾𝑉 𝑠′
𝑎∈𝒜(𝑠)
𝑠′
𝑎 𝑎
𝑄 ∗ 𝑠, 𝑎 = ෍ 𝒫𝑠𝑠 ′ ℛ 𝑠𝑠 ′ + 𝛾 max
′
𝑄 ∗ 𝑠 ′ , 𝑎′
𝑎
𝑠′

28
Snapshot of Demo – Gridworld V Values

Noise = 0.2
Discount = 0.9
Living reward = 0
29
Snapshot of Demo – Gridworld Q Values

Noise = 0.2
Discount = 0.9
Living reward = 0
30
Optimal policy
 If we have 𝑉 ∗ (𝑠) and 𝑃(𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) we can compute 𝜋 ∗ (𝑠)

𝜋 ∗ 𝑠 = argmax ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′

𝑎
𝑠′

 It can also be computed as:

𝜋 ∗ 𝑠 = argmax 𝑄∗ 𝑠, 𝑎
𝑎∈𝒜(𝑠)

 Optimal policy has the interesting property that it is the

optimal policy for all states.
 Share the same optimal state-value function
 It is not dependent on the initial state.
 use the same policy no matter what the initial state of MDP is

31
Value Iteration algorithm
Consider only MDPs with finite state and action spaces:

1) Initialize all 𝑉(𝑠) to zero

2) Repeat until convergence
for 𝑠 ∈ 𝑆
𝑉 𝑛𝑒𝑤 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎
𝑠′
𝑛𝑒𝑤
𝑉=𝑉
3) for 𝑠 ∈ 𝑆
𝜋(𝑠) ← argmax σ𝑠′ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎

𝑉(𝑠) converges to 𝑉 ∗ (𝑠)

Asynchronous: Instead of updating values for all states at once in each iteration,
it can update them state by state, or more often to some states than others.

32
Value Iteration

 Bellman equations characterize the optimal values: V(s)

𝑉 ∗ 𝑠 = max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′ a

𝑎 s, a
𝑠′
s,a,s
 Value iteration computes them:
’V(s’)
𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′
𝑎
𝑠′

 Value iteration is just a fixed point solution method

 … though the Vk vectors are also interpretable as time-limited values

33
Value Iteration

 Start with V0(s) = 0: no time steps left means an expected reward sum of zero

 Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′ a
𝑎
𝑠′ s, a
 Repeat until convergence s,a,s’
Vk(s’)
 Complexity of each iteration: O(S2A)

 Theorem: will converge to unique optimal values

 Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do

34
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
35
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
36
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
37
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
38
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
39
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
40
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
41
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
42
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
43
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
44
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
45
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
46
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
47
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
48
Convergence*
 Case 1: If the tree has maximum depth M, then VM
holds the actual untruncated values

 Case 2: If the discount is less than 1

 Sketch: For any state Vk and Vk+1 can be viewed as
depth k+1 results in nearly identical search trees
 The difference is that on the bottom layer, Vk+1 has
actual rewards while Vk has zeros
 That last layer is at best all RMAX
 It is at worst RMIN
 But everything is discounted by γk that far out
 So Vk and Vk+1 are at most γk max|R| different
 So as k increases, the values converge

𝛾 𝑘 max 𝑅
max 𝑉𝑘 𝑠 − 𝑉 ∗ 𝑠 ≤
𝑠∈𝒮 1−𝛾

49
Convergence and contractions
 Let . show the max norm: 𝑉 = max 𝑉(𝑠)
𝑠

 Theorem: For any two approximations 𝑉 and 𝑉′:

Bellman operator is
′
𝑉𝑖+1 − 𝑉𝑖+1 ≤ 𝛾 𝑉𝑖 − 𝑉𝑖′ a contraction

 Any two approximations get closer to each other after

Bellman update.
∗
 Since 𝑉𝑖+1 = 𝑉𝑖∗ , any approximation get closer to 𝑉 ∗
 Which implies existence and uniqueness of the fixed point

50
Value Iteration
 Value iteration works even if we randomly traverse the
environment instead of looping through each state and action
(update asynchronously)
 but we must still visit each state infinitely often

 Value Iteration
 It is time and memory expensive

51
Problems with Value Iteration
 Value iteration repeats the Bellman updates:

𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′ s

𝑎
𝑠′
a
 Problem 1: It’s slow – O(S2A) per iteration s, a

s,a,s’
 Problem 2: The “max” at each state rarely changes s’

 Problem 3: The policy often converges long before the values

52
Convergence: Example

[Russel, AIMA, 2010]

53
Computing Actions from Values

 Let’s imagine we have the optimal values V*(s)

 How should we act?

 It’s not obvious!

 We need to do (one step)

𝜋 ∗ 𝑠 = argmax ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′

𝑎
𝑠′

 This is called policy extraction, since it gets the policy implied by the
values

54
Computing Actions from Q-Values
 Let’s imagine we have the optimal q-values:

 How should we act?

 Completely trivial to decide!

 Important lesson: actions are easier to select from q-values than values!

55
Policy Iteration Algorithm
1) Initialize 𝜋(𝑠) arbitrarily
2) Repeat until convergence
 Compute the value function for the current policy 𝜋 (i.e. 𝑉 𝜋 )
 𝑉 ← 𝑉𝜋
 for 𝑠 ∈ 𝑆
 𝜋 𝑠 ← argmax σ𝑠′ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 𝑠′
𝑎

updates the policy (greedily) using the

current value function.
𝜋(𝑠) converges to 𝜋 ∗ (𝑠)

56
Fixed Policies Evaluation

Do the optimal action Do what  says to do

s s

a (s)
s, a s, (s)

s,a,s’ s, (s),s’
s’ s’

max over all actions to compute the fixed some policy (s), then the tree would
optimal values be simpler – only one action per state

57
Utilities for a Fixed Policy

 Another basic operation: compute the utility of a state s s

under a fixed (generally non-optimal) policy
(s)
s, (s)
 Recursive relation (one-step look-ahead / Bellman equation):
s, (s),s’
𝑉 𝜋 𝑠 = ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉 𝜋 𝑠 ′ s’
𝑠′

V(s) = expected total discounted rewards starting

in s and following 

58
Policy Evaluation
 How do we calculate the V’s for a fixed policy ?

 Idea 1: Turn recursive Bellman equations into updates s

(like value iteration)
(s)
s, (s)

𝜋
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ s, (s),s’
𝑠′ s’

 Efficiency: O(S2) per iteration

 Idea 2: Without the maxes, the Bellman equations are just a linear system

59
Policy Iteration
 Evaluation: For fixed current policy , find values with policy evaluation:
 Iterate until values converge:

𝜋 𝜋𝑖
𝑖
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋𝑖 (𝑠) 𝑅 𝑠, 𝜋𝑖 (𝑠), 𝑠 ′ + 𝛾𝑉𝑘 𝑠′
𝑠′

 Improvement: For fixed values, get a better policy using policy extraction
 One-step look-ahead:

𝜋𝑖
𝜋𝑖+1 𝑠 ← argmax ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠′
𝑎
𝑠′

60
Policy Iteration
 Repeat steps until policy converges
 Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
 Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values

 This is policy iteration

 It’s still optimal!
 It is guaranteed to converge and at convergence, the current policy and its value function are
the optimal policy and the optimal value function.
 Can converge (much) faster under some conditions

61
Policy Iteration Optimality: Proof Sketch
 Convergence:
 Every step improves the policy and each policy is achieved at most once.
Thus, the number of iterations is at most as large as the number of
different policies (i.e., 𝒜 𝒮 )

 Optimal at convergence:
 At convergence ∀𝑠, 𝜋𝑘+1 𝑠 = 𝜋𝑘 𝑠 .
 Indeed, ∀𝑠 𝑉 𝜋𝑘 𝑠 = max σ𝑠′ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑘 𝑠 ′
𝑎
 Therefore, 𝑉 𝜋𝑘 satisfies Bellman equation and thus 𝑉 𝜋𝑘 = 𝑉 ∗

62
When to stop iterations:

63 [Russel, AIMA 2010]

Comparison
 Both value iteration and policy iteration compute the same thing (all
optimal values)

 In value iteration:
 Every iteration updates both the values and (implicitly) the policy
 We don’t track the policy, but taking the max over actions implicitly recomputes
it

 In policy iteration:
 We do several passes that update utilities with fixed policy (each pass is fast
because we consider only one action, not all of them)
 After the policy is evaluated, a new policy is chosen (slow like a value iteration
pass)
 The new policy will be better (or we’re done)

 Both are dynamic programs for solving MDPs

64
MDP Algorithms: Summary
 So you want to….
 Compute optimal values: use value iteration or policy iteration
 Compute values for a particular policy: use policy evaluation
 Turn your values into a policy: use policy extraction (one-step lookahead)

65
Unknown transition model
𝑎
 So far: learning optimal policy when we know 𝒫𝑠𝑠 ′ and
𝑎
ℛ𝑠𝑠 ′

 it requires prior knowledge of the environment's dynamics

 If a model is not available, then it is particularly useful to

estimate action values rather than state values

66
Reinforcement Learning

 Still assume a Markov decision process (MDP):

 A set of states s  S
 A set of actions (per state) A
 A model P(s’|a,s)
 A reward function R(s,a,s’)
 Still looking for a policy (s)

 New twist: don’t know P or R

 I.e. we don’t know which states are good or what the actions do
 Must actually try actions and states out to learn

67
Reinforcement Learning

Agent
State: s
Actions: a
Reward: r

Environmen
t

 Basic idea:
 Receive feedback in the form of rewards
 Agent’s utility is defined by the reward function
 Must (learn to) act so as to maximize expected rewards
 All learning is based on observed samples of outcomes!

68
What Just Happened?
 That wasn’t planning, it was learning!
 Specifically, reinforcement learning
 There was an MDP, but you couldn’t solve it with just computation
 You needed to actually act to figure it out

 Important ideas in reinforcement learning that came up

 Sampling: because of chance, you have to try things repeatedly
 Difficulty: learning can be much harder than solving a known MDP
 Exploration: you have to try unknown actions to get information
 Exploitation: eventually, you have to use what you know
 Regret: even if you learn intelligently, you make mistakes

69
RL algorithms
 Model-based (passive)
 Learn model of environment (transition and reward
probabilities)
 Then, value iteration or policy iteration algorithms

 Model-free (active)

70
Model-Based Learning
 Model-Based Idea:
 Learn an approximate model based on experiences
 Solve for values as if the learned model were correct

 Step 1: Learn empirical MDP model

 Count outcomes s’ for each s, a
 ෠ ′ |𝑠, 𝑎)
Normalize to give an estimate of 𝑃(𝑠
 Discover each when we experience (s, a, s’)

 Step 2: Solve the learned MDP

 For example, use value iteration, as before

71
Model free approaches to solve RL problems
 These methods can be categorized into:
 Monte Carlo methods
 Temporal-difference learning
 Q-learning is a more recent approaches to this problem.

72
Direct Evaluation of a Policy

 Before discussing about Monte Carlo approach for finding optimal

policy, first we solve a simpler problem “policy evaluation” by
Monte Carlo

 Goal: Compute values for each state under 

 Idea:Average together observed sample values

 Act according to 
 Every time you visit a state, write down what the sum of discounted
rewards turned out to be
 Average those samples

73
Example: Direct Policy Evaluation

Input Policy  Observed Episodes (Training) Output Values

Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A A
C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 +8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10

74
Monte Carlo methods
 do not assume complete knowledge of the environment

 require only experience

 sample sequences of states, actions, and rewards from on-line
or simulated interaction with an environment

 are based on averaging sample returns

 are defined for episodic tasks

75
Monte Carlo policy evaluation
1) Initialize 𝑄 arbitrarily and 𝑅𝑒𝑡𝑢𝑟𝑛𝑠 to empty lists
2) Repeat
 Generate an episode using 𝜋 and exploring starts

 for each pair of 𝑠 and 𝑎 appearing in the episode

 𝑅 ←return following the first occurrence of 𝑠, 𝑎
 Append 𝑅 to 𝑅𝑒𝑡𝑢𝑟𝑛𝑠(𝑠, 𝑎)
 𝑄 𝑠, 𝑎 ← 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑒𝑡𝑢𝑟𝑛𝑠(𝑠, 𝑎)

76
Monte Carlo RL algorithm
1) Initialize 𝑄 and 𝜋 arbitrarily and 𝑅𝑒𝑡𝑢𝑟𝑛𝑠 to empty lists
2) Repeat
 Generate an episode using 𝜋 and exploring starts

 for each pair of 𝑠 and 𝑎 appearing in the episode

for each 𝑠 in the episode

 𝜋(𝑠) ← argmax 𝑄(𝑠, 𝑎)
𝑎


77
Problems with Direct Evaluation

 What’s good about direct evaluation?

 It’s easy to understand Output Values
 It doesn’t require any knowledge of P, R -10
 It eventually computes the correct average A
values, using just sample transitions +8 +4 +10
B C D
 What bad about it? -2
E
 It wastes information about state connections
 Each state must be learned separately If B and E both go to C
under this policy, how can
 So, it takes a long time to learn their values be different?

78
Connections between states

 Simplified Bellman updates calculate V for a fixed policy: s

 Each round, replace V with a one-step-look-ahead layer over V
(s)
s, (s)
𝜋
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′
s, (s),s’
𝑠′
s’
 This approach fully exploited the connections between the states
 Unfortunately, we need T and R to do it!

 Key question: how can we do this update to V without knowing T and R?

 In other words, how to we take a weighted average without knowing the weights?

79
Connections between states

 We want to improve our estimate of V by computing these averages:

𝜋
𝑉𝑘+1 𝑠 ← ෍ 𝑃 𝑠 ′ 𝑠, 𝜋(𝑠) 𝑅 𝑠, 𝜋(𝑠), 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′
𝑠′
 Idea: Take samples of outcomes s’ (by doing the action!) and average
s
(s)
s,
s, (s),s’ (s)
s2 s1 s3
' '' '
Almost! But we can’t
rewind time to get sample
after sample from state s.

80
Temporal Difference Learning
 Big idea: learn from every experience!
 Update V(s) each time we experience a transition (s, a, s’, r)
 Likely outcomes s’ will contribute updates more often s
(s)
 Temporal difference learning of values s, (s)
 Policy still fixed, still doing evaluation!
 Move values toward value of whatever successor occurs: running
s’
average

Sample of V(s):

Update to V(s):

Same update:

81
Exponential Moving Average

 Exponential moving average

 The running interpolation update:

 Makes recent samples more important:

 Forgets about the past (distant past values were wrong anyway)

 Decreasing learning rate (alpha) can give converging averages

82
Example: Temporal Difference Learning

States Observed Transitions

B, east, C, -2 C, east, D, -2

A 0 0 0

B C D 0 0 8 -1 0 8 -1 3 8

E 0 0 0

Assume:  = 1, α = 1/2

83
Temporal Difference Methods
 TD learning is a combination of MC and DP )i.e. Bellman
equations) ideas.
 Like MC methods, can learn directly from raw experience
without a model of the environment's dynamics.

 Like DP, update estimates based in part on other learned

estimates, without waiting for a final outcome.

84
Problems with TD Value Learning
 TD value leaning is a model-free way to do policy evaluation,
mimicking Bellman updates with running sample averages

 However, if we want to turn values into a (new) policy, we’re

sunk.

85
Unknown transition model: New policy
 With a model, state values alone are sufficient to determine a policy
 simply look ahead one step and chooses whichever action leads to the best
combination of reward and next state s

𝜋 ∗ 𝑠 = argmax ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉 ∗ 𝑠′ a

𝑎
𝑠′ s, a

 Without a model, state values alone are not sufficient. s,a,s’

s’

 However, if agent knows 𝑄(𝑠, 𝑎), it can choose optimal action

without knowing 𝑃 and 𝑅:
𝜋 ∗ 𝑠 = argmax 𝑄 ∗ (𝑠, 𝑎)
𝑎

 Idea: learn Q-values, not state values

 Makes action selection model-free too!

86
Detour: Q-Value Iteration

 Value iteration: find successive (depth-limited) values

 Start with V0(s) = 0, which we know is right
 Given Vk, calculate the depth k+1 values for all states:

𝑉𝑘+1 𝑠 ← max ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾𝑉𝑘 𝑠′

𝑎
𝑠′

 But Q-values are more useful, so compute them instead

 Start with Q0(s,a) = 0, which we know is right
 Given Qk, calculate the depth k+1 q-values for all q-states:

𝑄𝑘+1 𝑠, 𝑎 ← ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾max

′
𝑄𝑘 𝑠 ′, 𝑎
𝑎
𝑠′

87
Q-Learning
 We’d like to do Q-value updates to each Q-state:

𝑄𝑘+1 𝑠, 𝑎 ← ෍ 𝑃(𝑠 ′ |𝑠, 𝑎) 𝑅(𝑠, 𝑎, 𝑠 ′ ) + 𝛾max

′
𝑄𝑘 𝑠 ′
,𝑎
𝑎
𝑠′
 But can’t compute this update without knowing P, R

 Instead, compute average as we go

 Receive a sample transition (s,a,r,s’)
 This sample suggests

 But we want to average over results from (s,a) (Why?)

 So keep a running average

88
Q-Learning

 Learn Q(s,a) values as you go

 Receive a sample (s,a,s’,r)
 Consider your old estimate:
 Consider your new sample estimate:

 Incorporate the new estimate into a running average:

89
Model-Free Learning

 Model-free (temporal difference) learning s

 Experience world through episodes a
s, a
r
 Update estimates each transition s’
a’
s’, a’
 Over time, updates will mimic Bellman
updates
s’’

90
Video of Demo Q-Learning Auto Cliff Grid

91
Exploration/exploitation tradeoff
 Exploitation: High rewards from trying previously-well-
rewarded actions

 Exploration:Which actions are best?

 Must try ones not tried before

92
How to Explore?

 Several schemes for forcing exploration

 Simplest: random actions (-greedy)
 Every time step, flip a coin
 With (small) probability , act randomly
 With (large) probability 1-, act on current policy

 Problems with random actions?

 You do eventually explore the space, but keep
thrashing around once learning is done
 One solution: lower  over time
 Another solution: exploration functions

93
Regret
 Even if you learn the optimal policy, you still make mistakes along the way!
 Regret is a measure of your total mistake cost: the difference between
your (expected) rewards, including youthful suboptimality, and optimal
(expected) rewards
 Minimizing regret goes beyond learning to be optimal – it requires
optimally learning to be optimal
 Example: random exploration and exploration functions both end up
optimal, but random exploration has higher regret

94
Q-learning: Policy
 Greedy action selection:
෠ 𝑎)
𝜋 𝑠 = argmax 𝑄(𝑠,
𝑎

 𝜖-greedy: greedy most of the times, occasionally take a random

action

 Softmax policy: Give a higher probability to the actions that

currently have better utility, e.g,
෠
𝑏 𝑄(𝑠,𝑎)
𝜋 𝑠, 𝑎 = ෠ ′)
σ𝑎′ 𝑏 𝑄(𝑠,𝑎

 After learning 𝑄 ∗ , the policy is greedy?

95
Q-learning Algorithm
 Initialize 𝑄(𝑠,
෠ 𝑎) arbitrarily
 Repeat (for each episode):
 Initialize 𝑠 e.g., ε-greedy, softmax, …
 Repeat (for each step of episode):
 Choose 𝑎 from 𝑠 using a policy derived from 𝑄෠
 Take action 𝑎, receive reward 𝑟, observe new state 𝑠 ′
 𝑄෠ 𝑠, 𝑎 ← 𝑄෠ 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max
′
෠ 𝑠 ′ , 𝑎′ − 𝑄෠ 𝑠, 𝑎
𝑄
𝑎
 𝑠 ← 𝑠′
 until 𝑠 is terminal

96
Q-learning convergence
 Q-learning converges to optimal Q-values if
 Every state is visited infinitely often
 The policy for action selection becomes greedy as time
approaches infinity
 The step size parameter is chosen appropriately

 Stochastic approximation conditions

 The learning rate is decreased fast enough but not too fast

97
Q-Learning Properties
 Amazing result: Q-learning converges to optimal policy --
even if you’re acting suboptimally!

 This is called off-policy learning

 Caveats:
 You have to explore enough
 You have to eventually make the learning rate small enough
 … but not decrease it too quickly
 Basically, in the limit, it doesn’t matter how you select actions (!)

98
SARSA (State-Action-Reward-State-Action)
 Initialize 𝑄(𝑠,
෠ 𝑎)
 Repeat (for each episode): e.g., greedy, ε-greedy

 Initialize 𝑠
 Choose 𝑎 from 𝑠 using a policy derived from 𝑄෠
 Repeat (for each step of episode):
 Take action 𝑎, receive reward 𝑟, observe new state 𝑠 ′
 Choose 𝑎′ from 𝑠 ′ using policy derived from 𝑄෠
 𝑄෠ 𝑠, 𝑎 ← 𝑄෠ 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄෠ 𝑠 ′ , 𝑎′ − 𝑄෠ 𝑠, 𝑎
 𝑠 ← 𝑠′
 𝑎 ← 𝑎′
 until 𝑠 is terminal

99
Q-learning vs. SARSA

 Q-learning is an off-policy learning algorithm

 SARSA is an on-policy learning algorithm

100
Tabular methods: Problem
 All of the introduced methods maintain a table

 Table size can be very large for complex environments

 Too many states to visit them all in training
 We may not even visit some states

 Too many states to hold the q-tables in memory

 But computation and memory problem

101
Generalizing Across States

 Basic Q-Learning keeps a table of all q-values

 In realistic situations, we cannot possibly learn

about every single state!

 Instead, we want to generalize:

 Learn about some small number of training states from
experience
 Generalize that experience to new, similar situations
 This is a fundamental idea in machine learning, and we’ll
see it over and over again

102
Example: Pacman

Let’s say we discover In naïve q-learning, Or even this one!

through experience we know nothing
that this state is bad: about this state:

103
Feature-Based Representations

 Solution: describe a state using a vector of

features (properties)
 Features are functions from states to real numbers
(often 0/1) that capture important properties of the
state
 Example features:
 Distance to closest ghost
 Distance to closest dot
 Number of ghosts
 1 / (dist to dot)2
 Is Pacman in a tunnel? (0/1)
 …… etc.
 Is it the exact state on this slide?
 Can also describe a q-state (s, a) with features (e.g.
action moves closer to food)

104
Linear Value Functions
 Using a feature representation, we can write a q function (or value
function) for any state using a few weights:

 Advantage:
 our experience is summed up in a few powerful numbers
 We can generalize from visited states to unvisited ones.
 In addition to the less space requirement

 Disadvantage: states may share features but actually be very different in

value!

105
Least square error: minimization
Imagine we had only one point x, with features f(x), target value y, and weights w:

 Q-learning with linear Q-functions:

“target” “prediction”
 Intuitive interpretation:
 Adjust weights of active features
 E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states
with that state’s features

106
Least square error: minimization
 In the general case (e.g., when 𝑄𝜃 (𝑠, 𝑎) is modeled by a neural
network):

2
𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑄𝜃 (𝑠, 𝑎)

∇𝜃 𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑄𝜃 (𝑠, 𝑎) ∇𝜃 𝑄𝜃 (𝑠, 𝑎)

∇𝜃 𝐿 𝜃 = 𝐸𝑠,𝑎,𝑠′,𝑟 𝑟 + 𝛾 max
′
𝑄𝜃 (𝑠′, 𝑎′) − 𝑄𝜃 (𝑠, 𝑎) ∇𝜃 𝑄𝜃 (𝑠, 𝑎)
𝑎

107
Adjusting function weights

𝜽 ← 𝜽 + 𝛼 𝑟 + 𝛾 𝑉𝜽 𝑠 ′ −𝑉𝜽 𝑠 𝛻𝜽 𝑉𝜽 (𝑠)
or
′ ′
𝜽 ← 𝜽 + 𝛼 𝑟 + 𝛾 max
′
𝑄𝜽 𝑠 , 𝑎 − 𝑄𝜽 (𝑠, 𝑎) 𝛻𝜽 𝑄𝜽 (𝑠, 𝑎)
𝑎

108
Deep RL
 𝑄(𝑠, 𝑎; 𝜃): neural network with parameter 𝜃

 A single feedforward to compute for the current state 𝑠,

Q-values of all actions

109
Playing Atari Games

 State: Raw pixels of images

 Action: e.g., R, L, U, D
 Reward: score

 Last layer has 4-d output (if 4 actions)

 Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 )

110
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
DeepMind Atari (©Two Minute Lectures)

111
References
 T. Mitchell, Machine Learning, MIT Press,1998. [Chapter 13]

 R.S. Sutton, A.G. Barto, Reinforcement Learning: An

Introduction, MIT Press, 1999 [Chapters 1,3,4,6].

112

Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
CS229
No ratings yet
CS229
17 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Lec 09
No ratings yet
Lec 09
51 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Lec 08
No ratings yet
Lec 08
59 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
06 MDP
No ratings yet
06 MDP
89 pages
Final Dif
No ratings yet
Final Dif
20 pages
Thinking Beyond Tokens From Brain-Inspired Intelligence To Cognitive Foundations For Artificial General Intelligence and Its Societal Impact
No ratings yet
Thinking Beyond Tokens From Brain-Inspired Intelligence To Cognitive Foundations For Artificial General Intelligence and Its Societal Impact
51 pages
Unit 2 DL
No ratings yet
Unit 2 DL
44 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
High Quality Related Search Query Suggestions Using Deep Reinforcement Learning
No ratings yet
High Quality Related Search Query Suggestions Using Deep Reinforcement Learning
7 pages
Energies 16 06770 v2
No ratings yet
Energies 16 06770 v2
20 pages
Chatbot Assessment Northern University Bangladesh
No ratings yet
Chatbot Assessment Northern University Bangladesh
12 pages
Reading 3 Machine Learning - Answers
No ratings yet
Reading 3 Machine Learning - Answers
12 pages
Batch-C15 Ieee Paper
No ratings yet
Batch-C15 Ieee Paper
7 pages
M Thesis Report
No ratings yet
M Thesis Report
38 pages
Bayesian RL for Researchers
No ratings yet
Bayesian RL for Researchers
27 pages
Artificial Intelligence Machine Learnind and Erp
No ratings yet
Artificial Intelligence Machine Learnind and Erp
13 pages
AI Unit-4
No ratings yet
AI Unit-4
58 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Final Report Traffic Controller Final Project
No ratings yet
Final Report Traffic Controller Final Project
39 pages
FlowPG - Action-Constrained Policy Gradient With Normalizing Flows
No ratings yet
FlowPG - Action-Constrained Policy Gradient With Normalizing Flows
16 pages
LLMs for Robotic Anomaly Detection
No ratings yet
LLMs for Robotic Anomaly Detection
24 pages
ArtificialIntelligencein5GTechnologyASurvey Morocho - Cayamcela Lim
No ratings yet
ArtificialIntelligencein5GTechnologyASurvey Morocho - Cayamcela Lim
7 pages
AI Unit 5
No ratings yet
AI Unit 5
18 pages
50 More AI-ML Projects
No ratings yet
50 More AI-ML Projects
9 pages
Inference and Learning From Data - Volume I - Foundations
No ratings yet
Inference and Learning From Data - Volume I - Foundations
1,106 pages
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
No ratings yet
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
30 pages
Deep Reinforcement Learning For Drone Delivery
No ratings yet
Deep Reinforcement Learning For Drone Delivery
19 pages
Plotmap: Automated Layout Design For Building Game Worlds: Yi Wang, Jieliang Luo, Adam Gaier, Evan Atherton, Hilmar Koch
No ratings yet
Plotmap: Automated Layout Design For Building Game Worlds: Yi Wang, Jieliang Luo, Adam Gaier, Evan Atherton, Hilmar Koch
13 pages
Carla Self Driving Presentation Final
No ratings yet
Carla Self Driving Presentation Final
2 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
30 pages
Reading 3 Machine Learning - Answers
No ratings yet
Reading 3 Machine Learning - Answers
11 pages
ML and Data Science For Financial Markes - Index
100% (1)
ML and Data Science For Financial Markes - Index
11 pages
Reinforcement Learning For Finance A Review
No ratings yet
Reinforcement Learning For Finance A Review
18 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages