0% found this document useful (0 votes)

2 views50 pages

RL Module 4

The document covers dynamic programming (DP) and its applications in reinforcement learning (RL), detailing concepts such as policy evaluation, improvement, and iteration. It contrasts classical DP, which assumes a perfect model of the environment, with model-free RL methods that learn through interaction in stochastic environments. Additionally, it discusses specific algorithms like value iteration and policy iteration, providing examples such as Jack's Car Rental and the Gambler's Problem to illustrate these concepts.

Uploaded by

manavvvv298

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views50 pages

RL Module 4

Uploaded by

manavvvv298

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Dynamic Programing

Module 4
Content

Dynamic Programming Monte Carlo

• Policy Evaluation, Improvement, • Advantages Of Monte Carlo Over
Iteration Dynamic Programing,
• Monte Carlo Control,
• Value Iteration,
• On-policy,
• Asynchronous Dynamic • Off- Policy,
Programing, • Incremental Monte Carlo,
• Generalized Policy Iteration (GPI), • Issues/Assumptions In Monte Carlo
• Bootstrap, Methods,
• Solution Of Blackjack Using Monte
• Full Back Up. Carlo Method
Classical Dynamic Programming
• Classical DP operates under the assumption that the model of the
environment is fully known and perfect. This means all states, actions,
transitions, and rewards are defined in advance.
• Example: Algorithms like Bellman’s equations, value iteration, and policy iteration are
used to solve grid world problems where everything is specified.
• Classical DP is often used for deterministic settings, where the outcome of
each action is predictable (given the current state).
• Example: Solving a shortest path problem in a graph can be done with classical DP, as
the costs are known and fixed.
• Classical DP can be computationally expensive, especially for large state
spaces, due to exhaustive listing of all possible states and actions.
• Example: Solving an MDP with a very large number of states can lead to infeasibility
in terms of computation and memory requirements.
• Used in optimization problems, operations research, and algorithm design
(such as shortest paths, knapsack problems, etc.).
Dynamic Programming in Reinforcement
Learning
• Model-Free Assumption: In RL, the model may not be known in advance. Instead of assuming perfect
knowledge, RL methods often learn about the environment through interaction, exploring and exploiting to
gather information.
• Example: Algorithms like Q-learning and SARSA do not require a predefined model of the environment; they learn
directly from the experiences gained through episodes of interaction.
• RL often deals with inherently stochastic environments, where actions may result in different outcomes even
in the same state (due to randomness).
• Example: A robot navigating an uncertain terrain may not always reach its intended location due to obstacles or
varying terrain conditions.
• RL emphasizes the trade-off between exploration (trying new actions to gather more information) and
exploitation (taking actions known to yield high rewards based on current knowledge).
• Example: A player in a game may need to explore different strategies while being rewarded for winning
(exploitation).
• Model Learning: Some RL methods may use model learning, where an agent constructs a model of the
environment based on the experiences it collects. Even with learned models, the computations might not be
as extensive as classical DP since the agent adapts to learn the essential aspects of the environment.
• Example: In model-based reinforcement learning, the agent learns a transition model and then uses DP to plan out
optimal policies based on what it has learned.
Dynamic Programming in
Aspect Classical Dynamic Programming
Reinforcement Learning

May not know the model; learns from

Model Knowledge Assumes a perfect, known model
interaction

Handles stochastic and uncertain

Determinism Often works in deterministic settings
environments

Central to learning; balancing

Exploration vs. Exploitation Not relevant or needed
exploration and exploitation

Efficient learning in uncertain

Applications Optimization, algorithm design
environments, robotics, game playing

Can be high, as it explores all Often designed to be more efficient

Computational Cost
states/actions through sampling the environment
Dynamic Programming in RL
• The term dynamic programming (DP) refers to a collection of
algorithms that can be used to compute optimal policies given a
perfect model of the environment as a Markov decision process
(MDP).
• A Markov Decision Process is a framework used to model decision-
making where outcomes are partly random and partly under the
control of a decision-maker (like a robot or a person).
• How Does Dynamic Programming Work with MDPs?
• When you have a perfect model of the environment (which includes all states,
actions, transitions, and rewards), you can use dynamic programming to find
the optimal policy (the best set of actions to take in each state to maximize
your long-term rewards).
Dynamic Programming in RL
• DP requires complete knowledge of system dynamics like states,
actions, rewards, & transition matrix (p(s`,r|s,a), P(s`|s,a)), E[r|s,a,s`])
• Expensive computation
• Guaranteed to converge
• Affected by dimensionality
Policy Evaluation
• Policy Evaluation in Reinforcement Learning refers to the process of estimating
the value function of a given policy. The value function indicates how good it is
to follow that policy from each state, helping to understand the expected
returns (rewards) the agent would receive if it were to behave according to that
policy.
• First we consider how to compute the state-value function for an arbitrary
policy π.
• This is called policy evaluation in the DP. We also refer to it as the prediction
problem. For all s ∈ S,
where (a|s) is the probability of taking
action a in state s under policy,
and the expectations are subscripted by to indicate
that they are conditional on being followed.
Policy Evaluation (1)
• The existence and uniqueness of are guaranteed as long as
either gamma < 1 or eventual termination is guaranteed from all
states under the policy π.
• If the environment's dynamics are completely known, then (4.4) is a
system of |S| simultaneous linear equations in |S| unknowns
(the (s), s ∈ S).

where γ gamma is the discount factor (which represents the importance of future rewards).
Policy Evaluation (2)
• For our purposes, iterative solution methods are most suitable.
Consider a sequence of approximate value functions v0; v1; v2; : : :,
each mapping S+ to R.
• The initial approximation, v0, is chosen arbitrarily (except that the
terminal state, if any, must be given value 0), and each successive
approximation is obtained by using the Bellman equation for 4.4
as an update rule:

for all s ∈ S.
Policy Evaluation (3)
• Iterative policy evaluation algorithm
Policy Improvement
• The process of creating a new policy that is better than or equal to
the current policy.
• This is often done by acting greedily with respect to the value
function of the current policy.
• We have a deterministic policy π and its state-value function v_π(s).
• We want to know if changing the action in a specific state s to a
different action a (a ≠ π(s)) would result in a better policy.
Evaluating a Potential Change:
• Consider taking action a in state s once, and then following the
original policy π thereafter.
• The value of this action is given by the action-value function:

• The Key Criterion-

• Compare q_π(s, a) with v_π(s).
• If q_π(s, a) > v_π(s): Taking action a once in s and then following π is better
than always following π from s.
• Intuitively, consistently choosing a in state s should lead to a better overall
policy.
The Policy Improvement Theorem:
• Let π and π' be two deterministic policies such that for all states s:

• Then the new policy π' is as good as or better than π for all states:
• If there is strict inequality of (4.7) at any state, then there must be
strict inequality of (4.8) at at least one state.
• This result applies in particular to the two policies that we considered
in the previous paragraph, an original deterministic policy π, and a
changed policy π`, that is identical to π except that π`(s) = a ≠ π(s).
• Obviously, (4.7) holds at all states other than s.
• Thus, if q π(s; a) > v π(s), then the changed policy is indeed better
than π.
Policy Iteration
• Once a policy, π, has been improved using v π to yield a better policy,
π`, we can then compute v π` and improve it again to yield an even
better π``.
• We can thus obtain a sequence of monotonically improving policies
and value functions:
Policy iteration (using iterative policy
evaluation) for v π.
Example 4.2: Jack's Car Rental
• Two Locations: Jack manages two car rental locations.
• Daily Rentals: Customers arrive daily at each location to rent cars.
• Revenue: Jack earns $10 for each car rented.
• Lost Business: If a location is out of cars, rental requests are lost (no revenue).
• Car Returns: Rented cars are returned the next day.
• Overnight Movement: Jack can move cars between the two locations overnight. Cost of
Movement: Moving a car costs $2.
• Demand and Returns are Poisson: The number of rental requests and car returns at each
location follow Poisson distributions (discrete probability distribution, number of events
occurring in a fixed interval of time) with given expected values (λ).
• Location 1: Rental requests λ = 3, Returns λ = 3
• Location 2: Rental requests λ = 4, Returns λ = 2
• Capacity Limit: Each location can hold a maximum of 20 cars. Any excess cars effectively
disappear.
Example 4.2: Jack's Car Rental (Policy
Iteration)
1. Begin with a basic strategy (e.g., never move cars).
2. Evaluate Current Strategy: Figure out how good this strategy is for every
possible number of cars at each location (calculate the "value" of each
state). This considers expected future profits based on rental demand
and returns.
3. Improve Strategy: For each possible situation (number of cars at each
location), consider all possible actions (moving -5 to +5 cars). Choose the
action that looks like it will lead to the best future outcome (highest
value based on the evaluation). This creates a new, potentially better
strategy.
4. Repeat: Go back to step 2 and evaluate this new strategy. Keep repeating
steps 2 and 3 until the strategy stops changing. At that point, you've
found the best way for Jack to manage his cars.
Draw back of policy iteration:
• Computational Cost:
• Policy Evaluation: Each iteration requires evaluating the current policy, which
can be computationally intensive. For large state spaces, this evaluation step
can take considerable time, especially if a high degree of accuracy is needed.
• Policy Improvement: While typically faster than policy evaluation, the
improvement step still requires calculating the best action based on the
state-value function. In large state-action spaces, this can also be slow.
• Convergence Speed:
• Policy iteration often converges faster than value iteration initially due to the
more structured approach, but in cases where the value function is poorly
approximated, it may take many iterations to converge to an optimal policy.
• Particularly in large state spaces, it may take many iterations to reach
convergence, which can be computationally prohibitive.
Value Iteration
• An algorithm used to compute the optimal policy and value function
in Markov Decision Processes (MDPs).
• Repeatedly updates the value of states based on the expected
rewards of actions.
Heads: Wins the amount he bet.
Tails: Loses the amount he bet.
Gambler's Problem
• Problem definition:
• A gambler has a limited amount of money (e.g., $100) and wants to reach a goal
(e.g., $200) by betting on a fair coin flip.
• If the gambler wins, their amount increases; if they lose, it decreases.
• States :
• Represent the amount of money the gambler has (from 0 to 200).
• Actions:
• The gambler can choose to bet a certain amount (from $1 to the current amount
they hold).
• Rewards:
• Winning a bet results in the value increasing by the amount betted; losing results in
the value decreasing by the same amount.
Gambler's Problem (1)
• Terminal States
• Win (Goal): Reaching $200.
• Lose: Going broke ($0).
• Value Function
• The goal is to calculate the expected return (value) for each state V(s) based on
possible actions at each state.
• The Bellman Equation:
Where:
R= immediate reward,
γ = discount factor,
P(s′ ∣ s, a) = transition probability to next state.
How Value Iteration Process works?
• Initialization: Set V(s) for all states to an initial value (e.g., $0).
• Iterate: Update values using the Bellman equation until convergence
(values do not change significantly).
• Extract Policy: Once values converge, derive the optimal betting
strategy for each state.
✓ With sufficient iterations, the value function stabilizes, providing
insights into the best strategy for maximizing the chances of reaching
$200 without going bankrupt.
✓The Gambler's Problem is about finding the smartest way to bet on
coin flips to reach a specific money goal, knowing the odds of the
coin. We want to know the chances of winning from any current
amount of money and the best betting amount at each step.
π₀, π₁, π₂, π₃, π₄: Policies

Value Iteration •Positive Number (e.g., 1, 2, 3, 4, 5): Move that many

cars from the second location to the first location
overnight.
•Negative Number (e.g., -1, -2, -3, -4): Move that many
cars from the first location to the second location
overnight.
•Zero (0): Don't move any cars overnight.

X-Axis: Cars at second location

Y-Axis: Cars at first location
•π₀ (Initial Policy): Starts with a simple policy (in this case, it seems to be to move 0 cars between locations
regardless of the current state).
•π₁: After one iteration of policy evaluation and improvement, the policy starts to suggest moving cars based
on the car distribution. For example, if there are many cars at location 2 and few at location 1, it suggests moving
some from 2 to 1 (positive numbers).
•π₂, π₃, π₄: As the algorithm continues to iterate, the policy becomes more refined. The boundaries between
different actions shift, indicating a better understanding of the optimal car movements for different situations.
Notice how the areas with the same action become more structured.
•π₄ (Near-Optimal Policy): After several iterations, the policy likely converges to a near-optimal strategy.
V₄: shows the estimated value function for the policy π₄.
•Axes: Same as the policy plots: #Cars at second location and #Cars at first location.
•Vertical Axis (Value): Represents the estimated long-term profit you can expect to make if you start with that
specific number of cars at each location and follow the policy π₄.
•The Surface: The height of the surface at each point (combination of cars) indicates the value.
Higher values mean a more profitable starting state under policy π₄.
•What it shows: You can see how the value generally increases as you have more cars available in total.
•The shape of the surface reflects the complexities of the problem (demand, returns, moving costs).
Value Iteration
• The policy evaluation step of policy iteration can be truncated in several
ways without losing the convergence guarantees of policy iteration.
• One important special case is when policy evaluation is stopped after just
one sweep (one backup of each state). This algorithm is called value
iteration.
• Value Iteration is based on the principle of optimality.
• It starts with an initial estimate of the value function (often all zeros) and
iteratively updates this estimate by applying the Bellman Optimality
Equation.
• In each iteration, it considers all possible actions from each state and
updates the value of that state to be the maximum expected return
achievable by taking any action and then following the optimal policy
thereafter.
Asynchronous Dynamic Programing
X-axis (Capital): Shows the amount of money the
gambler has (from $1 to $99).

Y-axis (Value Estimates): Shows the estimated

probability of the gambler reaching the $100 goal
from that amount of money.

• As we do more sweeps, our estimates of the

winning probability get more accurate. Notice
how the lines become more stable and
converge towards a final shape.

• The probability of winning generally increases

as the gambler's capital increases.
Capital: Again, the amount of money the gambler has.

Final Policy (Stake): Shows the optimal amount of money the

gambler should bet at each capital level to maximize their
chance of reaching $100.

Spikes: The tall spikes indicate situations where the optimal

strategy is to bet a larger amount.

Steps: The flatter parts show situations where smaller bets are
likely optimal.

The best betting strategy isn't always to bet everything. It

depends on the current amount of money the gambler has. For
example, when close to $50, a larger bet might be optimal to
try and reach the goal faster.
Generalized Policy Iteration (GPI)
• Refers to the general concept of allowing policy evaluation and improvement
processes to interact, regardless of the specific timing or granularity.
• All reinforcement learning methods can be viewed as GPI.
• They maintain a policy and a value function, with the policy being improved
based on the value function and the value function being updated to reflect the
current policy.
• Include Figure 4.7 illustration next slide
Why GPI:The Convergence Condition
• If both evaluation and improvement processes stabilize (no
longer produce changes):
• The value function is consistent with the current policy.
• The policy is greedy with respect to the current value function.
• This implies the policy is greedy with respect to its own evaluation
function, satisfying the Bellman optimality equation.

V(s) = max_[a ∈ Actions(s)] { Reward(s, a, s') + γ V*(s') }

• Therefore, the policy and value function are optimal.

Example(1)
• Two states: State A and State B (which is a terminal state with a
reward of +1).
• One action in State A: Go to State B.
• Discount factor (γ): 1.0 (no discounting).
• Initial State:
• Policy (State A): Let's start with an initial policy in State A: "Stay in State A"
(even though it doesn't lead to the reward).
• Value Function: V(A) = 0
• V(B) = +1 (terminal state's value is its reward)
Iteration 1 Iteration 2

Policy • For State A, given the policy "Stay in State A", • For State A, with the new policy "Go to State B", the
Evaluation value function is now:
the value function should reflect the expected
future reward. V(A) = Reward(A -> B) + γ * V(B) = 0 + 1.0 * (+1) = +1.

• Since staying in A leads nowhere with a reward,

V(A) remains 0. • V(B) remains +1.

• V(B) remains +1.

Policy • Look at State A and consider all possible • Look at State A: The only action "Go to State B" has a
Improvement value of +1, which is now equal to the current value
actions: The only action is "Go to State B". V(A).
• The value of taking this action is: • The policy remains "Go to State B".

Reward(A -> B) + γ * V(B) = 0 + 1.0 * (+1) = +1.

• Since +1 (value of going to B) is greater than 0
(the value of following the current policy of
staying in A), we improve the policy for State A
to: "Go to State B".
Bootstrap
• Bootstrapping in RL refers to the process of estimating a value based on
other estimated values.
• Instead of relying solely on actual observed rewards, we use our current
value estimates to update other value estimates.
• When updating the value of a state (or state-action pair), a bootstrapping
method will use the estimated value of the subsequent state (or state-
action pair) in its update rule.
• In Q-learning, the update rule,
Q(s, a) ← Q(s, a) + α [r + γ * max_{a'} Q(s', a') - Q(s, a)]
Note: Q(s', a') is an estimated value, and we are using it to update Q(s, a).
Used for?
• Efficiency:
• In many RL problems, especially those with long episodes or continuous state spaces, waiting
for the final outcome (the true return) to update values can be very slow and inefficient.
• Bootstrapping allows for faster learning by leveraging existing knowledge (our current value
estimates).
• Temporal Difference (TD) Learning:
• Bootstrapping is a fundamental component of Temporal Difference (TD) learning methods
like Q-learning, SARSA, and TD(λ).
• These methods update value estimates based on the difference between the current
estimate and a more informed estimate that includes the immediate reward and the
estimated value of the next state.
Bootstrap
Advantages Disadvantages

Faster learning compared to Monte Can be susceptible to bias if the initial

Carlo methods (which wait for the end of value estimates are poor.
an episode)

Can learn online, without waiting for Can sometimes lead to instability if not
complete episodes. carefully implemented (e.g., with a high
learning rate).
Full Backup in Reinforcement Learning
• A full backup involves updating the value of a state (or state-action
pair) by considering all possible next states and their corresponding
probabilities (under the current policy or environment dynamics).
• It essentially performs an expectation over all possible future
outcomes in a single step.
• Full backups are a key characteristic of Dynamic Programming
methods like Policy Iteration and Value Iteration.
• These methods have a complete model of the environment
(transition probabilities and rewards).
Working
• To update the value of a state s, a full backup would look at all
possible next states s' that can be reached from s under all possible
actions a, weighted by their transition probabilities P(s'|s, a).
• For the state value function V(s) under a policy π, the update in
Policy Evaluation (using full backup) looks like:
V{k+1}(s) = Σ{a ∈ A(s)} π(a|s) Σ{s' ∈ S} P(s'|s, a) [R(s, a, s') + γ Vk(s')]
• Similarly, in Value Iteration, the update for V(s) involves taking the
maximum over all actions, with each action considering all possible
next states and their probabilities.
Bootstrap
Advantages Disadvantages

Guaranteed to converge to the optimal Requires a complete and accurate model

value function (under certain conditions)
of the environment (transition
because it directly uses the environmentprobabilities and rewards), which is
model. often unavailable in real-world RL
problems.
Efficient when the environment model is Can be computationally expensive for
known and the state space is not too large state and action spaces due to the
large. need to iterate over all possibilities.
Relationship between Bootstrapping and Full
Backup
Bootstrapping Full Backup

Full backup is a form of bootstrapping. It uses TD learning (with bootstrapping) often

the estimated values of the next states performs "sample backups" or "one-step
(V_k(s')) to update the value of the current backups." Instead of looking at all possible
state (V_{k+1}(s)). next states, it takes a single step in the
environment and uses the observed reward
However, it does so by considering all possible and the estimated value of the actual next
next states weighted by their probabilities. state to update the current value.

This is a form of bootstrapping without

requiring the full environment model.
Monte Carlo methods are an example of non-bootstrapping methods. They estimate values
based on the entire sequence of rewards received until the end of an episode, without using
estimates of future values within the episode.
Monte Carlo Methods in Reinforcement
Learning
• Learning by playing a game many, many times.
• After each full game (episode), you look back at how you played and
the rewards you got.
• You then adjust your strategy (policy) and your understanding of how
good each move was (value) based on the actual outcome of those
games.
• Learn from complete experiences, not by guessing what might
happen next in the middle of a game.
Why Monte Carlo over Dynamic Programming?
• Monte Carlo Wins When...
1. No Perfect Game Rules (Model-Free): We don't know all the
probabilities of what happens next in the environment. MC only
needs to experience the game. DP needs to know the rules
perfectly.
2. Huge State Spaces: Trying to calculate values for every possible
situation (like in a complex game) can be impossible for DP. MC
learns from samples (actual games played), so it can handle much
larger problems.
3. Focus on Specific States: We might only care about the value of
certain starting positions or key moments. MC can focus learning on
those specific parts without needing to calculate everything else.
Monte Carlo Control : Goal is to Find the best
strategy (policy) to get the most reward.
• Play many episodes using the current strategy.
• Estimate how good each action was in each state based on the total
reward received after taking that action in those episodes.
• Improve the strategy by making it more likely to choose actions that
led to higher rewards in the past.
• Repeat until the strategy stops improving.
1. Start:
• Initial guess of how good taking action a in state s
• Initial strategy (policy) for choosing actions in each
state
• A list to keep track of all the total rewards we've
received after taking action a in state s. Starts empty.
2. Playing (Forever): Start & Play full game until the
game ends.
3. Learn from the Game
i. G: Calculate the total reward received from
the first time we took action a in state s until the
end of the game.
ii. Add this total reward G to our list of rewards
Returns(s, a) for that specific state-action pair.
iii. Update our value estimate Q(s, a) by taking the
average of all the total rewards we've seen so far for
taking action a in state s
4. Improve the Strategy:
i. All the possible actions we could take in that state
ii. Update our strategy π(s) to always choose the
action a that currently has the highest estimated
value Q(s, a). This makes our strategy greedy with
respect to our current value estimates.
5. Repeat: Keep playing games, learning from them, and
improving our strategy forever
π(a|s): Our initial strategy (policy) for choosing actions in each
state.
It's an ε-soft policy-
i. It mostly chooses the action we think is best (a*).
ii. But it also has a small chance (ε) of choosing any other
action randomly. This ensures we keep exploring.

Improve Our Strategy (Slightly):

For Every State in the Game:
i. a*: Find the action a that currently has the highest
estimated value Q(s, a) in that state. This is the action we
currently think is the best.
ii. Update Our Policy (ε-Softly): For all possible actions a in
that state:
a. If a is the best action (a*), we increase the probability
of choosing it in the future (make it more likely).
b. If a is not the best action, we assign it a small
probability (equal probability for all other actions) to
ensure we still explore other options.
On-Policy vs. Off-Policy Learning
On-Policy Learning Off-Policy Learning
Learn about the strategy you are Learn about a different strategy than
currently using. the one you are currently using.
You might observe someone else playing
(or use a different strategy temporarily)
You play the game with one strategy and to learn about the best way to play, and
improve that same strategy. then apply that knowledge to your own
strategy

Think of it as practicing exactly how you Think of it as learning by watching an

want to play in the real game expert play, even if you're currently
playing differently
Incremental Monte Carlo
• Instead of waiting for the end of all episodes to update:
• Update the value estimates after each episode is completed.
• This allows for learning to happen more continuously and can be
more efficient for long-running tasks.
• Think of it as adjusting your understanding slightly after each game
you play, rather than waiting to analyze a whole bunch of games at
once.
Monte Carlo Methods
1. Need Many Episodes: To get reliable estimates, you need to play
the "game" (interact with the environment) a lot.
2. Only Works for Episodic Tasks: The process relies on having clear
"episodes" with a beginning and an end to calculate total rewards.
3. High Variance: The outcome of a single episode can be random,
leading to noisy value estimates, especially early on. Averaging over
many episodes helps reduce this.
4. Exploration vs. Exploitation: You need to try different actions
(explore) to find the best ones, but also use the actions that seem
best so far (exploit) to get high rewards. Balancing this is important.
BlackJack with Monte Carlo
Refer Reference Book for the topic

Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
Module 04
No ratings yet
Module 04
63 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
M 2
No ratings yet
M 2
12 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Dynamic_Programming_RL_Answers_Final
No ratings yet
Dynamic_Programming_RL_Answers_Final
3 pages
15 MDP
No ratings yet
15 MDP
35 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Rl Lecture4
No ratings yet
Rl Lecture4
16 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
UNIT-5 AI
No ratings yet
UNIT-5 AI
19 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
UNIT-5.docx
No ratings yet
UNIT-5.docx
39 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
RL
No ratings yet
RL
9 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
adprl_chapter_icis
No ratings yet
adprl_chapter_icis
43 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
notes
No ratings yet
notes
6 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ethical AI TT1
No ratings yet
Ethical AI TT1
60 pages
Module 4
No ratings yet
Module 4
28 pages
Module 3-22 March 2025
No ratings yet
Module 3-22 March 2025
34 pages
RL Week_1
No ratings yet
RL Week_1
53 pages
Eligibility_Criteria_March_2016
No ratings yet
Eligibility_Criteria_March_2016
2 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
Cybersecurity Internship Report
No ratings yet
Cybersecurity Internship Report
55 pages
Project 2-1 Analysis of AI Approaches To Playing Dots and Boxes
No ratings yet
Project 2-1 Analysis of AI Approaches To Playing Dots and Boxes
15 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages
Wang-2024-Deep Reinforcement Learning For Dema
No ratings yet
Wang-2024-Deep Reinforcement Learning For Dema
13 pages
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Multi-Agent Connected Autonomous Driving Using Deep Reinforcement Learning
No ratings yet
Multi-Agent Connected Autonomous Driving Using Deep Reinforcement Learning
16 pages
Chapter17 1
No ratings yet
Chapter17 1
40 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
5 planning-AI
No ratings yet
5 planning-AI
90 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Lecture Notes Stochastic Optimization-Koole
No ratings yet
Lecture Notes Stochastic Optimization-Koole
42 pages
Optimal Play of The Dice Game Pig
No ratings yet
Optimal Play of The Dice Game Pig
24 pages
Multi Agent Systems - Algorithmic, Theoretic, and Logical Foundations - ToC
No ratings yet
Multi Agent Systems - Algorithmic, Theoretic, and Logical Foundations - ToC
10 pages
Ai Fundamentals Final Quiz Source by Ate Zein
No ratings yet
Ai Fundamentals Final Quiz Source by Ate Zein
25 pages
Unit 3 - Decision Making under Uncertainty in AI
No ratings yet
Unit 3 - Decision Making under Uncertainty in AI
25 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
Linear Quadratic Regulator
0% (1)
Linear Quadratic Regulator
52 pages
Limited or Biased: Modeling Sub-Rational Human Investors in Financial Markets
No ratings yet
Limited or Biased: Modeling Sub-Rational Human Investors in Financial Markets
33 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
CSE 1st Semester
No ratings yet
CSE 1st Semester
6 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
Reinforcement Learning.ipynb - Colab
No ratings yet
Reinforcement Learning.ipynb - Colab
5 pages
Building Energy Management With Reinforcement Learning and Model Predictive Control A Survey
No ratings yet
Building Energy Management With Reinforcement Learning and Model Predictive Control A Survey
10 pages
Optimization in Operations Research PDF
No ratings yet
Optimization in Operations Research PDF
22 pages
3 Improving Viability of Electric Taxis by Taxi Service Strategy Optimization
No ratings yet
3 Improving Viability of Electric Taxis by Taxi Service Strategy Optimization
3 pages

RL Module 4

Uploaded by

RL Module 4

Uploaded by

Dynamic Programing

Dynamic Programming Monte Carlo

May not know the model; learns from

Handles stochastic and uncertain

Central to learning; balancing

Efficient learning in uncertain

Can be high, as it explores all Often designed to be more efficient

• The Key Criterion-

Value Iteration •Positive Number (e.g., 1, 2, 3, 4, 5): Move that many

X-Axis: Cars at second location

Y-axis (Value Estimates): Shows the estimated

• As we do more sweeps, our estimates of the

• The probability of winning generally increases

Final Policy (Stake): Shows the optimal amount of money the

Spikes: The tall spikes indicate situations where the optimal

The best betting strategy isn't always to bet everything. It

V*(s) = max_[a ∈ Actions(s)] { Reward(s, a, s') + γ * V*(s') }

• Therefore, the policy and value function are optimal.

• Since staying in A leads nowhere with a reward,

• V(B) remains +1.

Reward(A -> B) + γ * V(B) = 0 + 1.0 * (+1) = +1.

Faster learning compared to Monte Can be susceptible to bias if the initial

Guaranteed to converge to the optimal Requires a complete and accurate model

Full backup is a form of bootstrapping. It uses TD learning (with bootstrapping) often

This is a form of bootstrapping without

Improve Our Strategy (Slightly):

Think of it as practicing exactly how you Think of it as learning by watching an

You might also like

V(s) = max_[a ∈ Actions(s)] { Reward(s, a, s') + γ V*(s') }