[go: up one dir, main page]

0% found this document useful (0 votes)
2 views50 pages

RL Module 4

The document covers dynamic programming (DP) and its applications in reinforcement learning (RL), detailing concepts such as policy evaluation, improvement, and iteration. It contrasts classical DP, which assumes a perfect model of the environment, with model-free RL methods that learn through interaction in stochastic environments. Additionally, it discusses specific algorithms like value iteration and policy iteration, providing examples such as Jack's Car Rental and the Gambler's Problem to illustrate these concepts.

Uploaded by

manavvvv298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views50 pages

RL Module 4

The document covers dynamic programming (DP) and its applications in reinforcement learning (RL), detailing concepts such as policy evaluation, improvement, and iteration. It contrasts classical DP, which assumes a perfect model of the environment, with model-free RL methods that learn through interaction in stochastic environments. Additionally, it discusses specific algorithms like value iteration and policy iteration, providing examples such as Jack's Car Rental and the Gambler's Problem to illustrate these concepts.

Uploaded by

manavvvv298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Dynamic Programing

Module 4
Content

Dynamic Programming Monte Carlo


• Policy Evaluation, Improvement, • Advantages Of Monte Carlo Over
Iteration Dynamic Programing,
• Monte Carlo Control,
• Value Iteration,
• On-policy,
• Asynchronous Dynamic • Off- Policy,
Programing, • Incremental Monte Carlo,
• Generalized Policy Iteration (GPI), • Issues/Assumptions In Monte Carlo
• Bootstrap, Methods,
• Solution Of Blackjack Using Monte
• Full Back Up. Carlo Method
Classical Dynamic Programming
• Classical DP operates under the assumption that the model of the
environment is fully known and perfect. This means all states, actions,
transitions, and rewards are defined in advance.
• Example: Algorithms like Bellman’s equations, value iteration, and policy iteration are
used to solve grid world problems where everything is specified.
• Classical DP is often used for deterministic settings, where the outcome of
each action is predictable (given the current state).
• Example: Solving a shortest path problem in a graph can be done with classical DP, as
the costs are known and fixed.
• Classical DP can be computationally expensive, especially for large state
spaces, due to exhaustive listing of all possible states and actions.
• Example: Solving an MDP with a very large number of states can lead to infeasibility
in terms of computation and memory requirements.
• Used in optimization problems, operations research, and algorithm design
(such as shortest paths, knapsack problems, etc.).
Dynamic Programming in Reinforcement
Learning
• Model-Free Assumption: In RL, the model may not be known in advance. Instead of assuming perfect
knowledge, RL methods often learn about the environment through interaction, exploring and exploiting to
gather information.
• Example: Algorithms like Q-learning and SARSA do not require a predefined model of the environment; they learn
directly from the experiences gained through episodes of interaction.
• RL often deals with inherently stochastic environments, where actions may result in different outcomes even
in the same state (due to randomness).
• Example: A robot navigating an uncertain terrain may not always reach its intended location due to obstacles or
varying terrain conditions.
• RL emphasizes the trade-off between exploration (trying new actions to gather more information) and
exploitation (taking actions known to yield high rewards based on current knowledge).
• Example: A player in a game may need to explore different strategies while being rewarded for winning
(exploitation).
• Model Learning: Some RL methods may use model learning, where an agent constructs a model of the
environment based on the experiences it collects. Even with learned models, the computations might not be
as extensive as classical DP since the agent adapts to learn the essential aspects of the environment.
• Example: In model-based reinforcement learning, the agent learns a transition model and then uses DP to plan out
optimal policies based on what it has learned.
Dynamic Programming in
Aspect Classical Dynamic Programming
Reinforcement Learning

May not know the model; learns from


Model Knowledge Assumes a perfect, known model
interaction

Handles stochastic and uncertain


Determinism Often works in deterministic settings
environments

Central to learning; balancing


Exploration vs. Exploitation Not relevant or needed
exploration and exploitation

Efficient learning in uncertain


Applications Optimization, algorithm design
environments, robotics, game playing

Can be high, as it explores all Often designed to be more efficient


Computational Cost
states/actions through sampling the environment
Dynamic Programming in RL
• The term dynamic programming (DP) refers to a collection of
algorithms that can be used to compute optimal policies given a
perfect model of the environment as a Markov decision process
(MDP).
• A Markov Decision Process is a framework used to model decision-
making where outcomes are partly random and partly under the
control of a decision-maker (like a robot or a person).
• How Does Dynamic Programming Work with MDPs?
• When you have a perfect model of the environment (which includes all states,
actions, transitions, and rewards), you can use dynamic programming to find
the optimal policy (the best set of actions to take in each state to maximize
your long-term rewards).
Dynamic Programming in RL
• DP requires complete knowledge of system dynamics like states,
actions, rewards, & transition matrix (p(s`,r|s,a), P(s`|s,a)), E[r|s,a,s`])
• Expensive computation
• Guaranteed to converge
• Affected by dimensionality
Policy Evaluation
• Policy Evaluation in Reinforcement Learning refers to the process of estimating
the value function of a given policy. The value function indicates how good it is
to follow that policy from each state, helping to understand the expected
returns (rewards) the agent would receive if it were to behave according to that
policy.
• First we consider how to compute the state-value function for an arbitrary
policy π.
• This is called policy evaluation in the DP. We also refer to it as the prediction
problem. For all s ∈ S,
where (a|s) is the probability of taking
action a in state s under policy,
and the expectations are subscripted by to indicate
that they are conditional on being followed.
Policy Evaluation (1)
• The existence and uniqueness of are guaranteed as long as
either gamma < 1 or eventual termination is guaranteed from all
states under the policy π.
• If the environment's dynamics are completely known, then (4.4) is a
system of |S| simultaneous linear equations in |S| unknowns
(the (s), s ∈ S).

where γ gamma is the discount factor (which represents the importance of future rewards).
Policy Evaluation (2)
• For our purposes, iterative solution methods are most suitable.
Consider a sequence of approximate value functions v0; v1; v2; : : :,
each mapping S+ to R.
• The initial approximation, v0, is chosen arbitrarily (except that the
terminal state, if any, must be given value 0), and each successive
approximation is obtained by using the Bellman equation for 4.4
as an update rule:

for all s ∈ S.
Policy Evaluation (3)
• Iterative policy evaluation algorithm
Policy Improvement
• The process of creating a new policy that is better than or equal to
the current policy.
• This is often done by acting greedily with respect to the value
function of the current policy.
• We have a deterministic policy π and its state-value function v_π(s).
• We want to know if changing the action in a specific state s to a
different action a (a ≠ π(s)) would result in a better policy.
Evaluating a Potential Change:
• Consider taking action a in state s once, and then following the
original policy π thereafter.
• The value of this action is given by the action-value function:

• The Key Criterion-


• Compare q_π(s, a) with v_π(s).
• If q_π(s, a) > v_π(s): Taking action a once in s and then following π is better
than always following π from s.
• Intuitively, consistently choosing a in state s should lead to a better overall
policy.
The Policy Improvement Theorem:
• Let π and π' be two deterministic policies such that for all states s:

• Then the new policy π' is as good as or better than π for all states:
• If there is strict inequality of (4.7) at any state, then there must be
strict inequality of (4.8) at at least one state.
• This result applies in particular to the two policies that we considered
in the previous paragraph, an original deterministic policy π, and a
changed policy π`, that is identical to π except that π`(s) = a ≠ π(s).
• Obviously, (4.7) holds at all states other than s.
• Thus, if q π(s; a) > v π(s), then the changed policy is indeed better
than π.
Policy Iteration
• Once a policy, π, has been improved using v π to yield a better policy,
π`, we can then compute v π` and improve it again to yield an even
better π``.
• We can thus obtain a sequence of monotonically improving policies
and value functions:
Policy iteration (using iterative policy
evaluation) for v π.
Example 4.2: Jack's Car Rental
• Two Locations: Jack manages two car rental locations.
• Daily Rentals: Customers arrive daily at each location to rent cars.
• Revenue: Jack earns $10 for each car rented.
• Lost Business: If a location is out of cars, rental requests are lost (no revenue).
• Car Returns: Rented cars are returned the next day.
• Overnight Movement: Jack can move cars between the two locations overnight. Cost of
Movement: Moving a car costs $2.
• Demand and Returns are Poisson: The number of rental requests and car returns at each
location follow Poisson distributions (discrete probability distribution, number of events
occurring in a fixed interval of time) with given expected values (λ).
• Location 1: Rental requests λ = 3, Returns λ = 3
• Location 2: Rental requests λ = 4, Returns λ = 2
• Capacity Limit: Each location can hold a maximum of 20 cars. Any excess cars effectively
disappear.
Example 4.2: Jack's Car Rental (Policy
Iteration)
1. Begin with a basic strategy (e.g., never move cars).
2. Evaluate Current Strategy: Figure out how good this strategy is for every
possible number of cars at each location (calculate the "value" of each
state). This considers expected future profits based on rental demand
and returns.
3. Improve Strategy: For each possible situation (number of cars at each
location), consider all possible actions (moving -5 to +5 cars). Choose the
action that looks like it will lead to the best future outcome (highest
value based on the evaluation). This creates a new, potentially better
strategy.
4. Repeat: Go back to step 2 and evaluate this new strategy. Keep repeating
steps 2 and 3 until the strategy stops changing. At that point, you've
found the best way for Jack to manage his cars.
Draw back of policy iteration:
• Computational Cost:
• Policy Evaluation: Each iteration requires evaluating the current policy, which
can be computationally intensive. For large state spaces, this evaluation step
can take considerable time, especially if a high degree of accuracy is needed.
• Policy Improvement: While typically faster than policy evaluation, the
improvement step still requires calculating the best action based on the
state-value function. In large state-action spaces, this can also be slow.
• Convergence Speed:
• Policy iteration often converges faster than value iteration initially due to the
more structured approach, but in cases where the value function is poorly
approximated, it may take many iterations to converge to an optimal policy.
• Particularly in large state spaces, it may take many iterations to reach
convergence, which can be computationally prohibitive.
Value Iteration
• An algorithm used to compute the optimal policy and value function
in Markov Decision Processes (MDPs).
• Repeatedly updates the value of states based on the expected
rewards of actions.
Heads: Wins the amount he bet.
Tails: Loses the amount he bet.
Gambler's Problem
• Problem definition:
• A gambler has a limited amount of money (e.g., $100) and wants to reach a goal
(e.g., $200) by betting on a fair coin flip.
• If the gambler wins, their amount increases; if they lose, it decreases.
• States :
• Represent the amount of money the gambler has (from 0 to 200).
• Actions:
• The gambler can choose to bet a certain amount (from $1 to the current amount
they hold).
• Rewards:
• Winning a bet results in the value increasing by the amount betted; losing results in
the value decreasing by the same amount.
Gambler's Problem (1)
• Terminal States
• Win (Goal): Reaching $200.
• Lose: Going broke ($0).
• Value Function
• The goal is to calculate the expected return (value) for each state V(s) based on
possible actions at each state.
• The Bellman Equation:
Where:
R= immediate reward,
γ = discount factor,
P(s′ ∣ s, a) = transition probability to next state.
How Value Iteration Process works?
• Initialization: Set V(s) for all states to an initial value (e.g., $0).
• Iterate: Update values using the Bellman equation until convergence
(values do not change significantly).
• Extract Policy: Once values converge, derive the optimal betting
strategy for each state.
✓ With sufficient iterations, the value function stabilizes, providing
insights into the best strategy for maximizing the chances of reaching
$200 without going bankrupt.
✓The Gambler's Problem is about finding the smartest way to bet on
coin flips to reach a specific money goal, knowing the odds of the
coin. We want to know the chances of winning from any current
amount of money and the best betting amount at each step.
π₀, π₁, π₂, π₃, π₄: Policies

Value Iteration •Positive Number (e.g., 1, 2, 3, 4, 5): Move that many


cars from the second location to the first location
overnight.
•Negative Number (e.g., -1, -2, -3, -4): Move that many
cars from the first location to the second location
overnight.
•Zero (0): Don't move any cars overnight.

X-Axis: Cars at second location


Y-Axis: Cars at first location
•π₀ (Initial Policy): Starts with a simple policy (in this case, it seems to be to move 0 cars between locations
regardless of the current state).
•π₁: After one iteration of policy evaluation and improvement, the policy starts to suggest moving cars based
on the car distribution. For example, if there are many cars at location 2 and few at location 1, it suggests moving
some from 2 to 1 (positive numbers).
•π₂, π₃, π₄: As the algorithm continues to iterate, the policy becomes more refined. The boundaries between
different actions shift, indicating a better understanding of the optimal car movements for different situations.
Notice how the areas with the same action become more structured.
•π₄ (Near-Optimal Policy): After several iterations, the policy likely converges to a near-optimal strategy.
V₄: shows the estimated value function for the policy π₄.
•Axes: Same as the policy plots: #Cars at second location and #Cars at first location.
•Vertical Axis (Value): Represents the estimated long-term profit you can expect to make if you start with that
specific number of cars at each location and follow the policy π₄.
•The Surface: The height of the surface at each point (combination of cars) indicates the value.
Higher values mean a more profitable starting state under policy π₄.
•What it shows: You can see how the value generally increases as you have more cars available in total.
•The shape of the surface reflects the complexities of the problem (demand, returns, moving costs).
Value Iteration
• The policy evaluation step of policy iteration can be truncated in several
ways without losing the convergence guarantees of policy iteration.
• One important special case is when policy evaluation is stopped after just
one sweep (one backup of each state). This algorithm is called value
iteration.
• Value Iteration is based on the principle of optimality.
• It starts with an initial estimate of the value function (often all zeros) and
iteratively updates this estimate by applying the Bellman Optimality
Equation.
• In each iteration, it considers all possible actions from each state and
updates the value of that state to be the maximum expected return
achievable by taking any action and then following the optimal policy
thereafter.
Asynchronous Dynamic Programing
X-axis (Capital): Shows the amount of money the
gambler has (from $1 to $99).

Y-axis (Value Estimates): Shows the estimated


probability of the gambler reaching the $100 goal
from that amount of money.

• As we do more sweeps, our estimates of the


winning probability get more accurate. Notice
how the lines become more stable and
converge towards a final shape.

• The probability of winning generally increases


as the gambler's capital increases.
Capital: Again, the amount of money the gambler has.

Final Policy (Stake): Shows the optimal amount of money the


gambler should bet at each capital level to maximize their
chance of reaching $100.

Spikes: The tall spikes indicate situations where the optimal


strategy is to bet a larger amount.

Steps: The flatter parts show situations where smaller bets are
likely optimal.

The best betting strategy isn't always to bet everything. It


depends on the current amount of money the gambler has. For
example, when close to $50, a larger bet might be optimal to
try and reach the goal faster.
Generalized Policy Iteration (GPI)
• Refers to the general concept of allowing policy evaluation and improvement
processes to interact, regardless of the specific timing or granularity.
• All reinforcement learning methods can be viewed as GPI.
• They maintain a policy and a value function, with the policy being improved
based on the value function and the value function being updated to reflect the
current policy.
• Include Figure 4.7 illustration next slide
Why GPI:The Convergence Condition
• If both evaluation and improvement processes stabilize (no
longer produce changes):
• The value function is consistent with the current policy.
• The policy is greedy with respect to the current value function.
• This implies the policy is greedy with respect to its own evaluation
function, satisfying the Bellman optimality equation.

V*(s) = max_[a ∈ Actions(s)] { Reward(s, a, s') + γ * V*(s') }

• Therefore, the policy and value function are optimal.


Example(1)
• Two states: State A and State B (which is a terminal state with a
reward of +1).
• One action in State A: Go to State B.
• Discount factor (γ): 1.0 (no discounting).
• Initial State:
• Policy (State A): Let's start with an initial policy in State A: "Stay in State A"
(even though it doesn't lead to the reward).
• Value Function: V(A) = 0
• V(B) = +1 (terminal state's value is its reward)
Iteration 1 Iteration 2

Policy • For State A, given the policy "Stay in State A", • For State A, with the new policy "Go to State B", the
Evaluation value function is now:
the value function should reflect the expected
future reward. V(A) = Reward(A -> B) + γ * V(B) = 0 + 1.0 * (+1) = +1.

• Since staying in A leads nowhere with a reward,


V(A) remains 0. • V(B) remains +1.

• V(B) remains +1.


Policy • Look at State A and consider all possible • Look at State A: The only action "Go to State B" has a
Improvement value of +1, which is now equal to the current value
actions: The only action is "Go to State B". V(A).
• The value of taking this action is: • The policy remains "Go to State B".

Reward(A -> B) + γ * V(B) = 0 + 1.0 * (+1) = +1.


• Since +1 (value of going to B) is greater than 0
(the value of following the current policy of
staying in A), we improve the policy for State A
to: "Go to State B".
Bootstrap
• Bootstrapping in RL refers to the process of estimating a value based on
other estimated values.
• Instead of relying solely on actual observed rewards, we use our current
value estimates to update other value estimates.
• When updating the value of a state (or state-action pair), a bootstrapping
method will use the estimated value of the subsequent state (or state-
action pair) in its update rule.
• In Q-learning, the update rule,
Q(s, a) ← Q(s, a) + α [r + γ * max_{a'} Q(s', a') - Q(s, a)]
Note: Q(s', a') is an estimated value, and we are using it to update Q(s, a).
Used for?
• Efficiency:
• In many RL problems, especially those with long episodes or continuous state spaces, waiting
for the final outcome (the true return) to update values can be very slow and inefficient.
• Bootstrapping allows for faster learning by leveraging existing knowledge (our current value
estimates).
• Temporal Difference (TD) Learning:
• Bootstrapping is a fundamental component of Temporal Difference (TD) learning methods
like Q-learning, SARSA, and TD(λ).
• These methods update value estimates based on the difference between the current
estimate and a more informed estimate that includes the immediate reward and the
estimated value of the next state.
Bootstrap
Advantages Disadvantages

Faster learning compared to Monte Can be susceptible to bias if the initial


Carlo methods (which wait for the end of value estimates are poor.
an episode)

Can learn online, without waiting for Can sometimes lead to instability if not
complete episodes. carefully implemented (e.g., with a high
learning rate).
Full Backup in Reinforcement Learning
• A full backup involves updating the value of a state (or state-action
pair) by considering all possible next states and their corresponding
probabilities (under the current policy or environment dynamics).
• It essentially performs an expectation over all possible future
outcomes in a single step.
• Full backups are a key characteristic of Dynamic Programming
methods like Policy Iteration and Value Iteration.
• These methods have a complete model of the environment
(transition probabilities and rewards).
Working
• To update the value of a state s, a full backup would look at all
possible next states s' that can be reached from s under all possible
actions a, weighted by their transition probabilities P(s'|s, a).
• For the state value function V(s) under a policy π, the update in
Policy Evaluation (using full backup) looks like:
V{k+1}(s) = Σ{a ∈ A(s)} π(a|s) Σ{s' ∈ S} P(s'|s, a) [R(s, a, s') + γ Vk(s')]
• Similarly, in Value Iteration, the update for V(s) involves taking the
maximum over all actions, with each action considering all possible
next states and their probabilities.
Bootstrap
Advantages Disadvantages

Guaranteed to converge to the optimal Requires a complete and accurate model


value function (under certain conditions)
of the environment (transition
because it directly uses the environmentprobabilities and rewards), which is
model. often unavailable in real-world RL
problems.
Efficient when the environment model is Can be computationally expensive for
known and the state space is not too large state and action spaces due to the
large. need to iterate over all possibilities.
Relationship between Bootstrapping and Full
Backup
Bootstrapping Full Backup

Full backup is a form of bootstrapping. It uses TD learning (with bootstrapping) often


the estimated values of the next states performs "sample backups" or "one-step
(V_k(s')) to update the value of the current backups." Instead of looking at all possible
state (V_{k+1}(s)). next states, it takes a single step in the
environment and uses the observed reward
However, it does so by considering all possible and the estimated value of the actual next
next states weighted by their probabilities. state to update the current value.

This is a form of bootstrapping without


requiring the full environment model.
Monte Carlo methods are an example of non-bootstrapping methods. They estimate values
based on the entire sequence of rewards received until the end of an episode, without using
estimates of future values within the episode.
Monte Carlo Methods in Reinforcement
Learning
• Learning by playing a game many, many times.
• After each full game (episode), you look back at how you played and
the rewards you got.
• You then adjust your strategy (policy) and your understanding of how
good each move was (value) based on the actual outcome of those
games.
• Learn from complete experiences, not by guessing what might
happen next in the middle of a game.
Why Monte Carlo over Dynamic Programming?
• Monte Carlo Wins When...
1. No Perfect Game Rules (Model-Free): We don't know all the
probabilities of what happens next in the environment. MC only
needs to experience the game. DP needs to know the rules
perfectly.
2. Huge State Spaces: Trying to calculate values for every possible
situation (like in a complex game) can be impossible for DP. MC
learns from samples (actual games played), so it can handle much
larger problems.
3. Focus on Specific States: We might only care about the value of
certain starting positions or key moments. MC can focus learning on
those specific parts without needing to calculate everything else.
Monte Carlo Control : Goal is to Find the best
strategy (policy) to get the most reward.
• Play many episodes using the current strategy.
• Estimate how good each action was in each state based on the total
reward received after taking that action in those episodes.
• Improve the strategy by making it more likely to choose actions that
led to higher rewards in the past.
• Repeat until the strategy stops improving.
1. Start:
• Initial guess of how good taking action a in state s
• Initial strategy (policy) for choosing actions in each
state
• A list to keep track of all the total rewards we've
received after taking action a in state s. Starts empty.
2. Playing (Forever): Start & Play full game until the
game ends.
3. Learn from the Game
i. G: Calculate the total reward received from
the first time we took action a in state s until the
end of the game.
ii. Add this total reward G to our list of rewards
Returns(s, a) for that specific state-action pair.
iii. Update our value estimate Q(s, a) by taking the
average of all the total rewards we've seen so far for
taking action a in state s
4. Improve the Strategy:
i. All the possible actions we could take in that state
ii. Update our strategy π(s) to always choose the
action a that currently has the highest estimated
value Q(s, a). This makes our strategy greedy with
respect to our current value estimates.
5. Repeat: Keep playing games, learning from them, and
improving our strategy forever
π(a|s): Our initial strategy (policy) for choosing actions in each
state.
It's an ε-soft policy-
i. It mostly chooses the action we think is best (a*).
ii. But it also has a small chance (ε) of choosing any other
action randomly. This ensures we keep exploring.

Improve Our Strategy (Slightly):


For Every State in the Game:
i. a*: Find the action a that currently has the highest
estimated value Q(s, a) in that state. This is the action we
currently think is the best.
ii. Update Our Policy (ε-Softly): For all possible actions a in
that state:
a. If a is the best action (a*), we increase the probability
of choosing it in the future (make it more likely).
b. If a is not the best action, we assign it a small
probability (equal probability for all other actions) to
ensure we still explore other options.
On-Policy vs. Off-Policy Learning
On-Policy Learning Off-Policy Learning
Learn about the strategy you are Learn about a different strategy than
currently using. the one you are currently using.
You might observe someone else playing
(or use a different strategy temporarily)
You play the game with one strategy and to learn about the best way to play, and
improve that same strategy. then apply that knowledge to your own
strategy

Think of it as practicing exactly how you Think of it as learning by watching an


want to play in the real game expert play, even if you're currently
playing differently
Incremental Monte Carlo
• Instead of waiting for the end of all episodes to update:
• Update the value estimates after each episode is completed.
• This allows for learning to happen more continuously and can be
more efficient for long-running tasks.
• Think of it as adjusting your understanding slightly after each game
you play, rather than waiting to analyze a whole bunch of games at
once.
Monte Carlo Methods
1. Need Many Episodes: To get reliable estimates, you need to play
the "game" (interact with the environment) a lot.
2. Only Works for Episodic Tasks: The process relies on having clear
"episodes" with a beginning and an end to calculate total rewards.
3. High Variance: The outcome of a single episode can be random,
leading to noisy value estimates, especially early on. Averaging over
many episodes helps reduce this.
4. Exploration vs. Exploitation: You need to try different actions
(explore) to find the best ones, but also use the actions that seem
best so far (exploit) to get high rewards. Balancing this is important.
BlackJack with Monte Carlo
Refer Reference Book for the topic

You might also like