RL Module 4
RL Module 4
Module 4
Content
where γ gamma is the discount factor (which represents the importance of future rewards).
Policy Evaluation (2)
• For our purposes, iterative solution methods are most suitable.
Consider a sequence of approximate value functions v0; v1; v2; : : :,
each mapping S+ to R.
• The initial approximation, v0, is chosen arbitrarily (except that the
terminal state, if any, must be given value 0), and each successive
approximation is obtained by using the Bellman equation for 4.4
as an update rule:
for all s ∈ S.
Policy Evaluation (3)
• Iterative policy evaluation algorithm
Policy Improvement
• The process of creating a new policy that is better than or equal to
the current policy.
• This is often done by acting greedily with respect to the value
function of the current policy.
• We have a deterministic policy π and its state-value function v_π(s).
• We want to know if changing the action in a specific state s to a
different action a (a ≠ π(s)) would result in a better policy.
Evaluating a Potential Change:
• Consider taking action a in state s once, and then following the
original policy π thereafter.
• The value of this action is given by the action-value function:
• Then the new policy π' is as good as or better than π for all states:
• If there is strict inequality of (4.7) at any state, then there must be
strict inequality of (4.8) at at least one state.
• This result applies in particular to the two policies that we considered
in the previous paragraph, an original deterministic policy π, and a
changed policy π`, that is identical to π except that π`(s) = a ≠ π(s).
• Obviously, (4.7) holds at all states other than s.
• Thus, if q π(s; a) > v π(s), then the changed policy is indeed better
than π.
Policy Iteration
• Once a policy, π, has been improved using v π to yield a better policy,
π`, we can then compute v π` and improve it again to yield an even
better π``.
• We can thus obtain a sequence of monotonically improving policies
and value functions:
Policy iteration (using iterative policy
evaluation) for v π.
Example 4.2: Jack's Car Rental
• Two Locations: Jack manages two car rental locations.
• Daily Rentals: Customers arrive daily at each location to rent cars.
• Revenue: Jack earns $10 for each car rented.
• Lost Business: If a location is out of cars, rental requests are lost (no revenue).
• Car Returns: Rented cars are returned the next day.
• Overnight Movement: Jack can move cars between the two locations overnight. Cost of
Movement: Moving a car costs $2.
• Demand and Returns are Poisson: The number of rental requests and car returns at each
location follow Poisson distributions (discrete probability distribution, number of events
occurring in a fixed interval of time) with given expected values (λ).
• Location 1: Rental requests λ = 3, Returns λ = 3
• Location 2: Rental requests λ = 4, Returns λ = 2
• Capacity Limit: Each location can hold a maximum of 20 cars. Any excess cars effectively
disappear.
Example 4.2: Jack's Car Rental (Policy
Iteration)
1. Begin with a basic strategy (e.g., never move cars).
2. Evaluate Current Strategy: Figure out how good this strategy is for every
possible number of cars at each location (calculate the "value" of each
state). This considers expected future profits based on rental demand
and returns.
3. Improve Strategy: For each possible situation (number of cars at each
location), consider all possible actions (moving -5 to +5 cars). Choose the
action that looks like it will lead to the best future outcome (highest
value based on the evaluation). This creates a new, potentially better
strategy.
4. Repeat: Go back to step 2 and evaluate this new strategy. Keep repeating
steps 2 and 3 until the strategy stops changing. At that point, you've
found the best way for Jack to manage his cars.
Draw back of policy iteration:
• Computational Cost:
• Policy Evaluation: Each iteration requires evaluating the current policy, which
can be computationally intensive. For large state spaces, this evaluation step
can take considerable time, especially if a high degree of accuracy is needed.
• Policy Improvement: While typically faster than policy evaluation, the
improvement step still requires calculating the best action based on the
state-value function. In large state-action spaces, this can also be slow.
• Convergence Speed:
• Policy iteration often converges faster than value iteration initially due to the
more structured approach, but in cases where the value function is poorly
approximated, it may take many iterations to converge to an optimal policy.
• Particularly in large state spaces, it may take many iterations to reach
convergence, which can be computationally prohibitive.
Value Iteration
• An algorithm used to compute the optimal policy and value function
in Markov Decision Processes (MDPs).
• Repeatedly updates the value of states based on the expected
rewards of actions.
Heads: Wins the amount he bet.
Tails: Loses the amount he bet.
Gambler's Problem
• Problem definition:
• A gambler has a limited amount of money (e.g., $100) and wants to reach a goal
(e.g., $200) by betting on a fair coin flip.
• If the gambler wins, their amount increases; if they lose, it decreases.
• States :
• Represent the amount of money the gambler has (from 0 to 200).
• Actions:
• The gambler can choose to bet a certain amount (from $1 to the current amount
they hold).
• Rewards:
• Winning a bet results in the value increasing by the amount betted; losing results in
the value decreasing by the same amount.
Gambler's Problem (1)
• Terminal States
• Win (Goal): Reaching $200.
• Lose: Going broke ($0).
• Value Function
• The goal is to calculate the expected return (value) for each state V(s) based on
possible actions at each state.
• The Bellman Equation:
Where:
R= immediate reward,
γ = discount factor,
P(s′ ∣ s, a) = transition probability to next state.
How Value Iteration Process works?
• Initialization: Set V(s) for all states to an initial value (e.g., $0).
• Iterate: Update values using the Bellman equation until convergence
(values do not change significantly).
• Extract Policy: Once values converge, derive the optimal betting
strategy for each state.
✓ With sufficient iterations, the value function stabilizes, providing
insights into the best strategy for maximizing the chances of reaching
$200 without going bankrupt.
✓The Gambler's Problem is about finding the smartest way to bet on
coin flips to reach a specific money goal, knowing the odds of the
coin. We want to know the chances of winning from any current
amount of money and the best betting amount at each step.
π₀, π₁, π₂, π₃, π₄: Policies
Steps: The flatter parts show situations where smaller bets are
likely optimal.
Policy • For State A, given the policy "Stay in State A", • For State A, with the new policy "Go to State B", the
Evaluation value function is now:
the value function should reflect the expected
future reward. V(A) = Reward(A -> B) + γ * V(B) = 0 + 1.0 * (+1) = +1.
Can learn online, without waiting for Can sometimes lead to instability if not
complete episodes. carefully implemented (e.g., with a high
learning rate).
Full Backup in Reinforcement Learning
• A full backup involves updating the value of a state (or state-action
pair) by considering all possible next states and their corresponding
probabilities (under the current policy or environment dynamics).
• It essentially performs an expectation over all possible future
outcomes in a single step.
• Full backups are a key characteristic of Dynamic Programming
methods like Policy Iteration and Value Iteration.
• These methods have a complete model of the environment
(transition probabilities and rewards).
Working
• To update the value of a state s, a full backup would look at all
possible next states s' that can be reached from s under all possible
actions a, weighted by their transition probabilities P(s'|s, a).
• For the state value function V(s) under a policy π, the update in
Policy Evaluation (using full backup) looks like:
V{k+1}(s) = Σ{a ∈ A(s)} π(a|s) Σ{s' ∈ S} P(s'|s, a) [R(s, a, s') + γ Vk(s')]
• Similarly, in Value Iteration, the update for V(s) involves taking the
maximum over all actions, with each action considering all possible
next states and their probabilities.
Bootstrap
Advantages Disadvantages