0% found this document useful (0 votes)

7 views16 pages

2772 Blending MPC Value Function AP

Uploaded by

Bernard Kitenge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views16 pages

2772 Blending MPC Value Function AP

Uploaded by

Bernard Kitenge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Under review as a conference paper at ICLR 2021

BLENDING MPC & VALUE FUNCTION APPROXIMATION

FOR E FFICIENT R EINFORCEMENT L EARNING

Anonymous authors
Paper under double-blind review

ABSTRACT
Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world
systems that uses a model to make predictions about future behavior. For each state
encountered, MPC solves an online optimization problem to choose a control action
that will minimize future cost. This is a surprisingly effective strategy, but real-time
performance requirements warrant the use of simple models. If the model is not
sufficiently accurate, then the resulting controller can be biased, limiting performance.
We present a framework for improving on MPC with model-free reinforcement learning
(RL). The key insight is to view MPC as constructing a series of local Q-function
approximations. We show that by using a parameter λ, similar to the trace decay
parameter in TD(λ), we can systematically trade-off learned value estimates against the
local Q-function approximations. We present a theoretical analysis that shows how error
from inaccurate models in MPC and value function estimation in RL can be balanced.
We further propose an algorithm that changes λ over time to reduce the dependence on
MPC as our estimates of the value function improve, and test the efficacy our approach
on challenging high-dimensional manipulation tasks with biased models in simulation.
We demonstrate that our approach can obtain performance comparable with MPC with
access to true dynamics even under severe model bias and is more sample efficient
as compared to model-free RL.

1 INTRODUCTION
Model-free Reinforcement Learning (RL) is increasingly used in challenging sequential decision-making
problems including high-dimensional robotics control tasks (Haarnoja et al., 2018; Schulman et al., 2017)
as well as video and board games (Silver et al., 2016; 2017). While these approaches are extremely general,
and can theoretically solve complex problems with little prior knowledge, they also typically require a
large quantity of training data to succeed. In robotics and engineering domains, data may be collected
from real-world interaction, a process that can be dangerous, time consuming, and expensive.
Model-Predictive Control (MPC) offers a simpler, more practical alternative. While RL typically uses
data to learn a global model offline, which is then deployed at test time, MPC solves for a policy online
by optimizing an approximate model for a finite horizon at a given state. This policy is then executed
for a single timestep and the process repeats. MPC is one of the most popular approaches for control
of complex, safety-critical systems such as autonomous helicopters (Abbeel et al., 2010), aggressive
off-road vehicles (Williams et al., 2016) and humanoid robots (Erez et al., 2013), owing to its ability to
use approximate models to optimize complex cost functions with nonlinear constraints.
However, approximations in the model used by MPC can significantly limit performance. Specifically,
model bias may result in persistent errors that eventually compound and become catastrophic. For example,
in non-prehensile manipulation, practitioners often use a simple quasi-static model that assumes an object
does not roll or slide away when pushed. For more dynamic objects, this can lead to aggressive pushing
policies that perpetually over-correct, eventually driving the object off the surface.
Recently, there have been several attempts to combine MPC with model free RL, showing that the
combination can improve over the individual approaches alone. Many of these approaches involve using
RL to learn a terminal cost function, thereby increasing the effective horizon of MPC (Zhong et al., 2013;
Lowrey et al., 2018; Bhardwaj et al., 2020). However, the learned value function is only applied at the end
of the MPC horizon. Model errors would still persist in horizon, leading to sub-optimal policies. Similar
approaches have also been applied to great effect in discrete games with known models (Silver et al., 2016;

1
Under review as a conference paper at ICLR 2021

2017; Anthony et al., 2017), where value functions and policies learned via model-free RL are used to
guide Monte-Carlo Tree Search. In this paper, we focus on a somewhat broader question: can machine
learning be used to both increase the effective horizon of MPC, while also correcting for model bias?
One straightforward approach is to try to learn (or correct) the MPC model from real data encountered dur-
ing execution; however there are some practical barriers to this strategy. Hand-constructed models are often
crude-approximations of reality and lack the expressivity to represent encountered dynamics. Moreover, in-
creasing the complexity of such models leads to computationally expensive updates that can harm MPC’s on-
line performance. Model-based RL approaches such as Chua et al. (2018); Nagabandi et al. (2018); Shyam
et al. (2019) aim to learn general neural network models directly from data. However, learning globally
consistent models is an exceptionally hard task due to issues such as covariate shift (Ross & Bagnell, 2012).
We propose a framework, MPQ(λ), for weaving together MPC with learned value estimates to trade-off
errors in the MPC model and approximation error in a learned value function. Our key insight is to
view MPC as tracing out a series of local Q-function approximations. We can then blend each of these
Q-functions with value estimates from reinforcement learning. We show that by using a blending parameter
λ, similar to the trace decay parameter in TD(λ), we can systematically trade-off errors between these
two sources. Moreover, by smoothly decaying λ over learning episodes we can achieve the best of both
worlds - a policy can depend on a prior model before its has encountered any data and then gradually
become more reliant on learned value estimates as it gains experience.
To summarize, our key contributions are:
1. A framework that unifies MPC and Model-free RL through value function approximation.
2. Theoretical analysis of finite horizon planning with approximate models and value functions.
3. Empirical evaluation on challenging manipulation problems with varying degrees of model-bias.

2 PRELIMINARIES
2.1 REINFORCEMENT LEARNING
We consider an agent acting in an infinite-horizon discounted Markov Decision Process (MDP). An MDP
is defined by a tuple M = (S,A,c,P,γ,µ) where S is the state space, A is the action space, c(s,a) is
the per-step cost function, st+1 ∼ P (·|st,at) is the stochastic transition dynamics and γ is the discount
factor and µ(s0) is a distribution over initial states. A closed-loop policy π(·|s) outputs a distribution over
actions given a state. Let µπM be the distribution over state-action trajectories obtained
P∞ by running policy
π
π on M. The value function for a given policy π, is defined as VM (s) = EµπM [ t=0γ tc(st,at)|s0 =s]
P∞
and the action-value function as QπM(s, a) = EµπM [ t=0γ tc(st,at)|s0 =s,a0 =a]. The objective
is to find an optimal policy π∗ = argmin Es0 ∼µ [VM π
(s0)]. We can also define the (dis)-advantage
π
function AπM(s,a) = QπM(s,a)−V π (s), which measures how good an action is compared to the action
taken by the policy in expectation. It can be equivalently expressed in terms of the Bellman error as
AπM(s,a)=c(s,a)+γEs0 ∼P,a0 ∼π [QπM(s0,a0)]−Ea∼π [QπM(s,a)].

2.2 MODEL-PREDICTIVE CONTROL

MPC is a widely used technique for synthesizing closed-loop policies for MDPs. Instead of trying to solve
for a single, globally optimal policy, MPC follows a more pragmatic approach of optimizing simple, local
policies online. At every timestep on the system, MPC uses an approximate model of the environment to
search for a parameterized policy that minimizes cost over a finite horizon. An action is sampled from the
policy and executed on the system. The process is then repeated from the next state, often by warm-starting
the optimization from the previous solution.
We formalize this process as solving a simpler surrogate MDP M̂ = (S,A,ĉ,P̂ ,γ,µ̂,H) online, which
differs from M by using an approximate cost function ĉ, transition dynamics P̂ and limiting horizon to
H. Since it plans to a finite horizon, it’s also common to use a terminal state-action value function Q̂
that estimates the cost-to-go. The start state distribution µ̂ is a dirac-delta function centered on the current
state s0 =st. MPC can be viewed as iteratively constructing an estimate of the Q-function of the original
MDP M, given policy πφ at state s:
"H−1 #
φ
X
i H
QH (s,a)=Eµπφ γ ĉ(si,ai)+γ Q̂(sH ,aH )|s0 =s,a0 =a (1)
M
i=0

2
Under review as a conference paper at ICLR 2021

MPC then iteratively optimizes this estimate (at current system state st) to update the policy parameters
φ∗t =argmin QφH (st,πφ(st)) (2)
φ

Alternatively, we can also view the above procedure from the perspective of disadvantage minimization.
Let us define an estimator for the 1-step disadvantage with respect to the potential function Q̂ as
A(si,ai) =c(si,ai)+γ Q̂(si+1,ai+1)− Q̂(si,ai). We can then equivalently write the above optimization
as minimizing the discounted sum of disadvantages over time via the telescoping sum trick
" H−1
#
X
i
argmin Eµπφ Q̂(s0,a0)+λ γ A(si,ai)|s0 =st (3)
π∈Π M
i=0

Although the above formulation queries the Q̂ at every timestep, it is still exactly equivalent to the original
problem and hence, does not mitigate the effects of model-bias. In the next section, we build a concrete
method to address this issue by formulating a novel way to blend Q-estimates from MPC and a learned
value function that can balance their respective errors.

3 MITIGATING BIAS IN MPC VIA REINFORCEMENT LEARNING

In this section, we develop our approach to systematically deal with model bias in MPC by blending-in
learned value estimates. First we take a closer look at the different sources of error in the estimate in (1)
and then propose an easy-to-implement, yet effective strategy for trading them off.
3.1 SOURCES OF ERROR IN MPC
The performance of MPC algorithms critically depends on the quality of the Q-function estimator QφH (s,a)
in (1). There are three major sources of approximation error. First, model bias can cause compounding errors
in predicted state trajectories, which biases the estimates of the costs of different action sequences. The
effect of model error becomes more severe as H → − ∞. Second, the error in the terminal value function gets
propagated back to the estimate of the Q-function at the start state. With discounting, the effect of error due
to inaccurate terminal value function diminishes as H increases. Third, using a small H with an inaccurate
terminal value function can make the MPC algorithm greedy and myopic to rewards further out in the future.
We can formally bound the performance of the policy with approximate models and approximate learned
value functions. In Theorem 3.1, we show the loss in performance of the resulting policy as a function
of the model error, value function error and the planning horizon. Further, in Theorem 3.2 we provide
the optimal horizon H ∗ that minimizes performance loss when errors are known.
Theorem 3.1 (Proof Appendix A.1.2). Let MDP M̂ be an α-approximation of M such that ∀(s,a), we
have P̂ (s0|s,a)−P (s0|s,a) ≤ α and |ĉ(s,a)−c(s,a)| ≤ α. Let the learned value function Q̂(s,a) be
1
∗
an -approximation of the true value function Q̂(s,a)−QπM(s,a) ≤. The performance of the MPC
∞
π∗ π̂
policy is bounded w.r.t the optimal policy as VM (s)−VM (s) ∞

γ(1−γ H−1) γ H αH Vmax −Vmin γH

cmax −cmin α
≤2 αH + + + (4)
(1−γ H )(1−γ) 2 1−γ H 2 1−γ 1−γ H

This theorem generalizes over various established results. Setting H = 1, = 0 gives us the 1-step
simulation lemma in Kearns & Singh (2002) (Appendix A.1.1). Setting α=0, i.e. true model, recovers
the cost-shaping result in Sun et al. (2018).
Further inspecting terms in (4), we see that the model error increases with horizon H (the first two terms)
while the learned value error decreases with H. This matches our intuitions and indicates that there is
an optimal planning horizon H ∗ that minimizes the bound.
Theorem 3.2. The optimal planning horizon that minimizes the bound in (4) is given by
!!
∗
H =max 1.0,min H, (5)
(1−γ)(+α Vmax −V 4
min
)

3
Under review as a conference paper at ICLR 2021

In practice, the errors in model and value function are usually unknown and hard to estimate making it
impossible to set the MPC horizon to the optimal value. Instead, we next propose a strategy to blend the
Q-estimates from MPC and the learned value function at every timestep along the horizon, instead of
just the terminal step such that we can properly balance the different sources of error.
3.2 BLENDING MODEL PREDICTIVE CONTROL AND VALUE FUNCTIONS
A naive way to blend Q-estimates from MPC with Q-estimates from the value function would be to
consider a convex combination of the two
(1−λ) Q̂(s,a) +λQφH (s,a) (6)
| {z } | {z }
model-free model-based

where λ∈[0,1]. Here, the value function is contributing to a residual that is added to the MPC output, an
approach commonly used to combine model-based and model-free methods (Lee et al., 2020). However,
this is solution is rather ad hoc. If we have at our disposal a value function, why invoke it at only at the
first and last timestep? As the value function gets better, it should be useful to invoke it at all timesteps.
Instead, consider the following recursive formulation for the Q-estimate. Given (si,ai), the state-action
encountered at horizon i, the blended estimate Qλ(si,ai) is expressed as
Qλ(si,ai) =(1−λ)Q̂(si,ai)+λ( ĉ(si,ai) +γ Qλ(si+1,ai+1) ) (7)
| {z } | {z } | {z } | {z }
current blended estimate model-free model-based future blended estimate
λ
where λ ∈ [0,1]. The recursion ends at Q (sH ,aH ) = Q̂(sH ,aH ). In other words, the current blended
estimate is a convex combination of the model-free value function and the one-step model-based return.
The return in turn uses the future blended estimate. Note unlike (6), the model-free estimate is invoked
at every timestep.
We can unroll (7) in time to show QλH (s,a), the blended H−horizon estimate, is simply an exponentially
weighted average of all horizon estimates
H−1
X
QλH (s,a)=(1−λ) λiQφi(s,a)+λH QφH (s,a) (8)
i=0
hP i
k−1
where Qφk(s,a) = Eµπφ i=0 γ i
ĉ(si ,ai )+γ k
Q̂(sk ,ak )|s0 =s,a0 =a is a k-horizon estimate. When
M

λ=0, the estimator reduces to the just using Q̂ and when λ=1 we recover the original MPC estimate QH in
(1). For intermediate values of λ, we interpolate smoothly between the two by interpolating all H estimates.
Implementing (8) naively would require running H versions of MPC and then combining their outputs.
This is far too expensive. However, we can switch to the disadvantage formulation by applying a similar
telescoping trick " #
H−1
X
λ i
QH (s,a)=Eµ φ Q̂(s0,a0)+
π (γλ) A(si,ai) (9)
M
i=0

This estimator has a similar form as the T D(λ) estimator for the value function. However, while T D(λ)
uses the λ parameter for bias-variance trade-off, our blended estimator aims trade-off bias in dynamics
model with bias in learned value function.
Corollary 3.1 (Proof in Appendix A.1.4). For approximate models and learnt value functions as defined
in Theorem. 3.1, there exist a critical λ∗ ∈[0,1] that minimizes the performance bound

1 1
λ∗ = 1− ∗ (10)
γ H
where H ∗ is stated in (5) in Theorem 3.2.

λ∗ follows a similar trend as H ∗ – as learnt value error decreases, H ∗ →1 and λ∗ →0.

Why use blending λ when one can simply tune horizon H? Firstly, H limits the resolution we can tune
since it’s an integer – as H gets smaller the resolution becomes worse. Secondly, the blended estimator
QλH (s,a) uses far more samples. Even if both QλH and QφH ∗ had the same bias, the latter uses a strict subset
of samples as the former. Hence the variance of the blended estimator will be lower, with high probability.

4
Under review as a conference paper at ICLR 2021

Algorithm 1: MPQ(λ)
Input: Initial Q-function weights θ, Approximate dynamics P̂ and cost function ĉ
Parameter: MPC horizon H, λ schedule [λ1,λ2,...],
discount factor γ, minibatch size K, num mini-batches N, update frequency tupdate
1 D ←∅
2 for t=1...∞ do
// Update λ
3 λ=λt
// Blended MPC h action P selection i
H−1
4 φt ←argmin Eµ φ Q̂θ (s0,a0)+ i=0 (γλ)iA(si,ai)|s0 =st
π
φ M

5 at ∼πφt
6 Execute at on the system and observe (ct,st+1)
7 D ←(st,at,ct,st+1)
8 if t%tupdate ==0 then
n oK N
0
9 Sample N minibatches sk,n,ak,n,ck,n,sk,n from D
k=1 n=1
// Generate Blended h MPC value targets i
PH−1
10 ŷk,n =ck,n +γminEµπφ Q̂θ (s0,a0)+ i=0 (γλ)iA(si,ai)|s0 =s0k,n
M
PN PK 2
11 Update θ with SGD on loss L= N1 K1
n=1 k=1 ŷk,n − Q̂θ (sk,n ,ak,n )

4 THE MPQ(λ) ALGORITHM

We develop a simple variant of Q-Learning, called Model-Predictive Q-Learning with λ Weights

(MPQ(λ)) that learns a parameterized Q-function estimate Q̂θ . Our algorithm, presented in Alg. 1,
modifies Q-learning to use blended Q-estimates as described in the (9) for both action selection and
generating value targets. The parameter λ is used to trade-off the errors due to model-bias and learned
Q-function, Q̂θ . This can be viewed as an extension of the MPQ algorithm from Bhardwaj et al. (2020)
to explicitly deal with model bias by incorporating information from the learned Q-function at all timesteps.
Unlike MPQ, we do not consider the entropy-regularized formulation.
At every timestep t, MPQ(λ) proceeds by using H-horizon MPC from the current state st to optimize
a policy πφ with parameters φ. We modify the MPC algorithm to optimize for the greedy policy with
respect to the blended Q-estimator in (9), that is

" H−1
#
X
φ∗t =argmin Eµπφ Q̂θ (s0,a0)+ i
(γλ) A(si,ai)|s0 =st (11)
φ M
i=0

An action sampled from the resulting policy is then executed on the system. A commonly used heuristic is to
warm start the above optimization by shifting forward the solution from the previous timestep, which serves
as a good initialization if the noise in the dynamics in small (Wagener et al., 2019). This can significantly
cut computational cost by reducing the number of iterations required to optimize ( 11) at every timestep.
Periodically, the parameters θ are updated via stochastic gradient descent to minimize the following loss
function with N mini-batches of experience tuples of size K sampled from the replay buffer

N K 2
1 1 XX
L(θ)= ŷk,n − Q̂θ (sk,n,ak,n) (12)
N K n=1
k=1

The H-horizon MPC with blended Q-estimator is again invoked to calculate the targets

5
Under review as a conference paper at ICLR 2021

Figure 1: Tasks for evaluating MPQ(λ). Left to right - cartpole, peg insertion with 7DOF arm, and in-hand
manipulation to orient align pen(blue) with target(green) with 24DOF dexterous hand.

" H−1
#
X
ŷj =c(sj ,aj )+γminEµπφ Q̂θ (s0,a0)+ (γλ)i
A(si,ai)|s0 =s0k,n (13)
M
i=0

Using MPC to reduce error in Q-targets has been previously explored in literature (Lowrey et al., 2018;
Bhardwaj et al., 2020), where the model is either assumed to be perfect or model-error is not explicitly
accounted for. MPC with the blended Q-estimator and an appropriate λ allows us to generate more stable
Q-targets than using Qθ or model-based rollouts with a terminal Q-function alone. However, running
H-horizon optimization for all samples in a mini-batch can be time-consuming, forcing the use of smaller
batch sizes and sparse updates. In our experiments, we employ a practical modification where during
the action selection step, MPC is also queried for value targets which are then stored in the replay buffer,
thus allowing us to use larger batch sizes and updates at every timestep.
Finally, we also allow λ to vary over time. In practice, λ is decayed as more data is collected on the
system. Intuitively, in the early stages of learning, the bias in Q̂θ dominates and hence we want to rely
more on the model. A larger value of λ is appropriate as it up-weights longer horizon estimates in the
blended-Q estimator. As Q̂θ estimates improve over time, a smaller λ is favorable to reduce the reliance
on the approximate model.

5 EXPERIMENTS
Task Details: We evaluate MPQ(λ) on simulated robot control tasks, including a complex manipulation
task with a 7DOF arm and in-hand manipulation with a 24DOF anthropomorphic hand (Rajeswaran* et al.,
2018) as shown in Fig. 1. For each task, we provide the agent with a biased version of simulation that
is used as the dynamics model for MPC. We use Model Predictive Path Integral Control (MPPI) (Williams
et al., 2017), a state-of-the-art sampling-based algorithm as our MPC algorithm throughout.

1. CARTPOLESWINGUP: A classic control task where the agent slides a cart along a rail to swingup
the pole attached via an unactuated hinge joint. Model bias is simulated by providing the agent incorrect
masses of the cart and pole. The masses are set lower than the true values to make the problem harder
for MPC as the algorithm will always input smaller controls than desired as also noted in Ramos et al.
(2019). Initial position of cart and pole are randomized at every episode.
2. SAWYERPEGINSERTION: The agent controls a 7DOF Sawyer arm to insert a peg attached to the
end-effector into a hole at different locations on a table in front of the robot. We test the effects of inaccurate
perception by simulating a sensor at the target location that provides noisy position measurements at every
timestep. MPC uses a deterministic model that does not take sensor noise into account as commonly done
in controls. This biases the cost of simulated trajectories, causing MPC to fail to reach the target.
3. INHANDMANIPULATION: A challenging in-hand manipulation task with a 24DOF dexterous hand
from Rajeswaran* et al. (2018). The agent must align the pen with target orientation within certain
tolerance for succcess. The initial orientation of the pen is randomized at every episode. Here, we simulate
bias by providing larger estimates of the mass and inertia of the pen as well as friction coefficients, which
causes the MPC algorithm to optimize overly aggressive policies and drop the pen.
Please refer to the Appendix A.2 for more details of the tasks and biased simulations.

6
Under review as a conference paper at ICLR 2021

Baselines: We compare MPQ(λ) against both model-based and model-free baselines - (1) MPPI with
true dynamics and no value function, (2) MPPI with biased dynamics and no value function (3) Proximal
Policy Optimization (PPO), a state-of-the-art model-free RL method
Learning Details: We represent the Q-function with a feed-forward neural network. We bias simulation
parameters like mass or friction coefficients using the formula m=(1+b)mtrue, where b is a bias-factor.
We also employ a practical modification to Alg. 1 in order to speed up training times as discussed in
Section 4. Instead of maintaining a large global replay-buffer and re-calculating targets for every experience
tuple in a mini-batch, as done by approaches such as Bhardwaj et al. (2020); Lowrey et al. (2018), we
simply query MPC for the value targets online and store them in a smaller replay-buffer. This has a
similar stabilizing effect as using a frozen target network in DQN and allows us to perform updates at
every timestep. For PPO, we use the publically available implementation https://rb.gy/61iarq. Refer to
the Appendix A.2 for more details.
5.1 ANALYSIS OF OVERALL PERFORMANCE
O 1. MPQ(λ) is able to overcome model-bias in MPC for a wide range of λ values.
Fig. 2(a) shows a comparison of MPQ(λ) with MPPI using true and biased dynamics with b=−0.5 and
H =64 for various settings of λ. There exists a wide range of λ values for which MPQ(λ) can efficiently
trade-off the model-bias with the bias in the learned Q-function and out-perform MPPI with biased
dynamics. However, setting λ to a high value of 1.0 and 0.95, which weighs longer horizons heavily
leads to poor performance as compounding effects of model bias are not being compensated for by Qθ .
Performance also begins to drop as λ decreases below 0.6. We note that MPQ(λ) also outperforms MPPI
with access to the true dynamics function. This is not surprising as the learned Q-function adds global
information to the optimization problem and can correct for errors in optimizing for long horizons due
to large variance when using limited trajectory samples in MPPI.
O 2. Faster convergence can be achieved by decaying λ over time.
As more data is collected on the system, we expect the bias in Qθ to decrease, whereas model bias remains
constant. We decay λ in [1.0,λF ] using a fixed schedule to reduce the dependence on MPC over time as
the model-free Q-estimates improve. This leads to faster convergence than using a fixed λ as demonstrated
in Fig. 2(b). This can be attributed to the fact that during the initial steps of training, the bias in a randomly
initialized Qθ is large, whereas longer horizon lookahead with the approximate model can still lead us
to good parts of the state space. Thus, initially a larger value of λ that favors longer horizons is better as the
effect of Qθ is diminished due to discounting. As more data is collected, the accuracy of Qθ increases and
bias in the model begins to hurt performance. A smaller value of λ is more favorable now as it places more
weight on smaller horizon estimates. In figures 2(c) and 2(d) we present an ablation study that shows
that MPQ(λ) is robust to a wide range of decay rates with H =64 and 32 respectively. We see that when
provided with true dynamics, MPPI with H = 32 performs better than H = 64, which can be attributed
to optimization issues with long horizons. MPQ(λ) reaches performance comparable with MPPI H =32
in both cases showing robustness to horizon values which is important since in practice we wish to set
the horizon as large as the computation budget permits. However, decaying λ too fast or too slow can have
adverse effects on performance. An interesting question for future work is whether λ can be set adaptively
in a state-dependent manner. Refer to Appendix A.2 for details on decay schedule.
O 3. MPQ(λ) is robust to large degree of model misspecification.
Fig. 3(a) shows the effects of different values of the bias factor b to vary mass of the cart and pole
for MPQ(λ) with a fixed λ decay rate of [1.0,0.75]. MPQ(λ) achieves better performance than MPPI
(H = 64), both with biased and true dynamics, for a wide range of bias factors b, and convergence rate
is generally faster for smaller bias. For large values of b, MPQ(λ) either fails to improve or diverges as
the compounding effects of model-bias hurt learning. Similarly in trend in Fig. 3(b) MPQ(λ) achieves
performance comparable to MPPI with true dynamics for different degrees of bias in the mass, inertia and
friction properties of the pen, but is unable to do so for b=1.0. Both plots show validation performance
over averaged over 3 seeds. We conclude that while MPQ(λ) is robust to large amount of model bias,
if the model is extremely un-informative, relying on MPC can degrade performance.
O 4. MPQ(λ) is much more sample efficient compared to model-free RL on high-dimensional continuous
control tasks, even with approximate models
Figures 3(b) and 3(c) show comparison of MPQ(λ) with PPO (Schulman et al., 2017), a model-free policy
gradient algorithm on the INHANDMANIPULATION and SAWYERPEGINSERTION tasks respectively. In
both cases, we observe that MPQ(λ), through its use of approximate models, learned value functions,
and a dynamically-varying λ parameter to trade-off different sources of error, rapidly improves its

7
Under review as a conference paper at ICLR 2021

Validation rewards Validation rewards

200 200

300 300

400 400

Average Reward
Average Reward
MPPIH64(true),
MPPIH64(biased)
500 = 0.5 500 MPPIH64(true)
= 0.55 MPPIH64(biased)
600 = 0.6 600 = 0.6
= 0.65 = 0.65
= 0.7 = 0.7
700 = 0.75 700
= 0.8
= 0.75
800 = 0.85 800 F = 0.6
= 0.9 F = 0.65
= 0.95 F = 0.7
900 900
= 1.0 F = 0.75

0 5 10 15 20 25 0 5 10 15 20 25
Validation Iteration Validation Iteration

(a) Fixed λ (b) Fixed v/s Decaying λ

Validation rewards Validation rewards

100
200
150

300
Average Reward

Average Reward
MPPIH64(true), 200 MPPIH32(true),
MPPIH64(biased) MPPIH32(biased)
F = 0.5 F = 0.5
F = 0.55
250 F = 0.55
400 F = 0.6 F = 0.6
F = 0.65 300 F = 0.65
F = 0.7 F = 0.7
F = 0.75 F = 0.75
F = 0.8
350 F = 0.8
500
F = 0.85 F = 0.85
F = 0.9 400 F = 0.9
F = 0.95 F = 0.95

0 5 10 15 20 25 0 5 10 15 20 25
Validation Iteration Validation Iteration

(c) λ decay with H =64 (d) λ decay with H =32

Figure 2: CARTPOLESWINGUP experiments. All curves show average rewards over 30 validation episodes, averaged
over 3 runs with different seeds. Training is performed for 100k steps with validation after every 4k steps. When
decaying λ as per a schedule, it is fixed to the current value during validation. In (b),(c),(d), λF denotes the value of λ
at the end of training. The dashed lines show average reward of MPPI and shaded regions denote standard deviation.
Validation rewards
Validation rewards Validation rewards
4000
200 0

300 3000 100

400
Average Reward
Average Reward

Average Reward

MPPIH64(true), 200
500 MPPIH64(biased) 2000
b=0 300
b= 0.1
600 b= 0.2 MPPIH32(true),
b= 0.3 1000 MPPIH32(b = 0.6) 400
700 b= 0.4 MPPIH32(b = 0.8)
b= 0.5 MPPIH32(b = 1.0)
b= 0.6 PPO 500 MPPIH32(true),
800 b= 0.7 0 b = 0.6 MPPIH32(sensornoise)
b= 0.8 b = 0.8 PPO
900 b= 0.9 b = 1.0 600 MPQLam

0 5 10 15 20 25 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12
Validation Iteration Validation Iteration Validation Iteration

(a) CARTPOLESWINGUP Bias Sweep (b) INHANDMANIPULATION (c) SAWYERPEGINSERTION

Figure 3: Robustness and sample efficiency of MPQ(λ). (a)Varying bias over mass of cart and pole, (b) Varying
bias factor over mass, inertia and friction of pen (c) Peg insertion task with noisy perception. Same bias factor b is
used for all altered properties per task. Results depict average reward over 30 validation episodes. In (a) λ is decayed
in [1.0,0.75] and [1.0,0.85] in (b) and (c)

performance and achieves average reward comparable to MPPI with access to ground truth dynamics.
In INHANDMANIPULATION, PPO performance does not improve at all over 54k training steps. In
SAWYERPEGINSERTION, the small magnitude of reward difference between MPPI with true and biased
models is due to the fact that even with a biased model, MPC is able to get the peg close to the table, but
due to sensor noise, is unable to control it precisely enough to consistently insert it in the hole. Here, value
function learned by MPQ(λ) can adapt to sensor noise and allow for fine-grained control near the table.
PPO only marginally improves over validation iterations and does not get close to the table..

6 CONCLUSION
In this paper, we presented a method to mitigate model-bias in MPC by blending model-free estimates
using a parameter λ, to trade-off different sources of bias. We present a practical algorithm that achieves
performance close to MPC with access to the true dynamics while being sample efficient compared to
model-free RL methods. However, a few key questions remain. Although faster convergence can be
achieved by decaying λ over time, an interesting avenue for future research is to vary λ in a state-adaptive
fashion. In particular, reasoning about the model and value function uncertainty may allow us to vary
λ to rely more or less on our model in certain parts of the state space. Another interesting avenue for
research is to investigate variance reduction by using λ similarly to TD(λ), allowing us to optimize longer
sampling-based MPC horizons with fewer samples.

8
Under review as a conference paper at ICLR 2021

REFERENCES
Pieter Abbeel and Andrew Y Ng. Exploration and apprenticeship learning in reinforcement learning. In
Proceedings of the 22nd international conference on Machine learning, pp. 1–8, 2005.
Pieter Abbeel, Adam Coates, and Andrew Y Ng. Autonomous helicopter aerobatics through apprenticeship
learning. The International Journal of Robotics Research, 29(13):1608–1639, 2010.
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree
search. In Advances in Neural Information Processing Systems, pp. 5360–5370, 2017.
Mohak Bhardwaj, Ankur Handa, Dieter Fox, and Byron Boots. Information theoretic model predictive
q-learning. In Learning for Dynamics and Control, pp. 840–850, 2020.
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in
a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing
Systems, pp. 4754–4765, 2018.
Tom Erez, Kendall Lowrey, Yuval Tassa, Vikash Kumar, Svetoslav Kolev, and Emanuel Todorov. An
integrated system for real-time model predictive control of humanoid robots. In 2013 13th IEEE-RAS
International Conference on Humanoid Robots (Humanoids), pp. 292–299. IEEE, 2013.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces. In Proceedings
of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312, 2003.
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine
learning, 49(2-3):209–232, 2002.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Gilwoo Lee, Brian Hou, Sanjiban Choudhury, and Siddhartha S Srinivasa. Bayesian residual policy
optimization: Scalable bayesian reinforcement learning with clairvoyant experts. arXiv preprint
arXiv:2002.03042, 2020.
Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan
online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint
arXiv:1811.01848, 2018.
Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics
for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018.
Aravind Rajeswaran*, Vikash Kumar*, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel
Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement
Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2018.
Fabio Ramos, Rafael Carvalhaes Possas, and Dieter Fox. Bayessim: adaptive domain randomization via
probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728, 2019.
Stephane Ross and J Andrew Bagnell. Agnostic system identification for model-based reinforcement
learning. arXiv preprint arXiv:1203.1007, 2012.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional
continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Pranav Shyam, Wojciech Jaśkowski, and Faustino Gomez. Model-based active exploration. In International
Conference on Machine Learning, pp. 5779–5788, 2019.

9
Under review as a conference paper at ICLR 2021

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the
game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play
with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.

Colin Summers, Kendall Lowrey, Aravind Rajeswaran, Siddhartha Srinivasa, and Emanuel Todorov.
Lyceum: An efficient and scalable ecosystem for robot learning. arXiv preprint arXiv:2001.07343, 2020.

Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement
learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.

Nolan Wagener, Ching-An Cheng, Jacob Sacks, and Byron Boots. An online learning approach to model
predictive control. arXiv preprint arXiv:1902.08967, 2019.

Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive
driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics
and Automation (ICRA), pp. 1433–1440. IEEE, 2016.

Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and
Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017
IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. IEEE, 2017.

Mingyuan Zhong, Mikala Johnson, Yuval Tassa, Tom Erez, and Emanuel Todorov. Value function
approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic
programming and reinforcement learning (ADPRL), pp. 100–107. IEEE, 2013.

A APPENDIX

A.1 PROOFS

We present upper-bounds on performance of a greedy policy when using approximate value functions
and models. We also analyze the case of finite horizon planning with an approximate dynamics model
and terminal value function which can be seen as a generalization of (Sun et al., 2018). For simplicity,
we switch to using V̂ (s) to the learnt model-free value function (instead of Q̂(s))
∗
π
Let V̂ (s) be an -approximation V̂ (s)−VM (s) ≤ . Let MDP M̂ be an α-approximation of M
∞
such that ∀(s,a), we have P̂ (s0|s,a)−P (s0|s,a) ≤α and |ĉ(s,a)−c(s,a)|≤α.
1

A.1.1 A GENTLE START: BOUND ON PERFORMANCE OF 1-STEP GREEDY POLICY

Theorem A.1. Let the one-step greedy policy be

π̂(s)=argmin ĉ(s,a)+γΣs0 P̂ (s0|s,a)V̂ (s0) (14)

a∈A

The performance loss of π̂(s) w.r.t optimal policy π∗ on MDP M is bounded by

2 γ+α+γα Vmax −Vmin

π̂ π∗ 2
VM (s)−VM (s) ≤ (15)
∞ 1−γ

Proof. From (14) we have ∀s∈S

10
Under review as a conference paper at ICLR 2021

X X
ĉ(s,π̂(s))+γ P̂ (s0|s,π̂(s))V̂ (s0)≤ĉ(s,π∗(s))+γ P̂ (s0|s,π∗(s))V̂ (s0)
s0 s0
!
X X
∗ 0 ∗ 0 0 0
ĉ(s,π̂(s))−ĉ(s,π (s))≤γ P̂ (s |s,π (s))V̂ (s )− P̂ (s |s,π̂(s))V̂ (s )
s0 s0
π∗
(using V̂ (s)−VM (s) ≤)
∞
!
π∗ 0 π∗ 0
X X
∗ 0 ∗ 0
ĉ(s,π̂(s))−ĉ(s,π (s))≤γ P̂ (s |s,π (s))VM (s )− P̂ (s |s,π̂(s))VM (s ) +2γ
s0 s0
(using |ĉ(s,a)−c(s,a)|≤α)
!
π∗ 0 π∗ 0
X X
∗ 0 ∗ 0
c(s,π̂(s))−c(s,π (s))≤2γ+2α+γ P̂ (s |s,π (s))VM (s )− P̂ (s |s,π̂(s))VM (s )
s0 s0
(16)
π̂ π∗
Now, let s be the state with the max loss VM (s)−VM (s),

∗ X
π∗ 0
π̂
VM π
(s)−VM (s)=c(s,π̂)−c(s,π∗)+γ P (s0|s,π̂)VM
π̂
(s0)−P (s0|s,π∗)VM (s )
s0

Substituting from (16)

∗
π∗ 0 π∗ 0
X X
π
VM π̂
(s)−VM (s)≤2γ+2α+γ P̂ (s0|s,π∗(s))VM (s )−γ P̂ (s0|s,π̂(s))VM (s )
s0 s0
π∗ 0
X X
−γ P (s0|s,π∗)VM (s )+γ P (s0|s,π̂)VM
π̂
(s0)
s0 s0

Since s is the state with largest loss

X
∗ Vmax −Vmin π∗
π
VM π̂
(s)−VM (s) ≤2γ+2α+2γα +γ P (s0|s,π̂) VM π̂
(s)−VM (s)
∞ 2 ∞
s0

Vmax −Vmin π∗ π̂
≤2γ+2α+2γα +γ VM (s)−VM (s)
2 ∞

Re-arranging terms we get

2 γ+α+γα Vmax −V

min
∗
π π̂ 2
VM (s)−VM (s) ≤ (17)
∞ 1−γ
which concludes the proof.

11
Under review as a conference paper at ICLR 2021

A.1.2 BOUND ON PERFORMANCE OF H-STEP GREEDY POLICY

Notation: For brevity let us define the following macro,
"H−1 #
X
hV,π,MiH =EµπM γ ic(si,ai)+γ H V (sH ) (18)
i=0
which represents the expected cost achieved when executing policy π on M Dusing V as E the terminal cost.
We can substitute different policies, terminal costs and MDPs. For example, V̂ ,π̂,M̂ is the expected
H
cost obtained by running policy π̂ on simulator M̂ for H steps with approximate learned terminal value
function V̂ .
π∗
Lemma A.1. For a given policy π, the optimal value function VM and MDPs M, M̂ the following
performance difference holds

1−γ H−1 1−γ H

D ∗
π
E D ∗
π
E cmax −cmin H Vmax −Vmin
VM ,π,M − VM ,π,M̂ ≤γ αH +γ αH + α
H H ∞ 1−γ 2 2 1−γ

Proof. We temporarily introduce a new MDP M0 that has the same cost function as a M, but transition
function of M̂
D ∗ E D ∗ E D ∗ E D ∗ E
VM π
,π,M − VM π
,π,M̂ = VM π
,π,M − VM π
,π,M0
H H
D ∗ EH D ∗ EH (19)
+ VM π
,π,M0 − VM π
,π,M̂
H H

Let ∆P (s0...sH )=P (s0...sH )− P̂ (s0...sH ) represent the difference in distribution of states encountered
by executing π on M and M̂ respectively starting from state s0.
Expanding the RHS of (19)
H−1
! "H−1 #
π∗
X X X
i H i
= ∆P (s0...sH ) γ c(si,ai)+γ VM (sH ) +Eµπ γ (c(si,ai)−ĉ(si,ai)) (20)
M̂
s0 ,...,sH i=0 i=0

Since the first state s1 is the same

H−1
! "H−1 #
X X ∗ X
= ∆P (s1...sH ) γ ic(si,ai)+γ H VM
π
(sH ) +Eµπ γ i(c(si,ai)−ĉ(si,ai))
M̂
s1 ,...,sH i=1 i=0
H−1
! "H−1 #
X X ∗ X
≤ ∆P (s1...sH ) γ ic(si,ai)+γ H VM
π
(sH ) + Eµπ γ i(c(si,ai)−ĉ(si,ai))
M̂
s1 ,...,sH i=1 ∞ i=0 ∞
H−1
!
H
X X ∗ 1−γ
≤ ∆P (s1...sH ) γ ic(si,ai)+γ H VM
π
(sH ) + α
s1 ,...,sH i=1
1−γ
∞
(21)
where the first inequality is obtained by applying the triangle inequality and the second one is obtained
by applying triangle inequality followed by the upper bound on the error in cost-function.

H−1
X X ∗ 1−γ H
≤ ∆P (s2...sH+1) sup( γ ic(si,ai)+γ H VM
π
(sH )−K)+ α (22)
s2 ,...,sH+1 i=1
1−γ
∞
PH−1
By choosing K = i=1 γ (cmax +cmin)/2 +γ H (Vmax +Vmin)/2 we can ensure that the term inside
i

sup is upper-bounded by γ(1−γ H−1)/(1−γ)((cmax −cmin)/2)+γ H (Vmax −Vmin)/2

1−γ H−1 1−γ H

cmax −cmin Vmax −Vmin
≤γ αH +γ H αH + α (23)
1−γ 2 2 1−γ

12
Under review as a conference paper at ICLR 2021

The above lemma builds on similar results in (Kakade et al., 2003; Abbeel & Ng, 2005; Ross & Bagnell,
2012).
We are now ready to prove our main theorem, i.e. the performance bound of an MPC policy that uses
an approximate model and approximate value function.
Proof of Theorem 3.1

Proof. Since, π̂ is the greedy policy when using M̂ and V̂ ,

D E D E
V̂ ,π̂,M̂ ≤ V̂ ,π∗,M̂
D ∗ EH D ∗ H
E (24)
∗ π∗
VM ,π̂,M̂ ≤ VM ,π ,M̂ +2γ H (using V̂ −VM
π π
≤)
H H 1

Also, we have
D ∗ E D ∗ E D ∗ E D ∗ E
π
VM ,π̂,M − VMπ
,π∗,M = VM π
,π̂,M − VM π
,π̂,M̂
H H
D ∗ EH D ∗ H
E
π ∗ π ∗
− VM ,π ,M − VM ,π ,M̂ (25)
D ∗ E H D ∗ E H
π π ∗
+ VM ,π̂,M̂ − VM ,π ,M̂
H H

The first two terms can be bounded using Lemma A.1 and the third term using Eq. (24) to get

D ∗ E D ∗ E
π
VM ,π̂,M − VM π
,π∗,M
H H

1−γ H−1

cmax −cmin

Vmax −Vmin

1−γ H
(26)
H H
≤2 γ αH +γ αH + α+γ
1−γ 2 2 1−γ
∗
π̂ π
Now, let s be the state with max loss VM (s)−VM (s)

D ∗ E
π∗
VMπ̂
(s)−VM (s)= VM π̂
,π̂,M H − VM π
,π∗,M
D ∗ EH D ∗ E D ∗ E
π̂
= VM,π̂,M H − VM ,π̂,M π
+ VM π
,π̂,M − VM π
,π∗,M
H D ∗ E HD ∗ E H
H π̂ π∗ π π ∗
=γ VM(sH+1)−VM (sH+1) + VM ,π̂,M − VM ,π ,M
H H

π∗
≤γ H VM π̂
(s)−VM (s)
γ(1−γ H−1) 1−γ H

cmax −cmin H Vmax −Vmin H
+2 αH +γ αH + α+γ
1−γ 2 2 1−γ
(27)
where last inequality comes from applying Eq. (26) and the fact that s is the state with max loss. The
final expression follows from simple algebraic manipulation.

A.1.3 OPTIMAL PLANNING HORIZON

Proof of Theorem 3.2

Proof. We take the derivative of (4) and set them to 0. For convenience, we define the following interim
terms
α cmax −cmin
C1 =
1−γ 2
(28)
Vmax −Vmin
C2 =α
2

13
Under review as a conference paper at ICLR 2021

The derivative is then

γ 2H +γ H ((γ−1)Hlogγ−γ−1)+γ γ H (−γ H +Hlogγ+1) γ H logγ
C1 +C2 + =0
(1−γ H )2 (1−γ H )2 (1−γ H )2 (29)
C1(γ 2H +γ H ((γ−1)Hlogγ−γ−1)+γ)+C2(γ H (−γ H +Hlogγ+1))+(γ H logγ)=0

Let γ =1−δ where δ 1. Hence γ H =(1−δ)H ≈1−Hδ. Plugging into (29),

C1((1−2δH)+(1−δH)((γ−1)Hlogγ−γ−1)+γ)+
(30)
C2((1−δH)(−(1−δH)+Hlogγ+1))+logγ(1−Hδ)=0

C1(H 2(1−γ)δlogγ−H(1−γ)(δ+logγ))
(31)
+C2(−H 2(δ2 +δlogγ)+H(δ+logγ))+logγ(1−Hδ)=0

Using logγ =log(1−δ)≈−δ− 21 δ2

1
C1(−H 2δ3 +H( δ3))+C2(H 2(δ3/2)+H(−δ2/2))+H(δ2)−δ =0
2
1 (32)
C1(−H δ +H( δ2))+C2(H 2(δ2/2)+H(−δ/2))+H(δ)−=0
2 2
2
H 2δ2(C2/2−C1)+Hδ(+C2/2)−=0

For simplicity, let’s assume C1 ≈C2/2. Solving for the optimal H we get

H= (33)
δ(+C2/2)

A.1.4 OPTIMAL BLENDING PARAMETER

Proof of Corollary 3.1 We make a similar argument as Schulman et al. (2015). Compare the objective
of unblended MPC

" H−1
#
X
QφH (s,a)=Eµπφ Q̂(s0,a0)+ i
γ A(si,ai)|s0 =st (34)
M
i=0

and the blended variant

" H−1
#
X
QλH (s,a)=Eµπφ Q̂(s0,a0)+ i
(γλ) A(si,ai) (35)
M
i=0

The blended variant is simply setting a steeper discount (γλ). This steeper discount results in an effective
1
horizon of H ≈ 1−γλ . The effect of terms beyong this H is neglible.
Hence the effect of achieving any critical H ∗ can be obtained by simply setting λ∗suchthat

1
H∗ =
1−γλ∗
(36)
∗ 1 1
λ = 1− ∗
γ H

14
Under review as a conference paper at ICLR 2021

A.2 EXPERIMENT DETAILS

A.2.1 TASK DETAILS

CARTPOLESWINGUP
2 2
• Reward function: xcart +θpole +0.01vcart +0.01ωpole +0.01a2
• Observation: [xcart,θpole,vcart,ωpole] (4 dim)

SAWYERPEGINSERTION We simulate sensor noise by placing a simulated position sensor at the target
location in the MuJoCo physics engine that adds Gaussian noise with σ =4cm to the observed 3D position
vector. MPPI uses a deterministic model that does not take sensor noise into account for planning.

• Reward function: −1.0 ∗ ||xee − xtarget||1 − 5.0 ∗ ||xee − xtarget||2 + 5 ∗

1(||xee −xtarget||2 <0.06)

• Observation: qpos,qvel ,xee,xtarget,xee −xtarget,||xee −xtarget||1,||xee −xtarget||2 (25 dim)

INHANDMANIPULATION
T
• Reward function: −||xobj − xdes||2 + zobj zdes+ Bonus for proximity to desired pos + orien,
T
zobj zdes represents dot product between object axis and target axis to measure orientation
similarity.
• Observation: [qpos,xobj ,vobj ,zobj ,zdes,xobj −xdes,zobj −zdes] (45 dim)

A.2.2 LEARNING DETAILS

Here we present details of the learning setup and parameters used. All parameters were found using a
coarse grid search and better settings might exist.
MPQ(λ): For all tasks, we represent Q function using 2 layered fully-connected neural network with
100 units in each layer and ReLU activation. We use ADAM (Kingma & Ba, 2014) for optimization
with a learning rate of 0.001 and discount factor γ = 0.99. Further, the buffer size is 1500 for
CARTPOLESWINGUP and 3000 for the others, with batch size of 64 for all.
λ Decay: In all our experiments, we decay λ according to the following sublinear decay rate to smoothly
decay λ over time

λ0
λt = √ (37)
1+κ t
where the decay rate κ is calculate based on the desired final value of λ.
PPO: Both policy and value functions are represented by feed forward networks with 2 layers each with
64 and 128 units for policy and value respectively. All other parameters are left to the default values. Teh
number of trajectories collected per iteration is modified to correspond with the same number of samples
collected between validation iterations for MPQ(λ).
MPPI parameters Tables A.2.2, 2, 3 show the MPPI parameters used for different experiments. In
addition to the standard MPPI parameters, in certain cases we also use a step size parameter as introduced
by Wagener et al. (2019). For INHANDMANIPULATION and SAWYERPEGINSERTION we also apply
autoregressive filtering on the sampled MPPI trajectories to induce smoothness in the sampled actions,
with tuned filter coefficients. This has been found to be useful in prior works (Summers et al., 2020;
Lowrey et al., 2018) for getting MPQ(λ) to work on high dimensional control tasks.

15
Under review as a conference paper at ICLR 2021

Table 2: SAWYERPEGINSERTION Table 3: INHANDMANIPULATION

Table 1: CARTPOLESWINGUP
Parameter Value Parameter Value
Parameter Value
Horizon 32 Horizon 32
Horizon 32
Num particles 120 Num particles 100
Num particles 60
Covariance (Σ) 0.1 Covariance (Σ) 0.3
Covariance (Σ) 0.45
Temperature( β) 0.005 Temperature( β) 0.15
Temperature( β) 0.1
Filter coeffs [0.25, 0.8, 0.0] Filter coeffs [0.25, 0.8, 0.0]
Filter coeffs [1.0, 0.0, 0.0]
Step size 0.9 Step size 1.0
Step size 1.0
γ 0.99 γ 0.99

1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems
No ratings yet
1reinforcement Learning-Based Model Predictive Control For Discrete-Time Systems
13 pages
RLC Project
No ratings yet
RLC Project
13 pages
Pan 2022 J. Phys. Conf. Ser. 2203 012058
No ratings yet
Pan 2022 J. Phys. Conf. Ser. 2203 012058
7 pages
MPC For MCP
No ratings yet
MPC For MCP
7 pages
Provably Safe and Robust Learning-BasedModel Predictive Control
No ratings yet
Provably Safe and Robust Learning-BasedModel Predictive Control
13 pages
Adaptative For Lineare
No ratings yet
Adaptative For Lineare
23 pages
Model Predictive Control Using Matlab
No ratings yet
Model Predictive Control Using Matlab
10 pages
Advanced MBRL for Efficient Control
No ratings yet
Advanced MBRL for Efficient Control
17 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
A Safe Reinforcement Learning Driven Weights-Varying Model Predictive Control For Autonomous Vehicle Motion Control
No ratings yet
A Safe Reinforcement Learning Driven Weights-Varying Model Predictive Control For Autonomous Vehicle Motion Control
8 pages
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
No ratings yet
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
113 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Reinforcement Learning Algorithms
No ratings yet
Reinforcement Learning Algorithms
98 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
2023 Week7 modelbasedRL Updated
No ratings yet
2023 Week7 modelbasedRL Updated
56 pages
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
No ratings yet
Computationally Efficient Gauss-Newton Reinforcement Learning For Model Predictive Control
14 pages
Guzman 22 A
No ratings yet
Guzman 22 A
12 pages
Average-Reward Model-Free RL Review
No ratings yet
Average-Reward Model-Free RL Review
36 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
PCACCSMDec 2021
100% (1)
PCACCSMDec 2021
33 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
13 pages
Neural Network MPC
No ratings yet
Neural Network MPC
8 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Paper4 Asian Journal of Control Volume 7 Issue Number 3 Page(s) 323-329
No ratings yet
Paper4 Asian Journal of Control Volume 7 Issue Number 3 Page(s) 323-329
7 pages
Data-Enabled Predictive Control Algorithm
No ratings yet
Data-Enabled Predictive Control Algorithm
8 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
FMPC: A Fast Implementation of Model Predictive Control M. Canale M. Milanese
No ratings yet
FMPC: A Fast Implementation of Model Predictive Control M. Canale M. Milanese
6 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
RL Unit - Iii
No ratings yet
RL Unit - Iii
20 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
Constrained Model Predictive Control Sta
No ratings yet
Constrained Model Predictive Control Sta
26 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Multi Grid
No ratings yet
Multi Grid
10 pages
Branch Prediction As A Reinforcement Learning Problem: Why, How and Case Studies
No ratings yet
Branch Prediction As A Reinforcement Learning Problem: Why, How and Case Studies
6 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
Lec16 - Optimal Control
No ratings yet
Lec16 - Optimal Control
13 pages
Model Ensemble Trpo
No ratings yet
Model Ensemble Trpo
15 pages
Model-Free RL for Linear Quadratic Control
No ratings yet
Model-Free RL for Linear Quadratic Control
16 pages
Paper RL
No ratings yet
Paper RL
61 pages
Adaptive Stochastic Nonlinear Model Predictive Control With Look-Ahead Dee
No ratings yet
Adaptive Stochastic Nonlinear Model Predictive Control With Look-Ahead Dee
8 pages
DRL vs MPC for Adaptive Cruise Control
No ratings yet
DRL vs MPC for Adaptive Cruise Control
12 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
Model-Based Reinforcement Learning
No ratings yet
Model-Based Reinforcement Learning
67 pages
Robust MPC with Recursive Updates
No ratings yet
Robust MPC with Recursive Updates
11 pages
Learning Model Predictive Control For Quadrotors
No ratings yet
Learning Model Predictive Control For Quadrotors
7 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Standtallmollystoryguide
No ratings yet
Standtallmollystoryguide
3 pages
US Nuclear Power Policy - Nuclear Energy Policy USA - World Nuclear Association
No ratings yet
US Nuclear Power Policy - Nuclear Energy Policy USA - World Nuclear Association
22 pages
EC-211 Object Oriented: Programming and Data Structures Using C++
100% (1)
EC-211 Object Oriented: Programming and Data Structures Using C++
45 pages
Civ 2215 - Aht - 1 - 2020 - 2021
No ratings yet
Civ 2215 - Aht - 1 - 2020 - 2021
2 pages
Bitter Test Olive Oil
No ratings yet
Bitter Test Olive Oil
2 pages
Fomapan 100
No ratings yet
Fomapan 100
3 pages
Week 14
No ratings yet
Week 14
9 pages
Laboratory Apparatus With Definition
No ratings yet
Laboratory Apparatus With Definition
7 pages
Lesson Plan Form 4
No ratings yet
Lesson Plan Form 4
6 pages
By: Prof. A.S.Mohanty: Lesson Notes On Organizational Behaviour Semester - 3 Under BPUT Syllabus NOTE - 19
No ratings yet
By: Prof. A.S.Mohanty: Lesson Notes On Organizational Behaviour Semester - 3 Under BPUT Syllabus NOTE - 19
2 pages
Mercer Job Description
100% (1)
Mercer Job Description
1 page
Bank Management Report
100% (1)
Bank Management Report
18 pages
h13827 Unisphere Central v4 WP
100% (1)
h13827 Unisphere Central v4 WP
37 pages
License System
100% (1)
License System
22 pages
The Use of Dance Movement Therapy in Social Work
100% (3)
The Use of Dance Movement Therapy in Social Work
16 pages
Volcanologist Equipment Design Guide
No ratings yet
Volcanologist Equipment Design Guide
12 pages
Pre-K ELA Pattern Recognition Lesson
No ratings yet
Pre-K ELA Pattern Recognition Lesson
3 pages
Use Case Template 37
No ratings yet
Use Case Template 37
1 page
An Introduction To Linear Algebra by Krishnamurthy Mainra Arora PDF
No ratings yet
An Introduction To Linear Algebra by Krishnamurthy Mainra Arora PDF
348 pages
Types in The Power Query M Formula Language
No ratings yet
Types in The Power Query M Formula Language
7 pages
Modeling and Simulation of Surface Roughness: Rajendra M. Patrikar
No ratings yet
Modeling and Simulation of Surface Roughness: Rajendra M. Patrikar
8 pages
Atkins, P.J. (2005) Mapping Foodscapes, Food & History 3, 1, 267-80
No ratings yet
Atkins, P.J. (2005) Mapping Foodscapes, Food & History 3, 1, 267-80
13 pages
Reflection On Personal Development in Relation To Self-Awareness and Listening Skills
No ratings yet
Reflection On Personal Development in Relation To Self-Awareness and Listening Skills
8 pages
Hawthorne Studies
No ratings yet
Hawthorne Studies
7 pages
DLL Hele
No ratings yet
DLL Hele
6 pages
Hanson, Sharon Hanson - Legal Method (1999)
100% (4)
Hanson, Sharon Hanson - Legal Method (1999)
366 pages
Dust The Archive and Cultural History Carolyn Steedman PDF Download
No ratings yet
Dust The Archive and Cultural History Carolyn Steedman PDF Download
77 pages
Symposium in Nursing Education
No ratings yet
Symposium in Nursing Education
16 pages
3EXP 2019 EOY SS SBQ SRQ Answer Scheme
No ratings yet
3EXP 2019 EOY SS SBQ SRQ Answer Scheme
11 pages
Laplace Fourier Relationship
100% (8)
Laplace Fourier Relationship
17 pages

2772 Blending MPC Value Function AP

Uploaded by

2772 Blending MPC Value Function AP

Uploaded by

Under review as a conference paper at ICLR 2021

BLENDING MPC & VALUE FUNCTION APPROXIMATION

2.2 MODEL-PREDICTIVE CONTROL

3 MITIGATING BIAS IN MPC VIA REINFORCEMENT LEARNING

γ(1−γ H−1) γ H αH Vmax −Vmin γH 

λ∗ follows a similar trend as H ∗ – as learnt value error decreases, H ∗ →1 and λ∗ →0.

4 THE MPQ(λ) ALGORITHM

We develop a simple variant of Q-Learning, called Model-Predictive Q-Learning with λ Weights

Validation rewards Validation rewards

(a) Fixed λ (b) Fixed v/s Decaying λ

(c) λ decay with H =64 (d) λ decay with H =32

300 3000 100

(a) CARTPOLESWINGUP Bias Sweep (b) INHANDMANIPULATION (c) SAWYERPEGINSERTION

A.1.1 A GENTLE START: BOUND ON PERFORMANCE OF 1-STEP GREEDY POLICY

Theorem A.1. Let the one-step greedy policy be

π̂(s)=argmin ĉ(s,a)+γΣs0 P̂ (s0|s,a)V̂ (s0) (14)

The performance loss of π̂(s) w.r.t optimal policy π∗ on MDP M is bounded by

2 γ+α+γα Vmax −Vmin

Proof. From (14) we have ∀s∈S

Substituting from (16)

Since s is the state with largest loss

Re-arranging terms we get

A.1.2 BOUND ON PERFORMANCE OF H-STEP GREEDY POLICY

1−γ H−1 1−γ H

Since the first state s1 is the same

sup is upper-bounded by γ(1−γ H−1)/(1−γ)((cmax −cmin)/2)+γ H (Vmax −Vmin)/2

1−γ H−1 1−γ H

Proof. Since, π̂ is the greedy policy when using M̂ and V̂ ,

A.1.3 OPTIMAL PLANNING HORIZON

The derivative is then

Using logγ =log(1−δ)≈−δ− 21 δ2

A.1.4 OPTIMAL BLENDING PARAMETER

and the blended variant

A.2 EXPERIMENT DETAILS

A.2.1 TASK DETAILS

• Reward function: −1.0 ∗ ||xee − xtarget||1 − 5.0 ∗ ||xee − xtarget||2 + 5 ∗

A.2.2 LEARNING DETAILS

Table 2: SAWYERPEGINSERTION Table 3: INHANDMANIPULATION

You might also like

γ(1−γ H−1) γ H αH Vmax −Vmin γH

2 γ+α+γα Vmax −Vmin