2772 Blending MPC Value Function AP
2772 Blending MPC Value Function AP
Anonymous authors
Paper under double-blind review
ABSTRACT
Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world
systems that uses a model to make predictions about future behavior. For each state
encountered, MPC solves an online optimization problem to choose a control action
that will minimize future cost. This is a surprisingly effective strategy, but real-time
performance requirements warrant the use of simple models. If the model is not
sufficiently accurate, then the resulting controller can be biased, limiting performance.
We present a framework for improving on MPC with model-free reinforcement learning
(RL). The key insight is to view MPC as constructing a series of local Q-function
approximations. We show that by using a parameter λ, similar to the trace decay
parameter in TD(λ), we can systematically trade-off learned value estimates against the
local Q-function approximations. We present a theoretical analysis that shows how error
from inaccurate models in MPC and value function estimation in RL can be balanced.
We further propose an algorithm that changes λ over time to reduce the dependence on
MPC as our estimates of the value function improve, and test the efficacy our approach
on challenging high-dimensional manipulation tasks with biased models in simulation.
We demonstrate that our approach can obtain performance comparable with MPC with
access to true dynamics even under severe model bias and is more sample efficient
as compared to model-free RL.
1 INTRODUCTION
Model-free Reinforcement Learning (RL) is increasingly used in challenging sequential decision-making
problems including high-dimensional robotics control tasks (Haarnoja et al., 2018; Schulman et al., 2017)
as well as video and board games (Silver et al., 2016; 2017). While these approaches are extremely general,
and can theoretically solve complex problems with little prior knowledge, they also typically require a
large quantity of training data to succeed. In robotics and engineering domains, data may be collected
from real-world interaction, a process that can be dangerous, time consuming, and expensive.
Model-Predictive Control (MPC) offers a simpler, more practical alternative. While RL typically uses
data to learn a global model offline, which is then deployed at test time, MPC solves for a policy online
by optimizing an approximate model for a finite horizon at a given state. This policy is then executed
for a single timestep and the process repeats. MPC is one of the most popular approaches for control
of complex, safety-critical systems such as autonomous helicopters (Abbeel et al., 2010), aggressive
off-road vehicles (Williams et al., 2016) and humanoid robots (Erez et al., 2013), owing to its ability to
use approximate models to optimize complex cost functions with nonlinear constraints.
However, approximations in the model used by MPC can significantly limit performance. Specifically,
model bias may result in persistent errors that eventually compound and become catastrophic. For example,
in non-prehensile manipulation, practitioners often use a simple quasi-static model that assumes an object
does not roll or slide away when pushed. For more dynamic objects, this can lead to aggressive pushing
policies that perpetually over-correct, eventually driving the object off the surface.
Recently, there have been several attempts to combine MPC with model free RL, showing that the
combination can improve over the individual approaches alone. Many of these approaches involve using
RL to learn a terminal cost function, thereby increasing the effective horizon of MPC (Zhong et al., 2013;
Lowrey et al., 2018; Bhardwaj et al., 2020). However, the learned value function is only applied at the end
of the MPC horizon. Model errors would still persist in horizon, leading to sub-optimal policies. Similar
approaches have also been applied to great effect in discrete games with known models (Silver et al., 2016;
1
Under review as a conference paper at ICLR 2021
2017; Anthony et al., 2017), where value functions and policies learned via model-free RL are used to
guide Monte-Carlo Tree Search. In this paper, we focus on a somewhat broader question: can machine
learning be used to both increase the effective horizon of MPC, while also correcting for model bias?
One straightforward approach is to try to learn (or correct) the MPC model from real data encountered dur-
ing execution; however there are some practical barriers to this strategy. Hand-constructed models are often
crude-approximations of reality and lack the expressivity to represent encountered dynamics. Moreover, in-
creasing the complexity of such models leads to computationally expensive updates that can harm MPC’s on-
line performance. Model-based RL approaches such as Chua et al. (2018); Nagabandi et al. (2018); Shyam
et al. (2019) aim to learn general neural network models directly from data. However, learning globally
consistent models is an exceptionally hard task due to issues such as covariate shift (Ross & Bagnell, 2012).
We propose a framework, MPQ(λ), for weaving together MPC with learned value estimates to trade-off
errors in the MPC model and approximation error in a learned value function. Our key insight is to
view MPC as tracing out a series of local Q-function approximations. We can then blend each of these
Q-functions with value estimates from reinforcement learning. We show that by using a blending parameter
λ, similar to the trace decay parameter in TD(λ), we can systematically trade-off errors between these
two sources. Moreover, by smoothly decaying λ over learning episodes we can achieve the best of both
worlds - a policy can depend on a prior model before its has encountered any data and then gradually
become more reliant on learned value estimates as it gains experience.
To summarize, our key contributions are:
1. A framework that unifies MPC and Model-free RL through value function approximation.
2. Theoretical analysis of finite horizon planning with approximate models and value functions.
3. Empirical evaluation on challenging manipulation problems with varying degrees of model-bias.
2 PRELIMINARIES
2.1 REINFORCEMENT LEARNING
We consider an agent acting in an infinite-horizon discounted Markov Decision Process (MDP). An MDP
is defined by a tuple M = (S,A,c,P,γ,µ) where S is the state space, A is the action space, c(s,a) is
the per-step cost function, st+1 ∼ P (·|st,at) is the stochastic transition dynamics and γ is the discount
factor and µ(s0) is a distribution over initial states. A closed-loop policy π(·|s) outputs a distribution over
actions given a state. Let µπM be the distribution over state-action trajectories obtained
P∞ by running policy
π
π on M. The value function for a given policy π, is defined as VM (s) = EµπM [ t=0γ tc(st,at)|s0 =s]
P∞
and the action-value function as QπM(s, a) = EµπM [ t=0γ tc(st,at)|s0 =s,a0 =a]. The objective
is to find an optimal policy π∗ = argmin Es0 ∼µ [VM π
(s0)]. We can also define the (dis)-advantage
π
function AπM(s,a) = QπM(s,a)−V π (s), which measures how good an action is compared to the action
taken by the policy in expectation. It can be equivalently expressed in terms of the Bellman error as
AπM(s,a)=c(s,a)+γEs0 ∼P,a0 ∼π [QπM(s0,a0)]−Ea∼π [QπM(s,a)].
2
Under review as a conference paper at ICLR 2021
MPC then iteratively optimizes this estimate (at current system state st) to update the policy parameters
φ∗t =argmin QφH (st,πφ(st)) (2)
φ
Alternatively, we can also view the above procedure from the perspective of disadvantage minimization.
Let us define an estimator for the 1-step disadvantage with respect to the potential function Q̂ as
A(si,ai) =c(si,ai)+γ Q̂(si+1,ai+1)− Q̂(si,ai). We can then equivalently write the above optimization
as minimizing the discounted sum of disadvantages over time via the telescoping sum trick
" H−1
#
X
i
argmin Eµπφ Q̂(s0,a0)+λ γ A(si,ai)|s0 =st (3)
π∈Π M
i=0
Although the above formulation queries the Q̂ at every timestep, it is still exactly equivalent to the original
problem and hence, does not mitigate the effects of model-bias. In the next section, we build a concrete
method to address this issue by formulating a novel way to blend Q-estimates from MPC and a learned
value function that can balance their respective errors.
This theorem generalizes over various established results. Setting H = 1, = 0 gives us the 1-step
simulation lemma in Kearns & Singh (2002) (Appendix A.1.1). Setting α=0, i.e. true model, recovers
the cost-shaping result in Sun et al. (2018).
Further inspecting terms in (4), we see that the model error increases with horizon H (the first two terms)
while the learned value error decreases with H. This matches our intuitions and indicates that there is
an optimal planning horizon H ∗ that minimizes the bound.
Theorem 3.2. The optimal planning horizon that minimizes the bound in (4) is given by
!!
∗
H =max 1.0,min H, (5)
(1−γ)(+α Vmax −V 4
min
)
3
Under review as a conference paper at ICLR 2021
In practice, the errors in model and value function are usually unknown and hard to estimate making it
impossible to set the MPC horizon to the optimal value. Instead, we next propose a strategy to blend the
Q-estimates from MPC and the learned value function at every timestep along the horizon, instead of
just the terminal step such that we can properly balance the different sources of error.
3.2 BLENDING MODEL PREDICTIVE CONTROL AND VALUE FUNCTIONS
A naive way to blend Q-estimates from MPC with Q-estimates from the value function would be to
consider a convex combination of the two
(1−λ) Q̂(s,a) +λQφH (s,a) (6)
| {z } | {z }
model-free model-based
where λ∈[0,1]. Here, the value function is contributing to a residual that is added to the MPC output, an
approach commonly used to combine model-based and model-free methods (Lee et al., 2020). However,
this is solution is rather ad hoc. If we have at our disposal a value function, why invoke it at only at the
first and last timestep? As the value function gets better, it should be useful to invoke it at all timesteps.
Instead, consider the following recursive formulation for the Q-estimate. Given (si,ai), the state-action
encountered at horizon i, the blended estimate Qλ(si,ai) is expressed as
Qλ(si,ai) =(1−λ)Q̂(si,ai)+λ( ĉ(si,ai) +γ Qλ(si+1,ai+1) ) (7)
| {z } | {z } | {z } | {z }
current blended estimate model-free model-based future blended estimate
λ
where λ ∈ [0,1]. The recursion ends at Q (sH ,aH ) = Q̂(sH ,aH ). In other words, the current blended
estimate is a convex combination of the model-free value function and the one-step model-based return.
The return in turn uses the future blended estimate. Note unlike (6), the model-free estimate is invoked
at every timestep.
We can unroll (7) in time to show QλH (s,a), the blended H−horizon estimate, is simply an exponentially
weighted average of all horizon estimates
H−1
X
QλH (s,a)=(1−λ) λiQφi(s,a)+λH QφH (s,a) (8)
i=0
hP i
k−1
where Qφk(s,a) = Eµπφ i=0 γ i
ĉ(si ,ai )+γ k
Q̂(sk ,ak )|s0 =s,a0 =a is a k-horizon estimate. When
M
λ=0, the estimator reduces to the just using Q̂ and when λ=1 we recover the original MPC estimate QH in
(1). For intermediate values of λ, we interpolate smoothly between the two by interpolating all H estimates.
Implementing (8) naively would require running H versions of MPC and then combining their outputs.
This is far too expensive. However, we can switch to the disadvantage formulation by applying a similar
telescoping trick " #
H−1
X
λ i
QH (s,a)=Eµ φ Q̂(s0,a0)+
π (γλ) A(si,ai) (9)
M
i=0
This estimator has a similar form as the T D(λ) estimator for the value function. However, while T D(λ)
uses the λ parameter for bias-variance trade-off, our blended estimator aims trade-off bias in dynamics
model with bias in learned value function.
Corollary 3.1 (Proof in Appendix A.1.4). For approximate models and learnt value functions as defined
in Theorem. 3.1, there exist a critical λ∗ ∈[0,1] that minimizes the performance bound
1 1
λ∗ = 1− ∗ (10)
γ H
where H ∗ is stated in (5) in Theorem 3.2.
4
Under review as a conference paper at ICLR 2021
Algorithm 1: MPQ(λ)
Input: Initial Q-function weights θ, Approximate dynamics P̂ and cost function ĉ
Parameter: MPC horizon H, λ schedule [λ1,λ2,...],
discount factor γ, minibatch size K, num mini-batches N, update frequency tupdate
1 D ←∅
2 for t=1...∞ do
// Update λ
3 λ=λt
// Blended MPC h action P selection i
H−1
4 φt ←argmin Eµ φ Q̂θ (s0,a0)+ i=0 (γλ)iA(si,ai)|s0 =st
π
φ M
5 at ∼πφt
6 Execute at on the system and observe (ct,st+1)
7 D ←(st,at,ct,st+1)
8 if t%tupdate ==0 then
n oK N
0
9 Sample N minibatches sk,n,ak,n,ck,n,sk,n from D
k=1 n=1
// Generate Blended h MPC value targets i
PH−1
10 ŷk,n =ck,n +γminEµπφ Q̂θ (s0,a0)+ i=0 (γλ)iA(si,ai)|s0 =s0k,n
M
PN PK 2
11 Update θ with SGD on loss L= N1 K1
n=1 k=1 ŷk,n − Q̂θ (sk,n ,ak,n )
" H−1
#
X
φ∗t =argmin Eµπφ Q̂θ (s0,a0)+ i
(γλ) A(si,ai)|s0 =st (11)
φ M
i=0
An action sampled from the resulting policy is then executed on the system. A commonly used heuristic is to
warm start the above optimization by shifting forward the solution from the previous timestep, which serves
as a good initialization if the noise in the dynamics in small (Wagener et al., 2019). This can significantly
cut computational cost by reducing the number of iterations required to optimize ( 11) at every timestep.
Periodically, the parameters θ are updated via stochastic gradient descent to minimize the following loss
function with N mini-batches of experience tuples of size K sampled from the replay buffer
N K 2
1 1 XX
L(θ)= ŷk,n − Q̂θ (sk,n,ak,n) (12)
N K n=1
k=1
The H-horizon MPC with blended Q-estimator is again invoked to calculate the targets
5
Under review as a conference paper at ICLR 2021
Figure 1: Tasks for evaluating MPQ(λ). Left to right - cartpole, peg insertion with 7DOF arm, and in-hand
manipulation to orient align pen(blue) with target(green) with 24DOF dexterous hand.
" H−1
#
X
ŷj =c(sj ,aj )+γminEµπφ Q̂θ (s0,a0)+ (γλ)i
A(si,ai)|s0 =s0k,n (13)
M
i=0
Using MPC to reduce error in Q-targets has been previously explored in literature (Lowrey et al., 2018;
Bhardwaj et al., 2020), where the model is either assumed to be perfect or model-error is not explicitly
accounted for. MPC with the blended Q-estimator and an appropriate λ allows us to generate more stable
Q-targets than using Qθ or model-based rollouts with a terminal Q-function alone. However, running
H-horizon optimization for all samples in a mini-batch can be time-consuming, forcing the use of smaller
batch sizes and sparse updates. In our experiments, we employ a practical modification where during
the action selection step, MPC is also queried for value targets which are then stored in the replay buffer,
thus allowing us to use larger batch sizes and updates at every timestep.
Finally, we also allow λ to vary over time. In practice, λ is decayed as more data is collected on the
system. Intuitively, in the early stages of learning, the bias in Q̂θ dominates and hence we want to rely
more on the model. A larger value of λ is appropriate as it up-weights longer horizon estimates in the
blended-Q estimator. As Q̂θ estimates improve over time, a smaller λ is favorable to reduce the reliance
on the approximate model.
5 EXPERIMENTS
Task Details: We evaluate MPQ(λ) on simulated robot control tasks, including a complex manipulation
task with a 7DOF arm and in-hand manipulation with a 24DOF anthropomorphic hand (Rajeswaran* et al.,
2018) as shown in Fig. 1. For each task, we provide the agent with a biased version of simulation that
is used as the dynamics model for MPC. We use Model Predictive Path Integral Control (MPPI) (Williams
et al., 2017), a state-of-the-art sampling-based algorithm as our MPC algorithm throughout.
1. CARTPOLESWINGUP: A classic control task where the agent slides a cart along a rail to swingup
the pole attached via an unactuated hinge joint. Model bias is simulated by providing the agent incorrect
masses of the cart and pole. The masses are set lower than the true values to make the problem harder
for MPC as the algorithm will always input smaller controls than desired as also noted in Ramos et al.
(2019). Initial position of cart and pole are randomized at every episode.
2. SAWYERPEGINSERTION: The agent controls a 7DOF Sawyer arm to insert a peg attached to the
end-effector into a hole at different locations on a table in front of the robot. We test the effects of inaccurate
perception by simulating a sensor at the target location that provides noisy position measurements at every
timestep. MPC uses a deterministic model that does not take sensor noise into account as commonly done
in controls. This biases the cost of simulated trajectories, causing MPC to fail to reach the target.
3. INHANDMANIPULATION: A challenging in-hand manipulation task with a 24DOF dexterous hand
from Rajeswaran* et al. (2018). The agent must align the pen with target orientation within certain
tolerance for succcess. The initial orientation of the pen is randomized at every episode. Here, we simulate
bias by providing larger estimates of the mass and inertia of the pen as well as friction coefficients, which
causes the MPC algorithm to optimize overly aggressive policies and drop the pen.
Please refer to the Appendix A.2 for more details of the tasks and biased simulations.
6
Under review as a conference paper at ICLR 2021
Baselines: We compare MPQ(λ) against both model-based and model-free baselines - (1) MPPI with
true dynamics and no value function, (2) MPPI with biased dynamics and no value function (3) Proximal
Policy Optimization (PPO), a state-of-the-art model-free RL method
Learning Details: We represent the Q-function with a feed-forward neural network. We bias simulation
parameters like mass or friction coefficients using the formula m=(1+b)mtrue, where b is a bias-factor.
We also employ a practical modification to Alg. 1 in order to speed up training times as discussed in
Section 4. Instead of maintaining a large global replay-buffer and re-calculating targets for every experience
tuple in a mini-batch, as done by approaches such as Bhardwaj et al. (2020); Lowrey et al. (2018), we
simply query MPC for the value targets online and store them in a smaller replay-buffer. This has a
similar stabilizing effect as using a frozen target network in DQN and allows us to perform updates at
every timestep. For PPO, we use the publically available implementation https://rb.gy/61iarq. Refer to
the Appendix A.2 for more details.
5.1 ANALYSIS OF OVERALL PERFORMANCE
O 1. MPQ(λ) is able to overcome model-bias in MPC for a wide range of λ values.
Fig. 2(a) shows a comparison of MPQ(λ) with MPPI using true and biased dynamics with b=−0.5 and
H =64 for various settings of λ. There exists a wide range of λ values for which MPQ(λ) can efficiently
trade-off the model-bias with the bias in the learned Q-function and out-perform MPPI with biased
dynamics. However, setting λ to a high value of 1.0 and 0.95, which weighs longer horizons heavily
leads to poor performance as compounding effects of model bias are not being compensated for by Qθ .
Performance also begins to drop as λ decreases below 0.6. We note that MPQ(λ) also outperforms MPPI
with access to the true dynamics function. This is not surprising as the learned Q-function adds global
information to the optimization problem and can correct for errors in optimizing for long horizons due
to large variance when using limited trajectory samples in MPPI.
O 2. Faster convergence can be achieved by decaying λ over time.
As more data is collected on the system, we expect the bias in Qθ to decrease, whereas model bias remains
constant. We decay λ in [1.0,λF ] using a fixed schedule to reduce the dependence on MPC over time as
the model-free Q-estimates improve. This leads to faster convergence than using a fixed λ as demonstrated
in Fig. 2(b). This can be attributed to the fact that during the initial steps of training, the bias in a randomly
initialized Qθ is large, whereas longer horizon lookahead with the approximate model can still lead us
to good parts of the state space. Thus, initially a larger value of λ that favors longer horizons is better as the
effect of Qθ is diminished due to discounting. As more data is collected, the accuracy of Qθ increases and
bias in the model begins to hurt performance. A smaller value of λ is more favorable now as it places more
weight on smaller horizon estimates. In figures 2(c) and 2(d) we present an ablation study that shows
that MPQ(λ) is robust to a wide range of decay rates with H =64 and 32 respectively. We see that when
provided with true dynamics, MPPI with H = 32 performs better than H = 64, which can be attributed
to optimization issues with long horizons. MPQ(λ) reaches performance comparable with MPPI H =32
in both cases showing robustness to horizon values which is important since in practice we wish to set
the horizon as large as the computation budget permits. However, decaying λ too fast or too slow can have
adverse effects on performance. An interesting question for future work is whether λ can be set adaptively
in a state-dependent manner. Refer to Appendix A.2 for details on decay schedule.
O 3. MPQ(λ) is robust to large degree of model misspecification.
Fig. 3(a) shows the effects of different values of the bias factor b to vary mass of the cart and pole
for MPQ(λ) with a fixed λ decay rate of [1.0,0.75]. MPQ(λ) achieves better performance than MPPI
(H = 64), both with biased and true dynamics, for a wide range of bias factors b, and convergence rate
is generally faster for smaller bias. For large values of b, MPQ(λ) either fails to improve or diverges as
the compounding effects of model-bias hurt learning. Similarly in trend in Fig. 3(b) MPQ(λ) achieves
performance comparable to MPPI with true dynamics for different degrees of bias in the mass, inertia and
friction properties of the pen, but is unable to do so for b=1.0. Both plots show validation performance
over averaged over 3 seeds. We conclude that while MPQ(λ) is robust to large amount of model bias,
if the model is extremely un-informative, relying on MPC can degrade performance.
O 4. MPQ(λ) is much more sample efficient compared to model-free RL on high-dimensional continuous
control tasks, even with approximate models
Figures 3(b) and 3(c) show comparison of MPQ(λ) with PPO (Schulman et al., 2017), a model-free policy
gradient algorithm on the INHANDMANIPULATION and SAWYERPEGINSERTION tasks respectively. In
both cases, we observe that MPQ(λ), through its use of approximate models, learned value functions,
and a dynamically-varying λ parameter to trade-off different sources of error, rapidly improves its
7
Under review as a conference paper at ICLR 2021
300 300
400 400
Average Reward
Average Reward
MPPIH64(true),
MPPIH64(biased)
500 = 0.5 500 MPPIH64(true)
= 0.55 MPPIH64(biased)
600 = 0.6 600 = 0.6
= 0.65 = 0.65
= 0.7 = 0.7
700 = 0.75 700
= 0.8
= 0.75
800 = 0.85 800 F = 0.6
= 0.9 F = 0.65
= 0.95 F = 0.7
900 900
= 1.0 F = 0.75
0 5 10 15 20 25 0 5 10 15 20 25
Validation Iteration Validation Iteration
100
200
150
300
Average Reward
Average Reward
MPPIH64(true), 200 MPPIH32(true),
MPPIH64(biased) MPPIH32(biased)
F = 0.5 F = 0.5
F = 0.55
250 F = 0.55
400 F = 0.6 F = 0.6
F = 0.65 300 F = 0.65
F = 0.7 F = 0.7
F = 0.75 F = 0.75
F = 0.8
350 F = 0.8
500
F = 0.85 F = 0.85
F = 0.9 400 F = 0.9
F = 0.95 F = 0.95
0 5 10 15 20 25 0 5 10 15 20 25
Validation Iteration Validation Iteration
Figure 2: CARTPOLESWINGUP experiments. All curves show average rewards over 30 validation episodes, averaged
over 3 runs with different seeds. Training is performed for 100k steps with validation after every 4k steps. When
decaying λ as per a schedule, it is fixed to the current value during validation. In (b),(c),(d), λF denotes the value of λ
at the end of training. The dashed lines show average reward of MPPI and shaded regions denote standard deviation.
Validation rewards
Validation rewards Validation rewards
4000
200 0
Average Reward
MPPIH64(true), 200
500 MPPIH64(biased) 2000
b=0 300
b= 0.1
600 b= 0.2 MPPIH32(true),
b= 0.3 1000 MPPIH32(b = 0.6) 400
700 b= 0.4 MPPIH32(b = 0.8)
b= 0.5 MPPIH32(b = 1.0)
b= 0.6 PPO 500 MPPIH32(true),
800 b= 0.7 0 b = 0.6 MPPIH32(sensornoise)
b= 0.8 b = 0.8 PPO
900 b= 0.9 b = 1.0 600 MPQLam
0 5 10 15 20 25 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12
Validation Iteration Validation Iteration Validation Iteration
Figure 3: Robustness and sample efficiency of MPQ(λ). (a)Varying bias over mass of cart and pole, (b) Varying
bias factor over mass, inertia and friction of pen (c) Peg insertion task with noisy perception. Same bias factor b is
used for all altered properties per task. Results depict average reward over 30 validation episodes. In (a) λ is decayed
in [1.0,0.75] and [1.0,0.85] in (b) and (c)
performance and achieves average reward comparable to MPPI with access to ground truth dynamics.
In INHANDMANIPULATION, PPO performance does not improve at all over 54k training steps. In
SAWYERPEGINSERTION, the small magnitude of reward difference between MPPI with true and biased
models is due to the fact that even with a biased model, MPC is able to get the peg close to the table, but
due to sensor noise, is unable to control it precisely enough to consistently insert it in the hole. Here, value
function learned by MPQ(λ) can adapt to sensor noise and allow for fine-grained control near the table.
PPO only marginally improves over validation iterations and does not get close to the table..
6 CONCLUSION
In this paper, we presented a method to mitigate model-bias in MPC by blending model-free estimates
using a parameter λ, to trade-off different sources of bias. We present a practical algorithm that achieves
performance close to MPC with access to the true dynamics while being sample efficient compared to
model-free RL methods. However, a few key questions remain. Although faster convergence can be
achieved by decaying λ over time, an interesting avenue for future research is to vary λ in a state-adaptive
fashion. In particular, reasoning about the model and value function uncertainty may allow us to vary
λ to rely more or less on our model in certain parts of the state space. Another interesting avenue for
research is to investigate variance reduction by using λ similarly to TD(λ), allowing us to optimize longer
sampling-based MPC horizons with fewer samples.
8
Under review as a conference paper at ICLR 2021
REFERENCES
Pieter Abbeel and Andrew Y Ng. Exploration and apprenticeship learning in reinforcement learning. In
Proceedings of the 22nd international conference on Machine learning, pp. 1–8, 2005.
Pieter Abbeel, Adam Coates, and Andrew Y Ng. Autonomous helicopter aerobatics through apprenticeship
learning. The International Journal of Robotics Research, 29(13):1608–1639, 2010.
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree
search. In Advances in Neural Information Processing Systems, pp. 5360–5370, 2017.
Mohak Bhardwaj, Ankur Handa, Dieter Fox, and Byron Boots. Information theoretic model predictive
q-learning. In Learning for Dynamics and Control, pp. 840–850, 2020.
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in
a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing
Systems, pp. 4754–4765, 2018.
Tom Erez, Kendall Lowrey, Yuval Tassa, Vikash Kumar, Svetoslav Kolev, and Emanuel Todorov. An
integrated system for real-time model predictive control of humanoid robots. In 2013 13th IEEE-RAS
International Conference on Humanoid Robots (Humanoids), pp. 292–299. IEEE, 2013.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces. In Proceedings
of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312, 2003.
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine
learning, 49(2-3):209–232, 2002.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Gilwoo Lee, Brian Hou, Sanjiban Choudhury, and Siddhartha S Srinivasa. Bayesian residual policy
optimization: Scalable bayesian reinforcement learning with clairvoyant experts. arXiv preprint
arXiv:2002.03042, 2020.
Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan
online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint
arXiv:1811.01848, 2018.
Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics
for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International
Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018.
Aravind Rajeswaran*, Vikash Kumar*, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel
Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement
Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2018.
Fabio Ramos, Rafael Carvalhaes Possas, and Dieter Fox. Bayessim: adaptive domain randomization via
probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728, 2019.
Stephane Ross and J Andrew Bagnell. Agnostic system identification for model-based reinforcement
learning. arXiv preprint arXiv:1203.1007, 2012.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional
continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Pranav Shyam, Wojciech Jaśkowski, and Faustino Gomez. Model-based active exploration. In International
Conference on Machine Learning, pp. 5779–5788, 2019.
9
Under review as a conference paper at ICLR 2021
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the
game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play
with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
Colin Summers, Kendall Lowrey, Aravind Rajeswaran, Siddhartha Srinivasa, and Emanuel Todorov.
Lyceum: An efficient and scalable ecosystem for robot learning. arXiv preprint arXiv:2001.07343, 2020.
Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement
learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.
Nolan Wagener, Ching-An Cheng, Jacob Sacks, and Byron Boots. An online learning approach to model
predictive control. arXiv preprint arXiv:1902.08967, 2019.
Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive
driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics
and Automation (ICRA), pp. 1433–1440. IEEE, 2016.
Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and
Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017
IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. IEEE, 2017.
Mingyuan Zhong, Mikala Johnson, Yuval Tassa, Tom Erez, and Emanuel Todorov. Value function
approximation and model predictive control. In 2013 IEEE symposium on adaptive dynamic
programming and reinforcement learning (ADPRL), pp. 100–107. IEEE, 2013.
A APPENDIX
A.1 PROOFS
We present upper-bounds on performance of a greedy policy when using approximate value functions
and models. We also analyze the case of finite horizon planning with an approximate dynamics model
and terminal value function which can be seen as a generalization of (Sun et al., 2018). For simplicity,
we switch to using V̂ (s) to the learnt model-free value function (instead of Q̂(s))
∗
π
Let V̂ (s) be an -approximation V̂ (s)−VM (s) ≤ . Let MDP M̂ be an α-approximation of M
∞
such that ∀(s,a), we have P̂ (s0|s,a)−P (s0|s,a) ≤α and |ĉ(s,a)−c(s,a)|≤α.
1
10
Under review as a conference paper at ICLR 2021
X X
ĉ(s,π̂(s))+γ P̂ (s0|s,π̂(s))V̂ (s0)≤ĉ(s,π∗(s))+γ P̂ (s0|s,π∗(s))V̂ (s0)
s0 s0
!
X X
∗ 0 ∗ 0 0 0
ĉ(s,π̂(s))−ĉ(s,π (s))≤γ P̂ (s |s,π (s))V̂ (s )− P̂ (s |s,π̂(s))V̂ (s )
s0 s0
π∗
(using V̂ (s)−VM (s) ≤)
∞
!
π∗ 0 π∗ 0
X X
∗ 0 ∗ 0
ĉ(s,π̂(s))−ĉ(s,π (s))≤γ P̂ (s |s,π (s))VM (s )− P̂ (s |s,π̂(s))VM (s ) +2γ
s0 s0
(using |ĉ(s,a)−c(s,a)|≤α)
!
π∗ 0 π∗ 0
X X
∗ 0 ∗ 0
c(s,π̂(s))−c(s,π (s))≤2γ+2α+γ P̂ (s |s,π (s))VM (s )− P̂ (s |s,π̂(s))VM (s )
s0 s0
(16)
π̂ π∗
Now, let s be the state with the max loss VM (s)−VM (s),
∗ X
π∗ 0
π̂
VM π
(s)−VM (s)=c(s,π̂)−c(s,π∗)+γ P (s0|s,π̂)VM
π̂
(s0)−P (s0|s,π∗)VM (s )
s0
0 π∗ 0
P
Add and subtract γ s0 P (s |s,π̂)VM (s )
and re-arrange
X ∗
π∗
VM π̂
(s)−VM (s)≤2γ+2α+γ P̂ (s0|s,π∗)−P (s0|s,π∗) VM
π
(s0)
s0
X ∗
−γ P̂ (s0|s,π̂)−P (s0|s,π̂) VM
π
(s0)
s0
π∗ 0
X
+γ P (s0|s,π̂) VM
π̂
(s0)−VM (s )
s0
X
Vmax −Vmin
π∗ 0
≤2γ+2α+2γα +γ P (s0|s,π̂) VM
π̂
(s0)−VM (s )
2 0 s
2 γ+α+γα Vmax −V
min
∗
π π̂ 2
VM (s)−VM (s) ≤ (17)
∞ 1−γ
which concludes the proof.
11
Under review as a conference paper at ICLR 2021
Proof. We temporarily introduce a new MDP M0 that has the same cost function as a M, but transition
function of M̂
D ∗ E D ∗ E D ∗ E D ∗ E
VM π
,π,M − VM π
,π,M̂ = VM π
,π,M − VM π
,π,M0
H H
D ∗ EH D ∗ EH (19)
+ VM π
,π,M0 − VM π
,π,M̂
H H
Let ∆P (s0...sH )=P (s0...sH )− P̂ (s0...sH ) represent the difference in distribution of states encountered
by executing π on M and M̂ respectively starting from state s0.
Expanding the RHS of (19)
H−1
! "H−1 #
π∗
X X X
i H i
= ∆P (s0...sH ) γ c(si,ai)+γ VM (sH ) +Eµπ γ (c(si,ai)−ĉ(si,ai)) (20)
M̂
s0 ,...,sH i=0 i=0
H−1
X X ∗ 1−γ H
≤ ∆P (s2...sH+1) sup( γ ic(si,ai)+γ H VM
π
(sH )−K)+ α (22)
s2 ,...,sH+1 i=1
1−γ
∞
PH−1
By choosing K = i=1 γ (cmax +cmin)/2 +γ H (Vmax +Vmin)/2 we can ensure that the term inside
i
12
Under review as a conference paper at ICLR 2021
The above lemma builds on similar results in (Kakade et al., 2003; Abbeel & Ng, 2005; Ross & Bagnell,
2012).
We are now ready to prove our main theorem, i.e. the performance bound of an MPC policy that uses
an approximate model and approximate value function.
Proof of Theorem 3.1
Also, we have
D ∗ E D ∗ E D ∗ E D ∗ E
π
VM ,π̂,M − VMπ
,π∗,M = VM π
,π̂,M − VM π
,π̂,M̂
H H
D ∗ EH D ∗ H
E
π ∗ π ∗
− VM ,π ,M − VM ,π ,M̂ (25)
D ∗ E H D ∗ E H
π π ∗
+ VM ,π̂,M̂ − VM ,π ,M̂
H H
The first two terms can be bounded using Lemma A.1 and the third term using Eq. (24) to get
D ∗ E D ∗ E
π
VM ,π̂,M − VM π
,π∗,M
H H
1−γ H−1
cmax −cmin
Vmax −Vmin
1−γ H
(26)
H H
≤2 γ αH +γ αH + α+γ
1−γ 2 2 1−γ
∗
π̂ π
Now, let s be the state with max loss VM (s)−VM (s)
D ∗ E
π∗
VMπ̂
(s)−VM (s)= VM π̂
,π̂,M H − VM π
,π∗,M
D ∗ EH D ∗ E D ∗ E
π̂
= VM,π̂,M H − VM ,π̂,M π
+ VM π
,π̂,M − VM π
,π∗,M
H D ∗ E HD ∗ E H
H π̂ π∗ π π ∗
=γ VM(sH+1)−VM (sH+1) + VM ,π̂,M − VM ,π ,M
H H
π∗
≤γ H VM π̂
(s)−VM (s)
γ(1−γ H−1) 1−γ H
cmax −cmin H Vmax −Vmin H
+2 αH +γ αH + α+γ
1−γ 2 2 1−γ
(27)
where last inequality comes from applying Eq. (26) and the fact that s is the state with max loss. The
final expression follows from simple algebraic manipulation.
Proof. We take the derivative of (4) and set them to 0. For convenience, we define the following interim
terms
α cmax −cmin
C1 =
1−γ 2
(28)
Vmax −Vmin
C2 =α
2
13
Under review as a conference paper at ICLR 2021
Let γ =1−δ where δ 1. Hence γ H =(1−δ)H ≈1−Hδ. Plugging into (29),
C1((1−2δH)+(1−δH)((γ−1)Hlogγ−γ−1)+γ)+
(30)
C2((1−δH)(−(1−δH)+Hlogγ+1))+logγ(1−Hδ)=0
C1(H 2(1−γ)δlogγ−H(1−γ)(δ+logγ))
(31)
+C2(−H 2(δ2 +δlogγ)+H(δ+logγ))+logγ(1−Hδ)=0
1
C1(−H 2δ3 +H( δ3))+C2(H 2(δ3/2)+H(−δ2/2))+H(δ2)−δ =0
2
1 (32)
C1(−H δ +H( δ2))+C2(H 2(δ2/2)+H(−δ/2))+H(δ)−=0
2 2
2
H 2δ2(C2/2−C1)+Hδ(+C2/2)−=0
For simplicity, let’s assume C1 ≈C2/2. Solving for the optimal H we get
H= (33)
δ(+C2/2)
" H−1
#
X
QφH (s,a)=Eµπφ Q̂(s0,a0)+ i
γ A(si,ai)|s0 =st (34)
M
i=0
" H−1
#
X
QλH (s,a)=Eµπφ Q̂(s0,a0)+ i
(γλ) A(si,ai) (35)
M
i=0
The blended variant is simply setting a steeper discount (γλ). This steeper discount results in an effective
1
horizon of H ≈ 1−γλ . The effect of terms beyong this H is neglible.
Hence the effect of achieving any critical H ∗ can be obtained by simply setting λ∗suchthat
1
H∗ =
1−γλ∗
(36)
∗ 1 1
λ = 1− ∗
γ H
14
Under review as a conference paper at ICLR 2021
SAWYERPEGINSERTION We simulate sensor noise by placing a simulated position sensor at the target
location in the MuJoCo physics engine that adds Gaussian noise with σ =4cm to the observed 3D position
vector. MPPI uses a deterministic model that does not take sensor noise into account for planning.
INHANDMANIPULATION
T
• Reward function: −||xobj − xdes||2 + zobj zdes+ Bonus for proximity to desired pos + orien,
T
zobj zdes represents dot product between object axis and target axis to measure orientation
similarity.
• Observation: [qpos,xobj ,vobj ,zobj ,zdes,xobj −xdes,zobj −zdes] (45 dim)
λ0
λt = √ (37)
1+κ t
where the decay rate κ is calculate based on the desired final value of λ.
PPO: Both policy and value functions are represented by feed forward networks with 2 layers each with
64 and 128 units for policy and value respectively. All other parameters are left to the default values. Teh
number of trajectories collected per iteration is modified to correspond with the same number of samples
collected between validation iterations for MPQ(λ).
MPPI parameters Tables A.2.2, 2, 3 show the MPPI parameters used for different experiments. In
addition to the standard MPPI parameters, in certain cases we also use a step size parameter as introduced
by Wagener et al. (2019). For INHANDMANIPULATION and SAWYERPEGINSERTION we also apply
autoregressive filtering on the sampled MPPI trajectories to induce smoothness in the sampled actions,
with tuned filter coefficients. This has been found to be useful in prior works (Summers et al., 2020;
Lowrey et al., 2018) for getting MPQ(λ) to work on high dimensional control tasks.
15
Under review as a conference paper at ICLR 2021
16