tmp5056 TMP
tmp5056 TMP
Elmar Rueckert
1
, Max Mindt
1
, Jan Peters
1,2
and Gerhard Neumann
3
AbstractFor   controlling   high-dimensional   robots,   most
stochastic   optimal   control   algorithms   use   approximations   of
the  system  dynamics  and  of  the  cost  function  (e.g.,   using  lin-
earizations and Taylor expansions). These approximations are
typically only locally correct, which might cause instabilities in
the greedy policy updates, lead to oscillations or the algorithms
diverge. To overcome these drawbacks, we add a regularization
term  to  the   cost   function  that   punishes   large   policy  update
steps   in   the   trajectory   optimization   procedure.   We   applied
this   concept   to   the   Approximate   Inference   Control   method
(AICO), where the resulting algorithm guarantees convergence
for uninformative initial solutions without complex hand-tuning
of   learning   rates.   We   evaluated  our   new  algorithm  on  two
simulated robotic platforms. A robot arm with ve joints was
used for reaching multiple targets while keeping the roll angle
constant. On the humanoid robot Nao, we show how complex
skills like reaching and balancing can be inferred from desired
center of gravity or end effector coordinates.
I.   INTRODUCTION
Typical whole body motor control tasks of humanoids, like
reaching for objects while walking, avoiding obstacles during
motion, or maintaining balance during movement execution,
can be characterized as optimization problems with multiple
criteria  of  optimality  or  objectives.   The  objectives  may  be
specied in the robots conguration space (e.g., joint angles,
joint   velocities   and   base   reference   frame),   in   task   space
(where objectives such as desired end effector coordinates or
center of gravity positions are specied), or in combinations
of   both.   In   this   paper,   we   consider   control   problems   in
nonlinear  systems  with  multiple  objectives  in  combinations
of these spaces.
A  common  strategy  to  whole  body  motor   control   is   to
separate the redundant robots conguration space into a task
space and an orthogonal null space. Objectives or optimality
criteria  of  motion  are  implemented  as  weights  or  priorities
[1]   to   the   redundant   solutions   in   the   null   space.   While
these approaches have been successfully applied to a variety
of   tasks,   including   reaching,   obstacle   avoidance,   walking
and  maintaining  stability  [2][5],   the  application  of   these
methods  is  typically  limited  to  motor   control   and  can  not
be  directly  used  for  motor  planning.   It  is  also  unclear  how
these methods can be applied to motor control problems in
nonlinear systems like compliant robots.
Alternatively, in Stochastic Optimal Control (SOC) prob-
lems [6], a movement policy is optimized with respect to a
1
Intelligent   Autonomous   Systems   Lab,   Technische   Universit at   Darm-
stadt,   Hochschulstr.   10,   64289   Darmstadt,   Germany   {rueckert,
mindt}@ias.tu-darmstadt.de
2
Robot   Learning  Group,   Max-Planck  Institute  for   Intelligent   Systems,
Tuebingen, Germany mail@jan-peters.net
3
Computational   Learning  for   Autonomous   Systems,   Hochschulstr.   10,
64289 Darmstadt, Germany neumann@ias.tu-darmstadt.de
Fig. 1.   A  5-degree-of-freedom robot arm has to reach for a via-point (the
posture  on  the  left  in  A)  and  return  to  its  initial  pose  (the  posture  on  the
right in A). The reaching task is encoded in four task objectives, i.e., three
Cartesian  coordinates  and  the  roll   angle  of  the  end  effector.   The  inferred
trajectories for the y coordinate and the roll angle, including the objectives,
are shown in (B).
cost function, which combines the different criteria of opti-
mality with different weightings. For nonlinear systems, SOC
methods use approximations of the system dynamics and of
the cost functions, e.g., through linearizations and 2nd order
Taylor   expansions.   These  approximations   are  only  locally
correct and the updates of the policy may become unstable
if the minima is not close to the points of the linearizations,
or may oscillate in the case of multiple solutions.
Many   SOC  methods   address   this   issue   and   implement
regularizations  on  the  algorithmic  level.   E.g.,   in  the  iLQG
method  [7]   a  diagonal   regularization  term  is  added  to  the
control   cost   Hessian
1
,   and  in  an  extension  [8],   it   was  sug-
gested  to  penalize  deviations  from  the  state  trajectory  used
for   linearization  rather   than  controls.   A  drawback  of   this
approach is that the additive regularization term needs rapid
re-scaling to prevent divergence and accurate ne-tuning of
a learning rate to nd good solutions, which is challenging
and increases the computational time of the algorithm.
Probabilistic   planning   methods   that   translate   the   SOC
problem  into   an   inference   problem,   typically   implement
learning   rates   in   their   belief   updates   [9]   or   in   the   feed-
back  controller   [10].   However,   in  nonlinear   systems,   both
strategies are suboptimal in the sense that even with a small
learning rate on the beliefs the corresponding control updates
might be large (and vice-versa, respectively).
We  propose  to  regularize  the  policy  updated  on  the  cost
function  level   for   probabilistic  planning.   We  also  penalize
large distances between two successive trajectories in the it-
erative trajectory optimization procedure. In [8], the regular-
ization term is only used for the control gains and not for the
updates  of  the  value  function.  However,  the  deviation  from
the linearization point can still be high if small regularization
1
The  update  step  in  the  trajectory  optimizer   corresponds   to  a  Gauss-
Newton Hessian approximation [8].
terms   are  used.   In  our   approach,   we  always   want   to  stay
close  to  the  linearization  point   as  the  used  approximations
are only locally correct. Hence, using too large update steps
by  greedily  exploiting  the   inaccurate   models   might   again
be  dangerous,   leading  the  instabilities   or   oscillations.   The
scaling parameter of our punishment term serves as step size
of the policy update. Due to the use of probabilistic planning,
the need of an additional learning rate and complex update
strategies   of   this   learning  rate  can  be  avoided.   Moreover,
we  will   demonstrate  that   this  type  of  regularization  results
in  more   robust   policy  updates   in  comparison  to  [8].   We
choose the Approximate Inference Control (AICO) algorithm
as probabilistic planning method [9] to discuss and analyze
the  proposed  regularization,   however,   the  same  trick  can
be applied to large variety of SOC methods.
In  the  rest   of   this  paper,   we  introduce  the  probabilistic
planning  method  AICO,   analyze  its  convergence  properties
in a reaching task in a light-weight robot arm and introduce
the  proposed  regularization  on  the  cost   function  level.   The
resulting algorithm is evaluated on the humanoid robot Nao,
where   in   rst   results,   arm  reaching   and   balancing   skill
are  inferred  from  desired  center  of  gravity  or  end  effector
coordinates. We conclude in Section IV.
II.   METHODS
We  consider   nite  horizon  Markov  decision  problems
2
.
Let  q
t
  Q denote the current robots state in conguration
space  (e.g.,   a  concatenation  of  joint  angles,   joint  velocities
and  reference  coordinates  in  oating  base  systems)  and  let
vector   x
t
  X  denote  task  space  features  like  end  effector
positions   or   the   center   of   gravity   of   a   humanoid   (these
features  will   be  used  to  specify  a  cost   function  later).   At
time  t,   the  robot   executes  the  action  u
t
  U  according  to
the movement policy  (u
t
|q
t
).
The   chosen  action  at   the   current   state   is   evaluated  by
the   cost   function   C
t
(q
t
, u
t
)     R
1
and  results   in  a   state
transition  characterized  by  the  probability   P(q
t+1
|q
t
, u
t
).
In  Stochastic  Optimal   Control   (SOC),   the  goal   is  to  nd  a
stochastic policy  
= argmin
 C
T
(q
T
) +
T1
t=0
C
t
(q
t
, u
t
) 
q
  ,   (1)
where the expectation, denoted by the symbols , is taken
with respect to the trajectory distribution
q
(q
0:T
, u
0:T1
) = P(q
0
)
T1
t=0
(u
t
|q
t
)P(q
t+1
|q
t
, u
t
)   ,
where  P(q
0
) is the initial state distribution.
A.   Bayesian inference for control
An interesting class of algorithms to SOC problems have
been  derived  by  reformulating  the  original  Bellman  formu-
lation  in  (1)   as   an  Bayesian  inference  problem  [14][17].
2
Note that the same principle of regulating the update steps in trajectory
optimization can also be applied to planning algorithms in innite horizon
problems such as [11], [12]
Fig.   2.   Comparison  of  the  convergence  properties  of   iLQG,   AICO  and
our   robust   variant,   where   the   rate   of   convergence   is   controlled  via   the
parameter   .   In  the  top  row  (A-B),   the  model   of   the  forward  dynamics
was  approximated  by  a  pseudo  dynamics  model   [13].   In  the  bottom  row,
an  analytic  forward  dynamics  model   of  a  5-degree-of-freedom  robot   arm
was  used.  The  panels  in  the  rst  column  denote  the  costs  of  the  planning
algorithms  applied  to  a  simple  task,  where  the  robot  arm  has  to  reach  for
an  end  effector  target  and  return  to  the  initial  state.  In  the  second  column
(B,D),  the  robot  has  to  keep  additionally  the  roll  angle  constant  (at   /2).
Shown  are  the  mean  and  the  standard  deviations  for   10  initial   states  q
0
sampled from a Gaussian with zero mean and a standard deviation of 0.05.
Instead  of   minimizing  costs,   the  idea  is   to  maximize  the
probability  of   receiving  a  reward  event   (r
t
  =  1)   at   every
time step
p(r
t
  = 1|q
t
, u
t
)  exp{C
t
(q
t
, u
t
)}   .   (2)
Note that the idea of turning the cost function in Eq. (1) into
a  reward  signal   was  also  used  in  operational   space  control
approaches [18], [19].
In  the  probabilistic  framework,   we  want   to  compute  the
posterior  over  state  and  control  sequences,   conditioning  on
observing a reward at every time step,
p
(q
0:T
, u
0:T1
|r
0:T
  = 1) = exp{C
T
(q
T
)}
q
(q
0:T
, u
0:T1
)
T1
t=1
p(r
t
  = 1|q
t
, u
t
)   .
For efcient implementations of this inference problem, a
number of algorithms have been proposed that apply iterative
policy updates assuming that all probability distributions can
be   modeled   by   an   instance   of   the   family   of   exponential
distributions [9], [20], [21]. We will restrict our discussion on
the  Approximate  Inference  Control   (AICO)  algorithm  with
Gaussians [9].
B.   Approximate inference control with Gaussians (AICO)
We   consider   system   dynamics   of   the   form
q
t+1
  = f(q
t
, u
t
) +  with    denoting  zero  mean  Gaussian
noise.   In   AICO  (with   Gaussians),   the   system  dynamics
are   linearized   through   1st   order   Taylor   expansions,   i.e.,
P(q
t+1
|q
t
, u
t
)   =  N(q
t+1
|A
t
q
t
  +  a
t
  +  B
t
u
t
, Q
t
),   where
the  state  transition  matrix  A
t
,   the  linear   drift   term  a
t
  and
the  control   matrix  B
t
  are  often  computed  with  derivatives
simulated through nite differences methods. The numerical
stability   of   AICO  also   depends   on   the   accuracy   of   the
linearized model, we will therefore additionally compare to
an  approximation  of   the  system  dynamics,   where  controls
u
t
  correspond directly to joint accelerations
3
. We will refer
to this approximation as pseudo-dynamic model.
We propose to add a regularization term to the cost func-
tion. Before explaining the regularization term in more detail,
we briey discuss how different objectives are implemented
in  AICO.   In  the  simplest  case,   the  task-likelihood  function
in (2) can be split into separate state and a control dependent
terms, i.e.,
p(r
t
  = 1|q
t
, u
t
) = N[q
t
|r
t
, R
t
] N[u
t
|h
t
, H
t
]   ,   (3)
where,   for   analytical   reasons,   the   Gaus-
sians   are   given   in   canonical   form,   i.e.,
N[u
t
|h
t
, H
t
]  exp(1/2 u
T
t
  H
t
u
t
 +u
T
t
  h
t
). Note that the
vector   r
t
  in  (3)   denotes   the  linear   term  for   the  Gaussian
distribution  and  must   not   be   confused  with  the   auxiliary
variable  r
t
  = 1 in (2) denoting a reward event. By inserting
(3) in (2) we obtain the quadratic costs,
C
t
(q
t
, u
t
) = q
T
t
  R
t
q
t
  2r
T
t
  q
t
 +u
T
t
  Hu
t
  2h
T
t
  u
t
  .   (4)
The state dependent costs, encoded by N[q
t
|r
t
, R
t
], can be
dened  in  conguration  space
4
,   in  task  space
5
,   or   even  in
combinations of both spaces [16].
On  the  algorithmic  level,   AICO  combines  forward  mes-
sages   and  backward  messages   to  compute  the  belief   over
trajectories.   We  represent   these  Gaussian  forward  message
by  N[q
t
|s
t
, S
t
], the backward message by  N[q
t
|v
t
, V
t
], and
the  belief   by  N[q
t
|b
t
, B
t
].   The  recursive  update  equations
are given in [9] and in [10] where an implementation which
additionally implements control constraints (otherwise  h
t
  =
0) is given.
We can also compute the most likely action given the task
constraints. By doing so, in the case of AICO with Gaussians,
we obtain a time varying linear feedback controller
u
[n]
t
  = o
t
 +O
t
q
t
  ,   (5)
where  o
t
  is an open loop gain and  O
t
  denotes the feedback
gain matrix (n denotes the iteration).
3
For a single joint with  q  =  [q,    q]
T
, the matrix  A  =
1  
0  1
, a =
0
0
,
and  B  =
f(q
0
)
x
 and Rt  = J
T
CJ, where the diagonal elements of the matrix C specify
the desired precision in task space.
Algorithm 1: Approximate Inference Control with Reg-
ularized Update Steps
1   Input: initial state  q
0
, parameter  
[0]
, threshold  
2   Output: feedback control law  o
0:T1
  and  O
0:T1
3   initialize  q
[0]
1:T
  = q
0
,   S
0
  = 1e10  I,   s
0
  = S
0
q
0
,   n = 1
4   while not converged do
5   q
[n1]
0:T
  = q
[n]
0:T
6   for  t  1 to  T  do
7   linearize model:  A
t
, a
t
, B
t
8   compute:  H
t
, h
t
, R
t
, r
t
9   update:  s
t
,   S
t
,   v
t
,   V
t
,   b
t
, and  B
t
10   if b
t
 q
[n]
t
   >  then
11   repeat this time step
12   t  t  1
13   q
[n]
t
  = B
1
t
  b
t
14   for  t  T  1 to  0 do
15   ..same updates as above...
16   for  t  0 to  T  1 do
17   compute feedback controller:  o
t
, O
t
18   u
[n]
t
  = o
t
 +O
t
q
t
19   q
[n]
t+1
  = A
t
q
[n]
t
  +a
t
 +B
t
u
[n]
t
20   n = n + 1
21   
[n]
= 
[n1]
22   return   o
0:T1
  and  O
0:T1
C.   Evaluation of the convergence properties of AICO
To  investigate  the  convergence  properties   of   AICO,   we
use a simulated light-weight robot arm [22] with ve joints.
The  robot   has   to  reach  a  desired  end  effector   position  in
Cartesian space and subsequently has to return to its initial
pose.   To  increase  the  complexity,   we  dene  a  second  task,
where the robot should additionally keep the roll angle of the
end effector constant. For this task, we used the cost function
C
t
(q
t
, u
t
) =
10
4
(x
i
x
t
)
T
(x
i
x
t
)   if  t = T
i
10
2
u
T
u   else
  ,   (6)
where  x
i
denotes  the  desired  robot   postures  in  task  space
at   times   T
1
=   500   and   T
2
=   10
3
(the   planning
horizon   is   2   seconds   with   a   time   step   of   2ms)   with
x
1
= [1, 0.4, 0, 0, /2, 0]
T
and   x
2
= [1, 0, 0, 0, /2, 0]
T
.
Note that we do not assume any initial solution to initialize
the  planner,   solely  the  initial   posture  of   the  robot   in  con-
guration  space  is  used  as  initial   trajectory.   An  example
movement is shown in Figure 1.
Using  the  pseudo-dynamics  approximation  of  the  system
dynamics,   the   convergence   rate   of   the   costs   per   iteration
of   both  tasks   are   shown  in  Figure   2A,B.   For   the   simple
task in Figure 2A the inferred cost values converge fast for
all   algorithms,   with  the  standard  AICO  algorithm  showing
Fig. 3.   Reaching task with the humanoid robot Nao. The robot has to reach a desired end effector position with the right arm while maintaining balance.
Eight snapshots of the inferred movement are shown in (A). In (B), the convergence of the costs of the optimization procedure is shown, where we compare
iLQG, the standard implementation of AICO and the regularized variant. The mean and the standard deviations for  10 initial states q
0
  are sampled from
a  Gaussian  with  zero  mean  and  a  standard  deviation  of  0.05.  The  movement  objectives  for  the  right  arm  are  shown  in  the  left  panel  in  (C).  To  counter
balance, the robot moves its left hand and the head.
the  best   performance.   However,   the  fast   convergence  also
comes  with  the  costs  of  a  reduced  robustness  of  the  policy
update   as   can   be   seen   from  the   results   in   the   second
scenario  illustrated  in  Figure  2B,   where  AICO  is  unstable
and cannot infer solutions with low costs. When we used the
analytic forward dynamic model (where the linearizations are
computed  through  nite  differences)  instead  of  the  pseudo
dynamics model, computing the messages in AICO became
numerically   unstable   and   no   solutions   could   be   inferred.
Therefore, the panels in Figure 2C,D do not include results
of   AICO.   We   also   evaluated   the   iLQG  method   [7]   that
implements   an   adaptive   regularization   schedule   and   line
search to prevent divergence [8]. While the iLQG algorithm
performed well for the pseudo dynamics model, the learning
rate   was   automatically   decreased   to   almost   zero   for   the
analytical   dynamics   model.   Our   regularization  method  for
AICO, that we will present in the next section, considerably
outperformed both competing methods.
D.   Regulating the policy updates in AICO
To  regularize  the  policy  update  steps  in  (1),   we  add  an
additional cost term to the task-likelihood function, i.e.,
p(r
t
  = 1|q
[n]
t
  , u
[n]
t
  )  exp{C
t
(q
[n]
t
  , u
[n]
t
  )
[n]
(q
[n]
t
  q
[n1]
t
  )
T
(q
[n]
t
  q
[n1]
t
  )}   ,
which   punishes   the   distance   of   the   state   trajectories   of
two  successive   iterations   of   the   algorithm  (n  denotes   the
iteration).   The  parameter     controls  the  size  of  the  update
step. For large  , the trajectory update will be conservative
as  the  algorithm  will   stay  close  to  the  previous  trajectory
that has been used for linearization. For small   values, the
new trajectory will directly jump to the LQG solution given
the linearized dynamics and the approximated costs. Hence,
  is  inverse  proportional   to  the  step  size.   The  value  of   
is updated after each iteration according to  
[n]
= 
[n1]
.
For   
[0]
  1  and    >  1,   convergence  is  guaranteed  as  the
regularization term will dominate with an increasing number
of iterations.
The  algorithms   is   listed  in  Algorithm  1.   An  interesting
feature of this algorithm is that no learning rate is needed as
  is  used  directly  to  implement  a  step  size.   In  the  original
formulation  of  AICO  the  learning  rate  is  either  applied  to
the   state   update   (in   Line   13   in   Algorithm  1)   [9]   or   to
the  feedback  controller   (in  Line   18  in  Algorithm  1)   [10].
However, neither implementation can guarantee convergence
in   nonlinear   systems   or   in   tasks   with   costs   inducing   a
nonlinear mapping from Q to X.
We evaluate the resulting algorithm on the same robot arm
reaching  tasks.   For  both  tasks,   the  Cartesian  planning  task
in  Figure  2A,C  and  the  extension  with  the  additional   roll
angle objective in Figure 2B,D, we evaluated AICO with the
regularization  parameter     {1, 10}  (we  did  not   increase
  and     =  1).   For   both  models  of   the  system  dynamics,
the pseudo-dynamics approximation (shown in Figure 2A,B)
and  the  analytic  model   (illustrated  in  Figure  2C,D),   AICO
benets   from  the  regularization  term  and  the  costs   decay
exponentially fast. Interestingly, without good initial solu-
tions,   the  differential   dynamic  programming  method  iLQG
[8]   that   implements   a   sophisticated  regularization  scheme
cannot   generate   movement   policies   with   low  costs   when
using the analytic model. This is shown in Figure 2C,D.
III.   RESULTS
We  evaluated  the  proposed  planning  method  in  simula-
tion  with  the  humanoid  robot   Nao.   The  Nao  robot   has   25
degrees-of-freedom. In rst experiments, we investigated the
performance  of  the  planner  with  a  pseudo-dynamics  model
Fig. 4.   Balancing task in the humanoid robot Nao. The robot should swing its hips, which is encoded by adding an offset scalar to the x-coordinate of
the  center  of  gravity  vector.   In  (A)  10  snapshots  of  the  resulting  movement  for  an  increasing  planning  horizon  are  shown  for    =  1.   The  convergence
properties of  iLQG, the standard AICO and its regularized variants are shown in (B). The mean and the standard deviations for  10 initial states q
0
  are
sampled from a Gaussian with zero mean and a standard deviation of  0.05. In (C) the x-coordinate of the center of gravity of the Nao is illustrated. The
large dots denote the objectives.
of the robot.
The   humanoid  had  to  reach  for   an  end  effector   target
using the right arm while maintaining balance. In a second
experiment,  Nao  had  to  shift  the  x-coordinate  of  the  center
of gravity while maintaining balance.
A.   Arm reaching with a humanoid robot
The   humanoid  has   to  reach  for   the   end  effector   target
x
CoG
  =
x
CoG
(t  =  0). The same cost function as in the experiments
for  the  light  weight  robot  arm  in  (6)  is  used.  For  this  task,
however,   only  a  single  via-point  is  dened  that  is  used  for
the desired end effector target and the center of gravity, i.e.,
x
1
= [x
T
, x
CoG
T
]
T
.
Only  by  specifying  two  scalars   in   x