Open AccessArticle

Automatic Tempered Posterior Distributions for Bayesian Inversion Problems

Luca Martino

^1,*

Fernando Llorente

Ernesto Curbelo

²,

Javier López-Santiago

and

Joaquín Míguez

Department of Signal Processing, Universidad rey Juan Carlos (URJC), 28942 Madrid, Spain

Department of Statistics, Universidad Carlos III de Madrid (UC3M), 28911 Madrid, Spain

Department of Signal Processing, Universidad Carlos III de Madrid (UC3M), 28911 Madrid, Spain

Author to whom correspondence should be addressed.

Mathematics 2021, 9(7), 784; https://doi.org/10.3390/math9070784

Submission received: 27 February 2021 / Revised: 26 March 2021 / Accepted: 1 April 2021 / Published: 6 April 2021

(This article belongs to the Special Issue Recent Advances in Data Science)

Download

Browse Figures

Versions Notes

Abstract

We propose a novel adaptive importance sampling scheme for Bayesian inversion problems where the inference of the variables of interest and the power of the data noise are carried out using distinct (but interacting) methods. More specifically, we consider a Bayesian analysis for the variables of interest (i.e., the parameters of the model to invert), whereas we employ a maximum likelihood approach for the estimation of the noise power. The whole technique is implemented by means of an iterative procedure with alternating sampling and optimization steps. Moreover, the noise power is also used as a tempered parameter for the posterior distribution of the the variables of interest. Therefore, a sequence of tempered posterior densities is generated, where the tempered parameter is automatically selected according to the current estimate of the noise power. A complete Bayesian study over the model parameters and the scale parameter can also be performed. Numerical experiments show the benefits of the proposed approach.

Keywords:

Bayesian inference; importance sampling; MCMC; inversion problems

1. Introduction

The estimation of unknown parameters from noisy observations is an essential problem in signal processing, statistics, and machine learning [1,2,3]. Within the Bayesian signal processing framework, these problems are addressed by constructing posterior probability distributions of the unknowns. Given the posterior, one often wants to make inference about the unknowns, e.g., if we are estimating parameters, finding the values that maximize their posterior, or the values that minimize some cost function given the uncertainty of the parameters. Unfortunately, obtaining closed-form solutions, usually expressed as integrals of the posterior, is infeasible in most practical applications. Therefore, developing approximate computational techniques, such as importance sampling and Markov chain Monte Carlo (MCMC) algorithms, is often required [4,5,6].

The so-called tempering of the posterior distributions is a well-known procedure for improving the performance of Monte Carlo (MC) algorithms [7,8,9,10]. Tempering is performed by modulating an artificial scale parameter or by sequentially including new data. There are several reasons for the improvement in performance: improving mixing, discovering modes, fostering the exploration of the inference space, etc. In the first iterations of the MC scheme, a posterior density with a bigger scale is considered. The artificial scale parameter (often called temperature) is reduced along the iterations, until considering the true posterior distribution. However, the user should select a temperature schedule, i.e., a decreasing rule for the scale parameter, which is usually chosen in an heuristic way [4,5]. In the literature, the tempering procedure has gained particular attention for the estimation of the marginal likelihood (also known as Bayesian model evidence) [9,11,12].

Furthermore, the joint inference of parameters (denoted as

θ

) of observation models,

f (θ)

, and scale parameters of the likelihood function (that, in the scalar case, is usually denoted as

σ

) can be a hard task. Indeed, “wrong choices” of

σ

values can easily jeopardize the sampling of

θ

. In this work, we introduce a procedure to tackle this problem.

To be specific, in this work, we design an adaptive importance sampling (AIS) scheme [13] for Bayesian inversion problems, where an automatic tempering procedure is implemented. We assume that the vector of observations

y

is obtained by a nonlinear transformation

f (θ)

of the variables of interest

θ

, perturbed by additive Gaussian noise with unknown power

σ^{2}

. The nonlinear mapping

f (θ)

usually represents a complex physical model, a computer code, etc. The resulting posterior densities are usually highly multimodal and complex distributions. Furthermore, the inference task in the joint space

[θ, σ]

is particularly challenging. We introduce a split strategy to tackle this problem, involving an optimization approach over

σ

and a sampling scheme for

θ

. More specifically, we design an iterative procedure where these two tasks are alternated. Additionally, the current maximum likelihood (ML) estimate of the noise power,

{\hat{σ}}_{ML}^{2}

, is employed as a tempering parameter, starting from high values and then “cooling down” according to the ML estimates at each iteration. Therefore, the proposed scheme deals with a sequence of tempered posteriors according to the current estimation

{\hat{σ}}_{ML}^{2}

. It is important to observe that, given a fixed vector

θ

, the ML estimator

{\hat{σ}}_{ML}^{2}

can be obtained analytically.

Furthermore, the complete Bayesian analysis regarding the joint posterior of

θ

and

σ

is also possible (as discussed in Section 5). This is obtained by implementing a proper re-weighting of the samples generated by the proposed algorithm, called Automatic Tempering AIS (ATAIS), without any additional evaluations of the observation model. An approximation of the marginal posterior of

σ

is provided as well. The advantages of the proposed scheme are shown in two numerical experiments, one of them considering a complex astronomical model.

2. Problem Statement

Let us denote the observed measurements as

y = {[y_{1}, \dots, y_{K}]}^{⊤} \in R^{K}

, and the variable of interest that we wish to infer as

θ = {[θ_{1}, \dots, θ_{M}]}^{⊤} \in Θ \subseteq R^{M}

. Furthermore, let us assume the observation model

y = f (θ) + v,

(1)

where we have a nonlinear mapping,

\begin{matrix} f (θ) = {[f_{1} (θ), \dots, f_{K} (θ)]}^{⊤} : Θ \to R^{K} with Θ \subseteq R^{M}, \end{matrix}

(2)

and a Gaussian perturbation noise,

\begin{matrix} v = {[v_{1}, \dots, v_{K}]}^{⊤} \sim N (v | 0, σ^{2} I_{K}), \end{matrix}

(3)

with

σ > 0

, and

I_{K}

denotes the K-dimensional identity matrix. The model can be easily extended to a matrix of observations

Y = {[y_{1}, \dots, y_{K}]}^{⊤} \in R^{d_{y} \times K}

instead of a vector, if the nonlinear mapping is of type

F (θ) = {[f_{1} (θ), \dots, f_{K} (θ)]}^{⊤} : Θ \subseteq R^{M} \to R^{d_{y} \times K}

. The noise variance

σ^{2}

is unknown, in general. The mapping

f (θ)

may be analytically unknown: the only assumption is that we are able to evaluate it pointwise. The likelihood function is

\begin{matrix} ℓ (y | θ, σ) & = & \frac{1}{{(2 π σ^{2})}^{K / 2}} exp (- \frac{1}{2 σ^{2}} | | y - {f (θ) | |}^{2}), \end{matrix}

(4)

\begin{matrix} = & \frac{1}{{(2 π σ^{2})}^{K / 2}} exp (- \frac{1}{2 σ^{2}} \sum_{k = 1}^{K} {(y_{k} - f_{k} (θ))}^{2}) . \end{matrix}

(5)

Note that we have two types of variables of interest: the vector

θ

contains the parameters of the nonlinear mapping

f (θ)

, whereas

σ

is a scale parameter of the likelihood function.

Given the vector of measurements

y

, we wish to make inferences regarding the hidden parameters

θ

and the noise power

σ^{2}

, obtaining at least some point estimators

\hat{θ}

and

{\hat{σ}}^{2}

. We are also interested in performing uncertainty and correlation analysis for the components of

θ

. Furthermore, we aim to perform model selection, i.e., to compare, select, or properly average different models.

Bayesian inference in the complete space. We consider independent prior densities

g_{θ} (θ)

and

g_{σ} (σ)

over the unknowns. Therefore, the complete posterior density is

\begin{matrix} p (θ, σ | y) = \frac{1}{p (y)} p (θ, σ, y) = \frac{1}{p (y)} ℓ (y | θ, σ) g_{θ} (θ) g_{σ} (σ), \end{matrix}

(6)

The marginal likelihood is

\begin{matrix} Z = p (y) = \int_{R^{+}} \int_{Θ} ℓ (y | θ, σ) g_{θ} (θ) g_{σ} (σ) d θ d σ, \end{matrix}

(7)

This quantity is often needed for model selection. While Z is generally unknown, we can usually evaluate pointwise the un-normalized posterior

π (θ, σ | y) = ℓ (y | θ, σ) g_{θ} (θ) g_{σ} (σ)

, i.e.,

p (θ, σ | y) \propto π (θ, σ | y)

. More generally, the computation of integrals of the form

I (h) = \int_{R^{+}} \int_{Θ} h (θ, σ) p (θ, σ | y) d θ d σ,

(8)

where

h : Θ \times R^{+} \to R

is an integrable function, is usually required. We consider a Monte Carlo quadrature approach for approximating the integral above and, more generally, provide a particle approximation of the joint posterior

p (θ, σ | y)

Main observation. Generating random samples from a complicated posterior in Equation (6) and efficiently computing the integrals as in Equations (7) and (8) is very often a hard task. Moreover, this task becomes more difficult when we try to perform a joint inference where scale parameters are involved, i.e.,

σ

, and parameters of the nonlinearity, i.e.,

θ

. Indeed, “wrong choices” of

σ

values can easily jeopardize the sampling of

θ

. In the next section, we describe a strategy that we propose to tackle this problem. Before do so, however, we need to recall some additional definitions.

Conditional and marginal posteriors. In other to design efficient computational schemes, it is often useful to consider the conditional posteriors, for instance,

\begin{matrix} p (θ | y, σ) = \frac{p (θ, y, σ)}{p (y, σ)} & = \frac{ℓ (y | θ, σ) g_{θ} (θ) g_{σ} (σ)}{p (y | σ) g_{σ} (σ)}, \\ = \frac{ℓ (y | σ, θ) g_{θ} (θ)}{p (y | σ)} . \end{matrix}

(9)

In the next section, we will see that the idea underlying the proposed scheme is to split the space

[θ, σ]

, restricting the sampling problem only to

θ

and considering an optimization problem with respect to

σ

. The conditional marginal likelihood is obtained by integrating out one of the two variables, e.g.,

\begin{matrix} \begin{matrix} Z (σ) = p (y | σ) = \int_{θ} ℓ (y | θ, σ) g_{θ} (θ) d θ . \end{matrix} \end{matrix}

(10)

The integral above cannot be computed analytically, in general. We can also consider marginal posteriors, for instance, the marginal posterior of

σ

\begin{matrix} p (σ | y) = \frac{p (y | σ) g_{σ} (σ)}{p (y)} = \frac{Z (σ) g_{σ} (σ)}{Z} . \end{matrix}

(11)

Note that the joint posterior in Equation (6) can be also written as

\begin{matrix} p (θ, σ | y) & = p (θ | y, σ) p (σ | y) . \end{matrix}

(12)

Outline of the proposed approach. The underlying idea of this work is to divide the inference study in two parts. In the first part (Section 3 and Section 4), we focus on the study of the conditional posterior

p (θ | y, σ)

given a fixed

σ

. Then, in the second part (Section 5), we also estimate the marginal posterior

p (σ | y)

. Finally, using (12), we can obtain a final approximation of the complete posterior

p (θ, σ | y)

. Estimations of

Z (σ)

and Z are also obtained.

3. Key Observations and Proposed Approach

3.1. Split Inference

In the first part of work, we assume a uniform proper (or improper) prior over

θ

, i.e.,

g_{θ} (θ) \propto 1

Θ

. The possible use of a general choice of

g_{θ} (θ)

is discussed in Section 4.1. Let

θ_{MAP} = arg {max}_{θ} p (θ | y, σ)

denote the MAP estimator of

θ

. Generally,

θ_{MAP}

should be a function of

σ

, i.e.,

θ_{MAP} = θ_{MAP} (σ)

. However, due to the choice of likelihood function (and the uniform prior) considered in this paper, we have that

\begin{matrix} θ_{MAP} & = arg max_{θ} log p (θ | y, σ), \\ = arg min_{θ} | | y - {f (θ) | |}^{2}, with g_{θ} (θ) \propto 1, for θ \in Θ, \end{matrix}

which does not depend on

σ

, i.e., we have that

θ_{MAP}

maximizes the conditional posterior

p (θ | y, σ)

for any

σ

. See Appendix A for further details.

Furthermore, the variance of the conditional posterior

p (θ | y, σ)

grows when

σ

increases. In this sense, with larger

σ

, the density

p (θ | y, σ)

is “broader”; thus, it is easier for Monte Carlo methods to explore the space (namely, we have a tempering effect). Based on these considerations, we can run Monte Carlo schemes (specifically IS algorithms) on

p (θ | y, σ_{0})

with a large value

σ_{0}

for estimating

θ_{MAP}

more efficiently. Furthermore, apart from estimating

θ_{MAP}

, we are also interested in studying the conditional posterior

p (θ | y, σ_{ML})

, where

σ_{ML} = arg max_{σ} ℓ (y | θ_{MAP}, σ) .

The value

σ_{ML}

can be obtained in closed-form (see Appendix A). In fact, for any

θ

, we have

\begin{matrix} ℓ (y | θ, σ) & \propto & {(\frac{1}{σ^{2}})}^{\frac{K}{2}} exp (- \frac{| | y - {f (θ) | |}^{2}}{2 σ^{2}}), \end{matrix}

(13)

which has the form of an Inverse Gamma density for

σ^{2}

and it has a unique mode at

\sqrt{\frac{1}{K} | | y - {f (θ) | |}^{2}}

, where

θ

is a fixed. Therefore, finally we have

σ_{ML} = \sqrt{\frac{1}{K} | | y - f (θ_{MAP}) {| |}^{2}} .

This can serve as a point estimator of the noise power in the system, and also as a threshold value to stop the tempering of the conditional posterior, as we show in the following section.

3.2. An Iterative Scheme

Consider that we start with a large value

σ_{0}

, which can be viewed as a coarse approximation of

σ_{ML}

, so we denote it

σ_{0} = {\hat{σ}}_{ML}^{(0)}

. Let

{\hat{θ}}_{MAP}^{(1)}

denote an estimate of

θ_{MAP}

obtained by working w.r.t.

p (θ | y, {\hat{σ}}_{ML}^{(0)})

. We use this current estimation to obtain the next value of

σ

, i.e.,

{\hat{σ}}_{ML}^{(1)} = \sqrt{\frac{1}{K} | | y - f ({\hat{θ}}_{MAP}^{(1)}) {| |}^{2}}

. In general,

{\hat{σ}}_{ML}^{(1)}

is a better estimator of

σ_{ML}

than

{\hat{σ}}_{ML}^{(0)}

, as we have tried to evaluate of the smallest error between

f (θ)

and the data,

y

, i.e.,

| | y - f (θ) | |

, which is related to the power of the noise perturbation in the system. For instance, assuming zero noise, we would have

| | y - f (θ_{MAP}) | | = 0

, recalling that

g_{θ} (θ) \propto 1

for

θ \in Θ

. We can iterate this procedure for

t = 1, \dots, T

1: Estimate ${\hat{θ}}_{MAP}^{(t)}$ by Monte Carlo (e.g., an IS scheme) by approximately maximizing $p (θ | y, {\hat{σ}}_{ML}^{(t - 1)})$ .
2: Compute

${\hat{σ}}_{ML}^{(t)} = \sqrt{\frac{1}{K} | | y - f ({\hat{θ}}_{MAP}^{(t)}) {| |}^{2}} .$

(14)

With this iterative scheme, we have that

{\hat{σ}}_{ML}^{(T)} \to σ_{ML}

as T grows, thus we eventually perform IS with respect to the density of interest

p (θ | y, σ_{ML})

. Furthermore, a non-increasing sequence of values

{\hat{σ}}_{ML}^{(0)} \geq {\hat{σ}}_{ML}^{(1)} \geq \dots \geq {\hat{σ}}_{ML}^{(T)}

is produced, which facilitates the estimation of

θ_{MAP}

, and ensures the IS estimation of

p (θ | y, σ_{ML})

is performed efficiently by using the set of intermediate, tempered (i.e., wider) distributions

p (θ | y, {\hat{σ}}_{ML}^{(t)})

for

t = 0, 1, \dots, T

. Finally, a particle approximation of

p (θ | y, {\hat{σ}}_{ML}^{(T)})

is obtained, i.e.,

p (θ | y, {\hat{σ}}_{ML}^{(T)}) = \sum_{t = 1}^{T} \sum_{n = 1}^{N} {\tilde{w}}_{t}^{(n)} δ (θ - θ_{t}^{(n)}),

where

\sum_{t = 1}^{T} \sum_{n = 1}^{N} {\tilde{w}}_{t}^{(n)} = 1

. Note that

{\tilde{w}}_{t}^{(n)}

are the final corrected weights obtained at the end of the algorithm (see Algorithm 1).

4. Automatic Tempering Adaptive Importance Sampling (ATAIS)

In this section, we describe an adaptive importance sampler with an automatic tempering approach which follows the procedure given above. At each iteration t of the algorithm, we have an ML approximation of

σ

, i.e.,

{\hat{σ}}_{ML}^{(t - 1)}

. Considering Equation (9), we define the un-normalized tempered conditional posterior at the t-th iteration,

\begin{matrix} π_{t} (θ) = ℓ (y | θ, {\hat{σ}}_{ML}^{(t - 1)}) g_{θ} (θ), \end{matrix}

(15)

where we assume

g_{θ} (θ) \propto 1

Θ

. For other generic choice of

g_{θ} (θ)

, see the discussion in Section 4.1. At each iteration, we consider

p (θ | y, {\hat{σ}}_{ML}^{(t - 1)}) \propto π_{t} (θ)

as the target distribution. The dependence on the iteration t occurs because

{\hat{σ}}_{ML}^{(t)}

varies with t. The ATAIS algorithm is outlined in Algorithm 1. The resulting scheme is an adaptive IS algorithm which combines sampling schemes and stochastic optimization. It is important to remark that if

{\hat{σ}}_{ML}^{(0)}

is bigger than the true ML value, we generate a non-increasing sequence of

{\hat{σ}}_{ML}^{(t)}

, i.e.,

{\hat{σ}}_{ML}^{(0)} \geq {\hat{σ}}_{ML}^{(1)} \geq \dots {\hat{σ}}_{ML}^{(t)} \geq {\hat{σ}}_{ML}^{(t + 1)}

, etc. Note that this is true as we have assumed a uniform prior

g_{θ} (θ)

. To see this, recall that

{\hat{σ}}_{ML} = \sqrt{\frac{1}{K} | | y - f ({\hat{θ}}_{MAP}) {| |}^{2}}

. Improving

{\hat{θ}}_{MAP}

means that the squared error

| | y - f ({\hat{θ}}_{MAP}) {| |}^{2}

is smaller, as shown in Equation (14), which implies that

{\hat{σ}}_{ML}

always decreases (provided that we start with

{\hat{σ}}_{ML} > σ_{ML}

IS steps. A set of N samples

{θ_{t}^{(n)}}_{n = 1}^{N}

are drawn from a (normalized) proposal density

q (θ | μ_{t}, Σ_{t})

with mean

μ_{t}

and a covariance matrix

Σ_{t}

. An importance weight

w_{t}^{(n)} = \frac{π_{t} (θ_{t}^{(n)})}{q (θ_{t}^{(n)} | μ_{t}, Σ_{t})},

is assigned to each sample.

Proposal adaptation. A particle estimation of the conditional MAP estimator of

θ

is given by

{\hat{θ}}_{t} = arg max_{n} π_{t} (θ_{t}^{(n)})

. The value of current MAP approximation

π_{t} ({\hat{θ}}_{t})

is then compared with the value of global MAP estimator obtained so far denoted as

π_{MAP}

. If

π_{t} ({\hat{θ}}_{t}) \geq π_{MAP}

, all the global MAP estimators are updated and the proposal pdf is moved at

{\hat{θ}}_{t}

, i.e., we set

\begin{matrix} {\hat{θ}}_{MAP}^{(t)} = {\hat{θ}}_{t}, π_{MAP} = π_{t} ({\hat{θ}}_{t}), μ_{t} = {\hat{θ}}_{t} . \end{matrix}

(16)

Algorithm 1: ATAIS: AIS with automatic tempering.

Initializations: Choose N, $μ_{1}$ , $Σ_{1}$ , and obtain an initialization for ${\hat{σ}}_{ML}^{(0)}$ , and set $π_{MAP} = 0$ .
For $t = 1, \dots, T$ :
(a)
Sampling:
i.
Draw $θ_{t}^{(1)}, \dots, θ_{t}^{(N)} \sim q (θ | μ_{t}, Σ_{t})$ .
ii.
Assign to each sample the weights

$\begin{matrix} w_{t}^{(n)} = \frac{π_{t} (θ_{t}^{(n)})}{q (θ_{t}^{(n)} | μ_{t}, Σ_{t})}, n = 1, \dots, N . \end{matrix}$

(17)

(b)
Current maximum estimations:
i.
Obtain ${\hat{θ}}_{t} = arg max_{n} π_{t} (θ_{t}^{(n)})$ , and compute ${\hat{r}}_{t} = f ({\hat{θ}}_{t})$
ii.
Compute ${\hat{σ}}_{t} = \sqrt{\frac{1}{K} | | y - {\hat{r}}_{t} {| |}^{2}}$ .
(c)
Global maximum estimations:
i.
If ${\hat{σ}}_{t} \leq {\hat{σ}}_{ML}^{(t - 1)}$ , then set ${\hat{σ}}_{ML}^{(t)} = {\hat{σ}}_{t}$ . Otherwise, set ${\hat{σ}}_{ML}^{(t)} = {\hat{σ}}_{ML}^{(t - 1)}$ .
ii.
If $π_{t} ({\hat{θ}}_{t}) \geq π_{MAP}$ , then set ${\hat{θ}}_{MAP}^{(t)} = {\hat{θ}}_{t}$ and $π_{MAP} = π_{t} ({\hat{θ}}_{t})$ . Otherwise, ${\hat{θ}}_{MAP}^{(t)} = {\hat{θ}}_{MAP}^{(t - 1)}$ and keep the value of $π_{MAP}$ .
(d)
Adaptation: Set

$\begin{matrix} μ_{t} & = & {\hat{θ}}_{MAP}^{(t)}, \end{matrix}$

(18)

$\begin{matrix} Σ_{t} & = & \sum_{n = 1}^{N} {\bar{w}}_{t}^{(n)} {(θ_{t}^{(n)} - {\bar{θ}}_{t})}^{⊤} (θ_{t}^{(n)} - {\bar{θ}}_{t}) + ϵ I_{M}, \end{matrix}$

(19)

where ${\bar{w}}_{t}^{(n)} \frac{w_{t}^{(n)}}{\sum_{i = 1}^{N} w_{t}^{(i)}}$ are the normalized weights, ${\bar{θ}}_{t} = \sum_{n = 1}^{N} {\bar{w}}_{t}^{(n)} θ_{t}^{(n)}$ and $ϵ > 0$ is a small scalar value .
Output: Return the final estimators ${\hat{θ}}_{MAP}^{(T)}$ , ${\hat{σ}}_{ML}^{(T)}$ , and all the weighted samples ${θ_{t}^{(n)}, {\tilde{w}}_{t}^{(n)}}$ , for all t and n, with the corrected weights

${\tilde{w}}_{t}^{(n)} = w_{t}^{(n)} \frac{π_{T + 1} (θ_{t}^{(n)})}{π_{t} (θ_{t}^{(n)})} .$

(20)

Otherwise, we keep the previous values of

{\hat{θ}}_{MAP}^{(t)} = {\hat{θ}}_{MAP}^{(t - 1)}

π_{MAP}

, and

μ_{t} = μ_{t - 1}

. The covariance matrix

Σ_{t}

is adapted by considering the empirical covariance of the weighted samples. Note that we set

μ_{t} = {\hat{θ}}_{MAP}^{(t)}

instead of using the empirical mean of the samples (as in other classical AIS schemes). This is because we have noticed that this choice provides better and more robust results, especially as the dimension of the problem grows.

Automatic tempering. As we showed in the previous section, the current ML estimator of

σ

can be obtained analytically as

{\hat{σ}}_{t} = \sqrt{\frac{1}{K} | | y - {\hat{r}}_{t} {| |}^{2}},

(21)

where

{\hat{r}}_{t} = f ({\hat{θ}}_{t})

. If the current ML estimator

{\hat{σ}}_{t}

is smaller than the current global one

{\hat{σ}}_{ML}^{(t - 1)}

, i.e.,

{\hat{σ}}_{t} < {\hat{σ}}_{ML}^{(t - 1)}

, then we update

{\hat{σ}}_{ML}^{(t)} = {\hat{σ}}_{t}

, Otherwise, we keep the value of

{\hat{σ}}_{ML}^{(t)} = {\hat{σ}}_{ML}^{(t - 1)}

. Actually, with a uniform prior

g_{θ} (θ)

, every time that we update

{\hat{θ}}_{MAP}^{(t)}

, we also update

{\hat{σ}}_{ML}^{(t)}

(see footnote in the previous page).

ATAIS outputs. After T iterations, a final correction of the weights is needed, i.e.,

{\tilde{w}}_{t}^{(n)} = w_{t}^{(n)} \frac{π_{T + 1} (θ_{t}^{(n)})}{π_{t} (θ_{t}^{(n)})},

(22)

in order to obtain a particle approximation of the measure of the final conditional posterior

p (θ | y, {\hat{σ}}_{ML}^{(T)}) \propto π_{T + 1} (θ)

. Thus, the algorithm returns the final estimators

{\hat{θ}}_{MAP}^{(T)}

{\hat{σ}}_{ML}^{(T)}

, and all the weighted samples

{θ_{t}^{(n)}, {\tilde{w}}_{t}^{(n)}}

, for all

n = 1, \dots, N

and

t = 1, \dots, T

. Other outputs can be obtained with a postprocessing of the weighted samples, as shown below. Note that Equation (22) does not require any additional evaluation of the model, and the error is

e_{t}^{(n)} = | | y - f (θ_{t}^{(n)}) {| |}^{2}

. Moreover, we can also use

e_{t}^{(n)}

and

{θ_{t}^{(n)}}

for building a particle approximation of any other conditional posterior

p (θ | y, σ)

. This allows the study of the marginal posterior

p (σ | y)

and provides the complete Bayesian inference, as we show in the next section.

4.1. With a Generic Prior $g_{θ} (θ)$

The ATAIS algorithm is based on the fact that

θ_{MAP}

does not depend on

σ

. This allows us to progressively estimate it by targeting the sequence of tempered posteriors

π_{t} (θ) \propto p (θ | y, {\hat{σ}}_{ML}^{(t)})

that share all the same MAP. However, in the case

g_{θ} (θ)

is not uniform, we generally have one

θ_{MAP} (σ)

for each

p (θ | y, σ)

, and we could have that the sequence of

θ_{MAP} ({\hat{σ}}_{ML}^{(t)})

will not approach

θ_{MAP} (σ_{ML})

If the data are informative and the prior

g_{θ} (θ)

is chosen such it is vague with respect to the likelihood, the position of

θ_{MAP} (σ)

is not very sensitive to the value of

σ

. Namely, we have

θ_{MAP} ({\hat{σ}}_{ML}^{(1)}) \approx θ_{MAP} ({\hat{σ}}_{ML}^{(2)}) \approx \dots \approx θ_{MAP} (σ_{ML})

, and thus our algorithm can be applied in this context. When the data are not informative, we should use an even more vague prior (i.e., wider than the likelihood function) in order to maintain the usefulness of the algorithm.

5. Complete Bayesian Inference with ATAIS

Let us assume we have a proper prior

g_{θ} (θ)

and we introduce another proper prior

g_{σ} (σ)

for

σ

. The outputs of the ATAIS algorithm can serve to approximate the normalizing constant of the joint posterior

p (θ, σ | y) \propto ℓ (y | θ, σ) g_{θ} (θ) g_{σ} (σ)

, i.e., the so-called marginal likelihood or Bayesian model evidence, given by

Z = \int_{R^{+}} \int_{Θ} ℓ (y | θ, σ) g_{θ} (θ) g_{σ} (σ) d θ d σ = \int_{R^{+}} Z (σ) g_{σ} (σ) d σ,

(23)

where we have denoted

Z (σ) = \int_{Θ} ℓ (y | θ, σ) g_{θ} (θ d θ

, usually called conditional marginal likelihood. The quantity Z is useful for model selection purposes. Furthermore, a complete Bayesian study of the joint posterior

p (θ, σ | y)

can be provided as well.

Approximation of $Z (σ) = p (y | σ)$ . After the T iterations of ATAIS, we can also approximate the conditional marginal likelihood

Z (σ) = p (y | σ)

without additional evaluations of the target function. Indeed, saving the error values at each particle obtained for the computation of the likelihood function during ATAIS,

e_{t}^{(n)} = | | y - f (θ_{t}^{(n)}) {| |}^{2},

We can compute the IS weights,

ρ_{t}^{(n)} (σ) = \frac{\frac{1}{{(2 π σ^{2})}^{\frac{K}{2}}} exp (- \frac{e_{t}^{(n)}}{2 σ^{2}}) g_{θ} (θ_{t}^{(n)})}{q (θ_{t}^{(n)} | μ_{t}, Σ_{t})},

(24)

for a generic value of

σ

Thus, the IS estimator of the conditional marginal likelihood

Z (σ)

is given by the arithmetic mean of the weights

ρ_{t}^{(n)} (σ)

\hat{Z} (σ) = \hat{p} (y | σ) = \frac{1}{N T} \sum_{t = 1}^{T} \sum_{n = 1}^{N} ρ_{t}^{(n)} (σ) .

(25)

Approximation of Z. Drawing

σ^{(r)} \sim g_{σ} (σ)

, for

r = 1, \dots, R

, (or considering a deterministic grid, e.g., as a Riemannian integration), we can approximate the global marginal likelihood Z by applying simple Monte Carlo to the integral in Equation (23),

\hat{Z} = \frac{1}{R} \sum_{r = 1}^{R} \hat{Z} (σ^{(r)}) .

(26)

Approximation of $p (σ | y)$ . An approximation of the marginal posterior

p (σ | y) = \frac{p (y | σ) g_{σ} (σ)}{p (y)}

can be also obtained as

\begin{matrix} p (σ | y) \approx \hat{p} (σ | y) = \frac{\hat{Z} (σ) g_{σ} (σ)}{\hat{Z}}, \end{matrix}

(27)

which can be used to approximate, e.g., the MAP of

p (σ | y)

σ_{MAP - marg} \approx arg {max}_{σ} \hat{Z} (σ) g_{σ} (σ) .

Other different moments of

p (σ | y)

can be computed by a deterministic quadrature (as the problem is now one-dimensional) or applying noisy Monte Carlo approaches.

Complete Bayesian analysis. We can approximate the integral of interest as

\begin{matrix} I & = \int_{R^{+}} \int_{Θ} h (θ, σ) p (θ, σ | y) d θ d σ, \end{matrix}

(28)

\begin{matrix} = \int_{R^{+}} \int_{Θ} h (θ, σ) p (θ | y, σ) p (σ | y) d θ d σ \end{matrix}

(29)

\begin{matrix} \approx \frac{1}{J} \sum_{j = 1}^{J} \sum_{t = 1}^{T} \sum_{n = 1}^{N} {\bar{ρ}}_{t}^{(n)} (σ^{(j)}) h (θ_{t}^{(n)}, σ^{(j)}), \end{matrix}

(30)

where

\begin{matrix} {\bar{ρ}}_{t}^{(n)} (σ^{(j)}) = \frac{ρ_{t}^{(n)} (σ^{(j)})}{\sum_{τ = 1}^{T} \sum_{i = 1}^{N} ρ_{τ}^{(i)} (σ^{(j)})}, \end{matrix}

(31)

and

σ^{(j)}

are generated by applying a noisy MCMC with invariant density

\hat{p} (σ | y) \propto \hat{Z} (σ) g_{σ} (σ)

. Note that the samples

θ_{t}^{(n)}

do not depend on the index j (they do not change) as we are recycling the particles generated by ATAIS and reusing evaluations

e_{t}^{(n)} = | | y - f (θ_{t}^{(n)}) {| |}^{2}

6. Simulations

We test the proposed scheme in two numerical examples: The first numerical experiment is a simple bidimensional example (which is easy to be reproduced). The second experiment considers a real-world application, i.e., a radial velocity model of exoplanet systems which is often employed in astronomy applications (with a dimension of the inference problem of 6 and 11).

6.1. First Numerical Analysis

For the sake of simplicity, let us consider

θ \in R

and an observation model given by the equation

y_{k} = θ^{2} + log (| sin (10 θ) |) + v_{k},

so that

f (θ) = θ^{2} + log (| sin (10 θ) |)

, and

v_{k} \sim N (0, σ^{2})

. We consider

θ_{true} = 2.5

, and

σ_{true} = 4

. We generate

K = 8

observations from the model above. We also consider a uniform prior for

θ

(0, 20]

. The conditional posterior

p (θ | y, σ_{true})

is shown in Figure 1c. We can observe that

p (θ | y, σ_{true})

is highly multimodal. Figure 1 also depicts the conditional posteriors

p (θ | y, σ)

with

σ \in {10, 20}

. Considering also a uniform prior over

σ

(0, 20]

, we have also a bidimensional joint posterior over

[θ, σ]

, which is depicted in Figure 2a.

In this bidimensional example, it is possible to obtain the ground-truths using an expensive thin grid. We show the ground-truths of the different pdfs in Table 1. Moreover, the true value of the complete evidence

Z = p (y) = 1.5983 \times 10^{- 9}

. As the prior over

σ

is uniform, the maximum likelihood of

σ

σ_{ML} = σ_{MAP - joint} = 3.23

. The two marginal posteriors are shown in Figure 2b,c.

We apply ATAIS with the goal of estimating the expected value and the variance of the posterior density with respect to

θ

. We consider a Gaussian proposal

q (θ | μ_{t}, λ_{t})

with

μ_{0} = 10

and a starting variance of

λ_{0} = 4

. Note that

μ_{0}

is located in a region that does not contain modes. We also start with

{\hat{σ}}_{ML}^{(0)} = 20

and

π_{MAP} = 0

(initial conditions). The Mean Square Error (MSE) of ATAIS, averaged over 500 runs, in estimation of different moments and modes as function of N (and with

T = 10

), is given in Table 2. The ML estimation

{\hat{σ}}_{ML}^{(t)}

, as function of the iteration t (with

N = 5

) for different runs, is given in Figure 3a. The approximation of the marginal posterior

p (σ | y)

, denoted

\hat{p} (σ | y)

, is obtained as in Equation (27) in one specific run, with different

N \in {10, 100, 500}

and

T = 10

. The approximations of the joint posterior

p (θ, σ | y)

and the marginal posterior

p (θ | y)

, obtained by resampling the particles according to the normalized weights in Equations (31) and (24), are shown in Figure 4, i.e., using a sampling importance resampling procedure. For more details, see in [14] and Chapter 24 in [15].

6.2. Radial Velocity Curves of Exoplanets and Binary Systems

In this example, we consider an application in an astronomical model. In recent years, the problem of revealing objects orbiting other stars has acquired large attention. Different techniques have been proposed to discover exo-objects but, nowadays, the radial velocity technique is still the most used [16,17,18,19]. The problem consists in fitting a model (the so-called radial velocity curve) to data acquired at different moments spanning during long time periods (up to years). The model is highly nonlinear, and it is costly in terms of computation time (especially for certain sets of parameters). Obtaining a value to compare to a single observation involves numerically integrating a differential equation in time or an iterative procedure for solving to a nonlinear equation. Typically, the iteration is performed until a threshold is reached or

10^{6}

iterations are performed. The problem of radial velocity curve fitting is applied in several related applications.

Observation model—likelihood. When analyzing the radial velocity data of an exoplanetary system, it is commonly accepted that the wobbling of the star around the center of mass is caused by the sum of the gravitational force of each planet independently and that they do not interact with each other. Each planet follows a Keplerian orbit, and the radial velocity of the host star is given by

y_{r, t} = V_{0} + \sum_{i = 1}^{S} A_{i} [cos (u_{i, t} + ω_{i}) + e_{i} cos (ω_{i})] + ξ_{t},

(32)

with

t = 1, \dots, T

and

r = 1, \dots, R

. In this equation,

A_{i}

is the amplitude of the curve,

w_{i}

is the argument of perigee, and

e_{i}

is the eccentricity of the orbit of the i-th planet. The parameter

V_{0}

represents the mean velocity, and it is common for all the planets. The number of objects in the system is S, which is considered to be known in this experiment (for the sake of simplicity). Both

y_{r, t}

and

u_{i, t}

depend on time t, and then

ξ_{t}

is a Gaussian noise perturbation with variance

σ^{2}

. The likelihood function is defined by (32) and some indicator variables described below. The angle

u_{i, t}

is the true anomaly of the planet i, and it can be determined from

\frac{d u_{i, t}}{d t} = \frac{2 π}{P_{i}} \frac{{(1 + e_{i} cos u_{i, t})}^{2}}{{(1 - e_{i})}^{\frac{3}{2}}}

(33)

This equation has analytical solution. As a result, the true anomaly

u_{t}

can be determined from the mean anomaly M. However, the analytical solution contains a nonlinear term that needs to be determined by iterating. First, we define the mean anomaly

M_{i, t}

M_{i, t} = \frac{2 π}{P_{i}} (t - τ_{i}),

(34)

where

τ_{i}

is the time of periastron passage of the planet i and

P_{i}

is the period of its orbit. Then, through the Kepler’s equation,

M_{i, t} = E_{i, t} - e_{i} sin E_{i, t},

(35)

where

E_{i, t}

is the eccentric anomaly. Equation (35) has no analytic solution and it must be solved by an iterative procedure. A Newton–Raphson method is typically used to find the roots of this equation [20]. For certain sets of parameters, this iterative procedure can be particularly slow. We also have

tan \frac{u_{i, t}}{2} = \sqrt{\frac{1 + e_{i}}{1 - e_{i}}} tan \frac{E_{i, t}}{2},

(36)

The variable of interest

θ

is then the vector

θ = [V_{0}, A_{1}, ω_{1}, e_{1}, P_{1}, τ_{1}, \dots, A_{S}, ω_{S}, e_{S}, P_{S}, τ_{S}],

(37)

Then, for a single object (e.g., a planet or a natural satellite), the dimension of

θ

M = 5 + 1 = 6

, with two objects the dimension of

θ

M = 11

etc.

This example consists in a synthetic radial velocity curve of a planetary system with one planet or two planets (i.e.,

S = 1

S = 2

). More specifically, we generate simulated data with a model with two planets. The orbital parameters of the planets are listed in Table 3, where P is the period of the orbit, A is the amplitude of the curve, e is the eccentricity of the orbit,

ω

is the argument of perigee, and

τ

is the last periastron passage. A mean velocity

V_{0} = 5

m s

^{- 1}

is assumed. A Gaussian noise perturbation is added with a standard deviation

σ = 3

m s

^{- 1}

. To simulate observations, a total of

K = 120

data points are selected from three random time periods (and two planets in the system). Note that the amplitude of the radial velocity curve of the second planet is close to the noise level. We run ATAIS and a standard AIS scheme with the model with one planet and with the model with two planets. The purpose of this simulation is to check the ability of the method to detect the two planets (by approximating the model evidence).

We apply ATAIS and a standard AIS scheme [13] over the space

[θ, σ]

for approximating the model evidence

Z = p (y)

(marginal likelihood) of both models (one planet or two planets) with the given data (generated considering two planets). Uniform priors are considered for each parameter:

P \in [0, 365]

A \in [- 20, 20]

e \in [0, 1]

ω \in [0, 2 π]

, and

τ \in [0, 50]

(moreover,

σ \in [0, 30]

for the standard AIS scheme). The ATAIS algorithm and the standard AIS scheme have been run with

N = 10^{6}

and

T = 50

iterations for both the model with one planet and the model with two planets. In both cases, we consider the same Gaussian proposal with a starting standard deviation of 5 for each component (note that the standard AIS scheme works in higher dimensional space due the inference over

σ

). To decide which model is more probable, the model evidence Z of each model is estimated. More specifically, we approximate the one-planet model

{\hat{Z}}_{1} = {\hat{p}}_{1} (y)

and the two-planet model

{\hat{Z}}_{2} = {\hat{p}}_{2} (y)

with the ATAIS algorithm and the standard AIS scheme. When

{\hat{Z}}_{1} > {\hat{Z}}_{2}

, we select the first model; otherwise, if

{\hat{Z}}_{1} < {\hat{Z}}_{2}

, we select the second one. The true model is the two-planet model, as the simulated data were generated from that model. After 500 independent runs, the percentage of correct detection of the true model for ATAIS is

\approx 98 %

, whereas with the standard AIS scheme is only

\approx 56 %

. This is due to the difficulty of making inference jointly over

[θ, σ]

. Let us denote the Bayesian factor as

B = Z_{2} / Z_{1}

. In ATAIS, the expected value of the ratio between the model evidences (averaged over the 500 runs) is

E [B] \approx 5 \cdot 10^{3}

with a relative variance of

\frac{E [{(B - E [B])}^{2}]}{E {[B]}^{2}} \approx 0.04

. In the case of the standard AIS, we have

E [B] \approx 16.32

and

\frac{E [{(B - E [B])}^{2}]}{E {[B]}^{2}} \approx 0.15

. Therefore, for ATAIS, the model with two planets is clearly more probable than the model with one planet.

The fitted curves, corresponding to the vector of parameters

{\hat{θ}}_{MAP}

obtained with ATAIS, are shown in Figure 5. From the figure, it is not clear which model better fits the simulated observations (blue points), although the model with two planets seems to better fit the observations in the time period from 200 to 300 days. The values of

{\hat{θ}}_{MAP}

, obtained in one specific run by ATAIS, are given in Table 4. We notice that

ω

and

τ

are highly correlated and more iterations may be needed to obtain the actual global maximum, but the remaining parameters obtained from

{\hat{θ}}_{MAP}

are similar to the simulated values. In addition, the amplitude of the curve of the second planet is close to the intensity of the noise, making it difficult to derive the best fit for that planet. Summarizing, our results show the method is able to discriminate between a model with one planet (with six dimensions of the inference problem) and a model with two planets (with 11 dimensions of the inference problem), for this particular simulation. Finally, the evolution of the automatic tempering parameter

{\hat{σ}}_{ML}^{(t)}

is shown in Figure 6. The dashed line is the evolution of

{\hat{σ}}_{ML}^{(t)}

for the single-planet model, whereas the continuous line is the evolution of

{\hat{σ}}_{ML}^{(t)}

for the model with two planets. In this second model, the tempering parameter reaches a smaller value, as expected.

7. Conclusions

We have proposed a novel AIS scheme for Bayesian inversion problems where an automatic tempering procedure is implemented (called ATAIS). The inference of the variables of interest

θ

and the noise power

σ^{2}

is divided. A sampling strategy is considered for

θ

and an optimization approach is employed for

σ^{2}

. Thus, ATAIS performs an iterative procedure, alternating sampling and optimization steps. Therefore, the proposed scheme deals with a sequence of tempered posteriors according to the current estimation of the noise power. We have also discussed the possibility of approximating the marginal posterior of

σ

without additional evaluations of the complex model. Furthermore, the complete Bayesian analysis regarding the complete joint posterior is possible as discussed in Section 5, again without any additional evaluations of the likelihood function.

Several simulations are provided and the application to a sophisticated astronomical model has been considered, where the number of planets in the system is detected by the analysis of the marginal likelihood. The results show the benefits of the proposed scheme. For instance, in the astronomical example, the percentage of correct detection of the true model obtained by ATAIS is ≈

98 %

, whereas with the standard AIS scheme is only ≈

56 %

. As future research, we plan to extend the ATAIS scheme in order to deal with an observation model with correlated noise perturbations (for instance, using a Gaussian Process). Moreover, the use of parallel AIS schemes (or MCMC algorithms) will be also considered. A combination of parallel MCMC chains and AIS schemes can be found in the so-called layered AIS method and other similar approaches [21,22]. This idea seems particularly interesting for improving the inference with radial velocity models.

Author Contributions

All the authors contribute to all the sections in the same way. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Office of Naval Research (award no. N00014-19-1-2226), the Spanish Ministry of Science and Innovation (RTI2018-099655-B- I00 CLARA and PID2019-105032GB-I00 SPGRAPH), the Foundation by the Community of Madrid in the framework of the Multiannual Agreement with the Rey Juan Carlos University in line of action 1, Encouragement of Young Ph.D. students investigation Project under Grant F661, and the regional Government of Madrid (Comunidad de Madrid, reference Y2018/TCS-4705 PRACTICO).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. On the Optimization of the Likelihood Function

Let us set

δ = σ^{2}

and consider to optimize of the likelihood function

ℓ (θ, δ) = \frac{1}{{(2 π δ)}^{K / 2}} exp (- \frac{V (θ)}{δ}) .

Recall that, in our model, we have

V (θ) = | | y - {f (θ) | |}^{2}

. We desire to obtain

[θ_{ML}, δ_{ML}] = arg max ℓ (θ, δ) .

We can write the gradient and equal to zero,

\begin{matrix} \{\begin{matrix} \nabla_{θ} ℓ (θ, δ) & = - \frac{1}{δ} \nabla_{θ} V (θ) [\frac{1}{{(2 π δ)}^{K / 2}} exp (- \frac{V (θ)}{δ})] = 0 ⟹ \nabla_{θ} V (θ) = 0, \\ \frac{\partial ℓ (θ, δ)}{\partial δ} & = \frac{e^{- \frac{V (θ)}{δ}} (2 V (θ) - δ K)}{2^{\frac{K}{2} + 1} δ^{\frac{K}{2} + 2} π^{K / 2}} = 0 ⟹ δ = \frac{2}{K} V (θ) . \end{matrix} \end{matrix}

(A1)

We have obtained that the ML solution is defined by the system of equations,

\begin{matrix} \{\begin{matrix} \nabla_{θ} V (θ_{ML}) = 0 \\ δ_{ML} = \frac{2}{K} V (θ_{ML}) . \end{matrix} \end{matrix}

(A2)

References

Fitzgerald, W.J. Markov chain Monte Carlo methods with applications to signal processing. Signal Process. 2001, 81, 3–18. [Google Scholar] [CrossRef]
Andrieu, C.; de Freitas, N.; Doucet, A.; Jordan, M. An Introduction to MCMC for Machine Learning. Mach. Learn. 2003, 50, 5–43. [Google Scholar] [CrossRef] [Green Version]
Martino, L.; Míguez, J. Generalized Rejection Sampling Schemes and Applications in Signal Processing. Signal Process. 2010, 90, 2981–2995. [Google Scholar] [CrossRef] [Green Version]
Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: New York, NY, USA, 2004. [Google Scholar]
Liu, J.S. Monte Carlo Strategies in Scientific Computing; Springer: New York, NY, USA, 2004. [Google Scholar]
Martino, L.; Luengo, D.; Miguez, J. Independent Random Sampling Methods; Springer: New York, NY, USA, 2018. [Google Scholar]
Kirkpatrick, S., Jr.; Gelatt, C.D.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef] [PubMed]
Marinari, E.; Parisi, G. Simulated Tempering: A New Monte Carlo Scheme. Europhys. Lett. 1992, 19, 451–458. [Google Scholar] [CrossRef] [Green Version]
Friel, N.; Pettitt, A.N. Marginal Likelihood Estimation via Power Posteriors. J. R. Stat. Soc. Ser. B Stat. Methodol. 2008, 70, 589–607. [Google Scholar] [CrossRef]
Moral, P.D.; Doucet, A.; Jasra, A. Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 411–436. [Google Scholar] [CrossRef]
Neal, R.M. Annealed importance sampling. Stat. Comput. 2001, 11, 125–139. [Google Scholar] [CrossRef]
Llorente, F.; Martino, L.; Delgado, D.; Lopez-Santiago, J. Marginal likelihood computation for model selection and hypothesis testing: An extensive review. arXiv 2020, arXiv:2005.08334. [Google Scholar]
Bugallo, M.F.; Martino, L.; Corander, J. Adaptive importance sampling in signal processing. Digit. Signal Process. 2015, 47, 36–49. [Google Scholar] [CrossRef] [Green Version]
Rubin, D.B. Using the SIR algorithm to simulate posterior distributions. In Bayesian Statistics 3, Ads Bernardo, Degroot, Lindley, and Smith; Oxford University Press: Oxford, UK, 1988. [Google Scholar]
Gelman, A.; Meng, X.L. Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives; John Wiley & Sons: New York, NY, USA, 2018. [Google Scholar]
Gregory, P.C. Bayesian re-analysis of the Gliese 581 exoplanet system. Mon. Not. R. Astron. Soc. 2011, 415, 2523–2545. [Google Scholar] [CrossRef] [Green Version]
Barros, S.C.C.; Brown, D.J.A.; Hébrard, G.; Gómez Maqueo Chew, Y.; Anderson, D.R.; Boumis, P.; Delrez, L.; Hay, K.L.; Lam, K.W.F.; Llama, J.; et al. WASP-113b and WASP-114b, two inflated hot Jupiters with contrasting densities. Astron. Astrophys. 2016, 593, A113. [Google Scholar] [CrossRef]
Affer, L.; Damasso, M.; Micela, G.; Poretti, E.; Scand ariato, G.; Maldonado, J.; Lanza, A.F.; Covino, E.; Garrido Rubio, A.; González Hernández, J.I.; et al. HADES RV program with HARPS-N at the TNG. IX. A super-Earth around the M dwarf Gl 686. Astron. Astrophys. 2019, 622, A193. [Google Scholar] [CrossRef] [Green Version]
Trifonov, T.; Stock, S.; Henning, T.; Reffert, S.; Kürster, M.; Lee, M.H.; Bitsch, B.; Butler, R.P.; Vogt, S.S. Two Jovian Planets around the Giant Star HD 202696: A Growing Population of Packed Massive Planetary Pairs around Massive Stars? Astron. J. 2019, 157, 93. [Google Scholar] [CrossRef] [Green Version]
Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes in C++: The Art of Scientific Computing; Springer: New York, NY, USA, 2002. [Google Scholar]
Martino, L.; Elvira, V.; Luengo, D.; Corander, J. Layered Adaptive Importance Sampling. Stat. Comput. 2017, 27, 599–623. [Google Scholar] [CrossRef] [Green Version]
Botev, Z.I.; Ecuyer, P.L.; Tuffin, B. Markov chain importance sampling with applications to rare event probability estimation. Stat. Comput. 2013, 23, 271–285. [Google Scholar] [CrossRef]

Figure 1. Conditional posteriors corresponding to different values of

σ

: (a)

σ = 20

, (b)

σ = 10

, and (c)

σ = σ_{true} = 4

Figure 1. Conditional posteriors corresponding to different values of

σ

: (a)

σ = 20

, (b)

σ = 10

, and (c)

σ = σ_{true} = 4

Figure 2. The bidimensional joint posterior

p (θ, σ | y)

and the two marginal posteriors

p (θ | y)

p (σ | y)

in Equation (11), computed by using a thin grid approximation.

Figure 2. The bidimensional joint posterior

p (θ, σ | y)

and the two marginal posteriors

p (θ | y)

p (σ | y)

in Equation (11), computed by using a thin grid approximation.

Figure 3. (a) The maximum likelihood (ML) estimation

{\hat{σ}}_{ML}^{(t)}

(different runs) versus the number of iterations t, with

N = 5

. (b) The true marginal posterior

p (σ | y)

and different approximations, in one specific run,

\hat{p} (σ | y)

obtained as in Equation (27) with different

N \in {10, 100, 500}

and

T = 10

(thus, the total number of samples are

N T

Figure 3. (a) The maximum likelihood (ML) estimation

{\hat{σ}}_{ML}^{(t)}

(different runs) versus the number of iterations t, with

N = 5

. (b) The true marginal posterior

p (σ | y)

and different approximations, in one specific run,

\hat{p} (σ | y)

obtained as in Equation (27) with different

N \in {10, 100, 500}

and

T = 10

(thus, the total number of samples are

N T

Figure 4. Approximations obtained with ATAIS. (a,b) Joint posterior

p (θ, σ | y)

: (a) by an histogram with

2 \times 10^{6}

samples; (b)

10^{4}

samples from joint posterior obtained by ATAIS. (c) Approximation by an histogram with

2 \times 10^{6}

samples, of the marginal posterior

p (θ | y)

Figure 4. Approximations obtained with ATAIS. (a,b) Joint posterior

p (θ, σ | y)

: (a) by an histogram with

2 \times 10^{6}

samples; (b)

10^{4}

samples from joint posterior obtained by ATAIS. (c) Approximation by an histogram with

2 \times 10^{6}

samples, of the marginal posterior

p (θ | y)

Figure 5. Comparison of the results of the ATAIS algorithm with the simulations (blue dots). Left panel shows, in gray, the radial velocity curve for

{\hat{θ}}_{MAP}

using a model with one planet. Right panel is like left panel but considering a model with two planets.

Figure 5. Comparison of the results of the ATAIS algorithm with the simulations (blue dots). Left panel shows, in gray, the radial velocity curve for

{\hat{θ}}_{MAP}

using a model with one planet. Right panel is like left panel but considering a model with two planets.

Figure 6. Evolution of the tempering parameter

{\hat{σ}}_{ML}^{(t)}

. We decide

{\hat{σ}}_{ML}^{(0)} = 50

as starting value (the figure shows from

t = 1

), which is an arbitrary high value to help the exploration in the first iteration. However, after the first iteration, the algorithm is able to obtain reasonable values of

{\hat{σ}}_{ML}^{(1)}

. The dashed line is the evolution for the model with one planet. The continuous line is the evolution of the two-planet model.

Figure 6. Evolution of the tempering parameter

{\hat{σ}}_{ML}^{(t)}

. We decide

{\hat{σ}}_{ML}^{(0)} = 50

as starting value (the figure shows from

t = 1

), which is an arbitrary high value to help the exploration in the first iteration. However, after the first iteration, the algorithm is able to obtain reasonable values of

{\hat{σ}}_{ML}^{(1)}

. The dashed line is the evolution for the model with one planet. The continuous line is the evolution of the two-planet model.

Table 1. Summary of pdfs and ground-truths for the first numerical experiment.

Pdf	Expectation	Variance	MAP
$p (θ \| y, σ_{ML})$	2.48	0.11	2.56
$p (σ \| y)$	4.32	2.43	3.46
$p (θ \| y)$	2.46	0.18	2.56

Table 2. Mean Square Error (MSE) of ATAIS (averaged over 500 runs), in the estimation of the evidence, different moments and modes as function of N and

T = 10

Table 2. Mean Square Error (MSE) of ATAIS (averaged over 500 runs), in the estimation of the evidence, different moments and modes as function of N and

T = 10

Value	$N = 10$	$N = 100$	$N = 1000$	$N = 5000$	Ground-Truths
$E [θ \| y, σ_{ML}]$	0.0311	0.0098	0.0034	0.0024	2.48
$var [θ \| y, σ_{ML}]$	0.0474	0.0370	0.0298	0.0201	0.11
$θ_{MAP}$	0.0410	0.0337	0.0285	0.0127	2.56
$E [σ \| y]$	0.9233	0.0785	0.0097	0.0023	4.32
$var [σ \| y]$	6.1869	0.2640	0.0035	0.0010	2.43
$σ_{MAP - marg}$	0.0056	0.0004	0.0001	$3 \times 10^{- 5}$	3.46
$σ_{ML}$	$8 \times 10^{- 5}$	$2 \times 10^{- 5}$	$5 \times 10^{- 7}$	$6 \times 10^{- 9}$	3.23
$Z = p (y)$	$2 \times 10^{- 18}$	$1.8 \times 10^{- 20}$	$1.4 \times 10^{- 20}$	$3.6 \times 10^{- 22}$	$1.6 \times 10^{- 9}$

Table 3. Main orbital parameters of the two exoplanets in the simulation.

Parameter	Planet 1	Planet 2
P	15 d	115 d
A	25 m s $^{- 1}$	5 m s $^{- 1}$
e	0.1	0.0
$ω$	0.61 rad	0.17 rad
$τ$	3 d	24 d

Table 4. The value of

{\hat{θ}}_{MAP}

and the variances of the marginal posteriors for the 2-planets model (with

K = 120

data points).

Table 4. The value of

{\hat{θ}}_{MAP}

and the variances of the marginal posteriors for the 2-planets model (with

K = 120

data points).

Parameter	Planet 1		Planet 2
Parameter	${\hat{θ}}_{MAP}$	$Var (θ \| y)$	${\hat{θ}}_{MAP}$ —Planet 2	$Var (θ \| y)$
P	14.99 d	0.18	110.39 d	11.28
K	23.78 m s $^{- 1}$	0.52	3.50 m s $^{- 1}$	0.44
e	0.05	0.047	0.00	0.003
$ω$	7.69 rad	0.61	0.68 rad	0.82
$τ$	6.8 d	0.76	7.96 d	20.31

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Martino, L.; Llorente, F.; Curbelo, E.; López-Santiago, J.; Míguez, J. Automatic Tempered Posterior Distributions for Bayesian Inversion Problems. Mathematics 2021, 9, 784. https://doi.org/10.3390/math9070784

AMA Style

Martino L, Llorente F, Curbelo E, López-Santiago J, Míguez J. Automatic Tempered Posterior Distributions for Bayesian Inversion Problems. Mathematics. 2021; 9(7):784. https://doi.org/10.3390/math9070784

Chicago/Turabian Style

Martino, Luca, Fernando Llorente, Ernesto Curbelo, Javier López-Santiago, and Joaquín Míguez. 2021. "Automatic Tempered Posterior Distributions for Bayesian Inversion Problems" Mathematics 9, no. 7: 784. https://doi.org/10.3390/math9070784

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Tempered Posterior Distributions for Bayesian Inversion Problems

Abstract

1. Introduction

2. Problem Statement

3. Key Observations and Proposed Approach

3.1. Split Inference

3.2. An Iterative Scheme

4. Automatic Tempering Adaptive Importance Sampling (ATAIS)

4.1. With a Generic Prior $g_{θ} (θ)$

5. Complete Bayesian Inference with ATAIS

6. Simulations

6.1. First Numerical Analysis

6.2. Radial Velocity Curves of Exoplanets and Binary Systems

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. On the Optimization of the Likelihood Function

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Automatic Tempered Posterior Distributions for Bayesian Inversion Problems

Abstract

1. Introduction

2. Problem Statement

3. Key Observations and Proposed Approach

3.1. Split Inference

3.2. An Iterative Scheme

4. Automatic Tempering Adaptive Importance Sampling (ATAIS)

4.1. With a Generic Prior g θ ( θ )

5. Complete Bayesian Inference with ATAIS

6. Simulations

6.1. First Numerical Analysis

6.2. Radial Velocity Curves of Exoplanets and Binary Systems

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. On the Optimization of the Likelihood Function

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. With a Generic Prior $g_{θ} (θ)$