Double-Parallel Monte Carlo for Bayesian analysis of big data

729 Accesses
16 Citations
Explore all metrics

Abstract

This paper proposes a simple, practical, and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since this algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as “Double-Parallel Monte Carlo.” The validity of the proposed algorithm is justified mathematically and numerically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel inference for big data with the group Bayesian method

Article 25 June 2020

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

Article 09 June 2018

Subsampling sequential Monte Carlo for static Bayesian models

Article 09 September 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Andrieu, C., Moulines, E., Priouret, P.: Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. 44, 283–312 (2005). https://doi.org/10.1137/S0363012902417267
Article MathSciNet MATH Google Scholar
Carvalho, C.M., Polson, N.G., Scott, J.G.: The horseshoe estimator for sparse signals. Biometrika 97, 465–480 (2010)
Article MathSciNet MATH Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984). https://doi.org/10.1109/TPAMI.1984.4767596
Article MATH Google Scholar
Hasting, W.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970). https://doi.org/10.1093/biomet/57.1.97
Article MathSciNet Google Scholar
Kass, R.E., Tierney, L., Kadane, J.B.: The validity of posterior expansions based on Laplace’s method. In: Geisser, S., Hodges, J.S., Press, S.J., ZeUnerv, A. (eds.) Bayesian and Likelihood Methods in Statistics and Econometrics: Essays in Honor of George A. Barnard, vol. 7, pp. 473–488. North Holland, Amsterdam (1990)
Google Scholar
Li, C., Srivastava, S., Dunson, D.B.: Simple, scalable and accurate posterior interval estimation. Biometrika 104, 665–680 (2017)
Liang, F.: On the use of stochastic approximation Monte Carlo for Monte Carlo integration. Stat. Probab. Lett. 79, 581–587 (2009)
Article MathSciNet MATH Google Scholar
Liang, F., Liu, C., Carroll, R.J.: Stochastic approximation in Monte Carlo computation. J. Am. Stat. Assoc. 102, 305–320 (2007)
Article MathSciNet MATH Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087 (1953)
Article Google Scholar
Minsker, S., Srivastava, S., Lin, L., Dunson, D.B.: Robust and scalable Bayes via a median of subset posterior measures. ArXiv e-prints (2014)
Neiswanger, W., Wang, C., Xing, E.P.: Asymptotically exact, embarrassingly parallel MCMC, In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pp. 623–632, AUAI Press, Arlington. http://dl.acm.org/citation.cfm?id=3020751.3020816 (2014)
Ripley, B.D.: Stochastic Simulation. Wiley, New York (1987)
Book MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). https://doi.org/10.1214/aoms/1177729586
Article MathSciNet MATH Google Scholar
Roe, B.P., Yang, H.-J., Zhu, J., Liu, Y., Stancu, I., McGregor, G.: Boosted decision trees as an alternative to artificial neural networks for particle identification. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrom. Detect. Assoc. Equip. 543, 577–584 (2005)
Article Google Scholar
Scott, S.L., Blocker, A.W., Bonassi, F.V.: Bayes and big data: the consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11, 78–88 (2016)
Google Scholar
Song, Q., Wu, M., Liang, F.: Weak convergence rates of population versus single-chain stochastic approximation MCMC algorithms. Adv. Appl. Probab. 46, 1059–1083 (2014). https://doi.org/10.1239/aap/1418396243
Article MathSciNet MATH Google Scholar
Srivastava, S., Cevher, V., Tran-Dinh, Q., Dunson, D.B.: WASP: scalable Bayes via barycenters of subset posteriors. In: AISTATS, volume 38 of JMLR: Workshop and Conference Proceedings (JMLR.org) (2015a)
Srivastava, S., Li, C., Dunson, D.B.: Scalable Bayes via barycenter in Wasserstein space. ArXiv e-prints (2015b)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. (Ser. B) 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wand, M.: KernSmooth: functions for kernel smoothing supporting Wand & Jones (1995). r package version 2.23-15 (2015)
Wang, X., Dunson, D.: Parallelizing MCMC via Weierstrass sampler. ArXiv e-prints (2013)
Wang, X., Guo, F., Heller, K.A., Dunson, D.B.: parallelizing MCMC with random partition trees. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 451–459. Curran Associates, Inc. http://papers.nips.cc/paper/5986-parallelizing-mcmc-with-random-partition-trees.pdf (2015)

Download references

Acknowledgements

Liang’s research was supported in part by the Grants DMS-1545202, DMS-1612924 and R01-GM117597. The authors thank the Editor, Associate Editor and two referees for their constructive comments which has led to significant improvement of this paper.

Author information

Authors and Affiliations

Department of Statistics, Texas A&M University, College Station, TX, 77843, USA
Jingnan Xue
Department of Statistics, Purdue University, West Lafayette, IN, 47907, USA
Faming Liang

Authors

Jingnan Xue
View author publications
You can also search for this author in PubMed Google Scholar
Faming Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faming Liang.

Appendix A Proof of Theorem 1

Under conditions A1 and A2, we can expand the mean of each subposterior at the corresponding MLE, $\hat{\varvec{\theta }}^{(j)}$, as follows:

$$\begin{aligned} \varvec{\mu }^{(j)}= & {} E_{\tilde{\pi }_j} (\varvec{\theta }) =\hat{\varvec{\theta }}^{(j)}+\frac{\hat{I}^{(j)-1}}{n} \\&\times \left[ \frac{\partial \log g({\varvec{\theta }})}{\partial {\varvec{\theta }}}|_{\hat{{\varvec{\theta }}}^{(j)}} -\frac{1}{2}\hat{H}^{(j)}\hat{I}^{(j)-1} \right] +O(n^{-2}), \end{aligned}$$

where $\tilde{\pi }_j=\tilde{\pi }({\varvec{\theta }}|{\varvec{X}}_{[j]})$, $\hat{I}^{(j)}=-\frac{1}{m} \frac{\partial ^{2} \log f({\varvec{X}}_{[j]}|\varvec{\theta })}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}|_{\varvec{\theta }=\hat{\varvec{\theta }}^{(j)}}$, $\hat{H}^{(j)}=-\frac{1}{m}\frac{\partial ^{3} logf({\varvec{X}}_{[j]}|\varvec{\theta })}{\partial \varvec{\theta } \partial \varvec{\theta }^{T}\partial \varvec{\theta }}|_{\varvec{\theta }=\hat{\varvec{\theta }}^{(j)}}$, and $\hat{H}^{(j)}\hat{I}^{(j)-1}$ is a vector whose rth element equals $\sum _{st}\hat{H}_{rst}^{(j)}\hat{I}_{st}^{(j)-1}$. To simplify the notation, we denote $\hat{I}^{(j)-1}[\partial \log g({\varvec{\theta }})/\partial {\varvec{\theta }}|_{\hat{{\varvec{\theta }}}^{(j)}} -\frac{1}{2}\hat{H}^{(j)}\hat{I}^{(j)-1}]$ by $\varvec{\nu }^{(j)}$. Here we would like to mention that although A1 and A2 are directly associated with the Laplace approximation of the full posterior, they can also lead to the conditions that guarantee the Laplace approximation of each subposterior, as long as m goes to $\infty $ with n.

Moreover, for each $\hat{\varvec{\theta }}^{(j)}$, we have

$$\begin{aligned} \hat{\varvec{\theta }}^{(j)}=\varvec{\theta }^{*}+\frac{\varvec{\xi }^{(j)}}{\sqrt{m}}+O_{p}(m^{-1}), \end{aligned}$$

where

$$\begin{aligned} \varvec{\xi }^{(j)}= & {} \frac{1}{\sqrt{m}}I^{-1}\sum _{i=1}^{m}\frac{\partial \log f(X_{ji}|\varvec{\theta }^{*})}{\partial \varvec{\theta }},\\ I= & {} -E_{{\varvec{X}}|\varvec{\theta }^{*}}\frac{\partial ^2 \log f({\varvec{X}}|\varvec{\theta }^{*}) }{\partial \varvec{\theta } \partial \varvec{\theta }^{T} }. \end{aligned}$$

Therefore, the mean of the mixture distribution $\tilde{\pi }(\varvec{\theta }|{\varvec{X}})$ is given by

$$\begin{aligned} E_{\tilde{\pi }} (\varvec{\theta })= & {} \frac{1}{k}\sum _{j=1}^{k}\varvec{\mu }^{(j)}\\= & {} \varvec{\theta }^{*}+\frac{1}{k}\sum _{j=1}^{k}\frac{\varvec{\nu }^{(j)}}{n}+\frac{1}{k}\sum _{j=1}^{k} \frac{\varvec{\xi }^{(j)}}{\sqrt{m}}\\&+\,O_{p}(m^{-1})+O(n^{-2}). \end{aligned}$$

We can also implement a similar procedure on the full data posterior $\pi (\varvec{\theta }|{\varvec{X}})$ and obtain

$$\begin{aligned} E_{\pi }(\varvec{\theta })= & {} \varvec{\theta }^{*}+\frac{\varvec{\nu }}{n} +\frac{\varvec{\xi }}{\sqrt{n}}+O_{p}(n^{-1})+O(n^{-2}), \end{aligned}$$

where $\varvec{\nu }=\hat{I}^{-1}[\partial \log g({\varvec{\theta }})/\partial {\varvec{\theta }}|_{\hat{{\varvec{\theta }}} } -\frac{1}{2}\hat{H}\hat{I}^{-1}]$, $\varvec{\xi }=\frac{1}{\sqrt{n}}I^{-1}\sum _{i=1}^{n}\frac{\partial \log f(X_{i}|\varvec{\theta }^{*})}{\partial \varvec{\theta }}$, $\hat{I}=-\frac{1}{n} \frac{\partial ^{2} \log f({\varvec{X}}|\varvec{\theta })}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}|_{\varvec{\theta }=\hat{\varvec{\theta }}}$, $\hat{H}=-\frac{1}{m}\frac{\partial ^{3} logf({\varvec{X}}|\varvec{\theta })}{\partial \varvec{\theta } \partial \varvec{\theta }^{T}\partial \varvec{\theta }}|_{\varvec{\theta }=\hat{\varvec{\theta }}}$, and $\hat{\varvec{\theta }}$ denotes the MLE of $\theta $ calculated with the full dataset. By noting that $\frac{\varvec{\xi }}{\sqrt{n}}=\frac{1}{k}\sum _{j=1}^{k}\frac{\varvec{\xi }^{(j)}}{\sqrt{m}}$, $\varvec{\nu }^{(j)}$ and $\varvec{\nu }$ are O(1), $m<n$, Eq. (4) in Theorem 1 is thereby verified:

$$\begin{aligned} E_{\tilde{\pi }} (\varvec{\theta })-E_{\pi } (\varvec{\theta })= & {} \frac{1}{k}\sum _{j=1}^{k}\frac{\varvec{\nu }^{(j)}}{n}+O_{p}(m^{-1})\\&-\frac{\varvec{\nu }}{n} +O_{p}(n^{-1})+O(n^{-2}) \\= & {} O(n^{-1})+O_{p}(m^{-1})+O(n^{-1})\\&+\,O_{p}(n^{-1})+O(n^{-2}) \\= & {} O_{p}(m^{-1}). \end{aligned}$$

The variance of each subposterior can be approximated as follows:

$$\begin{aligned} \mathrm{Var}_{\tilde{\pi }_{j}}(\varvec{\theta })=\frac{\hat{I}^{(j)-1}}{n}+O(n^{-2}). \end{aligned}$$

Therefore, the variance of the mixture distribution $\tilde{\pi }(\varvec{\theta }|{\varvec{X}})$ is given by

$$\begin{aligned} \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta })=\frac{1}{k}\sum _{j=1}^{k}\mathrm{Var}_{\tilde{\pi }_{j}}(\varvec{\theta }) =\frac{1}{k}\sum _{j=1}^{k}\frac{\hat{I}^{(j)-1}}{n}+O(n^{-2}). \end{aligned}$$

In addition, we have

$$\begin{aligned} \mathrm{Var}_{\pi }(\varvec{\theta })=\frac{\hat{I}^{-1}}{n}+O(n^{-2}). \end{aligned}$$

Since $\hat{I}^{(j)}=I+o_{p}(1)$ and $\hat{I}=I+o_{p}(1)$, we further have $\hat{I}^{(j)-1}=I^{-1}+o_{p}(1)$ and $\hat{I}^{-1}=I^{-1}+o_{p}(1)$. Equation (5) in Theorem 1 is thereby verified:

$$\begin{aligned} \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta })-\mathrm{Var}_{\pi }(\varvec{\theta })= & {} \frac{1}{k}\sum _{j=1}^{k}\frac{\hat{I}^{(j)-1}}{n}-\frac{\hat{I}^{-1}}{n}+O(n^{-2})\\= & {} \frac{I^{-1}}{n}+o_{p}(n^{-1})-\frac{I^{-1}}{n}\\&+\,o_{p}(n^{-1})+O(n^{-2})\\= & {} o_{p}(n^{-1}). \end{aligned}$$

Finally, for the square of the Wasserstein distance of order 2, we have

$$\begin{aligned}&\bigg |d^{2}(\tilde{\pi }(\varvec{\theta }|{\varvec{X}}),\delta _{{\varvec{\theta }}^{*}})-d^{2}(\pi (\varvec{\theta }|{\varvec{X}}),\delta _{{\varvec{\theta }}^{*}})\bigg |\\&\quad =\bigg |\int _{\Theta }\Vert \varvec{\theta }- \varvec{\theta }^{*}\Vert _{2}^{2}\tilde{\pi }(\varvec{\theta }|{\varvec{X}})\mathrm{d}\varvec{\theta }\\&\quad \quad -\int _{\Theta }\Vert \varvec{\theta }- \varvec{\theta }^{*}\Vert _{2}^{2}\pi (\varvec{\theta }|{\varvec{X}})\mathrm{d}\varvec{\theta }\bigg |\\&\quad =\left| \left\| E_{\tilde{\pi }}(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2+tr\left( \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta }) \right) \right. \\&\quad \quad \left. -\left\| E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2 -tr\left( \mathrm{Var}_{\pi }(\varvec{\theta }) \right) \right| \\&\quad \le \left| \left\| E_{\tilde{\pi }}(\varvec{\theta })-E_{\pi }(\varvec{\theta })+E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2\right. \\&\left. \quad \quad -\left\| E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\right\| _2^2\right| +|tr\left( \mathrm{Var}_{\tilde{\pi }}(\varvec{\theta }) -\mathrm{Var}_{\pi }(\varvec{\theta }) \right) |\\&\quad \le \Vert E_{\tilde{\pi }}(\varvec{\theta })-E_{\pi }(\varvec{\theta })\Vert _2^2+2\Vert E_{\tilde{\pi }}(\varvec{\theta })\\&\quad \quad -\,E_{\pi }(\varvec{\theta })\Vert _{2}\Vert E_{\pi }(\varvec{\theta })-\varvec{\theta }^{*}\Vert _{2}+o_{p}(n^{-1})\\&\quad =O_{p}(m^{-2})+2O_{p}(m^{-1})O_{p}(n^{-1/2})+o_{p}(n^{-1})\\&\quad =o_{p}(n^{-1}), \end{aligned}$$

where the last equality follows from the fact $n=o(m^{2})$. This verifies Eq. (5) in Theorem 1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xue, J., Liang, F. Double-Parallel Monte Carlo for Bayesian analysis of big data. Stat Comput 29, 23–32 (2019). https://doi.org/10.1007/s11222-017-9791-1

Download citation

Received: 22 April 2017
Accepted: 17 November 2017
Published: 27 November 2017
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s11222-017-9791-1

Double-Parallel Monte Carlo for Bayesian analysis of big data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel inference for big data with the group Bayesian method

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

Subsampling sequential Monte Carlo for static Bayesian models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Double-Parallel Monte Carlo for Bayesian analysis of big data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel inference for big data with the group Bayesian method

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

Subsampling sequential Monte Carlo for static Bayesian models

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A Proof of Theorem 1

Appendix A Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation