Bayesian statistics and modelling

A Publisher Correction to this article was published on 03 February 2021

This article has been updated


Bayesian statistics is an approach to data analysis based on Bayes’ theorem, where available knowledge about parameters in a statistical model is updated with the information in observed data. The background knowledge is expressed as a prior distribution and combined with observational data in the form of a likelihood function to determine the posterior distribution. The posterior can also be used for making predictions about future events. This Primer describes the stages involved in Bayesian analysis, from specifying the prior and data models to deriving inference, model checking and refinement. We discuss the importance of prior and posterior predictive checking, selecting a proper technique for sampling from a posterior distribution, variational inference and variable selection. Examples of successful applications of Bayesian analysis across various research fields are provided, including in social sciences, ecology, genetics, medicine and more. We propose strategies for reproducibility and reporting standards, outlining an updated WAMBS (when to Worry and how to Avoid the Misuse of Bayesian Statistics) checklist. Finally, we outline the impact of Bayesian analysis on artificial intelligence, a major goal in the next decade.

Fig. 1: The Bayesian research cycle.
Fig. 2: Illustration of the key components of Bayes’ theorem.
Fig. 3: Prior predictive checking for the PhD delay example.
Fig. 4: Posterior estimation using MCMC for the PhD-delays example.
Fig. 5: Examples of shrinkage priors for Bayesian variable selection.
Fig. 6: Posterior predictive checking and predicted future page views based on current observations.
Fig. 7: Elements of reproducibility in the research workflow.

Change history

  • 03 February 2021

    A Correction to this paper has been published: https://doi.org/10.1038/s43586-021-00017-2.


R.v.d.S. was supported by grant NWO-VIDI-452-14-006 from the Netherlands Organization for Scientific Research. R.K. was supported by Leverhulme research fellowship grant reference RF-2019-299 and by The Alan Turing Institute under the EPSRC grant EP/N510129/1. K.M. was supported by a UK Engineering and Physical Sciences Research Council Doctoral Studentship. C.Y. is supported by a UK Medical Research Council Research Grant (Ref. MR/P02646X/1) and by The Alan Turing Institute under the EPSRC grant EP/N510129/1

Author information

Authors and Affiliations



Introduction (R.v.d.S.); Experimentation (S.D., D.V., R.v.d.S. and J.W.); Results (R.K., M.G.T., M.V., D.V., K.M., C.Y. and R.v.d.S.); Applications (S.D., R.K., K.M. and C.Y.); Reproducibility and data deposition (B.K., D.V., S.D. and R.v.d.S.); Limitations and optimizations (A.G.); Outlook (K.M. and C.Y.); Overview of the Primer (R.v.d.S.).

Corresponding author

Correspondence to Rens van de Schoot.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information

Nature Reviews Methods Primers thanks D. Ashby, J. Doll, D. Dunson, F. Feinberg, J. Liu, B. Rosenbaum and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Prior distribution

Beliefs held by researchers about the parameters in a statistical model before seeing the data, expressed as probability distributions.

Likelihood function

The conditional probability distribution of the given parameters of the data, defined up to a constant.

Posterior distribution

A way to summarize one’s updated knowledge, balancing prior knowledge with observed data.


Priors can have different levels of informativeness and can be anywhere on a continuum from complete uncertainty to relative certainty, but we distinguish between diffuse, weakly and informative priors.


Parameters that define the prior distribution, such as mean and variance for a normal prior.

Prior elicitation

The process by which background information is translated into a suitable prior distribution.

Informative prior

A reflection of a high degree of certainty or knowledge surrounding the population parameters. Hyperparameters are specified to express particular information reflecting a greater degree of certainty about the model parameters being estimated.

Weakly informative prior

A prior incorporating some information about the population parameter but that is less certain than an informative prior.

Diffuse priors

Reflections of complete uncertainty about population parameters.

Improper priors

Prior distributions that integrate to infinity.

Prior predictive checking

The process of checking whether the priors make sense by generating data according to the prior in order to assess whether the results are within the plausible parameter space.

Prior predictive distribution

All possible samples that could occur if the model is true based on the priors.

Kernel density estimation

A non-parametric approach used to estimate a probability density function for the observed data.

Prior predictive p-value

An estimate to indicate how unlikely the observed data are to be generated by the model based on the prior predictive distribution

Bayes factor

The ratio of the posterior odds to the prior odds of two competing hypotheses, also calculated as the ratio of the marginal likelihoods under the two hypotheses. It can be used, for example, to compare candidate models, where each model would correspond to a hypothesis.

Credible interval

An interval that contains a parameter with a specified probability. The bounds of the interval are the upper and lower percentiles of the parameter’s posterior distribution. For example, a 95% credible interval has the upper and lower 2.5% percentiles of the posterior distribution as its bounds.

Closed form

A mathematical expression that can be written using a finite number of standard operations.

Marginal posterior distribution

Probability distribution of a parameter or subset of parameters within the posterior distribution, irrespective of the values of other model parameters. It is obtained by integrating out the other model parameters from the joint posterior distribution.

Markov chain Monte Carlo

(MCMC). A method to indirectly obtain inference on the posterior distribution by simulation. The Markov chain is constructed such that its corresponding stationary distribution is the posterior distribution of interest. Once the chain has reached the stationary distribution, realizations can be regarded as a dependent set of sampled parameter values from the posterior distribution. These sampled parameter values can then be used to obtain empirical estimates of the posterior distribution, and associated summary statistics of interest, using Monte Carlo integration.

Markov chain

An iterative process whereby the values of the Markov chain at time t + 1 are only dependent on the values of the chain at time t.

Monte Carlo

A stochastic algorithm for approximating integrals using the simulation of random numbers from a given distribution. In particular, for sampled values from a distribution, the associated empirical value of a given statistic is an estimate of the corresponding summary statistic of the distribution.

Transition kernel

The updating procedure of the parameter values within a Markov chain.

Auxiliary variables

Additional variables entered in a model such that the joint distribution is available in closed form and quick to evaluate.

Trace plots

Plots describing the posterior parameter value at each iteration of the Markov chain (on the y axis) against the iteration number (on the x axis).

\(\hat{R}\) statistic

The ratio of within-chain and between-chain variability. Values close to one for all parameters and quantities of interest suggest the Markov chain Monte Carlo algorithm has sufficiently converged to the stationary distribution.

Variational inference

A technique to build approximations to the true Bayesian posterior distribution using combinations of simpler distributions whose parameters are optimized to make the approximation as close as possible to the actual posterior.

Approximating distribution

In the context of posterior inference, replacing a potentially complicated posterior distribution with a simpler distribution that is easy to evaluate and sample from. For example, in variational inference, it is common to approximate the true posterior with a Gaussian distribution.

Stochastic gradient descent

An algorithm that uses a randomly chosen subset of data points to estimate the gradient of a loss function with respect to parameters, providing computational savings in optimization problems involving many data points.


A situation that arises in a regression model when a predictor can be linearly predicted with high accuracy from the other predictors in the model. This causes numerical instability in the estimation of parameters.

Shrinkage priors

Prior distributions for a parameter that shrink its posterior estimate towards a particular value.


A situation where most parameter values are zero and only a few are non-zero.

Spike-and-slab prior

A shrinkage prior distribution used for variable selection specified as a mixture of two distributions, one peaked around zero (spike) and the other with a large variance (slab).

Continuous shrinkage prior

A unimodal prior distribution for a parameter that promotes shrinkage of its posterior estimate towards zero.

Global–local shrinkage prior

A continuous shrinkage prior distribution characterized by a high concentration around zero to shrink small parameter values to zero and heavy tails to prevent excessive shrinkage of large parameter values.

Horseshoe prior

An example of a global–local shrinkage prior for variable selection that uses a half-Cauchy scale mixture of normal distributions.


A particular type of multilayer neural network used for unsupervised learning consisting of two components: an encoder and a decoder. The encoder compresses the input information into low-dimensional summaries of the inputs. The decoder takes these summaries and attempts to recreate the inputs from these. By training the encoder and decoder simultaneously, the hope is that the autoencoder learns low-dimensional, but highly informative, representations of the data.


To detect non-stationarity within individual Markov chain Monte Carlo chains (for example, if the first part shows gradually increasing values whereas the second part involves gradually decreasing values), each chain is split into two parts for which the \(\hat{R}\) statistic is computed and compared.


A technique used in variational inference to reduce the number of free parameters to be estimated in a variational posterior approximation by replacing the free parameters with a trainable prediction function that can instead predict the values of these parameters.

Cite this article

van de Schoot, R., Depaoli, S., King, R. et al. Bayesian statistics and modelling. Nat Rev Methods Primers 1, 1 (2021). https://doi.org/10.1038/s43586-020-00001-2

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s43586-020-00001-2


