Bayesian Non-Parametric Inference For Multivariate
Bayesian Non-Parametric Inference For Multivariate
Article
Bayesian Non-Parametric Inference for Multivariate
Peaks-over-Threshold Models
Peter Trubey * and Bruno Sansó
Abstract: We consider a constructive definition of the multivariate Pareto that factorizes the random
vector into a radial component and an independent angular component. The former follows a
univariate Pareto distribution, and the latter is defined on the surface of the positive orthant of the
infinity norm unit hypercube. We propose a method for inferring the distribution of the angular
component by identifying its support as the limit of the positive orthant of the unit p-norm spheres
and introduce a projected gamma family of distributions defined through the normalization of a
vector of independent random gammas to the space. This serves to construct a flexible family of
distributions obtained as a Dirichlet process mixture of projected gammas. For model assessment, we
discuss scoring methods appropriate to distributions on the unit hypercube. In particular, working
with the energy score criterion, we develop a kernel metric that produces a proper scoring rule
and presents a simulation study to compare different modeling choices using the proposed metric.
Using our approach, we describe the dependence structure of extreme values in the integrated vapor
transport (IVT), data describing the flow of atmospheric moisture along the coast of California. We
find clear but heterogeneous geographical dependence.
Keywords: multivariate extremes; peak over threshold models; bayesian non-parametric models;
dirichlet process mixtures
During the last decade or so, much work has been done in the exploration of the
definition and properties of an appropriate generalization of the univariate GP distribution
to a multivariate setting. To mention some of the papers on the topic, the work of [3] defines
the generalized Pareto distribution, with further analysis of these classes of distributions
presented in [4,5]. A recent review of the state of the art in multivariate peaks over threshold
modeling using generalized Pareto is provided in [6] while [7] provides insight on the
theoretical properties of possible parametrizations. These are used in [8] for likelihood-
based models for PoT estimation. A frequently used method for describing dependence in
multivariate distributions is to use a copula. Refs. [9,10] provide successful examples of
this approach in an EVT framework.
Ref. [11] presents a constructive definition of the Pareto process, which generalizes
the GP to an infinite-dimensional setting. It consists of decomposing the process into
independent radial and angular components. Such an approach can be used in the finite-
dimensional case, where the angular component contains information pertaining to the
dependence structure of the random vector. Based on this definition, we present a novel
approach for modeling the angular component with families of distributions that provide
flexibility and can be applied in a moderately large dimensional setting. Our focus on
the angular measure is similar to that in [12–14], which consider Bayesian non-parametric
approaches. Yet, our approach differs in that it is established in the peaks-over-threshold
regime and uses a constructive definition of the multivariate GP based on the infinity norm.
The approach proposed in this paper adds to the growing literature on Bayesian models for
multivariate extreme value analysis (see, for example, [12–15]), providing a model that has
strong computational advantages due its structural simplicitly, achieves flexibility using a
mixture model, and scales well to moderately large dimensions.
The remainder of this paper is outlined as follows. Section 2 comprises a brief re-
view of multivariate PoT, detailing the separation of the radial measure from the angular
measure. Section 3 details our approach for estimating the angular measure, based on
transforming an arbitrary distribution supported in Rd+ onto unit hyper-spheres defined
by p-norms. Section 4 develops criteria for model selection in the support of the angular
measure. Section 5 explores the efficacy of the proposed approach on a set of simulated data,
and, acknowledging the relevance of extreme value theory to climatological events [16–18],
estimates the extremal dependence structure for a measure of water vapor flow in the atmo-
sphere, used for identifying atmospheric rivers. Finally, Section 6 presents our conclusions
and discussion.
Throughout the paper, we adopt the operators ∧ to denote minima and the ∨ to
denote maxima. Thus ∧i si = mini si , and ∨i si = maxi si . These operators can also be
applied component-wise between vectors such as a ∧ b = ( a1 ∧ b1 , a2 ∧ b2 , . . .). Similarly,
we apply inequality and arithmetic operators operators to vectors. For example, a ≤ b, and
interpret them component-wise. We use uppercase to indicate random variables, lowercase
to indicate points, and bold-face to indicate vectors or matrices thereof.
where a− 1
n indicates element-wise inversion, and {W ̸ ≤ bn } denotes the set where at least
one coordinate is above the corresponding component of bn . H is a multivariate Pareto
distribution. It corresponds to a joint distribution conditional on exceeding a multivariate
threshold. H is defined by a non-parametric function governing the multivariate depen-
dence and two d-dimensional vectors of parameters that control the shapes and scales of the
marginals. We denote these as ξ for the shapes and σ for the scales. Ref. [7] provides a num-
ber of stochastic representations for H. In this paper, we focus on a particular one that is
proposed in [11]. To this end, we denote as Z a random variable with distribution H where
ξ = 1 and σ = 0. Then, Z = RV where R and V are independent. R = ∥ Z ∥∞ = ∨id=1 Zi is
distributed as a standard Pareto random variable, and V = Z/∥ Z ∥∞ is a random vector
in Sd∞−1 , the positive orthant of the unit sphere under L∞ norm, with distribution Φ. This
representation is central to the methods proposed in this paper. R and V are referred to,
respectively, as the radial and angular components of H. The angular measure controls the
dependence structure of Z in the tails. In view of this, to obtain a PoT model, we seek a
flexible model for the distribution of V ∈ Sd∞−1 , based on a Bayesian non-parametric model.
The approach considered in [6] focuses on the limiting conditional distribution H.
An alternative approach to obtaining a limiting PoT distribution consists of assuming
that regular variation (see, for example, [19]) holds for the limiting distribution of W,
implying that h i
lim nPr n−1 W ∈ A = µ( A),
n→∞
for some measure µ that is referred to as the exponent measure. µ has the homogeneity
property µ(tA) = t−1 µ( A). Letting ρ = ∥W ∥ p , p > 0 and θ = W /ρ, define Ψ( B) = µ({w :
ρ > 1, θ ∈ B}), which is referred to as the angular measure. After some manipulations, we
obtain that
Ψ( A)
lim Pr[θ ∈ A|ρ > r ] = . (1)
r →∞ Ψ(Sdp−1 )
Thus, a model for the exponent measure induces a model for the limiting distribution
conditional on the observations being above a threshold defined with respect to their
p-norm. The constraint that all marginals of µ correspond to a standard Pareto distribution
leads to the so-called moment constraints on Ψ, consisting of
1
Z
wi dΨ(w) = , i = 1, . . . , d.
Sdp−1 d
Inference for the limiting distribution of the exceedances needs to account for the normaliz-
ing constant in Equation (1) as well as the moment constraints. Because of these issues, in
this paper, we prefer to follow the limiting conditional distribution approach. An example
of the application of the regular variation approach using p = 1 is developed in [13].
where (·)+ indicates the positive part function. We set bt,l = F̂ℓ−1 (1 − 1/t), the empir-
ical (1 − 1/t)-quantile. We then estimate ξ ℓ and σℓ , for each ℓ, using likelihood-based
methods. To estimate the angular distribution, we standardize each of the marginals. The
standardization yields
wiℓ − bt,ℓ 1/ξ ℓ
zi ℓ = 1 + ξ ℓ . (2)
σℓ +
Note that ziℓ > 1 implies that wiℓ > bt,ℓ , meaning that the observation wi is extreme in the
ℓ-th dimension. Thus, ri = ∥zi ∥∞ > 1 implies that at least one dimension has an extreme
observation and corresponds to a very extreme observation when t is large. We focus on
the observations that are such that ri > 1. These provide a sub-sample of the standardized
original sample. We define vi = zi /ri ∈ S∞ d−1 . These vectors are used for the estimation
of Φ.
The absolute and Euclidean norms are obtained for p = 1 and p = 2 respectively, and the
L∞ norm can be obtained as a limit:
d
_
∥ x∥∞ = lim ∥ x∥ = xℓ .
p→∞
ℓ=1
is invertible with
1
−1 −1 p p
T (r, y1 , . . . , yd−1 ) = ry1 , . . . , ryd−1 , r 1 − ∑dℓ= 1 yℓ . (4)
The normalization provided by T maps a vector in Rd+ onto Sdp−1 . With a slight abuse of
terminology, we refer to it as a projection. Using Equations (3)–(5), we have the joint density
" α #
d βℓ ℓ h i
−1 p 1− p
f (r, y) = ∏ (ryℓ ) α ℓ −1
exp{− β ℓ ryℓ } × r d−1 yd + ∑dℓ= yℓ yd . (6)
ℓ=1
Γ(αℓ ) 1
Entropy 2024, 26, 335 5 of 19
defined for y ∈ Sdp−1 and for any finite p > 0. To avoid identifiability problems when
estimating the shape and scale parameters, we set β 1 = 1. Ref. [20] obtain the density in
Equation (7) for p = 2 as a multivariate distribution for directional data using spherical
coordinates. For y ∈ S1d−1 and β ℓ = β for all ℓ, the density in Equation (7) corresponds to
that of a Dirichlet distribution.
The projected gamma family is simple to specify and has very tractable computational
properties. Thus, we use it as a building block for the angular measure Φ models. To
build a flexible family of distributions in Sdp−1 , we consider mixtures of projected gamma
densities defined as Z
f (y) = P G(y | θ)dG (θ), (8)
Θ
where θ = (α, β). Following a Bayesian non-parametric approach [21–23], we assume that
G is drawn from a random measure. In particular, assuming a Dirichlet process prior for G,
we have a hierarchical formulation of the mixture model that, for a vector of observations
yi , is given by
yi ∼ PG(yi | θi ) θi ∼ G G ∼ DP (η, G0 ) (9)
where DP denotes a Dirichlet process, η is the precision parameter, and G0 is the centering
distribution.
Unfortunately, in the limit when p → ∞, the normalizing transformation is not differ-
entiable. Thus, a closed-form expression like Equation (7) for the projected gamma density
on Sd∞−1 is not available. Instead, we observe that for a sufficiently large p, Sdp−1 will ap-
proach Sd∞−1 . With that in mind, our strategy consists of describing the angular distribution
Φ using a sample-based approach with the following steps: (i) Apply the transformation in
Equation (2) to the original data; (ii) Obtain the subsample of the standardized observations
that satisfy R > 1; (iii) Take a finite p and project the observations onto Sdp−1 ; (iv) Fit the
model in Equation (8) to the resulting data and obtain samples from the fitted model;
(v) project the resulting samples onto Sd∞−1 . For step (iv), we use a Bayesian approach that is
implemented using a purposely developed Markov chain Monte Carlo sampler described
in the next section.
1 1
1
9
8 2
7 5
6
5 inf
4
3
2
1
0 0
0 1 0 1
χ12 provides information about the distribution of extremes for the variable Z1 given that
Z2 is very large. When χ12 > 0, Z1 and Z2 are said to be asymptotically dependent; other-
wise, they are asymptotically independent. The following result provides the asymptotic
dependence coefficient between two components of Z for our proposed PoT model.
Proposition 1. Suppose that Z = RV with R ∼ Pa(1), Pr[Vℓ > 0] = 1 and E[Vℓ ] exists, for
ℓ = 1, . . . , d, then " #
Vȷ Vℓ
χ ȷℓ = E ∧ (10)
E Vȷ E[Vℓ ]
Proof. Denote as Fℓ the marginal distribution of Zℓ . To obtain χ ȷℓ , we need Pr(Zȷ > z ȷ , Zℓ > zℓ ),
where zℓ = Fℓ−1 (u) = E[Vℓ ]/(1 − u), ℓ = 1, . . . , d. Using the fact that Vℓ > 0, ∀ℓ almost
surely, we have that the former is equal to
"
z ℓ −1
# " #
zȷ zȷ Vȷ Vℓ Vȷ
zℓ Vℓ
Pr R > ∨ = E 1∧ ∨ =E ∧ = (1 − u )E ∧
Vȷ Vℓ Vȷ Vℓ zȷ zℓ E Vȷ E[Vℓ ]
where the second identity is justified by the fact that Vi is bounded and zi → ∞. The proof
is completed by noting that Pr[ Fi ( Zi ) > u] = 1 − u.
Equation (10) implies that χ ȷℓ > 0, and so, no asymptotic independence is possible
under our proposed model. For the analysis of extreme values, it is of interest to calculate
the multivariate conditional survival function. The following result provides the relevant
expression as a function of the angular measure.
vation i. Under this model, the probability of cluster membership for a given observation is
proportional to
(−i )
(
n j P G yi | α j , β j
Pr[δi = j | . . .] ∝ R
η P G yi | α j , β j dG0 (α j , β j ),
where the top case is iterating over extant clusters j = 1, . . . , J (−i) , and the bottom case
is for a new cluster. If G0 is not a conjugate prior for the kernel density, the integral in
the above formula may not be available in closed form. We sidestep this using Algo-
rithm 8 from [25]: by Monte Carlo integration, we draw m candidate clusters, α j , β j for
j = J (−i) + 1, . . . , J (−i) + m from G0 . Then, we sample the cluster indicator γi from extant
or candidate clusters, where the probability of cluster membership is proportional to
(−i )
(
nj P G yi | α j , β j
Pr[δi = j | . . .] ∝ η (12)
m P G yi | α j , β j .
Again, the top case is iterating over extant clusters, and now the bottom case is iterating
over new candidate clusters. If a candidate cluster is selected, then γi = J = J (−i) + 1, and
the associated cluster parameters are saved.
A key feature of the the projected Gamma distribution is its computational properties.
We augment P G(yi | αi , β i ) by introducing a latent radial component ri , for each observa-
tion. Using Equation (6) we observe that the full conditional of ri is easy to sample from, as
it is given as !
d d
ri | αi , β i ∼ G ri ∑ αi ℓ , ∑ β ℓ yi ℓ . (13)
ℓ=1 ℓ=1
d
f (α j , β j | Y, r, δ, . . .) ∝ ∏ ∏ i iℓ jℓ jℓ × dG0 (α j , β j ).
G r y | α , β (14)
i:γi = j ℓ=1
Note that the ordering of the products can be reversed in Equation (14), indicating that
with appropriate choice of centering distribution, the full conditionals for α j , β j can become
separable by dimension, and thus inference on α jℓ , β jℓ can be done in parallel for all j, ℓ.
We first consider a centering distribution given by a product of independent Gammas:
d d
G0 (α j , β j | ξ, τ, ζ, σ ) = ∏ G(α jℓ | ξ ℓ , τℓ ) × ∏ G( β jℓ | ζ ℓ , σℓ ). (15)
ℓ=1 ℓ=2
Samples of α jℓ can thus be obtained using a Metropolis step. In our analysis, we first
transform α jℓ to the log scale and use a normal proposal density. The full conditional for
β is !
β jℓ | r, Y, α, ζ ℓ , σℓ ∼ G β jℓ n j α jℓ + ζ ℓ , ∑ ri yiℓ + σℓ , (18)
i:γi = j
d
G0 α j , β j | µ, Σ, ζ, σ = LN α j | µ, Σ × ∏ G β jℓ | ζ ℓ , σℓ .
(19)
ℓ=2
This model is completed with a normal prior on µ, an inverse Wishart prior on Σ, and
Gamma priors on ζ ℓ , σℓ , and η. This model is denoted as the projected gamma-log-normal
(PG-LN) model. We also explore a restricted Gamma form of this model as above, where
β ℓ = 1 for all ℓ. This is denoted as the projected restricted gamma-log-normal (PRG-LN) model.
Updates for α can be accomplished using a joint Metropolis step, where β jℓ for ℓ = 2, . . . , d
have been integrated out of the log-density. That is,
1 T −1 1
π (α j | Y, r, δ, µ, Σ, ζ, σ ) ∝ exp − (log α j − µ) Σ (log α j − µ) × d
2 ∏ℓ=1 α jℓ
α j1 −1
∏i:γi = j ri yi1 d Γ n j α jℓ + ζ ℓ
× n
×∏ n j α jℓ +ζ ℓ
∏dℓ=1 Γ j (α jℓ ) ℓ=2 ∑
i:γi = j ri yi ℓ + σℓ
The inferential forms for β jℓ and its priors are the same as for PG-G. The normal prior for µ
is conjugate for the log-normal α j and can be sampled via a Gibbs step. Finally, the inverse
Wishart prior for Σ is again conjugate to the log-normal α j , implying that it can also be
sampled via a Gibbs step.
To effectively explore the sample space with a joint Metropolis step, as well as to speed
convergence, we implement a parallel tempering algorithm [27] for the log-normal models.
This technique runs parallel MCMC chains at ascending temperatures. That is, for chain s,
the posterior density is exponentiated by the reciprocal of temperature ts . For the cold chain,
t1 := 1. Let Es be the log-posterior density under the current parameter state for chain s.
Then states between chains r, s are exchanged via a Metropolis step with probability
h n oi
αrs = min 1, exp (tr−1 − t− 1
s )( Er − Es ) .
Higher temperatures serve to flatten the posterior distribution, meaning hotter chains have
a higher probability of making a given transition or will make larger transitions. As such,
they will more quickly explore the parameter space and share information gained through
state exchange.
The energy score criterion, defined for a general probability distribution P with a finite
expectation, is developed as
1
SES ( P, xi ) = E p [ g( Xi , xi )] − E p g Xi , Xi′ ,
(20)
2
where g is a kernel function. The score defined in Equation (20) can be evaluated using
samples from P with the help of the law of large numbers. Moreover, Theorem 4 in [28]
states that if g(·, ·) is a negative definite kernel, then S( P, x) is a proper scoring rule. Recall
that a real valued function g is a negative definite kernel if it is symmetric in its arguments,
and ∑in=1 ∑nj=1 ai a j g( xi , x j ) ≤ 0 for all positive integers n and any collection a1 , . . . , an ∈ R
such that ∑in=1 ai = 0.
In a Euclidean space, these conditions are satisfied by the Euclidean distance [29].
However, for observations on different faces of Sd∞−1 , the Euclidean distance will under-
estimate the geodesic distance, the actual distance required to travel between the two
points. Let
Cdℓ −1 = { x : x ∈ Sd∞−1 , xℓ = 1}
comprise the ℓth face. For points on the same face, the Euclidean distance corresponds to
the length of the shortest possible path in Sd∞−1 . For points on different faces, the Euclidean
distance is a lower bound for that length.
For a finite p, the shortest connecting path between two points in Sdp−1 is the minimum
geodesic; its length satisfying the definition of a distance. Thus its length can be used as
a negative definite kernel for the purpose of defining an energy score. Unfortunately, as
p → ∞, the resulting surface Sd∞−1 is not differentiable, implying that routines to calculate
geodesics are not readily available. However, as Sd∞−1 is a portion of a d-cube, we can
borrow a result from geometry [30] stating that the length of the shortest path between two
points on a geometric figure corresponds to the length of a straight line drawn between the
points on an appropriate unfolding, rotation, or net of the figure from a d-dimensional to a
d − 1-dimensional space. The optimal net will have the shortest straight line between the
points, as long as that line is fully contained within such a net. As Sd∞−1 has d faces—each
face pairwise adjacent, there are d! possible nets. However, we are only interested in nets
that begin and end on the source and destination faces, respectively, reducing the number
−2 d −2
of nets under consideration to ∑dk= 0 ( k ). This is still computationally burdensome for a
large number of dimensions. However, we can efficiently establish an upper bound on the
geodesic length. We use this upper bound on geodesic distance as the kernel function for
the energy score.
To calculate the energy score, we define the kernel
where a ∈ Cdℓ −1 , and b ∈ Cdȷ −1 for ℓ, ȷ ∈ {1, . . . , d}. Evaluating g as described requires
the solution of a (d − 2)-dimensional optimization problem. The following proposition
provides a straightforward approach.
Proposition 3. Let a ∈ Cdℓ −1 , and b ∈ Cdȷ −1 , for ℓ, ȷ ∈ {1, . . . , d}. For ℓ ̸= ȷ, the transformation
Pȷℓ (·) required to rotate the ȷth face along the ℓth axis produces the vector b′ , where
bi
for i ̸= ȷ, ℓ
′
bi = Pȷℓ (b) = 1 for i = ℓ . (21)
2 − bℓ for i = ȷ
Then g( a, b) = ∥ a − b′ ∥2 .
Entropy 2024, 26, 335 10 of 19
Proof. Notice that for c ∈ Cdȷ −1 ∩ Cdℓ −1 , ∥b − c∥2 = ∥b′ − c∥2 . We then have that
∥ c − a ∥2 + b ′ − c
= min 2
c∈Cdȷ −1 ∩Cℓd−1
= a − b′ 2
.
The last equality is due to the fact that a and b′ belong to the same hyperplane.
5. Data Illustrations
We apply the aforementioned models to simulated angular data. We then consider the
analysis of atmospheric data. To tackle the difficult problem of assessing the convergence
an MCMC chain for a large-dimensional model, we monitor the log-posterior density. In
all the examples considered, MCMC samples produced stable traces of the log-posterior in
less than 40,000 iterations. We use that as a burn-in and thereafter sample 10,000 additional
iterations. We then thin the chain by retaining one every ten samples, to obtain 1000 total
samples. These are used to generate samples from the posterior predictive densities. We
used two different strategies to implement the MCMC samplers. For the models whose
DP prior is centered around a log-normal distribution, we used parallel tempering. This
serves to overcome the very slow mixing that we observed in these cases. The temperature
ladder was set as ts = 1.3s for s ∈ {0, 1, . . . , 5}. This was set empirically in order to produce
acceptable swap probabilities both for the simulated data and real data. Parallel tempering
produces chains with good mixing properties but has a computational cost that grows
linearly with the number of temperatures. Thus, for the gamma-centered models, we used
a single chain. We leverage the fast speed of each iteration to obtain a large number of
samples, which are appropriately thinned to deal with a mild autocorrelation. In summary,
the strategy for log-normal centered models is based on a costly sampler with good mixing
properties. The strategy for the gamma-centered models is based on a cheap sampler that
can be run for a large number of iterations.
Our hyperprior parameters are set as follows: For the gamma-centered models (PG-G,
PRG-G), the shape parameter for the centering distribution ξ ℓ ∼ G(1, 1), and rate param-
eter τℓ ∼ G(2, 2). For the log-normal centered models (PG-LN, PRG-LN), the centering
distribution’s log-mean µ ∼ Nd (0, Id ), and covariance matrix Σ ∼ IW (d + 10, (d + 10) Id ).
These values are intended such that draws from the prior for Σ will weakly tend towards
the identity matrix. For models learning rate parameters β jℓ (PG-G, PG-LN), the centering
distribution’s shape parameter ζ ∼ G(1, 1) and rate parameter σ ∼ G(2, 2). The choice of
Entropy 2024, 26, 335 11 of 19
the G(2, 2) for rate parameters places little mass near 0 in order to draw estimates for the
value away from 0 for numerical stability.
Figure 2 shows the average rise over baseline in energy score, as calculated on Sd∞−1
using the kernel metric described in Proposition 3, for models trained on simulated data.
After training a model, a posterior predictive dataset is generated, and the energy score is
calculated as a Monte Carlo approximation of Equation (20). In our analysis, after burn-
in and thinning, we had 1000 replicates from the posterior distribution and generated
10 posterior predictive replicates per iteration. The baseline value is the energy score of a
new dataset from the same generating distribution as the training dataset evaluated against
the training dataset. For the simulated data, we observe that the projected gamma models
dominate the other two options considered, regardless of the choice of centering distri-
bution. The projected restricted gamma models with a multivariate log-normal centering
Entropy 2024, 26, 335 12 of 19
1 2 4 8
0.3
0.15
Energy Score (Rise over Baseline)
0.075
0.04
0.2
0.10
0.050
0.02
0.1
0.05
0.025
10 20 30 10 20 30 10 20 30 10 20 30
Number of Columns
Dirichlet DP−PG−G DP−PG−LN DP−PRG−G DP−PRG−LN PairwiseBetas
Figure 2. Average energy score rise over baseline (on Sd∞−1 ) for various models fitted to simulated data,
with ascending count of mixture components (indicated by plot heading) and number of dimensions
(indicated by horizontal axis). Note that pairwise betas is a moment-restricted model.
Table 1. Model fit assessment and computation time on ERA-Interim and ERA5 data. (a) Energy score
criterion from fitted models against the IVT data. Lower is better. (b) Time to sample (in minutes)
50,000 iterations for various models.
(a)
Source Pairwise Betas PG-G PG-LN PRG-G PRG-LN
ERA-Interim 0.8620 0.8003 0.7986 0.7966 0.7970
ERA5 2.0311 1.6404 1.5576 1.4349 1.5051
(b)
Source Pairwise Betas PG-G PG-LN PRG-G PRG-LN
ERA-Interim 1.5 16.3 66.5 14.8 52.9
ERA5 53.1 19.4 153.4 24.6 121.4
1
2 3
1 5 4
6 7
8 9
2 11 10
13 12
14 15
3 16 18
17
19 22
20
4 23
21 24
25 26
30 36
5 27
28 29 34 37 40
32 35 39 42
6 7 31 33 38
41 43
44 45
8 46 47
Figure 3. Grid cell locations for ERA-Interim (left) and ERA5 (right).
Fitting our models to the IVT data requires some pre-processing. First, we subset
the data to the rainy season, which in California runs roughly from November to March.
Following the approach described in Section 3, we estimate the shape and scale parameters
of a univariate GP in each dimension, using maximum likelihood. We set the threshold
in each dimension ℓ as bt,ℓ = F̂ℓ−1 (1 − t−1 ), where F̂ is the empirical CDF and t = 20,
which corresponds to the 95 percentile. We then use the transformation in Equation (2) to
standardize the observations. Dividing each standardized observation by its L∞ norm, we
obtain a projection onto S∞ d−1 . As the data correspond to a daily time series, the observations
are temporally correlated. For each group of consecutive standardized vectors zi such that
∥zi ∥∞ > 1, we retain only the vector with the largest L∞ norm. The complete procedure is
outlined in Algorithm 2.
After subsetting the ERA-Interim data to the rainy season, we have 5587 observations.
After the processing and declustering described in Algorithm 2, this number reduces to
511 observations. A pairwise plot of the transformed data after processing and declustering
is presented in Figure 4. From this, we note that the marginal densities display strong
similarities, with a large spike near 0 and a small spike near 1. A value of 1 in a particular
axis indicates that the standardized threshold exceedance was the largest in that dimension.
The off-diagonal plots correspond to pairwise density plots. We observe that some site pairs,
such as (1, 2), (7, 8), and especially (4, 5), have the bulk of their data concentrated in a small
arc along the 45◦ ; in other site combinations, such as (3, 6), (2, 7), or (1, 8), the data is split,
favoring one side or the other of the 45◦ line. For the ERA5 data, after subsetting, we have
6342 observations, which reduces to 532 observations after processing and declustering.
We fit the PG-G, PRG-G, PG-LN, and PRG-LN models to both datasets.
Entropy 2024, 26, 335 14 of 19
Algorithm 2 Data preprocessing to isolate and transform data exhibiting extreme behavior.
ri represents the radial component, and vi the angular component. The declustering portion
is relevant for data correlated in time.
for ℓ = 1, . . . , d do
Set bt,ℓ = F̂ℓ−1 1 − 1t .
With xℓ > bt,ℓ , fit σℓ , ξ ℓ via MLE according to generalized Pareto likelihood.
end for
for i = 1, . . . , n do
1/ξ ℓ
x −b
Define zi,ℓ = 1 + ξ ℓ i,ℓσ t,ℓ ; then ri = ∥zi ∥∞ , vi = ∥zz∥i
ℓ + i ∞
end for
Subset r, v such that ri ≥ 1
if declustering then
for i = 1, . . . , n do
If ri ≥ 1 and ri−1 ≥ 1, drop the lesser (and associated vi ) from dataset.
end for
end if
C1 C2 C3 C4 C5 C6 C7 C8
C1
C2
C3
C4
C5
C6
C7
C8
Figure 4. Pairwise plots from ERA-Interim data after transformation and projection to S7∞ . Down
the diagonal are marginal kernel densities with two-dimensional histograms on the off-diagonal. In
those plots, red indicates a higher density. All data are between 0 and 1.
Table 1a shows the values of the estimated energy scores for the different models con-
sidered. We observe that, contrary to the results in the simulation study in Figure 2, the pre-
ferred model is the projected restricted gamma models, though for the lower-dimensional
ERA-Interim data, all models perform comparably. Table 1b shows the computing times
needed to fit the different models to the two datasets. We see the effect of dimensionality
on the various models; for gamma-centered models, it grows linearly; for the log-normal
centered model, it will grow superlinearly as matrix inversion becomes the most costly
operation. For BMAmevt, its parameter space grows combinatorially with the number of
dimensions, and thus so does computational complexity and sampling time.
We consider an exploration of the pairwise extremal dependence using Monte Carlo
estimates of the coefficients in Equation (10). For this, we use samples obtained from the
Entropy 2024, 26, 335 15 of 19
PRG-G model. Figure 5 provides a graphical analysis of the results. The coefficients achieve
values between 0.286 and 0.759 for the ERA-Interim data and between 0.181 and 0.840
for the ERA5 data. The greater range in dependence scores observed in the ERA5 data
versus ERA-Interim speaks to the greater granularity of the ERA5 data, indicating that
distance between locations is a strong contributor to the strength of the pairwise asymptotic
dependence. The highest coefficients are 0.759 for locations 4 and 5 in the ERA-Interim data
and 0.840 for locations 1 and 2 in the ERA5 data. Clearly, pairwise asymptotic dependence
coefficients tell a limited story, as a particular dependence may include more than two
locations. We can, however, glean some information from the patterns that emerge in two
dimensions. For the ERA-Interim data, we observe a possible cluster between cells 5–8,
indicating a strong dependence among these cells. Analogously, for the ERA5 data, we
observe three possible groups of locations.
ERA−Interim ERA5
1
5
2 10
3 15
χ
20 0.8
4
25 0.6
5
30 0.4
6
35 0.2
7 40
8 45
Figure 5. Pairwise extremal dependence coefficients for IVT data using the PRG-G model.
Figure 6 shows, for the ERA-Interim data under the PRG-G model, the conditional sur-
vival curve defined in Equation (11) for one dimension conditioned on all other dimensions
being greater than their (fitted) 90th percentile. Figure 7 presents the bi-variate conditional
survival function, conditioning on all other dimensions. These results illustrate quantita-
tively how extremal dependence affects the shape of the conditional survival curves. The
two top panels represent the joint survival function between grid locations 4 and 5, which
are shown in Figure 5 to exhibit strong extremal dependence. We observe that the joint
survival surface is strongly convex. The bottom panels represent the joint survival surface
between grid locations 1 and 5, which exhibited low extremal dependence. In this case, the
shape of the contours tends to be concave, quite different from the shapes observed in the
top panels.
0.75 0.75
0.50 0.50
0.25 0.25
Figure 6. Conditional survival curves for selected locations using ERA-Interim and PRG-G model
conditioning on all other dimensions at greater than 90th percentile (fitted). The left panel uses
original units. Right panel uses standardized units.
Entropy 2024, 26, 335 16 of 19
800 100
600
50
400
800
100
600
50
400
Figure 7. Pairwise conditional survival curves for selected locations, using ERA-Interim and PRG-G
models, conditioning on all other dimensions at greater than the 90th percentile (fitted).
Using our proposed scoring criteria, we explored the effect of the choice of p on the
final results. Using the simulated data, generated from a mixture of projected gammas, we
were unable to observe sizeable differences in the scores for p ranging between 1 and 15.
However, for the IVT data, we observed a drop in the energy score associated with higher
p, with a diminishing effect as p increased. We observed no significant differences in the
performance of the model that uses p = 10, which corresponds to the analysis presented,
relative to the one that uses p = 15.
6. Conclusions
In this paper, we have built upon the definition of the multivariate Pareto described
in [11] to establish a useful representation of its dependence structure through the distri-
bution of its angular component, which is supported on the positive orthant of the unit
hypersphere under the L∞ norm, Sd∞−1 . Due to the inherent difficulty of obtaining the like-
lihood of distributions with support on Sd∞−1 , our method transforms the data to Sdp−1 , fits
then using mixtures of products of independent gammas, then transforms the predictions
back to Sd∞−1 . As Sdp−1 converges to Sd∞−1 as p → ∞, we expect the proposed resampling
to be efficient for large enough p. In fact, our exploration of the simulated and real data
indicates that the procedure is robust to the choice of moderately large values of p. Our
method includes two inferential steps. The first consists of the estimation of the marginal
Pareto distributions; the second consists of the estimation of the angular density. Param-
eter uncertainty incurred in the former is not propagated to the latter. Conceptually, an
integrated approach that accounts for all the estimation uncertainty is conceivable. Unfor-
tunately, this leads to posterior distributions with complex data-dependent restrictions that
are very difficult to explore, especially in large dimensional settings. In fact, our attempts to
Entropy 2024, 26, 335 17 of 19
fit a simple parametric model for the marginals and the angular measures jointly in several
dimensions were not successful.
In this paper, we have focused on a particular representation of the multivariate
Pareto distribution for PoT inference on extreme values. To this end, our model provides a
computationally efficient and flexible approach. An interesting extension of the proposed
model is to consider regressions of extreme value responses, due to extreme value inputs
following the ideas in [38]. This will produce PoT-based Bayesian non-parametric extreme
value regression models. More generally, models that allow for covariate-dependent
extremal dependence [39] could be considered. In addition, we notice that our approach is
based on flexibly modeling angular distributions for any p-norm. As such, it can be applied
to other problems focused on high-dimensional directional statistics constrained to a cone
of directions.
Developing an angular measure specifically in Sd∞−1 provides two benefits over Sdp−1 .
First, the transformation to Sd∞−1 is unique. Recall that Equation (4) gives yd as a function
of y1 , . . . , yd−1 . An analogous expression can be obtained for any yℓ . This indicates that
there are d equivalent transformations, each yielding a different Jacobian and, for p > 1,
potentially resulting in a different density. Second, the evaluation of geodesic distances on
Sdp−1 is not straightforward. However, we have demonstrated a computationally efficient
upper bound on geodesic distance on Sd∞−1 . Accepting these foibles, it would be interesting
to explore the distribution of Sdp−1 ,
The computations in this paper were performed on a desktop computer with an AMD
Ryzen 5000 series processor. The program is largely single-threaded, so computation time
is not dependent on the available core count. In each case, we run the MCMC chain for
50,000 iterations, with a burn-in of 40,000 samples. Fitting the PG-G model on the ERA5
dataset took approximately 15 min. Work is in progress to optimize the code and explore
parallelization where possible. We are also exploring alternative computational approaches
that will make it feasible to tackle very high dimensional problems, such as variational
Bayes. In fact, to elaborate on the study of IVT, there is a need to consider several hundreds,
if not thousands, of grid cells over the Pacific Ocean in order to obtain a good description
of atmospheric events responsible for large storm activity over California.
References
1. Coles, S.G. An Introduction to Statistical Modelling of Extreme Values; Springer: Berlin/Heidelberg, Germany, 2001. [CrossRef]
2. De Haan, L.; Ferreira, A. Extreme Value Theory: An Introduction; Springer: Berlin/Heidelberg, Germany, 2006; Volume 21.
[CrossRef]
3. Rootzén, H.; Tajvidi, N. Multivariate generalized Pareto distributions. Bernoulli 2006, 12, 917–930. [CrossRef]
4. Falk, M.; Guillou, A. Peaks-over-Threshold stability of multivariate generalized Pareto distributions. J. Multivar. Anal. 2008,
99, 715–734. [CrossRef]
5. Michel, R. Some notes on multivariate generalized Pareto distributions. J. Multivar. Anal. 2008, 99, 1288–1301. [CrossRef]
6. Rootzén, H.; Segers, J.; Wadsworth, J.L. Multivariate peaks over thresholds models. Extremes 2018, 21, 115–145. [CrossRef]
7. Rootzén, H.; Segers, J.; Wadsworth, J.L. Multivariate generalized Pareto distributions: Parametrizations, representations, and
properties. J. Multivar. Anal. 2018, 165, 117–131. [CrossRef]
8. Kiriliouk, A.; Rootzén, H.; Segers, J.; Wadsworth, J.L. Peaks Over Thresholds Modeling With Multivariate Generalized Pareto
Distributions. Technometrics 2019, 61, 123–135. [CrossRef]
Entropy 2024, 26, 335 18 of 19
9. Renard, B.; Lang, M. Use of a Gaussian copula for multivariate extreme value analysis: Some case studies in hydrology. Adv.
Water Resour. 2007, 30, 897–912. [CrossRef]
10. Falk, M.; Padoan, S.A.; Wisheckel, F. Generalized Pareto copulas: A key to multivariate extremes. J. Multivar. Anal. 2019,
174, 104538. [CrossRef]
11. Ferreira, A.; de Haan, L. The generalized Pareto process; with a view towards application and simulation. Bernoulli 2014,
20, 1717–1737. [CrossRef]
12. Boldi, M.O.; Davison, A.C. A mixture model for multivariate extremes. J. R. Stat. Soc. Ser. Stat. Methodol. 2007, 69, 217–229.
[CrossRef]
13. Sabourin, A.; Naveau, P. Bayesian Drichlet mixture model for multivariate extremes: A re-parametrization. Comput. Stat. Data
Anal. 2014, 71, 542–567. [CrossRef]
14. Hanson, T.E.; de Carvalho, M.; Chen, Y. Bernstein polynomial angular densities of multivariate extreme value distributions. Stat.
Probab. Lett. 2017, 128, 60–66. [CrossRef]
15. Guillotte, S.; Perron, F.; Segers, J. Non-parametric Bayesian inference on bivariate extremes. J. R. Stat. Soc. Ser. (Stat. Methodol.)
2011, 73, 377–406. [CrossRef]
16. Jentsch, A.; Kreyling, J.; Beierkuhnlein, C. A new generation of climate-change experiments: Events, not trends. Front. Ecol.
Environ. 2007, 5, 365–374. [CrossRef]
17. Vousdoukas, M.I.; Mentaschi, L.; Voukouvalas, E.; Verlaan, M.; Jevrejeva, S.; Jackson, L.P.; Feyen, L. Global probabilistic
projections of extreme sea levels show intensification of coastal flood hazard. Nat. Commun. 2018, 9, 2360. [CrossRef]
18. Li, C.; Zwiers, F.; Zhang, X.; Chen, G.; Lu, J.; Li, G.; Norris, J.; Tan, Y.; Sun, Y.; Liu, M. Larger increases in more extreme local
precipitation events as climate warms. Geophys. Res. Lett. 2019, 46, 6885–6891. [CrossRef]
19. Resnick, S. Extreme Values, Regular Variation, and Point Processes; Applied Probability; Springer: Berlin/Heidelberg, Germany, 2008.
20. Núñez-Antonio, G.; Geneyro, E. A multivariate projected Gamma model for directional data. Commun. Stat.-Simul. Comput. 2021,
50, 2721–2742. [CrossRef]
21. Ferguson, T.S. Prior Distributions on Spaces of Probability Measures. Ann. Stat. 1974, 2, 615–629. [CrossRef]
22. Antoniak, C.E. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. Ann. Stat. 1974,
2, 1152–1174. [CrossRef]
23. Müller, P.; Quintana, F.A.; Jara, A.; Hanson, T. Bayesian Nonparametric Data Analysis; Springer: Berlin/Heidelberg, Germany, 2015;
Volume 1.
24. Ascolani, F.; Lijoi, A.; Rebaudo, G.; Zanella, G. Clustering consistency with Dirichlet process mixtures. Biometrika 2022,
110, 551–558. [CrossRef]
25. Neal, R.M. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. J. Comput. Graph. Stat. 2000, 9, 249–265.
[CrossRef]
26. Escobar, M.D.; West, M. Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 1995, 90, 577–588.
[CrossRef]
27. Earl, D.J.; Deem, M.W. Parallel tempering: Theory, applications, and new perspectives. Phys. Chem. Chem. Phys. 2005,
7, 3910–3916. [CrossRef]
28. Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378.
[CrossRef]
29. Berg, C.; Christensen, J.P.R.; Ressel, P. Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions; Springer:
Berlin/Heidelberg, Germany, 1984; Volume 100.
30. Pappas, T. The Joy of Mathematics: Discovering Mathematics All Around You; Wide World Pub Tetra: Frederick, MD, USA, 1989.
31. Cooley, D.; Davis, R.A.; Naveau, P. The pairwise beta distribution: A flexible parametric multivariate model for extremes. J.
Multivar. Anal. 2010, 101, 2103–2117. [CrossRef]
32. Sabourin, A. BMAmevt: Multivariate Extremes: Bayesian Estimation of the Spectral Measure, R Package Version 1.0.5; 2023. Available
online: https://CRAN.R-project.org/package=BMAmevt (accessed on 10 April 2024).
33. Ralph, F.M.; Iacobellis, S.; Neiman, P.; Cordeira, J.; Spackman, J.; Waliser, D.; Wick, G.; White, A.; Fairall, C. Dropsonde
observations of total integrated water vapor transport within North Pacific atmospheric rivers. J. Hydrometeorol. 2017, 18,
2577–2596. [CrossRef]
34. Neiman, P.J.; White, A.B.; Ralph, F.M.; Gottas, D.J.; Gutman, S.I. A water vapour flux tool for precipitation forecasting. In
Proceedings of the Institution of Civil Engineers-Water Management; Thomas Telford Ltd.: London, UK, 2009; Volume 162, pp. 83–94.
[CrossRef]
35. Berrisford, P.; Kållberg, P.; Kobayashi, S.; Dee, D.; Uppala, S.; Simmons, A.; Poli, P.; Sato, H. Atmospheric conservation properties
in ERA-Interim. Q. J. R. Meteorol. Soc. 2011, 137, 1381–1399. [CrossRef]
36. Dee, D.P.; Uppala, S.; Simmons, A.; Berrisford, P.; Poli, P.; Kobayashi, S.; Andrae, U.; Balmaseda, M.; Balsamo, G.; Bauer, P.; et al.
The ERA-Interim reanalysis: Configuration and performance of the data assimilation system. Q. J. R. Meteorol. Soc. 2011,
137, 553–597. [CrossRef]
37. Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.;
Schepers, D.; et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [CrossRef]
Entropy 2024, 26, 335 19 of 19
38. de Carvalho, M.; Kumukova, A.; Dos Reis, G. Regression-type analysis for multivariate extreme values. Extremes 2022, 25,
595–622. [CrossRef]
39. Mhalla, L.; de Carvalho, M.; Chavez-Demoulin, V. Regression-type models for extremal dependence. Scand. J. Stat. 2019,
46, 1141–1167. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.