[go: up one dir, main page]

Next Article in Journal
Thermodynamic Metrics and Black Hole Physics
Next Article in Special Issue
A Bayesian Decision-Theoretic Approach to Logically-Consistent Hypothesis Testing
Previous Article in Journal
Subspace Coding for Networks with Different Level Messages
Previous Article in Special Issue
Approximate Methods for Maximum Likelihood Estimation of Multivariate Nonlinear Mixed-Effects Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bayesian Predictive Discriminant Analysis with Screened Data

Department of Statistics, Dongguk University-Seoul, Pil-Dong 3Ga, Chung-Gu, Seoul 100-715, Korea
Entropy 2015, 17(9), 6481-6502; https://doi.org/10.3390/e17096481
Submission received: 3 April 2015 / Revised: 25 August 2015 / Accepted: 17 September 2015 / Published: 21 September 2015
(This article belongs to the Special Issue Inductive Statistical Methods)

Abstract

:
In the application of discriminant analysis, a situation sometimes arises where individual measurements are screened by a multidimensional screening scheme. For this situation, a discriminant analysis with screened populations is considered from a Bayesian viewpoint, and an optimal predictive rule for the analysis is proposed. In order to establish a flexible method to incorporate the prior information of the screening mechanism, we propose a hierarchical screened scale mixture of normal (HSSMN) model, which makes provision for flexible modeling of the screened observations. An Markov chain Monte Carlo (MCMC) method using the Gibbs sampler and the Metropolis–Hastings algorithm within the Gibbs sampler is used to perform a Bayesian inference on the HSSMN models and to approximate the optimal predictive rule. A simulation study is given to demonstrate the performance of the proposed predictive discrimination procedure.

Graphical Abstract">

Graphical Abstract

1. Introduction

The topic of analyzing multivariate screened data has received a great deal of attention over the last few decades. In the standard multivariate problem, an analysis of data generated from a p-dimensional screened random vector x = d [ v | v 0 C q ) ] is our issue of interest, where the p × 1 random vector v and the q × 1 random vector v 0 (called the screening vector) are jointly distributed with the correlation matrix C o r r ( v , v 0 ) 0 . Thus, we observe x only when the unobservable screening vector v 0 belongs to a known subset C q of its space R q , such that 0 P ( v 0 C q ) 1 . That is, x is subject to the screening scheme or hidden truncation (or simply truncation if v = v 0 ). Model parameters underlying the joint distribution of v and v 0 are then estimated from the screened data (i.e., observations of x) using the conditional density f ( x | v 0 C q ) .
The screening of sample (or sample selection) arises as open in practice as a result of controlling observability of the outcome of interest in the study. For example, the dataset consists of the Otis IQ test scores (the values of x) of freshmen of a college. These students had been screened in the college admission process, which examines whether their prior school grade point average (GPA) and the Scholastic Aptitude Test (SAT) scores (i.e., the screening values denoted by v 0 ) are satisfactory. What the true vale of screening vector variable v 0 of each student is may not be available due to a college regulation. The observations available are the IQ values of x, the screened data. For the application with real screened data, one can refer to that with the student aid grants data given by [1] and that with the U.S. labor market data given by [2], as well. A variety of methods have been suggested for analyzing such screened data. See, e.g., [3,4,5,6], for various distributions for modeling and analyzing screened data; see, [7,8] for the estimative classification analysis with screened data; and see, e.g., [1,2,9,10], for the regression analysis with screened response data.
The majority of existing methods rely on the fact that v and v 0 are jointly multivariate normal, and the screened observation vector x is subject to a univariate screening scheme defined by an open interval C q with q = 1 . In many practical situations, however, the screened data are generated from a non-normal joint distribution of v and v 0 , having a multivariate screening scheme defined by a q-dimensional ( q > 1 ) rectangle region C q of v 0 . In this case, a difficulty in applications with the screened data is that the empirical distribution of the screened data is skewed; its parametric model involves a complex density; and hence, standard methods of analysis cannot be used. See [4,6] for the conditional densities, f ( x | v 0 C q ) , useful for fitting the rectangle screened data generated from a non-normal joint distribution of v and v 0 . In this article, we develop yet another multivariate technique applicable for analyzing the rectangle screened data: we are interested in constructing a Bayesian predictive discrimination procedure for the data. More precisely, we consider a Bayesian multivariate technique for sorting, grouping and prediction of multivariate data generated from K rectangle screened populations. In the standard problem, a training sample D = { ( z i , x i ) , i = 1 , , n } is available, where, for each i = 1 , , n , x i is a p × 1 rectangle screened observation vector coming from one of K populations and taking values in R p , and z i is a categorical response variable representing the population membership, so that z i = k implies that the predictor x i belongs to the k-th rectangle screened population (denoted by π k ), k = 1 , , K . Using the training sample D , the goal of the predictive discriminant analysis is to predict population membership of a new screened observation x based on the posterior probability of x belonging to π k . The posterior probability is given by:
p ( z = k | D , x ) p ( x | D , z = k ) p ( z = k | D ) , k = 1 , , K ,
where z is the the population membership of x, p ( z = k | D ) is the prior probability of π k updated by the training sample D and:
p ( x | D , z = k ) = p ( x | Θ k ) p ( Θ k | D , z = k ) d Θ k ,
p ( x | Θ k ) = p ( x | v 0 C q , z = k ) and p ( Θ k | D , z = k ) , respectively, denote the predictive density, the probability density of x and the posterior density of parameters Θ k associated with π k . One of the first and most applied predictive approaches by [11] is the case of unscreened and normally-distributed populations π k with unknown parameters Θ k = { μ k , Σ k } , namely π k : N p ( μ k , Σ k ) for k = 1 , , K . This is called a Bayesian predictive discriminant analysis with normal populations ( B P D A N ) in which a multivariate Student t distribution is obtained for Equation (2).
A practical example where the predictive discriminant analysis with the rectangle screened populations ( π k ’s) is applicable is in the discrimination between passed and failed pairs of applicants in a college admission process (the second screening process). Consider the case where college admission officers wish to set up an objective criterion (with a predictor vector x) for admitting students for matriculation; however, the admission officers must first ensure that a student with observation x has passed the first screening process. The first screening scheme may be defined by the q-dimensional region C q of the random vector v 0 (consisting of SAT scores, high-school GPA, and so on), so that only the students who satisfy v 0 C q can proceed to the admission process. In this case, we encounter a crucial problem for applying the normal classification by [11]; given the screening scheme v 0 C q , the assumption of the multivariate normal population distribution for [ x | z = k ] = d [ x | π k ] , k = 1 , 2 , , K is not valid. The work in [7,12] found that the normal classification shows a lack of robustness to the departure from the normality of the population distribution, and hence, the performance of the normal classification can be very misleading, if used with the continuous, but non-normal or screened normal input vector x.
Thus, the predictive density in Equation (2) has two specific features to be considered for Bayesian predictive discrimination with the rectangle screened populations, one about the prior distribution of the parameters Θ k and the other about the distributional assumption of the population model with density p ( x | Θ k ) . For the unscreened populations case, there have been a variety of studies that are concerned with the two considerations. See, for example, [11,13,14] for the choice of the prior distributions of Θ k , and see [15,16] for copious references to the literature on the predictive discriminant analysis with non-normal population models. Meanwhile, for deriving Equation (2) of the rectangle screened observation x, we need to develop a population model with density p ( x | Θ k ) that uses the screened sample information in order to maintain consistency with the underlying theory associated with the populations π k generating the screened sample. Then, we propose a Bayesian hierarchical approach to flexibly incorporate the prior knowledge about Θ k with the non-normal sample information, which is the main contribution of this paper to the literature on Bayesian predictive discriminant analysis.
The rest of this paper is organized as follows. Section 2 considers a class of screened scale mixture of normal (SSMN) population models, which well accounts for the screening scheme conducted through a q-dimensional rectangle region C q of an external scale mixture of normal vector, v 0 . Section 3 proposes a hierarchical screened scale mixture of normal (HSSMN) model to derive the predictive density Equation (2) and proposes an optimal rule for Bayesian predictive discriminant analysis (BPDA) with the SSMN populations (abbreviated as B P D A S S M N ). Approximation of the rule is studied in Section 4 by using an MCMC method applied to the HSSMN model. In Section 5, a simulation study is done to check the convergence of the MCMC method and the performance of the B P D A S S M N by making a comparison between the B P D A S S M N and the B P D A N . Finally, concluding remarks are given in Section 6.

2. The SSMN Population Distributions

Assume that the joint distribution of respective q × 1 and p × 1 vector variables v 0 and v, associated with π k , is F F , where:
F = F : N s μ k * , κ ( η ) Σ k * , η g ( η ) with κ ( η ) > 0 , and η > 0 ,
k = 1 , , K , s = ( q + p ) , η is a mixing variable with the pdf g ( η ) , κ ( η ) is a suitably-chosen weight function and μ k * and Σ k * are partitioned corresponding to the orders of v 0 and v:
v * = v 0 v , μ k * = μ 0 k μ k , Σ k * = Σ 0 k Δ k Δ k Σ k .
Notice that F defined by Equation (3) denotes a class of scale mixture of multivariate normal (SMN) distributions (see, e.g., [17,18] for details), equivalently denoted as S M N s ( μ k * , Σ k * , κ ( η ) , G ) in the remainder of the paper, where G = G ( η ) denote the cdf of η .
Given the joint distribution [ v * | π k ] S M N s ( μ k * , Σ k * , κ ( η ) , G ) , the SSMN distribution is defined by the following screening scheme:
[ x | π k ] = d [ v | v 0 C q ( α , β ) , π k ] S S M N p ( C q ( α , β ) ; μ k * , Σ k * , κ ( η ) , G ) ,
where C q ( α , β ) = { v 0 R q | α v 0 β } is a q-dimensional rectangle screening region in the space of v 0 R q . Here, α = ( α 1 , , α q ) T , β = ( β 1 , , β q ) T , and α j < β j for j = 1 , , q . This region contains the cases of C q ( α , ) and C q ( - , β ) as special cases.
The pdf of x is given by:
f ( x | μ k * , Σ k * , π k ) = 0 h p ( x | μ k * , Σ k * , κ ( η ) , G ) d G ( η ) 0 Φ ¯ q ( C q ( α , β ) ; μ 0 k , κ ( η ) Σ 0 k ) d G ( η ) , x R p ,
where:
h p ( x | μ k * , Σ k * , κ ( η ) ) = ϕ p ( x ; μ k , κ ( η ) Σ k ) Φ ¯ q C q ( α , β ) ; μ v 0 k | x , κ ( η ) Σ v 0 k | x ,
μ v 0 k | x = μ 0 k + Δ k Σ k - 1 ( x - μ k ) and Σ v 0 k | x = Σ 0 k - Δ k Σ k - 1 Δ k . Here, ϕ q ( · μ , Σ ) and Φ ¯ q ( C q ( α , β ) ; μ , Σ ) , respectively, denote the pdf and the probability of the rectangle region of a random vector w N q ( μ , Σ ) . The latter is equivalent to P r ( w C q ( α , β ) ) .
One particular member of the class of SSMN distributions is the rectangle-screened normal (RSN) distribution defined by Equation (5) and Equation (6), for which G ( η ) is degenerate with κ ( η ) = 1 . The work in [4,8] studied properties of the distribution and denoted it as the R S N p ( C q ( α , β ) ; μ k * , Σ k * ) distribution. Another member of the class is the rectangle-screened p-variate Student t distributions ( R S t p ) considered by [8]. Its pdf is given by:
f ( x | μ k * , Σ k * , π k ) = t p ( x | μ k , Σ k , ν ) T ¯ q C q ( α , β ) ; μ v 0 k | x , Γ v 0 k | x , ν + p T ¯ q ( C q ( α , β ) ; μ 0 k , Σ 0 k , ν ) , x R p ,
where t p ( · | a , B , c ) and T ¯ p ( C p ; a , B , c ) are the respective pdf and probability of a rectangle region C p of the p-variate Student t distribution with the location vector a , the scale matrix B, the degrees of freedom c and:
Γ v 0 k | x = ( ν + p ) - 1 ν + ( x - μ k ) Σ k - 1 ( x - μ k ) Σ v 0 k | x .
Similar to the RSN distributions, the density Equation (7) of [ x | π k ] R S t p ( C q ( α , β ) ; μ k * , Σ k * , ν ) is obtained by taking κ ( η ) = 1 / η and η G a m m a ( ν / 2 , ν / 2 ) , i.e.,
g ( η ) = ( ν / 2 ) ν / 2 Γ ( ν / 2 ) η ν / 2 - 1 exp - ν 2 η , η > 0 .
The stochastic representations of the RSN and R S t p distributions are immediately obtained by applying the following lemma, for which detailed proof can be found in [4].
Lemma 1. Suppose [ x | π k ] S S M N p ( C q ( α , β ) ; μ k * , Σ k * , κ ( η ) , G ) . Then, it has the following stochastic representation in a hierarchical fashion,
[ x | η , π k ] = d μ k + Δ k Σ 0 k - 1 Z C q ( a k , b k ) + ( Σ k - Δ k Σ 0 k - 1 Δ k ) 1 / 2 Z p ,
η G ( η ) w i t h κ ( η ) > 0 , η > 0 ,
where Z p N p ( 0 , κ ( η ) I p ) and Z C q - μ 0 k = d [ Z q | Z q C q ( a k , b k ) ] are conditionally independent and Z q N q ( 0 , κ ( η ) Σ 0 k ) . Here, a k = α - μ 0 k and b k = β - μ 0 k .
Lemma 1 provides the following: (i) an intrinsic structure of the SSMN population distributions, which reveals a type of departure from the SMN law because the distribution of [ x | π k ] reduces to the SMN distribution if Δ k = 0 (i.e., C o v ( v 0 , v | π k ) = 0 ); (ii) the representation provides a convenient device for random number generation; (iii) it leads to a simple and direct construction of a HSSMN model for the BPDA with the SSMN populations, i.e., [ x | π k ] S S M N p ( C q ( α , β ) ; μ k * , Σ k * , κ ( η ) , G ) .

3. The HSSMN Model

3.1. The Hierarchical Model

For a Bayesian predictive discriminant analysis, suppose we have K rectangle screened populations π k ( k = 1 , , K ) , each specified by the S S M N p ( C q ( α , β ) ; μ k * , Σ k * , κ ( η ) , G ) distribution. Let D k = { x k 1 , , x k n k } be a training sample obtained from the rectangle screened population π k with the S S M N p ( C q ( α , β ) ; μ k * , Σ k * , κ ( η ) , G ) distribution, where the parameters ( μ k * , Σ k * ) are unknown. The predictive discrimination analysis is to assess the relative predictive odds ratio or posterior probability that a screened multivariate observation x belongs to one of K populations, π k . As noted by Equation (6), however, a complex likelihood function of D k prevents us from choosing reasonable priors of the model parameters and obtaining the predictive density of x given by Equation (2). These problems are solved if we use the following hierarchical representation of the population models.
According to Lemma 1, we may rewrite the SSMN model for Equations (8) and (9) by a three-level hierarchy given by:
[ x k i | η k i , f k i , π k ] = d μ k + Λ k f k i + ε k i , ε k i i n d N p ( 0 , κ ( η k i ) Ψ k ) , i = 1 , , n k ,
f k i i n d N q ( 0 , κ ( η k i ) Σ 0 k ) I f k i C q ( a k , b k ) , κ ( η k i ) > 0 ,
η k i i . i . d . G ( η ) w i t h η k i > 0 ,
where Λ k = Δ k Σ 0 k - 1 , Ψ k = Σ k - Δ k Σ 0 k - 1 Δ k , G is the scale mixing distribution of the independent η k i ’s, f k i and ε k i are independent conditional on η k i and N q ( 0 , κ ( η k i ) Σ 0 k ) I f k i C q ( a k , b k ) denotes a truncated N q ( 0 , κ ( η k i ) Σ 0 k ) distribution having the truncated space f k i C q ( a k , b k ) .
The first stage model in Equation (10) may be written in a compact form by defining the following vector and matrix notations,
X k = ( x k 1 - μ k , , x k n k - μ k ) , F k = ( f k 1 , , f k n k ) , E k = ( ε k 1 , , ε k n k ) , η k = ( η k 1 , , η k n k ) .
Then, the three-level hierarchy of the model Equation (10) can be expressed as:
X k = Λ k F k + E k , v e c ( E k ) N p n k ( 0 , D ( κ ( η k ) ) Ψ ) ,
v e c ( F k ) N q n k ( 0 , D ( κ ( η k ) ) Σ 0 k ) I ( f k i C q a k , b k ) , Cov ( v e c ( F k ) , v e c ( E k ) | η k ) = O ,
η k i i . i . d . g ( η ) , i = 1 , , n k ,
where A B denotes the Kronecker product of two matrices A and B , v e c ( F k ) = ( f k 1 , , f k n k ) , v e c ( E k ) = ( ε k 1 , , ε k n k ) , and D ( κ ( η k ) ) = d i a g { κ ( η k 1 ) , , κ ( η k n k ) } is an n k × n k diagonal matrix of the scale mixing functions. Note that the hierarchical population model Equation (11) adopts a robust discriminant modeling by the use of the scale mixture of normal, such as the SMN and the truncated SMN, and thus, it enables us to avoid the anomaly generated from the non-normal sample information.
The Bayesian analysis of the model in Equation (11) begins with the specification of the prior distributions of the unknown parameters. When the prior information is not available, a convenient strategy of avoiding improper posterior distribution is to use proper priors with their hyperparameters being fixed as appropriate quantities to reflect the flatness (or diffuseness) of priors (i.e., limiting non-informative priors). For convenience, but not always optimal, we suppose that μ k , μ 0 k , ( Λ k , Ψ k ) and Σ 0 k of the model in Equation (11) are independent a priori; prior distributions for μ k and μ 0 k are normal; an inverse Wishart prior distribution for Σ 0 k ; and a generalized natural conjugate family (see [19]) of prior distributions for ( Λ k , Ψ k ) , so that we adopt the normal prior density for the Λ k conditional on the matrix Ψ k ,
P ( Λ k | Ψ k ) | Ψ k | - q / 2 exp - 1 2 t r [ Ψ k - 1 ( Λ k - Λ 0 k ) H k ( Λ k - Λ 0 k ) ] , Ψ k I W p ( R k , τ k ) ,
where W I W m ( V , ν ) denotes the inverse Wishart distribution whose pdf I W m ( W ; V , ν ) is:
I W m ( W ; V , ν ) | W | - ν / 2 exp - 1 2 t r ( W - 1 V ) , V > 0 .
Note that if Λ k = ( λ k 1 , , λ k q ) , λ k v e c ( Λ k ) = ( λ k 1 , , λ k q ) and λ 0 k v e c ( Λ 0 k ) , then:
t r [ Ψ k - 1 ( Λ k - Λ 0 k ) H k ( Λ k - Λ 0 k ) ] = ( λ k - λ 0 k ) ( H k - 1 Ψ k ) - 1 ( λ k - λ 0 k ) .
This prior elicitation of the parameters, along with the three-level hierarchical model Equation (11), produces a hierarchical screened scale mixture of normal population model, which is referred to as HSSMN( Θ ( k ) ) in the rest of this paper, where Θ ( k ) = { μ k , μ 0 k , Λ k , Ψ k , F k , Σ 0 k , η k } . The HSSMN( Θ ( k ) ) model is defined as follows.
x k i | F k , μ k , Ψ k , η k i n d N p ( μ k + Λ k f k i , κ ( η k i ) Ψ k ) , i = 1 , , n k ,
f k i | Ψ k , Σ 0 k , μ 0 k , η k i n d N q 0 , κ ( η k i ) Σ 0 k I f k i C q ( a k , b k ) , i = 1 , , n k ,
μ k N p ( θ k , Ω k ) ,
μ 0 k N q ( θ 0 k , Ω 0 k ) ,
λ k | Ψ k N p q ( λ 0 k , H k - 1 Ψ k ) ,
Ψ k I W p ( R k , τ k ) , τ k > 2 p ,
Σ 0 k I W q ( Q k , γ k ) , γ k > 2 q ,
η k i i n d g ( η ) , i = 1 , , n k ,
where λ k v e c ( Λ k ) , λ 0 k v e c ( Λ 0 k ) and hyperparameters ( θ k , θ 0 k , Ω k , Ω 0 k , Λ 0 k , H k , R k , τ k , Q k , γ k ) are fixed as appropriate quantities to reflect the flatness of priors.
The last distributional specification is omitted in the RSN distribution case. For the HSSMN( Θ ( k ) ) model for the R S t ν distribution, we may set η k i i n d G a m m a ( ν / 2 , ν / 2 ) , ν G a m m a ( 1 , 0 . 1 ) I ( ν > 2 ) , a truncated Gamma distribution (see, e.g., [20]). See, for example, [21,22] and the references therein for other choices of the prior distribution of ν .

3.2. Posterior Distributions

Based on the HSSMN( Θ ( k ) ) model structure with the likelihood and the prior distributions in Equation (12), the joint posterior distribution of Θ ( k ) is given by:
p ( Θ ( k ) | D k ) i = 1 n k | κ ( η k i ) Ψ k | - 1 / 2 exp - 1 2 t r Ψ k - 1 ( X k - Λ k F k ) D ( κ ( η k ) ) - 1 ( X k - Λ k F k )
× | Ψ k | - ( q + τ k ) / 2 exp - 1 2 t r [ Ψ k - 1 G k ] i = 1 n k ϕ q ( f k i ; 0 , κ ( η k i ) Σ 0 k ) Φ ¯ q C q ( a k , b k ) ; 0 , κ ( η k i ) Σ 0 k i = 1 n k g ( η k i )
× I W q ( Σ 0 k ; Q k , γ k ) ϕ p ( μ k ; θ k , Ω k ) ϕ q ( μ 0 k ; θ 0 k , Ω 0 k ) ,
where:
i = 1 n k | κ ( η k i ) Ψ k | - 1 / 2 exp - 1 2 t r Ψ k - 1 ( X k - Λ k F k ) D ( κ ( η k ) ) - 1 ( X k - Λ k F k ) i = 1 n k ϕ p ( x k i ; μ k + Λ k f k i , κ ( η k i ) Ψ k ) ,
G k = ( Λ k - Λ 0 k ) H k ( Λ k - Λ 0 k ) + R k and g ( η k i ) ’s denote the densities of the mixing variables η k i ’s. Note that the joint posterior of Equation (13) is not simplified in an analytic form of the known density and, thus, intractable for the posterior inference. Instead, we derived each of conditional posterior distribution of μ k , μ 0 k , λ k v e c ( Λ k ) , Σ 0 k , F k , Ψ k and η k i ’s, which will be useful for posterior inference based on Markov chain Monte Carlo methods (MCMC). All of the full conditional posterior distributions are as follows (see the Appendix for their derivations):
(1) The full conditional distribution of μ k is a p-variate normal given by:
μ k | Θ ( k ) \ μ k , D k N p μ μ k , Σ μ k ,
where μ μ k = Σ μ k Ω k - 1 θ k + i = 1 n k Ψ k - 1 ( x k i - Λ k f k i ) / κ ( η k i ) and Σ μ k = i = 1 n k 1 κ ( η k i ) Ψ k - 1 + Ω k - 1 - 1 .
(2) The full conditional density of μ 0 k is given by:
p ( μ 0 k | Θ ( k ) \ μ 0 k , D k ) ϕ q ( μ 0 k ; θ 0 k , Ω 0 k ) i = 1 n k Φ ¯ q C q ( α , β ) ; μ 0 k , κ ( η k i ) Σ 0 k .
(3) The full conditional posterior distribution of λ k is given by:
λ k | Θ ( k ) \ λ k , D k N p q μ λ k , Σ λ k ,
where:
μ λ k = v e c ( Λ k * ) , Λ k * = ( X k D ( κ ( η k ) ) - 1 F k + Λ 0 k H k ) Q k - 1 Σ λ k = Q k - 1 Ψ k , a n d Q k = F k D ( κ ( η k ) ) - 1 F k + H k .
(4) The full conditional posterior distribution of Ψ k is an inverse-Wishart distribution:
Ψ k | Θ ( k ) \ Ψ k , D k I W p V k , ν k ν k > 2 p ,
where V k = ( X k - Λ k F k ) D ( κ ( η k ) ) - 1 ( X k - Λ k F k ) + ( Λ k - Λ 0 k ) H k ( Λ k - Λ 0 k ) + R k and ν k = n k + q + τ k .
(5) The full conditional posterior distribution of f k i is the q-variate truncated normal given by:
f k i | Θ ( k ) \ f k i , D k i n d N q μ f k i , κ ( η k i ) Σ f k i I f k i C q ( a k , b k ) , i = 1 , , n k ,
where μ f k i = Σ f k i Λ k Ψ k - 1 ( x k i - μ k ) and Σ f k i = Σ 0 k - 1 + Λ k Ψ k - 1 Λ k - 1 .
(6) The full conditional posterior density of Σ 0 k is given by:
p ( Σ 0 k | Θ ( k ) \ Σ 0 k , y k ) I W q ( Σ 0 k ; Q k , γ k ) i = 1 n k ϕ q ( f k i ; 0 , κ ( η k i ) Σ 0 k ) Φ ¯ q C q ( a k , b k ) ; 0 , κ ( η k i ) Σ 0 k .
(7) The full conditional posterior densities of η k i ’s are given by:
p ( η k i | Θ ( k ) \ η k i , y k ) κ ( η k i ) - p 2 exp - z k i Ψ k - 1 z k i 2 κ ( η k i )
× ϕ q ( f k i ; 0 , κ ( η k i ) Σ 0 k ) Φ ¯ q C q ( a k , b k ) ; 0 , κ ( η k i ) Σ 0 k g ( η k i ) , i = 1 , , n k ,
where z k i = x k i - μ k - Λ k f k i and η k i ’s are independent.
Based on the above full conditional posterior distributions and the stochastic representations of the SSMN in Lemma 1, one can easily obtain Bayes estimates of the k-th SSMN population mean μ π k = E [ x | π k ] and covariance matrix Σ π k = C o v ( x | π k ) , k = 1 , , K . Specifically, the mean and covariance matrix of an observation x belonging to π k : S S M N p ( C q ( α , β ) ; μ k * , Σ k * , κ ( η ) , G ) , which are used for calculating their Bayes estimates via Rao–Blackwellization, are given by:
μ π k = μ k + Ω 21 k Ω 22 k - 1 ξ k
Σ π k = Ω 22 k - Ω 21 k ( Ω 11 k - 1 - Ω 11 k - 1 T k Ω 11 k - 1 ) Ω 21 k ,
where Ω 21 k = κ ( η ) Δ k , Ω 11 k = κ ( η ) Σ 0 k , Ω 22 k = κ ( η ) Σ k ,
ξ k = C q ( a k , b k ) z ζ k ( 2 π ) q / 2 | Ω 11 k | 1 / 2 exp { - z Ω 11 k - 1 z 2 } d z ,
ζ k = Φ ¯ q ( C q ( α , β ) ; μ 0 k , Ω 11 k ) , T k = P k - ξ k ξ k and:
P k = C q ( a k , b k ) z z ζ k ( 2 π ) q / 2 | Ω 11 k | 1 / 2 exp { - z Ω 11 k - 1 z 2 } d z .
We see that these moments of Equation (21) agree with the formula for the mean and covariance matrix of the untruncated marginal distribution of a general multivariate truncated distribution given by [23]. Readers are referred to [24] with the R package tmvtnorm and [25] with the R package mvtnorm for implementing calculations of ξ k and P k involved in the first and second moments.
When the sampling information, i.e., the observed training samples, is augmented by the proper information of prior knowledge, the anomalies of the maximum likelihood estimate of the SSMN model, investigated by [16], would disappear in the HSSMN ( Θ ( k ) ) model. Furthermore, note that the conditional distribution of λ k in Equation (16) is a p q -dimensional one; and hence, its Gibbs sampling needs to be performed by using the inverse of the matrix of order p q , which may cause computational costs in implementing the MCMC method. For large q, a more computationally-convenient Gibbs sampler can be considered based on the full conditional posterior distributions of λ k j , j = 1 , , p , than the Gibbs sampler with λ k in Equation (16), where λ k V e c ( Λ k ) and Λ k ( λ k 1 , , λ k q ) .
For this purpose, we defined the following notations: for j = 1 , , p ,
λ ˜ k ( j ) = ( E j I p ) λ k , θ ˜ k ( j ) = ( E j I p ) μ λ k ,
Ω ˜ k ( j ) = ( E j I p ) Σ λ k ( E j I p ) , E i = ( e j , e 1 , , e j - 1 , e j + 1 , , e q ) ,
where e j denotes the j-th column of I q , namely an elementary vector with unity for its j-th element and zeros elsewhere. Furthermore, we consider the following partitions:
λ ˜ k ( j ) = λ k j λ k j * , θ ˜ k ( j ) = θ ˜ k ( 1 j ) θ ˜ k ( 2 j ) , a n d Ω ˜ k ( j ) = Ω ˜ k 11 ( j ) Ω ˜ k 12 ( j ) Ω ˜ k 21 ( j ) Ω ˜ k 22 ( j ) ,
where the orders of λ k j * , θ ˜ k ( 1 j ) , Ω ˜ k 11 ( j ) and Ω ˜ k 21 ( j ) are ( p - 1 ) q × 1 , q × 1 , q × q and ( p - 1 ) q × q , respectively. Under these partitions, the conditional property of a multivariate normal distribution leads to the full conditional posterior distributions of λ k j given by:
λ k j | Θ ( k ) \ λ k j , y k N q μ λ k j , Σ λ k j ,
for j = 1 , , q , where:
μ λ k j = θ ˜ k ( 1 j ) + Ω ˜ k 12 Ω ˜ k 22 - 1 ( λ k j * - θ ˜ k ( 2 j ) ) a n d Σ λ k j = Ω ˜ k 11 ( j ) - Ω ˜ k 12 ( j ) Ω ˜ k 22 ( j ) - 1 Ω ˜ k 21 ( j ) .
When p is large, we may partition λ k j into two vectors with smaller dimensions, say λ k = ( λ k j ( 1 ) , λ k j ( 2 ) , ) , then use their full conditional normal distributions for the Gibbs sampler.
Now, the posterior sampling can be implemented by using all of the conditional posterior Equations (14)–(20). The Gibbs sampler and Metropolis–Hastings algorithm within the Gibbs sampler may be used to obtain posterior samples of all of the unknown parameters Θ ( k ) . Note that in the case where the p q -dimensional matrix is too large to manipulate for computation, the Gibbs sampler can be modified by replacing the full conditional posterior Equation (16) with Equation (22). That is, as indicated by Equation (22), the modified Gibbs sampler based on Equation (22) would be more convenient for numerical computation than the first one using Equation (16). The detailed Markov chain Monte Carlo algorithm with Gibbs sampling is discussed in the next subsection.

3.3. Markov Chain Monte Carlo Sampling Scheme

It is not complicated to construct an MCMC sampling scheme working with Θ ( k ) = { μ k , μ 0 k , Λ k , Ψ k , F k , Σ 0 k , η k } , since a routine Gibbs sampler would work to generate posterior samples of ( μ k , Λ k , Ψ k , F k ) based on each of their full conditional posterior distributions obtained in Section 3.2. In the posterior sampling of μ 0 k , Σ 0 k and η k , Metropolis–Hastings within the Gibbs algorithm would be used, since their conditional posterior densities do not have explicit forms of known distributions as in Equation (15), Equation (19) and Equation (20).
Here, for simplicity, we considered the MCMC algorithm based on the HSSMN ( Θ ( k ) ) model with a known screening scheme, in which μ 0 k and Σ 0 k are assumed to be known. The extension to the general HSSMN ( Θ ( k ) ) model with unknown μ 0 k and Σ 0 k can be made without difficulty.
The MCMC algorithm starts with some initial values μ k [ 0 ] , λ k [ 0 ] , Ψ k [ 0 ] , F k [ 0 ] and η k [ 0 ] . The detailed posterior sampling steps are as follows:
  • Step 1: generate μ k by using the full conditional posterior distribution in Equation (14).
  • Step 2: generate λ k by using the full conditional posterior distribution in Equation (16).
  • Step 3: generate inverse-Wishart random matrix ψ k by using the full conditional posterior distribution in Equation (17).
  • Step 4: generate independent q-variate truncated normal random variables f k i by using the full conditional posterior distribution in Equation (18).
  • Step 5: given the current values { μ k , Λ k , Ψ k , F k } , we independently generate a candidate η k i from a proposal density q ( η k i * | η k i ) = g ( η k i * ) , as suggested by [26], which is used for a Metropolis–Hastings algorithm. Then, accept the candidate value with the acceptance rate:
    α ( η k i , η k i * ) = min p ( Θ ( k ) | η k i * ) p ( Θ ( k ) | η k i ) ) , 1
    i = 1 , , n k . Because the target density is proportional to p ( Θ ( k ) | η k i ) g ( η k i ) and p ( Θ ( k ) | η k i ) is uniformly bounded for η i k > 0 where:
    p ( Θ ( k ) | η k i ) = ϕ p ( x k i ; μ k + Λ k f k i , κ ( η k i ) Ψ k ) ϕ q ( f k i ; 0 , κ ( η k i ) Σ 0 k ) Φ ¯ q C q ( a k , b k ) ; 0 , κ ( η k i ) Σ 0 k
    and g ( · ) is the density of mixing variable η i k . Note that η k = ( η k 1 , , η k n k ) .
When one conducts a posterior inference of the HSSMN ( Θ ( k ) ) model using the samples obtained from the MCMC sampling algorithm, the following points should be noted.
(i)
See, e.g., [18], for the sampling method for η k i from various mixing distributions g ( η k i ) of the SMN distributions, such as the multivariate t, multivariate l o g i t , multivariate s t a b l e and multivariate e x p o n e n t i a l p o w e r models.
(ii)
Suppose the HSSMN( Θ ( k ) ) model involves unknown μ 0 k . Then, as indicated by the full conditional posterior of μ 0 k in Equation (15), the complexity of the conditional distribution prevents us from using straightforward Gibbs sampling. Instead, we may use a simple random walk Metropolis algorithm that uses a normal proposal density q ( μ 0 k * | μ 0 k ) = q ( | μ 0 k * - μ 0 k | ) to sample from the conditional distribution of μ 0 k * ; that is, given the current point is μ 0 k , the candidate point is μ 0 k * N q ( μ 0 k , D ) , where a diagonal matrix D should be turned, so that the acceptance rate of the candidate point is around 0.25 (see, e.g., [26]).
(iii)
When the HSSMN( Θ ( k ) ) model involves unknown Σ 0 k : The MCMC sampling algorithm, using the full conditional posterior Equation (19) is not straightforward, because the conditional posterior density is unknown and complex. Instead, we may apply a Metropolized hit-and-run algorithm, described by [27], to sample from the conditional posterior of Σ 0 k .
(iv)
One can easily calculate the posterior estimate of Θ k = ( μ k * , Σ k * ) by using that of Θ ( k ) , because the re-parameterizing relations are Ψ = Σ k - Δ k Σ 0 k - 1 Δ k and Λ k = Δ k Σ 0 k - 1 .

4. The Predictive Classification Rule

Suppose we have K populations π k , k = 1 , , K , each specified by the HSSMN ( Θ ( k ) ) model. For each of the populations, we have the screened training sample D k comprised of a set of independent observations { x k i , i = 1 , , n k } whose population level is z k i = k . Let x be assigned to one of the K populations, with prior probability p k of belonging to π k , k = 1 K p k = 1 . Then, the predictive density of x given D under the HSSMN ( Θ ( k ) ) model with the space Θ ( k ) Θ ( k ) is:
p ( x | D , z = k ) = Θ ( k ) p ( x | Θ ( k ) ) p ( Θ ( k ) | D ) d Θ ( k ) , k = 1 , , K ,
and the posterior probability that x belongs to π k , i.e., p ( z = k | D , x ) = p ( x π k | D , x ) , is:
p ( x π k | D , x ) = p ( x | D , z = k ) p ( z = k | D ) j = 1 K p ( x | D , z = j ) p ( z = j | D ) , k = 1 , , K ,
where D = k = 1 K D k , p ( x | Θ ( k ) ) is equal to Equation (6) and p ( Θ ( k ) | D ) is the joint posterior density given in Equation (13). We see from Equation (24) that the total posterior probability of misclassifying x from π i to π j , i j is defined by:
T P M ( j ) = i j ; i = 1 K p ( x | D , z = i ) p ( z = i | D ) = 1 K p ( x | D , z = ) p ( z = | D ) .
We minimize the misclassification error at this point if we choose j, so as to minimize Equation (25); that is, we select k that gives the maximum posterior probability p ( x π k | D , x ) (see, e.g., Theorem 6.7.1 of [28] (p. 234). Thus, an optimal Bayesian predictive discrimination rule that minimizes the classification error is to classify x into π k , if x R k , where the optimal classification region is given by:
R k : p ( x | D , z = k ) p ( z = k | D ) > p ( x | D , z = j ) p ( z = j | D ) , for   all j k ; k = 1 , , K ,
p ( z = k | D ) is the posterior probability of population π k given the dataset D . If we assume the values of p k ’s are a priori known, then p ( z = k | D ) = p k .
Since we are unable to obtain an analytic solution of Equation (26), a numerical approach is required. Thus, we used the MCMC method of the previous section to draw samples from the posterior density of the parameters, p ( Θ ( k ) | D ) , to approximate the predictive density, Equation (23), by:
p ( x | D , z = k ) 1 N k - M t = M + 1 N k p ( x | Θ k t ) , k = 1 , , K ,
where Θ k t ’s are posterior samples generated from the MCMC process under the HSSMN ( Θ ( k ) ) model and M and N k are the burn-in period and run length, respectively.
If we assume Dirichlet priors for p k , that is:
[ p 1 , , p K - 1 ] D i r i c h l e t ( d 1 , , d K - 1 ; d K )
(see, e.g., [19] (p. 143) for the distributional properties), then:
p ( z = k | D ) = E [ p k | D ] = d k + n k j K d j + n j , k = 1 , , K - 1
and p ( z = K | D ) = 1 - j = k K - 1 p ( z = k | D ) .
Thus, the posterior probabilities in Equation (24) and the minimum error classification region R k in Equation (26) can be generated within the MCMC scheme, which uses Equation (27) to approximate the predictive densities involved in Equation (24) and Equation (26).

5. Simulation Study

This section presents results of a simulation study to show the convergence of the MCMC algorithm and the performance of the B P D A S S M N . Simulation of the training sample observations, model estimation by the MCMC algorithm and a comparison of classification results among three BPDA methods were implemented by coding the R package program. The three methods consist of two proposed B P D A S S M N methods (i.e., B P D A R S N and B P D A R S t for classifying RSN and RSt populations) and B P D A N by [11] (for classifying unscreened normal populations).

5.1. A Simulation Study: Convergence of the MCMC Algorithm

This simulation study considers inference of the HSSMN ( Θ ( k ) ) model with a two-dimensional case by generating a training sample of one thousand observations, n k = 1000 , from each population π k , k = 1 , 2 , 3 . We considered the following specific choice of parameters, i.e., μ k = ( μ k 1 , μ k 2 ) , Σ 0 k , μ 0 k , C q ( α , β ) , λ k = V e c ( Λ k ) = ( λ k 1 , , λ k 4 ) and Ψ k = Σ k - Δ k Σ 0 k - 1 Δ k = { ψ k i j } matrices, for generating a synthetic data from π k ,
μ k = 1 + k - 2 + k , Σ 0 k = 7 + ε k - 2 - 2 4 + ε k , α = - 0 . 5 - 0 . 5 , β = 4 4 , μ 0 k = ( 0 , 0 ) , Δ k = 2 1 2 0 , Σ k = 3 + ε k 0 0 1 + ε k , a n d ε k = 0 . 1 × k .
Based on the above parameter values with p = q = 2 , we simulated 200 sets of three training samples of each size n k = 1000 from three populations π k , k = 1 , 2 , 3 . Two cases of screened populations were assumed, that is π k : R S N p ( C q ( α , β ) ; μ k * , Σ k * ) and π k : R S t p ( C q ( α , β ) ; μ k * , Σ k * , ν = 5 ) . The respective datasets were generated by using the stochastic representation of each population (see Lemma 1 for the representation). Given a generated training sample, corresponding population parameters were estimated by using the MCMC algorithm based on the HSSMN ( Θ ( k ) ) model, Equation (12), for each screened population, π k , distribution. We used μ k = 0 , Δ k = I 2 and Ψ k = I 2 as the initial values of the MCMC algorithm. To satisfy an objective Bayesian perspective considered by [29], we need to specify the hyper-parameters ( θ k , δ k , Ω k , Ω 0 k , H k , R k , Q k , τ k , γ k ) of the HSSMN ( Θ ( k ) ) model, so as to be insensitive to changes of the priors. Thus, we assumed that we have no information about the parameters. To specify this, we adopted θ k = 0 , δ k = 0 , Ω k = 10 3 I p , Ω 0 k = 10 3 I q , H k = 10 - 3 I q , R k = 10 - 3 I p , Q k = 10 - 3 I q , τ k = 10 - 3 + p + 1 and γ k = 10 - 3 + q + 1 (see, e.g., [18]).
The MCMC samplers were based on 20,000 iterations as burn-in, followed by a further 20,000 iterations with a thinning size of 10. Thus, the final MCMC samples with a size of 2000 were obtained for each HSSMN ( Θ ( k ) ) model. Table 1 only provides posterior summaries for the parameters of the π 1 distribution for the sake of saving space. From Column 4–Column 9 of the table list, the mean and three quantiles of 200 sets of posterior samples, which were obtained from the MCMC method, were repeatedly applied to the 200 sets of training sample of size n 1 = 1000 . Then, the remaining two columns of the table list formal convergence test results of the MCMC algorithm. In estimating the Monte Carlo error (MC error) in Column 5, we used the batch mean method with 50 batches, see e.g., [30] (pp. 39–40). The low values of the MC errors indicate that the variability of each estimate due to the simulation is well controlled. The table also compares the MCMC results with the true parameter values (listed in Column 3): (i) each parameter value in Column 3 is located in the credible interval (2.5% quantile, 97.5% quantile); (ii) for each parameter, we see that the difference between its true value and corresponding posterior mean is less than 2 × the standard error (s.e.). Thus, the posterior summaries, obtained by using the weakly informative priors, indicate that the MCMC method based on the HSSMN ( Θ ( 1 ) ) model performs well in estimating the population parameters, regardless of the SSMN models (RSN and RSt) considered.
Table 1. Posterior summaries of 200 Markov chain Monte Carlo (MCMC) results for the π 1 models.
Table 1. Posterior summaries of 200 Markov chain Monte Carlo (MCMC) results for the π 1 models.
Model ( π 1 )ParameterTrueMeanMC Errors.e.2.5%Median97.5% R c p-Value
RSN μ 11 2.0001.9660.0030.0641.8821.9642.1491.0140.492
μ 12 −1.000−0.9740.0020.033−1.023−0.974−0.9031.0110.164
λ 11 0.3120.3200.0080.1590.0460.3220.8191.0210.944
λ 12 0.4060.4070.0070.1640.0300.4170.8721.0180.107
λ 13 0.2500.2530.0040.0830.0820.2560.4391.0190.629
λ 14 0.1250.1330.0040.0670.0030.1330.4081.0170.761
ψ 111 1.9682.0320.0050.1301.7432.0082.2651.0340.634
ψ 112 −0.625−0.6270.0020.098−0.821−0.617−0.4051.0220.778
ψ 122 0.5000.5660.0010.0390.4650.5570.6381.0180.445
RSt μ 11 2.0002.0360.0040.0691.8672.0502.1661.0150.251
μ 12 −1.000−1.0420.0030.036−1.137−1.054−0.9741.0120.365
λ 11 0.3120.3180.0080.0720.1860.3200.6011.0170.654
λ 12 0.4060.4050.0060.0740.2620.4140.5621.0190.712
λ 13 0.2500.2550.0050.0510.1130.2570.3871.0230.661
λ 14 0.1250.1360.0050.0550.0270.1330.3011.0190.598
ψ 111 1.9681.9060.0060.1081.7811.9962.2111.0230.481
ψ 112 −0.625−0.6200.0030.101−0.818−0.615−0.4221.0210.541
ψ 122 0.5000.4590.0020.0440.3660.4570.5781.0160.412
Some of the trace plots from an MCMC run are provided in Figure 1. Each plot demonstrates a parallel zone centered near the true parameter value of interest with no obvious tendency or periodicity. These plots and the small MC error values listed in Table 1 convince us of the convergence of the MCMC algorithm. For a formal diagnostic check, we calculated the Brooks and Gelman diagnostic statistic R c (adjusted shrinkage factor introduced by [31]) using a MCMC runs with three chains in parallel, each one starting from different initial values. The calculated R c value for each parameter is listed in the 10th column of Table 1. Table 1 shows that all of the R c values are close to one, indicating the convergence of the MCMC algorithm. For another formal diagnostic check, we applied the Heidelberger–Welch diagnostic tests of [32] to single-chain MCMC runs, which were used to plot Figure 1. They consist of the stationarity test and the half-width test for the MCMC runs of each parameter. The 11th column of Table 1 lists the p-value of the test for the stationarity of the single Markov chain, where all of the p-values are larger than 0.1. Furthermore, all of the the half-width tests, testing the convergence of the Markov chain of a single parameter, were passed. Thus, all of the diagnostic checking methods (formal and informal methods) advocate the convergence of the proposed MCMC algorithm, and hence, we can say that it generates an MCMC sample that comes from the marginal posterior distributions of interest (i.e., the SSMN population parameters). It is seen that the similar estimation results in Table 1 apply to the posterior summaries of the other parameters in π 2 and π 3 distributions. According to these simulation results, we can say that the MCMC algorithm constructed in Section 3.3 provides an efficient method for estimating the SSMN distributions. To achieve this quality of MCMC algorithm for the higher dimensional case (with large p and/or q values), the diagnostic tests, considered in this section, should be used to monitor the convergence of the algorithm; for more details, see [30].
Figure 1. Trace plots of μ 11 , μ 12 , λ 11 and ψ 111 generated from HSSMN ( Θ ( 1 ) ) of the RSt with ν = 5 model.
Figure 1. Trace plots of μ 11 , μ 12 , λ 11 and ψ 111 generated from HSSMN ( Θ ( 1 ) ) of the RSt with ν = 5 model.
Entropy 17 06481 g001

5.2. A Simulation Study: Performance of the Predictive Methods

This simulation study compares the performance of three BPDA methods using training samples generated from three rectangle screened populations, π k ( k = 1 , 2 , 3 ). The three methods compared are B P D A R S N , B P D A R S t with degrees of freedom ν = 5 , and B P D A N (a standard predictive method with no screening). Two different cases of rectangle screened population distributions were used to generate the training samples. One case is the rectangle screened population π k with the R S N p ( C q ( α , β ) ; μ k * , Σ k * ) distribution. The other case is π k with the R S t p ( C q ( α , β ) ; μ k * , Σ k * , ν = 5 ) distribution in order to examine the robustness of B P D A R S t in discriminating observations from heavily-tailed empirical distributions. For each case, we obtained 200 sets of training and validation (or testing) samples of each size n k = 20 , 50 , 100 generated from the rectangle screened distribution of π k . They are denoted by D k ( i ) and V k ( i ) ( i = 1 , , 200 ) . The i-th validation sample V k ( i ) that corresponds to the training D k ( i ) sample was simply obtained by setting V k ( i ) = D k ( i - 1 ) ( i = 1 , , 200 ) , where D k ( 0 ) = D k ( 200 ) .
The parameter values of the screened population distributions of the three populations π k were given by:
μ k * = 0 q ε ( - 2 + k ) * 1 p , Σ k * = I q Δ Δ k * I p , Δ = ρ J p × q , α = a 1 q , β = 1 q
for p = 2 , 5 , q = 2 and k = 1 , 2 , 3 . Further, we assumed that the parameters μ 0 k and Σ 0 k of the underlying q-dimensional screening vector v 0 and the rectangle screening region C q ( α , β ) were known as given above. Thus, we may investigate the performance of the BPDA methods by varying the values of correlation ρ, dimension p of the predictor vector, rectangle screened region and differences among the three population means and covariance matrices whose expressions can be found in [4]. Here, 1 r is a r × 1 summing vector whose every element is unity, and J p × q denote a p × 2 matrix whose every odd row is equal to (1, 0) and every even row is (0, 1).
Using the training samples, we calculated the approximate predictive densities Equation (27) by the MCMC algorithm proposed in Section 3.3. In this calculation, we assumed that p k = 1 / 3 , because n 1 = n 2 = n 3 = n . Thus, the posterior probabilities in Equation (24) and the minimum error classification region R k in Equation (26) can be estimated within the MCMC scheme, which uses Equation (27) to approximate the predictive densities involved in both Equation (24) and Equation (26). Then, we estimated the classification error rates of the three BPDA methods by using the validation samples, V k ( i ) ( i = 1 , , 200 ) . To apply the B P D A R S N and B P D A R S t methods for classifying the simulated training samples, we used the optimal classification rule, which uses Equation (26), while we used the posterior odds ratio given in [11] to implement the B P D A N method. Then, we compare the classification results in terms of error rates. The error rate of each population ( E R π k ) and the total error rate (Total E R ) were estimated by:
Total E R = k = 1 3 p k E R π k and E R π k = n k * n k , k = 1 , 2 , 3 ,
where n k * is the number of misclassified observations out of n k validation sample observations from π k .
For each case of π k distributions, the above procedure was implemented on each set of 200 validation samples to evaluate the error rates of the BPDA methods. Here, [Case 1]denotes that the training (and validation) samples were generated from π k : R S N p ( C q ( α , β ) ; μ k * , Σ k * ) , and [Case 2] indicates that they were generated from π k : R S t p ( C q ( α , β ) ; μ k * , Σ k * , ν = 5 ) , k = 1 , 2 , 3 . For each case, Table 2 compares the mean of classification error rates obtained from the 200 replicated classifications by using the BPDA methods. The error rates and their standard errors in Table 2 are indicated as follows. (i) Both the B P D A R S N and B P D A R S t methods work reasonably well in classifying screened observations, compared to the B P D A N method. This implies that, in BPDA, they provide better classification results than the B P D A N , provided that π k ’s are screened by a rectangle screening scheme. (ii) The performance of the B P D A S S M N methods becomes better as the correlation (ρ) between the screening variables and predictor variables becomes larger. (iii) For a comparison of the error rates with respect to the values of a, we see that the B P D A S S M N methods tends to yield better performance in the discrimination of a screened by a small rectangle screened region. (iv) The performance of the three BPDA methods improves when the differences of the mean increases. (v) An increase in the sizes of dimension p and training sample n also tends to yield a better performance of the BPDA methods. (vi) As expected, the performance of the B P D A R S N in [Case 1] is better than the other two methods, because the estimates of error rates are not covered by the corresponding two standard errors. Further, a considerable gain in the error rates over the B P D A N manifests the utility of the B P D A R S N in the discriminant analysis. (vii) As for [Case 2], the table indicates that the performance of the B P D A R S t is better than the two other methods. This demonstrates the robustness of the B P D A R S t method in the discrimination with screened and heavy tailed data.
Table 2. Classification error rates: the respective standard errors are in parenthesis.
Table 2. Classification error rates: the respective standard errors are in parenthesis.
pnaMethod ρ = 0 . 5 ρ = 0 . 9
ε = 0 . 1 ε = 0 . 4 ε = 0 . 1 ε = 0 . 4
[Case 1]
2200.5 B P D A R S N 0.322(0.0025)0.174(0.0022)0.281(0.0024)0.106(0.0020)
B P D A R S t 0.335(0.0025)0.185(0.0023)0.306(0.0025)0.115(0.0021)
B P D A N 0.350(0.0025)0.206(0.0023)0.356(0.0025)0.205(0.0021)
- 0 . 5 B P D A R S N 0.329(0.0027)0.182(0.0023)0.301(0.0025)0.134(0.0021)
B P D A R S t 0.348(0.0024)0.193(0.0022)0.319(0.0024)0.142(0.0021)
B P D A N 0.349(0.0025)0.201(0.0023).349(0.0025)0.192(0.0020)
1000.5 B P D A R S N 0.303(0.0016)0.161(0.0014)0.266(0.0015)0.097(0.0013)
B P D A R S t 0.316(0.0017)0.165(0.0013)0.275(0.0015)0.101(0.0013)
B P D A N 0.351(0.0025)0.186(0.0023)0.356(0.0025)0.186(0.0021)
- 0 . 5 B P D A R S N 0.306(0.0015)0.163(0.0014)0.282(0.0014)0.116(0.0013)
B P D A R S t 0.318(0.0017)0.168(0.0015)0.291(0.0015)0.121(0.0013)
B P D A N 0.338(0.0024)0.172(0.0023)0.337(0.0026)0.170(0.0021)
5200.5 B P D A R S N 0.318(0.0025)0.158(0.0022)0.240(0.0024)0.101(0.0020)
B P D A R S t 0.327(0.0026)0.175(0.0023)0.276(0.0025)0.114(0.0021)
B P D A N 0.337(0.0026)0.183(0.0023)0.332(0.0025)0.184(0.0020)
- 0 . 5 B P D A R S N 0.321(0.0025)0.165(0.0023)0.231(0.0025)0.109(0.0021)
B P D A R S t 0.330(0.0026)0.207(0.0023)0.318(0.0025)0.141(0.0021)
B P D A N 0.345(0.0026)0.216(0.0024)0.346(0.0025)0.218(0.0021)
1000.5 B P D A R S N 0.280(0.0015)0.150(0.0014)0.233(0.0015)0.084(0.0012)
B P D A R S t 0.291(0.0016)0.153(0.0015)0.249(0.0015)0.092(0.0013)
B P D A N 0.307(0.0025)0.186(0.0023)0.308(0.0025)0.189(0.0021)
- 0 . 5 B P D A R S N 0.291(0.0016)0.163(0.0014)0.239(0.0015)0.103(0.0013)
B P D A R S t 0.294(0.0016)0.169(0.0015)0.253(0.0015)0.117(0.0013)
B P D A N 0.305(0.0024)0.175(0.0022)0.301(0.0025)0.176(0.0021)
[Case 2]
2200.5 B P D A R S N 0.351(0.0025)0.189(0.0022)0.310(0.0025)0.114(0.0021)
B P D A R S t 0.320(0.0024)0.175(0.0023)0.293(0.0024)0.105(0.0020)
B P D A N 0.367(0.0026)0.185(0.0023)0.365(0.0024)0.191(0.0020)
- 0 . 5 B P D A R S N 0.349(0.0026)0.192(0.0022)0.317(0.0024)0.149(0.0022)
B P D A R S t 0.321(0.0023)0.183(0.0021)0.304(0.0023)0.132(0.0021)
B P D A N 0.356(0.0025)0.210(0.0023)0.357(0.0025)0.199(0.0020)
1000.5 B P D A R S N 0.313(0.0016)0.164(0.0015)0.273(0.0015)0.098(0.0014)
B P D A R S t 0.306(0.0015)0.158(0.0013)0.265(0.0014)0.091(0.0012)
B P D A N 0.346(0.0023)0.179(0.0022)0.341(0.0024)0.175(0.0022)
- 0 . 5 B P D A R S N 0.321(0.0015)0.170(0.0014)0.287(0.0015)0.119(0.0015)
B P D A R S t 0.310(0.0014)0.164(0.0013)0.281(0.0013)0.112(0.0013)
B P D A N 0.329(0.0025)0.181(0.0025)0.327(0.0027)0.176(0.0022)
5200.5 B P D A R S N 0.329(0.0024)0.181(0.0024)0.281(0.0023)0.119(0.0021)
B P D A R S t 0.317(0.0023)0.164(0.0020)0.265(0.0021)0.094(0.0020)
B P D A N 0.340(0.0027)0.196(0.0024)0.314(0.0026)0.152(0.0022)
- 0 . 5 B P D A R S N 0.342(0.0025)0.205(0.0024)0.332(0.0024)0.194(0.0024)
B P D A R S t 0.328(0.0022)0.171(0.0022)0.275(0.0022)0.118(0.0021)
B P D A N 0.351(0.0026)0.224(0.0025)0.329(0.0025)0.175(0.0025)
1000.5 B P D A R S N 0.284(0.0016)0.155(0.0018)0.283(0.0016)0.154(0.0013)
B P D A R S t 0.271(0.0014)0.149(0.0014)0.238(0.0014)0.086(0.0011)
B P D A N 0.294(0.0026)0.192(0.0024)0.274(0.0026)0.161(0.0024)
- 0 . 5 B P D A R S N 0.289(0.0016)0.177(0.0015)0.288(0.0016)0.175(0.0013)
B P D A R S t 0.278(0.0013)0.162(0.0013)0.231(0.0014)0.107(0.0011)
B P D A N 0.312(0.0025)0.178(0.0025)0.270(0.0026)0.141(0.0022)

6. Conclusions

In this paper, we proposed an optimal predictive method (BPDA) for the discriminant analysis of multidimensional screened data. In order to incorporate the prior information about a screening mechanism flexibly in the analysis, we introduced the SSMN models. Then, we provided the HSSMN ( Θ ( k ) ) model for Bayesian inference of the SSMN populations, where the screened data were generated. Based on the HSSMN ( Θ ( k ) ) model, posterior distributions of Θ ( k ) were derived, and the calculation of the optimal predictive classification rule was discussed by using an efficient MCMC method. Numerical studies with simulated screened observations were given to illustrate the convergence of the MCMC method and the usefulness of the BPDA.
The methodological results of the Bayesian estimation procedure proposed in the paper can be extended to other multivariate linear models that incorporate non-normal errors, a general covariance matrix and truncated random covariates. For example, the seemingly unrelated regression (SUR) model and the factor analysis model (see, e.g., [19]) can be explained in the same framework of the proposed HSSMN ( Θ ( k ) ) in Equation (12). The former is a special case of the HSSMN ( Θ ( k ) ) model in which Z k ’s are observable as predictors. Therefore, when the regression errors are non-normal, it would be plausible to apply the proposed approach by using the HSSMN ( Θ ( k ) ) model to work with a robust SUR model, whereas the latter is a natural extension of the oblique factor analysis model to the case of that with non-normal measurement errors. The HSSMN ( Θ ( k ) ) model can also be extended to accommodate missing, values as done in the other models by [33,34]. We are hopeful to address these issues, as well, in the near future.

Acknowledgments

The research of Hea-Jung Kim was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A2A2A01004790).

Author Contributions

The author developed a Bayesian predictive method for the discriminant analysis of screened data. For the multivariate technique, the author introduced a predictive discrimination method with the SSMN populations ( B P D A S S M N ) and provided a Bayesian estimation methodology, which is suited to the B P D A S S M N . The methodology consists of constructing a hierarchical model for the SSMN populations (HSSMN ( Θ ( k ) ) ) and using an efficient MCMC algorithm to estimate the SSMN models, as well as an optimal rule for the B P D A S S M N .

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

Derivations of the full conditional posterior distributions
(1)
The full conditional posterior density of μ k given μ 0 k , λ k , Ψ k , F k , Σ 0 k , η k and D k is proportional to:
i = 1 n k ϕ p ( x k i ; μ k + Λ k f k i , κ ( η k i ) Ψ k ) ϕ p ( μ k ; θ k , Ω k ) exp - 1 2 ( μ k - μ μ k ) Σ μ k - 1 ( μ k - μ μ k )
which is a kernel of the N p ( μ μ k , Σ μ k ) distribution.
(2)
It is obvious from the joint posterior density in Equation (13).
(3)
It is straightforward to see from Equation (13) that the full conditional posterior density of λ k is given by:
p ( Λ k | Θ ( k ) \ Λ k , D k ) exp - 1 2 t r [ Ψ k - 1 V k ] exp - 1 2 t r [ Ψ k - 1 ( Λ k - Λ k * ) Q k ( Λ k - Λ k * ) ] exp - 1 2 ( λ k - μ λ k ) Σ λ k - 1 ( λ k - μ λ k ) .
This is a kernel of N p q ( μ λ k , Σ λ k ) , where V k = ( X k - Λ k F k ) D ( κ ( η k ) ) - 1 ( X k - Λ k F k ) + ( Λ k - Λ 0 k ) H k ( Λ k - Λ 0 k ) + R k and ν k = n k + q + τ k .
(4)
We see from Equation (13) that the full conditional posterior density of Ψ k is given by:
p ( Ψ k | Θ ( k ) \ Ψ k , D k ) | Ψ k | - ( n k + q ) / 2 exp - 1 2 t r [ Ψ k - 1 ( X k - Λ k F k ) D ( κ ( η k ) ) - 1 ( X k - Λ k F k ) × exp - 1 2 t r [ Ψ k - 1 ( Λ k - Λ 0 k ) H k ( Λ k - Λ 0 k ) ] × I W p ( Ψ k ; R k , τ k ) | Ψ k | - ( n k + τ k + q ) / 2 exp - 1 2 t r Ψ k - 1 V k .
This is a kernel of I W p ( V k , ν k ) .
(5)
We see, from Equation (13), that the full conditional posterior densities of f k i ’s are independent, and each density is given by:
p ( f k i | Θ ( k ) \ f k i , D k ) ϕ q ( f k i ; 0 , κ ( η k i ) Σ 0 k ) ϕ p ( x k i ; μ k + Λ k f k i , κ ( η k i ) Ψ k ) I f k i ( a k , b k ) exp { - 1 2 κ ( η k i ) [ f k i Σ 0 k - 1 + Λ k Ψ k - 1 Λ k f k i - 2 f k i Λ k Ψ k - 1 ( x k i - μ k ) ] } I f k i C q ( a k , b k ) exp - 1 2 κ ( η k i ) ( f k i - μ f k i ) Σ f k i - 1 ( f k i - μ f k i ) I f k i C q ( a k , b k )
which is a kernel of the q-variate truncated normal N q μ f k i , κ ( η k i ) Σ f k i I f k i C q ( a k , b k ) .
(6)
It is obvious from the joint posterior density in Equation (13).
(7)
It is obvious from the joint posterior density in Equation (13).

References

  1. Catsiapis, G.; Robinson, C. Sample selection bias with multiple selection rules: An application to student aid grants. J. Econom. 1982, 18, 351–368. [Google Scholar] [CrossRef]
  2. Mohanty, M.S. Determination of participation decision, hiring decision, and wages in a double selection framework: Male-female wage differentials in the U.S. labor market revisited. Contemp. Econ. Policy 2001, 19, 197–212. [Google Scholar] [CrossRef]
  3. Kim, H.J. A class of weighted multivariate normal distributions and its properties. J. Multivar. Anal. 2008, 99, 1758–1771. [Google Scholar] [CrossRef]
  4. Kim, H.J.; Kim, H.-M. A class of rectangle-screened multivariate normal distributions and its applications. Statisitcs 2015, 49, 878–899. [Google Scholar] [CrossRef]
  5. Lin, T.I.; Ho, H.J.; Chen, C.L. Analysis of multivariate skew normal models with incomplete data. J. Multivar. Anal. 2009, 100, 2337–2351. [Google Scholar] [CrossRef]
  6. Arellano-Valle, R.B.; Branco, M.D.; Genton, M.G. A unified view of skewed distributions arising from selections. J. Can. Stat. 2006, 34, 581–601. [Google Scholar]
  7. Kim, H.J. Classification of a screened data into one of two normal populations perturbed by a screening scheme. J. Multivar. Anal. 2011, 102, 1361–1373. [Google Scholar] [CrossRef]
  8. Kim, H.J. A best linear threshold classification with scale mixture of skew normal populations. Comput. Stat. 2015, 30, 1–28. [Google Scholar] [CrossRef]
  9. Marchenko, Y.V.; Genton, M.G. A Heckman selection-t model. J. Am. Stat. Assoc. 2012, 107, 304–315. [Google Scholar] [CrossRef]
  10. Sahu, S.K.; Dey, D.K.; Branco, M.D. A new class of multivariate skew distributions with applications to Bayesian regession models. Can. J. Stat. 2003, 31, 129–150. [Google Scholar] [CrossRef]
  11. Geisser, S. Posterior odds for multivariate normal classifications. J. R. Stat. Soc. B 1964, 26, 69–76. [Google Scholar]
  12. Lachenbruch, P.A.; Sneeringer, C.; Revo, L.T. Robustness of the linear and quadratic discriminant function to certain types of non-normality. Commun. Stat. 1973, 1, 39–57. [Google Scholar] [CrossRef]
  13. Wang, Y.; Chen, H.; Zeng, D.; Mauro, C.; Duan, N.; Shear, M.K. Auxiliary marke-assited classification in the absence of class identifiers. J. Am. Stat. Assoc. 2013, 108, 553–565. [Google Scholar] [CrossRef] [PubMed]
  14. Webb, A. Statistical Pattern Recognition; Wiley: New York, NY, USA, 2002. [Google Scholar]
  15. Aitchison, J.; Habbema, J.D.F.; Key, J.W. A critical comparison of two methods of statistical discrimination. Appl. Stat. 1977, 26, 15–25. [Google Scholar] [CrossRef]
  16. Azzalini, A.; Capitanio, A. Statistical application of the multivariate skew-normal distribution. J. R. Stat. Soc. B 1999, 65, 367–389. [Google Scholar] [CrossRef]
  17. Branco, M.D. A general class of multivariate skew-elliptical distributions. J. Multivar. Anal. 2001, 79, 99–113. [Google Scholar]
  18. Chen, M-H.; Dey, D.K. Bayesian modeling of correlated binary responses via scale mixture of multivariate normal link functions. Indian J. Stat. 1998, 60, 322–343. [Google Scholar]
  19. Press, S.J. Applied Multivariate Analysis, 2nd ed.; Dover: New York, NY, USA, 2005. [Google Scholar]
  20. Reza-Zadkarami, M.; Rowhani, M. Application of skew-normal in classification of satellite image. J. Data Sci. 2010, 8, 597–606. [Google Scholar]
  21. Wang, W.L.; Fan, T.H. Bayesian analysis of multivariate t linear mixed models using a combination of IBF and Gibbs samplers. J. Multivar. Anal. 2012, 105, 300–310. [Google Scholar] [CrossRef]
  22. Wang, W.L.; Lin, T.I. Bayesian analysis of multivariate t linear mixed models with missing responses at random. J. Stat. Comput. Simul. 2015, 85. [Google Scholar] [CrossRef]
  23. Johnson, N.L.; Kotz, S. Distribution in Statistics: Continuous Multivariate Distributions; Wiley: New York, NY, USA, 1972. [Google Scholar]
  24. Wilhelm, S.; Manjunath, B.G. tmvtnorm: Truncated multivariate normal distribution and student t distribution. R J. 2010, 1, 25–29. [Google Scholar]
  25. Genz, A.; Bretz, F. Computation of Multivariate Normal and t Probabilities; Springer: New York, NY, USA, 2009. [Google Scholar]
  26. Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar]
  27. Chen, H.-M.; Schmeiser, R.W. Performance of the Gibbs, hit-and-run, and metropolis samplers. J. Comput. Gr. Stat. 1993, 2, 251–272. [Google Scholar] [CrossRef]
  28. Anderson, T.W. Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
  29. Adwards, W.H.; Lindman, H.; Savage, L.J. Bayesian statistical inference for psychological research. Psycol. Rev. 1963, 70, 192–242. [Google Scholar] [CrossRef]
  30. Ntzoufras, I. Bayesian Modeling Using WinBUGS; Wiley: New York, NY, USA, 2009. [Google Scholar]
  31. Brooks, S.; Gelman, A. Alternative methods for monitoring convergence of iterative simulations. J. Comput. Gr. Stat. 1998, 7, 434–455. [Google Scholar]
  32. Heidelberger, P.; Welch, P. Simulation run length control in the presence of an initial transient. Oper. Res. 1992, 31, 1109–1144. [Google Scholar] [CrossRef]
  33. Lin, T.I. Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput. Stat. Data Anal. 2014, 71, 183–195. [Google Scholar] [CrossRef]
  34. Lin, T.I.; Ho, H.J.; Chen, C.L. Analysis of multivariate skew normal models with incomplete data. J. Multivar. Anal. 2009, 100, 2337–2351. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Kim, H.-J. A Bayesian Predictive Discriminant Analysis with Screened Data. Entropy 2015, 17, 6481-6502. https://doi.org/10.3390/e17096481

AMA Style

Kim H-J. A Bayesian Predictive Discriminant Analysis with Screened Data. Entropy. 2015; 17(9):6481-6502. https://doi.org/10.3390/e17096481

Chicago/Turabian Style

Kim, Hea-Jung. 2015. "A Bayesian Predictive Discriminant Analysis with Screened Data" Entropy 17, no. 9: 6481-6502. https://doi.org/10.3390/e17096481

Article Metrics

Back to TopTop