A critical issue in modeling binary response data is the choice of the links. We introduce a new link based on the Student’s t-distribution (t-link) for correlated binary data. The t-link relates to the common probit-normal link adding one additional parameter which controls the heaviness of the tails of the link. We propose an interesting EM algorithm for computing the maximum likelihood for generalized linear mixed t-link models for correlated binary data. In contrast with recent developments (Tan et al. in J. Stat. Comput. Simul. 77:929–943, 2007; Meza et al. in Comput. Stat. Data Anal. 53:1350–1360, 2009), this algorithm uses closed-form expressions at the E-step, as opposed to Monte Carlo simulation. Our proposed algorithm relies on available formulas for the mean and variance of a truncated multivariate t-distribution. To illustrate the new method, a real data set on respiratory infection in children and a simulation study are presented.
References
We thank the editor, associate editor and two referees, whose constructive comments led to a much improved presentation. Victor Lachos acknowledges support from CNPq-Brazil (Grant 305054/2011-2) and from FAPESP-Brazil (Grant 2011/17400-6). Marcos Prates would like to acknowledge the partial support of Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG-Brazil).
Proof of Proposition 1
First note that if X∼t p (μ,Σ,ν), then we can write
It follows that
which concludes the proof. □
Lemma 1
If U∼Gamma(α,β), then for any vector \(\mathbf{B}\in \mathbb{R}^{p}\) and a p×p positive definite matrix Σ,
If V∼N p (0,Σ); then
where, clearly \(\mathbf{T}=\frac{\mathbf{V}}{(U\beta/\alpha)^{1/2}}\) has a multivariate Student’s t-distribution, which concludes the proof. □
Details of the EM Algorithm:
Treat \(\mathbf{b}=\{\mathbf{b}_{i}\}^{m}_{i=1}\), \(\mathbf{Z}=\{\mathbf{Z}_{i}\}^{m}_{i=1}\) and \(\mathbf{U}=\{{U}_{i}\}^{m}_{i=1}\) as missing data. From the definition of the latent variable Z, we have {Y,Z}=Z. Then, the joint density for the complete-data Y com ={Y,Z,b,U} is
To complete the demonstration about how to employ the EM-type algorithm for ML estimation of the t-GLMM, it is necessary to derive the four conditional expectations of the complete-data sufficient statistics: E[U i |Y i ], E[U i Z i |Y i ], E[U i b i |Y i ] and \(E[U_{i}\mathbf {b}_{i}\mathbf {b}^{\top}_{i}|\mathbf {Y}_{i}]\). To calculate them, we first derive the conditional predictive distribution of the missing data, which is given by:
Since f(b|Y,Z,u,θ) is proportional to (9), we obtain the following result:
where \(\boldsymbol {\varDelta }_{i}=\mathbf {D}\mathbf {W}^{\top}_{i} \boldsymbol {\varOmega }_{i}^{-1}\), \(\boldsymbol {\varLambda }_{i}=\mathbf {D}-\mathbf {D}\mathbf {W}^{\top}_{i}\boldsymbol {\varOmega }^{-1}_{i}\mathbf {W}_{i}\mathbf {D}\) and \(\boldsymbol {\varOmega }_{i}=\mathbf {W}_{i}\mathbf {D}\mathbf {W}^{\top}_{i}+\mathbf{I}_{n_{i}}\), i=1,…,m. To derive the second term on the right-hand side of (10), we use the following result from Chib and Greenberg (1998)
which indicates that given Z i , the conditional probability of Y i is independent of b i and u i . Hence, expression (11) implies \(P(\mathbf {Y}_{i}=\mathbf {y}_{i}|\mathbf {Z}_{i},\boldsymbol {\theta })=\mathbb{I}_{(\mathbf {Z}_{i} \in \mathbb{B}_{i})}\). Since the conditional probability Z i |u i ,θ is normally distributed and U i ∼Gamma(v/2,v/2), the marginal distribution of Z i |θ follows \(t_{n_{i}}(\mathbf {X}_{i}\boldsymbol {\beta },\boldsymbol {\varOmega }_{i},\nu)\). Furthermore, from
we obtain
Using the prior results and the property that, if Z|θ follows t p (μ,Σ,ν) and U∼Gamma(ν/2,ν/2), we have \(E[U|\mathbf {Z}]=\frac{\nu+p}{\nu+\delta}\) (Lachos et al. 2011), where δ represents the Mahalanobis distance. It follows that:
where \(\bar{\mathbf {Z}}^{2}_{i}=E [\frac{\nu+n_{i}}{\nu+\delta_{i}}\mathbf {Z}_{i}\mathbf {Z}^{\top}_{i}|\mathbf {Y}_{i} ]\), \(\delta_{i}=(\mathbf {Z}_{i}-\boldsymbol {\gamma }_{i})^{\top} \boldsymbol {\varOmega }^{-1}_{i}(\mathbf {Z}_{i}-\boldsymbol {\gamma }_{i})\), \(\boldsymbol {\varDelta }_{i}=\mathbf {D}\mathbf {W}^{\top}_{i} \boldsymbol {\varOmega }_{i}^{-1}\), \(\boldsymbol {\varLambda }_{i}=\mathbf {D}-\mathbf {D}\mathbf {W}^{\top}_{i}\boldsymbol {\varOmega }^{-1}_{i}\mathbf {W}_{i}\mathbf {D}\), \(\boldsymbol {\varOmega }_{i}=\mathbf {W}_{i}\mathbf {D}\mathbf {W}^{\top}_{i}+\mathbf{I}_{n_{i}}\), γ i =X i β, and \(\mathbb{B}_{i}=B_{i1}\times\cdots\times B_{in_{i}}\), where B ij is the interval (0,∞) if y ij =1 and the interval (−∞,0] if y ij =0.
Prates, M.O., Costa, D.R. & Lachos, V.H. Generalized linear mixed models for correlated binary data with t-link. Stat Comput 24, 1111–1123 (2014). https://doi.org/10.1007/s11222-013-9423-3
