Open AccessArticle

Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data

Renjun Ma

^1,*

Md. Dedarul Islam

¹,

M. Tariqul Hasan

¹ and

Bent Jørgensen

Department of Mathematics and Statistics, University of New Brunswick, Fredericton, NB E3B 5A3, Canada

Department of Statistics, University of Southern Denmark, DK-5230 Odense, Denmark

Author to whom correspondence should be addressed.

Entropy 2023, 25(6), 863; https://doi.org/10.3390/e25060863

Submission received: 3 April 2023 / Revised: 16 May 2023 / Accepted: 24 May 2023 / Published: 28 May 2023

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)

Download

Browse Figure

Versions Notes

Abstract

Multilevel semicontinuous data occur frequently in medical, environmental, insurance and financial studies. Such data are often measured with covariates at different levels; however, these data have traditionally been modelled with covariate-independent random effects. Ignoring dependence of cluster-specific random effects and cluster-specific covariates in these traditional approaches may lead to ecological fallacy and result in misleading results. In this paper, we propose Tweedie compound Poisson model with covariate-dependent random effects to analyze multilevel semicontinuous data where covariates at different levels are incorporated at relevant levels. The estimation of our models has been developed based on the orthodox best linear unbiased predictor of random effect. Explicit expressions of random effects predictors facilitate computation and interpretation of our models. Our approach is illustrated through the analysis of the basic symptoms inventory study data where 409 adolescents from 269 families were observed at varying number of times from 1 to 17 times. The performance of the proposed methodology was also examined through the simulation studies.

Keywords:

best linear unbiased predictors; clustered data; random effects; repeated data; two-part models; zero-inflated data

MSC:

62J05; 62J12; 62P10

1. Introduction

Random effects models have been widely used in the analysis of multilevel data in various fields of applications. Random effects have traditionally been assumed to be covariate-independent in these models; however, such assumption may lead to ecological fallacy [1] in the interpretation of cluster or sub-cluster level covariates. The ecological fallacy occurs if the misleading inference is made about covariate effects when group level covariates are associated with individual observations directly. To account for correlation between cluster-specific random effects and cluster-specific covariates, models with covariate-dependent random effects have been introduced to handle clustered binary data [2], zero-inflated count data [3], discrete-survival data [4] and survival data [5,6]. Modeling of multilevel semicontinuous data has recently become an area of active research [7,8,9]; however, the cluster-specific and sub-cluster-specific random effects are assumed covariate-independent so far in the literature. Multilevel semicontinuous data with cluster-specific covariates are frequently observed in medical, environmental, economic and insurance studies. An example of three-level semicontinuous data with cluster or sub-cluster level covariates is the basic symptoms inventory (BSI) study data [10] where 409 adolescents from 269 independent HIV infected parents were observed at varying number of times from 1 to 17 times. The purpose of this study was to evaluate the effectiveness of an intervention. The intervention was designed based on social learning theory that focused primarily on skill building to reduce the risk from dangerous sexual practices and substance abuse [10]. At baseline, patients and adolescents were randomly assigned to the intervention (132 patients and 203 youth) or standard care (137 patients and 206 youth). The effect of intervention on adolescents was measured by the Global severity index (GSI). GSI is a composite index that based on 53 psychiatric indices which are an indicator of abnormal mental behavior [11]. Response variable of GSI scores is right-skewed with excessive number of exact zeros. Furthermore, parental, child-specific and visit-specific covariates were observed for these 409 adolescents; therefore, accounting for the covariates at their relevant levels would be more appropriate.

In this paper, we introduce a Tweedie compound Poisson model with covariate-dependent cluster-specific and sub-cluster-specific random effects for three-level semicontinuous data. Unlike two-part models [8,9], our model characterizes zero and positive components in an integral way instead of separately. In addition, the skewness can be flexibly modeled by the power index parameter of Tweedie family of compound Poisson models. Furthermore, our approach consolidates conditional and marginal modeling. The estimation of our model has been developed based on the orthodox best linear unbiased predictors (BLUP) of random effects given the data as an extension of [12]. Explicit expressions derived for BLUP predictors of random effects predictors in this paper facilitate computation and interpretation of our models. To the best of our knowledge, this is the first time a compound Poisson model is developed with covariate-dependent random effects. Tweedie compound Poisson models are often used to characterize insurance claims, rainfalls, pollutants and health data in practice.

The rest of the paper is organized as follows. After introducing our compound Poisson mixed model with covariate-dependent random effects, we discuss prediction of random effects and model estimation in Section 2 and Section 3, respectively. Our approach is illustrated through the analyses of the basic symptoms inventory data in Section 4. The simulations and conclusion are presented in Section 5 and Section 6.

2. Tweedie Compound Poisson Models with Covariate-Dependent Random Effects

2.1. The Model

The zero-inflated continuous data are usually referred to as semicontinuous data. More specifically, we only consider nonnegative semicontinuous data in this paper. Let

Y_{i j k}

represent the semicontinuous responses recorded from the kth (

k = 1, 2, \dots, n_{i j}

) observation of the jth (

j = 1, 2, \dots, J_{i}

) sub-cluster of the ith (

i = 1, 2, \dots, I

) independent cluster. Then the response vector can be expressed as

Y = {(Y_{111}, \dots, Y_{11 n_{11}}, \dots, Y_{I J 1}, \dots, Y_{I J n_{I J}})}^{'}

. We consider the cluster level random effect

U_{i}

for the response of the ith cluster and the sub-cluster level random effect

V_{i j}

for the response from the jth sub-cluster of the ith cluster. Let

W = {(U^{'}, V^{'})}^{'}

denote the vector of the random effects where

U = {(U_{1}, \dots, U_{i}, \dots, U_{I})}^{'}

and

V = {(V_{1}^{'}, \dots, V_{j}^{'}, \dots, V_{J_{I}}^{'})}^{'}

with

V_{j} = {(V_{j 1}, \dots, V_{j j}, \dots, V_{j J_{I}})}^{'}

respectively. Our model is based on the following three assumptions:

Assumption 1.

Cluster level random effects

U_{1}, \dots, U_{i}, \dots, U_{I}

are independently distributed with gamma distribution with mean

μ_{i}

and variance

σ^{2}

. That is

E (U_{i}) = μ_{i}

and

V a r (U_{i}) = σ^{2}

, where

μ_{i} = exp (Z_{i}^{'} β_{(1)})

with

Z_{i}^{'}

be the cluster level covariate vector of the ith cluster.

Assumption 2.

Given the cluster level random effects

U = {(U_{1}, \dots, U_{i}, \dots, U_{I})}^{'}

, the sub-cluster level random effects

V_{11}, \dots, V_{1 J_{1}}, \dots, V_{J 1}, \dots, V_{J J_{I}}

are conditionally independent following gamma distribution with mean

μ_{i j} U_{i}

and variance

τ^{2} {U_{i}}^{- 1}

, where

μ_{i j} = exp (Z_{i j}^{'} β_{(2)})

with

Z_{i j}

represents sub-cluster level covariates.

Assumption 3.

Given the cluster and sub-cluster level random effect

W = {(U^{'}, V^{'})}^{'}

, the responses

Y_{i j k}

follow compound Poisson distribution which can be expressed as

\begin{matrix} Y_{i j k} | W \sim c_{p} (y_{i j k}; ρ^{2}) exp \{\frac{1}{ρ^{2}} (\frac{y_{i j k} μ_{i j k}^{1 - p}}{1 - p} - \frac{μ_{i j k}^{2 - p}}{2 - p})\} w h e r e 1 < p < 2, \end{matrix}

(1)

where the explicit expression for

c_{p} (y_{i j k}; ρ^{2})

are given by [13] but are immaterial in our derivation in the following sections. This Tweedie family is also known as the power-variance family since we have

E (Y_{i j k} | W) = μ_{i j k} V_{i j}

and

v a r (Y_{i j k} | W) = ρ^{2} V_{i j}^{1 - p} μ_{i j k}^{q} V_{i t}^{p} = ρ^{2} μ_{i j k}^{p} V_{i j}

. In (1),

μ_{i j k} = exp (Z_{i j k}^{'} β_{(3)})

, where

Z_{i j k}

represents observation level covariates.

The index parameter is restricted to the range

1 < p < 2

for the semicontinuous response since only this sub-family of Tweedie distributions has a support of the response

Y_{i j k}

on left-closed interval

[0, + \infty)

. In addition, gamma distributed random effects have been traditionally used in the multiplicative models for multilevel data.

2.2. Moment Structure

The main focus of this section is to present the moment structures of the compound Poisson mixed model with covariate-dependent random effects discussed in the previous subsection. The moments of the model can be obtained after some algebraic calculations by methods of conditioning on random effects. The unconditional mean and covariance are presented here to facilitate the parameter estimation. As

U_{i}

independently follows gamma distribution with mean

μ_{i}

and variance

σ^{2}

, the moment structure of

U_{i}

can be expressed as

E (U_{i}) = μ_{i} and Cov [U_{s}, U_{i}] = δ (s, i) σ^{2} μ_{i}^{2},

where

δ_{(s, i)} = 1

, if

s = i

and 0 otherwise. Following Assumption 2, the conditional mean and variance of

V_{i j} | U = u

can be expressed as

E (V_{i j} | U) = μ_{i j} U_{i}

and

V a r (V_{i j} | U) = τ^{2} μ_{i j}^{2} U_{i}

. This implies that the unconditional mean of

V_{i j}

E (V_{i j}) = μ_{i} μ_{i j}

(2)

and the covariance of

V_{s t}

and

V_{i j}

Cov [V_{s t}, V_{i j}] = δ (s, i) \{σ^{2} μ_{i}^{2} μ_{i j} μ_{i t} + δ (t, j) τ^{2} μ_{i} μ_{i j}^{2}\} .

(3)

After some extensive algebra,

Cov [U_{s}, Y_{i j k}]

and

Cov [V_{s t}, Y_{i j k}]

can be expressed as

Cov [U_{s}, Y_{i j k}] = δ (s, i) σ^{2} μ_{i}^{2} μ_{i j} μ_{i j k}

and

Cov [V_{s t}, Y_{i j k}] = δ (s, i) \{σ^{2} μ_{i}^{2} μ_{i j} μ_{i t} + δ (t, j) τ^{2} μ_{i} μ_{i j}^{2}\} μ_{i j k},

respectively. Following Assumption 3, the conditional mean and variance of

E [Y_{i j k} | W]

can be simplified as

E [Y_{i j k} | W] = μ_{i j k} V_{i j}

and

Var [Y_{i j k} | W] = ρ^{2} μ_{i j k}^{p} V_{i j}

, respectively. Thus the unconditional mean of the response variable

Y_{i j k}

can be calculated by conditioning on the vector of random effects W, which can be expressed as

E [Y_{i j k}] = μ_{i} μ_{i j} μ_{i j k} = exp (X_{i j k}^{'} β),

(4)

where

X_{i j k}^{'} = (Z_{i}^{'}, Z_{i j}^{'}, Z_{i j k}^{'})

. Similarly the unconditional covariance

Y_{i j k}

and

Y_{s t l}

can be simplified as

\begin{matrix} Cov [Y_{s t l}, Y_{i j k}] & = & δ (s, i) \{σ^{2} μ_{i}^{2} μ_{i j} μ_{i t} μ_{i j k} μ_{i t l} + δ (t, j) [τ^{2} μ_{i} μ_{i j}^{2} μ_{i j k} μ_{i j l} + δ (l, k) ρ^{2} μ_{i} μ_{i j} μ_{i j k}^{p}]\} . \end{matrix}

(5)

The moments structures described above will be used to develop the estimating equations for the regression and random effect parameters in next section.

In the literature, marginal and conditional inferences refer to the transformed linearity in regression parameters under appropriate link function for marginal and conditional means, respectively [14,15,16]. Our approach consolidates both conditional and marginal modeling interpretations under one model since both the conditional mean and the marginal mean of our model are clearly linear in regression parameters under the log-link.

3. Estimation of Parameters

In this section, we discuss the prediction of random effects and estimation of regression and random effects parameters.

3.1. Best Linear Unbiased Predictors of Random Effects

As in [12], the orthodox BLUP predictors of cluster level random effects Ui can be expressed in matrix form as follows

{\hat{U}}_{i} = E (U_{i}) + Cov (U_{i}, Y_{i}) Var {(Y_{i})}^{- 1} (Y_{i} - E (Y_{i})) .

(6)

Furthermore, the explicit expression of BLUP for cluster level random effects

U_{i}

are given by

{\hat{U}}_{i} = \frac{μ_{i} + σ^{2} μ_{i} \sum_{j = 1}^{J_{i}} \sum_{k = 1}^{n_{i j}} w_{i j} μ_{i j k}^{1 - p} Y_{i j k}}{1 + σ^{2} μ_{i} \sum_{j = 1}^{J_{i}} \sum_{k = 1}^{n_{i j}} w_{i j} μ_{i j} μ_{i j k}^{2 - p}},

(7)

where

w_{i j} = 1 / (ρ^{2} + τ^{2} μ_{i j} \sum_{k = 1}^{n_{i j}} μ_{i j k}^{2 - p})

. Similarly the BLUP predictors of the sub-cluster level random effects

V_{i j}

can be calculated as

{\hat{V}}_{i j} = E (V_{i j}) + cov (V_{i j}, Y_{i}) var {(Y_{i})}^{- 1} (Y_{i} - E (Y_{i})),

(8)

which can be simplified as

\begin{matrix} {\hat{V}}_{i j} & = & μ_{i j} {\hat{U}}_{i} + τ^{2} μ_{i j} w_{i j} \sum_{k = 1}^{n_{i j}} μ_{i j k}^{1 - p} (Y_{i j k} - μ_{i j} μ_{i j k} {\hat{U}}_{i}) \\ = & ρ^{2} w_{i j} μ_{i j} {\hat{U}}_{i} + τ^{2} μ_{i j} w_{i j} \sum_{k = 1}^{n_{i j}} μ_{i j k}^{1 - p} Y_{i j k} . \end{matrix}

(9)

The BLUP predictors for cluster and sub-cluster random effects given in Equations (7) and (9) are positive with explicit linear combinations of semicontinuous responses. These BLUP predictors provide the best prediction of the random effects among the class of all linear functions of the observed responses.

3.2. Estimation of Regression Parameters

In this section, we first consider the estimation for the regression parameters in the case of known random effect parameters. To do that, following [12], we differentiate the partially observed ‘joint’ log-likelihood of the Tweedie mixed model for the data and random effects with respect to

β = {(β_{(1)}, β_{(2)}, β_{(3)})}^{'}

and yield the partially observed ‘joint’ score function. Then we replace the random effects with their BLUP predictors to obtain an unbiased estimating function for the regression parameters

β

. To be specific, the score function for cluster level regression parameters can be achieved by the following score function

\frac{\partial ℓ}{\partial β_{(1)}} = \sum_{i = 1}^{I} z_{i} \frac{μ_{i}^{- 1}}{σ^{2}} (U_{i} - μ_{i}),

which can be modified after including the BLUP predictors of the cluster level random effect as

\begin{matrix} ψ^{(1)} (β) & = & \sum_{i = 1}^{I} Z_{i} \frac{μ_{i}^{- 1} (β)}{σ^{2}} [{\hat{U}}_{i} (β) - μ_{i} (β)] = \sum_{i = 1}^{I} ψ_{i}^{(1)} (β), \end{matrix}

(10)

where the ith component corresponds to the ith independent cluster. Similarly the sub-cluster level and observation level score functions can simplified as

ψ^{(2)} (β) = \sum_{i = 1}^{I} \sum_{j = 1}^{J_{i}} Z_{i j} \frac{μ_{i j}^{- 1} (β)}{τ^{2}} [{\hat{V}}_{i j} (β) - {\hat{U}}_{i} (β) μ_{i j} (β)] = \sum_{i = 1}^{I} ψ_{i}^{(2)} (β),

(11)

and

ψ^{(3)} (β) = \sum_{i = 1}^{I} \sum_{j = 1}^{J_{i}} \sum_{k = 1}^{n_{i j}} Z_{i j k} \frac{μ_{i j k}^{1 - p} (β)}{ρ^{2}} [Y_{i j k} - {\hat{V}}_{i j} (β) μ_{i j k} (β)] = \sum_{i = 1}^{I} ψ_{i}^{(3)} (β),

(12)

Without loss of generality, we assume that the matrix of the covariates

X = {(Z_{i}^{'}, Z_{i j}^{'}, Z_{i j k}^{'})}^{'}

is of full rank. Let the global estimating function be defined by

ψ (β) = {(ψ^{(1)} {(β)}^{'}, ψ^{(2)} {(β)}^{'}, ψ^{(3)} {(β)}^{'})}^{'} .

(13)

As in [12,17], a global matrix expression can be obtained for this

ψ (β)

. Furthermore, a simple relationship among sensitivity matrix

S (β) = E_{β} \{\partial ψ (β) / \partial β^{⊤}\}

, variability matrix

V (β) = E_{β} \{ψ (β) ψ^{⊤} (β)\}

and the Godambe Information matrix

J (β) = S (β) {V (β)}^{- 1} {S (β)}^{⊤}

can be obtained. Since the joint density for

(Y, U)

forms an exponential family in regression parameter

β

, the proof in [17] applies to

ψ (β)

as long as the score functions are linear in both

U

and

Y

. However, this linearity is clear since the score functions are in fact the right hand side of (10)–(12) with

\hat{U} (β)

being replaced by

U

. A component-wise proof can also be found in [18]. Thus we have the following results:

Theorem 1.

For Tweedie compound Poisson models with covariate-dependent random effects, the predicted global score function

ψ (β)

, sensitivity matrix

S (β)

and variability matrix

V (β)

can be expressed as follows:

1.: $ψ (β) = X^{⊤} d i a g (E (Y)) {Var}^{- 1} (Y) (Y - E (Y)),$
2.: $V (β) = Var (ψ (β)) = X^{⊤} d i a g (E (Y)) {Var}^{- 1} (Y) d i a g (E (Y)) X,$
3.: $J (β) = - S (β) = V (β) .$

Following [12], under mild regularity conditions, the solution of the estimating equation

ψ (β) = 0

is consistent and asymptotically normal with asymptotic mean

β

and asymptotic variance given by the inverse of the sensitivity matrix

S (β) = E_{β} \{\partial ψ (β) / \partial β\}

as the subjects are assumed to be independent. In addition, this estimating function

ψ (β) = 0

is optimal in the sense that it attains the minimum asymptotic covariance for the estimator of

β

within the class of all linear functions of Y [12]. As in [12], this estimation equation

ψ (β) = 0

can be solved iteratively using the scoring algorithm, where the value of

β

is updated following

β^{*} = β - S^{- 1} (β) ψ (β) .

As the asymptotic variance matrix of regression parameter vector

\hat{β}

is given by the inverse of the negative sensitivity matrix

- S^{- 1} (β)

, the asymptotic variance for each component of regression parameter vector

\hat{β}

is given by the corresponding diagonal element of its asymptotic variance matrix,

- S^{- 1} (β)

; therefore, the estimated standard errors of

\hat{β}

are obtained as the square root of diagonal elements of its asymptotic variance matrix,

- S^{- 1} (β)

3.3. Estimation of Random Effects Parameters

The dispersion parameters are now assumed to be unknown. The unknown dispersion parameter is estimated using the adjusted Pearson estimator [12]. The adjusted Pearson estimate has some advantages; one of them is that the Pearson estimator is adjusted by bias correction. We may thus estimate

σ^{2}

by the following adjusted Pearson estimator:

{\hat{σ}}^{2} = \frac{1}{I} \sum_{i = 1}^{I} \frac{{({\hat{U}}_{i} - μ_{i})}^{2}}{μ_{i}^{2}} + \frac{1}{I} \sum_{i = 1}^{I} \frac{d (i)}{μ_{i}^{2}},

(14)

where the explicit expression for

d (i)

can be expressed as

\begin{matrix} d (i) & = & E {({\hat{U}}_{i} - U_{i})}^{2} = \frac{σ^{2} μ_{i}^{2}}{1 + σ^{2} μ_{i} \sum_{j = 1}^{J_{i}} \sum_{k = 1}^{n_{i j}} w_{i j} μ_{i j} μ_{i j k}^{2 - p}} . \end{matrix}

The expression of

τ^{2}

is written as

\begin{matrix} {\hat{τ}}^{2} & = & \frac{1}{I} \sum_{i = 1}^{I} \frac{1}{J_{i}} \sum_{j = 1}^{J_{i}} \frac{{({\hat{V}}_{i j} - μ_{i j} {\hat{U}}_{i})}^{2}}{μ_{i} μ_{i j}^{2}} + \frac{1}{I} \sum_{i = 1}^{I} \frac{1}{J_{i}} \sum_{j = 1}^{J_{i}} \frac{d (i) μ_{i j}^{2} + d (i j) - 2 ρ^{2} d (i) w_{i j} μ_{i j}^{2}}{μ_{i} μ_{i j}^{2}}, \end{matrix}

(15)

where

d (i j) = E {({\hat{V}}_{i j} - V_{i j})}^{2} = ρ^{2} w_{i j} (τ^{2} μ_{i} μ_{i j}^{2} + ρ^{2} d (i) w_{i j} μ_{i j}^{2})

. The expression for

ρ^{2}

is written as

\begin{matrix} {\hat{ρ}}^{2} & = & \frac{1}{I} \sum_{i = 1}^{I} \frac{1}{J_{i}} \sum_{j = 1}^{J_{i}} \frac{1}{n_{i j}} \sum_{k = 1}^{n_{i j}} \frac{{(Y_{i j k} - {\hat{V}}_{i j} μ_{i j k})}^{2}}{μ_{i} μ_{i j} μ_{i j k}^{p}} + \frac{1}{I} \sum_{i = 1}^{I} \frac{1}{J_{i}} \sum_{j = 1}^{J_{i}} \frac{1}{n_{i j}} \sum_{k = 1}^{n_{i j}} \frac{d (i j) μ_{i j k}^{2 - p}}{μ_{i} μ_{i j}} . \end{matrix}

(16)

As the clusters are assumed to be independent, standard estimating function theory can be applied to demonstrate the consistency and asymptotic normality for both regression and random effects parameters of our models as the number of clusters tends to infinity. In fact, this proof has actually been shown in the Chapter 5 of [18]. In addition, our estimation algorithm iterates among updating regression parameters through the scoring method, predicting random effects via BLUP predictor by (6) and (9), updating dispersion parameters through the moment estimators given in (14)–(16).

4. Analysis of Basic Symptoms Inventory Study Data

In this section, we apply our proposed covariate dependent semicontinuous model to the BSI study data.

4.1. Model Specification

In the BSI study, 409 adolescents from 269 independent HIV infected parents were followed over time between 1995 and 2002. Adolescents were observed at varying number of times from 1 to 17 times. Therefore, the data are of a hierarchical structure with observations nested within adolescents and adolescents further nested by their parents. The response of interest is the measurement of global severity index (GSI) scores with covariates at observation, adolescent and parent levels. A complete list of these variables with descriptions is given in Table 1.

The BSI data were analyzed in [19] as longitudinal normal data ignoring parental level; however, there is a large proportion of exact zeros in the GSI response as shown in Figure 1; therefore, a three level model for semicontinuous data is more appropriate.

Let

Y_{i j k}

represent the GSI for observation k from the adolescent j of the ith parent. The random effects for parent i and adolescent j of parent i are denoted by

U_{i}

and

V_{i j}

, respectively. The three-level Tweedie models with covariate-dependent random effects (TMCDRE) is then specified as follows.

\begin{matrix} Y_{i j k} | V = v & \sim & {T w}_{p} (μ_{i j k} v_{i j}, ρ^{2} {v_{i j}}^{1 - p}), \end{matrix}

(17)

with

μ_{i j k} = exp (z_{i j k}^{⊤} β_{(3)})

. In (17),

z_{i j k}

represents observation level covariates and

V_{i j} | U_{*} = u_{*}

follows

\begin{matrix} V_{i j} | U_{*} = u_{*} & \sim & Gamma (μ_{i j} u_{i}, τ^{2} {u_{i}}^{- 1}), \end{matrix}

(18)

where

μ_{i j} = exp (z_{i j}^{⊤} β_{(2)})

with

z_{i j}

represents sub-cluster level covariates. In (18), cluster level random effects

U_{i}

are assumed to be independent with

U_{i} \sim Gamma (μ_{i}, σ^{2}),

(19)

where

μ_{i} = exp (z_{i}^{⊤} β_{(1)})

with

z_{i}

represents cluster level covariates.

4.2. Analysis Results

First, we obtained a maximum likelihood estimate

\hat{p}

from Tweedie’s compound Poisson model using the R package “tweedie” Version 2.07 (Dunn, 2010) in the presence of the previous described covariates. The estimated index parameters p was 1.55. For

p = 1.55

, we applied our approach to estimate regression and random effect parameters based on the model specified above. The regression and random effect parameters are presented in Table 2. To compare our proposed model with Tweedie compound Poisson models with covariate-independent random effects, we also presented analysis results based on the conventional Tweedie mixed model (CTMM) in Table 2.

Our results in Table 2 show that the effects of covariates at observation level on the GSI score did not change materially in terms of direction, significance and magnitude. Second, all covariates at adolescent level had positive effects on the GSI score with race factor clearly insignificant and gender factor highly significant. The interesting phenomenon is that effect of baseline age on the GSI score changed from significant to insignificant at the significance level of

0.05

when all covariates were considered at their relevant levels. Third, again the effects of covariates at parent level on the GSI score did not change materially in terms of direction, significance and magnitude. Our results demonstrate that analysis results may change materially when all covariates were associated at their relevant levels. We have not observed any dramatic change in this analysis. This might partly be explained by the nature of involved covariates in this study since no covariates on health status were available. Another interesting phenomenon is that the variation at the adolescent level was clearly shifted to that at the parent level when all covariates were associated at their relevant levels. A finding of practical importance is that the intervention was not effective whether covariates were associated at their relevant levels in the analysis or not.

5. Simulation Studies

In this section, we compare the performance of the proposed models (TMCDRE) with the conventional Tweedie mixed model (CTMM) through simulation studies to assess the effects of associating all covariates were considered at their relevant levels. To replicate data in realistic conditions, we simulated data based on analysis results given in Table 2 with respective arrangements of covariates under the TMCDRE and CTMM. The estimated regression parameters (

{\hat{β}}_{}

) and random effect parameters (

σ^{2}

τ^{2}

ρ^{2}

) under the TMCDRE and CTMM in Table 2 were taken as true values for the respective models. The simulation study under the TMCDRE and CTMM were done separately in Section 5.1 and Section 5.2.

5.1. Simulating Data from the Tweedie Compound Poisson Model with Covariate-Dependent Random Effects

In this section, we first generated clustered semicontinuous data from the TMCDRE model in Table 2. We then analysed the simulaed based on the TMCDRE and CTMM models. Our objective here was to compare the performance of TMCDRE and CTMM when the true data were generated from TMCDRE. The data generation procedure is given below.

We first generate 269 samples ( $u_{1}, \dots, u_{269}$ ) using $Gamma (μ_{i}, σ^{2})$ where,

$μ_{i} = exp (β_{7} Treatment + β_{8} Parent Gender + β_{9} Parent age)$
In the second step, we generate $n_{i}$ samples ( $v_{i 1}, \dots, v_{i n_{i}}$ and $n_{i}$ varies from 1 to 5) for each $u_{i}$ using $Gamma (μ_{i j} u_{i}, τ^{2} {u_{i}}^{- 1})$ , where

$μ_{i j} = exp (β_{4} Base age + β_{8} Hispanic + β_{7} Gender) .$
Finally, we generate $n_{i j}$ samples ( $Y_{i j 1}, \dots, Y_{i j n_{i j}}$ and $n_{i j}$ varies from 1 to 17) for each $v_{i j}$ using $(v_{i j} {T w}_{p} (μ_{i j k}, \frac{ρ^{2}}{v_{i j}}))$ , where

$μ_{i j k} = exp (β_{0} + β_{1} True month + β_{2} Spring + β_{3} Summer) .$

We carried out 400 simulation runs using the procedure discussed above. Estimated bias, simulated standard errors (Sim SE) and estimated standard errors (Est SE) of the regression and random effect parameters using TMCDRE and CTMM techniques are presented in Table 3. Parameter estimates of cluster level covariates are less biased for TMCDRE than for CTMM. Parameter estimate of the sub-cluster level continuous covariate Adolescent base age is less biased for TMCDRE than for CTMM. TMCDRE produced less biased estimates for observation level binary covariates Spring and Summer than CTMM. All dispersion parameters are overestimated by the TMCDRE, whereas only

σ^{2}

and

τ^{2}

are overestimated by CTMM. The simulation study shows that

ρ^{2}

is underestimated by the CTMM approach. Dispersion parameters are more biased for CTMM than for TMCDRE.

Table 3 shows the Sim SE and Est SE. Clearly, differences between Sim SE and Est SE of parameter estimates of observation level covariates (True month, Spring and Summer) are smaller for TMCDRE technique than those from the CTMM technique. Differences between Sim SE and Est SE for regression parameter estimates of all sub-cluster level covariates (Adolescents base age, Hispanic, Gender) and cluster level covariates (Treatment and Parent base age) are less for the TMCDRE than for CTMM. The only exception is with the covariate Gender.Pr. As expected, TMCDRE performed better than for CTMM when data were simulted from the TMCDRE model.

5.2. Simulating Data from the Tweedie Compound Poisson Model with Covariate-Independent Random Effects

In this section, we first generate clustered semicontinuous data using the conventional Tweedie Mixed Model (CTMM) method. Then, we analyze these simulated data by using TMCDRE and CTMM methods. Our objective is to compare TMCDRE and CTMM’s performances when the true data were generated based on CTMM. The data generation technique is discussed below:

We will generate 269 samples ( $u_{1}, \dots, u_{269}$ ) using $Gamma (1, σ^{2})$ .
In the second step, we will generate $n_{i}$ samples ( $v_{i 1}, \dots, v_{i n_{i}}$ and $n_{i}$ varies from 1 to 5) for each $u_{i}$ using $Gamma (u_{i}, τ^{2} {u_{i}}^{- 1})$ .
Finally, we will generate $n_{i j}$ samples ( $Y_{i j 1}, \dots, Y_{i j n_{i j}}$ and $n_{i j}$ varies from 1 to 17) for each $v_{i j}$ using $(v_{i j} {T w}_{p} (μ_{i j k}, \frac{ρ^{2}}{v_{i j}}))$ , where

$μ_{i j k} = exp (β^{T} X) .$

The fixed effect and random effect parameter estimates in Table 2 using CTMM technique (

\hat{β} = (- 1.15, - 0.047, 0.086, - 0.020, 0.044, 0.122,

0.401,

0.0149, 0.230, - 0.019)

σ^{2} = 0.10

τ^{2} = 0.59

, and

ρ^{2} = 0.53

) are used as true parameter values. Similar to Section 5.1, we carried out 400 simulation runs using the procedure discussed above. Estimated bias, simulated stadard errors (Sim SE) and estimated standard errors (Est SE) of the regression and random effect parameters using TMCDRE and CTMM techniques presented in Table 4.

Our simulated results indicate that although data are generated from CTMM, parameter estimates from TMCDRE for observation level continuous covariate True month is less biased than the corresponding parameter estimate from CTMM. TMCDRE parameter estimate for the cluster level binary covariate Parent gender is also less biased than the corresponding parameter estimate from CTMM. CTMM gives less biased estimates for all other covariates than the TMCDRE. Dispersion parameters

σ^{2}

and

ρ^{2}

are overestimated by both conventional and covariate-dependent approach, whereas

τ^{2}

is underestimated by CTMM and TMCDRE. Dispersion parameters are more biased for TMCDRE than those from CTMM are.

Table 4 also indicates that the differences between Sim SE and Est SE for parameter estimate of observation level binary covariate (Spring) are smaller for TMCDRE than the CTMM. One important feature of TMCDRE is that covariates are separated in their corresponding level. The simulation study shows that the differences between Sim SE and Est SE of parameter estimates corresponding to binary covariates (Hispanic, Gender) in sub-cluster level are smaller for TMCDRE than for CTMM. Differences between Sim SE and Est SE for all other covariate are less for CTMM than TMCDRE. In other words, when data were generated from the CTMM, CTMM performed better than TMCDRE in general; however, TMCDRE still produced better estimates for some covariates than the CTMM.

6. Conclusions

In this paper, we have introduced a three-level Tweedie compound Poisson model with covariate-dependent random effects for semicontinuous data. As random effects at different levels are likely to be associated with the covariates at relevant levels, this new model enables us to account for such association in our analysis. Our analysis of BSI data demonstrate that analysis results may change materially when all covariates were associated at their relevant levels. Our simulation study has shown that Tweedie compound Poisson model with covariate-dependent random effects performs better than Tweedie compound Poisson model with covariate-independent random effects when random effects at different levels are associated with the covariates at relevant levels. Although we also compared performances between models with covariate-dependent and covariate-independent random effects when random effects at different levels are not associated with the covariates at relevant levels, such no association situation is far less likely in practice. Therefore, Tweedie compound Poisson model with covariate-dependent random effects is more appropriate for analysis of multilevel semicontinuous data with covariates at different levels.

Tweedie compound Poisson model with covariate-dependent random effects becomes far more complex than Tweedie compound Poisson model with covariate-independent random effects; however, we still obtained explicit expression for random effects predictors at different levels. These explicit expressions may facilitate interpretation of clustering effects at cluster and sub-cluster levels.

Models with covariate-dependent random effects have been introduced to handle multilevel binary, zero-inflated and survival data in the literature. Our model for multilevel semicontinuous data has enriched this class of models. Furthermore, Tweedie models with covariate-dependent random effects have also been developed in [18,20] for multilevel continuous and count data.

Author Contributions

Conceptualization, B.J. and R.M.; methodology, R.M.; software, M.D.I. and M.T.H.; formal analysis, M.D.I., R.M. and M.T.H.; data curation, M.D.I., R.M. and M.T.H.; writing—original draft preparation, M.T.H., R.M. and M.D.I.; writing—review and editing, R.M. and M.T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2020-04751 and RGPIN-2017-04246).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from https://robweiss.faculty.biostat.ucla.edu/book-data-sets (accessed on 14 March 2023).

Acknowledgments

This paper is dedicated to the memory of Bent Jørgensen. The authors thank the editor and four anonymous referees for their helpful comments that greatly improved the presentation of these results.

Conflicts of Interest

The authors declare no conflict of interest.

References

Freedman, D.A. Ecological inference and the ecological fallacy. Int. Encycl. Soc. Behav. Sci. 1999, 6, 4027–4030. [Google Scholar]
Wang, Z.; Louis, T.A. Marginalized binary mixed-effects models with covariate-dependent random effects and likelihood inference. Biometrics 2004, 60, 884–891. [Google Scholar] [CrossRef] [PubMed]
Wong, K.Y.; Lam, K.F. Modeling zero-inflated count data using a covariate-dependent random effect model. Stat. Med. 2013, 32, 1283–1293. [Google Scholar] [CrossRef] [PubMed]
Scheike, T.H.; Ekstrøm, C.T. A discrete survival model with random effects and covariate-dependent selection. Appl. Stoch. Model. Data Anal. 1998, 14, 153–163. [Google Scholar] [CrossRef]
Liu, D.; Kalbfleisch, J.D.; Schaubel, D.E. A positive stable frailty model for clustered failure time data with covariate-dependent frailty. Biometrics 2011, 67, 8–17. [Google Scholar] [CrossRef] [PubMed]
Zhou, H.; Hanson, T.; Jara, A.; Zhang, J. Modelling county level breast cancer survival data using a covariate-adjusted frailty proportional hazards model. Ann. Appl. Stat. 2015, 9, 43. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y. Likelihood-based and Bayesian methods for Tweedie compound Poisson linear mixed models. Stat. Comput. 2013, 23, 743–757. [Google Scholar] [CrossRef]
Su, L.; Tom, B.D.; Farewell, V.T. A likelihood-based two-part marginal model for longitudinal semicontinuous data. Stat. Methods Med. Res. 2015, 24, 194–205. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Strawderman, R.L.; Johnson, B.A.; O’Quigley, J.M. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Stat. Methods Med. Res. 2016, 25, 133–152. [Google Scholar] [CrossRef] [PubMed]
Bursch, B.; Lester, P.; Jiang, L. Rotheram-Borus, M.J.; Weiss, R. Psychosocial predictors of somatic symptoms in adolescents of parents with HIV: A six-year longitudinal study. AIDS Care 2008, 20, 667–676. [Google Scholar] [CrossRef] [PubMed]
McGillicuddy, N.B.; Rychtarik, R.G.; Morsheimer, E.T.; Burke-Storer, M.R. Agreement Between Parent and Adolescent Reports of Adolescent Substance Use. J. Child Adolesc. Subst. Abuse 2007, 16, 59–78. [Google Scholar] [CrossRef] [PubMed]
Ma, R.; Jørgensen, B. Nested generalized linear mixed models: An orthodox best linear unbiased predictor approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 2007, 69, 625–641. [Google Scholar] [CrossRef]
Duan, X.; Ma, R.; Zhang, X. Forecasting occurrence and quantity of monthly precipitation simultaneously while accounting for complex serial correlation. Int. J. Climatol. 2022, 42, 9494–9509. [Google Scholar] [CrossRef]
Lee, Y.; Nelder, J.A. Hierarchical generalized linear models (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 619–678. [Google Scholar] [CrossRef]
Lee, Y.; Nelder, J.A. Conditional and marginal models: Another view. Stat. Sci. 2004, 19, 219–238. [Google Scholar] [CrossRef]
Ma, R.; Yan, G.; Hasan, M.T. Tweedie family of generalized linear models with distribution-free random effects for skewed longitudinal data. Stat. Med. 2018, 37, 3519–3532. [Google Scholar] [CrossRef] [PubMed]
Ma, R.; Krewski, D.; Burnett, R.T. Random effects Cox models: A Poisson modelling approach. Biometrika 2003, 90, 157–169. [Google Scholar] [CrossRef]
Ma, R. An Orthodox BLUP Approach to Generalized Linear Mixed Models. Ph.D. Thesis, University of British Columbia, Vancouver, BC, Canada, 1999. [Google Scholar]
Weiss, R. Modeling Longitudinal Data; Springer: New York, NY, USA, 2005. [Google Scholar]
Islam, M.D. Analysis of Clustered Data Using Tweedie Models with Covariate-Dependent Random Effects. Master’s Thesis, University of New Brunswick, Fredericton, NB, Canada, 2013. [Google Scholar]

Figure 1. Histogram for the Global severity index (GSI) scores of Basic symptoms inventory (BSI) Data. The bold bar on the left of the histogram indicates proportion of exact zeros.

Table 1. Variable description for the BSI data.

Variable Type	Name	Description
Response	GSI	BSI global severity index.
Explanatory	Cluster level	Parent level
	Treatment	Intervention or not.
	Gender.Par	1 if Parent is female or 0 otherwise.
	Age.Par	Parent’s baseline age.
	Sub-cluster level	Adolescent level
	Age.Adol	Adolescent’s baseline age.
	Race.Adol	1 if Adolescent is Hispanic or 0 otherwise.
	Gender.Adol	1 if Adolescent is female or 0 otherwise.
	Observation level
	Months	Number of months adolescent in the study.
	Spring	Spring season.
	Summer	Summer season.

Table 2. Parameter estimates for the BSI data based on CTMM and TMCDRE.

Levels	Covariates	TMM			TMCDRE
Levels	Covariates	Estimates	St. Errors	p-Value	Estimates	St. Errors	p-Value
Observation	Intercept	−1.1574	0.4449	0.0093	−1.2114	0.4515	0.0074
	Months	−0.0473	0.0072	0.0000	−0.0503	0.0071	0.0000
	Spring	0.0866	0.0325	0.0061	0.0880	0.0316	0.0045
	Summer	−0.0204	0.0338	0.5353	−0.0173	0.0326	0.5892
Adolescent	Age.Ad	0.0444	0.0221	0.0455	0.0403	0.0221	0.0719
	Race.Ad	0.1226	0.0908	0.1802	0.1148	0.0908	0.2077
	Gender.Ad	0.4017	0.0863	0.0001	0.4058	0.0863	0.0000
Parent	Treatment	0.0149	0.0916	0.8729	0.0110	0.0918	0.9045
	Gender.Pr	0.2305	0.1240	0.0643	0.2310	0.1326	0.0819
	Age.Pr	−0.0199	0.0092	0.0308	−0.0165	0.0095	0.0854
	$σ^{2}$	0.1007			0.1050
	$τ^{2}$	0.5921			0.3844
	$ρ^{2}$	0.5316			0.6794

Table 3. Summary statistics for 400 simulations from Tweedie model with covariate-dependent random effects.

Levels	Covariates	True Value	CTMM			TMCDRE
Levels	Covariates	True Value	Bias	Sim. SE ^a	Est. SE ^b	Bias	Sim. SE	Est. SE
Observation	Intercept ( $β_{0}$ )	−1.2114	−0.0146	0.4463	0.4401	−0.0238	0.4354	0.4439
	Months ( $β_{1}$ )	−0.0503	−0.0004	0.0075	0.0081	0.0006	0.0073	0.0086
	Spring ( $β_{2}$ )	0.0880	−0.0015	0.0303	0.0317	−0.0008	0.0296	0.0309
	Summer ( $β_{3}$ )	−0.0173	−0.0004	0.0318	0.0312	0.0002	0.0317	0.0312
Sub-cluster	Age.Ad ( $β_{4}$ )	0.0403	0.0012	0.0234	0.0218	0.0010	0.0233	0.0219
	Race.Ad ( $β_{5}$ )	0.1148	0.0007	0.0906	0.0899	0.0009	0.0897	0.0895
	Gender.Ad ( $β_{6}$ )	0.4058	0.0062	0.0887	0.0852	0.0062	0.0871	0.0845
Cluster	Treatment ( $β_{7}$ )	0.0110	0.0081	0.0920	0.0909	0.0076	0.0906	0.0906
	Gender.Pr ( $β_{8}$ )	0.2310	0.0093	0.1247	0.1229	0.0090	0.1232	0.1254
	Age.Pr ( $β_{9}$ )	−0.0165	−0.0007	0.0100	0.0091	−0.0004	0.0098	0.0094
	$σ^{2}$	0.1050	0.0018	0.0561		0.0034	0.0551
	$τ^{2}$	0.3844	0.1871	0.1000		0.0117	0.1170
	$ρ^{2}$	0.6794	−0.1471	0.0718		0.0270	0.1201

^a: Sim. SE, standard error of estimates over 400 simulations. ^b: Est. SE, average of 400 estimated standard errors.

Table 4. Summary statistics for 400 simulations from Conventional Tweedie mixed model.

Levels	Covariates	True Value	CTMM			TMCDRE
Levels	Covariates	True Value	Bias	Sim. SE ^a	Est. SE ^b	Bias	Sim. SE	Est. SE
Observation	Intercept ( $β_{0}$ )	−1.1574	−0.0053	0.4170	0.4373	−0.0001	0.4193	0.4455
	Months ( $β_{1}$ )	−0.0473	−0.0001	0.0071	0.0071	0.0001	0.0072	0.0070
	Spring ( $β_{2}$ )	0.0866	−0.0002	0.0332	0.0318	−0.0004	0.0331	0.0320
	Summer ( $β_{3}$ )	−0.0204	−0.0003	0.0342	0.0326	−0.0004	0.0342	0.0324
Sub-cluster	Age.Ad ( $β_{4}$ )	0.0444	−0.0014	0.0205	0.0217	−0.0016	0.0205	0.0219
	Race.Ad ( $β_{5}$ )	0.1226	−0.0029	0.0836	0.0893	−0.0031	0.0848	0.0897
	Gender.Ad ( $β_{6}$ )	0.4017	0.0020	0.0833	0.0846	0.0028	0.0840	0.0845
Cluster	Treatment ( $β_{7}$ )	0.0149	0.0010	0.0892	0.0902	0.0014	0.0891	0.0908
	Gender.Pr ( $β_{8}$ )	0.2305	−0.0052	0.0091	0.0090	−0.0047	0.0091	0.0094
	Age.Pr ( $β_{9}$ )	−0.0199	−0.0003	0.1240	0.1221	−0.0005	0.1232	0.1262
	$σ^{2}$	0.1007	0.0036	0.0544		0.0104	0.0552
	$τ^{2}$	0.5921	−0.0294	0.1029		−0.2436	0.1397
	$ρ^{2}$	0.5316	0.0030	0.0305		−0.1576	0.1001

^a: Sim. SE, standard error of estimates over 400 simulations. ^b: Est. SE, average of 400 estimated standard errors.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, R.; Islam, M.D.; Hasan, M.T.; Jørgensen, B. Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data. Entropy 2023, 25, 863. https://doi.org/10.3390/e25060863

AMA Style

Ma R, Islam MD, Hasan MT, Jørgensen B. Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data. Entropy. 2023; 25(6):863. https://doi.org/10.3390/e25060863

Chicago/Turabian Style

Ma, Renjun, Md. Dedarul Islam, M. Tariqul Hasan, and Bent Jørgensen. 2023. "Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data" Entropy 25, no. 6: 863. https://doi.org/10.3390/e25060863

APA Style

Ma, R., Islam, M. D., Hasan, M. T., & Jørgensen, B. (2023). Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data. Entropy, 25(6), 863. https://doi.org/10.3390/e25060863

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data

Abstract

1. Introduction

2. Tweedie Compound Poisson Models with Covariate-Dependent Random Effects

2.1. The Model

2.2. Moment Structure

3. Estimation of Parameters

3.1. Best Linear Unbiased Predictors of Random Effects

3.2. Estimation of Regression Parameters

3.3. Estimation of Random Effects Parameters

4. Analysis of Basic Symptoms Inventory Study Data

4.1. Model Specification

4.2. Analysis Results

5. Simulation Studies

5.1. Simulating Data from the Tweedie Compound Poisson Model with Covariate-Dependent Random Effects

5.2. Simulating Data from the Tweedie Compound Poisson Model with Covariate-Independent Random Effects

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI