Abstract
Diagnostic classification models are confirmatory in the sense that the relationship between the latent attributes and responses to items is specified or parameterized. Such models are readily interpretable with each component of the model usually having a practical meaning. However, parameterized diagnostic classification models are sometimes too simple to capture all the data patterns, resulting in significant model lack of fit. In this paper, we attempt to obtain a compromise between interpretability and goodness of fit by regularizing a latent class model. Our approach starts with minimal assumptions on the data structure, followed by suitable regularization to reduce complexity, so that readily interpretable, yet flexible model is obtained. An expectation–maximization-type algorithm is developed for efficient computation. It is shown that the proposed approach enjoys good theoretical properties. Results from simulation studies and a real application are presented.
Similar content being viewed by others
References
Allman, E. S., Matias, C., & Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37, 3099–3132.
American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: American Psychiatric Association.
Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95, 759–771.
Chen, Y., Liu, J., Xu, G., & Ying, Z. (2015a). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110, 850–866.
Chen, Y., Liu, J., & Ying, Z. (2015b). Online item calibration for Q-matrix in CD-CAT. Applied Psychological Measurement, 39, 5–15.
Croon, M. (1990). Latent class analysis with ordered latent classes. British Journal of Mathematical and Statistical Psychology, 43, 171–192.
Croon, M. (1991). Investigating mokken scalability of dichotomous items by means of ordinal latent class analysis. British Journal of Mathematical and Statistical Psychology, 44, 315–331.
Dalrymple, K., & D’Avanzato, C. (2013). Differentiating the subtypes of social anxiety disorder. Expert Review of Neurotherapeutics, 13, 1271–1283.
de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179–199.
de la Torre, J., & Douglas, J. (2004). Higher order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353.
DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood-based classification techniques. In R. L. B. Paul, D. Nichols, & Susan F. Chipman (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, Y., & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75, 531–552.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
Goodman, L. A. (1974a). The analysis of systems of qualitative variables when some of the variables are unobservable. Part I—a modified latent structure approach. American Journal of Sociology, 79, 1179–1259.
Goodman, L. A. (1974b). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.
Grant, B. F., Kaplan, K., Shepard, J., & Moore, T. (2003). Source and accuracy statement for Wave 1 of the 2001–2002 National Epidemiologic Survey on Alcohol and Related Conditions. Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism.
Haberman, S. J., von Davier, M. & Lee, Y.-H. (2008) Comparison of multidimensional item response models: Multivariate normal ability distributions versus multivariate polytomous ability distributions. ETS Research Rep. No. RR-08-45. Princeton, NJ: ETS.
Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26, 301–321.
Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210.
Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272.
Kessler, R. C., Berglund, P., Demler, O., Jin, R., Merikangas, K. R., & Walters, E. E. (2005). Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the national comorbidity survey replication. Archives of General Psychiatry, 62, 593–602.
Lazarsfeld, P. F., Henry, N. W., & Anderson, T. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin.
Lehmann, E. L., & Casella, G. (2006). Theory of point estimation. New York: Springer.
Leighton, J., & Gierl, M. (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge: Cambridge University Press.
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205–237.
Li, X., Liu, J. & Ying, Z. (2016). Chernoff index for cox test of separate parametric families. arXiv:1606.08248.
Liu, J., Xu, G., & Ying, Z. (2012). Data-driven learning of Q-matrix. Applied Psychological Measurement, 36, 548–564.
Liu, J., Xu, G., & Ying, Z. (2013). Theory of self-learning Q-matrix. Bernoulli, 19, 1790–1817.
Nishii, R., et al. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics, 12, 758–765.
Rupp, A., & Templin, J. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement: Interdisciplinary Research and Perspective, 6, 219–262.
Rupp, A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.
Stein, M. B., & Stein, D. J. (2008). Social anxiety disorder. The Lancet, 371, 1115–1125.
Tatsuoka, C. (2002). Data analytic methods for latent partially ordered classification models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51, 337–350.
Tatsuoka, C., & Ferguson, T. (2003). Sequential classification on partially ordered sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 143–157.
Tatsuoka, K. (1985). A probabilistic model for diagnosing misconceptions in the pattern classification approach. Journal of Educational Statistics, 12, 55–73.
Tatsuoka, K. (2009). Cognitive assessment: An introduction to the rule space method. New York: Routledge.
Templin, J., & Henson, R. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58, 267–288.
von Davier, M. (2005) A general diagnosis model applied to language testing data. ETS Research Rep. No. RR-05-16. Princeton, NJ: ETS.
von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287–307.
von Davier, M. (2014). The DINA model as a constrained general diagnostic model: Two variants of a model equivalency. British Journal of Mathematical and Statistical Psychology, 67, 49–71.
von Davier, M., & Haberman, S. J. (2014). Hierarchical diagnostic classification models morphing into unidimensional ‘diagnostic’ classification models—a commentary. Psychometrika, 79, 340–346.
von Davier, M. & Yamamoto, K. (2004) A class of models for cognitive diagnosis. In 4th Spearman Conference, Philadelphia, PA.
Wang, H., Li, B., & Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 671–683.
Wang, H., Li, R., & Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
Wang, T., & Zhu, L. (2011). Consistent tuning parameter selection in high dimensional sparse linear regression. Journal of Multivariate Analysis, 102, 1141–1151.
Xu, G. (2016). Identifiability of restricted latent class models with binary responses. arXiv:1603.04140.
Acknowledgments
This work was supported by NSF (grant nos. SES-1323977, IIS-1633360), Army Research Office (grant no. W911NF-15-1-0159), and NIH (grant no. R01GM047845).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Estimation Via the Expectation–Maximization Algorithm
We propose to use the expectation-maximization (EM) algorithm combined with the coordinate descent algorithm for the computation of the regularized estimator in (10) for given \(\lambda \) and M. The algorithm guarantees a monotone increasing objective function. Given initial values \(\mathbf {c}\) and \(\varvec{\pi }\), the algorithm iterates between the E-step and M-step until convergence.
1.1 E-step
In the E-Step, one computes the Q-function,
The expectation is taken with respect to \(m_i, i=1,\ldots ,N\). The notation \(E_{\mathbf {c},\varvec{\pi }}\) denotes the conditional distribution corresponding to parameters \(\mathbf {c}\) and \(\varvec{\pi }\). The complete data log-likelihood function is
Under the posterior distribution, \(m_i\), \(i=1,\ldots ,N\) are independent and the posterior distribution associated with the parameters \(\mathbf {c}\) and \(\varvec{\pi }\) is
The Q-function takes the following additive form,
1.2 M-step
The M-step consists of maximizing the regularized Q-function with respect to \((\mathbf {c}^*, \varvec{\pi }^*)\)
Note that in the objective function, the term
consists only of \(\varvec{\pi }^*\), and for each j the term
consists only of \(\mathbf {c}_{j}^*\). Therefore, we can maximize the Q-function w.r.t. \(\varvec{\pi }^*\) and each \(\mathbf {c}_j^*\) independently. In particular,
can be computed as follows:
We maximize
where \(a^j_m=\sum _{i=1}^N q_{im}R_i^j\) and \(b^j_m=\sum _{i=1}^N q_{im}(1-R^j_i)\). Here, \(a^j_m\) represents the expected number of respondents who are from latent class m and have responded correctly to item j, and \(b^j_m\) represents the expected number of respondents who are from latent class m and have responded incorrectly to item j, given the responses and the current parameter estimates.
Let
We first show the result for the order of \(c^\dagger _{j,m}, m=1,\ldots ,M.\)
Proposition 1
Let \(x^*_{j,m}=\frac{a^j_m}{a^j_m+b^j_m}\) and \(c^\dagger _{j,m}\) be defined in (17), \(j=1,\ldots ,J\), \(m=1,\ldots ,M\). Then for each j, the order of \(c^\dagger _{j,1},\ldots ,c^\dagger _{j,M}\) is the same as that of \(x^*_{j,1},\ldots ,x^*_{j,M}\). That is, for \(l\ne s, 1\le l,s\le M\), if \(x^*_{j,l}\ge x^*_{j,s}\) then \(c^\dagger _{j,l}\ge c^\dagger _{j,s}\).
Because of this proposition, the computation in (17) is greatly simplified. That is, instead of looking for the solution on the whole domain \([0,1]^M\), we only need to focus on a much smaller subspace (whose volume is 1 / (M!)) that is decided by the order of \(x^*_{j,1}, \ldots , x^*_{j,M}\). On knowing the order of \(c^\dagger _{j,1},\ldots ,c^\dagger _{j,M}\), we parameterize the maximization problem by the order statistics. For instance, if \(x^*_{j,1}<\cdots <x^*_{j,M}\), then \(c^\dagger _{j,(m)}=c^\dagger _{j,m}\). In this case, we write
where \(d_l = c_{j,(l+1)}^* - c_{j,(l)}^*\). Then we apply the coordinate descent algorithm to the reparametrized function \(Q^r_j(c_{j,(1)},d_1,\ldots ,d_{M-1})\) subject to the constraint that \(c_{j,(1)},d_1,\ldots ,d_{M-1}\ge 0\) and \(c_{j,(1)}+\sum _{m=1}^{M-1}d_m\le 1\). For more details about the coordinate descent algorithm, see Friedman, Hastie, and Tibshirani (2010).
Appendix 2: Proof of Proposition 1
Proof
For simplicity of notation, we assume \(M=2\) and \(x^*_{j,1}\le x^*_{j,2}\). For \(M>2\), the proof is similar. Assume to the contrary that \(c^\dagger _{j,1}>c^\dagger _{j,2}\). Then according to (17)
According to (16), this can be simplified to
Because \(p_{\lambda }^{SCAD}(c^\dagger _{j,1}-c^\dagger _{j,2})\ge 0\), (18) and (19) are still true by removing the term \(-2p_{\lambda }^{SCAD}(c^\dagger _{j,1}-c^\dagger _{j,2})\). According to the definition of \(x^*_{j,1}\) and \(x^*_{j,2}\), we have
Adding these two inequalities up gives
Therefore
However, the function \(\log x-\log (1-x)\) is strictly increasing for \(x\in (0,1)\), so (20) is impossible. This finishes the proof. \(\square \)
Appendix 3: Proof of Theorem 1
Proof
Throughout the proof, we write \(a_N=o(b_N)\) for two sequence of vectors \(a_N\) and \(b_N\) if \(\Vert a_N\Vert /\Vert b_N\Vert \) tend to zero and \(a_N=O(b_N)\) if \(\Vert a_N\Vert /\Vert b_N\Vert \) is bounded when N varies. Moreover, for two sequences of random vectors \(a_N\) and \(b_N\), we write \(a_N=O_P(b_N)\) if \(\Vert a_N\Vert /\Vert b_N\Vert \) converges to zero in probability and \(a_N=O_P(b_N)\) if \(\Vert a_N\Vert /\Vert b_{N}\Vert \) is tight in probability. To simplify the notation, we denote the true model parameters as \((\mathbf {c}, \varvec{\pi })\) and write \(\varvec{\theta }=(\mathbf {c},\varvec{\pi }_{-1})\), \(\hat{\varvec{\theta }}=(\hat{\mathbf {c}}^{\lambda _N},\hat{\varvec{\pi }}_{-1}^{\lambda _N})\) and \(\varvec{\theta }'=(\mathbf {c}',\varvec{\pi }_{-1}')\). Note that the event \(\Vert \hat{\varvec{\theta }}-\varvec{\theta }\Vert \ge \frac{C}{\sqrt{N}}\) implies
Therefore, it is sufficient to show that for each \(\varepsilon >0\), there exists a sufficiently large constant C such that
We split the probability above into two parts,
where
and
Here, \(\varepsilon _1\) is a positive constant independent of N, whose value will be chosen later. We present upper bounds for \(I_1\) and \(I_2\) separately. The next lemma, whose proof is given in Appendix 5, provides an upper bound for \(I_1\).
Lemma 1
For any fixed \(\varepsilon _1>0\), there exists a positive constant \(\varepsilon _2\) (depending on \(\varepsilon _1)\) such that for sufficiently large N, we have \( I_1\le e^{-\varepsilon _2 N}. \)
We proceed to the \(I_2\) term. We first analyze
It is straightforward to check that for \(\varvec{\theta }'\in \Theta \), there exists a sufficiently large positive constant \(\eta \) such that
where \(\nabla ^2l\) and \(\nabla ^3 l\) denote vectors consisting of all second and third partial derivatives of l, respectively. According to (21), we compute the Taylor expansion of \(l(\varvec{\theta }')\) around \(\varvec{\theta }\) for \(\varvec{\theta }'\in \Theta \)
In (22), the term \(O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2\sqrt{N})\) corresponds to the remainder term for the second derivatives at \(\varvec{\theta }\) and the term \(O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3 N)\) corresponds to the terms involving third derivatives. Note that for \(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1\), there exists a positive constant \(C_2\), independent of \(\varepsilon _1\), such that \(O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3N)\le C_2 \varepsilon _1 \Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2N\). Thus, the \(O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3 N)\) term is dominated by the second term, that is,
Also note that \(O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2\sqrt{N})=O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \sqrt{N})\), and \(\nabla l(\varvec{\theta })=O_P(\sqrt{N})\). Thus,
Combining (22), (23) and (24) gives
which is further bounded above by
Therefore, by choosing \(\varepsilon _1\) sufficiently small, we have
We proceed to the penalty term. For simplicity of discussion, we only state the proof for the case where there is no \(j\in \{1,\ldots ,J\}\) such that \(c_{j,1}=c_{j,2}=\cdots =c_{j,M}\). That is, all items have discrimination power. When there are items that have the same item response function among all the latent classes, the proof is similar, and is thus omitted.
Define a function \(\mathbf {gap}(\varvec{\beta })=\min \{|\beta _i-\beta _j|:\beta _i\ne \beta _j, i=1,\ldots ,M, j=1,\ldots ,M \}\), where \(\varvec{\beta }=(\beta _1,\ldots ,\beta _M)\in R^M\) and there exist i and j such that \(\beta _i\ne \beta _j\). Note that the difference of order statistics \(c_{j,(m+1)}-c_{j,(m)}\) is either zero or greater than \(\frac{\mathbf {gap}(\mathbf {c}_j)}{4}\). Recall in the definition (9), \(p_{\lambda _n}^{SCAD}(x)=\frac{(a+1)^2\lambda ^2}{2}\) for all \(|x|\ge a\lambda \). Thus, the penalty term \(p_{\lambda _N}^{SCAD}(c_{j,(m+1)}-c_{j,(m)})\) is either 0 (when \(c_{j,(m+1)}-c_{j,(m)}=0\)) or \(\frac{(a+1)^2\lambda ^2}{2}\) (when \(c_{j,(m+1)}-c_{j,(m)}>0\)) for N sufficiently large such that \(\lambda _N<\frac{\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)}{4a}\). Therefore,
where \(\mathrm {Card}(\cdot )\) denotes the number of elements in a set. On the other hand, we have the following lemma on \(\kappa _{\lambda _N}(\mathbf {c}')\), whose proof is given in Appendix 5.
Lemma 2
If \(\Vert \mathbf {c}'-\mathbf {c}\Vert <\frac{1}{4}\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)\) and \(\lambda _N\le \frac{1}{4a}\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)\), then
The above lemma and (26) show that \( \kappa _{\lambda _N}(\mathbf {c}')-\kappa _{\lambda _N}(\mathbf {c})\ge 0 \) for \(\lambda _N \le \frac{\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)}{4a}.\) Combine this with (25), we have that for sufficiently large N,
Note that \(\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v \) is equal to the smallest eigenvalue of \(I(\varvec{\theta })\), which is positive by Assumption A3. Therefore, we have
for C sufficiently large. Combining our results for \(I_1\) and \(I_2\), we conclude the proof. \(\square \)
Appendix 4: Proof of Theorem 2
Proof
We first present a useful lemma, whose proof is given in Appendix 5.
Lemma 3
There exist constants C and \(C_1\) such that
for sufficiently large N.
Let the event \(\Omega _1=\Big \{\sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}}\Vert \nabla l(\varvec{\theta }')\Vert \le C_1\sqrt{N},\Vert \hat{\varvec{\theta }}-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\Big \}\). It is sufficient to show that on the event \(\Omega _1\), \(\hat{\mathbf {c}}^{\lambda _N}\) and \(\mathbf {c}\) have the same partially merged pattern for N large enough. We prove this by contradiction. Assume that on the contrary, the partially merged pattern of \(\hat{\mathbf {c}}^{\lambda _N}\) and \(\mathbf {c}\) are different, we will construct a \(\tilde{\varvec{\theta }}\in \Theta \) such that
which contradicts the definition of \(\hat{\varvec{\theta }}\). Without loss of generality, assume that \(\hat{\mathbf {c}}_1^{\lambda _N}\) and \(\mathbf {c}_1\) have different partially merged patterns. That is, there exist \(m_1,m_2\in \{1,\ldots ,M\}\) such that \(c_{1,m_1}\le c_{1,m_2}\) but \(\hat{c}^{\lambda _N}_{1,m_1}>\hat{c}^{\lambda _N}_{1,m_2}\). There are two cases: (1) \(c_{1,m_1}<c_{1,m_2}\) and (2) \(c_{1,m_1}=c_{1,m_2}\). Because on the event \(\Omega _1\), \(|\hat{c}^{\lambda _N}_{1,m_i}-c_{1,m_i}|<\frac{C}{\sqrt{N}}\) \((i=1,2)\), the first case is not possible when N is sufficiently large. Thus, we only need to consider the second case where \(c_{1,m_1}=c_{1,m_2}\) and \(\hat{c}^{\lambda _N}_{1,m_1}>\hat{c}^{\lambda _N}_{1,m_2}\). Define two sets of indices as follows:
and
The set B is a subset of A, collecting the indices that \(\hat{c}^{\lambda _N}_{1,m}\) is minimized. Due to the assumption above, both A and B are non-empty sets. Now we construct \(\tilde{\mathbf {c}}\) as follows:
where \(\Delta \) is a sufficiently small positive number that will be chosen later. For \(j=2,\ldots ,J\) and \(m=1,\ldots ,M\), we keep \(\tilde{c}_{j,m}=\hat{c}_{j,m}^{\lambda _N}\). We also set \(\tilde{\varvec{\pi }}_{-1}=\hat{\varvec{\pi }}_{-1}^{\lambda _N}\). That is, \(\tilde{\varvec{\theta }}\) and \(\hat{\varvec{\theta }}\) are the same except for \(\tilde{c}_{1,m}\) where \(m\in B\). We proceed to compare \(l(\tilde{\varvec{\theta }})-N\kappa _{\lambda _N}(\tilde{\mathbf {c}})\) and \(l(\hat{\varvec{\theta }})-N\kappa _{\lambda _N}(\hat{\mathbf {c}}^{\lambda _N})\). Because \(\tilde{\varvec{\theta }}\) and \(\tilde{\mathbf {c}}\) depend on \(\Delta \), we write \(\tilde{\varvec{\theta }}(\Delta )\) and \(\tilde{\mathbf {c}}(\Delta )\) to indicate this dependence.
Lemma 4
On the event \(\Omega _1\), for N sufficiently large, \(\kappa _{\lambda _N}(\tilde{\mathbf {c}}(\Delta ))\) is differentiable at 0. Furthermore, \( \frac{d\kappa _{\lambda _N}(\tilde{\mathbf {c}}(\Delta ))}{d\Delta } = -\lambda _N.\)
The lemma above allows us to take the derivative of \(q(\Delta )=l(\tilde{\varvec{\theta }}(\Delta ))-N\kappa _{\lambda _N}(\tilde{\mathbf {c}}(\Delta ))\) with respect to \(\Delta \) on the event \(\Omega _1\),
Recall that on event \(\Omega _1\), \(|\sum _{m\in B}\frac{\partial l(\hat{\varvec{\theta }})}{\partial c_{1,m}}|\le C_1\sqrt{N}\mathrm {Card}(B)\). This, together with Lemma 4, gives
Note that \(\sqrt{N}\lambda _N\rightarrow \infty \) as \(N\rightarrow \infty \). Thus, \(\dot{q}(0)>0\) for sufficiently large N. This implies that \(q(\Delta )>q(0)=l(\hat{\varvec{\theta }})-N\kappa _{\lambda _N}(\hat{\mathbf {c}}^{\lambda _N})\) for sufficiently small positive \(\Delta \). It further implies that (27) holds for such \(\tilde{\varvec{\theta }}(\Delta )\), contradicting the definition of \(\hat{\varvec{\theta }}\). \(\square \)
Appendix 5: Proof of Supporting Lemmas
Proof of Lemma 1
Note that the event
implies
Thus, we have an upper bound for the probability
According to the definition of \(\kappa _{\lambda _N}\), we have \(0\le \kappa _{\lambda _N}(\mathbf {c}')\le J(M-1)\times \frac{(a+1)^2\lambda _N^2}{2}\). Therefore, (28) is further bounded above by
where \(C_3=J(M-1)\times \frac{(a+1)^2}{2}\) is a constant. Note that \(\lambda _N\rightarrow 0\) as \(N\rightarrow \infty \), so the right-hand side of the above display is the type I error probability of the generalized likelihood ratio test with a \(e^{o(N)}\) cut-off value for testing
whose exponential decay rate has been established in Lemma 3 in Li, Liu, and Ying (2016), that there exists a rate \(\rho >0\) such that \(P\Big ( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-l(\varvec{\theta })\}\ge C_2 N\lambda _N^2 \Big )=e^{-(\rho +o(1))N}.\) Choosing \(\varepsilon _2\) to be positive and smaller than \(\rho \), we conclude our proof. \(\square \)
Proof of Lemma 2
Because \(\kappa _{\lambda _N}(\mathbf {c}')=\sum _{j=1}^J p_{\lambda _N}(\mathbf {c}'_j)\), it is sufficient to show that for each \(j\in \{1,\ldots ,J\}\)
Similar to the discussion proceeding (26), we only need to prove that for each \(j\in \{1,..,J\}\),
We first prove that for each \(m\in \{1,\ldots ,M-1\}\), if \(c_{j,(m+1)}-c_{j,(m)}>0\), then there exists \(m'\in \{1,\ldots ,M-1\}\) such that
To proceed, we define a set \(D=\{ l:c_{j,l}=c_{j,(m)} \}\). We choose \(m' \in D\) such that \( c'_{j,m'}=\max _{k\in D} c'_{j,k}.\) Recall that we assume \(\Vert \mathbf {c}'-\mathbf {c}\Vert \le \frac{1}{4}\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)\). Thus, we have \(|c'_{j,m'}-c_{j,(m)}| = |c'_{j,m'}-c_{j,m'}|\le \frac{1}{4}\mathbf {gap}(\mathbf {c}_j)\). Moreover, for each l such that \(c'_{j,l}>c'_{j,m'}\), \(l \notin D\), due to the choice of \(m'\). We then show
by contradiction. If \(c'_{j,l} \le c_{j,(m)} + \frac{1}{2} \mathbf {gap}(\mathbf {c}_j) \), then
Since \(c_{j,(m+1)}\ge c_{j,(m)}+\mathbf {gap}(\mathbf {c}_j)\), combining with (32) implies that
On one hand, \(c_{j,l} = c_{j,(m)}\) contradicts \(l \notin D\). On the other, if \(c_{j,l} < c_{j,(m)}\), then \(c_{j,l}\le c_{j,(m-1)}\) and
contradicting \(c'_{j,l}>c'_{j,m'}\). Therefore,
when \(\lambda _{N}\le \frac{1}{4a}\min _{j\in \{1,\ldots ,J\}}\mathbf {gap}(\mathbf {c}_j)\). Therefore, (31) holds for \(\lambda _{N}\le \frac{1}{4a}\min _{j\in \{1,\ldots ,J\}}\mathbf {gap}(\mathbf {c}_j)\). Notice that for different m such that \(c_{j,(m+1)}-c_{j,(m)}>0\), the corresponding \(m'\) such that (31) holds are distinct. Thus, (30) is proved. \(\square \)
Proof of Lemma 3
According to Theorem 1, for each \(\varepsilon \), there exists a constant C such that for sufficiently large N,
Now, for \(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\), we expand \(\nabla l(\varvec{\theta }')\) around \(\varvec{\theta }\),
By (21) the right-hand side of the above display is further bounded above by \(\eta N\times \frac{C}{\sqrt{N}}= C\eta \sqrt{N}\). Thus, for \(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\),
Taking the supremum with respect to \(\varvec{\theta }'\) in the above display, we have
Note that \(\Vert \nabla l(\varvec{\theta }) \Vert =O_P(\sqrt{N})\). This and the above display yield
Consequently, we can choose \(C_1\) sufficiently large such that
We combine this with (33), concluding the proof. \(\square \)
Proof of Lemma 4
Let \(K=\mathrm {Card}(\{\hat{c}^{\lambda _N}_{1,1},\ldots ,\hat{c}^{\lambda _N}_{1,M} \})\) be the number of distinct values in \(\hat{\mathbf {c}}^{\lambda _N}_1\). Define the vector of ordered distinct values in \(\hat{\mathbf {c}}^{\lambda _N}_1\) as \(\hat{\gamma }=(\hat{\gamma }_1,\ldots ,\hat{\gamma }_K)^T\) such that \(\hat{\gamma }_1<\hat{\gamma }_2<\cdots <\hat{\gamma }_K\) and \(\{ \hat{\gamma }_1,\ldots ,\hat{\gamma }_K \}=\{ \hat{c}_{1,1}^{\lambda _N},\ldots ,\hat{c}_{1,M}^{\lambda _N} \}\). We define \(\tilde{\gamma }\) in the same manner. Let \(k^*\) satisfy \(\hat{\gamma }_{k^*}=\min _{l\in A}\hat{c}^{\lambda _N}_{1,l}\). We choose \(|\Delta |<\min \{\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*} ,\hat{\gamma }_{k^*}-\hat{\gamma }_{k^*-1}\}\). Then \(\tilde{\mathbf {c}}_1\) and \(\hat{\mathbf {c}}^{\lambda _N}_1\) have the same partially merged pattern and for \(k=1,\ldots ,K\)
The penalty term for \(\tilde{\mathbf {c}}_1\) is
By (34), the above display becomes
where we set \(p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1})\) to be 0 if \(k^*=1\). We compare this with the penalty term of \(\hat{\mathbf {c}}_1^{\lambda _N}\)
where we define \( q_1(\Delta )=p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}-\Delta )-p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}) \) and \( q_2(\Delta )=p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1})-p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}-\hat{\gamma }_{k^*-1}). \) We will show that \(\dot{q}_1(0)=-\lambda _N\) and \(\dot{q}_2(0)=0\). To proceed, we first analyze \(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}\). Let \(m^*_2\) satisfy \(\hat{c}_{1,m^*_2}^{\lambda _N}=\min _{l\in A}\hat{c}_{1,l}^{\lambda _N} = \hat{\gamma }_{k^*}\). According to the definition of the set A, there exists \(m_1\) such that \(\hat{c}_{1,m_1}^{\lambda _N}>\hat{c}_{1,m^*_2}^{\lambda _N}\) and \(c_{1,m_1}=c_{1,m^*_2}\). Note that \(\hat{c}_{1,m_1}^{\lambda _N}\le c_{1,m_1}+\frac{C}{\sqrt{N}}\) and \(\hat{c}_{1,m^*_2}^{\lambda _N}\ge c_{1,m^*_2}-\frac{C}{\sqrt{N}}\) on the event \(\Omega _1\), so \(\hat{c}_{1,m_1}^{\lambda _N}\le \hat{c}_{1,m^*_2}^{\lambda _N}+\frac{2C}{\sqrt{N}}\). Recall that \(\hat{\gamma }_{k^*}= \min _{l\in A}\hat{c}_{1,l}^{\lambda _N}=\hat{c}_{1,m^*_2}^{\lambda _N}\) and \(\hat{\gamma }_{k^*+1}\le \hat{c}^{\lambda _N}_{1,m_1}\). Thus, \(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}\le \frac{2C}{\sqrt{N}}\). Because \(\lambda _N\sqrt{N}\rightarrow \infty \) as N grows large, \(\frac{2C}{\sqrt{N}}<\frac{\lambda _N}{2}\) for sufficiently large N. Consequently, for \(|\Delta |<\frac{C}{\sqrt{N}}\),
According to the definition of \(p_{\lambda _N}^{SCAD}\) in (9) and (36), we have
Because \(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}>0\), for \(|\Delta |<\min \{\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}, \frac{C}{\sqrt{N}} \}\),
Therefore,
Now we proceed to the analysis of \(q_2\). If \(k^*=1\), then \(q_2(\Delta )\) is set to 0, and so is \(\dot{q}_2(\Delta )\). We proceed to the case where \(k^*\ge 2\). Choose \(m^*_1\) such that \(\hat{c}^{\lambda _N}_{1,m^*_1}=\hat{\gamma }_{k^*-1}\). As \(\hat{c}^{\lambda _N}_{1,m^*_1}=\hat{\gamma }_{k^*-1}<\hat{\gamma }_{k^*}=\hat{c}^{\lambda _N}_{1,m^*_2}\), we know \(m_1^*\notin A\) and \(c_{1,m^*_1}\ne c_{1,m^*_2}\) because of the definition of A and B. Furthermore, according to the analysis below (27), it is not possible to have \(c_{1,m^*_1}>c_{1,m^*_2}\) on event \(\Omega _1\). Thus, we have \(c_{1,m^*_1}<c_{1,m^*_2}\). Now let N be sufficiently large such that \(\frac{2C}{\sqrt{N}}<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{2}\), then \(\hat{c}^{\lambda _N}_{1,m^*_2}-\hat{c}^{\lambda _N}_{1,m^*_1}>\frac{c_{1,m^*_2}-c_{1,m^*_1}}{2}\) on the event \(\Omega _1\). Thus, for \(|\Delta |<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{4}\) we have
By the definition in (9), for N sufficiently large such that \(a\lambda _N<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{4}\),
Thus, \(\dot{q_2}(0)=0\). Combining this with \(\dot{q_1}(0)=-\lambda _N\) and (35), \( \frac{d}{d\Delta }\{p_{\lambda _N}(\tilde{\mathbf {c}}_1)-p_{\lambda _N}(\hat{\mathbf {c}}_1^{\lambda _N})\}|_{\Delta =0}=-\lambda _N. \) We conclude the proof by noting that \(\kappa _{\lambda _N}(\tilde{\mathbf {c}})=\sum _{j=1}^Jp_{\lambda _N}(\tilde{\mathbf {c}}_j)\) and that \(\tilde{\mathbf {c}}_j=\hat{\mathbf {c}}_j^{\lambda _N}\) for \(j\in \{2,\ldots ,J\}\). \(\square \)
Rights and permissions
About this article
Cite this article
Chen, Y., Li, X., Liu, J. et al. Regularized Latent Class Analysis with Application in Cognitive Diagnosis. Psychometrika 82, 660–692 (2017). https://doi.org/10.1007/s11336-016-9545-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-016-9545-6