Loading metrics

Open Access

Peer-reviewed

Research Article

Performance Evaluation of Missing-Value Imputation Clustering Based on a Multivariate Gaussian Mixture Model

Jing Xiao ,

Contributed equally to this work with: Jing Xiao, Qiongqiong Xu

Affiliation Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, 226019, China
⨯
Qiongqiong Xu ,

Contributed equally to this work with: Jing Xiao, Qiongqiong Xu

Affiliation Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, 226019, China
⨯
Chuanli Wu,

Affiliation Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, 226019, China
⨯
Yuexia Gao,

Affiliation Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, 226019, China
⨯
Tianqi Hua,

Affiliation Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, 226019, China
⨯
Chenwu Xu

* E-mail: qtls@yzu.edu.cn

Affiliation Jiangsu Key Laboratory of Crop Genetics and Physiology/Co-Innovation Center for Modern Production Technology of Grain Crops, Key Laboratory of Plant Functional Genomics of the Ministry of Education, Yangzhou University, Yangzhou, 225009, China
⨯

Performance Evaluation of Missing-Value Imputation Clustering Based on a Multivariate Gaussian Mixture Model

Jing Xiao,
Qiongqiong Xu,
Chuanli Wu,
Yuexia Gao,
Tianqi Hua,
Chenwu Xu

Published: August 23, 2016
https://doi.org/10.1371/journal.pone.0161112

Figures

Abstract

Background

It is challenging to deal with mixture models when missing values occur in clustering datasets.

Methods and Results

We propose a dynamic clustering algorithm based on a multivariate Gaussian mixture model that efficiently imputes missing values to generate a “pseudo-complete” dataset. Parameters from different clusters and missing values are estimated according to the maximum likelihood implemented with an expectation-maximization algorithm, and multivariate individuals are clustered with Bayesian posterior probability. A simulation showed that our proposed method has a fast convergence speed and it accurately estimates missing values. Our proposed algorithm was further validated with Fisher’s Iris dataset, the Yeast Cell-cycle Gene-expression dataset, and the CIFAR-10 images dataset. The results indicate that our algorithm offers highly accurate clustering, comparable to that using a complete dataset without missing values. Furthermore, our algorithm resulted in a lower misjudgment rate than both clustering algorithms with missing data deleted and with missing-value imputation by mean replacement.

Conclusion

We demonstrate that our missing-value imputation clustering algorithm is feasible and superior to both of these other clustering algorithms in certain situations.

Citation: Xiao J, Xu Q, Wu C, Gao Y, Hua T, Xu C (2016) Performance Evaluation of Missing-Value Imputation Clustering Based on a Multivariate Gaussian Mixture Model. PLoS ONE 11(8): e0161112. https://doi.org/10.1371/journal.pone.0161112

Editor: Yong Deng, Southwest University, CHINA

Received: March 24, 2016; Accepted: July 29, 2016; Published: August 23, 2016

Copyright: © 2016 Xiao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data are available from the websites http://genome-www.stanford.edu/cellcycle/, http://archive.ics.uci.edu/ml/datasets/Iris and http://www.cs.toronto.edu/~kriz/cifar.html.

Funding: The research was supported by grants from the National Natural Science Foundation of China (31000539, 31391632 and 91535103), the Priority Academic Program Development of Jiangsu Higher Education Institutions, the National High-tech R&D Program (863 Program, 2014AA10A601-5), the Natural Science Foundations of Jiangsu Province (BK20150010) and the Natural Science Foundation of the Jiangsu Higher Education Institutions (14KJA210005). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Clustering analysis, as a multivariate statistical method, refers to the process of classifying a set of observations into subsets, called clusters, such that observations in the same cluster are similar in certain respects [1–3]. Clustering is widely used in medical sciences, for instance when clustering diseases or gene-expression profiles. Clustering methods usually fall into two categories: hierarchical clustering methods [4], which are used for clustering datasets with small size [5,6]; and dynamic clustering methods, such as K-means [7,8] and self-organizing maps [9], which begin with an initial partitioning of the individuals and iteratively move individuals from one cluster to another until the criterion of convergence is met. With dynamic clustering, the number of clusters must be specified in advance [10]. Dynamic algorithms are mostly heuristically motivated, and they do not require an underlying statistical model. Nevertheless, selection of the “correct” number of clusters and of the best clustering method remains a topic for discussion. Model-based clustering methods [11–13] are a type of dynamic clustering based on the hypothesis that the whole dataset is a finite mixture of the same type of distribution with different sets of parameters, such as a finite mixture of a multivariate Gaussian distribution. Compared with “heuristic” algorithms, one obvious advantage to model-based clustering is that objective statistical criteria—such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [14]—are used to determine the number of clusters.

Missing data present a problem for medical research, given that the data are often supplied retrospectively and from various sources [15]. In particular, missing data are a frequent occurrence in microarray experiments and in medical research. Missing values are especially common in large-scale studies involving dozens of variables and hundreds of individuals. Indeed, many clustering methods require a full set of data. Individuals with missing values are either rejected or been estimated prior to the analysis. Consequently, several missing-value imputation methods have been developed [16–20], such as mean substitution, regression imputation, fuzzy c-means (FCM) clustering of incomplete data [21], and Gaussian mixture model-based missing-value imputation classification [22]. In this study, we propose a dynamic method for a model-based missing-value imputation clustering algorithm. With our proposed method, missing values are estimated iteratively until arithmetic convergence is reached during the process of clustering individuals. We used 12 simulated datasets. Each of these datasets had two versions: a complete version and a version where 10% of the individuals had at least one variable missing. We used these datasets to evaluate the clustering accuracy and the accuracy and precision of cluster-parameter estimators. We compared our proposed algorithm based on the missing-value imputation according to the maximum likelihood estimation (a “pseudo-complete” dataset) with a model-based clustering algorithm using the complete dataset and another two model-based clustering algorithms, one using a dataset with missing values deleted and the other imputing the missing values by mean replacement. In addition, we compared our algorithm with the FCM clustering method using real datasets—viz., Fisher’s Iris dataset, the Yeast Cell-cycle Gene-expression dataset, and the CIFAR-10 images dataset.

Methods

Multivariate Gaussian mixture model

The dataset is arranged as an n × k matrix denoted by Y, where n is the number of individuals and k is the number of variables. Let y_ij be the observed value of the i-th individual in the j-th variable, for i = 1,2,…, n and j = 1,2,…, k. Let Y_i = (y_i1,y_i2,…,y_ik)^T be the i-th column in matrix Y^T—i.e., a k × 1 vector of the data for individual i under all variables. The value of y_ij across all the k variables represents the expression level of the i-th individual. Under a finite multivariate Gaussian mixture model, each Y_i is assumed to follow a k-dimensional Gaussian mixture distribution. Mathematically, the mixture distribution for c clusters is as follows: (1) where is the probability density function for the l-th k-dimensional Gaussian distribution with mean vector μ_l = (μ_l1, μ_l2,…, μ_lk) and variance-covariance matrix , for l = 1,2,…,c. Moreover, P_l with , is the mixing proportion of cluster l. The mixing proportion p_l is defined as the proportion of individuals that belong to the l-th cluster. The joint log-likelihood function for n independent individual vectors is defined as follows: (2)

Clustering algorithm for missing values

Most reports [23–25] consider only complete datasets (without missing values) for Gaussian mixture clustering. Indeed, it is challenging to estimate missing values accurately. An urgent problem concerns how to provide the optimal number of clusters c and how to cluster n individuals into the c clusters precisely. Therefore, we propose a missing-value imputation algorithm as follows.

Suppose that there are r missing values in Y_i. Thus, y_i1, y_i2,…, y_ir are missing. The Y_i, μ_l, and S_l are then divided into where

Suppose further that Y_i is from the l-th k-dimensional Gaussian distribution and that Y_i(2) is known. The conditional expectation of Y_i(1), which belongs to the l-th cluster, is derived as follows: (3)

We provide the initial value for μ_l(1), S_l11, and S_l12. Then, based on the criterion of the minimum mean-square deviations from Eq (3), the conditional expectation E_l(Y_i(1)|Y_i(2)) can be calculated, representing the best predicted function for Y_i(1) [26]. Under the mixture distribution of the individual vector Y_i, the conditional expectation for Y_i(1) based on Y_i(2) is (4) where E(Y_i(1)|Y_i(2)) denotes the estimators for the missing values Y_i(1). Subsequently, a “pseudo-complete” dataset can be constructed.

Iterative clustering algorithm

The proposed model-based clustering algorithm for datasets with missing values assigns each individual to one of the c clusters with a certain probability. We define this probability as p_li, which is that of the i-th individual belonging to the l-th cluster. An individual is assigned to the l-th cluster if p_li is greater than a certain pre-determined threshold. The probability can be calculated using the expectation-maximization (EM) algorithm [27] to achieve the maximum likelihood for the objective function derived with Eq (2). The EM algorithm begins with some initial parameters that are set in advance. Then, each parameter is iteratively updated in the algorithm until convergence to a minimizer is reached. An EM iteration includes the following steps:

Initialize the prior probabilities of the cluster assignment and the cluster parameters: (5)
Calculate the missing values using Eqs (3) and (4) to generate the “pseudo-complete” dataset for .
Update the posterior probabilities of the cluster assignment: (6)
Update the cluster proportions, mean vectors, and variance-covariance matrices: (7) (8) (9)
Repeat steps (1)–(4) until convergence is reached.

The number of clusters c can also be treated as an unknown parameter and inferred by the BIC or AIC tests. The BIC is derived as follows: (10) where q = c(k+1)(k+2)/2−1 is the number of independent parameters to be estimated in the model, is the likelihood of —i.e., the vector for the maximum likelihood estimation of the parameters—and n is the size of the dataset. The number of clusters c is determined by the maximum BIC value.

Results

Simulation analysis

A simulation was designed to evaluate the feasibility and accuracy of the proposed missing-values imputation algorithm. The individuals belonging to each cluster were known exactly and without subjective errors. Indeed, when analyzing real data, the misjudgment rate (MR) may reflect confounding errors in the experiment and the subjective errors. Without loss of generality, we simulated datasets with a two-dimensional Gaussian distribution and some missing values. All simulations were performed using statistical SAS (version 9.3; SAS/IML) software.

Design of the simulation.

Suppose that there are 500 individuals derived from one of two (two-dimensional) Gaussian populations, denoted by MVN (μ₁,S₁) and MVN (μ₂,S₂), from which 10% of the individuals have at least one variable randomly removed. The entire dataset contains 1000 individuals. Each individual has two variables, and 100 individuals have at least one variable missing. The cluster’s mean vectors are denoted by μ₁ and μ₂, which were simulated at three different levels: A₁: μ₁ = (0, 0), μ₂ = (2.5, 2.5), A₂: μ₁ = (0, 0), μ₂ = (2.0, 2.0), and A₃: μ₁ = (0, 0), μ₂ = (1.5, 1.5). The variance-covariance matrices are denoted by S₁ and S₂ (without loss of generality, supposing that S₁ = S₂ = S), and these were set at four levels: , , , and . The total number of treatment combinations (datasets) is therefore 3 × 4 = 12, that is, A₁B₁, A₁B₂, A₁B₃, A₁B₄, A₂B₁, A₂B₂, A₂B₃, A₂B₄, A₃B₁, A₃B₂, A₃B₃, and A₃B_4. Twenty replicated simulations were conducted for each of the twelve scenarios.

Our missing-value imputation cluster algorithm (denoted by M-3, generating a “pseudo-complete” dataset) was compared with three other cluster algorithms: a cluster algorithm using the complete dataset (denoted by M-1, i.e., the above simulation dataset without any missing values); a cluster algorithm using a dataset from which all individuals with missing variables are deleted (denoted by M-2, i.e., a cropped dataset); and a cluster algorithm using missing-value imputation by mean replacement in which the missing values are replaced by the mean value of the whole dataset (denoted by M-4, another “pseudo-complete” dataset). Thus, we used these four algorithms to cluster simulated datasets. These algorithms are all based on the multivariate Gaussian mixture model, using the maximum likelihood estimation with the EM algorithm. As such, the clustering results could be fairly compared. We evaluated M-1, M-2, M-3, and M-4 using the following metrics: (1) the average convergence rate; (2) the accuracy and precision of their respective parameter estimates; and (3) the total misjudgment rate (MR), where MR is the ratio of all misjudged individuals to the total number of individuals in the 20 replicates and MR = 1 - accuracy. A chi-square test and multiple comparisons of Scheffé’s Confidence Interval were used to test the differences in MR among the four algorithms.

Simulation results.

The average parameter estimates include the mean vectors (± se), variance-covariance matrices (± se), the likelihood value of the maximum-likelihood function, and the total MR for the four clustering algorithms. The results are presented in Tables 1–3. Our proposed missing-value imputation clustering algorithm (M-3) outperformed the other two algorithms (M-2 and M-4) in terms of the convergence rate for most of the simulation datasets. Indeed, its superiority was most obvious when the mean vectors of two clusters were near each other and the variance-covariance of each cluster was more disperse, such as with datasets A₂B₄, A₃B₃, and A₃B₄. Moreover, the proposed algorithm resulted in the most accurate and precise parameter estimations, closely approximating the true values.

Download:

Table 1. Average parameter estimates under 4 different simulation datasets A₁B₁-A₁B₄ in 20 replicates.

https://doi.org/10.1371/journal.pone.0161112.t001

Download:

Table 2. Average parameter estimates under 4 different simulation datasets A₂B₁-A₂B₄ in 20 replicates.

https://doi.org/10.1371/journal.pone.0161112.t002

Download:

Table 3. Average parameter estimates under 12 different simulation datasets A₃B₁-A₃B₄ in 20 replicates.

https://doi.org/10.1371/journal.pone.0161112.t003

Thus, our proposed missing-value imputation clustering algorithm correctly clusters individuals with missing variables. The total MR resulting from the proposed algorithm (M-3) demonstrates that its performance is superior to that of M-4 in all simulation trials, with significant statistical difference (p < 0.05), and superior to that of M-2 in simulation trials A₂B₄, A₃B₃, and A₃B₄, with significant statistical difference (p < 0.05). Moreover, the clustering accuracy of our proposed algorithm was comparable to that of the complete dataset (M-1) (p > 0.05). In general, the more similar the mean vectors of two clusters or the more disperse their variance-covariance, the higher the MR of all clustering methods and the better the performance of our proposed algorithm compared to M-2 and M-4. Interestingly, however, our algorithm performed well in such cases. For example, in scenario A₃B₄, the speed of convergence with M-2 and M-4 was lower in both cases than for other scenarios and had a total MR of 48.32% and 56.00%, respectively, whereas the total MR with our proposed algorithm M-3 was 21.30%. In fact, this result was slightly better than M-1, which resulted in an MR of 28.00%.

Real data analysis

Iris dataset.

Fisher’s Iris dataset (Fisher, 1936) [12] is perhaps the best-known database for comparing clustering algorithms. The dataset contains three clusters with 50 individuals each, where each cluster is a species of the genus Iris: namely, Iris setosa, Iris versicolor, and Iris virginica. These three species are thus segregated. The dataset contains four variables: sepal length, sepal width, petal length, and petal width. Fisher’s Iris dataset is available on the Internet (http://archive.ics.uci.edu/ml/datasets/Iris). We randomly removed one of the four variables in 10% of the individual observations of Iris data; the resulting dataset constituted a dataset with missing values. Table 4 lists the clustering results from five clustering algorithms: the above four clustering algorithms [the clustering algorithm with a complete dataset (M-1), the clustering algorithm in which individuals with missing variables are deleted (M-2), our proposed missing-value imputation algorithm (M-3), and the clustering algorithm for missing-value imputation by mean replacement (M-4)] and the FCM clustering method applied to the dataset with data missing from 10% of the individuals. In the M-4 and FCM methods, the missing values are replaced by the mean of the whole dataset. The superiority of our proposed method is obvious insofar as M-3 resulted in the lowest MR of the five algorithms. Because of Gaussian mixture distribution of the dataset, our algorithm makes full use of the known data based on the Gaussian mixture model to estimate missing values, the accuracy of our method was higher than those of the FCM method and the M-2 and M-4 algorithms, and it closely approximated the accuracy obtained using the complete-dataset clustering algorithm (M-1). Moreover, the model-based algorithms (M-1, M-2, M-3, and M-4) resulted in the highest BIC values in the three clusters, validating the effectiveness of using the BIC to determine the number of clusters.

Download:

Table 4. The comparison among four algorithms and FCM method for Fisher’s Iris dataset.

https://doi.org/10.1371/journal.pone.0161112.t004

Yeast Cell-cycle dataset.

The Yeast Cell-cycle dataset [28] shows the fluctuation in the expression levels of the whole genome (6220 genes) over 17 time points taken at 10-minute intervals, covering nearly two cell cycles. The entire raw dataset is available at http://genome-www.stanford.edu/cellcycle/. Among the effectively expressed genes, 416 genes were found to have consistent periodic changes in expression level. Yeung et al. picked out 384 genes over 15 time points that reached their peaks at five (C = 5) cell phases on the basis of the size of the buds, after which five non-intersecting clusters were classified [29]. These 384 gene expressions, a subset with a moderate sample size, are available at http://www.cs.washington.edu/homes/kayee/model. All 384 genes were assigned to one of the five clusters by the original researchers [28,29]. Therefore, we can use this dataset to test the performance of the clustering algorithms introduced here and compare them with the performance of existing methods. We randomly removed one of the 15 variable values in 10% of the genes to create a dataset with missing values. Table 5 lists the clustering results from five clustering algorithms: the above four model-based clustering algorithms (M-1, M-2, M-3, and M-4) and the FCM clustering method. Further validating the results from Fisher’s Iris dataset, we found that our missing-value imputation algorithm had the lowest MR. In addition, because our missing-value imputation based on the Gaussian mixture distribution is compatible with the Yeast Cell-cycle dataset, the accuracy of our method was significantly higher than that of the FCM method, and it closely approximated the accuracy from using the complete-dataset clustering algorithm (M-1).

Download:

Table 5. The comparison among four algorithms and FCM method for Yeast Cell-cycle dataset.

https://doi.org/10.1371/journal.pone.0161112.t005

CIFAR-10 images dataset.

The CIFAR-10 dataset [30] contains 60,000 32 × 32 color images (3072 variables) divided into ten classes, which represent different kinds of pictures. The classes are designed to be completely mutually exclusive. For example, neither automobile pictures nor truck pictures contain images of pickup trucks. Having a large sample size, each class contains 5000 training images and 1000 test images. The entire dataset was used to test the performance of test images according to training images in supervised classification, and it is available at http://www.cs.toronto.edu/~kriz/cifar.html. We used the training images to test the performance of our unsupervised clustering algorithm and compared it to existing clustering algorithms. We randomly removed 10 of the 3072 variable values in 10% of the color images to create a dataset with missing values.

Table 6 lists the clustering results from five clustering algorithms: the above four model-based clustering algorithms (M-1, M-2, M-3, and M-4) and the FCM clustering method. We found that our missing-value imputation algorithm had a lower total MR than algorithms M-2 and M-4. The accuracy of our method closely approximated the accuracy obtained using the clustering algorithm with complete data (M-1) but was lower than that of the FCM method. (It should be noted that most clusters of the CIFAR-10 data have deviants from the normal distribution.) Thus, the performance of our model-based algorithm does not offer clustering that is more accurate than the FCM method. Moreover, the speed of computations was much lower in algorithms M-1, M-2, M-3, and M-4.

Download:

Table 6. The comparison among four algorithms and FCM method for CIFAR-10 dataset.

https://doi.org/10.1371/journal.pone.0161112.t006

Discussion

Missing values are unavoidable in experimental data and data from field investigations [31]. Many imputation methods [16–20] have been proposed, such as the mean-substitution method, the regression imputation method, the EM estimation method, and the FCM method. Clustering is a widely used statistical method, especially in gene expression profiles. Most classical clustering methods, such as K-means clustering and hierarchical clustering, are unable to deal with missing values. Therefore, we developed a model-based clustering algorithm for imputing missing values. The proposed method utilizes information from existing data, and it is based on the Gaussian mixture model. With our imputation algorithm, missing values are estimated by calculating the conditional expectation, from which a “pseudo-complete” dataset can be generated and used for clustering.

The performance of our proposed algorithm was tested and compared with other algorithms using both real and simulated data. We found that our proposed method outperformed other algorithms. In particular, our proposed method produced clustering results comparable to an algorithm that uses a complete dataset, and it outperformed an algorithm in which samples with missing data are removed and another algorithm in which missing values are imputed by mean replacement. The results are consistent with Hunt and Jorgensen’s reports [32]. Even with datasets in which the boundaries of the clusters overlap, such as datasets A₃B₃, A₂B₄, and A₃B₄, our algorithm performed well.

Furthermore, we found that the proposed algorithm outperformed the FCM clustering method on Fisher’s Iris dataset and on the Yeast Cell-cycle dataset. The FCM clustering method, a heuristic approach, has been verified as a successful missing-value clustering method [33]. FCM clusters data based on the characteristics of the samples. The criteria for FCM clustering are inconsistent, however, because the characteristics of dynamic algorithms lead to uncertainties about the samples, degrading FCM’s ability to precisely cluster data [34–36]. Our study shows that the proposed method is more likely to result in a low MR when the dataset is exactly a multivariable Gaussian mixture distribution. That is, it is more accurate than other methods with both simulated data and real data, e.g., Fisher’s Iris dataset and the Yeast Cell-cycle dataset. One possible reason for this is that our method estimates missing values with a conditional expectation based on the Gaussian mixture distribution and generates a “pseudo-complete” dataset. In addition, our method combines dynamic clustering with discriminant analysis and uses the EM algorithm to achieve a maximum-likelihood function. It can thus better utilize the information from known data to discover the intrinsic and implicit properties of a dataset. Usually, model-based methods are better than heuristic approaches, even though the former are always more time consuming. Model-based methods depend on a probability model, which is usually proposed based on the experience and feasibility of the model. The convergence and accuracy of the EM algorithm have already been established [25]. The Gaussian mixture model is both convenient and robust. Moreover, a model-based method provides substantial information—e.g., the posterior probabilities of individuals and the parameters of each cluster—and this information can be exploited to more accurately impute missing values with iterative updates to the algorithm until convergence to a minimizer is reach.

Our method uses the Bayesian posterior probability to cluster individuals; individuals are assigned to the cluster having the highest posterior probability. It is more reasonable to assign individuals to a cluster having a high posterior probability, such as 0.8 or 0.9, and this improves the clustering accuracy. Whereas traditional dynamic clustering methods cluster individuals directly to a certain category, this process is difficult when the boundaries of the clusters overlap. Another advantage to a model-based approach is that statistical criteria, such as AIC and BIC [14], can be used to discover the number of clusters that best suit the data. We confirmed the effectiveness of such criteria with results indicating three clusters for Fisher’s Iris dataset (a dataset with exactly three species of Iris) and five clusters for the Yeast Cell-cycle Gene-expression dataset. However, the reversible-jump Markov-chain Monte Carlo (MCMC) method was previously developed to infer the number of distributions in a mixture [37]. MCMC estimates the optimal number of clusters from a Bayesian perspective. Our future research shall compare MCMC with the BIC criteria used in this study. Finally, a prerequisite for raw data collected by experimenters is that they be normalized. Indeed, our method is more sensitive to data that have not been normalized. Therefore, either a log or some other form of data transformation should be applied in order to distribute the data in a normal fashion. Furthermore, because our proposal relies on the information from known individuals when imputing missing values, a large sample size is desirable for more accuracy when estimating missing values.

Conclusions

In this paper, we have presented a model-based missing-data imputation algorithm for clustering numerical data with missing values. We achieved a high convergence rate with our algorithm when it was tested on simulated and real datasets. In addition, the statistical power of our algorithm is high, and it can accurately estimate parameters and correctly cluster individuals with missing variables. Moreover, the results of our evaluation show a low MR, even when the mean vectors of two clusters are close to each other or when their variance-covariance is disperse. The algorithm was applied successfully to datasets (both simulated and real) having missing data. Its performance is superior to that of an algorithm in which samples with missing data were removed and another algorithm in which missing values were imputed by mean replacement. A comparative evaluation demonstrated that our proposed algorithm outperforms the FCM method when the data from each cluster fit a multivariate Gaussian distribution.

Acknowledgments

The authors wish to thank all those organizations who shared the open datasets, e.g. Fisher’s Iris dataset, the Yeast Cell-cycle dataset and the CIFAR-10 dataset.

Author Contributions

Conceptualization: JX CWX.
Formal analysis: JX QQX CLW YXG TQH.
Funding acquisition: JX CWX.
Methodology: JX QQX.
Software: JX QQX CLW.
Validation: JX CWX.
Writing - original draft: JX QQX CLW.
Writing - review & editing: JX CWX.

References

1. Wylie MP, Holtizman J. The non-line of sight problem in mobile location estimation. In: Proc IEEE ICUPC. Cambridge. 1996; 2: 827–31. https://doi.org/10.1109/ICUPC.1996.562692
2. Zhang YT, Fang KT. Introduction to Multivariate Statistical Analysis. Beijing: Science Press, 1983; 401–57.
3. Johnoson RA, Wichern DW. Applied Multivariate Statistical Analysis. New Jersey: Prentice-Hall, Inc 1992; 532–560.
4. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences. 1998; 95(25): 14863–8.
- View Article
- Google Scholar
5. Quackenbush J. Computational analysis of microarray data. Nature Reviews Genetics. 2001; 2(6): 418–27. pmid:11389458
- View Article
- PubMed/NCBI
- Google Scholar
6. Speed T. Statistical Analysis of Gene Expression Microarray Data. London: Chapman & Hall/CRC Press 2003; 45–7.
7. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium. 1967; 1, 281–97.
8. Hartigan JA, Wong MA. A K-means clustering algorithm. Journal of Applied Statistics. 1979; 28(1): 100–8.
- View Article
- Google Scholar
9. Herrero J, Valencia A, Dopazo J. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics. 2001; 17(2): 126–36. pmid:11238068
- View Article
- PubMed/NCBI
- Google Scholar
10. Selim SZ, Alsultan K. A simulated annealing algorithm for the clustering problem. Pattern Recognition. 1991; 24(91): 1003–8,
- View Article
- Google Scholar
11. Dasgupta A, Raftery AE. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association. 1998; 93: 294–302.
- View Article
- Google Scholar
12. McLachlan GJ, Basford KE. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988.
13. Titterington DM, Smith AFM, Makov UE. Statistical Analysis of Finite Mixture Distributions. New York: John Wiley & Sons, Inc, 1985.
14. Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978; 6(2): 461–4.
- View Article
- Google Scholar
15. Krzysztof S. Clustering with missing values. Fundamenta informaticae. 2013; 123(3): 331–50.
- View Article
- Google Scholar
16. Rubin DB. Inference and missing data. Biometrika. 1976; 63(1): 581–92.
- View Article
- Google Scholar
17. Rubin DB. Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. Journal of the American Statistical Association. 1978; (1): 20–34.
- View Article
- Google Scholar
18. Carpenter J, Kenward M. Multiple Imputation and its Application. New York: Wiley & Sons; 2012.
19. Chen J, Shao J. Nearest neighbor imputation for survey data. Journal of Official Statistics. 2000; 16(2): 113–32.
- View Article
- Google Scholar
20. Yang J, Zhao Y, Ding WX. Missing data in survey sampling interpolation method. Applica Stat Manage (Chin). 2008; 27: 821–32.
- View Article
- Google Scholar
21. Hathaway RJ, Bezdek JC. Fuzzy c-Means Clustering of incomplete data. pmid:IEEE transactions on systems, men, and cybernetics-part B: cybernetics. 2001; 31(5): 735–744. pmid:18244838
- View Article
- PubMed/NCBI
- Google Scholar
22. Ghahramani ZB, Jordan MI. Supervised learning from incomplete data via an EM approach. Advances in Neural Information Processing Systems, 1994, 6: 120–127.
- View Article
- Google Scholar
23. Qu Y, Xu SZ. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics. 2004; 20: 1905–13. pmid:15044244
- View Article
- PubMed/NCBI
- Google Scholar
24. Si YQ, Liu P, Li PH, Brutnell TP. Model-based clustering for RNA-seq data. Bioinformatics 2014; 30(2): 197–205. pmid:24191069
- View Article
- PubMed/NCBI
- Google Scholar
25. Hayes M, Pyon YS, Li J. A model-based clustering method for genomic structural variant prediction and genotyping using paired-end sequencing data. PLoS ONE. 2012; 7(12): e52881. pmid:23300804
- View Article
- PubMed/NCBI
- Google Scholar
26. Wang SC, Li XL, Tang HY. Hybrid data clustering based on dependency structure and gibbs sampling. Lecture Notes in Computer Science 2006; 4304: 1145–51.
- View Article
- Google Scholar
27. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B-statistical Methodology. 1977; 39(1): 1–38.
- View Article
- Google Scholar
28. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 1998; 2(1): 65–73. pmid:9702192
- View Article
- PubMed/NCBI
- Google Scholar
29. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL. Model-based clustering and data transformations for gene expression data. Bioinformatics. 2001; 17: 977–87. pmid:11673243
- View Article
- PubMed/NCBI
- Google Scholar
30. Krizhevsky A. Learning multiple layers of features from Tiny Images. Master’s thesis, Dept. of Comp. Sci., University of Toronto, 2009.
31. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17: 520–25. pmid:11395428
- View Article
- PubMed/NCBI
- Google Scholar
32. Hunt L, Jorgensen M. Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis. 2003; 41(s3-4): 429–40.
- View Article
- Google Scholar
33. Jiao YB, Liu HB, Zhang P, Wang XQ, Wei HB. Unsupervised performance evaluation strategy for bridge superstructure based on Fuzzy clustering and field data. The Scientific World Journal. 2013; 2013(2): 544–54. pmid:24288483
- View Article
- PubMed/NCBI
- Google Scholar
34. Sebzalli YM, Wang XZ. Knowledge discovery from process operational data using PCA and fuzzy clustering. Engineering Applications of Artificial Intelligence. 2001; 14(5): 607–16.
- View Article
- Google Scholar
35. Podofillini L, Zio E, Mercurio D, Dang VN. Dynamic safety assessment: scenario identification via a possibilistic clustering approach. Reliability Engineering & System Safety. 2010; 95(5): 534–49.
- View Article
- Google Scholar
36. Li SY. Engineering Fuzzy Mathematics with Application, Harbin: Harbin Institute of Technology Press; China; 2004.
37. Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995; 363(4): 711–32.
- View Article
- Google Scholar

Subject Areas
?

For more information about PLOS Subject Areas, click here.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!

Clustering algorithms
Is the Subject Area "Clustering algorithms" applicable to this article?

Thanks for your feedback.
Algorithms
Is the Subject Area "Algorithms" applicable to this article?

Thanks for your feedback.
Yeast
Is the Subject Area "Yeast" applicable to this article?

Thanks for your feedback.
Cluster analysis
Is the Subject Area "Cluster analysis" applicable to this article?

Thanks for your feedback.
Simulation and modeling
Is the Subject Area "Simulation and modeling" applicable to this article?

Thanks for your feedback.
Covariance
Is the Subject Area "Covariance" applicable to this article?

Thanks for your feedback.
Maximum likelihood estimation
Is the Subject Area "Maximum likelihood estimation" applicable to this article?

Thanks for your feedback.
Statistical dispersion
Is the Subject Area "Statistical dispersion" applicable to this article?

Thanks for your feedback.