[go: up one dir, main page]

Next Article in Journal
Minimum Energy Problem in the Sense of Caputo for Fractional Neutral Evolution Systems in Banach Spaces
Next Article in Special Issue
A Procedure for Constructing the Solution of a Nonlinear Fredholm Integro-Differential Equation of Second Order
Previous Article in Journal
Transposition Regular TA-Groupoids and Their Structures
Previous Article in Special Issue
A Simple Greedy Heuristic for Site Specific Management Zone Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Fuzzy C-Means Clustering Algorithm Oriented to Big Data Realms

by
Joaquín Pérez-Ortega
1,*,
Sandra Silvia Roblero-Aguilar
1,2,
Nelva Nely Almanza-Ortega
2,
Juan Frausto Solís
3,
Crispín Zavala-Díaz
4,
Yasmín Hernández
1 and
Vanesa Landero-Nájera
5
1
Tecnológico Nacional de México/Cenidet, Cuernavaca 62490, Mexico
2
Tecnológico Nacional de México/IT Tlalnepantla, Tlalnepantla 54070, Mexico
3
Tecnológico Nacional de México/IT Cd. Madero, Madero 89440, Mexico
4
Faculty of Accounting, Administration and Informatic, Universidad Autónoma del Estado de Morelos, Cuernavaca 62209, Mexico
5
Computer Systems, Universidad Politécnica de Apodaca, Apodaca 66600, Mexico
*
Author to whom correspondence should be addressed.
Axioms 2022, 11(8), 377; https://doi.org/10.3390/axioms11080377
Submission received: 24 June 2022 / Revised: 27 July 2022 / Accepted: 27 July 2022 / Published: 31 July 2022
(This article belongs to the Special Issue Computational and Mathematical Methods in Science and Engineering)
Figure 1
<p>Initial and final centroids of the standard K-Means and FCM algorithms with the URBAN dataset in Example 1. Panel (<b>a</b>) shows the randomly generated initial centroids; (<b>b</b>) contains the final centroids with the solution of the standard FCM algorithm, considering the random initial centroids; (<b>c</b>) displays the final centroids of the K-Means algorithm receiving random centroids; (<b>d</b>) displays the final centroids of FCM with initial centroids generated after the convergence of K-Means; and (<b>e</b>) presents a comparison of the solutions of the FCM algorithm, whose initial centroids are random and generated by K-Means.</p> ">
Figure 2
<p>Initial and final centroids of the standard K-Means and FCM algorithms with the URBAN dataset in Example 2. Panel (<b>a</b>) shows the randomly generated initial centroids; (<b>b</b>) contains the final centroids with the solution of the standard FCM algorithm, considering the random initial centroids; (<b>c</b>) visualizes the final centroids with the solution of the K-Means algorithm that receives random centroids; (<b>d</b>) shows the final centroids of FCM with initial centroids generated after K-Means convergence; and (<b>e</b>) presents a comparison of solutions of the FCM and K-Means algorithms.</p> ">
Figure 3
<p>Structure of the HOFCM algorithm.</p> ">
Figure 4
<p>Results of the clustering in Experiment I.</p> ">
Figure 5
<p>The behavior of datasets in Experiment II.</p> ">
Versions Notes

Abstract

:
A hybrid variant of the Fuzzy C-Means and K-Means algorithms is proposed to solve large datasets such as those presented in Big Data. The Fuzzy C-Means algorithm is sensitive to the initial values of the membership matrix. Therefore, a special configuration of the matrix can accelerate the convergence of the algorithm. In this sense, a new approach is proposed, which we call Hybrid OK-Means Fuzzy C-Means (HOFCM), and it optimizes the values of the membership matrix parameter. This approach consists of three steps: (a) generate a set of n solutions of an x dataset, applying a variant of the K-Means algorithm; (b) select the best solution as the basis for generating the optimized membership matrix; (c) resolve the x dataset with Fuzzy C-Means. The experimental results with four real datasets and one synthetic dataset show that HOFCM reduces the time by up to 93.94% compared to the average time of the standard Fuzzy C-Means. It is highlighted that the quality of the solution was reduced by 2.51% in the worst case.
MSC:
62H30; 90C70; 68W40; 91C20

1. Introduction

The clustering of objects, according to their attributes, has been studied extensively in various areas, such as bioinformatics, pattern recognition, business, and image processing, among others [1,2]. Moreover, it has gained relevance due to an exponential increase in data in different areas of knowledge. In general, a complete dataset could provide a basis for their analysis, reasoning, and decision-making [3,4].
Clustering aims to partition a set of n objects in subsets, called clusters, in such a way that the objects in any cluster have similar attributes and, at the same time, are different from the objects in any other cluster. The above description corresponds to a hard or conventional clustering type, often related to the K-Means algorithm [5]. However, in many domains, each object needs to belong to two or more groups with different degrees of membership [6,7,8]. The algorithm that allows for the above is associated with Fuzzy C-Means (FCM). Nevertheless, the complexity of FCM is greater than that of K-Means. In [9], the time complexity of the FCM algorithm is O(ndc2t), where n is the number of objects, d is the number of dimensions, c is the number of clusters, and t is the number of iterations. In this paper, indistinct dimensions or attributes are named to represent the characteristics of the objects.
Clustering is one of the NP-hard problems most studied in the scientific community [10]. There are no efficient algorithms to solve large datasets in polynomial time. Such issues are considered computationally intractable. Hence, improving the FCM clustering algorithm is a relevant and open research problem.
The standard FCM algorithm is sensitive to the random initial values of the membership matrix, which represent the values that express the degrees of membership of each object to each cluster. Therefore, a particular configuration of the membership matrix can accelerate the convergence of the algorithm. In the specialized literature, there have been publications trying to improve the FCM algorithm to create an initial configuration. However, we found no research that optimizes this matrix. In this paper, we aim to reduce the time complexity of the algorithm by optimizing the membership matrix. To achieve our goal, we propose an algorithm called Hybrid OK-Means Fuzzy C-Means (HOFCM).
This paper is organized as follows: Section 2 presents related work. Section 3 describes the algorithms used in this research. Section 4 shows the improvement proposal. Section 5 reports the results obtained. Section 6 presents the discussion. Conclusions and ideas for future research are given in Section 7.

2. Related Work

In [11], it is inferred that the FCM algorithm contains the following phases: initialization, calculation of the centroid, classification, and convergence. The initialization phase requires the definition of the following parameters: the number of clusters, the value of the weighting exponent, the random membership matrix, the convergence value (ε), and the maximum number of iterations. In [12,13,14], the initialization phase is different from that described. Instead of generating the membership matrix, these papers considered the initialization phase, calculating the initial centroids using the K-Means algorithm or some variant.
In [12], an algorithm called FCM++ was proposed. To generate the initial centroids, a variant of the K-Means++ algorithm [15] was implemented. Subsequently, it continues with the standard FCM algorithm. However, FCM++ has some limitations: the first concerns the IRIS and WINE datasets, where the number of iterations is higher than the standard FCM for c = 8 and c = 10, respectively; the second concerns the choice of parameter p, which is added to K-Means++ since its efficiency depends on the user’s experience. A description of the K-Means++ algorithm is provided in Section 3.2.
In [13], the K-Means algorithm was implemented to generate the final centroids, which, through a transformation process called One-Hot encoding, generates the membership matrix required by the FCM algorithm. This research proposes selecting attributes through entropy weighting and Box–Cox transformation, so out of 2170 original dimensions, only 10 dimensions were used. A dataset called CSI 800, which stores the history of daily stock prices, was used to perform the experimental evaluation. A 15% decrease in execution time was observed in the results obtained.
In [14], an algorithm called NFCM was proposed. This algorithm implements the strategy of the K-Means++ algorithm [15], with the difference being that NFCM considers the information provided by the membership matrix to select the following centroids. Two datasets, namely, SPAM and IRIS, were used to evaluate its performance against the FCM++ algorithm. In IRIS, it was observed that the NFCM algorithm has a specific advantage in the number of iterations and CPU time when c > 4. With the SPAM dataset, when c = 8 and c = 9, the FCM++ algorithm had greater competitiveness.
There have been improvements to other phases of the standard FCM algorithm. However, this research focuses on the initialization phase. Readers interested in improvements to the FCM algorithm in the calculation of the centroid and classification phases may refer to [16,17,18]. For improvements that refer to the fuzzy factor, refer to [19,20].
Among the improvements reported to date that have had the greatest impact on the scientific community, it is observed that the initialization phase of the FCM algorithm is the least studied. It is essential to mention that the improvement proposals in references [12,13,14] are the closest to the improvement proposal in this research. However, analytically, it has not been visualized whether the K-Means solution is on the path of convergence with FCM.

3. Materials and Methods

This section extensively describes the K-Means, K++, O-K-Means, and FCM algorithms.

3.1. K-Means Algorithm

The K-Means clustering algorithm is one of the most relevant, widely studied, and used algorithms [21]. Its popularity is mainly due to the ease of interpreting its results. This algorithm is an iterative method that consists of partitioning a set of n objects into k ≥ 2 clusters; thus, the objects in one cluster are similar to each other and different from those in other clusters [22]. The formulation of the K-Means problem is described below:
Let X = {x1, …, xn} be the set of n objects to be partitioned by a criterion of similarity, where xi  Rd for i = 1, …, n, and d ≥ 1 is the number of dimensions. Furthermore, let k ≥ 2 be an integer and K = {1, …, k}. For a k-partition P = {G(1), …, G(k)} of X, denote vj the centroid of cluster G(j), for j   K, and let V = {v1, …, vk} and W = {w11, …, wij}.
In Equation (1), the clustering problem is shown as an optimization problem [23]:
P :   minimize   z ( W ,   V ) = i = 1 n j = 1 k w i j D ( x i , v j )  
Subject   to   j = 1 k w i j = 1 , for   i = 1 ,   ,   n ,
w i j = 0   or   1 , for   i = 1 ,   ,   n   y   j = 1 ,   ,   k ,
where wij = 1 ⇔ the object xi belongs to cluster G(j), and D(xi, vj) denotes the Euclidean distance between xi and vj for i =1, …, n and j = 1, …, k. The pseudocode of the standard K-Means algorithm is shown in Algorithm 1 [24].
Algorithm 1: Standard K-Means
1Initialization:
2  X: = {x1, …, xn};
3  V: = {v1, …, vk};
4Classification:
5  For xi ϵ X and vk ϵ V{
6     Calculate the Euclidean distance from each xi to the k centroids;
7     Assign the xi object to the nearest vk centroid;}
8Calculate centroids:
9  Calculate the centroid vk;
10Convergence:
11  If V: = {v1, …, vk} does not change in two consecutive iterations:
12      Stop the algorithm;
13   Otherwise:
14     Go to Classification
15End of algorithm

3.2. K++ Algorithm

The K++ algorithm, proposed in [15], initializes the clustering centroids of the K-Means algorithm by selecting objects from the dataset that are the farthest from each other in a probabilistic manner. This method accelerates the convergence speed, theoretically guaranteeing O(log k) and therefore being competitive with the optimal solution. The pseudocode of the K++ algorithm is shown in Algorithm 2.
Algorithm 2: K++
1Initialization:
2    X: = {x1, …, xn};
3    Assign the value for k;
4    V: = Ø;
5    Select the first randomly uniform k1 centroid V: = V U {v1} ;
6    For i = 2 to k:
7       Select the i-th centroid vi of X with probability D(xi, vj)/∑xϵX D(xi, vj);
8       V: = V U {vi} ;
9    End of for
10    Return V
11End of algorithm
In Algorithm 2, the initialization of the centroids with the K++ algorithm is shown. Given dataset X and the value of k, the next step is to select the first centroid k1 in a uniform random manner from dataset X. Subsequently, assign it to the set of centroids V. For the second centroid, and until completing the total of k, select the i-th centroid with a probability distribution. In this paper, the algorithm is referred to as K++ when it is only about initializing the centroids.

3.3. O-K-Means Algorithm

There have been different improvements to the K-Means algorithm. In [22], O-K-Means is proposed, which accelerates the convergence process, stopping the algorithm when the total number of objects that change the cluster in an iteration is less than a threshold. This value expresses a relationship between the computational effort and the quality of the solution. The pseudocode of the O-K-Means algorithm is shown in Algorithm 3.
Algorithm 3: O-K-Means
1Initialization:
2  X: = {x1, …, xn};
3  V: = {v1, …, vk};
4  εok: = Threshold value for determining O-K-Means convergence;
5Classification:
6  For xi ϵ X and vk ϵ V{
7     Calculate the Euclidean distance from each xi to the k centroids;
8     Assign the xi object to the nearest vk centroid;
9     Compute γ};
10Calculate centroids:
11  Calculate the centroid vk;
12Convergence:
13  If (γεok):
14     Stop the algorithm;
15  Otherwise:
16     Go to Classification
17End of algorithm
The indicator γ represents the percentage of objects that change the cluster in an iteration t, which is calculated as γt = 100(ot/n), where ot is the number of objects that change the cluster.

3.4. FCM Algorithm

The fuzzy set theory proposed by Zadeh [25] in 1965 gave an idea of membership uncertainty described by a membership function. Cluster analysis theory was proposed by Bellman, Kalaba, and Zadeh [26], and Ruspini [27] coined the concept of fuzzy partitioning—more specifically, the fuzzy clustering algorithm. These documents set the tone for research on fuzzy clustering. In 1973, Dunn [28] extended the meaning of hard grouping to preliminary concepts of fuzzy means. Finally, in 1981, Bezdek [11] generalized Dunn’s approach to obtaining an infinite family of Fuzzy C-Means algorithms. The basic idea is that X = {x1, …, xn} is the set of n objects to be partitioned by a similarity criterion, where xi ∈ Rd for i = 1,…, n, and c is the number of clusters where 2 ≤ c < n.
The fuzzy clustering problem was formulated as an optimization problem [27], minimizing a function, as shown in Equation (2):
P :   minimize   J m ( U , V ) = i = 1 n j = 1 c ( μ i j ) m D ( x i , v j )  
where U = µij is the membership matrix of each object i to each cluster j; V = {v1, …, vc} is the set of centroids, where vj is the centroid of cluster j; and m is the weighting exponent or fuzzy factor that indicates how much the clusters overlap, m > 1 y D(xi, vj), indicating the Euclidean distance between the object xi and the centroid vj for i = 1, …, n and j = 1, …, c.
Minimizing Jm, an estimated model of U and V is obtained as
u i j = 1 i = 1 c ( D ( x i , v j ) D ( x i , v k ) ) 2 / ( m 1 ) 1 j n ; 1 j c
v j = i = 1 n ( u i j ) m x i i = 1 n ( u i j ) m 1 j c
where xi and vj are vectors that belong to a space Rd and are represented as
xi = (x1, x2, …, xd), 1 ≤ in
vj = (v1, v2, …, vd), 1 ≤ jc
The constraints of diffuse clustering are formalized in Equations (7)–(9):
u i j   [ 0 , 1 ] , 1 j c , 1 i n  
j = 1 c u i j = 1 , 1 i n  
0 < i = 1 n u i j < n , 1 j c  
Equation (7) indicates that the degree of membership of an object i to a cluster j must be between 0 and 1. Equation (8) defines that the sum of the degrees of membership of an object i to different clusters must be equal to 1. Equation (9) indicates that the sum of all the degrees of membership in a cluster must be greater than 0 and less than n; that is, there must be no empty clusters and only one cluster.
The pseudocode of the standard FCM algorithm is shown in Algorithm 4 [11].
Algorithm 4: Standard FCM
1Initialization:
2  Assign the value for c y m;
3  Determine the value of the threshold ε for convergence;
4  t: = 0, TMAX: = 50;
5  X: = {x1, …, xn};
6  U(t): = {µ11, …, µij}; is randomly generated
7Calculate centroids:
8  Calculate the centroid vk;
9Classification:
10  Calculate and update the membership matrix U(t+1): = {µij}
11Convergence:
12  If max [abs(µij(t) − µij(t+1))] < ε or t ≤ TMAX:
13     Stop the algorithm;
14  Otherwise:
15     U(t): = U(t+1) y t: = t + 1;
16  Go to Classification
17End of algorithm

4. Proposal for Improvement

The FCM algorithm is sensitive to the initial values of the membership matrix. Finding a good initial configuration is important since it accelerates the algorithm’s convergence. To reduce the time complexity of the FCM algorithm, this section proposes a new solution approach that we call HOFCM. This approach results from observations of the dynamic behavior of the standard K-Means and FCM algorithms. Figure 1 shows Example 1, in which two alternatives are observed to select the initial centroids to execute the FCM algorithm. The first alternative is to randomly generate the centroids, Panel (a), and the second alternative runs the K-Means algorithm, Panel (c), whose final centroids are passed as initial centroids to FCM, Panel (d). In Panel (e), the final centroids of FCM are presented, considering the two alternatives. The dataset used is URBAN, which contains the longitude and latitude coordinates of traffic accidents in urban areas of Great Britain [29]; its values are n = 360,177, d = 2. For the example, a value of c = 8 is considered.
From Figure 1, the following observations emerge:
Observation 1: The positions of the final centroids for the standard K-Means and FCM algorithms, when run with randomly generated initial centroids and K-Means-generated centroids, are, in general, approximately the same, as shown in the sample in Panel (e).
Observation 2: The solution time of FCM, in the first alternative, is 60,506.91 ms (Panel (b)), and in the second, it is 26,120.84 ms (Panel (d)). This last value is the result of adding the execution times of the K-Means and FCM of Panels (c) and (d). This results in a time reduction of 56.83%.
Observation 3: The difference in the value of the objective function is negligible compared to the gain in the solution of the algorithm.
Employing the same idea as that in Figure 1, in Figure 2, Example 2 is presented, illustrating the dynamic behavior of the K-Means and FCM algorithms with different initial random centroids.
From Figure 2, the following observation can be deduced:
Observation 4: The final centroids of the K-Means algorithm, which are passed as an input parameter to FCM, do not always favor the reduction of the time complexity of FCM. In Panel (d), it is observed that the execution times, also considering the execution times of the K-Means algorithm, increased by 41% compared to those in Panel (b). However, the quality of the solution was better by 33.5%.
It is important to mention that, in the different experiments carried out, the behavior of the K-Means algorithm was similar to that in Example 1. Most of the time, the final centroids are close to the neighborhood of the final centroids of FCM. However, sometimes, this is not the case. HOFCM takes a sample of 10 solutions from a dataset to solve this problem. Additionally, the promising heuristics of the K-Means algorithm are used. In Figure 3, the structure of the HOFCM algorithm is described.
Figure 3 shows that, given dataset X, a pre-processing phase generates a set of 10 solutions for that dataset. For this, the K++ algorithm is executed, which returns initial centroids; the O-K-Means algorithm receives these. At the end of the convergence phase, such an algorithm returns optimized centroids. From the set of 10 solutions of dataset X, the value of i, in which the objective function obtained the minimum value, is identified, and its final centroids are selected. These final centroids are transformed into the initial values of the membership matrix through a transformation function s.
The components of the transformation function s are described in more detail in the following subsection.

4.1. Transformation Functions

The transformation function s has three elements: domain, codomain, and transformation rules [30]. The following mathematical notation corresponds to the description of the domain:
X = {x1, …, xn} the set of n objects to partition, where xi ϵ Rd for i = 1, …, n, and d ≥ 1 is the number of dimensions.
V = {v1, …, vc} is the set of final centroids of a solution obtained with the O-K-Means algorithm, where vj is the centroid of the cluster j.
xi and vj are the vectors described in Equations (5) and (6).
m = value of the weighting exponent, where 1 < m < ∞.
The codomain is represented by µij, where µij is the membership matrix of each object xi to each cluster j.
The transformation rules are defined in Equations (10) and (11):
u i j = 1 if   x i = v j
u i j = 1 k = 1 c ( D ( x i , v j ) D ( x i , v k ) ) 2 / ( m 1 ) if   x i v j
The transformation function s is defined in Equation (12):
s ( x i , v j , m ) = u i j
Carrying out this heuristic is reasonable because the O-K-Means algorithm has shown promising results in the experimental part. For this reason, the FCM algorithm would have less work in the clustering, and, therefore, the number of iterations would be reduced.
In Algorithm 5, the pseudocode that incorporates the HOFCM heuristics proposed in Figure 1 is shown.
  Algorithm 5: HOFCM
1Initialization:
2    X: = {x1, …, xn};
3    V: = {v1, …, vc};
4    εok: = Threshold value for determining O-K-Means convergence;
5    Assign the value for c;
6    i: = 1;
7    Repeat
8      Function K++ (X, c):
9        Return V’;
10      Function O-K-Means (X, V″, εok, c):
11        Return V’’;
12      i = i + 1;
13    While i <=10;
14    Select V’’ for the value of i at which the objective function obtained the minimum value;
15    Transformation function s;
16    Determine the value of the threshold ε to determine the convergence of the algorithm;
17    Assign the value for m;
18    t: = 1;
19Calculate centroids:
20    Calculate the centroid vj;
21Classification:
22    Calculate and update the membership matrix U(t+1): = {µij};
23Convergence:
24    If max [abs(µij(t) − µij(t+1))] < ε:
25      Stop the algorithm;
26     Otherwise:
27      U(t): = U(t+1) y t: = t + 1;
28    Go to Classification
29End of algorithm
The HOFCM algorithm has four phases. The first is the initialization phase; in this phase, three functions are created: (1) The K++ function, which receives dataset X and the number of clusters as input parameters; this function returns a set of centroids identified with variable V′. (2) The O-K-Means function, which receives dataset X, the set of centroids V′ generated from K++, and the threshold value εok as input parameters to determine the convergence of the algorithm and the number of clusters; at the end of the phase of convergence, it returns optimized centroids defined with variable V″. (3) The transformation function s, which allows the final centroids of the O-K-Means algorithm to be transformed, taken from the lowest value of the objective function into the initial optimized values of the membership matrix; these are the input parameters for the FCM algorithm. The second, third, and fourth phases of the HOFCM algorithm, called the calculation of centroids, classification, and convergence, respectively, are typical of the standard FCM.

5. Results

5.1. Experiment Environment

To evaluate the performance of HOFCM, a set of four real datasets was selected, obtained from the UCI machine learning repository [29]. Additionally, a synthetic dataset randomly generated its attribute values with a uniform distribution between 0 and 1. To compare the results of HOFCM, the following algorithms were selected and implemented: K-Means, O-K-Means, K++, standard FCM, FCM++, NFCM, and HOFCM. All algorithms were coded in C language with the GCC 7.4.0 compiler. The computer equipment used had the following configuration: Intel® CoreTM i9-10900 (Santa Clara, CA, USA) 2.80 GHz, 32 GB RAM, a 1 TB hard drive, and a Windows 10 Pro operating system.
For the design of the computational experiments and the analysis of our algorithms, we used the methodology proposed in [31].
The standard FCM, FCM++, NFCM, and HOFCM algorithms were executed with the following parameter values: m = 2, ε = 0.01. For the standard FCM, the initial values of the membership matrix were randomly generated; for the FCM++ algorithm, its centroids were generated with the K++ algorithm and a variant of it for the NFCM algorithm. In the case of the HOFCM algorithm, the optimized membership matrix, whose values are the results of the transformation function s, is described in Section 4.1.
For the K-Means algorithm, the centroids were randomly generated, and for the O-K-Means, they were generated with the K++ algorithm. Additionally, it is worth mentioning that the threshold’s value determines the convergence of the O-K-Means εok = 0.72.
For all of the implemented algorithms, the value of c = 2, 4, 6, 8, 10, 14, 18, and 26. With this, eight test case configurations were considered for each dataset, one for each value of c.
The algorithms were run 10 times for each of the eight test case configurations of each dataset. The number of times the algorithm was selected to run was based on preliminary tests in which a sample size of 10 runs was found to provide good results.
Table 1 describes the set of datasets used. The table structure is as follows: column one contains the dataset identifier; column two contains the name; column three contains the data type; columns four and five contain the total objects and dimensions of the dataset; and column six contains the product of columns four and five. It is important to note that the SPAM, URBAN, and 1m2d datasets in this research are large datasets, considering an n*d indicator greater than 200,000 objects.

5.2. Description of Test Cases

In general, the design of the experiments focused on showing the quality of the solution and the efficiency of the HOFCM algorithm when solving large datasets such as those used in Big Data. To perform the experimental evaluation of the HOFCM, two experiments were designed to compare the results obtained with respect to the standard FCM, FCM++, FCM-KMeans, and NFCM algorithms.

5.2.1. Description of Experiment I

The purpose of this experiment was to compare the results of the standard FCM and HOFCM algorithms.
Table 2 shows the average percentages of time reduction and the gain of the objective function, resulting from the comparison of the standard FCM and HOFCM algorithms, for all the datasets in Experiment I. The structure of Table 2 is as follows: column 1 contains the consecutive number of test cases; column two includes the value of each cluster; columns 3, 5, 7, 9, and 11 contain the average percentages of time of each dataset; and columns 4, 6, 8, 10, and 12 contain the average percentages of the value of the objective function for each dataset.
In Equations (13) and (14), it is shown how the percentages of time reduction and gain in the objective function, respectively, were obtained:
t i m e = 100 ( 1 a v e r a g e T i m e _ H O F C M a v e r a g e T i m e _ s t a n d a r d _ F C M )
J m = 100 ( 1 J m A v e r a g e _ H O F C M J m A v e r a g e _ s t a n d a r d _ F C M )
Figure 4 shows the millisecond time averages of each of the eight test cases of the five datasets in Experiment 1.
The analysis of the results in Table 2 and Figure 4 is described in Section 5.3.1.

5.2.2. Description of Experiment II

The purpose of this experiment was to compare the results of HOFCM and the other improvements of the standard FCM algorithm, which are called FCM++ [12], FCM-KMeans [13], and NFCM [14].
Table 3, Table 4 and Table 5, show the average percentages of time reduction and the gain of the objective function, resulting from the HOFCM versus FCM++, HOFCM versus FCM-KMeans, and HOFCM versus NFCM comparisons, respectively, for all the datasets and test cases in Experiment II. The structure of the tables is similar to that of Table 2. Equations (13) and (14) are used to obtain the time reduction and gain percentages in the objective function, respectively.
Figure 5 shows the behavior of the set of datasets for each of the test cases in Experiment II.
The analysis of the results in Table 3, Table 4 and Table 5 and Figure 5 is described in Section 5.3.2.

5.3. Analysis of Experiments

This section analyzes the results obtained after solving the problems described in Experiments I and II.

5.3.1. Analysis of the Results of Experiment I

Based on the experimental results shown in Table 2 and Figure 4, where the results of HOFCM and the standard FCM are shown, the following was observed:
  • In the case of HOFCM, for all datasets, time was reduced in all test cases. In the best case, it was reduced by up to 93.94%. This percentage is highlighted in bold in column seven. Notably, this percentage was achieved with the SPAM dataset, which has high dimensionality.
  • HOFCM improved the quality of the solution in large datasets in 20 of 24 cases. In the best case, the quality of the solution was enhanced by 68.31%. This percentage is highlighted in bold in column eight. In the worst case, a quality loss of 2.19% was identified, as can be seen in column four. Regarding the small datasets, a solution quality of 62.5% was obtained in all cases, that is, in 10 of the 16 test cases.
  • In general, based on the results obtained, it is possible to affirm that the HOFCM proposal performs better than the standard FCM algorithm.

5.3.2. Analysis of the Results of Experiment II

Considering the results of Table 3, Table 4 and Table 5 and Figure 5, the following was observed:
  • HOFCM outperformed the FCM++ algorithm in terms of solution time in all test cases with large datasets. In the small datasets, only in one case was it not higher. Regarding the quality of the solution in large datasets, in the best case, an average gain of 5.48% was obtained, and in the worst case, there was a loss of 2.33%. Both percentages are in bold in column eight in Table 3. It can be stated that HOFCM was higher in terms of solution quality in 75% of all cases.
  • HOFCM outperformed the FCM-KMeans algorithm in solution quality in 82.5% of all test cases. Regarding the solution time, HOFCM was better in the large and real datasets.
  • HOFCM outperformed the NFCM algorithm in solution quality in 85% of all test cases. Regarding the solution time, HOFCM was better in the large datasets.
  • In all three comparisons, HOFCM obtained the greatest time reductions and the greatest gains in solution quality when dealing with large and real datasets.
  • In general, based on the results obtained, it is possible to affirm that the HOFCM proposal performs better than the FCM++, FCM-KMeans, and NFCM algorithms.
As an example, Table 6 shows the sum of the solution times in milliseconds for the eight test cases of each dataset and each algorithm implemented in the two experiments carried out in this research. The structure of Table 6 is as follows: column one shows the name of the dataset, and columns two–six show the names of each algorithm.
As can be observed in Table 6, the standard FCM algorithm is the one that generates more time in the convergence of the eight test cases for all the datasets. In general, it can be stated that the HOFCM algorithm is 4.90, 1.99, 3.30, and 3.40 times faster than the standard FCM, FCM-KMeans, FCM++, and NFCM algorithms, respectively.

6. Discussion

In this research, HOFCM was compared not only with the standard FCM algorithm but also with other improved algorithms in the literature, such as FCM++ [12], FCM-KMeans [13], and NFCM [14], which are the closest to HOFCM. It is worth mentioning that, among these three improved algorithms, it has not been visualized analytically whether the solution of the K-Means algorithm is on the path of convergence with FCM most of the time. Due to the above, in this research, it was proposed to obtain a set of ten solutions for each of the datasets used with the K++ and O-K-Means algorithms, which have shown promising solutions. The purpose of having ten solutions is to obtain the best set of final centroids by selecting the best value of the objective function, which allows, through a transformation process, one to obtain the optimized membership matrix to accelerate the algorithm’s convergence.
The results obtained in the experimental part show that the HOFCM algorithm outperformed the standard FCM, FCM++, FCM-KMeans, and NFCM algorithms in all cases. Evidence for this is given in Table 6. Based on the above, it is highlighted that, for the initialization of the standard FCM, it is not enough to implement the K++ algorithm, as in the case of FCM++; a variant of K++, as in the case of NFCM; or the K-Means algorithm, as in the case of FCM-KMeans.
Another important point to note is that, in the results presented in Table 6, the FCM++ and NFCM algorithms show a minimum difference of approximately 5%. This can be corroborated by the results presented in [14]. Additionally, it can be stated that, in general, the FCM-KMeans algorithm obtained better results than those mentioned above.

7. Conclusions

This paper shows that it is feasible to reduce the time complexity of the FCM algorithm by optimizing the initial membership matrix. To validate the proposal, which we call HOFCM, a set of experiments composed of real and synthetic datasets was designed. It is noteworthy that the sizes of some of the datasets were larger than those of the datasets reported in the specialized literature. To compare the results of the algorithms, all datasets were solved using HOFCM and the standard FCM, FCM++, FCM-KMeans, and NFCM algorithms.
Based on the results, it was observed that HOFCM obtained a reduction in solution time in all large datasets compared to the standard FCM algorithm. When comparing the results of HOFCM with the other algorithms, it was observed that it was 2.0, 3.3, and 3.5 times faster than the FCM-KMeans, FCM++, and NFCM algorithms, respectively. In particular, it was observed that the HOFCM algorithm showed a good performance in solving large datasets. For example, the SPAM dataset, which has high dimensionality, reduced the time by 93.94% compared to the standard FCM. Regarding the quality of the solution, it was observed that, with large datasets, a gain of up to 68.12% was obtained in the best case, and in the worst case, the quality was reduced by 2.51%.
To continue this research, we suggest two lines of investigation: the first is to implement HOFCM under the parallel and distributed programming paradigm, and the second is to determine in detail the type of datasets where the FCM algorithm is dominant.
Finally, we consider that the principles used in our approach could be used in other variants that improve the FCM algorithm.

Author Contributions

Conceptualization, J.P.-O., S.S.R.-A. and N.N.A.-O.; methodology, J.P.-O. and S.S.R.-A.; software, S.S.R.-A. and N.N.A.-O.; validation, J.P.-O., S.S.R.-A., C.Z.-D., Y.H. and J.F.S.; formal analysis, J.P.-O., J.F.S. and C.Z.-D.; investigation, J.P.-O., S.S.R.-A. and N.N.A.-O.; resources, S.S.R.-A., N.N.A.-O. and V.L.-N.; data curation, S.S.R.-A., N.N.A.-O. and V.L.-N.; writing—original draft preparation, J.P.-O. and S.S.R.-A.; writing—review and editing, J.P.-O., S.S.R.-A., N.N.A.-O., J.F.S., C.Z.-D., Y.H. and V.L.-N.; visualization, S.S.R.-A.; supervision, J.P.-O.; project administration, S.S.R.-A.; funding acquisition, J.P.-O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tecnológico Nacional de México, grant number 13869.22-P, grant number 13541.22-P, and PRODEP.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The real datasets used were obtained from the UCI machine learning repository https://archive.ics.uci.edu/ml/index.php (accessed on 26 January 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yang, M.S. A survey of fuzzy clustering. Math. Comput. Model. 1993, 18, 1–16. [Google Scholar] [CrossRef]
  2. Nayak, J.; Naik, B.; Behera, H.S. Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014. In Proceedings of the Comput Intell Data Mining, Odisha, India, 20–21 December 2014. [Google Scholar]
  3. Shirkhorshidi, A.S.; Aghabozorgi, S.; Wah, T.Y.; Herawan, T. Big Data Clustering: A Review. In Proceedings of the International Conference on Computational Science and Its Applications—ICCSA 2014, Guimaraes, Portugal, 30 June–3 July 2014. [Google Scholar]
  4. Ajin, V.W.; Kumar, L.D. Big data and clustering algorithms. In Proceedings of the 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS), Bangalore, India, 6–7 April 2016. [Google Scholar]
  5. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symp Math Statis and Probability, Berkeley, CA, USA, 21 June–18 July 1965. [Google Scholar]
  6. Ruspini, E.H.; Bezdek, J.C.; Keller, J.M. Fuzzy Clustering: A Historical Perspective. IEEE Comput. Intell. Mag. 2019, 14, 45–55. [Google Scholar] [CrossRef]
  7. Lee, G.M.; Gao, X. A Hybrid Approach Combining Fuzzy c-Means-Based Genetic Algorithm and Machine Learning for Predicting Job Cycle Times for Semiconductor Manufacturing. Appl. Sci. 2021, 11, 7428. [Google Scholar] [CrossRef]
  8. Lee, S.J.; Song, D.H.; Kim, K.B.; Park, H.J. Efficient Fuzzy Image Stretching for Automatic Ganglion Cyst Extraction Using Fuzzy C-Means Quantization. Appl. Sci. 2021, 11, 12094. [Google Scholar] [CrossRef]
  9. Ghosh, S.; Kumar, S. Comparative Analysis of K-Means and Fuzzy C-Means Algorithms. Int. J. Adv. Comput. Sci. Appl. 2013, 4, 35–39. [Google Scholar] [CrossRef] [Green Version]
  10. Garey, M.R.; Johnson, D.S. Computers and Intractability: A Guide to the Theory of NP-Completeness; W. H. Freeman & Co.: New York, NY, USA, 1979; pp. 109–120. [Google Scholar]
  11. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press: New York, NY, USA, 1981; pp. 43–93. [Google Scholar]
  12. Stetco, A.; Zeng, X.J.; Keane, J. Fuzzy C-means++: Fuzzy C-means with effective seeding initialization. Expert Syst. Appl. 2015, 42, 7541–7548. [Google Scholar] [CrossRef]
  13. Wu, Z.; Chen, G.; Yao, J. The Stock Classification Based on Entropy Weight Method and Improved Fuzzy C-means Algorithm. In Proceedings of the 2019 4th International Conference on Big Data and Computing, Guangzhou, China, 10–12 May 2019. [Google Scholar]
  14. Liu, Q.; Liu, J.; Li, M.; Zhou, Y. Approximation algorithms for fuzzy C-means problem based on seeding method. Theor. Comput. Sci. 2021, 885, 146–158. [Google Scholar] [CrossRef]
  15. Arthur, D.; Vassilvitskii, S. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007. [Google Scholar]
  16. Cai, W.; Chen, S.; Zhang, D. Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit. 2007, 40, 825–838. [Google Scholar] [CrossRef] [Green Version]
  17. Al-Ayyoub, M.; Al-andoli, M.; Jararweh, Y.; Smadi, M.; Gupta, B.B. Improving fuzzy C-mean-based community detection in social networks using dynamic parallelism. Comput Elect. Eng. 2018, 74, 533–546. [Google Scholar] [CrossRef]
  18. Hashemzadeh, M.; Oskouei, A.G.; Farajzadeh, N. New fuzzy C-means clustering method based on feature-weight and cluster-weight learning. Appl. Soft. Comput. 2019, 78, 324–345. [Google Scholar] [CrossRef]
  19. Khang, T.D.; Vuong, N.D.; Tran, M.-K.; Fowler, M. Fuzzy C-Means Clustering Algorithm with Multiple Fuzzification Coefficients. Algorithms 2020, 13, 158. [Google Scholar] [CrossRef]
  20. Khang, T.D.; Tran, M.-K.; Fowler, M. A Novel Semi-Supervised Fuzzy C-Means Clustering Algorithm Using Multiple Fuzzification Coefficients. Algorithms 2021, 14, 258. [Google Scholar] [CrossRef]
  21. Naldi, M.C.; Campello, R.J.G.B. Comparison of distributed evolutionary k-means clustering algorithms. Neurocomputing 2015, 163, 78–93. [Google Scholar] [CrossRef]
  22. Pérez, J.; Almanza, N.N.; Romero, D. Balancing effort and benefit of K-means clustering algorithms in Big Data realms. PLoS ONE. 2018, 13, e0201874. [Google Scholar]
  23. Selim, S.Z.; Ismail, M.A. K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality. IEEE Trans. Pattern Anal. Mach. Intell. 1984, PAMI-6, 81–87. [Google Scholar] [CrossRef]
  24. Jancey, R.C. Multidimensional group analysis. Aust. J. Bot. 1966, 14, 127–130. [Google Scholar] [CrossRef]
  25. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
  26. Bellman, R.; Kalaba, R.; Zadeh, L.A. Abstraction and pattern classification. J. Math. Anal. Appl. 1966, 13, 1–7. [Google Scholar] [CrossRef]
  27. Ruspini, E.H. A new approach to clustering. Inf. Control 1969, 15, 22–32. [Google Scholar] [CrossRef] [Green Version]
  28. Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1974, 3, 32–57. [Google Scholar] [CrossRef]
  29. UCI Machine Learning Repository, University of California. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 26 January 2022).
  30. Rosen, K.H. Discrete Mathematics and Its Applications; McGraw-Hill Education: New York, NY, USA, 2018; pp. 90–98. [Google Scholar]
  31. McGeoch, C.C. A Guide to Experimental Algorithmics; Cambridge University Press: Cambridge, UK, 2012; pp. 17–114. [Google Scholar]
Figure 1. Initial and final centroids of the standard K-Means and FCM algorithms with the URBAN dataset in Example 1. Panel (a) shows the randomly generated initial centroids; (b) contains the final centroids with the solution of the standard FCM algorithm, considering the random initial centroids; (c) displays the final centroids of the K-Means algorithm receiving random centroids; (d) displays the final centroids of FCM with initial centroids generated after the convergence of K-Means; and (e) presents a comparison of the solutions of the FCM algorithm, whose initial centroids are random and generated by K-Means.
Figure 1. Initial and final centroids of the standard K-Means and FCM algorithms with the URBAN dataset in Example 1. Panel (a) shows the randomly generated initial centroids; (b) contains the final centroids with the solution of the standard FCM algorithm, considering the random initial centroids; (c) displays the final centroids of the K-Means algorithm receiving random centroids; (d) displays the final centroids of FCM with initial centroids generated after the convergence of K-Means; and (e) presents a comparison of the solutions of the FCM algorithm, whose initial centroids are random and generated by K-Means.
Axioms 11 00377 g001
Figure 2. Initial and final centroids of the standard K-Means and FCM algorithms with the URBAN dataset in Example 2. Panel (a) shows the randomly generated initial centroids; (b) contains the final centroids with the solution of the standard FCM algorithm, considering the random initial centroids; (c) visualizes the final centroids with the solution of the K-Means algorithm that receives random centroids; (d) shows the final centroids of FCM with initial centroids generated after K-Means convergence; and (e) presents a comparison of solutions of the FCM and K-Means algorithms.
Figure 2. Initial and final centroids of the standard K-Means and FCM algorithms with the URBAN dataset in Example 2. Panel (a) shows the randomly generated initial centroids; (b) contains the final centroids with the solution of the standard FCM algorithm, considering the random initial centroids; (c) visualizes the final centroids with the solution of the K-Means algorithm that receives random centroids; (d) shows the final centroids of FCM with initial centroids generated after K-Means convergence; and (e) presents a comparison of solutions of the FCM and K-Means algorithms.
Axioms 11 00377 g002
Figure 3. Structure of the HOFCM algorithm.
Figure 3. Structure of the HOFCM algorithm.
Axioms 11 00377 g003
Figure 4. Results of the clustering in Experiment I.
Figure 4. Results of the clustering in Experiment I.
Axioms 11 00377 g004
Figure 5. The behavior of datasets in Experiment II.
Figure 5. The behavior of datasets in Experiment II.
Axioms 11 00377 g005
Table 1. Datasets used in experiments.
Table 1. Datasets used in experiments.
IdNameTypendSize Indicator n*d
1WDBCReal5693017,070
2ABALONEReal4177729,239
3SPAMReal460157262,257
4URBANReal360,1772720,354
51m2dSynthetic1,000,00022,000,000
Table 2. Average percentages of time reduction and value of the objective function in Experiment I.
Table 2. Average percentages of time reduction and value of the objective function in Experiment I.
Standard FCM versus HOFCM
WDBCABALONESPAMURBAN
Pc%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
1280.880.0064.140.003554.37−0.00146.124.1274.950.03
2487.120.0256.15−0.0275.366.5376.33−0.00187.810.03
3667.83−2.1952.590.0092.5822.7568.180.0745.260.36
4864.30−1.1786.600.0893.9436.9074.601.4285.560.22
51057.553.0375.610.0986.2250.6970.981.8567.75−0.05
61478.994.5037.00−0.1177.2049.3070.384.2689.65−0.27
71859.0412.392.71−0.6079.8056.2468.457.8184.070.05
82664.9927.3136.78−2.0584.8868.3144.3810.7388.000.02
Table 3. Average percentages of time reduction and value of the objective function in Experiment II; HOFCM versus FCM++.
Table 3. Average percentages of time reduction and value of the objective function in Experiment II; HOFCM versus FCM++.
HOFCM versus FCM++
WDBCABALONESPAMURBAN1m2d
Pc%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
1266.870.0059.380.0041.670.0018.190.0032.080.03
2478.590.0236.47−0.0253.080.0152.360.0073.610.03
3669.42−0.5254.030.0454.045.4854.831.8637.78−0.01
4857.86−0.2080.640.0560.421.9770.422.2084.360.02
51057.291.7668.800.0971.62−2.3363.271.7062.20−0.07
61445.970.4339.000.0425.562.7741.402.1086.60−0.19
71837.010.18−7.22−0.0632.720.9440.443.5780.08−0.03
82627.582.9238.48−0.7142.082.2413.330.6282.990.09
Table 4. Average percentages of time reduction and value of the objective function in Experiment II; HOFCM versus FCM-KMeans.
Table 4. Average percentages of time reduction and value of the objective function in Experiment II; HOFCM versus FCM-KMeans.
HOFCM versus FCM-KMeans
WDBCABALONESPAMURBAN1m2d
Pc%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
1220.050.0015.520.0033.950.0030.150.0055.790.03
2423.740.009.720.0158.116.5328.550.0047.730.00
3632.29−2.51−10.980.0090.8822.7568.570.97−13.940.42
4847.87−1.2850.98−0.0272.5814.6658.990.1453.89−0.02
51066.053.0960.490.1241.1933.6549.893.56−63.130.05
61484.134.5246.830.6957.6750.1363.452.1778.870.06
71884.2111.9924.45−0.0673.8857.6142.569.4041.810.02
82680.9829.0141.00−0.8183.0366.865.553.9945.48−0.01
Table 5. Average percentages of time reduction and value of the objective function in Experiment II; HOFCM versus NFCM.
Table 5. Average percentages of time reduction and value of the objective function in Experiment II; HOFCM versus NFCM.
HOFCM versus NFCM
WDBCABALONESPAMURBAN1m2d
Pc%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
%
time
%
Jm
1274.650.0161.840.0047.580.0018.430.0019.790.03
2476.703.7444.02−0.0255.534.6660.830.0082.070.03
3661.78−1.0951.070.0275.694.7754.852.3430.900.13
4855.55−0.3082.670.7681.259.0772.681.8187.450.01
51048.543.2171.650.4655.0020.8756.783.7759.430.16
61435.871.1553.760.1719.8812.9858.732.3886.76−0.15
7181.492.63−4.36−0.4034.421.3544.023.2976.420.12
82620.094.9826.10−2.0247.185.3322.793.4784.930.14
Table 6. Sum of solution times in milliseconds of Experiments I and II.
Table 6. Sum of solution times in milliseconds of Experiments I and II.
Algorithm
Dataset Name
Standard FCM FCM-KMeansFCM++NFCMHOFCM
WDBC64,516.85116,784.5734,037.9428,395.1621,689.84
ABALONE115,702.20123,930.39113,680.29105,730.0776,996.63
SPAM1,746,301.341,355,655.16482,874.71513,205.42289,610.79
URBAN3,080,009.701,886,664.031,885,697.602,104,360.891,321,799.74
1m2d11,181,675.943,091,877.128,385,299.138,729,514.141,592,132.11
The total sum of solution time16,188,206.036,574,911.2710,901,589.6711,481,205.683,302,229.11 *
The number of times by which the HOFCM algorithm is faster4.901.993.303.48
* Best result obtained.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pérez-Ortega, J.; Roblero-Aguilar, S.S.; Almanza-Ortega, N.N.; Frausto Solís, J.; Zavala-Díaz, C.; Hernández, Y.; Landero-Nájera, V. Hybrid Fuzzy C-Means Clustering Algorithm Oriented to Big Data Realms. Axioms 2022, 11, 377. https://doi.org/10.3390/axioms11080377

AMA Style

Pérez-Ortega J, Roblero-Aguilar SS, Almanza-Ortega NN, Frausto Solís J, Zavala-Díaz C, Hernández Y, Landero-Nájera V. Hybrid Fuzzy C-Means Clustering Algorithm Oriented to Big Data Realms. Axioms. 2022; 11(8):377. https://doi.org/10.3390/axioms11080377

Chicago/Turabian Style

Pérez-Ortega, Joaquín, Sandra Silvia Roblero-Aguilar, Nelva Nely Almanza-Ortega, Juan Frausto Solís, Crispín Zavala-Díaz, Yasmín Hernández, and Vanesa Landero-Nájera. 2022. "Hybrid Fuzzy C-Means Clustering Algorithm Oriented to Big Data Realms" Axioms 11, no. 8: 377. https://doi.org/10.3390/axioms11080377

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop