Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, one object of the present invention is to provide a method for fast clustering image data based on spectral clustering, which not only can ensure the accuracy of the spectral clustering algorithm, but also effectively improves the efficiency and application of image data clustering, and is simple and easy to implement.
Another objective of the present invention is to provide an apparatus for fast clustering image data based on spectral clustering.
In order to achieve the above object, an embodiment of the present invention provides a method for fast clustering image data based on spectral clustering, including the following steps: inputting an image data set X, and extracting m initial central points I from the data set X by using an improved spectral clustering algorithm; calculating the distance between the image data set X and each initial central point of each I to obtain a data set Dist; obtaining a second data set Dist 'according to the data set Dist normalization matrix D, and calculating a normalization Laplacian matrix L of the second data set Dist'; solving the first k eigenvalues and eigenvectors E of the normalized Laplacian matrix L, and selecting k initial central points from the eigenvectors E by using the improved spectral clustering algorithm to perform clustering by using a k-means clustering algorithm to obtain a clustering result, wherein m and k are positive integers.
The image data fast clustering method based on spectral clustering of the embodiment of the invention clusters the image data through a simple and fast spectral clustering algorithm, and solves the problems of large calculation cost and high complexity when the image data are clustered by the traditional spectral clustering algorithm, thereby not only ensuring the accuracy of the spectral clustering algorithm, but also being easy to understand and realize, enabling the algorithm to be used more effectively, and improving the efficiency and application of image data clustering.
In addition, the image data fast clustering method based on spectral clustering according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, the clustering using k-means includes: randomly selecting k objects as initial clustering centers; the distance between each object and each seed cluster center is calculated, and each object is assigned to the cluster center closest to the seed cluster center.
Further, in an embodiment of the present invention, after the image data set X is input, the method further includes: a matrix is derived from the height of the image, the width of the image and the pixels of the image dataset X.
Further, in an embodiment of the present invention, the improved spectral clustering algorithm is: the method comprises the steps of representing an input image data set again to obtain a new data set, carrying out similarity matrix calculation, Laplace matrix calculation and Laplace matrix feature solving on the new data set, using a new initial central point selection mode to replace random point selection on an obtained feature vector matrix, and finally using k-means clustering to obtain a clustering result.
Further, in one embodiment of the present invention, the inputs of the improved spectral clustering algorithm are the image data set X, the m initial centers and the cluster number k, and the output of the improved spectral clustering algorithm is k clusters.
In order to achieve the above object, an embodiment of another aspect of the present invention provides an apparatus for fast clustering image data based on spectral clustering, including: the system comprises an input module, a data processing module and a data processing module, wherein the input module is used for inputting an image data set X and extracting m initial central points I from the data set X by using an improved spectral clustering algorithm; the first calculation module is used for calculating the distance between the image data set X and each initial central point of each I to obtain a data set Dist; the second calculation module is used for obtaining a second data set Dist 'according to the data set Dist normalization matrix D and calculating a normalization Laplace matrix L of the second data set Dist'; and the solving module is used for solving the first k eigenvalues and eigenvectors E of the normalized Laplace matrix L, selecting k initial central points from the eigenvectors E by using the improved spectral clustering algorithm, and clustering by using k-means to obtain a clustering result, wherein m and k are positive integers.
The image data fast clustering device based on spectral clustering of the embodiment of the invention clusters the image data through a simple and fast spectral clustering algorithm, and solves the problems of large calculation cost and high complexity when the traditional spectral clustering algorithm clusters the image data, thereby not only ensuring the accuracy of the spectral clustering algorithm, but also being easy to understand and realize, enabling the algorithm to be used more effectively, and improving the efficiency and application of image data clustering.
In addition, the image data fast clustering device based on spectral clustering according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, the solving module is further configured to randomly select k objects as initial clustering centers; the distance between each object and each seed cluster center is calculated, and each object is assigned to the cluster center closest to the seed cluster center.
Further, in an embodiment of the present invention, the method further includes: and the matrix acquisition module is used for obtaining a matrix according to the height of the image data set X, the width of the image and the pixels of the image after the image data set X is input.
Further, in an embodiment of the present invention, the improved spectral clustering algorithm is: the method comprises the steps of representing an input image data set again to obtain a new data set, carrying out similarity matrix calculation, Laplace matrix calculation and Laplace matrix feature solving on the new data set, using a new initial central point selection mode to replace random point selection on an obtained feature vector matrix, and finally using k-means clustering to obtain a clustering result.
Further, in one embodiment of the present invention, the inputs of the improved spectral clustering algorithm are the image data set X, the m initial centers and the cluster number k, and the output of the improved spectral clustering algorithm is k clusters.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The present invention is based on the recognition and discovery by the inventors of the following problems:
generally, when objects are researched and processed, objects are often required to be classified, for example, samples are classified according to indexes of geophysical prospecting and chemical prospecting in geological exploration; ancient biological studies have classified mined bones according to their shape and size; in dam monitoring, due to the fact that the obtained observation data amount is huge, the observation data amount sometimes needs to be classified and merged, typical representatives of the objects are obtained and then deep analysis is carried out, the objects are classified, and then the rules of the objects are summarized and found to be an important method for people to know the world and reform the world.
In recent years, data taxonomy gradually forms a new branch called cluster analysis, the cluster analysis is suitable for data sets of different types, and the development and application of clustering technology are promoted in various research fields, such as engineering, biology, medicine, language, anthropology, psychology, marketology and the like.
The cluster analysis is also called group analysis or point group analysis, is a quantitative method for researching the classification problem of multi-element objects, is an emerging multivariate statistical method, and is the combination of contemporary taxonomy and multivariate analysis. The basic principle is that according to the self-attribute of the samples, the affinity and the sparsity among the samples are quantitatively determined by a mathematical method according to certain similarity or difference indexes, and the samples are clustered according to the affinity and the sparsity degree.
The clustering analysis is to place the classified objects in a multidimensional space and classify the classified objects according to the degree of affinity and sparseness of the spatial relationship. In popular terms, clustering analysis is to identify objects according to their different attributes, and to group objects with similar attributes into one group, so that objects in the same group have high similarity. Common clustering analysis methods include a system clustering method, a dynamic clustering method, a fuzzy clustering method and the like.
Clustering analysis, which classifies individuals (samples) or objects (variables) into categories by degree of similarity (distance), so that elements in the same category have stronger similarity than elements in other categories. The goal is to maximize the homogeneity of inter-class elements and the heterogeneity of inter-class elements. The main basis is that samples grouped into the same dataset should be similar to each other, while samples belonging to different groups should be sufficiently dissimilar.
Wherein (1) K-means is a clustering analysis algorithm for iterative solution. The method comprises the steps of randomly selecting k objects as initial clustering centers, then calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal. The principle of k-means clustering is simple, the realization is convenient, the convergence speed is high, the model has strong interpretability, the parameter adjustment only needs to modify the cluster number k, and the time complexity is linearly related to the sample number, so that the algorithm is very efficient and has good flexibility for processing a large data set. However, this algorithm is more difficult to converge for data sets that are not convex; if the types of the data are unbalanced, such as unbalanced data amount or different variance of the categories, the clustering effect is poor. Meanwhile, due to the adoption of an iterative mode, only a local optimal solution can be obtained, and the method is sensitive to noise and abnormal points.
(2) The spectral clustering algorithm is established on the basis of a spectrogram theory, and compared with the traditional clustering algorithm, the spectral clustering algorithm has the advantages that clustering can be performed on a sample space with any shape, and the global optimal solution is converged. Dividing the weighted undirected graph into two or more optimal subgraphs, and enabling the interior of the subgraphs to be similar as much as possible and the distance between the subgraphs to be far as possible.
The algorithm first defines an affinity matrix describing the similarity of pairs of data points based on a given sample data set, and computes the eigenvalues and eigenvectors of the matrix, and then selects the appropriate eigenvector to cluster the different data points. The spectral clustering algorithm is initially used in the fields of computer vision, VLSI (Very Large Scale Integration), and the like, and has recently been used in machine learning, and has rapidly become a research hotspot in the field of machine learning internationally.
The spectral clustering algorithm is established on the basis of spectrogram theory in graph theory, essentially converts the clustering problem into the optimal partitioning problem of the graph, is a point-to-cluster algorithm, and has good application prospect for data clustering. When a relatively complex clustering problem is encountered, and k-means hardly has a good effect, spectral clustering can be used.
The traditional spectral clustering mainly comprises the following four steps: A. an adjacency matrix W is constructed based on given data. B. The normalized laplacian matrix L is calculated from the adjacency matrix. C. The first k eigenvectors of the matrix are solved and a new eigenspace is constructed therefrom. D. The feature vectors in the feature space are clustered by some clustering algorithm, such as k-means. The basic steps are shown in figure 1.
After the data set is constructed into an undirected graph, the undirected graph needs to be Cut into a plurality of k sub-graphs which are not connected with each other, and currently, MinCut (Minimum Cut, Minimum Cut criterion), RCut (Ratio Cut, proportional Cut criterion) and Ncu (Normalized Cut criterion) are adopted. The spectral clustering is an algorithm for clustering the eigenvectors of the data Laplace matrix, is essentially used for converting the clustering problem into the optimal partitioning problem of a graph, and is a point-to-cluster algorithm. The common generation method of the similarity matrix is based on a full-connection mode of K neighbors or Gaussian kernels, the common cutting mode is Ncut, and the last step of traditional clustering is based on K-means.
Although the spectral clustering algorithm has good clustering effect, the spectral clustering algorithm has the problems of high calculation complexity, long calculation time and the like. In the four steps of the spectral clustering algorithm, similarity needs to be calculated among samples, and then the matrix is subjected to characteristic decomposition and finally clustered. Each step involves a large number of computations, so the computational overhead becomes a major performance bottleneck for spectral clustering. In the case of ensuring the spectral clustering performance, how to reduce the computational complexity of the spectral clustering algorithm becomes a challenging problem.
In recent years, researchers have made many efforts and attempts to solve the above problems. They can be broadly divided into two categories: the first category of algorithms focuses on reducing the computational load of matrix decomposition. Given a similarity matrix, this type of algorithm accelerates feature decomposition by faster approximation methods. Although this type of method has worked well, the construction of the similarity matrix still requires quadratic computation time, which limits their use. The second category of algorithms attempts to reduce the computational cost of both graph construction and matrix decomposition by approximating the original graph. In general, the second class of algorithms is more practical than the first class of algorithms because it reduces the cost of all stages of spectral clustering.
However, clustering of the feature vectors and data representations prior to establishing the similarity matrix is often neglected. Meanwhile, most of the spectral clustering algorithms proposed at present are too complex, difficult to understand and difficult to widely apply. Therefore, the embodiment of the invention focuses on the data representation and the clustering process of the feature vectors so as to reduce the calculation overhead of spectral clustering. Specifically, the method comprises the following steps: due to the wide popularization and application of the spectral clustering algorithm, the clustering effect on the image data is obvious, however, the image data is high in data dimension and large in data scale. In order to solve the problems of high calculation cost and high time complexity of a spectral clustering algorithm, the embodiment of the invention focuses on data representation before the construction of a similar matrix, and optimizes a k-means algorithm to reduce the calculation cost of spectral clustering, so that the spectral clustering algorithm is more suitable for image data.
The following describes a method and an apparatus for fast clustering image data based on spectral clustering according to an embodiment of the present invention with reference to the accompanying drawings, and first, a method for fast clustering image data based on spectral clustering according to an embodiment of the present invention will be described with reference to the accompanying drawings.
FIG. 2 is a flowchart of a method for fast clustering image data based on spectral clustering according to an embodiment of the present invention.
As shown in fig. 2, the method for fast clustering image data based on spectral clustering comprises the following steps:
in step S201, an image data set X is input, and m initial center points I are extracted from the data set X using a modified spectral clustering algorithm.
As can be appreciated, the input image dataset X,
using kmc
2M initial center points I are chosen from the data set X,
wherein kmc
2The approximate Kmeans + + of the sub-linear time is shown, and the initial point selection method of Kmeans is shown by the Kmeans + +.
Further, in one embodiment of the present invention, the improved spectral clustering algorithm is: the method comprises the steps of representing an input image data set again to obtain a new data set, carrying out similarity matrix calculation, Laplace matrix calculation and Laplace matrix feature solving on the new data set, using a new initial central point selection mode to replace random point selection on an obtained feature vector matrix, and finally using k-means clustering to obtain a clustering result.
Wherein the inputs of the improved spectral clustering algorithm are X, m initial centers and the clustering number k of the image data set, and the output of the improved spectral clustering algorithm is k clusters.
It can be understood that, in the embodiment of the present invention, a simple and fast spectral clustering algorithm KSC is designed to perform image data clustering, where KSC is an improved spectral clustering algorithm, and the structure of KSC is shown in fig. 3. The algorithm of KSC is shown in table 1.
TABLE 1
Further, in an embodiment of the present invention, after the image data set X is input, the method further includes: the matrix is derived from the height of the image, the width of the image and the pixels of the image dataset X.
In step S202, the distance between the image dataset X and each initial center point of each I is calculated, resulting in a dataset Dist.
It will be appreciated that the distance between each of the points X and I is calculated, resulting in a data set Dist,
in step S103, a second data set Dist 'is obtained according to the data set Dist normalization matrix D, and a normalized laplacian matrix L of the second data set Dist' is calculated.
It will be appreciated that normalizing the matrix D results in Dist',
the normalized laplacian matrix L for Dist' is calculated.
In step S204, the first k eigenvalues and eigenvectors E of the normalized laplacian matrix L are solved, and k initial central points are selected from the eigenvectors E by using an improved spectral clustering algorithm, so as to perform clustering by using k-means, and obtain a clustering result, where m and k are positive integers.
It will be appreciated that solving for the first k eigenvalues of L and the eigenvector E, using kmc2Choosing k from EAnd (4) clustering the initial central points by using k-means.
In one embodiment of the present invention, the clustering using k-means includes: randomly selecting k objects as initial clustering centers; the distance between each object to the respective seed cluster center is calculated and each object is assigned to the closest cluster center.
The image data fast clustering method based on spectral clustering will be further described by the following specific embodiments, specifically as follows:
1. data set
And Rind Data: the data set consists of 1100 two-dimensional points, forming two concentric circles.
MNIST: the data set comprises 60000 pictures of handwritten data from 0 to 9, each picture being represented by 784-dimensional original pixel values.
USPS: the data set comprises 9298 pictures of handwritten data from 1 to 10, each picture consisting of 256-dimensional data.
MNIST and USPS are commonly used for cluster analysis, and they generally evaluate the cluster quality well.
2. Effectiveness of
This section is intended to evaluate the effectiveness of the KSC algorithm. The spectral clustering algorithm is a clustering algorithm developed from graph theory, so that it is expected that the KSC can maintain the clustering characteristics of spectral clustering. In this section, the present example compares KSC to the k-means and SC algorithms. The embodiment of the invention uses the Rind Data to perform visual clustering display.
First, the data sets were clustered using k-means and SC, and the results are shown in FIG. 4 and FIG. 5. As can be seen in FIGS. 4 and 5, the data is divided into two parts based on the k-means of the distance partition. And the SC divides the data into two concentric circles.
Subsequently, the same experiment is performed by using KSC in the embodiment of the present invention, and the result is shown in fig. 6.
3. Performance testing
The performance test mainly examines the accuracy and the running time of the algorithm, and the embodiment of the invention uses NMI (Normalized Mutual Information) as an index for evaluating the accuracy. Meanwhile, two Clustering algorithms of RSC (Robust Spectral Clustering) and KFSC (Kalman Spectral Clustering) are added for comparison test. The embodiment of the invention performs the part of experiments on the image data sets MNIST and USPS, and the KFSC has poor performance, so that the USPS data sets are not drawn so as to better show the results.
The result of the accuracy of the clustering is shown in fig. 7 and fig. 8, and it can be seen from the graphs that the accuracy of KSC is basically similar to SC and RSC, and the clustering effect of k-means is not good. It is consistent with the expectations of embodiments of the invention for algorithms that maintain the accuracy of KSCs as spectral clusters.
In fig. 9, 10 and 11, it can be seen that k-means sufficiently exhibits the characteristic of high calculation speed. Meanwhile, KSC does not show its own advantages when the amount of data is small. However, with the increase of data volume, the operation time of the KSC is obviously lower than that of SC and RSC algorithms, and the KSC improves the operation speed of the algorithms and accords with the original purpose of algorithm design on the premise of ensuring certain accuracy by combining the index of algorithm accuracy.
In summary, the embodiment of the invention designs a simple and fast spectral clustering algorithm KSC for image data clustering. The embodiment of the invention firstly expresses the image data in a matrix form, and redefines the obtained matrix data set, thereby reducing the calculation cost among samples and improving the algorithm efficiency. At the same time, kmc is used2The process of random point selection of k-means is replaced, and the convergence of the algorithm is accelerated. Finally, the algorithm is simple in principle and easy to implement, and can be well applied to image data clustering analysis, so that the image data clustering efficiency and application are improved.
According to the image data fast clustering method based on spectral clustering provided by the embodiment of the invention, the image data clustering is carried out through a simple and fast spectral clustering algorithm, and the problems of large calculation cost and high complexity when the image data clustering is carried out through the traditional spectral clustering algorithm are solved, so that the accuracy of the spectral clustering algorithm can be ensured, the algorithm is easy to understand and realize, the algorithm can be used more effectively, and the image data clustering efficiency and application are improved.
Next, an image data fast clustering apparatus based on spectral clustering according to an embodiment of the present invention will be described with reference to the drawings.
Fig. 12 is a schematic structural diagram of an apparatus for fast clustering image data based on spectral clustering according to an embodiment of the present invention.
As shown in fig. 12, the image data fast clustering apparatus 10 based on spectral clustering includes: an input module 100, a first calculation module 200, a second calculation module 300, and a solving module 400.
The input module 100 is configured to input an image data set X, and extract m initial central points I from the data set X by using an improved spectral clustering algorithm; the first calculating module 200 is configured to calculate a distance between the image data set X and each initial central point of each I, so as to obtain a data set Dist; the second calculating module 300 is configured to obtain a second data set Dist 'according to the data set Dist normalization matrix D, and calculate a normalized laplacian matrix L of the second data set Dist'; the solving module 400 is configured to solve the first k eigenvalues and eigenvectors E of the normalized laplacian matrix L, and select k initial central points from the eigenvectors E by using an improved spectral clustering algorithm, so as to perform clustering by using k-means, thereby obtaining a clustering result, where m and k are positive integers. The device 10 of the embodiment of the invention not only can ensure the accuracy of the spectral clustering algorithm, but also effectively improves the efficiency and application of image data clustering, and is simple and easy to realize.
Further, in an embodiment of the present invention, the solving module 400 is further configured to randomly select k objects as initial clustering centers; the distance between each object to the respective seed cluster center is calculated and each object is assigned to the closest cluster center.
Further, in one embodiment of the present invention, the apparatus 10 of the embodiment of the present invention further comprises: and a matrix acquisition module. The matrix acquisition module is used for obtaining a matrix according to the height of an image of the image data set X, the width of the image and pixels of the image after the image data set X is input.
Further, in one embodiment of the present invention, the improved spectral clustering algorithm is: the method comprises the steps of representing an input image data set again to obtain a new data set, carrying out similarity matrix calculation, Laplace matrix calculation and Laplace matrix feature solving on the new data set, using a new initial central point selection mode to replace random point selection on an obtained feature vector matrix, and finally using k-means clustering to obtain a clustering result.
Further, in one embodiment of the present invention, the inputs to the improved spectral clustering algorithm are X, m initial centers and the number of clusters k for the image data set, and the output of the improved spectral clustering algorithm is k clusters.
It should be noted that the foregoing explanation of the embodiment of the method for fast clustering image data based on spectral clustering is also applicable to the device for fast clustering image data based on spectral clustering of this embodiment, and is not repeated here.
According to the image data fast clustering device based on spectral clustering provided by the embodiment of the invention, image data clustering is carried out through a simple and fast spectral clustering algorithm, and the problems of large calculation cost and high complexity when the traditional spectral clustering algorithm is used for carrying out image data clustering are solved, so that the accuracy of the spectral clustering algorithm can be ensured, the algorithm is easy to understand and realize, the algorithm can be used more effectively, and the image data clustering efficiency and application are improved.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.