CN101968852A

CN101968852A - Entropy sequencing-based semi-supervision spectral clustering method for determining clustering number

Info

Publication number: CN101968852A
Application number: CN2010102787672A
Authority: CN
Inventors: 张向荣; 焦李成; 杨杰; 侯彪; 王爽; 公茂果; 刘若辰; 李阳阳
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2010-09-09
Filing date: 2010-09-09
Publication date: 2011-02-09

Abstract

The invention discloses a method for determining the number of clusters by semi-supervised spectral clustering based on entropy sorting, which mainly solves the problem of selecting the eigenvectors of the Laplacian matrix in the spectral clustering. The process is: use the theory of entropy sorting to "rearrange" the eigenvectors to obtain the column with the highest importance of the eigenvectors. For a k-type problem, extract the first k columns of the eigenvectors and project them into the k-dimensional space; Clustering is performed according to the distance between each point and the 2k semi-axes in the k-dimensional space, and the clusters with no points in the 2k classes or the clusters with points less than 1% of the input data points are removed, and the remaining clusters are recorded is c; then extract the first c columns of the feature vector, and repeat the operation until the number of clusters is stable. At this time, the corresponding number of clusters is the optimal number of clusters; mark the input data points according to the coordinates of the input points, and obtain the clustering result . The invention has the advantages of self-adaptation and high accuracy rate of clustering, and can be used for self-adaptive determination of image category numbers.

Description

A Method for Determining the Number of Clusters Based on Entropy Sorting Based on Semi-supervised Spectral Clustering

技术领域technical field

本发明属于图像处理技术领域，涉及图像聚类方法，可应用于图像聚类领域，以自适应地确定聚类数。The invention belongs to the technical field of image processing and relates to an image clustering method, which can be applied to the field of image clustering to determine the number of clusters adaptively.

背景技术Background technique

图像聚类是图像处理过程中的一个重要步骤。图像聚类的目的是将图像上不同的区域根据图像像素点之间的关系聚成不同的类。谱聚类是近年来新兴的一个聚类方法，该算法的思想起先源于谱图划分理论，被看作是一个无向图的多路划分问题。谱聚类优于传统的聚类算法原因在于其不受样本空间形状限制且收敛于全局最优解，因此，谱聚类算法在图像聚类领域得到了广泛应用。Image clustering is an important step in image processing. The purpose of image clustering is to cluster different regions on the image into different classes according to the relationship between image pixels. Spectral clustering is an emerging clustering method in recent years. The idea of this algorithm originated from the theory of spectral graph partitioning, and it is regarded as a multi-way partitioning problem of an undirected graph. The reason why spectral clustering is superior to traditional clustering algorithms is that it is not limited by the shape of the sample space and converges to the global optimal solution. Therefore, spectral clustering algorithms have been widely used in the field of image clustering.

近年来提出的半监督谱聚类算法是在谱聚类算法的基础上加入人工标记的类标签来改良聚类结果的一种方法。这种类标签以先验信息的形式出现，一般地，通过修正亲和度矩阵来加入先验信息。类标签的个数很有讲究，太少不足以达到理想的聚类结果，太多又会给计算和存储带来过大负担，因此，加入多少类标签需要读者在实际中权衡。The semi-supervised spectral clustering algorithm proposed in recent years is a method to improve the clustering results by adding artificially marked class labels on the basis of the spectral clustering algorithm. This kind of label appears in the form of prior information, and generally, the prior information is added by modifying the affinity matrix. The number of class labels is very particular. Too few are not enough to achieve the ideal clustering results, and too many will bring too much burden on calculation and storage. Therefore, readers need to weigh in practice how many class labels to add.

谱聚类中有两个共同关注的问题，即尺度参数和聚类数的自适应确定。尺度参数确定的方法近年来已经发展得相当完善。本发明主要探讨聚类数的确定问题。聚类数的事先确定会使得聚类过程更加的自适应，降低了手工工作量。目前现有的方法大都通过分析由亲和度矩阵构造而得的拉普拉斯矩阵的特征值和特征向量入手的。There are two common concerns in spectral clustering, namely the adaptive determination of the scale parameter and the number of clusters. Methods for scale parameter determination have been developed quite well in recent years. The present invention mainly discusses the problem of determining the number of clusters. The prior determination of the number of clusters will make the clustering process more adaptive and reduce the manual workload. At present, most of the existing methods start by analyzing the eigenvalues and eigenvectors of the Laplacian matrix constructed from the affinity matrix.

2001年，A.NG等人指出特征向量等于1的个数即为该组数据的类别数，参见A.Y.Ng，M.I.Jordan，and Y.Weiss，《On spectral clustering：Analysis and an algorithm》，Advances in Neural Information Processing Systems(NIPS)。这种方法容易受到噪声影响而导致聚类结果出现误差。In 2001, A.NG and others pointed out that the number of eigenvectors equal to 1 is the number of categories of the data set, see A.Y.Ng, M.I.Jordan, and Y.Weiss, "On spectral clustering: Analysis and an algorithm", Advances in Neural Information Processing Systems (NIPS). This method is easily affected by noise and leads to errors in clustering results.

2005年，Zelnik-Manor和Perona等人提出的自调节谱聚类(self-tuning spectral clustering)算法通过将部分特征向量进行旋转以构造一个与聚类数相关的目标函数，认为使之达到最小的即为最佳聚类数，参见Zelnik-Manor，L.，and Perona，P.，《Self-tuning spectral clustering》，Advances in Neural Information Processing Systems。这种方法能出色地处理一些复杂性问题，但是由于其反复重组旋转向量大大增加了计算代价，而且，由于要人工设定阈值，所以针对不同数据集，不同实践者，就产生了误差，同时带来了较大工作量。In 2005, the self-tuning spectral clustering algorithm proposed by Zelnik-Manor and Perona et al. rotates some feature vectors to construct an objective function related to the number of clusters, which is considered to achieve the minimum That is the optimal number of clusters, see Zelnik-Manor, L., and Perona, P., "Self-tuning spectral clustering", Advances in Neural Information Processing Systems. This method can handle some complex problems excellently, but due to its repeated reorganization of rotation vectors, the calculation cost is greatly increased, and because the threshold is manually set, errors are generated for different data sets and different practitioners. At the same time resulted in a greater workload.

Zhong等人于2008年提出一种新的自适应谱聚类(adaptive spectral clustering，ASC)算法，参见Qingliu，Z.，and Zixing，C.，《Adaptive spectral clustering algorithm for color image Segmentation Application》，Research of Computers 25(12)，(2008)。它采用全局平均N近邻距离的比例参数代替局部N近邻距离的比例参数，利用相邻特征向量前k个相邻列间的平均差异与第一列的比值来确定最佳分类数，直到该比值小于某个给定阈值时对应的k认为是最佳分类数。该算法由于要人工设定阈值，使得它的自调节性大打折扣。Zhong et al. proposed a new adaptive spectral clustering (ASC) algorithm in 2008, see Qingliu, Z., and Zixing, C., "Adaptive spectral clustering algorithm for color image Segmentation Application", Research of Computers 25(12), (2008). It uses the scale parameter of the global average N nearest neighbor distance instead of the scale parameter of the local N nearest neighbor distance, and uses the ratio of the average difference between the first k adjacent columns of adjacent feature vectors to the first column to determine the optimal classification number until the ratio When it is less than a given threshold, the corresponding k is considered as the best classification number. Because the algorithm needs to manually set the threshold, its self-regulation is greatly reduced.

Wang等人于2005年提出的ACNA算法通过特征向量与相应坐标轴的距离将数据点划分至不同的类，经过若干次的循环迭代最后稳定到的那个k值就认为是最佳的聚类数，参见Chongjun，W.，Wujun，L.，Lin，D.，Juan，T.，and Shifu，C.，《Image segmentation using spectral clustering》，Proceedings of the 17th IEEE In ternational Conference on Tools with Artificial Intelligence，IEEE Computer Society，677-678(2005)。该算法在一定程度上取得了较好的结果，但是对于复杂数据结果就欠理想。The ACNA algorithm proposed by Wang et al. in 2005 divides the data points into different classes through the distance between the feature vector and the corresponding coordinate axis, and the k value finally stabilized after several iterations is considered as the best clustering number , see Chongjun, W., Wujun, L., Lin, D., Juan, T., and Shifu, C., "Image segmentation using spectral clustering", Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, IEEE Computer Society, 677-678 (2005). The algorithm achieves better results to a certain extent, but the results are not ideal for complex data.

以上自动确定聚类数方法由于均选取前k个最大特征值对应的特征向量，因此存在以下不足：1.易受噪声影响而导致聚类误差；2.对大数据结果欠佳或者失效；3.需人工设定阈值。Since the above methods of automatically determining the number of clusters all select the eigenvectors corresponding to the top k largest eigenvalues, they have the following disadvantages: 1. Susceptible to noise and cause clustering errors; 2. The results are not good or invalid for large data; 3. . Need to manually set the threshold.

发明内容Contents of the invention

本发明的目的在于克服上述聚类方法只选取前k个最大特征值对应的特征向量而导致部分图像信息丢失的缺点，提出一种基于熵排序的半监督谱聚类确定聚类数的方法，通过熵排序的理论将特征向量按照其重要度进行一次“重排列”，选取前k列重要度最高的特征向量，以在无需人工设定阈值的条件下准确地确定聚类数，减小聚类误差，提高聚类效果。The purpose of the present invention is to overcome the shortcomings of the above-mentioned clustering method that only selects the eigenvectors corresponding to the first k largest eigenvalues, resulting in the loss of part of the image information, and proposes a method for determining the number of clusters by semi-supervised spectral clustering based on entropy sorting. Through the theory of entropy sorting, the eigenvectors are "rearranged" according to their importance, and the eigenvectors with the highest importance in the top k columns are selected to accurately determine the number of clusters without manually setting the threshold and reduce the number of clusters. class error and improve the clustering effect.

1、一种基于熵排序的半监督谱聚类确定聚类数的方法，包括如下步骤：1. A method for determining the number of clusters based on entropy sorting semi-supervised spectral clustering, comprising the steps of:

(1)输入数据集X＝{x₁，x₂，...，x_n}∈R^d，x_i表示数据集中的任意点，i∈(1，n)，n为数据个数，d表示数据维数；(1) Input data set X={x ₁ , x ₂ ,..., x _n }∈R ^d , x _i represents any point in the data set, i∈(1,n), n is the number of data, d Indicates the data dimension;

(2)分别计算数据集X中各个点的尺度参数σ_i和数据集的亲和度矩阵A；(2) Calculate the scale parameter σ _i of each point in the data set X and the affinity matrix A of the data set respectively;

(3)由人工加入的类标签提取成对约束信息，并用这些成对约束信息对亲和度矩阵加以修正

其中must-link限制两个样本点必须属于同一类；cannot-link限制两个样本点不能处于同一类；(3) Extract pairwise constraint information from artificially added class labels, and use these pairwise constraint information to modify the affinity matrix

Among them, must-link restricts that two sample points must belong to the same class; cannot-link restricts that two sample points cannot be in the same class;

(4)用修正后的亲和度矩阵构造拉普拉斯矩阵：L＝D^-1/2AD^1/2，其中D为对角矩阵，对角线上的任意元素

(4) Construct the Laplacian matrix with the modified affinity matrix: L=D ^-1/2 AD ^1/2 , where D is a diagonal matrix, and any element on the diagonal

(5)对拉普拉斯矩阵进行特征分解，并按照特征值的大小将对应特征向量从大到小排列；(5) Carry out eigendecomposition to the Laplacian matrix, and arrange the corresponding eigenvectors from large to small according to the size of the eigenvalues;

(6)用熵排序方法对特征向量按其重要度进行重排序：(6) Use the entropy sorting method to reorder the feature vectors according to their importance:

(6a)依次移出特征向量的每一列，计算其余列熵值，将求得的熵值规定为该列特征向量对应的熵值；(6a) Remove each column of the eigenvector in turn, calculate the entropy value of the remaining columns, and specify the obtained entropy value as the entropy value corresponding to the eigenvector of the column;

(6b)将特征向量按照其对应的熵值从大到小进行一次“重排列”，得到特征向量重要度由高到低的排序，记为VR；(6b) Perform a "rearrangement" of the feature vectors according to their corresponding entropy values from large to small, and obtain the order of feature vector importance from high to low, which is recorded as VR;

(7)初始化聚类数k＝2，取VR的前k列并归一化；(7) Initialize the number of clusters k=2, take the first k columns of VR and normalize;

(8)自适应确定聚类数：(8) Adaptively determine the number of clusters:

(8a)把取得的VR前k列看成n个k维的点，将其投射到k维坐标系；(8a) Treat the obtained VR front k columns as n k-dimensional points, and project them into the k-dimensional coordinate system;

(8b)用每个坐标轴的正负方向分别表示一个聚类，根据每个点距坐标系各个半轴的距离将输入数据点划分为2k类；(8b) Use the positive and negative directions of each coordinate axis to represent a cluster respectively, and divide the input data points into 2k categories according to the distance between each point and each semi-axis of the coordinate system;

(8c)除去2k类中没有点的类或者点数少于输入数据点数百分之一的聚类，将保留下来的聚类数记为c；(8c) Remove the clusters without points in the 2k classes or the clusters with points less than 1% of the input data points, and record the remaining clusters as c;

(9)比较k和c，如果二者不同，令k＝c，返回步骤(8)，如果相同，此时所得的k就是最佳聚类数，记为km；(9) compare k and c, if the two are different, make k=c, return to step (8), if identical, the k obtained at this time is the optimal clustering number, recorded as km;

(10)将输入数据点划分至km类，并按输入点的坐标对输入数据点进行标记，得到聚类结果。(10) Divide the input data points into km categories, and mark the input data points according to the coordinates of the input points to obtain the clustering result.

2、根据权利要求1所述的确定聚类数的方法，其中步骤(2)所述的分别计算数据集X中各个点的尺度参数σi和数据集的亲和度矩阵A，用以下公式计算：2. The method for determining the number of clusters according to claim 1, wherein the step (2) respectively calculates the scale parameter σi of each point in the data set X and the affinity matrix A of the data set, and calculates with the following formula :

${σ σ}_{i i} = = \frac{11}{m m} {Σ Σ}_{d d = = 11}^{m m} | | | | {x x}_{i i} - - {x x}_{d d} | | | |$

其中，σ_i表示数据点中任意点的尺度参数，x_d是数据级X中任意点x_i距其余各点的第d个近邻，选择d＝7；Among them, _σi represents the scale parameter of any point in the data point, x _d is the dth nearest neighbor of any point x _i in the data level X from the rest of the points, and d=7 is selected;

A_ij＝exp(-‖x_i-x_j‖²/σ_iσ_j)，i，j∈(1，n)A _ij = exp(-‖x _i -x _j ‖ ² /σ _i σ _j ), i, j∈(1, n)

其中，A_ij表示亲和度矩阵A的任意元素，σ_i，σ_j分别表示数据集中任意点x_i和x_j对应的尺度参数，‖x_i-x_j‖表示点x_i和x_j的欧氏距离。Among them, A _ij represents any element of the affinity matrix A, σ _i , σ _j represent the scale parameters corresponding to any point x _i and x _j in the data set, and _‖xi -x _j ‖ represents the point x _i and x _j Euclidean distance.

3、根据权利要求1所述的确定聚类数的方法，其中步骤(6a)所述的计算其余列熵值E，由下式计算：3. The method for determining the number of clusters according to claim 1, wherein the remaining column entropy value E of the calculation described in step (6a) is calculated by the following formula:

$E E. = = - - \underset{{V V}_{i i} &Element; &Element; V V}{Σ Σ} [[p p (({V V}_{i i})) log log p p (({V V}_{i i})) + + ((11 - - p p (({V V}_{i i})))) log log ((11 - - p p (({V V}_{i i}))))]]$

其中V表示拉普拉斯矩阵的特征向量，V_i表示该特征向量中的任意列，p(V_i)表示V_i列的概率，由于实际中该概率不容易得到，具体操作中用特征向量V_i列与其他任意列的亲和度S_ij代替，即Among them, V represents the eigenvector of the Laplacian matrix, V _i represents any column in the eigenvector, and p(V _i ) represents the probability of the column of V _i . Since this probability is not easy to obtain in practice, the eigenvector is used in the specific operation The column V _i is replaced by the affinity S _ij of any other column, that is

S_ij＝exp(-‖V_i-V_j‖²/σ_iσ_j)S _ij ＝exp(-‖V _i -V _j ‖ ² /σ _i σ _j )

$E E. = = - - \underset{{V V}_{i i} &Element; &Element; V V}{Σ Σ} \underset{{V V}_{j j} &Element; &Element; V V}{Σ Σ} {E E.}_{i i} = = - - \underset{{V V}_{i i} &Element; &Element; V V}{Σ Σ} \underset{{V V}_{j j} &Element; &Element; V V}{Σ Σ} [[{S S}_{ij ij} log log {S S}_{ij ij} + + ((11 - - {S S}_{ij ij})) log log ((11 - - {S S}_{ij ij}))]]$

其中，σ_i，σ_j分别表示特征向量任意列V_i和V_j对应的尺度参数，‖V_i-V_j‖表示特征向量V_i和V_j列的欧氏距离。Among them, σ _i and σ _j represent the scale parameters corresponding to any columns V _i and V _j of feature vectors respectively, and ‖V _i -V _j ‖ represents the Euclidean distance between feature vectors V _i and V _j columns.

本发明与现有的技术相比具有以下优点：Compared with the prior art, the present invention has the following advantages:

1、本发明由于采用熵排序的方法对特征向量按其重要度进行重排序，能够选择包含源图像信息较多的特征向量，改善了聚类效果；1. Since the present invention adopts the entropy sorting method to reorder the feature vectors according to their importance, the feature vectors containing more source image information can be selected, and the clustering effect is improved;

2、本发明由于根据第k列特征向量中每个点距k维坐标系各个半轴的距离，对数据集进行分类，因而在无需人工设定阈值的条件下，能够自适应地确定聚类数。2. The present invention classifies the data set according to the distance between each point in the feature vector of the kth column and each semi-axis of the k-dimensional coordinate system, so that the clustering can be adaptively determined without manually setting the threshold number.

附图说明Description of drawings

图1是本发明的实现流程图；Fig. 1 is the realization flowchart of the present invention;

图2是本发明在四组人工数据集上的聚类数确定结果和聚类结果；Fig. 2 is the determination result and the clustering result of the number of clusters of the present invention on four groups of artificial data sets;

图3是用本发明与现有方法在Berkeley图像上进行聚类的结果图；Fig. 3 is the result figure that carries out clustering on Berkeley image with the present invention and existing method;

图4是用本发明与现有方法在SAR图像1上进行聚类的结果图；Fig. 4 is the result figure that carries out clustering on SAR image 1 with the present invention and existing method;

图5是用本发明与现有方法在SAR图像2上进行聚类的结果图。Fig. 5 is a graph showing the results of clustering on the SAR image 2 using the present invention and the existing method.

具体实施方式Detailed ways

参照图1，本发明的具体实实现包括如下步骤：With reference to Fig. 1, concrete realization of the present invention comprises the following steps:

步骤一，计算各个点的尺度参数σ_i和数据集的亲和度矩阵A。Step 1, calculate the scale parameter σ _i of each point and the affinity matrix A of the data set.

1a)输入图像数据集X＝{x₁，x₂，...，x_n}∈R^d，其中x_i表示数据集中的任意点，i∈(1，n)，n为数据个数，d表示数据维数；1a) Input image data set X={x ₁ , x ₂ ,..., x _n }∈R ^d , where x _i represents any point in the data set, i∈(1,n), n is the number of data, d represents the data dimension;

1b)计算输入图像数据X的各个点的尺度参数σ_i：1b) Calculate the scale parameter σ _i of each point of the input image data X:

其中，σ_i表示数据点中任意点的尺度参数，x_d是数据级X中任意点x_i距其余各点的第d个近邻，选择d＝7，‖x_i-x_d‖表示点x_i与x_d的欧氏距离，该尺度参数的作用主要是根据数据分布状况来调节测定尺度，以保证聚类对象规范化并落入聚类范围内；Among them, σ _i represents the scale parameter of any point in the data point, x _d is the dth neighbor of any point x _i in the data level X from the rest of the points, choose d=7, ‖ x _i -x _d ‖ represents the point x The Euclidean distance between _i and x _d , the function of this scale parameter is mainly to adjust the measurement scale according to the data distribution, so as to ensure that the clustering objects are normalized and fall within the clustering range;

1c)计算输入图像数据X数据集的亲和度矩阵图像的亲和度，即图像上任意两点间的相似程度A：1c) Calculate the affinity of the affinity matrix image of the input image data X data set, that is, the degree of similarity A between any two points on the image:

其中，A_ij表示亲和度矩阵A的任意元素，σ_i，σ_j分别表示数据集中任意点x_i和x_j对应的尺度参数，‖x_i-x_j‖表示点x_i和x_j的欧氏距离，A_ij描述的是点x_i和x_j之间的相似程度。当x_i和x_j相同的时候，亲和度矩阵对应点处的元素A_ij为1，即A_ii＝A_jj＝1，由此可见，亲和度矩阵为一个对称矩阵，且对角线上元素为1。Among them, A _ij represents any element of the affinity matrix A, σ _i , σ _j represent the scale parameters corresponding to any point x _i and x _j in the data set, and _‖xi -x _j ‖ represents the point x _i and x _j Euclidean distance, A _ij describes the similarity between points x _i and x _j . When x _i and x _j are the same, the element A _ij at the corresponding point of the affinity matrix is 1, that is, A _ii =A _jj =1, it can be seen that the affinity matrix is a symmetric matrix, and the diagonal The upper element is 1.

步骤二，由人工加入的类标签提取成对约束信息，并用这些成对约束信息对亲和度矩阵加以修正。Step 2: Extract pairwise constraint information from the artificially added class labels, and use these pairwise constraint information to modify the affinity matrix.

所谓半监督谱聚类，就是在传统谱聚类基础上加入人工标记的类标签以达到改善聚类结果目的。，本发明通过加入由类别标签提取出的成对约束信息来修正亲和度矩阵，其实现步骤如下：The so-called semi-supervised spectral clustering is to add artificially marked class labels on the basis of traditional spectral clustering to improve the clustering results. , the present invention modifies the affinity matrix by adding the pairwise constraint information extracted from the category label, and the implementation steps are as follows:

2a)给源图像中加入类标签，类标签点就是根据人工标记图像中的部分像素点，以给聚类算法提供先验信息；2a) Add a class label to the source image, and the class label point is to manually mark some pixels in the image to provide prior information for the clustering algorithm;

2b)由类标签点提取出成对约束信息，成对约束用以描述两个点是否属于同一类别，其获得广泛应用原因在于考虑成对约束信息不像样本类属信息那样困难，操作者可以从样本类属信息中获得成对约束信息，反之则不能，成对约束分为must-link和cannot-link两类，其中must-link限制两个样本点必须属于同一类，cannot-link限制两个样本点不能处于同一类；2b) The pairwise constraint information is extracted from the class label points. The pairwise constraint is used to describe whether two points belong to the same category. The reason for its wide application is that considering the pairwise constraint information is not as difficult as the sample category information. The operator can Obtain pairwise constraint information from the sample category information, and vice versa. Pairwise constraints are divided into two types: must-link and cannot-link. Among them, must-link restricts that two sample points must belong to the same class, and cannot-link restricts two sample points cannot be in the same class;

2c)通过用成对约束信息修正亲和度矩阵，即将步骤2a)中所说的人工类标签以先验信息的方式按照如下条件加入到聚类过程中，用以改善聚类结果：2c) Modify the affinity matrix with pairwise constraint information, that is, add the artificial class labels mentioned in step 2a) to the clustering process in the form of prior information according to the following conditions to improve the clustering results:

$\{\begin{matrix} {A A}_{ij ij} = = 11,, {A A}_{ji the ji} = = 11,, if if (({x x}_{i i},, {x x}_{j j})) &Element; &Element; must must - - link link \\ {A A}_{ij ij} = = 00,, {A A}_{ji the ji} = = 00,, if if (({x x}_{i i},, {x x}_{j j})) &Element; &Element; cannot cannot - - link link \end{matrix}$

其中x_i、x_j为数据集中任意点。Among them, x _i and x _j are any points in the data set.

步骤三，用亲和度矩阵A构造度矩阵D，该度矩阵D为一对角矩阵，且对角线上的任意元素D_ii为亲和度矩阵第i行的元素A_ij的和：Step 3, use the affinity matrix A to construct the degree matrix D, the degree matrix D is a diagonal matrix, and any element D _ii on the diagonal is the sum of the elements A _ij in the i-th row of the affinity matrix:

${D D.}_{ii i} = = {Σ Σ}_{j j = = 11}^{n no} {A A}_{ij ij}$

用修正后的亲和度矩阵和度矩阵构造拉普拉斯矩阵：Construct the Laplacian matrix with the modified affinity and degree matrices:

L＝D^-1/2AD^1/2 L＝D ^-1/2 AD ^1/2

拉普拉斯矩阵表示的源图像的无向图可以自然形成多个连通分支，因而可以将其表示为对角分块矩阵，该对角矩阵的特征值和特征向量包含较多的图像的信息。The undirected graph of the source image represented by the Laplacian matrix can naturally form multiple connected branches, so it can be represented as a diagonal block matrix, and the eigenvalues and eigenvectors of the diagonal matrix contain more image information .

步骤四，对拉普拉斯矩阵进行特征分解，将特征向量从大到小排列，这里所说的特征向量的大小指的是特征向量对应的特征值大小。Step 4, perform eigendecomposition on the Laplacian matrix, and arrange the eigenvectors from large to small. The size of the eigenvectors mentioned here refers to the size of the eigenvalues corresponding to the eigenvectors.

步骤五，用熵排序方法对特征向量按其重要度进行重排序，具体实现步骤如下：Step five, use the entropy sorting method to reorder the feature vectors according to their importance, the specific implementation steps are as follows:

5a)依次移出特征向量的每一列，计算其余列熵值，将求得的熵值规定为该列特征向量对应的熵值，传统的谱聚类，对于一个k类问题，在得到拉普拉斯矩阵的特征向量后，将特征向量从大到小排列，然后选择前k个特征向量来划分原始数据点，然而，最大的特征向量并不一定能包含源图像的所有信息，尤其对于架构较为复杂的图像，因此，引入熵排序的理论对求得的特征向量进行排序，得到对聚类重要度最高的前k个特征向量很必要，所述的余列熵值E，由下式计算：5a) Remove each column of the eigenvector in turn, calculate the entropy value of the remaining columns, and specify the obtained entropy value as the entropy value corresponding to the eigenvector of the column. In traditional spectral clustering, for a k-class problem, after obtaining Lapla After the eigenvectors of the Adams matrix, arrange the eigenvectors from large to small, and then select the first k eigenvectors to divide the original data points. However, the largest eigenvectors may not contain all the information of the source image, especially for the structure For complex images, it is necessary to introduce the theory of entropy sorting to sort the obtained feature vectors to obtain the top k feature vectors with the highest importance to clustering. The entropy value E of the remaining columns is calculated by the following formula:

其中，σ_i，σ_j分别表示特征向量任意列V_i和V_j对应的尺度参数，‖V_i-V_j‖表示特征向量V_i和V_j列的欧氏距离；Among them, σ _i and σ _j represent the scale parameters corresponding to any columns V _i and V _j of the feature vectors, respectively, and ‖V _i -V _j ‖ represents the Euclidean distance between the feature vectors V _i and V _j columns;

5b)将特征向量按照其对应的熵值从大到小进行一次“重排列”，得到特征向量重要度由高到低的排序，记为VR。5b) Perform a "rearrangement" of the feature vectors according to their corresponding entropy values from large to small, and obtain the order of feature vector importance from high to low, which is recorded as VR.

步骤六，初始化聚类数k＝2，令p＝k，取VR的前k列并归一化。Step 6, initialize the number of clusters k=2, set p=k, take the first k columns of VR and normalize.

步骤七，自适应确定聚类数。Step seven, adaptively determine the number of clusters.

7a)将VR的前两列先投射到二维空间，二维空间根据各个点与四个坐标半轴的距离大小将其划分为四类，除去没有数据点的聚类或只拥有点数少于输入数据点数百分之一的类，将保留下来的类别数记为p；然后再抽取特征向量的前p列，将其投影到p维空间中，根据各个点与p维空间中2p个半轴的距离将n个输入点划分为2p类；7a) Project the first two columns of VR to the two-dimensional space first, and the two-dimensional space divides each point into four categories according to the distance between each point and the four coordinate semi-axes, and removes the clusters without data points or only have points less than Enter the class with one percent of the number of data points, and record the number of retained categories as p; then extract the first p columns of the feature vector and project it into the p-dimensional space, according to each point and 2p in the p-dimensional space The distance of the semi-axis divides the n input points into 2p classes;

7b)除去2p类中没有点的聚类或只拥有点数少于输入数据点数百分之一的类，最后类别数将小于等于2p，将保留下来的聚类数记为c。7b) Remove the clusters without points in the 2p class or the classes with points less than 1% of the input data points, the final number of categories will be less than or equal to 2p, and the number of retained clusters will be recorded as c.

步骤八，比较步骤七中所得的拟聚类数c和步骤六中初始化的聚类数k，如果二者不同，令c＝k，返回步骤七，如果相同，此时所得的c就是最佳聚类数，记为km，本步骤通过聚类方法的若干次循环迭代，通过每个点距坐标系各个半轴的距离将输入数据点划分，最后确定最佳聚类数。Step 8, compare the number of clusters c obtained in step 7 with the number of clusters k initialized in step 6, if they are different, set c=k, return to step 7, if they are the same, the c obtained at this time is the best The number of clusters is denoted as km. In this step, through several cyclic iterations of the clustering method, the input data points are divided by the distance from each point to each semi-axis of the coordinate system, and finally the optimal number of clusters is determined.

步骤九、输出分割结果图像，将输入数据点划分至km类，并按输入点的坐标对输入数据点进行标记，得到聚类结果。Step 9: output the segmentation result image, divide the input data points into km classes, and mark the input data points according to the coordinates of the input points to obtain the clustering result.

本发明的效果可以通过以下仿真进一步说明：Effect of the present invention can be further illustrated by following simulation:

仿真的硬件平台为：Intel Core2 Duo CPU E6550@2.33GHZ、2GB RAM，软件平台为MATLAB 7.6。The simulated hardware platform is: Intel Core2 Duo CPU E6550@2.33GHZ, 2GB RAM, and the software platform is MATLAB 7.6.

图2给出了本发明在四组人工数据集上的聚类数确定结果和聚类结果，其中，图2(a)、图2(b)、图2(c)、图2(d)的类别数分别为3、4、5、3，由图2可见本发明能正确的确定聚类数，也能得到正确的聚类结果。Fig. 2 has provided the clustering number determination result and clustering result of the present invention on four groups of artificial data sets, wherein, Fig. 2 (a), Fig. 2 (b), Fig. 2 (c), Fig. 2 (d) The numbers of categories are 3, 4, 5, and 3 respectively. It can be seen from FIG. 2 that the present invention can correctly determine the number of clusters and obtain correct clustering results.

图3给出了Berkeley自然图像的聚类数确定和聚类结果，其中图3(a)、图3(b)、图3(c)、图3(d)分别代表原图像，无监督谱聚类的聚类结果，半监督谱聚类的聚类结果和本发明方法聚类结果，图3(b)中无监督谱聚类和图3(c)中半监督谱聚类的聚类数均是事先设定的，分别为3、2、3，图3(d)中本发明的聚类数是自适应确定的，由聚类结果可得，半监督谱聚类的方法较传统无监督谱聚类所得结果有很大改善。从图3(d)可见，本发明中提出的方法比半监督谱聚类的效果更好，尤其对于图像边缘完整性的保持优于图3(b)中无监督谱聚类和图3(c)中半监督谱聚类方法。Figure 3 shows the determination of the number of clusters and the clustering results of Berkeley natural images, where Figure 3(a), Figure 3(b), Figure 3(c), and Figure 3(d) represent the original image, and the unsupervised spectrum The clustering results of clustering, the clustering results of semi-supervised spectral clustering and the clustering results of the method of the present invention, the clustering of unsupervised spectral clustering in Fig. 3 (b) and semi-supervised spectral clustering in Fig. 3 (c) The numbers are pre-set, respectively 3, 2, and 3. The number of clusters of the present invention in Fig. 3 (d) is determined adaptively, and can be obtained from the clustering results. The method of semi-supervised spectral clustering is more traditional The results obtained by unsupervised spectral clustering are much improved. It can be seen from Fig. 3(d) that the method proposed in the present invention has a better effect than semi-supervised spectral clustering, especially for the preservation of image edge integrity, which is better than unsupervised spectral clustering in Fig. 3(b) and Fig. 3( c) Semi-supervised spectral clustering method in middle.

图4给出了本发明用于SAR图像1的聚类数确定和聚类结果，其中图4(a)、图4(b)、图4(c)、图4(d)分别代表原图像，无监督谱聚类的聚类结果，半监督谱聚类的聚类结果和本发明方法聚类结果，图4(c)中半监督谱聚类虽然能大体上分开图4(a)中的水域和陆地，但边界保持不好，个别水域聚类错误，由图4(d)可见，本发明的方法能够有效地克服这些缺点。Figure 4 shows the determination of the number of clusters and the clustering results for SAR image 1 according to the present invention, wherein Figure 4 (a), Figure 4 (b), Figure 4 (c), and Figure 4 (d) represent the original image respectively , the clustering result of unsupervised spectral clustering, the clustering result of semi-supervised spectral clustering and the clustering result of the method of the present invention, although the semi-supervised spectral clustering in Fig. 4 (c) can be roughly separated in Fig. 4 (a) waters and land, but the boundaries are not well maintained, and individual waters are clustered incorrectly. It can be seen from Figure 4(d) that the method of the present invention can effectively overcome these shortcomings.

图5给出了本发明用于SAR图像2的聚类数确定和聚类结果，图5(a)、图5(b)、图5(c)、图5(d)分别代表原图像，无监督谱聚类的聚类结果，半监督谱聚类的聚类结果和本发明方法聚类结果，由图5(d)可见，本发明正确地确定该图的聚类数，且能够有效地克服图5(c)中半监督谱聚类出现的区域斑点，不连贯等缺点。Fig. 5 shows the determination of the number of clusters and clustering results for SAR image 2 in the present invention, Fig. 5 (a), Fig. 5 (b), Fig. 5 (c), Fig. 5 (d) represent the original image respectively, The clustering results of unsupervised spectral clustering, the clustering results of semi-supervised spectral clustering and the clustering results of the method of the present invention can be seen from Figure 5 (d), the present invention correctly determines the number of clusters in this figure, and can effectively It overcomes the shortcomings of regional spots and incoherence in the semi-supervised spectral clustering in Figure 5(c).

实验表明，使用该基于熵排序的半监督谱聚类确定聚类数的方法能够自适应地，准确地确定聚类数，而且能够改善聚类效果，非常适合于自然图像和SAR图像地聚类数自适应确定。Experiments show that the method of determining the number of clusters using semi-supervised spectral clustering based on entropy sorting can adaptively and accurately determine the number of clusters, and can improve the clustering effect, which is very suitable for the clustering of natural images and SAR images The number is determined adaptively.

Claims

1. A method for determining the number of clusters based on semi-supervised spectral clustering of entropy sorting, comprising the steps:

(1) Input data set X={x ₁ , x ₂ ,..., x _n }∈R ^d , x _i represents any point in the data set, i∈(1,n), n is the number of data, d Indicates the data dimension;

(2) Calculate the scale parameter σ _i of each point in the data set X and the affinity matrix A of the data set respectively;

(3) Extract pairwise constraint information from artificially added class labels, and use these pairwise constraint information to modify the affinity matrix

(5) Carry out eigendecomposition to the Laplacian matrix, and arrange the corresponding eigenvectors from large to small according to the size of the eigenvalues;

(6) Use the entropy sorting method to reorder the feature vectors according to their importance:

(6a) Remove each column of the eigenvector in turn, calculate the entropy value of the remaining columns, and specify the obtained entropy value as the entropy value corresponding to the eigenvector of the column;

(6b) Perform a "rearrangement" of the feature vectors according to their corresponding entropy values from large to small, and obtain the order of feature vector importance from high to low, which is recorded as VR;

(7) Initialize the number of clusters k=2, take the first k columns of VR and normalize;

(8) Adaptively determine the number of clusters:

(8a) Treat the obtained VR front k columns as n k-dimensional points, and project them into the k-dimensional coordinate system;

(8b) Use the positive and negative directions of each coordinate axis to represent a cluster respectively, and divide the input data points into 2k categories according to the distance between each point and each semi-axis of the coordinate system;

(8c) Remove the clusters without points in the 2k classes or the clusters with points less than 1% of the input data points, and record the remaining clusters as c;

(9) compare k and c, if the two are different, make k=c, and return to step (8), if identical, the k obtained at this time is the optimal number of clusters, recorded as km;

(10) Divide the input data points into km categories, and mark the input data points according to the coordinates of the input points to obtain the clustering result.

2. The method for determining the number of clusters according to claim 1, wherein the step (2) described respectively calculates the scale parameter σ _i of each point in the data set X and the affinity matrix A of the data set, with the following formula calculate:

{σ σ}_{i i} = = \frac{11}{m m} {Σ Σ}_{d d = = 11}^{m m} | | | | {x x}_{i i} - - {x x}_{d d} | | | |

Among them, _σi represents the scale parameter of any point in the data point, x _d is the dth nearest neighbor of any point x _i in the data level X from the rest of the points, and d=7 is selected;

A _ij = exp(-‖x _i -x _j ‖ ² /σ _i σ _j ), i, j∈(1, n)

Among them, A _ij represents any element of the affinity matrix A, σ _i , σ _j represent the scale parameters corresponding to any point x _i and x _j in the data set, and _‖xi -x _j ‖ represents the point x _i and x _j Euclidean distance.

3. The method for determining the number of clusters according to claim 1, wherein the calculation of all the other column entropy values E described in step (6a) is calculated by the following formula:

E E. = = - - \underset{{V V}_{i i} &Element; &Element; V V}{Σ Σ} [[p p (({V V}_{i i})) log log p p (({V V}_{i i})) + + ((11 - - p p (({V V}_{i i})))) log log ((11 - - p p (({V V}_{i i}))))]]

Among them, V represents the eigenvector of the Laplacian matrix, V _i represents any column in the eigenvector, and p(V _i ) represents the probability of the column of V _i . Since this probability is not easy to obtain in practice, the eigenvector is used in the specific operation The column V _i is replaced by the affinity S _ij of any other column, that is

S _ij ＝exp(-‖V _i -V _j ‖ ² /σ _i σ _j )

E E. = = - - \underset{{V V}_{i i} &Element; &Element; V V}{Σ Σ} \underset{{V V}_{j j} &Element; &Element; V V}{Σ Σ} {E E.}_{i i} = = - - \underset{{V V}_{i i} &Element; &Element; V V}{Σ Σ} \underset{{V V}_{j j} &Element; &Element; V V}{Σ Σ} [[{S S}_{ij ij} log log {S S}_{ij ij} + + ((11 - - {S S}_{ij ij})) log log ((11 - - {S S}_{ij ij}))]]

Among them, σ _i and σ _j represent the scale parameters corresponding to any columns V _i and V _j of feature vectors respectively, and ‖V _i -V _j ‖ represents the Euclidean distance between feature vectors V _i and V _j columns.