CN109635849A

CN109635849A - A kind of target clustering method and system based on three c-means decisions

Info

Publication number: CN109635849A
Application number: CN201811401683.6A
Authority: CN
Inventors: 张凯; 刘三女牙; 孙建文
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-04-16

Abstract

The invention provides a target clustering method and system based on three-way c-means decision, belonging to the technical field of machine learning clustering. The present invention models a cluster as a positive domain, a boundary domain and a negative domain, and assigns the target data to different domains of the cluster according to the relative relationship between the center point of the cluster and the target data. This method can be applied to all of them, with wide application and good clustering effect. Further, in the calculation of the center point of the cluster, the weight is determined according to the number of positive domains and boundary domains to which the target belongs, instead of using the empirical weight, which can more effectively perform cluster analysis on the target.

Description

A kind of target clustering method and system based on three c-means decisions

Technical field

The present invention relates to machine learning clustering technique field, more particularly, to a kind of based on three c-means decisions Target clustering method and system are clustered especially suitable for educational resource.

Background technique

With the development of data mining technology, more and more target clustering techniques are applied in class prediction, common Application scenarios such as image dividing processing, biomedical identification, educational resource classification etc..By taking educational resource is classified as an example, root According to various features of educational resource: such as type (video, text, exercise) uses duration (the average time used of resource Length), frequency of use (number that resource is used in certain term) etc. can cluster out several different types of educational resources, Its result can provide suggestion from application angle for the exploitation of educational resource.Further, with student information data Cooperative Analysis, The exploitation of educational resource can be made more targeted.

The main purpose of target cluster is similar Target Assignment into a cluster, so that the target phase in the same cluster It is high as far as possible like spending, and the target similarity in different clusters is low as far as possible.In traditional clustering method, each target can only belong to In a cluster, such methods belong to hard clustering method.However as going deep into for application, hard clustering method encounters some problem, One of them is exactly the uncertain border issue between cluster and cluster, i.e. some targets may be between multiple clusters, this just exceeds The solution range of hard clustering method, and such issues that soft cluster is specific to.

Most important one kind technical solution is in soft cluster, using rough set (Rough Sets) or it is similar theoretical to cluster into Then row modeling models target using fuzzy set (Fuzzy Sets) or similar theory, modeling will be finally completed Cluster and target substitute into the frame of traditional k-means clustering algorithm.

The problem of two aspects are still highlighted in this kind of soft clustering method.On the one hand, the modeling of cluster is used a variety of similar Theory, in addition to rough set, there are also shade collection (shadowed sets) etc., these theories are that a cluster is regarded as three domains: One domain being made of the target for absolutely belonging to the cluster, a domain being made of the target that may belong to the cluster, one is The domain that target by being unlikely to belong to the cluster forms.And present invention applicant has found, these theories have internal uniformity, It can be summarized with three decision theories, but current soft clustering method does not use three decision theories to build cluster Mould；On the other hand, when calculating cluster center, different weights is applied to the target in not same area, and these weights are roots It is determined according to experience, such consequence is that cluster center is very sensitive to weighted value.Currently, the two aspects are urgent need to resolve The problem of.

Summary of the invention

In view of the drawbacks of the prior art, technical purpose of the invention is the provision of a kind of target clustering method, uses three Branch decision theory models cluster, more efficiently can carry out clustering to target.

In order to realize the technology of the present invention purpose, present invention employs following technical solutions:

A kind of target clustering method based on three c-means decisions, by a cluster c_iBe modeled as the domain positive, The domain boundary and the domain negtive, are expressed as POS (c_i)、BND(c_i) and NEG (c_i)；Wherein, the positive of a cluster Domain is made of the target for absolutely belonging to the cluster, and the domain boundary of a cluster is made of the target that may belong to the cluster, a cluster The domain negtive be made of the target for being unlikely to belong to the cluster；

This method comprises the following steps:

(1) by target data x to be clustered_jIt is initially allocated to the domain positive of k cluster at random, wherein x_j∈ U, U are The set of all target data compositions to be clustered；

(2) central point of k cluster is calculated；

(3) according to calculated each central point, redistribute all target datas to k cluster not same area；

(4) it checks whether stopping criterion for iteration meets, (2) step is returned to if being unsatisfactory for, otherwise, terminate；

The step (3) redistributes all target datas to the specific implementation process of each cluster are as follows:

Define relation function r (c_i,x_j)=μ_ij, μ_ijIndicate target x_jWith cluster c_iThe fuzzy member value of similarity degree；

Opening relationships vector [r (c₁,x_j),r(c₂,x_j),…,r(c_k,x_j)]^T=[μ_1j,μ_2j,…,μ_kj]^T, indicate target x_j With the similarity degree of each cluster；

Defined feature functionTable Show the maximum value for extracting relation vector；

Define relativeness functionTarget x is described_jWith cluster c_iRelatively The relativeness value of other clusters, the value is bigger to illustrate target x_jWith cluster c_iRelationship it is closer, value range be (0,1]；

The opposite ownership set of definitionTarget x is described_jThe gathering that may belong to It closes；Wherein t_mj,t_njIt is [t respectively_ij], maximum value and Second Largest Value in 1≤i≤k；It should Cluster in set is target x_jThe cluster that may belong to, if only one cluster of the set, target x_jThe cluster will be assigned to The domain positive, if the set there are two or the above cluster, target x_jThe domain boundary of these clusters will be assigned to；

Establish evaluation functionTarget x is described_jWith cluster c_iRelativeness value；α=1 is set,Then have the Clustering Model based on evaluation as follows:

A kind of target clustering system based on three c-means decisions, by a cluster c_iBe modeled as the domain positive, The domain boundary and the domain negtive, are expressed as POS (c_i)、BND(c_i) and NEG (c_i)；Wherein, the positive of a cluster Domain is made of the target for absolutely belonging to the cluster, and the domain boundary of a cluster is made of the target that may belong to the cluster, a cluster The domain negtive be made of the target for being unlikely to belong to the cluster；

The system includes the following modules:

Original allocation module, for by target data x to be clustered_jIt is initially allocated to the domain positive of k cluster at random, Wherein, x_j∈ U, U are the set of all target data compositions to be clustered；

Center point calculation module, for calculating the central point of k cluster；

Distribution module is updated, for according to calculated each central point, redistributing all target datas to k cluster Not same area；

Iteration ends determination module returns to central point meter for checking whether stopping criterion for iteration meets if being unsatisfactory for Module is calculated, otherwise, is terminated；

The update distribution module redistributes all target datas to the specific implementation process of each cluster are as follows:

Define relation function r (c_i, x_j)=μ_ij, μ_ijIndicate target x_jWith cluster c_iThe fuzzy member value of similarity degree；

Opening relationships vector [r (c₁, x_j), r (c₂, x_j) ..., r (c_k, x_j)]^T=[μ_1j, μ_2j..., μ_kj]^T, indicate target x_j With the similarity degree of each cluster；

The opposite ownership set of definitionTarget x is described_jThe gathering that may belong to It closes；Wherein t_mj, t_njIt is [t respectively_ij], maximum value and Second Largest Value in 1≤i≤k；It should Cluster in set is target x_jThe cluster that may belong to, if only one cluster of the set, target x_jThe cluster will be assigned to The domain positive, if the set there are two or the above cluster, target x_jThe domain boundary of these clusters will be assigned to；

Further, the calculation formula of the central point of the cluster is as follows:

Wherein, mean_iIndicate cluster c_iCentral point；POS(c_i) indicate cluster c_iThe domain positive, | POS (c_i) | indicating should The number of target in the domain cluster positive；BND(c_i) indicate cluster c_iThe domain boundary, | BND (c_i) | indicate cluster boundary The number of target, w in domain_ijIndicate target x_jFor cluster c_iWeight.

Further, the target x_jFor cluster c_iWeightμ_ij∈M_xj, wherein μ_ijIndicate target x_j With cluster c_iThe fuzzy member value of similarity degree, M_xjIndicate characterization target x_jWith the fuzzy member value collection of affiliated cluster similarity degree It closes.

Further, the characterization target x_jWith cluster c_iThe calculation method of the fuzzy member value of similarity degree are as follows:

Wherein, μ_ijIndicate characterization target x_jWith cluster c_iThe fuzzy member value of similarity degree, the number of k expression cluster, 1≤i≤ K, 1≤j≤n, n are the target numbers in data set；d_ij, d_ljRespectively indicate target x_jTo cluster c_iWith cluster c_lEuclidean distance, ginseng Number m > 1.

Further, three domains of the same cluster meet following condition:

Three domains of different clusters meet following condition:

Compared with existing clustering method, the target clustering method of the present invention based on three c-means algorithms.This One cluster is modeled as by invention towards each boundary cluster uncertain problem common in practical clustering problem The domain positive and the domain boundary, it is any simply by the presence of the applicable this method of the indefinite problem of cluster boundary, applicable surface Extensively, Clustering Effect is good.

Further, when calculating cluster center the Upper approxiamtion according to belonging to target (domain positive and The domain boundary) quantity determine its weight, rather than use experience weight can more efficiently carry out cluster point to target Analysis.

With the application of the invention, clustering effectively can be carried out to various educational data collection, it is poly- especially suitable for student performance The fields such as class, education resource cluster.

Detailed description of the invention

Fig. 1 is target clustering method flow chart of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

Specific implementation step of the invention is described in further detail below with reference to Fig. 1.

Step 1. inputs a D to be clustered and ties up educational resource data set, clusters number k, cutoff threshold ξ.

Step 2. initialization generates a random number for each dataThat is r be 1 and k it Between natural number.According to this random number r, achievement data is assigned to some cluster c_iThe domain positive.

Step 3. calculates the central point of each educational resource cluster.

Calculate the fuzzy member value of each data.According to formula (1), each data can be calculated relative to each poly- The fuzzy member value of class, as shown in the table.

Calculate the domain positive or boundary which cluster is each data belong to.To any data x_jCalculate itFor example, x_jIn c₁, c₃, c₄In upper approxima-tion Deng three cluster, then gather

Data are found out with respect to the fuzzy member value that these are clustered.To any data x_jCalculate itFor example,Then have

The value is normalized, w is calculated_ij。

Using normalized value as the mean of each cluster of weight calculation.

Step 4. redistributes data to each cluster according to the mean of each cluster.

Define relation function r (c_i, x_j)=μ_ij。

Defined feature function

Define relativeness function

Define an opposite ownership set

Establish evaluation functionAchievement data is distributed to different clusters.

POS(c_i)={ x_j∈U|v(c_i, x_j)≥1}；

Step 5. checks termination condition.The step (5) checks the specific implementation process of termination condition are as follows: record changes every time The mean of each cluster in generation, decision algorithm is restrained if the difference of the mean of each cluster with previous iteration is less than pre- cutoff threshold ξ；Or Algorithm iteration 100 times；Above-mentioned termination condition meets first, then algorithm enters step 6, otherwise return step 3.

Step 6. exports the domain positive and the domain boundary of each cluster.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of target clustering method based on three c-means decisions, by a cluster c_iBe modeled as the domain positive, The domain boundary and the domain negtive, are expressed as POS (c_i)、BND(c_i) and NEG (c_i)；Wherein, the positive of a cluster Domain is made of the target for absolutely belonging to the cluster, and the domain boundary of a cluster is made of the target that may belong to the cluster, a cluster The domain negtive be made of the target for being unlikely to belong to the cluster；

It is characterized in that, this method comprises the following steps:

(1) by target data x to be clustered_jIt is initially allocated to the domain positive of k cluster at random, wherein x_j∈ U, U are all The set of target data composition to be clustered；

(2) central point of k cluster is calculated；

Opening relationships vector [r (c₁,x_j),r(c₂,x_j) ..., r (c_k,x_j)]^T=[μ_1j,μ_2j,…,μ_kj]^T, indicate target x_jWith it is each The similarity degree of a cluster；

Defined feature functionExpression mentions Take the maximum value of relation vector；

Define relativeness functionTarget x is described_jWith cluster c_iOther opposite clusters Relativeness value, the value is bigger to illustrate target x_jWith cluster c_iRelationship it is closer, value range be (0,1]；

The opposite ownership set of definitionTarget x is described_jThe gathering that may belong to is closed；Wherein t_mj,t_njIt is [t respectively_ij], maximum value and Second Largest Value in 1≤i≤k；The set In cluster be target x_jThe cluster that may belong to, if only one cluster of the set, target x_jThe positive of the cluster will be assigned to Domain, if the set there are two or the above cluster, target x_jThe domain boundary of these clusters will be assigned to；

2. the target clustering method according to claim 1 based on three c-means decisions, which is characterized in that the cluster Central point calculation formula it is as follows:

Wherein, meani indicates cluster c_iCentral point；POS(c_i) indicate cluster c_iThe domain positive, | POS (c_i) | indicate the cluster The number of target in the domain positive；BND(c_i) indicate cluster c_iThe domain boundary, | BND (c_i) | indicate the domain cluster boundary The number of middle target, w_ijIndicate target x_jFor cluster c_iWeight.

3. the target clustering method according to claim 2 based on three c-means decisions, which is characterized in that the mesh Mark x_jFor cluster c_iWeightWherein, μ_ijIndicate target x_jWith cluster c_iThe fuzzy of similarity degree Member value,Indicate characterization target x_jWith fuzzy member's value set of affiliated cluster similarity degree.

4. the target clustering method according to claim 3 based on three c-means decisions, which is characterized in that the table Levy target x_jWith cluster c_iThe calculation method of the fuzzy member value of similarity degree are as follows:

Wherein, μ_ijIndicate characterization target x_jWith cluster c_iThe fuzzy member value of similarity degree, the number of k expression cluster, 1≤i≤k, 1≤ J≤n, n are the target numbers in data set；d_ij,d_ljRespectively indicate target x_jTo cluster c_iWith cluster c_lEuclidean distance, parameter m > 1.

5. the target clustering method according to claim 1 or 2 or 3 or 4 based on three c-means decisions, feature exist In,

Three domains of the same cluster meet following condition:

Three domains of different clusters meet following condition:

6. a kind of target clustering system based on three c-means decisions, by a cluster c_iBe modeled as the domain positive, The domain boundary and the domain negtive, are expressed as POS (c_i)、BND(c_i) and NEG (c_i)；Wherein, the positive of a cluster Domain is made of the target for absolutely belonging to the cluster, and the domain boundary of a cluster is made of the target that may belong to the cluster, a cluster The domain negtive be made of the target for being unlikely to belong to the cluster；

It is characterized in that, the system includes the following modules:

Original allocation module, for by target data x to be clustered_jIt is initially allocated to the domain positive of k cluster at random, wherein x_j∈ U, U are the set of all target data compositions to be clustered；

Distribution module is updated, the difference for according to calculated each central point, redistributing all target datas to K cluster Domain；

Iteration ends determination module returns to central point if being unsatisfactory for and calculates mould for checking whether stopping criterion for iteration meets Otherwise block terminates；

Opening relationships vector [r (c₁,x_j),r(c₂,x_j),…,r(c_k,x_j)]^T=[μ_1j,μ_2j,…,μ_kj]^T, indicate target x_jWith it is each The similarity degree of a cluster；

7. the target clustering method according to claim 6 based on three c-means decisions, which is characterized in that the cluster Central point calculation formula it is as follows:

Wherein, mean_iIndicate cluster c_iCentral point；POS(c_i) indicate cluster c_iThe domain positive, | POS (c_i) | indicate the cluster The number of target in the domain positive；BND(c_i) indicate cluster c_iThe domain boundary, | BND (c_i) | indicate the domain cluster boundary The number of middle target, w_ijIndicate target x_jFor cluster c_iWeight.

8. the target clustering method according to claim 7 based on three c-means decisions, which is characterized in that the mesh Mark x_jFor cluster c_iWeightWherein, μ_ijIndicate target x_jWith cluster c_iThe fuzzy of similarity degree Member value,Indicate characterization target x_jWith fuzzy member's value set of affiliated cluster similarity degree.