CN108805174A

CN108805174A - clustering method and device

Info

Publication number: CN108805174A
Application number: CN201810482717.2A
Authority: CN
Inventors: 姚佳
Original assignee: Guangdong Hui He Science And Technology Development Co Ltd
Current assignee: Guangdong Hui He Science And Technology Development Co Ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2018-11-13
Anticipated expiration: 2038-05-18
Also published as: CN108805174B

Abstract

A kind of clustering method of the embodiment of the present application offer and device, method include：Read include multiple samples pending data, and by pending data RDD objects；The first preset quantity target sample is randomly determined in multiple samples, using the first preset quantity sample as the cluster centre of the first preset quantity cluster；In sample in multiple samples in addition to target sample, it is randomly determined the second preset quantity sample；For each in the second preset quantity sample, the sample is calculated at a distance from the cluster centre of each cluster, the cluster which being divided into where nearest cluster centre at a distance from the sample；The average value of each sample in each cluster is calculated as the first average value, the value of the cluster centre of the cluster is updated according to the current value of first average value and the cluster centre of the cluster, until meeting preset condition.The result that so, it is possible relatively effectively to avoid is local optimum.

Description

Clustering method and device

Technical field

This application involves big data technical fields, in particular to a kind of clustering method and device.

Background technology

Clustering method has extremely wide application in various fields such as machine learning, data minings.Conventional cluster side In method, it can be divided into hierarchical clustering, division formula cluster according to certain characteristic, gather based on Density Clustering, based on network clustering, core Class, spectral clustering etc. are a variety of.Accordingly, hierarchical clustering has brich, rock, Chameleon etc., division formula cluster have K-means and Its mutation etc., Density Clustering have dbscan, OPTICS etc., Grid Clustering to have sting, clique etc..In these clustering methods, It is most common to be no more than K-means.Although K-means has a priori assumption such as circular distribution, preassigns cluster classification number Deng defect, but its relatively low computation complexity still allow its become popular algorithm.But in number to be treated According to measure it is huge in the case of, the computation complexity of existing K-means is still larger, and is easy to be easily trapped into local optimum Problem.

Invention content

In view of this, the purpose of the application includes a kind of clustering method of offer and device, to improve the above problem.

In order to achieve the above object, the embodiment of the present application adopts the following technical scheme that：

In a first aspect, the embodiment of the present application provides a kind of clustering method, it is applied to the terminal device based on Spark frames, The method includes：

Read by the Spark frames include multiple samples pending data, and by RDD pairs of the pending data As；

The first preset quantity target sample is randomly determined in the multiple sample, by first preset quantity Cluster centre of the sample respectively as the first preset quantity cluster；

In sample in the multiple sample in addition to the first preset quantity target sample, it is randomly determined Two preset quantity samples, wherein second preset quantity is the preset multiple of first preset quantity；

For each sample in the second preset quantity sample, the sample and the cluster centre of each cluster are calculated Distance, the cluster which being divided into where nearest cluster centre at a distance from the sample；

The average value of each sample in each cluster is calculated as the first average value, according to the poly- of first average value and the cluster The current value at class center is updated the value of the cluster centre of the cluster；

When being unsatisfactory for preset condition after the updating, the first preset quantity mesh is removed in the multiple sample again In sample except standard specimen sheet, the second preset quantity sample is determined, and the second preset quantity sample redefined is drawn Corresponding cluster is assigned to be updated with the value of the cluster centre to each cluster.

Optionally, the clustering method provided according to the embodiment of the present application first aspect, the preset condition include：

Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle；

The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches Preset times.

Optionally, the clustering method provided according to the embodiment of the present application first aspect is preset in the terminal device poly- Class center turnover rate；

The value of the cluster centre of the cluster is carried out more according to the current value of first average value and the cluster centre of the cluster Newly, including：

With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.

Optionally, the clustering method provided according to the embodiment of the present application first aspect, the method further include：

In meeting the preset condition any one when, the cluster centre of each cluster is carried out more according to the multiple sample Newly；

When being unsatisfactory for the preset condition after update, again according to the multiple sample to the cluster centre of each cluster into Row update.

Optionally, the clustering method provided according to the embodiment of the present application first aspect, according to the multiple sample to each cluster Cluster centre be updated, including：

For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample Originally the cluster being divided into where nearest cluster centre at a distance from the sample；

The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster Second average value.

Second aspect, the embodiment of the present application also provide a kind of clustering apparatus, are set applied to the terminal based on Spark frames Standby, described device includes：

Data read module, for including the pending data of multiple samples by Spark frames reading, and by institute State pending data RDD objects；

Center determining module will for being randomly determined the first preset quantity target sample in the multiple sample Cluster centre of the first preset quantity sample respectively as the first preset quantity cluster；

Sample chooses module, for the sample in the multiple sample in addition to the first preset quantity target sample In this, it is randomly determined the second preset quantity sample, wherein second preset quantity is the pre- of first preset quantity If multiple；

First division module, for for each sample in the second preset quantity sample, calculating the sample At a distance from the cluster centre of each cluster, the cluster which is divided into where nearest cluster centre at a distance from the sample；

First update module, for calculating the average value of each sample in each cluster as the first average value, according to described The current value of the cluster centre of one average value and the cluster is updated the value of the cluster centre of the cluster, is unsatisfactory for after the updating When preset condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, determine Second preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster to gather to each cluster The value at class center is updated.

Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, the preset condition include：

Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, 8. is according to claim 7 poly- Class device, which is characterized in that cluster centre turnover rate is preset in the terminal device；First update module is according to The mode that the current value of the cluster centre of first average value and the cluster is updated the value of the cluster centre of the cluster is：

Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, the clustering apparatus further include：

Second update module, in meeting the preset condition any one when, according to the multiple sample to each The cluster centre of cluster is updated, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each The cluster centre of cluster is updated.

Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, second update module is according to institute Stating the mode that multiple samples are updated the cluster centre of each cluster is：

Compared to existing technologies, the embodiment of the present application has the advantages that：

A kind of clustering method and device provided by the embodiments of the present application are applied to the terminal device based on Spark frames.Eventually The data read are formed RDD objects by end equipment by the pending data that the reading of Spark frames includes multiple samples. Determine the first preset quantity target sample respectively as in the cluster of the first preset quantity cluster at random in multiple sample The heart, it is random to determine the second preset quantity sample in the sample in multiple samples in addition to the first preset quantity target sample This, then the second preset quantity sample is divided in corresponding cluster.Then the average value conduct of each sample in each cluster is calculated First average value carries out the value of the cluster centre of the cluster according to the current value of first average value and the cluster centre of the cluster Update, until meeting preset condition.By aforementioned process, it is not necessary to be all updated, be dropped according to whole samples in each update Low computation complexity, and can avoid the problem that being easy to converge on local optimum to a certain extent.

Description of the drawings

It, below will be to needed in the embodiment attached in order to illustrate more clearly of the technical solution of the embodiment of the present application Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 is a kind of connection block diagram of terminal device provided by the embodiments of the present application；

Fig. 2 is a kind of flow diagram of clustering method provided by the embodiments of the present application；

The another flow diagram of Fig. 3 clustering methods provided by the embodiments of the present application；

Fig. 4 is a kind of functional block diagram of clustering apparatus provided by the embodiments of the present application.

Icon：100- terminal devices；110- memories；120- processors；130- communication units；200- clustering apparatus； 210- data read modules；The centers 220- determining module；230- samples choose module；240- division modules；250- first updates Module；The second update modules of 260-.

Specific implementation mode

To keep the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is implemented The component of example can be arranged and be designed with a variety of different configurations.

Therefore, below the detailed description of the embodiments herein to providing in the accompanying drawings be not intended to limit it is claimed Scope of the present application, but be merely representative of the selected embodiment of the application.Based on the embodiment in the application, this field is common The every other embodiment that technical staff is obtained without creative efforts belongs to the model of the application protection It encloses.

It should be noted that：Similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.

There are many clustering methods to be suitble to use under big data environment at present, but K-means is wherein relatively handy It is a kind of.Herein by taking the scene based on Spark frames as an example, the K-means in some embodiments is illustrated：

(1) it is randomly determined k sample in given sample, each in this k sample represents gathering for cluster Class center, wherein k is preassigned cluster classification number, i.e., the quantity of the required cluster divided；

(2) in the remaining sample in given sample, each sample is calculated at a distance from the cluster centre of each cluster, by this Sample is divided into the cluster representated by the nearest cluster centre with the sample；

(3) average value for calculating each sample in each cluster, the average value is updated to by the value of the cluster centre of the cluster.

(4) (2) are constantly repeated to (3) step, until meeting corresponding conditions.

However, through inventor the study found that in the above method, in order to ensure to restrain, each iteration (update) can all make With whole samples directly as the learning sample of the new cluster centre of cluster, exactly K-means can be caused to be easy to converge in this way Local optimum.Also, the iteration at each cluster center is all directly to use new value, and ignore update completely in above-mentioned mode The value of cluster centre before, this causes K-menas to be easy to receive the interference of some outliers.In addition, larger in data volume In the case of, the time complexity of the K-means of the above method is still higher.

Based on this, a kind of clustering method of the embodiment of the present application proposition and device are set applied to the terminal based on Spark frames It is standby, to improve the above problem at least partly.It should be appreciated that clustering method provided in this embodiment and device can also be applied to Other frames such as MapReduce, the present embodiment are not limited system.The content will be described in detail below.

As shown in Figure 1, being a kind of block diagram of terminal device 100 provided by the embodiments of the present application, terminal device 100 Can be the arbitrary electronic equipment with data processing function and communication function, for example, with server specifically, terminal can be worked as When equipment 100 is server, either an individual server, can also be the server set of multiple server compositions Group, the present embodiment are without limitation.

In the present embodiment, Spark corresponding operation platforms can be installed, wherein Spark is in terminal device 100 The computing engines for aiming at large-scale data processing and the Universal-purpose quick of design, are the general of the class Hadoop MapReduce to increase income Parallel frame.

Terminal device 100 includes clustering apparatus 200, memory 110, processor 120 and communication unit 130.

In the present embodiment, memory 110, processor 120 and 130 each element of communication unit are direct or indirect between each other Ground is electrically connected, to realize the transmission or interaction of data.For example, these elements between each other can be total by one or more communication Line or signal wire, which are realized, to be electrically connected.Clustering apparatus 200 include it is at least one can be with the shape of software or firmware (firmware) Formula is stored in memory 110 or is solidificated in the software in the operating system (Operating System, OS) of terminal device 100 Function module.Processor 120 is for executing the executable module stored in memory 110, for example, included by clustering apparatus 200 Software function module and computer program etc..

Wherein, memory 110 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..

Processor 120 can be a kind of IC chip, have signal handling capacity.The processor 120 can also It is general processor, such as central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP), microprocessor etc.；Can also be digital signal processor (DSP)), application-specific integrated circuit (ASIC), scene Programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware group Part；Processor 120 can also be any conventional processor, may be implemented or executes and is in the embodiment of the present invention disclosed each Method, step and logic diagram.

Communication unit 130 is used to establish the communication connection between the terminal device 100 and other equipment, to realize data Interaction or communication.

It should be appreciated that structure shown in FIG. 1 is only to illustrate, terminal device 100 can also include than shown in Fig. 1 more or more Few component, it is possible to have with configuration entirely different shown in Fig. 1.It is worth noting that, each component shown in FIG. 1 can herein To be realized with hardware, software, or its combination, the present embodiment is without limitation.

As shown in Fig. 2, being a kind of flow diagram of clustering method provided by the embodiments of the present application, which can Applied to terminal device 100 shown in FIG. 1.

Step S201, read by the Spark include multiple samples pending data, and by the pending data Form RDD objects.

Wherein, RDD is the distinctive data models of Spark, and the multiple sample refers to the previously given sample for cluster This.

Step S202 is randomly determined the first preset quantity target sample in the multiple sample, by described first Cluster centre of the preset quantity sample respectively as the first preset quantity cluster.

Wherein, first preset quantity is predetermined required cluster categorical measure, i.e., the required cluster divided Quantity.

Step S203, in the sample in the multiple sample in addition to the first preset quantity target sample, with Determine the second preset quantity sample to machine.

Wherein, second preset quantity is the preset multiple of first preset quantity.In the present embodiment, described Two preset quantities are predetermined value, indicate need to be used for the cluster centre of cluster more using how many sample in each iteration Newly, the second preset quantity typically 100-1000.

For example, it is assumed that the first preset quantity is k, preset multiple γ, then the second preset quantity is k* γ.

Step S204 calculates the sample and each cluster for each sample in the second preset quantity sample The distance of cluster centre, the cluster which being divided into where nearest cluster centre at a distance from the sample.

By taking above-mentioned second preset quantity is k* γ as an example, for each in k* γ samples, sample arrival is calculated The distance of the current cluster centre of each cluster, and determine the cluster corresponding to the cluster centre for the distance minimum for wherein reaching the sample It is assigned to the target cluster as target cluster, then by the sample.

Step S205 calculates the average value of each sample in each cluster as the first average value, according to first average value And the current value of the cluster centre of the cluster is updated the value of the cluster centre of the cluster.

In the present embodiment, for each cluster, after obtaining the first average value of the cluster, in conjunction with the cluster centre of the cluster Current value the cluster centre of the cluster is updated.

Optionally, in the present embodiment, cluster centre turnover rate can be preset in terminal device 100.Accordingly, step S205 may comprise steps of：

Assuming that above-mentioned cluster centre turnover rate is α, the current value of the cluster centre of the cluster is old_center, is calculated The first average value arrived is new_center, then as the new value center (the i.e. described weighted average) of the cluster centre of the cluster It can be calculate by the following formula to obtain：

Center=α * new_center+ (1- α) * old_center

Wherein, cluster centre turnover rate can be controlled according to actual conditions, so as to avoid being based on part sample Carry out the excessively poor caused influence of certain more new effects.

Step S206, when being unsatisfactory for preset condition after the updating, again except described first is pre- in the multiple sample If in the sample except quantity target sample, determining the second preset quantity sample, and the second present count that will be redefined Amount sample is divided into corresponding cluster and is updated with the value of the cluster centre to each cluster.

In other words, in the present embodiment, terminal device 100 can repeat step S203- steps S205 until meeting institute State preset condition.

A specific example is given below, to be elaborated to step S203- steps S206.

Assuming that having been repeated 1000 times to step S203- steps S206 in total in this example, wherein ith calculates Obtained the first average value (i.e. above-mentioned new_center) is v_i, what ith was acquired by way of calculating weighted average The new value (i.e. above-mentioned center) of cluster centre is V_i, it is assumed that cluster centre turnover rate α is 0.8, then V_iFollowing formula meter can be passed through It obtains：

V_i=0.8*v_i+0.2*v_i-1

It is possible thereby to learn, repeat what above-mentioned steps S203- steps S205 was updated with the value to cluster centre Process realizes really by way of exponent-weighted average, in other words, in clustering method provided in this embodiment, Influence of the newer value to current results before all having been fully considered when updating each time, rather than treat in isolation or carry out every Primary update, so, it is possible relatively reliable to avoid the problem that local convergence.

Optionally, in the present embodiment, the preset condition may include：

Wherein, the predetermined angle and the preset times can flexibly be set according to actual conditions, the present embodiment It is without limitation.

In addition, inventor also found if be iterated more to the cluster centre of each cluster only with part sample always Newly, it may be difficult to ensure Complete Convergence, so after reaching preset condition according to mode above-mentioned, can further use complete Portion's sample is updated the cluster centre of each cluster.On the basis of executing abovementioned steps, need to only be based on again whole samples into The update of a little number of row (being typically less than 10 times), time complexity are negligible.

Based on this, as shown in figure 3, clustering method provided in this embodiment can also include step S301 and step S302.

Step S301, in meeting the preset condition any one when, according to the multiple sample to the cluster of each cluster Center is updated.

Wherein, step S301 can be realized by following sub-step：

Wherein, the second average value refers to working as according to whole samples (each sample in the multiple sample) to each cluster Cluster centre is updated the average value of each sample in each cluster being calculated later.

Step S302, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each cluster Cluster centre is updated.

In the present embodiment, the iteration update carried out based on whole samples is proceeded to the when of meeting the preset condition and stopped Only.Based on above-mentioned analysis it is found that the update less than 10 times would generally be carried out, time complexity is negligible.

As shown in figure 4, the embodiment of the present application also provides a kind of clustering apparatus 200, it is applied to terminal device shown in FIG. 1 100.The clustering apparatus 200 includes data read module 210, center determining module 220, sample selection module 230, divides mould Block 240 and the first update module 250.

Wherein, data read module 210 is used to pass through the pending number that Spark frames reading includes multiple samples According to, and the pending data is formed into RDD objects.

In the present embodiment, the description as described in data read module 210 is specifically referred to the detailed of step S201 shown in Fig. 2 Thin description, i.e. step S201 can be executed by data read module 210.

Center determining module 220 is used to be randomly determined the first preset quantity target sample in the multiple sample, Using the first preset quantity sample as the cluster centre of the first preset quantity cluster.

In the present embodiment, the description as described in center determining module 220 is specifically referred to the detailed of step S202 shown in Fig. 2 Thin description, i.e. step S202 can be executed by center determining module 220.

Sample is chosen module 230 and is used in the multiple sample in addition to the first preset quantity target sample In sample, it is randomly determined the second preset quantity sample, wherein second preset quantity is first preset quantity Preset multiple.

In the present embodiment, the description as described in sample chooses module 230 is specifically referred to the detailed of step S203 shown in Fig. 2 Thin description, i.e. step S203 can choose module 230 by sample and execute.

Division module 240 is used for for each sample in the second preset quantity sample, calculate the sample and The distance of the cluster centre of each cluster, the cluster which being divided into where nearest cluster centre at a distance from the sample.

In the present embodiment, the description as described in division module 240 specifically refers to retouching in detail to step S204 shown in Fig. 2 It states, i.e. step S204 can be executed by division module 240.

First update module 250 is used to calculate the average value of each sample in each cluster as the first average value, according to described The current value of the cluster centre of first average value and the cluster is updated the value of the cluster centre of the cluster, is discontented with after the updating When sufficient preset condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, really Fixed second preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster with to each cluster The value of cluster centre is updated.

In the present embodiment, the description as described in the first update module 250 is specifically referred to the detailed of step S205 shown in Fig. 2 Thin description, i.e. step S205 can be executed by the first update module 250.

Wherein, the preset condition may include：

Optionally, terminal device 100 can in be preset with cluster centre turnover rate.

Accordingly, the first update module 250 according to the current value of first average value and the cluster centre of the cluster to this The mode that the value of the cluster centre of cluster is updated can be：

Optionally, the clustering apparatus 200 can also include the second update module 250.

Wherein, the second update module 260 be used in meeting the preset condition any one when, according to the multiple sample This is updated the cluster centre of each cluster, when being unsatisfactory for the preset condition after update, again according to the multiple sample This is updated the cluster centre of each cluster.

In the present embodiment, the description as described in the second update module 260 specifically refers in the above to being walked shown in Fig. 3 The detailed description of rapid S301 and step S302, i.e. step S301 and step S302 can be executed by the second update module 260.

Optionally, in the present embodiment, the second update module 260 according to the multiple sample to the cluster centre of each cluster into The newer mode of row can be：

In conclusion a kind of clustering method provided by the embodiments of the present application and device, are applied to the end based on Spark frames End equipment.Terminal device includes the pending data of multiple samples by the reading of Spark frames, and the data read are formed RDD objects.Determine the first preset quantity target sample respectively as the first preset quantity cluster at random in multiple sample Cluster centre, it is random to determine that second is default in the sample in multiple samples in addition to the first preset quantity target sample Quantity sample, then the second preset quantity sample is divided in corresponding cluster.Then the flat of each sample in each cluster is calculated Mean value is as the first average value, according to the current value of first average value and the cluster centre of the cluster to the cluster centre of the cluster Value be updated, until meet preset condition.By aforementioned process, it is not necessary to all be carried out according to whole samples in each update Update, reduces computation complexity, and can avoid the problem that being easy to converge on local optimum to a certain extent.

In embodiment provided herein, it should be understood that disclosed device and method, it can also be by other Mode realize.The apparatus embodiments described above are merely exemplary, for example, the flow chart and block diagram in attached drawing are shown According to the device, the architectural framework in the cards of method and computer program product, function of multiple embodiments of the application And operation.In this regard, each box in flowchart or block diagram can represent one of a module, section or code Point, a part for the module, section or code includes one or more for implementing the specified logical function executable Instruction.It should also be noted that at some as in the realization method replaced, the function of being marked in box can also be attached to be different from The sequence marked in figure occurs.For example, two continuous boxes can essentially be basically executed in parallel, they also may be used sometimes To execute in the opposite order, this is depended on the functions involved.It is also noted that each of block diagram and or flow chart The combination of box in box and block diagram and or flow chart, function or the dedicated of action are based on as defined in execution The system of hardware is realized, or can be realized using a combination of dedicated hardware and computer instructions.

In addition, each function module in each embodiment of the application can integrate to form an independent portion Point, can also be modules individualism, can also two or more modules be integrated to form an independent part.

It, can be with if the function is realized and when sold or used as an independent product in the form of software function module It is stored in a computer read/write memory medium.Based on this understanding, the technical solution of the application is substantially in other words The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of step. And storage medium above-mentioned includes：USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

The above, the only specific implementation mode of the application, but the protection domain of the application is not limited thereto, it is any Those familiar with the art can easily think of the change or the replacement in the technical scope that the application discloses, and should all contain It covers within the protection domain of the application.Therefore, the protection domain of the application should be subject to the protection scope in claims.

Claims

1. a kind of clustering method, which is characterized in that it is applied to the terminal device based on Spark, the method includes：

Include the pending data of multiple samples by Spark readings, and the pending data is formed into RDD objects；

The first preset quantity target sample is randomly determined in the multiple sample, by the first preset quantity sample Respectively as the cluster centre of the first preset quantity cluster；

In sample in the multiple sample in addition to the first preset quantity target sample, it is pre- to be randomly determined second If quantity sample, wherein second preset quantity is the preset multiple of first preset quantity；

For each sample in the second preset quantity sample, calculate the sample and the cluster centre of each cluster away from From the cluster being divided into the sample where nearest cluster centre at a distance from the sample；

The average value of each sample in each cluster is calculated as the first average value, according in the cluster of first average value and the cluster The current value of the heart is updated the value of the cluster centre of the cluster；

When being unsatisfactory for preset condition after the updating, the first preset quantity target sample is removed in the multiple sample again In sample except this, the second preset quantity sample is determined, and the second preset quantity sample redefined is divided into Corresponding cluster is updated with the value of the cluster centre to each cluster.

2. clustering method according to claim 1, which is characterized in that the preset condition includes：

The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches default Number.

3. clustering method according to claim 2, which is characterized in that be preset with cluster centre update in the terminal device Rate；

The value of the cluster centre of the cluster is updated according to the current value of first average value and the cluster centre of the cluster, is wrapped It includes：

With the weight that the cluster centre turnover rate is first average value, with 1 and the difference of the cluster centre turnover rate For the weight of the current value of the cluster centre of the cluster, adding for the current value of first average value and the cluster centre of the cluster is calculated Weight average value, the weighted average that the value of the cluster centre of the cluster is updated to.

4. clustering method according to claim 2 or 3, which is characterized in that the method further includes：

In meeting the preset condition any one when, the cluster centre of each cluster is updated according to the multiple sample；

When being unsatisfactory for the preset condition after update, the cluster centre of each cluster is carried out more according to the multiple sample again Newly.

5. clustering method according to claim 4, which is characterized in that according to the multiple sample to the cluster centre of each cluster It is updated, including：

For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, which is drawn Assign to the cluster where nearest cluster centre at a distance from the sample；

The average value of each sample in each cluster is calculated as the second average value, and by the value of the cluster centre of the cluster be updated to this Two average values.

6. a kind of clustering apparatus, which is characterized in that be applied to the terminal device based on Spark frames, described device includes：

Data read module for including the pending data of multiple samples by Spark frames reading, and is waited for described It handles data and forms RDD objects；

Center determining module will be described for being randomly determined the first preset quantity target sample in the multiple sample Cluster centre of the first preset quantity sample respectively as the first preset quantity cluster；

Sample chooses module, for the sample in the multiple sample in addition to the first preset quantity target sample In, it is randomly determined the second preset quantity sample, wherein second preset quantity is the default of first preset quantity Multiple；

Division module, for for each sample in the second preset quantity sample, calculating the sample and each cluster The distance of cluster centre, the cluster which being divided into where nearest cluster centre at a distance from the sample；

First update module, it is flat according to described first for calculating the average value of each sample in each cluster as the first average value The current value of the cluster centre of mean value and the cluster is updated the value of the cluster centre of the cluster, is unsatisfactory for presetting after the updating When condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, second is determined Preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster in the cluster to each cluster The value of the heart is updated.

7. clustering apparatus according to claim 6, which is characterized in that the preset condition includes：

8. clustering apparatus according to claim 7, which is characterized in that be preset with cluster centre update in the terminal device Rate；First update module is according to the current value of first average value and the cluster centre of the cluster to the cluster centre of the cluster The mode that is updated of value be：

9. clustering apparatus according to claim 7 or 8, which is characterized in that the clustering apparatus further includes：

Second update module, in meeting the preset condition any one when, according to the multiple sample to each cluster Cluster centre is updated, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each cluster Cluster centre is updated.

10. clustering apparatus according to claim 9, which is characterized in that second update module is according to the multiple sample This mode being updated to the cluster centre of each cluster is：