[go: up one dir, main page]

CN108805174A - clustering method and device - Google Patents

clustering method and device Download PDF

Info

Publication number
CN108805174A
CN108805174A CN201810482717.2A CN201810482717A CN108805174A CN 108805174 A CN108805174 A CN 108805174A CN 201810482717 A CN201810482717 A CN 201810482717A CN 108805174 A CN108805174 A CN 108805174A
Authority
CN
China
Prior art keywords
cluster
sample
cluster centre
value
centre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810482717.2A
Other languages
Chinese (zh)
Other versions
CN108805174B (en
Inventor
姚佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Hui He Science And Technology Development Co Ltd
Original Assignee
Guangdong Hui He Science And Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Hui He Science And Technology Development Co Ltd filed Critical Guangdong Hui He Science And Technology Development Co Ltd
Priority to CN201810482717.2A priority Critical patent/CN108805174B/en
Publication of CN108805174A publication Critical patent/CN108805174A/en
Application granted granted Critical
Publication of CN108805174B publication Critical patent/CN108805174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of clustering method of the embodiment of the present application offer and device, method include:Read include multiple samples pending data, and by pending data RDD objects;The first preset quantity target sample is randomly determined in multiple samples, using the first preset quantity sample as the cluster centre of the first preset quantity cluster;In sample in multiple samples in addition to target sample, it is randomly determined the second preset quantity sample;For each in the second preset quantity sample, the sample is calculated at a distance from the cluster centre of each cluster, the cluster which being divided into where nearest cluster centre at a distance from the sample;The average value of each sample in each cluster is calculated as the first average value, the value of the cluster centre of the cluster is updated according to the current value of first average value and the cluster centre of the cluster, until meeting preset condition.The result that so, it is possible relatively effectively to avoid is local optimum.

Description

Clustering method and device
Technical field
This application involves big data technical fields, in particular to a kind of clustering method and device.
Background technology
Clustering method has extremely wide application in various fields such as machine learning, data minings.Conventional cluster side In method, it can be divided into hierarchical clustering, division formula cluster according to certain characteristic, gather based on Density Clustering, based on network clustering, core Class, spectral clustering etc. are a variety of.Accordingly, hierarchical clustering has brich, rock, Chameleon etc., division formula cluster have K-means and Its mutation etc., Density Clustering have dbscan, OPTICS etc., Grid Clustering to have sting, clique etc..In these clustering methods, It is most common to be no more than K-means.Although K-means has a priori assumption such as circular distribution, preassigns cluster classification number Deng defect, but its relatively low computation complexity still allow its become popular algorithm.But in number to be treated According to measure it is huge in the case of, the computation complexity of existing K-means is still larger, and is easy to be easily trapped into local optimum Problem.
Invention content
In view of this, the purpose of the application includes a kind of clustering method of offer and device, to improve the above problem.
In order to achieve the above object, the embodiment of the present application adopts the following technical scheme that:
In a first aspect, the embodiment of the present application provides a kind of clustering method, it is applied to the terminal device based on Spark frames, The method includes:
Read by the Spark frames include multiple samples pending data, and by RDD pairs of the pending data As;
The first preset quantity target sample is randomly determined in the multiple sample, by first preset quantity Cluster centre of the sample respectively as the first preset quantity cluster;
In sample in the multiple sample in addition to the first preset quantity target sample, it is randomly determined Two preset quantity samples, wherein second preset quantity is the preset multiple of first preset quantity;
For each sample in the second preset quantity sample, the sample and the cluster centre of each cluster are calculated Distance, the cluster which being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the first average value, according to the poly- of first average value and the cluster The current value at class center is updated the value of the cluster centre of the cluster;
When being unsatisfactory for preset condition after the updating, the first preset quantity mesh is removed in the multiple sample again In sample except standard specimen sheet, the second preset quantity sample is determined, and the second preset quantity sample redefined is drawn Corresponding cluster is assigned to be updated with the value of the cluster centre to each cluster.
Optionally, the clustering method provided according to the embodiment of the present application first aspect, the preset condition include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches Preset times.
Optionally, the clustering method provided according to the embodiment of the present application first aspect is preset in the terminal device poly- Class center turnover rate;
The value of the cluster centre of the cluster is carried out more according to the current value of first average value and the cluster centre of the cluster Newly, including:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Optionally, the clustering method provided according to the embodiment of the present application first aspect, the method further include:
In meeting the preset condition any one when, the cluster centre of each cluster is carried out more according to the multiple sample Newly;
When being unsatisfactory for the preset condition after update, again according to the multiple sample to the cluster centre of each cluster into Row update.
Optionally, the clustering method provided according to the embodiment of the present application first aspect, according to the multiple sample to each cluster Cluster centre be updated, including:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster Second average value.
Second aspect, the embodiment of the present application also provide a kind of clustering apparatus, are set applied to the terminal based on Spark frames Standby, described device includes:
Data read module, for including the pending data of multiple samples by Spark frames reading, and by institute State pending data RDD objects;
Center determining module will for being randomly determined the first preset quantity target sample in the multiple sample Cluster centre of the first preset quantity sample respectively as the first preset quantity cluster;
Sample chooses module, for the sample in the multiple sample in addition to the first preset quantity target sample In this, it is randomly determined the second preset quantity sample, wherein second preset quantity is the pre- of first preset quantity If multiple;
First division module, for for each sample in the second preset quantity sample, calculating the sample At a distance from the cluster centre of each cluster, the cluster which is divided into where nearest cluster centre at a distance from the sample;
First update module, for calculating the average value of each sample in each cluster as the first average value, according to described The current value of the cluster centre of one average value and the cluster is updated the value of the cluster centre of the cluster, is unsatisfactory for after the updating When preset condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, determine Second preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster to gather to each cluster The value at class center is updated.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, the preset condition include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches Preset times.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, 8. is according to claim 7 poly- Class device, which is characterized in that cluster centre turnover rate is preset in the terminal device;First update module is according to The mode that the current value of the cluster centre of first average value and the cluster is updated the value of the cluster centre of the cluster is:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, the clustering apparatus further include:
Second update module, in meeting the preset condition any one when, according to the multiple sample to each The cluster centre of cluster is updated, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each The cluster centre of cluster is updated.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, second update module is according to institute Stating the mode that multiple samples are updated the cluster centre of each cluster is:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster Second average value.
Compared to existing technologies, the embodiment of the present application has the advantages that:
A kind of clustering method and device provided by the embodiments of the present application are applied to the terminal device based on Spark frames.Eventually The data read are formed RDD objects by end equipment by the pending data that the reading of Spark frames includes multiple samples. Determine the first preset quantity target sample respectively as in the cluster of the first preset quantity cluster at random in multiple sample The heart, it is random to determine the second preset quantity sample in the sample in multiple samples in addition to the first preset quantity target sample This, then the second preset quantity sample is divided in corresponding cluster.Then the average value conduct of each sample in each cluster is calculated First average value carries out the value of the cluster centre of the cluster according to the current value of first average value and the cluster centre of the cluster Update, until meeting preset condition.By aforementioned process, it is not necessary to be all updated, be dropped according to whole samples in each update Low computation complexity, and can avoid the problem that being easy to converge on local optimum to a certain extent.
Description of the drawings
It, below will be to needed in the embodiment attached in order to illustrate more clearly of the technical solution of the embodiment of the present application Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of connection block diagram of terminal device provided by the embodiments of the present application;
Fig. 2 is a kind of flow diagram of clustering method provided by the embodiments of the present application;
The another flow diagram of Fig. 3 clustering methods provided by the embodiments of the present application;
Fig. 4 is a kind of functional block diagram of clustering apparatus provided by the embodiments of the present application.
Icon:100- terminal devices;110- memories;120- processors;130- communication units;200- clustering apparatus; 210- data read modules;The centers 220- determining module;230- samples choose module;240- division modules;250- first updates Module;The second update modules of 260-.
Specific implementation mode
To keep the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is implemented The component of example can be arranged and be designed with a variety of different configurations.
Therefore, below the detailed description of the embodiments herein to providing in the accompanying drawings be not intended to limit it is claimed Scope of the present application, but be merely representative of the selected embodiment of the application.Based on the embodiment in the application, this field is common The every other embodiment that technical staff is obtained without creative efforts belongs to the model of the application protection It encloses.
It should be noted that:Similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
There are many clustering methods to be suitble to use under big data environment at present, but K-means is wherein relatively handy It is a kind of.Herein by taking the scene based on Spark frames as an example, the K-means in some embodiments is illustrated:
(1) it is randomly determined k sample in given sample, each in this k sample represents gathering for cluster Class center, wherein k is preassigned cluster classification number, i.e., the quantity of the required cluster divided;
(2) in the remaining sample in given sample, each sample is calculated at a distance from the cluster centre of each cluster, by this Sample is divided into the cluster representated by the nearest cluster centre with the sample;
(3) average value for calculating each sample in each cluster, the average value is updated to by the value of the cluster centre of the cluster.
(4) (2) are constantly repeated to (3) step, until meeting corresponding conditions.
However, through inventor the study found that in the above method, in order to ensure to restrain, each iteration (update) can all make With whole samples directly as the learning sample of the new cluster centre of cluster, exactly K-means can be caused to be easy to converge in this way Local optimum.Also, the iteration at each cluster center is all directly to use new value, and ignore update completely in above-mentioned mode The value of cluster centre before, this causes K-menas to be easy to receive the interference of some outliers.In addition, larger in data volume In the case of, the time complexity of the K-means of the above method is still higher.
Based on this, a kind of clustering method of the embodiment of the present application proposition and device are set applied to the terminal based on Spark frames It is standby, to improve the above problem at least partly.It should be appreciated that clustering method provided in this embodiment and device can also be applied to Other frames such as MapReduce, the present embodiment are not limited system.The content will be described in detail below.
As shown in Figure 1, being a kind of block diagram of terminal device 100 provided by the embodiments of the present application, terminal device 100 Can be the arbitrary electronic equipment with data processing function and communication function, for example, with server specifically, terminal can be worked as When equipment 100 is server, either an individual server, can also be the server set of multiple server compositions Group, the present embodiment are without limitation.
In the present embodiment, Spark corresponding operation platforms can be installed, wherein Spark is in terminal device 100 The computing engines for aiming at large-scale data processing and the Universal-purpose quick of design, are the general of the class Hadoop MapReduce to increase income Parallel frame.
Terminal device 100 includes clustering apparatus 200, memory 110, processor 120 and communication unit 130.
In the present embodiment, memory 110, processor 120 and 130 each element of communication unit are direct or indirect between each other Ground is electrically connected, to realize the transmission or interaction of data.For example, these elements between each other can be total by one or more communication Line or signal wire, which are realized, to be electrically connected.Clustering apparatus 200 include it is at least one can be with the shape of software or firmware (firmware) Formula is stored in memory 110 or is solidificated in the software in the operating system (Operating System, OS) of terminal device 100 Function module.Processor 120 is for executing the executable module stored in memory 110, for example, included by clustering apparatus 200 Software function module and computer program etc..
Wherein, memory 110 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Processor 120 can be a kind of IC chip, have signal handling capacity.The processor 120 can also It is general processor, such as central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP), microprocessor etc.;Can also be digital signal processor (DSP)), application-specific integrated circuit (ASIC), scene Programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware group Part;Processor 120 can also be any conventional processor, may be implemented or executes and is in the embodiment of the present invention disclosed each Method, step and logic diagram.
Communication unit 130 is used to establish the communication connection between the terminal device 100 and other equipment, to realize data Interaction or communication.
It should be appreciated that structure shown in FIG. 1 is only to illustrate, terminal device 100 can also include than shown in Fig. 1 more or more Few component, it is possible to have with configuration entirely different shown in Fig. 1.It is worth noting that, each component shown in FIG. 1 can herein To be realized with hardware, software, or its combination, the present embodiment is without limitation.
As shown in Fig. 2, being a kind of flow diagram of clustering method provided by the embodiments of the present application, which can Applied to terminal device 100 shown in FIG. 1.
Step S201, read by the Spark include multiple samples pending data, and by the pending data Form RDD objects.
Wherein, RDD is the distinctive data models of Spark, and the multiple sample refers to the previously given sample for cluster This.
Step S202 is randomly determined the first preset quantity target sample in the multiple sample, by described first Cluster centre of the preset quantity sample respectively as the first preset quantity cluster.
Wherein, first preset quantity is predetermined required cluster categorical measure, i.e., the required cluster divided Quantity.
Step S203, in the sample in the multiple sample in addition to the first preset quantity target sample, with Determine the second preset quantity sample to machine.
Wherein, second preset quantity is the preset multiple of first preset quantity.In the present embodiment, described Two preset quantities are predetermined value, indicate need to be used for the cluster centre of cluster more using how many sample in each iteration Newly, the second preset quantity typically 100-1000.
For example, it is assumed that the first preset quantity is k, preset multiple γ, then the second preset quantity is k* γ.
Step S204 calculates the sample and each cluster for each sample in the second preset quantity sample The distance of cluster centre, the cluster which being divided into where nearest cluster centre at a distance from the sample.
By taking above-mentioned second preset quantity is k* γ as an example, for each in k* γ samples, sample arrival is calculated The distance of the current cluster centre of each cluster, and determine the cluster corresponding to the cluster centre for the distance minimum for wherein reaching the sample It is assigned to the target cluster as target cluster, then by the sample.
Step S205 calculates the average value of each sample in each cluster as the first average value, according to first average value And the current value of the cluster centre of the cluster is updated the value of the cluster centre of the cluster.
In the present embodiment, for each cluster, after obtaining the first average value of the cluster, in conjunction with the cluster centre of the cluster Current value the cluster centre of the cluster is updated.
Optionally, in the present embodiment, cluster centre turnover rate can be preset in terminal device 100.Accordingly, step S205 may comprise steps of:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Assuming that above-mentioned cluster centre turnover rate is α, the current value of the cluster centre of the cluster is old_center, is calculated The first average value arrived is new_center, then as the new value center (the i.e. described weighted average) of the cluster centre of the cluster It can be calculate by the following formula to obtain:
Center=α * new_center+ (1- α) * old_center
Wherein, cluster centre turnover rate can be controlled according to actual conditions, so as to avoid being based on part sample Carry out the excessively poor caused influence of certain more new effects.
Step S206, when being unsatisfactory for preset condition after the updating, again except described first is pre- in the multiple sample If in the sample except quantity target sample, determining the second preset quantity sample, and the second present count that will be redefined Amount sample is divided into corresponding cluster and is updated with the value of the cluster centre to each cluster.
In other words, in the present embodiment, terminal device 100 can repeat step S203- steps S205 until meeting institute State preset condition.
A specific example is given below, to be elaborated to step S203- steps S206.
Assuming that having been repeated 1000 times to step S203- steps S206 in total in this example, wherein ith calculates Obtained the first average value (i.e. above-mentioned new_center) is vi, what ith was acquired by way of calculating weighted average The new value (i.e. above-mentioned center) of cluster centre is Vi, it is assumed that cluster centre turnover rate α is 0.8, then ViFollowing formula meter can be passed through It obtains:
Vi=0.8*vi+0.2*vi-1
It is possible thereby to learn, repeat what above-mentioned steps S203- steps S205 was updated with the value to cluster centre Process realizes really by way of exponent-weighted average, in other words, in clustering method provided in this embodiment, Influence of the newer value to current results before all having been fully considered when updating each time, rather than treat in isolation or carry out every Primary update, so, it is possible relatively reliable to avoid the problem that local convergence.
Optionally, in the present embodiment, the preset condition may include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches Preset times.
Wherein, the predetermined angle and the preset times can flexibly be set according to actual conditions, the present embodiment It is without limitation.
In addition, inventor also found if be iterated more to the cluster centre of each cluster only with part sample always Newly, it may be difficult to ensure Complete Convergence, so after reaching preset condition according to mode above-mentioned, can further use complete Portion's sample is updated the cluster centre of each cluster.On the basis of executing abovementioned steps, need to only be based on again whole samples into The update of a little number of row (being typically less than 10 times), time complexity are negligible.
Based on this, as shown in figure 3, clustering method provided in this embodiment can also include step S301 and step S302.
Step S301, in meeting the preset condition any one when, according to the multiple sample to the cluster of each cluster Center is updated.
Wherein, step S301 can be realized by following sub-step:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster Second average value.
Wherein, the second average value refers to working as according to whole samples (each sample in the multiple sample) to each cluster Cluster centre is updated the average value of each sample in each cluster being calculated later.
Step S302, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each cluster Cluster centre is updated.
In the present embodiment, the iteration update carried out based on whole samples is proceeded to the when of meeting the preset condition and stopped Only.Based on above-mentioned analysis it is found that the update less than 10 times would generally be carried out, time complexity is negligible.
As shown in figure 4, the embodiment of the present application also provides a kind of clustering apparatus 200, it is applied to terminal device shown in FIG. 1 100.The clustering apparatus 200 includes data read module 210, center determining module 220, sample selection module 230, divides mould Block 240 and the first update module 250.
Wherein, data read module 210 is used to pass through the pending number that Spark frames reading includes multiple samples According to, and the pending data is formed into RDD objects.
In the present embodiment, the description as described in data read module 210 is specifically referred to the detailed of step S201 shown in Fig. 2 Thin description, i.e. step S201 can be executed by data read module 210.
Center determining module 220 is used to be randomly determined the first preset quantity target sample in the multiple sample, Using the first preset quantity sample as the cluster centre of the first preset quantity cluster.
In the present embodiment, the description as described in center determining module 220 is specifically referred to the detailed of step S202 shown in Fig. 2 Thin description, i.e. step S202 can be executed by center determining module 220.
Sample is chosen module 230 and is used in the multiple sample in addition to the first preset quantity target sample In sample, it is randomly determined the second preset quantity sample, wherein second preset quantity is first preset quantity Preset multiple.
In the present embodiment, the description as described in sample chooses module 230 is specifically referred to the detailed of step S203 shown in Fig. 2 Thin description, i.e. step S203 can choose module 230 by sample and execute.
Division module 240 is used for for each sample in the second preset quantity sample, calculate the sample and The distance of the cluster centre of each cluster, the cluster which being divided into where nearest cluster centre at a distance from the sample.
In the present embodiment, the description as described in division module 240 specifically refers to retouching in detail to step S204 shown in Fig. 2 It states, i.e. step S204 can be executed by division module 240.
First update module 250 is used to calculate the average value of each sample in each cluster as the first average value, according to described The current value of the cluster centre of first average value and the cluster is updated the value of the cluster centre of the cluster, is discontented with after the updating When sufficient preset condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, really Fixed second preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster with to each cluster The value of cluster centre is updated.
In the present embodiment, the description as described in the first update module 250 is specifically referred to the detailed of step S205 shown in Fig. 2 Thin description, i.e. step S205 can be executed by the first update module 250.
Wherein, the preset condition may include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches Preset times.
Optionally, terminal device 100 can in be preset with cluster centre turnover rate.
Accordingly, the first update module 250 according to the current value of first average value and the cluster centre of the cluster to this The mode that the value of the cluster centre of cluster is updated can be:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Optionally, the clustering apparatus 200 can also include the second update module 250.
Wherein, the second update module 260 be used in meeting the preset condition any one when, according to the multiple sample This is updated the cluster centre of each cluster, when being unsatisfactory for the preset condition after update, again according to the multiple sample This is updated the cluster centre of each cluster.
In the present embodiment, the description as described in the second update module 260 specifically refers in the above to being walked shown in Fig. 3 The detailed description of rapid S301 and step S302, i.e. step S301 and step S302 can be executed by the second update module 260.
Optionally, in the present embodiment, the second update module 260 according to the multiple sample to the cluster centre of each cluster into The newer mode of row can be:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster Second average value.
In conclusion a kind of clustering method provided by the embodiments of the present application and device, are applied to the end based on Spark frames End equipment.Terminal device includes the pending data of multiple samples by the reading of Spark frames, and the data read are formed RDD objects.Determine the first preset quantity target sample respectively as the first preset quantity cluster at random in multiple sample Cluster centre, it is random to determine that second is default in the sample in multiple samples in addition to the first preset quantity target sample Quantity sample, then the second preset quantity sample is divided in corresponding cluster.Then the flat of each sample in each cluster is calculated Mean value is as the first average value, according to the current value of first average value and the cluster centre of the cluster to the cluster centre of the cluster Value be updated, until meet preset condition.By aforementioned process, it is not necessary to all be carried out according to whole samples in each update Update, reduces computation complexity, and can avoid the problem that being easy to converge on local optimum to a certain extent.
In embodiment provided herein, it should be understood that disclosed device and method, it can also be by other Mode realize.The apparatus embodiments described above are merely exemplary, for example, the flow chart and block diagram in attached drawing are shown According to the device, the architectural framework in the cards of method and computer program product, function of multiple embodiments of the application And operation.In this regard, each box in flowchart or block diagram can represent one of a module, section or code Point, a part for the module, section or code includes one or more for implementing the specified logical function executable Instruction.It should also be noted that at some as in the realization method replaced, the function of being marked in box can also be attached to be different from The sequence marked in figure occurs.For example, two continuous boxes can essentially be basically executed in parallel, they also may be used sometimes To execute in the opposite order, this is depended on the functions involved.It is also noted that each of block diagram and or flow chart The combination of box in box and block diagram and or flow chart, function or the dedicated of action are based on as defined in execution The system of hardware is realized, or can be realized using a combination of dedicated hardware and computer instructions.
In addition, each function module in each embodiment of the application can integrate to form an independent portion Point, can also be modules individualism, can also two or more modules be integrated to form an independent part.
It, can be with if the function is realized and when sold or used as an independent product in the form of software function module It is stored in a computer read/write memory medium.Based on this understanding, the technical solution of the application is substantially in other words The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of step. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The above, the only specific implementation mode of the application, but the protection domain of the application is not limited thereto, it is any Those familiar with the art can easily think of the change or the replacement in the technical scope that the application discloses, and should all contain It covers within the protection domain of the application.Therefore, the protection domain of the application should be subject to the protection scope in claims.

Claims (10)

1. a kind of clustering method, which is characterized in that it is applied to the terminal device based on Spark, the method includes:
Include the pending data of multiple samples by Spark readings, and the pending data is formed into RDD objects;
The first preset quantity target sample is randomly determined in the multiple sample, by the first preset quantity sample Respectively as the cluster centre of the first preset quantity cluster;
In sample in the multiple sample in addition to the first preset quantity target sample, it is pre- to be randomly determined second If quantity sample, wherein second preset quantity is the preset multiple of first preset quantity;
For each sample in the second preset quantity sample, calculate the sample and the cluster centre of each cluster away from From the cluster being divided into the sample where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the first average value, according in the cluster of first average value and the cluster The current value of the heart is updated the value of the cluster centre of the cluster;
When being unsatisfactory for preset condition after the updating, the first preset quantity target sample is removed in the multiple sample again In sample except this, the second preset quantity sample is determined, and the second preset quantity sample redefined is divided into Corresponding cluster is updated with the value of the cluster centre to each cluster.
2. clustering method according to claim 1, which is characterized in that the preset condition includes:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches default Number.
3. clustering method according to claim 2, which is characterized in that be preset with cluster centre update in the terminal device Rate;
The value of the cluster centre of the cluster is updated according to the current value of first average value and the cluster centre of the cluster, is wrapped It includes:
With the weight that the cluster centre turnover rate is first average value, with 1 and the difference of the cluster centre turnover rate For the weight of the current value of the cluster centre of the cluster, adding for the current value of first average value and the cluster centre of the cluster is calculated Weight average value, the weighted average that the value of the cluster centre of the cluster is updated to.
4. clustering method according to claim 2 or 3, which is characterized in that the method further includes:
In meeting the preset condition any one when, the cluster centre of each cluster is updated according to the multiple sample;
When being unsatisfactory for the preset condition after update, the cluster centre of each cluster is carried out more according to the multiple sample again Newly.
5. clustering method according to claim 4, which is characterized in that according to the multiple sample to the cluster centre of each cluster It is updated, including:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, which is drawn Assign to the cluster where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the second average value, and by the value of the cluster centre of the cluster be updated to this Two average values.
6. a kind of clustering apparatus, which is characterized in that be applied to the terminal device based on Spark frames, described device includes:
Data read module for including the pending data of multiple samples by Spark frames reading, and is waited for described It handles data and forms RDD objects;
Center determining module will be described for being randomly determined the first preset quantity target sample in the multiple sample Cluster centre of the first preset quantity sample respectively as the first preset quantity cluster;
Sample chooses module, for the sample in the multiple sample in addition to the first preset quantity target sample In, it is randomly determined the second preset quantity sample, wherein second preset quantity is the default of first preset quantity Multiple;
Division module, for for each sample in the second preset quantity sample, calculating the sample and each cluster The distance of cluster centre, the cluster which being divided into where nearest cluster centre at a distance from the sample;
First update module, it is flat according to described first for calculating the average value of each sample in each cluster as the first average value The current value of the cluster centre of mean value and the cluster is updated the value of the cluster centre of the cluster, is unsatisfactory for presetting after the updating When condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, second is determined Preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster in the cluster to each cluster The value of the heart is updated.
7. clustering apparatus according to claim 6, which is characterized in that the preset condition includes:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches default Number.
8. clustering apparatus according to claim 7, which is characterized in that be preset with cluster centre update in the terminal device Rate;First update module is according to the current value of first average value and the cluster centre of the cluster to the cluster centre of the cluster The mode that is updated of value be:
With the weight that the cluster centre turnover rate is first average value, with 1 and the difference of the cluster centre turnover rate For the weight of the current value of the cluster centre of the cluster, adding for the current value of first average value and the cluster centre of the cluster is calculated Weight average value, the weighted average that the value of the cluster centre of the cluster is updated to.
9. clustering apparatus according to claim 7 or 8, which is characterized in that the clustering apparatus further includes:
Second update module, in meeting the preset condition any one when, according to the multiple sample to each cluster Cluster centre is updated, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each cluster Cluster centre is updated.
10. clustering apparatus according to claim 9, which is characterized in that second update module is according to the multiple sample This mode being updated to the cluster centre of each cluster is:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, which is drawn Assign to the cluster where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the second average value, and by the value of the cluster centre of the cluster be updated to this Two average values.
CN201810482717.2A 2018-05-18 2018-05-18 Clustering method and device Active CN108805174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810482717.2A CN108805174B (en) 2018-05-18 2018-05-18 Clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810482717.2A CN108805174B (en) 2018-05-18 2018-05-18 Clustering method and device

Publications (2)

Publication Number Publication Date
CN108805174A true CN108805174A (en) 2018-11-13
CN108805174B CN108805174B (en) 2022-03-29

Family

ID=64091236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810482717.2A Active CN108805174B (en) 2018-05-18 2018-05-18 Clustering method and device

Country Status (1)

Country Link
CN (1) CN108805174B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110045371A (en) * 2019-04-28 2019-07-23 软通智慧科技有限公司 Identification method, device, equipment and storage medium
CN110400020A (en) * 2019-07-31 2019-11-01 北京百度网讯科技有限公司 Method and device for outputting information
CN112215247A (en) * 2019-07-10 2021-01-12 南京地平线机器人技术有限公司 Method and device for clustering feature vectors and electronic equipment
CN112560731A (en) * 2020-12-22 2021-03-26 苏州科达科技股份有限公司 Feature clustering method, database updating method, electronic device and storage medium
CN112949697A (en) * 2021-02-07 2021-06-11 广州杰赛科技股份有限公司 Method and device for confirming pipeline abnormity and computer readable storage medium
CN113077015A (en) * 2021-04-29 2021-07-06 平安科技(深圳)有限公司 Sample selection method and device, computer equipment and storage medium
CN113111893A (en) * 2020-01-09 2021-07-13 中国移动通信集团四川有限公司 Data processing method and system and electronic equipment
CN113393412A (en) * 2020-02-27 2021-09-14 中国石油天然气股份有限公司 Method and device for determining characteristic value of corrosion defect in gas pipeline
CN113762311A (en) * 2021-01-28 2021-12-07 北京沃东天骏信息技术有限公司 A data clustering method, device and computer-readable storage medium
CN114358102A (en) * 2021-09-10 2022-04-15 腾讯科技(深圳)有限公司 Data classification method, device, equipment and storage medium
CN114662579A (en) * 2022-03-10 2022-06-24 中国工商银行股份有限公司 Clustering method and clustering equipment
CN114969317A (en) * 2021-02-26 2022-08-30 中移(苏州)软件技术有限公司 Data processing method and device, electronic equipment and storage medium
CN118094273A (en) * 2024-01-19 2024-05-28 北京大学 Clustering method, device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049651A (en) * 2012-12-13 2013-04-17 航天科工深圳(集团)有限公司 Method and device used for power load aggregation
CN103593609A (en) * 2012-08-16 2014-02-19 阿里巴巴集团控股有限公司 Trustworthy behavior recognition method and device
CN103699678A (en) * 2013-12-31 2014-04-02 苏州大学 Hierarchical clustering method and system based on multistage layered sampling
CN105913077A (en) * 2016-04-07 2016-08-31 华北电力大学(保定) Data clustering method based on dimensionality reduction and sampling
CN106570173A (en) * 2016-11-09 2017-04-19 重庆邮电大学 High-dimensional sparse text data clustering method based on Spark
CN106682116A (en) * 2016-12-08 2017-05-17 重庆邮电大学 OPTICS point sorting clustering method based on Spark memory computing big data platform
CN107578070A (en) * 2017-09-19 2018-01-12 安徽中科美络信息技术有限公司 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593609A (en) * 2012-08-16 2014-02-19 阿里巴巴集团控股有限公司 Trustworthy behavior recognition method and device
CN103049651A (en) * 2012-12-13 2013-04-17 航天科工深圳(集团)有限公司 Method and device used for power load aggregation
CN103699678A (en) * 2013-12-31 2014-04-02 苏州大学 Hierarchical clustering method and system based on multistage layered sampling
CN105913077A (en) * 2016-04-07 2016-08-31 华北电力大学(保定) Data clustering method based on dimensionality reduction and sampling
CN106570173A (en) * 2016-11-09 2017-04-19 重庆邮电大学 High-dimensional sparse text data clustering method based on Spark
CN106682116A (en) * 2016-12-08 2017-05-17 重庆邮电大学 OPTICS point sorting clustering method based on Spark memory computing big data platform
CN107578070A (en) * 2017-09-19 2018-01-12 安徽中科美络信息技术有限公司 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110045371A (en) * 2019-04-28 2019-07-23 软通智慧科技有限公司 Identification method, device, equipment and storage medium
CN112215247A (en) * 2019-07-10 2021-01-12 南京地平线机器人技术有限公司 Method and device for clustering feature vectors and electronic equipment
CN112215247B (en) * 2019-07-10 2025-03-14 南京地平线机器人技术有限公司 Method, device and electronic device for clustering feature vectors
CN110400020A (en) * 2019-07-31 2019-11-01 北京百度网讯科技有限公司 Method and device for outputting information
CN113111893A (en) * 2020-01-09 2021-07-13 中国移动通信集团四川有限公司 Data processing method and system and electronic equipment
CN113393412A (en) * 2020-02-27 2021-09-14 中国石油天然气股份有限公司 Method and device for determining characteristic value of corrosion defect in gas pipeline
CN113393412B (en) * 2020-02-27 2024-05-31 中国石油天然气股份有限公司 Method and device for determining characteristic value of corrosion defect in gas pipeline
CN112560731B (en) * 2020-12-22 2022-07-01 苏州科达科技股份有限公司 Feature clustering method, database updating method, electronic device and storage medium
CN112560731A (en) * 2020-12-22 2021-03-26 苏州科达科技股份有限公司 Feature clustering method, database updating method, electronic device and storage medium
CN113762311A (en) * 2021-01-28 2021-12-07 北京沃东天骏信息技术有限公司 A data clustering method, device and computer-readable storage medium
CN112949697A (en) * 2021-02-07 2021-06-11 广州杰赛科技股份有限公司 Method and device for confirming pipeline abnormity and computer readable storage medium
CN114969317A (en) * 2021-02-26 2022-08-30 中移(苏州)软件技术有限公司 Data processing method and device, electronic equipment and storage medium
CN113077015A (en) * 2021-04-29 2021-07-06 平安科技(深圳)有限公司 Sample selection method and device, computer equipment and storage medium
CN114358102A (en) * 2021-09-10 2022-04-15 腾讯科技(深圳)有限公司 Data classification method, device, equipment and storage medium
CN114662579A (en) * 2022-03-10 2022-06-24 中国工商银行股份有限公司 Clustering method and clustering equipment
CN118094273A (en) * 2024-01-19 2024-05-28 北京大学 Clustering method, device, computer equipment and storage medium
CN118094273B (en) * 2024-01-19 2025-03-21 北京大学 Clustering method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN108805174B (en) 2022-03-29

Similar Documents

Publication Publication Date Title
CN108805174A (en) clustering method and device
CN115131566A (en) Automatic image segmentation method based on super-pixels and improved fuzzy C-means clustering
JP2021533474A (en) Node classification method, model training method, and its equipment, equipment and computer program
CN110827924B (en) Clustering method and device for gene expression data, computer equipment and storage medium
CN108960520A (en) Power load prediction method, system, computer equipment and medium
CN111526119A (en) Abnormal flow detection method and device, electronic equipment and computer readable medium
CN108197795B (en) Malicious group account identification method, device, terminal and storage medium
CN118504775B (en) Urban planning method and system based on digital twinning
CN104484600B (en) Intrusion detection method and device based on improved density clustering
CN106682414A (en) Method and device for establishing timing sequence prediction model
CN110751354B (en) Abnormal user detection method and device
CN110796485A (en) Method and device for improving prediction precision of prediction model
CN115099875A (en) Data classification method based on decision tree model and related equipment
CN110166344A (en) A kind of identity recognition methods, device and relevant device
CN114781650A (en) Data processing method, device, equipment and storage medium
CN110378543A (en) Leaving office Risk Forecast Method, device, computer equipment and storage medium
CN112416560A (en) Truth value inference and online task allocation method for numerical tasks in crowdsourcing scene
CN118245823B (en) Insulator leakage current prediction method, device, equipment and storage medium
CN117495571B (en) Data processing method and device, electronic equipment and storage medium
CN113807391A (en) Task model training method and device, electronic equipment and storage medium
CN112597323A (en) Remote sensing image storage and migration method and device and storage medium
CN118568528A (en) Photovoltaic power generation power prediction and model construction method and device
CN117369954A (en) JVM optimization method and device of risk processing framework for big data construction
CN117253079A (en) Model training method, device, equipment and storage medium
CN111079843A (en) Training method based on RBF neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant