CN108805174A - clustering method and device - Google Patents
clustering method and device Download PDFInfo
- Publication number
- CN108805174A CN108805174A CN201810482717.2A CN201810482717A CN108805174A CN 108805174 A CN108805174 A CN 108805174A CN 201810482717 A CN201810482717 A CN 201810482717A CN 108805174 A CN108805174 A CN 108805174A
- Authority
- CN
- China
- Prior art keywords
- cluster
- sample
- cluster centre
- value
- centre
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000007306 turnover Effects 0.000 claims description 19
- 230000015654 memory Effects 0.000 description 16
- 230000006870 function Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 241000122205 Chamaeleonidae Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 239000011435 rock Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of clustering method of the embodiment of the present application offer and device, method include:Read include multiple samples pending data, and by pending data RDD objects;The first preset quantity target sample is randomly determined in multiple samples, using the first preset quantity sample as the cluster centre of the first preset quantity cluster;In sample in multiple samples in addition to target sample, it is randomly determined the second preset quantity sample;For each in the second preset quantity sample, the sample is calculated at a distance from the cluster centre of each cluster, the cluster which being divided into where nearest cluster centre at a distance from the sample;The average value of each sample in each cluster is calculated as the first average value, the value of the cluster centre of the cluster is updated according to the current value of first average value and the cluster centre of the cluster, until meeting preset condition.The result that so, it is possible relatively effectively to avoid is local optimum.
Description
Technical field
This application involves big data technical fields, in particular to a kind of clustering method and device.
Background technology
Clustering method has extremely wide application in various fields such as machine learning, data minings.Conventional cluster side
In method, it can be divided into hierarchical clustering, division formula cluster according to certain characteristic, gather based on Density Clustering, based on network clustering, core
Class, spectral clustering etc. are a variety of.Accordingly, hierarchical clustering has brich, rock, Chameleon etc., division formula cluster have K-means and
Its mutation etc., Density Clustering have dbscan, OPTICS etc., Grid Clustering to have sting, clique etc..In these clustering methods,
It is most common to be no more than K-means.Although K-means has a priori assumption such as circular distribution, preassigns cluster classification number
Deng defect, but its relatively low computation complexity still allow its become popular algorithm.But in number to be treated
According to measure it is huge in the case of, the computation complexity of existing K-means is still larger, and is easy to be easily trapped into local optimum
Problem.
Invention content
In view of this, the purpose of the application includes a kind of clustering method of offer and device, to improve the above problem.
In order to achieve the above object, the embodiment of the present application adopts the following technical scheme that:
In a first aspect, the embodiment of the present application provides a kind of clustering method, it is applied to the terminal device based on Spark frames,
The method includes:
Read by the Spark frames include multiple samples pending data, and by RDD pairs of the pending data
As;
The first preset quantity target sample is randomly determined in the multiple sample, by first preset quantity
Cluster centre of the sample respectively as the first preset quantity cluster;
In sample in the multiple sample in addition to the first preset quantity target sample, it is randomly determined
Two preset quantity samples, wherein second preset quantity is the preset multiple of first preset quantity;
For each sample in the second preset quantity sample, the sample and the cluster centre of each cluster are calculated
Distance, the cluster which being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the first average value, according to the poly- of first average value and the cluster
The current value at class center is updated the value of the cluster centre of the cluster;
When being unsatisfactory for preset condition after the updating, the first preset quantity mesh is removed in the multiple sample again
In sample except standard specimen sheet, the second preset quantity sample is determined, and the second preset quantity sample redefined is drawn
Corresponding cluster is assigned to be updated with the value of the cluster centre to each cluster.
Optionally, the clustering method provided according to the embodiment of the present application first aspect, the preset condition include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches
Preset times.
Optionally, the clustering method provided according to the embodiment of the present application first aspect is preset in the terminal device poly-
Class center turnover rate;
The value of the cluster centre of the cluster is carried out more according to the current value of first average value and the cluster centre of the cluster
Newly, including:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate
Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster
Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Optionally, the clustering method provided according to the embodiment of the present application first aspect, the method further include:
In meeting the preset condition any one when, the cluster centre of each cluster is carried out more according to the multiple sample
Newly;
When being unsatisfactory for the preset condition after update, again according to the multiple sample to the cluster centre of each cluster into
Row update.
Optionally, the clustering method provided according to the embodiment of the present application first aspect, according to the multiple sample to each cluster
Cluster centre be updated, including:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample
Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster
Second average value.
Second aspect, the embodiment of the present application also provide a kind of clustering apparatus, are set applied to the terminal based on Spark frames
Standby, described device includes:
Data read module, for including the pending data of multiple samples by Spark frames reading, and by institute
State pending data RDD objects;
Center determining module will for being randomly determined the first preset quantity target sample in the multiple sample
Cluster centre of the first preset quantity sample respectively as the first preset quantity cluster;
Sample chooses module, for the sample in the multiple sample in addition to the first preset quantity target sample
In this, it is randomly determined the second preset quantity sample, wherein second preset quantity is the pre- of first preset quantity
If multiple;
First division module, for for each sample in the second preset quantity sample, calculating the sample
At a distance from the cluster centre of each cluster, the cluster which is divided into where nearest cluster centre at a distance from the sample;
First update module, for calculating the average value of each sample in each cluster as the first average value, according to described
The current value of the cluster centre of one average value and the cluster is updated the value of the cluster centre of the cluster, is unsatisfactory for after the updating
When preset condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, determine
Second preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster to gather to each cluster
The value at class center is updated.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, the preset condition include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches
Preset times.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, 8. is according to claim 7 poly-
Class device, which is characterized in that cluster centre turnover rate is preset in the terminal device;First update module is according to
The mode that the current value of the cluster centre of first average value and the cluster is updated the value of the cluster centre of the cluster is:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate
Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster
Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, the clustering apparatus further include:
Second update module, in meeting the preset condition any one when, according to the multiple sample to each
The cluster centre of cluster is updated, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each
The cluster centre of cluster is updated.
Optionally, the clustering apparatus provided according to the embodiment of the present application second aspect, second update module is according to institute
Stating the mode that multiple samples are updated the cluster centre of each cluster is:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample
Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster
Second average value.
Compared to existing technologies, the embodiment of the present application has the advantages that:
A kind of clustering method and device provided by the embodiments of the present application are applied to the terminal device based on Spark frames.Eventually
The data read are formed RDD objects by end equipment by the pending data that the reading of Spark frames includes multiple samples.
Determine the first preset quantity target sample respectively as in the cluster of the first preset quantity cluster at random in multiple sample
The heart, it is random to determine the second preset quantity sample in the sample in multiple samples in addition to the first preset quantity target sample
This, then the second preset quantity sample is divided in corresponding cluster.Then the average value conduct of each sample in each cluster is calculated
First average value carries out the value of the cluster centre of the cluster according to the current value of first average value and the cluster centre of the cluster
Update, until meeting preset condition.By aforementioned process, it is not necessary to be all updated, be dropped according to whole samples in each update
Low computation complexity, and can avoid the problem that being easy to converge on local optimum to a certain extent.
Description of the drawings
It, below will be to needed in the embodiment attached in order to illustrate more clearly of the technical solution of the embodiment of the present application
Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair
The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of connection block diagram of terminal device provided by the embodiments of the present application;
Fig. 2 is a kind of flow diagram of clustering method provided by the embodiments of the present application;
The another flow diagram of Fig. 3 clustering methods provided by the embodiments of the present application;
Fig. 4 is a kind of functional block diagram of clustering apparatus provided by the embodiments of the present application.
Icon:100- terminal devices;110- memories;120- processors;130- communication units;200- clustering apparatus;
210- data read modules;The centers 220- determining module;230- samples choose module;240- division modules;250- first updates
Module;The second update modules of 260-.
Specific implementation mode
To keep the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is implemented
The component of example can be arranged and be designed with a variety of different configurations.
Therefore, below the detailed description of the embodiments herein to providing in the accompanying drawings be not intended to limit it is claimed
Scope of the present application, but be merely representative of the selected embodiment of the application.Based on the embodiment in the application, this field is common
The every other embodiment that technical staff is obtained without creative efforts belongs to the model of the application protection
It encloses.
It should be noted that:Similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
There are many clustering methods to be suitble to use under big data environment at present, but K-means is wherein relatively handy
It is a kind of.Herein by taking the scene based on Spark frames as an example, the K-means in some embodiments is illustrated:
(1) it is randomly determined k sample in given sample, each in this k sample represents gathering for cluster
Class center, wherein k is preassigned cluster classification number, i.e., the quantity of the required cluster divided;
(2) in the remaining sample in given sample, each sample is calculated at a distance from the cluster centre of each cluster, by this
Sample is divided into the cluster representated by the nearest cluster centre with the sample;
(3) average value for calculating each sample in each cluster, the average value is updated to by the value of the cluster centre of the cluster.
(4) (2) are constantly repeated to (3) step, until meeting corresponding conditions.
However, through inventor the study found that in the above method, in order to ensure to restrain, each iteration (update) can all make
With whole samples directly as the learning sample of the new cluster centre of cluster, exactly K-means can be caused to be easy to converge in this way
Local optimum.Also, the iteration at each cluster center is all directly to use new value, and ignore update completely in above-mentioned mode
The value of cluster centre before, this causes K-menas to be easy to receive the interference of some outliers.In addition, larger in data volume
In the case of, the time complexity of the K-means of the above method is still higher.
Based on this, a kind of clustering method of the embodiment of the present application proposition and device are set applied to the terminal based on Spark frames
It is standby, to improve the above problem at least partly.It should be appreciated that clustering method provided in this embodiment and device can also be applied to
Other frames such as MapReduce, the present embodiment are not limited system.The content will be described in detail below.
As shown in Figure 1, being a kind of block diagram of terminal device 100 provided by the embodiments of the present application, terminal device 100
Can be the arbitrary electronic equipment with data processing function and communication function, for example, with server specifically, terminal can be worked as
When equipment 100 is server, either an individual server, can also be the server set of multiple server compositions
Group, the present embodiment are without limitation.
In the present embodiment, Spark corresponding operation platforms can be installed, wherein Spark is in terminal device 100
The computing engines for aiming at large-scale data processing and the Universal-purpose quick of design, are the general of the class Hadoop MapReduce to increase income
Parallel frame.
Terminal device 100 includes clustering apparatus 200, memory 110, processor 120 and communication unit 130.
In the present embodiment, memory 110, processor 120 and 130 each element of communication unit are direct or indirect between each other
Ground is electrically connected, to realize the transmission or interaction of data.For example, these elements between each other can be total by one or more communication
Line or signal wire, which are realized, to be electrically connected.Clustering apparatus 200 include it is at least one can be with the shape of software or firmware (firmware)
Formula is stored in memory 110 or is solidificated in the software in the operating system (Operating System, OS) of terminal device 100
Function module.Processor 120 is for executing the executable module stored in memory 110, for example, included by clustering apparatus 200
Software function module and computer program etc..
Wherein, memory 110 may be, but not limited to, random access memory (Random Access Memory,
RAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only
Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM),
Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Processor 120 can be a kind of IC chip, have signal handling capacity.The processor 120 can also
It is general processor, such as central processing unit (Central Processing Unit, CPU), network processing unit (Network
Processor, NP), microprocessor etc.;Can also be digital signal processor (DSP)), application-specific integrated circuit (ASIC), scene
Programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware group
Part;Processor 120 can also be any conventional processor, may be implemented or executes and is in the embodiment of the present invention disclosed each
Method, step and logic diagram.
Communication unit 130 is used to establish the communication connection between the terminal device 100 and other equipment, to realize data
Interaction or communication.
It should be appreciated that structure shown in FIG. 1 is only to illustrate, terminal device 100 can also include than shown in Fig. 1 more or more
Few component, it is possible to have with configuration entirely different shown in Fig. 1.It is worth noting that, each component shown in FIG. 1 can herein
To be realized with hardware, software, or its combination, the present embodiment is without limitation.
As shown in Fig. 2, being a kind of flow diagram of clustering method provided by the embodiments of the present application, which can
Applied to terminal device 100 shown in FIG. 1.
Step S201, read by the Spark include multiple samples pending data, and by the pending data
Form RDD objects.
Wherein, RDD is the distinctive data models of Spark, and the multiple sample refers to the previously given sample for cluster
This.
Step S202 is randomly determined the first preset quantity target sample in the multiple sample, by described first
Cluster centre of the preset quantity sample respectively as the first preset quantity cluster.
Wherein, first preset quantity is predetermined required cluster categorical measure, i.e., the required cluster divided
Quantity.
Step S203, in the sample in the multiple sample in addition to the first preset quantity target sample, with
Determine the second preset quantity sample to machine.
Wherein, second preset quantity is the preset multiple of first preset quantity.In the present embodiment, described
Two preset quantities are predetermined value, indicate need to be used for the cluster centre of cluster more using how many sample in each iteration
Newly, the second preset quantity typically 100-1000.
For example, it is assumed that the first preset quantity is k, preset multiple γ, then the second preset quantity is k* γ.
Step S204 calculates the sample and each cluster for each sample in the second preset quantity sample
The distance of cluster centre, the cluster which being divided into where nearest cluster centre at a distance from the sample.
By taking above-mentioned second preset quantity is k* γ as an example, for each in k* γ samples, sample arrival is calculated
The distance of the current cluster centre of each cluster, and determine the cluster corresponding to the cluster centre for the distance minimum for wherein reaching the sample
It is assigned to the target cluster as target cluster, then by the sample.
Step S205 calculates the average value of each sample in each cluster as the first average value, according to first average value
And the current value of the cluster centre of the cluster is updated the value of the cluster centre of the cluster.
In the present embodiment, for each cluster, after obtaining the first average value of the cluster, in conjunction with the cluster centre of the cluster
Current value the cluster centre of the cluster is updated.
Optionally, in the present embodiment, cluster centre turnover rate can be preset in terminal device 100.Accordingly, step
S205 may comprise steps of:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate
Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster
Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Assuming that above-mentioned cluster centre turnover rate is α, the current value of the cluster centre of the cluster is old_center, is calculated
The first average value arrived is new_center, then as the new value center (the i.e. described weighted average) of the cluster centre of the cluster
It can be calculate by the following formula to obtain:
Center=α * new_center+ (1- α) * old_center
Wherein, cluster centre turnover rate can be controlled according to actual conditions, so as to avoid being based on part sample
Carry out the excessively poor caused influence of certain more new effects.
Step S206, when being unsatisfactory for preset condition after the updating, again except described first is pre- in the multiple sample
If in the sample except quantity target sample, determining the second preset quantity sample, and the second present count that will be redefined
Amount sample is divided into corresponding cluster and is updated with the value of the cluster centre to each cluster.
In other words, in the present embodiment, terminal device 100 can repeat step S203- steps S205 until meeting institute
State preset condition.
A specific example is given below, to be elaborated to step S203- steps S206.
Assuming that having been repeated 1000 times to step S203- steps S206 in total in this example, wherein ith calculates
Obtained the first average value (i.e. above-mentioned new_center) is vi, what ith was acquired by way of calculating weighted average
The new value (i.e. above-mentioned center) of cluster centre is Vi, it is assumed that cluster centre turnover rate α is 0.8, then ViFollowing formula meter can be passed through
It obtains:
Vi=0.8*vi+0.2*vi-1
It is possible thereby to learn, repeat what above-mentioned steps S203- steps S205 was updated with the value to cluster centre
Process realizes really by way of exponent-weighted average, in other words, in clustering method provided in this embodiment,
Influence of the newer value to current results before all having been fully considered when updating each time, rather than treat in isolation or carry out every
Primary update, so, it is possible relatively reliable to avoid the problem that local convergence.
Optionally, in the present embodiment, the preset condition may include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches
Preset times.
Wherein, the predetermined angle and the preset times can flexibly be set according to actual conditions, the present embodiment
It is without limitation.
In addition, inventor also found if be iterated more to the cluster centre of each cluster only with part sample always
Newly, it may be difficult to ensure Complete Convergence, so after reaching preset condition according to mode above-mentioned, can further use complete
Portion's sample is updated the cluster centre of each cluster.On the basis of executing abovementioned steps, need to only be based on again whole samples into
The update of a little number of row (being typically less than 10 times), time complexity are negligible.
Based on this, as shown in figure 3, clustering method provided in this embodiment can also include step S301 and step S302.
Step S301, in meeting the preset condition any one when, according to the multiple sample to the cluster of each cluster
Center is updated.
Wherein, step S301 can be realized by following sub-step:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample
Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster
Second average value.
Wherein, the second average value refers to working as according to whole samples (each sample in the multiple sample) to each cluster
Cluster centre is updated the average value of each sample in each cluster being calculated later.
Step S302, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each cluster
Cluster centre is updated.
In the present embodiment, the iteration update carried out based on whole samples is proceeded to the when of meeting the preset condition and stopped
Only.Based on above-mentioned analysis it is found that the update less than 10 times would generally be carried out, time complexity is negligible.
As shown in figure 4, the embodiment of the present application also provides a kind of clustering apparatus 200, it is applied to terminal device shown in FIG. 1
100.The clustering apparatus 200 includes data read module 210, center determining module 220, sample selection module 230, divides mould
Block 240 and the first update module 250.
Wherein, data read module 210 is used to pass through the pending number that Spark frames reading includes multiple samples
According to, and the pending data is formed into RDD objects.
In the present embodiment, the description as described in data read module 210 is specifically referred to the detailed of step S201 shown in Fig. 2
Thin description, i.e. step S201 can be executed by data read module 210.
Center determining module 220 is used to be randomly determined the first preset quantity target sample in the multiple sample,
Using the first preset quantity sample as the cluster centre of the first preset quantity cluster.
In the present embodiment, the description as described in center determining module 220 is specifically referred to the detailed of step S202 shown in Fig. 2
Thin description, i.e. step S202 can be executed by center determining module 220.
Sample is chosen module 230 and is used in the multiple sample in addition to the first preset quantity target sample
In sample, it is randomly determined the second preset quantity sample, wherein second preset quantity is first preset quantity
Preset multiple.
In the present embodiment, the description as described in sample chooses module 230 is specifically referred to the detailed of step S203 shown in Fig. 2
Thin description, i.e. step S203 can choose module 230 by sample and execute.
Division module 240 is used for for each sample in the second preset quantity sample, calculate the sample and
The distance of the cluster centre of each cluster, the cluster which being divided into where nearest cluster centre at a distance from the sample.
In the present embodiment, the description as described in division module 240 specifically refers to retouching in detail to step S204 shown in Fig. 2
It states, i.e. step S204 can be executed by division module 240.
First update module 250 is used to calculate the average value of each sample in each cluster as the first average value, according to described
The current value of the cluster centre of first average value and the cluster is updated the value of the cluster centre of the cluster, is discontented with after the updating
When sufficient preset condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, really
Fixed second preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster with to each cluster
The value of cluster centre is updated.
In the present embodiment, the description as described in the first update module 250 is specifically referred to the detailed of step S205 shown in Fig. 2
Thin description, i.e. step S205 can be executed by the first update module 250.
Wherein, the preset condition may include:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches
Preset times.
Optionally, terminal device 100 can in be preset with cluster centre turnover rate.
Accordingly, the first update module 250 according to the current value of first average value and the cluster centre of the cluster to this
The mode that the value of the cluster centre of cluster is updated can be:
With the weight that the cluster centre turnover rate is first average value, with 1 and the cluster centre turnover rate
Difference is the weight of the current value of the cluster centre of the cluster, calculates the current value of first average value and the cluster centre of the cluster
Weighted average, the weighted average that the value of the cluster centre of the cluster is updated to.
Optionally, the clustering apparatus 200 can also include the second update module 250.
Wherein, the second update module 260 be used in meeting the preset condition any one when, according to the multiple sample
This is updated the cluster centre of each cluster, when being unsatisfactory for the preset condition after update, again according to the multiple sample
This is updated the cluster centre of each cluster.
In the present embodiment, the description as described in the second update module 260 specifically refers in the above to being walked shown in Fig. 3
The detailed description of rapid S301 and step S302, i.e. step S301 and step S302 can be executed by the second update module 260.
Optionally, in the present embodiment, the second update module 260 according to the multiple sample to the cluster centre of each cluster into
The newer mode of row can be:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, by the sample
Originally the cluster being divided into where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated to be updated to as the second average value, and by the value of the cluster centre of the cluster
Second average value.
In conclusion a kind of clustering method provided by the embodiments of the present application and device, are applied to the end based on Spark frames
End equipment.Terminal device includes the pending data of multiple samples by the reading of Spark frames, and the data read are formed
RDD objects.Determine the first preset quantity target sample respectively as the first preset quantity cluster at random in multiple sample
Cluster centre, it is random to determine that second is default in the sample in multiple samples in addition to the first preset quantity target sample
Quantity sample, then the second preset quantity sample is divided in corresponding cluster.Then the flat of each sample in each cluster is calculated
Mean value is as the first average value, according to the current value of first average value and the cluster centre of the cluster to the cluster centre of the cluster
Value be updated, until meet preset condition.By aforementioned process, it is not necessary to all be carried out according to whole samples in each update
Update, reduces computation complexity, and can avoid the problem that being easy to converge on local optimum to a certain extent.
In embodiment provided herein, it should be understood that disclosed device and method, it can also be by other
Mode realize.The apparatus embodiments described above are merely exemplary, for example, the flow chart and block diagram in attached drawing are shown
According to the device, the architectural framework in the cards of method and computer program product, function of multiple embodiments of the application
And operation.In this regard, each box in flowchart or block diagram can represent one of a module, section or code
Point, a part for the module, section or code includes one or more for implementing the specified logical function executable
Instruction.It should also be noted that at some as in the realization method replaced, the function of being marked in box can also be attached to be different from
The sequence marked in figure occurs.For example, two continuous boxes can essentially be basically executed in parallel, they also may be used sometimes
To execute in the opposite order, this is depended on the functions involved.It is also noted that each of block diagram and or flow chart
The combination of box in box and block diagram and or flow chart, function or the dedicated of action are based on as defined in execution
The system of hardware is realized, or can be realized using a combination of dedicated hardware and computer instructions.
In addition, each function module in each embodiment of the application can integrate to form an independent portion
Point, can also be modules individualism, can also two or more modules be integrated to form an independent part.
It, can be with if the function is realized and when sold or used as an independent product in the form of software function module
It is stored in a computer read/write memory medium.Based on this understanding, the technical solution of the application is substantially in other words
The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be
People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of step.
And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
The above, the only specific implementation mode of the application, but the protection domain of the application is not limited thereto, it is any
Those familiar with the art can easily think of the change or the replacement in the technical scope that the application discloses, and should all contain
It covers within the protection domain of the application.Therefore, the protection domain of the application should be subject to the protection scope in claims.
Claims (10)
1. a kind of clustering method, which is characterized in that it is applied to the terminal device based on Spark, the method includes:
Include the pending data of multiple samples by Spark readings, and the pending data is formed into RDD objects;
The first preset quantity target sample is randomly determined in the multiple sample, by the first preset quantity sample
Respectively as the cluster centre of the first preset quantity cluster;
In sample in the multiple sample in addition to the first preset quantity target sample, it is pre- to be randomly determined second
If quantity sample, wherein second preset quantity is the preset multiple of first preset quantity;
For each sample in the second preset quantity sample, calculate the sample and the cluster centre of each cluster away from
From the cluster being divided into the sample where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the first average value, according in the cluster of first average value and the cluster
The current value of the heart is updated the value of the cluster centre of the cluster;
When being unsatisfactory for preset condition after the updating, the first preset quantity target sample is removed in the multiple sample again
In sample except this, the second preset quantity sample is determined, and the second preset quantity sample redefined is divided into
Corresponding cluster is updated with the value of the cluster centre to each cluster.
2. clustering method according to claim 1, which is characterized in that the preset condition includes:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches default
Number.
3. clustering method according to claim 2, which is characterized in that be preset with cluster centre update in the terminal device
Rate;
The value of the cluster centre of the cluster is updated according to the current value of first average value and the cluster centre of the cluster, is wrapped
It includes:
With the weight that the cluster centre turnover rate is first average value, with 1 and the difference of the cluster centre turnover rate
For the weight of the current value of the cluster centre of the cluster, adding for the current value of first average value and the cluster centre of the cluster is calculated
Weight average value, the weighted average that the value of the cluster centre of the cluster is updated to.
4. clustering method according to claim 2 or 3, which is characterized in that the method further includes:
In meeting the preset condition any one when, the cluster centre of each cluster is updated according to the multiple sample;
When being unsatisfactory for the preset condition after update, the cluster centre of each cluster is carried out more according to the multiple sample again
Newly.
5. clustering method according to claim 4, which is characterized in that according to the multiple sample to the cluster centre of each cluster
It is updated, including:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, which is drawn
Assign to the cluster where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the second average value, and by the value of the cluster centre of the cluster be updated to this
Two average values.
6. a kind of clustering apparatus, which is characterized in that be applied to the terminal device based on Spark frames, described device includes:
Data read module for including the pending data of multiple samples by Spark frames reading, and is waited for described
It handles data and forms RDD objects;
Center determining module will be described for being randomly determined the first preset quantity target sample in the multiple sample
Cluster centre of the first preset quantity sample respectively as the first preset quantity cluster;
Sample chooses module, for the sample in the multiple sample in addition to the first preset quantity target sample
In, it is randomly determined the second preset quantity sample, wherein second preset quantity is the default of first preset quantity
Multiple;
Division module, for for each sample in the second preset quantity sample, calculating the sample and each cluster
The distance of cluster centre, the cluster which being divided into where nearest cluster centre at a distance from the sample;
First update module, it is flat according to described first for calculating the average value of each sample in each cluster as the first average value
The current value of the cluster centre of mean value and the cluster is updated the value of the cluster centre of the cluster, is unsatisfactory for presetting after the updating
When condition, again in the sample in the multiple sample in addition to the first preset quantity target sample, second is determined
Preset quantity sample, and the second preset quantity sample redefined is divided into corresponding cluster in the cluster to each cluster
The value of the heart is updated.
7. clustering apparatus according to claim 6, which is characterized in that the preset condition includes:
Value and the average value of the angle of updated value before the cluster centre update of each cluster are less than predetermined angle;
The number being updated to the value of the cluster centre of each cluster according to identified second preset quantity sample reaches default
Number.
8. clustering apparatus according to claim 7, which is characterized in that be preset with cluster centre update in the terminal device
Rate;First update module is according to the current value of first average value and the cluster centre of the cluster to the cluster centre of the cluster
The mode that is updated of value be:
With the weight that the cluster centre turnover rate is first average value, with 1 and the difference of the cluster centre turnover rate
For the weight of the current value of the cluster centre of the cluster, adding for the current value of first average value and the cluster centre of the cluster is calculated
Weight average value, the weighted average that the value of the cluster centre of the cluster is updated to.
9. clustering apparatus according to claim 7 or 8, which is characterized in that the clustering apparatus further includes:
Second update module, in meeting the preset condition any one when, according to the multiple sample to each cluster
Cluster centre is updated, when being unsatisfactory for the preset condition after update, again according to the multiple sample to each cluster
Cluster centre is updated.
10. clustering apparatus according to claim 9, which is characterized in that second update module is according to the multiple sample
This mode being updated to the cluster centre of each cluster is:
For each sample in the multiple sample, the sample is calculated at a distance from the cluster centre of each cluster, which is drawn
Assign to the cluster where nearest cluster centre at a distance from the sample;
The average value of each sample in each cluster is calculated as the second average value, and by the value of the cluster centre of the cluster be updated to this
Two average values.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810482717.2A CN108805174B (en) | 2018-05-18 | 2018-05-18 | Clustering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810482717.2A CN108805174B (en) | 2018-05-18 | 2018-05-18 | Clustering method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108805174A true CN108805174A (en) | 2018-11-13 |
CN108805174B CN108805174B (en) | 2022-03-29 |
Family
ID=64091236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810482717.2A Active CN108805174B (en) | 2018-05-18 | 2018-05-18 | Clustering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108805174B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110045371A (en) * | 2019-04-28 | 2019-07-23 | 软通智慧科技有限公司 | Identification method, device, equipment and storage medium |
CN110400020A (en) * | 2019-07-31 | 2019-11-01 | 北京百度网讯科技有限公司 | Method and device for outputting information |
CN112215247A (en) * | 2019-07-10 | 2021-01-12 | 南京地平线机器人技术有限公司 | Method and device for clustering feature vectors and electronic equipment |
CN112560731A (en) * | 2020-12-22 | 2021-03-26 | 苏州科达科技股份有限公司 | Feature clustering method, database updating method, electronic device and storage medium |
CN112949697A (en) * | 2021-02-07 | 2021-06-11 | 广州杰赛科技股份有限公司 | Method and device for confirming pipeline abnormity and computer readable storage medium |
CN113077015A (en) * | 2021-04-29 | 2021-07-06 | 平安科技(深圳)有限公司 | Sample selection method and device, computer equipment and storage medium |
CN113111893A (en) * | 2020-01-09 | 2021-07-13 | 中国移动通信集团四川有限公司 | Data processing method and system and electronic equipment |
CN113393412A (en) * | 2020-02-27 | 2021-09-14 | 中国石油天然气股份有限公司 | Method and device for determining characteristic value of corrosion defect in gas pipeline |
CN113762311A (en) * | 2021-01-28 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | A data clustering method, device and computer-readable storage medium |
CN114358102A (en) * | 2021-09-10 | 2022-04-15 | 腾讯科技(深圳)有限公司 | Data classification method, device, equipment and storage medium |
CN114662579A (en) * | 2022-03-10 | 2022-06-24 | 中国工商银行股份有限公司 | Clustering method and clustering equipment |
CN114969317A (en) * | 2021-02-26 | 2022-08-30 | 中移(苏州)软件技术有限公司 | Data processing method and device, electronic equipment and storage medium |
CN118094273A (en) * | 2024-01-19 | 2024-05-28 | 北京大学 | Clustering method, device, computer equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049651A (en) * | 2012-12-13 | 2013-04-17 | 航天科工深圳(集团)有限公司 | Method and device used for power load aggregation |
CN103593609A (en) * | 2012-08-16 | 2014-02-19 | 阿里巴巴集团控股有限公司 | Trustworthy behavior recognition method and device |
CN103699678A (en) * | 2013-12-31 | 2014-04-02 | 苏州大学 | Hierarchical clustering method and system based on multistage layered sampling |
CN105913077A (en) * | 2016-04-07 | 2016-08-31 | 华北电力大学(保定) | Data clustering method based on dimensionality reduction and sampling |
CN106570173A (en) * | 2016-11-09 | 2017-04-19 | 重庆邮电大学 | High-dimensional sparse text data clustering method based on Spark |
CN106682116A (en) * | 2016-12-08 | 2017-05-17 | 重庆邮电大学 | OPTICS point sorting clustering method based on Spark memory computing big data platform |
CN107578070A (en) * | 2017-09-19 | 2018-01-12 | 安徽中科美络信息技术有限公司 | K means initial cluster center method for optimizing based on neighborhood information and mean difference degree |
-
2018
- 2018-05-18 CN CN201810482717.2A patent/CN108805174B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593609A (en) * | 2012-08-16 | 2014-02-19 | 阿里巴巴集团控股有限公司 | Trustworthy behavior recognition method and device |
CN103049651A (en) * | 2012-12-13 | 2013-04-17 | 航天科工深圳(集团)有限公司 | Method and device used for power load aggregation |
CN103699678A (en) * | 2013-12-31 | 2014-04-02 | 苏州大学 | Hierarchical clustering method and system based on multistage layered sampling |
CN105913077A (en) * | 2016-04-07 | 2016-08-31 | 华北电力大学(保定) | Data clustering method based on dimensionality reduction and sampling |
CN106570173A (en) * | 2016-11-09 | 2017-04-19 | 重庆邮电大学 | High-dimensional sparse text data clustering method based on Spark |
CN106682116A (en) * | 2016-12-08 | 2017-05-17 | 重庆邮电大学 | OPTICS point sorting clustering method based on Spark memory computing big data platform |
CN107578070A (en) * | 2017-09-19 | 2018-01-12 | 安徽中科美络信息技术有限公司 | K means initial cluster center method for optimizing based on neighborhood information and mean difference degree |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110045371A (en) * | 2019-04-28 | 2019-07-23 | 软通智慧科技有限公司 | Identification method, device, equipment and storage medium |
CN112215247A (en) * | 2019-07-10 | 2021-01-12 | 南京地平线机器人技术有限公司 | Method and device for clustering feature vectors and electronic equipment |
CN112215247B (en) * | 2019-07-10 | 2025-03-14 | 南京地平线机器人技术有限公司 | Method, device and electronic device for clustering feature vectors |
CN110400020A (en) * | 2019-07-31 | 2019-11-01 | 北京百度网讯科技有限公司 | Method and device for outputting information |
CN113111893A (en) * | 2020-01-09 | 2021-07-13 | 中国移动通信集团四川有限公司 | Data processing method and system and electronic equipment |
CN113393412A (en) * | 2020-02-27 | 2021-09-14 | 中国石油天然气股份有限公司 | Method and device for determining characteristic value of corrosion defect in gas pipeline |
CN113393412B (en) * | 2020-02-27 | 2024-05-31 | 中国石油天然气股份有限公司 | Method and device for determining characteristic value of corrosion defect in gas pipeline |
CN112560731B (en) * | 2020-12-22 | 2022-07-01 | 苏州科达科技股份有限公司 | Feature clustering method, database updating method, electronic device and storage medium |
CN112560731A (en) * | 2020-12-22 | 2021-03-26 | 苏州科达科技股份有限公司 | Feature clustering method, database updating method, electronic device and storage medium |
CN113762311A (en) * | 2021-01-28 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | A data clustering method, device and computer-readable storage medium |
CN112949697A (en) * | 2021-02-07 | 2021-06-11 | 广州杰赛科技股份有限公司 | Method and device for confirming pipeline abnormity and computer readable storage medium |
CN114969317A (en) * | 2021-02-26 | 2022-08-30 | 中移(苏州)软件技术有限公司 | Data processing method and device, electronic equipment and storage medium |
CN113077015A (en) * | 2021-04-29 | 2021-07-06 | 平安科技(深圳)有限公司 | Sample selection method and device, computer equipment and storage medium |
CN114358102A (en) * | 2021-09-10 | 2022-04-15 | 腾讯科技(深圳)有限公司 | Data classification method, device, equipment and storage medium |
CN114662579A (en) * | 2022-03-10 | 2022-06-24 | 中国工商银行股份有限公司 | Clustering method and clustering equipment |
CN118094273A (en) * | 2024-01-19 | 2024-05-28 | 北京大学 | Clustering method, device, computer equipment and storage medium |
CN118094273B (en) * | 2024-01-19 | 2025-03-21 | 北京大学 | Clustering method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108805174B (en) | 2022-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108805174A (en) | clustering method and device | |
CN115131566A (en) | Automatic image segmentation method based on super-pixels and improved fuzzy C-means clustering | |
JP2021533474A (en) | Node classification method, model training method, and its equipment, equipment and computer program | |
CN110827924B (en) | Clustering method and device for gene expression data, computer equipment and storage medium | |
CN108960520A (en) | Power load prediction method, system, computer equipment and medium | |
CN111526119A (en) | Abnormal flow detection method and device, electronic equipment and computer readable medium | |
CN108197795B (en) | Malicious group account identification method, device, terminal and storage medium | |
CN118504775B (en) | Urban planning method and system based on digital twinning | |
CN104484600B (en) | Intrusion detection method and device based on improved density clustering | |
CN106682414A (en) | Method and device for establishing timing sequence prediction model | |
CN110751354B (en) | Abnormal user detection method and device | |
CN110796485A (en) | Method and device for improving prediction precision of prediction model | |
CN115099875A (en) | Data classification method based on decision tree model and related equipment | |
CN110166344A (en) | A kind of identity recognition methods, device and relevant device | |
CN114781650A (en) | Data processing method, device, equipment and storage medium | |
CN110378543A (en) | Leaving office Risk Forecast Method, device, computer equipment and storage medium | |
CN112416560A (en) | Truth value inference and online task allocation method for numerical tasks in crowdsourcing scene | |
CN118245823B (en) | Insulator leakage current prediction method, device, equipment and storage medium | |
CN117495571B (en) | Data processing method and device, electronic equipment and storage medium | |
CN113807391A (en) | Task model training method and device, electronic equipment and storage medium | |
CN112597323A (en) | Remote sensing image storage and migration method and device and storage medium | |
CN118568528A (en) | Photovoltaic power generation power prediction and model construction method and device | |
CN117369954A (en) | JVM optimization method and device of risk processing framework for big data construction | |
CN117253079A (en) | Model training method, device, equipment and storage medium | |
CN111079843A (en) | Training method based on RBF neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |