A kind of method for detecting abnormality based on distribution probability measuring similarity
[technical field]
The present invention relates to machine learning field method for detecting abnormality, more particularly to one kind to be based on distribution probability measuring similarity
Method for detecting abnormality.
[background technique]
When solving the problems, such as abnormality detection using machine learning method, there are no abnormal datas to be trained, part is different
Often, each dimension dimension different distributions of data differ larger problem.The unsupervised part of data is solved according to suitable sorting algorithm
Abnormality detection problem is one of the hot spot studied now to improve model to normal and whole exceptional sample discrimination.It solves
At present for abnormality detection problem, conventional method is broadly divided into three types.The first is the method by mathematical statistics, is passed through
The probability size that each dimension in statistical number strong point or each dimension various combination of data point occur judges whether exception.Second for based on
The method for detecting abnormality of distance mainly judges to be by calculating test point with the degree of closeness of normal data is locally or globally gone up
No is abnormal.The third is to judge whether it is abnormal by clustering or calculating the modes such as distribution relative density based on data distribution.
But these methods have certain supposed premise: abnormal point and normal distribution cluster meet farther out, abnormal data density it is far low
In normal data density etc..But in actual application environment, being distributed with for abnormal data may very be concentrated or away from normal number
It is closer according to distribution cluster, or even have abnormal data and be wrapped in inside normal data cluster.The hypotheses that above-mentioned algorithm is done
It excessively idealizes, this does not always set up even possibility very little in actual application environment, causes model recognition effect unstable
It is fixed.In addition, will appear the situation of cluster exception in the events such as Epidemic outbreak of disease, network attack.It is a large amount of different when occurring extremely
Regular data distribution concentrates on one or more distributed areas, that is to say, that spatial abnormal feature feature is more concentrated, is densely distributed, base
Correct judgement can not be made to this situation in the serial algorithm of density.And in a practical situation, the case where anomalous concentration is that have
Very big researching value, discrete abnormal point illustrates that such abnormal probability of happening is lower, and for the region of anomalous concentration
Probability of happening is relatively high, detects that this exception can reduce loss to the full extent.Such as in network attack, if can be in time
It detects cluster exception, finds out its attack mode, then can provide effective Informational support for network O&M worker, avoid system
It is broken.
In conclusion for abnormality detection problem, there are following difficult points at present: abnormal data acquisition is more difficult;Abnormal number
More closely there is local anomaly according to away from normal data, existing major part algorithm only judges abnormal journey by global or local data distribution
Degree, can not comprehensively consider global and local distributed intelligence;Abnormal data distribution is more intensive, and distribution density is close with normal data
Even higher, highdensity abnormal data cluster is easily judged as normal data by the algorithm based on relative density;Each attribute of data point
Cloth range difference is larger, and by traditional distance calculating method such as Euclidean distance, weight has big difference between different dimensions, and early period
Normalization, standardization can adapt data original distribution state again;Abnormal data is distributed in inside normal data, and is distributed
Range has certain intersection;Have label training data concentrate data point distribution will not completely high density concentrate on a region, need to sieve
The training data for having information redundancy is selected, guarantees that subsequent processing is interference-free.
[summary of the invention]
In view of this, the embodiment of the present invention proposes a kind of method for detecting abnormality based on distribution probability measuring similarity,
To improve disaggregated model to the discrimination of positive negative sample entirety.
A kind of method for detecting abnormality based on distribution probability measuring similarity that the embodiment of the present invention proposes, comprising:
Multiple stochastical sampling obtains multiple subsets of normal sample data, complete with binary tree structure save each subset with
Machine isolation processes delimit the threshold depth of backtracking according to drift ratio;
External leaf node position and the threshold depth that each tree is fallen according to test point, leaf node is recalled where it
To the ancestor node of threshold depth, training data of all data as measurement and test point similarity under the node is extracted;
Remaining data points appearance is calculated separately in each attribute dimensions for endpoint with certain point in test point and training dataset
Probability between this two o'clock calculates the dissimilar degree of all the points in test point and data set in conjunction with Min Shi distance, obtains this
The exceptional value of point.
In the above method, multiple stochastical sampling obtains multiple subsets of normal sample data, with the preservation of full binary tree structure
The random isolation processes of each subset, according to drift ratio delimit backtracking threshold depth method are as follows: by training dataset D with
Machine samples to obtain several training subsets X_all, and each subset X contains m sample X={ X1, X2..., Xm, m is less than training
The positive integer of data set D size, can select appropriate value according to the actual situation, and each sample point contains n dimension, i.e., i-th
SampleRandomly select dimension and isolation threshold, isolation threshold be subset in certain dimension between
Random value between its maximum value and minimum value;Continuous iteration, until meet following three conditions one of them, then terminate and change
Generation: (1) spatially only one sample of each isolation;(2) spatially each sample point is identical in the dimension values;(3) reach
Iteration limit number;By this process record in a tree structure, a complete binary tree is formed, each node can contain
There is zero or two child node, what is saved in leaf node is the sample in each insulating space, and what internal node saved is isolation
Dimension and corresponding threshold value, depth threshold of the Dt as the retrospective search neighbours training points in tree need according to each training points
The average value E (h (x)) of depth h (x) is determined in each tree, as follows:
Wherein, E (h (x)) is the mean depth after sample x is traversed on all t isolation trees, and t is selected according to the actual situation
Select suitable positive integer, liIt (x) is the pathdepth of i-th tree;
Need to be arranged a drift ratio r, 0≤r≤1, the i.e. relative depature in all normal training dataset D for Dt
The ratio data of normal data distribution, r setting need to according in the dispersion degree and actual conditions of data distribution to model
Indices demand is measured, and is selected before each training sample mean depth in ((1-r) * 100) % according to the drift ratio r of setting
Minimum value as tree in retrospective search part training points depth threshold Dt.
In the above method, external leaf node position and the threshold depth of each tree are fallen according to test point, where it
Leaf node traces back to the ancestor node of threshold depth, extracts all data under the node as measurement and test point similarity
The method of training data are as follows: test point is sent into every one tree, if test sample falls in certain node under Dt depth, by test point
Place node is recalled upwards, and until forefathers' node of Dt depth, training sample Ltd all under forefathers' node are extracted work
For the data for next calculating test sample intensity of anomaly, the data extracted in all trees are incorporated as next instruction
Practice sample, if test sample more than Dt depth, using data all in Ltd as next training sample, is surveyed this
For sample sheet, farther out, local training data in its vicinity is less for the distribution in the overall situation away from normal data, therefore by Dt
Under all training samples extract as next training sample, the intensity of anomaly of further validation test data.
In the above method, it is calculated separately in each attribute dimensions for endpoint with certain point in test point and training dataset
There is the probability between this two o'clock in remainder strong point, and the dissmilarity of all the points in test point and data set is calculated in conjunction with Min Shi distance
Degree, the method for obtaining the exceptional value of the point are as follows: firstly, by Ri(x, y) is defined as sample x and sample y in i-th dimension xiAnd yiTwo
Region between value, at this time x ∈ Ltd.If S is the subspace set where all data of Ltd, SiFor space S i-th dimension sky
Between distribution, ifSelect SiMost value be boundary, then Ri(x, y) is converted to Ri(x, S), Numi(x, y, d, S) is
I-th dimension diWhether in RiBoolean in (x, y) range, wherein d is other samples in Ltd in addition to x, Mi(x,y|Ltd,S)
It is as follows for the training points number in Ltd in i-th dimension between x and y:
Wherein, I () is indicator function, and condition in bracket is otherwise 0 if true, its value is 1;Then with Mi(x,y|
Ltd, S) different degree of the ratio as x and y in i-th dimension in Ltd is accounted for, calculate different degree of the x and y in all dimensions
D ' (x, y), as follows:
Wherein, p is the index value in Minkowski Distance, finally, the abnormality score p (y) of test point y is as follows:
Wherein, p (y) be the similarity at test point y and all midpoints Ltd and, test point is ranked up, p (y) is got over
Greatly, intensity of anomaly is higher.
[Detailed description of the invention]
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field
For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is that the process for the method for detecting abnormality based on distribution probability measuring similarity that the embodiment of the present invention is proposed is shown
It is intended to.
[specific embodiment]
For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing
It states.
It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
Its embodiment, shall fall within the protection scope of the present invention.
The embodiment of the present invention provides the method for detecting abnormality based on distribution probability measuring similarity, referring to FIG. 1, it is this
The flow diagram for the method for detecting abnormality based on distribution probability measuring similarity that inventive embodiments are proposed, as shown in Figure 1,
Method includes the following steps:
Step 101, multiple stochastical sampling obtains multiple subsets of normal sample data, is saved with full binary tree structure each
The random isolation processes of subset delimit the threshold depth of backtracking according to drift ratio.
Specifically, obtaining several training subsets X_all by training dataset D stochastical sampling, each subset X contains m
Sample X={ X1, X2..., Xm, m is the positive integer less than training dataset D size, can select suitable number according to the actual situation
Value, each sample point contain n dimension, i.e. i-th of sample Randomly select dimension and isolation threshold
Value, isolation threshold are random value of the subset in certain dimension between its maximum value and minimum value;Continuous iteration, until meeting
Below three conditions one of them, then terminate iteration: (1) spatially only one sample of each isolation;(2) spatially each
Sample point is identical in the dimension values;(3) reach iteration limit number;By this process record in a tree structure, formed
One complete binary tree, each node can contain zero or two child node, and what is saved in leaf node is each insulating space
In sample, what internal node saved is the dimension and corresponding threshold value of isolation, and Dt is used as the retrospective search neighbours training points in tree
Depth threshold, need the average value E (h (x)) of the depth h (x) in each tree according to each training points to determine, following institute
Show:
Wherein, E (h (x)) is the mean depth after sample x is traversed on all t isolation trees, and t is selected according to the actual situation
Select suitable positive integer, liIt (x) is the pathdepth of i-th tree;
Need to be arranged a drift ratio r, 0≤r≤1, the i.e. relative depature in all normal training dataset D for Dt
The ratio data of normal data distribution, r setting need to according in the dispersion degree and actual conditions of data distribution to model
Indices demand is measured, and is selected before each training sample mean depth in ((1-r) * 100) % according to the drift ratio r of setting
Minimum value as tree in retrospective search part training points depth threshold Dt.
Algorithm 1 and algorithm 2 are the isolation processes of step 101 and the pseudocode of depth threshold setting method:
Step 102, external leaf node position and the threshold depth that each tree is fallen according to test point, leaf where it
Node traces back to the ancestor node of threshold depth, extracts training of all data as measurement and test point similarity under the node
Data.
Specifically, test point is sent into every one tree, if test sample falls in certain node under Dt depth, by test point institute
Recall upwards in node, until forefathers' node of Dt depth, training sample Ltd all under forefathers' node are extracted into conduct
The data extracted in all trees are incorporated as next training by the data for next calculating test sample intensity of anomaly
Sample, if test sample more than Dt depth, using data all in Ltd as next training sample, tests this
For sample, farther out, local training data in its vicinity is less, therefore will be under Dt for the distribution in the overall situation away from normal data
All training samples are extracted as next training sample, the intensity of anomaly of further validation test data.
Algorithm 3 and algorithm 4 are the pseudocode that local training data method is extracted in step 102:
Step 103, its remainder is calculated separately for endpoint in each attribute dimensions with certain point in test point and training dataset
There is the probability between this two o'clock in strong point, and the dissimilar journey of all the points in test point and data set is calculated in conjunction with Min Shi distance
Degree, obtains the exceptional value of the point.
Specifically, firstly, by Ri(x, y) is defined as sample x and sample y in i-th dimension xiAnd yiRegion between two values, this
When x ∈ Ltd.If S is the subspace set where all data of Ltd, SiFor space S i-th dimension spatial distribution range, ifSelect SiMost value be boundary, then Ri(x, y) is converted to Ri(x, S), Numi(x, y, d, S) is i-th dimension diWhether
RiBoolean in (x, y) range, wherein d is other samples in Ltd in addition to x, Mi(x, y | Ltd, S) it is in Ltd i-th
The training points number between x and y is tieed up, as follows:
Wherein, I () is indicator function, and condition in bracket is otherwise 0 if true, its value is 1;Then with Mi(x,y|
Ltd, S) different degree of the ratio as x and y in i-th dimension in Ltd is accounted for, calculate different degree of the x and y in all dimensions
D ' (x, y), as follows:
Wherein, p is the index value in Minkowski Distance, finally, the abnormality score p (y) of test point y is as follows:
Wherein, p (y) be the similarity at test point y and all midpoints Ltd and, test point is ranked up, p (y) is got over
Greatly, intensity of anomaly is higher.
Table solves 10 public affairs first is that the embodiment of the present invention provides the method for detecting abnormality based on distribution probability measuring similarity
When opening data set abnormality detection task, the pretreatment of data set is divided into normal data and abnormal data for all kinds of in data set.
Table one
Table solves 10 public affairs second is that the embodiment of the present invention provides the method for detecting abnormality based on distribution probability measuring similarity
When opening data set abnormality detection task, AUC value (ranking of random selection positive sample is higher than the probability of random selection negative sample)
Contrast and experiment, wherein in the embodiment of the present invention control methods be the type solution never KNN of balanced sort problem,
Eight kinds of methods of iForest, SCiForest, iNNE, ALSH, L1SH, L2SH, KLSH.By table one, it can be concluded that, the present invention is mentioned
Method DPSM out is concentrated in public data and is significantly improved in AUC value compared to control methods.It is especially real in the first seven group
Have greatly improved in testing, proposition method is highest level in eight groups of methods, remaining two groups also close with highest level.
The method that the embodiment of the present invention is proposed achieves certain breakthrough in method for detecting abnormality.
Table two
In conclusion the embodiment of the present invention has the advantages that
In the technical solution that the present invention is implemented, multiple stochastical sampling obtains multiple subsets of normal sample data, with complete two
Fork tree construction saves the random isolation processes of each subset, and the threshold depth of backtracking delimited according to drift ratio;According to test point
External leaf node position and the threshold depth for falling in each tree, the ancestors that leaf node where it traces back to threshold depth save
Point extracts training data of all data as measurement and test point similarity under the node;With test point and training dataset
Certain interior point is endpoint, probability of the remaining data points appearance between this two o'clock is calculated separately in each attribute dimensions, in conjunction with Min Shi
Distance calculates the dissimilar degree of all the points in test point and data set, obtains the exceptional value of the point.According to embodiments of the present invention
The technical solution of offer can effectively solve local anomaly test problems, can be using original training data according to where test point
The distribution of regional area normal data obtains its intensity of anomaly, improve the local anomaly of abnormality detection model detectability and its
Overall target.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.