Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other means, components, materials, devices, or operations. In such cases, well-known structures, methods, devices, implementations, materials, or operations will not be shown or described in detail.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As previously mentioned, in a data-driven context, traffic requires indicators to interpret to describe, track, and push traffic. And the index fluctuates during the service operation. Therefore, the process of finding the cause of the fluctuation is often the trigger for positioning the problem, correcting in time, finding the growth point and generating the business insights.
The inventor finds that when the conventional data product is used for detecting the abnormality, the abnormality is often detected by adopting a method of artificially setting a threshold value according to experience under the condition of not considering the periodicity and the trend of time series data, so that false alarm is easily generated. When the abnormity is detected, multi-dimensional index disassembly is not supported, and the application of attribution of related indexes and events is easy to generate meaningless attribution results.
According to the embodiment of the application, when the abnormity is detected, the time series decomposition algorithm is utilized, and the abnormity detection is carried out on the target index on the basis of identifying the periodic component and the trend component of the target index, so that the accuracy of judging whether the target index is abnormal or not is improved.
According to other embodiments of the application, when the abnormal reason of the target index is positioned, the abnormal reason of the target index is positioned by adopting methods such as dimension disassembly, related indexes and related events, so that the abnormal reason of the target index is analyzed more comprehensively, and the accuracy and the interpretability of the positioning of the abnormal reason are improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments accompanying the present application will be described below by way of example with reference to the accompanying drawings. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without making any creative effort, are also within the scope of the present disclosure.
Before describing embodiments of the present application, terms appearing in the present application are explained first.
Dimension: factors that affect the target index change.
Kats (kit to Analyze Time Series, kats for short) is a lightweight, easy-to-use, extensible, and versatile framework for timing analysis in Python.
ARIMA (automated Integrated Moving Average model, abbreviated as differential Integrated Moving Average Autoregressive model) is a time series prediction analysis method.
Random forest: the method is an algorithm for integrating a plurality of trees by the idea of ensemble learning, the basic unit of the algorithm is a decision tree, and the algorithm is a machine learning algorithm. A
AutoEncoder: also known as an autoencoder, is a type of artificial neural network used in semi-supervised learning and unsupervised learning, and performs characterization learning on input information by using the input information as a learning target.
Prophet algorithm: the method for predicting the time series data based on the additive model has strong robustness to lost data and trend change, can generally well process abnormal values, and is suitable for the time series with strong seasonal influence and a plurality of seasonal historical data. Specific embodiments according to the present application will be described in detail below with reference to the accompanying drawings.
Fig. 1 shows a flowchart of a method for locating an abnormal cause of a target index according to an exemplary embodiment of the present application, and the following describes in detail the method for locating an abnormal cause of a target index according to an exemplary embodiment of the present application with reference to fig. 1.
As shown in fig. 1, in step S101, in response to a target index abnormality, a dimension combination is constructed according to a preset dimension.
According to some embodiments, the preset dimensions include an advertiser type, an advertising plan type, an advertising sub-channel, and a payment type.
According to the embodiment of the present application, before step S101, the method for locating an abnormality cause shown in fig. 1 further includes performing abnormality detection on the target index.
For example, the target indicator is detected for anomalies using time series decomposition algorithms, which according to some embodiments include Kats, ARIMA, random forest, autoencor, and Prophet algorithms.
According to some embodiments, the target indicator is detected for abnormality through the steps shown in fig. 2, which are not described herein again.
According to some embodiments, in step S101, dimension combinations are constructed through the steps shown in fig. 3, which are not described herein again.
In step S103, a contribution degree of the dimension combination is calculated, where the contribution degree is used to characterize a contribution size of the dimension combination causing the target index to be abnormal.
According to the embodiment of the present application, step S103 calculates the contribution degrees of each dimension in the dimension combination, and the specific calculation process is shown in fig. 3 and is not described herein again.
In step S105, the dimension combination affecting the target index abnormality is located according to the contribution degree.
According to some embodiments, the decision tree model is used to locate the dimension combinations affecting the target index anomalies according to the contribution degrees, and the final output result is the difference dimension combination contribution rate shown in fig. 10, where the value of each contribution rate corresponds to one dimension combination. In the result shown in fig. 10, the larger the value of the contribution ratio is, the greater the relevance between the dimensional combination corresponding to the contribution ratio and the occurrence of an abnormality in the target index is.
For example, the input X of the decision tree model is different dimension combinations, the predicted marking y is the contribution rate, and the regression prediction algorithm is utilized to position the dimension combination which influences the target index abnormality. The specific process is as follows:
firstly, calculating the variance of the contribution degree in the dimension combination;
then, determining the highest information gain cutting method according to the calculated variance;
and finally, cutting a data space through a greedy algorithm, and finding a dimension combination space with the highest contribution absolute value.
Because the decision tree has the problem of overfitting, according to the embodiment of the application, when the decision tree model is constructed, a post-pruning method is adopted, so that the average F1-score on the simulation data set is up to 91.9%.
According to an embodiment of the application, a decision tree is constructed with post-pruning by:
firstly, constructing a whole decision tree;
and then, if the generalization performance can be improved after the subtree corresponding to the non-leaf node of the decision tree is changed into the leaf node, replacing the subtree with the page node.
Then, selecting a non-leaf node of a level with a large node surface error rate gain value, and deleting left and right child nodes of the non-leaf node;
and then, if the surface error rate gain values of a plurality of non-leaf nodes are the same or small, selecting the non-leaf node with the largest number of sub-nodes in the non-leaf nodes for pruning.
And finally, determining the number of layers of the decision tree by using the complexity of a post-pruning algorithm.
According to some embodiments, the complexity of the post-pruning algorithm is shown in equation (1).
Wherein, R (T) is the variance of the node T, R (Tt) is the sum of the variances of the subtrees Tt of the node T, and | T | is the number of nodes of the decision tree. Alpha is alpha eff Large indicates that the information gain of the node t is high.
FIG. 11 is a diagram illustrating the number of transaction dimensions versus the mean of the variance of the decision tree layer nodes according to an example embodiment of the present application. Taking the curve shown by the reference numeral 2 as an example, when the dimension of the transaction is 2, the variance of the decision tree at the second layer is the highest, and the decision tree is divided downwards from the second layer, so that the smaller the variance is, the smaller the information gain is, and the overfitting is obvious. From the polyline shown in FIG. 11, the inflection points of the variance are at 2 levels, and thus, the maximum depth of the decision tree is equal to 2.
For the index that cannot be disassembled or the reason that the target index is abnormal still cannot be located according to the calculated contribution degree after dimension disassembly, according to the embodiment of the present application, the correlation between each index in the preset index set and the target index is calculated by using the correlation index attribution method as shown in fig. 4, or the correlation between each event in the preset event set and the target index is determined by using the correlation event attribution method as shown in fig. 5. And determining the reason causing the target index abnormality according to the correlation value.
According to the embodiment shown in fig. 1, when the abnormal cause of the target index is positioned, when dimension disassembly is adopted, dimension combination causing the target index abnormality is obtained by dimension drilling down and utilizing a decision tree layer by layer, or the cause causing the target index abnormality is determined by calculating the cross-correlation value of the target index and the related value of the related event, so that the abnormal cause of the target index is analyzed more comprehensively, and the accuracy and the interpretability of abnormal cause positioning are improved.
Fig. 2 is a flowchart of an anomaly detection method according to an exemplary embodiment of the present application, as shown in fig. 2, in step S201, time series data is decomposed into a period component, a trend component and a residual component by using a time series decomposition algorithm;
in step S203, the period component and the trend component are removed from the time series data to obtain a residual component of the time series data.
According to some embodiments, the period component is a fluctuation of the target index in a certain period, e.g., day, week. The trend component is a low variance change of the target index over a long period (low frequency or large period). The residual component is a value obtained by removing the trend component and the cycle component from the time-series data of the target index.
According to embodiments of the present application, a Kats algorithm is employed to detect timing outliers.
In step S205, an abnormality determination is performed on the residual component.
According to some embodiments of the present application, according to the distribution of the time sequence residual value of the target index, if the residual value exceeds a quartile interval of 3 times, the target index is considered to be abnormal.
Fig. 12 is a schematic diagram of a change detection billboard according to an exemplary embodiment of the present application, such as the time-varying graph of the target index shown in fig. 12, where there is a drop-shaped graphical location indicating that the target index is abnormal.
According to the embodiment of the application, when the abnormity is detected, the time series decomposition algorithm is utilized, and the abnormity detection is carried out on the target index on the basis of identifying the periodic component and the trend component of the target index, so that the accuracy of judging whether the target index is abnormal or not is improved.
Fig. 3 is a flowchart illustrating a dimension splitting method according to an exemplary embodiment of the present application, and as shown in fig. 3, in step S301, a preset dimension is split to obtain a sub-dimension of the preset dimension;
according to the embodiment of the application, at least one of an absolute value index dimension disassembling method, a ratio index dimension disassembling method and a sub-index multiplication disassembling method is used for disassembling the preset dimension.
The absolute value index dimension resolving method is shown in formula (2).
Y=X1+X2+X3 (2)
Wherein, Y is a target index, for example, a flow rate of each channel, and X1, X2, and X3 are flow rates of a certain channel, for example, a terminal application wake-up channel or an external casting drainage channel.
The ratio index dimension decomposition method is shown in formula (3).
Where Y is the target index, e.g., page click rate, P is the number of page clicks, and S is the pageThe number of surface views, xi is the channel dimension,
the number of page clicks for the APP wake-up channel,
and i is the channel number.
The sub-index multiplication disassembly method is shown in formula (4).
Y=∏ i X i (4)
Where Y is the target index, e.g., in the case of known thousand search yields = average price per click thousands times
In the case of click rate, Y is the thousand search yields, X1 is the average price per click, and X2 is the thousand click rate.
For example, the last 1 day's effective advertising financial consumption = the last 1 day's effective bid page visit amount, the last 1 day's effective advertising depth, the last one day's effective advertising coverage, the last 1 day's effective advertising click cost.
It should be noted that what disassembling method is used is related to the characteristics of the predetermined dimension. And if the preset dimension is the sum of several sub-dimensions, such as a user access frequency index and a browsed page number index, utilizing an absolute value class index dimension disassembling method. If the preset dimension is the product of several sub-dimensions, such as revenue, a ratio index dimension decomposition method is adopted. And if the preset dimension is a ratio index, for example, the page click rate, adopting a ratio index dimension disassembling method. How to disassemble the preset dimension indexes can be simultaneously performed by two or three of an absolute value index dimension disassembling method, a ratio index dimension disassembling method and a sub-index multiplication disassembling method, and any index dimension disassembling method can be adopted for disassembling.
In step S303, the sub-dimensions are combined to obtain a dimension combination, where the obtained dimension combination is a sub-dimension combination of a preset dimension.
For example, the sub-dimensions are combined using the cartesian product.
Assuming that the target index Y fluctuates by
Wherein Y1 is the data value of the current month, and Y0 is the data value of the previous month (year-on-year or year-on-year). Calculating the contribution degree of the dimension combination means calculating the contribution size of the dimension combination causing the target index to be abnormal.
According to the embodiment of the application, the absolute value class index dimension decomposition method calculates the contribution degree by using the method shown in formula (5).
According to the embodiment of the present application, the ratio index dimension decomposition method calculates the contribution degree using the methods shown in equations (6) to (8).
According to the embodiment of the application, the sub-index multiplication disassembling method utilizes the method shown in the formula (9)
The month of day data, X, of the disassembled index X 0 And the data of the previous month ratio or ring ratio of the disassembled index X.
Fig. 4 shows a flowchart of a method for calculating a correlation between a preset index set and a target index, as shown in fig. 4, and in step S401, a cross-correlation value between each index in the preset index set and the target index is calculated, where the index set is an index set that affects an abnormality of the target index, as shown in fig. 5.
In step S403, the indexes related to the target index abnormality in the preset index set are determined by using the cross correlation values. And the larger the calculated correlation value is, the greater the relevance between the indexes in the preset index set and the target index abnormity is.
Fig. 5 is a flowchart illustrating a method for calculating a cross-correlation value between each index in a preset index set and a target index according to an exemplary embodiment of the present application, as shown in fig. 5, in step S501, a time-series sequence of each index in the preset index set and the target index is calculated respectively by using an anomaly detection model.
According to some embodiments, assuming that X1 is a target index time-series value and X2 is a time-series value of each index in a preset index set, the target index time-series value and the time-series value of each index in the preset index set are calculated by formula (10) and formula (11), respectively.
Y1=f(X1) (10)
Y2=f(X2) (11)
Wherein,
indicating the degree of abnormality of the target index Xi at time t;
a data value indicating that the observed target index X1 occurs at a certain time T, T belonging to a time set T within the observation period;
representing the data value of each set X2 in a preset index set for observation at a certain time T, wherein T belongs to a time set T in an observation period; f is an anomaly detection model, such as Kats algorithm, ARIMA algorithm, random forest algorithm, autoEncoder algorithm, and Prophet algorithm.
In step S503, a cross-correlation value of the time-series sequence of each index in the preset index set and the time-series sequence of the target index is calculated.
According to some embodiments, the cross-correlation value of each index in the preset index set with the target index is calculated by formula (12).
k = -2, indicating that index Y2 is obtained two days later than the value of target index Y1.
Fig. 6 shows a flowchart of a method for calculating a correlation between a preset event set and a target index according to an exemplary embodiment of the present application, and as shown in fig. 6, in step S601, a correlation value between each event in the preset event set and the target index is calculated, where the event set is an event set that affects an abnormality of the target index, as shown in fig. 7.
In step S603, an event related to the target index abnormality in the preset event set is determined using the correlation value. The larger the calculated correlation value is, the greater the relevance between the event in the preset event set and the target index abnormality is.
Fig. 7 is a flowchart illustrating a method for calculating a correlation value between each event in a preset event set and a target index according to an exemplary embodiment of the present application, where in step S701, a time sequence of the target index is calculated using an anomaly detection model, as shown in formula (3), as shown in fig. 7.
In step S703, the time-series of each event in the event set is determined using the time length of the time-series of the target index.
For example, X2= {1 (did _ happy) T }, T ∈ T, representing whether each event in the event set occurred, if it occurred, 1, and if it did not occur, 0 at some time T, where the observation time length of X2 is | T '|, and | T' | ≦ T |, and
in step S705, the correlation value of the time series of each event and the time series of the target index is calculated as shown in equations (13) and (14).
Wherein r is MEAN Is the daily average of the outliers of the target index within the time window T of occurrence of the event.
r SUM (Y 1 ,X 2 )=∑ T (Y 1 ,X 2 ) (14)
Wherein r is SUM The sum of the outliers of the target index within the time window of occurrence T.
If the abnormal values of the formula (13) and the formula (14) are high, it indicates that the event has a strong causal relationship causing the abnormality of the target index.
According to embodiments of the present application, the event dimension mainly includes holidays, a/B experiments (e.g., new functions are tested by a method of a/B experiments, a tested user group is expanded, or the entire user group receives a new optimization scenario), bug fixes, version releases, code rollback, marketing budget adjustment, promotional campaigns, and pricing mechanism adjustment.
The embodiments of the present application have been described above primarily from a method perspective. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the operations or steps of the various examples described in connection with the embodiments disclosed herein. Skilled artisans may implement the described functionality in varying ways for each particular operation or method, and such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Embodiments of the apparatus of the present application are described below. For details which are not described in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Fig. 8 is a block diagram of an abnormal cause locating device of a target index according to an exemplary embodiment of the present application, where the abnormal cause locating device shown in fig. 8 includes a dimension combination construction unit 801, a contribution degree calculation unit 803, and a dimension combination determination unit 805, where the dimension combination construction unit 801 is configured to construct a dimension combination according to a preset dimension in response to an abnormal target index; the contribution degree calculating unit 803 is configured to calculate a contribution degree of the dimension combination, where the contribution degree is used to characterize a contribution size of the dimension combination that causes the target index to be abnormal; the dimension combination determination unit 805 is configured to locate a dimension combination affecting the target index abnormality according to the contribution degree.
Fig. 9 illustrates an electronic device according to an exemplary embodiment of the present application. An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 9. The electronic device 200 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 9, the electronic device 200 is in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code that can be executed by the processing unit 210, so that the processing unit 210 executes the methods according to various exemplary embodiments of the present application described herein. For example, the processing unit 210 may perform a method as shown in fig. 1.
The storage unit 220 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM) 2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.
Bus 230 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. The technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiments of the present application.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions described above.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
According to an embodiment of the application, a computer program is proposed, comprising computer programs or instructions, which, when executed by a processor, may perform the above described method.
The foregoing embodiments have been described in detail to illustrate the principles and implementations of the present application, and the foregoing embodiments are only used to help understand the method and its core idea of the present application. Meanwhile, a person skilled in the art should, according to the idea of the present application, change or modify the embodiments and applications of the present application based on the scope of the present application. In view of the above, the description should not be taken as limiting the application.