CN119377038A

CN119377038A - Application monitoring method, device, equipment, storage medium and product

Info

Publication number: CN119377038A
Application number: CN202411334049.0A
Authority: CN
Inventors: 朱少先; 郑少忠; 陈川峰; 李治; 徐平; 蒋旭; 谢功琴; 丁亚雷; 易楚翘; 王立奇; 周传声; 吴涵哲; 王冬青; 刘云; 杨畯; 邓晓龙; 丛浩; 曾媛; 匡力; 吴陶
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2024-09-24
Filing date: 2024-09-24
Publication date: 2025-01-28

Abstract

The present application discloses an application monitoring method, device, equipment, storage medium and product, which relate to the field of anomaly detection technology. The method generates a performance indicator baseline based on historical performance indicator data of a target business system; monitors the target business system to obtain real-time performance indicator data; generates an original alarm information set through a preset general anomaly detection algorithm and the real-time performance indicator data and the performance indicator baseline; iteratively optimizes the original alarm information set through a preset alarm enhancement strategy to obtain an alarm enhancement information set. Through the above scheme, by means of alarm enhancement, the accuracy of fault detection is improved without the need for a classification algorithm.

Description

Application monitoring method, device, equipment, storage medium and product

Technical Field

The present application relates to the field of anomaly detection technologies, and in particular, to an application monitoring method, apparatus, device, storage medium, and product.

Background

Through application monitoring, an operation and maintenance team can timely discover and process potential problems, and continuous availability and stability of business applications are ensured. Meanwhile, the quantity and quality of alarms in the application monitoring link influence the reliability of actions required to be taken after abnormal discovery. Therefore, to improve the quality and efficiency of fault emergency, the accuracy of abnormality detection by application monitoring is particularly important.

Currently, the implementation modes of the anomaly detection algorithm in IT application monitoring are mainly divided into two types according to the number of algorithms, one type is that monitoring of a plurality of IT applications is completed by using a single anomaly detection algorithm. The method uses various indexes of the IT application system history to train an algorithm and outputs the algorithm to obtain a model. In the actual running process of the IT application, an abnormality judgment result is obtained by inputting all index data of the IT application in real time into a model, and monitoring and alarming of the IT application are realized according to the result. Secondly, monitoring of a plurality of IT applications is completed by using a plurality of abnormality detection algorithms, the method classifies IT application systems by using manual priori knowledge, classification algorithms, clustering algorithms and the like, and selects different abnormality detection algorithms for each class, but the existing method has certain defects no matter a single abnormality detection algorithm or a plurality of abnormality detection algorithms are used.

The existing method using a single anomaly detection algorithm only has one anomaly detection algorithm, cannot adapt to various data forms, has poor performance on an IT application system with complex data characteristics or service characteristics, and has difficulty in achieving good overall monitoring performance when the monitored IT application number is large. Meanwhile, after the algorithm is adjusted aiming at the data characteristics of one IT application, the fault discovery performance of the data characteristics of other IT applications is easily affected. The existing method for using various anomaly detection algorithms mostly relies on clustering, classification and manual priori knowledge to classify IT applications, the follow-up procedure is excessively dependent on the classification result, various limitations exist, and when the data form of the IT applications changes, the change condition cannot be automatically identified, so that the fault discovery performance is poor.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The application mainly aims to provide an application monitoring method, an application monitoring device, application monitoring equipment, a storage medium and a computer program product, and aims to improve fault discovery accuracy in an alarm enhancement mode.

In order to achieve the above object, the present application provides an application monitoring method, which includes:

generating a performance index baseline based on historical performance index data of the target business system;

monitoring the target service system to acquire real-time performance index data;

Generating an original alarm information set through a preset general anomaly detection algorithm, the real-time performance index data and the performance index base line;

And carrying out iterative optimization on the original alarm information set through a preset alarm enhancement strategy to obtain an alarm enhancement information set.

In an embodiment, the real-time performance index data includes a plurality of performance characteristic indexes of the target service system, and the step of generating the original alarm information set by a preset general anomaly detection algorithm and the real-time performance index data and the performance index baseline includes:

calculating index difference degrees between the performance characteristic indexes and the corresponding performance index baselines through a preset general anomaly detection algorithm;

judging whether the index difference exceeds a preset difference threshold;

if the index difference exceeds the difference threshold, determining the performance characteristic index as a target performance characteristic index;

and generating corresponding alarm information based on the target performance characteristic index to obtain an original alarm information set.

In an embodiment, the step of iteratively optimizing the original alarm information set by a preset alarm enhancement policy to obtain the alarm enhancement information set includes:

Generating a supplementary monitoring object set according to a supplementary monitoring strategy based on the real-time performance index data;

Performing anomaly detection on the real-time performance index data in the supplementary monitoring object set according to a preset anomaly detection optimization algorithm to obtain an alarm supplementary information set;

Generating an enhanced monitoring object set according to the supplemental monitoring strategy based on the alarm supplemental information set;

And carrying out anomaly detection on the real-time performance index data in the enhanced monitoring object set according to a preset anomaly detection optimization algorithm to obtain alarm enhancement information.

In one embodiment, the step of generating a set of supplemental monitoring objects based on the real-time performance index data and the set of original alert information includes:

Counting the number of false positives and the number of false negatives corresponding to the performance characteristic indexes respectively;

Determining the false alarm rate and the missing report rate corresponding to the performance characteristic indexes according to the false alarm times and the missing report times;

Determining any performance characteristic index of the false alarm times, the missing alarm times, the false alarm rate and the false alarm times exceeding a preset false alarm threshold as a target optimized alarm index;

and generating a supplementary monitoring object set based on the target optimization alarm index.

In an embodiment, the anomaly detection optimization algorithm includes at least one of a time-series based anomaly detection algorithm, a machine-learning based anomaly detection algorithm, and a deep-learning based anomaly detection algorithm, and the training process of the anomaly detection optimization algorithm includes:

constructing a corresponding anomaly detection optimization basic model based on the type of the anomaly detection optimization algorithm;

performing off-line optimization on target parameters in the anomaly detection optimization basic model through the historical performance index data to obtain an anomaly detection optimization intermediate model;

and learning the data change trend of the real-time performance index data based on the anomaly detection optimization intermediate model increment to obtain an anomaly detection optimization algorithm.

In an embodiment, the step of performing iterative optimization on the original alarm information set through a preset alarm enhancement policy to obtain an alarm enhancement information set further includes:

performing historical alarm simulation trial run based on the anomaly detection optimization algorithm and the historical performance index data to obtain trial run recall rate and trial run accuracy rate;

and if the trial run recall rate and the trial run accuracy rate meet preset conditions, introducing the anomaly detection optimization algorithm.

In addition, to achieve the above object, the present application also proposes an application monitoring apparatus including:

The generating module is used for generating a performance index baseline based on the historical performance index data of the target service system;

The monitoring module is used for monitoring the target service system to acquire real-time performance index data;

the alarm module is used for generating an original alarm information set through a preset general anomaly detection algorithm, the real-time performance index data and the performance index base line;

and the enhancement module is used for carrying out iterative optimization on the original alarm information set through a preset alarm enhancement strategy to obtain an alarm enhancement information set.

In addition, in order to achieve the above object, the application also proposes an application monitoring device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the application monitoring method as described above.

Furthermore, to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the application monitoring method as described above.

Furthermore, to achieve the above object, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of an application monitoring method as described above.

According to one or more technical schemes, a performance index baseline is generated based on historical performance index data of a target service system, the target service system is monitored to obtain real-time performance index data, an original alarm information set is generated through a preset general anomaly detection algorithm, the real-time performance index data and the performance index baseline, the original alarm information set is subjected to iterative optimization through a preset alarm enhancement strategy, and an alarm enhancement information set is obtained.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a first embodiment of an application monitoring method of the present application;

FIG. 2 is a schematic flow chart of a second embodiment of an application monitoring method according to the present application;

FIG. 3 is a schematic flow chart of an alarm enhancement strategy of the present application applying a monitoring method;

FIG. 4 is a schematic diagram of an algorithm replacement logic for applying the monitoring method of the present application;

FIG. 5 is a schematic diagram of an alternative flow of an anomaly detection optimization algorithm and an algorithm online evaluation flow in the application monitoring method of the present application;

FIG. 6 is a schematic diagram of a service layering decoupling structure using a monitoring method according to the present application;

FIG. 7 is a schematic diagram of data flow direction of an anomaly detection algorithm in the application of the monitoring method of the present application;

FIG. 8 is a schematic diagram of a technical framework and architecture for applying an anomaly detection algorithm in a monitoring method according to the present application;

FIG. 9 is a schematic block diagram of an application monitoring device according to an embodiment of the present application;

fig. 10 is a schematic device structure diagram of a hardware operating environment related to an application monitoring method in an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

The method mainly comprises the steps of providing an abnormality detection algorithm and alarm enhancement method to improve accuracy and usability of operation and maintenance monitoring, analyzing multi-dimensional historical index data of an application system, intelligently selecting and optimizing an abnormality detection algorithm which is most suitable for current data characteristics by combining with real-time alarm conditions to form an efficient abnormality detection model, and constructing an algorithm library containing multiple algorithms in order to adapt to changeable business requirements, so that a product can flexibly cope with monitoring tasks in different scenes. In addition, the scheme also provides a set of algorithm selection and replacement strategy, so that high accuracy of fault discovery can be maintained under various service scenes and data characteristics.

In order to further improve usability, the scheme also designs an abnormality detection algorithm platform which integrates key functions such as alarm labeling, model training, algorithm configuration and replacement, and the like, so that the use and management of the algorithm become more visual and convenient. In the aspect of alarm enhancement, the scheme adopts an iterative optimization method, and the accuracy of the alarm is improved by continuously adjusting and optimizing an algorithm according to the alarm result, which is similar to a Boosting technology. Meanwhile, in order to support rapid iteration and updating of the algorithm, a general pluggable algorithm system framework is also provided, and the algorithm components are allowed to be flexibly replaced and updated among different monitoring objects.

Finally, to simplify the configuration and replacement operations during the use of the algorithm, the solution also developed a series of tools oriented to the use of the algorithm, which further enhance the user-friendliness and the operating efficiency of the overall monitoring system. In a comprehensive view, the solution provides a comprehensive, efficient and easily-managed monitoring tool for the IT application monitoring field through technical innovation, and aims to help an operation and maintenance team to more effectively discover and process potential system problems and ensure continuous availability and stability of business applications.

Because of the prior art, IT is difficult to adapt to diversified data patterns and complex business characteristics using a single anomaly detection algorithm, resulting in poor performance on a particular IT application system. In addition, adjusting algorithms to optimize monitoring performance of one application may negatively impact other applications, making overall monitoring performance difficult to balance. The prior art also relies excessively on manual prior knowledge and classification algorithms to classify IT applications, which not only increases the complexity of the system, but also limits ITs flexibility and adaptability. Meanwhile, the existing monitoring system has insufficient recognition capability on the data form change, and cannot respond to the change of the system state in time, so that the fault discovery performance is reduced. In terms of usability, the prior art generally does not well consider the relationship between data storage and algorithm suitability, alarm labeling and algorithm configuration, so that an operation and maintenance team faces higher operation difficulty in the use process. Finally, the existing monitoring system generally lacks an effective dynamic algorithm optimizing and updating mechanism, so that the dynamic adjustment of the algorithm is difficult to carry out according to real-time monitoring data and alarm results, and the accuracy and efficiency of monitoring are affected. These problems together lead to limitations and deficiencies of existing IT application monitoring technologies in practical operation and dimensions.

The application provides a solution, which is characterized by generating a performance index baseline based on historical performance index data of a target service system, monitoring the target service system to acquire real-time performance index data, generating an original alarm information set through a preset general anomaly detection algorithm and the real-time performance index data and the performance index baseline, and carrying out iterative optimization on the original alarm information set through a preset alarm enhancement strategy to obtain an alarm enhancement information set.

It should be noted that, the execution body of the embodiment may be a computing service device with functions of application monitoring, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, or an electronic device, an application monitoring system, or the like capable of implementing the above functions. The present embodiment and the following embodiments will be described below by taking an application monitoring system as an example.

Based on this, an embodiment of the present application provides an application monitoring method, and referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the application monitoring method of the present application.

In this embodiment, the application monitoring method includes steps S1000 to S4000:

Step S1000, generating a performance index baseline based on historical performance index data of a target service system;

In this embodiment, the IT application monitors and utilizes the index data such as the request amount, response rate, success rate, average response time, application CPU usage rate, and application memory usage rate of the application to monitor and analyze the running state and performance of the service application system in real time, and through the IT application monitoring, the operation and maintenance team can timely find and process the potential problem, so as to ensure the continuous availability and stability of the service application. Meanwhile, the quantity and quality of alarms in the IT application monitoring link influence the reliability of actions required to be taken after abnormal discovery. Therefore, IT is important to improve the quality and efficiency of fault emergency, and the accuracy of anomaly detection in IT application monitoring.

In the embodiment, the meaning of IT application monitoring mainly comprises 1, by an automatic monitoring means, time cost of manual inspection and fault detection is reduced, and operation and maintenance efficiency is improved. 2. The potential problems are found and processed in time, the stable operation of the service application system is ensured, and the continuity of the service is ensured. 3. And according to the analysis result of the monitoring data, the resource allocation is reasonably adjusted, the resource utilization rate is improved, and the cost expenditure is reduced.

Aiming at millions and tens of millions of IT applications to be monitored, a set of intelligent operation and maintenance abnormality detection algorithm system with high accuracy, simple configuration and strong timeliness is needed to realize integral effective monitoring, so that limited emergency resources are focused on the abnormality which really needs to be concerned, and the emergency efficiency is improved.

In this embodiment, generating a performance index baseline based on the historical performance index data of the target service system is a first step of constructing an anomaly detection model. Performance index baselines generally refer to statistical features of performance indexes calculated from historical data, such as mean, median, mode, variance, etc. The purpose of this step is to establish a "normal" reference standard of performance so that abnormal behaviour can be identified later, which is significantly different from this standard.

Additionally, it should be noted that the process of generating the performance index baseline may be implemented by a variety of statistical methods, such as moving average, exponential smoothing, and the like. The construction of the baseline requires consideration of the time span of the historical data and the stability of the data to ensure the accuracy and reliability of the baseline.

For example, in one embodiment, the system first collects performance index data of the business system over a period of time, such as CPU usage, memory usage, transaction response time, etc. Then, the system calculates the mean value, standard deviation, etc. of the indexes by adopting a statistical analysis method, thereby forming a performance index baseline. These baselines will serve as reference criteria for subsequent anomaly detection.

Step S2000, monitoring the target service system to acquire real-time performance index data;

It should be noted that, the acquisition of real-time performance index data by the monitoring target service system is an essential part in the anomaly detection process. The real-time performance index data refers to performance data generated during the running of the system in the current or near-term, such as current CPU load, memory usage, network traffic, etc. The purpose of this step is to capture the latest state of the system and to discover possible anomalies in time.

Additionally, it should be noted that the acquisition of real-time performance index data may be implemented by various monitoring tools and probes, which may be deployed on various key nodes of the business system, to collect and transmit data in real-time. The acquired data needs to be high in real-time performance and good in accuracy so as to truly reflect the performance condition of the system.

For example, in one particular embodiment, the system deploys multiple monitoring probes on key components of the business system, such as database servers, application servers, etc. The probes collect performance indexes such as CPU utilization rate, memory utilization amount, response time and the like in real time, and send data to a monitoring center for analysis.

In a possible implementation manner, the real-time performance index data includes a plurality of performance characteristic indexes of the target service system, and step S2000 may include steps S2100 to S2300:

step S2100, calculating index difference degrees between the performance characteristic indexes and the corresponding performance index baselines through a preset general anomaly detection algorithm;

Step S2200, judging whether the index difference exceeds a preset difference threshold;

step S2300, determining the performance characteristic index with the index difference degree exceeding the difference degree threshold as a target performance characteristic index;

And step 2400, generating corresponding alarm information based on the target performance characteristic index to obtain an original alarm information set.

It should be noted that, the real-time performance index data is key information reflecting the current running state of the target service system, and generally covers the performance of the system in different dimensions, such as CPU utilization, memory utilization, network traffic, response time, and the like. These indicators are the basis for monitoring system performance and stability, and potential problems and anomalies can be found in time through real-time tracking and analysis of the indicators.

In the monitoring process, the system calculates the degree of difference between the performance characteristic indexes and the pre-established performance index base line by using a preset general anomaly detection algorithm. The performance index baseline is a normal operating range derived based on historical data, and the index variability reflects the degree of deviation of the real-time data from the normal range. By setting a variance threshold, the system can determine when the deviation exceeds the normal fluctuation range, thereby identifying a possible anomaly.

When the index difference exceeds the threshold, the corresponding performance characteristic index is determined as the target performance characteristic index, i.e., those that exhibit abnormal behavior. Based on the target performance characteristic indexes, the system generates corresponding alarm information to construct an original alarm information set. Each alarm in the alarm information set records the key information such as the occurrence time, the occurrence position, the related performance index and the like of the abnormality in detail, and provides an important basis for subsequent analysis and processing.

Additionally, it should be noted that the alarm enhancement policy is introduced to further improve the accuracy and reliability of the alarm information. Through iterative optimization, the system can reduce false alarm and missing alarm, and improve the correlation and accuracy of the alarm. This strategy involves further analysis and processing of the original set of alert information including, but not limited to, validation, merging, prioritization, etc. of the alerts.

For example, in one possible implementation, the system first collects historical performance data for the target business system and generates a performance index baseline based on the data. Then, the system monitors the performance index of the business system in real time, and once the index which has significant difference from the base line is found, the difference degree is calculated and whether the threshold value is exceeded is judged. For indexes exceeding the threshold, the system generates alarm information, and applies an alarm enhancement strategy for further analysis and optimization according to the content and the context information of the alarm. In this way, the system not only can discover anomalies in time, but also can provide more accurate and useful alarm information to help the operation and maintenance team to handle potential problems more effectively.

Step S3000, generating an original alarm information set through a preset general anomaly detection algorithm, the real-time performance index data and the performance index base line;

it should be noted that, generating the original alarm information set through the preset general anomaly detection algorithm, the real-time performance index data and the performance index baseline is the core of the anomaly detection process. The general anomaly detection algorithm refers to an algorithm which can be applied to various business systems and scenes, such as a statistical process control algorithm, a clustering algorithm and the like. The purpose of this step is to quickly identify possible anomalies by comparing the real-time data with the baseline using an algorithm.

Additionally, it should be noted that, the algorithm may determine whether the real-time data deviates from the normal range according to a preset rule or threshold, so as to generate the alarm information. The alarm information generally comprises key information such as time, position, type and the like of occurrence of the abnormality, and provides basis for subsequent analysis and processing.

Specifically, in this embodiment, under the condition of lacking a priori knowledge, only the posterior condition can be relied on to perform rapid iteration, and a suitable algorithm is found. The method continuously selects a more appropriate algorithm according to the actual abnormal detection effect on the basis of the general algorithm, thereby achieving the ideal alarm effect. Specifically, the method is based on posterior condition (final monitoring effect), adopts a Boosting-like mode to continuously enhance the accuracy of the alarm result, iterates out a proper algorithm, and achieves the purpose of 'priori classification'. All monitoring objects can generate alarms by adopting a general anomaly detection algorithm. The algorithm calculates statistical characteristics such as mean value, variance, maximum and minimum values of the indexes by using historical data of each index and adopting a statistical-based method, and forms a baseline value. And when the application real-time index value is lower than the baseline value, an alarm is sent. The algorithm can be used for rapidly realizing the monitoring of a large number of applications, but the performance of the algorithm on part of monitored objects is not ideal, and the phenomenon of false alarm or missing report is generated, so that the monitoring performance of the application is reduced.

For example, in one embodiment, the system employs a statistical-based general anomaly detection algorithm that calculates the deviation of the real-time performance metric data from the baseline and sets a threshold. When the deviation exceeds a threshold, the system considers that the abnormality is detected and generates corresponding alarm information.

And S4000, carrying out iterative optimization on the original alarm information set through a preset alarm enhancement strategy to obtain the alarm enhancement information set.

It should be noted that, the original alarm information set is iteratively optimized by the preset alarm enhancement strategy, so as to improve the accuracy of the alarm and reduce false alarm or missing alarm. The alert enhancement policies may include further analysis, filtering, and validation of the original alert, as well as pattern recognition based on historical alert data and real-time performance data.

Additionally, it should be noted that, implementation of the alarm enhancement strategy may be implemented by machine learning, deep learning, etc., which can learn and optimize alarm rules from a large amount of historical data, so as to improve the relevance and accuracy of the alarm.

For example, in one embodiment, the system first performs a preliminary screening of the original set of alert information to exclude those frequently misdirected alerts. Then, the system analyzes the remaining alarms by using a machine learning model, classifies and prioritizes the alarms according to the historical pattern of the alarms and the characteristics of the real-time data, thereby generating a more accurate and reliable alarm enhancement information set.

In one possible implementation, the step S4000 may include steps S4100 to S4400:

step S4100, generating a supplementary monitoring object set according to a supplementary monitoring strategy based on the real-time performance index data;

step S4200, performing anomaly detection on the real-time performance index data in the supplementary monitoring object set according to a preset anomaly detection optimization algorithm to obtain an alarm supplementary information set;

Step S4300, generating an enhanced monitoring object set according to the supplemental monitoring strategy based on the alarm supplemental information set;

And step S4400, performing anomaly detection on the real-time performance index data in the enhanced monitoring object set according to a preset anomaly detection optimization algorithm to obtain alarm enhancement information.

It should be noted that, the alarm enhancement strategy is one of the core links of the present invention, and aims to improve the accuracy and reliability of the alarm through the iterative optimization process. The strategy generates a supplementary monitoring object set by analyzing the real-time performance index data and combining with a preset monitoring rule, and carries out deep abnormality detection on the objects so as to identify and supplement possible missed abnormal conditions. The anomaly detection optimization algorithm plays a key role in the process, and can select the most suitable algorithm model for real-time analysis and prediction based on different types of data and scene requirements.

Additionally, it should be noted that the alarm enhancement process includes not only analysis of real-time data, but also learning and simulation of historical alarm data. Through simulation trial run, the system can evaluate the performance of the new algorithm on historical data and ensure the effectiveness of the new algorithm in practical application. The process is helpful for reducing false alarms and missing alarms, and improving the practicability and accuracy of the alarms.

For example, in one possible implementation, the system first determines a set of monitoring objects that require additional attention based on the real-time performance metric data and a preset supplemental monitoring strategy. Then, the system uses an anomaly detection optimization algorithm, such as a method based on time series analysis, machine learning or deep learning, to conduct deep analysis on the real-time data of the objects, identify potential anomaly patterns, and generate an alarm supplemental information set. And then, the system further screens and confirms the enhanced monitoring object set according to the alarm supplementary information set and the supplementary monitoring strategy, and re-analyzes the objects to obtain final alarm enhanced information. In the process, the system can also perform historical alarm simulation trial run so as to verify the effectiveness of the new algorithm and ensure that the new algorithm can meet the preset performance requirements. Through the series of iterative optimization steps, the system can generate an alarm enhancement information set with higher accuracy and richer information, and more reliable decision support is provided for operation and maintenance teams.

Specifically, the anomaly detection optimization algorithm in the present embodiment may mainly include three types, one of which is a time-series anomaly detection algorithm, which is generally used to process data points arranged in time series. The algorithm model is simple, the effect is obvious, the training speed is high, the interpretability is high, common algorithms comprise Holt-Wintess, ARIMA, prophet, and the algorithms consider the trend, seasonality and other modes of data change along with time. For example, the autoregressive integrated moving average (ARIMA) model is a commonly used time series prediction method that predicts future data points by fitting historical data and identifies those outliers that deviate significantly from the predicted values. In addition, an exponential smoothing state space model (ETS) is also an effective method of processing data with trends and seasonality.

And an abnormality detection algorithm based on machine learning, the machine learning algorithm identifying an abnormality by learning a pattern from the history data. These algorithms may be supervised and, or may be unsupervised. Such algorithms are well developed for complex data and have some interpretability, common algorithms include SVM, DBSCN, where unsupervised learning algorithms such as K-means clustering and Principal Component Analysis (PCA) are often used for anomaly detection because they can identify anomalous patterns in the data without explicit labels. For example, K-means clustering may divide data points into clusters, while points that are far from the center of any cluster may be outliers. Another popular unsupervised machine learning algorithm is an isolated forest, which "isolates" observations by randomly selecting a feature and cut point, anomalies are often more easily isolated.

Also, deep learning based anomaly detection algorithms, deep learning algorithms, and in particular neural networks, are capable of learning high-level feature representations from complex data structures. The anomaly detection algorithm based on deep learning is accurate in complex data efficiency acquisition, but has high training cost and poor interpretability, and the common algorithm comprises the following steps that an LSTM long-short-term memory network (LSTM) is a special cyclic neural network (RNN), can process and predict the sequence dependence problem in a time sequence, and is very suitable for anomaly detection. LSTM recognizes normal patterns by learning long-term dependencies and can highlight those behavior violating these patterns. In addition, a self-encoder is a method of learning a compressed representation of data through a neural network that can be used to identify anomalies in the data, as anomalies are generally not effectively compressed.

The embodiment provides an application monitoring method, which comprises the steps of generating a performance index baseline based on historical performance index data of a target service system, monitoring the target service system to obtain real-time performance index data, generating an original alarm information set through a preset general anomaly detection algorithm, the real-time performance index data and the performance index baseline, and carrying out iterative optimization on the original alarm information set through a preset alarm enhancement strategy to obtain an alarm enhancement information set.

In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the above description, and will not be repeated. On this basis, referring to fig. 2, step S4100 of the application monitoring method further includes steps S4110 to S4140:

Step S4110, counting the number of false positives and the number of false negatives corresponding to each of the performance characteristic indexes;

Step S4120, determining the false alarm rate and the missing report rate corresponding to the performance characteristic indexes according to the false alarm times and the missing report times;

step S4130, determining the performance characteristic index of any one of the false alarm times, the missing report times, the false alarm rate and the false alarm times exceeding a preset false alarm threshold as a target optimized alarm index;

And step S4140, generating a supplementary monitoring object set based on the target optimized alarm index.

It should be noted that counting the number of false alarm times and the number of false alarm times corresponding to each of the performance characteristic indexes is one of the key steps of the alarm enhancement strategy. False alarms refer to the false marking of normal events as abnormal, while false alarms are the failure to detect a true abnormal event. By precisely counting the occurrence times of the events, the performance of the monitoring system on specific performance indexes can be quantified, and a basis is provided for subsequent optimization.

Referring to fig. 3, fig. 3 is a flow chart of an alarm enhancement strategy applying a monitoring method according to the present application, where all monitoring objects generate alarms by using a general anomaly detection algorithm. The algorithm calculates statistical characteristics such as mean value, variance, maximum and minimum values of the indexes by using historical data of each index and adopting a statistical-based method, and forms a baseline value. And when the application real-time index value is lower than the baseline value, an alarm is sent. The algorithm can be used for rapidly realizing the monitoring of a large number of applications, but the performance of the algorithm on part of monitored objects is not ideal, and the phenomenon of false alarm or missing report is generated, so that the monitoring performance of the application is reduced.

Based on the false alarm and missing alarm results of the general abnormal detection algorithm, counting the false alarm times and missing alarm times, calculating the false alarm rate and missing alarm rate, setting the upper limit value of the four indexes according to service requirements, and screening out a part of monitoring object set A with poor monitoring performance on the general abnormal detection algorithm through the upper limit value. And adopting a new abnormality detection algorithm on the object to obtain an alarm result.

Similarly, the new anomaly detection algorithm cannot be perfectly applied to the monitored objects in all the monitored object sets a, so that the set a is filtered and screened in the same manner by using the false alarm times, the false alarm rate and the false alarm rate according to the false alarm and false alarm results, and the new anomaly detection algorithm B is adopted to obtain the alarm result.

The steps are repeated continuously, and in the repeated process, the new anomaly detection algorithm B has higher complexity compared with the anomaly detection algorithm A, can dig out the deep characteristic of the application index, and has better fitting and anomaly detection effects.

And integrating all alarms to obtain an approximately complete and accurate alarm set. The alarm can accurately give a prompt when the application state is abnormal, and is accompanied with alarm information including abnormal indexes, abnormal time, brief description and the like.

In a possible implementation manner, the anomaly detection optimization algorithm includes at least one of an anomaly detection algorithm based on a time sequence, an anomaly detection algorithm based on machine learning and an anomaly detection algorithm based on deep learning, and a training process of the anomaly detection optimization algorithm includes steps A1000-A3000:

a1000, constructing a corresponding anomaly detection optimization basic model based on the type of the anomaly detection optimization algorithm;

a2000, performing off-line optimization on target parameters in the anomaly detection optimization basic model through the historical performance index data to obtain an anomaly detection optimization intermediate model;

And step A3000, learning the data change trend of the real-time performance index data based on the abnormal detection optimization intermediate model increment to obtain an abnormal detection optimization algorithm.

IT should be noted that the anomaly detection optimization algorithm is a key technology used for improving the monitoring accuracy of the IT application in the present application. The algorithm constructs a multi-level and multi-dimensional abnormality detection system by comprehensively utilizing advanced technologies such as time sequence analysis, machine learning, deep learning and the like. Referring to fig. 4, fig. 4 is a schematic diagram of an algorithm replacement logic by applying a monitoring method, unlike the traditional Boosting algorithm, the implementation idea of the embodiment does not integrate and output all algorithms into an enhanced model, but enhances the final alarm result, thereby achieving the effect of improving the monitoring accuracy.

All monitoring objects are filtered by algorithms in different stages, and due to certain tolerance on algorithm effects, a strategy capable of being advanced in an algorithm selection layer is to use a general default algorithm for monitoring at first, to use an anomaly detection algorithm based on a time sequence in the second stage, wherein the algorithm model is simple, training speed is high, interpretation is strong, effects on complex data are obvious, to use some algorithms based on machine learning for anomaly detection in the third stage, the algorithm has good effect on complex data and still has certain interpretation, and to use an algorithm based on deep learning in the last stage, the algorithm performance is excellent, but a large amount of training resources are needed and interpretation is poor.

The algorithm mainly comprises three links of model offline training, model updating training and online real-time abnormality detection in the actual application process.

The model off-line training part mainly utilizes long-time application history index data to adjust and optimize various parameters of the model, the process can adopt a gradient descent method, an Adam optimizer and other parameter adjusting and optimizing methods to search the parameters, and finally an algorithm model after parameter adjustment and optimization is obtained. The model has certain fitting capacity to the application index data, and can finish the output of the abnormal detection result when the real-time index data is input.

As the form of the applied index data changes over time, the model obtained in the offline training process fits the historical data, and when the applied index data changes, the fitting capacity of the index also decreases, so that the abnormality detection performance decreases. At the moment, the model is corrected by incrementally learning the trend change of new data on the basis of the initialized model, so that the aim of continuously improving the generalization capability of the model is fulfilled. The process is specifically that on the basis of the model parameters obtained by offline training, new incremental data are used for adjusting and increasing the model parameters, and the adjustment process is also a multi-parameter adjustment and optimization method such as a gradient descent method, an Adam optimizer and the like. And after the parameter is adjusted, outputting an algorithm model with the adjusted parameter.

And in the online anomaly detection process, inputting real-time application index data into an algorithm model with optimized parameters, and obtaining an output result through calculation, wherein the output result is an anomaly detection result.

In this embodiment, the training process of the anomaly detection optimization algorithm begins with constructing a base model, and selects an appropriate mathematical model and initial parameters according to the characteristics of the algorithm. For example, for time series based algorithms, appropriate statistical models and parameters are selected, and for machine learning or deep learning based algorithms, appropriate network structures and initialization weights may be selected.

Next, the system uses the historical performance index data to perform offline tuning of the base model. The process reveals the behavior mode of the business system in normal and abnormal states through historical data, so that the algorithm model can more accurately identify and predict future abnormal events. Off-line tuning typically employs supervised or unsupervised learning methods to adjust model parameters by optimization algorithms such as gradient descent, etc., to minimize prediction errors.

And then, the system further perfects and optimizes an algorithm model by incrementally learning the data change trend of the real-time performance index data based on the intermediate model after the optimization, so that the algorithm can adapt to the change of the performance index of the service system along with time, and the timeliness and the accuracy of anomaly detection are maintained.

Finally, the performance of the algorithm model is evaluated through simulation trial run and actual application. Including calculating recall and accuracy of the trial run stage to verify the effectiveness of the algorithm. Only when the trial run result meets the preset condition, the algorithm is formally introduced into the monitoring system to ensure the effectiveness and reliability of the algorithm in practical application.

For example, in one possible implementation, the system first collects historical performance data for the target business system, including metrics such as CPU usage, memory usage, transaction response time, etc. Then, based on these data, the system constructs an ARIMA model based on time series, a random forest model based on machine learning, and an LSTM model based on deep learning. By offline tuning, the system determines the optimal parameter configuration for these models. The system then performs incremental learning on these models using the real-time data to accommodate the latest changes in performance metrics. In the simulation run-in stage, the system evaluates the recall rate and the accuracy rate of the models, and after confirming that the models meet the preset performance standard, the models are formally applied to the real-time monitoring of the service system, so that the abnormal condition of the service system is efficiently and accurately detected.

In one possible implementation, the step S4000 is preceded by steps B1000-B2000:

Step B1000, performing historical alarm simulation trial run based on the anomaly detection optimization algorithm and the historical performance index data to obtain trial run recall rate and trial run accuracy rate;

And step B2000, if the trial run recall rate and the trial run accuracy rate meet preset conditions, introducing the anomaly detection optimization algorithm.

It should be noted that, the introduction and evaluation of the anomaly detection optimization algorithm is an important link for ensuring the reliability of the alarm system. It is necessary to verify the validity of the algorithms by historical alert simulation run-out before the algorithms are actually applied. Referring to fig. 5, fig. 5 is a schematic diagram of an alternative process of an anomaly detection optimization algorithm and an online evaluation process of the algorithm in the monitoring method, in which the system uses historical performance index data to simulate the operation of the algorithm in the running process, so as to evaluate the performance of the algorithm on the historical data, which is an indispensable part of the algorithm optimization process.

Additionally, it should be noted that the trial recall and the trial accuracy are key indicators for measuring the performance of the algorithm. Recall refers to the ratio of the abnormal event successfully identified by the algorithm to all actual abnormal events, while accuracy refers to the ratio of the abnormal event correctly predicted by the algorithm to all abnormal events predicted by the algorithm. Only if both of these criteria reach the preset conditions, the algorithm is considered to be sufficiently robust and can be introduced into a real-time monitoring system for use.

For example, in one possible implementation, the system first performs a simulation run based on historical performance index data and an anomaly detection optimization algorithm. In the process, the system builds one or more basic models, and adjusts model parameters through historical data to form an anomaly detection optimization intermediate model. The system will then use these intermediate models to perform simulation tests on the historical alert data, calculating the trial recall and the trial accuracy. If the indexes meet the preset performance standard, the algorithm has higher accuracy and reliability on the historical data, and the system monitors the real-time performance index data by adopting the algorithm, so that an alarm enhancement information set is generated. The whole flow ensures that the algorithm is fully tested and verified before the actual application, and improves the reliability of an alarm system and the confidence of an operation and maintenance team on alarm information.

In this embodiment, multiple algorithm substitutions are involved in the process of implementing the alarm enhancing method, and the algorithm substitutions directly affect the monitoring policy, so that the method makes strict substitution standards and procedures. The method comprises the steps of carrying out historical alarm simulation trial run on a new algorithm before the new algorithm is on line, accurately measuring the performance of the algorithm on the basis of the existing historical alarm tag, dynamically detecting the performance of the new algorithm after the new algorithm is on line, carrying out algorithm replacement according to the actual production operation effect of the new algorithm, and ensuring the monitoring effectiveness on the whole.

In particular, the trial run should contain a sufficient number of active and inactive alarms, and the overall process simulates a production run scenario in order to accurately evaluate the performance of the new algorithm on historical data. For the monitored objects without a sufficient number of real alarms, artificial alarms are generated by adopting methods of alarm migration, generation of an countermeasure network and the like. In the trial run step, the alarm recall rate and the accuracy rate of the new algorithm are not lower than those of the old algorithm, otherwise, the algorithm is not effective enough, the existing algorithm cannot be replaced, and the new algorithm needs to be tried.

In addition, according to the method, a large number of algorithms need to be replaced and updated frequently, so in this embodiment, the algorithm is considered to be partially extracted to realize layering and decoupling of the overall service, referring to fig. 6, fig. 6 is a schematic diagram of a service layering decoupling structure of the application monitoring method according to the present application, and the service layering decoupling structure is divided into a data layer, an algorithm layer and a service layer from bottom to top, where the data layer mainly performs data-related processing and operation to meet the input requirement of the algorithm layer, the algorithm layer focuses on development and practice of multiple types of algorithms, and supports updating of the algorithms in a standard plug-and-pull manner, so that the uplink flow of a new algorithm is simpler and more focused on development and iteration of an algorithm model by a user. Specifically, the method defines a standard structure of three links of model offline training, model online updating and online abnormality detection input and output from a software layer, and realizes the hot loading of a new algorithm file in a software framework mode. Therefore, the algorithm can be applied to other links after the algorithm file is uploaded only by completing the structural adaptation of the input and output of three links according to the logic of the algorithm, and the service layer depends on the data result of the algorithm layer and completes final result output by combining related service requirements.

Referring to fig. 7 and 8, fig. 7 is a schematic diagram of data flow direction of an anomaly detection algorithm in the application monitoring method of the present application, and fig. 8 is a schematic diagram of a technical framework and architecture of the anomaly detection algorithm in the application monitoring method of the present application, in a data layer, a data source is application index data of a minute level in Kafka, and in order to satisfy a scenario that a model needs large-scale long-time historical data for offline training, one data is written in HDFS in batch and synchronized into Hive so as to quickly load the full-volume data. In the model update training stage, since an initial model and most historical data already exist, and not all models need to be updated at the same time, only incremental data of part of monitoring objects need to be acquired. The application index data of the order of minutes in Kafka is thus written simultaneously to ClickHouse so that quick interrogation can be performed in accordance with a fixed object. In addition to the long-term persistent storage of the data, the online detection section also consumes minute-level data in real time to complete the judgment of abnormality.

It should be noted that the foregoing examples are only for understanding the present application, and are not intended to limit the application monitoring method of the present application, and more forms of simple transformation based on the technical concept are all within the scope of the present application.

Referring to fig. 9, the present application further provides an application monitoring device 90, which includes:

A generating module 91, configured to generate a performance index baseline based on historical performance index data of the target service system;

The monitoring module 92 is configured to monitor the target service system to obtain real-time performance index data;

an alarm module 93, configured to generate an original alarm information set through a preset general anomaly detection algorithm, the real-time performance index data and the performance index baseline;

The enhancement module 94 is configured to perform iterative optimization on the original alarm information set through a preset alarm enhancement policy, so as to obtain an alarm enhancement information set.

The application monitoring device provided by the application can solve the technical problem of application monitoring by adopting the application monitoring method in the embodiment. Compared with the prior art, the application monitoring device provided by the application has the same beneficial effects as the application monitoring method provided by the embodiment, and other technical features in the application monitoring device are the same as the features disclosed by the embodiment method, and are not repeated herein.

The application provides an application monitoring device which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the application monitoring method in the first embodiment.

Referring now to FIG. 10, a schematic diagram of an application monitoring device suitable for use in implementing embodiments of the present application is shown. The application monitoring device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal DIGITAL ASSISTANT: personal digital assistant), a PAD (Portable Application Description: tablet computer), a PMP (Portable MEDIA PLAYER: portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The application monitoring device shown in fig. 10 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present application.

As shown in fig. 10, the application monitoring apparatus may include a processing device 1001 (e.g., a central processor, a graphics processor, etc.) that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the operation of the application monitoring device are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, a system including an input device 1007 such as a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, etc., an output device 1008 including a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, etc., a storage device 1003 including a magnetic tape, a hard disk, etc., and a communication device 1009 may be connected to the I/O interface 1006. The communication means 1009 may allow the application monitoring device to communicate wirelessly or by wire with other devices to exchange data. While an application monitoring device having various systems is shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

Compared with the prior art, the application monitoring device provided by the application has the advantages that the application monitoring device provided by the application has the same advantages as the application monitoring method provided by the embodiment, and other technical characteristics in the application monitoring device are the same as the characteristics disclosed by the method of the embodiment, and are not repeated herein.

It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

The present application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon for performing the application monitoring method in the above-described embodiments.

The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (Radio Frequency) and the like, or any suitable combination of the foregoing.

The computer readable storage medium may be included in the application monitoring device or may exist alone without being incorporated into the application monitoring device.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (i.e. a computer program) for executing the application monitoring method, so that the technical problem of application monitoring can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the application monitoring method provided by the above embodiment, and are not described herein.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of an application monitoring method as described above.

The computer program product provided by the application can solve the technical problem of application monitoring. Compared with the prior art, the beneficial effects of the computer program product provided by the application are the same as those of the application monitoring method provided by the above embodiment, and are not described herein.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. An application monitoring method, the method comprising:

2. The method of claim 1, wherein the real-time performance indicator data comprises a plurality of performance characteristic indicators of the target business system, and wherein the generating the original set of alert information by a preset general anomaly detection algorithm and the real-time performance indicator data and the performance indicator baseline comprises:

judging whether the index difference exceeds a preset difference threshold;

3. The method of claim 2, wherein the step of iteratively optimizing the original alert information set by a preset alert enhancement policy to obtain an alert enhancement information set comprises:

4. The method of claim 3, wherein the generating a supplemental monitoring object set based on the real-time performance metric data and the original alert information set comprises:

5. The method of claim 3, wherein the anomaly detection optimization algorithm comprises at least one of a time-series based anomaly detection algorithm, a machine-learning based anomaly detection algorithm, and a deep-learning based anomaly detection algorithm, the training process of the anomaly detection optimization algorithm comprising:

6. The method of claim 5, wherein the step of iteratively optimizing the original set of alert information to obtain the set of alert enhancement information by a preset alert enhancement strategy further comprises, prior to:

7. An application monitoring device, the device comprising:

8. An application monitoring device, characterized in that the device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the application monitoring method according to any one of claims 1 to 6.

9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the application monitoring method according to any one of claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the application monitoring method according to any of claims 1 to 6.