CN111290913A

CN111290913A - Fault location visualization system and method based on operation and maintenance data prediction

Info

Publication number: CN111290913A
Application number: CN202010079674.0A
Authority: CN
Inventors: 王子健; 周扬帆; 付娇娇; 陈昊; 蔡煜; 曹袖
Original assignee: CERNET Corp; Fudan University
Current assignee: CERNET Corp; Fudan University
Priority date: 2020-02-04
Filing date: 2020-02-04
Publication date: 2020-06-16

Abstract

The invention discloses a fault location visualization system and method based on operation and maintenance data prediction. The invention collects the machine and application log information in the network cluster by using the existing log collection framework to generate operation and maintenance big data, and then processes the operation and maintenance big data by using an artificial intelligence method, thereby predicting the network fault in advance and displaying the network fault in a visual mode. The invention has the beneficial effects that: the fault can be identified efficiently and accurately, the network fault can be predicted in advance, the operation and maintenance personnel can be given sufficient time to carry out operation and maintenance work in time, and the work efficiency of the operation and maintenance personnel can be improved effectively.

Description

A fault location visualization system and method based on operation and maintenance data prediction

技术领域technical field

本发明涉及一种基于运维数据预测的故障定位可视化系统和方法，涉及计算机网络与智能运维技术领域。The invention relates to a fault location visualization system and method based on operation and maintenance data prediction, and relates to the technical field of computer network and intelligent operation and maintenance.

背景技术Background technique

随着网络技术的日趋成熟，无线网络的覆盖范围逐渐变大，同时人们身边智能设备也越来越多，因而网络中接入的设备数量也出现了剧增。在此基础上由网络所提供的信息服务的质量是影响用户体验的关键因素，例如，如果网络中有一个门禁认证系统，如果出现故障，则门禁系统的使用者工作效率可能会因此大大降低。基于以上两点，目前的在网络对故障定位的准确性和时效性提出了更高的要求，所以网络的故障侦测与定位目前就成为了一个关键的研究问题。With the maturity of network technology, the coverage of wireless networks has gradually increased, and at the same time, there are more and more smart devices around people, so the number of devices connected to the network has also increased dramatically. On this basis, the quality of the information service provided by the network is a key factor affecting the user experience. For example, if there is an access control authentication system in the network, if there is a failure, the user's work efficiency of the access control system may be greatly reduced. Based on the above two points, the current network puts forward higher requirements for the accuracy and timeliness of fault location, so network fault detection and location has become a key research issue.

目前对网络中的故障进行定位和分析的框架有eSight、ELK、Splunk、zabbix等，这些故障定位和分析框架主要实现的功能大同小异。他们首先在终端上部署应用来收集设备上的日志信息，之后再基于日志信息来提取设备的关键信息，通过设定阈值以及进行简单的统计分析方法来定位和分析设备上的故障，提供可视化信息和信息通知机制来辅助运维人员进行运维工作。Currently, the frameworks for locating and analyzing faults in the network include eSight, ELK, Splunk, and zabbix. The main functions of these frameworks for locating and analyzing faults are similar. They first deploy applications on the terminal to collect log information on the device, and then extract the key information of the device based on the log information, locate and analyze the faults on the device by setting thresholds and performing simple statistical analysis methods, and provide visual information. and information notification mechanism to assist operation and maintenance personnel in operation and maintenance work.

尽管目前有如此多的网络故障分析和定位框架，但这些方案在当前环境下也存在许多的缺点和不足。首先是大多数方案在检测异常方面主要依靠阈值法和错误日志信息提取来检测故障和定位，这种方法很难及时发现错误，容易有漏报，在日益壮大的网络环境下其存在的风险越来越大。而且框架的阈值一般有专家指定，随着系统升级频繁更新，重新理解系统设定参数的工作量过大，准确度不高；第二个问题在于接入设备的数量剧增，使得我们的日志信息是数据量过大，即使借助目前的框架进行处理，其涵盖的信息量也过大，难以及时协助运维人员理解目前的网络状态，这样在运维人员无法及时理解机器状态的情况下，很多故障无法及时处理，运维人员陷入了一种等待用户反馈再进行运维的被动式运维状态。Although there are so many network fault analysis and localization frameworks, these schemes also have many shortcomings and deficiencies in the current environment. First of all, most schemes mainly rely on the threshold method and error log information extraction to detect and locate faults in detecting anomalies. This method is difficult to detect errors in time, and it is easy to have false negatives. In the growing network environment, the more the risk exists. come bigger. Moreover, the thresholds of the framework are generally specified by experts. With the frequent updates of system upgrades, the workload of re-understanding the system setting parameters is too large, and the accuracy is not high; the second problem is that the number of connected devices increases sharply, which makes our logs Information is that the amount of data is too large. Even if it is processed with the current framework, the amount of information covered is too large, and it is difficult to assist the operation and maintenance personnel to understand the current network status in time. Many faults cannot be dealt with in time, and operation and maintenance personnel fall into a passive operation and maintenance state that waits for user feedback before operation and maintenance.

发明内容SUMMARY OF THE INVENTION

针对现有技术的不足，本发明的目的在于提出一种基于运维数据预测的故障定位可视化系统和方法。本发明基于人工智能技术高效准确识别故障，同时可以提前预测网络故障，从而给予运维人员充分的时间来及时进行运维工作，能有效协助提高运维人员的工作效率。In view of the deficiencies of the prior art, the purpose of the present invention is to propose a fault location visualization system and method based on operation and maintenance data prediction. The invention efficiently and accurately identifies faults based on artificial intelligence technology, and at the same time can predict network faults in advance, thereby giving operation and maintenance personnel sufficient time to perform operation and maintenance work in a timely manner, and can effectively assist in improving the work efficiency of the operation and maintenance personnel.

本发明的技术方案具体介绍如下。The technical solutions of the present invention are specifically introduced as follows.

一种基于运维数据预测的故障定位可视化系统，其包括数据收集部分、算法预测部分和可视化展示部分；其中：A fault location visualization system based on operation and maintenance data prediction, which includes a data collection part, an algorithm prediction part and a visual display part; wherein:

数据收集部分：收集集群中各台机器的机器和应用日志，统一提取出其中用以监控集群状态的关键性能指标，再利用时间序列提取的方法构建运维大数据；Data collection part: collect the machine and application logs of each machine in the cluster, extract the key performance indicators used to monitor the cluster status, and then use the method of time series extraction to construct operation and maintenance big data;

算法预测部分：利用机器学习和神经网络的方法，基于已经收集到的历史的运维大数据来学习其中的先验经验，从而生成符合实际生产规律要求的人工智能预测模型；再用人工智能模型对实时采集的运维数据进行故障预测；Algorithm prediction part: Use machine learning and neural network methods to learn the prior experience based on the historical operation and maintenance big data that has been collected, so as to generate an artificial intelligence prediction model that meets the requirements of actual production laws; then use the artificial intelligence model. Perform fault prediction on the real-time collected operation and maintenance data;

可视化部分，用于将所有的日志信息和故障预测信息进行展示。The visualization part is used to display all log information and fault prediction information.

本发明中，数据收集部分包括日志收集模块和性能指标提取模块；其中：In the present invention, the data collection part includes a log collection module and a performance index extraction module; wherein:

日志收集模块：包括分布式日志收集组件和集中式日志存储组件，收集集群中各台机器的机器和应用日志，完成日志重定向集中化处理的工作；Log collection module: includes distributed log collection components and centralized log storage components, collects machine and application logs of each machine in the cluster, and completes the work of log redirection and centralized processing;

性能指标提取模块：提取日志中反映机器性能的指标信息，利用时间序列提取的方法构建运维大数据。Performance index extraction module: extracts the index information reflecting machine performance in the log, and uses the method of time series extraction to construct operation and maintenance big data.

本发明中，所述性能指标提取模块中，通过过滤和关键词提取方法提取反映机器性能的指标信息；机器性能的指标信息包括CPU使用率信息、内存使用率信息、磁盘读取带宽信息以及网络流量信息。In the present invention, in the performance index extraction module, the index information reflecting machine performance is extracted by filtering and keyword extraction methods; the index information of machine performance includes CPU usage information, memory usage information, disk read bandwidth information and network traffic information.

本发明中，算法预测部分包括算法自适应模块和预测故障模块；其中：In the present invention, the algorithm prediction part includes an algorithm adaptive module and a prediction fault module; wherein:

算法自适应模块：基于前一天的历史运维大数据通过统计学特征筛选算法和交叉验证筛选算法训练模型，获得当天预测故障的人工智能模型；Algorithm adaptive module: Based on the historical operation and maintenance big data of the previous day, the model is trained by the statistical feature screening algorithm and the cross-validation screening algorithm, and the artificial intelligence model for predicting the failure of the day is obtained;

预测故障模块：用人工智能模型对实时采集的运维数据进行故障预测。Prediction failure module: Use artificial intelligence models to predict failures from real-time collected operation and maintenance data.

本发明中，可视化部分包括实时监控模块、历史预测模块、日志信息检索模块以及机器性能曲线展示模块；其中：In the present invention, the visualization part includes a real-time monitoring module, a historical prediction module, a log information retrieval module and a machine performance curve display module; wherein:

实时监控模块：用于显示集群中机器实时的性能状态；性能状态包括CPU利用率、内存利用率、磁盘吞吐率、网络带宽、故障预测信息以及通过阈值和SNMP方法收集信息获得的机器状态；Real-time monitoring module: used to display the real-time performance status of machines in the cluster; performance status includes CPU utilization, memory utilization, disk throughput, network bandwidth, fault prediction information, and machine status obtained by collecting information through thresholds and SNMP methods;

历史预测模块：用于缓存集群中机器的历史故障预测信息；Historical prediction module: used to cache historical failure prediction information of machines in the cluster;

日志信息检索模块：用于提供提取的机器性能日志信息；Log information retrieval module: used to provide extracted machine performance log information;

机器性能曲线展示模块：用于将日志信息以曲线图的方式展示给运维人员。Machine performance curve display module: used to display log information to operation and maintenance personnel in the form of a curve graph.

本发明还提供一种基于上述系统的故障定位可视化方法，具体步骤如下：The present invention also provides a fault location visualization method based on the above system, the specific steps are as follows:

（1）收集集群中各台机器的机器和应用日志，统一提取出其中用以监控集群状态的关键性能指标，再利用时间序列提取的方法构建运维大数据；(1) Collect the machine and application logs of each machine in the cluster, uniformly extract the key performance indicators used to monitor the cluster status, and then use the method of time series extraction to construct operation and maintenance big data;

（2）利用机器学习和神经网络的方法，基于已经收集到的历史的运维大数据来学习其中的先验经验，从而生成符合实际生产规律要求的人工智能预测模型；用人工智能模型对实时采集的运维数据进行故障预测；(2) Using machine learning and neural network methods to learn the prior experience based on the historical operation and maintenance big data that has been collected, so as to generate an artificial intelligence prediction model that meets the requirements of actual production laws; The collected operation and maintenance data is used for fault prediction;

（3）将所有的日志信息和故障预测信息进行展示。(3) Display all log information and fault prediction information.

本发明中，步骤（1）中，通过过滤和关键词提取方法提取反映机器性能的指标信息；机器性能的指标信息包括CPU使用率信息、内存使用率信息、磁盘读取带宽信息以及网络流量信息。In the present invention, in step (1), the index information reflecting machine performance is extracted by filtering and keyword extraction methods; the index information of machine performance includes CPU usage information, memory usage information, disk read bandwidth information and network traffic information .

本发明中，步骤（2）中，基于前一天的历史运维大数据通过统计学特征筛选算法和交叉验证筛选算法训练模型，获得当天预测故障的人工智能模型。In the present invention, in step (2), the model is trained by statistical feature screening algorithm and cross-validation screening algorithm based on the historical operation and maintenance big data of the previous day to obtain an artificial intelligence model for predicting faults on the current day.

本发明中，步骤（3）中，通过曲线图和表格形式展示。In the present invention, in step (3), it is displayed in the form of a graph and a table.

和现有技术相比，本发明的有益效果在于：Compared with the prior art, the beneficial effects of the present invention are:

本发明通过机器学习和神经网络的方法，基于已经收集到的历史的运维大数据来学习其中的先验经验获得的训练模型能够实现网络故障的准确预测和定位；The present invention can realize accurate prediction and localization of network faults through the method of machine learning and neural network, based on the historical operation and maintenance big data that has been collected to learn the training model obtained by learning the prior experience therein;

本发明通过可视化的运维信息平台展示常规的日志监控信息和各台机器的预测故障信息，能有效运维人员直观高效地完成运维任务。The present invention displays conventional log monitoring information and predicted fault information of each machine through a visualized operation and maintenance information platform, so that effective operation and maintenance personnel can intuitively and efficiently complete operation and maintenance tasks.

附图说明Description of drawings

图1是系统架构图。Figure 1 is a system architecture diagram.

图2是系统运行流程图。Figure 2 is a flow chart of system operation.

图3是数据收集部分结构图。Figure 3 is a block diagram of the data collection section.

图4是算法预测部分结构图。Figure 4 is a structural diagram of the algorithm prediction part.

图5可视化展示部分结构图。Figure 5 shows part of the structure diagram visually.

具体实施方式Detailed ways

下面结合附图和实施例对本发明的技术方案进行详细介绍。The technical solutions of the present invention will be described in detail below with reference to the accompanying drawings and embodiments.

本发明首先利用分布式日志采集框架来收集各个设备上的机器日志信息，之后将机器日志信息中的关键信息提取出来，作为这台机器当时的性能指标，并构建运维大数据。之后将处理好的运维大数据用来训练人工智能模型，当有实时的最新的性能数据输入时，模型便可预测出这些机器在之后的一段时间里是否会有故障出现，并以图示和表格的形式可视化给运维人员，帮助运维人员提前确定问题和准备解决方案，从而解决了现在网络复杂运维人员无法及时运维的问题。The invention first uses a distributed log collection framework to collect machine log information on each device, then extracts key information in the machine log information as the current performance index of the machine, and builds operation and maintenance big data. Afterwards, the processed operation and maintenance big data is used to train the artificial intelligence model. When the latest real-time performance data is input, the model can predict whether these machines will fail in the next period of time, and show the graph as shown in the figure. It can be visualized to the operation and maintenance personnel in the form of a table and a table to help the operation and maintenance personnel identify problems and prepare solutions in advance, thus solving the problem that the complex network operation and maintenance personnel cannot operate and maintain in a timely manner.

一、系统整体架构1. Overall system architecture

系统架构图如图1，系统主要由三部分组成，分别是数据收集部分(日志收集服务器)、算法预测部分（故障预测服务器）和可视化展示部分（可视化前端）。The system architecture is shown in Figure 1. The system is mainly composed of three parts, namely the data collection part (log collection server), the algorithm prediction part (fault prediction server) and the visual display part (visual front end).

数据收集部分主要是收集集群中各台机器的机器和应用日志，其主要原理是通过将轻量级的日志收集工具部署到被监控的机器上，通过这些分布式的工具来收集日志到服务器上，并统一提取出其中的关键性能指标用以监控集群状态，并构建运维大数据。The data collection part mainly collects the machine and application logs of each machine in the cluster. The main principle is to deploy lightweight log collection tools to the monitored machines, and use these distributed tools to collect logs on the server. , and uniformly extract the key performance indicators to monitor the cluster status and build operation and maintenance big data.

算法预测部分主要是利用机器学习和神经网络的方法，基于已经收集到的历史的运维大数据来学习其中的先验经验，从而生成符合实际生产规律要求的预测模型。在实际运行预测过程中则是将实时的性能数据直接提供给模型作为输入，通过学习到的规律来预测之后的一段时间内是否会出现故障。The algorithm prediction part mainly uses machine learning and neural network methods to learn the prior experience based on the historical operation and maintenance big data that has been collected, so as to generate a prediction model that meets the requirements of actual production laws. In the actual operation prediction process, the real-time performance data is directly provided to the model as input, and the learned rules are used to predict whether there will be a fault in a certain period of time.

可视化部分是将所有的日志信息和预测信息进行展示，考虑到再详细的文字报告也不如图表深入人心，本发明结合图表的方式将预测的故障以及状态的实时信息提供给运维人员作为参考，从而帮助运维人员能够高效地理解和处理网络集群中可能存在的故障。The visualization part displays all log information and prediction information. Considering that no matter how detailed the text report is, it is not as popular as the chart. The present invention provides the predicted fault and real-time status information to the operation and maintenance personnel as a reference in combination with the chart. This helps operation and maintenance personnel to efficiently understand and deal with possible faults in network clusters.

整个系统的运行原理如图2所示，首先是存在于各个机器上的日志收集轻量级组件，这些组件分布式地部署在集群中，并将机器上的所有需要的系统日志信息和应用日志信息通过网络传输回特定的存储服务器，存储服务器上部署了日志收集存储工具用以接受并保存集群中所有机器的日志信息并进行一个集中化存储和处理，之后在这台存储服务器上通过过滤和关键词提取等方法将每台机器的关键性能指标信息从日志中提取出来，构成我们所有机器的运维大数据；之后在另外的服务器上，我们构建的神经网络算法根据收集到的历史运维大数据进行模型的训练和学习，生成符合我们情况的人工智能模型，并将收集到的实时的运维数据作为模型的输入进行处理，从而实时地预测出集群中的机器在接下来一段时间中是否会有故障发生，完成基于运维大数据的故障预测和定位工作；最后我们利用前端框架来构建了可视化的运维数据展示平台，实时监控我们的集群状态并对有预测故障信息的设备进行预警，缓存预测故障信息以待之后的处理。这样运维人员就可以通过图标高效地理解集群状态，并提供详细的日志信息来准确地寻找到故障并提前设计解决故障的方案。The operating principle of the entire system is shown in Figure 2. The first is the log collection lightweight components that exist on each machine. These components are deployed in the cluster in a distributed manner, and all the required system log information and application logs on the machine are collected. The information is transmitted back to a specific storage server through the network, and a log collection and storage tool is deployed on the storage server to accept and save the log information of all machines in the cluster and perform a centralized storage and processing. Keyword extraction and other methods extract the key performance indicator information of each machine from the log, which constitutes the operation and maintenance big data of all our machines; after that, on another server, the neural network algorithm we constructed is based on the collected historical operation and maintenance. Use big data to train and learn models, generate artificial intelligence models that fit our situation, and process the collected real-time operation and maintenance data as the input of the model, so as to predict in real time that the machines in the cluster will be in the next period of time. Whether there will be a fault, complete the fault prediction and location based on operation and maintenance big data; finally, we use the front-end framework to build a visual operation and maintenance data display platform, monitor our cluster status in real time, and carry out equipment with predicted fault information. Early warning, caching predicted failure information for later processing. In this way, operation and maintenance personnel can efficiently understand the cluster status through icons, and provide detailed log information to accurately find faults and design solutions in advance.

二、数据收集部分2. Data collection part

如图3所示，数据收集部分主要由日志收集模块和性能指标提取模块组成，这一部分主要完成集群中监控信息的收集和关键性能指标提取的任务。As shown in Figure 3, the data collection part is mainly composed of a log collection module and a performance index extraction module. This part mainly completes the tasks of collecting monitoring information and extracting key performance indicators in the cluster.

日志收集模块：用来完成日志重定向和存储的任务。本发明参考了现有的应用于很多大公司的底层运维工作的成熟集群日志监控和管理框架来完成日志收集的任务。我们主要使用了主流框架ELK其中的日志收集工具和日志存储工具来完成我们日志重定向集中化处理的工作，从而集中化存储和处理集群中的日志信息，以便统一管理。Log collection module: used to complete the task of log redirection and storage. The present invention completes the task of log collection by referring to the existing mature cluster log monitoring and management framework applied to the bottom operation and maintenance work of many large companies. We mainly use the log collection tools and log storage tools in the mainstream framework ELK to complete our log redirection centralized processing, so as to centrally store and process log information in the cluster for unified management.

性能指标提取模块：用来提取日志中反应机器性能的指标。我们的方案中主要收集了每台机器的CPU利用率、内存利用率、磁盘读取速度以及网络流量带宽来衡量一台机器的状态。而这些信息利用关键词提取以及利用过滤等技术都可以从机器的系统日志信息中提取出我们所需要的关键性能指标，从而反映出集群中各个机器的状态，并利用时间序列提取的方法构建运维大数据。Performance index extraction module: used to extract the index reflecting machine performance in the log. Our scheme mainly collects the CPU utilization, memory utilization, disk read speed, and network traffic bandwidth of each machine to measure the status of a machine. The key performance indicators we need can be extracted from the system log information of the machine by using keyword extraction and filtering technology, so as to reflect the status of each machine in the cluster, and use the method of time series extraction to construct the operation. dimensional big data.

三、算法预测部分3. Algorithm prediction part

如图4所示，算法预测部分根据收集到的运维大数据的不同种类和统计学特征来自适应地选择合适的人工智能算法，从而完成不同类型数据的预测工作。这一部分由算法自适应模块、预测故障模块两个模块组成。As shown in Figure 4, the algorithm prediction part adaptively selects suitable artificial intelligence algorithms according to the different types and statistical characteristics of the collected operation and maintenance big data, so as to complete the prediction of different types of data. This part consists of two modules: algorithm adaptive module and prediction fault module.

算法自适应模块：筛选合适的算法来学习运维大数据中的先验经验。由于集群系统会不断迭代和重新设计，运维数据的统计学特征和分布也可能出现巨大的变化。如果此时坚持使用上一版的算法模型进行预测，则会导致预测的准确率大幅度下降，造成大量的误报。为了杜绝这种情况，本发明采用统计学特征筛选算法和交叉验证筛选算法训练模型；具体的，每天会使用前一天的历史运维大数据训练模型，具体流程就是收集前一天的历史运维大数据来作为训练数据，并计算出其中的一些统计学指标，用以选择合适的算法。之后将训练数据处理为时序序列，并分为训练集和测试集，将待选的算法配置以合适的参数在训练集上进行学习，之后利用测试集进行交叉验证，选择出训练的效果最好的模型，作为当天预测故障的人工智能模型进行持久化保存。Algorithm adaptation module: Screen appropriate algorithms to learn from prior experience in O&M big data. Due to the continuous iteration and redesign of the cluster system, the statistical characteristics and distribution of O&M data may also change dramatically. If you insist on using the algorithm model of the previous version for prediction at this time, the accuracy of the prediction will drop significantly, resulting in a large number of false positives. In order to prevent this situation, the present invention adopts statistical feature screening algorithm and cross-validation screening algorithm to train the model; specifically, the model is trained using the historical operation and maintenance data of the previous day every day, and the specific process is to collect the historical operation and maintenance data of the previous day. The data is used as training data, and some statistical indicators are calculated to select the appropriate algorithm. Then, the training data is processed into time series, and divided into training set and test set, the algorithm to be selected is configured with appropriate parameters to learn on the training set, and then the test set is used for cross-validation, and the best training effect is selected. The model is persisted as an artificial intelligence model that predicts failures on the day.

预测故障模块：用持久化的模型对实时采集的运维数据进行故障预测。上一个模块已经生成了预测使用的模型，之后将实时的机器性能数据信息处理为运维数据作为输入提供给模型，模型便可以输出在接下来的某一时间段内机器是否可能出现故障，可能出现什么样的故障，从而完成网络故障定位和预测的任务。Prediction failure module: Use a persistent model to predict failures from real-time collected operation and maintenance data. The previous module has generated the model used for prediction, and then processes the real-time machine performance data information into operation and maintenance data as input to the model, and the model can output whether the machine may fail in the next period of time. What kind of fault occurs, so as to complete the task of network fault location and prediction.

四、可视化展示部分Fourth, the visual display part

如图5所示，可视化展示部分主要有实时监控模块、历史预测模块、日志信息检索模块以及机器性能曲线展示模块组成。As shown in Figure 5, the visual display part mainly consists of a real-time monitoring module, a historical prediction module, a log information retrieval module and a machine performance curve display module.

实时监控模块：该模块用来显示集群中机器实时的性能状态，主要展示每台机器实时的CPU利用率、内存利用率、磁盘吞吐率、网络带宽以及每台机器在接下来的时间段内是否会有预测故障发生，同时也通过阈值以及SNMP等方法收集信息来判断机器是否正常。Real-time monitoring module: This module is used to display the real-time performance status of the machines in the cluster, mainly showing the real-time CPU utilization, memory utilization, disk throughput, network bandwidth, and whether each machine is in the next period of time. There will be prediction failures, and information will be collected through methods such as thresholds and SNMP to determine whether the machine is normal.

历史预测模块：该模块主要用来缓存我们每台机器的历史预测信息。我们访问过专业的运维人员后，运维人员给我们的建议是将每次预测的历史记录都保存下来。因为一旦出现故障，这些预测信息对运维人员来说蕴含着重要的信息，能够帮助及时理解发生的故障。Historical prediction module: This module is mainly used to cache the historical prediction information of each of our machines. After we interviewed professional operation and maintenance personnel, the suggestion from the operation and maintenance personnel is to save the historical record of each prediction. Because in the event of a failure, these prediction information contains important information for the operation and maintenance personnel, which can help to understand the failure in time.

日志信息检索模块：该部分主要提供了我们提取的机器性能日志信息。这些信息是目前运维人员的主要运维依据，此处提供这些信息主要是为了方便运维人员以其熟悉的方式来进行运维。这样运维人员在收到预测故障信息时，就可以快速检阅该台机器当前的日志信息，从而更精准的利用日志信息和自己的运维知识来确认故障发生的确切原因和准备好相应的解决方案。Log information retrieval module: This part mainly provides the machine performance log information we extracted. This information is the main operation and maintenance basis of the current operation and maintenance personnel. The information provided here is mainly for the convenience of the operation and maintenance personnel to perform operation and maintenance in a familiar way. In this way, when the operation and maintenance personnel receive the predicted fault information, they can quickly review the current log information of the machine, so as to more accurately use the log information and their own operation and maintenance knowledge to confirm the exact cause of the fault and prepare for corresponding solutions. Program.

机器性能曲线展示模块：该模块是为了将日志信息以曲线图的方式展示给运维人员。这样运维人员能直观地发现性能曲线是否有异常从而快速确定故障，同时我们也将实时预测的结果显示在了性能曲线上，给运维人员提供了重要的辅助信息。Machine performance curve display module: This module is used to display the log information to the operation and maintenance personnel in the form of a curve graph. In this way, operation and maintenance personnel can intuitively find out whether the performance curve is abnormal and quickly determine the fault. At the same time, we also display the real-time prediction results on the performance curve, providing important auxiliary information for operation and maintenance personnel.

基于上述系统，本发明提出了一种基于运维大数据预测的故障定位可视化方法，其具体实施步骤如下：Based on the above system, the present invention proposes a fault location visualization method based on operation and maintenance big data prediction. The specific implementation steps are as follows:

首先，用户需要将轻量级的日志收集工具安装到各个网络集群中的机器上，并保持该组件实时运行。通过修改配置文件，日志收集工具会将机器上的系统日志信息和指定的应用性能日志信息进行收集和重定向，这样便可以将集群中所有机器的相关信息进行收集从而监控集群中机器的性能情况。这种日志收集工具相较于其他工具更为简单方便，自身开销几乎可以忽略不记，因而在集群中大量使用该工具并不会增加过大的额外开销，保证了原来网络的性能。同时将所有的日志信息重定位到集群中一台配置了日志存储工具的服务器上，进行存储并方便下一步解析。First, users need to install a lightweight log collection tool on machines in each network cluster and keep the component running in real time. By modifying the configuration file, the log collection tool will collect and redirect the system log information on the machine and the specified application performance log information, so that the relevant information of all machines in the cluster can be collected to monitor the performance of the machines in the cluster. . Compared with other tools, this log collection tool is simpler and more convenient, and its overhead is almost negligible. Therefore, using this tool in a cluster will not increase excessive overhead and ensure the performance of the original network. At the same time, all log information is relocated to a server configured with a log storage tool in the cluster for storage and convenient analysis in the next step.

然后，集群中所有机器的相关日志被分布式部署的日志收集工具重定向到日志存储服务器上，日志存储服务器也可使用消息队列工具来完成消息缓存的功能，从而保证所有的日志信息都能被集中存储到服务器上。之后为了完成对集群中机器性能的监控，使用过滤的方法或者使用关键词提取的方法就可以从系统日志中提取出类似于CPU利用率、内存使用率、磁盘吞吐率、网络带宽流量等能反应机器性能的指标。有了这些指标，我们可以实时监控集群中机器的状态，同时也为之后的预测工作提供了基础。Then, the relevant logs of all machines in the cluster are redirected to the log storage server by the log collection tool deployed in a distributed manner. The log storage server can also use the message queue tool to complete the message caching function, so as to ensure that all log information can be stored. Centrally stored on the server. After that, in order to monitor the performance of the machines in the cluster, you can use the filtering method or the keyword extraction method to extract performance responses such as CPU utilization, memory utilization, disk throughput, network bandwidth traffic, etc. from the system log. An indicator of machine performance. With these metrics, we can monitor the status of the machines in the cluster in real time, and also provide a basis for future prediction work.

第三步，在收集到了所有机器的性能指标数据后，我们需要将其处理为时间序列数据，我们的初始化设置是以100秒为窗口的大小、1秒为两个时间序列的间隔来构建时间序列，每个事件序列的标签为该时间窗口后1000秒内是否有故障发生，如果有则标签为1，没有则标签为0。例如我们现在假设用s_n来代表机器第n秒的信息，我们现在有s₁，s₂，s₃…… s₁₀₁时刻的数据信息，并且1000秒内有故障发生，则我们构建两个时间窗口，分别是包含s₁，s₂，s₃…… s₁₀₀数据的时间序列和包含s₂，s₃，s₄…… s₁₀₁数据的时间序列，这两个时间序列的标签均为1。这样在将所有的数据都构建成时间序列数据存储后，再统计所有的时间序列窗口的统计信息作为新的特征使用，这样就生成了运维大数据。In the third step, after collecting the performance index data of all machines, we need to process it as time series data. Our initial settings are 100 seconds as the window size and 1 second as the interval between two time series to construct the time Sequence, the label of each event sequence is whether there is a fault within 1000 seconds after the time window, if there is, the label is 1, if there is no, the label is 0. For example, we now assume that s _n is used to represent the information of the nth second of the machine. We now have the data information of s ₁ , s ₂ , s ₃ ...... s ₁₀₁ , and there is a fault within 1000 seconds, then we construct two times windows, which are the time series containing s ₁ , s ₂ , s ₃ ...... s ₁₀₀ data and the time series containing s ₂ , s ₃ , s ₄ ...... s ₁₀₁ data, the labels of both time series are 1 . In this way, after all data is constructed as time series data storage, the statistical information of all time series windows is counted as new features, thus generating operation and maintenance big data.

第四步，我们要训练一个算法模型来完成故障预测的任务。同时为了保证时效性和准确性，我们会将前一天所有的运维数据分为训练集和测试集用来训练合适的模型以供构建用于预测后一天网络故障的模型，这里训练集和测试集的划分比例默认为4：1。之后为了完成姿势因的算法筛选，将我们预设好参数的各个算法模型利用得到的训练集和测试集进行训练和交叉验证，取交叉验证成绩最好的模型为实际生产中所使用的预测模型并持久化保存以便下一步进行实时故障预测和定位。In the fourth step, we need to train an algorithmic model to complete the task of failure prediction. At the same time, in order to ensure timeliness and accuracy, we will divide all the operation and maintenance data of the previous day into a training set and a test set to train a suitable model for building a model for predicting the network failure of the next day. Here, the training set and the test set The division ratio of sets is 4:1 by default. After that, in order to complete the algorithm screening of posture factors, each algorithm model with preset parameters is used for training and cross-validation using the obtained training set and test set, and the model with the best cross-validation score is used as the prediction model used in actual production. And persist it for real-time fault prediction and location in the next step.

第五步，在实际生产环境中，为了完成实时故障预测的目标，我们将某台机器的实时数据和前99秒的所有数据处理成时间序列。再调用我们持久化保存的预测模型，以处理好的数据为输入，算法的输出结果即为预测之后的1000秒内集群中的某台机器上是否会有故障发生，这样便完成了实时的故障预测和定位的工作，从而帮助运维人员提前确认可能存在的潜在故障。The fifth step, in the actual production environment, in order to achieve the goal of real-time fault prediction, we process the real-time data of a certain machine and all the data in the first 99 seconds into a time series. Then call our persistently saved prediction model, take the processed data as input, and the output of the algorithm is whether there will be a failure on a machine in the cluster within 1000 seconds after the prediction, thus completing the real-time failure Prediction and positioning work, thus helping operation and maintenance personnel to identify potential failures that may exist in advance.

第六步，当完成了故障预测的任务，生成了预测结果，我们还需要将预测结果和实际性能信息相结合来完成可视化展示的工作来帮助运维人员直观简单地了解集群中机器的情况。首先展示集群中每台机器的实时性能状态指标（CPU利用率、内存利用率等）和实时预测故障结果，这些指标主要完成了对机器进行实时监控的任务，如果运维人员发现某一台机器有了预测故障，则可以点击机器名来查看详细的机器信息，这些信息包括该台机器所有的性能日志信息和性能曲线图信息，其中性能曲线图信息可以帮助运维人员直观地查看性能曲线直观发现问题，并将预测故障信息标出以供参考；在确认可能出现具体性能故障地时间点后我们可以检阅具体的性能日志信息来提前理解故障并提供相应的解决方案。最后为了能让运维人员更全面的理解故障，我们还保留了预测故障的历史记录以供运维人员分析。In the sixth step, when the task of fault prediction is completed and the prediction result is generated, we also need to combine the prediction result with the actual performance information to complete the visual display work to help the operation and maintenance personnel to understand the situation of the machines in the cluster intuitively and simply. First, display the real-time performance status indicators (CPU utilization, memory utilization, etc.) and real-time predicted fault results of each machine in the cluster. These indicators mainly complete the task of real-time monitoring of the machine. If the operation and maintenance personnel find a certain machine With predicted faults, you can click the machine name to view detailed machine information, including all performance log information and performance curve graph information of the machine. The performance curve graph information can help operation and maintenance personnel to view the performance curve intuitively. Find problems and mark the predicted failure information for reference; after confirming the time point at which a specific performance failure may occur, we can review the specific performance log information to understand the failure in advance and provide corresponding solutions. Finally, in order to allow the operation and maintenance personnel to understand the fault more comprehensively, we also keep the historical record of the predicted fault for the operation and maintenance personnel to analyze.

Claims

1. The fault location visualization system based on operation and maintenance data prediction is characterized by comprising a data collection part,

An algorithm prediction part and a visualization display part; wherein:

a data collection part: collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;

and an algorithm prediction part: learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; then, carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;

and a visualization part: and the method is used for displaying all log information and failure prediction information.

2. The fault localization visualization system according to claim 1, wherein the data collection part comprises a log collection module and a performance index extraction module; wherein:

a log collection module: the system comprises a distributed log collection component and a centralized log storage component, wherein the distributed log collection component and the centralized log storage component are used for collecting machine and application logs of all machines in a cluster and finishing the work of log redirection centralized processing;

a performance index extraction module: index information reflecting the performance of the machine in the log is extracted, and operation and maintenance big data are constructed by a time sequence extraction method.

3. The fault location visualization system according to claim 1, wherein in the performance extraction module, index information reflecting machine performance is extracted by filtering and keyword extraction methods; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.

4. The fault localization visualization system of claim 1 wherein the algorithmic prediction portion comprises an algorithmic adaptation module and a predictive fault module; wherein:

an algorithm self-adaptive module: training a model through a statistical feature screening algorithm and a cross validation screening algorithm based on historical operation and maintenance big data of the previous day to obtain an artificial intelligence model for predicting faults on the same day;

a predictive failure module: and carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model.

5. The fault location visualization system according to claim 1, wherein the visualization part comprises a real-time monitoring module, a history prediction module, a log information retrieval module and a machine performance curve display module; wherein:

a real-time monitoring module: the system is used for displaying the real-time performance state of the machines in the cluster; the performance state comprises CPU utilization rate, memory utilization rate, disk throughput rate, network bandwidth, failure prediction information and machine state obtained by collecting information through a threshold value and an SNMP method;

a history prediction module: the system comprises a cache module, a fault prediction module and a fault prediction module, wherein the cache module is used for caching historical fault prediction information of machines in a cluster;

a log information retrieval module: for providing extracted machine performance log information;

the machine performance curve display module: the system is used for displaying the log information to the operation and maintenance personnel in a graph mode.

6. A fault localization visualization method based on the system of claim 1, characterized by comprising the following specific steps:

(1) collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;

(2) learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;

(3) and displaying all log information and fault prediction information.

7. The fault location visualization method according to claim 6, wherein in the step (1), index information reflecting the performance of the machine is extracted by filtering and keyword extraction methods; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.

8. The fault location visualization method according to claim 6, wherein in the step (2), the model is trained through a statistical feature screening algorithm and a cross-validation screening algorithm based on historical operation and maintenance big data of the previous day, so as to obtain an artificial intelligence model of the predicted fault of the current day.

9. The fault localization visualization method according to claim 6, wherein in the step (3), the visualization is performed by graph and table.