CN119248617A

CN119248617A - Elasticsearch cluster fault intelligent detection system

Info

Publication number: CN119248617A
Application number: CN202411575774.7A
Authority: CN
Inventors: 杨昌玉; 原帅; 刘衍琦; 彭洪浩; 吴雪安
Original assignee: Yantai Institute Of Technology
Current assignee: Yantai Institute Of Technology
Priority date: 2024-11-06
Filing date: 2024-11-06
Publication date: 2025-01-03

Abstract

The invention provides an intelligent detection system for faults of an elastic search cluster, which relates to the technical field of big data storage and retrieval, and comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the current log data and monitoring index data of the elastic search cluster are analyzed through a fault detection model of the elastic search cluster fault analysis module, the predicted fault type of the elastic search cluster is determined, and automatic detection and prediction are realized; the deep learning algorithm and the expert rule base are fused, so that complex and dynamic fault scenes can be efficiently processed, false alarm and missing report rate is reduced, detection accuracy is improved, self-learning and dynamic optimization of a fault detection model are realized through the model optimization and self-learning module, and operation and maintenance complexity and cost are reduced.

Description

Intelligent detection system for failure of elastic search cluster

Technical Field

The invention relates to the technical field of big data storage and retrieval, in particular to an elastic search cluster fault intelligent detection system.

Background

Distributed search engines elastsearch is a distributed, real-time, high-performance search and analysis engine that can process large amounts of data and provide fast, accurate search results. In practical applications, the elastic search cluster may encounter various failures, such as node downtime, disk fullness, network failure, etc. Detection of an elastiscearch cluster failure is thus crucial.

In the prior art, the fault detection of the elastic search cluster mainly depends on a traditional monitoring tool and manual investigation, and generally comprises the following aspects:

(1) Performance monitoring and log analysis, namely evaluating the health condition of the cluster by monitoring various performance indexes (such as central processing unit (Central Processing Unit, CPU) utilization rate, memory utilization rate, query delay and the like) of the elastic search cluster, and simultaneously analyzing the elastic search log by using a log analysis tool to monitor abnormal behaviors or error information;

(2) Visualization tools applications, using a Kibana, grafana or other visualization tools to expose various performance metrics of the cluster, and deploying a separate log analysis tool (e.g., logstack) to process and analyze log data;

(3) And (3) real-time monitoring and history recording, namely real-time monitoring the running state of the elastic search cluster, and timely finding and processing the abnormality and the fault in the cluster. And meanwhile, a history record of cluster operation conditions is provided, and fault investigation and performance optimization are assisted.

However, the prior art relies on static monitoring and manual experience analysis, cannot efficiently process complex and dynamic fault scenes, is difficult to predict and prevent potential faults, and has low efficiency, complexity of operation and maintenance and high cost.

Disclosure of Invention

Aiming at the problems existing in the prior art, the embodiment of the invention provides an intelligent detection system for an elastic search cluster fault.

The invention provides an elastic search cluster fault intelligent detection system, which comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein:

the data acquisition module is used for acquiring current log data and monitoring index data of the elastic search cluster;

The elastic search cluster fault analysis module is used for extracting feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model and determine the predicted fault type of the elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model and is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as faults and related fault types which are included in training data;

The expert system module is used for verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a solution;

The model optimization and self-learning module is used for carrying out model training on the fault detection model according to the current log data and the monitoring index data, the predicted fault type and the target fault type.

Optionally, the expert system module is specifically configured to:

Matching and verifying the current log data, the monitoring index data and the predicted fault type with rules of the expert rule base to obtain a verification result;

When the verification result is that a target rule exists in the expert rule base, and the fault type of the target rule and the fault symptom of the elastic search cluster are matched with the predicted fault type, the current log data and the monitoring index data, determining that the target fault type of the elastic search cluster is the predicted fault type;

And when the verification result is that the target rule matched with the current log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the current log data and the monitoring index data in the expert rule base.

Optionally, the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer, and a Softmax layer;

the input layer receives the feature vector, the convolution layer extracts local features in the feature vector through a convolution kernel to identify each fault type, the pooling layer downsamples the local features extracted by the convolution layer to obtain key features, the full-connection layer maps the key features sampled by the pooling layer to each fault type to obtain feature mapping results, and the Softmax layer generates probability distribution of each fault type according to the feature mapping results output by the full-connection layer.

Optionally, the data acquisition module is further configured to:

and collecting the historical log data, the monitoring index data, various fault types and solutions, and sending the historical log data, the monitoring index data, the various fault types and solutions to the model optimization and self-learning module.

Optionally, the model optimization and self-learning module is further configured to:

marking the history log data and the monitoring index data as corresponding fault types according to the fault symptoms of the elastic search cluster corresponding to the history log data and the monitoring index data, recording occurrence time, triggering conditions, operation steps and influence degree of the faults on the elastic search cluster and the service corresponding to the history log data and the monitoring index data marked as the faults, and generating the training data set.

And according to the newly-added historical log data and the monitoring index data, periodically performing model retraining and model updating on the fault detection model.

and training the fault detection model again according to the updating rule and the newly added rule in the expert rule base.

And measuring the difference between the predicted fault type and the target fault type by adopting a cross entropy loss function, and updating parameters of the fault detection model.

Optionally, the fault type includes at least one of:

index fragmentation, query delay, central Processing Unit (CPU) overload, memory overflow, main node failure, data node failure, network partition failure.

Optionally, the system further comprises:

the system monitoring and feedback module is used for monitoring the running conditions of the data acquisition module, the elastic search cluster fault analysis module, the expert system module and the model optimization and self-learning module.

The invention also provides an intelligent detection method for the failure of the elastic search cluster, which comprises the following steps:

Collecting current log data and monitoring index data of an elastic search cluster;

Extracting feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, and determining the predicted fault type of the elastic search cluster, wherein the fault detection model is a large-scale pre-training language model based on deep learning, and is obtained by training based on historical log data and monitoring index data marked as normal, historical log data and monitoring index data marked as faults and related fault types included in a training data set;

Verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a solution;

and carrying out model training on the fault detection model according to the current log data, the monitoring index data, the predicted fault type and the target fault type.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the intelligent detection method for the failure of the elastic search cluster is realized when the processor executes the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an elastic search cluster fault intelligent detection method as described above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements an elastiscearch cluster fault intelligentized detection method as described above.

The intelligent detection system for the faults of the elastic search clusters comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the current log data and monitoring index data of the elastic search clusters are subjected to deep analysis through a fault detection model in the elastic search cluster fault analysis module, the prediction fault type of the elastic search clusters is determined, so that potential fault risks are predicted, automatic detection and prediction of faults of the elastic search clusters are realized, dependence on manual experience is reduced, detection efficiency is greatly improved, operation staff is helped to take measures in advance, fault occurrence is avoided, complex and dynamic fault scenes can be efficiently processed, misinformation and missing report rate are reduced, detection accuracy is improved, and the fault detection model can realize continuous learning and dynamic optimization through the model optimization and self-learning module, so that the fault detection model can keep high-level fault detection capability continuously, and operation cost is reduced.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a framework of an elastic search cluster fault intelligent detection system provided by the invention;

FIG. 2 is a schematic diagram of the module relationship of the elastic search cluster fault intelligent detection system provided by the invention;

FIG. 3 is a schematic flow chart of the intelligent detection method for failure of the elastic search cluster;

fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to facilitate a clearer understanding of various embodiments of the present application, some relevant knowledge will be presented first.

Currently, failure detection of an elastiscearch cluster relies primarily on traditional monitoring tools and manual troubleshooting. These methods typically include (1) performance monitoring and log analysis, (2) visualization tool application, and (3) real-time monitoring and history.

The method can provide real-time performance monitoring and abnormal response, and can visually check and analyze the cluster state by combining with the visualization tool, so that the method has stronger expandability and can process large-scale logs and monitoring data.

However, these methods also have disadvantages:

(1) Relying on manual analysis, the efficiency is lower:

for complex fault scenes, manual experience is relied on for analysis and investigation, the efficiency is low, and misjudgment is easy to occur;

(2) Failure to predict and prevent potential failure:

the traditional monitoring method mainly aims at the occurred faults, lacks intelligent and automatic prediction capability, and is difficult to prevent potential fault risks in time;

(3) The operation and maintenance complexity is high:

Multiple independent monitoring and analysis tools are required to be configured and maintained, so that the complexity and the operation and maintenance cost of the system are increased;

(4) Processing capacity bottlenecks-in large-scale clusters, performance monitoring and log analysis can face bottlenecks in data volume and processing capacity, and it is difficult to respond and handle failures in time.

In view of the above problems, the present invention provides an intelligent fault detection system for an elastic search cluster, which is used for realizing intelligent, automatic and accurate fault detection for the elastic search cluster.

The elastic search cluster fault intelligent detection system provided by the invention is specifically described below with reference to fig. 1. Fig. 1 is a schematic diagram of a framework of an elastic search cluster fault intelligent detection system provided by the present invention, and referring to fig. 1, the system includes:

The system comprises a data acquisition module 101, an elastic search cluster fault analysis module 102, an expert system module 103 and a model optimization and self-learning module 104, wherein:

The data acquisition module 101 is configured to acquire current log data and monitoring index data of the elastic search cluster.

In the embodiment of the present invention, the data acquisition module 101 is connected to the elastic search cluster fault analysis module 102, and is responsible for collecting current log data and monitoring index data from the elastic search cluster, and transmitting the current log data and monitoring index data to the elastic search cluster fault analysis module 102 for subsequent deep analysis and fault detection.

In practical applications, in order to provide a reliable data basis for the elastic search cluster fault analysis module 102, the data acquisition module 101 needs to collect and sort the monitoring index data and log data of the elastic search cluster in real time.

The monitoring index data comprise functional indexes and performance indexes, wherein the performance indexes comprise data such as CPU utilization rate, memory occupation, query delay and the like, and the log data comprise data such as error logs and warning logs.

The failure analysis module 102 of the elastic search cluster is configured to extract feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, input the feature vectors to a failure detection model, obtain probability distribution of each failure type output by the failure detection model, and determine a predicted failure type of the elastic search cluster, where the failure detection model is a deep learning-based large-scale pre-training language model, and the failure detection model is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as failure, and related failure types included in a training data set.

In the embodiment of the present invention, the elastic search cluster fault analysis module 102 is configured to implement intelligent detection and prediction of faults, so as to reduce manual intervention. The module performs feature extraction and depth analysis on the monitoring index data and log data provided by the data acquisition module 101 based on the fault detection model, identifying potential faults and anomalies.

In practical application, the fault detection model is a customized large-scale pre-training language model, such as a general meaning thousand-to-Qwen model, the model is deeply combined with the data of the elastic search cluster, including log data, monitoring index data and historical fault cases for training and optimizing, and the log structure, the monitoring index data and the common fault modes of the elastic search cluster are adapted to ensure the effectiveness and the practicability of the model.

The elastic search cluster fault analysis module 102 is connected to the expert system module 103, and is configured to communicate the predicted fault type to the expert system module 103, and to be verified and optimized by the expert system module 103.

Optionally, the fault type includes at least one of:

The expert system module 103 is configured to verify the predicted fault type according to the current log data, the monitoring index data, and an expert rule base, and determine a target fault type of the elastic search cluster according to a verification result, where the expert rule base includes a plurality of rules, and each rule includes a fault type, an elastic search cluster fault symptom, and a solution.

In the embodiment of the present invention, in order to improve the accuracy and practicality of fault detection, an expert system module 103 is introduced. Based on the expert rule base, the expert system module 103 verifies and optimizes the predicted fault type output by the fault detection model, so as to determine the target fault type and improve the reliability under the complex fault scene. Each rule in the expert rule base defines a particular fault type, fault symptom, and corresponding solution. The rule base is deployed in a local environment, the specificity of the actual application scene is fully considered, and the effectiveness and applicability of the rule base are ensured.

The module fuses the operation and maintenance experience of the field expert, audits the detection result of the elastic search cluster fault analysis module 102, provides optimization suggestions, improves the detection accuracy, and plays an important role in complex or novel fault scenes.

It should be noted that, the expert system module 103 may transmit the optimized feedback information to the model optimization and self-learning module 104, so as to update the fault detection model.

The model optimization and self-learning module 104 is configured to perform model training on the fault detection model according to the current log data and the monitoring index data, the predicted fault type and the target fault type.

In order to maintain sensitivity and high detection accuracy to the latest fault mode, the elastic search cluster fault intelligent detection system is provided with a model optimization and self-learning module 104, so that continuous optimization and self-learning of a fault detection model are realized.

The specific implementation steps are as follows:

(1) Continuous data collection and preprocessing

In the running process of the system, new log data and monitoring index data are continuously collected. The newly collected data is standardized and extracted in characteristics, so that the newly collected data meets the requirement of model training;

(2) Incremental learning and parameter optimization

And performing incremental learning on the fault detection model by utilizing the new data, learning new fault characteristics, and improving the detection capability of the novel fault. According to the performance of the fault detection model on new data, dynamically adjusting model parameters, and improving generalization capability and robustness of the model;

(3) Expert feedback fusion

Feedback provided by expert system module 103 and new fault rules are incorporated into the model training dataset. Through retraining, misjudgment of the model on a specific fault type is corrected, and detection accuracy is improved;

(4) Model update and deployment

And updating the optimized fault detection model into the elastic search cluster fault analysis module 102, so as to improve the detection capability of the fault detection model. And forming a closed loop of data collection, model optimization and model deployment, and continuously improving the system performance.

The optimization and self-learning module 104 is described in detail in the following embodiments, and is not described herein.

In the embodiment of the invention, the model optimization and self-learning module 104 has the capability of continuous learning and dynamic adaptation, keeps the sensitivity to new faults, and improves the long-term effectiveness of the system. The module performs incremental learning and dynamic optimization on the fault detection model by continuously collecting new data and expert feedback, thereby forming a closed loop. Each module in the intelligent detection system for the failure of the elastic search cluster realizes close cooperation through a data transmission and feedback mechanism, and ensures the efficient performance of the failure detection and optimization workflow of the elastic search cluster.

The intelligent detection system for the faults of the elastic search clusters comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the fault detection module in the elastic search cluster fault analysis module can conduct deep analysis on current log data and monitoring index data of the elastic search clusters to determine the predicted fault types of the elastic search clusters, so that potential fault risks are predicted, automatic detection and prediction of faults of the elastic search clusters are realized, dependence on manual experience is reduced, detection efficiency is greatly improved, operation and maintenance personnel are helped to take measures in advance, fault occurrence is avoided, complex and dynamic fault scenes can be efficiently processed through fusion of a deep learning algorithm and an expert rule base, false alarm and missing report rate are reduced, detection accuracy is improved, and the fault detection module can realize self-learning and dynamic optimization through the model optimization and self-learning, namely, continuous learning and dynamic adaptive capacity are realized, so that the fault detection module can keep high-level fault detection capability, and operation and maintenance cost are reduced.

In the embodiment of the invention, a fault detection model adopts a framework of combining a convolutional neural network (Convolutional Neural Networks, CNN) and a fully-connected network. In practical application, the fault detection model generates probability distribution of each fault type, which can be realized by the following steps:

(1) The input layer receives the feature vector, and before the monitoring index data and the log data of the elastic search cluster are input into the input layer of the fault detection model, the monitoring index data and the log data are subjected to standardization and vectorization.

Specifically, for log data, a natural language processing model (e.g., BRET model) is used to translate the log data into a vector representation. The log data includes characteristic information of various faults occurring so that a fault detection model can capture fault modes hidden in the log data.

Aiming at monitoring index data (such as CPU utilization rate, memory occupation, inquiry delay and the like), standardization and dimension reduction processing are required to be carried out, and key features are extracted. The monitoring index data are closely related to the occurrence of faults, and the fault detection model can identify abnormal performance behaviors through analysis.

(2) The convolution layer extracts local features in the feature vector through the convolution kernel, and identifies fault modes and abnormal behaviors in the system log.

(3) The pooling layer downsamples the output of the convolution layer, reduces the data dimension, and retains key features.

(4) The full connection layer further processes and maps key features to different failure types, such as primary node failure, data node failure, network partition, etc.

(5) The Softmax layer converts the feature mapping result output by the full connection layer into probability distribution of various faults, and generates probability distribution of each fault type.

The deep learning and prediction mechanism enables the fault detection model to accurately identify known faults and also has certain unknown fault detection capability.

To ensure the efficiency and accuracy of the fault detection model, it is necessary to continuously optimize it with the model optimization and self-learning module 104 during model training and application.

Specifically, the model optimization and self-learning module 104 is further configured to:

The method specifically comprises the following steps:

1) Measuring the difference between the predicted result and the actual result by adopting a cross entropy loss function, and guiding the updating of model parameters;

2) Combining an Adam optimizer and a self-adaptive learning rate strategy, accelerating model convergence and avoiding overfitting;

3) As new fault data is collected, the model is periodically retrained and updated, maintaining sensitivity to the latest fault type.

Optionally, the expert system module 103 is specifically configured to:

In the embodiment of the invention, in order to improve the accuracy and the practicability of fault detection, an expert system module 103 is introduced into an elastic search cluster fault intelligent detection system. By systemizing the long-term accumulated operation and maintenance experience of the domain expert, an independent expert rule base is established, and the expert rule base and the elastic search cluster fault analysis module 102 are jointly involved in the decision process of fault detection. The specific implementation flow is as follows:

(1) Expert rule base construction and localization deployment:

Summarizing and systemizing the operation and maintenance experience of the field expert, and establishing a rule base covering common fault types. Each rule defines a particular fault type, symptom, and corresponding solution. The rule base is deployed in a local environment, the specificity of the actual application scene is fully considered, and the effectiveness and applicability of the rule base are ensured.

(2) And (3) real-time verification and optimization:

after the elastic search cluster fault analysis module 102 outputs the predicted fault type, the expert system module 103 calls the expert rule base in real time to verify the predicted fault type, and confirms the fault type through rule matching. The expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a corresponding solution.

And in the first situation, matching and verifying the predicted fault type output by the pre-log data, the monitoring index data and the fault detection model with rules of an expert rule base, and determining that the target fault type of the elastic search cluster is the predicted fault type when the verification result is that the target rules exist in the expert rule base and the fault type and the fault symptom of the elastic search cluster are matched with the predicted fault type, the pre-log data and the monitoring index data.

And in the second case, when the verification result is that the target rule matched with the pre-log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the pre-log data and the monitoring index data in the expert rule base.

(3) Rule base maintenance and updating

In the running process of the system, if a new fault type and expert feedback are found, the rule base needs to be continuously updated and perfected, and the timeliness and effectiveness of the rule base are maintained.

Optionally, the model optimization and self-learning module 104 is further configured to:

and training the fault detection model again according to the updating rule and the newly added rule in the expert rule base. By the measures, the fault detection model can be kept sensitive to the latest fault type.

Optionally, the data acquisition module 101 is further configured to:

The historical log data and monitoring index data, various fault types and solutions are collected and sent to the model optimization and self-learning module 104.

In the embodiment of the present invention, the data acquisition module 101 sends the historical log data, the monitoring index data, various fault types and solutions to the model optimization and self-learning module 104, so that a high-quality training data set generated by the model optimization and self-learning module 104 is used for training the fault detection model, so as to ensure that the fault detection model can learn complex association relations of different fault types, and further accurately identify and predict various complex faults.

In practical application, the training data set includes historical log data and monitoring index data of the elastic search cluster, various fault types, solutions and a plurality of sample rules in the expert rule base, and each sample rule includes a fault type, an elastic search cluster fault symptom and a corresponding solution.

Specifically, the failure types encompass index fragmentation, query latency, CPU overload, memory overflow, primary node failure, data node failure, network partitioning, and the like. The training data set not only contains the occurrence condition, symptom and influence of the fault, but also combines rich manual experience, and ensures the diversity and representativeness of the training data set.

In the data collection process, the monitoring index data (such as CPU utilization, memory occupation, query delay, etc.) and the log data (such as WARN and ERROR log entries) collected by the data collection module 101 provide accurate input features for the subsequent fault detection model through standardized processing, feature extraction and dimension reduction processing.

Optionally, the model optimization and self-learning module 104 is configured to collect historical log data and monitoring index data of the elastic search cluster, and various fault types and solutions, and perform data labeling on each of the historical log data and the monitoring index data to obtain a training data set.

The marking of the history log data and the monitoring index data can be realized by the following steps:

In the process of marking the historical log data and the monitoring index data, firstly, marking the historical log data and the monitoring index data as specific fault types according to the fault characteristics, wherein the fault types comprise index fragmentation, query delay, CPU overload, memory overflow, main node faults, data node faults, network partitions and the like. And then, recording the occurrence time, the triggering condition and the operation step of each fault sample in detail and the influence degree of the fault on the cluster and the service to form labels of the occurrence condition and the influence of the fault. To ensure consistency and accuracy of the annotation, the initially annotated data will be audited and corrected by the domain expert, and agreed upon by expert discussion for the problematic or controversial samples.

In the intelligent detection system for the failure of the elastic search cluster, which is provided by the embodiment of the invention, the failure detection model is kept to be high in sensitivity to a new failure type by being tightly combined with the model optimization and self-learning module 104, so that the intelligent and automatic detection and prediction of the failure of the elastic search cluster are realized, and the failure detection efficiency and the accurate failure detection capability are further improved.

Fig. 2 is a schematic block diagram of the elastic search cluster fault intelligent detection system provided by the invention. On the basis of the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103 and the model optimization and self-learning module 104 shown in fig. 1, fig. 2 further includes a system monitoring and feedback module 105, wherein:

The system monitoring and feedback module 105 is connected to the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103, and the model optimization and self-learning module 104, and is configured to monitor the operation conditions of the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103, and the model optimization and self-learning module 104.

In the embodiment of the present invention, the system monitoring and feedback module 105 runs through the whole system, monitors the running state and performance of each module, coordinates the data flow and the information flow, is responsible for monitoring the state of the whole system, controls the behavior of other modules through a feedback mechanism, and determines that the system runs stably. The modules realize close cooperation through the system monitoring and feedback module 105, and ensure the efficient performance of fault detection and optimization workflow of the elastic search cluster.

In summary, the invention provides an elastic search cluster fault intelligent detection system based on data set construction and fusion of a localization intelligent model and expert experience, aiming at the problems that the prior art relies on manual analysis, potential faults cannot be predicted, operation and maintenance complexity is high and the like in fault detection, and the system has the following advantages:

1. Improving the accuracy and efficiency of fault detection

(1) Intelligent and automatic. And the customized large language model is utilized to carry out deep analysis on a large amount of logs and monitoring index data, so that automatic detection and prediction of faults are realized, dependence on manual experience is reduced, and detection efficiency is greatly improved.

(2) And accurately identifying complex faults. By combining deep learning with expert experience, the model can accurately identify complex and dynamic fault scenes, reduce false alarm and missing report rate and improve detection accuracy.

2. Having fault prediction and prevention capabilities

(1) Potential faults are early warned in advance. The model can analyze historical data and real-time indexes, predicts potential fault risks, helps operation and maintenance personnel to take measures in advance, and avoids faults.

(2) Dynamically adapting to environmental changes. By self-learning and dynamic optimization of the model, the system can adapt to the changes of cluster configuration and load, and continuously maintain high-level detection capability.

3. Reducing operation and maintenance complexity and cost

(1) And (5) functional integration. The system integrates the functions of data collection, intelligent analysis, expert system and the like, reduces the dependence on a plurality of independent monitoring and analysis tools, and reduces the complexity of operation and maintenance.

(2) And the manual intervention is reduced. The automatic detection and prediction mechanism remarkably reduces the workload of manual investigation, reduces the operation and maintenance cost and improves the operation and maintenance efficiency.

4. Improving stability and expandability of system

(1) And large-scale data is processed efficiently. By utilizing an advanced deep learning algorithm, the system can efficiently process the log and performance data of the large-scale cluster, and the bottleneck of the traditional method in data quantity and processing capacity is overcome.

(2) And (5) modular design. The system has clear structure, definite connection relation between modules, easy deployment, maintenance and expansion and capability of adapting to the increase of service demands.

The method for intelligently detecting the failure of the elastic search cluster, which is provided by the invention, is described below, and the method for intelligently detecting the failure of the elastic search cluster, which is described below, and the system for intelligently detecting the failure of the elastic search cluster, which is described above, can be correspondingly referred to each other. Fig. 3 is a schematic flow chart of the method for intelligently detecting an elastic search cluster fault, as shown in fig. 3, where an execution body of the method for intelligently detecting an elastic search cluster fault is an intelligent detection system for an elastic search cluster fault, and specifically includes steps 301 to 304, where:

Step 301, collecting current log data and monitoring index data of an elastic search cluster.

And 302, extracting feature vectors of current log data and monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, and determining the predicted fault type of an elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model, and is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as faults and associated fault types included in a training data set.

And 303, verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises the fault type, the fault symptom of the elastic search cluster and a solution.

And step 304, performing model training on the fault detection model according to the current log data, the monitoring index data, the predicted fault type and the target fault type.

The intelligent detection method for the faults of the elastic search clusters reduces the complexity of operation and maintenance, simultaneously, a customized fault detection model can carry out deep analysis on the current log data and the monitoring index data of the elastic search clusters, and the predicted fault type of the elastic search clusters is determined, so that potential fault risks are predicted, automatic detection and prediction of the faults of the elastic search clusters are realized, dependence on manual experience is reduced, the detection efficiency is greatly improved, operation and maintenance personnel are helped to take measures in advance, faults are avoided, moreover, complex and dynamic fault scenes can be efficiently processed, false alarm and missing report rate are reduced, the detection accuracy is improved, and meanwhile, the fault detection model can realize self-learning and dynamic optimization, namely has continuous learning and dynamic adaptability, so that the fault detection model can continuously maintain high-level fault detection capability, and the complexity and cost of operation and maintenance are reduced.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include a processor (processor) 410, a communication interface (Communications Interface) 420, a memory (memory) 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 perform communication with each other through the communication bus 440. The processor 410 may call logic instructions in the memory 430 to execute an intelligent detection method of an elastic search cluster fault, where the method includes collecting current log data and monitoring index data of the elastic search cluster, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search cluster, where the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training based on historical log data and monitoring index data marked as normal, historical log data and monitoring index data marked as fault and related fault types included in a training data set, verifying the predicted fault type according to the current log data and the monitoring index data and an expert rule base, determining a target fault type of the elastic search cluster according to a verification result, and the expert rule base includes a plurality of rules, each rule includes a fault type, an elastic search cluster fault symptom and a solution, and a target fault type is performed on the current log data and the target fault type.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

On the other hand, the invention also provides a computer program product, which comprises a computer program, wherein the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer can execute the intelligent detection method for the faults of the elastic search clusters, which is provided by the methods, and comprises the steps of collecting current log data and monitoring index data of the elastic search clusters, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search clusters, wherein the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training historical log data and monitoring index data which are marked as normal and are included in a training data set, the historical log data and the monitoring index data marked as faults and associated fault types, verifying the predicted fault types according to the current log data and the monitoring index data and an expert rule base, and determining the predicted fault types according to the fault types of the fault detection model, and the fault detection rules of the fault detection model comprises the fault type and the fault type of the fault detection model.

In still another aspect, the invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented when executed by a processor to perform the method for intelligently detecting an elastic search cluster fault provided by the above methods, the method comprising collecting current log data and monitoring index data of the elastic search cluster, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training based on historical log data and monitoring index data marked as normal included in a training data set, historical log data and monitoring index data marked as faults and associated fault types, verifying the predicted fault types according to the current log data and the monitoring index data and an expert rule base, determining a target fault type of the elastic search cluster according to a verification result, and wherein the fault detection model comprises a plurality of expert rule sets and fault rule bases, and fault rule sets are included in the fault detection model, and fault rule sets are used for solving the fault types.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. An Elasticsearch cluster fault intelligent detection system, characterized by comprising a data acquisition module, an Elasticsearch cluster fault analysis module, an expert system module and a model optimization and self-learning module; wherein:

The data collection module is used to collect current log data and monitoring indicator data of the Elasticsearch cluster;

The Elasticsearch cluster fault analysis module is used to extract feature vectors of the current log data and monitoring indicator data using a natural language processing (NLP) algorithm, input the feature vectors into a fault detection model, obtain the probability distribution of each fault type output by the fault detection model, and determine the predicted fault type of the Elasticsearch cluster; wherein the fault detection model is a large-scale pre-trained language model based on deep learning, and the fault detection model is trained based on historical log data and monitoring indicator data marked as normal, historical log data and monitoring indicator data marked as faults, and associated fault types included in a training data set;

The expert system module is used to verify the predicted fault type according to the current log data, monitoring indicator data and expert rule base, and determine the target fault type of the Elasticsearch cluster according to the verification result; wherein the expert rule base includes multiple rules, each rule includes a fault type, Elasticsearch cluster fault symptoms and solutions;

The model optimization and self-learning module is used to perform model training on the fault detection model based on the current log data and monitoring indicator data, the predicted fault type and the target fault type.

2. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the expert system module is specifically used for:

Matching and verifying the current log data, monitoring indicator data and the predicted fault type with the rules of the expert rule base to obtain a verification result;

When the verification result is that the target rule exists in the expert rule base, and the fault type and the Elasticsearch cluster fault symptom of the target rule match the predicted fault type and the current log data and the monitoring indicator data, determining that the target fault type of the Elasticsearch cluster is the predicted fault type;

When the verification result is that there is no target rule in the expert rule base that matches the current log data and monitoring indicator data and the predicted fault type, based on the Elasticsearch cluster fault symptoms corresponding to the current log data and monitoring indicator data in the expert rule base, the fault type associated with the Elasticsearch cluster fault symptom in the expert rule base is determined as the target fault type of the Elasticsearch cluster.

3. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer and a Softmax layer;

Among them, the input layer receives the feature vector; the convolution layer extracts local features in the feature vector through the convolution kernel to identify each fault type; the pooling layer downsamples the local features extracted by the convolution layer to obtain key features; the fully connected layer maps the key features sampled by the pooling layer to each fault type to obtain a feature mapping result; the Softmax layer generates a probability distribution of each fault type according to the feature mapping result output by the fully connected layer.

4. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the data acquisition module is also used for:

The historical log data and monitoring indicator data, various fault types and solutions are collected and sent to the model optimization and self-learning module.

5. The Elasticsearch cluster fault intelligent detection system according to claim 4, characterized in that the model optimization and self-learning module is also used for:

According to the Elasticsearch cluster failure symptoms corresponding to the historical log data and monitoring indicator data, the historical log data and monitoring indicator data are marked as corresponding failure types, and the occurrence time, triggering conditions, operation steps, and impact of the failure on the Elasticsearch cluster and business corresponding to each historical log data and monitoring indicator data marked as a failure are recorded to generate the training data set.

6. The Elasticsearch cluster fault intelligent detection system according to claim 4, characterized in that the model optimization and self-learning module is also used for:

According to the newly added historical log data and monitoring indicator data, the fault detection model is regularly retrained and updated.

7. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the model optimization and self-learning module is also used for:

The fault detection model is retrained according to the updated rules and newly added rules in the expert rule base.

8. The Elasticsearch cluster fault intelligent detection system according to claim 1, wherein the model optimization and self-learning module is further used for:

A cross entropy loss function is used to measure the difference between the predicted fault type and the target fault type, and the parameters of the fault detection model are updated.

9. The Elasticsearch cluster fault intelligent detection system according to claim 1, wherein the fault type includes at least one of the following:

Index fragmentation, query delay, CPU overload, memory overflow, master node failure, data node failure, network partition failure.

10. The Elasticsearch cluster fault intelligent detection system according to any one of claims 4 to 6, characterized in that the system further comprises:

The system monitoring and feedback module, together with the data acquisition module, the Elasticsearch cluster fault analysis module, the expert system module and the model optimization and self-learning module, is used to monitor the operation of the data acquisition module, the Elasticsearch cluster fault analysis module, the expert system module and the model optimization and self-learning module.