CN119248617A - Elasticsearch cluster fault intelligent detection system - Google Patents
Elasticsearch cluster fault intelligent detection system Download PDFInfo
- Publication number
- CN119248617A CN119248617A CN202411575774.7A CN202411575774A CN119248617A CN 119248617 A CN119248617 A CN 119248617A CN 202411575774 A CN202411575774 A CN 202411575774A CN 119248617 A CN119248617 A CN 119248617A
- Authority
- CN
- China
- Prior art keywords
- fault
- data
- module
- model
- fault type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 166
- 238000012544 monitoring process Methods 0.000 claims abstract description 125
- 238000005457 optimization Methods 0.000 claims abstract description 55
- 238000004458 analytical method Methods 0.000 claims abstract description 47
- 238000013135 deep learning Methods 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims description 43
- 208000024891 symptom Diseases 0.000 claims description 26
- 239000013598 vector Substances 0.000 claims description 25
- 238000012795 verification Methods 0.000 claims description 19
- 238000003058 natural language processing Methods 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 7
- 238000013480 data collection Methods 0.000 claims description 6
- 238000013467 fragmentation Methods 0.000 claims description 5
- 238000006062 fragmentation reaction Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 5
- 238000005192 partition Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 238000012423 maintenance Methods 0.000 abstract description 20
- 238000013500 data storage Methods 0.000 abstract description 2
- 238000000034 method Methods 0.000 description 29
- 230000008569 process Effects 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000007251 Prelog reaction Methods 0.000 description 4
- 238000011835 investigation Methods 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 238000012800 visualization Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 206010000117 Abnormal behaviour Diseases 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000008713 feedback mechanism Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 238000009411 base construction Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000000225 bioluminescence resonance energy transfer Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Probability & Statistics with Applications (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides an intelligent detection system for faults of an elastic search cluster, which relates to the technical field of big data storage and retrieval, and comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the current log data and monitoring index data of the elastic search cluster are analyzed through a fault detection model of the elastic search cluster fault analysis module, the predicted fault type of the elastic search cluster is determined, and automatic detection and prediction are realized; the deep learning algorithm and the expert rule base are fused, so that complex and dynamic fault scenes can be efficiently processed, false alarm and missing report rate is reduced, detection accuracy is improved, self-learning and dynamic optimization of a fault detection model are realized through the model optimization and self-learning module, and operation and maintenance complexity and cost are reduced.
Description
Technical Field
The invention relates to the technical field of big data storage and retrieval, in particular to an elastic search cluster fault intelligent detection system.
Background
Distributed search engines elastsearch is a distributed, real-time, high-performance search and analysis engine that can process large amounts of data and provide fast, accurate search results. In practical applications, the elastic search cluster may encounter various failures, such as node downtime, disk fullness, network failure, etc. Detection of an elastiscearch cluster failure is thus crucial.
In the prior art, the fault detection of the elastic search cluster mainly depends on a traditional monitoring tool and manual investigation, and generally comprises the following aspects:
(1) Performance monitoring and log analysis, namely evaluating the health condition of the cluster by monitoring various performance indexes (such as central processing unit (Central Processing Unit, CPU) utilization rate, memory utilization rate, query delay and the like) of the elastic search cluster, and simultaneously analyzing the elastic search log by using a log analysis tool to monitor abnormal behaviors or error information;
(2) Visualization tools applications, using a Kibana, grafana or other visualization tools to expose various performance metrics of the cluster, and deploying a separate log analysis tool (e.g., logstack) to process and analyze log data;
(3) And (3) real-time monitoring and history recording, namely real-time monitoring the running state of the elastic search cluster, and timely finding and processing the abnormality and the fault in the cluster. And meanwhile, a history record of cluster operation conditions is provided, and fault investigation and performance optimization are assisted.
However, the prior art relies on static monitoring and manual experience analysis, cannot efficiently process complex and dynamic fault scenes, is difficult to predict and prevent potential faults, and has low efficiency, complexity of operation and maintenance and high cost.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides an intelligent detection system for an elastic search cluster fault.
The invention provides an elastic search cluster fault intelligent detection system, which comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein:
the data acquisition module is used for acquiring current log data and monitoring index data of the elastic search cluster;
The elastic search cluster fault analysis module is used for extracting feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model and determine the predicted fault type of the elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model and is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as faults and related fault types which are included in training data;
The expert system module is used for verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a solution;
The model optimization and self-learning module is used for carrying out model training on the fault detection model according to the current log data and the monitoring index data, the predicted fault type and the target fault type.
Optionally, the expert system module is specifically configured to:
Matching and verifying the current log data, the monitoring index data and the predicted fault type with rules of the expert rule base to obtain a verification result;
When the verification result is that a target rule exists in the expert rule base, and the fault type of the target rule and the fault symptom of the elastic search cluster are matched with the predicted fault type, the current log data and the monitoring index data, determining that the target fault type of the elastic search cluster is the predicted fault type;
And when the verification result is that the target rule matched with the current log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the current log data and the monitoring index data in the expert rule base.
Optionally, the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer, and a Softmax layer;
the input layer receives the feature vector, the convolution layer extracts local features in the feature vector through a convolution kernel to identify each fault type, the pooling layer downsamples the local features extracted by the convolution layer to obtain key features, the full-connection layer maps the key features sampled by the pooling layer to each fault type to obtain feature mapping results, and the Softmax layer generates probability distribution of each fault type according to the feature mapping results output by the full-connection layer.
Optionally, the data acquisition module is further configured to:
and collecting the historical log data, the monitoring index data, various fault types and solutions, and sending the historical log data, the monitoring index data, the various fault types and solutions to the model optimization and self-learning module.
Optionally, the model optimization and self-learning module is further configured to:
marking the history log data and the monitoring index data as corresponding fault types according to the fault symptoms of the elastic search cluster corresponding to the history log data and the monitoring index data, recording occurrence time, triggering conditions, operation steps and influence degree of the faults on the elastic search cluster and the service corresponding to the history log data and the monitoring index data marked as the faults, and generating the training data set.
Optionally, the model optimization and self-learning module is further configured to:
And according to the newly-added historical log data and the monitoring index data, periodically performing model retraining and model updating on the fault detection model.
Optionally, the model optimization and self-learning module is further configured to:
and training the fault detection model again according to the updating rule and the newly added rule in the expert rule base.
Optionally, the model optimization and self-learning module is further configured to:
And measuring the difference between the predicted fault type and the target fault type by adopting a cross entropy loss function, and updating parameters of the fault detection model.
Optionally, the fault type includes at least one of:
index fragmentation, query delay, central Processing Unit (CPU) overload, memory overflow, main node failure, data node failure, network partition failure.
Optionally, the system further comprises:
the system monitoring and feedback module is used for monitoring the running conditions of the data acquisition module, the elastic search cluster fault analysis module, the expert system module and the model optimization and self-learning module.
The invention also provides an intelligent detection method for the failure of the elastic search cluster, which comprises the following steps:
Collecting current log data and monitoring index data of an elastic search cluster;
Extracting feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, and determining the predicted fault type of the elastic search cluster, wherein the fault detection model is a large-scale pre-training language model based on deep learning, and is obtained by training based on historical log data and monitoring index data marked as normal, historical log data and monitoring index data marked as faults and related fault types included in a training data set;
Verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a solution;
and carrying out model training on the fault detection model according to the current log data, the monitoring index data, the predicted fault type and the target fault type.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the intelligent detection method for the failure of the elastic search cluster is realized when the processor executes the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an elastic search cluster fault intelligent detection method as described above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements an elastiscearch cluster fault intelligentized detection method as described above.
The intelligent detection system for the faults of the elastic search clusters comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the current log data and monitoring index data of the elastic search clusters are subjected to deep analysis through a fault detection model in the elastic search cluster fault analysis module, the prediction fault type of the elastic search clusters is determined, so that potential fault risks are predicted, automatic detection and prediction of faults of the elastic search clusters are realized, dependence on manual experience is reduced, detection efficiency is greatly improved, operation staff is helped to take measures in advance, fault occurrence is avoided, complex and dynamic fault scenes can be efficiently processed, misinformation and missing report rate are reduced, detection accuracy is improved, and the fault detection model can realize continuous learning and dynamic optimization through the model optimization and self-learning module, so that the fault detection model can keep high-level fault detection capability continuously, and operation cost is reduced.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a framework of an elastic search cluster fault intelligent detection system provided by the invention;
FIG. 2 is a schematic diagram of the module relationship of the elastic search cluster fault intelligent detection system provided by the invention;
FIG. 3 is a schematic flow chart of the intelligent detection method for failure of the elastic search cluster;
fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to facilitate a clearer understanding of various embodiments of the present application, some relevant knowledge will be presented first.
Currently, failure detection of an elastiscearch cluster relies primarily on traditional monitoring tools and manual troubleshooting. These methods typically include (1) performance monitoring and log analysis, (2) visualization tool application, and (3) real-time monitoring and history.
The method can provide real-time performance monitoring and abnormal response, and can visually check and analyze the cluster state by combining with the visualization tool, so that the method has stronger expandability and can process large-scale logs and monitoring data.
However, these methods also have disadvantages:
(1) Relying on manual analysis, the efficiency is lower:
for complex fault scenes, manual experience is relied on for analysis and investigation, the efficiency is low, and misjudgment is easy to occur;
(2) Failure to predict and prevent potential failure:
the traditional monitoring method mainly aims at the occurred faults, lacks intelligent and automatic prediction capability, and is difficult to prevent potential fault risks in time;
(3) The operation and maintenance complexity is high:
Multiple independent monitoring and analysis tools are required to be configured and maintained, so that the complexity and the operation and maintenance cost of the system are increased;
(4) Processing capacity bottlenecks-in large-scale clusters, performance monitoring and log analysis can face bottlenecks in data volume and processing capacity, and it is difficult to respond and handle failures in time.
In view of the above problems, the present invention provides an intelligent fault detection system for an elastic search cluster, which is used for realizing intelligent, automatic and accurate fault detection for the elastic search cluster.
The elastic search cluster fault intelligent detection system provided by the invention is specifically described below with reference to fig. 1. Fig. 1 is a schematic diagram of a framework of an elastic search cluster fault intelligent detection system provided by the present invention, and referring to fig. 1, the system includes:
The system comprises a data acquisition module 101, an elastic search cluster fault analysis module 102, an expert system module 103 and a model optimization and self-learning module 104, wherein:
The data acquisition module 101 is configured to acquire current log data and monitoring index data of the elastic search cluster.
In the embodiment of the present invention, the data acquisition module 101 is connected to the elastic search cluster fault analysis module 102, and is responsible for collecting current log data and monitoring index data from the elastic search cluster, and transmitting the current log data and monitoring index data to the elastic search cluster fault analysis module 102 for subsequent deep analysis and fault detection.
In practical applications, in order to provide a reliable data basis for the elastic search cluster fault analysis module 102, the data acquisition module 101 needs to collect and sort the monitoring index data and log data of the elastic search cluster in real time.
The monitoring index data comprise functional indexes and performance indexes, wherein the performance indexes comprise data such as CPU utilization rate, memory occupation, query delay and the like, and the log data comprise data such as error logs and warning logs.
The failure analysis module 102 of the elastic search cluster is configured to extract feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, input the feature vectors to a failure detection model, obtain probability distribution of each failure type output by the failure detection model, and determine a predicted failure type of the elastic search cluster, where the failure detection model is a deep learning-based large-scale pre-training language model, and the failure detection model is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as failure, and related failure types included in a training data set.
In the embodiment of the present invention, the elastic search cluster fault analysis module 102 is configured to implement intelligent detection and prediction of faults, so as to reduce manual intervention. The module performs feature extraction and depth analysis on the monitoring index data and log data provided by the data acquisition module 101 based on the fault detection model, identifying potential faults and anomalies.
In practical application, the fault detection model is a customized large-scale pre-training language model, such as a general meaning thousand-to-Qwen model, the model is deeply combined with the data of the elastic search cluster, including log data, monitoring index data and historical fault cases for training and optimizing, and the log structure, the monitoring index data and the common fault modes of the elastic search cluster are adapted to ensure the effectiveness and the practicability of the model.
The elastic search cluster fault analysis module 102 is connected to the expert system module 103, and is configured to communicate the predicted fault type to the expert system module 103, and to be verified and optimized by the expert system module 103.
Optionally, the fault type includes at least one of:
index fragmentation, query delay, central Processing Unit (CPU) overload, memory overflow, main node failure, data node failure, network partition failure.
The expert system module 103 is configured to verify the predicted fault type according to the current log data, the monitoring index data, and an expert rule base, and determine a target fault type of the elastic search cluster according to a verification result, where the expert rule base includes a plurality of rules, and each rule includes a fault type, an elastic search cluster fault symptom, and a solution.
In the embodiment of the present invention, in order to improve the accuracy and practicality of fault detection, an expert system module 103 is introduced. Based on the expert rule base, the expert system module 103 verifies and optimizes the predicted fault type output by the fault detection model, so as to determine the target fault type and improve the reliability under the complex fault scene. Each rule in the expert rule base defines a particular fault type, fault symptom, and corresponding solution. The rule base is deployed in a local environment, the specificity of the actual application scene is fully considered, and the effectiveness and applicability of the rule base are ensured.
The module fuses the operation and maintenance experience of the field expert, audits the detection result of the elastic search cluster fault analysis module 102, provides optimization suggestions, improves the detection accuracy, and plays an important role in complex or novel fault scenes.
It should be noted that, the expert system module 103 may transmit the optimized feedback information to the model optimization and self-learning module 104, so as to update the fault detection model.
The model optimization and self-learning module 104 is configured to perform model training on the fault detection model according to the current log data and the monitoring index data, the predicted fault type and the target fault type.
In order to maintain sensitivity and high detection accuracy to the latest fault mode, the elastic search cluster fault intelligent detection system is provided with a model optimization and self-learning module 104, so that continuous optimization and self-learning of a fault detection model are realized.
The specific implementation steps are as follows:
(1) Continuous data collection and preprocessing
In the running process of the system, new log data and monitoring index data are continuously collected. The newly collected data is standardized and extracted in characteristics, so that the newly collected data meets the requirement of model training;
(2) Incremental learning and parameter optimization
And performing incremental learning on the fault detection model by utilizing the new data, learning new fault characteristics, and improving the detection capability of the novel fault. According to the performance of the fault detection model on new data, dynamically adjusting model parameters, and improving generalization capability and robustness of the model;
(3) Expert feedback fusion
Feedback provided by expert system module 103 and new fault rules are incorporated into the model training dataset. Through retraining, misjudgment of the model on a specific fault type is corrected, and detection accuracy is improved;
(4) Model update and deployment
And updating the optimized fault detection model into the elastic search cluster fault analysis module 102, so as to improve the detection capability of the fault detection model. And forming a closed loop of data collection, model optimization and model deployment, and continuously improving the system performance.
The optimization and self-learning module 104 is described in detail in the following embodiments, and is not described herein.
In the embodiment of the invention, the model optimization and self-learning module 104 has the capability of continuous learning and dynamic adaptation, keeps the sensitivity to new faults, and improves the long-term effectiveness of the system. The module performs incremental learning and dynamic optimization on the fault detection model by continuously collecting new data and expert feedback, thereby forming a closed loop. Each module in the intelligent detection system for the failure of the elastic search cluster realizes close cooperation through a data transmission and feedback mechanism, and ensures the efficient performance of the failure detection and optimization workflow of the elastic search cluster.
The intelligent detection system for the faults of the elastic search clusters comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the fault detection module in the elastic search cluster fault analysis module can conduct deep analysis on current log data and monitoring index data of the elastic search clusters to determine the predicted fault types of the elastic search clusters, so that potential fault risks are predicted, automatic detection and prediction of faults of the elastic search clusters are realized, dependence on manual experience is reduced, detection efficiency is greatly improved, operation and maintenance personnel are helped to take measures in advance, fault occurrence is avoided, complex and dynamic fault scenes can be efficiently processed through fusion of a deep learning algorithm and an expert rule base, false alarm and missing report rate are reduced, detection accuracy is improved, and the fault detection module can realize self-learning and dynamic optimization through the model optimization and self-learning, namely, continuous learning and dynamic adaptive capacity are realized, so that the fault detection module can keep high-level fault detection capability, and operation and maintenance cost are reduced.
Optionally, the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer, and a Softmax layer;
the input layer receives the feature vector, the convolution layer extracts local features in the feature vector through a convolution kernel to identify each fault type, the pooling layer downsamples the local features extracted by the convolution layer to obtain key features, the full-connection layer maps the key features sampled by the pooling layer to each fault type to obtain feature mapping results, and the Softmax layer generates probability distribution of each fault type according to the feature mapping results output by the full-connection layer.
In the embodiment of the invention, a fault detection model adopts a framework of combining a convolutional neural network (Convolutional Neural Networks, CNN) and a fully-connected network. In practical application, the fault detection model generates probability distribution of each fault type, which can be realized by the following steps:
(1) The input layer receives the feature vector, and before the monitoring index data and the log data of the elastic search cluster are input into the input layer of the fault detection model, the monitoring index data and the log data are subjected to standardization and vectorization.
Specifically, for log data, a natural language processing model (e.g., BRET model) is used to translate the log data into a vector representation. The log data includes characteristic information of various faults occurring so that a fault detection model can capture fault modes hidden in the log data.
Aiming at monitoring index data (such as CPU utilization rate, memory occupation, inquiry delay and the like), standardization and dimension reduction processing are required to be carried out, and key features are extracted. The monitoring index data are closely related to the occurrence of faults, and the fault detection model can identify abnormal performance behaviors through analysis.
(2) The convolution layer extracts local features in the feature vector through the convolution kernel, and identifies fault modes and abnormal behaviors in the system log.
(3) The pooling layer downsamples the output of the convolution layer, reduces the data dimension, and retains key features.
(4) The full connection layer further processes and maps key features to different failure types, such as primary node failure, data node failure, network partition, etc.
(5) The Softmax layer converts the feature mapping result output by the full connection layer into probability distribution of various faults, and generates probability distribution of each fault type.
The deep learning and prediction mechanism enables the fault detection model to accurately identify known faults and also has certain unknown fault detection capability.
To ensure the efficiency and accuracy of the fault detection model, it is necessary to continuously optimize it with the model optimization and self-learning module 104 during model training and application.
Specifically, the model optimization and self-learning module 104 is further configured to:
And measuring the difference between the predicted fault type and the target fault type by adopting a cross entropy loss function, and updating parameters of the fault detection model.
The method specifically comprises the following steps:
1) Measuring the difference between the predicted result and the actual result by adopting a cross entropy loss function, and guiding the updating of model parameters;
2) Combining an Adam optimizer and a self-adaptive learning rate strategy, accelerating model convergence and avoiding overfitting;
3) As new fault data is collected, the model is periodically retrained and updated, maintaining sensitivity to the latest fault type.
Optionally, the expert system module 103 is specifically configured to:
Matching and verifying the current log data, the monitoring index data and the predicted fault type with rules of the expert rule base to obtain a verification result;
When the verification result is that a target rule exists in the expert rule base, and the fault type of the target rule and the fault symptom of the elastic search cluster are matched with the predicted fault type, the current log data and the monitoring index data, determining that the target fault type of the elastic search cluster is the predicted fault type;
And when the verification result is that the target rule matched with the current log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the current log data and the monitoring index data in the expert rule base.
In the embodiment of the invention, in order to improve the accuracy and the practicability of fault detection, an expert system module 103 is introduced into an elastic search cluster fault intelligent detection system. By systemizing the long-term accumulated operation and maintenance experience of the domain expert, an independent expert rule base is established, and the expert rule base and the elastic search cluster fault analysis module 102 are jointly involved in the decision process of fault detection. The specific implementation flow is as follows:
(1) Expert rule base construction and localization deployment:
Summarizing and systemizing the operation and maintenance experience of the field expert, and establishing a rule base covering common fault types. Each rule defines a particular fault type, symptom, and corresponding solution. The rule base is deployed in a local environment, the specificity of the actual application scene is fully considered, and the effectiveness and applicability of the rule base are ensured.
(2) And (3) real-time verification and optimization:
after the elastic search cluster fault analysis module 102 outputs the predicted fault type, the expert system module 103 calls the expert rule base in real time to verify the predicted fault type, and confirms the fault type through rule matching. The expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a corresponding solution.
And in the first situation, matching and verifying the predicted fault type output by the pre-log data, the monitoring index data and the fault detection model with rules of an expert rule base, and determining that the target fault type of the elastic search cluster is the predicted fault type when the verification result is that the target rules exist in the expert rule base and the fault type and the fault symptom of the elastic search cluster are matched with the predicted fault type, the pre-log data and the monitoring index data.
And in the second case, when the verification result is that the target rule matched with the pre-log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the pre-log data and the monitoring index data in the expert rule base.
(3) Rule base maintenance and updating
In the running process of the system, if a new fault type and expert feedback are found, the rule base needs to be continuously updated and perfected, and the timeliness and effectiveness of the rule base are maintained.
Optionally, the model optimization and self-learning module 104 is further configured to:
and training the fault detection model again according to the updating rule and the newly added rule in the expert rule base. By the measures, the fault detection model can be kept sensitive to the latest fault type.
Optionally, the data acquisition module 101 is further configured to:
The historical log data and monitoring index data, various fault types and solutions are collected and sent to the model optimization and self-learning module 104.
In the embodiment of the present invention, the data acquisition module 101 sends the historical log data, the monitoring index data, various fault types and solutions to the model optimization and self-learning module 104, so that a high-quality training data set generated by the model optimization and self-learning module 104 is used for training the fault detection model, so as to ensure that the fault detection model can learn complex association relations of different fault types, and further accurately identify and predict various complex faults.
In practical application, the training data set includes historical log data and monitoring index data of the elastic search cluster, various fault types, solutions and a plurality of sample rules in the expert rule base, and each sample rule includes a fault type, an elastic search cluster fault symptom and a corresponding solution.
Specifically, the failure types encompass index fragmentation, query latency, CPU overload, memory overflow, primary node failure, data node failure, network partitioning, and the like. The training data set not only contains the occurrence condition, symptom and influence of the fault, but also combines rich manual experience, and ensures the diversity and representativeness of the training data set.
In the data collection process, the monitoring index data (such as CPU utilization, memory occupation, query delay, etc.) and the log data (such as WARN and ERROR log entries) collected by the data collection module 101 provide accurate input features for the subsequent fault detection model through standardized processing, feature extraction and dimension reduction processing.
Optionally, the model optimization and self-learning module 104 is configured to collect historical log data and monitoring index data of the elastic search cluster, and various fault types and solutions, and perform data labeling on each of the historical log data and the monitoring index data to obtain a training data set.
The marking of the history log data and the monitoring index data can be realized by the following steps:
marking the history log data and the monitoring index data as corresponding fault types according to the fault symptoms of the elastic search cluster corresponding to the history log data and the monitoring index data, recording occurrence time, triggering conditions, operation steps and influence degree of the faults on the elastic search cluster and the service corresponding to the history log data and the monitoring index data marked as the faults, and generating the training data set.
In the process of marking the historical log data and the monitoring index data, firstly, marking the historical log data and the monitoring index data as specific fault types according to the fault characteristics, wherein the fault types comprise index fragmentation, query delay, CPU overload, memory overflow, main node faults, data node faults, network partitions and the like. And then, recording the occurrence time, the triggering condition and the operation step of each fault sample in detail and the influence degree of the fault on the cluster and the service to form labels of the occurrence condition and the influence of the fault. To ensure consistency and accuracy of the annotation, the initially annotated data will be audited and corrected by the domain expert, and agreed upon by expert discussion for the problematic or controversial samples.
Optionally, the model optimization and self-learning module 104 is further configured to:
And according to the newly-added historical log data and the monitoring index data, periodically performing model retraining and model updating on the fault detection model.
In the intelligent detection system for the failure of the elastic search cluster, which is provided by the embodiment of the invention, the failure detection model is kept to be high in sensitivity to a new failure type by being tightly combined with the model optimization and self-learning module 104, so that the intelligent and automatic detection and prediction of the failure of the elastic search cluster are realized, and the failure detection efficiency and the accurate failure detection capability are further improved.
Fig. 2 is a schematic block diagram of the elastic search cluster fault intelligent detection system provided by the invention. On the basis of the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103 and the model optimization and self-learning module 104 shown in fig. 1, fig. 2 further includes a system monitoring and feedback module 105, wherein:
The system monitoring and feedback module 105 is connected to the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103, and the model optimization and self-learning module 104, and is configured to monitor the operation conditions of the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103, and the model optimization and self-learning module 104.
In the embodiment of the present invention, the system monitoring and feedback module 105 runs through the whole system, monitors the running state and performance of each module, coordinates the data flow and the information flow, is responsible for monitoring the state of the whole system, controls the behavior of other modules through a feedback mechanism, and determines that the system runs stably. The modules realize close cooperation through the system monitoring and feedback module 105, and ensure the efficient performance of fault detection and optimization workflow of the elastic search cluster.
In summary, the invention provides an elastic search cluster fault intelligent detection system based on data set construction and fusion of a localization intelligent model and expert experience, aiming at the problems that the prior art relies on manual analysis, potential faults cannot be predicted, operation and maintenance complexity is high and the like in fault detection, and the system has the following advantages:
1. Improving the accuracy and efficiency of fault detection
(1) Intelligent and automatic. And the customized large language model is utilized to carry out deep analysis on a large amount of logs and monitoring index data, so that automatic detection and prediction of faults are realized, dependence on manual experience is reduced, and detection efficiency is greatly improved.
(2) And accurately identifying complex faults. By combining deep learning with expert experience, the model can accurately identify complex and dynamic fault scenes, reduce false alarm and missing report rate and improve detection accuracy.
2. Having fault prediction and prevention capabilities
(1) Potential faults are early warned in advance. The model can analyze historical data and real-time indexes, predicts potential fault risks, helps operation and maintenance personnel to take measures in advance, and avoids faults.
(2) Dynamically adapting to environmental changes. By self-learning and dynamic optimization of the model, the system can adapt to the changes of cluster configuration and load, and continuously maintain high-level detection capability.
3. Reducing operation and maintenance complexity and cost
(1) And (5) functional integration. The system integrates the functions of data collection, intelligent analysis, expert system and the like, reduces the dependence on a plurality of independent monitoring and analysis tools, and reduces the complexity of operation and maintenance.
(2) And the manual intervention is reduced. The automatic detection and prediction mechanism remarkably reduces the workload of manual investigation, reduces the operation and maintenance cost and improves the operation and maintenance efficiency.
4. Improving stability and expandability of system
(1) And large-scale data is processed efficiently. By utilizing an advanced deep learning algorithm, the system can efficiently process the log and performance data of the large-scale cluster, and the bottleneck of the traditional method in data quantity and processing capacity is overcome.
(2) And (5) modular design. The system has clear structure, definite connection relation between modules, easy deployment, maintenance and expansion and capability of adapting to the increase of service demands.
The method for intelligently detecting the failure of the elastic search cluster, which is provided by the invention, is described below, and the method for intelligently detecting the failure of the elastic search cluster, which is described below, and the system for intelligently detecting the failure of the elastic search cluster, which is described above, can be correspondingly referred to each other. Fig. 3 is a schematic flow chart of the method for intelligently detecting an elastic search cluster fault, as shown in fig. 3, where an execution body of the method for intelligently detecting an elastic search cluster fault is an intelligent detection system for an elastic search cluster fault, and specifically includes steps 301 to 304, where:
Step 301, collecting current log data and monitoring index data of an elastic search cluster.
And 302, extracting feature vectors of current log data and monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, and determining the predicted fault type of an elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model, and is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as faults and associated fault types included in a training data set.
And 303, verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises the fault type, the fault symptom of the elastic search cluster and a solution.
And step 304, performing model training on the fault detection model according to the current log data, the monitoring index data, the predicted fault type and the target fault type.
The intelligent detection method for the faults of the elastic search clusters reduces the complexity of operation and maintenance, simultaneously, a customized fault detection model can carry out deep analysis on the current log data and the monitoring index data of the elastic search clusters, and the predicted fault type of the elastic search clusters is determined, so that potential fault risks are predicted, automatic detection and prediction of the faults of the elastic search clusters are realized, dependence on manual experience is reduced, the detection efficiency is greatly improved, operation and maintenance personnel are helped to take measures in advance, faults are avoided, moreover, complex and dynamic fault scenes can be efficiently processed, false alarm and missing report rate are reduced, the detection accuracy is improved, and meanwhile, the fault detection model can realize self-learning and dynamic optimization, namely has continuous learning and dynamic adaptability, so that the fault detection model can continuously maintain high-level fault detection capability, and the complexity and cost of operation and maintenance are reduced.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include a processor (processor) 410, a communication interface (Communications Interface) 420, a memory (memory) 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 perform communication with each other through the communication bus 440. The processor 410 may call logic instructions in the memory 430 to execute an intelligent detection method of an elastic search cluster fault, where the method includes collecting current log data and monitoring index data of the elastic search cluster, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search cluster, where the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training based on historical log data and monitoring index data marked as normal, historical log data and monitoring index data marked as fault and related fault types included in a training data set, verifying the predicted fault type according to the current log data and the monitoring index data and an expert rule base, determining a target fault type of the elastic search cluster according to a verification result, and the expert rule base includes a plurality of rules, each rule includes a fault type, an elastic search cluster fault symptom and a solution, and a target fault type is performed on the current log data and the target fault type.
Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
On the other hand, the invention also provides a computer program product, which comprises a computer program, wherein the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer can execute the intelligent detection method for the faults of the elastic search clusters, which is provided by the methods, and comprises the steps of collecting current log data and monitoring index data of the elastic search clusters, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search clusters, wherein the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training historical log data and monitoring index data which are marked as normal and are included in a training data set, the historical log data and the monitoring index data marked as faults and associated fault types, verifying the predicted fault types according to the current log data and the monitoring index data and an expert rule base, and determining the predicted fault types according to the fault types of the fault detection model, and the fault detection rules of the fault detection model comprises the fault type and the fault type of the fault detection model.
In still another aspect, the invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented when executed by a processor to perform the method for intelligently detecting an elastic search cluster fault provided by the above methods, the method comprising collecting current log data and monitoring index data of the elastic search cluster, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training based on historical log data and monitoring index data marked as normal included in a training data set, historical log data and monitoring index data marked as faults and associated fault types, verifying the predicted fault types according to the current log data and the monitoring index data and an expert rule base, determining a target fault type of the elastic search cluster according to a verification result, and wherein the fault detection model comprises a plurality of expert rule sets and fault rule bases, and fault rule sets are included in the fault detection model, and fault rule sets are used for solving the fault types.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411575774.7A CN119248617A (en) | 2024-11-06 | 2024-11-06 | Elasticsearch cluster fault intelligent detection system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411575774.7A CN119248617A (en) | 2024-11-06 | 2024-11-06 | Elasticsearch cluster fault intelligent detection system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119248617A true CN119248617A (en) | 2025-01-03 |
Family
ID=94022537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411575774.7A Pending CN119248617A (en) | 2024-11-06 | 2024-11-06 | Elasticsearch cluster fault intelligent detection system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119248617A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119415221A (en) * | 2025-01-06 | 2025-02-11 | 北京壁仞科技开发有限公司 | Active testing methods, devices, equipment, media and products for AI computing clusters |
-
2024
- 2024-11-06 CN CN202411575774.7A patent/CN119248617A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119415221A (en) * | 2025-01-06 | 2025-02-11 | 北京壁仞科技开发有限公司 | Active testing methods, devices, equipment, media and products for AI computing clusters |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111210024B (en) | Model training method, device, computer equipment and storage medium | |
CN112953629B (en) | Method and system for analyzing uncertainty of optical network fault prediction | |
CN112966714B (en) | Edge time sequence data anomaly detection and network programmable control method | |
CN115412947B (en) | Fault simulation method and system based on digital twin and AI algorithm | |
CN117743909A (en) | Heating system fault analysis method and device based on artificial intelligence | |
CN115508672B (en) | Fault traceability reasoning method, system, equipment and medium of power grid main equipment | |
CN119248617A (en) | Elasticsearch cluster fault intelligent detection system | |
CN117527622B (en) | Data processing method and system of network switch | |
CN119066541A (en) | New energy station monitoring data quality evaluation method and system based on multi-source data | |
CN115269314A (en) | Transaction abnormity detection method based on log | |
CN119728397B (en) | Network fault prediction method and system | |
CN117670239A (en) | Smart campus data monitoring application system based on the Internet of Things | |
CN119902905A (en) | Server load status evaluation method based on dynamic evaluation algorithm | |
CN118965247B (en) | Power plant data management method and system based on multi-source data | |
CN117135038A (en) | Network fault monitoring methods, devices and electronic equipment | |
JP7062505B2 (en) | Equipment management support system | |
CN119557134A (en) | Fault handling method, device, electronic device and storage medium for cloud computing platform | |
KR102572192B1 (en) | Auto Encoder Ensemble Based Anomaly Detection Method and System | |
CN119577681A (en) | Intelligent management method of industrial equipment data based on big data algorithm | |
CN117520040A (en) | Micro-service fault root cause determining method, electronic equipment and storage medium | |
CN119761781B (en) | Health management method of thermal power generation equipment based on knowledge graph and mechanism model | |
CN109961171A (en) | A capacitor fault prediction method based on machine learning and big data analysis | |
CN119691576A (en) | Self-adaptive operation and maintenance root cause positioning method and system based on deep learning | |
CN118863859A (en) | Power grid fault alarm method, device, storage medium and system | |
Wang et al. | Mining Method of SER Fault Event Set in Converter Station Based on Improved Association Rule Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |